StepFun and Geely Join Forces to Open Source

On February 18, 2025, StepFun and Geely Auto Group jointly announced that they will open source two Step series multimodal large models developed in collaboration to global developers. This includes the Step-Video-T2V, which is currently the largest parameter and best performing open source video generation model globally, as well as the industry’s first product-level open source multimodal large model for voice interaction – Step-Audio. Starting today, users can experience them on the StepChat app.

StepFun is a strategic technology partner of Geely Auto Group. The two parties have deep cooperation with complementary advantages in computing power algorithms, scene training, and other areas significantly enhancing the performance of multimodal large models. This joint open source initiative aims to promote sharing and innovation in large model technology and drive inclusive development of artificial intelligence. It will also contribute the strongest multimodal large model capabilities to the open-source world, forming another force from China in the world of open-source large models.

Geely Auto Group CEO Gan Jiayue said: ‘Geely is committed to becoming a leader and promoter of intelligent automotive AI technology. As early as 2021, Geely has built an end-to-end self-developed system and ecosystem alliance around chips, software operating systems, data, and satellite networks, creating a complete ‘Intelligent Geely Technology Ecosystem’ to continuously evolve user experiences in intelligent driving and smart cabins. Currently, Geely’s full-stack self-developed Xingrui AI large model has achieved deep integration with Step-Video-T2V, Step-Audio, and other large models. This will bring users a more intelligent and advanced cabin interaction experience as well as smart driving experiences, promoting the popularization of AI technology in the field of intelligent automobiles.’

It is understood that this is also the first time that StepFun has open-sourced its Step series base models. Dr. Jiang Daxin, founder and CEO of StepFun stated: ‘StepFun has always been dedicated to developing base large models with the goal of achieving AGI (Artificial General Intelligence). We are well aware that achieving AGI requires the collective efforts of global developers. The original intention of open sourcing is not only to share our latest technological achievements with everyone and contribute to the open-source community but also because we believe that multimodal models are essential for realizing AGI; however, it is still in its early stages. We look forward to brainstorming with community developers to collectively expand the boundaries of model technology and drive industrial applications.’

SEE ALSO: StepFun Releases Three Large Models of the Step Series

Step-Video-T2V is currently the largest and best-performing open-source video generation model globally, with 300 billion parameters. The model can directly generate high-quality videos of 204 frames at a resolution of 540P, ensuring that the generated video content has extremely high information density and strong consistency.

In terms of generation effects, Step-Video-T2V demonstrates powerful capabilities in complex movements, aesthetic characters, visual imagination, basic text generation, native bilingual input in Chinese and English, and language of lenses. It excels in semantic understanding and instruction compliance abilities to efficiently assist video creators in achieving precise creative presentations. Users can experience the video generation capabilities of Step-Video-T2V on the StepChat website or through the StepChat App.

Step-Audio is the industry’s first product-level open-source voice interaction model that can generate expressions of emotions, dialects, languages, singing voices, and personalized styles according to different scene requirements. It can engage in high-quality conversations with users naturally. The voice generated by the model has features such as supernaturalness and high emotional intelligence. It can also achieve high-quality sound replication and role-playing, meeting application needs in industries such as film and television entertainment, social networking, gaming, etc.

In the five mainstream public test sets such as LlaMA Question and Web Questions, the performance of Step-Audio model surpassed that of similar open-source models in the industry, ranking first. The performance of Step-Audio in the evaluation of HSK-6 (Chinese Proficiency Test Level 6) was particularly outstanding, making it the most knowledgeable open-source large-scale speech interaction model for Chinese language.