Alibaba Open Sources Video Generation Model Wanxiang 2.1, Performance Surpasses Sora
On the evening of February 25th, Alibaba Group Holding Limited fully open-sourced its video generation model Wanxiang 2.1. This model is an important component of Alibaba Cloud’s Tongyi series AI models, released in January 2025. In the authoritative evaluation set VBench, it significantly outperformed domestic and foreign models such as Sora, Luma, and Pika with a total score of 86.22%, firmly holding the top position.
Wanxiang 2.1 adopts a self-developed efficient variational autoencoder (VAE) and dynamic image transformer (DiT) architecture to enhance its spatiotemporal context modeling capabilities. This design enables the model to more accurately capture and simulate dynamic changes in the real world while reducing training costs through parameter sharing mechanisms.
By dividing videos into multiple chunks and caching intermediate features, the model avoids the complexity of traditional end-to-end encoding/decoding processes, supporting efficient generation and processing of unlimited-length 1080P videos.
It is also the first video generation model that supports Chinese text generation as well as special effects for both Chinese and English texts. In terms of instruction compliance, it can strictly output videos according to instructions such as camera movements and accurately understand and execute long-text instructions.
Furthermore, this model can accurately simulate physical laws in the real world, such as raindrops splashing on umbrellas or natural transitions during human movement. When dealing with complex motions like figure skating or swimming, Wanxiang 2.1 maintains coordination between body parts and authenticity in motion trajectories.
Alibaba has open-sourced all the inference code and weights of Wanxiang 2.1 with two parameter specifications, 14B and 1.3B, based on the Apache 2.0 license. Global developers can download and experience them on Github, HuggingFace, and MoDa Community.
It is understood that the 14B model excels in instruction compliance and complex scene generation, while the 1.3B version can run on consumer-grade graphics cards with only 8.2GB of memory to generate high-quality videos. It is suitable for secondary model development and academic research, greatly reducing the entry barrier.
In fact, there have been precedents of open-source video generation models in the industry before this release by Alibaba; previously StepStar had open-sourced Step-Video-T2V which had the largest parameter size globally with best performance among open-source video generation models.
For the AI industry, open source provides powerful tools for developers to accelerate technological innovation and application expansion in video generation field. Currently, domestic AI star company DeepSeek continues to be open source-oriented; Baidu has also announced that it will fully open-source ERNIE’s large-scale model 4.5 starting from June 30th.
SEE ALSO: Baidu: ERNIE’s 4.5 Series Will Be Open Source From June 30th