Alibaba Open Sources Video Generation Model Wanxiang 2.1
On the evening of February 25th, Alibaba Group Holding Limited announced the full open-sourcing of its video generation model, Wanxiang 2.1. This innovative model is a significant addition to Alibaba Cloud’s Tongyi series of AI models, which were introduced earlier in January 2025. In comprehensive evaluations using the VBench standard, Wanxiang 2.1 demonstrated remarkable performance, surpassing competitors like Sora, Luma, and Pika by achieving an impressive total score of 86.22%, thus securing the top rank in its category.
The Wanxiang 2.1 model utilizes a proprietary architecture that incorporates an efficient variational autoencoder (VAE) combined with a dynamic image transformer (DiT). This advanced design improves the model’s ability to understand and replicate the spatiotemporal dynamics evident in real-world scenarios. An added advantage of this architecture is its capacity to reduce the training costs by sharing parameters effectively.
One of its unique features is the way it processes videos. By segmenting videos into multiple chunks and caching intermediate features, Wanxiang 2.1 sidesteps the complexities associated with traditional end-to-end encoding and decoding methods. This allows for the efficient generation and handling of videos in 1080P resolution, regardless of their length.
Furthermore, Wanxiang 2.1 is groundbreaking as it is the first video generation model to support Chinese text generation, along with the ability to apply special effects to both Chinese and English texts. This model excels in following instructions, enabling it to produce videos that adhere closely to specific guidelines related to camera movements and complex long-text instructions.
In terms of realism, Wanxiang 2.1 effectively simulates real-world physical phenomena, such as raindrops bouncing off umbrellas or the fluid transitions observed during human movement. The model maintains a strong sense of coordination in complex situations, such as figure skating or swimming, ensuring that all elements appear authentically aligned.
Alibaba has made the complete inference code and model weights for Wanxiang 2.1 available to the public, offering two different parameter sizes: 14 billion parameters and 1.3 billion parameters, under the Apache 2.0 license. Developers worldwide can access these resources through platforms like GitHub, Hugging Face, and MoDa Community.
The larger 14 billion parameter model stands out for its superior instruction compliance and capability in generating complex scenes. In contrast, the 1.3 billion parameter model is designed to operate on consumer-level graphics cards with just 8.2GB of memory, making it suitable for secondary model development and academic research, effectively lowering barriers to entry in this field.
Previously, there have been other open-source video generation models in the market, including StepStar’s Step-Video-T2V, which was noted for its extensive parameter size and high performance among open-source offerings.
The trend of open-source development in the AI sector serves as a catalyst for rapid innovation and expansion within video generation technologies. Companies such as DeepSeek maintain an open-source ethos, and Baidu has recently announced plans to fully open-source its large-scale ERNIE model 4.5 beginning June 30th.
SEE ALSO: Baidu: ERNIE’s 4.5 Series Will Be Open Source From June 30th
Sign up today for 5 free articles monthly!
Alibaba, OpenSource, Technology