ByteDance’s Doubao Video Generation Large Model Released

On September 24, Volcano Engine officially launched two large models for Doubao Video Generation: PixelDance and Seaweed, targeting the enterprise market for invited testing. This marks ByteDance’s official entry into AI video generation. Note: The new Doubao video generation models are currently undergoing limited testing in the Dreamina AI beta version.

The model supports efficient DiT fusion computing units, a newly designed diffusion model training method, and a deeply optimized Transformer structure, enabling better compression of video and text. It facilitates consistent multi-shot generation, significantly enhancing the generalization ability of video generation.

According to the official introduction, the Doubao Video Generation large model achieves industry-leading standards in semantic understanding, complex interactions between multiple entities, and content consistency during multi-shot transitions.

Volcano Engine President Tan Dai stated, “Video generation faces many challenges that need to be overcome. The two Doubao models will continue to evolve, exploring more possibilities in addressing key issues and accelerating the expansion of AI video creation and application.”

Tan added that the Doubao Video Generation large model supports consistent multi-shot generation across various styles and ratios, making it applicable in fields such as e-commerce marketing, animation education, urban culture and tourism, and micro-script development.

Furthermore, Tan reported a dramatic increase in usage since the release of the Doubao large model. As of September, the daily token usage for the Doubao language model has surpassed 1.3 trillion, a tenfold increase compared to its initial release in May. The multimodal data processing volume has also reached 50 million images and 850,000 hours of audio daily.

Previously, video generation models could only execute simple instructions, but the Doubao Video Generation model enables natural and coherent multi-shot actions and complex interactions among multiple entities. Creators who had early access to the Doubao video generation model found that the generated videos not only followed intricate instructions but also allowed different characters to interact through multiple action commands. The characters’ appearances, clothing details, and even accessories remained consistent across various camera angles, approaching the quality of live-action footage.

According to Volcano Engine, the Doubao Video Generation model is built on the DiT architecture, utilizing efficient DiT fusion computing units to allow videos to seamlessly switch between dynamic movements and camera angles. It possesses multi-shot capabilities such as zoom, orbiting, panning, scaling, and target tracking. The model also features professional-grade lighting and color harmonization, resulting in visually stunning and realistic images.

The deeply optimized Transformer structure significantly enhances the generalization ability of Doubao video generation, supporting a variety of styles including 3D animation, 2D animation, traditional Chinese painting, black-and-white, and thick paint. It adapts to various device formats, including films, television, computers, and mobile phones. This makes it suitable not only for enterprise scenarios like e-commerce marketing, animation education, urban culture and tourism, and micro-script development but also provides creative assistance for professional creators and artists.

SEE ALSO: ByteDance’s New Large Model, Doubao, Delivers Stellar Performance