ByteDance’s Doubao Team Releases Open-Source VideoWorld Model for Video-Based AI Learning

On February 10, ByteDance’s Doubao large AI model team, in collaboration with Beijing Jiaotong University and the University of Science and Technology of China, unveiled VideoWorld, a video generation model that has now been open-sourced.

Unlike mainstream multimodal models such as Sora, DALL-E, and Midjourney, VideoWorld learns from visual information, relying solely on video data to develop reasoning, planning, and decision-making capabilities.

To train VideoWorld, the team constructed two experimental environments: video-based Go gameplay and robotic video simulation. The model learns from an offline dataset of video demonstrations, adopting a naive autoregressive framework that combines a VQ-VAE encoder-decoder and an autoregressive Transformer architecture.

SEE ALSO: HeyGen Introduces Advanced Motion Control for Virtual Avatars

Typically, video encoding requires hundreds or thousands of discrete tokens to capture frame-level visual details, leading to knowledge being sparsely embedded. To address this, VideoWorld introduces a Latent Dynamics Model (LDM), which compresses inter-frame visual changes into compact latent representations, significantly improving knowledge extraction efficiency.

For example, in Go gameplay, multi-step board changes exhibit strong temporal correlations, just as in robotic control, where sequential actions must be coordinated. By compressing these changes into compact embeddings, VideoWorld enhances policy representation while also encoding forward-planning guidance.

Despite having only 300 million parameters, VideoWorld has demonstrated impressive performance, reaching the professional 5-dan level in 9×9 Go and executing robotic tasks across diverse environments.

The advancement of AI’s visual learning capabilities is expected to accelerate new AI applications. A research report from Great Wall Securities highlights the ongoing improvements in multimodal AI models in China, including ByteDance’s Doubao AI model and Kuaishou’s Kling AI model, which are enhancing video generation with precise semantic understanding, consistent multi-shot composition, and dynamic camera movements. As underlying AI capabilities evolve, AI applications are expected to iterate rapidly, driving increased token usage and broader adoption across industries.