
Meituan Open-Sources “LongCat-Video,” a 5-Minute Text-to-Video AI Model
Want to read in a language you're more familiar with?
Meituan open-sources LongCat-Video, a breakthrough AI model that generates 5-minute HD videos from text or images, advancing China’s generative video tech.
Chinese tech giant Meituan has released its new LongCat-Video model, claiming a breakthrough in text-to-video generation by producing coherent, high-definition clips up to five minutes long. The company has also open-sourced the model on GitHub and Hugging Face to support broader research collaboration.
According to Meituan, LongCat-Video is built on a Diffusion Transformer (DiT) architecture and supports three modes — text-to-video, image-to-video, and video continuation. The model can transform a text prompt or a single reference image into a smooth 720p/30 fps sequence, or extend existing footage into longer scenes with consistent style, motion, and physics.
The team said the model addresses a persistent challenge in generative video — maintaining quality and temporal stability across extended durations. LongCat-Video can generate continuous, multi-minute content without the typical frame degradation that affects most diffusion-based systems.
Meituan described LongCat-Video as a step toward “world-model” AI, capable of learning real-world geometry, semantics, and motion to simulate physical environments. The model is publicly available through Meituan’s repositories on GitHub and Hugging Face.




