MiniMax Releases Its First Text-to-Video Model
On August 31, MiniMax quietly launched its first video-generating large model and released a 2-minute video titled ‘Magic Coin’ generated by the MiniMax large model.
It is worth noting that MiniMax has not yet disclosed the specific parameters and technical points of the model. On that day, MiniMax founder Yan Junjie stated during a media group interview, ‘We have indeed made significant progress in video model generation, and based on internal evaluations and scores, our performance is better than that of Runway in generating videos.’
According to their disclosure, the current video generation model is only the first version, and a new version will be released soon. There will also be continuous iterations in terms of data, algorithms, and usage details. Currently, it only provides text-to-video generation. In the future, image-to-video and text-plus-image generation videos will also be launched in succession.
‘Our strategy is to wait another week or two until the new features reach a satisfactory state before possibly considering commercialization,’ Yan Junjie further stated.
Currently, the commercialization of MiniMax consists of two parts: the open platform has over two thousand clients. Additionally, the company’s products also include an advertising mechanism. ‘At this stage, the most important thing is not commercialization, but that the technology can reach a level of widespread availability,’ said Yan Junjie.
However, compared to Kuaishou’s KLING, MiniMax launched its video generation model a month or two later.
Yan Junjie explained that during this period, the team was focused on solving more complex technical issues—specifically, how to train high-computational capability content. The difficulty lies in training video generation capabilities, which requires converting videos into tokens. These tokens are very long, and the longer they are, the more complex they become. Ultimately, the MiniMax team continuously reduced complexity through algorithms, resulting in a higher compression ratio, which delayed the release by one to two months.
However, he also stated that whether it’s video, text, or sound, the core research and development philosophy of the MiniMax team is not to find an algorithm that improves by 5% or 10%. “If we can enhance it several times over, we must make it happen; just a 5% improvement is not worth pursuing.”
When discussing why it is essential to create text-to-video capabilities, Yan Junjie believes that the essence lies in the fact that most content consumed by humans daily is in the form of images and videos, with text accounting for a small proportion. To achieve a higher level of user coverage and usage, the only way is to produce multimodal content instead of simply outputting text. We must pursue multimodal content; this approach is consistent throughout.
The generation of large video models presents certain difficulties. Yan Junjie explained that the complexity of working with video is more challenging than working with text. The contextual text of videos is inherently very long, making it challenging to process.
Moreover, the volume of video data is substantial. For example, a 5-second video might be several megabytes, whereas 100 characters might be less than 1 kilobyte, highlighting thousands of times the storage difference.
The challenge in generating video models lies in the fact that the underlying infrastructure built for text is not applicable to video generation. For instance, how to process, clean, and label the data means that the infrastructure also needs to be upgraded.
At the press conference that day, Yan Junjie emphasized ‘speed.’ He believes that, in the long run, the faster things progress, the better they are. Whether it’s working on MOE, Linear attention, or other explorations, the essence is to make models with the same performance become faster. Yan Junjie pointed out: ‘Speed means that the same computing power (training content) can become better.’
SEE ALSO: Startup Company MiniMax Completes Series B Funding, Doubling Its Valuation
On that day, Wei Weiye, the head of the MiniMax open platform, stated during the event that currently, the effectiveness, cost, and multi-modal aspects of large models still face challenges.
First, large models inevitably have hallucinations and may produce outputs that do not meet expectations due to insufficient compliance with instructions and language understanding capabilities. Therefore, we must insist on creating models that are higher, faster, and stronger.
Second, from last year to the first half of this year, costs have been a limiting factor preventing many companies from affording large models.
Since May of this year, the large model field has ignited a price war, with APIs dropping to “bargain prices.” Wei believes that low costs can stimulate the emergence of more application scenarios, and in the future, the costs of APIs will further decrease.
Third, multi-modal capabilities will trigger more application scenarios. For instance, the combination of text and voice can help large models better recognize and express emotions. The integration of voice and video can generate short videos and advertisements with voiceovers.
Currently, there are many non-consensus issues in the large model field: Should we focus on B2B (business) or B2C (consumer)? Should we target the domestic market or the overseas market? Can the Scaling Law continue? Addressing these common industry questions, Yan Junjie frankly stated: Despite many challenges, we are one of the most optimistic companies, filled with optimism about technological advancements, users, and product iteration efficiency.