ModelBest Releases MiniCPM-V 2.6, Matching GPT-4V in Edge Performance

The visual information in the real world is dynamic, and when it comes to processing this dynamic visual information, edge-side video understanding has a natural advantage. Devices at the edge, such as smartphones, PCs, AR, robots, and smart vehicles, come with built-in cameras that provide inherent multimodal input capabilities.

Compared to the cloud, the edge is closer to the user, has a shorter link, and is more efficient, while also offering stronger information security advantages.

On August 6th, Chinese large model company ModelBest officially released MiniCPM-V 2.6, which fully matches the edge performance of GPT-4V.

According to the introduction, MiniCPM-V 2.6 has, for the first time, fully surpassed GPT-4V in core multimodal capabilities such as single image, multiple images, and video understanding on the edge, achieving state-of-the-art results below 20 billion parameters in all three capabilities. Its single image understanding performance is on par with Gemini 1.5 Pro and GPT-4o mini.

In terms of knowledge density, thanks to a 30% reduction in visual tokens compared to the previous generation and 75% lower than similar models, MiniCPM-V 2.6 achieved twice the single token encoding pixel density (token density) of GPT-4o.

It is worth mentioning that ModelBest has also brought capabilities such as ‘real-time’ video understanding, multi-image joint understanding, and multi-image in-context learning (ICL) to the edge for the first time.

After quantization, the backend memory occupies only 6 GB, and the inference speed on the edge reaches 18 tokens/s, which is 33% faster than the previous generation model. Additionally, it supports inference right from the release for llama.cpp, ollama, and vllm, and supports multiple languages.

Additionally, for ‘Too Long; Didn’t Watch’ videos, you can now directly drag the file in and let the model summarize the key information for you, without having to watch it to the end, speed it up, or fast forward.

In this approximately 1-minute weather forecast video, MiniCPM-V 2.6 can leverage its powerful video OCR capabilities to recognize the dense text in the video frames without hearing any audio, providing detailed weather descriptions for different cities in various segments of the video.

Besides video multimodality, the newly released MiniCPM-V 2.6 also integrates the capabilities of multi-image joint understanding and multi-image ICL (contextual few-shot learning) for the first time in an edge model, which is a capability that GPT-4V previously took pride in.

In terms of multi-image joint understanding, examples of scenarios are provided, such as the headache of bookkeeping or expense reimbursements, where the dense numbers on receipts are difficult to decipher, not to mention the cumbersome process of calculating the total.

At this point, you can take photos and send them all to MiniCPM-V 2.6. With the support of OCR capabilities and CoT (chain of thought) abilities, MiniCPM-V 2.6 can identify the amounts on each receipt and calculate the total.

Not only that, but in terms of multi-modal inference capabilities at the edge, MiniCPM-V 2.6 has also successfully caught up with GPT-4V.

For example, consider the classic task demonstrated by GPT-4V: adjusting a bicycle seat. This problem is very simple for humans but is quite challenging for the model, as it tests the complex reasoning capabilities of multi-modal models and their mastery of physical knowledge.

The 8B MiniCPM-V 2.6 shows promising potential to complete this challenge smoothly. Through multi-image and multi-turn dialogue with the model, it clearly communicates each detailed step to lower the bicycle seat, and can also help you find the right tools based on instructions and the toolbox.

Using small to achieve large is the core competitiveness of edge models.

The MiniCPM-V 2.6, with 8 billion parameters, not only catches up to GPT-4V in overall performance but also marks the first time an edge model has completely surpassed GPT-4V in three core multimodal capabilities: single-image understanding, multi-image understanding, and video comprehension. All of these have achieved state-of-the-art performance with models under 20 billion parameters.

In terms of knowledge compression rate, we found that MiniCPM-V 2.6 demonstrates exceptional efficiency, achieving a pixel density (Token Density) for large multimodal models that is twice that of GPT-4o.

In terms of single image comprehension, MiniCPM-V 2.6 has significantly surpassed Gemini 1.5 Pro and GPT-4o mini on the authoritative evaluation platform OpenCompass. For multi-image comprehension, on the authoritative evaluation platform Mantis-Eval, MiniCPM-V 2.6 has achieved state-of-the-art performance among open-source models and surpassed GPT-4V. Additionally, in terms of video comprehension, on the authoritative evaluation platform Video-MME, MiniCPM-V 2.6 has reached state-of-the-art performance on the edge, surpassing GPT-4V.

Additionally, on OCRBench, the MiniCPM-V 2.6 OCR performance achieves SOTA for both open-source and closed-source models, while on the hallucination evaluation leaderboard Object HalBench, the hallucination level of MiniCPM-V 2.6 (where a lower hallucination rate is better) also outperforms many commercial models such as GPT-4o, GPT-4V, and Claude 3.5 Sonnet.

According to ModelBest, the reason why MiniCPM-V 2.6 can achieve a leap from single to comprehensive advantages is not only due to the performance enhancement of the Qwen2-7B base model but also thanks to the adoption of a unified high-definition visual architecture, which allows for the inheritance of traditional multimodal advantages from single-image capabilities and enables seamless interoperability.

SEE ALSO: