Zhipu AI Releases and Open-Sources Its GLM-4.6V Multimodal Model Series, Cuts API Prices by 50%

Zhipu AI Releases and Open-Sources Its GLM-4.6V Multimodal Model Series, Cuts API Prices by 50%

Published:December 9, 2025
Reading Time:2 min read

Want to read in a language you're more familiar with?

GLM-4.6V delivers native multimodal toolcalling and SOTA benchmark performance, marking a major leap in Zhipu AI’s push toward unified visual-language intelligence.

Zhipu AI announced the release and open-sourcing of its new GLM-4.6V multimodal large model series on December 8. The lineup includes:

  • GLM-4.6V (106B-A12B): A foundation model designed for cloud environments and high-performance clusters;
  • GLM-4.6V-Flash (9B): A lightweight variant optimized for on-device deployment and low-latency use cases.

Zhipu AI highlights that traditional toolcalling relies heavily on text input, making it inefficient and lossy when handling images, videos, or complex documents. GLM-4.6V, built around the design philosophy of “images as parameters, results as context,” introduces native multimodal toolcalling to eliminate these bottlenecks:

  • Multimodal Input: Images, screenshots, and document pages can be directly fed into tools without converting them to text descriptions first, reducing information loss and engineering overhead.

  • Multimodal Output: The model can visually interpret returned tool outputs—such as charts, rendered webpage snapshots, or retrieved product images—and integrate them into downstream reasoning.

This creates a full pipeline from perception to understanding to execution, enabling GLM-4.6V to better handle complex tasks such as mixed-format content generation, product recognition and price-value recommendations, and advanced agent workflows.

Across over 30 mainstream multimodal benchmarks—including MMBench, MathVista, and OCRBench—GLM-4.6V demonstrates substantial improvements over its predecessor. At comparable parameter scales, the model achieves state-of-the-art performance in multimodal interaction, logical reasoning, and long-context understanding.

The compact GLM-4.6V-Flash (9B) outperforms Qwen3-VL-8B, while the 106B-parameter, 12B-activation GLM-4.6V delivers performance competitive with Qwen3-VL-235B, despite the latter having nearly twice as many parameters.