On August 25th, Alibaba Cloud launched two open-source large vision language models (LVLM), Qwen-VL and its conversationally fine-tuned Qwen-VL-Chat. Qwen-VL is the multimodal version of Qwen-7B, Alibaba Cloud’s 7-billion-parameter model of its large language model Tongyi Qianwen. Capable of understanding both image inputs and text prompts in English and Chinese, Qwen-VL can perform various tasks such as responding to open-ended queries related to different images and generating image captions.
Qwen-VL is a vision language (VL) model that supports multiple languages including Chinese and English. Compared to previous VL models, Qwen-VL not only has basic abilities in image recognition, description, question answering, and dialogue, but also adds capabilities such as visual localization and understanding of text within images.
For example, if a foreign tourist who doesn’t understand Chinese goes to the hospital for treatment and doesn’t know how to get to the corresponding department, he can take a picture of the floor guide map and ask Qwen-VL, “Which floor is the orthopedics department on?” or “Where should I go for ENT?” Qwen-VL will provide text replies based on the information in the image. This is its image question-answering capability. Another example is that if you input a photo of Shanghai’s Bund, and ask Qwen-VL to find the Oriental Pearl Tower, it can accurately outline the corresponding building using detection boxes. This demonstrates its visual localization ability.
Qwen-VL, based on the Qwen-7B language model, introduces a visual encoder in its architecture to support visual input signals. Through the design of the training process, the model is able to perceive and understand visual signals at a fine-grained level. Qwen-VL supports image input resolution of 448, which is higher than the previously open-sourced LVLM models that typically supported only 224 resolution. Building upon Qwen-VL, the team at Tongyi Qianwen has developed Qwen-VL-Chat, a visual AI assistant based on LLM with alignment mechanisms. This allows developers to quickly build dialogue applications with multimodal capabilities.
Multimodality is one of the important technological advancements in general artificial intelligence. It is widely believed that transitioning from a single-sensory, text-only language model to a multimodal model that supports various forms of information input such as text, images, and audio represents a significant leap towards intelligent models on a larger scale. Multimodality enhances the understanding capabilities of large-scale models and greatly expands their range of applications.
Vision is the primary sensory ability of humans and it is also the first modality that researchers aim to incorporate into large-scale models. Following the release of M6 and OFA series multimodal models, Alibaba Cloud’s Tongyi Qianwen team has now open-sourced a large-scale vision language model (LVLM) called Qwen-VL, based on Qwen-7B.
Qwen-VL is the industry’s first universal model that supports Chinese open-domain visual localization. The ability of open-domain visual localization determines the accuracy of large models’ “vision”, that is, whether they can accurately identify desired objects in images. This is crucial for the practical application of VL models in scenarios such as robot control.
In mainstream multimodal task evaluation and multimodal conversational ability evaluation, Qwen-VL has achieved performance far beyond that of equivalent-sized general models.
In the standard English evaluation of the four major multimodal tasks (Zero-shot Caption/VQA/DocVQA/Grounding), Qwen-VL achieved the best performance among open-source LVLMs of similar size. In order to test the model’s multimodal dialogue capability, the Tongyi Qianwen team constructed a test set called ‘Shijinshi’ based on GPT-4 scoring mechanism, and conducted comparative tests on Qwen-VL-Chat and other models. Qwen-VL-Chat achieved the best results among open-source LVLMs in both Chinese and English alignment evaluations.
Qwen-VL and its visual AI assistant Qwen-VL-Chat have been launched on the ModelScope, open-source, free, and available for commercial use. Users can directly download models from the ModelScope or access and invoke Qwen-VL and Qwen-VL-Chat through Alibaba Cloud DashScope. Alibaba Cloud provides users with comprehensive services including model training, inference, deployment, fine-tuning, etc.
In early August, Alibaba Cloud open-sourced the Qwen-7B Generalized Questioning Model and Qwen-7B-Chat Dialogue Model, with a total of 70 billion parameters. This made it the first large-scale technology company in China to join the ranks of open-source large models. The release of the Qwen-7B Generalized Questioning Model immediately gained widespread attention and quickly climbed up HuggingFace’s trending list that week. In less than a month, it received over 3,400 stars on GitHub, and its cumulative download count has exceeded 400,000.