news

alibaba cloud's qwen2-vl second-generation visual language model is open source

2024-09-02

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

it home reported on september 2 that alibaba cloud tongyi qianwen announced today that it has open-sourced the second-generation visual language model qwen2-vl, and launched two sizes, 2b and 7b, and their quantized versions. at the same time, the api of the flagship model qwen2-vl-72b has been launched on the alibaba cloud bailian platform, and users can call it directly.

according to alibaba cloud, the basic performance of qwen2-vl has been comprehensively improved compared with the previous generation model:

understand images of different resolutions and aspect ratios, achieving world-leading performance in benchmarks such as docvqa, realworldqa, and mtvqa;

understand videos longer than 20 minutes and support applications such as video-based question-answering, conversations, and content creation;

with powerful visual intelligence capabilities, qwen2-vl can autonomously operate mobile phones and robots. with the ability of complex reasoning and decision-making, qwen2-vl can be integrated into mobile phones, robots and other devices to perform automatic operations based on visual environment and text instructions;

understand multilingual text in images and videos, including chinese, english, most european languages, japanese, korean, arabic, vietnamese, etc.

qwen2-vl continues the series structure of vit and qwen2. models of three sizes all use a 600m-sized vit and support unified image and video input.

but in order to enable the model to perceive visual information and understand videos more clearly, the team made some upgrades to the architecture:

first, it realizes full support for native dynamic resolution. unlike the previous generation model, qwen2-vl can process image inputs of any resolution, and images of different sizes will be converted into a dynamic number of tokens, with a minimum of only 4 tokens. this design simulates the natural way of human visual perception, ensures a high degree of consistency between the model input and the original information of the image, and gives the model a powerful ability to process images of any size, allowing it to process images more flexibly and efficiently.

second, the multimodal rotational position embedding (m-rope) method is used. traditional rotational position embedding can only capture the position information of one-dimensional sequences. m-rope enables large-scale language models to simultaneously capture and integrate the position information of one-dimensional text sequences, two-dimensional visual images, and three-dimensional videos, giving language models powerful multimodal processing and reasoning capabilities, allowing the model to better understand and model complex multimodal data.

the api of the flagship model qwen2-vl-72b among the multiple open-source models of qwen2-vl has been launched on the alibaba cloud bailian platform. users can directly call the api through the alibaba cloud bailian platform.

at the same time, the tongyi qianwen team has open-sourced qwen2-vl-2b and qwen2-vl-7b under the apache 2.0 protocol, and the open-source code has been integrated into hugging face transformers, vllm and other third-party frameworks. developers can download and use the models through hugging face and moda modelscope, or through the tongyi official website and the main dialogue page of the tongyi app.