news

Chinese multimodal understanding rankings released: Tencent Hunyuan ranked first in China

2024-08-03

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Multimodal understanding is one of the key capabilities of large models to understand the complex real world.

On August 2, the August list of the Chinese multimodal big model SuperCLUE-V benchmark was released. Tencent's Hunyuan big model stood out among many participating models with its outstanding performance in multimodal understanding, and won the first place in the domestic big model ranking, firmly occupying the quadrant of excellence leader.


Multimodal understanding, commonly known as "image-to-text", requires the model to accurately identify image elements, understand their relationships, and generate natural language descriptions. This not only tests the accuracy of image recognition, but also reflects the comprehensive understanding of the scene, the deep insight into the details, and the model's understanding of the complex real world.

This evaluation covers 12 of the most representative multimodal understanding models at home and abroad, including 4 overseas models and 8 domestic representative multimodal models. The evaluation content includes two major directions: basic capabilities and application capabilities. The multimodal large models are evaluated with open-ended questions. Tencent Hunyuan Large Model received a high score of 71.95 in terms of multimodal basic capabilities and application capabilities, showing its comprehensive advantages in technology and application layers.


According to the official introduction of SuperCLUE, the evaluation criteria cover dimensions such as understanding accuracy, response relevance and reasoning depth. The scoring rules combine automated quantitative scoring with expert review to ensure the scientificity and fairness of the evaluation.

The evaluation results show that domestic large models have already approached the top overseas models in terms of basic capabilities of multimodal understanding. The total score of Tencent Hunyuan Large Model is only slightly lower than GPT-4o, and performs better than CLaude3.5-Sonnet and Gemini-1.5-Pro, showing the rapid iteration of domestic models in basic capabilities. In terms of application capabilities, Tencent Hunyuan Large Model has great potential for practical applications with its deep understanding of Chinese context and comprehensive capabilities in general, common sense, images and other fields.


Relying on the technical foundation of Tencent's Hunyuan Big Model, the AI ​​native application Tencent Yuanbao has multimodal understanding capabilities at the beginning of its release. Whether it is a document screenshot, portrait scenery, cash register receipt, or any casual photo, Yuanbao can give its own understanding and analysis based on the content in the picture.


Jiang Jie, vice president of Tencent, previously stated that multimodality is a "must-answer question" for Tencent's Hunyuan Big Model. Currently, the Hunyuan Big Model is actively deploying technologies from multimodality to full modality. Users will soon be able to experience it in the Tencent Yuanbao App, Tencent's internal businesses and scenarios, and it will also be open to external applications through Tencent Cloud.

At present, Tencent's Hunyuan large model has expanded to a trillion-level parameter scale. It is the first in China to adopt the mixed expert model (MoE) structure. Relying on the capabilities of Tencent's large language model, its multimodal understanding capabilities are constantly improving, reaching the domestic leading level.

Leifeng Network