news

Multimodal model evaluation framework lmms-eval released! Comprehensive coverage, low cost, zero pollution

2024-08-21

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

AIxiv is a column where Synced publishes academic and technical content. In the past few years, Synced's AIxiv column has received more than 2,000 articles, covering top laboratories in major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work to share, please submit or contact us for reporting.Submission email: [email protected]; [email protected]

along withLarge ModelAs research deepens, how to extend it to more modalities has become a hot topic in academia and industry. GPT-4oClaude 3.5 and others already have super strong image understanding capabilities, and open source domain models such as LLaVA-NeXT, MiniCPM, and InternVL are also showing performance that is getting closer and closer to closed source models.


In this era of "80,000 kilograms per mu" and "one SoTA every 10 days", a simple, easy-to-use, standard, transparent and reproducible multimodal evaluation framework is becoming increasingly important, but this is not an easy task.


To solve the above problems, researchers from LMMs-Lab of Nanyang Technological University jointly open-sourced LMMs-Eval, an evaluation framework designed specifically for multimodal large models, providing a one-stop, efficient solution for the evaluation of multimodal models (LMMs).


  • Code repository: https://github.com/EvolvingLMMs-Lab/lmms-eval

  • Official homepage: https://lmms-lab.github.io/

  • Paper address: https://arxiv.org/abs/2407.12772

  • List address: https://huggingface.co/spaces/lmms-lab/LiveBench


Since its release in March 2024, the LMMs-Eval framework has received collaborative contributions from the open source community, companies, and universities. It has received 1.1K Stars on Github, more than 30+ contributors, and contains more than 80 datasets and more than 10 models, and is still increasing.

 

Standardized Assessment Framework


In order to provide a standardized evaluation platform, LMMs-Eval includes the following features:


  1. Unified interface: LMMs-Eval improves and expands the text evaluation framework lm-evaluation-harness. By defining a unified interface for models, datasets, and evaluation indicators, it facilitates users to add new multimodal models and datasets on their own.

  2. One-click launch: LMMs-Eval hosts more than 80 (and growing) datasets on HuggingFace, which are carefully converted from the original sources, including all variants, versions, and splits. Users do not need to do any preparation, just one command, multiple datasets and models will be automatically downloaded and tested, and results can be obtained in a few minutes.

  3. Transparency and reproducibility: LMMs-Eval has a built-in unified logging tool. Every question answered by the model and whether it is correct or not will be recorded, ensuring reproducibility and transparency. It also makes it easy to compare the advantages and disadvantages of different models.


The vision of LMMs-Eval is that future multimodal models no longer need to write their own data processing, reasoning, and submission code. In today's highly centralized multimodal test set environment, this approach is not only unrealistic, but also difficult to directly compare the measured scores with other models. By connecting to LMMs-Eval, model trainers can focus more on improving and optimizing the model itself, rather than wasting time on evaluation and alignment results.


The “Impossible Triangle” of Evaluation


The ultimate goal of LMMs-Eval is to find a method to evaluate LMMs with 1. wide coverage 2. low cost 3. zero data leakage. However, even with LMMs-Eval, the author team found that it was difficult or even impossible to achieve all three points at the same time.


As shown in the figure below, when they expanded the evaluation dataset to more than 50, it became very time-consuming to perform a full evaluation of these datasets. In addition, these benchmarks are also susceptible to contamination during training. To this end, LMMs-Eval proposed LMMs-Eval-Lite to balance wide coverage and low cost. They also designed LiveBench to achieve low cost and zero data leakage.

 

LMMs-Eval-Lite: Wide Coverage Lightweight Evaluation

 

When evaluating large models, the large number of parameters and test tasks often cause the time and cost of the evaluation task to increase dramatically, so people often choose to use smaller data sets or specific data sets for evaluation. However, limited evaluation often leads to a lack of understanding of the model's capabilities. In order to take into account both the diversity of evaluation and the cost of evaluation, LMMs-Eval launched LMMs-Eval-Lite

 

LMMs-Eval-Lite aims to build a reduced benchmark set to provide useful and fast signals during model development, thus avoiding the bloated problem of current tests. If we can find a subset of the existing test set where the absolute scores and relative rankings between models remain similar to the full set, then we can consider it safe to prune these datasets.


In order to find the data salient points in the dataset, LMMs-Eval first uses CLIP and BGE models to convert the multimodal evaluation dataset into a vector embedding form and uses k-greedy clustering to find the data salient points. In the test, these smaller datasets still show similar evaluation capabilities to the full set.

 

LMMs-Eval then used the same method to create a Lite version covering more data sets, which are designed to help people save evaluation costs during development so that they can quickly judge model performance.

 

LiveBench: Dynamic Testing of LMMs

Traditional benchmarks focus on static evaluation using fixed questions and answers. With the progress of multimodal research, open source models often outperform commercial models such as GPT-4V in score comparisons, but fall short in actual user experience. Dynamic, user-oriented Chatbot Arenas and WildVision are becoming increasingly popular in model evaluation, but they require collecting thousands of user preferences, which makes the evaluation cost extremely high.


The core idea of ​​LiveBench is to evaluate the performance of the model on a constantly updated dataset to achieve zero pollution and keep costs low. The author team collected evaluation data from the Internet and built a pipeline to automatically collect the latest global information from websites such as news and community forums. To ensure the timeliness and authenticity of the information, the author team selected sources from more than 60 news media including CNN, BBC, Japan's Asahi Shimbun, and China's Xinhua News Agency, as well as forums such as Reddit. The specific steps are as follows:


  1. Capture a screenshot of the homepage and remove ads and non-news elements.

  2. Design questions and answer sets using the most powerful current multimodal models such as GPT4-V, Claude-3-Opus, and Gemini-1.5-Pro. Reviewed and revised by another model

  3. questions, ensuring accuracy and relevance.

  4. The final set of questions and answers is manually reviewed, approximately 500 questions are collected each month, and 100-300 are retained as the final livebench question set.

  5. The scoring criteria of LLaVA-Wilder and Vibe-Eval are adopted -- the scoring model is scored based on the standard answers provided, and the score range is [1, 10]. The default scoring model is GPT-4o, and Claude-3-Opus and Gemini 1.5 Pro are also included as alternatives. The final report results will be based on the accuracy indicator converted from 0 to 100 based on the score.

 

In the future, you can also view the latest evaluation data of multimodal models in our dynamically updated list every month, as well as the latest evaluation results on the list.