Specially designed to cure large model "examination"! Jia Jiaya's team's new benchmark allows the model to only find errors without solving problems

Specially designed to solve the problem of "brushing questions" for large models! Jia Jiaya's team's new benchmark allows the model to only find errors without solving questions

2024-07-18

MR-Ben team contributed
Quantum Bit | Public Account QbitAI

The problem of large models getting high scores in tests but performing poorly in actual scenarios has been solved.

Jia Jiaya's team, in collaboration with several well-known universities, proposed a new evaluation method, which allowed some models to immediately show their prototypes.

Now we don’t have to worry about the large model doing too many exercises and the test set failing to reflect the real level.

This new evaluation dataset is called MR-Ben, which uses existing questions from datasets such as GSM8K and MMLU.

However, the role of the big model in the test changed from "student answering questions" to "examination teacher".Point out errors in existing solution steps。

This way, the model can no longer answer questions correctly by memorizing or guessing, and there is no need to worry about test questions being leaked.

Using MR-Ben, Jia's team evaluated many open source and closed source models such as GPT4-Turbo, Cluade3.5-Sonnet, GLM4, and Qwen2-70B.

Currently, all codes and data involved in this dataset have been open source.

Familiar test questions, new tasks

At present, the mainstream direction of large model testing is to use human standardized tests - multiple-choice questions and fill-in-the-blank questions to evaluate large models.

The advantages of this testing method are that the standards are clear, the indicators are intuitive, and the quantitative results are naturally topical.

However, the author believes that since today's large models generally use a step-by-step chain of thinking to generate the final answer, this method is not "reliable".

The pre-trained model has already seen trillions of tokens during pre-training.It is difficult to tell whether the model being evaluated has seen the corresponding data before, and thus answer the questions correctly by “reciting the questions”.

Since the evaluation method mainly relies on checking the final answer, the modelIt is also unclear whether the correct option is selected based on correct understanding and reasoning.。

Although the academic community has been constantly upgrading and transforming data sets such as GSM8K and MMLU, such as introducing a multi-language version of the MGSM data set on GSM8K and introducing more difficult questions based on MMLU, it is still unable to break away from the rut of multiple-choice or fill-in-the-blank questions.

Moreover, these datasets have faced seriousSaturation Problem, the values of these indicators of large language models have peaked and gradually lost their discriminability.

To this end, Jia Jiaya's team, in collaboration with several well-known universities such as MIT, Tsinghua University, and Cambridge, and domestic head annotation companies, annotated an evaluation dataset MR-Ben for the reasoning process of complex problems.

MR-Ben conducted a test based on the questions of the GSM8K, MMLU, LogiQA, MHPP and other large model pre-training must-test data sets.Paradigm transformation of “paper marking”, the generated new dataset is more difficult and more discriminative, and can more truly reflect the model's reasoning ability!

There is no need to create new questions or to transform the questions to test the robustness of the model. MR-Ben directly changes the model from a "question answerer" to a "marker" and judges the existing answering process in the dataset. By letting the big model act as a teacher, it can test its mastery of knowledge points!

Specifically, Jia Jiaya's team sorted out the mainstream evaluation data sets on the market, such as GSM8K, MMLU, LogiQA, MHPP, and divided them into multiple categories such as mathematics, physics, chemistry, biology, code, logic, medicine, etc., and distinguished different difficulty levels.

For each category and each question collected, the team carefully collected the corresponding step-by-step problem-solving process, and had professional master's and doctoral annotators trained and annotated them.

During the marking process, whether the problem-solving process is correct, the location of the error, and the cause of the error will be pointed out in detail. By comparing the marking results of the large model with those of human experts, we can know the model's mastery of the knowledge points.

From the evaluation method point of view, the method proposed by MR-Ben requires the model to carefully analyze the premise, assumption, and logic of each step in the problem-solving process, and to rehearse the reasoning process to determine whether the current step can lead to the correct answer.

This "marking" type of assessment is much more difficult than the assessment method of just answering questions, but it can effectively avoid the problem of inflated scores caused by memorizing questions from models. However, it is difficult for a student who can only memorize questions to become a qualified marker.

GPT4-Turbo performs best

Jia Jiaya's team evaluated several well-known large models, and some models had multiple versions involved in the test.

It can be seen that among the closed-source models, GPT4-Turbo performed the best (although no calculation errors were found during the "marking"). In most subjects, it was ahead of other models both with demo (k=1) and without demo (k=0).

The GLM model of the Zhipu team ranked second in the list, surpassing Claude's latest 3.5-Sonnet.

However, there is a large degree of differentiation between different models. The strongest GPT4-Turbo scored less than 50 points on the MR-Ben dataset, which shows that its performance is still not saturated.

In addition, some open source models with strong performance have already caught up with some commercial models.

In addition, the MR-Ben team also discovered some interesting phenomena during their work, such as:

In low-resource scenarios, small models also have many highlights. In the MR-Ben evaluation, Phi-3-mini stands out among a number of small models, and is even higher than or equal to large models with tens of billions of parameters, demonstrating the importance of fine-tuning data.
The MR-Ben scenario involves complex logical parsing and step-by-step reasoning. In the Few-shot mode, excessively long context will confuse the model, resulting in a drop in performance.
MR-Ben evaluated a number of ablation experiments of generation-reflection-regeneration, and checked the differences between different prompt strategies. It was found that it had no effect on low-level models, and the effect was not obvious on high-level models such as GPT4-Turbo. On the contrary, the effect was slightly improved for the intermediate-level models because they always corrected the wrong ones to the right ones and the right ones to the wrong ones.
After roughly dividing the subjects of MR-Ben evaluation into knowledge-based, logic-based, computational-based, and algorithmic-based, different models have their own advantages and disadvantages in different types of reasoning.

Jia Jiaya's team has uploaded a one-click evaluation method on GitHub. The amount of tokens consumed for each test is approximately 12M. Developers can evaluate and submit the test on their own models, and the MR-Ben team will update the corresponding leaderboard in a timely manner.

Paper address:
https://arxiv.org/abs/2406.13975
Project homepage:
https://randolph-zeng.github.io/Mr-Ben.github.io/
Github Repo：
https://github.com/dvlab-research/Mr-Ben

news

Specially designed to solve the problem of "brushing questions" for large models! Jia Jiaya's team's new benchmark allows the model to only find errors without solving questions

Familiar test questions, new tasks

GPT4-Turbo performs best

Introduction

my contact information