ACL 2024 | Mathematical evaluation of 25 open-source and closed-source models, GPT-3.5-Turbo barely passed

2024-07-18

AIxiv is a column where Synced publishes academic and technical content. In the past few years, Synced's AIxiv column has received more than 2,000 articles, covering top laboratories in major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work to share, please submit or contact us for reporting. Submission email: [email protected]; [email protected]

The authors of this article are from the University of Hong Kong and Tencent. Author list: Li Qintong, Leyang Cui, Zhao Xueliang, Kong Lingpeng, Wei Bi. Among them, the first author Li Qintong is a doctoral student in the Natural Language Processing Laboratory of the University of Hong Kong. His research direction involves natural language generation and text reasoning. He and doctoral student Zhao Xueliang are both taught by Professor Kong Lingpeng. Leyang Cui and Wei Bi are senior researchers at Tencent.

Preface

Large language models (LLMs) are increasingly showing their extraordinary ability to solve problems. Recently, one notable phenomenon is that these models have achieved amazing results in a number of mathematical reasoning benchmarks. For example, GPT-4 performed well in the difficult elementary school word problem test set GSM8K [1], with an accuracy rate of over 90%. At the same time, many open source models have also demonstrated impressive strength, with an accuracy rate of over 80%.

However, we often find that when the math problem is slightly changed, LLMs may have some low-level errors, as shown in the following figure:

Figure 1: GPT-3.5-Turbo correctly solves a math problem (left), but when a restriction is added to the original problem (right), Turbo makes an error by misusing the operator because it does not correctly distinguish between the directions of “leaving” and “returning”.

We can’t help but ask: Do large language models really grasp the essence of mathematical knowledge? How do they achieve such high scores on these tests? Is it just because they imitate the surface reasoning patterns in a large amount of training data? Whether LLMs truly understand mathematical concepts is still a question worth exploring.

To explore this issue, the authors designed an evaluation benchmarkGSM-PlusThis test aims to perform 8 different fine-grained mathematical transformations on a problem and systematically evaluate the ability of current LLMs to handle basic mathematical word problems. In this new benchmark, the paper rigorously evaluates 25 different LLMs, including open source and closed source models in the industry.

Experimental results show that GSM-Plus is a challenging benchmark for most LLMs. Even on GSM8K, GPT-3.5-Turbo can achieve 73.62% accuracy, but only 61.19% accuracy on GSM-Plus. This work has been accepted by ACL2024 with 4, 4, 4.5 points.

论文标题：GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers

Paper address: https://arxiv.org/pdf/2402.19255

Paper homepage: https://qtli.github.io/GSM-Plus/

background

Mathematical reasoning is an important part of the development of artificial intelligence. It requires rigorous problem understanding, strategy formulation, and computational execution. In the past few years, many public datasets have been used to evaluate the mathematical reasoning ability of artificial intelligence systems. Early mathematical datasets focused on equation-based math problems. Subsequently, more difficult datasets were introduced, covering elementary school, high school, and college-level math problems.

As the difficulty of evaluation data continues to increase, the development of LLMs has become very rapid. In order to improve the performance of LLMs in the mathematical field, supervised fine-tuning (SFT) can be used to quickly help LLMs adapt to the mathematical field by training them on a variety of task data. In the reasoning stage, the mathematical ability of LLMs can also be effectively stimulated by designing clever input prompts (for example, Chain-of-Thought and Program-of-Thought).

For most LLMs, there is still a lot of room for improvement when facing math problems at high school and above. However, in the field of elementary school math, LLMs have shown great potential.This makes us wonder whether LLMs can still maintain high performance in real environments?

Adversarial Evaluation Dataset GSM-Plus

This study aims to introduce a comprehensive benchmark, GSM-Plus, to systematically examine the robustness of LLMs in solving basic mathematical problems. Inspired by the taxonomy of mathematical problem-solving capabilities in Polya’s principles [2], this paper identifies five guiding principles for constructing the GSM-Plus dataset:

To make it easier to understand, here is the question "Janet's duck lays 16 eggs every day. She eats three eggs for breakfast every morning and uses four eggs to bake muffins for her friends. She sells the remaining eggs at the farmer's market for $2 each. How many dollars does she make at the farmer's market every day?" as an example.

(1) Numerical changes: refers to changing numerical data or its type. This document defines three subcategories:

Numerical replacement: Replace numerical values with the same digit and type, for example, replace "16" in the question with "20".

Digit expansion: Increase the number of digits in a value, for example, replacing "16" with "1600".

Integer-Decimal-Fractional Conversion: Change an integer to a decimal or fraction, such as converting "2" to "2.5".

(2) Arithmetic changes: refers to introducing additional operations or reversing a math problem, but is limited to addition, subtraction, multiplication, and division operations:

Operational expansion: Add restrictions to the original problem. For example, add a new condition "She also uses two eggs to make her own hair mask every day."

Operation reversal: Convert a known condition of the original problem into a variable to be solved in the GSM-Plus variant problem. For example, the statement of the original problem in Figure 2, "Each duck egg costs 2 dollars", is converted into the interrogative sentence of the new problem, "What is the price of each duck egg?", while the interrogative sentence of the original problem, "How many dollars does she make at the farmer's market every day?", is converted into the known condition of the new problem, "She makes 18 dollars at the farmer's market every day."

(3) Problem understanding: refers to restating a math problem in different words without changing the meaning, such as "Janet raises a flock of ducks that lay 16 eggs every day. She consumes three eggs for breakfast and then uses four eggs to bake muffins for her friends. Janet sells the remaining eggs at the farmer's market for $2 each. How much money does she make from selling eggs at the farmer's market every day?"

(4) Interference item insertion: refers to inserting a sentence related to the topic, containing numerical values but useless for solving the problem into the original problem, such as "Janet also wants to use two duck eggs to feed her pet parrot. Fortunately, her neighbor gives her two duck eggs every day to feed the parrot."

(5) Critical thinking: Focuses on whether LLMs have the ability to ask questions or doubt when a math problem lacks necessary conditions, such as "Janet's ducks lay eggs every day. She eats three eggs for breakfast every morning and uses four eggs to bake muffins for her friends every day. She sells the remaining eggs at the farmer's market every day for $2 each. How many dollars does she make at the farmer's market every day?".

Based on the 1,319 test problems in GSM8K, we created eight variants for each problem, resulting in the GSM-Plus dataset containing 10,552 problem variants (a test subset containing 2,400 problem variants is also provided for quick evaluation). By testing LLMs with each problem and its eight variants, GSM-Plus can help researchers comprehensively evaluate the robustness of LLMs in solving mathematical problems.

Figure 2: Based on a seed math problem, 8 perturbations from 5 angles are used to generate problem variants. The main modifications are highlighted in green.

By using GSM-Plus to evaluate 25 LLMs of different sizes, different pre-training methods, and different task fine-tuning, as well as combining 4 commonly used prompting techniques, this paper found that LLMs can accurately solve GSM8K problems as a whole, but encounter obvious difficulties when answering variant questions in GSM-Plus. The main findings are as follows:

Task-specific optimization, i.e. fine-tuning on mathematically relevant datasets, can usually improve the accuracy of downstream tasks; while the level of robustness depends more on the choice of base model and fine-tuning dataset.

The performance of LLMs drops rapidly when “critical thinking” is required, “arithmetic changes” and “interference factor insertion” are involved; however, the performance of LLMs is relatively stable for the perturbations of “numerical changes” and “problem understanding”.

Previous hinting techniques (e.g., CoT, PoT, LtM, and Complexity-based CoT) have insignificant effects on robustness enhancement, especially for “arithmetic changes” and “critical thinking”. Based on previous work, this paper further explores a combined hinting method that can simultaneously improve the performance of LLMs on GSM8K and GSM-Plus by iteratively generating and verifying each reasoning thought.

GSM-Plus Features

quality assurance：GSM-Plus evaluation questions are generated in two stages. First, GPT-4’s question rewriting ability is used to generate question variants, and then candidate answers are generated for these variants; to ensure data quality, all question variants and answers generated by GPT-4 are strictly checked by the manual annotation team. The manual annotation team corrected 18.85% of the questions rewritten by GPT-4.

Fine-grained evaluation: For each test question of the mainstream evaluation dataset GSM8K, GSM-Plus provides 8 variant questions with perturbation directions, which fully tests the ability of the large model to flexibly solve mathematical word problems in different contexts.

challenge: Compared with GSM8K, the GSM-Plus problem variant is more challenging, and the performance of all evaluated LLMs has dropped significantly. In the following analysis, this paper will specifically analyze the robustness of LLMs under different types of perturbations.

Comparison with other elementary school math word problems

Table 1: Different colors represent different types of perturbations:

As can be seen from the table above, previous studies used different perturbations to test the robustness of mathematical reasoning, but the evaluation settings only covered some perturbation types, and most of them were introduced through automatic methods, and the quality was difficult to guarantee. In contrast, GSM-Plus uses eight different mathematical reasoning skills to perturb a single problem, with more comprehensive coverage and strict quality control.

experiment analysis

Evaluation indicators

Performance degradation rate (PDR): The performance degradation of LLMs on the perturbed problem compared to the original problem.

Percentage of Problem Pairs Solved Simultaneously (ASP): The proportion of the original problem and its corresponding variant that are correctly answered by LLMs.

Overall Performance

As shown in the following table, the performance of most LLMs on GSM-Plus is significantly reduced compared to GSM8K.

GPT-4 shows the highest robustness, with the smallest PDR of only 8.23%. CodeLlama has the largest PDR, with 7B, 13B, and 34B models at 40.56%, 39.71%, and 34.27%, respectively, exceeding its base model LLaMA-2-7B (39.49%), as well as mathematical SFT models fine-tuned on it, such as SEGO-7B (34.91%). This shows that using only program language reasoning is vulnerable to perturbations.

When facing mathematical perturbations, the larger the model, the more stable the performance. Although supervised fine-tuning can improve the accuracy on downstream tasks, it cannot significantly enhance the robustness of the model to perturbations (i.e., lower PDR). The data of supervised fine-tuning is very important for robustness. Using different data for fine-tuning based on LLaMA-2 will lead to large differences in the accuracy and robustness of the model.

Table 2: Overall performance

Fine-grained experimental analysis

Performance of LLMs under different perturbations

This paper further evaluates the performance stability of LLMs under 8 problem variants. Compared with the human baseline, the performance of LLMs is significantly reduced for "critical thinking" (purple), "operation expansion" and "operation reversal" (blue), "interference insertion" (pink), and "integer-decimal-fraction conversion" (orange) perturbations. For "numerical substitution" and "problem understanding", the performance of LLMs is stable and even slightly improved.

Figure 3: Fine-grained experimental analysis

Transferability of mathematical reasoning ability

The previous analysis is mainly based on the dataset as a whole. Next, this paper splits the two datasets according to whether the math questions are answered correctly, and analyzes whether when LLMs successfully solve the GSM8K problem, it means that the possibility of correctly answering the GSM-Plus variant problem becomes higher (i.e., high ASP value), and vice versa. If this assertion is true, it can be considered that LLMs have stable performance on this specific subset of math questions, even if it is not the case on the entire dataset. In the experimental setting, each GSM8K problem and its variant in GSM-Plus are transformed into 8 question pairs, and the results are shown in Figure 4.

Figure 4: Transferability of reasoning between GSM8K and GSM-Plus problem pairs by LLMs. Purple (both correct) and blue (both incorrect) bars indicate consistent model behavior, while red (GSM8K correct & GSM-Plus incorrect) and yellow (GSM8K incorrect & GSM-Plus correct) bars indicate inconsistent model behavior. The sum of the heights of the purple and red bars indicates the number of GSM8K problems correctly solved by LLMs.

The presence of red bars (LLMs correctly answering the original question but not the variant) indicates that most models have limited performance transferability. Although LLMs have different performance on GSM8K problems (the height of the purple and red bars), the performance transferability is similar (the height of the red bar). This means that existing benchmarks cannot accurately assess the true ability of the model in mathematical reasoning. High accuracy does not equate to strong reasoning robustness.

Tips to help with performance robustness of LLMs

Previous work has shown that good prompting instructions are important for stimulating the mathematical ability of language models. This paper selects 4 representative models and tests their performance in solving problems under different prompting instructions. As shown in the figure below, when facing interference, LLMs perform most stably when complex examples are used as contextual demonstrations (Complexity-based CoT); in contrast, LLMs are more susceptible to interference when only program language is used to represent intermediate reasoning (Program-of-Thought). Overall, these prompting techniques are not enough for LLMs to maintain the same performance on GSM-Plus as GSM8K.

Figure 5: Effect of hints on LLMs performance robustness

Are combination prompts effective?

How to enhance the robustness of LLMs based on existing hinting methods?This paper found that LLMs often overlook important conditions or make calculation errors during problem solving. To this end, this paper explored a combined prompting method Comp. This method first prompts LLMs to extract the necessary conditions related to the numerical values in the problem (Prompt1). Then, based on the problem and key conditions, instruct LLMs to iteratively generate reasoning goals (Prompt2) and calculation goals (Prompt3), and let them provide feedback for the generated historical problem-solving steps to determine whether the final answer is obtained (Prompt4). The specific implementation is shown in Figure 6.

Figure 6: Schematic diagram of the Comp iteration prompt method

It can be seen that Comp can improve the performance of LLMs under various types of problem variations through iterative generation and self-verification, but it still cannot bridge the performance gap between LLMs in standard test sets and adversarial test sets. This study expects that there will be more methods in the future to further improve the robustness of the model and promote the further development of LLMs in the field of mathematical reasoning.

Table 3: Performance of Comp Iteration Hint

Build Example

The figure below shows the performance of GPT-3.5-Turbo under different prompting techniques on GSM8K problems and GSM-Plus rewrite problems based on "operation reversal". Although all prompts can inspire Turbo to accurately answer GSM8K questions, only Comp can help Turbo generate correct answers on GSM-Plus variant problems.

Figure 7: Examples of the model answering math questions under different prompt settings

Conclusion

This paper introduces a set of adversarial elementary school math word problems, GSM-Plus, which aims to systematically analyze the robustness of LLMs in solving math word problems. Experimental analysis finds that when faced with perturbations, the performance of most LLMs significantly degrades compared to their performance on standard benchmarks, far from reaching human performance levels. The researchers hope that the work in this paper will promote more future research, including but not limited to: (1) systematically evaluating the mathematical skills of LLMs; (2) building models that can flexibly perform mathematical reasoning.

[1] Cobbe, Karl, et al. "Training verifiers to solve math word problems." arXiv preprint arXiv:2110.14168 (2021). https://paperswithcode.com/sota/arithmetic-reasoning-on-gsm8k

[2] George Polya. 2004. How to solve it: A new aspect of mathematical method, volume 85. Princeton university press.

news

ACL 2024 | Mathematical evaluation of 25 open-source and closed-source models, GPT-3.5-Turbo barely passed

Introduction

my contact information