Big models solve math problems differently from humans: GPT-4o performs best with obvious lack of knowledge

Big models solve math problems differently from humans: knowledge is obviously lacking, GPT-4o performs best

2024-07-23

AIxiv is a column where Synced publishes academic and technical content. In the past few years, Synced's AIxiv column has received more than 2,000 articles, covering top laboratories in major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work to share, please submit or contact us for reporting. Submission email: [email protected]; [email protected]

The authors of this article are from Beijing University of Posts and Telecommunications, Tencent WeChat, Huazhong University of Science and Technology, and Beijing Institute of Technology. List of authors: Qiao Runqi, Tan Qiuna, Dong Guanting, Wu Minhui, Sun Chong, Song Xiaoshuai, Gong Que Zhuoma, Lei Shanglin, Wei Zhe, Zhang Miaoxuan, Qiao Runfeng, Zhang Yifan, Zong Xiao, Xu Yida, Diao Muxi, Bao Zhimin, Li Chen, Zhang Honggang. Among them, the co-first author Qiao Runqi is a doctoral student at Beijing University of Posts and Telecommunications, Tan Qiuna is a master's student at Beijing University of Posts and Telecommunications, and the corresponding author is Associate Professor Zhang Honggang of Beijing University of Posts and Telecommunications. This article was completed by Qiao Runqi during his internship at WeChat.

With the rapid development of artificial intelligence technology, large multimodal models (LMMs) that can process information from multiple modalities have gradually become a hot topic of research. By integrating information from different modalities, LMMs demonstrate certain reasoning and understanding capabilities, and perform well in tasks such as visual question answering, image generation, and cross-modal retrieval. This multimodal capability makes LMMs have great potential for application in various complex scenarios. In order to rigorously and scientifically test whether AI has strong reasoning capabilities, mathematical question answering has become an important benchmark for measuring model reasoning capabilities.

Looking back at the development of AI, we find that human cognition and the way we think about problems have had a profound impact on the development of AI. Breakthroughs such as neural networks and attention mechanisms are closely related to human thinking patterns. Imagine that when humans solve a math problem, they first need to be familiar with the knowledge points tested in the question, and then use the relevant knowledge to reason step by step to get the answer. But when the model answers, is its reasoning process consistent with that of humans?

Focusing on math problems, we found that the model can answer complex questions, but it is stretched when it comes to simple questions. To explore the reasons for this phenomenon, inspired by the human problem-solving thinking model, we first modeled the problem-solving process of first mastering knowledge points and then using them for logical reasoning as follows:

Where (X, Y) and (x_i, y_i) represent the math problem and the question and answer in each sub-problem, respectively, and P_reason represents the comprehensive application ability of LMMs (knowledge generalization). Based on this, We-Math first built a multi-level tree-like knowledge system based on 67 atomic knowledge points, and then based on atomic knowledge and reasoning answers, explored the model's answering mechanism by decomposing complex problems with multiple knowledge points into sub-problems corresponding to multiple atomic knowledge points.

题目：WE-MATH: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?
Paper: https://arxiv.org/pdf/2407.01284
Home page: https://we-math.github.io/
Code: https://github.com/We-Math/We-Math
Dataset: https://huggingface.co/datasets/We-Math/We-Math

Currently, We-Math ranks first in the HuggingFace Daily Paper of the day and has received 10K+ views on Twitter!

We-Math Benchmark

1. Data composition

The We-Math evaluation dataset contains 6.5k multimodal elementary school math problems and a multi-level knowledge architecture. Each math problem has a corresponding knowledge point (1-3). The knowledge points of all problems are covered by a knowledge architecture of 5 layers and 99 nodes (the last layer contains 67 knowledge points). As shown in the figure below, in order to alleviate the inherent problems of the model in the problem-solving process, we refer to textbooks and Wikipedia and heuristically introduce the description of 67 knowledge points to provide necessary knowledge hints for the reasoning process of LMMs.

2. Question breakdown

In order to reasonably evaluate the answering mechanism of the model, we strictly follow the standard answers given by humans and break down complex questions into n sub-questions according to the knowledge points contained in the complex questions, where n represents the number of knowledge points contained in the complex question.

As shown in the figure below, for a complex problem: Mary walked from the northernmost point of a circular flower bed along the edge of the flower bed to the easternmost point. The distance she walked was 50.24 meters. Find the area of the circular flower bed. In the process of solving the problem, firstly, according to the knowledge point of "Southeast, Northwest and Eastwest Directions", through the conditions of the two directions of "Northmost" and "Eastmost", find the size of the central angle of the circle corresponding to the path Mary walked (the angle between "Northmost" and "Eastmost" is 90 degrees). Then, according to the knowledge point of "Circumference of a Circle", through the conditions of the size of the central angle of 90 degrees and the length of the path Mary walked, calculate the circumference of the circular flower bed and find the radius of the circular flower bed. Finally, according to the knowledge point of "Area of a Circle", through the conditions of the radius obtained, calculate the area of the circular flower bed, and thus complete the solution of the problem.

Analyzing the above problem-solving process, in order to explore the model's answering mechanism and the model's fine-grained reasoning performance, the original question can be broken down into three sub-questions according to its corresponding knowledge points. Specifically, the first question: Mary walked from the northernmost point of a circular flower bed along the edge of the flower bed to the easternmost point. Find the degree of the central angle of the arc corresponding to the path she walked; the second question: In the circular flower bed, the length of the arc corresponding to the central angle of 90 degrees is 59.24m. Find the radius of the circular flower bed; the third question: Find the area of the circular flower bed with a radius of 32m.

3. Metrics

On this basis, as shown in the figure below, we introduce a new four-dimensional measurement standard, namely, insufficient knowledge (IK), insufficient generalization ability (IG), complete mastery (CM) and rote memorization (RM).

Insufficient Knowledge (IK): The model is unable to answer complex questions and makes mistakes in sub-questions. We speculate that the reason why the model cannot answer complex questions is because of insufficient knowledge.
Insufficient generalization ability (IG): The model cannot answer complex questions, but all sub-questions are answered correctly. We speculate that the reason why the model cannot answer complex questions is because it lacks comprehensive application ability (generalization ability).
Complete Mastery (CM): The model can answer complex questions and can answer all sub-questions, which is reasonable and expected.
Rote Memorization (RM): The model can answer complex questions but make mistakes in sub-questions, which is contrary to human logical thinking. If a model can solve complex multi-step problems but cannot answer the single-step questions required in the solution process, we believe this situation is unreasonable and consider the model to have mechanical memory.

There is IK among IK, IG and CM.

Experiments and Conclusions

We-Math has completed the evaluation in 17 large models, including 4 closed-source models and 13 open-source models. Table 1 and Figure 6 show the results of LMMs under different numbers of knowledge points and the performance of the model under the second-level knowledge points; Table 2 and Figures 7, 8, and 9 show the results of LMMs under four-dimensional indicators and the comprehensive scoring results under strict and loose standards; Figure 10 shows the results of the KCA strategy on the model's mitigation of the IK problem.

Performance of LMMs with different numbers of knowledge points and their performance with second-level knowledge points

The model's answer performance shows a significant negative correlation with the number of knowledge points contained in the question, that is, the more knowledge points a question contains, the worse the model's answer performance is. We also propose that the difficulty of a question can be modeled by the number of knowledge points contained in the question.
The model performs well on computationally relevant knowledge points, but performs poorly on fine-grained visual problems. This further demonstrates that LMMs are good at applying formulas, but still have limitations in understanding and synthesizing applied knowledge.
GPT-4o performed the best, maintaining its lead in questions containing different numbers of knowledge points, and basically maintaining its lead under different knowledge points.
LMMs show some potential for parameter compression. Among different LMMs, LLaVA-NeXT-110B performs closest to GPT-4. Surprisingly, despite the smaller parameter size, models such as InternVL-Chat-V1.5, GLM-4V-9B, and InternLM-XC2 also perform well.

Performance of LMMs under four-dimensional indicators and their comprehensive scoring results under strict and loose standards

Most models suffer from the problems of “lack of knowledge” and “rote learning”, which are especially evident in smaller models. Moreover, “lack of knowledge” is still the main problem of most models.
GPT-4o is significantly ahead of other models in the measurement dimension of "rote memorization", which further shows that GPT-4o is closer to the way humans solve problems and the results it presents are more reliable, which means that the model has truly learned knowledge rather than "memorizing" it.
GPT-4o is significantly ahead of other models in the measurement dimension of "inadequate knowledge" and has gradually moved to the next stage, where it needs to further improve its "knowledge generalization ability."

Performance of LMMs under KCA strategy

The overall performance of the model has improved under the KCA strategy. As shown in the figure above, LMMs with different parameter sizes have shown consistent performance improvements in both strict and loose indicators after the introduction of the KCA strategy.
The KCA strategy significantly alleviates the IK problem, but the improvement on the IG problem is not obvious. This is consistent with human intuition, because knowledge description mainly solves the gap in reasoning knowledge. However, to solve the IG problem, it is necessary to comprehensively improve the knowledge generalization ability of LMMs, which also points out the direction for future research.

Summarize

In this paper, we proposed WE-MATH, a comprehensive benchmark for fine-grained evaluation of the answering mechanism of LMMs in visual mathematical reasoning tasks. WE-MATH contains a total of 6.5k visual mathematical questions, covering a multi-level knowledge architecture with 5 layers and 67 knowledge points. We innovatively decompose the questions into multiple sub-questions based on the knowledge points required, and introduce a new four-dimensional metric for fine-grained reasoning evaluation. Through WE-MATH, we comprehensively evaluate the performance of existing LMMs in visual mathematical reasoning, and reveal that the model's answering performance shows a significant negative correlation with the number of knowledge points contained in the questions.

In addition, we find that most models suffer from rote memorization (RM) and insufficient knowledge (IK) are the biggest flaws of LMMs. However, the main challenge of GPT-4o has gradually shifted from IK to IG, indicating that it is the first model to move to the next stage. Finally, our analysis of KCA strategies and error cases further heuristically guides existing LMMs towards human-like visual mathematical reasoning.

news

Big models solve math problems differently from humans: knowledge is obviously lacking, GPT-4o performs best

Introduction

my contact information