2024-07-23
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
AIxiv is a column where Synced publishes academic and technical content. In the past few years, Synced's AIxiv column has received more than 2,000 articles, covering top laboratories in major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work to share, please submit or contact us for reporting. Submission email: [email protected]; [email protected]
The authors of this article are from Beijing University of Posts and Telecommunications, Tencent WeChat, Huazhong University of Science and Technology, and Beijing Institute of Technology. List of authors: Qiao Runqi, Tan Qiuna, Dong Guanting, Wu Minhui, Sun Chong, Song Xiaoshuai, Gong Que Zhuoma, Lei Shanglin, Wei Zhe, Zhang Miaoxuan, Qiao Runfeng, Zhang Yifan, Zong Xiao, Xu Yida, Diao Muxi, Bao Zhimin, Li Chen, Zhang Honggang. Among them, the co-first author Qiao Runqi is a doctoral student at Beijing University of Posts and Telecommunications, Tan Qiuna is a master's student at Beijing University of Posts and Telecommunications, and the corresponding author is Associate Professor Zhang Honggang of Beijing University of Posts and Telecommunications. This article was completed by Qiao Runqi during his internship at WeChat.
With the rapid development of artificial intelligence technology, large multimodal models (LMMs) that can process information from multiple modalities have gradually become a hot topic of research. By integrating information from different modalities, LMMs demonstrate certain reasoning and understanding capabilities, and perform well in tasks such as visual question answering, image generation, and cross-modal retrieval. This multimodal capability makes LMMs have great potential for application in various complex scenarios. In order to rigorously and scientifically test whether AI has strong reasoning capabilities, mathematical question answering has become an important benchmark for measuring model reasoning capabilities.
Looking back at the development of AI, we find that human cognition and the way we think about problems have had a profound impact on the development of AI. Breakthroughs such as neural networks and attention mechanisms are closely related to human thinking patterns. Imagine that when humans solve a math problem, they first need to be familiar with the knowledge points tested in the question, and then use the relevant knowledge to reason step by step to get the answer. But when the model answers, is its reasoning process consistent with that of humans?
Focusing on math problems, we found that the model can answer complex questions, but it is stretched when it comes to simple questions. To explore the reasons for this phenomenon, inspired by the human problem-solving thinking model, we first modeled the problem-solving process of first mastering knowledge points and then using them for logical reasoning as follows:
Where (X, Y) and (x_i, y_i) represent the math problem and the question and answer in each sub-problem, respectively, and P_reason represents the comprehensive application ability of LMMs (knowledge generalization). Based on this, We-Math first built a multi-level tree-like knowledge system based on 67 atomic knowledge points, and then based on atomic knowledge and reasoning answers, explored the model's answering mechanism by decomposing complex problems with multiple knowledge points into sub-problems corresponding to multiple atomic knowledge points.
Currently, We-Math ranks first in the HuggingFace Daily Paper of the day and has received 10K+ views on Twitter!
We-Math Benchmark
1. Data composition
The We-Math evaluation dataset contains 6.5k multimodal elementary school math problems and a multi-level knowledge architecture. Each math problem has a corresponding knowledge point (1-3). The knowledge points of all problems are covered by a knowledge architecture of 5 layers and 99 nodes (the last layer contains 67 knowledge points). As shown in the figure below, in order to alleviate the inherent problems of the model in the problem-solving process, we refer to textbooks and Wikipedia and heuristically introduce the description of 67 knowledge points to provide necessary knowledge hints for the reasoning process of LMMs.
2. Question breakdown
In order to reasonably evaluate the answering mechanism of the model, we strictly follow the standard answers given by humans and break down complex questions into n sub-questions according to the knowledge points contained in the complex questions, where n represents the number of knowledge points contained in the complex question.
As shown in the figure below, for a complex problem: Mary walked from the northernmost point of a circular flower bed along the edge of the flower bed to the easternmost point. The distance she walked was 50.24 meters. Find the area of the circular flower bed. In the process of solving the problem, firstly, according to the knowledge point of "Southeast, Northwest and Eastwest Directions", through the conditions of the two directions of "Northmost" and "Eastmost", find the size of the central angle of the circle corresponding to the path Mary walked (the angle between "Northmost" and "Eastmost" is 90 degrees). Then, according to the knowledge point of "Circumference of a Circle", through the conditions of the size of the central angle of 90 degrees and the length of the path Mary walked, calculate the circumference of the circular flower bed and find the radius of the circular flower bed. Finally, according to the knowledge point of "Area of a Circle", through the conditions of the radius obtained, calculate the area of the circular flower bed, and thus complete the solution of the problem.
Analyzing the above problem-solving process, in order to explore the model's answering mechanism and the model's fine-grained reasoning performance, the original question can be broken down into three sub-questions according to its corresponding knowledge points. Specifically, the first question: Mary walked from the northernmost point of a circular flower bed along the edge of the flower bed to the easternmost point. Find the degree of the central angle of the arc corresponding to the path she walked; the second question: In the circular flower bed, the length of the arc corresponding to the central angle of 90 degrees is 59.24m. Find the radius of the circular flower bed; the third question: Find the area of the circular flower bed with a radius of 32m.
3. Metrics
On this basis, as shown in the figure below, we introduce a new four-dimensional measurement standard, namely, insufficient knowledge (IK), insufficient generalization ability (IG), complete mastery (CM) and rote memorization (RM).
There is IK among IK, IG and CM.
Experiments and Conclusions
We-Math has completed the evaluation in 17 large models, including 4 closed-source models and 13 open-source models. Table 1 and Figure 6 show the results of LMMs under different numbers of knowledge points and the performance of the model under the second-level knowledge points; Table 2 and Figures 7, 8, and 9 show the results of LMMs under four-dimensional indicators and the comprehensive scoring results under strict and loose standards; Figure 10 shows the results of the KCA strategy on the model's mitigation of the IK problem.
Performance of LMMs with different numbers of knowledge points and their performance with second-level knowledge points
Performance of LMMs under four-dimensional indicators and their comprehensive scoring results under strict and loose standards
Performance of LMMs under KCA strategy
Summarize
In this paper, we proposed WE-MATH, a comprehensive benchmark for fine-grained evaluation of the answering mechanism of LMMs in visual mathematical reasoning tasks. WE-MATH contains a total of 6.5k visual mathematical questions, covering a multi-level knowledge architecture with 5 layers and 67 knowledge points. We innovatively decompose the questions into multiple sub-questions based on the knowledge points required, and introduce a new four-dimensional metric for fine-grained reasoning evaluation. Through WE-MATH, we comprehensively evaluate the performance of existing LMMs in visual mathematical reasoning, and reveal that the model's answering performance shows a significant negative correlation with the number of knowledge points contained in the questions.
In addition, we find that most models suffer from rote memorization (RM) and insufficient knowledge (IK) are the biggest flaws of LMMs. However, the main challenge of GPT-4o has gradually shifted from IK to IG, indicating that it is the first model to move to the next stage. Finally, our analysis of KCA strategies and error cases further heuristically guides existing LMMs towards human-like visual mathematical reasoning.