news

ICML2024 speech is a big hit! Meta Zhu Zeyuan reveals the inner world of the big model: different from human reasoning

2024-08-05

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina



AIxiv is a column where Synced publishes academic and technical content. In the past few years, Synced's AIxiv column has received more than 2,000 articles, covering top laboratories in major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work to share, please submit or contact us for reporting. Submission email: [email protected]; [email protected]

How does the Large Language Model (LLM) solve math problems? Is it through template memorization, or does it really learn reasoning? What is the mental arithmetic process of the model? What kind of reasoning skills can it learn? Is it the same as humans, or does it surpass humans? Will learning only one type of math problems help the development of general intelligence? Why does the LLM make reasoning errors? How big and deep does the LLM have to be to be able to reason?



Paper address: https://arxiv.org/abs/2407.20311

Recently, a team of four people from Meta FAIR, CMU and MBZUAI, including Ye Tian, ​​Xu Zicheng, Li Yuanzhi and Zhu Zeyuan, published a new arXiv paper titled "Language Model Physics Part 2.1: Elementary School Mathematics and Hidden Reasoning Processes" to cleverly answer the above questions with controlled experiments. Twitter user @xlr8harder commented, "This result will once and for all settle the debate about whether LLM has reasoning ability or is just a random parrot."

Editor's note: The entire series of "Language Model Physics" was invited to give a two-hour special report at the ICML 2024 International Machine Learning Conference on July 22, which received a warm response and was reportedly applauded continuously. Here we present Part 2.1 of the series.



figure 1

Paper Detailed Explanation

First of all, according to the convention of this series, the author believes that we should not guess the thinking mode of large models such as GPT-4 by talking to them. This is similar to animal behavior, which is feasible but not rigorous enough to scientifically reveal the inner thinking process of GPT-4.

In addition, from a data perspective, only by fully accessing the model's pretraining data can we clearly know which questions the model has seen and which ones it has learned through reasoning. Even if the model scores high on GSM8k (a benchmark test set containing 8,000 elementary school math questions), it is difficult to tell whether it has seen variants of these questions (such as variants in different languages ​​or rewritten by GPT-4).

To this end, the authors created iGSM, a synthetic, elementary school math-level thinking problem set, and pre-trained the model from scratch on iGSM to control the types of problems the model is exposed to. It is worth noting that iGSM does not contain common sense information, only addition, subtraction, and multiplication within the range of mod 23, and all calculations are performed step by step using CoT. With iGSM, controlled experiments can be conducted to specifically study the reasoning ability of the model while ignoring other factors (such as large integer operations). Figure 2 shows a simple example.



figure 2

Using this dataset, the author first tested the performance of GPT2 (RoPE version). Using op to represent the number of mathematical operations required to solve the problem, the author found that when trained on questions with op≤21, the model not only achieved 99% accuracy, but also maintained 83% accuracy on more difficult questions (such as op=32) (see Figure 3). This shows that the model has learned some reasoning skills, after all, it has never seen questions with op>21. (By the way, GPT-4o can only handle questions with op=10 on this dataset. Questions beyond this difficulty are like blind guessing. We will discuss this issue at the end of the article.)

So what kind of reasoning skills did the model learn? There are at least two ways to solve the math problems of iGSM. One is what the authors call “Level 0 Reasoning", which means "brute force calculation can be calculated." Since the variables in the question may have complex dependencies, some can be calculated directly, while others require other variables to be calculated first. For example, if Xiao Zhang has 3 times more fruits than Xiao Wang, then we must first calculate how many apples and pears Xiao Wang has and sum them up before we can start calculating the number of fruits Xiao Zhang has. "Level 0 reasoning" is to enumerate all variables as much as possible, randomly find a computable variable each time, calculate the result and continue.

The corresponding one is "Level 1 Reasoning”: Through topological sorting, we start from the problem and work backwards to determine which variables need to be calculated, and then start from the leaf nodes and calculate upwards, striving for the "shortest solution". Common math problems are usually solved using level 1 reasoning, and "unnecessary variables" are not calculated. For example, if Xiao Zhang has 3 times more fruit than Xiao Wang, and we ask Xiao Zhang how many fruits he has, then the number of Xiao Li's apples is an unnecessary variable, while the number of Xiao Wang's apples and pears are necessary.

As shown in Figure 3, the authors found that GPT-2 can learn level 1 reasoning and give the shortest answer almost every time. This is very difficult! Because before the model generates the first sentence, it must have completed the entire topological sorting in its mind - otherwise how does it know which variables are unnecessary? If the model generates "Xiao Li has 7 apples" at the beginning, it cannot go back and get the shortest answer.



image 3

So, how does the model learn "level 1 reasoning"? To this end, the authors conducted a probing study on the internal parameters of the model (see Figure 4). The conclusion shows (see the paper for the specific probing method) that before the model generates the first sentence, it has determined which variables A are "necessary" (nece (A)=True) through mental calculation. At the same time, after each sentence, the model also mentally calculates all the "calculable" variables A that follow (cannext (A)=True). Therefore, the model only needs to continuously perform logical AND operations on nece and cannext to give a complete calculation process step by step, starting from the leaf node.

It is worth noting that these complex mental arithmetic abilities were not shown in the training set. The model was only exposed to iGSM data and only saw the "language" part (questions and answers), but it autonomously learned a human-like mental process and came up with the optimal solution! In other words, this study refutes our report a week ago in "Language ≠ Thinking, Large Models Can't Learn Reasoning: A Nature Article That Made the AI ​​Community Crazy" and proved it with scientific methods.Large models can indeed learn to think through language

What’s even more amazing is that the model has learned more than that. In Figure 4, the author also found that the model can mentally calculate a lot of information that is useless for solving the problem. For example, just after the variable relationship is described, or even before the question is asked, the model already knows whether there is a recursive dependency between any two variables A and B - even if these variables are irrelevant to solving the problem. For humans, we usually start from the problem and work backwards, ignoring unnecessary variables, while language models such as GPT-2 will comb through the entire relationship graph to prepare for any questions that may be asked in the future. The author calls this ability "Level 2 Reasoning」。

Although "level 2 reasoning" is not necessary for problem solving, it is indeed a more general skill. The model uses parallel capabilities to perform a lot of causal combing of information. This ability is mastered by the language model in learning to solve problems, and no one (data) has taught it to do so. The author speculates that this may be the potential source of the word "general" in general artificial intelligence (AGI), that is, the language model can go beyond the skills taught by the data set and learn more general abilities.



Figure 4

Next, the author studied why the model made mistakes. In summary, on the iGSM dataset, the model almost only made two types of errors: one is calculating unnecessary variables, and the other is calculating variables that are currently uncalculated, as shown in Figure 5.

For the former, the authors found that if the model makes a mental calculation error before generating an answer and mistakenly believes that a variable A is "necessary" (nece (A)=True), then the model is likely to force calculation of A when generating the answer, resulting in a non-shortest solution. This finding is very interesting, as it suggests that many errors are systematic, and the model can be sure that it will make a mistake (through the probe method) before generating the first token. This type of error has nothing to do with randomness or beam search in the model generation process.

As for the latter, the author also attributed it to mental arithmetic errors, and will use a whole follow-up Part 2.2 paper to specifically improve the model's mental arithmetic ability, so as to ultimately improve the accuracy of problem solving. This paper has not yet been published, and we will continue to pay attention to and report on it in the official account.



Figure 5

The next conclusion is that the authors refute the "big is king" emphasis in the scaling law of large models, that is, the performance of the model is only related to the number of parameters, not the width or depth. This view was first proposed by OpenAI's scaling law paper and has been followed in almost all subsequent research.

The authors conducted a controlled experiment using the iGSM dataset, as shown in Figure 6. By comparing a smaller and deeper model with a larger and wider model, they found that for solving math problems in iGSM,The depth of the model is obviously more important than the widthFor example, a 20-layer, 9-head model performs much better than a 4-layer, 30-head model, even though the latter has twice as many parameters.

Furthermore, the authors foundThe reliance on depth stems from the complexity of the model’s mental arithmeticBy studying the probes at different depths of the model, the authors found that for variables A that are far from the problem, mental calculation of nece (A) often requires more layers. Specifically, if the distance between variable A and the problem variable is t, then t steps of mental calculation are required to know that nece (A) = True. The larger t is, the more layers the model requires, as shown in Figure 6.

The author emphasizes that the model's reliance on depth cannot be offset by Chain-of-Thought (CoT). In fact, the math problem solving in iGSM has used CoT as much as possible, that is, all calculations are broken down into steps. Even so, the model still needs to use mental arithmetic to plan what the first step of CoT should be - this mental arithmetic process may still require multiple steps. This explains why the model relies on depth.



Figure 6

In summary, unlike more than 99% of papers that study the behavior process of LLM, the author of this article takes a different approach and reveals the mental process of LLM when solving mathematical problems, providing a new perspective for understanding the intelligence of LLM.

At the end of the article, the author points out that even GPT-4 can only perform a maximum of 10 steps of reasoning on the iGSM dataset. This shows that even the most powerful model currently, which uses allegedly all Internet data, still cannot accurately complete more than 10 steps of reasoning. This suggests that the pretraining data used by existing large models may still have a lot of room for improvement. Through the methods of this article, it may be a new possibility to create artificial synthetic data to enhance the model's reasoning ability and information combing ability.