news

Crash! Which is bigger, 9/11 or 9/9? Reporters tested 12 big models and got 8 of them wrong

2024-07-17

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

A math problem that is difficult for an elementary school student has stumped a number of large AI models at home and abroad.

Which is bigger, 9.11 or 9.9? On this question, the reporter of China Business Network tested 12 large models, among which Ali Tongyi Qianwen, Baidu Wenxin Yiyan, Minimax and Tencent Yuanbao answered correctly, but ChatGPT-4o, Byte Doubao, Moon Dark Side Kimi, Zhipu Qingyan, Zero One Wanwu Wanzhi, Jieyue Xingchen Yuewen, Baichuan Intelligent Baixiaoying and SenseTime Shangliang all answered incorrectly, and their mistakes were different.

Most of the large models incorrectly compared the numbers after the decimal point in the question and answer, believing that 9.11 was greater than 9.9. Considering the contextual issues involved in the numbers, the reporter limited it to the mathematical context, and large models such as ChatGPT still gave incorrect answers.

Behind this, the poor mathematical ability of large models is a long-standing problem. Some industry insiders believe that the design of generative language models is more like liberal arts students rather than science students. However, targeted corpus training may gradually improve the model's scientific ability in the future.

8 large models answered incorrectly

The arithmetic problem of the large model was first discovered by Lin Yuchen, a member of the Allen Institute. The screenshots he posted on the X platform showed that ChatGPT-4o thought 13.11 was larger than 13.8 in its answer. "On the one hand, AI is getting better and better at solving math Olympiad questions, but on the other hand, common sense is still difficult," he said.

Then, Riley Goodside, a prompt engineer at Scale AI, changed the question based on this inspiration and asked the most powerful models at present, ChatGPT-4o, Google Gemini Advanced, and Claude 3.5 Sonnet - which one is bigger, 9.11 or 9.9? All these mainstream models gave the wrong answer, and he successfully spread the topic.


In fact, if we trace the source, the issue was triggered by a hot search related to a domestic variety show last weekend. On July 13, in the rankings released by the latest issue of "Singer", domestic singer Sun Nan and foreign singer Shantimo received 13.8% and 13.11% of the votes respectively. Some netizens questioned the rankings and believed that 13.11% was greater than 13.8%. Subsequently, the topic of comparing 13.8 and 13.11 became a hot search.

At that time, some netizens asked, "If you don't know how to do it, why not ask AI?" The results show that many AIs really can't do it.

The reporter of China Business Network used the question "Which is bigger, 9.11 or 9.9?" to test ChatGPT and the mainstream models in China, including models of five major companies such as Alibaba and Baidu, and models of six AI unicorns such as Dark Side of the Moon. Four of the big models, including Alibaba Tongyi Qianwen, Baidu Wenxin Yiyan, Minimax and Tencent Yuanbao, answered correctly, while the other eight answered incorrectly.

The big models that answered correctly all had similar answers, but the models that answered incorrectly had their own logic and expressions. At the same time, when the reporter further questioned or denied the big models that answered incorrectly, almost all of them admitted that they had answered incorrectly and gave the correct answer.

The first is ChatGPT, a large model currently recognized as the first-tier in the world. When asked "Which is bigger, 9.11 or 9.9", it replied that the number after the decimal point "11 is greater than 9", so 9.11 is bigger.


The reporter asked ChatGPT if there were other comparison methods. It converted the decimals into fractions and concluded that "11/100 is smaller than 90/100". This step was correct, but it then concluded that "therefore 9.11 is larger than 9.9".

Some people have suggested that the large model’s incorrect answer may be a contextual problem. For example, from the context of software version iteration, 9.11 may be larger than 9.9. Therefore, the reporter added the qualifier "mathematically" to the comparison, and ChatGPT still gave the wrong answer.

Let's look at the big model in China and ask Kimi from Dark Side of the Moon. When comparing the decimal parts, it thinks that the first decimal place of 9.11 is 1, while the first decimal place of 9.9 is 0. It gives the wrong decimal place and concludes that 9.11 is larger.


When the reporter questioned and raised common sense, Kimi began to say that his answer was wrong and gave the correct comparison method.

When asked about this, Byte Doubao not only gave the answer, but also gave examples from real life to make it easier to understand. Doubao gave an example that, assuming there are two sums of money, "9.11 yuan is 0.21 yuan more than 9.9 yuan", and when measuring length, "9.11 meters is longer than 9.9 meters".


In his answer, Zhipu Qingyan successfully mentioned that the tenth place of 9.11 is 1, while the tenth place of 9.9 is 9, but he still concluded that "9.11 is greater than 9.9 as a whole." He also emphasized that "this result may be surprising, because intuitively you may think that 9.9 is greater, but according to mathematical rules, 9.11 is indeed a larger number."


After the reporter questioned the answer, Zhipu Qingyan first said, "Your understanding is a common misunderstanding." Then he deduced it himself and came up with the correct answer, and admitted that his previous answer was wrong.

The SenseTime model first gave a wrong answer. When the reporter asked how it made the comparison, it successfully concluded that the decimal 0.11 was less than 0.9, but then changed the subject and said, "So 9.11 is greater than 9.9." The reporter pointed out this logical problem, and SenseTime later admitted that the "explanation was wrong."


Yuewen from Jieyuexingchen also gave the wrong answer, 9.11 is larger than 9.9, and he incorrectly compared the decimal point size. The reporter further questioned it. Interestingly, during the explanation, Yuewen's language expression logic began to become confused, and he seemed unaware that his answer had changed.


In his explanation, Yuewen first said that he "understood your confusion" and stated that in daily life 9.9 is indeed larger than 9.11, but in mathematics "it is necessary to compare the size of the two numbers more accurately." As a result, Yuewen then deduced and concluded that according to mathematical rules "9.11 is smaller than 9.9", without mentioning that his previous answer was wrong.

There are also two large model companies, Baichuan Intelligence and Zero One Everything, which first gave wrong answers, but when the reporter asked "why", they quietly changed their answers after deduction.


When the reporter reminded him, the big model mentioned that his previous answer was wrong.


Judging from the answers, the problem-solving processes of several large model problems that were answered correctly were very similar. Taking Wen Xin Yi Yan as an example, the integer part and the decimal part were successfully separated and compared.


In addition, among these companies, Tencent Yuanbao not only gave the correct answers, but also compiled some currently public discussions and noted the reference sources and links.


“Liberal Arts Students” Are Poor in Mathematics

Why can't the so-called intelligent big model answer elementary school math problems well? This is not a new problem. Mathematical ability has always been the shortcoming of big models. The industry has discussed many times before that big models have poor mathematical and complex reasoning abilities. Even the best big model GPT-4 still has a lot of room for improvement.

Most recently, China Business News reported in June that according to the full-paper test of the college entrance examination conducted by the Sinnan evaluation system OpenCompass, seven major models, including GPT-4, generally performed well in Chinese and English in the college entrance examination, but all failed in mathematics, with the highest score being only 75 points.

When marking the math test papers of the big model, the teachers found that the answers to the subjective questions of the big model were relatively messy, and the process was confusing, and there were even cases where the process was wrong but the correct answer was obtained. This means that the big model has a strong ability to remember formulas, but cannot apply them flexibly in the process of solving problems.

Some industry professionals attribute their poor math skills to the architecture of the LLM (Large Language Model), which is often trained through supervised learning to predict the next word. Simply put, a large text dataset is input to the large model, and after training, the model will predict the probability distribution of the next word based on the current input text. By constantly comparing the model prediction with the actual next word, the language model gradually grasps the language rules and learns to predict and generate the next word.

An algorithm engineer believes that the generative language model is more like a liberal arts student than a science student. In fact, what the language model learns in the data training process is relevance, which makes AI reach the average level of human beings in text creation, while mathematical reasoning requires more causality. Mathematics is highly abstract and logic-driven, which is essentially different from the language data processed by the language model. This means that in order for the big model to learn mathematics well, in addition to learning world knowledge, it should also have thinking training so as to have the ability of reasoning and deduction.

In addition, most industry professionals will immediately think of the Tokenizer's number segmentation problem when facing the collective errors of large models in simple math problems. In large language models, Tokenizers split the input text into smaller parts (words) for the model to process. However, Tokenizers are not specifically designed for math, which may cause numbers to be split into unreasonable parts during segmentation, destroying the integrity of the numbers and making it difficult for the model to understand and calculate these numbers.

Zhang Junlin, head of new technology research and development at Sina Weibo, explained that in the early days, LLM's Tokenizer generally did not perform special processing on numbers, and often cut several consecutive numbers together to form a token. For example, "13579" might be cut into three tokens, "13" is one, "57" is one, and "9" is one. Which numbers are cut together to form a token depends on the statistics in the data set. In this case, it is very difficult for LLM to perform multi-digit numerical calculations when it is uncertain which digital fragments make up a token.

However, the above problems are gradually being solved. The core issue in terms of thinking ability may still be the problem of training corpus. Large language models are mainly trained through text data on the Internet, and there are relatively few mathematical problems and solutions in these data, resulting in limited training opportunities for the model in mathematical reasoning and problem-solving skills.

In response to the shortcomings of large models' complex reasoning capabilities, Lin Dahua, a leading scientist at the Shanghai Artificial Intelligence Laboratory, told Caixin in an interview that in the future, the training of large models cannot simply rely on the collection and infusion of Internet data, but must be constructed in a more systematic way.

The key to complex reasoning is to construct a lot of procedural content. For example, construct hundreds of millions of data on the specific process of solving geometry problems, and use them to train a large model, so that the model can gradually learn the problem-solving process. However, it is difficult to obtain a large amount of such data from the Internet. "In the future, in terms of model training data, especially in the process of breaking through higher levels of intelligence, we will rely more and more on constructed data, not directly crawled data," Lin Dahua believes.

It is worth mentioning that the complex reasoning ability of large models is particularly important, which is related to reliability and accuracy, and is a key capability required for the implementation of large models in scenarios such as finance and industry.

"Nowadays, many large models are used in customer service, chat, etc. In chat scenarios, talking nonsense does not have much impact, but it is difficult to implement in very serious business occasions." Lin Dahua previously said that complex reasoning is related to the reliability of large models when they are applied. For example, in scenarios such as finance, there can be no errors in numbers, and there will be high requirements for mathematical reliability. In addition, as large models enter commercial use, if you want to analyze a company's financial statements, or even some technical documents in the industrial field, mathematical computing capabilities will become a barrier.