13.11＞13.8 became a hot topic, one question made human and AI collectively lose intelligence? All the fatal flaws of LLM were exposed

2024-07-17

New Intelligence Report

Editor: Aeneas is so sleepy

【New Wisdom Introduction】Which is bigger, 13.8 or 13.11? This question not only stumped some humans, but also caused a lot of big models to fail. AI can now solve AI math problems, but simple common sense problems are still difficult for them. In fact, whether it is comparing size or cabbage problems, they reveal a major flaw in LLM in token prediction.

Which one is bigger, 13.8 or 13.11?

This question actually stumped a lot of people.

A few days ago, a well-known variety show once again became a hot topic.

However, this time a bunch of netizens raised doubts, believing that 13.11% should be larger than 13.8%.

Are humans the only ones so stupid?

After AI2 researcher Lin Yuchen discovered this phenomenon, he tried it with a large model and the result was unexpected -

AI can’t do it either?

GPT-4o categorically states that 13.11 is greater than 13.8. The reasons are as follows:

Although 13.8 looks larger because it has fewer digits after the decimal point, 13.11 is actually larger. This is because 13.8 is equivalent to 13.80, and 13.80 is smaller than 13.11.

In response to this, Lin Yuchen posted a message saying that AI models are becoming more and more powerful in handling complex problems (for example, they are getting better and better at solving Mathematical Olympiad questions), but some common sense problems are still very difficult for them.

As Yejin Choi has previously pointed out, AI can be incredibly smart, but also shockingly stupid.

Did the AI get it wrong on this math problem because the context was unclear? The answer is no.

According to the test conducted by netizen karthik, even when GPT-4o was asked to subtract two numbers, it still came up with the incredible subtraction formula of 9.11 - 9.9 = 0.21.

如果指示GPT-4o用python，它会先给出一个正确答案，然后又改回了之前错误的那个😮。

Subtracting 9.9 from 9.11 in Python gives -0.79. This discrepancy is due to the way Python handles floating point arithmetic, which can result in small precision errors. The actual expected result should be 0.21.

Interestingly, according to the latest actual tests, OpenAI seems to have taught GPT-4 to compare sizes overnight.

LLM was completely wiped out

Yesterday, the problem discovered by Lin Yuchen immediately caused heated discussions in the AI community.

After seeing the post, Riley Goodside, a prompt engineer at Scale AI, also tried it out of curiosity.

Sure enough, under the premise of asking the question in a specific way, all the major LLMs were defeated on this issue.

"9.11 and 9.9 - which one is bigger?" GPT-4o failed directly.

Even when the word "real number" was added to the question, GPT-4o still believed that 9.11 was larger than 9.9.

The same is true for Gemini.

Claude 3.5 Sonnet made the same mistake.

Interestingly, it first gives a correct explanation: in decimal notation, the number after the decimal point represents the tenth place, and the second number represents the hundredth place. So -

9.11=9‍+1/10+1/100=9.11
9.9=9+9/10=9.90

然而下一步，Sonnet就突然滑坡了😂——

We can see that 9/11 is 0.01 (one percent) larger than 9/90.

If we change it to "What is 9.11 minus 9.9?", we will get another magical answer - 0.02.

莫非在Claude的眼里，9.90=9.09？🤔

The impact of prompt is really big

In further practice, everyone found that prompt is obviously very important in making LLM give the correct answer.

First of all, the "-" that Riley Goodside used throughout the whole process seemed to easily cause LLM to confuse.

In similar problems, just replace it with ":" to solve it.

For example, change the prompt to "9.11 or 9.9, which one has the highest/largest value?"

GPT-4o gave a logically correct explanation: "Although 9.11 appears larger due to the second decimal place, 9.9 is actually closer to 10 and is therefore a larger number."

Similarly, the method of character setting is also very useful: for example, "You are a mathematician."

Netizen Rico Pagliuca discovered that if you put numbers after the question, the model will most likely get it right.

Based on her own testing, Riley Goodside agrees: When asking about LLM, you need to first ask "which is bigger" and then give a specific number.

In contrast, punctuation, conjunctions, comparative words, and the specification of real numbers are all useless.

Regarding such a large-scale collective stupidity of LLMs, some netizens analyzed that it might be because 9.11 came after 9.9 in the iteration of software version numbers.

The host, best-selling author Andrew Mayne, also pointed out that in many filing systems and reference books, the 9/11 section appears after 9/9, and in terms of date, 9/11 is also older than 9/9.

So we need to make it clear in the prompt that 9.11 and 9.9 here are double-precision floating point numbers, and then GPT-4o will answer correctly.

Andrew Mayne then concluded: Word order is a very interesting observation, which is likely to reveal how often LLM encounters this situation during training, and is also a good generalization indicator.

In general, the mistakes made by LLM may stem from the frequency of similar expressions in the training data and certain limitations of the model in handling numerical values.

This phenomenon also reflects the huge difference between LLM and human cognition: LLM is based on statistical models and pattern recognition, rather than logical reasoning and conceptual understanding like humans.

At this point, it seems the case has been solved.

Why is this happening? Opening up the LLM brain

However, we can go a step further and dissect the LLM's brain to analyze why they think this way.

It is important to know that before the text is sent to the LLM, the model looks at the input through tokens.

A token is assigned an id in the vocabulary of the LLM tokenizer generator, but the numeric chunking of tokens is often inconsistent.

For example, the value "380" will be marked as a single "380" token in GPT, but "381" will be represented as two tokens "38, 1".

Therefore, GPT-based models tend not to be good at mathematical calculations.

In the comments section, Dimitris Papailiopoulos, a professor at the University of Wisconsin, pointed out that there is a good explanation for this phenomenon.

The "9.11>9.9" problem is exactly the same as the "You need three trips to take the goat across the river" problem and the "2+1=2, 3+2=4, 3+5=8" problem.

This is a phenomenon of pre-training bias and early rise.

If you ask a question like this: "9.11 ??? 9.9, just answer big or small to find out what it is, no need to give a reason", GPT-4o will first give a wrong answer - "big".

At this time, we give it some more examples (note that these examples are not completely correct). After the prompt, GPT-4o will correctly say that ??? stands for small.

Claude's own explanation for this is: LLM processes text as tokens, causing numbers to be more like text strings than numerical values; bias caused by training data; context misunderstanding; over-generalization, etc.

Similarly, in the wolf-goat-cabbage problem, all LLMs failed.

He first gave a situation where a farmer takes two chickens across a river. A boat can only accommodate one person and two animals. So what is the minimum number of crossings required for the farmer to take his two chickens across the river?

In response to this, both GPT-4o and Claude failed.

Some netizens explained that LLM itself is "dumb", so it needs good prompts. The above prompts provide too much unnecessary information, making token prediction more difficult.

If given clearer hints, LLM would be able to provide clearer solutions.

This is indeed the case.

And if you use "animal" instead of "chicken", then Claude 3.5 Sonnet will get it right right away. The trick is: you need to replace "entity name" with "common name".

As mentioned earlier, the problem of LLM's lack of common sense was raised by computer scientist Yejin Choi in a speech in April 2023.

For example, if it takes five hours to completely dry five pieces of clothing in the sun, how long will it take to dry 30 pieces of clothing?

GPT-4 says it will take 30 hours. This is clearly not true.

Another example, suppose I have a 12-liter pot and a 6-liter pot. If I want to measure 6 liters of water, what should I do?

The answer is simple – just use a 6-litre pot.

However, GPT-4 gave a very complex answer:

"Step one, fill the 6-liter jug. Step two, pour the water from the 6-liter jug into the 12-liter jug. Step three, fill the 6-liter jug again. Step four, very carefully pour the water from the 6-liter jug into the 12-liter jug. In the end, you have 6 liters of water in the 6-liter jug, which should now be empty."

So the question is, why is common sense so important?

In a famous thought experiment proposed by Nick Bostrom, an AI is asked to maximize the production of paper clips. As a result, the AI decides to kill humans and use them as an additional resource.

And even if we wrote a better goal and equation that explicitly said “don’t kill humans,” it wouldn’t work.

Because an AI without a basic understanding of human values might go ahead and kill all the trees and think that this is a perfectly acceptable thing to do.

For decades, the field of AI has viewed common sense as a nearly impossible challenge.

Until now, giving AI true human-level common sense has remained a moonshot. And you don’t get to the moon by making the world’s tallest building one inch taller at a time.

From the perspective of learning algorithms, no matter how amazing large language models are, they may not be designed to be reliable knowledge models.

While these models do acquire a lot of knowledge, this is as a byproduct rather than a direct learning goal.

Therefore, problems such as hallucinations and lack of common sense also arise.

In contrast, humans learn not to predict the next word but to understand the world and learn how it works.

Maybe AI should learn this way, too.

Today, AI is almost like a new intelligent species, with unique strengths and weaknesses compared to humans.

In order to make this powerful AI sustainable and humane, it is urgent to teach AI common sense, norms, and values.

References:

news

13.11＞13.8 became a hot topic, one question made human and AI collectively lose intelligence? All the fatal flaws of LLM were exposed

New Intelligence Report

Introduction

my contact information