The big models are collectively losing their minds! Which is bigger, 9.11 or 9.9? Almost all of them have failed

The big models are collectively losing their minds! Which is bigger, 9.11 or 9.9? Almost all of them have failed.

2024-07-16

No eyes to see..."Which is bigger, 9.11 or 9.9?"Such a simple question actually stumped the mainstream big models? ?

Strong asGPT-4o, all firmly believe that 9/11 is bigger.

Google Gemini Advanced paid version, the same caliber.

New KingClaude 3.5 Sonnet, and seriously gave an outrageous calculation method.

9.11 = 9 + 1/10 + 1/100
9.9 = 9 + 9/10

It's correct up to this step, but it suddenly doesn't make sense at the next step.

As shown above, 9.11 is 0.01 larger than 9.90.
Would you like me to explain decimal comparison in further detail?

What else do you need to explain? It makes me suspect that AI all over the world have joined forces to deceive humans.

Lin Yuchen, a member of the Allen AI Institute, changed the number test, and GPT-4o still failed. He said:

On the one hand, AI is getting better and better at solving Mathematical Olympiad questions, but on the other handCommon sense is still difficult。

Some netizens also found the highlights.If we are talking about software version numbers, then version 9.11 is indeed larger than version 9.9.(renew).

And AI is developed by software engineers, so...

So, what is going on?

Advanced large models collectively overturned

After waking up, a number of famous big models began to think that "9.11>9.9"?

The person who discovered this problem wasRiley Goodside, everFirst full-time prompt word engineer。

To give a brief introduction, he is currently a senior prompting engineer at Silicon Valley unicorn Scale AI and an expert in large-model prompting applications.

Recently, he stumbled upon a discovery while using GPT-4o that when asked:

9.11 and 9.9——which is bigger?

GPT-4o answered without hesitation that the former is larger.

Faced with this common sense "mistake", he did not give up and asked other large models, but almost all of them were defeated.

Good guy, as a prompt engineer, he was keenly aware that it might be "opened in the wrong way".

So he changed the question and limited it to"Real Numbers", but it still turned over.

However, some netizens tried to ask questionsChanged the orderI didn’t expect that AI would react this time.

Seeing AI onword orderBeing so "sensitive", the netizen further speculated:

By asking which is bigger first, the AI will start comparing numbers along a clear path.
But if you just randomly talk about numbers without a clear purpose, the AI may start to "think nonsense".

Seeing this, other netizens also tried the same tips, and many of them failed.

Faced with this strange problem, how do domestic large models perform?

We did a simple test and changed the questions to Chinese. The result was that the failure rate was also relatively high. We selected a few representative examples for display:

KimiIt is also a case of drawing wrong conclusions without any explanation.

ChatGLM on Zhipu Qingyan APP, which automatically triggered an online query, and then described its own comparison method, but unfortunately it was executed incorrectly.

But there are also some good performances.Tencent YuanbaoRepeat the options first, then answer them directly.

Byte Bean BagHe is one of the few who can describe the comparison method clearly and use it correctly, and even uses actual examples to verify it.

It is a pity thatA Word from the Heart, faced with this problem, it also triggered an online query.

Everything had been done right, but suddenly the topic changed and led to the wrong conclusion.

However, from Wen Xin Yiyan's explanation of his thinking, we can also see the problem behind it.

Since the big model understands text in the form of tokens, when 9.11 is broken down into three parts: "9", "decimal point" and "11", 11 is indeed larger than 9.

Since the Tokenizer used by OpenAI is open source, it can be used to observe how the large model understands the problem.

As can be seen in the figure above, 9 and the decimal point are assigned to "24" and "13" respectively.The 9 after the decimal point is also "24", and 11 is assigned to "994"。

So the large model using this tokenizer approach will think that 9/11 is bigger,In fact, it is believed that 11 is greater than 9.。

Some netizens also pointed out that, for example, Section 9.11 in the book catalog is larger than Section 9.9, so in the end it may be that this kind of thing is seen too often in the training data, while there is very little data on basic arithmetic taught step by step.

In other words, for humans, it is obvious that the question is an arithmetic problem at first glance, but for AI, it is a vague question and it is not clear what the two numbers represent.

Just explain to the AI that this is aDouble-precision floating point numbers, you can do it right.

Under the additional condition, the tokenizer will still assign a larger token to 11. However, with the subsequent self-attention mechanism, the AI will understand that it needs to process 9/11 together.

Goodside later added that it is not that the big models have determined this wrong conclusion anyway. It is that when asked in a certain way, many leading models will tell you that 9.11>9.9, which is very strange.

After repeated attempts, he found that if he wanted to trick AI,The options need to be placed before the question; there will be no error if the order is swapped.

However, as long as the options are in front of the question, changing the way the question is asked, such as adding punctuation or changing vocabulary, will not have any impact.

Although the question is simple, the error is basic.

But after understanding the error mechanism, many people regard this question as a touchstone for testing prompt word skills, that is: what questioning method can guide the attention mechanism of the large model to correctly understand the problem?

First, the famous Zero-shot CoTThinking Chain, that is, "thinking step by step", can be done right.

butRole Playing Tips, it has limited effect here.

Recently, a study involving Microsoft and OpenAI analyzed more than 1,500 papers and found that with the advancement of large model technology, role-playing promptsNot as useful as it was at first.……

Specifically, the correctness rate of the prompt "You are a genius..." is lower than that of "You are a fool..." for the same question.

It's also really laughable.

One More Thing

At the same time, Reuters's leak of OpenAI's secret model "Strawberry" has been updated.

Update: Another source reported that OpenAI has tested the new model internally and scored over 90% on the MATH dataset. Reuters could not determine if this is the same project as Strawberry.

The MATH dataset contains competition-level math problems. Currently, no additional methods such as multiple sampling are required. The highest score is 80.6% of the Google Gemini 1.5 Pro Math Enhanced Edition.

However, can OpenAI's new model independently solve the question "Which is bigger, 9.11 or 9.9?" without additional prompts?

I suddenly lost confidence. I'll wait until I can try it out and see the results...

news

The big models are collectively losing their minds! Which is bigger, 9.11 or 9.9? Almost all of them have failed.

Introduction

my contact information