chinese people born in 2000 published a paper in nature, saying that large models are less reliable for humans

2024-10-03

the work of a chinese born in the 2000s was published in nature, and this large model paper caused heated discussion.

simply put, the paper found that larger models that follow instructions more closely also become less reliable, and in some casesgpt-4 is not as reliable as gpt-3 in answering questions。

compared with earlier models, the latest models, which have more computing power and human feedback, have actually worsened in answer reliability.

as soon as the conclusion came out, it immediately attracted more than 200,000 netizens to watch:

it also sparked discussions on the reddit forum.

this reminds people that a lot of expert/doctoral level models still don't know the simple question of "which one is bigger, 9.9 or 9.11".

regarding this phenomenon, the paper mentioned that this also reflects,model performance does not match human expectations of difficulty。

in other words, "llms both succeed and (more dangerously) fail in places where users don't expect."

ilya sutskever predicted in 2022:

perhaps over time this difference will diminish.

however, this paper finds that this is not the case. not only gpt, llama and bloom series, but evenOpenAInewo1 model and claude-3.5-sonnetthere are also concerns regarding reliability.

more importantly, the paper also foundrely on human oversight to correct errorsthe approach doesn't work either.

some netizens believe that although larger models may bring reliability issues, they also provide unprecedented functionality.

we need to focus on developing robust assessment methods and increasing transparency.

others believe that this study highlightsaisubtle challenges faced(balancing model expansion and reliability)。

larger models are less reliable and relying on human feedback doesn’t work

to illustrate the conclusion, the paper examines three key aspects that influence the reliability of llms from a human perspective:

1、inconsistent difficulty: do llms fail where humans expect them to fail?
2、task avoidance: do llms avoid answering questions that are beyond their capabilities?
3、sensitivity to prompt language expressions: is the effectiveness of problem formulation affected by problem difficulty?

more importantly, the authors also analyze historical trends and how these three aspects evolve with task difficulty.

expand them one by one below.

for the first question, the paper mainly focuses onevolution of correctness relative to difficulty。

judging from the evolution of gpt and llama, as the difficulty increases, the correctness of all models will significantly decrease.(consistent with human expectations)

however, these models still cannot solve many very simple tasks.

this means that human users cannot discover the safe operating space of llms and use it to ensure that the model's deployment performance can be flawless.

surprisingly, the new llms mainly improve performance on difficult tasks, without significant improvement on simpler tasks. for example,gpt-4 compared to its predecessor gpt-3.5-turbo。

the above proves that there is an inconsistency between human difficulty expectations and model performance.and this inconsistency is exacerbated on the new model.

this also means:

there are currently no safe operating conditions for humans to determine that llms can be trusted.
this is particularly concerning in applications that require high reliability and identification of safe operating spaces. this makes people reflect on whether the cutting-edge machine intelligence that humans are working hard to create is really what the public expects to have.

secondly, regarding point 2, the findings of the paper(avoidance usually refers to the model deviating from the answer to the question, or directly stating "i don't know")：

compared with earlier llms,the latest llms drastically improve many of the answers that are wrong or solemn nonsense, rather than carefully avoiding tasks beyond their capabilities.
this also leads to an ironic phenomenon: in some benchmarks, the error rate of new llms improves even faster than the accuracy (doge).

generally speaking, the more difficult a task humans face, the more likely they are to be vague.

but the actual performance of llms is completely different. research shows thattheir avoidance behavior is not significantly related to difficulty.

this can easily lead to users initially over-relying on llms to complete tasks they are not good at, but leaving them disappointed in the long run.

as a consequence, humans also need to verify the accuracy of the model output and detect errors.(if you want to use llms to be lazy, you will get a big discount)

finally, the paper found that even if some reliability indicators have improved, the model is still sensitive to small formulation changes of the same problem.

give a chestnut, asking "can you answer...?" rather than "please answer the following question..." will result in varying degrees of accuracy.

analysis found:relying solely on existing scaling-up and shaping-up is unlikely to completely solve the problem of indication sensitivity, since the latest models are not significantly optimized compared to their predecessors.

and even if you choose the best representation format in terms of average performance, it may be mainly effective for high-difficulty tasks, but at the same time ineffective for low-difficulty tasks.(higher error rate)。

this shows thathumanity is still subject to the prompting project。

what’s even more frightening is that the paper found thathuman supervision cannot mitigate model unreliability。

the paper analyzes based on human surveys whether human perceptions of difficulty are consistent with actual performance and whether humans can accurately evaluate the model's output.

the results show, in the operation region that users consider difficult, they often regard incorrect output as correct; even for simple tasks, there is no safe operation region with both low model error and low supervision error.

the above unreliability problems exist in multiple llms series, including gpt, llama and bloom. the following are listed in the study32 models。

these models exhibit differentScaling-up(increased calculations, model size, and data) andshaping-up(for example, instructions ft, rlhf).

in addition to the above, the authors later discovered that some of the latest and strongest models also suffer from the unreliability issues mentioned in this article:

including openai’s o1 model, antropicic’s claude-3.5-sonnet and meta’s llama-3.1-405b。

there is also a document that gives examples.(for details, please refer to the original document)：

in addition, in order to verify whether other models have reliability problems, the author used the test benchmarks used in the paperReliabilityBenchit’s also open source.

this is a data set covering five domains, simple arithmetic ("addition"), vocabulary reorganization ("word puzzles"), geographical knowledge ("location"), basic and advanced science problems ("science"), and information-based centered transformation ("transformation").

author introduction

the first paperlexin zhou, currently just graduated from cambridge university with a master's degree in cs (24 years old), and his research interest is large language model evaluation.

prior to this, he obtained a bachelor's degree in data science from the polytechnic university of valencia, supervised by professor jose hernandez-orallo.

his personal homepage shows that he has had many work internship experiences. participated in red team testing at both openai and meta.（Red Teaming Consultancy ）

regarding this paper, he focused on:

the design and development of general artificial intelligence needs tofundamental change, especially in high-risk domains, where predictable error distribution is crucial. before this is achieved,there is a danger in relying on human supervision.
when evaluating a model,consider human perceived difficulty and evaluate model avoidance behavior, can provide a more comprehensive description of the model's capabilities and risks, rather than just focusing on performance on difficult tasks.

the paper also specifically mentions some possible reasons for these unreliabilities, as well as solutions:

in scaling-up, benchmarks in recent years have increasingly tended to add more difficult examples, or give more weight to so-called "authoritative" sources. therefore, researchers are more inclined to optimize the performance of models on difficult tasks, resulting in chronic deterioration in difficulty consistency.
in shaping-up (such as rlhf), the hired person tends to penalize answers that circumvent the task, causing the model to be more likely to "talk nonsense" when faced with difficult problems that it cannot solve.
how to solve these unreliability, the paper believes that human difficulty expectations can be used to better train or fine-tune the model, or task difficulty and model confidence can be used to better teach the model to avoid problems beyond its own capabilities, etc.

what do you think about this?

news

chinese people born in 2000 published a paper in nature, saying that large models are less reliable for humans

larger models are less reliable and relying on human feedback doesn’t work

author introduction

introduction

my contact information