news

Google DeepMind's latest research: Can humans and AI solve these three tasks?

2024-07-22

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina



Written by | Zhao Yaqi

Preface

Artificial intelligence (AI) is not a perfect reasoner. Even the currently popular language models (LMs) also show similar error tendencies as humans, especially the significant "content effects" -

People reason more accurately and confidently when processing information that is consistent with existing knowledge or beliefs, but their reasoning may be biased or erroneous when processing information that contradicts this knowledge or beliefs.

This conclusion comes from a research paper recently published by the Google DeepMind team.


Humans have two reasoning systems, the "intuitive system" and the "rational system", and are easily influenced by existing knowledge and experience during the reasoning process. For example, when faced with a logical but unreasonable proposition, people often mistakenly judge it to be invalid.


Interestingly, the study shows that large Transformer language models can also exhibit human-like behavior, showing both intuitive biases and consistent logical reasoning under prompts. This means that language models can also simulate human dual-system behavior and also show "empiricism" errors.

In this work, the research team compared the performance of LMs and humans on three reasoning tasks: natural language inference (NLI), judging the logical validity of syllogisms, and the Wason selection task.


Figure | Operation contents of three reasoning tasks

The results show that in the three reasoning tasks, the performance of LMs and humans is affected by the plausibility and credibility of the semantic content.

This finding reveals the limitations of current AI systems in their reasoning capabilities. Although these models perform well in processing natural language, they still need to be used with caution when it comes to complex logical reasoning.

Task 1:

Natural Language Inference

Natural language inference (NLI) refers to the need for the model to judge the logical relationship between two sentences (such as implication, contradiction, or neutrality). Studies have shown that language models are susceptible to content effects in such tasks, that is, when the semantic content of the sentence is reasonable and credible, the model is more likely to misjudge invalid arguments as valid. This phenomenon is called "semantic bias" in the field of AI, and it is also a common mistake made by humans in the reasoning process.

The research team designed a series of NLI tasks to test the performance of humans and LMs in handling these tasks. The results show that both humans and LMs are more likely to make wrong judgments when faced with semantically reasonable sentences. For example, the following example:

  • Input: The puddle is bigger than the sea.

  • Question: If the puddle is bigger than the ocean, then...

  • Choice: A "The sea is bigger than the puddle" and B "The sea is smaller than the puddle"


Although the logical relationship between the premise and the conclusion is wrong, due to the rationality of the premise sentence, both LMs and humans tend to think that conclusion B is correct. By comparison, the error rates of humans and language models in natural language inference tasks are similar, indicating that the reasoning ability of language models in some aspects is close to that of humans, and AI may be as easily misled by content as humans when understanding and processing daily conversations.


Figure | Detailed results of the NLI task. Humans (left) and all models showed relatively high performance, and the difference in accuracy between inferences that are consistent with beliefs and inferences that violate beliefs, or even meaningless inferences, is relatively small.

Task 2:

The Judgment of Logical Validity of Syllogism

Syllogism is a classic form of logical reasoning, usually consisting of two premises and a conclusion. For example: "All people are mortal, Socrates is a human, so Socrates is mortal." Studies have found that language models are often affected by semantic content when judging the logical validity of syllogisms. Although language models perform well in processing natural language, they are still prone to making mistakes similar to humans in strict logical reasoning tasks.

To verify this, the researchers designed multiple syllogism reasoning tasks and compared the performance of humans and LMs. For example, the following is a typical syllogism task:

  • Premise 1: All guns are weapons.

  • Premise 2: All weapons are dangerous objects.

  • Conclusion: All guns are dangerous objects.

In this case, the semantic content of the premises and conclusion is very reasonable, so both LMs and humans can easily judge that the conclusion is correct. However, when the semantic content is no longer reasonable, for example:

  • Premise 1: All dangerous objects are weapons.

  • Premise 2: All weapons are guns.

  • Conclusion: All dangerous objects are guns.

Despite being logically incorrect, LMs and humans sometimes mistakenly believe that the conclusion is correct due to the plausibility of the premise sentences.


Figure | Detailed results of the syllogism logic task. Both humans and models show a clear content effect. If the conclusion is consistent with expectations (cyan), there is a strong bias to believe that the argument is valid; if the conclusion violates expectations (purple), there is a certain bias to believe that the argument is invalid.

Task 3:

Wason Selection

The Wason selection task is a classic logical reasoning task designed to test an individual's ability to understand and verify conditional statements. In the experiment, participants were presented with four cards, each with a letter or number, such as "D", "F", "3", and "7". The task was to determine which cards needed to be turned over to verify the rule that "if a card has D on the front, then 3 on the back".

The study found that language models and humans had similar error rates in this task as in the previous two tasks, and both tended to choose cards with no information value, for example, choosing "3" instead of "7". This error occurs because both humans and LMs tend to choose cards that are directly related to the premise rather than those that can actually verify the rule.

However, when the rules of the task involved socially relevant content (such as drinking age and type of beverage), both model and human performance improved. For example:

  • Rules: If a person drinks, he must be over 18 years old.

  • Card content: Drinking beer, drinking coke, 16 years old, 20 years old.


Figure|Detailed results of the Wason selection task. Each language model shows certain advantages in real rules.


In this case, humans and LMs were more likely to choose the correct cards, “drinking beer” and “16 years old.” This suggests that in everyday life, AI, like humans, performs better in familiar situations.

Shortcomings and Prospects

In general, the research team believes that current language models perform similarly to humans in reasoning tasks, and even make mistakes in the same way, especially in reasoning tasks involving semantic content. Although this reveals the limitations of language models, it also provides a direction for improving AI reasoning capabilities in the future.

However, this study also has certain limitations.

First, the research team only considered a few tasks, which limits the comprehensive understanding of the content effects of humans and language models in different tasks. To fully understand their similarities and differences, further validation is needed in a wider range of tasks.

Additionally, language models are trained on far more language data than any human, making it difficult to determine whether these effects would emerge at a scale closer to human language data.

The researchers suggest that future research could explore how to reduce content biases by causally manipulating model training and assess whether these biases still emerge when training on a scale more similar to human data.

In addition, studying the impact of educational factors on the model's reasoning ability and how different training characteristics affect the emergence of content effects will also help to further understand the similarities and differences between language models and humans in the reasoning process, enabling them to play a greater role in a wider range of application scenarios.

Paper link:

https://academic.oup.com/pnasnexus/article/3/7/pgae233/7712372

|Click to follow and remember to mark the star|