After ChatGPT, the ultimate milestone of artificial intelligence has fallen

2024-08-19

Machine Heart Report

Editor: Zenan, Yali

The anthropomorphic behavior of the large model gives us the uncanny valley effect.

「Turing Testis a bad test standard because conversational ability and reasoning are completely different things. "In recent days, a new point of view has been popular in the AI circle.

Now is the era of generative AI, and the standards we use to evaluate intelligence should change.

“Can machines think?” This was the question Alan Turing posed in his 1950 paper “Computing Machinery and Intelligence.” Turing quickly pointed out that the question was “meaningless and unworthy of discussion” given the difficulty of defining “thinking.” As is common in philosophical debates, he suggested replacing it with another question.

Turing envisioned an “imitation game” in which a human judge talks to a computer and a human (the foil), with each side trying to convince the judge that they are the real human.

Importantly, the computer, the foil, and the judges could not see each other, and they communicated entirely through text. After talking to each candidate, the judges guessed who was the real human.

Turing’s new question was: “Is there a conceivable digital computer that can perform well in the Imitation Game?”

Paper link:

https://academic.oup.com/mind/article/LIX/236/433/986238?login=false

This game proposed by Turing, now widely known as the Turing Test, was designed to refute the widespread intuitive belief that "because of the mechanical nature of computers, it is impossible for them to think in principle."

Turing’s point was: if a computer is indistinguishable from a human in its performance (apart from its appearance and other physical features), then why don’t we consider it a thinking entity? Why should we limit the qualification of “thinking” to humans (or, more broadly, to entities made of biological cells)? As computer scientist Scott Aaronson described it, Turing’s proposal was “a call against ‘flesh chauvinism.’ ”

The Turing test is an idea rather than a "method"

Turing intended his test as a philosophical thought experiment rather than a way to actually measure machine intelligence. Yet in the public eye, the Turing test has become the ultimate milestone in artificial intelligence (AI) — the primary criterion for judging whether general machine intelligence has arrived.

Today, nearly 75 years later, coverage of AI is rife with claims that the Turing test has been passed, especially with the introduction of chatbots like OpenAI’s ChatGPT and Anthropic’s Claude.

Last year, OpenAI CEO Sam Altman wrote: "In the face of technological change, people's adaptability and resilience have been well demonstrated: the Turing test has quietly passed, and most people continue with their lives."

Major media outlets published similar headlines, such as one newspaper report that “ChatGPT passed the famous ‘Turing test’ — showing that the AI bot has human-like intelligence.”

The Daily Mail, a long-established British daily newspaper

Even the BBC, one of the world's largest media and a widely influential public media organization, proposed in 2014 that computer AI had passed the Turing test.

https://www.bbc.com/news/technology-27762088

The question, however, is: do modern chatbots actually pass the Turing test? And if so, should we give them the status of “thinking” as Turing proposed?

Surprisingly, despite its widespread cultural importance, the AI community has long disagreed on the criteria for passing the Turing test. Many have questioned whether having conversational skills that can deceive a human truly reveals anything about a system’s underlying intelligence or “thinking” ability.

There are likely to be a thousand Turing test standards in the eyes of a thousand people.

Turing Award winner Geoffery Hinton talked about his "Turing Test Criteria" in an interview. He believes that chatbots such as Palm can explain why jokes are funny, which can be regarded as a sign of their intelligence. Today's large models, such as GPT-4, are very good at explaining why a joke is funny, which is considered part of its Turing test criteria.

Compared with other scientists' serious definition of the Turing test, Hinton's view, although humorous, still reveals his thoughts on the ultimate question of "whether artificial intelligence has the ability to think."

Interview video link: https://www.youtube.com/watch?v=PTF5Up1hMhw

A "Turing farce"

Since Turing did not come up with a test with complete practical instructions.

His description of “The Imitation Game” is short on detail:

How long should the test last?
What types of questions are allowed?
What qualifications does a human judge or "foil" need to have?

Turing did not elaborate on these specific issues, but he did make one specific prediction: “In about 50 years, I believe, computers can be programmed to be so good that the average interrogator, after five minutes of questioning, will not be able to identify a real human being more than 70% of the time.” In short, in a five-minute conversation, the judge has an average of 30% chance of being misled.

Some see this casual prediction as the “official” criterion for passing the Turing test, which was held in 2014 in London by the Royal Society, with five computer programs, 30 human counterparts, and 30 judges.

The humans involved were a diverse group of people, young and old, native and non-native English speakers, computer experts and non-experts. Each judge had multiple, five-minute parallel conversations with a pair of contestants (one human and one machine), after which the judge had to guess who was the human.

A chatbot named "Eugene Goostman", pretending to be a teenager, successfully deceived 10 judges (deception rate: 33.3%).

Obviously the "cheating rate" has exceeded the 30% that Turing said at the time.

Eugene Goostman simulated a 13-year-old boy.

Based on the criterion of "30% chance of cheating within five minutes", the organizers announced: "The iconic Turing test passed by the computer program 'Eugene Gustmann' for the first time 65 years ago will go down in history..."

After reading the transcript of the conversation between Eugene Goostman, the protagonist of the Turing test, AI experts scoffed at the claim that the chatbot passed the Turing test, believing that the chatbot, which was not complex enough and did not look like a human, did not pass the test envisioned by Turing.

The limited conversation time and uneven expertise of the judges made the test more of a test of human credulity than a demonstration of machine intelligence. The result was a stark example of the “ELIZA effect” — named after the 1960s chatbot ELIZA, which, despite its extreme simplicity, fooled many people into thinking it was an understanding and sympathetic psychotherapist.

This highlights our human psychological tendency to attribute intelligence to entities that can converse with us.

ELIZA was one of the earliest chatbots after the Turing test was "published". It was a very basic Rogerian psychotherapy chatbot.

Another Turing test competition, the Loebner Prize, allows longer conversations, invites more expert judges, and requires the machine to fool at least half of the judges.When the standards were raised, no machine had ever passed this version of the test in nearly 30 years of the annual competition.

The Turing test begins to turn

Although Turing's original paper lacked details about how to implement the test, it was clear that the Imitation Game required three players: a computer, a human foil, and a human judge. Over time, however, the term "Turing Test" has evolved in public discussion to mean a significantly weaker version: any interaction between a person and a computer that behaves sufficiently like a human is considered to have passed the Turing Test.

For example, when The Washington Post reported in 2022 that “Google’s AI passed a famous test — and showed it’s flawed,” they weren’t referring to the Imitation Game, but to the fact that Google engineer Blake Lemoine thought Google’s LaMDA chatbot hadSentient。

In 2024, a Stanford University press release announced that the Stanford team’s research “marks the first time that artificial intelligence has passed one of the rigorous Turing tests.” But the so-called Turing test here was completed by comparing GPT-4’s behavioral statistics in psychological surveys and interactive games with those of humans.

This definition may be far from Turing's original intention: the Stanford team's definition is "We believe that an AI passes the Turing test when its responses are statistically indistinguishable from randomly selected human responses."

The latest claim that a chatbot has passed the Turing test comes from a 2024 study that used a “two-player” test: unlike Turing’s “three-person” imitation game (in which the referee would question both the computer and a human counterpart), here each referee interacted only with either the computer or the human.

The researchers recruited 500 human participants, each of whom was assigned to be a judge or a human foil. Each judge played a five-minute round of the game with either the foil, GPT-4 (which was prompted with human-written suggestions on how to deceive the judge), or a version of the ELIZA chatbot. After talking for five minutes via a web interface, the judges guessed whether their conversation partner was a human or a machine.

In the end, the human foil was judged human in 67% of the rounds; GPT-4 was judged human in 54% of the rounds, and ELIZA was judged human in 22% of the rounds. The authors defined "passing" as fooling the judges more than 50% of the time - that is, better than the probability of random guessing. By this definition, GPT-4 passed, even though the human opponent still scored higher.

Worryingly, most human judges were fooled by GPT-4 within five minutes of conversation. Using generative AI systems to impersonate humans to spread false information or commit fraud is a risk that society must address. But do today’s chatbots really pass the Turing test?

The answer, of course, is that it depends on which version of the test you’re talking about. A three-person imitation game with expert judges and longer conversations still hasn’t been passed by any machine (someone is planning a super-rigorous version for 2029).

Because the Turing test focuses on trying to fool humans rather than testing intelligence more directly, many AI researchers have long considered the Turing test a distraction, a test designed not for AI to pass but for humans to fail. But the test's importance still dominates in most people's eyes.

Having conversations is an important way for each of us to assess other humans, and it’s natural to assume that an agent that can converse fluently must possess human-like intelligence and other psychological traits, such as beliefs, desires, and self-awareness.

Yet if the history of AI has taught us anything, it’s that these assumptions are often based on false intuitions. Decades ago, many prominent AI experts believed that creating a machine capable of beating a human at chess would require something on par with full human intelligence.

“If one could design a successful chess-playing machine, he would seem to have penetrated to the heart of human intelligence,” AI pioneers Allen Newell and Herbert Simon wrote in 1958. Cognitive scientist Douglas Hofstadter predicted in 1979 that in the future “there may be programs that can beat anyone at chess, but … they will be programs of general intelligence.”

Of course, over the next two decades, IBM’s DeepBlue defeated world chess champion Garry Kasparov using a brute-force approach that was a far cry from what we call “general intelligence.” Similarly, advances in artificial intelligence have shown that tasks once thought to require general intelligence — speech recognition, natural language translation, and even autonomous driving — can be accomplished by machines that lack human-level comprehension.

The Turing test may well become yet another casualty of our changing notions of intelligence. In 1950, Turing intuitively felt that the ability to converse like a human should be strong evidence of “thinking” and all its attendant abilities. That intuition remains compelling today. But perhaps what we learned from ELIZA and Eugene Goostman, and what we may still learn from ChatGPT and its ilk, is that being able to speak natural language fluently, like playing chess, is not conclusive evidence of the existence of general intelligence.

In fact, there is growing evidence in neuroscience that language fluency is surprisingly disconnected from other aspects of cognition. In a series of careful and convincing experiments, MIT neuroscientist Ev Fedorenko and others have shown that the brain networks behind what they call "formal language ability" (abilities related to language production) are largely separate from the networks behind common sense, reasoning, and other aspects of what we might call "thinking." These researchers claim that our intuitive belief that fluent language is a sufficient condition for general intelligence is a "fallacy."

“I believe that, by the end of this century, the use of words and generally educated opinion will have changed so much that one will be able to talk about machines thinking without being refuted,” Turing wrote in his 1950 paper. We are not there yet. Was Turing’s prediction simply off by a few decades? Did the real change occur in our concept of “thinking”? — or is true intelligence more complex and subtle than Turing and we realize? It remains to be seen.

Interestingly, former Google CEO Eric Schmidt also expressed his views in a recent speech at Stanford University.

For a long time, human understanding of the universe was more mysterious, and the scientific revolution changed this. However, now AI has once again made it impossible for us to truly understand the principles. Is the nature of knowledge changing? Should we start accepting the results of these AI models without having them explain them to us?

Schmidt puts it this way: We can compare it to a teenager. If you have a teenager, you know they are a human being, but you don't fully understand their thinking. Our society is clearly adapted to the existence of teenagers. We may have knowledge systems that we don't fully understand, but we understand the scope of their ability.

This is probably the best we can get.

news

After ChatGPT, the ultimate milestone of artificial intelligence has fallen

Introduction

My contact information