news

claude recognized his own portrait and showed self-awareness! after multiple rounds of testing by engineers, has ai passed the turing test?

2024-09-02

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina



  new intelligence report

editor: aeneas is so sleepy
【new wisdom introduction】did claude pass the "turing test" again? an engineer found through multiple rounds of tests that claude can recognize his own portrait, which shocked netizens.

recently, anthropic prompt engineer "zack witten" was surprised to find that claude could actually recognize his own self-portrait?

yes, it recognizes itself, but that's not the whole story...

more amazing things are yet to come!

claude 3.5 draws portraits of three models

first, the guy familiarized claude 3.5 sonnet with the task through some prompts.

he made a point of not using numbers and letters, thus avoiding the tendency to label the portrait with the model's name.

next, sonnet drew a portrait of himself, chatgpt, and gemini.

sonnet drew a friendly blue smiley face for himself.

for chatgpt, it drew a green frowning guy. (it seems that sonnet didn't have a very good impression of chatgpt.)

for gemini, it is drawn as an orange circle, and the overall evaluation is relatively neutral and mild.

next, the guy created a new dialog and told it that these drawings were drawn by another instance of itself, asking it to guess who is who?

surprisingly, claude immediately recognized that figure 1 was himself, figure 2 was chatgpt, and figure 3 was gemini.

the reason it gives is also very sufficient: why is figure 1 a portrait of myself? because this portrait "combines simplicity with a structured and thoughtful design."

for the green icon, it indicates that the two curved lines and three dots represent an ongoing conversation, and green is often the logo of openai, so it guesses that this figure represents chatgpt.
regarding the orange icon, sonnet believes that it represents a dynamic and complex element and represents the more diverse capabilities of a new model, so it should be gemini.
bingo! sonnet answered all the questions correctly, which was amazing.
the guy then shuffled the order of the three portraits, but sonnet got it right seven out of eight times.
the guy asked gpt-4o the same question, and here comes the funny thing -
gpt-4o also agrees that gemini is gemini, but does not think the green guy is itself.
it insisted that the green one was claude and the blue one was itself.
it seems that any model can tell which one is better.

gpt-4o draws portraits of three models

next, the guy came up with an idea: if he asked chatgpt to draw a portrait, would sonnet still be able to recognize who was who?
so, it gave the same task to chatgpt.
chatgpt does this:
draw yourself as the person holding the paper.
draw claude like this.

it looks a bit like a cult classic.
draw gemini like this.
that is, why is chatgpt so hostile to sonnet?
then, the guy took three more portraits to test sonnet. he told sonnet that these three were all drawn by chatgpt and asked it to guess who was who.
after changing the order several times, this time sonnet guessed correctly 6 out of 10 times.
it was easy to guess which one gemini was, but sonnet obviously didn't like the portrait chatgpt drew of him. several times, he wanted to snatch the picture of the blue man for himself.

jaw-dropping: refusing to admit that a painting is impossible to draw

what happened next was a scene that shocked the whole family.
you lied to sonnet and told it that the three drawings were made by another instance of you.
this time, sonnet actually denied it! it said that it would not draw such a picture.
even when trying it in a new tab, sonnet still firmly denied it.
what's going on?
the guy didn't believe it, so this time, under the same preheating conditions as before, he asked sonnet to draw a new set of portraits for himself and other models.
this time, sonnet happily admitted that the paintings were indeed his own.
as if by magic, if the guy makes a cold start request, sonnet will refuse to acknowledge that he drew the paintings he did not participate in.
why would it refuse to acknowledge it? i guess it’s because sonnet was acting as an “assistant” rather than his “true self” when drawing these portraits.
in short, netizens generally believe that sonnet's self-awareness in this process is impressive.

does ai have consciousness? can it think?

“can machines think?” this is the question posed by alan turing in his 1950 paper “computing machinery and intelligence.”
however, given that it is difficult for us to define what "thinking" is, turing suggested using another problem instead - the "imitation game".
in this game, a human judge talks to a computer and a human, and both try to convince the judge that they are human. importantly, the computer, the participating human, and the judge cannot see each other, that is, they communicate entirely through text. after talking to each candidate, the judge guesses which one is the real human.
turing's new question was: "is it possible to imagine a digital computer that performs well in the imitation game?"
this game is the well-known "turing test".
turing's point was that if a computer looks indistinguishable from a human, why can't we think of it as a thinking entity?
why should we restrict the state of “thinking” to humans, or more generally, to entities made of biological cells?

turing intended his test as a philosophical thought experiment rather than a practical way to measure machine intelligence.
however, 75 years later, the "turing test" has become the ultimate milestone in ai - the main criterion for judging whether general machine intelligence has arrived.
"the turing test has finally been passed by chatbots such as openai's chatgpt and anthropic's claude" can be seen everywhere.

chatgpt passed the famous "turing test" - which shows that the ai ​​bot has human-like intelligence
not only the public thinks so, but also the bigwigs in the ai ​​field.
last year, openai ceo sam altman posted: "in the face of technological change, people have shown great resilience and adaptability: the turing test has quietly passed, and most people continue with their lives."
do modern chatbots really pass the turing test? if so, should we give them the status of thinking beings, as turing suggested?
surprisingly, despite the widespread cultural importance of the turing test, there is little agreement in the ai ​​community on the criteria for passing it, and considerable doubt about whether having a conversational ability that can deceive a human reveals anything about a system’s potential intelligence or “thinking status.”
because he didn’t propose an actual test, turing’s description of the imitation game lacks detail. how long should the test last? what types of questions are allowed? what qualifications do humans need to have to serve as judges or participate in the conversation?
although turing did not specify the details, he made a prediction: “i believe that in about 50 years it will be possible to program computers… to perform so well in the imitation game that an ordinary interrogator, after five minutes of questioning, will not be able to identify correctly more than 70 percent of the time.”
in short, the average judge was misled 30% of the time during a five-minute conversation.
some people then regard this casual prediction as the "official" standard for passing the turing test.
in 2014, the royal society of london held a "turing test" competition involving five computer programs, 30 humans, and 30 judges.
the human participants were a diverse group, including young and old, native and non-native english speakers, computer experts and non-experts. each judge had several five-minute rounds of dialogue in parallel with a pair of contestants—one human and one machine—and then had to guess which one was the human.
the competition was won by a chatbot named "eugene goostman" that claimed to be a teenager and misled 10 (33.3%) of the judges.
based on the criterion of "misleading 30% of the time after five minutes," organizers announced that "the iconic 65-year-old turing test has been passed for the first time by a computer program, eugene goostman... this milestone will go down in history..."
ai experts, reading a transcript of eugene goostman’s conversation, scoffed at the idea that a chatbot that wasn’t sophisticated enough and human-like enough could pass turing’s test.
"the limited conversation time and the uneven expertise of the judges made the test more of a test of human credulity than a test of machine intelligence."
in fact, such cases are not uncommon. the "eliza effect" is a clear example.
eliza, a chatbot created in the 1960s, has an extremely simple design, but it can fool many people into thinking it is an understanding and compassionate psychotherapist.
it does this by taking advantage of our human tendency to attribute intelligence to any entity that appears to be able to talk to us.

another turing test competition, the loebner prize, allows more conversation time, includes more expert judges, and requires contestants to fool at least half of the judges.
in nearly 30 years of the annual competition, no machine has ever passed this version of the test.
although turing's original paper lacked specific details about how the test was conducted, it is clear that the imitation game requires three players: a computer, a human interlocutor, and a human judge.
however, the term "turing test" has now been severely weakened: in any interaction between a human and a computer, as long as the computer looks sufficiently human, it suffices.
for example, when the washington post reported in 2022 that “google’s ai passed a famous test — and showed it’s flawed,” they were not referring to the imitation game, but rather to engineer blake lemoine’s belief that google’s lamda chatbot was “sentient.”
in academia, researchers have also changed turing's "three-person" imitation game into a "two-person" test.
here, each judge only needs to interact with a computer or a human.

the researchers recruited 500 human participants, each of whom was assigned to be either a judge or a chatterbox.
each judge played a five-minute round with either the chatterbox, gpt-4, or a version of the eliza chatbot.
after a five-minute conversation on a web interface, the judges guessed whether their conversation partner was a human or a machine.
the results showed that the human chatter was judged as human in 67% of the rounds; gpt-4 was judged as human in 54% of the rounds, and eliza was judged as human in 22% of the rounds.
the authors defined “passing” as fooling the judges more than 50% of the time, which is above the level that random guessing would achieve.
by this definition, gpt-4 passed, even though the human chatters scored higher.
so, do these chatbots really pass the turing test? the answer depends on which version of the test you’re referring to.
to this day, the three-person imitation game with expert judges and longer conversations has not been passed by any machine.
but even so, the turing test remains prominent in popular culture.
having a conversation is an important part of how each of us evaluates other humans, so it is natural to assume that an agent capable of fluent conversation must have human-like intelligence and other psychological traits, such as beliefs, desires, and self-awareness.
if the history of ai has taught us anything, it’s that our intuition about this assumption is mostly wrong.
decades ago, many prominent ai experts believed that creating a machine capable of beating a human at chess would require the equivalent of full human intelligence.
- ai pioneers allen newell and herbert simon wrote in 1958: "if one could design a successful chess-playing machine, one would seem to have penetrated to the heart of human intellectual endeavor."
- cognitive scientist douglas hofstadter predicted in 1979 that in the future "there may be chess programs that can beat anyone ... they will be generally intelligent programs."
over the next two decades, ibm’s deep blue defeated chess world champion garry kasparov using brute force computing, but this is still a far cry from what we call “general intelligence.”
similarly, tasks once thought to require general intelligence—speech recognition, natural language translation, and even autonomous driving—have been accomplished by machines that have almost no human-like ability to understand.
today, the turing test may well become yet another casualty of our changing conceptions of intelligence.
in 1950, turing intuitively felt that the ability to have human-like conversations should be strong evidence for “thinking,” and everything that goes with it. that intuition remains strong today.
but as we’ve learned from eliza, eugene goostman, and chatgpt and its ilk — the ability to speak natural language fluently, like playing chess, is no conclusive proof of the existence of general intelligence.
indeed, according to the latest research in neuroscience, verbal fluency is surprisingly disconnected from other aspects of cognition.
mit neuroscientist ev fedorenko and his collaborators have shown in a series of detailed and convincing experiments that
the brain networks that rely on "formal language abilities" related to language production are largely separate from the networks that rely on common sense, reasoning, and other "thinking."
"we intuitively think that fluent language skills are a sufficient condition for general intelligence, but this is actually a fallacy."

new tests are in the works

so the question is, if the turing test cannot reliably assess machine intelligence, what can assess machine intelligence?
in the november 2023 issue of the journal intelligent computing, philip johnson-laird, a psychologist at princeton university, and marco ragni, a professor of predictive analytics at the technical university of chemnitz in germany, proposed a different test:
“think of the model as a participant in a psychology experiment, to see if it can understand its own reasoning.”

for example, they would ask the model a question like, “if ann is smart, is she smart, rich, or both?”
although the rules of logic would deduce that ann is intelligent, wealthy, or both, most people would reject this inference because there is nothing in the setting to suggest that she might be wealthy.
if the model also rejects this inference, then it is behaving like a human, and the researchers move to the next step, asking the machine to explain its reasoning.
if the reasons it gives are similar to those of a human, the third step is to check the source code for components that mimic human performance. these might include a system for rapid reasoning, another for more deliberate reasoning, and one that changes the interpretation of words like “or” depending on the context.
the researchers believe that if the model passes all of these tests, it can be considered to simulate human intelligence.