AI big model "college entrance examination" results announced: almost all of them are biased towards liberal arts, a little weak in mathematics, and their problem-solving ideas are particularly "stubborn"

2024-07-26

As soon as the 2024 National College Entrance Examination was over, OpenCompass, an open source and open evaluation system for large models under the Shanghai Artificial Intelligence Laboratory, selected 7 large AI models from home and abroad to test all subjects of the college entrance examination. The test papers of the 7 AI candidates were scored by teachers with experience in college entrance examination marking without knowing the identities of the candidates.

Recently, the test results were released: the Wenquxing language model of the Shusheng·Puyu 2.0 series, the Qwen2-72B large model of Alitongyi Qianwen, and the GPT-4o ranked in the top three among all AI candidates. Based on the score line of Henan Province this year, the liberal arts scores of these three AI candidates all exceeded the "first-tier line", and the science scores were firmly above the "second-tier line".

After analyzing the answers submitted by AI candidates, the industry believes that at the current stage, large models have a very different thinking trajectory from humans when solving memory and logic problems, but this also points out the direction for the future evolution of AI.

The language test performed well, but the math short answer test became an "insurmountable hurdle"

The test results show that AI candidates all have some bias in their subjects and seem to be "liberal arts students".

Among the 7 major models, 4 achieved high scores of more than 130 in the English test of the New Curriculum Standards Volume I. Among them, GPT-4o took the lead in the English test and was also appreciated by an English examiner for its composition, who called it "rich in sentence patterns and flawless in language". However, the number of words was slightly less, so 1 point was deducted at the discretion of the teacher.

In addition, AI candidates also performed well in answering the Chinese New Curriculum Standards Paper I: the average score rate in modern text reading, ancient poetry reading, famous quote dictation and composition was above 70%.

AI is usually considered to have excellent logical thinking ability, but in this test, AI candidates were almost "annihilated" in the new mathematics curriculum I, and their scores did not reach half of the total score (i.e. 75 points). Mathematics short-answer questions became a "hurdle" that these candidates could not overcome, with an average score of only 18.9% for the five short-answer questions.

Professor Zhang Junping of the School of Computer Science and Technology at Fudan University said that the AI candidates who took the test were all large language models that had received corpus training, so they had an advantage in answering language papers. In the examination of mathematics and physics subjects, candidates were required to have certain reasoning ability, which has always been the shortcoming of large models.

The "fast system" thinking mode prevents AI test takers from "drafting"

Why do AI test takers have such a big bias in their scores? Many researchers in the field of artificial intelligence have pointed out that this has a lot to do with the way large models "think" at this stage.

"When doing questions, people usually form a solution idea first and then answer. But AI is different. They just try to do it without thinking, and then make up a solution if they can't solve it." The person in charge of the Shanghai Artificial Intelligence Laboratory told reporters that the problem-solving process of math and science questions is extremely uncertain. Therefore, human test takers usually sort out their ideas on a draft paper before starting to answer the questions. The large model generates texts sequentially and lacks the ability to "draft". If their ideas go astray when answering questions, there is basically no room for recovery.

"The two thinking modes of AI candidates and human candidates can be respectively analogized to the 'fast system' and 'slow system' proposed by Daniel Kahneman in 'Thinking, Fast and Slow'." Zhang Junping explained that AI always outputs answers quickly and uses probability calculations to simulate the reasoning process, while humans' understanding of problems often relies on accumulated experience and can look at things as a whole and macro, so they can also see more deeply.

The problems exposed in the test paper are also the "new test paper" for the development of AI

In the college entrance examination, humans are still far ahead of AI. "Organizing AI big models to participate in the college entrance examination is to evaluate the true level of the current big models, identify problems, and continuously promote technological progress." The relevant person in charge of the Shanghai Artificial Intelligence Laboratory emphasized that the results of AI candidates exposed both the advantages and weaknesses of the big models, and also proposed many directions worth thinking about for its future development.

The person in charge of the Shanghai Artificial Intelligence Laboratory told reporters that most models do not have the ability to correct errors on their own, and they will "forge ahead" even if they make a mistake, or even make up for it by "talking nonsense". Therefore, improving the ability to correct errors may be something that needs special attention in the future training of large models.

In addition, the "illusion" of the big models still exists, and they will "seriously" make up content. "In this test, some big models will make up poems, which makes some examiners mistakenly believe that a certain poem they made up really exists, but they just don't know it." The relevant person in charge of the artificial intelligence laboratory added that how to improve the credibility of AI is still on the way.

Author: Zhang Feiya

Text: Trainee reporter Zhang Feiya Photo: Visual China Editor: Zhang Feiya Editor-in-chief: Fan Liping

Please indicate the source when reprinting this article.

news

AI big model "college entrance examination" results announced: almost all of them are biased towards liberal arts, a little weak in mathematics, and their problem-solving ideas are particularly "stubborn"

Introduction

my contact information