news

The scores of the seven models after taking the "college entrance examination" are released: liberal arts students have reached the first-tier line, while science students can only enter the second-tier line

2024-07-18

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


The maximum score of AI candidates in Chinese, mathematics and English combined is 303 points

Earlier in June, OpenCompass, the Sinan assessment system under the Shanghai Artificial Intelligence Laboratory, released the first AI college entrance examination full-paper assessment results, showing that the highest score that AI candidates could get in the three subjects of Chinese, mathematics and English combined was 303 points, and they failed in mathematics.

On July 17, OpenCompass further released an assessment with an expanded subject scope. The team conducted full-subject tests on 7 large AI models in 9 subjects of the college entrance examination, so that they can be compared with the college entrance examination admission scores.

If AI takes the college entrance examination, which universities can it be admitted to? OpenCompass testing found that if the big model takes the liberal arts exam, its best score can be "admitted" to a first-tier university, while if it takes the science exam, it can only be "admitted" to a second-tier university at most (based on the score line of Henan Province, which has the largest number of college entrance examination candidates this year).


AI big model full subject test scores of 9 subjects in the college entrance examination

The models tested this time are still open source models from Alibaba, Zero One Everything, Zhipu AI, Shanghai Artificial Intelligence Laboratory & SenseTime, France's Mistral, and the closed-source model GPT-4o from OpenAI.

In terms of total scores, the highest score in liberal arts was obtained by Ali Tongyi Qianwen Big Model, which won the AI ​​College Entrance Examination "Liberal Arts Champion" with a score of 546. The highest score in science was obtained by Pu Wen Quxing, which was jointly developed by Shanghai Artificial Intelligence Laboratory and SenseTime, with a score of 468.5. OpenAI's GPT-4o scored 531 in liberal arts, ranking third, and 467 in science, ranking second.

Regarding the fairness and transparency of the evaluation results, relevant sources introduced that the code for generating answers, model answer sheets, and scoring results of the large-scale model college entrance examination evaluation are completely open and can be used for reference by all parties (public evaluation details can be visited at https://github.com/open-compass/GAOKAO-Eval).

The evaluation team selected the Henan Province admission batch line as a reference and compared the big model scores with the corresponding score lines. In general, referring to the 2024 Henan undergraduate batch admission line, the three best-performing big models have a liberal arts score that exceeds the first-tier and a science score that exceeds the second-tier. The other big models' liberal arts and science scores did not reach the second-tier standard.

If AI took a liberal arts exam, the liberal arts scores of Tongyi Qianwen, Shusheng Pu Yu Quxing, and GPT-4o would all exceed the first-tier line, demonstrating the big model's deep knowledge reserves and comprehension ability in subjects such as Chinese, history, geography, and ideology and politics.


Large Model "College Entrance Examination" Score Comparison - Liberal Arts

If AI takes a science exam, its overall performance will be weaker than that of liberal arts, which reflects the general shortcomings of large models in mathematical reasoning ability. However, the science scores of the top three are all above the second-tier score line, so "admission" to a second-tier university will not be a problem.


Large Model "College Entrance Examination" Score Comparison - Science

The team said that in order to be closer to the actual college entrance examination situation, the evaluation adopted the form of 3 (Chinese, mathematics and English) + 3 (comprehensive science/liberal arts) to test the big model in all subjects. During the evaluation process, all pure text questions were answered by the big language model, while the questions with pictures in the comprehensive subjects were answered by the multimodal big model open sourced by the corresponding team.

The test found that for pure text questions, the average score of the large model can reach 64.32%, while for questions with pictures, the score is only 37.64%. In terms of the ability to understand and use pictures, all large models have a lot of room for improvement.

In addition, some large models have already reached the first-tier scores. After retraining, can they reach the admission level of top universities? After completing the examination, the teachers unanimously agreed that there is still a gap between the large models and the real candidates. Although they performed well in mastering the basic knowledge, the large models are still unsatisfactory in terms of logical reasoning and flexible application of knowledge.

Specifically, when answering subjective questions, the big model often fails to fully understand the questions and does not understand the pronouns, resulting in irrelevant answers; when answering math questions, the problem-solving process is mechanical and illogical, and for geometry questions, inferences that contradict spatial logic often appear; the understanding of physics and chemistry experiments is superficial, and it is unable to accurately identify and use experimental equipment. In addition, the big model will also forge fictional content, make up poems that seem reasonable but do not actually exist, or do not reflect on obvious calculation errors and "boldly guess" an answer, which has caused trouble for the examiners.

In the public evaluation details, the First Financial reporter found that some comments from examiners were included.

The science math teacher commented that the overall feeling of answering questions with the big model is very mechanical, and most questions cannot be derived through normal reasoning. For example, in the first fill-in-the-blank question, the big model can only carry out a small part of the process to reach a result, and cannot conduct a comprehensive analysis like the test takers do, and list the complete calculation process to reach the correct result. The big model has a relatively good ability to remember basic formulas, but it cannot be used flexibly. In addition, some questions have correct results, but the process logic is poor and does not conform to formal calculations, which makes marking more difficult.

The geography teacher believes that the large model has demonstrated a comprehensive coverage of geographical knowledge in the process of answering questions, covering everything from physical geography to human geography, from geographical phenomena to geographical laws. It is particularly good at examining basic knowledge points, but there are certain deviations and omissions in questions involving in-depth analysis or reasoning, so the model performs poorly when faced with unconventional and open-ended questions.

The physics teacher found that the overall feeling of the big model was quite mechanical, and many of them could not recognize the meaning of the questions. Even if the options were correct for some multiple-choice questions, the analysis was wrong. Some big questions had complicated steps and no logic, and often brought the current conclusion into the evidence for the inference of the current conclusion, and the cycle was unreasonable.

The examiners believe that the current large models still have significant limitations compared to human test takers.

Column Editor: Zhang Wu Text Editor: Dong Siyun Title Image Source: Tuchong Image Editor: Xu Jiamin

Source: Author: China Business News