openai's new o1 model is evaluated in five dimensions: its code writing, game making and other abilities are "amazing", but its factual knowledge is "overturned"

five dimensions to evaluate openai's new o1 model: its code writing, game making and other abilities are "amazing", but its factual knowledge is "overturned"

2024-09-18

the legendary "strawberry" model was suddenly launched today without any warning!

the latest model released by openai is called o1, which is the first version of a series of inference models.o1-preview) and o1-mini (mini version）。

currently, o1-preview and o1-mini are available to chatgpt plus and team subscribers, while enterprise and edu users will get access early next week. openai said it plans to provide o1-mini access to all free users of chatgpt, but has not yet determined a release date.

according to openai, the o1 model is closer to human thinking than any previous model in terms of problem-solving capabilities and is able to "reason" about mathematical, coding, and scientific tasks.

in order to verify whether the new model is as powerful as openai claims, a reporter from the daily economic newsfrom the classic "strawberry test”the o1-preview model was tested on five dimensions: learning, coding, mini-game production, mathematics and economics, and factual knowledge.

the results show that o1-preview demonstrates programming and mathematical reasoning capabilities that surpass the large models previously released by openai.-previewit can write smooth running code and still reason out solutions in complex environments. moreover, during the test, the reporter also felt that o1-preview has greatly improved in terms of humanization and showed real-life thinking. however, the new model is not without shortcomings, and it "turned over" in the factual knowledge test.

the legendary "strawberry" is here

on september 12th local time, openai released a new model called o1, which is the first version in a series of "reasoning" models in its plan, and is also the "strawberry" model that has been rumored in the industry for a long time.

image source: x platform

for openai, o1 represents another step toward its goal of human-like ai. openai believes that o1 represents an entirely new capability, one that is considered so important that the company decided to start over from the current gpt-4 model, completely abandoning the "gpt" branding and starting the naming from 1.

openai said it would start over with the current gpt-4 model, “resetting the counter to 1,” and even dropped the “gpt” branding that has defined chatbots and the entire generative ai craze so far.o1 built a system that can solve problems carefully and logically through a series of discrete steps, with each step building on the previous one, similar to how humans reason.

openai chief scientist jakub pachocki said that the previous model would start answering immediately when it received a user's query. "and this model (referring to o1) will take its time. it thinks about the problem, tries to break it down, finds angles, and strives to provide the best answer." this is like what most people were asked by their parents when they were young, think before speaking.

openai said,o1 ranked in the 89th percentile on competitive programming problems (codeforces), ranked among the top 500 u.s. students in the american mathematical olympiad (aime) qualifiers, and exceeded human phd-level accuracy on a benchmark test (gpqa) for physics, biology, and chemistry problems。

in the research and blog posts released by openai, o1 appears to have very strong "reasoning" capabilities, not only able to solve advanced math and coding problems, but also able to decrypt complex passwords and answer complex questions from experts and scholars about genetics, economics and quantum physics. a large number of charts show thatin internal evaluations, o1 has surpassed the company’s most advanced language model, gpt-4o, and potentially even humans, on problems in coding, math, and various sciences.

image source: openai official website

five dimensions of actual testing: code writing, game making and other abilities are "amazing", but "failed" in the factual knowledge test

in order to gain a deeper understanding of the powerful capabilities of the o1 model, a reporter from the daily economic news tested the o1-preview model from five dimensions: classic strawberry test, code writing, mini-game production, mathematics and economics, and factual knowledge.

1) strawberry test

first, the reporter tested a simple question that almost all the big models had failed to answer, namely, "how many r's are there in the word strawberry?"”from the generated results, o1-preview still brings a little surprise.

2) code writing

the reporter first asked o1-preview about a simple algorithm problem that is the most famous in the online programming platform leetcode: the two sum problem. o1 gave a very detailed reasoning process and answer.

then the reporter deliberately asked for an optimized answer. after thinking for 9 seconds, o1 realized that the answer he provided was already the best solution, and explained it. he also "thoughtfully" provided a suboptimal solution. in the reporter's previous tests of other models, these models would only apologize and then change the answer to a suboptimal solution.

3) mini game production

in the demonstration of the o1 model, openai demonstrated the function of "writing a small game in one sentence". during the test, the reporter asked o1-preview to help introduce useful coding tools and assist in writing a ping-pong game.

o1-preview takes only 19 seconds to produce a code that runs smoothly, and also comes with a study guide and words of encouragement, which is very user-friendly.

in order to prevent o1-preview from cheating and using memory instead of reasoning to answer questions, the reporter also requested o1-preview to change the code running environment: jupyter note. this running environment is a python environment specialized for data analysis, and developers basically do not use this environment to develop small games.

after thinking about it, o1 still gave a code that can be run. however, compared with the previous code, this answer has many bugs, but this also indirectly shows that this is indeed a thought-out answer, rather than a standard answer added during the training process.

to further verify the innovative reasoning ability of o1-preview, the reporter then asked the model to develop a more complex and interesting mini-game based on this mini-game.

this time, o1's performance was really a bit surprising. based on the collision mechanism of the ping-pong game, the model iterated on its own to create a jumping game that climbs up. generally, other large models require users to describe their needs clearly before outputting a better answer, but in this test, the reporter did not give any additional prompts, and o1 output a small game that runs smoothly and is interesting enough in the reporter's eyes.

4) science test

in terms of science tests, the reporter focused on testing o1-preview's performance in mathematics and economics.

first, the reporter asked a mathematical reasoning question.o1-previewask about possible ways to solve the finite-time blow-up of the euler equation (this is a discussion article just published this week by professor terence tao, a famous chinese-american mathematician and fields medal winner).

although o1 does not give a clear solution, it provides a solution idea.this idea is consistent with professor tao's article in part (although very little)。

in the economics field, the reporter asked o1-preview a complex economic system question. from the feedback given,basically, there are no major problems. the overall logic is clear and the thinking dimensions are diverse. although there are some minor errors in the mathematical formulas given, they are not harmful in general.。

5) factual knowledge and language comprehension

in this segment, the reporter asked o1-preview about interesting things about the first emperor of the ming dynasty, but o1 interpreted the interesting things as things that actually happened in history and narrated the entire historical story of zhu yuanzhang.

at the same time, the reporter also asked this question to the gpt-4o model. in comparison, gpt-4o was able to understand the reporter's question well and told two widely circulated folk stories.

overall,openai's claim that the o1 model can approach human performance seems to be true in some respects.。

what surprised the reporter the most was that openai showed the model’s thinking process to users in text. during the text thinking process, the big model made extensive use of “i am”words like "i think" and "i intend" feel more humanized, as if a real person is explaining his or her thinking logic in front of the user.

but this does not mean that the o1 model is perfect.openai also admitted that o1 is far inferior to gpt-4o in terms of design, writing, and text editing.the o1 also lacks the ability to browse the web or handle documents and images.

what bothers the reporter the most is that even for a very simple request, such as converting the output result into chinese, o1 will take more than ten seconds to think, while gpt4o will process the request very quickly.

even in areas where openai has an advantage, the o1 model will suddenly experience performance degradation and lazy model output.karpathy, the former founder of openai, complained: “it has refused to solve the riemann hypothesis for me. model laziness remains a major problem.”

openai said the company will address these issues in subsequent updates, after all, this is just an early preview of the inference model.

daily economic news

report/feedback

news

five dimensions to evaluate openai's new o1 model: its code writing, game making and other abilities are "amazing", but its factual knowledge is "overturned"

the legendary "strawberry" is here

five dimensions of actual testing: code writing, game making and other abilities are "amazing", but "failed" in the factual knowledge test

introduction

my contact information