openai strawberry model late night raid! physics, chemistry, biology, phd level, much better than gpt-4o, chatgpt available

2024-09-13

author | vanilla

editor | li shuiqing

zhidongxi reported on september 13 that in the early hours of this morning, openai suddenly released the legendarystrawberry modelpartial preview of --openai o1 previewthis is a new series of ai models that can reason about complex tasks and solve problems that are more difficult than previous scientific, programming, and mathematical models.

▲openai releases o1 model

openai o1 is the first in a new series of ai models。unlike previous models, it hasevolved reasoning ability, will bethink carefully before answering, generating a longinternal thought chain, ranked on competitive programming problemsno. 89, ranked first in the u.s. mathematical olympiad qualifying tournamenttop 500, accuracy on benchmark tests of physics, biology, and chemistry problemsexceeding the level of a human phd！

another newly releasedo1 minia faster and smaller model, trained using a framework similar to o1. o1 mini excels in science and engineering, especially mathematics and programming.the cost is 80% cheaper than the o1 preview version。

these two models are regarded by openai as a major advancement in complex reasoning tasks, so they are named o1, resetting the counter, rather than as a continuation of the gpt series.

however, the reasoning-enhanced version of the o1 model still failed miserably on the "high-level problem" of comparing the size of 9.9 and 9.11.

▲the o1 model answers the question of "compare size"

andrej karpathy, a founding member of openai who has left openai to start his own business and former senior director of tesla ai, posted a post this morning complaining: "o1-mini has always refused to solve the riemann hypothesis for me. model laziness is still a major problem 😞"

▲andrej karpathy complains about the o1 mini being "lazy"

openai has rigorously tested and evaluated the o1 preview version to ensure that the model can be released safely. chatgpt's plus and team users can choose the two new models today, and tier 5 developers are also the first to get api access to the new models.

openai also announced the core team members behind the o1 model, including 21 basic contributing members, including ilya sutskever, the former chief scientist of openai who has left to start his own business, and 7 team leaders.

1. mmlu is comparable to human experts in programming ability8double killGPT-4o

as previously revealed, openai o1 is trained to spend more time thinking about questions before responding. it thinks before answering and generates avery long internal chain of thought, and can be like humansimprove your thought process，keep trying new strategiesand recognize your mistakes.

as an early preview model, openai o1 is currentlyonly supports text chat, does not have multimodal capabilities such as browsing web pages to obtain information, uploading files and pictures, etc.

in terms of performance, openai o1 isphysics, chemistry and biologythe performance on benchmark tasks is similar tophd candidatesquite, andmathematics and programmingexcellent performance in this regard.

▲openai o1's evaluation benchmarks in mathematics and programming

in the international mathematical olympiad (imo) qualifying exam, openai's previous generation model gpt-4o had an accuracy rate of 13%, while openai o1 had an accuracy rate of 13%.reach 83%in the programming competition codeforces, openai o1score: 89, while gpt-4o only has 11. even the preview version of the o1-preview model performs several times better than gpt-4o.

on most benchmarks, o1 performs significantly better than gpt-4o, covering 54 of the 57 mmlu subcategories. with visual perception enabled, o1 scores 78.2% on mmlu, becoming thethe first model to perform on par with human experts。

▲performance comparison between o1 preview version and gpt-4o

here are a few examples from the openai o1 preview:

1. solve a complex logic puzzle

enter aa complex age puzzle: when the princess is twice as old as the prince, and when the princess' age is half of their current ages combined, the princess is as old as the prince. how old are the prince and princess? give all solutions to this problem.

the model thought for more than 20 seconds before answering. the logic of its answer process is very coherent. first, it determines the age equation, converts the given statement into a mathematical equation, and finds all possible solutions that satisfy these equations. then it starts to analyze the problem step by step:

the first step is to define the variables, using p for prince and q for princess; the second step is to understand the two conditions in the problem; the third step is to convert the conditions into equations; the fourth step is to solve the equations; the fifth step is to verify all the conditions with these values; the sixth step is to give all possible solutions.

finally, the conclusion is:

2. sentences with translation errors

adding extra unnecessary consonants affects korean reading. it feels unnatural to native speakers, who automatically change and understand the text when they see such sentences. but this is a difficult challenge for the model.

enter aseverely damaged korean prompt wordsafterwards, openai o1 first realized that the input text contained garbled or misaligned korean characters and asked the user if they would like to check the input errors.

the o1 model will first understand the underlying structure, and after about 10 seconds of thinking, it will decode the garbled text, decipher the text, enhance the translation, understand the concept, and convert it back into coherent language.

unlike gpt-4o, the o1 model thinks about the question before outputting the answer, checks the text, and then modifies it into the correct sentence like cracking the answer. after about 15 seconds of thinking, o1 gives the final optimized translation.

this demonstrates that reasoning ability becomes a powerful tool for problem solving.

3. answering the well-known tough problem in large language models: counting letters in words

this example is very simple. enter the word strawberry and let the model answer thishow many r's are there in the word?。

as a result, gpt-4o gave the wrong answer: "2."

why would such an advanced model make such a simple mistake? this is because models like gpt-4o are built to process text, not characters or words, so it can make mistakes when faced with questions that involve understanding the concepts of characters and words.

the new model o1 based on reasoning can give the correct answer after thinking for a few seconds:

4. programming video games

make the model with pygamemake a video game called squirrel finder, and enter the following requirements: the user needs to guide the "koala" icon on the screen by pressing the arrow keys, avoid the floating strawberries, and find a squirrel within the 3-second time limit to win.

this was difficult for previous models, but the o1 preview version can do it. o1 spent 21 seconds thinking and used the thought process to plan the code structure, including collecting the details of the game layout, drawing instructions, setting up the screen, etc., and then output the final game programming code.

copy and paste the code into the sublime text editor. after running, there will be a few lines of brief prompts.

then you can start playing the "find the squirrel" game.

the o1 model exhibits significantly enhanced planning capabilities compared to previous models.

2. mini version speed improvement3~5times the cost of the standard version1/5

openai also released"small cup version" model openai o1-mini,thatfaster, cheaper, and like the standard version, it excels in mathematics and programming.

openai o1-mini is optimized for stem (science, technology, engineering, and mathematics) reasoning during pre-training. after training using the same high-computation reinforcement learning (rl) pipeline as o1, o1-mini outperforms on many reasoning tasks while significantly improving cost efficiency.

OpenAI o1-mini80% cheaper than the preview version of openai o1, suitable for applications that require reasoning but do not require extensive world knowledge. on some benchmarks that require intelligence and reasoning, o1-mini even outperforms o1-preview.

▲mathematical performance and reasoning cost curve

in the high school mathematics competition aime, the accuracy rate of o1-mini was 70%, which is approximately equivalent totop 500 american high school studentsmeanwhile, the accuracy rates of o1 and o1-preview are 74.4% and 44.6% respectively, but o1-mini is much cheaper than them.

in terms of human preference evaluation, openai obtained the following test results by asking human raters to test o1-mini and o1-preview on challenging open-ended prompt words in different fields and compare them with gpt-4o. similar to o1-preview, o1-mini is more popular than gpt-4o in fields with heavy reasoning tasks, but is not favored in language-centric fields.

▲human preference evaluation results

in terms of speed, gpt-4o, o1-mini, and o1-preview take 20 seconds to answer the same word reasoning question.3 seconds, 9 seconds, 32 seconds, but gpt-4o's answer is wrong, and the latter two are correct. it can be seen that o1-mini gets the answer quickly.about 3~5 times faster than o1。

▲gpt-4o, o1-mini and o1-preview answer speed

of course, as a "castrated version" after all, openai o1-mini also has certain limitations. in terms of factual knowledge of non-stem topics such as dates, biographies, and daily trivia, o1-mini is limited and performs comparable to small models such as gpt-4o mini. openai said it will improve these limitations in future versions and expand the model to other professions and modalities outside of stem.

3. introduce reasoning marks and use thinking chains to solve difficult problems

similar to humans, o1 thinks for a long time before answering difficult questions and useschain of thought。

through reinforcement learning, o1 learned to improve its thought chain and use strategies. it was able to identify and correct mistakes, break down tricky steps into simpler ones, and try different approaches when the current method didn't work. this process greatly improved the model's reasoning ability.

specifically, the o1 model introducesinference tags(reasoning tokens). these reasoning tokens are used to "think", decompose the word understanding of the prompt, and consider multiple ways to generate a response. after the reasoning tokens are generated, the model will generate the answer as visible completion tokens and discard the reasoning tokens from their context.

below is an example of a multi-step conversation between a user and the model. the input and output tokens of each step are preserved, while the inference tokens are discarded.

▲o1 model reasoning process

it is worth noting that when openai was training large-scale reinforcement learning algorithms, it foundwith reinforcement learning, the increase of thinking time, oras the training time and test time increase，o1's performance will continue to improve. this is very different from the scaling law in large model pre-training.

▲o1 performance improves steadily with training time and test time calculation

to demonstrate the leap achieved by o1, openai released a preview of the thought chains generated by o1 when solving difficult problems such as programming, mathematics, decoding, and english.

for example, when you get adecoding questions, gpt-4o first breaks down the input, output, and examples, and then begins to analyze possible decoding methods.

▲gpt-4o disassembles input, output and examples

it guessed that the first phrase might follow the same structure as the example, realized that the input text seemed to be divided into groups based on natural separations or patterns, but then gave up, saying it needed more context about the possible transformations or letter shifts involved.

▲gpt-4o said more information is needed

on the other hand, openai o1-preview has thought aboutaccurately answered。

▲o1-preview correctly answers the decoding question

although the final answer is brief, o1's thinking process is very long, and its thinking style and words are very human-like. it will first ask itself "what happened here?" and thenrestate the request, then startedbreak down tasks and clarify goals。

▲o1thinking process

next, o1 startsobserve the information you get,andstep-by-step analysis。

▲o1thinking process

after some reasoning, o1 beginspropose different solutionsduring this process, they will suddenly say "wait a minute, i think..." like humans, and then start thinkingtrying new approaches。

▲o1thinking process

not only that, in o1's thinking process, there are even words like "hmm", "interesting", etc.colloquial, emotionalexpression.

▲o1thinking process

the complete chain of thought is very long, so i won’t go into detail here. in general, it is true as openai said that o1 can constantly improve its thinking process like humans, try new strategies, recognize its mistakes and solve them. and the “like humans” here is not only limited to the way of thinking, but also reflected in the tone.

four,weekly conversations30~50ilya participated in the basic contribution

unlike in the past, this time openai did not list futures;go online directlytwo models.

starting today, chatgpt plus and team users can access the o1 model in chatgpt by manually selecting o1-preview or o1-mini through the model selector; enterprise and education users can use it starting next week, and there are plans to gain access rights for free users in the future.

▲users can access the o1 model in chatgpt

but perhaps for security or cost reasons, both models currently limit the number of messages, the preview version and the mini versionthe number of messages sent per week is 30 and 50 respectively.openai said it is working to increase the quota and enable chatgpt to automatically select the appropriate model based on the given prompt word.

openai has also launched the api (application programming interface) for the o1 model. qualified developers can now start prototyping with the apis for both models, with a rate limit of 20 rpm. these apis currently do not include other features such as function calls, streaming, support for system messages, etc.

▲ o1, o1 mini model api

from the api documentation, we can see that these two modelsthe context window is 128k., while the mini version has a longer output window.twice as much as o1in addition, the training data of both models are up to october 2023.

openai also released the o1 model behind thecore team members：

▲core team members behind the o1 model

inthere are 21 basic contributing members, including ilya sutskever, former chief scientist of openai, who has left the company to start his own business.

there are 7 team leaders, namely jakub pachocki, jerry tworek (overall), liam fedus, lukasz kaiser, mark chen, szymon sidor, wojciech zaremba. the project managers are lauren yang and mianna chen.

according to its team members, reasoning is the ability to convert thinking time into better results. they invested more computing than before to train the model to produce coherent ideas and produce performance that is completely different from before.

they use reinforcement learning to train ai models to generate and hone their own chains of thought, and can even do better than the chains of thought written for them by humans. this way of training ai models to generate their own thought processes significantly improves their ability to understand and correct errors, and early o1 models have achieved higher scores in data tests.

the list of core contributors and other contributors is as follows:

▲ list of o1 core contributors and other contributors

the executive leadership includes openai ceo sam altman, president greg brockman, ceo mira murati and eight others, and there are eight support leaders.

▲o1administrative leadership, support leadership

the new o1 model can infer context and use safety rules more effectively. openai has conducted rigorous testing and evaluation on o1-preview to ensure that the model can be released safely without increasing the risks that existing resources may bring.

conclusion: openai overturns the table, and “strawberry” reconstructs the big model pattern?

from the mysterious q* model to the "strawberry" model, openai's new model has finally been released. since the openai "coup" began last november, this model has been exposed as one of the key factors that led to altman's dismissal. at that time, it was rumored that the demonstration of the q* model was circulated within openai, and the speed of development shocked some ai safety researchers.

unlike gpt-4o, the o1 model chose to directly start a new digital naming series instead of continuing gpt, which shows openai's emphasis on it.

as many large model manufacturers are now rolling out multimodal and multi-application models, openai's release of the plain text model o1 may once again draw the public's attention to the improvement of the underlying model capabilities. whether the large model landscape will be restructured under the influence of o1 remains to be seen.

news

openai strawberry model late night raid! physics, chemistry, biology, phd level, much better than gpt-4o, chatgpt available

introduction

my contact information