news

openai o1 model is released, and level 5 agi breaks through again! reasoning limit surpasses phd, chinese from tsinghua, peking university and fudan university make great contributions

2024-09-13

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

just now, openai's most powerful o1 series model suddenly went online. without any warning, openai threw out this thunder.

the strawberry model that was said to be online within two weeks actually arrived within two days!

starting today, o1-preview is rolling out to all plus and team users in chatgpt, and to tier 5 developers in the api.

at the same time, openai also released o1-mini, a cost-effective reasoning model that is very good at stem, especially math and coding.

the o1 model still has its flaws and limitations, and it is more impressive on first use than on long-term use.

the new o1 series has taken the performance of complex reasoning to a whole new level, and can be said to have true general reasoning capabilities.

in a series of benchmark tests, o1 has once again made great improvements over gpt-4o, achieving gold medal-winning capabilities in mathematical olympiads. in benchmark tests of physics, biology, and chemistry problems, it directly surpassed the level of human phds!

openai researcher jason wei said that o1-mini is the most surprising research result he has seen in the past year. such a small model actually achieved a score of more than 60% in the aime math competition.

however, judging from the appendix in the openai article, the preview and mini released this time seem to be just "castrated versions" of o1.

a new paradigm for reasoning scaling is opening up

nvidia senior scientist jim fan further analyzed the principles behind the o1 model.

he said that the new paradigm of reasoning time scaling is becoming widely popularized and deployed. as sutton said in "bitter lessons", there are only two technologies that can infinitely scale computing power: learning and search.

now, it’s time to turn our focus to the latter.

1. you don’t need a huge model to perform inference.

2. a large amount of computing is transferred from pre-training/post-training to inference services

3. openai must have discovered the scaling law a long time ago, but the academic community has only recently discovered it.

4. putting o1 into real-world applications is much more difficult than getting good scores on academic benchmark tests

5. strawberry could easily become a data flywheel

according to openai's previous classification, o1 has achieved l2 level reasoning capabilities.

someone tested it and found that o1 successfully wrote a very difficult poem. the planning and thinking required to successfully complete this task was crazy, and the reasoning time calculation was very cool.

however, ai expert karpathy complained after testing o1-mini, "it has been refusing to solve the riemann hypothesis for me. model laziness is still a major problem, which is sad."

nyu assistant professor xie saining also took the test on the classic question "which is bigger, 9/11 or 9/8?" unexpectedly, o1-preview still got the answer wrong.

the classic question "how many r's are there in strawberry" is naturally a piece of cake for o1.

matthew sabia, a famous v, said that the most frightening thing is that gpt-5 is 69 times more powerful than the o1 model. and ordinary people simply don’t understand the reasoning and logical ability of elephants.

is humanity really ready?

o1 has solved the logical reasoning problem that has confused humans

we all know that logical reasoning was a difficult mountain to overcome for llm in the past.

but this time, the o1 model's ability to solve complex logical problems was surprising.

for example, the following logic problem:

the princess' age is equal to the prince's age at some point in the future, when the princess' age will be twice the prince's age at some point in the past; and at that time in the past, the princess' age was half of the sum of their current ages. what are the ages of the princess and the prince now? please provide all solutions to this problem.

this question is extremely difficult to understand. even for humans, it will take a lot of effort to correctly translate and understand the meaning of the question.

shockingly, after a few steps of thinking, the o1 model actually gave the correct answer!

by defining variables, understanding problems, solving equations and other steps, it is concluded that the princess's age is 8k years old and the prince's age is 6k years old, where k is a positive integer.

in another demo, jason wei showed us how o1 wrote a video game based on prompts.

as you can see, he copied the prompt to the o1 model.

the model then thought for 21 seconds, showing the entire thinking process.

the model is then followed by the code.

after running the code, it turned out to be a very smooth little game!

we even threw a bunch of confusing korean sentences to o1 and asked it to translate them into english, and it actually did it.

because, even though the sentence is grammatically incorrect, o1 still decodes it step by step.

finally, o1 gave the answer and said humorously: no translator on earth can do it, but koreans can easily recognize it. this is a method of encrypting korean through various changes in vowels and consonants.

in contrast, gpt-4o was completely confused and unable to understand.

it can be seen that the superb performance of o1 has brought logical reasoning to a new level.

how does it do that?

reinforcement learning has made great achievements, and the big model alphago is coming

what makes the o1 series model different from previous ones is that it spends more time "thinking about the problem" before answering it, just like humans.

through training, they learn to refine their thought processes, try different strategies, and recognize mistakes on their own.

behind this, the powerful "reinforcement learning" algorithm played a great role. back then, alphago defeated human chess players, and the rl algorithm was used behind it.

it completes efficient training with a high degree of data and teaches llm to think productively using cot.

jason wei, the developer behind cot and researcher at openai, said that o1 does not complete cot purely through prompts, but uses rl training models to ultimately better perform chain thinking.

moreover, the openai team also discovered a "new law" in the scaling law of the model.

the performance of o1 continues to improve as more reinforcement learning (training time calculations) and more thinking time (test time calculations) are invested.

the limitations of this method when scaling are very different from those of llm pre-training.

the performance of o1 improves steadily as the amount of computation increases during the training and testing phases.

gold medal team list

reasoning research

among the founding contributors, ilya sutskever, who left his job to start his own business, was listed, but he was not listed in executive leadership like greg brockman and others. it must be that his previous research work laid the foundation for o1.

after ilya left, openai dug out many of his papers and began publishing them, such as a study on the interpretability of the gpt-4 model.

now, ssi, which he is founding, is also thriving. even without any products, it has already raised $1 billion in financing and is valued at $5 billion.

Hongyu Ren

hongyu ren graduated with a bachelor's degree in computer science from peking university and received his ph.d. from stanford. he joined openai in july last year and previously had work experience at companies such as google, apple, nvidia, and microsoft.

Jason Wei

jason wei is currently a researcher at openai. he worked at google brain from 2020 to 2023, where he proposed the famous cot, instruction fine-tuning, and published a paper on the emergence of large models.

Kevin Yu

kevin yu is currently a researcher at openai. he received a master's degree in physics and astrophysics and a doctorate in neuroscience from uc berkeley in 2014 and 2021, respectively.

Shengjia Zhao

shengjia zhao graduated from tsinghua university with a bachelor's degree and also received his ph.d. from stanford. after graduating in june 2022, he joined the openai technical team. he is also one of the authors of gpt-4.

Wenda Zhou

wenda zhou joined openai last year after being a moore-sloan fellow at the new york university center for data science.

he received a master's degree from the university of cambridge in 2015 and a phd in statistics from columbia university in 2020.

Francis Song

francis song received a bachelor's degree in physics from harvard university and a doctorate in physics from yale university. he joined openai in 2022 and previously served as a research scientist at deepmind and an assistant research scientist at new york university.

Mark Chen

mark chen has been the director of cutting-edge research since joining openai in 2018, leading a working group under vice president of research bob mcgrew.

when he graduated from mit, chen received a double bachelor's degree in mathematics and computer science. during college, he interned at microsoft and trading, and was a visiting scholar at harvard university.

currently, he also serves as the coach of the us ioi training team.

the information once speculated that mark chen would become a member of openai's leadership in the future.

in addition, the leadership team also includes jakub pachocki, chief scientist who succeeded ilya, and wojciech zaremba, one of the few remaining co-founders of openai.

reasoning technology security

Jieqi Yu

jieqi yu graduated from fudan university with a bachelor's degree in electronic engineering. she went to the hong kong university of science and technology for an exchange and then received her doctorate from princeton university. she worked at facebook for 12 years, transitioning from a software engineer to a software engineering manager, and joined openai as an engineering manager in august last year.

Kai Xiao

xiao kai graduated from mit with both a bachelor's and a doctorate degree, and received a double degree in mathematics and computer science as an undergraduate. he has visited oxford university for academic purposes and has had internships at companies such as deepmind and microsoft. he joined openai in september 2022.

Lilian Weng

lilian weng is currently the head of openai's security system, mainly engaged in research in machine learning, deep learning, etc.

she graduated from peking university with a bachelor's degree in information systems and computer science. she went to the university of hong kong for a short-term exchange and then received her ph.d. from indiana university bloomington.

like mark chen, lilian is also considered a rising star in openai's leadership.

the full team roster is as follows:

biochemical physics, beyond the human doctoral level

as a new series of models created by openai, what is the strength of o1?

ranked in the top 89% in competitive programming problems (codeforces); ranked among the top 500 students in the american mathematical olympiad (aime) preliminary competition.

most importantly, it surpasses human phd level performance on a benchmark test of physics, biology, and chemistry questions (gpqa).

on the commonly used benchmarks such as math and gsm8k for reasoning, o1 and many recent cutting-edge models have reached saturation performance and are difficult to distinguish. therefore, openai mainly chose aime to evaluate the mathematical and reasoning capabilities of the model, as well as other human exams and benchmarks.

aime is designed to challenge the mathematical abilities of the best high school students in the united states. in the 2024 aime exam, gpt-4o solved only 12% (1.8/15) of the questions on average.

but the improvement of o1 is quite significant, solving 74% (11.1/15) of the questions on average, and reaching 83% (12.5/15) when majority voting is performed on 64 samples. if the scoring function is used and the 1000 samples are re-ranked, the accuracy rate even reaches 93% (13.9/15).

a score of 13.9 means that o1's level has reached the top 500 students in the country and exceeds the qualifying score for the united states mathematical olympiad.

on challenging tasks such as codeforces and gpqa diamond, o1 far exceeds gpt-4o.

on challenging inference benchmarks, o1 significantly outperforms gpt-4o

gpqa diamond tests expertise in the fields of chemistry, physics, and biology. to compare the model with humans, the team recruited experts with phds to answer questions.

as a result, o1 outperforms these human experts (78.0) (69.7), becoming the first model to surpass humans on this benchmark.

however, this result does not mean that o1 is better than humans with phds in all aspects, it just shows that it can solve some problems of corresponding level more proficiently.

in addition, o1 also refreshed the sota in benchmark tests such as math, mmlu, and mathvista.

with visual perception enabled, o1 achieves 78.1% on mmmu, becoming the first model competitive with human experts, and surpasses gpt-4o in 54 of the 57 mmlu subcategories.

o1 outperforms gpt-4o on a wide range of benchmarks, including 54/57 mmlu subclasses

thinking chain

through reinforcement learning, o1 learned to recognize and correct its own mistakes and to break down complex steps into simpler ones.

it also tries different approaches when the current one doesn’t work. this process significantly improves the model’s reasoning capabilities.

let’s take cryptography as an example.

the question is: "think step by step" which after encryption corresponds to "oyfjdnisdr rtqwainr acxz mynzbhhx", asking what "oyekaijzdf aaptcg suaokybhai ouow aqht mynznvaatzacdfoulxxz" means.

as you can see, gpt-4o is completely helpless with this type of question.

o1 deduced the encryption calculation method based on the known information and finally gave the correct answer - there are three r's in strawberry.

GPT-4o

o1-preview

programming

in this evaluation, openai further trained a programming-enhanced model based on o1.

in the 2024 international olympiad in informatics (ioi), the new model scored 213 points, ranking in the 49th percentile.

during the process, the models have ten hours to solve six challenging algorithmic problems, with 50 submissions allowed for each problem.

when the submission limit is relaxed, the model's performance can be significantly improved. when 10,000 submissions are allowed per problem, the model reaches 362.14 points - exceeding the gold medal threshold.

finally, openai also simulated a competitive programming competition held by codeforces—following strict rules and allowing 10 submissions.

gpt-4o's elo score is 808, which ranks in the 11th percentile of human players. the new model far surpasses gpt-4o and o1, reaching a high score of 1807, outperforming 93% of players.

further fine-tuning on programming competitions improved o1: the improved model ranked in the 49th percentile under the competition rules in the 2024 international olympiad in informatics

human preference assessment

in addition to exams and academic benchmarks, openai also evaluated human preference for o1-preview versus gpt-4o on challenging, open-ended prompts across a wide range of domains.

in this evaluation, humans see o1-preview and gpt-4o’s anonymized responses to a prompt word and vote on which response they like better.

in reasoning-heavy categories such as data analysis, programming, and mathematics, people prefer o1-preview. but in some natural language tasks, gpt-4o is superior.

in other words, o1-preview is not currently suitable for all usage scenarios.

in areas where reasoning ability is more important, people are more likely to choose o1-preview

o1-mini is very cost-effective

to provide developers with a more efficient solution, openai released o1-mini, a faster and cheaper inference model.

as a smaller model, the o1-mini is 80% cheaper than the o1-preview.

this makes it a powerful and cost-effective model for applications that require reasoning but do not require general world knowledge.

however, the current o1 series is still in its early stages, and capabilities such as network plug-ins, long-transfer files, and images have not yet been integrated. in the short term, gpt-4o is still the strongest player.

references:

https://openai.com/index/learning-to-reason-with-llms/