news

openai's new model is comparable to a phd? i asked a phd from peking university and tsinghua university to taste it: wake up!

2024-09-14

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

to be honest, i can't stand these companies. they always make big mistakes in the middle of the night. . .

openai was particularly mentioned, as it released the new model that everyone has been thinking about for a long time without any warning.

i was talking about strawberries before, but a picture of strawberries has been making people laugh for days.

as a result, this new model has nothing to do with strawberry hair, but has a completely new name.Oepn AI o1 model

and this thing is already known as openai’s zenith technology, and ultraman directly posted a message saying that this is their strongest and most consistent model to date.

what’s different from previous times is that openai didn’t actually brag about how awesome this thing is, but they just threw out a few pictures that are enough to make people’s scalps numb.

as shown in the figure below, the results of three test projects can illustrate this: the international mathematical olympiad, a programming competition, and doctoral-level scientific problems.

the leftmost one is gpt-4o, the middle one is the currently available preview version o1, and the tall red column on the right is the full-blooded o1. take a look, in almost every aspect, o1 is better than its predecessors.both are nearly 8 times improvement...

if we break down these test results, the new o1 surpasses 4o in almost every discipline and field.

what really scares me is that openai said it specially invited phd experts to answer questions together.

the results were based on doctoral-level test results.we can see that o1's answering scores all exceeded those of phd experts. o1 scored 78, while humans scored 69.7 . . .

even the doctor lost, so what am i to compare with it?

sensitive netizens were immediately furious and began to shout that a new god had appeared.

just flip through it, and you will see super high reviews with the word "best". "it's just awesome!", "the closest thing to human reasoning"

there were even quite a few friends who came to our backstage and said, "o1, you are really something."

doesn’t that sound awesome? openai apparently thinks so too.

the specific amount of money openai spent on it has not yet been announced, but from the user experience, it is obvious how much money this thing costs.

o1 preview is $15 per million input and $60 per million output

what is available to users this time is not even the full version, but an early preview version and a small castrated version.

even if you just want to be the first to try it out, it is not free. even if you pay for a membership, your number of q&a sessions will be limited.the preview version only has 30 posts per week, and the mini version only has 50 posts per week...

although it’s a bit expensive, we certainly can’t let openai boast about it.

didn’t they say they surpassed the doctor?i created several accounts and found some doctors to test it myself.

in order to ensure professionalism and objectivity, we specially invited phds in three comprehensive science subjects to participate in the evaluation, including biology, solid state physics, materials chemistry, etc.

in,nanjingsolid state physics at universitydr. cui's evaluation is the highest among the others. he thinks o1 has reached a level of 60-80 points (full score 100).

even some answers can be given 90 points.

the first question from dr. cui:is there any way to overcome white noise when distributing entangled photons over long distances?

in about 9 seconds, o1 gave 10 feasible measures.

of course, i didn’t understand any of the points. however, dr. cui’s evaluation was okay: the answers were comprehensive, in line with the latest research progress, and were at the level of popular science.

among them, the direction of adaptive optics mentioned is even the latest scientific achievement this year.

compared with the old version 4o, the difference is immediately clear.

let alone whether the new direction is mentioned or not,the number of measures is quite different.

so later, we specifically asked about the new direction of adaptive optics:what principle of quantum entanglement is used to improve the signal-to-noise ratio? can it be extended to quantum adaptive optics?

after several rounds of answers, dr. cui gave me a high score of 80-90 points, and generously admitted to me that some of his thinking was his weak point and it helped to hint at his direction.

however, when we asked more questions, its problems were exposed. when we asked more difficult experimental details, o1's answer became less effective.

but in general, in terms of physics, the performance of o1 is pretty good. compared with the old version, the improvement is about 20 points.

however, in openai's test, physics has always scored the highest. so we brought in anotherpeking university study materialschemicaldr. k, i want to ask some tough questions to chemistry, which has the lowest score.

dr. k aroundFe-N4 a series of questions were asked and o1 gave a long list of answers. in order to save space, we only show some of the questions and results here.

after the overall test, dr. k’s evaluation was similar: he might have the level of a graduate student, but his in-depth understanding and ability to come up with solutions were rather vague, and he mainly answered questions based on known content.

for example, if you ask how to adjust fe-n4, o1 can say that it is based on electronic state adjustment, but if you ask it ...adjust, it's a bit stuck.

although they are less nonsensical than gpt4o, neither of them can give much advice on specific issues. the old version is full of nonsense without details, while the new version is at a loss for words due to its limited capabilities.

in addition to these two, biology is definitely indispensable in the three subjects of comprehensive science.

we also consulteddoctor xin from tsinghua university, studying biology, his question is: " how to distinguish lactoylation and carboxyethyl modification of lysine residues from mass spectrometry data sets?

although i didn't understand it, o1 also gave a very long answer, which was like a paper review, and also included references at the end.

but surprisingly, when we gave this answer to dr. xin, he found something wrong after reading it, and it was a question related to ding zhen at first sight.

it’s not that the ai’s answers are all wrong, but the references are made up, this paper does not exist at all!

although it was edited, it was not completely edited. overall, the tsinghua phd still thinks it is much better than previous ai. at least its comprehension ability is visible to the naked eye, and the editing is very similar...

however, the evaluation of doctoral degrees in different directions is different, which may also be related to o1's own area of ​​expertise.

judging from the official science scores, although gpt4o's score in biology is higher than that in chemistry and physics, this time's o1 is completely different.

o1's score in physics reached 92.8, which is far higher than the other two subjects. this may be the reason why dr. cui is optimistic about it.

overall, when it comes to surpassing the level of professional doctoral students, phds believe that it will take some time.

dr. cui bluntly stated that in real scientific research, scholars still have to do it themselves in most cases, and ai can only provide a general direction, so there is little point in spending money on such detailed ai.

hemore recommended for undergraduatesif you choose this ai at the master's or doctoral level, the ai's answers will not actually meet the instructor's standards and it will definitely be criticized at the group meeting.

dr. xin from tsinghua university also holds the same view. apart from the issue of ai’s hallucination and fabrication of literature, in terms of professionalism, ai’s answer is alsocan only fool big peers, that is, people with different directions in the same major discipline; but in the eyes of small peers and people who specialize in this field, the problems of ai are still very obvious.

dr. k from peking university talked more deeply. he believed that this ai could only be said to have reached the level of a master's student in terms of cognition, but was just a patchwork maker and could not produce any creative results.in terms of creativity, ai is far inferior to the level of masters and doctors., which is also an important problem that ai needs to solve.

in the evaluation of the doctors, we seem to be able to grasp a key point: the reason why the o1 model is relatively stronger is that it has a higher-dimensional cognition and thinking pattern.

this is also the main point of o1's update. we found the article learning to reason with llms on the openai official website, in which they said that they mainly used the long thought chain (cot) instead of the traditional prompt chain.

it may seem a bit confusing at first glance, but to put it simply, this big model has changed the previous way of thinking where you ask me and i answer.

in the previous model, the big model's question-answering was like giving a subconscious answer. for example, if you asked me what color the sky was, i would answer blue without even thinking about it. this actually required me to already know this knowledge point, and then give you a direct response.

but this long chain of thinking is equivalent to that i not only need to know what blue is, but i can also deduce why it is blue, taking into account atmospheric scattering and spectral wavelength.

this requires ai to have the ability to construct logic and reason and reason. in other words,, he not only needs to grow a brain, but also needs to use it.

although the concept of thought chain was proposed by google in 2022, openai was the first to implement it this time.

during the actual operation, when you talk to the o1 model, in addition to getting the answers, you can also choose to expand it to see its thinking logic when answering the questions. its thinking is concrete rather than a black box.

for example, let’s take dr. cui’s question, “what are the ways to overcome white noise in long-distance entangled photon distribution?” the thinking process of the o1 model is as follows:

however, just as it can fail with problems in professional fields, some simple questions in daily scenarios seem to be able to stump it.

take the classic example of comparing 9.11 and 9.8. xiaohongshu netizen @小水刚醒 found that this thing "crashed as soon as the difficulty was increased... infinite loop and crazy thinking chain (cot)"

our editorial department also discovered this problem during its own evaluation, but when asked why, it immediately realized that there was an error in its reasoning and then re-deduce it.

good, good, good, you are worthy of being a doctor, you are good at finding mistakes, right?

after the whole round of testing, i have to admit that it has indeed been greatly improved.

in terms of effect, it is indeed better than the previous generation, andthe application of long-term thinking is good for the future development of ai.

but after several doctors took turns to whip it, its problems were exposed quite clearly, in some aspects such as creativity,it cannot replace human phd experts.

however, openai researcher noam brown revealed that future versions of o1 will think for hours, days or even weeks. although this will cost more money, it is worthwhile for tasks such as developing anti-cancer drugs.

in addition, i think the thinking chain mode implemented by gpt o1 is likely to be like the previous transformer architecture and dit architecture.leading the world in the direction of large models

therefore, the road to agi is neither near nor far. we look forward to the next round of players coming on stage.

written by: naxi & four major

edit : jiang jiang & noodles

art : huan yan

image, source :openai, x, ibm, xiaohongshu, etc., image source network