GPT-4o mini tops the large model arena, Ultraman: fine-tuning within two months is free

2024-07-24

Cressey from Aofei Temple
Quantum Bit | Public Account QbitAI

Just now, the GPT-4o mini version ushered in its "highlight moment"——

Climbed to the top of the LMSYS large model arena, tied for first place with the full-blooded version, and left Claude 3.5 behind.

Different from general data set evaluation, the big model arena is the result of users setting questions themselves and voting with their feet. There is no shortcut by "brushing questions", so it is more realistic.

When this result came out, even CEO Altman was excited:

We tried to be as reserved as possible when it came to the evaluation results, but we were very excited to see that the performance of the GPT-4o mini was the same as the full-powered version but at only 1/20 of the price.

After seeing this, netizens said it was OK, but they were more concerned about when "Her" demonstrated at the GPT-4o press conference would be launched.

At the same time, OpenAI also brought another good news, which will provide benefits to developers——

GPT-4o miniFine-tuning will be gradually opened, which is currently open to tier 4 and tier 5 users, and will be expanded in due course.

andFrom now until September 23, 2 million training tokens can be used for free every day。

Mini is on par with the full-blooded version

After millions of rounds of 1v1 competition among more than 80 models, the GPT-4o mini’s score on the LMSYS list was only 7 points behind the full-blooded version.

According to the lmsys ranking, the 7-point difference did not affect the ranking, and the two models were counted as tied for first place.

It was followed by Claude 3.5 and the Gemini family, and two more versions of GPT-4.

If we look at the raw data of GPT-4o mini, we will find that its average win rate of 0.6 is second only to the full-blooded version.

Looking at the results of the competition between the two, they are equally matched.

The reason why lmsys's performance has attracted attention is that it has a unique way of competing:

Instead of a dataset,Allow users to come up with their own questions and randomly select two models for a 1-on-1 battle, and then choose which model performs better.

Before giving a choice, the model is anonymous and the user does not know which two models are competing. If the model lets it slip, the vote will be invalid.

The scores obtained in this way are more realistic, avoiding the possibility of obtaining inflated scores by "brushing questions" and are also closer to user experience.

This large model arena has recentlyParticipated in ICML2024, the top conference on machine learning。

Moreover, lmsys's evaluation alsoVery popular with OpenAI, an early version of GPT-4o mini before its official launch was listed under the pseudonym gpt-mini.

It was already ranked 4th at the time, on the same level as GPT4-Turbo.

Earlier, before GPT-4o went online, it was also known as gpt2-chatbot and was tested on lmsys.

However, some people have raised doubts, saying that although the performance of GPT-4o mini is indeed very good, it is a bit of an exaggeration to say that it surpasses the Claude 3.5 sonnet.

Some even bluntly stated that the integrity of the lmsys method has begun to collapse and changes need to be made, otherwise it will no longer be a useful test benchmark.

The "small model" is also rolled up

The launch of the mini version focuses on cost-effectiveness.

For every million input/output tokens, the price is 15 cents and 60 cents respectively (about 1.09/4.36 RMB), which is less than half of 3.5 Turbo.

Compared with the text-davinci-003 version of GPT-3 two years ago (the best model at the time), the price has dropped by 99%.

In addition to opening up small models to users, OpenAI has also come up with a new way to play -

In a posthumous work of the "Super Alignment" team, a small model with one thousandth or one hundredth of the parameters of the large model was used to optimize the large model.

In the experiment, the two models, large and small, "play" against each other. The large model needs to continuously optimize and adjust its output to make the small model believe that it is telling the truth.

During this "game", the capabilities of the large model have been improved, and comprehensibility has been greatly improved without a significant loss of accuracy.

In addition to OpenAI, other companies have also started to develop small models.

For example, before GPT-4o mini, Google and Anthropic launched Gemini Flash and Claude 3-Haiku respectively.

It can even be said that GPT-4o mini is OpenAI's counterattack against the two models, surpassing these two models in both performance and price.

In the same week that GPT-4o mini was released, Hugging Face and "European OpenAI" Mistral both launched small models.

Even Apple launched its own 7B model and open-sourced the entire training process and resources at one time.

In short, as long as the performance is sufficient to meet the usage needs, the small model is undoubtedly a more economical choice.

At the same time, a smaller scale also means that it is possible to run on the end side, showing advantages in terms of privacy protection and other aspects.

It is not difficult to understand why the “small” models are becoming more and more popular.

Reference Links:
[1]https://x.com/sama/status/1815877987696533897/
[2]https://x.com/OpenAIDevs/status/1815836887631946015

news