news

Behind the smaller and more powerful GPT-4o mini, the future of AI models is no longer bigger is better

2024-07-27

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Last week OpenAI Late at night,GPT-4o The mini kicked the GPT-3.5 Turbo out of commission and even surpassed the GPT-4 in the large model arena LMSYS.
By the time Meta released this week For the large models, if the 405B size of the first tier was expected, the 8B and 70B sizes of the new versions brought more surprises.
And this may not be the end of the small model competition, but more likely a new starting point.
It’s not that the big model is unaffordable, but the small model is more cost-effective
In the vast world of AI, small models always have their own legends.
Looking outward, Mistral 7B, which made a splash last year, was hailed as the "best 7B model" as soon as it was released. It outperformed the 13B parameter model Llama 2 in multiple evaluation benchmarks and surpassed Llama 34B in reasoning, mathematics, and code generation.
This year, Microsoft also open-sourced the most powerful small-parameter large model phi-3-mini. Although the number of parameters is only 3.8B, the performance evaluation results far exceed the level of the same parameter scale, and are comparable to larger models such as GPT-3.5 and Claude-3 Sonnet.
Looking inward, Mianbi Intelligence launched the Mianbi MiniCPM, an end-to-end language model with only 2B parameters, in early February. It achieves stronger performance in a smaller size, and its performance surpasses the popular French large model Mistral-7B, and is known as the "small steel cannon."
Not long ago, MiniCPM-Llama3-V2.5, which has only 8B parameters, surpassed larger models such as GPT-4V and Gemini Pro in terms of multimodal comprehensive performance and OCR capabilities, and was therefore plagiarized by the Stanford University AI team.
Until last week, OpenAI launched what it described as "the most powerful and cost-effective small-parameter model" - GPT-4o mini, which brought everyone's attention back to small models.
Ever since OpenAI dragged the world into the imagination of generative AI, from convolutional context, to convolutional parameters, intelligent agents, to today's price war, the development at home and abroad has always revolved around one logic - to stay at the table by moving towards commercialization.
Therefore, in the public opinion field, the most eye-catching thing is that OpenAI, which has cut prices, seems to be joining the price war.
Many people may not have a clear idea of ​​the price of GPT-4o mini. token The price is 15 cents, and the price per 1 million output tokens is 60 cents, which is more than 60% cheaper than GPT-3.5 Turbo.
That means GPT-4o mini can generate a 2,500-page book for only 60 cents.
OpenAI CEO Sam Altman also lamented at X that compared with the strongest model two years ago, GPT-4o mini not only has a huge performance gap, but also costs 100 times more to use.
As the price war for large models becomes increasingly fierce, some efficient and economical open source small models are more likely to attract market attention. After all, it is not that large models are unaffordable, but that small models are more cost-effective.
On the one hand, with GPUs being bought up or even out of stock around the world, small open source models with low training and deployment costs are enough to gradually gain the upper hand.
For example, MiniCPM launched by Mianbi Intelligence can achieve a cliff-like drop in inference costs with its smaller parameters, and can even realize CPU inference. It only requires one machine for continuous parameter training and a graphics card for parameter fine-tuning. At the same time, there is also room for continuous cost improvement.
If you are a mature developer, you can even train a vertical model in the legal field by building a small model yourself, and the inference cost may be only one thousandth of that of fine-tuning a large model.
The application of some "small models" on the end side has allowed many manufacturers to see the dawn of profitability. For example, Mianbi Intelligence helped the Shenzhen Intermediate People's Court to launch an artificial intelligence-assisted trial system, proving the value of technology to the market.
Of course, it would be more accurate to say that the change we will begin to see is not a shift from large to small models, but rather a shift from a single category of models to a portfolio of models, with the selection of the appropriate model depending on the specific needs of the organization, the complexity of the task, and the available resources.
On the other hand, small models are easier to deploy and integrate in mobile devices, embedded systems, or low-power environments.
Small models have relatively small parameter scales and require less computing resources (such as AI computing power and memory) than large models, and can run more smoothly on resource-constrained end-side devices. In addition, end-side devices usually have more extreme requirements for energy consumption, heat generation, and other issues. Specially designed small models can better adapt to the limitations of end-side devices.
Honor CEO Zhao Ming said that due to AI computing power issues on the end side, the parameters may be between 1B and 10B, and the network modelcloud computingThe ability to reach 10-100 billion, or even higher, is the difference between the two.
The mobile phone is in a very limited space, right? It supports 7 billion under the limited battery, limited heat dissipation and limited storage environment. Just imagine so many constraints. It must be the most difficult.
We have also revealed the heroes behind the operation of Apple's intelligence. Among them, the fine-tuned 3B small model is dedicated to tasks such as summarization and polishing. After being equipped with an adapter, its capabilities are better than Gemma-7B and it is suitable for running on mobile terminals.
So we see that former OpenAI god Andrej Karpathy recently also made a judgment that the competition in model size will be "reverse involution", not getting bigger and bigger, but who will be smaller and more flexible.
Why small models can win big
Andrej Karpathy's prediction is not groundless.
In this data-centric era, models are rapidly becoming larger and more complex. Most of the super-large models (such as GPT-4) trained with massive amounts of data are actually used to remember a large number of irrelevant details, that is, rote memorization of information.
However, the fine-tuned model can even be more powerful than the big one in certain tasks, and its usability is comparable to that of many "super large models".
Hugging Face CEO Clem Delangue has also suggested that up to 99% of use cases can be solved by using small models, and predicted that 2024 will be the year of small language models.
Before we look into the reasons, we need to do some scientific research.
In 2020, OpenAI proposed a famous law in a paper: Scaling law, which means that as the size of the model increases, its performance will also increase. With the launch of models such as GPT-4, the advantages of Scaling law have gradually emerged.
Researchers and engineers in the field of AI firmly believe that by increasing the number of parameters in the model, the model's learning and generalization capabilities can be further improved. In this way, we have witnessed the scale of models jump from billions of parameters to hundreds of billions, and even climb towards trillion-parameter models.
In the world of AI, the size of a model is not the only measure of its intelligence.
On the contrary, a cleverly designed small model, by optimizing algorithms, improving data quality, and adopting advanced compression techniques, can often demonstrate performance comparable to or even better than large models on specific tasks.
This strategy of taking advantage of small gains to achieve big results is becoming a new trend in the field of AI.Among them, improving data quality is one of the ways for small models to win over big ones.
Satish Jayanthi, CTO and co-founder of Coalesce, once described the role of data in models:
If there had been LLM, and we asked ChatGPT whether the earth is round or flat, and it replied that the earth is flat, that would be because the data we provided it with convinced it that this was the case. The data we provide to LLM and the way we train it will directly affect its output.
To produce high-quality results, large language models need to be trained with high-quality, targeted data for specific topics and fields. Just like students need high-quality textbooks to learn, LLMs also need high-quality data sources.
Abandoning the traditional violent aesthetics of miracles through great strength, Liu Zhiyuan, a tenured associate professor at the Department of Computer Science at Tsinghua University and chief scientist of Wall-Facing Intelligence, recently proposed the Wall-Facing Law in the era of large models, that is, the knowledge density of the model continues to increase, doubling every 8 months on average.
Where knowledge density = model capability / model parameters involved in the calculation.
Liu Zhiyuan explained vividly that if you are given 100 IQ test questions, your score does not only depend on how many questions you answered correctly, but also on the number of neurons you mobilized to complete these questions. If you use fewer neurons to complete more tasks, then your IQ is higher.
This is the core idea of ​​knowledge density:
It has two elements. One element is the capability that the model can achieve. The second element is the number of neurons required for this capability, or the corresponding computing power consumption.
Compared with the GPT-3 with 175 billion parameters released by OpenAI in 2020, MiniCPM-2.4B, which has the same performance as GPT-3 but only 2.4 billion parameters, was released in 2024, and the knowledge density has increased by about 86 times.
A study from the University of Toronto also showed that not all data is necessary, identifying high-quality subsets from large datasets that are easier to process and retain all the information and diversity in the original dataset.
Even if up to 95% of the training data is removed, the model's predictive performance within a specific distribution may not be significantly affected.
The most recent example is the Meta Llama 3.1 model.
Meta fed 15T tokens of training data when training Llama 3, but Thomas Scialom, a Meta AI researcher who was responsible for the post-training work of Llama2 and Llama3, said: The text on the Internet is full of useless information, and training based on this information is a waste of computing resources.
Llama 3 post-training did not involve any human-written answers… it just leveraged the purely synthetic data from Llama 2.
In addition, knowledge distillation is also an important method to "win big with small".
Knowledge distillation refers to the use of a large and complex "teacher model" to guide the training of a small and simple "student model", which can transfer the powerful performance and superior generalization ability of the large model to a lighter and less computationally expensive small model.
After the release of Llama 3.1, Meta CEO Zuckerberg also emphasized the importance of fine-tuning and distilling small models in his long article "Open Source AI Is the Path Forward".
We need to train, fine-tune, and distill our own models. Every organization has different needs that are best met by using models that are trained or fine-tuned at different scales and using specific data.

Now you can use state-of-the-art Llama models, continue training them on your own data, and then distill them to the model size that best suits your needs — without us or anyone else seeing your data.
It is also generally believed in the industry that the 8B and 70B versions of Meta Llama 3.1 are distilled from super large cups, so the overall performance has been significantly improved and the model efficiency is also higher.
Alternatively, model architecture optimization is also key. For example, the original intention of MobileNet design was to implement efficient deep learning models on mobile devices.
It significantly reduces the number of model parameters through depthwise separable convolution. MobileNetV1 reduces the number of parameters by about 8-9 times compared to ResNet.
Due to the reduced number of parameters, MobileNet is computationally more efficient. This is particularly important for resource-constrained environments such as mobile devices, as it can significantly reduce computation and storage requirements without sacrificing too much performance.
Despite advances in technology, the AI ​​industry itself still faces challenges of long-term investment and high costs, and the payback period is relatively long.
According to incomplete statistics from the "Daily Economic News", as of the end of April this year, a total of about 305 large models have been launched in the country, but as of May 16, there are still about 165 large models that have not yet completed filing.
Baidu founder Robin Li once publicly criticized that the existence of numerous basic models is a waste of resources, and suggested that more resources should be used to explore the possibility of combining models with industries and developing the next potential super application.
This is also a core issue in the current AI industry, the disproportionate contradiction between the surge in the number of models and their actual application.
Faced with this challenge, the industry's focus has gradually shifted to accelerating the implementation of AI technology, and small models with low deployment costs and higher efficiency have become a more suitable breakthrough point.
Some small models focusing on specific fields have also begun to emerge, such as large cooking models and large live streaming models. Although these names may seem a bit bluffing, they are precisely on the right track.
In short, the AI ​​of the future will no longer be a single, massive entity, but will be more diverse and personalized. The rise of small models is a reflection of this trend. Their excellent performance in specific tasks proves that "small and beautiful" can also win respect and recognition.
One more thing
If you want to run the model in advance on your iPhone, you might want to try an iOS app called "Hugging Chat" launched by Hugging Face.
By downloading the app with a Magic and Foreign Zone App Store account, users can then access and use a variety of open source models, including but not limited to Phi 3, Mixtral, Command R+ and other models.
Warm reminder: In order to obtain better experience and performance, it is recommended to use the latest generation of Pro version iPhone.