news

The big model trend has changed, OpenAI and Apple have changed their positions

2024-07-22

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


Smart Things
AuthorZeR0
Edit Mo Ying

There seems to be an invisible rule in generative AI: every once in a while, a shocking large-scale "car crash" will occur.

This year alone, the release of Google's Gemini 1.5 Pro model coincided with the launch of OpenAI's video generation model Sora, and the release of OpenAI GPT-4o coincided with the Google I/O Developer Conference, allowing onlookers around the world to smell the strong smell of gunpowder in the competition for supremacy among big models.

If the previous coincidences suggested that OpenAI was deliberately cutting in on Google, then the fact that Hugging Face, OpenAI, Mistral, and Apple released their most powerful and lightweight models in quick succession over the past four days was definitely a manifestation of the latest trends in the AI ​​industry.

Now, AI big models are no longer just racing"Bigger and stronger", and rolled up violently“Do small and do fine”

Surpassing GPT-4o is no longer the only KPI. As large models enter a critical period of competition for the market, they must not only rely on showing off their technical strength, but also prove that their own models are more cost-effective.The model is smaller with the same performance, and the performance is higher and more cost-effective with the same parameters.


▲The lightweight models GPT-4o mini and Mistral NeMo released last week are both very cost-effective (Source: Artificial Analysis)

In fact, this technological trend of "rolling up and miniaturizing large models" has already begun to brew in the second half of last year.

The game changers are two companies. One is the French AI startup Mistral AI, which used a 7 billion parameter model to beat the 13 billion parameter Llama 2 in September last year, and became famous in the developer community. The other is the Chinese AI startup Mianbi Intelligence, which launched a more concentrated end-side model MiniCPM in February this year, which achieved a performance that exceeded Llama 2's 13B parameters with only 2.4 billion parameters.

Both startups have a good reputation in the developer community, and many of their models have topped the open source hot list. In particular, FaceWall Intelligence, which was incubated from the Natural Language Processing Laboratory of Tsinghua University, caused an uproar this year when its multimodal model was "shelled" by a team from a top American university. FaceWall's original work has been recognized by academic circles at home and abroad, making domestic open source AI models proud.

Apple also began researching end-side models that can better adapt to mobile phones last year. OpenAI, which has always been on the path of extensive and violent expansion, is a relatively unexpected new entrant. The launch of the lightweight model GPT-4o mini last week means that the big model brother has taken the initiative to step down from the "altar" and began to follow the industry trend, trying to use cheaper and more accessible models to pry open a wider market.

2024 will be a critical year for the "miniaturization" of large models!


▲Incomplete statistics of newly released lightweight general language models in 2024, only general language models with a parameter volume of ≤8B that can be deployed on the client side are included, and multimodal models are not included (Source: Zhidongxi)

1. Moore's Law in the Era of Big Models: Efficiency is the Key to Sustainability

The current large-scale model development is falling into a kind of inertia:Vigorously miracle

In 2020, a paper from OpenAI verified that model performance is strongly correlated with scale. As long as more high-quality data is swallowed and a larger model is trained, higher performance can be achieved.


Following this simple but effective path, a global race to pursue larger models has been launched in the past two years. This has laid the hidden danger of algorithm hegemony. Only teams with sufficient funds and computing power have the capital to participate in the competition in the long term.

Last year, OpenAI CEO Sam Altman revealed that the cost of training GPT-4 was at least$100 millionWithout a high-profit business model, even the largest technology companies with deep pockets can hardly afford to invest at any cost in the long term. The ecological environment cannot tolerate such a bottomless money-burning game.

The performance gap between the top large language models is narrowing visibly. Although GPT-4o ranks first, the difference in benchmark scores between it and Claude 3 Opus and Gemini 1.5 Pro is not too far. In some capabilities, the 10-billion-level large model can even achieve better performance. Model size is no longer the only determining factor affecting performance.

It’s not that the top-level large models lack appeal, it’s just that the lightweight models are more cost-effective.

The following figure is an AI inference cost trend chart shared by AI engineer Karina Ngugen on a social platform at the end of March this year, which clearly depicts the relationship between the performance of large language models on the MMLU benchmark and their costs since 2022: over time, language models have achieved higher MMLU accuracy scores, and the related costs have dropped significantly. The accuracy of the new model is about 80%, and the cost can be several orders of magnitude lower than a few years ago.


The world is changing very fast, and in the past few months there has been a large wave of cost-effective and lightweight models launched.


▲ Smaller models can achieve excellent performance at a lower cost (Source: Embedded AI)

“The race for large language model size is intensifying — back off!” AI technology guru Andrej Karpathy bets: “We’re going to see some very, very small models that ‘think’ very well and reliably.”

Model capability ÷ model parameters involved in calculation = knowledge densityThis measurement dimension can be used to represent that models with the same parameter scale can have strong intelligence. The GPT-3 model released in June 2020 has 175 billion parameters. In February this year, the MiniCPM-2.4B model with the same performance has a parameter scale of 2.4 billion, which is equivalent to an increase in knowledge density of about 100%.86 times


Based on these trends, Liu Zhiyuan, a tenured associate professor at the Department of Computer Science at Tsinghua University and chief scientist of FaceWall Intelligence, recently put forward an interesting point of view:The era of big models has its own "Moore's Law"

in particular,With the coordinated development of data, computing power and algorithms, the knowledge density of large models continues to increase, doubling every 8 months on average.


▲From the changes in the OpenCompass list, we can see that small parameter and high performance models are becoming a trend

By increasing the density of circuits on chips, computing devices with the same computing power have evolved from supercomputers that can be put into several rooms to mobile phones that can be put into pockets. The development of large models will also follow similar laws. Liu Zhiyuan named the guiding law he proposed "the law of facing the wall."

If this trend continues,Training a model with 100 billion parameters will enable the model with 50 billion parameters to achieve the same capabilities in 8 months, and 25 billion parameters in another 8 months.

2. Divide into multiple groups: The closed-source price war is in full swing, while the open-source market is dominated by China, the US and Europe

Currently, players entering the large model lightweight competition are divided into many groups.

OpenAI, Google, and Anthropic have all adopted the closed-source route. Their flagship models such as GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro control the strongest performance level, and the parameter scale of these models is as high as hundreds of billions or even trillions.

The lightweight model is a simplified version of its flagship model. After OpenAI launched the new model last week, GPT-4o mini became the most cost-effective option under 10B on the market with performance exceeding Gemini Flash and Claude Haiku. To C replaced GPT-3.5 for free use by users, and To B slashed the API price, making the threshold for adopting large model technology even lower.


Andriy Burkov, author of Machine Learning Engineering, inferred that the parameter specifications of GPT-4o mini are around 7B based on the price. Li Dahai, CEO of Mianbi Intelligence, speculated that GPT-4o mini is a "wide MoE" model with a large number of experts, rather than a terminal model. It is positioned as a cost-effective cloud model to greatly reduce the cost of large model landing in the industry.

The camp of open source lightweight models is even larger, with representative players from China, the United States and Europe.

Domestic companies such as Alibaba, Mianbi Intelligence, SenseTime and Shanghai Artificial Intelligence Laboratory have all open-sourced some lightweight models.Among them, Alibaba's Qwen series models are frequent participants in lightweight model benchmark comparisons, and Mianbi Intelligent's MiniCPM series models are also a model of using small parameters to surpass large models in seconds, and are highly praised in the open source community.

Mianbi Intelligence is a very forward-looking entrepreneurial team. In 2020, it was the first in China to take the big model route. It began to explore how to use efficient fine-tuning technology to reduce training costs. At the beginning of last year, it began to explore AI Agents and released a trillion multi-modal big model in August. It applied big models and agent technology to scenarios such as finance, education, government affairs, and smart terminals. At the end of the year, it formulated the direction of end-cloud collaboration, and then this year it intensively launched a number of high-efficiency, low-energy consumption end-side models.

In the past six months, Mianbi Intelligence has released the base models MiniCPM 2.4B and MiniCPM 1.2B, the long text model MiniCPM-2B-128k, the multimodal model MiniCPM-V 2.0, the MiniCPM-Llama3-V 2.5 with GPT-4V performance level, the hybrid expert model MiniCPM-MoE-8x2B, etc. So far, the overall download volume of the MiniCPM series has reached nearly 950,000, with 12,000 stars.

The startup also implemented a more energy-efficient MiniCPM-S 1.2B model through an efficient sparse architecture: the knowledge density is 2.57 times that of the dense model MiniCPM 1.2B of the same scale and 12.1 times that of Mistral-7B, further demonstrating the "wall-facing law" and significantly reducing the cost of large model reasoning.


▲MiniCPM series models of Mianbi Intelligent are rapidly iterating and improving knowledge density

In the lightweight open source model camp in the United States, major technology companies have a high degree of participation, including Meta, Microsoft, Google, Apple, Stability AI, etc., and the plot of "the back wave knocks the front wave down on the beach" is frequently staged.

Hugging Face also launched three SmolLM models with parameter specifications of 135M, 360M, and 1.7B last week. Compared with models of the same size, the performance is very competitive. Among them, the 1.7B version has surpassed Microsoft Phi-1.5, Google MobileLLM-1.5B and Alibaba Qwen2-1.5B in multiple benchmark tests.

Apple, known for its "closed" nature, is a well-known open source company in the field of AI: it released the Ferret multimodal model in October last year; in April this year, it released four OpenELM pre-trained models with parameters ranging from 2.7 billion to 30 billion; and the latest DCLM model, of which the 6.9B version outperforms Mistral 7B, and the 1.4B version's MMLU score exceeds SmolLM-1.7B.


▲Apple uses DCLM-Baseline to train the model (orange), which shows good performance compared to closed-source models (cross) and other open-source datasets and models (circles)

The representative player in Europe is none other than the French large-model unicorn Mistral AI.It just released the Mistral Nemo 12B small cup model last week, which supports 128k context processing and outperforms Google Gemma 2 9B and Llama 2 8B. Its reasoning, world knowledge and code capabilities are the strongest among open source models of the same level.

These advances are showing potential for miniaturizing large models.

Hugging Face co-founder and CEO Clem Delangue predicted:Smaller, cheaper, faster, more personalized models will cover 99% of use casesYou don’t need a $1 million Formula 1 car to drive to work every day, and you don’t need a bank customer chatbot to tell you the meaning of life!

3. How did you become a money-saving expert in the large model industry?

The miniaturization of large models is the inevitable trend for the universal use of AI.

Not all applications require the most powerful large models. Commercial competition considers cost-effectiveness and emphasizes quality and low price. Different scenarios and businesses have very different requirements for output quality and cost-effectiveness.

Super-large-scale models bring steep learning costs to developers, and are very time-consuming from training to deployment. A more streamlined model can reduce the input-output ratio, use less funds, data, hardware resources and training cycles to build competitive models, thereby reducing infrastructure costs, helping to improve accessibility, and speeding up model deployment and application implementation.


▲According to Apple's DataComp-LM paper, the fewer model parameters, the less computing power and time required for training

For specific applications, lightweight models require less data, so they can be more easily fine-tuned for specific tasks to achieve performance and efficiency that meet the needs. Due to the more streamlined architecture, such models require less storage capacity and computing power. After being optimized for end-side hardware, they can run locally on laptops, smartphones, or other small devices, with advantages such as low latency, easy access, and privacy protection, ensuring that personal data will not be leaked.

Lightweight high-performance model is small, but it must be "Using limited computing power and energy consumption, condense knowledge into models with smaller parameters", the technical threshold is not low.

The training process isFirst get bigger, then get smaller, distilling the essence of knowledge from complex large models. For example, Google's small cup multimodal model Gemma-2 was refined using the knowledge of the 27B model.

But in terms of specific technical routes, different players have different approaches.

For example,Training DataOn the one hand, Meta generously fed Llama 3 with 15T tokens of training data. Microsoft, Apple, etc. focused on optimizing training data sets and innovating data methods. Microsoft Phi-3 only used 3.3T tokens, and Apple DCLM 7B only used 2.6T tokens. According to Apple's DataComp-LM paper,Improving training datasets can strike a balance between computation and performance and reduce training costsReleased last week, Mistral NeMo compresses text and code more efficiently than previous models by using the advanced Tekken tokenizer.

"Becoming smaller" also requiresArchitecture InnovationFor example, Apple's OpenELM model is designed to fine-tune the model layering to address hardware bottlenecks in order to improve the operating efficiency on the end side; the MiniCPM-S 1.2B efficient sparse model of FaceWall Intelligence achieves a sparsity of nearly 88%, reducing the energy consumption of the full link layer to 84%, and increasing the decoding speed by 2.8 times compared to the corresponding dense model without compromising performance.


▲Classification of technologies for implementing resource-efficient large language models (Image source: Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models)

The large model is a systematic project that requires exploration ofScientific artificial intelligence"Direction, that isThrough continuous iteration of technical solutions such as algorithms, architecture, data governance, and multimodal fusion, the model can be trained more reliably, predictably, and with higher quality., in order to continuously improve the knowledge density of large models.

To quickly train and optimize models, it is necessary to establish an efficient production line.It is necessary to build a full-process tool suite platform and form an efficient and scalable model training strategyFor example, the model sandbox mechanism of Facing the Wall uses a small model to predict the performance of a large model and a large and small model share hyperparameter solutions to achieve rapid model capability formation.


▲Actual comparison of inference decoding speed between MiniCPM 1.2B and MiniCPM-S 1.2B

In order to accelerate the empowerment of smart terminals with big models, Mianbi Intelligence has recently open-sourced the industry's first out-of-the-box terminal-side big model toolset "MobileCPM", and provided a nanny-style tutorial to help developers integrate big models into the App with one click.


▲Facing the wall intelligent terminal side large model tool set "MobileCPM"

This year is the first year of the explosion of edge AI. Chip giants such as Intel, NVIDIA, AMD, and Qualcomm, as well as major AI PC and smartphone manufacturers, are all pushing for rich edge AI applications. Terminal manufacturers have begun to work with general model manufacturers to promote the implementation of lightweight models in a wide range of edge devices.

As the performance of end-side chips becomes stronger and the knowledge density of models increases, the models that can be run locally on end-side devices are becoming larger and better. Now that GPT-4V can run on the end-side, Liu Zhiyuan predictsIn the next year, GPT-3.5 level models can be put on the end-side for operation, and in the next two years, GPT-40 level models can be put on the end-side for operation.

Conclusion: Start a big model competition without burning money

In the world of technology, the historical trend of becoming smaller, cheaper, and more user-friendly is always repeated. In the era of mainframes, computers were high-end luxury goods that only the rich and the elite could access. In the era of minicomputers, technological advances made computing devices more and more portable and user-friendly, and PCs and mobile phones entered the daily work and life of the general public.

Just as we need supercomputers with massive computing power and mobile phones that ordinary people can fit in their pockets, the era of generative AI requires extremely intelligent large models, as well as economical models that are closer to users, more cost-effective, and can meet specific application needs.

OpenAI GPT-4o still stands at the top of the most powerful AI big models, but it is no longer as invincible as before. Many GPT-4-level big models have achieved similar performance. At the same time, more compact and efficient big models are challenging the idea that "bigger is better". The new trend of "small wins big" is expected to change the way AI is developed and open up new possibilities for the implementation of AI in enterprise and consumer environments.

The shift from volume to miniaturization marks a major change in the AI ​​industry. The competition for large models has begun to shift from focusing on improving performance to focusing on more detailed needs in the real world. In this trend, China's open source power represented by Mianbi Intelligence is growing vigorously. Through a series of technological innovations, it verifies the knowledge density law of large models in a more economical and feasible way, and ultimately promotes the implementation of large models in actual application scenarios.