news

Is the era of big models over? Big players predict: AI models may need to be scaled down before they can be scaled up again

2024-07-22

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


New Intelligence Report

Edit: Ears

【New Wisdom Introduction】Small models are coming in force, is the "era of big models" coming to an end?

"Miniature Week" has passed, and the newest battlefield for miniatures has just opened up.

Last week, GPT-4o mini and Mistral NeMo were released in quick succession. These small but fully-equipped models have become a new direction that industry leaders are paying close attention to.


So, are large models going to fall out of favor? Is the Scaling Law going to become invalid?

Andrej Karpathy, a former OpenAI and Tesla AI researcher, has just entered the field of AI education. "Teacher K" recently published a tweet to guide the industry and reveal the new trend behind the technology giants turning to the development of small models: the competition for large AI models is about to reverse.

He predicts that future models will be smaller but still smarter.


AI giants and some new unicorns have recently released AI models that are more compact, powerful, and affordable than their peers. The latest example is OpenAI’s GPT-4o mini.

Karpathy predicts that this trend will continue. “My bet is that we’ll see a lot of models that can think effectively and reliably, and that are very small,” he writes.

Small Models: Standing on the Shoulders of Giants

In the early stages of LLM development, it is an inevitable trend to process more data and build larger models. This is mainly based on the following reasons:

First, data-driven demand.

Living in an era of data explosion, large amounts of rich and diverse data require more powerful models to process and understand.

Large models have the ability to accommodate and process massive amounts of data, and through large-scale data training, they can uncover deep patterns and rules.

Second, the improvement of computing power.

The continuous advancement of hardware technology and the development of high-performance computing devices such as GPUs have provided powerful computing support for the training of large models, making it possible to train large and complex models.

Furthermore, the pursuit of higher performance and precision.

Large models can usually demonstrate excellent performance in multiple fields such as language understanding, generation, and image recognition. The more they understand, the more accurate the results they generate.

Finally, the generalization ability is stronger.

Large models can better handle new problems and new tasks that have never been seen before, can make reasonable inferences and answers based on previously learned knowledge, and have stronger generalization capabilities.

In addition, the competition in the AI ​​field is fierce. Various research institutions and giants are committed to developing larger and more powerful models to demonstrate their technological strength and leading position. The size of the volume model has naturally become the general direction of LLM development.

Karpathy also attributes the size of the current most powerful models to the complexity of the training data, adding that large language models excel at memorization, surpassing human memory capabilities.

For example, if you have a closed-book exam during finals week, the exam requires you to memorize a passage from a book based on the first few words.

This is the goal of pre-training today’s large models. Karpathy said that today’s large models are like a greedy snake that just wants to swallow all the available data.

Not only can they recite the SHA series of hash algorithms for common numbers, they can also remember all kinds of knowledge in all fields.

However, this way of studying is like memorizing everything in the entire library and on the Internet for an exam.

It is undeniable that only a genius can achieve this kind of memory ability, but in the end only one page was used in the exam!

For this kind of gifted student - the reason why it is difficult for LLM to do better is that in the process of training data, thinking demonstration and knowledge are "entangled" together.

Moreover, from the perspective of practical applications, large models face high costs and resource consumption during deployment and operation, including computing resources, storage resources, and energy consumption.

Small models are easier to deploy in various devices and scenarios, meeting the requirements of ease of use and low power consumption.

On the other hand, from the perspective of technological maturity, after the nature and laws of the problem have been fully explored and understood through large models, these knowledge and patterns can be refined and applied to the design and optimization of small models.

This allows small models to reduce scale and cost while maintaining the same or even better performance as large models.

Although the development of large models has encountered bottlenecks and small models have gradually become a new trend, Karpathy emphasized that large models are still needed, even if they are not effectively trained, because small models are condensed from large models.

Karpathy expects that each model will continue to improve, generating training data for the next model, until there is a "perfect training set."

Even an out-of-date model like GPT-2, which has 1.5 billion parameters, when you train GPT-2 with this perfect training set, it could turn into a very powerful and intelligent model by today's standards.

This GPT-2 trained with a perfect training set may score slightly lower in tests such as the Massive Multi-Task Language Understanding (MMLU) test, which covers 57 tasks including elementary mathematics, American history, computer science, law, etc., to evaluate the basic knowledge coverage and understanding ability of large models.


But smarter AI models of the future will not win by quantity; they will be able to retrieve information and verify facts more reliably.

Just like a top student taking an open-book test, although he may not know all the knowledge by heart, he can accurately locate the correct answer.

OpenAI's Strawberry project reportedly focuses on solving this problem.

"Slimming" of the "pseudo-fat" model

As Karpathy said, most of the super-large models (such as GPT-4) trained with massive amounts of data are actually used to remember a large number of irrelevant details, that is, rote memorization of information.

This is related to the purpose of model pre-training. During the pre-training stage, the model is required to repeat the following content as accurately as possible, which is equivalent to memorizing a text. The more accurate the memorization, the higher the score.

Although the model can learn the knowledge that appears repeatedly in it, the data sometimes contains errors and biases, and the model must remember all of it before fine-tuning it.

Karpathy believes that with a higher quality training data set, it is possible to train a smaller, more powerful, and more reasoning-capable model.

With the help of super large models, higher quality training data sets can be automatically generated and cleaned.

Similar to GPT-4o mini, it is trained with data cleaned from GPT-4.

First make the model bigger, and then "slim it down" on this basis. This may be a new trend in model development.

To make a vivid analogy, the current large model has the problem of being too fat due to too much data set. After data cleaning and a lot of training, it will be transformed into a small model with lean muscles.


This process is like a step-by-step evolution, where each generation of models helps generate training data for the next generation until we finally get a "perfect training set".

OpenAI CEO Sam Altman made similar remarks, announcing the "end of the era" for large AI models as early as April 2023.

Moreover, it is increasingly becoming a consensus that data quality is a key success factor for AI training, whether it is real data or synthetic data.

The key question, Altman believes, is how AI systems can learn more from less data.

Microsoft researchers made the same judgment when developing the Phi model, and Hugging Face AI researchers also agreed with the pursuit of high-quality data sets and released high-quality training data sets.

This means that blind expansion is no longer the only technological goal of technology giants. Even small high-quality models can benefit from more, more diverse, and higher-quality data.

Returning to smaller, more efficient models can be seen as the goal of the next stage of integration, and OpenAI's model release clearly indicates the future direction of development.

Comments: Correct, pertinent, and incisive

Karpathy also mentioned Tesla's similar approach with its Autopilot network.


Tesla has something called an “offline tracker” that generates cleaner training data by running a previous weaker model.

Upon hearing that Tesla's technology was being called out as being ahead of the times, Musk quickly rushed to the comments section:


Netizens in the comment section also expressed their admiration for Karpathy’s foresight and said, “I second the opinion!”

For future general artificial intelligence, smaller and more efficient AI models may redefine "intelligence" in AI and challenge the assumption that "bigger is better."


Sebastian Raschka, author of "Python Machine Learning", believes that this is like knowledge distillation, distilling a small model like Gemma-2 from a large model of 27B.

He also reminded us that multiple-choice tests such as the MMLU can test knowledge but cannot fully reflect actual ability.


Some netizens also had a brilliant idea. If the small model performs well, then everyone has their own expertise. Why not use more small models to generate answers?

Gather 10 AI assistants and let the smartest one make the final summary. It's simply an AI version of a think tank.


So, is AGI a large, all-powerful model, or is it the result of the collaboration of many small models?

References:

https://the-decoder.com/ai-models-might-need-to-scale-down-to-scale-up-again/

https://x.com/karpathy/status/1814038096218083497