Jia Yangqing: Large model size is following the old path of CNN; Musk: It is the same in Tesla

2024-08-01

Hengyu from Aofei Temple
Quantum Bit | Public Account QbitAI

The size change of Transformer large model is following the old path of CNN!

Seeing that everyone is attracted by LLaMA 3.1,Jia YangqingExpress such emotion.

Comparing the development of large model sizes with the development of CNNs, we can find an obvious trend and phenomenon:

During the ImageNet era, researchers and practitioners witnessed a rapid increase in parameter size and then began to move towards smaller and more efficient models.

Does it sound like GPT is rolling up model parameters, the industry generally agrees on the Scaling Law, and then GPT-4o mini, Apple DCLM-7B, and Google Gemma 2B appear?

Jia Yangqing said with a smile, "This was something from the pre-large model era, and many people may not remember it very much:)".

Moreover, Jia Yangqing is not the only one who senses this.AI master Kapasi also thinks so：

The competition for large model sizes is heating up… but the volume is coming in the opposite direction!
The model must first pursue "bigger" before it can pursue "smaller" because we need this process to help us reconstruct the training data into an ideal, synthetic format.

He even bet that we would eventually see models that were both good and reliable in terms of thinking.

And the parameter scale is very small.

Even Musk repeatedly said yes in Kapasi’s comment section:

The above can probably be called "big guys think alike."

Expand and talk about

Jia Yangqing's emotion started with LLaMA 3.1, which only stayed on the strongest throne for one day.

That was the first time that “the strongest open source model = the strongest model” was achieved, and as expected, it attracted a lot of attention.

However, Jia Yangqing raised a point at this time:

“But I thinkThe industry will really thrive with small vertical models。”

As for what small vertical models are, Jia Yangqing also made it very clear, such as those great small and medium-sized models represented by Patrouns AI's Iynx (the company's hallucination detection model, which surpasses GPT-4o in hallucination tasks).

Jia Yangqing said that as far as personal preference is concerned, he personally likes the 100 billion parameter model very much.

But in reality, he observed that large models with parameter ranges between 7B and 70B are more convenient for people to use:

They are easier to host and do not require huge traffic to be profitable;
As long as you ask clear questions, you can get decent quality output - contrary to some previous beliefs.

At the same time, he heard that OpenAI’s newest, fastest models were starting to become smaller than “state-of-the-art” large models.

"If my understanding is correct, then this is definitely indicative of an industry trend," Jia said, directly stating his point of view, "that is, to use models that are applicable, cost-effective, and still powerful in the real world."

So, Jia Yangqing briefly sorted out the development history of CNN.

First, it was the era of the rise of CNN.

Starting with AlexNet (2012), a period of approximately three years of model size growth began.

VGGNet, which appeared in 2014, is a model with very powerful performance and scale.

Secondly, it is a period of downsizing.

In 2015, GoogleNet reduced the model size from "GB" to "MB" level, which is 100 times smaller; but the model performance did not drop sharply, but maintained good performance.

Following a similar trend are the SqueezeNet model, which was released in 2015.

Then, for a period of time, the focus of development was on pursuing balance.

Subsequent studies, such as ResNet (2015) and ResNeXT (2016), maintained a moderate model size.

It is worth noting that controlling the model size does not reduce the amount of computation - in fact, everyone is willing to invest more computing resources to seek a state of "same parameters but more efficient".

What followed was a period when CNN danced on the edge.

For example, MobileNet is an interesting work launched by Google in 2017.

The interesting thing is that it takes up very little resources, but has excellent performance.

Just last week, someone mentioned to Jia Yangqing: "Wow~ We are still using MobileNet because it can run on devices and has excellent feature embedding generality."

Finally, Jia Yangqing borrowed a picture from "A Survey on Efficient Convolutional Neural Networks and Hardware Acceleration" by Ghimire et al.:

And once again raised my own question:

Will large model sizes follow the same trend as in the CNN era?

What do netizens think?

In fact, there are many examples like GPT-4o mini that are "not big but small" in the development path of large models.

When the above-mentioned people expressed such views, some people immediately nodded their heads and brought up some other similar examples to prove that they saw the same trend.

Someone immediately followed up:

I have a new positive example here! Gemma-2 distills the knowledge of a 27B parameter model into a smaller version.

Some netizens also said that developing larger models means that the training of subsequent generations of smaller and more vertical models can be "intensified."

This iterative process eventually produces what is called a “perfect training set”.

In this way, smaller large models can be as smart as or even smarter than current large models with huge parameters in specific areas.

In a nutshell,The model must first get bigger before it can get smaller.

Most people who discussed this point of view still agree with this trend, and some people bluntly said that "this is a good thing, which is more practical and useful than the parameter competition of 'my model is bigger than your model'."

But, of course!

Looking through the online comment section,Some people also have different opinions.

For example, the friend below left a comment under Jia Yangqing’s tweet:

Mistral Large (behind the company Mistral AI), LLaMA 3.1 (behind the company Meta) and OpenAI, the companies with the most competitive models, may currently be training larger models.
I don't see a trend towards smaller models achieving technological breakthroughs.

Faced with this question, Jia Yangqing responded promptly.

Here’s what he said: “Exactly! When I say that large model sizes may be following the path of CNNs, I’m definitely not calling for everyone to stop training larger models.”

He further explained that the original intention of this statement is that as the technology (including CNN and large models) is put into practice more and more widely, people have begun to pay more and more attention to more cost-effective models. "

Therefore, perhaps more efficient small-big models can redefine the "intelligence" of AI and challenge the assumption that "bigger is better."

Do you agree with this view?

Reference Links:
[1]https://x.com/jiayq/status/1818703217263624385
[2]https://x.com/fun000001/status/1818791560697594310
[3]https://www.patronus.ai/
[4]https://twitter.com/karpathy/status/1814038096218083497

news