Meta launches 350M small model MobileLLM for mobile devices, with performance comparable to 7B LLaMA-v

Meta launches MobileLLM, a 350M small model for mobile devices, to challenge the Scaling Law. Its performance is comparable to that of 7B LLaMA-v.

2024-07-22

New Intelligence Report

Editor: Qiao Yang

【New Wisdom Introduction】Scaling Law has not yet reached its end, and "small models" have gradually become a trend that technology giants are catching up with. Meta's recently released MobileLLM series has even reduced its scale to less than 1B. The two versions have only 125M and 350M parameters respectively, but they have achieved better performance than larger-scale models.

From the press conferences of several technology giants in May and June, we can already vaguely sense an important development trend of AI: from cloud data centers to individual users, and from large servers to laptops and mobile devices.

Following the Scaling Law is no longer the only path, and the story of "small wins big" models continues to unfold.

First Microsoft updates; then Google uses it.

On the hardware side, we have seen AI capabilities gradually being deeply integrated into electronic products.

For example, Microsoft's infamous Recall feature is an important part of them; Apple has also launched it under the banner of Apple Intelligence, striving to integrate it seamlessly with iOS.

Nowadays, the number of parameters of LLM is often tens of billions, and the number of parameters of Apple 3B seems very small, but it is still a high threshold for mobile devices such as mobile phones.

Not only does the model use 2-bit and 4-bit mixed precision compression (3.5-bit per weight on average), but it also requires at least 8G of memory and an M1 chip to run.

A recent paper published by Meta shows that the number of parameters can be further reduced. The number of parameters of the newly proposed MobileLLM model is less than 1B, but the performance is still impressive.

Paper address: https://arxiv.org/abs/2402.14905

LeCun also personally tweeted to endorse this research, praising a series of operations that reduced the number of parameters.

This paper has been accepted by ICML 2024, and the training code of the model has been open sourced on GitHub.

GitHub address: https://github.com/facebookresearch/MobileLLM

Introduction

Let’s first make an assumption. If GPT-4 (with about 1 trillion parameters) is deployed in life at an inference speed of 50 tokens/s, what kind of hardware do you need?

The answer is 100 million H100 GPUs. Not to mention mobile devices, you can't even put them in your home.

What if we lower the standard and use a model like LLaMA-v2 7B and add 8-bit quantization?

A simple calculation shows that just storing the model parameters requires about 7GB, but this is not storage space, but precious RAM space (DRAM).

Moreover, DRAM cannot be fully occupied by AI models. Considering the operation of the operating system and other applications, the proportion of LLM memory cannot exceed 10%.

According to the statistics in Figure 2, the mobile devices recently released by various brands are generally equipped with 6-12GB of DRAM. This means that if the model is to be successfully deployed on a mobile phone, the number of parameters should be reduced to <1B.

Not only the RAM, but also the power consumption is a big problem. The energy consumption of the 7B model is about 0.7J/token, and a fully charged iPhone has about 50kJ to waste. Calculated, if the generation rate is 10tokens/s, a full charge of the phone is only enough for you to communicate with the model for 2 hours.

Based on the above considerations, it is a more ideal choice to deploy a model <1B on the mobile terminal. Therefore, the parameter size of MobileLLM is positioned at 125M/350M, which is an order of magnitude less than Apple's 3B model. It can be described as "mini among minis".

But don’t be limited by the Scaling Law. Small parameters do not mean weak capabilities. The importance of model architecture should be brought back into our attention.

MobileLLM not only achieves SOTA performance in models of the same size, but also proposes that the depth of the architecture is more important than the width. A "deep and narrow" "thin" small model can also learn abstract concepts.

Architecture and Methodology

With only 125M/350M parameters, how to optimize the architecture design within a limited range becomes an important issue.

For LLMs <1B, the author explores 4 effective architecture design techniques.

1) Using SwiGLU feed-forward network

2) Make the overall shape of the network "long and narrow", that is, deep and narrow

3) Reuse the embedding sharing method

4) Use grouped query attention

On this basis, the authors also proposed a block-wise layer-sharing method, which can further improve the model accuracy without introducing additional memory overhead, but at the cost of increasing the inference delay of the decoding process.

This model with added layer sharing mechanism is labeled as MobileLLM-LS.

Refuting Scaling Law: The architectural design of small models is important

The paper that proposed the Scaling Law in 2020 believes that the amount of training data, the number of parameters, and the number of training iterations are the key factors that determine performance, while the influence of the model architecture can be almost ignored.

However, the authors of this paper proposed through comparative experiments that this law does not apply to small models.

When the model parameters are fixed at 125M or 350M, the "long and narrow" model with 30 to 42 layers has significantly better performance than the "short and fat" model with about 12 layers (Figure 4). Similar trends are seen in eight benchmark tests such as common sense reasoning, question answering, and reading comprehension.

This is actually a very interesting finding, because in the past, when designing architectures for small models of the 125M scale, the number of layers would generally not exceed 12.

Why should we revisit "code sharing"?

The "embedding sharing" method was first proposed by small models such as OPT, because the parameters of the encoding layer in small models account for a considerable proportion.

For example, in the 125M model, a context length of 32k and a dimension of 512 are used. The input and output encoding layers contain 16M parameters, accounting for 20%.

In comparison, the number of parameters in the encoding layer of the large model is negligible. For example, in LLaMA-7B, this ratio has dropped to 3.7%, and in LLaMA-70B it is even only 0.7%. Therefore, shared encoding is dispensable for LLM.

The obsolescence of code sharing in the era of large models does not mean that this technology is no longer applicable to small models. It can make the model architecture more compact and more efficient.

As shown in Table 1, after encoding sharing, the model still maintains its original performance overall while reducing the total number of parameters by 16M, and even improves on some benchmarks.

Layer Sharing Mechanism

As mentioned before, the experimental results of the paper found that making the small model "slim" is beneficial to performance improvement. So the author thought: if a layer sharing mechanism is introduced, wouldn't it be equivalent to increasing the depth of the model while keeping the total number of parameters unchanged?

Experiments have shown that this method can indeed improve performance, and the paper also compares different layer sharing methods (Figure 6). Finally, after weighing the device memory, performance, and inference latency, immediate block-wise sharing (Figure 6b) was selected.

Evaluation Experiment

The authors built MobileLLM/MobileLLM-LS models with 125M and 350M parameters and trained them on a 1T dataset.

The pre-trained model is tested with zero-shot on multiple datasets, including common benchmarks such as ARC-easy, ARCchallenge, HellaSwag, WinoGrande, TQA, and RACE.

Table 3 shows the evaluation results of zero-shot common sense reasoning. The MobileLLM series has basically achieved comprehensive SOTA, not only surpassing previously released classic models such as OPT and BLOOM, but also outperforming recently released models with larger parameters such as GPT-neo, Galactica, and RWKV.

In terms of question answering and reading comprehension, MobileLLM still performs well (Table 4). Compared with other models, MobileLLM 125M and 325M have an improvement of >6.4 points and about 10 points in TQA, respectively.

Downstream tasks

In addition to benchmarking, the paper also takes into account the various requirements for the model when deployed in application scenarios and conducts corresponding evaluations.

AlpacaEval and MT-Bench test the performance of the model in single-round and multi-round chat tasks respectively. Compared with the other three baseline models, MobileLLM still has the best performance, and can even surpass the performance of other models with parameters > 1B with 350M parameters.

In addition to conversations, in the API call scenario, the EM score of MobileLLM can match that of LLaMA-v2 with 7B parameters.

In addition, MobileLLM is also very compatible with quantization (PTQ). After W8A8 quantization, the performance of the model only drops by less than 0.5 points, and it is still compatible with the layer sharing mechanism, so it can adapt to deployment under more stringent hardware conditions.

About the Author

The corresponding author of this article, Zechun Liu, is a research scientist at Meta Reality Labs. She received her undergraduate degree from Fudan University and her doctorate from the Hong Kong University of Science and Technology. Before joining Meta, she worked as a visiting scholar at CMU for more than two years.

Zechun's research interests are in the application of deep learning in real-world scenarios, such as resource constraints, trade-offs between computing resources and accuracy, with a focus on network binarization and quantization, network channel pruning, architecture design, knowledge distillation, etc.

References:

https://x.com/ylecun/status/1810035281472491665

https://arxiv.org/abs/2402.14905

news

Meta launches MobileLLM, a 350M small model for mobile devices, to challenge the Scaling Law. Its performance is comparable to that of 7B LLaMA-v.

Introduction

my contact information