news

USTC and Huawei Noah's Entropy Law reveal the performance of large models and data compression rate

2024-07-22

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

AIxiv is a column where Synced publishes academic and technical content. In the past few years, Synced's AIxiv column has received more than 2,000 articles, covering top laboratories in major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work to share, please submit or contact us for reporting. Submission email: [email protected]; [email protected]

This work was completed by the team of IEEE Fellow Chen Enhong from the National Key Laboratory of Cognitive Intelligence of USTC and Huawei Noah's Ark Lab. Professor Chen Enhong's team has been deeply involved in the fields of data mining and machine learning, and has published many papers in top journals and conferences, with over 20,000 citations on Google Scholar. Noah's Ark Lab is a laboratory of Huawei engaged in basic research on artificial intelligence. It adheres to the concept of giving equal importance to theoretical research and application innovation, and is committed to promoting technological innovation and development in the field of artificial intelligence.

Data is the cornerstone of the success of large language models (LLMs), but not all data is beneficial for model learning. Intuitively, high-quality samples are expected to have better efficiency in teaching LLMs. Therefore, existing methods usually focus on quality-based data selection. However, most of these methods evaluate different data samples independently, ignoring the complex combination effects between samples. As shown in Figure 1, even if each sample is of perfect quality, their combination may still be suboptimal due to their mutual information redundancy or inconsistency. Although the quality-based subset consists of all three high-quality samples, the knowledge they encode is actually redundant and conflicting. In contrast, another data subset consisting of several relatively low-quality but diverse samples may convey more information in teaching LLMs. Therefore, quality-based data selection does not fully meet the goal of maximizing the knowledge mastery of LLMs.

This paper aims to reveal the intrinsic relationship between LLM performance and data selection. Inspired by the information compression nature of LLM, we discovered an entropy law that links LLM performance to data compression rate and the loss of the previous steps of model training, reflecting the degree of information redundancy of the dataset and the degree to which LLM has mastered the inherent knowledge in the dataset, respectively. Through theoretical derivation and empirical evaluation, we found that model performance is negatively correlated with the compression rate of training data, which usually results in lower training loss. Based on the discovery of entropy law, we proposed a very efficient and general data selection method for training LLM, named ZIP, which aims to prioritize data subsets with low compression rates. ZIP greedily selects diverse data in multiple stages, and finally obtains a data subset with good diversity.



Team: Chen Enhong's team from the National Key Laboratory of Cognitive Intelligence, USTC, and Huawei's Noah's Ark Laboratory

Paper link: https://arxiv.org/pdf/2407.06645

Code link: https://github.com/USTC-StarTeam/ZIP



figure 1

Entropy law

We theoretically analyze the relationship between data compression and LLM performance. Intuitively, the correctness and diversity of the training data will affect the performance of the final model. At the same time, if the data has serious inherent conflicts or the model has a poor grasp of the information encoded in the data, the performance of LLM may be suboptimal. Based on these assumptions, we denote the performance of LLM as Z, which is expected to be affected by the following factors:

Data compression ratio R: Intuitively, a data set with a lower compression ratio indicates a higher information density.

Training loss L: Indicates whether the data is difficult for the model to remember. Under the same basic model, high training loss is usually due to the presence of noise or inconsistent information in the dataset.

Data consistency C: Data consistency is reflected by the entropy of the probability of the next token given the previous context. Higher data consistency usually leads to lower training loss.

Average data quality Q: reflects the average sample-level quality of the data, which can be measured by various objective and subjective aspects.



Based on Entropy law, we make two inferences:

If C is considered as a constant, the training loss is directly affected by the compression ratio. Therefore, the model performance is controlled by the compression ratio: if the data compression ratio R is high, then Z is usually worse, which will be verified in our experiments.

At the same compression rate, higher training loss means lower data consistency. Therefore, the effective knowledge learned by the model may be more limited. This can be used to predict the performance of LLM on different data with similar compression rate and sample quality. We will show the application of this inference in practice later.

ZIP: A highly lightweight data selection algorithm

Guided by the entropy law, we proposed the data selection method ZIP, which selects data samples by data compression rate, aiming to maximize the effective information under a limited training data budget. For efficiency considerations, we adopted an iterative multi-stage greedy paradigm to efficiently obtain an approximate solution with a relatively low compression rate. In each round of iteration, we first use the global selection stage to select a pool of candidate samples with low compression rates and find samples with high information density. Then, we use a coarse-grained local selection stage to select a smaller set of samples with the lowest redundancy with the selected samples. Finally, we use a fine-grained local selection stage to minimize the similarity between the samples to be added. The above process continues until enough data is obtained. The specific algorithm is as follows:



Experimental Results

1. Effectiveness of the ZIP selection algorithm for different LLMs and at different LLM alignment stages

Comparing different SFT data selection algorithms, the model trained based on ZIP selected data shows advantages in performance and efficiency. The specific results are shown in the table below:



Thanks to the model-independent and content-agnostic features of ZIP, it can also be applied to data selection in the preference alignment stage. The data selected by ZIP also shows great advantages. The specific results are shown in the table below:



2. Experimental verification of the entropy law

Based on the SFT data selection experiment, we fit multiple relationship curves based on the model effect, data compression rate, and the loss of the model in the first few steps of training. The results are shown in Figures 2 and 3, from which we can observe the close relationship between the three factors. First, low compression rate data usually leads to better model effects. This is because the learning process of LLMs is highly related to information compression. We can regard LLM as a data compressor, so data with lower compression rate means more knowledge and is more valuable to the compressor. At the same time, it can be observed that lower compression rate is usually accompanied by higher training loss. This is because data that is difficult to compress carries more knowledge, which poses a greater challenge for LLM to absorb the knowledge contained therein.



Figure 2 Mistral-7B



Figure 3 Llama-3-8B

3. Practical Application of Entropy Law

We provide an application of entropy law to guide incremental updates of LLM training data in a real-world scenario. In this task scenario, the amount of training data remains relatively stable, and only a small portion of the data is modified. The results are shown in Figure 4, where



There are five data versions that are incrementally updated. For confidentiality reasons, only the relative relationship of the model effects under different compression rates is provided. According to the entropy law prediction, assuming that the data quality does not decrease significantly after each incremental update, it can be expected that the model performance will improve as the data compression rate decreases. This prediction is consistent with the data version in the figure.

However, the data version

The results show an abnormal increase in loss and data compression ratio, which indicates the potential for a decrease in model performance due to a decrease in the consistency of the training data. This prediction is further confirmed by subsequent model performance evaluation. Therefore, the entropy law can be used as a guiding principle for LLM training, predicting the potential risk of LLM training failure without having to train the model on the full dataset until convergence. This is particularly important given the high cost of training LLM.



Figure 4