news

LLM alignment data is fully automatically synthesized! UW Chinese doctoral student proposed Magpie method, which can be run on Macbook Air

2024-07-26

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


New Intelligence Report

Editor: Qiao Yang

【New Wisdom Introduction】A recent paper published by the University of Washington and Allen AI proposed a novel and interesting data synthesis method. They found that by making full use of the autoregressive properties of LLM, the model can be guided to automatically generate high-quality instruction fine-tuning data.

Data is crucial for LLM training, but we often focus on training and evaluation data and ignore fine-tuning data.

For example, for the Llama series of models, although the weights are open (such as Llama-3-Instruct), the fine-tuning dataset is still private.

A large part of LLM's success depends on instruction fine-tuning, a process that allows the model to generalize better to tasks it was not exposed to during training.

Just as the effectiveness of training depends on the quality of the training corpus, the effectiveness of instruction fine-tuning also depends on the availability of high-quality instruction datasets.

However, compared to unlabeled self-supervised training corpora, high-quality fine-tuning and alignment datasets are more difficult to construct and expand because they require more manual annotations and have a pre-defined range of prompts.

Even companies that specialize in providing data to AI technology giants are unable to achieve automated labeling at the current stage, and even have to hire professionals at high salaries to participate in fine-tuning and aligning data sets.

Alexandr Wang, CEO of Scale AI, once said,

Recently, a paper jointly published by the University of Washington and research institution Allen AI focused on how to allow aligned LLM to synthesize high-quality fine-tuning data.


Paper address: https://arxiv.org/abs/2406.08464

The method proposed in the paper realizes the automation of the whole process without any seed problem. What is even more amazing is that the code can not only run locally, but also automatically generate very reliable high-quality data using LLM.

After fine-tuning the Llama-3-8B Base model on the SFT dataset they generated, they obtained a model with better performance than the official fine-tuned version Llama-3-Instruct.


The paper was forwarded and endorsed by Sebastian Raschka, a big name in the AI ​​circle.


At first, he didn't believe that this method could really run natively on a MacBook Air, but after trying it himself, he was pleasantly surprised to find that it really could.


Raschka is the author of several technical bestsellers, including Building Large Language Models from Scratch and Machine Learning with Python. He currently works as a research engineer at Lightning AI.



The first author of the paper, Zhangchen Xu, is a second-year doctoral student at the University of Washington's Cybersecurity Laboratory under the supervision of Professor Radha Poovendran. His research interests are in the security, privacy, and fairness of machine learning, and he is currently focusing on how to build a trusted LLM.


Let us take a closer look at how this efficient data synthesis method is achieved.

Method Overview

Typical LLM input generally consists of three parts:

- Pre-query template

- Query content (query)

- Post-query template

Two of the templates are usually pre-defined by the model developer to ensure that the model is correctly prompted.

For example, the input form of Llama-2-chat is:

[INST] Hi! [/INST]

In previous studies, there are usually two ways to construct fine-tuning datasets. One is to let humans manually create them, which is obviously time-consuming and resource-consuming. The other is to start with a small number of manually annotated seed instructions and call LLM through prompts to synthesize more instructions.

Although the second method saves manpower, it is very challenging to the level of prompt engineering and the selection of initial seed questions. In other words, it is difficult to achieve controllable large-scale expansion.

A more serious problem is that the synthesized instructions are often very close to the seed instructions, which seriously affects the diversity of large-scale datasets. Creating high-quality and diverse instruction datasets in a scalable way is still a challenging problem in the LLM field.

However, the authors made an interesting discovery in early experiments: due to the autoregressive nature of LLM, when only the pre-query template is input, the model automatically synthesizes queries, and from the content point of view, it seems to have good quality and diversity. This shows that it can effectively utilize the capabilities learned during the alignment process.

Inspired by this, the authors proposed the following idea to construct an instruction dataset: use the pre-query template as a prompt, input it into the aligned LLM, and automatically generate instruction data.

As shown in the figure below, each instruction data instance contains one or more instruction-response pairs, and specifies the roles of the instruction provider and follower.


Figure 1 describes the entire data automatic generation pipeline, which is roughly divided into two steps.

The first is instruction generation. The MAGPIE method constructs the query content into the format of the LLM predefined instruction template, but only contains the instruction provider (such as user) and does not contain the specific instruction content.

Using this as the LLM input, the model generates instructions in an autoregressive manner. This process ensures the diversity of generated instructions, as no specific hint engineering techniques are required and no seed questions are used.

In the second step, MAGPIE inputs the previously generated instruction to LLM again to obtain the response content.

By repeating the above two steps, you can get multiple rounds of instruction data. If you want to generate data for a specific field, you can add corresponding prompts to achieve it.


After obtaining the original generation results, the author also filtered them according to indicators such as text length, task category, input quality, and input difficulty.


The paper uses the Llama-3-8B-Instruct and Llama-3-70B-Instruct models to construct two datasets, MAGPIE-Air and MAGPIE-Pro, and gives examples of generating instructions in the appendix:


As you can see, the text quality is indeed good and is completely comparable to the level of instructions written by humans.

However, evaluating the quality of such a large amount of data cannot rely solely on subjective feelings, so the authors conducted a quantitative analysis of the generated instruction dataset MAGPIE-Pro.

Dataset Analysis

Coverage

To consider the diversity of instruction text, an effective indicator is the coverage of text embedding in the semantic space.

The authors randomly sampled instruction texts from MAGPIE-Pro, encoded them into embedding vectors and projected them into two-dimensional space using the t-SNE method. At the same time, three baseline datasets were used for comparison, including Alpaca, Evol Instruct and UltraChat.

Each t-SNE projection point in the figure below represents 10,000 randomly selected instructions. It can be seen that the projection of MAGPIE-Pro basically covers the range of the other three datasets, which shows that it provides a wider range of diverse topics.


Directive attributes

The paper uses the Llama-3-8B-Instruct model to evaluate various attributes of MAGPIE instruction data, such as the task category, quality, difficulty, similarity, and response quality of the instructions.

The task categories for generating instructions are mainly information retrieval, accounting for more than half, and also include creative writing, seeking advice, planning, mathematics, reasoning, brainstorming editing, etc., which are basically consistent with the mainstream needs of human users.


The quality and difficulty of instructions are also automatically evaluated using the Llama-3-8B-Instruct model.

It can be seen that in both datasets, most instances are judged to be above average, and the overall quality of MAGPIE-Pro is better than that of MAGPIE-Air.

The distribution of instruction difficulty of the dataset is basically similar, with more than 60% concentrated at the "easy" level, and the Pro dataset is slightly more challenging than the Air.


By calculating the instruction similarity, we can evaluate the degree of diversity from another perspective. The paper uses FAISS to search for the nearest neighbors of each text embedding and calculate the distance between the two to measure the similarity.

In terms of response quality, FsfairX-LLaMA3-RM-v0.1 is used as the reward evaluation model, and URIAL is used as the baseline model for comparison. A positive reward difference indicates a higher quality, which is beneficial to the instruction fine-tuning process.

As can be seen from Figure 5b, the data distribution of MAGPIE is overall right-shifted and has a lower peak value compared to the baseline model, indicating that the overall response quality is better.


safety

In addition, in terms of instruction security, the authors used Llama-guard-2 for automatic evaluation and found that the MAGPIE dataset is mostly safe, but still contains less than 1% of harmful instructions or response results.


Results Evaluation

One of the biggest highlights of this research is the efficient operating cost and the fully automated pipeline that does not require any human intervention.

When creating the 3M MAGPIE-Air dataset, it took 1.55 hours/50 hours to complete the command/response generation using 4 A100 GPUs. Generating the 1M MAGPIE-Pro dataset took 3.5 hours/150 hours respectively.

If running on a cloud server, the cost is also very considerable, costing $0.12 or $1.10 per 1k instance, depending on whether it is an Air or Pro dataset.

In order to truly demonstrate the advantages of the MAGPIE method, the paper actually applied the dataset to the fine-tuning of the base model and compared it with the officially released fine-tuned version.

The author selected 6 state-of-the-art open source instruction fine-tuning datasets such as ShareGPT and Evol Instruct as baselines. ShareGPT and WildChat are written by humans, while Evol Instruct and UltraChat are synthetic datasets.

The fine-tuned base models include Llama-3 and Qwen-1.5, and two widely used metrics, AlpacaEval and Arena-Hard, are selected to evaluate the performance.

From the detailed data comparison of the two tables, it can be found that no matter which base model is used, the dataset generated by the MAGPIE method has higher quality, better than all baseline datasets, and better than the officially released fine-tuning model in most indicators.



As LLM's scaling law gradually hits the data wall, the method in this paper opens a new door of hope for synthetic data. Perhaps with the use of carefully designed algorithms and techniques, LLM synthetic data can gradually become the "mainstay" of public datasets.

References:

https://arxiv.org/abs/2406.08464