news

A 10,000-word evaluation of a large model instruction tuning dataset! Produced jointly by Tencent and Shanghai Jiao Tong University

2024-08-15

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

  • Tencent Youtu Lab Contribution
    Quantum Bit | Public Account QbitAI

With the rapid development of large models, instruction tuning plays a vital role in improving model performance and generalization capabilities.

However, there is no unified system for data evaluation and selection methods for instruction tuning datasets, and there is a lack of comprehensive and in-depth reviews.

To fill this gap, Tencent Youtu Lab released a complete review.

The length exceeds 10,000 words and involves more than 400 documents.



This study covers data evaluation and selection methods in three main aspects: quality, diversity, and importance, and each aspect is classified and elaborated in detail.

At the same time, the author also pays attention to the latest developments and trends in this field, including some emerging technologies and methods, such as using powerful language models such as GPT for data scoring, Coreset sampling based on double-layer optimization, etc.

Comprehensive evaluation of instruction tuning dataset

The development goal of LLMs is to unlock the generalization capability of natural language processing (NLP) tasks, in which instruction tuning plays an important role, and data quality is crucial to the effect of instruction tuning.

The authors conducted an in-depth study on data evaluation and selection methods for various instruction tuning datasets, and classified and elaborated them from three aspects: quality, diversity, and importance.



★Quality assessment and selection

"Quality" mainly refers to the completeness, accuracy and rationality of instruction response data points. Existing methods usually develop a unified scoring mechanism to comprehensively consider these dimensions.

Regarding the quality of the data set, the author mainly summarizes four testing methods:

  • One is to manually design indicators, such as evaluating data quality through vocabulary, syntax, semantic similarity, etc. The advantage is that the indicator calculation is clear, but it cannot detect mismatched command-response pairs.
  • The second is to use model-based indicators. This method uses trainable models (such as perplexity, multidimensional scoring evaluators, etc.) and combines a hybrid technique with multiple training-aware indicators (such as uncertainty, reward scores, etc.). This method has potential in selecting unbiased high-quality samples.
  • The third method is to hand it over directly to GPT and call OpenAI APIs to automatically score the instruction tuning dataset. This method is highly aligned with human preferences. After collecting a small number of GPT scoring samples, fine-tuning the open source LLM for quality measurement can improve cost efficiency.
  • Finally, there is manual evaluation. This method is indispensable when constructing a preference alignment dataset and can be used to provide high-quality data for model training. However, there is a problem of inconsistent annotations, and detailed guidelines need to be formulated and supplemented by other measures such as GPT scores.

★Diversity assessment and selection

The diversity here refers to the individual diversity (such as vocabulary and semantic richness) and overall diversity (such as data distribution) of the instruction dataset. Selecting a diverse dataset can enhance the generalization ability of the model.

The author also summarizes four ways to test the diversity of data sets.

  • Hand-designed metrics: These include lexical diversity (such as Type-token ratio, vocd-D, MTLD, HD-D, etc.) and semantic diversity (such as calculating distances through k-NN graphs, calculating variance using BERT embeddings, etc.).
  • Model-based metrics: Evaluate diversity through entropy-related methods (such as vanilla entropy, Rényi entropy, Simpson's Index, Vendi Score, etc.), Task2Vec embeddings, open-label diversity markers, etc.
  • Coreset sampling based on geometric features: The most informative and diverse subsets are selected through methods such as k-center greedy and herding to represent the entire dataset, so that the training performance of the model on the subset is close to the training performance on the entire dataset. Clustering technology plays a role in explaining the data structure.
  • Bi-level based Coreset sampling: Coreset sampling is regarded as a Bi-level optimization problem. The subset is selected by optimizing the hard mask or soft weight, which involves the optimization of internal model parameters and the external loop of data selection. Some methods improve robustness and efficiency by introducing validation sets, gradient matching and optimization techniques.

★Importance assessment and selection

Importance refers to the necessity of the sample for model training, which is related to the model task and also related to performance. Easy samples may not require additional tuning, while difficult samples are crucial for model training.

There are several indicators and methods for evaluating importance:

  • Hand-crafted metrics: Evaluate text difficulty through readability metrics (such as grammar, vocabulary, reasoning dependencies, etc.), select challenging samples to evaluate model robustness and build discriminative NLP benchmarks.
  • Model-based indicators: including uncertainty (such as prompt uncertainty), reward scores (using reward models to determine the necessity of samples for model behavior) and data models (such as using data models to predict the impact of data points on model behavior, DSIR estimating importance scores based on distribution similarity, MATES continuously selecting the most effective subsets, Xie et al. selecting samples with similar target distributions through importance resampling).
  • Coreset sampling based on Loss and Error: Estimate the importance by recording the errors of samples in training (such as forgetting score, memorization, influence, etc.), and select samples that contribute greatly to the loss or cause poor performance. Some studies accelerate the calculation of marginal effects through iterative approximation and small proxy models.
  • Gradient-based Coreset sampling: Utilize the characteristic that gradient directly affects the optimization of language models, select data through gradient matching (such as approximating the gradient of the entire data set) and gradient-based influence (such as measuring the influence of samples on model parameters through up-weighted gradient multiplication), and some techniques (such as low-rank gradient similarity search, moving sample approximation, etc.) are used to accelerate calculations and improve efficiency, while considering the accuracy and efficiency of the approximation.



Current challenges and future directions

The authors found that there is a gap between the effectiveness of data selection and the reported performance of models on benchmarks, due to reasons such as poor correlation between evaluation loss and benchmark performance and test set contamination.

In the future, there is a need to build specialized benchmarks to evaluate instruction tuning models and selected data points, and to decouple data selection and model evaluation to exclude the impact of data contamination.

There is currently no unified standard to distinguish between "good" and "bad" instructions. Existing quality measurement methods are task-oriented and lack interpretability. In the future, more unified and universal definitions and improved interpretability of selection pipelines are needed to meet the needs of different downstream tasks.

As the dataset grows, it becomes difficult to determine the optimal selection ratio due to increased noise, overfitting, and forgetting problems. It is recommended to determine the optimal selection ratio through a quality measurement scheme, emphasizing diversity, and considering similarities with pre-training data, and to optimize the scalability pipeline of data evaluation and selection.

In addition to datasets, the size of large models themselves is also increasing, and the cost-efficiency of data evaluation and selection is decreasing, requiring the development of efficient proxy models while rethinking traditional machine learning techniques such as optimization techniques and dimensionality reduction methods.

Project homepage:
https://github.com/yuleiqin/fantastic-data-engineering
Paper address:
https://arxiv.org/abs/2408.02085