2024-08-15
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
- Tencent Youtu Lab Contribution
Quantum Bit | Public Account QbitAI
With the rapid development of large models, instruction tuning plays a vital role in improving model performance and generalization capabilities.
However, there is no unified system for data evaluation and selection methods for instruction tuning datasets, and there is a lack of comprehensive and in-depth reviews.
To fill this gap, Tencent Youtu Lab released a complete review.
The length exceeds 10,000 words and involves more than 400 documents.
This study covers data evaluation and selection methods in three main aspects: quality, diversity, and importance, and each aspect is classified and elaborated in detail.
At the same time, the author also pays attention to the latest developments and trends in this field, including some emerging technologies and methods, such as using powerful language models such as GPT for data scoring, Coreset sampling based on double-layer optimization, etc.
The development goal of LLMs is to unlock the generalization capability of natural language processing (NLP) tasks, in which instruction tuning plays an important role, and data quality is crucial to the effect of instruction tuning.
The authors conducted an in-depth study on data evaluation and selection methods for various instruction tuning datasets, and classified and elaborated them from three aspects: quality, diversity, and importance.
★Quality assessment and selection
"Quality" mainly refers to the completeness, accuracy and rationality of instruction response data points. Existing methods usually develop a unified scoring mechanism to comprehensively consider these dimensions.
Regarding the quality of the data set, the author mainly summarizes four testing methods:
★Diversity assessment and selection
The diversity here refers to the individual diversity (such as vocabulary and semantic richness) and overall diversity (such as data distribution) of the instruction dataset. Selecting a diverse dataset can enhance the generalization ability of the model.
The author also summarizes four ways to test the diversity of data sets.
★Importance assessment and selection
Importance refers to the necessity of the sample for model training, which is related to the model task and also related to performance. Easy samples may not require additional tuning, while difficult samples are crucial for model training.
There are several indicators and methods for evaluating importance:
Current challenges and future directions
The authors found that there is a gap between the effectiveness of data selection and the reported performance of models on benchmarks, due to reasons such as poor correlation between evaluation loss and benchmark performance and test set contamination.
In the future, there is a need to build specialized benchmarks to evaluate instruction tuning models and selected data points, and to decouple data selection and model evaluation to exclude the impact of data contamination.
There is currently no unified standard to distinguish between "good" and "bad" instructions. Existing quality measurement methods are task-oriented and lack interpretability. In the future, more unified and universal definitions and improved interpretability of selection pipelines are needed to meet the needs of different downstream tasks.
As the dataset grows, it becomes difficult to determine the optimal selection ratio due to increased noise, overfitting, and forgetting problems. It is recommended to determine the optimal selection ratio through a quality measurement scheme, emphasizing diversity, and considering similarities with pre-training data, and to optimize the scalability pipeline of data evaluation and selection.
In addition to datasets, the size of large models themselves is also increasing, and the cost-efficiency of data evaluation and selection is decreasing, requiring the development of efficient proxy models while rethinking traditional machine learning techniques such as optimization techniques and dimensionality reduction methods.
Project homepage:
https://github.com/yuleiqin/fantastic-data-engineering
Paper address:
https://arxiv.org/abs/2408.02085