Nature cover: AI trains AI, and the more it trains, the dumber it gets

2024-07-27

Baijiao from Aofei Temple
Quantum Bit | Public Account QbitAI

AI training AI, could make AI stupid?!

Researchers from Oxford, Cambridge and other institutions have recently found that large models may collapse when trained with synthetic data. Their research results have been selected as the latestNature cover。

Direct one:GARBAGE OUT！

You know, now most of the big models of technology companies are using synthetic data to alleviate the "data shortage". This is undoubtedly a cold water for the entire industry.

The research team gave such an example.

They tested Meta's OPT-125m model, asking it for information about medieval architecture.

Each fine-tuning is trained by the data generated in the previous round. The results were good in the first few rounds. But in the ninth round, it started to speak nonsense...

What the hell is this about rabbits?!

The lead author of the paper said they had considered that synthetic data might cause errors in large models, but did not expect the models to deteriorate so quickly.

Three errors cause the model to collapse

First, the team defined what a model crash is.

Model collapse is a degradation process where the content generated by the model will pollute the next generation of training data sets. After training on the polluted data, the new generation of models will easily misunderstand reality.

This cycle repeats itself, with each generation getting worse than the previous one.

As time goes by, there are mainly two situations: early model collapse and late model collapse.

In early model collapse, the model starts to lose some tail information (like some low probability events in the probability distribution) while in late model collapse, the model will converge to a distribution that has almost no resemblance to the original distribution.

The occurrence of this process is related to the model design, learning process and the quality of the data used.

Specifically in theory, these three errors mainly lead to the deviation of the large model from the original model.

Statistical approximation errorThis is the main type of error that arises due to the finite number of samples and disappears as the number of samples tends to infinity. This is because there is a non-zero probability that information will be lost at each step of resampling.
Function expressiveness errorThis error arises due to the limited expressive power of function approximations. In particular, neural networks are universal approximations only when their size reaches infinity. However, in the absence of the other two errors, this error only occurs in the first generation.
Function approximation error. It is mainly caused by limitations of the learning process, such as structural biases in stochastic gradient descent or the choice of objectives. This error can be viewed as the error that would occur in the case of infinite data and perfect expressiveness at each generation.

Impact on language models

The researchers then evaluated the impact of model collapse on language models. Since training large models from scratch is very expensive, they chose to evaluate the most common settings for language models:Fine-tune settings。

Each training cycle starts with a pre-trained model with the latest data. The training data comes from another fine-tuned pre-trained model.

They used the Meta causal language model OPT-125m, fine-tuned on wikitext2.

To generate data from the trained model, the team used a five-way beam search. They set the training sequence to be 64 tokens long; then for each token sequence in the training set, the model was asked to predict the next 64 tokens.

They go through all the original training datasets and generate an artificial dataset of the same size.If the model's error is 0, it generates the original wikitext2 dataset.

To further feel the difference, they used two different settings: one group had no original training data except for the initial training; the other group retained 10% of the original data.

The results show that over time, the errors produced by the model increase. Before the model completely breaks down, it also causes the model to forget low-probability events in the data set, and their outputs become more homogeneous. Eventually, this phenomenon occurs.

In addition, similar model collapse phenomena were observed in VAE and GMM models.

Professor Emily Wenger from Duke University said that so far, it has not been easy to alleviate the problem.

Leading technology companies have deployed a technology that embeds "watermarks" -

The difficulty is that this requires coordination among technology companies and is therefore not commercially viable.

In this way, the models trained by companies that previously obtained data from the Internet are more representative of the real world. Therefore, the first wave of large models has a first-mover advantage.

What do you think about this view?

Reference Links:
[1]https://www.nature.com/articles/d41586-024-02420-7
[2]https://www.nature.com/articles/d41586-024-02355-z
[3]https://www.nature.com/articles/s41586-024-07566-y

news