news

Oxford and Cambridge "poisoning" AI failed 9 times and appeared on the cover of Nature, sparking heated debate in the academic circle! Can AI training AI break

2024-07-27

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


New Intelligence Report

Editor: Aeneas is so sleepy

【New Wisdom Introduction】The Oxford and Cambridge paper about nine poisonings that led to the collapse of the model has been criticized by many people: Can this be published in Nature? The academic circle has further discussed this, and everyone’s views are the same: synthetic data is regarded as a panacea by many people, but there is no free lunch in the world.

In the AI ​​era, data is the new oil. In an era when global human data is gradually depleting, is synthetic data our future?

The recent controversy caused by a paper on the cover of Nature has made us realize that what is important is not "synthetic data" but "correct use of synthetic data."

This Thursday, a paper from institutions including Oxford, Cambridge, Imperial College London, and the University of Toronto appeared on the cover of Nature.


However, what was unexpected was that the paper triggered a lot of discussion in the AI ​​community once it was published.



Some people believe that the core of the problem is not "synthetic data" but "data quality".

Even if all the data is manual, if the quality is too poor, the result will still be "garbage in, garbage out".



Some even feel that the researchers deliberately adopted methods that do not match actual operations, and are actually "sensationalizing."


In this regard, Professor Ma Yi said that we have now entered an era of lack of scientific ideas and methods.

Many studies are nothing more than rediscovering some scientific common sense.


How to avoid model collapse?

So the question is, how can we avoid model collapse when using AI to synthesize data?

Hybrid data is the future

Alexandr Wang, CEO of Scale AI, strongly agrees with this article on the cover of Nature.

He said that using purely synthetic data to train models will not bring information gain.

Usually, when evaluation metrics go up due to self-distillation, it’s probably because of some more subtle trade-offs:

  • Synthetic data can improve your evaluation results in the short term, but you will pay the price later with model collapse

  • You accumulate hidden debt while training or fine-tuning your model, and this debt will be difficult to repay


Specifically, in consecutive generations of synthetic training, errors mainly come from three aspects:

  • Statistical approximation error

  • Functional expressivity error

  • Functional approximation error

That is, every time you train a new model using data generated by the previous model, you lose some information and precision, causing the model to become increasingly hollow and eventually stop working properly.


While these experiments were performed on small-scale models (100M parameters), the fundamental effects observed also emerge over time on larger models.

For example, most models today cannot generate blog posts in the style of the Slate Star Codex, again due to model collapse. As we continuously train models, they gradually lose the ability to predict over a broad distribution.


In Wang’s view, hybrid data is the future development direction, which can avoid all the thorny problems related to model collapse.

That is, in the process of synthesizing data, it must be generated through some new source of information:

(1) Using real-world data as seeds

(2) Human Expert Participation

(3) Formal logic engine

In contrast, developers who accidentally use synthetic data with no information gain to train their models will eventually find that their models become increasingly weird and stupid over time.

Reinforcement learning is all you need

Researchers from Meta, New York University, and Peking University proposed a "ranking-pruning feedback" method through humans or weaker models to restore or even exceed the model's original performance.

LeCun also forwarded this research and expressed his support.


It is well known that it is much easier for both humans and machines to distinguish between a good and a bad example than to generate a high-quality sample from scratch.

Based on this, the authors proposed a new method - preventing model collapse through synthetic data feedback.


Paper address: https://arxiv.org/abs/2406.07515

To examine this question, the authors first provide analytical results in a theoretical setting.

Here, the authors propose Gaussian mixture models and linear models in the high-dimensional limit as classifiers, and let a verifier (e.g., a human or an oracle) select or prune the generated data.

The results show that when the number of synthetic data points tends to infinity, the model trained on the selected data can achieve the best results comparable to those trained on the original data.

Simulations on synthetic data show that oracle supervision consistently produces near-optimal results compared to using the original annotations.

In addition, since it is easier and cheaper to identify high-quality data through human supervision than direct human annotation, this provides strong evidence for the effectiveness of human supervision.


A Gaussian mixture model with a linear generator and a linear pruner: the pruner improves performance by selecting enhanced synthetic data

Next, the authors conducted two large-scale experiments:

1. Train a Transformer on an arithmetic task (matrix eigenvalue prediction) and use the distance from the true value to prune a large amount of synthetic data

2. News Summarization Using Large Language Model (Llama 2) and Limited Synthetic Data

The results show that in both cases, relying solely on generated data leads to degraded performance, with model collapse occurring even when the amount of data increases.

Furthermore, selecting the best solution from the generated pool based solely on perplexity does not improve performance, i.e. the model itself lacks the ability to select the best prediction based on perplexity.

In contrast, with oracle supervision, a synthetic dataset based on feedback enhancement can be obtained, whose performance exceeds the original dataset as the amount of data increases.


Human and model reinforcement can improve performance and prevent model collapse; without reinforcement, performance degrades

Therefore, when training a new model with synthetic data, we should not only focus on the quality of the generator, but also need a high-quality verifier to select the data.

To sum it up in one sentence: reinforcement is all you need!

Real data + synthetic data

Rylan Schaeffer, a doctoral student at Stanford University, expressed his understanding of readers' complaints about the Nature cover paper.

He noted that model breakdowns typically occur when researchers deliberately adopt methods that do not match actual practices.

Data accumulation may or may not collapse, depending entirely on the specific operational details.

If you deliberately make it collapse, of course it will collapse.


In the paper, co-authored by researchers at Stanford, Maryland, and MIT, Schaeffer examined how accumulating data affects model collapse.

After experiments, they confirmed that replacing the original real data with each generation of synthetic data would indeed cause the model to collapse.

However, if successive generations of synthetic data are accumulated together with the original real data, model collapse can be avoided.


Paper address: https://arxiv.org/abs/2404.01413

In practice, subsequent LLMs are trained on increasing amounts of data over time, e.g., Llama 1 requires 1.4 trillion tokens, Llama 2 requires 2 trillion tokens, and Llama 3 requires 15 trillion tokens.

In a sense, this data accumulation setting is extremely pessimistic.

In this hypothetical future, synthetic data is dumped uncontrollably onto the internet to be used to train the next iteration of the model.


As shown on the right side of the figure, accumulating data can avoid model collapse

The researchers used three different experimental settings: causal Transformer, diffusion model, and autovariational encoder, and trained them on real text, molecular conformation, and image datasets respectively.

They found that replacing data leads to model collapse for all models and all datasets, while accumulating data avoids model collapse.

Causal Language Modeling Based on Tranformer

First, they trained a causal Transformer on text data.

Specifically, a single epoch of 9M parameter GPT-2 and 12M, 42M, and 125M parameter Llama 2 language models were pre-trained on TinyS-tories.

The former is a 470M token, kindergarten reading level short story dataset generated by GPT-3.5/4.

For each model fitting iteration n ≥ 2, the researchers sampled a new dataset of the same size as TinvStories from the language type of the previous iteration, and then replaced or concatenated the previous dataset with the newly generated dataset.

In each model fitting iteration, they pre-train a new initialization model with replacement or concatenation of the dataset from the previous iteration.


The results show that for all architectures, parameter counts, and sampling temperatures, replacing the data leads to an increase in test cross entropy as the number of model fitting iterations increases (Figure 2 left).

They also found that for all architectures, parameter counts, and sampling temperatures, as the number of model fitting iterations increases, the accumulated data leads to a test cross entropy that is equal to or lower (Figure 2, right).

Figure 3 shows the learning curves for each model fitting iteration when repeatedly replacing the data (top) and accumulating the data (bottom).

The results show that data accumulation avoids model collapse in language modeling.


Both Llama2 at 125M and GPT-2 at 9M show quality degradation when replacing data (R), but maintain high-quality text generation when accumulating data (A).


Diffusion Models for Molecular Conformation Data

Next, they trained a diffusion model sequence on the molecular conformational data.

Specifically, the researchers trained GeoDiff, a geometric diffusion model for molecular conformation generation, on the GEOMDrugs dataset.

They downsampled the training portion of the GEOM-Drugs dataset to 40,000 molecular conformations, used this as the initial training set, and performed 50 diffusion steps for each prediction.

Results After 8 model fitting iterations, the researchers found that the test loss increased when replacing data, which matched our language model experiments, and the test loss remained relatively constant when accumulating data (Figure 4).


Unlike language models, they found that when replacing data, performance deteriorates significantly in the first model fitting iteration of training on synthetic data and does not degrade further significantly in subsequent iterations.

Autovariational Encoders for Image Data

At the end of the experiment, the researchers trained a sequence of variational encoders (VAEs) on CelebA, a dataset containing 200,000 face images divided into training and test sets.

This choice strikes a balance between realistic datasets with many samples, color images, and resolution, and the computational feasibility of training the model for many iterations on the accumulated data.

They found that replacing the data at each iteration again showed that the model collapsed -

The test error rises rapidly with each additional iteration, and each iteration produces lower quality and less diverse generated faces, until all model generation represent a single mode.


In contrast, accumulating data at each iteration significantly slows down model collapse—

With each additional iteration, the rate at which the test error increases slows down significantly.

While the diversity of generations has indeed decreased compared to the middle and right panels of Figure 6, it still represents the main axis of variation in the dataset, such as gender, but the model no longer seems to generate other details along the shorter axes of the data manifold, such as glasses and accessories.

It is also interesting to note that, unlike language modeling, the test error on accumulated data does increase with the number of iterations (although much more slowly than with replacement).

Why does this difference exist? This research direction is left to the future.

References:

https://x.com/alexandr_wang/status/1816491442069782925 https://x.com/RylanSchaeffer/status/1816535790534701304

https://arxiv.org/abs/2404.01413

https://arxiv.org/abs/2406.07515