news

“By 2028, all high-quality text data on the Internet will have been used up”

2024-08-01

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Research firm Epoch AI predicts that all high-quality text data on the internet will be used up by 2028, and that machine learning datasets could exhaust all “high-quality language data” by 2026.

Researchers point out that using data sets generated by artificial intelligence (AI) to train future generations of machine learning models may lead to "model collapse". The topic of whether there is a shortage of training data for large AI models has once again become a hot topic in many media recently.

Recently, The Economist magazine published an article titled "AI firms will soon exhaust most of the internet's data", pointing out that with the depletion of high-quality internet data, the AI ​​field is facing a "data wall". For AI big model companies, the challenge now is to find new data sources or sustainable alternatives.

The article cites a forecast by research firm Epoch AI that all high-quality text data on the Internet will be used up by 2028, and machine learning datasets may run out of "high-quality language data" by 2026. This phenomenon is known in the industry as the "data wall." How to deal with the "data wall" is one of the major problems facing AI companies today, and it may also be the problem most likely to slow down their training progress. The article points out that as pre-training data on the Internet dries up, post-training becomes more important. Label companies such as Scale AI and Surge AI make hundreds of millions of dollars each year by collecting post-training data.


The Economist magazine cited Epoch AI chart

In fact, there have been voices about "data exhaustion" in the industry for a long time. The Paper noted that in early July 2023, Stuart Russell, a professor of computer science at the University of California, Berkeley and author of Artificial Intelligence: A Modern Approach, warned that AI-driven robots such as ChatGPT may soon "exhaust the text in the universe" and that the technology of training robots by collecting large amounts of text "is beginning to run into difficulties."

But there are different voices in the industry. In an interview with Bloomberg technology reporter Emily Chang in May 2024, Fei-Fei Li, a famous computer scientist, co-director of the Stanford University Artificial Intelligence Laboratory, and professor at Stanford University, made it clear that she did not agree with the pessimistic view that "our artificial intelligence models are running out of data for training." Fei-Fei Li believes that this view is too narrow. From the perspective of language models alone, there is still a large amount of differentiated data waiting to be mined to build more customized models.

At present, one of the solutions to the problem of limited training data is to use synthetic data, which is created by machines and is therefore unlimited. However, synthetic data also has its risks. On July 24, the international academic journal Nature published a computer science paper pointing out that training future generations of machine learning models with data sets generated by artificial intelligence (AI) may contaminate their outputs, a concept called "model collapse." Because the model is trained on contaminated data, it will eventually misunderstand reality.

The research team showed in the study that in the large language model learning task, the tail of the underlying distribution is very important. The large-scale use of large language models to publish content on the Internet will pollute the data collection work used to train its successors. In the future, real data of human interaction with large language models will become more and more valuable. However, the research team also mentioned that AI-generated data is not completely undesirable, but the data must be strictly filtered. For example, in the training data of each generation of models, keep 10% or 20% of the original data, and use diversified data, such as human-generated data, or study more robust training algorithms.