news

we need to be alert to the risk of ai “model collapse”

2024-10-01

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

image source: "theweek" in the united states
【today’s viewpoint】
◎our reporter zhang jiaxin
from customer service to content creation, artificial intelligence (ai) has impacted progress in numerous areas. but a growing problem known as “model collapse” could undo all of ai’s achievements.
“model collapse” is a problem pointed out in a research paper published in the british journal nature in july this year. it refers to using ai-generated data sets to train future generations of machine learning models, potentially seriously "contaminating" their output.
multiple foreign media reported that this is not only a technical issue that data scientists need to worry about. if left unchecked, "model collapse" may have a profound impact on enterprises, technology and the entire digital ecosystem. professor xiong deyi, head of the natural language processing laboratory of tianjin university, explained "model collapse" from a professional perspective in an interview with a reporter from science and technology daily.
what’s going on with “model collapse”?
most ai models, such as gpt-4, are trained on large amounts of data, most of which comes from the internet. initially, this data is generated by humans and reflects the diversity and complexity of human language, behavior, and culture. ai learns from this data and uses it to generate new content.
however, as the ai ​​searches the web for new data to train the next generation of models, the ai ​​is likely to absorb some of the content it generates, creating a feedback loop in which the output of one ai becomes the input of another. when generative ai is trained with its own content, its output can also deviate from reality. it's like making multiple copies of a document, with each version losing some of the original details and ending up with a blurry, less accurate result.
the new york times reported that when ai is separated from human input content, the quality and diversity of its output will decrease.
xiong deyi explained: "the distribution of real human language data usually conforms to zipf's law, that is, word frequency is inversely proportional to the order of words. zipf's law reveals that there is a long-tail phenomenon in human language data, that is, there are a large number of low-frequency and diverse content.”
xiong deyi further explained that due to errors such as approximate sampling, the long tail phenomenon of the real distribution gradually disappears in the data generated by the model. the distribution of the data generated by the model gradually converges to a distribution that is inconsistent with the real distribution, and the diversity is reduced, resulting in "model collapse".
is ai “cannibalizing” itself a bad thing?
regarding "model collapse", the american "theweek" magazine recently published an article commenting that this means that ai is "cannibalizing" itself.
xiong deyi believes that with the emergence of this phenomenon, the higher the proportion of model-generated data in subsequent model iterative training, the more information the subsequent model will lose about real data, making model training more difficult.
at first glance, "model collapse" may seem like a niche problem that only ai researchers need to worry about in the laboratory, but its impact will be far-reaching and long-lasting.
an article in the american "atlantic monthly" pointed out that in order to develop more advanced ai products, technology giants may have to provide synthetic data to programs, that is, simulated data generated by ai systems. however, because the output of some generative ai is full of bias, disinformation, and absurd content, these will be passed on to the next version of the ai ​​model.
the us "forbes" magazine reported that "model collapse" may also exacerbate problems of bias and inequality in ai.
that doesn't mean all synthetic data is bad, though. the new york times said that in some cases, synthetic data can help ai learn. for example, when the output of a large ai model is used to train a smaller model, or when the correct answer can be verified, such as the solution to a math problem or the best strategy for games like chess, go, etc.
is ai taking over the internet?
the problem of training new ai models may highlight a larger challenge. "scientific american" magazine stated that ai content is taking over the internet, and text generated by large language models is flooding hundreds of websites. compared with human-created content, ai content can be created faster and in larger quantities.
openai ceo sam altman said in february this year that the company generates about 100 billion words every day, equivalent to the text of 1 million novels, a large part of which flows into the internet.
the abundance of ai content on the internet, including bot-tweets, ridiculous images and fake comments, has fueled a more negative perception. "forbes" magazine stated that the "death internet theory" believes that most of the traffic, posts and users on the internet have been replaced by robots and ai-generated content, and humans can no longer determine the direction of the internet. the idea initially circulated only on online forums, but has recently gained more traction.
fortunately, experts say the "dead internet theory" has yet to become a reality. "forbes" magazine pointed out that the vast majority of widely circulated posts, including some profound opinions, sharp language, keen observations, and definitions of new things in new contexts, are not generated by ai.
however, xiong deyi still emphasized: "with the widespread application of large models, the proportion of ai synthetic data in internet data may become higher and higher. a large amount of low-quality ai synthetic data will not only make subsequent use of internet data training models there will be a certain degree of 'model collapse', and it will also have a negative impact on society, such as the generated erroneous information that misleads some people. therefore, ai-generated content is not only a technical issue, but also a social issue that needs to be managed safely. effective response from dual perspectives with ai technology.”
(source: science and technology daily)
report/feedback