news

study: repeatedly training ai with ai-generated content can cause 'model collapse'

2024-09-05

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

it home reported on september 5 that on september 4 local time, according to forbes, dr. ilya shumelov of oxford university and his team found that when generative ai software relies solely on content generated by ai, the quality of answers begins to deteriorate. this research has been published in the journal nature.

after the first two queries, the answers gradually deviated from accuracy, and by the fifth, the quality had dropped significantly, and by the ninth consecutive query, the answers had completely degenerated into meaningless gibberish. the researchers call this cyclical overuse of generative ai content "model collapse," where the ai ​​outputs gradually deviate from reality and eventually become worthless after it continuously pollutes its own training set.

“it’s surprising how quickly and imperceptibly model collapse occurs,” shumelov said. “initially, it affects the minority data — the underrepresented data. then, it affects the diversity of the outputs, causing a decrease in variance. sometimes, you observe a small improvement on the majority data, but this improvement masks a deterioration in the performance of the minority data. model collapse can have serious consequences.”

the researchers identified the "model collapse" phenomenon by using a pre-trained ai-driven wikipedia and then letting the ai ​​model update based on its own generated content. the influence of contaminated data gradually caused the original training set to be eroded, and the output information became unintelligible. for example, after the ninth query cycle, the wikipedia entry in the study hilariously changed from content about 14th-century english church spires to a paper on short-tailed rabbits of various colors.

according to another study released in june by the amazon web services team, about 57% of online text has been translated by ai algorithms. if human-generated data on the internet is quickly covered by ai-filtered content, and shumelov's findings are true, then ai may be "self-destructing" - and "destroying" the internet at the same time.

the study concluded that the only way to achieve long-term sustainability for ai is to ensure it can access existing non-ai-generated content and continuously introduce new human-generated content.