news

what to do when human data runs out? fudan university professor xiao yanghua proposes two solutions

2024-09-07

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

according to red star capital bureau on september 7, in the era of big data, the value of data has not been fully explored. however, in the era of ai, data is facing the challenge of being consumed too quickly, which makes synthetic data technology a hot field. according to a report released by research institution epoch ai in june, starting from 2026, the amount of new data generated by humans will be less than the amount of new data learned by models, and it is estimated that by 2028, large language models will run out of human data.
data determines the upper limit of intelligence to a certain extent, which means that the more breakthroughs the technology of big models makes, the more data technology must be "aligned" with them.
during the 2024 bund conference, xiao yanghua, a professor at fudan university and director of the shanghai key laboratory of data science, said in an interview with red star capital and other media that there may be two ways to solve the problem of data exhaustion in the future. the first is to synthesize data, and the other is to move towards the private domain.
xiao yanghua
"many people have annotated the four books and five classics. the annotation process is like the process of data synthesis. we can continuously think, associate, and integrate the original data to generate more data, which is synthetic data." xiao yanghua pointed out that synthetic data is a very important idea. it is not only to alleviate the problem of data exhaustion, but also has very important significance.
"most of the synthesized data is data from our thinking process. through synthetic data, a large amount of implicit, unrecorded, unexpressed, and thought-oriented data can be expressed. this kind of data is crucial to stimulating the iq or rational ability of large models."
xiao yanghua mentioned that our current big models "only have intellect but no rationality". they just remember more facts, but it does not mean they are smarter, and their rationality has not increased. synthetic data is a very important idea to improve rationality.
"using synthetic data that simulates the thinking process to train the big model will help it know how to think about problems. therefore, synthetic data is used to alleviate data 'famine' and to improve the rationality of the big model."
another very important idea mentioned by xiao yanghua is to move towards the private domain. "more high-quality and high-value data is in the private domain, in vertical industries, and in all walks of life. going further, it is personal data. therefore, the private domain and individuals still have a large amount of valuable, very original, and real data, but we have not activated this data, and it has not been injected into the big model, and the big model has not learned this knowledge. how to use the attributes of the private domain to stimulate the potential of the big model is also a very important thing in the future."
xiao yanghua said that private domain data is all in the database system. these databases contain a large amount of high-quality private domain data and industry data in various forms. how to turn them into large model training corpus is an important issue. if private domain data can be used to train large models, it is possible to turn large models into industry experts.
"the current large models only have general knowledge capabilities and are not yet capable of professional tasks. to achieve this, making good use of private domain data may be the key, so the potential of data to be mined is still very large."
xiao yanghua also looked forward to "personal data". he pointed out that the use of personal data to train big models has just begun. he believes that the next step must be to combine personal data with big models. in the future, how to combine personal data with big models to turn them into "personalized big models" to serve individuals still has great potential, but there is still a long way to go.
red star news reporter wang tian
editor: deng lingyao
(download red star news and get a reward for reporting!)
report/feedback