news

If there is not enough AI data, can papers be used to make up for it?

2024-08-17

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

August 14
Nature's official website published an article saying
Many academic publishers are turning to technology companies
Selling rights to research papers
For training AI models
In many cases, these transactions
The author's opinion was not sought
This has caused strong dissatisfaction among some researchers.
The paper that was “betrayed”
Nature
UK academic publishers
Taylor Francis
has signed a
$10 million deal
Allow Microsoft to access its data
To improve AI systems
In June, there was news that
American publisher Wiley
Allow a company to use its content to train models
Earned $23 million
Nature
These papers cover
Natural sciences, social sciences and other fields
It has become an important corpus for AI model training
A painting robot demonstrates painting at the 2024 "AI for Good Global Summit" held in Geneva, Switzerland.
Image source: Xinhua News Agency
The Nature article believes
Current academic paper authors
When faced with publishers selling their copyrighted works
Almost no right to interfere
For publicly published articles
There is no existing mechanism to confirm these contents.
Whether it is used as AI training data
In the use of large language models
How to establish a fairer mechanism
Protecting the rights of creators
It is worth extensive discussion in academia and copyright circles.
Not enough data for AI
Thesis to "make up"
Three key elements in the development of large AI models
Data, algorithms, and computing power
With the rapid development of AI technology
The growing demand for training data
Some media said
End of 2021
OpenAI training GPT-4
Encountered a difficult problem
Its training has exhausted the Internet
Reliable English text resources
then
To deal with this problem
OpenAI uses video and audio from the Youtube platform
Transcribe into large amounts of conversation text
Afterwards, training
On July 2, staff communicated with digital people in the Digital Economy Immersive Experience Area of ​​the 2024 Global Digital Economy Conference.
Photo by Xinhua News Agency reporter Ren Chao
now
Technology giants have turned their attention to
Academic Publishers
Academic papers as
The crystallization of scientific research wisdom
Become a "hot commodity" that everyone is eager to buy
Take natural language processing as an example
Through a large number of papers training
AI models can better understand
Knowledge in academic fields
Improve its accuracy
Google, Microsoft and other tech giants
Investing huge sums of money to purchase corpora
In order to gain a competitive advantage in the field of AI
The Financial Times has already
At a considerable price
Sold to OpenAI
Reddit
A similar agreement was reached with Google
These transactions reflect
Publishers attempt to obtain legal authorization
Prevent your content from being grabbed by AI models for free
The training data dilemma
How to break the deadlock
Working with publishers
Only technology companies can solve
The epitome of insufficient training data
Recent
The Economist magazine published an article
AI companies will soon consume most of the internet’s data
Predict the available
Human text data will be exhausted in four years
On April 23, at the Hannover Industrial Fair in Germany, visitors played the "Rock, Paper, Scissors" game with an intelligent robot.
Photo by Xinhua News Agency reporter Ren Pengfei
Faced with such a problem
OpenAI founder and CEO Sam Altman
A workaround has been proposed:
Companies like OpenAI
Eventually moving to using AI-generated data
(also known as synthetic data) to train AI
As developers create increasingly powerful technologies
It will also reduce reliance on copyrighted data
certainly
Is this technical path feasible?
There is still controversy
A cover article in Nature magazine argues that
If you let the big model
Train yourself with automatically generated data
AI may degenerate itself
In just a few generations, the original content
Iterates into irreversible gibberish
In addition to synthetic data
Further open sharing of public data
It is also considered an effective path
Industrial Securities report pointed out
Strengthen the open development of public data
The current solution to data dispersion
Important measures to improve the quality of training data
Written by:Li Fei, Ma Jingyuan typesetting:Li Wenjian Co-ordination:Li Zhengwei
ReferencesNature, The Paper, Cailian Press, 21st Century Business Herald
Produced by Guangming Online
Source: World Internet Conference
Report/Feedback