news

Nature reveals shocking inside story: Papers are sold at sky-high prices to feed AI! Publishers make hundreds of millions, authors earn nothing

2024-08-15

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina



  New Intelligence Report

Editor: Editorial Department
【New Wisdom Introduction】An article in Nature revealed: The papers you published may have been used to train models! Some publishers have made $23 million by selling data. However, the authors who worked hard to write the papers don't get a penny. Is this reasonable?

Global data is in crisis. What should we do?
Essays to collect!
Recently, an article in Nature revealed to us the fact that even scientific research papers have been used to train AI...
It is reported that many academic publishers have authorized technology companies to access their own papers for training AI models.
From the conception of an idea to the final draft, a paper contains the hard work of countless authors day and night, but now it may become data for training AI without their knowledge.
Is this reasonable?
What’s even more infuriating is that my paper was used by the publisher to make a profit.
According to a Nature report, last month British academic publisher Taylor & Francis signed a $10 million deal with Microsoft to allow Microsoft to access its data to improve AI systems.
An investor update in June showed that after the US publisher Wiley allowed a company to use its content training model, it made a direct profit of $23 million!
But this money has nothing to do with the authors of the majority of the papers.
And Lucy Lu Wang, an AI researcher at the University of Washington, says that anything that can be read online, even if it’s not in an open-access repository, has likely been entered into the LLM.
What’s even more frightening is that if a paper has been used as training data for a model, it cannot be deleted after the model training is completed.
If your paper hasn’t been used to train an AI yet, don’t worry — it will be soon!

Data sets are like gold, and major companies are bidding


We all know that LLM needs to be trained on massive data, which are usually captured from the Internet.
It is from these billions of tokens in the training data that LLM derives patterns to generate text, images, and code.
Academic papers are long and have high information density, which makes them one of the most valuable data that can be fed to LLM.
Moreover, training LLMs on large amounts of scientific information can also greatly improve their reasoning ability on scientific topics.
Wang has co-created S2ORC, a dataset based on 81.1 million academic papers. Initially, the S2ORC dataset was developed for text mining, but later, it was used to train LLM.
Pile, built by the nonprofit organization Eleuther AI in 2020, is one of the most widely used large open source datasets in NLP research, with a total volume of 800GB. It contains a large amount of text from academic sources, with arXiv papers accounting for 8.96%, and also covers other academic websites such as PubMed, FreeLaw, and NIH.
The 1T token dataset MINT, which was open-sourced some time ago, also tapped into the treasure of arXiv, extracting a total of 870,000 documents and 9B tokens.
From the data processing flow chart below, we can see how high the quality of the paper data is - there is almost no need for much filtering and deduplication, and the utilization rate is extremely high.
Now, in order to deal with copyright disputes, major model companies have begun to bid real money to purchase high-quality data sets.
This year, the Financial Times sold its content to OpenAI for a considerable price; Reddit also reached a similar agreement with Google.
In the future, there will be more and more transactions like this.

It is extremely difficult to prove that the paper has been used by LLM


Some AI developers open up their data sets, but many companies that develop AI models keep most of their training data confidential.
Stefan Baack, an AI training data analyst at the Mozilla Foundation, said no one knows what training data these companies have.
The most popular data sources among industry insiders are undoubtedly the abstracts of the open source repository arXiv and the academic database PubMed.
Currently, arXiv hosts the full text of more than 2.5 million papers, and PubMed contains an astonishing number of citations, more than 37 million.
Although some full-text papers on websites such as PubMed have paywalls, the abstracts are free to browse, and this part may have been captured by big technology companies long ago.
So, is there any technical method to identify whether my paper has been used?
At the moment, it is still difficult.
It is difficult to prove that LLM used a specific paper, says Yves-Alexandre de Montjoye, a computer scientist at Imperial College London.
One way is to use very rare sentences in the paper text to prompt the model and see if its output is the next word in the original text.

A scholar once prompted GPT-3 with the beginning of the third chapter of "Harry Potter and the Sorcerer's Stone", and the model quickly and correctly spit out about a full page of the book.
If so, then there is no escape - the paper is in the model's training set.
What if it is not? This is not necessarily valid evidence that the paper was not used.
Because developers can code LLMs to filter responses so that they don’t overly match the training data.
It is possible that we have tried our best but still cannot prove it clearly.
Another method is "membership inference attack".
The principle of this method is that when the model sees something it has seen before, it will be more confident in the output.
To this end, De Montjoye's team developed a special "copyright trap".
To set the traps, the team generated plausible-looking but meaningless sentences and hid them in their work, such as white text on a white background or as zero-width fields on a web page.
If the model's perplexity for the unused control sentences is higher than for the control sentences hidden in the text, this can be taken as statistical evidence that the trap has been seen.

Copyright Disputes


However, even if we can prove that LLM was trained on a certain paper, what can we do?
There is a long-standing controversy here.
From a publisher's perspective, if a developer uses copyrighted text in training without obtaining permission, that's a clear infringement.
But the other party can refute it like this: the big model is not plagiarized, so how can there be any infringement?
Indeed, LLM does not copy anything, it just takes information from the training data, disassembles it, and then uses it to learn to generate new text.
A more complicated issue is how to distinguish between commercial and academic research purposes.
According to the current terms of use on the arXiv website, it is legal and supported to capture, store, and use all electronic preprint papers and website metadata for personal or research purposes.
However, commercial use of arXiv is strictly prohibited.
So the question is, if a commercial company uses an open source dataset released by an academic institution to train its own commercial model, and the data source includes arXiv or similar academic publishing institutions, how should this be calculated?
In addition, publishers often do not clearly stipulate in users' subscription terms whether papers can be used as training data for models.