2024-08-16
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
Increasingly, academic publishers are selling research papers to tech companies to train artificial intelligence (AI) models, with zero revenue for authors.
The Large Language Model (LLM) has once again sparked controversy due to the issue of training data. Recently, Elizabeth Gibney, editor of the internationally renowned journal Nature, published an article titled "Has your paper been used to train an AI model? Almost certainly." The author of the article stated that more and more academic publishers are licensing research papers to technology companies for training AI models. Some academic publishers have earned $23 million from this, while the authors have earned nothing. In many cases, the authors were not consulted on these transactions, which has aroused strong dissatisfaction among some researchers.
"If your paper has not yet been used as AI training data, it will probably soon become part of the training." Elizabeth Gipney pointed out in the article that academic paper authors currently have little right to interfere when publishers sell their copyrighted works. There is no ready-made mechanism to confirm whether publicly published articles are used as AI training data. How to establish a fairer mechanism to protect the rights of creators in the use of large language models is worthy of extensive discussion in academia and the copyright industry.
Large language models (LLMs) are usually trained on large amounts of data scraped from the internet. These data include billions of pieces of language information (called "tags"), and by analyzing the patterns between these tags, the model is able to generate fluent text. Academic papers are more valuable than large amounts of ordinary data because of their rich content and high information density, and are an important source of data in AI training. Data analyst Stefan Baack from the global nonprofit Mozilla Foundation analyzed that scientific papers are of great help in the training of large language models, especially in terms of reasoning ability on scientific topics. It is precisely because of the high value of the data that major technology companies have spent huge sums of money to purchase data sets.
The article points out that this year, the Financial Times reached an agreement with OpenAI to license its content to the latter, and Reddit, known as the "American Post Bar", also signed a similar deal with Google. These deals reflect publishers' attempts to prevent their content from being captured by AI models for free through legal authorization.
The article revealed that last month, British academic publisher Taylor & Francis signed a $10 million agreement with Microsoft, allowing Microsoft to access its data to improve AI systems. In June, American publisher Wiley earned $23 million by providing content to a company for AI training. These huge revenues have nothing to do with the authors of the papers.
Currently, researchers are trying to use technical means to help authors identify whether their works are used for AI model training. Lucy Lu Wang, an artificial intelligence researcher at the University of Washington in Seattle, said that if a paper has been used as training data for a model, it cannot be removed after the model training is completed.
However, even if it can be proved that the paper was used for AI training, it still faces controversy at the legal level. The article points out that publishers believe that using unauthorized copyrighted content for training is an infringement, while another legal view is that the large language model does not directly copy the content, but generates new text through learning.
It is worth noting that not all researchers are opposed to their works being used for AI training. Stefan Bak said that he is happy to see his research results used to improve the accuracy of AI and does not mind AI "imitating" his writing style. However, he also admitted that not everyone can easily deal with this problem, especially those who face competition from AI, such as artists and writers.
In fact, lawsuits regarding the use of copyrighted intellectual works to train AI models have previously attracted widespread attention.
On August 14, The Washington Post reported that a class action lawsuit filed by several visual artists and illustrators in the United States against AI image generation tools has made a breakthrough. They have accused startups such as Midjourney and Stability AI of using their works to train AI models without consent. The case made a key breakthrough this week. U.S. District Judge William Orrick allowed key parts of the case to move forward, which means that the court has decided that there is enough legal evidence for certain allegations to continue the trial, and the internal communications of these companies when developing AI tools may be disclosed as the legal trial proceeds.