2024-08-19
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
Author: Zhou Yixiao
Email: [email protected]
Editor: Wang Zhaoyang
Email: [email protected]
1
Disconnect
Recently, users of MiTa AI search will find a striking line of text at the top when they open the website: "It's over! We received a 28-page infringement notice from CNKI."
When I clicked on it, I saw a statement from MiTa, which said that it had received an infringement notice from China Academic Journal (CD-ROM Edition) Electronic Magazine Co., Ltd. - that is, CNKI, which had been controversial and had been fined 87.6 million yuan and 50 million yuan respectively for suspected monopolistic behavior and personal information security issues, had accused it of infringement.
To put it simply, the MiTo AI search can search the content of CNKI. CNKI believes that this is an infringement and demands that the provision of CNKI data in the search service be stopped immediately.
"We do not want our website, CNKI, to be searched by Mita Technology. Please immediately disconnect the link from the search results to our website. If you need business cooperation, please contact us."
In this statement, Mita Technology responded that the "Academic" section of Mita AI search only includes the abstracts and titles of papers, but not the content of the articles themselves. To read the full text, you need to jump to the website through the source link. According to academic norms, the abstracts and titles of documents should be independent and self-explanatory, so that readers can obtain necessary information without reading the full text.
Currently, some links in MiTa Academic Search will jump to Wanfang Data.
Mita AI also emphasizes that the value of knowledge lies in its flow. Academic literature is an important carrier of human intellectual achievements and is highly irreplaceable. If scientific literature becomes a luxury, it will be detrimental to the fair acquisition of knowledge and the development of scientific research.
However, after talking about everything from human wisdom to academic pursuits, the action that MiTa gave was to "break the link": "Even if we don't understand, we respect HowNet's choice." From now on, MiTa AI search will no longer include the title and abstract data of HowNet documents, and will instead include the title and abstract data of documents from other authoritative Chinese and English knowledge bases. Other databases are also welcome to cooperate in discussions.
In other words, MiTa finally handled the matter according to CNKI's appeal request.
1
Important issues that are not clearly stated
Mita AI Search is a star product in this round of AI craze, and is often compared to China's Perplexity. Mita is also a star company in this round of big model startups. The latest news shows that it has completed the latest financing of 100 million yuan, with a post-investment valuation of 150 million US dollars. Mita was founded before the big model craze, but its core product Mita AI Search was officially launched in March this year.
Mita's advertisement on Hunan TV
In the infringement notice of HowNet, it is stated that MiTa provides users with academic literature titles and abstract data of HowNet, which is suspected of infringement. In this regard, You Yunting, senior partner and lawyer of Shanghai DaBang Law Firm, said that web pages are different from papers. The academic literature titles and abstract web pages of HowNet are publicly accessible to domestic users. HowNet, as a dominant operator in the Chinese academic literature network database service market in China, needs reasonable reasons to not allow MiTa to search and crawl these two parts of public information.
Essentially, HowNet is asking MiTa not to crawl its website. In the traditional search engine ecosystem, such information crawling behavior has basic rules - each website and various information providers use a Robots.txt file to tell the search engine which content can be crawled and which cannot.
Search engines like Baidu and Google will name their own crawlers in this process to let the other party know that they have been there and what they have taken away. However, judging from the Robots.txt file of HowNet, it does not block any crawlers.
"What's interesting is that although CNKI sent a letter to Mita asking it to disconnect the link, that is, not allowing it to crawl the webpage content, its robots file (https://www.cnki.cn/robots.txt) does not prohibit any search engine crawlers. According to the content of CNKI's robots file, no one is prohibited from crawling their webpages, but cms, query.html?*, report, paper, qrcode, js, cs, which involve the background management interface, static resource directory and specific content directory webpages cannot be crawled."
There is no industry rule prohibiting the other party from crawling, so why send a notice letter?
"Many AI search engine crawlers are not ethical. Unlike traditional search engines like Baidu, Google, Sogou, and Bing, they do not name their own crawlers, but crawl anonymously without making a sound," said You Yunting. In fact, these anonymous crawls are not necessarily carried out in the name of these AI search companies. There are many third-party crawler services on the market that circumvent these basic principles to crawl. Whether these services were used was not mentioned in Mita's reply.
Peroplexity has encountered similar controversies before.
At that time, Wired magazine and developer Robb Knight investigated and found that Perplexity did not comply with the robots.txt standard. Founder Aravind Srinivas responded in an interview that Perplexity did not ignore the Robot Exclusions Protocol... The web crawler that was found to have problems belonged to a third-party supplier.
But when asked if he would stop using third-party crawlers, he simply said "it's complicated." In addition, the investigation also showed that in some cases, Perplexity may not summarize the actual article, but reconstruct the content based on the URL and traces left in the search engine (such as excerpts and metadata). Deja vu.
According to the article published by MiTa, the infringement notice sent by CNKI to MiTa was as long as 28 pages. MiTa only intercepted the notice letter and published it. From the screenshots, the rest of the content mainly lists the evidence of infringement, which may not only show the various abstracts and titles being crawled.
According to previous sharing by many users, Secret Tower can obtain non-public papers, and can be read directly on Secret Tower's website. Although these PDF documents are linked to external library websites, they may actually be stored on Secret Tower's servers. You Yunting believes that if Secret Tower establishes an index library containing the full text of CNKI papers, it may constitute infringement.
"The podcast and library sections of Mita AI search have index libraries. My understanding of the index library is that Mita has directly created an index database for the bulk collection of documents in advance. When users search, Mita will search the corresponding real-time content on the Internet, and then use artificial intelligence to integrate the real-time search results and the content of the index library to provide answers." You Yunting said. In other words, although the core display result page presents the index in the form of marked sources, the "original text" is also moved over in its own service.
"The index database is very likely to exist. In fact, this is not difficult to prove technically. When we encounter this problem when representing litigation, we usually use packet capture software to display the real IP address of the document. If this IP address is located on the server of Mita, it means that it was provided by Mita."
In addition, as an AI search engine based on pre-trained models, whether these intellectual property data are used in the training data is a more important issue.
When the paper data used in training is highly consistent with the original text when it is output to the user due to the "overfitting" problem that usually exists in the model, this moves from fair use to the category of copyright infringement similar to "plagiarism".
But in this case, does CNKI have the right to "protect the rights" of these papers written by individual researchers?
"HowNet has no right to claim that Secret Tower Training has infringed on its copyright," You Yunting believes.
He said that although most of the papers on the website are included, CNKI has the right to disseminate information on the Internet authorized by the magazine or the author. If the paper is used for training, the copyright involved in the training is the reproduction right and other copyright rights stipulated in the Copyright Law, and does not infringe on CNKI's right to disseminate information on the Internet. Of course, if the magazine is infringing on the rights of the MiTa training, then MiTa will face the same problem as the New York Times sued OpenAI.
1
It's time for some more serious discussions
Therefore, the object that the secret towers want to "respond" to is not just CNKI, which is described as "evil" by netizens.
In addition to responding to CNKI - these responses always trigger empathy, judging from the comment section of its response articles, people still have the attitude that they are fed up with CNKI for a long time, and they all "stand" with Secret Tower - Secret Towers may be able to explain the use of these training data to the individual authors behind these data.
The controversial "academic" search function is an important design that distinguishes Mita from other Perplexities, and this function has also won praise from many users. These users are often those who need to do a lot of literature searches for tasks such as class assignments, article re-creation, or even writing papers.
For the real authors of the paper, the use of this data may bring other problems.
A recent article in Nature pointed out that many academic publishers have authorized technology companies to access their own papers for training AI models. For example, the American publisher Wiley directly earned $23 million after allowing a company to use its content to train models. And this income has nothing to do with the authors of the papers.
In addition to the real benefit distribution problem that may not be solved in the end, for these researchers, some very important evaluation systems of the academic community have also been disrupted in the generation process of this "AI academic search". For example, citation volume, a very important indicator in the academic community, seems to be non-existent in these AI academic search scenarios. The randomness and unexplainability of the big model itself, as well as the incompleteness of the data, all make the academic search results it generates inconsistent with the judgment criteria of the academic community itself.
A scholar told Silicon Star that when these AI searches generate their own answers, what are the criteria for selecting which ones to choose and which ones not to choose? For the academic community that uses citations as the most direct criterion for gold content, if these AI results become more and more numerous and are also used by many researchers in their own papers, is this also another form of AI SEO pollution?
The results of asking questions in Secret Tower Law
As for the dispute itself, when MiTa cleared the CNKI papers from its index library and no longer provided users with the online reading function of CNKI papers, the dispute over intellectual property infringement was very small. In addition, You Yunting stated that according to the Anti-Monopoly Law and the Internet Search Engine Service Self-Discipline Convention, CNKI no longer had a reasonable reason to not allow MiTa to search and crawl these two parts of public information.
But if AI search companies treat the products they are making as a long-term and serious undertaking, then in addition to celebrating some small pleasures surrounding the products and having some nonchalant attitude, it is also time to face up to these complex and realistic problems and discuss them openly in an appropriate manner. Only in this way can they truly hope to touch upon the real crux of today's information acquisition field that they hope to challenge.