news

170,000 videos involved! Nvidia and other giants were exposed for using YouTube data to train models in violation of regulations

2024-07-17

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Tech giants have been exposed for using unauthorized YouTube content to train AI (artificial intelligence) models.

On July 16, local time, foreign media reported that some large technology companies, including Apple, Nvidia, Salesforce and Anthrophic, were exposed to have used unauthorized data from YouTube, a video site owned by Google, when training AI models. These companies used a data set provided by a third party, which contained a large amount of video subtitle text captured from YouTube, violating YouTube's regulations prohibiting the unauthorized capture of content from the platform.

The report pointed out that these technology companies all used a data set called "YouTube Subtitles" when training AI models. The data set is 5.7GB in size, contains 489 million words, and comes from 173,500 videos in more than 48,000 channels on Youtube. The data set consists of plain text of video subtitles, including the part uploaded by video bloggers and the text automatically transcribed by Youtube. In addition to English, it usually comes with translations in languages ​​such as Japanese, German and Arabic.

The non-profit organization EleutherAI is the creator of the controversial dataset, and the company has not yet responded to the matter. According to the official website, EleutherAI's goal is to "lower the threshold for AI development and make cutting-edge AI technology accessible to everyone through training and publishing models." Previously, EleutherAI released a data compilation called "Pile", most of which are open to the public, including YouTube Subtitles.

According to the data, the company used Pile for training a few weeks before Apple released the small OpenELM model on the edge in April this year. However, it is worth noting that Apple itself did not download the data. Therefore, from a technical perspective, EleutherAI violated YouTube's terms of use.

A spokesperson for AI startup Anthropic confirmed that the Pile dataset has been used to train the company's generative AI assistant Claude, and that YouTube's terms only cover "direct use of its platform," and suggested discussing any violations of YouTube's terms of service with Pile's original author. Other companies, including Apple, Nvidia, and Salesforce, have not yet responded to the matter.

The creators affected by this incident include well-known bloggers such as Marques Brownlee, MrBeast and PewDiePie, as well as large news publishers such as The New York Times, the British Broadcasting Corporation (BBC) and ABC News in the United States. In addition, some materials in the dataset promote conspiracy theories such as "flat earth theory" and even contain content from deleted videos. Now, Pile has been removed from the official download website, but it can still be accessed through file sharing services.

In this regard, well-known technology blogger Marques Brownlee said on the X (formerly Twitter) platform: "Apple obtained the data needed for their AI from several companies, one of which grabbed a lot of data/transcribed text from YouTube videos, including my videos. Technically speaking, Apple did not 'make a mistake', they did not actively grab the data. But this will be a long-term problem."


Marques Brownlee's tweet. Source: X Platform

Although Apple and other companies may have used public data sets and did not violate any regulations, this incident has once again drawn people's attention to the data issues behind AI training. Earlier this year, YouTube's parent company Google was exposed for using videos on the platform to train its models. Google responded at the time that this behavior did not violate the agreement between the platform and creators.

In March this year, Mira Murati, chief technology officer of OpenAI, was vague about the source of training data for the Vincent video model Sora in an interview. In April, Neal Mohan, CEO of YouTube, said in an interview that he had no direct evidence to prove that OpenAI did use YouTube videos to improve its Vincent video AI tool Sora. If it did, it would be a "clear violation" of the terms of use of the YouTube platform.