Nvidia was exposed for stealing data, crawling more than 80 years of video data every day, and Peking University's academic data set was also affected

2024-08-06

Nvidia seems to have entered a turbulent period recently.

After its most powerful AI chip was reported to have delayed mass production and its market value evaporated by more than $300 billion, Nvidia was again exposed by 404 Media for grabbing video content from platforms such as YouTube and Netflix without authorization to train its AI video model that has not yet been announced to the public.

Internal emails and Slack chat records show that although Nvidia employees questioned the legality and ethics of using these data sets, company management said the actions had been approved by senior management and argued that their actions complied with copyright law.

It is worth mentioning that in an internal discussion at the end of February, NVIDIA mentioned multiple data sets it was using, including HD-VG-130M.

The latter is a dataset of 130 million YouTube videos created by researchers at Peking University, and its use license explicitly states that it is limited to academic research.

Nvidia's approach is more like a microcosm of most AI companies today.

When users are regarded as "data ATMs", unless insiders expose it, it is difficult for the outside world to know whether your work and mine has become nutrient for AI training.

In short, humans are still consumers at the top of the food chain, but we have inevitably become part of the AI development supply chain.

The following is the original report from 404 Media, GPT-4o Translation, enjoy it~

Feeding the model with YouTube videos, downloading 80 years of videos every day

Internal Slack chats, emails and documents obtained by 404 Media show thatNvidia scrapes videos from YouTube and multiple other sources to compile training data for its AI products. When asked about the legal and ethical issues of using copyrighted content to train AI models, Nvidia argued that its practices are “fully consistent with the letter and spirit of copyright law.”

Internal Nvidia conversations reviewed by 404 Media show that when employees raised questions about the legal implications of using data sets and YouTube videos compiled by academics for research purposes, managers told them that company senior management had approved the use of the content.

One former Nvidia employee, who 404 Media granted anonymity to discuss Nvidia’s internal processes, said employees were asked to scrape videos from Netflix, YouTube, and other sources to train Nvidia’s Omniverse 3D world generator.AutopilotAutomotive systems andDigital Human"AI model of the product.

The project, which is internally known as Cosmos (but is distinct from the company’s existing Cosmos deep learning product), has not yet been released publicly.

The goal of Cosmos is to build a state-of-the-art video infrastructure model that “brings together simulations of light transport, physics, and intelligence to develop a variety of downstream applications that are critical to NVIDIA,” according to an email from project leaders.

A graphic provided via email to 404 Media shows how the Cosmos model applies to different Nvidia products.

Slack messages within a company channel set up for the project show that employees used an open source YouTube video downloader called yt-dlp combined with a virtual machine to refresh IP addresses to avoid being blocked by YouTube.

According to reports, they tried to download full videos from multiple sources, including Netflix, but mainly focused on YouTube videos.

Emails reviewed by 404 Media show project managers discussing using 20 to 30 Amazon Web Services virtual machines to download 80 years' worth of video every day.

“We are finalizing the v1 data pipeline and securing enough computing resources to build a video data factory that generates a lifetime of human visual experience every day,” Mingyu Liu, Nvidia’s vice president of research and head of the Cosmos project, said in a May email.

Conversations and directives inside Nvidia show employees discussing the company’s legal and ethical considerations in designing the chips and APIs that have fueled the rise of generative AI and made it one of the world’s most valuable public companies.

This also highlights the largest companies in the industry, such as Runway and OpenAI, there is an insatiable demand for content that serves as data for training AI models.

An Nvidia spokesperson said in an email to 404 Media:

We respect the rights of all content creators and believe that our models and research work fully complies with the letter and spirit of copyright law. Copyright law protects certain forms of expression, but not facts, opinions, data, or information. Anyone can learn facts, opinions, data, or information from other sources and use them to create their own expressions. Fair use also protects the right to use works for transformative purposes, such as model training.

When asked about Nvidia’s use of YouTube videos as training data for its models, a Google spokesperson told 404 Media that the company’s “previous comments still apply.”

Among them, YouTube CEO Neal Mohan said that if OpenAI used YouTube videos to optimize its AI video generator Sora, this would clearly violate YouTube's terms of use.

A Netflix spokesperson told 404 Media that Netflix does not have an agreement with Nvidia regarding content acquisition and that the platform’s terms of service do not allow data scraping.

Questions about legal issues raised by employees working on the project were generally dismissed by project managers, who said the decision to scrape videos without permission was a "high-level decision" that employees didn't need to worry about, and that the topic of what constitutes fair, ethical use of copyrighted content and datasets for academic, non-commercial purposes was considered an "unresolved legal issue" that they would address in the future.

Our investigation highlights the tech companies’ unasked attitude toward scraping vast tracts of copyrighted content into the datasets used to train the world’s most valuable AI models.

Nvidia executives suggest that Peking University academic data sets have also been abused

In February 2024, Nvidia’s chief scientist Francesco Ferroni wrote in an Nvidia Slack channel called #cosmos-dataset-creation:

“Hi everyone, @Sanja Fidler mentioned to me about an initiative to aggregate a large curated video dataset for generative modeling. We thought it would make sense to first aggregate all video datasets available internally (public or downloaded internally) to avoid duplication of effort.”

(Note: Sanja Fidler is Nvidia’s vice president of AI research.)

Ferroni then linked a spreadsheet with links to datasets, including MovieNet (a database of 60,000 movie trailers), WebVid (a video dataset compiled from stock images on Github that was later deleted by its creator following a cease and desist notice from Shutterstock), InternVid-10M (a dataset of 10 million YouTube video IDs on Github), and several internally captured datasets of video game footage. 404 Media had already removed the names of low-level employees from the screenshots of the Slack conversations.

We included the names of several senior engineers and executives involved in the project because they have public visibility as leaders in the AI industry.

Ferroni linked to a spreadsheet showing the datasets used in the project

In a follow-up discussion in February, the engineers talked about the datasets they had acquired, including HD-VG-130M, a set of 130 million YouTube videos created by researchers at Peking University in China with a license that states it can only be used for academic purposes.

The dataset's Github page states: "By downloading or using the data, you understand, acknowledge and agree to all the terms of the following agreement."

The page emphasizes that "it can only be used for academic purposes. Any content in the HD-VG-130M dataset is for academic research only. You agree not to copy, trade or use it for any commercial purpose. Distribution is prohibited. Respect the privacy of personal information of the original source. No broadcast, modification or any other similar behavior of the dataset content in any form is allowed without the permission of the copyright owner."

Throughout the project, datasets compiled and made public by researchers and academics are considered free to use in NVIDIA models. AI researchers are increasingly concerned about the appropriate use of the datasets they make public, including ethical and legal aspects.

Robert Mahari of the MIT Data Provenance Initiative told 404 Media that they have seen a significant increase in the use of non-commercial licenses for research datasets over the past year, suggesting that academics are trying to limit commercial use of their work. Datasets compiled for research use have significantly different purposes than those for commercial use.

“When academics release public datasets, especially for specific tasks, we may not specifically check whether the data has certain types of biases or Western-centrism or things like that. If that’s not the focus of the research, then it’s not checked,” Mahari said. “So if an academic says in the license, ‘For academic use only,’ or, ‘Please do not use this data in ways not intended,’ there’s a good reason to comply with that. Because the data may not be of commercial quality, and it may not perform well in other types of settings.”

Like many other tech giants, Nvidia employs people who conduct and publish academic research. However, internal Nvidia conversations reviewed by 404 Media suggest that Cosmos is aimed at supporting the company’s efforts to bolster its commercial products in the competitive AI industry.

Publicly released research datasets are often distributed as URLs or YouTube IDs for two reasons: first, practical considerations—sharing millions of full video or image files is too cumbersome—and second, legal and ethical considerations. For example, if someone deletes their YouTube video or tweet, a copy won’t continue to exist in the dataset without the owner’s knowledge or permission.

“It’s a bit like getting around legal constraints by not distributing the dataset to outsiders,” Emily Bender, professor and director of the Computational Linguistics Laboratory at the University of Washington, told 404 Media. “Others can build on the dataset and use it for their own purposes.”

Discussion details exposed: How does Nvidia steal data on the edge of the law?

In March, a research scientist started a discussion on Slack about the possibility that OpenAI’s Sora video generator might use Hollywood movies like Avatar and The Lord of the Rings as training data.

“Movies are actually a good source of data to get game-like 3D consistency and fictional content, but at a higher quality. The characters are all fully CGI, and many live-action scenes are now CGI as well,” they said. Someone responded that the team should train on the Discovery Channel movie dataset.

Liu Mingyu said: "We need a volunteer to download all the movies."

The research scientist who originally proposed the film added: “While it’s very clear what they’re doing, we have to be very careful about Hollywood’s hypersensitivity to AI, just like what happened in the artist community after the release of SD [Stable Diffusion], and what’s happening now in Hollywood.”

They then posted two links in the chat: one to a Hollywood Reporter article about Tyler Perry pausing an $800 million studio expansion after seeing OpenAI’s Sora, and another to a Vanity Fair article about the 2023 SAG-AFTRA strike leading to the inclusion of AI language in studio contracts.

“We won’t publish any research results from what we do here,” Liu Mingyi emphasized. “We will use all the downloadable data to conduct experiments. Since we won’t publish anything, there will be no negative emotions.” The former employee who spoke to 404 Media explained that “publish” refers to research publications.

The person who raised the “high sensitivity” responded, “If we were to launch such a project within the company, we should communicate widely, because showing similar examples could cause backlash.” Liu Mingyu replied, “Yes.”

“Found some high-priority files to download,” Ferroni wrote in another project-related Slack channel in March. “Turns out 2.3 million original videos are missing from our HDVILA [High-Resolution Video Language] dataset!” They were referring to Microsoft’s HD-VILA-100M, a large-scale, high-resolution, and diverse video-language dataset. “Here are the missing YouTube links,” they sent a link to a Google Drive document, saying, “Let’s get this into the download process!”

The license statement for HD-VILA-100M reads:

“You agree to use the data only for computing purposes related to non-commercial research. This restriction means that you may engage in non-commercial research activities (including non-commercial research conducted or funded by a commercial entity), but you may not use the data or any results in any commercial product, including as part of a product or service that you use or provide to others (or to improve any product or service).”

“Let’s create a database of URLs we’ve downloaded,” another engineer responded. “YouTube videos have unique IDs, can we use those as references (the ID after the ‘?v=’)? We’ll compare and merge URLs multiple times later.” Ferroni responded, “Yes, we’re setting up the infrastructure with Hive right now,” meaning they were adding it to the project management tool Hive.

The Omniverse team member they tagged responded, “We’re on AWS, and restarting a [virtual machine] instance gives us a new public IP, so that’s not an issue at this time.”

In Slack discussions in the #cosmos-dataset-creation channel about how to find the best videos, employees occasionally bring up the legal and ethical implications of their work. In February, after someone mentioned using YouTube-8M, a research dataset of YouTube IDs compiled by Google, Ferroni asked, “Maybe we can’t use [YT8M] for non-research purposes?”

The YouTube-8M paper and project page make no mention of copyright issues, but the paper does indicate that the dataset was created to advance machine learning research: “We expect this dataset to level the playing field for academic researchers, close the gap to large-scale annotated video datasets, and significantly accelerate research in video understanding. We hope that this dataset will serve as a testbed for developing novel video representation learning algorithms, especially methods that effectively handle noisy or incomplete labels.”

In response to Ferroni’s question about using it for the Cosmos project, an Nvidia employee who previously co-created ACAV100M replied:

“Yes, downloading data from Google is very expensive. However, scheduling 10,000 cores from within Nvidia has always been a challenge.

Additionally, Nvidia’s bandwidth limits to the cloud add considerable variability that could cause issues. Downloading on Google Cloud means every job gets a stable, high-bandwidth connection to YouTube.”

“More importantly, downloading YouTube videos is prohibited by YouTube’s terms of service. So when downloading YouTube 8m, we communicated with Google and YouTube in advance and used Google Cloud as an inducement for downloading.After all, typically for 8 million videos, they get a lot of ad impressions which would result in lost revenue if downloaded when used for training, so they should get some benefit from this.Paying $0.00625 per video download is still a good deal.”

“Okay, is it expected that this data can only be used for research purposes? As far as I know, Google’s YouTube API can query the licensing terms of each video,” Ferroni responded. “Can you also comment on the licensing terms of ACAV100M and YouTube8M?”

“As far as I know, YouTube’s terms of service prohibit downloading regardless of licensing; the restriction is about their lost ad revenue, not licensing,” another employee responded. They continued:

“I don’t know which licenses Google filtered when creating the dataset; we just downloaded what they listed as included in the dataset (they publish the features, and links to the original videos). The YouTube 8m dataset I downloaded comes with full metadata, so you can check each video there. I still need to check out the ACAV100M dataset. In general, CC or public domain are of course best. However, whether copyrighted material can be used for training is currently an open legal question; most companies seem to consider this fair use. I believe our legal team has approved this practice for training large language models, and will probably approve it for video training as well.”

“I think there’s a huge gap between commercializing something without someone’s consent and studying generative AI capabilities based on publicly released content,” Shayne Longpre, a PhD student at the MIT Media Lab, told 404 Media. The question about YouTube’s terms of service in the Cosmos Slack channel wasn’t the last time legal issues came up.

Later, another employee said, "Hi team. Do we use https://research.google.com/youtube8m/download.html to download videos? If so, do we have legal approval? In one project, the legal department vetoed its use because the license of an individual video was superior to the license shared on yt8m." "This is an administrative decision. We have an overall license covering all data," Liu Mingyu replied. "Okay, thank you!" the person who asked the question replied.

Bender told 404 Media that companies are taking advantage of the legal gray area that currently exists around copyrighted content used for training data. “It seems to me that there’s definitely a culture of ‘if we can get it, we can use it,’ ” she said. “A lot of that is based on people wanting it to be, rather than on careful research into its legality, or deep thinking about the impact it has on people.”

Using copyrighted content for AI training is “definitely not settled law,” Mahari said. The legal system has yet to determine whether obtaining training data to develop an AI model is sufficiently transformative, especially because the model has been shown to memorize or recall the training data as output. “My view, which is partially summarized in this Science article, is that training an AI model may indeed constitute fair use, but that does not mean that generating output that is similar to a specific item in the training data is not an infringement.

In this case, it is not clear whether it is the provider of the underlying model or the specific user who generates the output that would infringe (this may depend on the specific context).”

In May, a research scientist dropped some links to YouTube channels in the Cosmos Slack channel and said, “If you guys are still open to suggestions for YouTube channels that you can download, here are a few that might be worth considering.” They included official channels from Expedia and Architectural Digest, as well as individual content creators like The Critical Drinker and Marques Brownlee (MKBHD). A project manager thanked them for the suggestion and said he would pass it on to the team, to which Fidler replied, “Did you include tutorial videos, too? Astronomy? Medicine?”

The “unresolved legal issues” of using copyrighted works for commercial base model training may not remain unresolved for long.

Copyright infringement lawsuits filed by copyright holders against generative AI companies are piling up, including Getty Images’ lawsuit against Stable Diffusion creator Stability AI, The New York Times’ lawsuit against OpenAI, and artists and creators’ lawsuits against Stability AI,Midjourney, DeviantArt, and Runway. The Cosmos training data team also discussed using Netflix to train the generator.

“In today’s meeting, we got permission to download all kinds of data. Should we download the entire Netflix? How should we operationalize that?” Liu said in the Slack channel. “We should download the entire Discovery Channel!”

Someone responded. “We need a project information coordinator. Who’s willing to watch all the movies and do screen captures at the same time?” Liu said. “We should get a lot of high-quality face videos from this,” Liu continued. Someone from the Omniverse infrastructure team was tagged in the thread, noting that they were willing to help “operationalize it” because they had “experience building large datasets at other big companies.”

The team also considered how best to add video game footage to the training data. Jim Fan, a senior research scientist at Nvidia, mentioned that there were "engineering and regulatory" obstacles in capturing live game video.

“Update: I have met with the folks at GeForce Now (GFN) and will be working with them on a data plan. We will be working closely with GFN and related engineering teams to build out real-time game data capture, scale up the pipeline, and process this data for training. High-quality game videos will be a very useful addition to our Sora project,” Fan wrote. “We don’t have stats or video files yet, as the infrastructure has not yet been set up to capture large amounts of live game video and action. We need to overcome engineering and regulatory hurdles. However, once the cleaned and processed GFN data arrives, we will add it to team-vfm as soon as possible.”

In March, the project hit a milestone: 100,000 videos downloaded in two weeks. An employee mentioned in a thread discussing the milestone that Ferroni had a downloader they were using, and Ferroni confirmed that they had been downloading audio and video. "Amazing progress. Now the question is how do we get a large number of high-quality URLs," Liu replied.

In late May, an email about the data strategy for video data was sent to members of the project team, announcing that they had compiled 38.5 million video URLs. "Based on our target distribution, the focus for the coming week remains on movies, drone footage, first-person videos, and some travel and nature videos," the email read. The email also included a chart showing the percentage of content types they downloaded.

In that email, a product manager suggested adding four additional datasets to the model’s training data. They wrote:

1. Ego-Exo4D: A diverse large-scale multimodal, multi-view video dataset and benchmark collected by 740 camera wearers in 13 cities around the world, capturing 1286.3 hours of videos of skilled human activities.

2. Ego4D: A large-scale first-person perspective dataset and benchmark suite, with over 3,670 hours of videos of daily life activities collected in 74 locations and 9 countries around the world.

3. HOI4D: A large-scale 4D first-person-view dataset with rich annotations to facilitate the study of category-level human-object interactions.

4. GeForce Now: Game Data.

HOI4D was created by researchers from Tsinghua University, Peking University, and Shanghai Institute of Intellectual Property under the CC BY-NC 4.0 license, which does not allow commercial use.

“In my opinion, if a company takes a dataset that is intended for research purposes only and uses it for research, they are still following the license of that dataset,” Bender said.

“But in order to ensure that, they have to be very careful to build firewalls between the research they do and the work they do in product development.”

In another update email in May, Liu said, “The research team is now training a 1 billion parameter model with many different configurations, each with 16 nodes. This is an important debugging step before further scaling. We plan to reach conclusions in a few weeks and then scale up to a 10 billion parameter model.”

Nvidia CEO Jensen Huang responded in that email, “Great update. Many companies have to build video infrastructure models. We can provide a fully accelerated pipeline.”

In June, employees discussed what types of content in models would be most useful for Nvidia's products to remain competitive in the AI industry.

“Nvidia has robots, self-driving cars, Omniverse, and Avatar that most content companies don’t have. To have the greatest impact on the company, the data we curate must be well-suited for these killer applications,” Liu said.

“I understand data that has implications for robots and self-driving cars. Can anyone share details on data that has implications for Omniverse and Avatar use cases?” a product manager responded. “It will be videos of how humans interact with objects. Like furniture installation, cutting fruit, folding laundry,” Liu responded.

Is the advancement of AI models based on your and my creations?

While Nvidia does contribute to academic research, conversations and emails obtained by 404 Media show that the models the Cosmos team is working on are intended for commercial use across multiple of its products.

Until a legal precedent is set on how training data is compiled, or until companies are required to be transparent about this data, companies will continue to exploit the legal gray area of scraping copyrighted training data. Leaks of internal conversations like this are the only way people will know if their work is being used to train models that make companies like Nvidia or Runway or OpenAI billions of dollars.

For years, the AI industry has been pushing for more transparency, whether through government regulation or industry standards.

“It is critical to understand what is in the datasets used to train models and how they were compiled,” MIT’s Jack Hardinges, Elena Simperl, and Nigel Shadbolt wrote earlier this year. “Without this information, the work of developers, researchers, and ethicists to address bias or remove harmful content from data will be hampered.

Information about training data is also critical for lawmakers to assess whether the underlying models have ingested personal data or copyrighted material. Downstream, if the intended operators of AI systems and those affected by their use understand how they were developed, they are more likely to trust those systems.”

Last year, lawmakers introduced several bills to address the issue, including the AI Foundation Model Transparency Act, introduced in December, which would require companies that create foundational AI models to work with federal agencies like the FTC and the Copyright Office to develop transparency standards, including requiring them to disclose certain information to consumers.

The Generative AI Copyright Disclosure Act, proposed in April, would require dataset makers to submit “a sufficiently detailed summary of any copyrighted work” to the Registrar or face fines.

“Technically, it’s really hard to determine if your work was used for training,” Mahari said. “Internally, the best policy is to not tell people what you used for training, because it’s very hard for any third party to actually do an audit and find out. So as long as you don’t tell anyone, it’s very hard to prove.”

Attached is the original report address:

https://www.404media.co/nvidia-ai-scraping-foundational-model-cosmos-project/

news

Nvidia was exposed for stealing data, crawling more than 80 years of video data every day, and Peking University's academic data set was also affected

Introduction

my contact information