news

Nvidia's mysterious video base model "Cosmos" was exposed, and the data was stolen

2024-08-06

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina



Machine Heart Report

Synced Editorial Department

For this video model, NVIDIA is frantically crawling 80 years of video data every day.

Today, a piece of news about Nvidia's plan to enter the video modeling industry set off Reddit.

The news comes from foreign media 404 Media. According to the Slack chat (Nvidia's internal chat platform), emails and documents it obtained, Nvidia is grabbing videos from Youtube and several other sources to collect training data for its AI products.



Internal Nvidia conversations reviewed by 404 Media show that when employees working on the project raised potential legal issues about using "research data sets that are prohibited for commercial use" and "YouTube videos," managers told them they had approval from the highest levels of the company to use the content.

An anonymous former Nvidia employee said employees were asked to grab videos from Netflix, YouTube and other sources to train AI models for its Omniverse 3D world generator, self-driving car systems and "digital human" products.

The project, which has not yet been released to the public and is named Cosmos internally (but is distinct from the company’s existing Cosmos deep learning product), aims to build a state-of-the-art video infrastructure model that “encapsulates light transport, physics, and intelligent simulation in one place to unlock a variety of downstream applications that are critical to NVIDIA,” according to an email from the project’s leadership to employees.

To collect training videos, Nvidia employees used an open source YouTube video downloader called "yt-dlp." They tried to download full videos from various sources such as Netflix, but focused mainly on YouTube videos. Emails reviewed by 404 Media show that project managers chose to use 20 to 30 virtual machines in Amazon Web Services to download 80 years of video every day.

“We are completing the v1 data pipeline and securing the necessary compute resources to build a video data factory that can generate a lifetime of human visual experience worth of training data per day,” Ming-Yu Liu, Nvidia’s vice president of research and head of the Cosmos project, said in a May email.

When asked about Nvidia's use of YouTube videos as training data for its models, a Google spokesperson told 404 Media that the company's "previous position remains valid." Previously, YouTube CEO Neal Mohan said that if OpenAI used YouTube videos to improve its AI video generator Sora, it would be a "clear violation" of YouTube's terms of use.

Likewise, a Netflix spokesperson told 404 Media that the company and Nvidia do not have an agreement for content scraping and that the platform’s terms of service do not allow scraping.

However, Nvidia does not seem to care. Legal questions raised by employees involved in the project were often dismissed by project managers, who said that the decision to grab videos without permission was an "administrative decision" that they did not need to worry about, and the question of what is fair and ethical use of copyrighted content and academic, non-commercial data sets is considered an "open legal question" that they will resolve in the future.

The story of NVIDIA's video model project

Like other tech giants, Nvidia employs academic researchers to publish academic results, but internal emails obtained by 404 Media show that Cosmos will apparently be used for commercial purposes.

In March this year, an Nvidia researcher posted on Slack, suggesting that using Hollywood movies such as "Avatar" or "The Lord of the Rings" to train OpenAI Sora might be more effective.

Subsequently, his proposal was recognized within the company, but he also added that Hollywood is particularly sensitive to the possibility of copyright infringement by AI. In July 2023, SAG-AFTRA, one of the three major Hollywood unions with 160,000 members, announced a strike, targeting generative AI products such as ChatGPT and Stable Diffusion. Prior to this, the American Writers Guild had been on strike for more than 70 days. Stable Diffusion has such a situation that even if the corresponding prompt word is not entered, and a vague description such as "anime-style plumber" is entered, Stable Diffusion will directly generate the classic image of Mario.

Under this post, an employee named "Liu" (that is, Ming-Yu Liu, vice president of research at Nvidia) replied: "If the paper is not published publicly, it will not attract the above negative issues. We should first experiment with downloadable videos."



Later, another NVIDIA researcher posted a post on the intranet, saying that he found a list of files that should be downloaded first for training video models, but about 2.3 million original videos were missing from the HD-VILA-100M dataset used by NVIDIA. This ever-expanding list also includes original videos from some well-known YouTubers, such as Marques Brownlee (MKBHD), a digital review blogger with a similar reputation in North America as "Hello, I am Mr. He".

For copyright protection, general video datasets often include URL links or YouTube IDs. Once the author deletes the original video, these contents will no longer be included in the dataset unless the video author explicitly agrees that their content can be retained and used.

Although Microsoft explicitly prohibited any commercial use of the HD-VILA-100M dataset in its usage statement, the Nvidia employee who posted the message did not seem to care. He quickly posted the YouTube link corresponding to the list and discussed with his colleagues a solution to circumvent YouTube's anti-crawler mechanism by changing the IP address of an AWS virtual machine.

In addition, NVIDIA employees also reached out to YouTube-8M, a large-scale video understanding dataset released by Google. Unlike completing Microsoft's dataset on their own, they reached a "deal" with YouTube and Google, YouTube's current parent company. NVIDIA bought 8 million videos at a price of $0.00625 (about 4 cents) per video, and will download them through Google Cloud. Without considering the issue of selling copyrights, Google may think it has earned back the advertising fees for these videos, but NVIDIA originally had some limitations in cloud bandwidth. Downloading on Google Cloud can get a more stable and predictable connection. Therefore, from any perspective, this "deal" seems to be beneficial to NVIDIA.

What is even more surprising is that when an Nvidia employee asked on the intranet: "Is it reasonable for us to download YouTube videos like this?"

“This was a high-level decision. We have full approval to use all the data,” he got back.

The data allowed by this decision also includes video works on Netflix. Netflix's data contains a lot of high-quality face data. After approval, someone @ed colleagues who have experience in "building large data sets" in other large companies on the company's intranet to help.

At the same time, the Cosmos team also considered how to effectively add game footage to training data. Nvidia senior research scientist Jim Fan also encountered "regulatory" obstacles when capturing real-time game footage.

Jim Fan posted:

Update: I've been meeting with the GeForce Now (GFN) folks to work out a plan. We'll be working closely with GFN and related engineering teams to develop methods for capturing live game data, scaling pipelines, and processing data for training. High-quality game video will be a very useful addition to "Our Sora"... We don't have the equipment to capture live game video and action yet, so we don't have statistics yet, but we'll be adding cleaned and processed GFN data to team-vfm as soon as possible.

In March, the video data collection for the Cosmo project reached a milestone: Nvidia completed 100,000 video downloads in two weeks.

“Amazing progress. Now the question is how can we get a large number of high-quality URLs,” Liu replied in the thread.

In late May, project team members received an email about video data strategy, announcing that they had compiled 38.5 million video URLs. The email said: "According to the plan, the focus of next week's collection of videos will still be movies, drone footage, first-person perspective footage, and natural scenery." The email also included a chart showing the percentage of the types of content they downloaded.

This email revealed some key technical information, including the four data sets used in model training data:

  • Ego-Exo4D: A diverse, large-scale, multi-modal, multi-view video dataset and benchmark collected by 740 camera wearers in 13 cities around the world, capturing 1286.3 hours of videos of human skilled activities.
  • Ego4D: This is a large-scale, egocentric dataset and benchmark suite with over 3,670 hours of videos of daily life activities collected at 74 locations in 9 countries around the world.
  • HOI4D: A large-scale 4D egocentric dataset with rich annotations to facilitate category-level human-object interaction studies. HOI4D was created by researchers from Tsinghua University, Peking University, and Shanghai Institute of Intellectual Property, and is licensed under CC BY-NC 4.0, prohibiting commercial use.
  • GeForce Now: Game data.

In another email, members of the Cosmos project said: “The research team is now training a 1 billion parameter model with multiple configurations, each with 16 nodes. This is an important debugging step before further expansion. We plan to reach conclusions in a few weeks and then expand to a 10 billion parameter model.”

“This update is great!” Nvidia CEO Jen-Hsun Huang replied to the email. He said: “Many companies are aiming to build a basic video model. We can definitely make an accelerated pipeline.”

In June, project team members discussed what types of content in the models would be most useful for NVIDIA's products in the context of maintaining competitiveness in the artificial intelligence industry.

"NVIDIA has robots, autonomous driving, Omniverse, and Avatar that most content companies don't have. To maximize the company's growth, the data we organize must be well suited for these 'killer' applications," said a member of the Cosmos project.

There is no doubt that the model that the Cosmos team is developing is intended for commercial use across its multiple products.

Until legislation is passed requiring these companies to fully disclose their training data, they will continue to exploit legal gray areas to scrape copyrighted data. Without leaks of internal emails or intranet conversations, no one knows what is going on behind the scenes, and models like these can make tech giants like Nvidia, Runway, or OpenAI billions of dollars.

https://www.404media.co/nvidia-ai-scraping-foundational-model-cosmos-project/