news

Nvidia's Sora was found to have illegally captured a large amount of data, and the official said it was dissatisfied

2024-08-06

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Baijiao from Aofei Temple
Quantum Bit | Public Account QbitAI

Nvidia version of Sora exposed——

Codenamed Cosmos, Vice President of Research Liu Mingyu is the person in charge.

However, with the leak of several internal documents, they were also exposed to have illegally captured data.



(This is indeed not the first or second time...)

Employees are tacitly allowed to capture any unauthorized and unapproved data on the Internet every day, such as from platforms such as YouTube, Netflix, etc.

All in all, the visual data captured every day is almost the same as what a person can perceive in 80 years.

As a result, Nvidia responded: Our approach,Totally legal!



Nvidia version of Sora exposed: codenamed Cosmos

According to leaked documents obtained by 404Media, Nvidia captures illegal data every day to train new models.

The goal of Cosmos is to build a state-of-the-art video infrastructure model that combines simulations of light transport, physics, and intelligence to unlock a variety of downstream applications, according to leaked emails.

For example, it is used in Omniverse 3D world generator, self-driving car systems and digital human products.

Ming-Yu Liu, vice president of research at Nvidia, serves as the project leader for Cosmos.



He is also an IEEE Fellow. He led the NVIDIA Deep Imagination research group and launched products such as NVIDIA Picasso [Edify], NVIDIA Canvas [GauGAN], and NVIDIA Maxine [LivePortrait].

An email from May showed:

We are completing the v1 data pipeline and securing the necessary compute resources to build a video data factory that can produce a lifetime of human visual experience worth of training data per day.

This image shows a table link provided by Nvidia Chief Scientist Francesco Ferroni, which brings together various video datasets, including MovieNet (a database of 60,000 movie trailers), WebVid, InternVid-10M, and several internally captured video game footage datasets.

Now, according to a former employee, employees are asked to scrape data from sources such as YouTube and Netflix.

They use an open source YouTube video downloader called yt-dlp, which uses a virtual machine to refresh IP addresses to avoid being blocked by YouTube.

To this end, Nvidia responded to 404 Media:

We respect the rights of all content creators and believe that our models and research work fully complies with the letter and spirit of copyright law.
Copyright law protects certain forms of expression, but not facts, ideas, data, or information. Anyone is free to learn facts, ideas, data, or information from other sources and use it to express their own views. Fair use also protects the ability to use a work for transformative purposes, such as model training.”

Google threw a link to 404 Media. In April this year, YouTube CEO said that if OpenAI used YouTube videos to train Sora, thenObvious violationYouTube's Terms of Use.

Netflix said it did not have an agreement with Nvidia to extract content, and that the platform's terms of service did not allow scraping of content.

Interestingly, on the same day, YouTube bloggers are seeking a class action lawsuit against OpenAI, accusing the company of using millions of YouTube video records to train its generative AI model without notifying or compensating the video owners.

It is not uncommon for these large companies to be exposed for illegally grabbing data.

But it must be said that this kind of raw data is really useful...

Nvidia has previously used game videos to improve the quality of training data.

The study that recently appeared on the cover of Nature shows that this large model trained with original Internet data has a first-mover advantage, the best data quality, and the corresponding model performance is also the best.

Later, as AI data became more and more abundant, large models were more likely to crash.

Garbage in,Garbage out

What do you think about this matter?

Reference Links:
[1]https://techcrunch.com/2024/08/05/youtuber-files-class-action-suit-over-openais-scrape-of-creators-transcripts/
[2]https://www.gamedeveloper.com/business/report-nvidia-used-scraped-video-game-footage-to-train-ai-products

[3]https://www.404media.co/nvidia-ai-scraping-foundational-model-cosmos-project/
[4]https://pivot-to-ai.com/2024/08/05/nvidia-caught-ingesting-as-much-of-youtube-as-possible/