news

Musk reveals AI behemoth Dojo! Self-developed supercomputer challenges Nvidia, equivalent to 8,000 H100

2024-08-05

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


The article is reproduced from New Wisdom

In order to train the strongest Grok 3, xAI spent 19 days to build the world's largest supercomputing cluster consisting of 100,000 H100s.


In terms of training FSD and Optimus Prime robots, Musk also spared no expense and invested a lot of computing resources.

Supercomputing Dojo is the cornerstone of Tesla's AI and is specifically designed for training FSD neural networks.

Just today, he toured Tesla's supercomputer cluster at the Texas Gigafactory (Cortex).

Musk said, "This will be a system with about 100,000 H100/H200 GPUs and large-scale storage for video training of Fully Self-Driving (FSD) and Optimus robots."


Not only that, in addition to NVIDIA GPU, this supercomputing cluster is also equipped with Tesla HW4, AI5, and Dojo systems.

They will be powered and cooled by a massive system of up to 500 megawatts.



At Tesla AI Day in 2021, Musk announced Dojo to the public for the first time.

Now that three years have passed, how is Dojo going?


1

8000 H100 equivalent computing power, double the bet

Half a month ago, netizens said that by the end of 2024, Tesla will have AI training computing power equivalent to the performance of a 90,000 yuan H100.


Musk added some additional information:

We use not only NVIDIA GPUs in our AI training system, but also our own AI computers, Tesla HW4 AI (renamed AI4), with a ratio of about 1:2. This means that there are about 90,000 H100s plus about 40,000 AI4 computers.


He also mentioned that by the end of this year, Dojo 1 will have about 8,000 H100 equivalents of computing power. This is not a huge scale, but it is not small either.


Dojo D1 supercomputing cluster

In fact, in June last year, Musk revealed that Dojo had been online and running useful tasks for several months.


This already implies that Dojo has already been involved in training for some tasks.

Recently, at Tesla's earnings conference, Musk said Tesla is preparing to launch self-driving taxis in October and the AI ​​team will "double down" on Dojo.


Dojo's total computing power is expected to reach 100 exaflops in October 2024.

Assuming a single D1 chip can achieve 362 teraflops, to reach 100 exaflops, Tesla would need more than 276,000 D1 chips, or more than 320,000 Nvidia A100 GPUs.


1

50 billion transistors, D1 is now in production

At Tesla AI Day 2021, the D1 chip made its debut. It has 50 billion transistors and is only the size of a palm.

It has powerful and efficient performance and can quickly handle various complex tasks.


In May this year, the D1 chip began production, using TSMC's 7nm process node.

Ganesh Venkataramanan, former senior director of hardware at Autopilot, once said, "D1 can perform computing and data transmission simultaneously, uses a customized ISA instruction set architecture, and is fully optimized for machine learning workloads."

This is a pure machine learning chip.


Still, the D1 is not as powerful as Nvidia's A100, which is also manufactured using TSMC's 7nm process.

The D1 placed 50 billion transistors on a 645 square millimeter chip, while the A100 contains 54 billion transistors and has a chip size of 826 square millimeters, making it ahead of the D1 in performance.

To achieve higher bandwidth and computing power, the Tesla AI team fused 25 D1 chips into one tile, operating it as a unified computer system.

Each tile has 9 petaflops of computing power and 36 terabytes per second of bandwidth, and includes power, cooling, and data transfer hardware.

We can think of a single tile as a self-sufficient computer made up of 25 small computers.


By using the wafer-level interconnect technology InFO_SoW (Integrated Fan-Out, System-on-Wafer), 25 D1 chips on the same wafer can achieve high-performance connection and work like a single processor.

Six such tiles form a rack, and two racks form a cabinet.

Ten cabinets make up one ExaPOD.

At AI Day 2022, Tesla said Dojo will be scalable by deploying multiple ExaPODs. All of these together form a supercomputer.


Wafer-scale processors, such as Tesla's Dojo and Cerebras' wafer-scale engine WSE, are much more performance-efficient than multi-processors.

The main advantages of the former include high-bandwidth and low-latency communication between cores, lower grid impedance, and higher energy efficiency.

Currently, only Tesla and Cerebras have system-on-wafer designs.

However, putting 25 chips together also poses considerable challenges to the voltage and cooling systems.


Netizens photographed Tesla building a giant cooling system in Texas

Another inherent challenge with wafer-level chips is that they must use on-chip memory, which is not flexible enough and may not be sufficient for all types of applications.

Tom's Hardware predicts that the technology used in the next generation may be CoW_SoW (Chip-on-Wafer), 3D stacking on tiles and integrating HBM4 memory.

In addition, Tesla is also developing the next-generation D2 chip to solve the problem of information flow.

Instead of connecting individual chips, D2 puts the entire Dojo tile on a single silicon wafer.

By 2027, TSMC expects to offer more complex wafer-level systems with computing power expected to increase by more than 40 times.

Since the release of D1, Tesla has neither disclosed the number of D1 chip orders that have been ordered or expected to be received, nor the specific deployment schedule of the Dojo supercomputer.

However, in June of this year, Musk said that in the next 18 months, half of the deployment will be Tesla AI hardware and half will be NVIDIA/other hardware.

Other hardware, possibly AMD too.


1

Why Dojo?

Autonomous driving consumes computing power

In our impression, Tesla's main business is limited to the production of electric vehicles, with some solar panels and energy storage system businesses attached.

But Musk's expectations for Tesla go far beyond that.

Most self-driving systems, such as Waymo, a unit of Google parent Alphabet, still rely on traditional sensors such as radar, lidar and cameras as input.

But Tesla takes the "full vision" path. They only rely on cameras to capture visual data, supplemented by high-definition maps for positioning, and then use neural networks to process data to make rapid decisions for autonomous driving.


Intuitively, it is obvious that the former is a simpler and faster path, and this is indeed the case.

Waymo has commercialized Level 4 autonomous driving, which is defined by the SAE as a system that can drive itself without human intervention under certain conditions. However, Tesla's FSD (Full Self-Driving) neural network still cannot be separated from human operation.

Andrej Karpathy, who previously served as head of AI at Tesla, said that implementing FSD is basically "building an artificial animal from scratch."

We can think of it as a digital replication of the human visual cortex and brain functions. FSD not only needs to continuously collect and process visual data, identify and classify objects around the vehicle, but also needs to have a decision-making speed comparable to that of humans.



This shows that Musk wants more than just a profitable autonomous driving system. His goal is to create a new kind of intelligence.

But fortunately, he rarely has to worry about insufficient data. Currently, about 1.8 million people have paid an $8,000 subscription fee for FSD (it could be as high as $15,000 before), which means Tesla can collect millions of miles of driving video for training.

In terms of computing power, Dojo Supercomputing is the training ground for FSD. Its Chinese name can be translated as "Dojo", which is a tribute to the martial arts practice space.

Nvidia is not strong enough

How popular are Nvidia GPUs? Just look at how much the CEOs of major tech giants want to get close to Huang.

Even Musk, who has deep pockets, admitted in a July earnings call that he was "very concerned" that Tesla might not have enough Nvidia GPUs.

“What we’re seeing is that demand for Nvidia hardware is so high that it’s often difficult to get GPUs.”


Currently, Tesla seems to still use Nvidia's hardware to provide computing power for Dojo, but Musk does not seem to want to put all his eggs in one basket.

Especially considering that Nvidia chips command such a high premium and their performance doesn't quite satisfy Musk.

In terms of hardware and software collaboration, Tesla and Apple have similar views, that is, a high degree of collaboration between the two should be achieved, especially for highly specialized systems such as FSD, which should get rid of highly standardized GPUs and use customized hardware.

At the heart of this vision is Tesla's proprietary D1 chip, which was released in 2021 and began to be mass-produced by TSMC in May this year.


In addition, Tesla is also developing the next-generation D2 chip, hoping to put the entire Dojo block on a single silicon chip to solve the information flow bottleneck.

In the second quarter earnings release, Musk noted that he saw "another avenue to compete with Nvidia through Dojo."

1

Can Dojo succeed?

Even someone as confident as Musk would hesitate when talking about Dojo and say that Tesla might not succeed.

In the long run, developing its own supercomputing hardware could open up new business models for the AI ​​sector.

Musk has said that the first version of Dojo will be tailored for Tesla's visual data annotation and training, which will be very useful for FSD and training Tesla's humanoid robot Optimus.

Future versions will be more suitable for general AI training, but this will inevitably step into Nvidia’s moat - software.


Almost all AI software is designed to work with Nvidia GPUs, and using Dojo would mean rewriting the entire AI ecosystem, including CUDA and PyTorch.

This means that Dojo has almost only one way out - renting out computing power and building a cloud computing platform similar to AWS and Azure.

Morgan Stanley predicted in a report last September that Dojo could add $500 billion to Tesla's market value by unlocking new revenue streams in the form of robotaxis and software services.

In short, judging from Musk's cautious allocation of hardware, Dojo is not a "go all in" but more like a double insurance. But once successful, it can also release huge dividends.

References:

https://techcrunch.com/2024/08/03/tesla-dojo-elon-musks-big-plan-to-build-an-ai-supercomputer-explained/

https://www.tomshardware.com/tech-industry/teslas-dojo-system-on-wafer-is-in-production-a-serious-processor-for-serious-ai-workloads


Click "" and go