news

Musk built the world's most powerful AI cluster in 19 days! The 100,000-yuan H100 "liquid-cooled monster" is about to awaken

2024-07-23

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


New Intelligence Report

Editor: Editorial Department

【New Wisdom Introduction】The construction of 100,000 liquid-cooled H100s officially started, and Musk built the world's most powerful AI training cluster in 19 days.

At 4:20 in the morning, the largest supercomputer training cluster on the other side of the ocean began to roar.


"420" is also Musk's favorite meme, symbolizing freedom, unconstrainedness and anti-traditionalism.

Musk frequently uses "420" in his product pricing, company meeting times, and Starship launch times, etc.

Netizens also joked in the comment section that Musk has a strong sense of ritual and will not start work before 4:20.


In the latest interview, Musk revealed more about the progress of the new supercomputer and xAI model:

- Grok 2 finished training last month, using about 15K H100

- Grok 2 will be released next month, comparable to GPT-4 - Grok 3 is being built on 100,000 liquid-cooled H100 supercomputers and training has begun - Grok 3 is expected to be released in December, "when it will become the world's most powerful AI"


100,000 liquid-cooled H100s, built in 19 days

It is worth noting that this world's largest supercomputer cluster is so large that it has 100,000 H100s, which are liquid-cooled.


What is the concept of 100,000 H100?

In terms of price, the H100 GPU is a key component of AI and a hot commodity in Silicon Valley. It is estimated that each unit costs between $30,000 and $40,000. 100,000 H100s would be a huge order of $4 billion.

A PhD student in machine learning at one of the top five universities in the United States once posted that the number of H100s in the laboratory was zero, and that GPUs had to be snatched away.

Fei-Fei Li also stated in the interview that Stanford's natural language processing group only has 64 A100 GPUs.

Musk spent 100,000 yuan at the first try, a number that made the comment section salivate.


In terms of computing power, it is about 20 times the computing power of the 25,000 A100s used by OpenAI to train GPT4.

In terms of power consumption, the total power required just to keep this supercomputing center running is 70MW, which is equivalent to the installed capacity of an ordinary power plant and can meet the energy needs of 200,000 people.

In May this year, Musk expressed his hope to build a "supercomputing factory" by the fall of 2025.

It now appears that in order to accelerate the construction of the supercluster, he chose to purchase the current generation of H100 GPUs rather than wait for the new generation H200 or other upcoming Blackwell-based B100 and B200 GPUs.

Although the market expects Nvidia's new Blackwell data center GPU to be available by the end of 2024, Musk apparently doesn't have the patience to wait.

The current AI arms race is becoming increasingly fierce, and speed is the only way to win. Whoever can launch products the fastest will be able to quickly capture the market.

As a startup company, xAI must seize the initiative in the battle with other giants.

Previously, Musk's negotiations with Oracle for a multi-billion dollar deal fell through as Musk disliked Oracle's slow pace and believed that the other party was not building a computing cluster at a feasible speed.


However, Oracle felt that the site chosen by xAI to build its supercomputer could not meet the electricity demand. As the negotiations for the tens of billions of orders broke down, xAI and Oracle stopped discussing the possibility of expanding their existing cooperation.

xAI had to build its own artificial intelligence data center in Memphis, Tennessee. The breakdown of the cooperation with Oracle meant that xAI had to go it alone and build an independent data center with 100,000 H100s to get rid of the capacity limitations of cloud providers such as Oracle.

Musk himself also said that xAI has the world's most powerful AI training cluster and is far ahead.


The world's strongest Grok-3 begins training, to be released at the end of the year

In Musk's latest interview, some details of building a supercomputer were revealed.

It took Musk only about a week to decide that xAI’s new supercomputer would be built in Memphis, according to Ted Townsend, president of the Greater Memphis Chamber of Commerce.

Townsend said Musk and his team chose the Tennessee city after several days of whirlwind negotiations in March because of its abundant electricity and ability to build quickly.

Moreover, it took only 19 days to build the supercomputing center, and Musk also praised the team's excellent work in his tweet.


Charles Liang, CEO of Supermicro, which also provides much of the hardware support for xAI, also commented on Musk's tweet, praising the team's execution capabilities.


The purpose of such a large training cluster is to train Grok 3.

Earlier this month, Musk announced that Grok 2 would be launched at the end of August. Before Grok-2 was released, Musk also revealed some details of Grok-3 to build momentum for the most powerful model, Grok 3.

In an interview with Nicolai Tangen, head of the Norwegian sovereign fund, in April this year, Musk said that Grok 2 needs about 20,000 H100s for training.

Grok 3 will be released at the end of the year. It can be foreseen that the performance of Grok 3, which is trained on 100,000 GPUs, will be even better than that of Grok 2.

Such a huge supercomputing center naturally requires a lot of talent and technical support. Musk continues to recruit talents on Twitter to expand the data advantages, talent advantages and computing power advantages to the extreme.


References:

https://x.com/elonmusk/status/1815325410667749760

https://x.com/tsarnick/status/1815493761486708993