news

Musk bought 100,000 H100s to build the world's most powerful AI supercomputer, and the next generation of model training begins

2024-07-23

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Mingmin from Aofei Temple
Quantum Bit | Public Account QbitAI

Musk has built the world's most powerful AI cluster!

This explosive news was officially announced by Ma himself on Twitter.

At 4:20 a.m. local time, the Memphis Supercluster, jointly built by xAI, X, and NVIDIA, began training.
It consists of100,000 H100, is currently the strongest training cluster in the world!



This scale far exceeds the world's most powerful supercomputer Frontier.

The founding members of xAI followed suit and said:

When we founded this company one year ago, our goal was to achieve three advantages: data advantage, talent advantage, and computing advantage.
Starting today, we have all three!



Under Musk's post, Supermicro, which has close ties with Nvidia and excels in liquid cooling technology, also sent congratulations. Its founder Charles Liang said:

So glad we are making history with Musk.



At the same time, Musk added that the completion of the cluster will provide significant advantages for training the world's most powerful model within this year.



According to previous statements, training Grok-3 requires 100,000 H100s.



△ Cluster aerial view

Not only that, in June this year, he mentioned that it was not worthwhile to invest 1GW of electricity in H100. Next summer, a cluster consisting of 300,000 B200s may be put into use.



Self-built clusters are more confident

In May this year, The Information reported that Musk would build a supercomputing cluster consisting of 100,000 H100s by the fall of 2025 in cooperation with Oracle.

It is reported that xAI will invest $10 billion to rent Oracle's servers.

At that time, some people questioned why they would use the previous generation of technology even though it would be completed next year?

NVIDIA has launched B100 and B200 based on the new Blackwell architecture, which are much more efficient in training large models than H100.

Now, maybe the time in the news was wrong? It would be much more reasonable if it was completed this year.



Just recently, Musk responded to the news of terminating cooperation with Oracle to build a supercomputing cluster.

He said that xAI obtained 24,000 H100 resources from Oracle to train Grok-2. Related news proves that the server rental cooperation between xAI and Oracle is still continuing.

However, in the construction of the 100,000-card H100 cluster, the self-built mode was chosen, and it was carried out at the fastest speed. It is said that it took only 19 days to install the 100,000 cards.

We have to take the steering wheel ourselves.



Later news showed that Dell and AMD became Musk's new partners.

The CEO of Dell and the CEO of Supermicro both recently tweeted that they are working together and posted photos of their data centers.



During the cluster construction process, Musk personally visited the site.

It was also revealed on Twitter that Grok is currently training in Memphis and Grok-2 will be launched in August.



It is worth mentioning that Oracle had previously raised concerns about the power supply at the location where the cluster would be built.

According to estimates, 100,000 H100s will need to be distributed with 150 megawatts of electricity from the power grid, but Musk seems to have solved this problem.

The latest news shows that the cluster has temporarily obtained 8 megawatts. After the agreement is signed on August 1, it will obtain 50 megawatts. There are already 32,000 cards online, and 100% will be online in the fourth quarter - this is enough to support the training and operation of the GPT-5 scale model.



In short, what is certain is that AI giants all believe that it is more reliable to hold computing power in their own hands, and it is worth spending money like crazy on it.

According to cost estimates, each H100 costs about $30,000 to $40,000. Musk's supercomputing cluster will be worth $4 billion (equivalent to more than 29 billion RMB).

Earlier news said that Microsoft and OpenAI are developing a $100 billion data center project called "Stargate".

A deal is being worked out between Oracle and Microsoft for 100,000 B200s, according to people familiar with the matter. The cluster could be ready by next summer.

In addition, Meta has also been exposed to have luxurious supercomputing clusters, and cloud vendors such as AWS are also investing more in data centers.

References:
[1]https://x.com/elonmusk/status/1810727394631950752
[2]https://x.com/elonmusk/status/1815325410667749760
[3]https://x.com/dylan522p/status/1815494840152662170
[4]https://x.com/MichaelDell/status/1803385185984974941