news

The AI ​​battle begins! OpenAI urgently builds 100,000 GB200 supercomputers, and Musk starts training 100,000 H100 at the end of the month

2024-07-16

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


New Intelligence Report

Editor: Peach

【New Wisdom Introduction】Musk officially announced that xAI has built the world's largest supercomputer cluster, which is built with 100,000 H100s and is expected to start training at the end of this month. On the other hand, OpenAI has increased its investment again and will build a supercomputer composed of 100,000 GB200s, which will completely crush xAI.

In order to achieve AGI, companies around the world are ready to burn all their GPUs!

Information exclusively reported that OpenAI's next supercomputing cluster will be composed of 100,000 GB200s.

This can be done with Nvidia's most powerful AI chip to date.


On the other hand, xAI is also building what it claims to be the "world's largest supercomputing cluster", consisting of 100k H100s, and will be put into training at the end of this month.

In Musk's latest post, he immediately responded to reports that xAI and Oracle had terminated negotiations on a server deal.


He said xAI has purchased 24,000 H100s from Oracle and trained Grok 2 on those chips.

Grok 2 is currently undergoing fine-tuning and bug fixes and is expected to be ready for release next month. At the same time, xAI is also building its own cluster of 100,000 H100s, with the goal of achieving the fastest training completion time, and plans to start training models later this month. This will become the strongest training cluster in the world, and the advantages are self-evident. The reason we decided to build our own 100,000 H100 chip system, as well as the next generation of major systems, is that our core competitiveness depends on being faster than other AI companies. This is the only way to catch up with our competitors. Oracle is an excellent company, and there is another company (referring to Microsoft) that has shown great potential in participating in OpenAI's GB200 cluster project. However, when our fate depends on being the fastest company, we must take control ourselves and not just be a bystander.


In short, in this rapidly changing era, if you want to surpass your competitors, you must ensure that you have an absolute speed advantage.

Oracle xAI talks fail, billions of dollars go down the drain

In May, The Information reported that xAI had been discussing a multi-year deal to lease Nvidia AI chips from Oracle.

The deal, expected to be worth up to $10 billion, has been stalled due to a number of issues.

Among them, Musk required the supercomputer to be built at a speed that completely exceeded Oracle's imagination. Oracle was also worried that the preferred location for xAI would not have sufficient power supply.


In order to change this situation, we can only rely on self-reliance.

Now, xAI is building its own AI data center in Memphis, Tennessee, which uses Nvidia chips shipped by Dell and Supermicro.

According to people involved in the negotiations, Oracle is not involved in the project.

In fact, before this, xAI had already rented many Nvidia chips from Oracle, becoming one of the largest customers of this cloud computing GPU supplier.

The deal will go ahead for now despite the failure of wider talks.

From Musk’s latest response, it can be seen that the number of Oracle chips has increased from 16,000 in May to 24,000.

100,000 H100s in series

However, Musk still hopes to build a supercomputer equipped with 100,000 Nvidia GPUs, calling it the "Gigafactory of Compute."


He said that xAI needs more chips to train the next generation of AI model - Grok 3.0.

Ma told investors in May that he hoped to have the supercomputer running by the fall of 2025, and that he would be personally responsible for delivering the supercomputer on time because it was crucial to the development of LLM.

He has publicly stated many times that a liquid-cooled training cluster consisting of 100,000 H100s will be online in a few months.


The reason why the Grok model iteration is so important is that it is part of the X social app subscription package, which starts at $8 per month and includes a variety of features.

Just last week, xAI released a photo of Musk and other employees in a data center, with servers in the background.


Although the post did not specify a location, in June, the president of the Greater Memphis Chamber of Commerce said that xAI was building a supercomputer at the Electrolux factory in Memphis.


Utility layout for the new xAI facility in Memphis, Tennessee

Dell CEO Micael Dell said Dell is helping xAI build a data center.


In addition, Supermicro CEO Charles Liang also released a photo of himself and Musk in the data center, confirming the company's partnership with xAI.


It is worth mentioning that last month Musk announced that xAI had completed an astonishing $6 billion Series B financing, with the company's valuation reaching $24 billion.

Investors in the Series B financing include eight investors including Andreessen Horowitz, Sequoia Capital, Valor Equity Partners, Vy Capital and Fidelity Management & Research.


He personally stated that in the latest round of financing, most of the funds will be invested in computing power construction.


Obviously, the supercomputing project built by xAI is part of its efforts to catch up with OpenAI.

100,000 GB200 supercomputers, rented for 5 billion US dollars for two years

In fact, on the other hand, OpenAI is also working tirelessly to accelerate its research and development pace and dares not slack off in the slightest.

Oracle's deal with Microsoft involves a cluster of 100,000 of Nvidia's upcoming GB200 chips, according to two people familiar with the matter.

By the time this supercomputer is built, Musk's $100,000 H100 will be nothing.


One netizen exclaimed that the number of Nvidia GB200 chips in the cluster is roughly equivalent to the number of transistors in Intel's 80286 processor. I am surprised that I can see this in my lifetime.


Others analyzed that "the training performance of GB200 will be 4 times that of H100."

GPT-4 was trained in 90 days using 25,000 A100s (the predecessor to the H100). So you could theoretically train GPT-4 in less than 2 days using 100,000 GB200s, although that’s under ideal conditions and may not be entirely realistic. But it does make you wonder what kind of AI models they could train in 90 days using this supercomputer cluster, which is expected to be operational in the second quarter of 2025.


At the GTC 2024 conference, Huang introduced that H100 is 4 times faster than A100, and B200 is 3 times faster than H100.


Assuming the companies sign a multiyear deal, the cost of leasing such a cluster could come to around $5 billion over two years, according to people familiar with GPU cloud pricing.

This cluster is expected to be ready by the second quarter of 2025.

Oracle will buy chips from Nvidia and then lease them to Microsoft, which will then provide them to OpenAI. After all, this has become a mutually beneficial practice for Microsoft and OpenAI.

Microsoft invested money in OpenAI and in return gained access to OpenAI's new models.


Oracle plans to place the chips in a data center in Abilene, Texas, according to people involved in the planning.

The deal also shows that Microsoft can't get enough Nvidia chips on its own.

Moreover, it is not common for cloud computing providers to rent servers from each other, but the strong demand for Nvidia chips led to this unusual deal.

Last year, Microsoft reached a similar server rental agreement with CoreWeave to increase the capacity of Nvidia servers.

References:

https://x.com/elonmusk/status/181072739463195075

https://x.com/amir/status/1810722841106821623