news

Nvidia's most powerful AI chip reveals major design flaws, and the China-exclusive version is accidentally exposed!

2024-08-05

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


New Intelligence Report

Editor: So Sleepy Peach

【New Wisdom Introduction】Due to a design flaw, Nvidia's most powerful AI chip, Blackwell, is really going to be delayed. The financial backers are all complaining, and all pre-order plans are expected to be delayed by at least three months.

NVIDIA GPU has always been the lifeblood of AI research and development by large model companies such as OpenAI.

Now, due to the design defects of Blackwell GPU, Nvidia's delivery time has to be delayed by 3 months or even longer.

The Information exclusively reported that TSMC engineers discovered the flaw in recent weeks as they were preparing for mass production of the Blackwell chip.


Just last week, Huang said at SIGGRAPH that NVIDIA has delivered Blackwell engineering samples to customers around the world.

He looked completely relaxed, not hinting at any unexpected delays.

So, where exactly is the flaw in the chip design?

GB200 contains two Blackwell GPUs and one Grace CPU. The problem lies in the key circuit connecting the two Blackwell GPUs.

It is this problem that has led to a decline in TSMC's GB200 production yield.


The delay in shipping the latest chips means that the AI ​​training process will be affected for major technology companies such as Meta, Google, and Microsoft.

Moreover, the construction of their data centers will inevitably be delayed.

It is said that Blackwell chips will be shipped in large quantities in the first quarter of next year.

The latest report from SemiAnalysis also details the technical challenges facing Nvidia, the timetable after delayed shipments, and the new system MGX GB200A Ultra NVL36.


Blackwell postponed in March, a lot of complaints

Remember that at the GTC 2024 conference, Huang held the most powerful Blackwell architecture GPU in his hands and announced the most powerful performance beast to the world.

In May, he publicly stated that "we plan to ship large quantities of Blackwell architecture chips later this year."

He even said confidently at the earnings conference, "We will see a lot of Blackwell revenue this year."

Nvidia shareholders have high hopes for the Blackwell GPU.


Analysts from Keybanc Capital Markets estimate that Blackwell chips will bring revenue to Nvidia's data centers from $47.5 billion in 2024 to more than $200 billion in 2025.

In other words, the Blackwell series of GPUs will play a decisive role in Nvidia's future sales and revenue.

Unexpectedly, the design defects directly affected Nvidia's production targets in the second half of this year and the first half of next year.

Insiders involved in the Blackwell chip design revealed that Nvidia is working with TSMC to run test chip production to resolve the problem as soon as possible.

However, at present, Nvidia's compensatory measure is to continue to extend the shipment of Hopper series chips and accelerate the production of Blackwell GPUs as planned in the second half of this year.

Billions of dollars spent, AI training delayed

Not only that, this chain effect will deal a fatal blow to large model developers and data center cloud service providers.

In order to train AI, Meta, Microsoft, Google and other financial sponsors have spent tens of billions of dollars to order a large number of Blackwell chips.

Google has ordered more than 400,000 GB200s, plus server hardware, and the cost of Google's order is well over $10 billion.

This year, the giant is expected to spend about $50 billion on chips and other equipment, an increase of more than 50% from last year.

Meta has also placed orders worth at least $10 billion, while Microsoft's orders have increased by 20% in recent weeks.

However, the specific order size of these two companies is not yet known.

People familiar with the matter revealed that Microsoft plans to prepare 55,000 to 65,000 GB200 chips for OpenAI by the first quarter of 2025.

Moreover, Microsoft management originally planned to provide Blackwell-powered servers to OpenAI in January 2025.


It now appears that the original plan will need to be postponed to March or next spring.

As originally planned, they will start operating the new supercomputing cluster in the first quarter of 2025.

AI companies, including OpenAI, are waiting to use the new chips to develop the next generation of LLMs.

Because training large models requires multiple times the computing power, they can better answer complex questions, automate multi-step tasks, and generate more realistic videos.

It can be said that the next generation of super AI depends on Nvidia’s latest AI chip.

A rare delay in history

However, this large-scale delay in chip orders was not only unexpected to everyone, but also rare.

TSMC originally planned to start mass production of Blackwell chips in the third quarter and start large-scale shipments to Nvidia customers from the fourth quarter.

Insiders revealed that the Blackwell chip is now expected to enter mass production in the fourth quarter, and if there are no further problems, servers will be shipped on a large scale in the subsequent quarter.


In fact, as early as 2020, the early version of Nvidia's flagship GPU had to be delayed due to some issues.

But the risks Nvidia faced were lower at the time, customers were not in a hurry to receive orders, and profits from data centers were relatively small.

This time, it is indeed very rare to discover major design flaws before mass production.

Chip designers typically work with TSMC's wafer fab to conduct multiple production tests and simulations to ensure product feasibility and a smooth manufacturing process before accepting large orders from customers.

It is not common for TSMC to stop a production line and redesign a product that is about to be mass-produced.

They have made full preparations for the mass production of GB200, including allocating dedicated machine capacity.

Now, the robots have to sit idle until the problem is resolved.

The design flaw will also affect the production and delivery of Nvidia's NVLink server racks, as the company responsible for the servers must wait for new chip samples before finalizing the server rack design.

Forced to release a remake

Technical challenges also forced NVIDIA to urgently develop a new system and component architecture, such as the MGX GB200A Ultra NVL36.

This new design will also have a significant impact on dozens of upstream and downstream suppliers.


As the most technologically advanced chip in the Blackwell series, NVIDIA made bold technical choices for GB200 at the system level.

The 72-GPU rack has an unprecedented power density of 125kW per rack, compared to the 12kW to 20kW found in most racks in data centers.

Such a complex system has also led to many problems related to power transmission issues, overheating, the growth of the water cooling supply chain, quick disconnect water cooling system leaks, and various circuit board complexity issues, catching some suppliers and designers off guard.

However, this is not the reason for Nvidia to reduce production or make major roadmap adjustments.

The core issue that really affects shipments is the design of Nvidia's Blackwell architecture itself.


The Blackwell package is the first package designed for mass production using TSMC’s CoWoS-L technology.

CoWoS-L requires the use of an RDL interposer with local silicon interconnects (LSI) and embedded bridge chips to bridge communications between the various compute and storage within the package.


Compared to the currently used CoWoS-S technology, CoWoS-L is much more complicated, but it is the future.

Nvidia and TSMC have a very aggressive growth plan, with a goal of more than one million chips per quarter.

But various problems also arose.

One of the issues is that embedding multiple fine-pitch bump bridges into the organic interposer and silicon interposer may cause mismatches in the coefficient of thermal expansion (CTE) between the silicon chip, bridges, organic interposer and substrate, leading to warping.


The layout of the bridge chips requires very high precision, especially when it comes to the bridges between the 2 main computing chips, as these bridges are critical to supporting 10 TB/s chip-to-chip interconnects.

According to rumors, a major design issue was related to the bridge chip. At the same time, the top few global routing metal layers and the chip's bumps also needed to be redesigned. This was one of the main reasons for the multi-month delay.

Another problem is that TSMC does not have enough CoWoS-L production capacity.

TSMC has built a large amount of CoWoS-S production capacity in the past few years, with Nvidia accounting for the majority of the share.

Now, as Nvidia is rapidly shifting demand to CoWoS-L, TSMC is building a new fab, AP6, for CoWoS-L and transforming existing CoWoS-S capacity at AP3.

To this end, TSMC needs to transform the old CoWoS-S capacity, otherwise these capacities will be idle, and the growth rate of CoWoS-L will be slower. And this transformation process will make the growth very uneven.

Combining these two issues, TSMC is clearly unable to supply enough Blackwell chips to meet Nvidia's needs.

Therefore, NVIDIA has concentrated almost all its production capacity on GB200 NVL 36x2 and NVL72 rack-scale systems, and has cancelled the HGX computing modules equipped with B100 and B200.


As a replacement, Nvidia will launch a Blackwell GPU, the B200A, based on the B102 chip and equipped with 4 layers of HBM video memory to meet the needs of mid- and low-end AI systems.

Interestingly, this B102 chip will also be used in the "special edition" B20 for China.

Since the B102 is a monolithic computing chip, Nvidia can not only package it on CoWoS-S, but also enable other suppliers besides TSMC to do 2.5D packaging, such as Amkor, ASE SPIL, and Samsung.

The B200A will come in 700W and 1000W HGX form factors, with up to 144GB of HBM3E memory and up to 4 TB/s of bandwidth. It’s worth noting that this is less memory bandwidth than the H200.

Next up is the mid-level enhanced version – Blackwell Ultra.

The standard CoWoS-L Blackwell Ultra, namely B210 or B200 Ultra, not only achieves up to 288GB of 12-layer HBM3E in video memory refresh, but also improves FLOPS performance by up to 50%.

The B200A Ultra will have higher FLOPS, but will not have any upgrades to the video memory.

In addition to the same HGX configuration as the original B200A, the B200A Ultra also introduces a new MGX NVL 36 form factor.


HGX Blackwell's performance/TCO is excellent when training workloads with fewer than 5,000 GPUs.

Still, the MGX NVL36 is an ideal choice for many next-generation models due to its more flexible infrastructure.

Since the Llama 3 405B is already close to the limit of the H200 HGX server, the next generation MoE LLAMA 4 will definitely not fit into a single Blackwell HGX server node.

Combined with the price estimate for MGX B200A Ultra NVL36, SemiAnalysis believes that HGX B200A will not sell very well.

MGX GB200A Ultra NVL36 architecture

The MGX GB200A NVL36 SKU is an air-cooled 40kW/rack server with 36 GPUs fully interconnected via NVLink.

Each rack will be equipped with 9 compute trays and 9 NVSwitch trays. Each compute tray is 2U and contains 1 Grace CPU and 4 700W B200A Blackwell GPUs. Each 1U NVSwitch tray has only 1 switch ASIC, and the bandwidth of each switch ASIC is 28.8 Tbit/s.

In comparison, the GB200 NVL72/36x2 contains 2 Grace CPUs and 4 1200W Blackwell GPUs.


As each rack consumes only 40kW and can be air cooled, existing data center operators can easily deploy the MGX NVL36 without re-adjusting their infrastructure.

Unlike GB200 NVL72/36x2, the ratio of 4 GPUs to 1 CPU means that each GPU can only get half of the C2C bandwidth.

Therefore, MGX NVL36 cannot use C2C interconnection, but needs to use the integrated ConnectX-8 PCIe switch to complete the communication between GPU and CPU.

Furthermore, unlike all other existing AI servers (HGX H100/B100/B200, GB200 NVL72 / 36x2, MI300), each backend NIC will now be responsible for 2 GPUs.

This means that although the ConnectX-8 NIC design can provide 800G of backend networking, each GPU can only access 400G of backend InfiniBand/RoCE bandwidth. (Also half of that in GB200 NVL72/36x2)


The heart of the GB200 NVL72/NVL36x2 compute tray is the Bianca board, which contains 2 Blackwell B200 GPUs and 1 Grace CPU.

Since each compute tray is equipped with 2 Bianca boards, it will be equipped with a total of 2 Grace CPUs and 4 1200W Blackwell GPUs.


In contrast, the MGX GB200A NVL36's CPU and GPU will be located on different PCBs, similar to the design of the HGX server.

But unlike the HGX servers, the four GPUs in each compute tray will be subdivided into two 2-GPU boards, each of which is equipped with a Mirror Mezz connector similar to the Bianca board.

These Mirror Mezz connectors will then be used to connect to the ConnectX-8 midplane and connect the ConnectX-8 ASIC with its integrated PCIe switch to the GPU, local NVMe storage, and Grace CPU.

Since the ConnectX-8 ASIC is very close to the GPU, no re-timer is required between the GPU and the ConnectX-8 NIC. This is required for the HGX H100/B100/B200.

In addition, since there is no C2C interconnection between the Grace CPU and the Blackwell GPU, the Grace CPU will be located on a completely independent PCB, the CPU motherboard. This motherboard will contain the BMC connector, CMOS battery, MCIO connector, etc.


The NVLink bandwidth per GPU will be 900GB/s per direction, which is the same as the GB200 NVL72 / 36x2. This significantly increases GPU-to-GPU bandwidth per FLOP, giving the MGX NVL36 an advantage in certain workloads.

Since there is only one layer of switches connecting the 36 GPUs, only nine NVSwitch ASICs are needed to provide non-blocking networking.

Also, since there is only one 28.8Tbit/s ASIC per 1U switch tray, it is very easy to air cool, as a 25.6Tbit/s 1U switch like the Quantum-2 QM9700 would do.


On the backend network, since each compute tray has only 2 800G ports, it will use a 2-rail optimized end-of-row network.

For every 8 GB200A NVL36 racks, there will be 2 Quantum-X800 QM3400 switches.


With 700W per GPU, the power consumption of GB200A NVL36 per rack may be around 40kW, which means 4kW of heat dissipation in a 2U space.

This will require specially designed heat sinks and high-speed fans for air cooling.


Challenges of deploying MGX GB200A NVL 36

Since GB200A NVL36 relies entirely on air cooling, and in addition to the PCIe NIC at the front end of the 2U chassis, there must be a dedicated PCIe switch, which will significantly increase the challenge of thermal management.

Therefore, making a custom backend NIC on the GB200A NVL36 is basically impossible.

Since many machine learning dependencies are compiled and optimized for x86 CPUs, and the Grace CPU and Blackwell GPU are on separate PCBs, it is likely that there will also be an x86 + B200A NVL36 version.

However, although x86 CPUs can provide higher peak performance, their power consumption will be 100W higher, greatly increasing the thermal management challenges for OEMs.

In addition, considering the sales volume of Grace CPU, even if NVIDIA launches the x86 B200A NVL36 solution, they will push customers to choose GB200A NVL36.

Of course, GB200A NVL36 also has its own selling point - a 40kW air cooling system per rack.

After all, many customers can't afford the liquid cooling and power infrastructure required for a GB200 NVL72 that draws about 125 kW per rack (or a 36x2 with a total power consumption of more than 130 kW).

The H100 has a TDP of 700W and currently uses a 4U high 3DVC, while the 1000W H200 uses a 6U high 3DVC.

In contrast, the MGX B200A NVL36 also has a TDP of 700W but the chassis is only 2U, so the space is quite limited. Therefore, a horizontally extended balcony-like heat sink will be needed to increase the surface area of ​​the heat sink.


In addition to requiring a larger heatsink, the fans would also need to provide more powerful airflow than the GB200 NVL72/36x2 2U compute tray or HGX 8 GPU design.

According to estimates, in a 40kW rack, 15% to 17% of the total system power will be used for internal chassis fans. In comparison, the HGX H100 fans only consume 6% to 8% of the total system power.

This is a very inefficient design due to the large amount of fan power required to make the MGX GB200A NVL36 work properly.

Why GB200A NVL64 was cancelled?

Before Nvidia finalized the MGX GB200A NVL36, they were also trying to design an air-cooled NVL64 rack - 60kW power consumption, equipped with 64 GPUs fully interconnected via NVLink.

However, after extensive engineering analysis, SemiAnalysis determined that the product was not viable and would not be brought to market.

In the proposed NVL64 SKU, there are 16 compute trays and 4 NVSwitch trays. Each compute tray is 2U and contains 1 Grace CPU and 4 700W Blackwell GPUs, just like the MGX GB200A NVL36.

The main modification is to the NVSwitch trays - instead of reducing the GB200's 2 NVSwitches per tray to 1, Nvidia tried to increase it to 4 ASIC switches.


Obviously, it is almost impossible to cool such a huge machine with such high power consumption by air alone. (Nvidia proposed 60kW, SemiAnalysis estimated 70kW)

This typically requires the use of a rear-door heat exchanger, which defeats the purpose of an air-cooled rack architecture, since it still relies on a liquid cooling supply chain. In addition, this solution still requires facility-level modifications in most data centers to deliver cooling water to the rear-door heat exchanger.

Another very difficult thermal issue is that the NVSwitch tray will contain four 28.8Tbit/s ASIC switches in a 1U chassis, requiring nearly 1500W of cooling power.

In isolation, achieving 1500W in a 1U chassis is not difficult. However, when you consider that the Ultrapass cables from the ASIC switches to the backplane connectors block a lot of airflow, the cooling challenge becomes significant.

Given the extreme speed with which the air-cooled MGX NVL rack needed to be brought to market, NVIDIA attempted to deliver the product within six months of design beginning. However, designing new exchange trays and supply chains was difficult for an industry already stretched to its limits.


Another major problem with GB200A NVL64 is that each rack has 64 800G backend ports, but each XDR Quantum-X800 Q3400 switch is equipped with 72 800G downstream ports. In other words, each switch will have 16 800G ports unused.

Having unused ports on expensive back-end switches can significantly impact network performance and total cost of ownership because switches are expensive, especially high-port-density modular switches like the Quantum-X800.


Additionally, using 64 GPUs in the same NVLink domain is not ideal.

On the surface, 64 seems like a good number because it has 2, 4, 8, 16, and 32 as common factors, which is a good fit for different parallel configurations.

For example, tensor parallelism TP=8, expert parallelism EP=8, or TP=4, full sharding data parallelism FSDP=16.

Unfortunately, due to hardware unreliability, NVIDIA recommends keeping at least 1 compute tray per NVL rack as a spare so that the GPU can be taken offline for maintenance and used as a hot spare.

If each rack does not have at least 1 compute tray in hot spare, even a 1 GPU failure can cause the entire rack to be forced out of service for a significant period of time. This is similar to how a 1 GPU failure on an 8-GPU HGX H100 server would force all 8 H100s to be out of service.

If at least one compute tray is kept as a hot spare, only 60 GPUs per rack can handle the workload, which eliminates the advantages mentioned above.


The NVL36×2 or NVL72 is equipped with 72 GPUs, which means that users can not only use 2 computing trays as hot spares, but also still have 64 GPUs available on each rack.

GB200A NVL36 can have one computing tray as a hot standby, and then there are 2, 4, 8, and 16 as common factors of the parallel solution.

Impact on the supply chain

According to SemiAnalysis's speculation, GB200 NVL72/36x2 shipments will be reduced or delayed, while B100 and B200 HGX shipments will be significantly reduced.

Meanwhile, Hopper shipments will increase from the fourth quarter of 2024 to the first quarter of 2025.

In addition, GPU orders will shift from HGX Blackwell and GB200 NVL36x2 to MGX GB200A NVL36 in the second half of the year.

This will impact all ODMs and component suppliers as shipment and revenue plans will change significantly between Q3 2024 and Q2 2025.

References:

https://www.theinformation.com/articles/nvidias-new-ai-chip-is-delayed-impacting-microsoft-google-meta?rc=epv9gi

https://www.semianalysis.com/p/nvidias-blackwell-reworked-shipment