news

Nvidia's castrated version of B200A is exposed! The strongest chip architecture is difficult to produce: insufficient production capacity, knife skills are used to make up for it

2024-08-05

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Dream morning from Aofei Temple
Quantum Bit | Public Account QbitAI

Nvidia's most powerful chip B200 was forced to be postponed for three months, and rumors were rife.

Huang's solution: castrated chipsB200Aexposure.

Is this"If the production capacity is insufficient, the knife skills will make up for it"



Yes, according to SemiAnalysis, the main problem encountered by B200 isInsufficient production capacity, more specificallyTSMC's new packaging process CoWoS-L has insufficient production capacity

The castrated version of B200A will first be used to meet the needs of mid- and low-end AI systems.

Castrated version of B200A, reduced memory bandwidth

Why is B200A considered a castrated version?

The main indicator is memory bandwidth.4TB/s, which is more than the 8TB/s advertised by B200 at the launch conference at the beginning of the year.Shrink by half



Behind this is the packaging process from CoWoS-LReturned CoWoS-S, and even B200A is said to be compatible with other non-TSMC 2.5D packaging technologies such as Samsung.

In general, there are currently three variants of CoWoS advanced packaging: CoWoS-S、CoWoS-Rand CoWoS-L, the main difference lies in the interposer solution.

Intermediary LayerIt is located between the chip wafer and the printed circuit board, realizing information exchange between the chip and the packaging substrate, while providing mechanical support and heat dissipation capabilities.

The CoWoS-S structure is the simplest, and the interposer is equivalent to a silicon board.



CoWoS-R usesRDL Technology(Redistribution layer), the interposer is a thin metal material with a multi-layer structure.



CoWoS-L is the most complex, adding aLSI Chip(Local Silicon Interconnect) can achieve higher wiring density and can also be made into larger size.



TSMC launched CoWoS-L because older technologies were facing difficulties in continuing to grow in size and performance.

For example, on AMD's AI acceleration chip MI300, the CoWoS-S interposer has been expanded to 3.5 times the original standard, but it is still difficult to meet the future performance growth needs of AI chips.

But now, there are reports that CoWoS-L is encountering some problems in its capacity ramp-up, possibly due to the gaps between silicon, interposer and substrate.Mismatched coefficients of thermal expansion, resulting in warping, needs to be redesigned.

In the past, TSMC has built a large amount of CoWoS-S production capacity, and Nvidia has occupied the largest share. Now Nvidia's demand can be quickly switched to CoWoS-L, but TSMC needs time to convert its production capacity to the new process.

There are also reports that the core of the B200A (internal model B102) will also be used to make a special edition B20 in the future. I won’t go into details, as those who understand will understand.

B200 training large models also faces other challenges

Blackwell's main specification is "new generation computing unit"GB200 NVL72, one cabinet has 36 CPUs + 72 GPUs.

The computing power is sufficient. The training computing power of one cabinet at FP8 precision is as high as 720PFlops, which is close to a DGX SuperPod supercomputer cluster (1000 PFlops) in the H100 era.

But the power consumption is also very high. According to Semianalysis estimates,Power densityApproximately per cabinet125kW, unprecedented, bringing challenges in power supply, heat dissipation, network design, parallelization, reliability, etc.

In fact, the industry has not yet fully tamed the H1 million card cluster that has been used for large model training.

For example, the technical report of the Llama 3.1 series pointed out that during training, failures occurred on average once every three hours, of which 58.7% were caused by GPUs.

Of the total 419 failures, 148 were caused by various GPU failures (including NVLink failures), and 72 could be specifically attributed to HBM3 memory failures.



So in general, even if Huang finally ships the B200, it will still take more time for AI giants to actually build the B200 cluster and invest in large-scale model training.

GPT-5, Claude 3.5 Opus, Llama 4, etc., which have already started training or are close to completion, are probably no longer useful. We will have to wait until the next generation of models to witness the power of Blackwell.

One More Thing

In response to rumors of B200 delay, NVIDIA gave an official response:

Hopper is in strong demand and Blackwell sampling has already begun extensively.Production is expected to increase in the second half of the year

No specific response was given as to whether there will be a delay of three months.

However, Morgan Stanley was more optimistic in its latest report, believing that production would only be suspended for about two weeks.

Reference Links:
[1]https://x.com/dylan522p/status/1820200553512841239
[2]https://www.semianalysis.com/p/nvidias-blackwell-reworked-shipment
[3]https://3dfabric.tsmc.com/english/dedicatedFoundry/technology/cowos.htm
[4]https://www.trendforce.com/news/2024/03/21/news-blackwell-enters-the-scene-a-closer-look-at-tsmcs-cowos-branch/
[5]https://ieeexplore.ieee.org/document/9501649