news

ming-chi kuo said nvidia stopped developing the dual-cabinet version gb200 (nvl36*2) ai cabinet

2024-10-02

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

according to it house news on october 2, ming-chi kuo released a market investment briefing yesterday (october 1). it was reported that in the absence of customer customization requirements, nvidia no longer provides the dual-cabinet version of gb200 (2 nvl36), and only provides single-cabinet versions. the cabinet version gb200 nvl72, while the single cabinet version nvl36 still maintains the original development and shipping plan.

it home attaches the briefing information of ming-chi kuo as follows:

in conclusion:

this matter will not affect the long-term positive trend of ai and nvidia, but in the short term it may cause some market participants to question the execution capabilities of nvidia and the supply chain.

nvidia has frequently revised its ai server product blueprint recently. i think this is because nvidia wants to achieve a better balance between supply chain execution, competitive advantages and customer needs under limited resources (stopping nvl36*2 development is just one example). this is a good thing and represents nvidia's more pragmatic approach to product planning, but the change process may make some market participants confused about supply chain chaos.

due to the current low visibility of the product shipment mix of blackwell servers in 2025 (a few months ago, the market generally believed that there would only be nvl36, nvl72 and nvl36*2), the 2025 outlook of some suppliers, such as assembly and cooling, will be greatly affected.

comparison of two 72gpu versions: reasons for choosing nvl72 and canceling nvl36*2

development resources are limited.the original plan was that three gb200 cases (nvl36, nvl72, nvl36*2) were under development at the same time. it is expected that the development version (development drop: devdrop) starting from mid-november will converge to nvl72 and nvl36*2 (because nvl36 is "theoretically" ready to enter the mass production stage), and the final version of the two will be completed by mid-march 2025. quality assurance (qa). however, there is still uncertainty in the development of nvl36, let alone the simultaneous development of two 72 gpu versions (nvl72 and nvl36*2).

nvl72 saves data center space.if nvl72 can properly solve the heat dissipation design challenges of sidecar, it will require one less cabinet than nvl36*2, improving data center space efficiency.

the inference efficiency of nvl72 is better.benefiting from the parallelizable design of the software, there is little difference in ai llm training results between nvl72 and nvl36*2. however, in the reasoning process that is not or difficult to parallelize the design (such as autoregressive models), the performance of nvl72 is easier to outperform nvl36*2.

key customer preferences.for example, microsoft prefers nvl72 rather than nvl36*2.

deliver on public promises. nvidia’s publicity focus has always been on the single-cabinet version of nvl72. in order to fulfill its public commitment and with limited resources, the development priority of nvl72 is higher than that of nvl36*2.

nvl72 development faces unprecedented technical challenges, and the current mass production schedule visibility is still low

the biggest challenge in the development of nvl72 mainly comes from the tdp (thermal design point) requirement of 132kw. this is the highest power consumption server in history. nvidia and the supply chain need more time to solve unprecedented technical problems.

it should be noted that tdp refers to the average power consumption of continuous operation. if improper design causes the instantaneous maximum power consumption (called edp (electrical design point) by nvidia) to be higher than tdp, more than two sidecars may be needed. if so, not only this increases the complexity of heat dissipation design and the difficulty of mass production, and also loses the advantage of nvl72 in saving data center space.

another design challenge of sidecar is to control the approaching temp stably within 5–10°c. if the standard is relaxed, the system stability may be affected.

it should be noted that the high power consumption challenge mentioned above involves not only sidecar, but all components and system design.

my latest supply chain survey points out that the mass production schedule of nvl72 may not be until after 2h25 (vs. nvidia’s optimistic target is 1h25).