news

Rare! Report: Nvidia's latest AI chip delayed due to design flaws

2024-08-03

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

The most advanced AI chip in Nvidia's new Blackwell series of chips may face delays.

According to The Information, citing people familiar with the matter, Nvidia's upcoming artificial intelligence chip willDelays of three months or more due to design flaws.

That could affect customers such as Meta Platforms, Google and Microsoft, which have collectively ordered tens of billions of dollars worth of chips.

Nvidia declined to comment on the delay announcement but said customers were testing samples of the Blackwell chip and that "production is expected to ramp" later this year.

It is not common to find major design flaws before mass production

The Information cited people involved in the production of the Blackwell chip as saying that Blackwell design problems have emerged in recent weeks.Because TSMC engineers discovered the flaw when they were preparing for mass production.

The GB200 chip contains two linked Blackwell GPUs and a Grace central processing unit. The issue involves a processor die (a piece of silicon that houses a chip's circuitry) that links two Blackwell GPUs. The snag reduces the amount of chips TSMC can produce for Nvidia and could even halt production.

Nvidia is reportedly conducting a new trial production run with its chipmaker TSMC.In order not to let the machine limit,TSMC is restarting production of another high-profile product that is close to mass production to fix the problem.This situation is also rare.

It is highly unusual to find major design flaws before mass production, as multiple production test runs and simulations are required to ensure product feasibility and a smooth manufacturing process.

According to the original plan, TSMC will start mass production of Blackwell chips in the third quarter and deliver them to Nvidia starting in the fourth quarter. Huang Renxun said in May that the company plans to ship a large number of Blackwell later this year.

The design defect may delay the production of the main Blackwell chips (B200 and GB200) by 3 months or more, and Blackwell mass production will be delayed until Q1 next year. After receiving the chips, cloud providers usually need about three months to put their large-scale clusters into operation.

The giants' expectations were dashed, and it is still unclear when they will receive the goods.

Blackwell can be described as the "white moonlight" in the minds of technology companies, carrying the high hopes of giants.

If the upcoming AI chips such as B100, B200 and GB200 are delayed by three months or more, Nvidia's customers may not be able to keep up with expectations.

These customers include Microsoft, Meta, and OpenAI, who have high expectations for Nvidia's AI chips and plan to use the "supercomputers" developed by Nvidia to produce future generations of large language models, Meta AI assistants, and other automated functions.

The Information cited people familiar with the matter as saying that Meta has placed an order worth at least $10 billion, and Microsoft has increased the size of its order by 20% in recent weeks. Microsoft plans to have 55,000-65,000 GB200 chips ready for OpenAI by the first quarter of 2025.

Obviously, the date on which Microsoft received these orders is unknown.

NVLink server racks may be affected

The design flaw will also affect the production and delivery of Nvidia's NVLink server racks, as companies working on servers will have to wait for new chip samples before they can finalize server rack designs.

Previously, Tianfeng International analyst Ming-Chi Kuo pointed out that the computing power advantage of GB200 NVL36 is unquestionable, but it also faces many unprecedented design and production challenges, and whether it can ensure large-scale shipments as scheduled remains in doubt.

Each cabinet of GB200 NVL36 consumes about 80kW of power, and according to a survey conducted by AMAX in April this year, less than 5% of data centers in the world can support 50kW servers per cabinet. Therefore, before purchasing GB200 NVL36, you need to make sure there is enough space to install it.
The single cabinet version of GB200 NVL72 consumes 130kW of power per cabinet and cannot be mass-produced in the short term.