news

Nvidia, which was blowing up the market, "blew itself up"?

2024-08-05

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

The author is Leslie Wu, a former TSMC plant construction expert (public account: Zihao Talks about Chips)

Edited by Su Yang

Nvidia, which has frequently experienced setbacks, has failed to maintain its market value of $3 trillion.

On June 19, Beijing time, Nvidia's market value reached 3.335 trillion US dollars, surpassing Microsoft and Apple to become the world's number one. After this highlight moment, Nvidia's market value began to decline. As of the close of August 2, Nvidia's market value shrank by 26%.

Prior to this, some analysts had called on investors to "put on the brakes." Daily Economic News quoted the views of Gil Luria, an analyst at investment bank DA Davidson, saying that Nvidia's record performance of $26 billion was due to top customers' spending on its GPU products. He believed that this trend would be shaken in the future and that Nvidia's stock price would fall by double digits within 18 months.

According to analysts like Gil Luria,Top customers have already "divided their minds", and Nvidia's own "mistakes" have also given customers a window of opportunity to change their minds and competitors to intercept. Everything started with the negative rumors about Blackwell architecture chips, including the low yield of CoWoS, the abandonment of B100 SKU, the delay of B200 shipments and re-tapeout, and other key issues.

According to information obtained from TSMC,The news of Nvidia's Blackwell chip re-tapeout is true, but it mainly involves the B100 series basic chipsThe problem lies in the underlying Standard cell——It is a pre-designed standard circuit module with specific functions and sizes. If chip design is understood as building blocks, the standard unit is the smallest unit of the building blocks——Abnormal operating conditions may occur under high pressure, the problems have been discovered and the mask needs to be reopened.

However, the overall wafer manufacturing time from Wafer-in to Wafer-out cannot be shortened. Fortunately, there will be only small batch shipments in 2024, which is not the shipment time for Blackwell servers. The production capacity will be expanded before the end of this year to catch up with the progress of small batch shipments. From my personal work experience, this is not a difficult task for TSMC.

01 Yield rate blamed for delayed shipments

The abandonment of B100 and the delay in shipment and re-tapeout of B200 are one-sided understandings of the "delayed delivery accident" of the Blackwell chip, which is related to Nvidia's complicated naming.

The Blackwell series chips include two basic chips, B100 and B102. These SKUs, including B200GB200, all use the Chiplet solution based on the B100 series, while B200A is built based on B102.

For easy understanding, we have compiled a table to compare the basic chips B102 and B100, as well as the corresponding server SKUs. For servers with different applications, more styles can be combined, such as HGX B200A/HGX B200/NVL36/72 or even air-cooled versions of NVL8 or GB210A.

The naming of Blackwell chips and various SKUs are confusing to the outside world, which is understandable.The statement that "CoWoS yield is only 66%, and only 10 good dies can be cut from a wafer" is beyond common sense.

We can briefly talk about the concept of "yield" from the front end and back end of wafer manufacturing.

For the front-end GPU Die, like Apple, Qualcomm and AMD, NVIDIA uses the N4P process, which is already very mature, so there is no need to worry about the yield rate.

The back-end packaging, especially the "oS" part of CoWoS, not only includes the GPU die, but also the HBM memory. Moreover, the cost of 8 HBMs is very high. If the GPU die fails, the entire package will become a waste.Therefore, if the yield rate is lower than 80%, it is impossible to schedule production, otherwise the cost will be infinitely magnified and the gross profit cannot be guaranteed. If it is at the level of 66%, it will not be scheduled for production at all.

In dealing with the risk of abnormal yield in the manufacturing process, as a fabless factory, whether it is Nvidia or Apple, it is impossible to bet all products on the new solution. If there is a problem with the new solution, the entire generation of products may be scrapped. This risk is too great, so when placing an order, there must be an alternative plan to open at the same time. In other words, even if there is a problem with the yield of CoWoS-L, it will not affect the shipment of Blackwell chips.

Let me give you an example. If Apple wants to use TSMC’s new 2nm process for its A18 chip next year, it will definitely develop the N3P process at the same time to ensure that it “wins without any loss.” Nvidia will naturally do the same.

According to the data we have obtained, Blackwell uses CoWoS-L packaging, and the current yield is about 90%.And it is still climbing, which is consistent with the Nomura team that has the most thorough research on CoWoS in the industry. In addition, at the beginning of the year, TSMC expected the yield rate of CoWoS-L to be 95%. Compared with the 99% yield rate of H200 and H100 products using CoWoS-S packaging, 90% is naturally not good, but for a new process, it is barely acceptable.

Therefore, the current yield of CoWoS-L is indeed not as good as expected, butThe fact that the GPU die in the front end needed to redesign the mask due to the problem of standard units, which led to the failure of the Blackwell chip to be produced smoothly, and indirectly caused the shutdown of the CoWoS-L production capacity in the back end, was summarized as a major anomaly in the CoWoS-L yield, and the inference that the Backwell chip could not be shipped smoothly was contrary to the facts and also contrary to industry common sense.

In fact, before the problem of the B100 series basic chip re-tapeout, NVIDIA had already made adjustments due to the problem of CoWoS-L yield being less than 95%. On the B200A using the B102 basic chip, it was replaced with CoWoS-S packaging. The original plan was to share the production capacity pressure of CoWoS-L and ensure that more Blackwell chips are produced in 2025. Now this adjustment can also help NVIDIA solve the progress delay caused by GPU die design problems, and can also help increase the overall shipment volume of Blackwell chips in 2025.

02 Who is strangling Nvidia's "neck"

There have been many discussions in the past, saying that Nvidia is holding the neck of computing power, but Nvidia’s own “neck” is being held by upstream companies such as HBM memory.

It should be said that the supply of HBM and liquid-cooled QCD quick-connector modules is relatively tight at present, butTight supply will not delay shipments, at most it will lead to a reduction in shipmentsMoreover, the process of these scarce components is still guaranteed at this stage. For example, Samsung has now decided to join NVIDIA's HBM supplier system.

What will really affect the shipment of Blackwell chips is the subsequent nodes of various server productization.

According to news from the industry chain, it is not only chips that have entered the production stage, but also board components, switching equipment, racks, cooling solutions, etc.

When expanding from an 8-card cabinet to a 72-card cabinet, many issues need to be considered, including network bandwidth convergence, and the optimal working conditions of various parallel strategies (model data segmentation, segmented calculation, copying and reorganization) in the entire cabinet. In addition, as the number of trays becomes larger, the density is higher and more compact, the number of internal wiring, high-speed switching, heat dissipation and other complex issues all mean that the rack must also be redesigned, and they should all be under testing.

Since NVL36/72 servers are all brand-new technical solutions, whether all subsystems and integrations are complete is also one of the risk points. In the past, the outside world has focused on performance. In fact, the high maturity and reliability of the entire system are also the basis for measuring the quality of this generation of products.

For the GB200 series that uses water cooling, the problem of leakage must also be considered, which mainly involves three components: water cooling plate, branch pipe, CDU liquid cooling distribution unit and QCD quick connector. Among them, quick connectors are most prone to leakage, so leakage is also the most troublesome problem for server manufacturers. Its quality is the most critical and directly involves the division of responsibilities. In general,If a leak occurs, Nvidia will first compensate the customer, and then claim compensation from system manufacturers such as Hon Hai and Quanta. An AI server rack costs millions of dollars, and compensation for leaks may bankrupt a small business.

According to the information we have received, NVIDIA, Foxconn, Quanta Computer and other system manufacturers are still testing water cooling and have not yet introduced it on a large scale.

As mentioned before, no matter whether it is a chip factory, a system factory or a heat dissipation factory, faced with millions of dollars in compensation, no manufacturer is willing to take this risk easily. They all need to actually introduce it and have "guinea pigs" before they can be implemented on a large scale.

03 Will Nvidia “crash”?

At the beginning of the article, we mentioned that Nvidia's market value has fallen from its historical high of more than 3.3 trillion US dollars to the current 2.6 trillion US dollars, a drop of more than 26%. When the first quarter report was released, Nvidia confidently expected second quarter revenue of 28 billion US dollars, with an error range of ±2%.

Now, due to the design problems of GPU die, the CoWoS packaging yield is lower than the expected 95%, and various server technology solutions have not yet been finalized, which will affect the smooth shipment of Blackwell chips. So will these problems go further and kick Nvidia out of the list of 2 trillion market value?

It can be said that there will not be too many problems in the short term. The key lies inThe Blackwell chip itself is scheduled for small batch production in the third quarter, and will not be mass-produced until the fourth quarter. This is just TSMC's production schedule. After the production of GPU die is completed, the next step is the back-end CoWoS, followed by the bumping factory, and finally to the system factories such as Foxconn and Wistron for assembly., and then complete the server shipment and performance implementation.

In a word, server shipments have an impact on Nvidia's revenue, not TSMC's chip shipments.

At the current pace, mass delivery of servers will not be until the first quarter of 2025 at the earliest. In other words, Nvidia will not achieve a significant business increase on the Blackwell chip until the first quarter of next year.In other words, this chip will not contribute a large amount of revenue to Nvidia until next year. This is also a reasonable expectation of the market and will not be reflected in the performance in the second or even third quarter.

For NVIDIA, if they discover the design problems in the third quarter and come up with solutions, and then run a super hot run at TSMC, the corresponding time is still the middle and late fourth quarter, around November-December. This part of the production capacity has already been booked, and production can basically continue for 3 months. In addition, TSMC's production capacity, whether N4P or CoWoS-S/L, is more sufficient than it is now. It is not difficult to increase the utilization rate to 120% to deal with the problem of delayed shipments of chips that were originally scheduled to be shipped in small batches in the third quarter due to design defects. In other words,On an annual basis, Blackwell shipments will be less this year, but not by much.

For Nvidia and the entire downstream industry chain, the chip problem has been exposed, and the various subsystems of the server must also be tested in various actual environments at the same time. The more optimistic point is that the chips currently produced will only have problems in specific high-voltage environments. These chips can be handed over to server system manufacturers such as Foxconn for various adjustments and tests. That is, the various subsystems of the server are the same as before, and there is still half a year to get the chips to simulate various environmental tests. The final large-scale shipment time will fall in February-March 2025.

Judging from the current situation, with the H200 flood-like shipments in the second quarter, the performance is likely to meet the guidance and exceed expectations. Moreover, the main revenue force in 2023 is the H200 series. As mentioned earlier, the scale of small-batch shipments of Blackwell chips this year will be smaller than originally planned, at about 20,000 wafers (CoWoS-L will be reduced from 41K to less than 20K), which is equivalent to Nvidia's performance estimate of about US$8-9.5 billion. However, by taking emergency measures such as incremental sales of the H series and sprinting production capacity after the B series returns, the performance loss this time will probably be around US$5 billion, which may be reflected in the fourth-quarter financial report. There will definitely be an impact on the stock price, after all, it is a product failure.

Compared with the Blackwell chip "crash" itself, a more worthy issue to think about and pay more attention to is that NVIDIA launches new SKUs every year, which requires many innovative technologies. The pace is very fast. If there is not enough time to optimize and improve reliability, there is a possibility that a certain product will completely crash in the next few years. This is NVIDIA's development logic that we need to re-examine, and it is also the opportunity that competitors have been waiting for.

From a more macro perspective, although there is no problem with Nvidia's growth logic in the past two years, the risks in its longer-term development are increasing.This risk is not only reflected in the crazy and radical technological changes in each generation, but also in the application and subsequent demand issues. Simply put, it is the well-known "AI bubble", or whether strong competitors of new technologies will emerge, such as new chip technology or upstream companies that master large models will begin to develop their own research.

I have seen a lot of reports in the past two days about Chinese and American giants all starting to develop their own products. I would like to share a piece of news for your reference.OpenAIThe talks on the self-developed chip project have almost been concluded with TSMC.