news

The chip giant has surged again, who is the driving force behind it?

2024-08-06

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

On July 30, Nvidia's stock price fell 7%, the company's biggest drop in three months. Its market value evaporated by US$193.4 billion overnight, falling to US$2.55 trillion.

From July 10 to 30, Nvidia's stock price plummeted 23%, from $134.91 per share to $103.73. Previously, the company's unremitting upward momentum seemed to make investors ignore the risks.

Investors have pulled money out of big tech stocks like Nvidia over the past two weeks as they grew concerned that big companies are struggling to get a return on their AI spending.

Technical analysts point out that such a shift leaves Nvidia's stock price with further room to fall.

01

Is it Apple’s fault?

Nvidia's sharp drop in stock price may be related to Apple.

On July 29, Apple stated in a technical paper that two models of its artificial intelligence (AI) system Apple Intelligence were trained on cloud chips designed by Google, and Apple detailed the tensor processing unit (TPU) used for training. In addition, Apple also released a preview version of Apple Intelligence for some devices.

Apple does not mention Google or Nvidia in its 47-page paper, but notes that its Apple Foundation Model (AFM) and AFM server are trained on a cloud-based TPU cluster. The paper says the system enables Apple to efficiently and scalably train AFM models, including AFM-on-device, AFM-server, and larger models.

Apple says AFM-on-device is trained on a single slice of 2048 TPU v5p chips, the most advanced TPU launched in December 2023. AFM-server is trained on 8192 TPU v4 chips that are configured to work together as eight slices over a data center network.

Google has long deployed TPUs in its data centers to accelerate AI model training and reasoning. In addition, Google uses TPUs not only for its own use, but also ascloud computingThe service is provided to third parties for use, turning it into a product for sale.

Google's latest TPUs cost less than $2 per hour, and the chips need to be reserved three years in advance. Google first launched TPUs for internal workloads in 2015 and made them available to the public in 2017. Now, they are the most mature custom chips designed for artificial intelligence.

However, Google remains one of Nvidia's top customers and sells rights to Nvidia technology on its cloud platform.

Apple has previously said inference, which is taking a pre-trained AI model and running it to generate content or make predictions, will be done in part on chips in Apple’s own data centers.

Apple released relevant technical documents during WWDC 2024 in June, showing that in addition to using Apple's own processors and other hardware in Apple Intelligence, coupled with its own software framework, engineers also use their own GPUs combined with Google TPU to accelerate the training of artificial intelligence models.

Nvidia is facing increasing competitive pressure. Take Google as an example. The technology giant continues to expand its market share through self-developed AI chips. Data from TechInsights shows that in the data center accelerator market in 2023, Google TPU shipments will reach 2 million units. Although slightly lower than Nvidia's 3.8 million units, it has firmly ranked third in the industry and has a strong growth momentum, posing a challenge to Nvidia. At the same time, technology giants such as Microsoft are also gradually reducing their dependence on Nvidia and switching to chips from other competing brands.

02

GPUs are too expensive

In addition to the risk of single reliance, the high price of Nvidia GPUs has also scared away many manufacturers.

Reports indicate that AI servers equipped with Nvidia's next-generation Blackwell GPUs will cost as much as $2-3 million each.

Nvidia has launched two reference designs based on the Blackwell architecture. The NVL36 is equipped with 36 B200 GPU accelerator cards and is expected to cost $2 million. The previous price was $1.8 million, but now the price has increased. The NVL72 is double the size and is equipped with 72 B200 accelerator cards. The starting price is expected to be $3 million.

Nvidia estimates that by 2025, shipments of B200 servers are expected to reach 60,000 to 70,000 units, with a total price of US$120 billion to US$210 billion.

Currently, AWS, Dell, Google, Meta, Microsoft and others are interested in purchasing B200 servers, and the scale is beyond expectations.

AI servers are mainly composed of CPU, GPU, FPGA and other processors, which are used to handle a large number of computing tasks. Compared with traditional servers, AI servers usually require higher-performance hardware to meet the needs of large-scale data processing and complex computing. Due to the high prices of these hardware, they account for the largest proportion of the cost of AI servers. Among them, GPU accounts for the largest proportion of the cost among various processors.

In the process of AI training and reasoning, GPUs are usually the most expensive hardware because they have strong computing power and parallel processing capabilities, which can accelerate the training and reasoning process of AI models. Most AI servers are equipped with multiple GPUs to meet the needs of high-performance computing.

Because GPUs have powerful computing capabilities, their power consumption is also high. In order to meet the computing needs of AI models, multiple GPUs are usually required, which will further increase the power consumption of the server. High power consumption means that the server requires a larger power supply when running and incurs higher electricity bills.

Compared with CPU, GPU has more complex architecture and more components, which means GPU maintenance is more tedious and complicated, requiring more professional technicians to maintain and manage. Moreover, due to the high power consumption of GPU, its heat dissipation requirements are also higher, requiring additional heat dissipation equipment and maintenance costs.

With the rapid development of AI technology, the performance of GPU is also constantly improving. In order to stay competitive, many companies need to frequently purchase new versions of GPU, which will increase the cost of servers.

With the promotion of AI applications, more and more companies have begun to use AI servers, which has led to an increasing demand for GPUs. When supply exceeds demand, the price of GPUs will also rise.

03

Pressure from competitors

Nvidia's competitors are all gearing up for a fight, among which AMD, the most notable, has performed well recently.

On July 30, AMD released its second-quarter 2024 financial report, with net profit surging 881% year-on-year and data center business revenue doubling, snatching a lot of business from Nvidia.

AMD's total revenue for the quarter reached $5.835 billion, which not only exceeded the previous forecast of $5.72 billion, but also achieved a year-on-year growth of 9% and a month-on-month growth of 7%. Net profit reached $265 million, a year-on-year growth of 881% and a month-on-month growth of 115%.

Sales of the MI300 GPU chip for data centers exceeded US$1 billion in a single quarter, driving a significant increase in revenue in the data center division.

The MI300 series is an AI GPU released by AMD at the end of 2023, including the MI300X, and the MI300A with integrated CPU cores and GPU accelerators. Among them, the MI300X is benchmarked against Nvidia's H100. According to AMD, the performance of MI300X for AI training is on par with Nvidia's H100, and its performance exceeds that of competing products in terms of reasoning. Taking a single server composed of 8 GPUs as an example, when running the BLOOM model with 176 billion parameters and the Llama2 model with 70 billion parameters, the performance of the MI300X platform is 1.4~1.6 times that of the H100 platform.

AMD CEO Lisa Su said that the company's AI chip sales were "higher than expected" and Microsoft's use of MI300 chips is increasing to support GPT-4 Turbo's computing power and to support Microsoft's Word, Teams and other Copilot services. Hugging Face is one of the first customers to adopt the new Microsoft Cloud Azure, enabling enterprises and AI customers to deploy hundreds of thousands of models on MI300 with a single click.

In June this year, AMD announced its iteration roadmap, planning to launch MI325X in the fourth quarter of this year, and to launch MI350 series and MI400 series in the next two years. Among them, M1300X and MI325X will adopt CDNA3 architecture, M1350 will adopt CDNA4 structure, and MI400 will adopt the next generation CDNA architecture. AMD will launch new product series every year in the future. In the industry's view, this speed is in line with the plan released by Nvidia.

In addition, Su Zifeng said that the demand for AI reasoning will be greater than training. AI PC is a very important part of the PC category, and the PC market is a good revenue growth opportunity for AMD's business.

This year, AMD is accelerating its AI deployment through investment. In July, the company invested $665 million to acquire Silo AI, Europe's largest artificial intelligence laboratory, which provides end-to-end AI-driven solutions. This acquisition is considered an important step for AMD to catch up with Nvidia.

Su Zifeng said that in addition to the acquisition of Silo AI, AMD has invested more than $125 million in more than a dozen artificial intelligence companies in the past 12 months to expand the AMD ecosystem and maintain AMD's leading position in computing platforms. She said that AMD will continue to invest in software, which is one of the reasons for investing in Silo AI.

AMD is competing with Nvidia in the same successful way it has done, based on superior GPU hardware, developed software, and ecosystem.

04

Nvidia also has weaknesses

To compete with NVIDIA, the best strategy is to play to one's strengths and avoid one's weaknesses, that is, to play to one's own strengths and attack NVIDIA's weaknesses.

Although GPUs have very strong parallel processing capabilities, which is also the fundamental reason why they are good at AI training. However, when data moves back and forth, the processing speed of GPUs is not fast. When large AI models are running, they often require a large number of GPUs and a large number of memory chips, which are interconnected. The faster the data moves between the GPU and the memory, the better the performance. When training large AI models, some GPU cores will be idle, waiting for data almost half of the time.

If a large number of processor cores and massive memory can be combined to form in-memory computing, the complexity of connections between multiple chips can be greatly reduced, and the data transmission speed can be greatly improved. A large number of processor cores are connected together in the chip, and their running speed is hundreds of times faster than that of independent GPU combinations. Currently, several startups are doing this, and their development is worth paying attention to.

In addition, you must be prepared for a protracted war in terms of software and hardware ecology in order to deal with Nvidia. In this regard, you need a strong resource background to have a chance to fight. AMD and Intel are doing this.

In addition, in addition to the chip itself, more efforts can be made in the interconnection between chips. Nvidia is not the leader in this regard, but Broadcom is.

Broadcom solves the problem of interconnection between chips, not competing directly with Nvidia GPU. Although Nvidia also has its own chip-to-chip interconnection technology, from the perspective of the entire industry, Broadcom's technology and products are superior. Among the world's eight largest AI server systems, seven have deployed Ethernet infrastructure supported by Broadcom technology. It is expected that by 2025, all ultra-large-scale AI server systems will be supported by Ethernet.

Broadcom is the best at solving communication bandwidth issues. In the global 50GB/s SerDes market, Broadcom occupies 76% of the market share. Its SerDes interface converts low-speed parallel data into high-speed serial data, and then converts it back to parallel data at the receiving end. Through such operations, data can be transferred from one TPU to another at high speed, greatly improving transmission efficiency.

Also benefiting from the growth of AI, Broadcom's network communication product revenue is growing at a year-on-year rate of 40%. The company's financial report shows that in the second fiscal quarter ending in May this year, AI revenue increased by 280% year-on-year to US$3.1 billion, and this figure is expected to exceed US$11 billion by the end of this fiscal year.

05

A big fall followed by a big rise

A number of competitors are putting pressure on Nvidia, which is an important reason for the company's stock price decline. However, the market changes so fast that people cannot react in time.

On the evening of July 31, Nvidia's stock price suddenly soared, with the increase once exceeding 14%, and its market value increased by US$326.9 billion in a single day.

Nvidia became the first stock to have a single-day market value increase of more than $300 billion. Currently, Nvidia ranks in the top three in the US stock market value increase ranking. On February 22 and May 23 this year, Nvidia's single-day market value increased by $276.6 billion and $217.7 billion respectively.

Morgan Stanley published a research report pointing out that considering Nvidia’s recent market sell-off, although the specific reasons are unclear, it is believed that it will bring good market entry opportunities for interested investors. Therefore, it has re-listed it as a preferred stock, with earnings forecasts and target prices unchanged, a rating of "overweight" and a target price of US$144.

In just two days, Nvidia's stock price fell sharply and then rose sharply, which may be related to the tight supply of Blackwell and the difficulty in delivering all of them on time.

Morgan Stanley said that Blackwell products have aroused strong interest in the market, especially the significant improvement in its reasoning performance, which further drove customers' desire to buy.

However, there are reports in the industry that the Blackwell GPU chip may be delayed, or that server products that come with the chip may be delayed.

Although the technology and products of many competitors are getting better and better, which has brought pressure to Nvidia, the company's GPU products are still the main force in the AI ​​server market at present and in the short and medium term in the future, and the overall supply is in short supply. At a time when the Blackwell GPU that many customers are looking forward to is about to be shipped, the news that the delivery will be delayed will definitely whet the market's appetite and help the stock price rise.