Attacking GPU, TPU chip becomes popular overnight

2024-08-17

SinceChatGPTAfter the explosion of popularity, the research and development of large AI models has emerged in an endless stream. While this "hundred-model war" is in full swing, the American chip company NVIDIA has made a fortune by relying on the outstanding performance of its GPU in large-model calculations.

However, a recent move by Apple has slightly cooled down Nvidia's popularity.

01

Apple chooses TPU instead of GPU for AI model training

NVIDIA has always been a leader in the field of AI computing infrastructure. In the AI hardware market, especially in the field of AI training, its market share is over 80%. NVIDIA GPU has always been the leading supplier of Amazon, Microsoft, Meta,OpenAI It is the preferred computing solution for many technology giants in the fields of AI and machine learning.

Therefore, NVIDIA continues to face multiple challenges in the industry. Among its competitors, there are strong companies that develop their own GPUs, as well as pioneers that explore innovative architectures. Google's TPU has also become a powerful rival that NVIDIA cannot ignore due to its unique advantages.

On July 30, Apple released a research paper in which it introduced two models that support Apple Intelligence: AFM-on-device (AFM is the abbreviation of Apple Basic Model) and AFM-server (a server-based large language model). The former is a language model with 3 billion parameters, and the latter is a server-based language model.

Apple said in the paper that to train its AI models, it used two types of Google's tensor processors (TPUs), which were organized into large chip clusters. To build AFM-on-device, an AI model that can run on iPhones and other devices, Apple used 2048 TPUv5p chips. For its server AI model AFM-server, Apple deployed 8192 TPUv4 processors.

Apple's strategic choice to abandon Nvidia GPU and turn to Google TPU dropped a bombshell in the technology industry. Nvidia's stock price fell by more than 7% that day, the biggest drop in three months, and its market value evaporated by US$193 billion.

Industry insiders said Apple's decision suggests some large technology companies may be looking for alternatives to Nvidia's graphics processing units for artificial intelligence training.

02

TPU VS GPU, which one is better for large models?

Before discussing whether TPU or GPU is more suitable for large models, we need to have a preliminary understanding of the two.

Comparison between TPU and GPU

TPU stands for Tensor Processing Unit, which is a special chip designed by Google to accelerate machine learning workloads. It is mainly used for training and reasoning of deep learning models. It is worth noting that TPU also belongs to the category of ASIC chips, which is a chip specially customized for a specific need.

GPU is familiar to everyone. GPU is a processor originally designed for graphics rendering, and later widely used in parallel computing and deep learning. It has powerful parallel processing capabilities, and the optimized GPU is also very suitable for parallel tasks such as deep learning and scientific computing.

It can be seen that these two different chips had different goals when they were initially designed.

Compared with traditional CPUs, the parallel computing capabilities of GPUs make them particularly suitable for processing large-scale data sets and complex computing tasks. Therefore, in recent years when large AI models have exploded, GPUs have become the computing hardware of choice for AI training.

However, with the continuous development of AI large models, computing tasks are becoming exponentially larger and more complex, which has put forward new requirements for computing power and computing resources. GPUs have low computing power utilization and high energy consumption when used for AI computing. The energy efficiency ratio bottleneck, as well as the high price and tight supply of NVIDIA GPU products, have attracted more attention to the TPU architecture, which was originally designed for deep learning and machine learning. The dominant position of GPUs in this field has begun to face challenges.

It is reported that Google began to develop chips dedicated to AI machine learning algorithms as early as 2013, but it was not until 2016 that this self-developed chip called TPU was officially released. AlphaGo, which is trained using Google's TPU series chips.

If we say that TPU is more suitable for training large AI models, it would be difficult to convince people without explaining its capabilities in detail.

How is TPU suitable for large model training?

First, TPU has multi-dimensional computing units to improve computing efficiency.Compared with the scalar computing units in the CPU and the vector computing units in the GPU, the TPU uses two-dimensional or even higher-dimensional computing units to complete computing tasks, and expands the convolution operation loop to achieve maximum data reuse, reduce data transmission costs, and improve acceleration efficiency.

Secondly, TPU has more time-saving data transmission and a highly efficient control unit.The storage wall problem caused by the von Neumann architecture is particularly prominent in deep learning tasks, while the TPU adopts a more radical strategy to design data transmission, and the control unit is smaller, leaving more space for on-chip memory and computing units.

Finally, TPU is designed for AI acceleration, enhancing AI/ML computing capabilities.With accurate positioning, simple architecture, single-threaded control, and customized instruction set, the TPU architecture is extremely efficient in deep learning operations and easy to expand, making it more suitable for ultra-large-scale AI training calculations.

It is reported that Google TPUv4 consumes 1.3-1.9 times less power than NVIDIA A100, and is 1.2-1.9 times more efficient than A100 in many types of working models such as Bert and ResNet. At the same time, its TPUv5/TPU Trillium products can further improve the computing performance by 2 times/nearly 10 times compared to TPUv4. It can be seen that Google TPU products have more advantages in cost and power consumption than NVIDIA products.

At the I/O 2024 developer conference in May this year, Alphabet CEO Sundar Pichai announced the sixth-generation data center AI chip Tensor Processor Unit (TPU) - Trillium, saying that the product is almost five times faster than the previous generation and that it will be launched later this year.

Google said the sixth-generation Trillium chip, designed to power technology that generates text and other content from large models, has 4.7 times the computing performance and 67% higher energy efficiency than the TPU v5e chip. Google also said the sixth-generation Trillium chip will be available to its cloud customers by the end of this year.

Google engineers achieved additional performance gains by increasing high-bandwidth memory capacity and overall bandwidth. AI models require large amounts of high-bandwidth memory, which has been a bottleneck for further performance improvements.

It is worth noting that Google will not sell its TPU chips separately as independent products, but will provide TPU-based computing services to external customers through the Google Cloud Platform (GCP).

Google's cleverness can also be seen in this plan: direct sales of hardware involve high costs and complex supply chain management. By providing TPU through cloud services, Google can simplify the installation, deployment and management process, reduce uncertainty and additional costs. This model also simplifies the sales process, without the need to establish an additional hardware sales team. In addition, Google is in a fierce competition with OpenAI for generative AI. If Google starts selling TPU, it will compete with two powerful opponents at the same time: Nvidia and OpenAI, which may not be the wisest strategy at present.

At this point in the article, some people may ask: Since TPU has such outstanding performance advantages, will it replace GPU in the near future?

03

Is it too early to talk about replacing GPUs?

This problem is not that simple.

Only talking about the advantages of TPU without mentioning the advantages of GPU is like looking at the tree with one leaf. Next, we also need to understand how GPU is suitable for current AI large model training compared to TPU.

We can see that the advantages of TPU lie in its outstanding energy efficiency and unit cost computing power indicators. However, as an ASIC chip, its disadvantage of high trial and error cost is also quite obvious.

In addition, in terms of the maturity of the ecosystem. After years of development, GPU has a large and mature software and development tool ecosystem. Many developers and research institutions have been developing and optimizing based on GPU for a long time, and have accumulated a wealth of libraries, frameworks, and algorithms. The TPU ecosystem is relatively new, and the available resources and tools may not be as rich as GPU, which may increase the difficulty of adaptation and optimization for developers.

In terms of versatility. The GPU was originally designed for graphics rendering, but its architecture is highly flexible and can adapt to many different types of computing tasks, not just deep learning. This makes the GPU more adaptable when facing a variety of application scenarios. In contrast, the TPU is custom-designed for machine learning workloads and may not be able to handle other non-machine learning related computing tasks as effectively as the GPU.

Finally, the GPU market is highly competitive, and manufacturers are constantly promoting technological innovation and product updates, with new architectures and performance improvements coming more frequently. The development of TPU is mainly led by Google, and its pace of updates and evolution may be relatively slow.

In general, NVIDIA and Google have different strategies for AI chips: NVIDIA pushes the performance limits of AI models by providing powerful computing power and extensive developer support, while Google improves the efficiency of large-scale AI model training through efficient distributed computing architecture. These two different path choices have enabled them to demonstrate unique advantages in their respective application fields.

Apple's choice of Google TPU may be due to the following reasons: First, TPU performs well in handling large-scale distributed training tasks, providing efficient and low-latency computing capabilities; second, using the Google Cloud platform, Apple can reduce hardware costs, flexibly adjust computing resources, and optimize the overall cost of AI development. In addition, Google's AI development ecosystem also provides a wealth of tools and support, allowing Apple to develop and deploy its AI models more efficiently.

Apple's example proves the ability of TPU in large model training. However, compared with NVIDIA, TPU is still rarely used in the field of large models. More large model companies, including OpenAI, Tesla, ByteDance and other giants, still generally use NVIDIA GPUs in their main AI data centers.

Therefore, it may be too early to define that Google's TPU can beat Nvidia's GPU, but TPU is definitely a very challenging player.

04

TPU is not the only challenger to GPU

China also has a company that is betting on TPU chips - Zhonghao Xinying. The founder of Zhonghao Xinying, Yang Gongyifan, was a core chip developer at Google and was deeply involved in the design and development of Google TPU 2/3/4. In his opinion, TPU is an advantageous architecture born for large AI models.

In 2023, Zhonghao Xinying's "Moment" chip was officially launched. With its unique 1024-chip high-speed inter-chip interconnection capability, the "Moment" chip built a large-scale intelligent computing cluster called "Taize". Its system cluster performance is dozens of times higher than that of traditional GPUs, providing unprecedented computing power guarantee for the training and reasoning of AIGC large models with more than 100 billion parameters. This achievement not only demonstrates Zhonghao Xinying's deep accumulation in the field of AI computing power technology, but also wins a valuable place for domestic chips on the international stage.

However, in today's AI gold rush, with Nvidia's H100 chip in short supply and expensive, companies large and small are looking for alternatives to Nvidia's AI chip products, including companies that follow the traditional GPU route and those exploring new architectures.

The challengers facing GPU are far more than just TPU.

In the development of GPU path, Nvidia's strongest rival isAMDIn January this year, researchers used about 8% of the GPUs on the Frontier supercomputing cluster to train a large model at the GPT 3.5 level. The Frontier supercomputing cluster is completely based on AMD hardware, consisting of 37,888 MI250X GPUs and 9,472 Epyc 7A53 CPUs. This research also broke through the difficulty of advanced distributed training models on AMD hardware, and verified the feasibility of training large models on the AMD platform.

At the same time, the CUDA ecosystem is also gradually being broken. In July this year, the British company Spectral Compute launched a solution that can natively compile CUDA source code for AMD GPUs, greatly improving the compatibility efficiency of AMD GPUs with CUDA.

IntelGaudi 3 was also launched as a direct competitor to Nvidia H100. In April, Intel launched Gaudi 3 for deep learning and large generative AI models. Intel said that compared with the previous generation, Gaudi 3 can provide four times the floating-point format BF16 AI computing power, 1.5 times the memory bandwidth, and twice the network bandwidth for large-scale system expansion. Compared with Nvidia's chip H100, if applied to the Meta Llama2 model with 7B and 13B parameters and the OpenAI GPT-3 model with 175B parameters, Gaudi 3 is expected to reduce the training time of these models by an average of 50%.

In addition, when applied to the 7B and 70B parameter Llama and 180B parameter open source Falcon models, Gaudi 3 is expected to have an average 50% higher inference throughput and 40% higher inference power than H100. Moreover, Gaudi 3 has a greater inference performance advantage on longer input and output sequences.

When applied to the Llama model with 7B and 70B parameters and the Falcon model with 180B parameters, Gaudi 3's inference speed is 30% faster than NVIDIA H200.

Intel said that Gaudi 3 will be available to customers in the third quarter of this year and to OEM manufacturers including Dell, HPE, Lenovo and Supermicro in the second quarter, but did not announce the price range of Gaudi 3.

Last November,MicrosoftAt the Ignite Technology Conference, Azure released its first self-developed AI chip, Azure Maia 100, and Azure Cobalt, a chip for cloud software services. Both chips will be manufactured by TSMC using 5nm process technology.

It is reported that Nvidia's high-end products can sometimes sell for $30,000 to $40,000. It is believed that about 10,000 chips are needed for ChatGPT, which is a huge cost for AI companies. Technology giants with a large demand for AI chips are trying hard to find alternative sources of supply. Microsoft chose to develop its own products in the hope of enhancing the performance of generative AI products such as ChatGPT while reducing costs.

Cobalt is a general-purpose chip based on the Arm architecture with 128 cores, and Maia 100 is an ASIC chip designed for Azure cloud services and AI workloads, used for cloud training and reasoning, with 105 billion transistors. These two chips will be imported into Microsoft Azure data centers to support services such as OpenAI and Copilot.

Rani Borkar, vice president of the Azure chip division, said Microsoft has begun testing the Maia 100 chip with Bing and Office AI products, and OpenAI, Microsoft's major AI partner and the developer of ChatGPT, is also testing it. Some market commentators believe that the timing of Microsoft's AI chip project is very coincidental, just when the large language models cultivated by Microsoft, OpenAI and other companies have begun to take off.

However, Microsoft does not believe that its AI chips can widely replace Nvidia's products. Some analysts believe that if Microsoft's efforts are successful, it may also help it gain an advantage in future negotiations with Nvidia.

In addition to chip giants, there is also a lot of impact from start-ups, such as LPU launched by Groq, Wafer Scale Engine 3 launched by Cerebras, and Sohu launched by Etched.

At present, Nvidia controls about 80% of the AI data center chip market, while most of the remaining 20% is controlled by different versions of Google TPU. In the future, will the market share of TPU continue to rise? How much will it grow? Will there be other AI chips with different architectures that will split the existing market into three? These suspense are expected to be gradually revealed in the next few years.

news

Attacking GPU, TPU chip becomes popular overnight

01

02

03

04

Introduction

My contact information