Ten years of hard work: How did Google's TPU chip "eat" Apple?

Ten years of hard work: Why can Google's TPU chip "swallow" Apple?

2024-08-14

Before the birth of ChatGPT, Google had single-handedly set off an important wave of development in the world of artificial intelligence, and what resounded around the world was Google AlphaGo's victory over Korean Go player Lee Sedol in the "man vs. machine" in 2016. Behind this, the TPU chip that supports the operation of AlphaGo's "strongest brain" is crucial and is still being iterated and improved.

Although TPU was originally created for internal workloads, it has many advantages and is not only widely used within Google and has become a pillar of AI, but has also been favored and used by technology giants such as Apple and many large-model startups. Looking back, ten years after the birth of the TPU chip, it has gradually moved from the edge of the AI industry to the center of the stage. However, because the TPU infrastructure is mainly built around TensorFlow and JAX, Google also faces challenges such as "technical islands" to a certain extent.

Ten years of keeping up with artificial intelligence innovation

With the in-depth development of machine learning and deep learning algorithms, the industry's demand for high-performance, low-power dedicated AI computing chips is growing rapidly. However, traditional general-purpose CPUs and GPUs that specialize in complex tasks such as graphics acceleration and video rendering cannot meet the huge demand for deep learning workloads, and there are also problems such as low efficiency and limited dedicated computing.

“We did some rough calculations to see how much computing power would be needed if hundreds of millions of people had a three-minute conversation with Google every day,” said Jeff Dean, Google’s chief scientist. “We quickly realized that this would basically consume all the computing power deployed at Google. In other words, we would need to double the number of computers in Google’s data centers to support these new features.”

As a result, Google is committed to exploring more cost-effective and energy-efficient machine learning solutions, and immediately launched the TPU project, announcing the internal launch of the first generation of TPU chips (TPU v1) in 2015. TPU is an application-specific integrated circuit (ASIC) designed for a single specific purpose, including running the unique matrix and vector-based mathematical operations required to build AI models. Different from the matrix operations of GPU, the iconic feature of PU is its matrix multiplication unit (MXU).

According to Norm Jouppi, Google's vice president and engineering fellow, the emergence of TPU has saved Google 15 data centers. An important reason why TPU is more cost-effective is that Google's software stack is more vertically integrated than GPU. Google has a dedicated engineering team to build the entire software stack for it, from model implementation (Vertex Model Garden) to deep learning frameworks (Keras, JAX and TensorFlow) to compilers optimized for TPU (XLA).

In terms of performance, TPU v1 has 65,536 8-bit MACs (matrix multiplication units), a peak performance of 92 TOPS, and 28 MiB of on-chip memory space. Compared with CPUs and GPUs, TPU v1 performs well in response time and energy efficiency, and can significantly improve the reasoning speed of neural networks. The success of TPU v1 made Google realize that machine learning chips have broad development prospects, so it continues to iterate and upgrade based on TPU v1 to launch more advanced and efficient products.

For example, TPU v2 and TPU v3 are designed as server-side AI reasoning and training chips to support more complex AI tasks. TPU v4 further enhances scalability and flexibility, supporting the construction of large-scale AI computing clusters. Among them, TPU v2 for the first time expanded the single-chip design to a larger supercomputing system, building a TPU Pod consisting of 256 TPU chips. In addition, TPU v3 added liquid cooling technology, and TPU v4 introduced optical circuit switches to further improve performance and efficiency.

In 2023, in view of the "exaggerated" doubts and controversies about the TPU v5 chip, Google jumped directly to the TPU v5e version. The TPU v5e has been adjusted in architecture, using a single TensorCore architecture, and the INT8 peak computing power reaches 393 TFLOPS, exceeding the 275 TFLOPS of v4, but the BF16 peak computing power is only 197 TFLOPS, which is lower than the previous generation v4. This shows that the TPU v5e is more suitable for reasoning tasks, and it also reflects Google's strategic choice for the AI computing service market.

At the I/O Developer Conference in May this year, Google released the sixth-generation Trillium TPU. Amin Vadhat, vice president and general manager of Google Cloud Machine Learning, Systems and Cloud AI, said that the peak computing performance of Trillium TPU is more than 4.7 times higher than that of the previous generation TPU v5e, and the energy efficiency is more than 67% higher than that of TPU v5e. At the same time, the high-bandwidth memory capacity and bandwidth are twice as much as before, and the chip-to-chip interconnection bandwidth has also doubled, thus meeting the needs of more advanced AI systems.

It is worth mentioning that Trillium can scale up to 256 TPUs in a single high-bandwidth, low-latency Pod. By leveraging Google's advances in Pod-level scalability, multi-slice technology, and Titanium intelligent processing units, users will be able to link hundreds of Trillium TPUs in separate Pods to build PB-level supercomputers and data center networks.

Overall, the advantage of the TPU technical solution is that it has a more centralized architectural design. Unlike multiple GPUs connected to the same board, TPU is organized in cubes, which allows for faster inter-chip communication, and in-depth cooperation with Broadcom has greatly improved the communication transmission rate. In addition, under the requirements of special scenarios and use cases, it can promote product optimization and iteration more quickly. However, since the TPU infrastructure is mainly built around TensorFlow and JAX, and the industry is more mainstream using the HuggingFace model and PyTorch for innovation, Google is also facing the problem of "technical islands" to some extent.

Adopted by Apple and many AI startups

In terms of application, the Google TPU project was initially created for specific internal needs and quickly gained wide application in various departments, becoming one of the most mature and advanced custom chips in the field of AI. According to Andy Swing, chief engineer of Google's machine learning hardware system, they originally expected to manufacture less than 10,000 TPU v1s, but eventually produced more than 100,000, with applications covering advertising, search, voice, AlphaGo, and even autonomous driving.

With the continuous improvement of performance and efficiency, TPU chips have gradually become the AI backbone of Google's AI infrastructure and almost all its products. For example, Google Cloud Platform widely uses TPU chips to support its AI infrastructure. These chips are used to accelerate the training and reasoning process of machine learning models and provide high-performance and efficient computing capabilities. Through Google Cloud Platform, users can access virtual machine instances (VMs) based on TPU chips for training and deploying their own machine learning models.

Although Google has a good user base in cloud services, it does not sell hardware directly to users. Industry analysts pointed out that Google is in fierce competition with OpenAI for generative AI. If it sells TPU, it will directly challenge Nvidia. "Fighting on two fronts" may not be the most sensible strategy at present. At the same time, direct sales of hardware involve high costs and complex supply chain management, while providing TPU through cloud services can simplify the installation, deployment and management process, reducing uncertainty and additional costs.

On the other hand, the close cooperation between Google Cloud and NVIDIA also needs to be considered. Google not only uses NVIDIA GPUs internally, but also provides services based on NVIDIA GPUs on its cloud service platform to meet customers' needs for high-performance computing and AI applications.

It is true that Nvidia's AI chips have become a "must-have weight" for technology giants, but the industry is also exploring more diversified options. While it has been widely used internally, Google is also trying to provide AI services to more customers by keeping up with artificial intelligence innovations with TPU. Andy Swing said, "The locations where we use TPUs and pods are best suited to the capabilities of current data centers, but we are changing the data center design to better meet demand. Therefore, the solutions prepared today will be very different from the solutions tomorrow. We are building a global data center network full of TPUs."

Currently, many technology companies around the world are using Google's TPU chips. For example, Apple admitted that it used Google TPU to train its artificial intelligence models, and said that "this system enables us to efficiently and scalably train AFM models, including AFM device-side, AFM server, and larger models." According to Apple, Apple trained server AFM from scratch on 8192 TPUv4 chips, using a sequence length of 4096 and a batch size of 4096 sequences, and trained 6.3 trillion tokens. In addition, the end-side AFM is trained on 2048 Google TPUv5p chips.

Other data shows that more than 60% of funded generative AI startups and nearly 90% of generative AI unicorns are using Google Cloud's AI infrastructure and Cloud TPU services, which are widely used in various social and economic fields.

For example, well-known AI startups such as Anthropic, Midjourney, Salesforce, Hugging Face, and AssemblyAI are using Cloud TPU extensively. Among them, as a "strong rival of OpenAI", Anthropic uses Google Cloud TPU v5e chips to provide hardware support for its large language model Claude to accelerate the model's training and reasoning process. In addition, many scientific research and educational institutions are also using Google TPU chips to support their AI-related research projects. These institutions can use the high-performance computing capabilities of TPU chips to accelerate the experimental process, thereby promoting cutting-edge scientific research and educational progress.

It is worth noting that according to official information from Google, the operating cost of its latest TPU is less than $2 per hour, but customers need to make a reservation three years in advance to ensure use. This may bring greater challenges to large-scale model companies in a rapidly changing industry.

In any case, the ten-year journey of TPU has successfully proved that in addition to CPU and GPU, the industry has a new path in pursuing the computing power required for AI. It has also become the core of AI functions in almost all Google products, and supports the rapid development of Google DeepMind's advanced basic models and even the entire large model industry. In the future, with the continuous development of AI technology and the continuous expansion of the market, more companies may choose to use Google TPU chips to meet their AI computing needs. However, AI hardware may also become more specialized, which will make the hardware and model more closely integrated, making it difficult to jump out of the framework to find new innovation possibilities.

news

Ten years of hard work: Why can Google's TPU chip "swallow" Apple?

Introduction

My contact information