news

Google is the biggest winner! In order to let Apple use AI on its phones, Cook actually bowed to his rival

2024-07-31

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


In the past two days, the launch of Apple Intelligence has become one of the biggest technology news.

Although compared to the complete version of Apple Intelligence released more than a month ago, the Apple Intelligence features introduced in Apple iOS 18.1 beta 1 are not complete. Image Playground, Genmoji, priority notifications, Siri with screen awareness, and ChatGPT integration... none of these are available yet.

But overall, Apple still brought Writing Tools, call recording (including transcription) and a newly designed Siri.

Among them, Writing Tools supports functions such as rewriting, specialization, and simplification, and can be used in scenarios such as chatting, posting to Moments, Xiaohongshu notes, and text writing; call recording can not only record calls, but also automatically transcribe them into text for user review.

In addition, Siri has also been "upgraded", but unfortunately it is currently limited to design, including a new "marquee" special effect and keyboard input support.

But what is striking is that Apple disclosed in a paper titled "Apple Intelligence Foundation Language Models" thatApple did not use common GPUs such as NVIDIA H100, but instead chose TPU from its "old rival" Google to train the basic model of Apple Intelligence.


Image/Apple

Using Google TPU to create Apple Intelligence

As we all know, Apple Intelligence is divided into three layers: one is the end-side AI running locally on Apple devices, and the other is the cloud AI running in Apple’s own data center based on “private cloud computing” technology. According to news from the supply chain, Apple will build its own data center by mass-producing M2 Ultra.

In addition, there is another layer, which is to access third-party cloud models, such as GPT-4o.

However, this is the inference end. How Apple trains its own AI model has always been one of the focuses of attention in the industry. According to Apple’s official paper, Apple trained two basic models on the hardware of the TPUv4 and TPUv5p clusters:

One is the device-side model AFM-on-device with a parameter scale of 300 million, which is trained using 2048 TPU v5p chips and runs locally on Apple devices; the other is the server-side model AFM-server with a larger parameter scale, which is trained using 8192 TPU v4 chips and finally runs in Apple’s own data center.


Image/Apple

This is strange. After all, we all know that GPUs such as Nvidia H100 are the mainstream choice for training AI. There is even a saying that "only Nvidia GPUs are used for AI training."

In comparison, Google's TPU seems somewhat "unknown".

But in fact, Google's TPU is an accelerator designed specifically for machine learning and deep learning tasks, which can provide excellent performance advantages. With its efficient computing power and low-latency network connection, Google's TPU performs well in processing large model training tasks.

For example, TPU v4 can provide a peak computing power of up to 275 TFLOPS per chip, and through ultra-high-speed interconnection, 4096 TPUv4 chips can be connected into a large-scale TPU supercomputer, thereby doubling the computing power.

And not only Apple, other large model companies have also adopted Google's TPU to train their large models.Claude from Anthropic is a typical example.


Chatbot Arena Ranking, Image/LMSYS

Claude is now the strongest competitor of OpenAI GPT model. In the LMSYS chatbot arena, Claude 3.5 Sonnet and GPT-4o are always "crouching dragon and phoenix chick" (in a positive sense). According to reports, Anthropic has not purchased NVIDIA GPUs to build supercomputers, but uses TPU clusters on Google Cloud for training and reasoning.

At the end of last year, Anthropic also officially announced that it was the first to use the TPU v5e cluster on Google Cloud to train Claude.

The long-term use of Anthropic and the results shown by Claude fully demonstrate the efficiency and reliability of Google TPU in AI training.

In addition, Google's Gemini also relies entirely on its own TPU chip for training. The Gemini model aims to advance the frontier of natural language processing and generation technology. Its training process requires processing a large amount of text data and performing complex model calculations.

The powerful computing power and efficient distributed training architecture of TPU enable Gemini to complete training in a relatively short time and achieve significant breakthroughs in performance.

But if Gemini is understandable, then why did Anthropic and Apple choose Google TPU instead of Nvidia GPU?

TPU and GPU, the secret battle between Google and Nvidia

At SIGGRAPH 2024, the top computer graphics conference held on Monday, Nvidia founder and CEO Jensen Huang revealed that Nvidia will send samples of the Blackwell architecture, Nvidia's latest generation of GPU architecture, this week.

On March 18, 2024, NVIDIA released its latest generation of GPU architecture, Blackwell, and the latest generation of B200 GPU at the GTC conference. In terms of performance, the B200 GPU can achieve 20 petaflops (quadrillion floating-point operations per second) of computing power on FP8 and the new FP6, making it perform well in processing complex AI models.

Two months after the release of Blackwell, Google also released its sixth-generation TPU (Trillium TPU)Each chip can provide a peak computing power of nearly 1000 TFLOPS (trillion operations per second) under BF16, and Google also regards it as "the highest-performing and most energy-efficient TPU to date."


Image/Google

Compared with Google's Trillium TPU, Nvidia's Blackwell GPU still has certain advantages in high-performance computing with the support of high-bandwidth memory (HBM3) and the CUDA ecosystem. In a single system, Blackwell can connect up to 576 GPUs in parallel to achieve powerful computing power and flexible scalability.

In contrast, Google's Trillium TPU focuses on high efficiency and low latency in large-scale distributed training. The design of the TPU enables it to maintain high efficiency in large-scale model training and reduce communication latency through ultra-high-speed network interconnection, thereby improving overall computing efficiency.

And not just on the latest generation of AI chips,The "secret war" between Google and Nvidia has actually been going on for eight years, starting in 2016 when Google developed its own AI chip TPU.

To date, NVIDIA's H100 GPU is the most popular AI chip in the mainstream market. It not only provides up to 80GB of high-bandwidth memory, but also supports HBM3 memory and enables efficient communication of multiple GPUs through NVLink interconnection. Based on Tensor Core technology, the H100 GPU has extremely high computing efficiency in deep learning and reasoning tasks.

But at the same time, TPUv5e has significant advantages in cost-effectiveness and is particularly suitable for training small and medium-sized models. The advantage of TPUv5e lies in its powerful distributed computing capabilities and optimized energy consumption ratio, which enables it to perform well when processing large-scale data. In addition, TPUv5e is also provided through Google Cloud Platform, which facilitates users to conduct flexible cloud training and deployment.


Google data center, photo/Google

In general, NVIDIA and Google have different strategies for AI chips: NVIDIA pushes the performance limits of AI models by providing powerful computing power and extensive developer support, while Google improves the efficiency of large-scale AI model training through efficient distributed computing architecture. These two different path choices have enabled them to demonstrate unique advantages in their respective application fields.

But more importantly, the only ones who can defeat Nvidia are those who adopt a software-hardware co-design strategy and have both strong chip and software capabilities.

Google is one such rival.

The strongest challenger to Nvidia's hegemony

Blackwell is another major upgrade of NVIDIA after Hopper. It has powerful computing power and is designed for large-scale language models (LLM) and generative AI.

It is reported that the B200 GPU is manufactured using TSMC's N4P process, has up to 208 billion transistors, is composed of two GPU chips using interconnection technology, and is equipped with up to 192GB of HBM3e (high-bandwidth memory) with a bandwidth of up to 8TB/s.

In terms of performance, Google's Trillium TPU has improved 4.7 times compared to the previous generation TPU v5e under BF16, and the HBM capacity and bandwidth and chip interconnection bandwidth have also doubled. In addition, the Trillium TPU is also equipped with the third-generation SparseCore, which can accelerate the training of the next generation of basic models with lower latency and lower cost.

Trillium TPU is particularly suitable for the training of large-scale language models and recommendation systems. It can be expanded to hundreds of sets and connect tens of thousands of chips through PB-level network interconnection technology per second, realizing another level of super "computer", greatly improving computing efficiency and reducing network latency.


Image/Google

Google Cloud users will be able to take advantage of the chip starting in the second half of this year.

In general, the hardware advantage of Google TPU lies in its efficient computing power and low-latency distributed training architecture. This makes TPU perform well in the training of large-scale language models and recommendation systems. However, the advantage of Google TPU also lies in another complete ecosystem independent of CUDA and deeper vertical integration.

Through the Google Cloud platform, users can flexibly train and deploy in the cloud. This cloud service model not only reduces the company's investment in hardware, but also improves the training efficiency of AI models. Google and Cloud also provide a series of tools and services that support AI development, such as TensorFlow and Jupyter Notebook, which enable developers to train and test models more conveniently.


Google TPU v5p used by Apple, photo/Google

Google's AI ecosystem also includes a variety of development tools and frameworks, such as TensorFlow, a widely used open source machine learning framework that can take full advantage of the hardware acceleration capabilities of TPU. Google also provides other tools to support AI development, such as TPU Estimator and Keras, and the seamless integration of these tools greatly simplifies the development process.

In addition, Google's advantage is that Google itself is the customer with the greatest demand for TPU computing power. From the processing of YouTube's massive video content to every training and reasoning of Gemini, TPU has long been integrated into Google's business system and has also met Google's huge computing power needs.

It can be said that Google's vertical integration is far more thorough than Nvidia's. It almost completely controls the key nodes from model training to application and then to user experience. This actually gives Google greater possibilities to optimize efficiency from the bottom up based on technology and market trends.

So even though the Trillium TPU is still unable to compete with the Blackwell GPU in terms of chip performance indicators, when it comes to training large models, Google can still match or even surpass Nvidia's CUDA ecosystem by systematically optimizing efficiency.

Using TPU in Google Cloud is the best choice for Apple

In short, the performance, cost and ecological advantages of Google TPU clusters make it an ideal choice for large-scale AI model training. Conversely, using TPU on Google Cloud is also the best choice for Apple at this stage.


Apple also uses supercomputers based on TPU v4. Image/Google

On one hand, there is performance and cost.TPU performs well in processing large-scale distributed training tasks, providing efficient and low-latency computing power to meet Apple's needs in AI model training. By using the Google Cloud platform, Apple can reduce hardware costs, flexibly adjust computing resources, and optimize the overall cost of AI development.

The other side is ecology.Google's AI development ecosystem also provides a wealth of tools and support, enabling Apple to develop and deploy its AI models more efficiently. In addition, Google Cloud's powerful infrastructure and technical support also provide solid protection for Apple's AI projects.

In March this year, Sumit Gupta, who had worked at Nvidia, IBM, and Google, joined Apple to lead cloud infrastructure. According to reports, Sumit Gupta joined Google's AI infrastructure team in 2021 and eventually became the product manager of Google's TPU, self-developed Arm CPU and other infrastructure.

Sumit Gupta understands the advantages of Google's TPU better than most people inside Apple.

In the first half of 2024, the technology circle is surging.
Big models are being put into practice at an accelerated pace, with AI applications emerging in an endless stream, including AI mobile phones, AI PCs, AI home appliances, AI search, and AI e-commerce.
Vision Pro is now available in China, setting off a new wave of XR spatial computing;
HarmonyOS NEXT is officially released, and the mobile OS ecosystem has changed;
Automobiles have fully entered the "second half", and intelligence has become a top priority;
Competition in e-commerce is increasingly fierce, with low prices and even lower services;
The wave of going overseas is surging, and Chinese brands are embarking on the journey of globalization;

In the hottest month of July, Lei Technology's mid-year review is launched, summarizing the brands, technologies and products worth recording in the first half of 2024 in the technology industry, recording the past and looking forward to the future. Please pay attention.