news

Nvidia's so-called "hot chips" are actually "hot platforms"

2024-08-24

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Nvidia was hit with rare bad news earlier this month when reports emerged that the company's much-anticipated "Blackwell" GPU accelerator could be delayed by as much as three months due to a design flaw. However, Nvidia spokespersons said everything was on schedule, some suppliers said nothing had changed, while others said there were some normal delays.

Industry insiders expect that users will know more about Blackwell's situation when Nvidia releases its second-quarter financial results for fiscal 2025 next Wednesday.

The Blackwell chips — B100, B200 and GB200 — will be a focus of this year’s Hot Chips conference next week at Stanford University in California, where Nvidia will introduce its architecture, detail some new innovations, outline its use of AI in designing chips and discuss its research into liquid cooling in data centers for running these growing AI workloads. The company will also show a Blackwell chip already running in one of its data centers, according to Dave Salvadore, Nvidia’s director of accelerated computing products.

Most of what Nvidia talked about Blackwell was already known, such as the Blackwell Ultra GPUs coming next year and the next-generation Rubin GPUs and Vera CPUs starting in 2026. However, Salvator stressed thatWhen talking about Blackwell, it's important to think of it as a platform, not a single chip.Salvator told reporters and analysts at a briefing this week in preparation for Hot Chips.

“When you think about NVIDIA and the platforms we build, GPUs, networking, and even our CPUs are just the beginning,” he said. “We’re doing system-level and datacenter-level engineering to build these systems and platforms that can actually go out and solve those really hard generative AI challenges. We’ve seen the size of the models grow over time, and most generative AI applications need to run in real time, and the inference requirements have increased dramatically over the last few years. Real-time large language model inference requires multiple GPUs and, in the not-too-distant future, multiple server nodes.”

This includes not only Blackwell GPUs and Grace CPUs, but also NVLink Switch chips, Bluefield-3 DPUs, ConnextX-7 and ConnectX-8 NICs, Spectrum-4 Ethernet switches, and Quantum-3 InfiniBand switches. Salvator also showed different information for NVLink Switch (below), compute, Spectrum-X800, and Quantum-X800.

Nvidia unveiled its much-anticipated Blackwell architecture at GTC 2024 in March, and hyperscalers and OEMs quickly signed on. The company is targeting the rapidly expanding field of generative AI, where large language models (LLMs) are becoming even larger, as evidenced by Meta's Llama 3.1, which launched in June and features a model with 405 billion parameters. Salvator said,As LLMs grow larger and the need for real-time inference persists, they will require more compute and lower latency, requiring a platform approach.

“Like most other LLMS, the services that will be powered by this model are expected to run in real time. To do that, you need multiple GPUs,” he said. “The challenge is how to strike a huge balance between high performance of the GPUs, high utilization of the GPUs, and providing a good user experience to the end users who use these AI-driven services.”

01 The need for speed

With Blackwell, Nvidia doubled the bandwidth per switch from 900 GB/sec to 1.8 TB/sec. The company's Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) technology brings more compute into the system where it actually resides in the switch. It lets us do some offload from the GPU to help accelerate performance, and it also helps smooth network traffic over the NVLink fabric. These are all innovations we continue to drive at the platform level.

The multi-node GB200 NVL72 is a liquid-cooled chassis that connects 72 Blackwell GPUs and 36 Grace CPUs in a rack-scale design, which Nvidia claims can deliver higher inference performance for trillion-parameter LLMs such as GPT-MoE-1.8T as a single GPU. Its performance is 30 times that of an HGX H100 system, and its training speed is 4 times that of an H100.

Nvidia has also added native support for FP4, which provides the same accuracy as FP16 while reducing bandwidth usage by 75 percent using the company’s Quasar Quantization System, a piece of software that leverages Blackwell’s Transformer Engine to ensure accuracy, something Salvator demonstrated by comparing generative AI images created using FP4 and FP16, with barely any discernible difference between the two.

With FP4, models can use less memory and perform even better than FP8 in Hopper GPUs.

02 Liquid Cooling System

On the liquid cooling front, Nvidia will introduce a direct chip-to-chip method of running warm water that could reduce data center power usage by 28 percent.

“What’s interesting about this approach is some of the benefits, which include improved cooling efficiency, lower operating costs, longer server life, and the possibility of repurposing captured heat for other purposes,” Salvator said. “It definitely helps with cooling efficiency. One way is that, as the name suggests, this system doesn’t actually use a chiller. If you think about how a refrigerator works, it works pretty well. However, it also requires electricity. By going with this solution that uses warm water, we don’t have to use a chiller, which saves us some energy and reduces operating costs.”

Another topic was how Nvidia is using artificial intelligence to design its AI chips using Verilog, a four-decade-old hardware description language that describes circuits in code. Nvidia is helping with this through an autonomous Verilog agent called VerilogCoder.

“Our researchers have developed a large language model that can be used to speed up the creation of the Verilog code that describes our systems,” he said. “We’re going to use it in future generations of our products to help build those codes. It can do a lot of things. It can help speed up the design and verification process. It can speed up the manual work of design and essentially automate a lot of tasks.”