news

nature: a day inside the world's fastest supercomputer

2024-09-15

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina



  new intelligence report

editor: qiao yang
【new wisdom introduction】in the mountains of eastern tennessee, a record-breaking supercomputer called frontier is giving scientists unprecedented access to study everything from atoms to galaxies.

the construction of supercomputers is in full swing. both sovereign ai and technology giants are continuously providing blood transfusions to nvidia and building data centers.

before that, as of december 2023, the fastest supercomputer in the world is frontier, also known as olcf-5, located in oak ridge, tennessee, usa.

frontier is equipped with amd cpus and gpus, with 50,000 processors (including 38,000 gpus) and a computing speed of 1.102 exaflops, or 1.102 quadrillion (1018) floating-point operations.

this speed is even faster than 100,000 laptops working at the same time, and when it debuted in 2022, frontier also broke a record - breaking the threshold of exascale computing speed for the first time.

frontier's supercomputer covers an area larger than two basketball courts

the reason for pursuing such outstanding speed and scale is to meet the needs of simulation computing in cutting-edge scientific research in various fields.

frontier is very good at creating simulations that capture both large-scale patterns and small-scale details, such as how tiny cloud droplets affect the rate of climate warming.

today, researchers from around the world log on to frontier to create cutting-edge models of everything from subatomic particles to galaxies, including simulating proteins for drug discovery and design, simulating turbulence to improve aircraft engines, and training open source llms that compete with those from google and openai.

however, one day in april of this year, something unexpected happened in frontier's operations.

to keep up with requests from scientists around the world, frontier’s power consumption has risen dramatically, peaking at about 27 megawatts, enough to power about 10,000 homes, said bronson messer, scientific director of oak ridge national laboratory in tennessee, where frontier is located.

this also poses challenges to the supercomputer's cooling system. in messer's words, "the machine is running like a scalded dog."

according to statistics in 2023, frontier has a total of 1,744 users in 18 countries, and the contributed calculations and data support at least 500 publicly published papers.

inside frontier's "brain"

similar to the scene we imagined, the computer room where frontier is located is similar to a warehouse, and the electronic buzzing sound emitted during operation is steady and gentle.

there are 74 racks in the computer room, and each node contains 4 gpus and 1 cpu. the reason for such a fast computing speed is due to the large number of gpus.

"these gpus are very fast, but also very stupid. they can do the same thing over and over again," said messer, director of the lab.

this ability to process multiple calculations simultaneously is useful for fast work on supercomputers, but not much else.

behind this "extreme stupidity" is a kind of universality. scientists from all fields can run gpus through customized codes.

frontier operates day and night, and the engineering team responsible for operations and maintenance also works around the clock.

corey edmonds, one of the engineers responsible for building this supercomputer, said that they have an engineering team that will continuously monitor frontier to determine whether there are any signs of failure.

for example, one of the night shift workers, conner cunningham, works from 7pm to 7am. he is responsible for using more than a dozen monitors to monitor the security of the network and buildings, as well as monitoring the local weather to ensure the normal operation of frontier.

in fact, most nights are "christmas eve" and cunningham usually only needs to make a few inspections and can study at his workstation the rest of the time.

"this job is a bit like being a firefighter. if anything happens, someone needs to be on duty to monitor it."

supporting big science

although frontier runs day and night, it is not easy for researchers to apply for the opportunity to use it.

scientific director messer and three other colleagues are responsible for the evaluation and approval of use proposals. last year, they approved 131 projects, with an approval rate of about 1/4.

applicants need to demonstrate that their project requires the use of the entire supercomputing system to be approved, generally for modeling at various time and space scales.

there are about 65 million node-hours available per year on frontier, and the most common allocation researchers get is 500,000 node-hours, which is equivalent to running the full system continuously for three days.

messer said researchers have access to about 10 times more computing resources on frontier than at other data centers.

frontier has more than 50,000 processors and uses liquid cooling

with faster computing speeds and more computing resources, researchers can do more ambitious "big science."

for example, accurately simulating biological processes with atomic-level precision, such as how proteins or nucleic acids in solution interact with other parts of a cell.

in may this year, scholars used frontier to simulate a cubic water droplet containing more than 155 billion water molecules, about one-tenth the width of a human hair, which is one of the largest atomic-level simulations ever.

in the short term, the researchers hope to simulate organelles to inform lab experiments; they also hope to combine these high-resolution simulations with ultrafast imaging using x-ray free-electron lasers to accelerate discoveries.

all of this work is paving the way for a bigger goal in the future - modeling the entire cell starting from the atoms.

with frontier, climate models have also become more accurate.

last year, climate scientist matt norman and other researchers used frontier to run a global climate model at a resolution of 3.25 kilometers, incorporating complex cloud motions at an even finer resolution.

in order to create forecast models that span decades, frontier’s computing power is necessary and requires the computing power of the entire system.

for a model to be suitable for weather and climate prediction, at least one year of daily simulation runs is required.

frontier can simulate 1.26 years per day, a rate that allows researchers to create 50-year forecasts that are more accurate than before.

if it is run on another computer, the calculation speed will be much slower to achieve the same resolution and take into account the impact of the cloud.

on larger cosmic scales, frontier can also provide higher resolution.

evan schneider, an astrophysicist at the university of pittsburgh, is also using frontier to study how galaxies the size of our milky way evolve as they age.

the galaxy models they created span four orders of magnitude, with the largest structures reaching about 100,000 light-years across. before frontier, the largest structures simulated at similar resolution were dwarf galaxies with about one-fiftieth the mass.

what frontier means for ai

as the former world number one, frontier's position is even more unique because this supercomputer is one of the few facilities that belongs to the public sector rather than being dominated by industry.

since research in the field of ai often requires enormous computing power, there is a huge gap between the results of academia and industry.

some scholars have calculated that in 2021, 96% of the largest ai models came from the industry. on average, the scale of industrial models is nearly 30 times that of academic models.

this difference is also evident in the amount of investment. non-defense public agencies in the united states provided $1.5 billion to support ai research in 2021. in the same year, global industry spent more than $340 billion.

since the release of commercial llms such as gpt-4 and gemini ultra, the gap between the two has been further widened. this investment gap has led to a significant asymmetry in the computing resources available to industry and academia.

since model development in the industry is for the purpose of profit, it often ignores many important issues that must be faced in technological development, such as basic research, the needs of low-income groups, assessing model risks, correcting model bias, etc.

if academia is to take on these responsibilities, it will need computing power that can match the scale of the industry, which is where frontier comes in.

a typical example is that llms trained by technology companies often retain varying degrees of proprietary features, but researchers often make their own models available for free to anyone.

this will help university researchers compete with companies, says abhinav bhatele, a computer scientist at the university of maryland, college park. “the only way for academics to train models of similar scale is to have access to resources like frontier,” he says.

bhatele believes that facilities like frontier play a vital role in the field of ai, allowing more people to participate in technology development and share the results.

but it is worth noting that the computing power infrastructure competition between countries, technology companies and non-profit organizations is still ongoing, and even a powerful one like frontier will eventually fall.

oak ridge national laboratory is already planning frontier's successor, called discovery, which will increase computing speed by 3 to 5 times.

for reference, frontier is 35 times faster than the fastest supercomputer in 2014, tianhe-2a, and 33,000 times faster than the fastest supercomputer in 2004, earth simulator.

researchers still crave faster speeds, but engineers face ongoing challenges, one of which is energy.

frontier's energy efficiency is more than four times higher than summit, largely due to a different cooling solution.

frontier uses room temperature water for cooling, unlike summit, which uses chilled water. about 3% to 4% of frontier's total energy consumption is used for cooling, while summit's is 10%.

unlike summit, which uses chilled water, frontier uses about 3-4% of its total energy for cooling, compared to 10% for summit.

for years, energy efficiency has been a key bottleneck in building more advanced supercomputers, and this bottleneck is expected to remain for the foreseeable future.

“we could have built an exascale supercomputer in 2012, but it would have been too expensive to power,” said messer, the lab’s director. “it would have required an order of magnitude or two more in electricity.”