google revealed its tpu secret weapon, alphachip appeared in nature, and gave an in-depth explanation of the development process of ai design chips

2024-09-29

recently, google deepmind officially announced its latest chip design algorithm alphachip on nature. this method is dedicated to accelerating and optimizing the development of computer chips. it has been tested by multiple tpu products and can complete the tasks required by human experts in just a few hours. weeks or even months of chip layout design.

in 2020, google published a landmark preprint paper "chip placement with deep reinforcement learning", showing the world for the first time its chip layout designed using a new reinforcement learning method.this innovation allows google to introduce ai into tpu chip design, achieving a chip layout that surpasses human designers.

by 2022, google further open sourced the algorithm code described in the paper, allowing researchers around the world to use this resource to pre-train chip blocks.

today, this ai-driven learning method has gone through the test of multiple generations of products such as tpu v5e, tpu v5p and trillium, and has achieved remarkable results within google. what is even more remarkable is that the google deepmind team recently published an appendix of this method in nature, elaborating in more detail its far-reaching impact on the field of chip design. at the same time,google also opened a checkpoint based on pre-training of 20 tpu modules, shared model weights, and named it alphachip.

the advent of alphachip not only heralds that ai will become more widely used in the field of chip design, but also marks that we are entering a new era of "chip-based design."

alphachip: how google deepmind uses ai to revolutionize chip design

as the pinnacle of google's deepmind, alphachip is capturing the attention of the global technology community with its revolutionary progress in chip design.

chip design is a field at the pinnacle of modern technology. its complexity lies in the ingenious connection of countless precision components through extremely fine wires. as one of the first reinforcement learning techniques applied to solve real-world engineering problems, alphachip is able to complete chip layout designs that are comparable to or better than humans in just a few hours, instead of weeks or months of manual labor. this epoch-making development has opened the door to our imagination beyond traditional limits.

so, how exactly does alphachip achieve this feat?

alphachip’s secret is its approach to reinforcement learning, which treats chip layout design as a game. starting with a blank grid, alphachip gradually places each circuit component until everything is in place. subsequently, based on the quality of the layout, the system will give corresponding rewards.

more importantly, google innovatively proposed an "edge-based" graph neural network.this enables alphachip to learn the mutual relationships between chip components and apply them to the design of the entire chip, thereby achieving self-transcendence in every design. similar to alphago, alphachip can learn through "games" and master the art of designing excellent chip layouts.

in the specific process of designing the tpu layout, alphachip will first conduct pre-training on various modules of previous generations of chips, including on-chip and inter-chip network modules, memory controllers, and data transmission buffers. this pre-training phase provides alphachip with a wealth of experience. google then used alphachip to generate high-quality layouts for current tpu modules.

unlike traditional methods, alphachip continuously optimizes itself by solving more chip layout tasks, just as human experts continue to improve their skills through practice. as deepmind co-founder and ceo demis hassabis said,google has built a powerful feedback loop around alphachip:

* first, train an advanced chip design model (alphachip)

* second, use alphachip to design better ai chips

* then, use these ai chips to train better models

* finally, use these models to design better chips

repeatedly, the model and ai chip are simultaneously upgraded. demis hassabis said, "this is part of the reason why the google tpu stack performs so well."

alphachip not only places more modules than human experts, but also has significantly less wiring length.with the introduction of each new generation of tpu, alphachip designs better chip layouts and provides a more complete overall floor plan, thereby shortening the design cycle and improving chip performance.

the number of alphachip design chip blocks and the average line length reduction of google’s three generations of tpu (v5e, tpu v5p)

the 10-year journey of google tpu: from persistence in asic to innovation in ai design

as an explorer and pioneer in the field of tpu, looking at google's development history in this technology line, it not only relies on its keen insight, but also demonstrates its extraordinary courage.

as we all know, in the 1980s,asic (application specific integrated circuit) is characterized by high cost-effectiveness, strong processing power and fast speed.it has won wide favor from the market. however, asic functionality is determined by custom mask tools, which means customers need to pay expensive upfront non-recurring engineering (nre) costs.

at this time,fpga (field programmable gate array) has the advantages of reducing upfront costs and reducing the risk of customized digital logic.entering the public eye, although it is not completely superior in performance, it is unique in the market.

at that time, the industry generally predicted that moore's law would drive fpga performance beyond the needs of asics. but it turns out that fpga, as a programmable "universal chip", performs well in exploratory and low-volume products and can achieve better speed, power consumption or cost indicators than gpu, but it still cannot get rid of "universality". and optimality cannot be achieved at the same time." once fpgas paved the way for a specialized architecture, they gave way to more specialized asics.

after entering the 21st century, the craze for ai technology has become higher and higher. machine learning and deep learning algorithms continue to iterate. the industry's demand for high-performance, low-power dedicated ai computing chips has increased. cpus, gpus, etc. have become increasingly ineffective in many complex tasks. powerless. against this background, google made a bold decision in 2013,choose asic to build tpu infrastructure and develop around tensorflow and jax.

it is worth noting that independent research and development of asic is a process with long cycle, large investment, high threshold and great risk. once the wrong direction is chosen, it may lead to huge economic losses. however, in order to explore more cost-effective and energy-saving machine learning solutions, after google made breakthrough progress in image recognition through deep learning in 2012, it immediately began to develop tpuv1 in 2013 and announced the first generation in 2015. the tpu chip (tpu v1) is online internally,this marks the birth of the world's first accelerator designed specifically for ai.

fortunately, tpu soon ushered in a high-profile demonstration opportunity-in march 2016, alphago lee successfully defeated the world go champion lee sedol. as the second-generation version of the alphago series, it runs on google cloud. consumes 50 tpus for calculation.

however, tpu did not immediately achieve large-scale successful application in the industry. it was not until the alphachip chip layout method was proposed that tpu truly entered a new stage of development.

google tpu development history

in 2020, google demonstrated alphachip's capabilities in a preprint paper "chip placement with deep reinforcement learning".it is able to learn from past experience and continuously improve, and can generate rich feature embeddings for input netlists by designing a reward neural architecture that can accurately predict various netlists and their layouts.

alphachip regards the conditions of performance optimization as the victory conditions of the game, adopts reinforcement learning methods, and continuously optimizes the ability of chip layout by training an agent with the goal of maximizing cumulative rewards. they started 10,000 games, allowing the ai to practice layout and routing on 10,000 chips and collect data, while continuously learning and optimizing.

ultimately, they found that ai outperformed or matched manual layout in terms of area, power, and wire length compared to human engineers, while taking significantly less time to meet design standards. the results show thatalphachip can generate layouts that rival or exceed manual efforts on modern accelerator netlists in less than 6 hours.under the same conditions, existing human experts may take several weeks to complete the same work.

with the help of alphachip, google is increasingly relying on tpu.december 2023google has launched 3 different versions of gemini, a multi-modal general large model. the training of this model uses the cloud tpu v5p chip extensively.may 2024,google has released the sixth-generation tpu chip trillium, which can be expanded into a cluster of up to 256 tpus in a single high-bandwidth, low-latency pod. compared with previous generations of products, trillium has stronger capabilities in adapting model training. .

at the same time, tpu chips have gradually moved beyond google and gained wider market recognition.july 30, 2024in a research paper it released, apple claimed that it selected two tensor processing unit (tpu) cloud clusters from google when training the artificial intelligence model afm in the apple intelligence ecosystem. other data show that more than 60% of generative ai startups and nearly 90% of generative ai unicorns are using google cloud's ai infrastructure and cloud tpu services.

there are various signs that after google has been sharpening its sword for ten years, tpu has come out of the cultivation period and has begun to feed google back in the ai era with its excellent hardware performance.the "ai design ai chip" path contained in alphachip also opens up new horizons in the field of chip design.

ai revolutionizes chip design: from google alphachip to the exploration of full-process automation

although alphachip is unique in the field of ai-designed chips, it is not alone.the reach of ai technology has been widely extended to many key links such as chip verification and testing.

the core task of chip design is to optimize the power consumption (power), performance (performance) and area (area) of the chip. these three key indicators are collectively called ppa. this challenge is also called design space exploration. traditionally, this task is completed by eda tools, but in order to achieve optimal performance, chip engineers must constantly make manual adjustments and then hand it over to eda tools for optimization again, and so on. this process is like arranging furniture at home, constantly trying to maximize space utilization and optimize circulation, but each adjustment is equivalent to moving the furniture out and rearranging it, which is extremely time-consuming and labor-intensive.

in order to solve this problem,synopsys launched dso.ai in 2020.this is the industry's first chip design solution that integrates ai and eda. dso.ai uses reinforcement learning technology to automatically search the design space through ai to find the best balance point without manual intervention. this tool has been used by many chip giants.

for example, after using dso.ai, microsoft reduced the power consumption of chip modules by 10%-15% while maintaining the same performance; stmicroelectronics increased ppa exploration efficiency by more than 3 times; memory chip giant sk hynix the chip area is reduced by 5%. synopsys data shows that dso.ai has successfully assisted more than 300 commercial tape-outs, marking the important role that ai plays in real chip design and production.

in terms of ai-assisted chip verification, a technical report released by synopsys also pointed out that the verification process occupies up to 70% of the entire chip development cycle. the cost of a chip tape-out is as high as hundreds of millions of dollars, and the complexity of modern chips continues to increase, making verification difficult. to this end,synopsys launches vso.ai tool,use ai to optimize the verification space and accelerate the convergence of coverage.

vso.ai can infer different coverage types, complementing traditional code coverage. ai can also learn from verification experience to continuously optimize coverage goals. in addition, synopsys has also launched the tso.ai tool, which can help chip developers screen out defective chips manufactured by foundries.

the deep involvement of ai in the field of chip design has triggered a bold idea: can we use ai to design a complete chip? in fact, nvidia has already tried in this area. design circuits through deep reinforcement learning agents,nearly 13,000 circuits in nvidia's h100 were designed by ai. the institute of computing technology of the chinese academy of sciences also used ai to generate a risc-v processor chip called "qiu meng no. 1" within 5 hours.with 4 million logic gates, its performance is comparable to intel 80486.

overall, ai's ability to design complete chips is still limited, but this is undoubtedly an important opportunity for future chip development. with the continuous advancement of technology, the potential of ai in the field of chip design will surely be further explored and utilized, and ultimately change the entire chip design process.

author: tian xiaoyao

news

google revealed its tpu secret weapon, alphachip appeared in nature, and gave an in-depth explanation of the development process of ai design chips

introduction

my contact information