Paper presented at top computer architecture conference, chip architecture becomes the best parallel computing option for edge AI

2024-08-13

Published by Synced

Synced Editorial Department

The explosion of large AI models has driven strong demand for GPUs, and AI applications that penetrate from the cloud to the edge will also drive demand for edge AI servers and accelerated processors. By comparing GPGPU, FPGA, NPU, and ASIC, the reconfigurable computing architecture CGRA has become the most suitable parallel computing architecture for edge AI. The reconfigurable parallel processor (RPP) proposed by Core Power is a computing architecture that is more suitable for large-scale parallel processing than the traditional CGRA. This has not only been confirmed through experimental evaluation, but also recognized by international academic authorities through the ISCA conference. The R8 chip based on the RPP architecture and subsequent higher-performance iterative chips will be the ideal AI accelerated processor choice for edge AI servers and AI PCs.

Table of contents

1. What is Edge AI?

2. Edge AI Server Market Trends

3. Ideal computing architecture for edge AI

IV. Detailed explanation of RPP architecture

5. RPP processor R8 energy efficiency comparison

6. RPP processor recognized by international academic authorities

VII. Conclusion

1. What is Edge AI?

Edge AI is an advanced technology that intersects artificial intelligence (AI) and edge computing. This concept stems from the paradigm shift of distributed computing where AI is moving from the cloud to the edge. The core of edge AI is to embed AI algorithms directly into local environments that generate large amounts of data, such as smartphones, IoT devices, or local servers, and perform real-time data processing and analysis through devices and systems located at the "edge" of the network (i.e., closer to the data source).

Compared with AI training or reasoning in traditional data centers or cloud computing platforms, the main advantage of edge AI is "on-site processing", which greatly reduces the delay of data transmission and processing. This is especially important in application scenarios such as intelligent monitoring, autonomous driving, real-time medical diagnosis or industrial automation control.

The devices and systems that implement edge AI computing mainly include:

Smart terminals: devices that are mainly used to generate or collect data, such as smart sensors, smartphones, AI PCs or IoT devices;
Edge AI server: edge devices and hardware and software systems that directly process and analyze collected data, such as dedicated large language model (LLM) AI inference servers, intelligent driving regional computing center servers, etc.
Communication network equipment: Although edge AI applications do not require as much bandwidth and rate from communication networks as cloud applications, reliable high-speed connections must be provided to achieve the low latency and real-time requirements required by edge AI.

This article mainly discusses edge AI servers and their market development trends, requirements for AI acceleration processors, and parallel computing architectures and processor implementations suitable for edge AI applications.

2. Edge AI Server Market Trends

AI servers refer to high-performance computer devices designed specifically for artificial intelligence applications, which can support complex tasks such as large-scale data processing, model training, and inference computing. AI servers are usually equipped with high-performance processors, high-speed memory, large-capacity high-speed storage systems, and efficient cooling systems to meet the extremely high demand for computing resources by AI algorithms. According to different classification standards, AI servers can be roughly divided into training servers, inference servers, GPU servers, FPGA servers, CPU servers, cloud AI servers, and edge AI servers.

According to Gartner's forecast, the AI server market will maintain rapid growth from now to 2027, with a compound annual growth rate of up to 30%. The "Global Server Market Report for the First Quarter of 2024" released by the agency shows that the global server market sales in Q1 this year was US$40.75 billion, a year-on-year increase of 59.9%; shipments were 2.82 million units, a year-on-year increase of 5.9%. Among many AI server suppliers, Inspur Information ranked second in the world and first in China. Its server shipments accounted for 11.3% of the global market, a year-on-year increase of 50.4%, and the fastest growth rate among the TOP5 manufacturers.

According to the "2024-2029 China Server Industry Demand Forecast and Development Trend Forecast Report" released by China Business Industry Research Institute, by the end of 2022, the total domestic market size will exceed 42 billion yuan, a year-on-year increase of about 20%; in 2023, it will be about 49 billion yuan, and the market growth rate will gradually slow down; it is expected that the market size will reach 56 billion yuan in 2024. In terms of shipments, China's AI server market shipments in 2022 will be about 284,000 units, a year-on-year increase of about 25.66%; in 2023, it will be about 354,000 units, and it is expected to reach 421,000 units in 2024.

In the early stages of AI big model development, AI server demand was mainly for model training, so training servers dominated the market. Currently, 57.33% of the AI server market is training servers, and inference servers account for 42.67%. However, as generative AI applications penetrate the edge, inference servers are expected to gradually become the mainstream of the market in the future, and edge AI servers will exceed cloud training and inference servers in terms of shipments.

Data from IDC's latest "China Semi-annual Edge Computing Market (2023 Full Year) Tracking" report shows that China's edge computing server market will continue to rise steadily in 2023, with a year-on-year increase of 29.1%. IDC predicts that by 2028, the scale of China's edge computing server market will reach US$13.2 billion.

As an important part of edge computing, the scale of customized edge servers will reach US$240 million in 2023, an increase of 16.8% compared with 2022. From the perspective of manufacturer sales, the manufacturers with a relatively large share in the edge customized server market are Inspur Information, Lenovo, Huawei, and H3C. With the diversified development of edge computing applications, emerging server manufacturers will have major breakthroughs in business scenarios and application markets such as vehicle-road collaboration, edge AI, and smart terminals, making the edge server market present a diversified pattern.

3. Ideal computing architecture for edge AI

The PC era was dominated by the WINTEL (Microsoft Windows + Intel CPU) alliance, and the smartphone era was dominated by the Android + Arm alliance. Which alliance will dominate the AI era? A new alliance is emerging, that is, the NT Alliance (Nvidia + TSMC) formed by Nvidia and TSMC. According to Wall Street investment experts, the total revenue of the NT Alliance is expected to reach US$200 billion, the total net profit is US$100 billion, and the total market value is expected to exceed US$5 trillion in 2024. Nvidia GPU and TSMC AI chip manufacturing business driven by cloud AI training and AI large model applications will become the biggest winners this year.

Although NVIDIA has an absolute dominant position in the cloud AI training and reasoning market, NVIDIA's GPGPU is not the best choice in edge AI application scenarios because the inherent high power consumption and high cost of its computing architecture limit its role in more extensive and decentralized edge AI applications. Scholars and experts in the field of computer architecture are looking for energy-efficient parallel technology architectures that can replace GPGPUs. ASIC design based on domain-specific architecture (DSA) is a feasible key idea, such as Google's tensor processing unit (TPU). This processor designed to accelerate machine learning workloads uses a systolic array architecture that can efficiently perform multiplication and accumulation operations, mainly for data center applications. Another idea is the neural processing unit (NPU) represented by Samsung, which is designed for mobile scenarios and has an energy-saving inner product engine that can use the sparsity of input feature maps to optimize the performance of deep learning reasoning.

Although both TPU and NPU can provide high-performance and energy-saving solutions that partially replace GPGPU, their dedicated design attributes limit their versatility and wide applicability. Kneron, an edge AI chip startup headquartered in California, USA, with R&D centers in Taiwan and mainland China, has proposed a reconfigurable NPU solution, which enables NPU chips to have ASIC high performance without sacrificing the programmability of data-intensive algorithms. With its unique and innovative architecture and excellent performance, the Kneron team won the IEEE CAS 2021 Darlington Best Paper Award. Kneron's 4th generation reconfigurable NPU can support the simultaneous running of CNN and Transformer networks, which can be used for both machine vision and semantic analysis. Unlike ordinary AI models that are only for specific applications, Kneron's reconfigurable artificial neural network (RANN) technology is more flexible, can meet different application requirements and adapt to various computing architectures. According to the company, its edge GPT AI chip KL830 can be applied to AI PCs, USB accelerators and edge servers. When used with GPUs, NPUs can reduce device energy consumption by 30%.

Reconfigurable hardware is another solution that can provide high-performance and energy-efficient computing. Field Programmable Gate Array (FPGA) is a representative of reconfigurable hardware computing, which is characterized by fine-grained reconfigurability. FPGAs use configurable logic blocks with programmable interconnects to implement custom computing kernels. This customized computing capability enables FPGA-based accelerators to be deployed in a wide range of large-scale computing applications such as financial computing, deep learning, and scientific simulation. However, the bit-level reconfigurability provided by FPGAs brings significant additional area and power overhead, and there is no scale cost-effectiveness, which greatly limits its applicability in application scenarios that require low power consumption and small size.

Coarse-grained reconfigurable architecture (CGRA) represents another type of reconfigurable hardware. Compared with FPGA, CGRA provides coarse-grained reconfigurability, such as word-level reconfigurable functional units. Since the ALU module inside CGRA has been built and its interconnection is simpler and smaller than FPGA, its latency and performance are significantly better than FPGA, which is interconnected at the gate level to form combinatorial computing logic. CGRA is more suitable for word-wise type (32-bit units) reconfigurable computing, and can alleviate the timing, area and power overhead problems of FPGA. It is an ideal high-performance parallel computing architecture for future edge AI.

Below we briefly review the development history of CGRA:

As early as 1991, the international academic community started research on reconfigurable chips;
In 2003, the European Aeronautic Defense and Space Company (EADS) was the first to use reconfigurable computing chips on satellites;
In 2004, Europe's IMEC proposed the dynamically reconfigurable structure ADRES, which has been used in Samsung's biomedical, high-definition TV and other product series. Japan's Renesas Technology also adopts this architecture.
In 2006, the reconfigurable computing team led by Professor Wei Shaojun of the Institute of Microelectronics at Tsinghua University began to conduct research on the theory and architecture of reconfigurable computing;
In 2017, the U.S. Defense Advanced Research Projects Agency (DARPA) announced the launch of the Electronics Resurgence Initiative (ERI), listing "reconfigurable computing" technology as one of the strategic technologies for the United States in the next 30 years;
In 2018, Qingwei Intelligence was established based on Tsinghua University's reconfigurable computing technology, officially starting the commercialization process. In 2019, Qingwei Intelligence mass-produced the world's first reconfigurable intelligent voice chip TX210, proving the commercial value of reconfigurable computing. In 2020, Qingwei Intelligence won the first prize for technological invention of the China Electronics Society; in 2023, the National Big Fund Phase II invested in Qingwei Intelligence. At present, Qingwei Intelligence has three major chip products: the edge TX2 and TX5 series chips, and the TX8 series for the server field. Among them, the TX2 and TX5 series chips have been used in smart security, financial payment, smart wearables, smart robots and other fields; the TX8 high-computing power chip for the cloud market is mainly used in the training and reasoning of large AI models.
Another domestic AI chip startup based on reconfigurable computing technology, Zhuhai Core Power, was established in 2017. Its reconfigurable parallel processor (RPP) architecture is an improved version of CGRA. In 2021, the first chip RPP-R8 was successfully taped out. In 2023, it entered the edge AI application market such as financial computing, industrial cameras and robots, and reached a strategic cooperation with Inspur Information to enter the edge AI server market.

The international computer academic community and high-tech industry have reached a consensus that reconfigurable computing chips based on the CGRA architecture have extensive general computing capabilities and can be applied to various edge AI computing scenarios. They are the only way to meet the needs of general high computing power and low power consumption.

4. Detailed explanation of RPP processor architecture

Both RPP and CGRA are coarse-grained reconfigurable arrays, both can achieve area density and power efficiency similar to ASIC, and both can be programmed by software. However, RPP is different from CGRA in terms of reconfigurable type and programming model, as shown in the following:

1. RPP is a quasi-static reconfigurable array, while traditional CGRA is generally used for dynamically reconfigurable arrays. Static reconfigurable array means that the execution of each instruction in the processing unit (PE) does not change over time, and the data flow is also unchanged. For the compiler, the static reconfigurable array does not need to arrange the instructions in time, which makes the RPP structure simpler and the instruction allocation speed is very low. Therefore, RPP can easily implement a large array, such as a 32x32 array. RPP is more suitable for large-scale parallel computing than traditional CGRA.

2. RPP uses the multi-threaded SIMT programming model, while CGRA usually uses single-threaded language programming. RPP is compatible with the CUDA language and is more suitable for parallel computing. The CUDA language requires programmers to consider the parallelism of data from the beginning and express parallel algorithms in the CUDA language; the compiler does not need to analyze the parallel computing degree, so the compiler is very simple; the CUDA language is a SIMT type and is only used for data parallel computing, and the parallelism remains unchanged in a program. CGRA usually uses C language + independent compiler. Although it can theoretically cover any type of computing, the compiler is very complex and it is difficult to achieve high compilation efficiency.

The following chart compares RPP and several mainstream reconfigurable acceleration architectures.

The advantages of the RPP architecture can be summarized in the following four points:

A ring-shaped reconfigurable parallel processing architecture with gasket memory that allows efficient reuse of data between different data flows;
Hierarchical memory design with multiple data access modes, address mapping strategies and shared memory modes, enabling efficient and flexible memory access;
Various hardware optimization mechanisms, such as concurrent kernel execution, register splitting and refilling, and heterogeneous scalar and vector computation, improve overall hardware utilization and performance;
A CUDA-compatible end-to-end complete software stack with compiler, runtime environment, and highly optimized RPP library to enable fast and efficient deployment of edge AI applications.

Core Power proposed the RPP hardware design block diagram based on the RPP architecture, and truly demonstrated the superiority of this parallel computing architecture through the R8 chip. This hardware design implementation mainly consists of a circular reconfigurable processor, a memory unit and a sequencer, as shown in the figure below.

Cycle-reconfigurable processors are the core computing components of massively parallel computing.
The memory cells are divided into multiple memory groups, each of which is paired with a cache to exploit the temporal and spatial locality of the program for efficient data reuse. Only when the registers and buffers within the ring reconfigurable processor are full, the intermediate data will be transferred and stored in the memory cells.
The sequencer is used to decode and distribute instructions to the ring reconfigurable processor and uses the cache to store the instructions received from the DDR.

The ring reconfigurable processor includes NPU processing elements (PEs) and a shim memory. Each PE is equipped with a memory port to facilitate data access to the memory unit. The memory port is designed with a mode controller, an address calculation unit, and multiple multiplexers to support different data access modes and shared memory modes. To achieve flexible intra-processor communication, each PE integrates a switch box (SB) and an interconnect switch box (ICSB) for efficient data forwarding. These PEs are connected in a linear sequence, and the shim memory acts as a bridge between the first and last PUs, forming a ring topology.

Data processing within the ring reconfigurable processor starts from the first PE and traverses the PEs in a pipelined manner, with intermediate computation results output to subsequent PEs in sequence. The shim memory caches the outputs of the last PE and recirculates them to the first PE, thereby maximizing data locality and eliminating memory traffic to the memory unit. The key computational component in the PE is the processing engine. In each PE, there are multiple arithmetic logic units (ALUs), each of which is coupled to data registers and address registers. These data registers are aggregated together to form a data buffer, which facilitates fast access to data within each PE.

In addition, the combination of linear switching networks and shim memory enables flexible data flow control and efficient data reuse, while eliminating the complex network routing in traditional grid-based CGRA designs. Combined with flexible and efficient data access to memory cells, RPP can optimize data flow processing and minimize memory traffic, thereby maximizing resource utilization efficiency.

The RPP processor adopts the SIMT programming model to enable streaming data flow processing for flexible multithreaded pipelines.

To ensure compatibility with the existing GPGPU software ecosystem, CorePower's RPP processor uses CUDA, which has a wide user base. CUDA code is parsed by the LLVM-based front end to generate PTX code for the RPP back end. The RPP compiler interprets the CUDA kernels as data flow graphs and maps them to virtual data paths (VDPs). The VDP is then decomposed into multiple physical data paths (PDPs) based on hardware constraints, and the configuration of each PDP is generated by the sequencer at runtime.

RPP's software stack can support a wide range of massively parallel applications, including machine learning, video/image processing, and signal processing. For machine learning applications, the stack is compatible with different mainstream frameworks, such as PyTorch, ONNX, Caffe, and TensorFlow. In addition, users have the flexibility to define their custom programs using CUDA. These high-level applications are handled by the RPP framework, which includes a compiler and different domain-specific libraries. At the bottom of the software stack, the RPP runtime environment and RPP driver are used to ensure that programs compiled using the toolchain can be seamlessly executed on the underlying hardware.

5. RPP processor R8 energy efficiency comparison

How does the RPP-R8 chip based on the above RPP processor hardware design and complete software stack perform in terms of computing performance and energy efficiency?

The performance parameters of the R8 chip are shown in the following table:

For edge computing scenarios, Core Power compared the RPP-R8 chip with two NVIDIA edge GPUs: Jetson Nano and Jetson Xavier AGX. The chip size of Jetson Nano is similar to RPP, which can provide relevant comparison within the physical area limit; Jetson Xavier AGX was selected based on its theoretical throughput comparable to RPP-R8. Core Power evaluated the three AI acceleration platforms on ResNet-50 inference, with the throughput of Jetson Nano coming from the benchmark paper and the performance data of Xavier AGX coming from the official NVIDIA website.

As shown in the table above, the measured operating throughput of RPP-R8 is 41.3 times and 2.3 times that of Jetson Nano and Jetson Xavier AGX, respectively. It should be noted that the chip size of Jetson Xavier AGX is almost three times that of R8, and the process is more advanced (12 nm vs. 14 nm), but its performance is lower than R8. In terms of energy efficiency, the energy efficiency of R8 is 27.5 times and 4.6 times that of Jetson Nano and Jetson Xavier AGX, respectively. These results show that in edge AI scenarios with limited area and power budgets, RPP-R8 performs significantly better than Jetson Nano and Jetson Xavier AGX.

Deep learning inference is a widely recognized massively parallel workload and a key application for the RPP-R8 hardware. Given that the Yolo series models exhibit higher computational complexity compared to classification models such as ResNet-50, CorePower selected NVIDIA Jeston Nano Orin as the GPU platform, which has a higher peak throughput of 40 TOPS than Jetson AGX Xavier. Since CPUs are generally not built for high-performance deep learning inference, Jetson Xavier Nx was selected as a lower-end GPU platform with a peak throughput of 21 TOPS. Workloads with batch sizes of 1, 2, and 4 were evaluated, reflecting real edge scenarios. The figure above shows a comparison of the throughput performance of the three platforms, with RPP-R8 demonstrating higher throughput on Yolo-v5m and Yolo-v7 tiny. With a batch size of 1, the throughput of RPP-R8 is approximately 1.5× ∼2.5 times higher than that of Jeston Nano Orin and 2.6× ∼4.3 times higher than that of Jeston Xavier Nx.

Evaluation and test results show that RPP outperforms traditional architectures such as GPU, CPU and DSP in terms of latency, throughput and energy efficiency. The performance improvement of RPP processor is attributed to its unique hardware features, mainly including: 1) Circular data flow processing: intermediate results flow through pipeline registers and FIFOs between PEs, significantly reducing data movement and memory traffic to remote memory storage; this model is more efficient than data processing in GPUs and CPUs. 2) Hierarchical memory system: RPP maximizes data locality through its hierarchical memory system. A large part of the RPP-R8 chip area (about 39.9%) is dedicated to on-chip memory. This design choice provides extensive memory capacity, enhances data reuse and reduces the need for frequent access to external memory. 3) Vectorization and multi-threaded pipeline: RPP's hardware architecture and programming model enable efficient vectorization and multi-threaded pipelines. This design fully exploits the full computing potential of RPP for parallel processing, ensuring that its resources are maximized, thereby improving performance.

In addition to its advantages in energy consumption, latency, and throughput, RPP also stands out for its small area. The chip area consumption of only 119 square millimeters makes RPP-R8 an ideal platform for area-constrained edge computing. Another feature of RPP is its high programmability, supported by a comprehensive end-to-end software stack that significantly improves deployment efficiency. Compatibility with CUDA enables users to take advantage of the familiar CUDA ecosystem, shortening the learning curve and promoting easier adoption. Support for both instant programming and graphics programming modes provides users with a high degree of flexibility to meet a variety of computing needs. Different library support, including OpenRT and RPP-BLAS, also promotes high performance and efficient deployment in various scenarios. The full-stack solution, including hardware architecture and software support, makes RPP stand out among various edge computing hardware.

6. RPP framework is recognized by international academic authorities

The paper "Circular Reconfigurable Parallel Processor for Edge Computing" (RPP Chip Architecture), co-authored by Core Power and computer architecture teams from top universities such as Imperial College, Cambridge University, Tsinghua University and Sun Yat-sen University, has been successfully included in the Industry Track of the 51st International Symposium on Computer Architecture (ISCA 2024). Dr. Yuan Li, founder and CEO of Core Power, and Hongxiang Fan, a PhD graduate from Imperial College (now a research scientist at the Samsung AI Center in Cambridge, UK), were invited to speak at the ISCA 2024 conference in Buenos Aires, Argentina, and exchanged ideas with experts from internationally renowned companies such as Intel and AMD.

This year's ISCA received 423 high-quality paper submissions from around the world. After a rigorous review process, only 83 papers stood out, with an overall acceptance rate as low as 19.6%. Among them, the admission difficulty of Industry Track was particularly prominent, with an acceptance rate of only 15.3%.

As the top academic event in the field of computer architecture, ISCA is jointly organized by ACM SIGARCH and IEEE TCCA. Since its establishment in 1973, it has been a pioneer in promoting progress in the field of computer system architecture. Its wide influence and outstanding contributions have made it a high-end platform for industry giants such as Google, Intel, and NVIDIA to compete to showcase cutting-edge research results. ISCA is known as the top four conferences along with MICRO, HPCA, and ASPLOS, and ISCA is the best among them, with a paper acceptance rate of about 18% all year round. Over the years, many research results published at ISCA have become the key driving force for the development of the semiconductor and computer industries.

The selected reconfigurable parallel processor (RPP) paper has injected strong momentum into the field of edge computing. The experimental results fully prove that as a hardware platform for parallel computing, the performance of RPP comprehensively surpasses the GPU currently on the market, especially in application scenarios with extremely high requirements for latency, power consumption and volume.

VI. Conclusion

ChatGPT has ignited the AI big model, which has led to huge demand for GPUs and AI accelerators. The development trend of AI applications will gradually penetrate from cloud AI training and reasoning to edge and end-side AI. AI servers that provide software and hardware support for various AI applications also follow the distributed expansion trend from data centers to edge computing. Traditional GPGPUs have begun to expose obvious architectural defects in edge AI application scenarios. Their high cost, high power consumption and high latency problems have forced industry experts to seek more energy-efficient parallel computing architectures.

After comparing different computing architectures such as CPU, GPU, ASIC, FPGA and NPU, we found that the reconfigurable computing architecture CGRA is more suitable for edge AI applications, especially the reconfigurable parallel processor (RPP) proposed by Core Power. Through comparative analysis with similar GPUs from NVIDIA, the R8 chip based on the RPP architecture performs well in terms of latency, power consumption, area cost, versatility and rapid deployment. We believe that this is the most ideal parallel computing architecture for edge AI.

At the ISCA2024 academic conference held in Argentina in July this year, the paper on the RPP processor architecture was recognized by international academic authorities. With the development of edge AI, AI servers and AI PCs will usher in a golden period of rapid growth, and AI accelerators that support such edge AI devices will also grow simultaneously. The RPP processor chip proposed by Zhuhai Core Power Technology will also be recognized by the industry and become the most ideal AI acceleration processor in edge AI application scenarios.

news

Paper presented at top computer architecture conference, chip architecture becomes the best parallel computing option for edge AI

Introduction

My contact information