news

the current status of domestic ai chips: gpus are making money, tpus are making a comeback, chiplet is becoming a trend, and the network is a bottleneck

2024-09-06

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

author | gacs

chip east reported on september 6 that the annual global ai chip summit (gacs 2024) opened in beijing today. the venue was packed with people, and the number of viewers of the cloud live broadcast reached 1.2 million.

▲the venue and the booths were crowded with people

the conference was initiated and hosted by chip east, a subsidiary of zhiyi technology, and zhixingxing, with the theme of "building a chip road in the era of intelligent computing". it invited more than 50 guests from the fields of ai chips, chiplet, risc-v, intelligent computing clusters, ai infra, etc. to attend the conference to share practical information.

it is the fifth anniversary of the establishment of the domestic gpgpu unicorn biren technology. at the meeting, biren technology announced that it has achieved a core technology breakthrough in multi-core mixed training and created a heterogeneous gpu collaborative training solution hgct. for the first time in the industry, it can support 3 or more heterogeneous gpus to train the same large model.

▲biren technology launches domestic heterogeneous gpu collaborative training solution hgct

gong lunchang, co-founder and ceo of zhiyi technology, delivered a speech as the organizer. this year is the seventh year of the global ai chip summit. the summit has become the most influential industry conference in this field in china and an important window for understanding the development trends of ai chips at home and abroad.

▲gong lunchang, co-founder and ceo of zhiyi technology

the global ai chip summit lasts for two days. the main venue includes the opening ceremony and three special sessions (ai chip architecture, data center ai chip, edge ai chip). the sub-venues include the chiplet technology forum, the intelligent computing cluster technology forum and the risc-v innovation forum.

at the opening ceremony, yin shouyi, professor at tsinghua university and vice dean of the school of integrated circuits, delivered a keynote speech entitled "exploration of the development path of high-computing power chips: from computing architecture to integrated architecture". he systematically reviewed the technical challenges of high-computing power chips and comprehensively analyzed five innovative technology paths: data flow chips, storage and computing integrated chips, reconfigurable chips, three-dimensional integrated chips, and wafer-level chips.

today, 21 experts, entrepreneurs and executives from top universities, research institutes and ai chip companies shared their views. among them, the high-end dialogue session invited representatives from three ai chip startups to debate passionately, namely, the domestic high-computing chip unicorn biren technology, the terminal and edge ai chip unicorn aixin yuanzhi, and lingchuan technology, a young ai chip startup that was founded only half a year ago. they focused on discussing the current status of the ai ​​chip industry, the latest practices and advanced directions.

1. solve the challenges of large-scale model computing power supply and demand, and break through performance bottlenecks through architectural innovation

yin shouyi, professor at tsinghua university and vice dean of the school of integrated circuits, explained the difficulties between the supply and demand of computing power in the era of big models: the chip process faces the limit of scaling-down, making the computing power improvement brought by the process dividend unsustainable; the system faces a scaling-out bottleneck, and insufficient communication bandwidth leads to system performance loss.

the opportunity to solve these two major problems lies in the joint innovation of computing architecture and integrated architecture of computing chips: computing architecture innovation enables each transistor to be fully utilized and unleash greater computing power; integrated architecture innovation enables chip scale to break through limits.

there are currently five new technology paths for the development of high-computing chips: data flow chips, reconfigurable chips, integrated storage and computing chips, three-dimensional integrated chips, and wafer-level chips. none of these paths completely rely on the most advanced manufacturing processes, and they will help open up new space for the domestic chip industry to improve computing power.

▲yin shouyi, professor at tsinghua university and vice dean of the school of integrated circuits

amd has built a comprehensive product line in the field of end-to-end ai infrastructure, covering everything from data center servers, ai pcs to smart embedded and edge devices, and provides leading ai open source software and an open ecosystem. amd's cpu processor platform based on the advanced zen4 architecture and the mi series accelerators based on the cdna3 architecture for ai reasoning and training have been adopted by giants such as microsoft.

according to wang hongqiang, senior director of amd's artificial intelligence business unit, amd is also promoting high-performance network infrastructure (ualink, ultra ethernet) for data centers, which is crucial for ai network structures to support fast switching and extremely low latency and expand ai data center performance.

amd is about to release the next generation of high-performance ai pcs. its ryzen ai npu based on the second-generation xdna architecture can provide 50tops computing power and increase the energy efficiency to 35 times that of general-purpose architectures. driven by ai pcs' demands for privacy, security, and data autonomy, important ai workloads are beginning to be deployed on pcs. as one of the world's leading ai infrastructure providers, amd is willing to work with customers and developers to build a transformative future.

▲wang hongqiang, senior director of amd artificial intelligence division

since 2015, qualcomm has been continuously innovating npu hardware design based on changes in ai application use cases. represented by the third-generation snapdragon 8, qualcomm ai engine adopts a heterogeneous computing architecture that integrates multiple processors such as cpu, gpu, and npu. among them, qualcomm hexagon npu optimizes performance and energy efficiency through designs such as large on-chip memory, dedicated accelerator power, and microarchitecture upgrades. ai has rich use cases and different computing power requirements, so the demand for heterogeneous computing and processor integration will exist for a long time, which will also bring a series of improvements in peak performance, energy efficiency, and cost.

qualcomm's product line covers a wide range of edge application scenarios, including mobile phones, pcs, xr, automobiles, and iot. it can support developers to use qualcomm's ai software and hardware solutions to accelerate algorithms in different product forms, bringing consumers a rich terminal-side ai experience and use cases. finally, wan weixing, head of qualcomm ai product technology in china, also announced that the next-generation snapdragon mobile platform equipped with the latest qualcomm oryon cpu will be released at the snapdragon summit held on october 21-23 this year.

▲wan weixing, head of qualcomm ai product technology china

yang yue, co-founder and ceo of apple core technology, disassembled the advancement process of storage-computing integrated technology. the emergence and growth of mainstream chips in the industry are closely related to the characteristics of current computing needs. around 2015, the computing bottleneck in the computing architecture migrated from the processor end to the storage end. in particular, the emergence of neural networks has accelerated the pace of improving the computing efficiency of ai chips, and storage-computing technology has therefore attracted attention.

yang yue believes that in the era of big models, the opportunity of storage-computing integrated technology is to be able to add computing wherever there is data storage. with the continuous development of software, end-side chips based on storage and computing have gradually matured this year. in the future, solving the data bandwidth bottleneck in the cloud may become the next killer application of storage and computing chips.

▲yang yue, co-founder and ceo of applecore technology

tan zhanhong, cto of arctic xiongxin, said that in the field of high-performance computing, there are two different paradigms for server design: standard server form and customized server architecture. in the standard server form, arctic xiongxin focuses on achieving higher cost performance through appropriate chip splitting and packaging solutions under standard constraints; in the non-standard server form, it provides opportunities for wafer-level integration, focuses on the integration of chip and system design, and conducts collaborative design of servers and chips, aiming to achieve the goal of "server as chip".

in particular, tan zhanhong emphasized that different chip designs have different bandwidth requirements. for example, at 7nm and above, combined with communication optimization, high interconnection bandwidth density is often not required, so advanced packaging is not necessary. 2d-based packaging can meet performance requirements and achieve cost-effective solutions. based on the pb-link ip of the "chips interconnect interface standard", arctic xiongxin has officially realized low-packaging cost interconnection and has begun to license it to external parties.

▲tan zhanhong, cto of arctic chip

2. high-end dialogue: the hematopoietic capacity of domestic ai chips has been enhanced, and the youngest startup’s products have been launched on kuaishou

zhang guoren, co-founder and editor-in-chief of zhiyi technology, ding yunfan, vice president of biren technology and chief architect of ai software, liu li, co-founder and vice president of lingchuan technology, and liu jianwei, co-founder and vice president of aixin yuanzhi, held a roundtable discussion on the theme of "consensus, co-creation and win-win in the implementation of domestic ai chips."

at the beginning of the roundtable discussion, zhang guoren said that the ai ​​chip summit, which has been initiated and hosted for six sessions by zhidongxi, xindongxi and zhixingxing, is the longest-running professional conference in this field in china. in recent years, it has witnessed the vigorous development of ai chips and large models, as well as the rise of a group of domestic chip-making "new forces".

▲zhang guoren, co-founder and editor-in-chief of zhiyi technology

ding yunfan said that high-computing chips are a technology-intensive, talent-intensive, and capital-intensive industry. as the largest chip unicorn in the market with publicly raised funds, biren technology has top talents, its first-generation products have been put into mass production, and multiple domestic gpu thousand-card clusters have been put into production, which can generate revenue independently. however, the overall situation of the domestic chip industry is still not easy, and there is still a gap between the domestic chip industry and foreign countries in terms of ecology.

many domestic ai chips have begun to be used in data centers and intelligent computing centers. in ding yunfan's view, nvidia's products for the domestic market are not cost-effective. as long as domestic chips can achieve high performance and cost-effectiveness, there will be a market. at present, there are more and more news about the landing of the domestic chip industry, and the hematopoietic capacity is enhanced. the gap between the domestic chip industry and nvidia will gradually narrow.

▲ding yunfan, vice president of biren technology and chief architect of ai software

liu jianwei believes that low cost is a very important part. enterprises will eventually have to settle accounts and make a profit from their investment in infrastructure. liu li believes that in the future, more companies will enter the segmented tracks such as embodied intelligence and smart video, which will bring higher value than general products and will compress nvidia's revenue and profits.

lingchuan technology is one of the youngest domestic ai chip startups. it was established in march this year and has completed a round of financing. its intelligent video processing chips currently on sale have been used in kuaishou, accounting for 99% of kuaishou's video processing usage. the high-computing power inference chip is expected to be taped out early next year.

in liu li's view, the ai ​​chip market window is still far from closing. faced with the advantages of giants in resources, funds, and ecology, startups need to focus on vertical and segmented fields. lingchuan technology combines intelligent video processing and ai reasoning computing power, aiming to reduce its per-token reasoning cost to 10% of nvidia h800.

▲liu li, co-founder and vice president of lingchuan technology

aixin yuanzhi, which focuses on the end and edge, has achieved remarkable market share. liu jianwei believes that the speed of achieving commercial closed loop in these two fields will be faster. he added that making ai chips will eventually make money, but the actual profit timetable will be affected by factors such as ai deployment costs, and companies should achieve self-sustaining and closed loop as soon as possible. in the future, aixin yuanzhi will explore the landing scenarios of large models on the end and edge.

aixin yuanzhi's product shipments in the automotive field are very impressive. liu jianwei said that this is because the underlying chip technology of smart cities and automobiles is similar. aixin yuanzhi has accumulated mature technology in smart cities and can quickly achieve mass production when entering smart driving. at the same time, the price war in the automotive field will promote industrial division of labor, which is an opportunity period.

▲liu jianwei, co-founder and vice president of aixin yuanzhi

regarding how domestic ai chips can quickly find an ecological niche, liu jianwei took the deep cultivation scenario of aixin yuanzhi as an example. there are basically no foreign companies in smart cities. in the field of intelligent driving, nvidia has pioneered the 0 to 1 stage. the 1 to 100 stage, which is more concerned with cost, is an opportunity for domestic companies. ding yunfan mentioned four elements: stable and reliable supply guarantee, cost-effectiveness, providing efficient support services based on customer needs, and efficient and easy to use. liu li believes that we should deepen our efforts in vertical fields and make more efficient and optimized solutions than general-purpose chips.

looking ahead, liu jianwei predicts that in the next 4-5 years, both the end-side and the cloud-side will have great development opportunities. after the implementation cost of the industry is reduced, data can realize greater value. liu li believes that as ai applications usher in an explosive period, the cloud side will generate a large number of inference needs. ding yunfan talked about the fact that high-end computing power is still scarce in china, but the synergy of the industrial chain can achieve steady development.

3. the construction of intelligent computing centers is on the rise: new breakthroughs in gpu, domestic tpus are being put into use, and chiplet is winning

at the data center ai chip special session held in the afternoon, yu mingyang, head of habana china, said that in the past three years, about 50+ government-led intelligent computing centers have been built, and 60+ are under planning and construction. the construction of intelligent computing centers has gradually shifted from first-tier cities to second- and third-tier cities, and from government-led to enterprise-led. the requirements for cost compression and investment return cycle are also gradually increasing.

according to his observation, the current development of large models is becoming more mature, the demand for inference continues to grow, the growth rate of self-developed inference chips by leading csps will increase, and many heterogeneous chip companies may be cultivated on the inference side in the future.

the demand for large model training abroad will remain strong, while the demand for computing power for model training in china is basically saturated, mainly from fine-tuning business. to support the future development of ai, the integration of chiplets, high-speed large-capacity memory, and private/general high-speed interconnection technologies will play a key role.

▲yu mingyang, head of habana china

in order to break the problem of large-model heterogeneous computing islands, ding yunfan, vice president of biren technology and chief architect of ai software, announced the launch of biren's original heterogeneous gpu collaborative training solution hgct. this is the first time in the industry that three or more heterogeneous gpus can be used to collaboratively train the same large model, that is, to support mixed training with "nvidia + biren + other brands of gpus", with a communication efficiency of more than 98% and an end-to-end training efficiency of 90~95%.

biren is working with customers and partners to promote a heterogeneous gpu collaborative training ecosystem, including china telecom, zte, sensetime, state grid research institute, shanghai intelligent computing technology co., ltd., shanghai artificial intelligence laboratory, and china academy of information and communications technology.

its products have been put into commercial use in multiple qianka gpu clusters. biren has developed a large-model overall solution that integrates hardware and software, optimizes the entire stack, collaborates with heterogeneous models, and is open source. biren has achieved the automatic elastic expansion and contraction of large-model 3d parallel tasks for the first time, maintaining a cluster utilization rate of nearly 100%; it has achieved the effect of automatic recovery of a model with hundreds of billions of parameters in a qianka cluster within 10 minutes, no failures for 4 days, and no interruptions for 15 days.

▲ding yunfan, vice president of biren technology and chief architect of ai software

zheng hanxun, co-founder and cto of zhonghao xinying, said that today's large ai models have computational complexity and computing power requirements far exceeding any point in computing history, and require specialized chips that are better at ai computing. compared with gpus, which were originally designed mainly for real-time rendering and image processing, tpus are designed mainly for machine learning, deep learning models, and neural network computing. they are highly optimized for tensor operations, and the throughput and processing efficiency of a single systolic array architecture have been greatly improved compared to gpus.

zhonghao xinying's self-developed "moment" chip is china's first mass-produced high-performance tpu architecture ai chip. after comprehensive measurement of computing power performance, cost, and energy consumption, the unit computing power cost is only 50% of the leading overseas gpu. zheng hanxun believes that in the later stage of large-scale model development, the optimal cost-effectiveness of clusters of thousands or tens of thousands of cards will be crucial. the moment chip has up to 1024 chips with direct high-speed interconnection, and the system cluster performance when building large-scale computing clusters can far exceed traditional gpus by several times.

▲zheng hanxun, co-founder and cto of zhonghao xinying

according to stephen feng, the person in charge of open accelerated computing products at inspur information, as the scale of large model parameters increases, the development of generative ai faces four major challenges: insufficient cluster scalability, high chip power consumption, difficult cluster deployment, and low system reliability. inspur information has always adhered to the application-oriented and system-centric approach, and has stimulated the innovative vitality of generative ai through open source systems.

in terms of hardware openness, by establishing oam (open acceleration module) specifications, we accelerate the deployment of advanced computing power and support the iteration acceleration of large models and ai applications. in terms of software openness, through the large model development platform "metabrain enterprise intelligence" epai, we create a full-process application development support platform for enterprises. through end-to-end solutions, we solve the illusion problem of basic large models landing in the field, solve the landing problems such as complex application development processes, high barriers, difficult multi-mode adaptation, and high costs, and accelerate the innovation and landing of large model applications in enterprises.

▲stephen feng, head of open accelerated computing products at inspur information

qingcheng jizhi was founded in 2023 and focuses on the ai ​​infra track. the team was incubated in the department of computer science at tsinghua university and has accumulated more than ten years of experience in intelligent computing power optimization.

shi tianhui, co-founder of qingcheng jizhi, shared that domestic high-performance computing systems are facing challenges such as difficult fault recovery and sub-healthy performance, and require the cooperation of 10 core basic software systems. qingcheng jizhi already has self-developed products in more than half of these areas.

at present, qingcheng jizhi has mastered the full stack technology accumulation from the bottom-level compiler to the upper-level parallel computing system, achieved full-stack coverage of the large model industry ecosystem, completed high-throughput reasoning optimization of multiple domestic chips and rapid transplantation and optimization of mainstream large models, and significantly improved computing effects. among them, the large model training system "bagualu" developed for ultra-large-scale domestic computing power clusters can be expanded to a full machine scale of 100,000 servers and used to train models with 174 trillion parameters.

▲shi tianhui, co-founder of qingcheng jizhi

huang xiaobo, technical marketing director of xinhe semiconductor, said that the demand for computing power has increased 60,000 times in the past 20 years and may reach 100,000 times in the next 10 years. storage and interconnection bandwidth have become the main development bottlenecks. chiplet integrated systems have become an important direction for breaking through the limitations of advanced process technology and improving high-performance computing power in the post-moore era, and have been widely used in ai high-computing power chips and ai computing power cluster network switching chips.

in this regard, xinhe semiconductor provides a one-stop multi-physics field simulation eda platform for the design and development of chiplet integrated systems. the platform supports parametric modeling of interconnect structures for mainstream process design, and its simulation capability is 10 times faster than other platforms, with only 1/20 of the memory, and built-in hbm/ucie protocol analysis to improve simulation efficiency. it has been used by many leading ai computing chip design manufacturers at home and abroad to help accelerate the implementation of high-computing chiplet integrated system products.

▲huang xiaobo, technical marketing director of xpec semiconductor

during the training of large models, the cost of network infrastructure accounts for 30%, highlighting the importance of network performance. according to zhu jundong, co-founder and vice president of products and solutions at ge healthcare, the network has become a bottleneck for intelligent computing performance, and building an ai network requires the integration of three networks, namely, interconnection between cluster networks, interconnection within cabinets, and interconnection within chips.

large intelligent computing clusters require high-performance interconnection, and modernize rdma and chiplet have become key technologies. in order to optimize rdma, singularity moore's ndsa network acceleration chiplet series is based on a programmable many-core streaming architecture, and uses a high-performance data engine to achieve high-performance data flow and flexible data acceleration. singularity moore's first gpu link chiplet "ndsa-g2g" is based on ethernet infrastructure, and through a high-performance data engine and d2d interface technology, it can achieve tb-level high bandwidth in the scale-up network, and its performance is comparable to the benchmark of global interconnection technology.

▲ zhu jundong, co-founder of ge mall and vice president of products and solutions

alphawave is a company that provides ip, chiplet and asic design solutions for hpc, ai and high-speed network applications. its asia-pacific senior business director guo dawei shared that in response to the problems faced by data during transmission, the bit error rate of alphawave ip products is 2 orders of magnitude lower than that of competing products. it can also assist in integration and verification and deeply integrate with the arm ecosystem. they can also provide full life cycle support for customers' socs.

in terms of chiplets, alphawave helps customers shorten cycles, reduce costs, and improve yield and iteration speed. it has currently produced the industry's first multi-protocol io connection chiplet, which has been taped out this year. in terms of custom chips, alphawave mainly focuses on processes below 7nm and can complete the entire process from specifications to tape-out according to customer needs. it has currently achieved more than 375 successful tape-outs with a dppm of less than 25.

▲dawei guo, senior business director of alphawave asia pacific

conclusion: downstream intelligence is surging, and ai chips are facing a historic opportunity

on the road to the ultimate issue of general artificial intelligence, the form of ai algorithms is constantly changing, and ai chips are also changing with it. when ancient sand meets the future machine intelligence, technology and engineering wisdom merge and collide, ai chips with fine designs are entering computing clusters and thousands of households, carrying the evolution of silicon-based life.

from intelligent computing centers and smart driving to ai pcs, ai phones, and new ai hardware, the downstream intelligentization trend has brought a new wave of historical opportunities for ai chips anchored in different scenarios. the rapid development of generative ai algorithms and applications continues to unlock new computing challenges. technological innovation and market demand are both driving the expansion of the ai ​​chip market and diversifying the competition landscape of ai chips.

on september 7, the 2024 global ai chip summit will continue to deliver intensive content: the main venue will hold a special session on ai chip architecture innovation and a special session on edge/end ai chips, and announce the "2024 china top 20 intelligent computing cluster solution companies" and "2024 china top 10 ai chip emerging companies" two major lists; the sub-venue will hold the intelligent computing cluster technology forum and the china risc-v computing chip innovation forum.