news

The world's first 1,000-card-scale heterogeneous chip mixed training platform is released! Wuwen Xinqiong: Make AI computing power easy to use

2024-07-15

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


Smart Things
AuthorZeR0
Edit Mo Ying

"Before turning on the tap, we don't need to know which river the water comes from. Similarly, when we use various AI applications in the future, we will not know which base models it calls or which accelerator card's computing power it uses - this is the best AI Native infrastructure."

Such AI Native infrastructure needs to be built by everyone together. On July 4, at the AI ​​Infrastructure Forum of the 2024 World Artificial Intelligence Conference, Xia Lixue, co-founder and CEO of Wuwen Xinqiong, released the world's first kilocalorie-scale heterogeneous chip hybrid training platform, with a kilocalorie heterogeneous hybrid training cluster computing power utilization rate of up to 97.6%.


At the same time, Xia Lixue announced that the Infini-AI cloud platform of Wuwen Core Qiong has integrated the large-model heterogeneous 1,000-card mixed training capability. It is the world's first platform that can perform single-task 1,000-card scale heterogeneous chip mixed training. It has scalability of 10,000 cards and supports large-model mixed training of six heterogeneous chips including AMD, Huawei Ascend, Tianshu Zhixin, Muxi, Moore Thread, and NVIDIA.

Starting from July, users who have passed the trial training application can initiate large-scale model training with a scale of 70 billion parameters on Infini-AI with one click.

Just four months ago, Wuwen Xinqiong's Infini-AI large model development and service cloud platform announced its first public beta. Large model company customers such as Zhipu AI, Dark Side of the Moon, and Shengshu Technology have been using heterogeneous computing power on Infini-AI in a stable manner. There are also more than 20 AI Native application startups that continue to call various preset model APIs on Infini-AI and use the tool chain provided by Wuwen Xinqiong to develop their own business models.

The launch of the world's first platform that can conduct kilo-card-scale heterogeneous chip mixed training is not only a reflection of Wuwen Xinqiong's technical strength in heterogeneous computing optimization and cluster system design, but also an important achievement of Wuwen Xinqiong's adherence to the "MxN" middle-layer ecological concept.

Wuwen Xinqiong was the first to build an "MxN" middle-layer ecological structure, realizing efficient and unified deployment of multiple large-model algorithms on multiple chips.

The Infini-AI platform already supports more than 30 models including Qwen2, GLM4, Llama 3, Gemma, Yi, Baichuan2, ChatGLM3 series, and more than 10 computing cards including AMD, Huawei Ascend, BiRen, Cambrian, Suiyuan, Haiguang, Tianshu Zhixin, Muxi, Moore Thread, NVIDIA, etc. It supports both one-to-one connection between a single algorithm and a chip, and the free matching and combination of multiple models and multiple chips.

According to Xia Lixue, it is expected that by the end of this year, Wuwenxinqiong will fully realize M×N automatic routing from model to chip.


1. Wanka cluster is a battleground for large models, and China faces difficulties in opening up the ecosystem

Xia Lixue, co-founder and CEO of Wuwen Xinqiong, believes that computing power is the outpost and cornerstone of AI development. The scale of models that appeared after GPT-4 did not grow further exponentially, and the computing power required to support the algorithm encountered a bottleneck. Currently, no one can realize a large system with a larger scale and a larger amount of computing power for a single model, which has caused the development of the model to slow down and stagnate. In other words, the computing power system that supports the model's ability to move towards the next generation still needs to be developed and built.

Large models compete for computing power worldwide under the influence of the Scaling Law. It is reported that Microsoft and OpenAI are building a large computing power project worth more than 100 billion US dollars. Compared with many other techniques, this simple and crude scale expansion brings the most practical returns on model intelligence. Google, OpenAI, and domestic large companies and the three major operators are all building large clusters of tens of thousands of cards.

In a truly sustainable, iterative, large, and stable system, Scaling Law has unique advantages. It does not require so many rich techniques and is easier to maintain and expand. For a system that is truly to be run for a long time, scalability is a very important attribute, and a scalable system is a good system.


IDC charts show that the computing power demand for AI deduction and training in the future is developing rapidly worldwide, and both training and reasoning require strong computing resources. The domestic and foreign ecosystems behind this huge market are very different. The foreign ecological model layer and chip layer are relatively concentrated, while the Chinese ecosystem is relatively decentralized and vibrant. Both the model layer and the chip layer are competing to expand the computing power market, facing many key issues in connecting the ecosystems.


The Wanka cluster is a battleground for large models. Xia Lixue shared that there are more than 100 1,000-card clusters under construction or planned in China, most of which are heterogeneous computing power, and many clusters are using different chip services and engaging in AI production. The reasons include the supply chain risks that may arise from over-reliance on a single hardware platform, and the rapid improvement of the performance of domestic chips, which provides cluster parties with a variety of options.

However, a large number of heterogeneous chips have also formed "ecological silos", with different hardware ecosystems closed and incompatible, and software stacks cannot be well coordinated and connected, and computing power usage faces a series of very complex engineering challenges. Even with a large number of computing power clusters, it is still difficult to achieve effective integration and utilization, which is a waste of computing power resources. It has not only become the biggest difficulty in building AI Native infrastructure, but also an important reason for the current "computing power shortage" faced by the large model industry.


Wuwen Xinqiong wants to build an AI Native infrastructure that can adapt to China's multi-model and multi-chip ecological landscape, provide a good computing platform that efficiently integrates heterogeneous computing resources, and middleware that supports joint optimization and acceleration of software and hardware, breaking the existing "ecological silos" and allowing heterogeneous chips and clusters to truly transform into large computing power.


AI training and reasoning tasks are very different from traditional computing. For example, a single task can be large and sudden. Therefore, if a more AI Native scheduling strategy is not adopted, the resource utilization of the entire system will be very low, and even cause customer tasks to frequently crash and restart, thus delaying the development of AI.

Wuwen Xinqiong's solution has a complete cloud management system at the bottom, including scheduling capabilities and PaaS and MaaS platforms. The bottom is equivalent to the cloud-coordinated computing power base, which allows developers and researchers of large models to move in with their bags and quickly use different computing powers.

The MaaS service platform built on this basis, that is, the model set service platform, can provide many flexible application large model services to help some companies that are still in the AI ​​learning stage to agilely develop some large-scale model applications.


2. Realize cross-training of different chips to reduce the cost of large model application

Behind a series of production and research progress, the Wuwenxinqiong R&D team has a lot of practical experience and achievements in heterogeneous chip computing optimization and cluster system design.

Recently, the joint research team of Wuwen Xinqiong, Tsinghua University and Shanghai Jiaotong University released a heterogeneous distributed hybrid training system HETHUB for large-scale models. This is the first time in the industry that cross-hybrid training between six different brands of chips has been achieved, and the engineering completion is high. According to Xia Lixue, the original intention of engineering this technology is to continue to push the upper limit of large model technical capabilities by integrating more heterogeneous computing power, and at the same time, by connecting the heterogeneous chip ecosystem, continue to reduce the cost of large model application.


He said that the two main challenges faced in building the system were communication and distributed training. Different hardware architectures have different communication libraries, which is equivalent to asking two people to use completely different languages ​​to complete a large project together; heterogeneous cards have different design concepts and performance differences, and are suitable for different tasks. The efficiency differences shown by different types of cards will make large-scale distributed training inefficient.

Therefore, the team has done a lot of work, including:


1. In terms of communication, a general collective communication library is established to achieve efficient communication between different types of chips and be compatible with a wide variety of hardware;

2. A non-uniform splitting solution based on pipeline parallelism is proposed to solve the problem of different hardware efficiencies and allocate the most suitable tasks according to their own conditions;

3. The self-developed mixed training prediction tool can predict the value of each chip in advance at the beginning of training, so as to find an optimal splitting strategy and complete the entire training task in the most efficient way on different cards.

Judging from the actual mixed training results, Wuwen Xinqiong has made a lot of combinations that can reach more than 70%, and the computing power utilization rate can reach up to 97.6%. The mixed training on 6 different combination chips has achieved a kilocalorie scale.


Previously, Wuwen Xinqiong achieved M×N reasoning, and now it has achieved M×N training, which is a very big breakthrough.

Such functions are integrated into the existing Infini-AI platform. The platform has the ability to enable users to efficiently deploy applications and services on the platform. After adding the mixed training capability, it can support the cross-combination of 6 brands, breaking the training bottleneck of a single brand. It is the world's first platform that supports thousands of calories of heterogeneous mixed training.

Infini-AI supports multiple training strategies including tensor parallelism, data parallelism, and communication overlap, which can achieve efficient training, support large-scale model training of more than 70 billion tokens, and one-click mixed training of large-scale models. Using this platform, developers do not need to spend more time considering the differences in underlying computing power. They can quickly customize their own large models on a mixed cluster composed of different chips and quickly implement their own business.

3. Efficient scheduling + efficient fault tolerance to ensure stable completion of tasks on large computing power clusters

After building a large computing power cluster, a core task is how to use it? This involves the issue of efficient scheduling. An efficient computing power scheduling system can make the integrated heterogeneous resources better utilized by all users.

Wuwen Xinqiong has made a lot of progress in the efficient scheduling system of computing power. The unified management of multiple heterogeneous clusters can support more than ten types of chips and build a computing power system with more than 10,000 cards. In addition, through a series of hybrid scheduling strategy designs of Wuwen Xinqiong, the average task scheduling delay is in milliseconds, and the resource utilization rate of the entire system cluster can be maintained at more than 90%. By enhancing the base of the entire AI container, Wuwen Xinqiong can increase the SLO of the entire cluster to 99.95% in multi-tenant scenarios, and the tolerance is very high.

In addition to scheduling, when training models, the training cannot be restarted continuously. Wuwen Xinqiong has developed an efficient fault-tolerant training system, including a fault-tolerant runtime system for large models, a hybrid indicator anomaly prediction system, and a checkpoint asynchronous read and write system.


The fault-tolerant part has increased the effective training time of large models by 30%, and the success rate of large model anomaly detection to 70%. It can detect and avoid most errors in advance, increase the read and write efficiency of checkpoints by 20 times, and reduce the abnormal terminal time of large models to within 5 minutes, which can ensure the stable completion of tasks on large computing power clusters.

In order to facilitate developers to better use the cluster, the platform integrates the optimization technology capabilities of the Wuwenxin large model service system. When the concurrency is very high and multiple users send requests at the same time, through request scheduling, prompt word caching and other technologies, it can help to better distribute tasks and return calculation results, and can achieve more than 30 times the throughput rate, making the application run more smoothly.


Conclusion: Make AI computing power accessible to everyone

"Pushing up the technological ceiling is not contradictory to the implementation and diffusion of technology, and it depends on how we are determined to treat this technology." Xia Lixue believes that today, saying that the cost of large models can be reduced to 1/10000 is like saying 30 years ago that every household should have access to electricity.

Good infrastructure is such a kind of "magic" that when the marginal cost drops to a critical value, more people can embrace new technologies.


At present, the development of the big model industry is entering the stage of large-scale industrial implementation, and the flourishing of application scenarios has brought about an increasingly urgent demand for big model training. Building AI Native infrastructure in the big model era can not only provide AI developers with a more general, efficient, and convenient R&D environment, but also realize the effective integration of computing resources and support the key cornerstone of the sustainable development of the AI ​​industry.

The development of AI requires both the ability to integrate multiple heterogeneous chips into the underlying system and the need for an easy-to-use middle layer between heterogeneous computing power and multiple algorithms, allowing users to schedule different computing powers through a unified programming framework. At the same time, it can be packaged with interfaces that are compatible with existing user programming habits to facilitate future expansion.

Wuwen Xinqiong is committed to building an AI Native infrastructure that is truly adaptable to multiple models and multiple chips, making AI computing power easily available in the world. It hopes to achieve not only the effective connection, utilization and integration of "M×N", but also the ultimate goal of turning the computing power resources that seem to be dormant now into real big computing power, pushing up the integrity of the big model ecosystem, significantly reducing the cost of implementing big models, and helping to promote the application innovation of big models in various industries.