news

New Product|Inspur Information Releases X400 Super AI Ethernet Switch Supporting Spectrum-X Platform

2024-07-23

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

July 23 news,Inspur InformationRelease of "X400 Super AI" for Generative AIEthernetswitchIt is the first domestic product that supports NVIDIA Spectrum-X platform technology and creates the X400 Super AI Ethernet (X400 AI Fabric) solution for end-to-end collaboration based on X400 and BlueField-3 SuperNICs, greatly improving the training performance of Wanka GPU to 1.6 times.

It is reported that as competition in the era of large models becomes increasingly fierce, the iteration capability of large models has become the core of market competitiveness. However, with the continuous increase in computing power, the performance of a single chip is no longer the decisive key, and the efficiency of the AI ​​system has become the focus of users. At present, in the training process of AI large model users, network communication accounts for 20-40% of the training time. For example, previous Meta statistics show that in AI training, network communication time accounts for an average of 35% of the time (up to 57%), which means that the GPU purchased at a cost of millions or billions of dollars is idle 35% of the time. In order to improve the utilization of GPU resources, network communication efficiency needs to be improved urgently. However, the uneven HASH problem of ECMP in traditional RoCE networks leads to low overall link load utilization. Although dedicated network solutions can meet performance requirements, they cannot take into account the well-established Ethernet ecosystem.

Recently, Inspur Information's "X400 Super AI Ethernet" based on the Spectrum-X platform has opened up a new path for the construction of AI large model training networks through end-to-network collaborative technology, addressing the challenges faced by customers from four aspects: performance, scalability, stability, and user experience, saving customers from the state of entanglement between Ethernet and dedicated networks. At the same time, the Super AI Ethernet Switch X400 adopts an open architecture and follows the S3IP-UNP specification design to achieve layered decoupling of software and hardware, and accelerate customer business innovation by building an open network ecosystem. In the actual test of GPT3 model training under the computing scale of 16K GPU cards, Super AI Ethernet achieved a performance breakthrough, reaching 1.6 times that of traditional RoCE.

In terms of network performance, the X400 Super AI Ethernet solution adopts the collaborative scheduling of X400 plus smart network cards. Through technologies such as adaptive routing, message order preservation, and programmable CC, it achieves closer cooperation between switches and network cards, providing a zero-packet-loss, non-blocking full-link switching network for large AI models. The inter-machine interconnection performance is 400G, and the effective bandwidth is increased from the traditional 60% to 95%, and the performance is 1.6 times that of traditional RoCE.

In terms of flexible expansion of computing resources, X400 Super AI Ethernet hasportWith the support of density and elastic scalability, it has ultra-high performance and can meet the computing power scale of hundreds of thousands of cards. In the second-layer network, GPUserverThe number can reach 1024, supporting 8K GPU cards. It can be flexibly expanded to three-layer networking according to the scale of computing power. The scale of GPU servers can reach 64,000, and the maximum number of supported GPU cards can reach 512K, meeting the requirements of networking of various scales. Flexible and elastic networking has become a powerful boost to business innovation.

In terms of operational efficiency, Inspur Information's X400 Super AI Ethernet solution continues the compatibility and cost-effectiveness of the Ethernet solution, ensuring agile operation and maintenance and ultra-high performance while significantly reducing the TCO of network construction. It also creates a one-click automated deployment model for customers, realizes model feature adaptive network configuration, shortens the deployment cycle from weeks to days, accelerates business launch, and combines a comprehensive and visual intelligent operation and maintenance platform to intuitively discover potential risks and faults to ensure business continuity. (Dingxi)