dialogue at china computing power conference|academician liu yunjie: domestic computing power must make up for its shortcomings through gpu clusters
2024-09-29
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
the ai (artificial intelligence) era is also an era of computing power. technology companies and telecom operators at home and abroad have "rolled up" 10,000 cards or even more than 10,000 cards, but problems such as ecological compatibility and heterogeneous computing have also become mountains that the industry must climb.
on september 28, during the opening ceremony of the 2024 china computing power conference, liu yunjie, an academician of the chinese academy of engineering, said in an interview with beijing news shell finance and other media that domestic endpoint gpus will still be unable to compete with foreign countries in a short period of time. a possible way to make up for the shortcomings is to build a computing power network to "train the entire computing power" and give full play to the effect of gpu clusters.
in addition, he pointed out that it is not possible to simply judge which type of enterprise has more advantages in building a computing power network, but mainly through technical evaluation. "it depends on whether your technology can be used and developed, and whether your innovation and the path you take meet the needs." regarding the problem of computing power cost, he still emphasized that "it must be solved with new technologies."
at present, the deterministic network technology studied by liu yunjie can save 60% to 70% of the cost. the computing network scheduling project launched by him in conjunction with other institutions can achieve multiple off-site training to achieve 80% efficiency of single-point training.
liu yunjie, academician of the chinese academy of engineering. photo courtesy of interviewees.
it is recommended to take the industry large model track to solve the problems of data circulation and computing power utilization.
"china must take the path of large-scale industry models." liu yunjie emphasized in her keynote speech. he believes that domestic general-purpose large models may lag far behind the united states in the short term, and it will be difficult to catch up.
he proposed that if domestic model companies can train industry data well and make industry large models based on general large models, they "can definitely follow the chinese path." he is optimistic about this technical direction because he believes that "china's industry data is the most complete and comprehensive."
at the same time, he said that the development of large-scale industry models requires the joint efforts of the government, enterprises, and capital. he told a reporter from shell finance that at present, the sharing and circulation of domestic data still need to be strengthened, which has had an impact on the large model of the training industry, and "everyone is still exploring" which type of track is more promising.
data disclosed at the 2024 china computing power conference shows that the total scale of national computing power reaches 246 eflops. according to liu yunjie’s observation, domestic computing power has reached a certain scale, but the utilization rate is not very ideal.
"if computing power wants to serve the real economy, several parties must agree." liu yunjie believes that first of all, computing power and network providers must do a good job, "(because) they have obtained benefits through these services." in addition, the government should say good things, "(because) the government has solved the problem." finally, companies have to say well, "(because) companies have improved their own efficiency by using computing power and the internet."
he emphasized that the effect of "one party's agreement" is not lasting, which means that the industry has not established a computing power ecosystem. “if we don’t solve the ecological problem, we won’t be able to use it (computing power).”
deterministic network is one of the basic technologies of future computing power network, which will save 60%-70% of costs.
"large model training requires lossless transmission of data and imposes requirements on network indicators such as packet loss, jitter and delay," liu yunjie said. taking international data standards as an example, he explained that if the packet loss rate reaches five thousandths, the transmission efficiency will drop by 50%.
he further explained that this is like using the entire 100g bandwidth to transmit data, and only the 50g bandwidth is useful. "when it drops to 1%, its efficiency is approximately equal to 0, which makes it impossible to train and reason."
the rdma (remote direct memory access) protocol is required for the network to avoid packet loss. this technology allows the computer to directly access the memory of the remote computer, transmit data at the memory level without frequent cpu intervention, and reduce the processing delay and resource consumption of the sending and receiving end during the data transmission process.
how to meet the data transmission standards for large model training and inference? liu yunjie believes that deterministic network technology relatively meets the requirements, and he judges it to be "a basic technology for future computing power networks." liu yunjie revealed that in 2022, he led the team to open deterministic networks in 35 cities. the number of cities has now increased to 39. it can achieve end-to-end delay and jitter of less than 50 microseconds and achieve zero packet loss.
in the process of developing deterministic network technology, liu yunjie believes that the most important technological breakthrough is photoelectric integration, which brings breakthroughs in bandwidth utilization, grid cost and energy consumption.
in terms of cost, he took a certain autonomous driving company as an example and explained that the autonomous driving data generated by 20 vehicles in 4 places across the country every day is first sent back to shanghai and then to guiyang for training, which requires about two 10g and a 1g circuit costs about 10 million yuan a year.
what should i do if i can’t afford it? switching to using hard drives to collect data and transporting it between the two cities, taking into account data loss, hard drive damage, etc., would cost about 1.9 million yuan a year. and using a deterministic network to provide services through slicing, "120,000 yuan a year is enough."
liu yunjie emphasized that this level of cost reduction is achieved through network sharing. the data he showed in his keynote speech showed that it has been running on the test network for more than three months, with parameter efficiency reaching more than 95% and cost savings of 60% to 70%.
give full play to the effect of gpu cluster to make up for the shortcomings of domestic computing power
is the computing power network likely to be the direction in which domestic computing power surpasses foreign computing power in the future? liu yunjie said that a more accurate understanding is to "make up for shortcomings." he believes that in a short period of time, our endpoint gpu will still be unable to compete with foreign countries. “i may not be able to match you in a single aspect, but i can beat you by leveraging the power of the group.” he further emphasized that to exert the effect of gpu clusters, it is necessary to build a network to “train the entire computing power.”
he believes that large models can adopt the path of collaborative training and distributed training. "if 100,000 cards are trained in one place, the power will be too much." he revealed that his team jointly launched the program with the chinese academy of sciences, the national supercomputing wuxi center and other institutions. the national computing power network scheduling project can achieve the effect of solving queuing problems at the minute level, and multiple off-site training can achieve 80% efficiency of single-point training. "basically, distributed training and collaborative training are feasible."
when talking about how to coordinate the development relationship between computing hardware and software, liu yunjie proposed that software and hardware should be combined and integrated for development.
hardware production consumes geophysical resources, he said. "(every time) it consumes a little, the resources are a little less." software is relatively flexible, can be modified, and consumes less physical resources. "this is a very important social development concept." in addition, liu yunjie believes that software development consumes a certain amount of human resources, but after the application of ai, development efficiency has been accelerated. he then proposed that all parts that can be replaced by software should be developed as much as possible.
"but software is not omnipotent and must meet the hardware conditions required by computing power." he believes that the parts that software cannot bear must be developed in conjunction with hardware.
how to create a shared computing power network ecosystem? liu yunjie suggested that relevant government departments should coordinate and manage it, and enterprises and scientific research institutions should cooperate closely. "this is an overall project, but currently everyone is working on their own."
beijing news shell finance reporter wei yingzi
editor lin zi
proofread by liu jun