baidu group executive vice president shen dou: big models are closely integrated with cloud computing and are becoming a new type of infrastructure

shen dou, executive vice president and president of baidu group: big models are closely integrated with cloud computing and are becoming a new type of infrastructure

2024-09-25

on september 25, at the 2024 baidu cloud intelligence conference, shen dou, executive vice president of baidu group and president of baidu intelligent cloud business group, stated at the cloud intelligence conference that the past year was a key year for big models to move from technological change to industrial change, and that big models are closely integrated with cloud computing and are becoming a new type of infrastructure. "big models and their related systems are rapidly becoming a new generation of infrastructure in just a few years. the speed of this change is unprecedented."

shen dou, executive vice president of baidu group and president of baidu intelligent cloud group. image source: provided by the company

regarding the computing power of large models, shen dou said that when it comes to computing power, many people have heard of the "ten thousand card cluster". simply put, gpu clusters have three characteristics: extreme scale, extreme high density and extreme interconnection.

these "extremes" have brought several severe challenges. shen dou introduced that the first is the huge construction and operation costs. to build a cluster of 10,000 gpus, the purchase cost of gpus alone is as high as tens of billions of yuan. secondly, the complexity of operation and maintenance increases dramatically on such a large-scale cluster. he said that hardware will inevitably fail, and the larger the scale, the higher the probability of failure. "when meta trained llama3, it used a cluster of 16,000 gpu cards, and on average, there was a failure every 3 hours."

shen dou further stated that most of these failures were caused by gpus. in fact, gpus are very sensitive hardware. even fluctuations in weather temperature at noon will affect the failure rate of gpus. these two challenges forced baidu to rethink how to build, manage, and maintain large and complex gpu clusters, shield the complexity of the hardware layer, and provide a simple and easy-to-use computing power platform for the entire process of large model implementation, so that users can more easily manage gpu computing power and use computing power at low cost. "in the past year, we have felt that customers' demand for model training has skyrocketed, and the scale of clusters required has also increased. at the same time, everyone's expectations for the continued decline in model inference costs have also increased. these have put forward higher requirements on the stability and effectiveness of gpu management."

based on this, baidu smart cloud announced that it will fully upgrade the baige ai heterogeneous computing platform to version 4.0. focusing on the computing power requirements of the entire journey of implementing large models, it will provide enterprises with "more, faster, more stable and more economical" ai infrastructure in four major aspects: cluster creation, development experiments, model training, and model reasoning.

among them, in order to solve the problem of computing power resource shortage, baige 4.0 has focused on upgrading the "multi-core mixed training" capability, achieving 95% multi-core mixed training efficiency on a wanka scale cluster, reaching the most advanced level in the business. in the cluster deployment link, the upgraded baige can achieve second-level deployment at the tool level, reducing the operation preparation time of the wanka cluster from several weeks to 1 hour at the fastest, greatly improving deployment efficiency and shortening the business launch cycle. in response to the frequent failures during large model training, baige 4.0 has comprehensively upgraded the fault detection method and automatic fault tolerance mechanism, which can effectively reduce the frequency of failures and significantly reduce the cluster failure handling time, achieving more than 99.5% of effective training time on the wanka cluster.

in addition, baidu smart cloud also announced the latest "report card" of qianfan big model platform. on qianfan big model platform, wenxin big model has been called more than 700 million times per day, helping users fine-tune 30,000 big models and develop more than 700,000 enterprise-level applications. in the past year, the price of wenxin flagship big models has dropped by more than 90%.

daily economic news

report/feedback

news

shen dou, executive vice president and president of baidu group: big models are closely integrated with cloud computing and are becoming a new type of infrastructure

introduction

my contact information