news

Facing challenges? Meta training Llama3 encounters failure

2024-07-29

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

According to a research report released by Meta on July 28, the 16,384 NVIDIA H100 graphics card clusters used to train the 405 billion parameter model Llama 3 experienced 419 unexpected failures in 54 days, an average of one failure every three hours. More than half of the failures were caused by the graphics card or its high-bandwidth memory (HBM3).


Due to the huge scale of the system and the high degree of synchronization of tasks, a single graphics card failure may cause the entire training task to be interrupted and need to be restarted.The Meta team still maintained more than 90% of effective training time.

IT Home noted that during the 54-day pre-training, there were 466 work interruptions, of which 47 were planned and 419 were unexpected. Planned interruptions were caused by automated maintenance, while unexpected interruptions were mainly caused by hardware problems.GPU issues were the leading cause of outages, accounting for 58.7% of unexpected outagesOnly three of these incidents required significant human intervention, with the rest managed by automation.


Of the 419 unexpected outages, 148 (30.1%) were caused by various GPU failures (including NVLink failures), while 72 (17.2%) were caused by failures in the GPU’s HBM3 memory. Interestingly, only two CPU failures occurred in 54 days. 41.3% of unexpected outages were caused by a variety of factors, including software errors, network cables, and network adapters.

To improve efficiency, the Meta team developed a series of tools and optimization strategies, including shortening task startup and checkpoint time, using PyTorch's NCCL flight recorder to diagnose performance issues, identifying lagging graphics cards, etc. In addition, Meta also paid attention to the impact of environmental factors, such as the slight impact of midday temperature fluctuations on GPU performance, and the huge pressure on the data center power grid caused by the simultaneous operation of a large number of GPUs.

However, as the number of AI model parameters continues to increase, the required computing resources also expand. Taking the 100,000 H100 graphics card cluster in the xAI plan as an example, the failure rate may increase exponentially, bringing greater challenges to future AI training.