news

From bare metal to a 70 billion parameter model, here is a tutorial and ready-made scripts

2024-07-24

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina



Selected from imbue.com

Author: Imbue Team

Compiled by Machine Heart

Editor: panda

We know that LLM is trained using massive amounts of data on large-scale computer clusters. Synced has previously introduced many methods and techniques for assisting and improving the LLM training process. Today, we are going to share an article that goes deep into the underlying technology, introducing how to turn a bunch of "bare metal" machines without even an operating system into a computer cluster for LLM training.

This article comes from Imbue, an AI startup that is working to achieve general intelligence by understanding how machines think.

Of course, turning a bunch of "bare metal" without even an operating system into a computer cluster for training LLM is not an easy process, full of exploration and trial and error, but Imbue eventually successfully trained an LLM with 70 billion parameters and accumulated a lot of useful experience in the process.

This article will provide an in-depth introduction to the team's process of building their own LLM training infrastructure and share the many tools and scripts they wrote to facilitate monitoring, inspection, and troubleshooting.

If you are interested in building your own LLM training infrastructure or are curious about how LLM is made, this article is worth reading and collecting.

The following is the original article from the Imbue team.

introduction

Our small team of researchers and engineers spent a few months training a 70 billion parameter model from scratch on our own infrastructure, and this model outperformed the zero-shot GPT-4o on reasoning-related tasks.

Today, we're sharing our process for setting up the required infrastructure: from putting together an initial cluster and installing an operating system to setting up automatic recovery in case of errors during training. We'll detail the challenges we encountered and how to solve them at each step. In addition to these lessons, we'll also be publishing many of the scripts we developed along the way so that other teams can more easily create stable infrastructure for their own model training.

Throughout the process, our engineering team worked with Voltage Park to prepare the computer cluster and build the foundation for production applications. This process included:

1. Configure each machine

2. Configure InfiniBand

3. Make sure the machine is fully healthy

4. Diagnose common training problems

5. Improve infrastructure tools

Each step is described in detail below.

Background: How it was made

Our goal in performing computations is to ensure that we can experiment quickly with large language models. To do this, we need a large number of fast GPUs that can communicate quickly with each other.

This article will focus on a cluster with 4088 H100 GPUs spread across 511 machines, or 8 GPUs per machine. The reason for having 511 machines with GPUs is that some connections need to be reserved for the Unified Fabric Manager node, which manages the InfiniBand network. On each of the 511 hosts with GPUs, each GPU is directly connected to a ConnectX-7 network card that can transfer data with any GPU on the InfiniBand network at a speed of 400 Gbps.

Our InfiniBand network topology is "fully non-blocking", which means it is completely non-blocking; in theory, this allows GPUs to communicate with each other at maximum speed. To this end, we use a three-layer InfiniBand network architecture: three-layer InfiniBand switches. As long as the connection is correct, the entire network can achieve this high level of throughput. The following figure shows an overview of this InfiniBand network:



Note that communication when training the network occurs over InfiniBand, not Ethernet. Although the machines are connected to Ethernet, the purpose of this network is to transfer data such as datasets and checkpoints. If Ethernet is used to send data, the speed will be much slower because the data is transferred from the GPU to the CPU and then sent out through the 100 Gbps Ethernet card. Although it is possible to train over Ethernet using a technology called RDMA over Converged Ethernet (RoCE), it requires a lot of additional work in both hardware and software, and is generally not as reliable as InfiniBand. For more information, please refer to this paper: https://arxiv.org/pdf/2402.15627

There is also a secondary Ethernet network used only for configuration and management, which provides access to the control interface for the BIOS (Basic Input Output System), power, and other low-level machine interfaces. Without this management network, we would have to manually set up each node via USB drive, keyboard, and monitor. With hundreds of machines, this is not a sustainable approach.

Achieving high-performance training on this cluster requires every component (InfiniBand, Ethernet, GPUs, and the nodes themselves) to work nearly perfectly. If any one of those 12,000+ connections is a little unstable, it will slow down the overall training run.

The rest of this article will describe how to make this all work perfectly and stably.

Process: How to turn a bare metal machine into a fully operational cluster

Configure each machine

After establishing an initial Ethernet connection to the cluster over the management network, we obtained access credentials to the baseboard management controller (BMC). The BMC is a dedicated service processor that remotely monitors the host system and is usually connected to a separate network. It allows us to operate the machine as if we were physically there, and also provides additional APIs for hardware health, BIOS settings, and power management.

With these components in place, we can roll up our sleeves and start setting up our cluster.

Step 0: Configure a machine first

We first installed Ubuntu 22.04 on a server using iDRAC (Dell's Baseboard Management Controller) before setting up everything else on top of that OS. One of the capabilities of iDRAC is that it allows ISO images to be mounted and booted from the local machine, providing a virtual console through a browser. Ideally, this would be the only manual installation step in the process.

Step 1: Install the operating system on each machine

After configuring the first machine, we proceeded to install Ubuntu’s Metal-as-a-Service (MAAS) software to help configure the remaining servers. Using the Preboot Execution Environment (PXE) protocol boot and automated iDRAC tools, we instructed each machine to boot from the network and configured MAAS to respond to PXE boot requests. On the initial network boot, the server obtains an IP and an initial kernel from MAAS via DHCP, a dynamic IP allocation protocol, without installing anything on the local drive. This is the basic environment for automating repetitive operating system installations. In theory, we could just wait for the first boot and everything would be taken care of. In practice, however, MAAS integration with the BMC was unreliable, so we used the iDRAC API to collect each machine’s MAC address (a unique physical hardware identifier) ​​beforehand.

Throughout this training process, MAAS has generally been a more reliable component of the stack. However, we encountered some issues in the beginning that were specific to our setup. For example, when configuring the first few machines, HTTPS certificate verification issues prevented us from installing anything through apt because the clocks were so far apart. Relatedly, since the MAAS server has to be responsible for so many things (DHCP server, DNS server for resolving hostnames to IPs, HTTP proxy between hosts and the official Ubuntu package server, NTP server, cloud-init configuration management, ground truth database for connecting MAC addresses to IPs to hostnames to custom metadata), it was difficult for us to solve these problems at the root. In addition, there is the learning curve of the MAAS configuration lifecycle, as it is designed to handle the complexity of managing greenfield deployments and the gradual migration of nodes and various debugging/unhealthy intermediate states.

Step 2: Diagnose the Broken Machine

We found that about 10% of the machines would not boot, mainly due to physical problems with the server. This is a common situation when setting up a large GPU cluster. We encountered cases such as missing or incorrectly connected network cables, hardware problems in the iDRAC, bad power supply units, bad NVME (non-volatile memory express) drivers, missing internal wiring, and network cards or GPUs not showing up. We automatically checked for these problems, returned some machines to Dell for retesting, and filed corresponding work orders for data center personnel. One advantage of configuring the cluster ourselves is that we can use healthy machines immediately while waiting for maintenance on some machines.

Step 3: Minimum Viable Observable Machine

We continue with the following settings on each server:

1.Docker (to make it easier to run services and training jobs)

2. Data Center GPU Driver

3. Prometheus node export tool (used to export stable hardware/operating system indicator data stream)

4. DCGM export tool (used to export additional metrics data from NVIDIA GPUs, such as GPU status, clocks, utilization)

5. RAIDZ ZFS pools for all non-OS drives, which allows the machine to continue to work even if a drive fails, and also provides transparent compression for free (especially useful for plain text datasets and repetitive logs - using this tool often increases the usable space by 10 times compared to not using it)

We then run basic GPU diagnostics to determine if the GPU is generally functional -- if it isn't, it will usually indicate a hardware problem within a few hours.

During this time, we encountered bandwidth bottlenecks when we tried to install the software package on all 400 nodes at the same time. This was the first time we received high temperature overheating alerts on multiple components deployed in the data center. These first thermal issues were basically resolved through firmware updates.

Step 4: Single-node GPU training

The next step is to ensure that each machine can handle realistic GPU workloads on its own. Many machines cannot do this, due to the following issues:

GPU-related errors, which can almost always be fixed by reinserting the GPU card into its slot: sliding the 200-pound server out of the rack, removing all cables between the hood and the GPU, then taking out the GPU, reinstalling it, then reconnecting the cables and pushing the server back into the rack.

According to the Ubuntu server logs, many cables between the GPU and the PCIe bus or network card were giving the error "limited width: x4 < x16". After updating the PCIe switch bus firmware, we found that about a quarter of the hosts needed to have the internal PCIe cables reseated - presumably because the cables between the case and the GPU are quite fragile, meaning that they get jostled or unplugged whenever maintenance is performed on the GPU.

There were also some miscellaneous issues affecting a few consoles. Dell helped us resolve some of these issues with firmware updates:

The NVMe drive showed no faults, but locked up the entire machine when touched.

The hard drives showed up in a random order under Linux, which confused MAAS and caused the operating system to be installed on the wrong drive.

The temperature readings were wrong, which caused the fans to run at full speed all the time. The cause was probably a problem with the Nvidia drivers, which was fixed by downgrading the driver version.

The CPU's dynamic frequency scaling went haywire, limiting the active cores to 2 GHz.

Direct GPU-GPU communication (GDR or GPUDirect RDMA Peer Memory Client) cannot be successfully applied.

Configuring InfiniBand

Step 0: Install UFM

One of the advantages of InfiniBand is its centralized design, so that the entire network has a single brain. Therefore, for the 320 network switches in the entire network structure, we only have to deal with one instance. Our first task is to figure out which switch is connected to which machine, and then relate it to the wiring diagram and rename the switches according to their physical location.

Step 1: Rewiring

Initially, UFM was unable to detect the 320 switches, let alone the hosts that were supposed to be in the fabric. After consulting with our data center partner, we confirmed that the switches were powered on and wired, but still could not be detected. After reviewing the network cabling list, we noticed that the top-level design of the fabric was incorrect: the fabric was not unified, but instead was divided into eight disjointed networks with no common routing paths. After rewiring, we added a check step to verify that all physical connections were consistent with the new design.

Step 2: 10,000 temperature alerts

After resolving the physical wiring issue, InfiniBand successfully established connectivity to all InfiniBand switches in the fabric. However, nearly every switch port began reporting high temperatures, sometimes exceeding 70°C, even when they were not transmitting data. We discovered that the problem was caused by open space between switches in the same rack, which caused hot air to flow back to the front. Our data center partner helped us quickly diagnose the problem and develop an appropriate solution.

Step 3: 1800 alarms

Many ports also had high error rates, or were fluctuating between normal and corrupted states, which is called "flapping." These issues only occurred when the ports were actually used, so they were difficult to detect in advance because our entire fabric consisted of 10,000 highly redundant links. Our data center partner helped clean and reinstall the alarming ports, and we disabled the remaining alarm transceivers while we waited for replacements.

Although InfiniBand is resilient to hardware failures, once about 10% of the fabric starts to fail, features such as adaptive routing no longer work reliably and cannot account for occasional lost links.

During this time, we successfully ran multi-node training with 100 to 200 machines. Our process was relatively ad hoc: we would sometimes start a random set of nodes, observe their performance, and then try to keep as many of them running as possible. This approach allowed us to find a reliable subset of the InfiniBand network fabric, but it was difficult because each time we had to change the set of nodes used for training, thereby changing the default InfiniBand link.

Step 4: InfiniBand Burns Like Crazy

To more effectively diagnose InfiniBand issues, we designed a workload for the entire cluster that pushes as much data as possible through every port in the fabric simultaneously. This is different from running a large all-reduce workload across the entire cluster, which requires using NCCL to optimize communication between nodes by using NVLink to enable GPU communication via the Server PCIe Module (SXM) slots.

Instead, we chose a brute force approach, which was easily successful. UFM started to sound alarms when most ports were transmitting more than 97% of their theoretical capacity, and some switches were temporarily downed. Every port that made it to the end of the day was considered robust enough, and the rest were disabled or removed pending repairs.

Step 5: GPUDirect RDMA

To allow GPUs to communicate without incurring CPU computational overhead, we enabled a feature called GPUDirect RDMA, which allows direct communication between InfiniBand network cards. This involves two key steps:

1. Start an additional kernel module

2. Make sure PCIe Access Control Service (ACS) is disabled to prevent immediate hangs

Step 6: Expand the "Gold" Server

When building a GPU cluster with the latest hardware, a rule of thumb is to be prepared for about 3% of the machines to fail every week.

However, it is important to note that it is not that every machine has a uniform 3% chance of failing, but rather that a small number of unreliable machines repeatedly have various different problems until they are properly fixed. This highlights the advantage of having a large number of machines in the same network structure. Therefore, instead of randomly picking machines to run large-scale training and seeing which ones fail like whack-a-mole, we focus on expanding servers that are known to be reliable, that is, "golden" servers.

Step 7: Maintenance

InfiniBand maintenance primarily involves responding to UFM alarms, replacing faulty cables and transceivers, and occasionally diagnosing more difficult errors (such as switch failures). There are usually two factors that lead to large-scale maintenance:

1. Firmware updates, especially when only half of the cluster has completed the update, can lead to UFM state corruption and require a reboot of UFM on all InfiniBand switches.

2. GPU boxes are restarted at the same time in large numbers, which may flood the UFM state with a large number of updates and also require a restart of the UFM service.

Ensure the machine is fully healthy

During this process, we discovered many ways for a single machine to fail or slow down training. Many of these failure modes are not immediately apparent, so we wrote a number of health check scripts to check if the host is healthy enough. We published the code here: https://github.com/imbue-ai/cluster-health

Note that many of these health checks are specific to our runtime environment, and are not necessarily tied to the underlying hardware, nor are they necessarily easy to fix or automate. This is by design: to achieve our overall goal of getting our machines ready for training, we want a single entry point that can answer a straightforward "yes" or "no" and can summarize any number of subtle details.

GPU Health Check

We checked that the number of GPUs was correct, that ECC (Error Correction Code) checking was enabled, and that there were no ECC errors. We also checked that the NVLink topology (connecting the GPUs to each other) was up and error-free.

Disk space health check

We checked if the host's disk space utilization was above 95%.

Docker health check

We checked that Docker was able to run containers with GPUs attached (that is, the NVIDIA Container Runtime was working properly), and that the monitoring/analytics related Docker containers were activated and had the correct host permissions.

Dmesg health check

We checked dmesg for hardware Xids or SXid errors (failures caused by the NVIDIA GPU or the NVIDIA switch between GPUs). We also read all dmesg log lines to verify that they all fell into the list of “common/expected log lines”.

iDRAC Health Check

We checked for iDRAC errors on the machine, ignoring non-fatal error messages. This is a Dell-specific check, so it is not included in our open source code.

Disk health check

We checked that the zpool was mounted, that Docker was properly connected to it, and that it could actually touch it without locking up the CPU.

InfiniBand Health Check

We checked whether the InfiniBand error rate might be increasing and/or the driver firmware might be outdated.

Nvlink Health Check

We checked for NVLink errors on the machine. In practice, this does not seem to cause training to fail, but it may slow down training.

GDR Health Check

We checked whether GDR is enabled on the machine.

VBIOS Health Check

We checked whether the VBIOS version of the GPU and the H100 baseboard firmware were up to date.

Flint Health Check

We used flint and hca_self_test to check that the Mellanox OFED driver, NIC firmware, and transceiver firmware were at the correct versions and that they were compiled correctly against the NVIDIA drivers.

PSB Health Check

We queried the PCIe devices to check if the connection speed and width between the GPU, PSB (PCIe switch bus), and NIC were as we expected. We also checked if the switch firmware was up to date. This script was developed by Dell, not Imbue, so we cannot share it at this time.

In addition to these quick health checks, we also perform some more complex health checks, including:

Initialize matrix computations through PyTorch, and measure NVLink bandwidth and GPU computation speed and memory. We set appropriate GDR flags to test InfiniBand and NVLink.

Data was sent through the IB card using ib_write_bw and –use_cuda and the PCIe and InfiniBand card bandwidth was measured. This process was continued for a long time (about 15 minutes) to ensure that the jittery InfiniBand link was identified.

Run a multi-node diagnostic run to check NCCL's ability to initialize and whether it randomly stalls. If so, our forked NCCL code adds additional logging. This can take 12-24 hours to detect problems, so we usually only run this on new nodes or when we suspect there are issues.

Check the DCGM export for any GPU clock throttling events (not including the expected gpu_idle and power_cap). To check for these power events, the best approach is to run multi-node training to check all GPUs, InfiniBand cards, as well as CPUs and disks simultaneously.

Diagnosing Common Training Problems

Once the hardware is working properly, you can start training.

This section will share some specific debugging steps and insights based on our experience running large language model training on our cluster.

Crash on startup

In some ways, this is the best bug you could ever have, as it is (theoretically) easy to reproduce and iterate on.

We started by checking that our code was running with the right version, configuration, and environment variables. While basic, we found this to be critical: ensuring that launching training is reproducible and easy to inspect. A big reason for this is that intermediate abstractions like Docker image caches or opaque secret configurations can cause confusion.

Another basic check we perform is to ensure that all machines are online and that the stack traces or logs emitted can be easily aggregated and inspected. We used the Loki, Prometheus, and Grafana software stack, but any suitable log aggregation or tracing SaaS tool will work. Since these training runs are inherently synchronous and distributed, the first error can often lead to a cascade of unrelated errors. Here, health checks can also help to immediately detect errors such as a bad hard drive or a missing or invalid GPU.

We built a system that automatically restarts when failures occur, which makes log and error aggregation even more important to avoid mixing up errors from different restarts. Some common errors we encounter include:

1. Errors like "Forward order differs across ranks: rank 0 is all-gathering 43 parameters while rank 1228 is all-gathering 1 parameters". We found that this is a strange feature of PyTorch's Fully Sharded Data Parallel (FSDP) implementation, which can be solved by restarting.

2. GPU Out of Memory (OOM) errors, which look like this: “CUDA out of memory. Tried to allocate …” We resolved these issues by double-checking our configuration and code and undoing recent code changes that resulted in excessive use of GPU #0 due to incorrect PyTorch device specifications during startup.

3. CPU/RAM Out of Memory (OOM) errors, which are not easy to find in the error log and can usually be detected through the dmesg log of the host outside the Docker container. When the OOM Killer call stops a forked process or a network peer, we can see that they mainly manifest as CalledProcessError or ConnectionError. When the OOM Killer call is detected from dmesg, we prefer to abandon the health check directly and restart the box. We also checked whether our code paths have enough manual garbage collection (there is a section below that describes how to disable it), and also checked whether there are unexpected attempts to perform calculations or move tensors to the CPU.

Crash during training

The first priority was to get the system running automatically, so that it would automatically rerun all health checks and then restart if no unhealthy hosts were found. We encountered some random hardware errors, including Xid and SXid errors, which could cause the run to crash without emitting a meaningful Python stack trace. Some problems, such as row remapping, were recoverable with a reboot. Other problems, such as uncorrectable ECC errors, often required hardware maintenance or part replacement.

Additionally, we’ve observed that malformed training data can cause crashes. For example, if there’s a single very large document in the corpus, this can cause out-of-memory errors on the GPU or CPU. To prevent this, we use a fully deterministic data loader — making every crash easily reproducible by associating it with an epoch or number of steps. We’ve found that disabling data loading or replacing fake data (e.g., all zeros) helps confirm if the root cause of the error is the data.

Finally, it is also helpful to record network and general node health statistics via metric aggregation methods. Issues such as a brief Ethernet outage or insufficient disk space may not show up as a useful error message, but can be easily correlated with the collected data.

Hangs with no stack trace (possibly followed by timeout issues)

Debugging these types of errors can be frustrating due to the lack of helpful information and the difficulty in reliably reproducing them.

One of the most memorable error types is accompanied by an error message like this:

Watchdog caught collective operation timeout:WorkNCCL (SeqNum=408951, OpType=_ALLGATHER_BASE, … , Timeout (ms)=600000) ran for 600351 milliseconds before timing out

And all GPU workers in the training run issued such an error message.

This means that one or more hosts failed to complete NCCL operations or the NCCL and InfiniBand connection crashed, causing all other hosts to get stuck on a tensor operation at the same time until the NCCL_TIMEOUT timeout was reached. Unfortunately, due to the nature of the NCCL software library, it is difficult to find out which host has the problem.

We have made some changes to the logging of the NCCL library, see our fork: https://github.com/boweiliu/nccl . This may better reveal the message or operation being executed when a crash occurs, and thus determine which host or GPU may be blocking the run.

Note that to identify misbehaving hosts, we often need to figure out which hosts are not generating certain log messages. The absence of such messages indicates that the worker on that host has fallen behind or has crashed.

Other cases of unresponsiveness without available error messages are usually related to hardware-related issues, such as the Xid/SXid/ECC errors mentioned earlier that can cause the NVIDIA driver or NVIDIA Docker communication driver to lock up. To distinguish NCCL hangs from driver hangs and race conditions or deadlocks in Python code, we use tools such as Py-Spy and the GNU Project Debugger (GDB) to debug stuck processes in real time. Using this approach revealed one specific issue: Due to a configuration error in the Python thread settings, we were unable to correctly launch eight multi-threaded NCCL GPU processes on some hosts, which encountered a race condition in the initialization code phase before PyTorch.

Training slowdown (measured by MFU)

Lack of tooling made this category of problems even more frustrating than the previous one. In addition to using Py-Spy, stack trace inspection, and GDB, we also used NVIDIA Nsight and profiling tools, some of which were difficult to use in a highly distributed setting.

Unfortunately, there are many reasons that can cause general slowdowns or speeds below the Model Floating Point Utilization (MFU) demonstrated previously.

First, it has proven useful to double-check configuration, code, and environment variables. Mistakes we have experienced include: running the wrong model, wrong batch size, wrong UFM or NCCL settings, wrong CUDA_DEVICE_MAX_CONNECTIONS. These can all lead to suboptimal performance.

We also find it useful to measure instantaneous (i.e. per-batch) MFU (rather than smoothed or windowed averages), as the unsmoothed MFU curve often helps diagnose classes of problems. Problems that can slow down training include:

Start training immediately from very low MFU (less than one-tenth of expected) and keep it stable

This is most likely a hardware problem with the InfiniBand network connection, such as a dead switch at the T2 or T3 layer. This can also be caused by a hardware problem between the GPU and the NIC, for which dmesg will report the following error: PCIe x16 lanes limited by …

Start training immediately at 30% of expected MFU and maintain steady

This could be caused by incorrect GDR settings on one host (NVIDIA Peer Memory) or incorrect GDR environment variables.

Start training immediately at about 60-80% of your expected MFU and maintain steady

The most common cause is a poor or failed InfiniBand link, especially a single GPU with a failure related to the InfiniBand NIC, causing NCCL to attempt to route traffic over the local NVLink and use the NIC on another GPU on the same host. CPU throttling can also cause this issue, which requires adjusting BIOS settings on some hosts.

Sudden and large slowdown (down by a factor of 10) when processing certain batches of data, and this happens frequently

This is all about checkpointing or evaluation - which can be verified by checking the number of epochs or steps. Annoyingly, if you set up automatic alerts when the MFU is abnormal, you will get a lot of false positives.

Sudden and significant slowdown (down by a factor of 10) when processing certain batches of data

This happens randomly and fairly rarely (about once every 15 minutes), and is followed immediately by a complete recovery to good MFU.

The most common cause seems to be other CPU-intensive workloads being scheduled onto a running host. We found it easier to coarsely monitor CPU by PID rather than building profiling tools to identify specific hosts. This could be caused by occasional network connectivity issues, such as bottlenecks in the data loader. We monitored metrics for data loading, checkpoints, and any non-NCCL code and added Python code timing logging, which has proven to be very reliable.

MFU gradually slows down during operation, but returns to 100% after each restart

Theoretically, this could be caused by heat buildup on the switch, but we have never seen this happen. However, using Python and the NVIDIA profiler, we determined that the cause of the performance degradation appears to be automatic garbage collection.



While debugging these slowdowns, we found that throughput would almost certainly periodically drop. As training progressed, this drop would increasingly affect distributed computations. This led us to hypothesize that the drop might be related to automatic garbage collection - a hypothesis we confirmed through profiling and testing. When we disabled automatic garbage collection and set garbage collection to occur only at certain intervals on all hosts, the throughput "drop" disappeared.

We used FSDP, a synchronous distributed training algorithm based on ZeRO-3. During blocking operations, a single worker process running garbage collection can slow down all other workers. If there are hundreds of worker processes, this can lead to significant slowdowns.

Performance starts out good, then suddenly drops (to 70% of expected) and continues at a high frequency (every 15 seconds)

We have observed this to be related to "clock throttling reasons" on NVIDIA GPUs, which can be resolved with appropriate settings for NVIDIA DCGM. Thermal issues (hot GPUs or failed/degraded host cooling fans) or power failures can cause this. Additionally, some of our hosts with specific power hardware have voltage issues when we maximize all 8 GPU utilization and 8x NIC InfiniBand utilization and CPU/RAM/disk at the same time, but only when they are all used (usually only during actual training runs).

Good performance but noisier than usual (high frequency white noise variance between 90% and 100% of expected MFU)

This is also related to InfiniBand hardware, but is usually due to some level of degradation or jitter in the links higher up in the network, rather than the less redundant host-to-T2 layer.

Unfortunately, many of these issues are difficult to pinpoint to a specific host, and issues related to InfiniBand are particularly difficult to pinpoint due to the topology-aware nature of InfiniBand switch technology. InfiniBand appears to favor adjacent hosts in an InfiniBand fat-tree design, and UFM can route packets at asymmetric link speeds.

Here is a simple summary/flowchart/sanity checklist for debugging throughput issues:

Did this system work properly before?

What changes have you made recently (such as merging code, updating drivers)?

Are the hosts you are running on healthy? Are all your dependent services running properly, including third-party SaaS such as Docker Hub, GitHub, etc.?

Can you be sure that you are running the exact same code, environment, configuration, version, host list, ranking order, random seed as last time? (If such a check is possible.)

Is the problem reproducible?

How does it relate to other things? Other processes? Daily crontab tasks? Host or DCGM or UFM metrics?

Are your tools measuring these metrics correctly?

Does the problem persist when running a reduced version of the code (using a smaller model, fake data, and not saving or loading checkpoints)?

Improving infrastructure tools

Once you’ve completed the above steps, you should be able to achieve good performance when training your model… at least until something breaks.

This section will introduce some tools and systems used to ensure that training continues to be stable, and preferably with as little human intervention as possible. Since our team is small, we don't have enough people to do manual maintenance, so we also want to automate the process as much as possible.

Almost all of the issues we encountered during training were attributable to machine or network component failures. Such failures are common in large clusters, so our approach is to automatically disable failed machines and network components and send repair requests.

machine malfunction

We developed a system that automatically restarts from the most recent checkpoint when a run crashes. The restart process starts by doing a health check on each available machine, then classifying each machine based on the health check result it passes; then attempting to restart training on the healthiest machine.

Network component failure

All network failures we observed were detected by UFM and recorded in the UFM event log, so the response was simple: parse the UFM log and take appropriate action.

The UFM event system is very complex and contains dozens of event types. However, in practice, we have found that only a few events are problematic, mainly related to link failures or high-tech symbol errors. After identifying these events, we can write scripts to parse the UFM event log, disable the links and ports related to the recent events, apply for maintenance tickets for these network components, and re-enable these components after the maintenance is completed.

Local mirror file system

For these large distributed trainings, people have long discovered that the speed of data exchange between the cluster and Ethernet is a bottleneck. The bandwidth of a shared Ethernet connection is about 10Gbit/s; if hundreds of workers are downloading datasets and model checkpoints at the same time, this bandwidth will be saturated quickly.

To this end, we decided to build a local file system inside our cluster to mirror the cloud storage, which is essentially a cache space to reduce the amount of files read from S3. To address cluster churn (i.e., when machines are disabled or replaced for maintenance reasons), we prepared three copies of each file and used consistent hashing to evenly distribute the load to minimize file movement during cluster churn. Since the cluster has limited disk space, we had to develop multiple tools to track the life cycle of files and purge files that are no longer useful.

Local distributed Docker registry

We used Kraken, an excellent open source software for peer-to-peer transfer of Docker images. We had almost no problems with the software, which was quite surprising given the complexity of our task and implementation. Tool address: https://github.com/uber/kraken

Various performance monitoring tools

We set up the default Torch profiler as well as NVIDIA's Nsight Systems. The latter helps us understand the exact time required for the forward/backward passes and NCCL communication, and further helps us determine if a given model size and number of workers will become a bottleneck. However, Nsight Systems is a bit difficult to use because it requires running Docker in privileged mode, which requires disabling security checks related to performance monitoring events, and saving its configuration often requires stopping the entire training process.

We also wrote tools to detect slow training batches and understand possible reasons. We found this very useful. The most useful tool monitored the time taken for each batch and discarded the stack trace of the worker if a batch was too slow - this made it easier to find hosts with minor hardware or software issues.

Divide machines into different groups to locate faulty hosts

In the first few months of using the cluster (when health checks were not as thorough as they are now), we often encountered a situation where a failure occurred while training on a group of machines, but it was unclear which machine had the problem. To find the failed host, we developed some tools that made it easy to split a group of machines into different groups and run smaller training on each group of machines.

For example, if a training run on 48 machines fails, then run a smaller run on 6 sets of 8 machines, and then run a smaller run on 8 sets of 6 machines. Typically, only one of these two runs will fail, giving us confidence that the machine that failed in both stages is the problem.

Reflections and Lessons Learned

In the process of setting up and maintaining the infrastructure, we learned some useful lessons:

One useful practice is to swap out the location of machines. It can be helpful to use 10-20% more machines than needed at run time, so that training can be easily restarted if a machine fails. Setting up the cluster network so that every machine is closely connected to every other machine allows us to use any working subset of the machines.

For every hardware or software failure you encounter, it is worth writing a test and automated solution, because every problem encountered in training will occur again. Similarly, for every unclear error message, it is worth writing a tool that can better explain the error.

Reproducibility is key to good science, and one of the principles we adopted right away was to make changes one at a time, even in the simplest places.

Trust, but verify. Whenever we bring in external tools or onboard new people (whether from inside or outside the company), we carefully check what they claim, especially when subsequent steps depend on those claims.

Summarize

Training large language models requires complex infrastructure to begin with. We chose to get deeply involved in the details of setting up that infrastructure because we believe it is important to fully understand the systems we operate, and because we believe it is more efficient to do so.

Now, having gone through the entire process, we are glad we took this approach - having full control over our infrastructure and the ability to easily debug at every level of abstraction has proven to be of paramount value. While this process required a lot of oversight and iteration, it allowed us to deeply understand the underlying workflow, build a range of tools to ensure host health, learn how to automate the system to ensure training continues to run smoothly, and ultimately build an infrastructure that allows us to quickly iterate on training cutting-edge language models.

This infrastructure building process embodies our fundamental methodology for researching and building AI agents: exploring the details, continuously improving existing processes, and building useful tools and systems to enable our motivated team to tackle greater challenges.