news

llm training traffic is 10,000 times less! new distributed optimizer integrates the world's computing power to train powerful ai

2024-09-10

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina



  new intelligence report

editor: alan
【new wisdom introduction】recently, nous research announced a major breakthrough. by using a distributed optimizer that is independent of the architecture and network, researchers successfully reduced the amount of communication between gpus when training llm by 1,000 to 10,000 times!

what if you could use all the computing power in the world to train ai models?

recently, nous research, which has attracted widespread attention by releasing the open source hermes 3 (based on llama 3.1), announced another major breakthrough - distro (distributed internet training).

by using an architecture- and network-agnostic distributed optimizer, the researchers were able to reduce the amount of inter-gpu communication when training llms by 1,000 to 10,000 times!

with such dramatic improvements, bandwidth, an important cost and bottleneck for training large models, is no longer a problem.

using distro's method, you can distribute the training load to the internet, and the entire network world becomes a huge heterogeneous ai server cluster.

——any device with relevant computing power can participate in the training process.

experiments have shown that the method in this paper basically does not lead to a decrease in model performance. at the same time, distro-adamw is comparable to the standard adamw+all-reduce in terms of convergence speed.

distributed internet training

generally speaking, training large-scale neural networks involves a lot of communication overhead.

for example, when doing data parallelism, different training data are forward and backward calculated on different hardware (graphics cards, etc.). afterwards, the gradients calculated for the same batch of data need to be synchronized between the graphics cards before entering the next epoch.

if the model is parallel, the intermediate data needs to be spliced ​​or accumulated through all-reduce.

if these data communication overheads cannot be overlapped, they will become a bottleneck for model training.

it just so happens that huang's video memory and bandwidth are very expensive, and even the hardware needed to build multiple graphics cards is very expensive.

to address this issue, the researchers developed distro, which reduces inter-gpu communication requirements by four to five orders of magnitude without relying on amortized analysis, enabling low-latency training of large neural networks on slow networks.

distro is general, scalable, and clock-synchronous (similar to sgd, adam, etc., each training step uses the same arithmetic operations and takes the same time).

in addition, compared with previous ad-hoc low-communication optimizers, distro is insensitive to the topology of telecommunication networks and neural network architecture, and can natively support distributed data parallel training (ddp) with minimal overhead.

llm pre-training

the researchers used nanotron as the pre-training framework and ran it only under the ddp strategy (each gpu loaded the entire model into vram).

llm selects llama 2 of size 1.2b. the hyperparameters used for the model and training are as follows:

the training data uses the dolma v1.7 dataset, with 10% representative samples (the first 105b tokens) randomly selected.

the optimizer uses adamw, β1=0.9, β2=0.95, the peak learning rate is 4×10e-4, the cosine decay scheme is used, and the weight decay is set to 0.1.

as a comparison, in another set of experiments, adamw was replaced with distro-adamw, but the hyperparameters were not changed and the all-reduce operation in nanotron was disabled.

unlike previous distributed training approaches, distro does not synchronize optimizer state (and can even be stateless).

the following figure shows the training loss curves of the two sets of experiments, using 105b data for 25,000 steps. it can be seen that the convergence ability of distro is on par with all-reduce.

importantly, without affecting the training effect, distro directly reduced the communication volume from 74.4gb to 86.8mb! this is equivalent to reducing the bandwidth pressure by 857 times.

the author also stated that this 857-fold increase is just an initial test, and it would not be a problem to reduce it by 1,000 to 3,000 times by adjusting the hyperparameters later.

if post-training and fine-tuning is performed, communication optimization can even be achieved up to 10,000 times without significantly affecting the training results.

finally, to verify the training effect, the author performed the gpt4all zero-shot benchmark on the trained model and compared it with tinyllama (checkpoint) trained on the same number of tokens.

the results are shown in the table above. the architecture and training process of tinyllama are very similar to the experiments in this paper, which can be used as a sanity check for the results.

future applications

data flow

in the scenario of this experiment, 32 nodes use the simplest all-reduce (full connection), and each node transmits an average of 86.8mb (2.8mb×31) and receives the same amount of data.

if a dedicated server is used for data aggregation, each node only needs to upload 2.8mb of data (the received data remains unchanged), and the communication volume is further reduced.

additionally, asymmetry has merit, as most consumer internet bandwidth is heavily skewed toward higher download speeds.

assuming a stable network speed of 100mbps download and 10mbps upload, the worst-case delay is only 6.94 seconds for download and 2.24 seconds for upload. with overlap, the delay for each step is 6.94 seconds.

ps: the above data transmission is all original vectors, and it can be even faster if compression technology is used.

bandwidth

the authors say that current experiments and research are still limited, and it is impossible to determine whether the rate of bandwidth reduction will increase, decrease, or remain the same as the model becomes larger.

however, the current 1.2b seems to be the minimum size for distro to work well (it will not converge if it is smaller), so it can be assumed that as the model size grows, the required communication will become relatively less.

however, it is also possible that the amount of communication has nothing to do with the model size. in this case, the model size can be increased without increasing the communication bandwidth to observe whether a larger model improves the training and learning results.

if the latter is true, then the paradigm of future gpu design and manufacturing will be changed (larger vram and narrower bandwidth).

we also happen to prefer compute-intensive workloads (rather than i/o-intensive), after all, bandwidth is much more expensive than compute these days.

federated learning

in addition to training llm, what else can distro be used for?

when doing distributed training on the internet, people immediately think of federated learning.

while allowing collaborative model training, keeping each participant’s data private and decentralized is becoming increasingly important as llm is controlled by large companies.

until now, federated learning has lacked efficient methods for training large models over limited internet bandwidth.

distro does not have any requirements on how to process data or distribute data to individual gpu nodes, and can be stateless (similar to federated averaging), making it suitable for the future of federated learning.

virtual heterogeneous gpu cluster

additionally, distro can create a fully decentralized and permissionless network to collaborate and share resources.

experiments show that distro is remarkably resilient to a small number of degraded or dropped nodes during training, and can easily adapt to the addition of new nodes.

with this capability, on the one hand, the security of the entire system can be guaranteed, reducing the risk of untrusted nodes using adversarial attacks to disrupt operations.

on the other hand, it can also encourage institutions and individuals to flexibly contribute their computing resources and release potential computing power.

even some old cards with insufficient memory or computing power can join in to earn some extra money, using strategies such as fsdp and swarm parallelism to work with distro.

energy

further large-scale application of distro may alleviate the problems associated with building large data centers, such as energy consumption, infrastructure costs, and land use.

the llama 3.1 project required building two large monolithic superclusters, each containing 24,000 h100 gpus, and the training process alone produced the equivalent of 11,000 tons of co2 emissions.

in today’s llm, in addition to the growth in model parameter size, the amount of training data is also increasing, causing ai-related data centers to reach the limits of modern power grids.

distro can be used to adaptively balance multiple small modular data centers that use excess capacity, utilizing existing infrastructure through dynamic balancing training technology to mitigate the negative impact of training on the environment.

at present, the theory behind distro needs further exploration, and more rigorous and detailed academic papers and complete codes will be released in the future.