news

DeepMind's research costs revealed: $12.9 million spent on one ICML paper

2024-08-03

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


New Intelligence Report

Editor: Qiao Yang

【New Wisdom Introduction】A recent paper by DeepMind accepted by ICML 2024 completely exposed their "arrogance" backed by Google. An article estimated the computing power and cost required for this research, which is about 15% of Llama 3 pre-training, and the cost can reach 12.9M US dollars.

How much experimental budget is needed to publish a paper in a top conference?

Recently, DeepMind published a study that conducted an extensive empirical investigation on various algorithmic and architectural details, such as the choice of parameters and optimizers, when LLM is scaled up.

This paper has been accepted by ICML 2024.


Paper address: https://arxiv.org/abs/2407.05872

The 63-page paper covers tens of thousands of models, with alternatives including 3 optimizers, 4 parameterization schemes, several alignment assumptions, more than a dozen learning rates, and 14 parameter sizes up to 26.8B.


Four parameterization schemes that need to be experimented with

Just by hearing these numbers, it is not difficult to know that this research must involve a huge number of model running experiments.

One loyal reader, in order to test his understanding of the paper's content, counted all the experiments conducted and estimated the cost of reproducing the paper.


Adding up all the computing power required, the total amount actually reached a staggering 12.9 million US dollars.

It's time to test your basic skills. If you are the leader of a research team, estimating the required computing power and cost based on the experimental plan is an essential skill.

Let us follow this blog post to find out where the more than 10 million US dollars went.

Transformer architecture information

Appendix C of the paper provides various detailed settings about the model algorithm and architecture, such as using a decoder-only architecture, layer normalization, GeLU activation function, no dropout, T5 tokenizer, batch size of 256, using FSDP parallelism, etc.


Parameter size statistics of experimental models

With the information about the architecture, we can roughly estimate the FLOPS required for each token in training, denoted as M.

Since the paper does not describe any GQA/MQA mechanism, we assume Rkv=1. In addition, there are lseq=512, Dhead=128, L=8 (depth), and V=32101 (word segmenter vocabulary).

The total number of model parameters can be expressed as:

Therefore, we can get the calculation formula of M:

By default, the number of tokens per experiment (TPE) is 5k (number of training steps) × 256 (batch size) × 512 (lseq), which is approximately 6.5536e9.

def M(d: int, L=8, l_seq=512, V=32101) -> int:     return 6*d * (L*(12*d + l_seq) + V) TPE = 50000 * 256 * 512

Alignment experiment

Assume that in the alignment experiment, the optimal result obtained by the subsequent learning rate sweep is directly used, and no separate learning rate sweep is performed. Therefore, the cost calculation of this step is relatively simple:


def alignment() -> int:     return 4 * TPE * sum(M(d) for d in [1024,2048,4096]) # >>> f'{alignment():.3E}' # '3.733E+20' # >>> cost_of_run(alignment())[0] # 888.81395400704

If the H100 costs $3 per hour to run, the cost of the alignment experiment is approximately $888.

Learning Rate

Subproblem: Best eval loss experiment

Table E1 of the paper records all possible combinations of optimizer × parameterization scheme × model size × experimental setting under 6 model scales, and performs base learning rate sweeps to obtain the best evaluation loss.

The following experimental variables are included:

- Model dimension D∈3072,4096,6144,8192,12288,16384

- 4 parameterization schemes

- 3 optimizers, of which SGD has only 5 experimental settings, and Adam and Adam+Param Scaling have 7 experimental settings

Assuming that the experiments here are all conducted independently, without copying the results from other places, there is an upper limit estimate of the cost if all are run once:


H = [1,2,4,6,8,12,16,20,24,32,48,64,96,128] D = [h * 128 for h in H] def table_e1() -> int:   sets_x_optims = 5 + 7 + 7   return 4 * sets_x_optims * TPE * sum(M(d) for d in D[-6:]) # >>> f'{table_e1():.3E}';cost_of_run(table_e1()) # '1.634E+23' # (388955.9991064986, 16206.499962770775)

The cost of this part is close to $400,000, which is still within an acceptable range, but is very expensive for most academic budgets.

Table E1 gives the best evaluation loss, but does not describe the scanning strategy of LR, and the number of points on each image is also different.


Since we did not receive a response from the authors of the paper and could not determine the specific mechanism, we assumed that each optimal evaluation loss was tested 15 times (visually, the number of points on each line was about 10-15).

Beta parameter

According to Section 4.2 of the paper, the learning rate also involves the selection of two hyperparameters: β and γ.

If there is only the β parameter, it is called the "LR+default" setting:


This part includes 3× optimizer, 4× parameterization, plus experiments with global and single layer (GlobalLR, Perlayer-fullalign), and unknown number of LR scans:


def beta_only() -> int:   return 3*4*2*PpL * TPE * sum(M(d) for d in D) # 7.988E+23 (1902022.3291813303, 79250.93038255542)

As can be seen from the formula, the cost is similar to the epsilon experiment below, both of which are 2 million US dollars.

Gamma Parameters

Compared with the experiment on the β parameter, there are two details in this part.

First of all, in addition to the GlobalLR and Perlayer-fullalign settings, the Perlayer-noalign setting also needs to be added.


Second, a 3D hyperparameter search (γ_1,γ_h,γ_L+1) is performed only for d=1024=b, so there are an additional 800 runs.


The calculation formula after combining the two is:


The estimated cost of this part is close to Adam's epsilon heat map experiment, about $3.2 million.

def gamma_expts() -> int:   return 36*TPE * (800*M(1024) + PpL*sum(M(d) for d in D)) # gamma_expts 1.354E+24 (3224397.534237257, 134349.8972598857)

Epsilon parameter of Adam optimizer

The Epsilon parameter experiment described in Section 4.3 of the paper is the bulk of the computational effort.


Based on the above inference, 15 different learning rates (points per line) were tried each time the best evaluation loss was found. The amount of computation consumed by the epsilon parameter variation diagram shown in Figure 6 is:


The math revealed a succinctly expensive bill of just $2 million.

PpL = 15  # unprincipled estimate def eps_variants() -> int:   return 4 * 6 * PpL * TPE * sum(M(d) for d in D) ''' >>> f'{eps_variants():.3E}';cost_of_run(eps_variants()) '7.988E+23' (1902022.3291813303, 79250.93038255542) '''

In addition to the line graph on the left side of Figure 6, there are also the results of the heat map in Appendix F.


Assuming that each block value is the result of 13 learning rate scans, the amount of calculation is:


It turns out that just to get these eight heatmaps, the cost was $3.2 million, and since we modeled the number of LR scans as a constant of 13, this number is likely lower than the actual cost.

def eps_heatmaps() -> int:    # eps-type * eps-val * parameterizations * LR range * ...   return 2 * 6 * 4 * 13 * TPE * sum(M(d) for d in D[-6:]) ''' >>> f'{eps_heatmaps():.3E}';cost_of_run(eps_heatmaps()) '1.341E+24' (3193533.466348094, 133063.89443117057) '''

Weight decay

The weight decay experiment (Appendix G) is relatively easy to understand. A basic LR scan of the 4× parameterization scheme and all parameters is performed:


It is much cheaper than the epsilon experiment, which is equivalent to the annual salary of an engineer in the Bay Area - $317,000.

def weight_decay() -> int:   return 4 * PpL * TPE * sum(M(d) for d in D) ''' >>> f'{weight_decay():.3E}'; cost_of_run(weight_decay()) '1.331E+23' (317003.7215302217, 13208.488397092571) '''

Adafactor Optimizer

This part of the experiment is described in detail in Appendix C3, and is intended to test whether Adafactor and Adam+parameter scaling have similar width scaling mechanisms.


There are 2×4 graphs in total, where each optimizer collects 11 data points, so the calculation formula is:


Add another $188,000 to the bill.

def adafactor() -> int:   return 2*2*4*PpL*TPE*sum(M(d) for d in D[:11]) ''' >>> f'{adafactor():.3E}'; cost_of_run(adafactor()) '7.918E+22' (188532.80765144504, 7855.533652143543) '''

Computational Optimization

The paper attempts to change the number of attention heads H in the hope of finding the most optimized setting for calculation, but this involves changes in step size and data set, so this part is not described by formulas. The calculation code is as follows:

def P(d: int, L=8, V=32101) -> int:     return 2 * d * (6*L*d + V) def compute_optimal():   indices_50k = (14, 14, 12)   return 4*PpL*sum([     TPE * sum(sum( M(d) for d in D[:i] ) for i in indices_50k),         20  * sum(P(d)*M(d) for d in D[:11]) *3,   ]) # compute_optim 7.518E+23 (1790104.1799513847, 74587.67416464102)

Summarize

Summarizing the computing power and cost of the above experiments:

alignment       3.733E+20 (888.81395400704, 37.033914750293334) table_e1        1.634E+23 (388955.9991064986, 16206.499962770775) eps_variants    7.988E+23 (1902022.3291813303, 79250.93038255542) eps_heatmaps    1.341E+24 (3193533.466348094, 133063.89443117057) beta_only       7.988E+23 (1902022.3291813303, 79250.93038255542) gamma_expts     1.354E+24 (3224397.534237257, 134349.8972598857) weight_decay    1.331E+23 (317003.7215302217, 13208.488397092571) adafactor       7.918E+22 (188532.80765144504, 7855.533652143543) compute_optim   7.518E+23 (1790104.1799513847, 74587.67416464102)

It was found that the computational workload of the entire paper was 5.42e24 FLOPS.

This number is only 15% of the training computational workload of Llama 3. If it is run on a 100,000-card H100 cluster, it only takes 2 days to complete all experiments.

total_flops=5.421E+24 rental price: US$12.9M h100 node months required: 746.9595590938408 (sanity check) D=[128, 256, 512, 768, 1024, 1536, 2048, 2560, 3072, 4096, 6144, 8192, 12288, 16384] (sanity check) model sizes: ['0.00979B', '0.0227B', '0.058B', '0.106B', '0.166B', '0.325B', '0.534B', '0.794B', '1.1B', '1.87B', '4.02B', '6.97B', '15.3B', '26.8B'] (sanity check) M/6P: ['63.4%', '68.5%', '75.3%', '79.7%', '82.8%', '86.8%', '89.3%', '91.0%', '92.2%', '93.9%', '95.7%', '96.7%', '97.7%', '98.3%']

However, if we do not measure it from the perspective of LLM pre-training and only regard this DeepMind paper as an academic research, this amount of computation seems quite extravagant.

If the laboratory only had 10 H100s, it would be impossible to conduct research of this magnitude.

A large laboratory with 100 H100s might be able to run all of the above experiments in a few years.

References:

https://152334h.github.io/blog/scaling-exponents/

https://news.ycombinator.com/item?id=41107721

https://arxiv.org/abs/2407.05872