What happens if you shuffle or skip the Transformer layer? The latest research reveals its information flow mechanism

What happens if you shuffle or skip the Transformer layer? New research reveals its information flow mechanism

2024-07-26

The west wind blows from Aofei Temple
Quantum Bit | Public Account QbitAI

The information flow mechanism in Transformer has been revealed by the latest research:

Are all the layers necessary? Do the layers in between do the same thing? Does the order of the layers matter?

ifSkip some layers, for example, what happens if the output of layer 4 is connected to layer 6.Randomly shuffle the order of layers, for example, what about 4-6-5-7.

Recently, a study called "Transformer Layers as Painters" has become popular. It was completed by a research team from AI startups Sakana AI and Emergence AI.

Starting from the internal working principle of Transformer, they came to the conclusion of the above problems through a series of experiments. The team said that a deep understanding of these principles can not only improve the efficiency of existing model utilization, but also help improve the architecture and develop new variants.

Lucas Beyer, a researcher at Google DeepMind and author of ViT, gave a thumbs up after reading it:

Great summary! Although some of the experiments have been shown in previous studies, I like the new details you added, especially highlighting that “reasoning” tasks are more affected than other tasks!

Many scholars and engineers also expressed strong recommendations.

I bet some of these insights will eventually be used to improve the Transformer.

The experiments reaffirmed that duplicating layers helps with creative tasks but is generally ineffective for reasoning tasks; changing the order of layers doesn’t work; and pruning works best in the middle layers, but repair adjustments are still needed.

So, what experiments did the research team conduct in this study? What questions did they answer?

Experimental model selection and benchmarking

Let's take a look at the experimental configuration first~

Experiment indecoder-onlyandencoder-onlyOn the model.

The decoder-only model is selectedLlama2, mainly studying the 32-layer, 7 billion parameter Llama2-7B. The extended experiment also includes 13B (40 layers) and 70B (80 layers) models.

The encoder-only model selectsBERT, with 24 layers and 340 million parameters.

The researchers used standard pre-trained checkpoints of these models. In all experiments, the models were frozen, and no model parameters were modified by fine-tuning or other methods, except for a standard fine-tuning step included in the evaluation of BERT.

In terms of benchmarks, Llama2 uses the following standard benchmarks: ARC (science test questions), HellaSwag (common sense questions), GSM8K (math questions), WinoGrande (common sense reasoning), LAMBADA (vocabulary prediction). LAMBADA is used to measure perplexity, which is closest to the original token prediction used during training.

For the performance evaluation of Llama2, the normalized median of the benchmark is provided, quantifying the performance from 0 to 1 (model optimal performance).

For BERT, we use the GLUE benchmark and follow its evaluation metrics, including the unnormalized mean score of the benchmark. Note that the standard BERT evaluation includes a fine-tuning step, so the model is adapted. In the appendix, the researchers also show an evaluation result where only the model head can be adjusted.

The motivation for the experiment originally came from the following question:

Is it possible to merge the multiple layers somehow into a single, possibly larger layer?Assume that the middle layer of the neural network may be due to the use of residual connections during training.Perhaps a common representational space is used.(This is not true for standard multilayer perceptrons, which have no mechanism to enforce common representations or consistency of arrangement across layers.)

If layers can share a representation space, it will have an important impact on subsequent conditional calculations or dynamically adding new knowledge to pre-trained Transformer models and downstream applications.

8 Big Questions About Transformer

Do the layers use the same representation space?

To determine whether different layers share the same representation space, the researchers examined the TransformerSkip specific layers or change the order of adjacent layersrobustness.

For example, what happens if we change the output flow from the normal order of "layer 4 -> layer 5 -> layer 6" to "layer 4 -> layer 6" in the Llama2-7B model, skipping layer 5?

Or what if we send the output of layer 4 to layer 6, then send the output of layer 6 to layer 5, and then to layer 7?

As shown in the figure below, the experiment found that except for the first and last few layers,Llama2-7B shows good robustness in skipping or changing layer order。

That is, the middle layers share a representational space, and the middle layers and the “outer layers” (the first and last few layers) have independent representational spaces.

To further confirm this hypothesis, the researchers measured the mean cosine similarity between hidden state activations at different layers in different models (Llama2-7B, Llama2-13B, and BERT-Large) and compared them across benchmarks.

Figure 3 below showsConsistency between all middle tiersFor example, the activations of the fourth layer from the bottom are highly similar to the activations of the fourth layer from the top. For the 40-layer Llama2-13B, you can see that the layers can be divided into 4-5 groups by similarity: layer 0, layers 1-3, the middle layers, and then the last one or two layers.

This suggests that the model mayHave three different representation spaces for the “beginning”, “middle” and “end” layersThe researchers also found that the number of “starting layers” seemed to increase as the total number of layers in the model increased.

In addition, high cosine similarity may prove that there is a shared representation space, and low similarity is more likely to indicate that these spaces are not shared. The data of Llama2-7B in Figure 3 above is highly consistent with the performance results shown in Figure 2, which further proves that:

At least the representation space of the middle layer is shared.

Are all layers necessary?

To further verify that the representation space of the intermediate layer is truly shared, the researchers also conductedLayer skipping experiment(No fine-tuning was performed in the experiments).

Specifically, the output of layer N is directly passed as the input of layer N+M (M>1), thereby "skipping" layer M-1, as shown in the following figure.

Originally, layer N+M was only trained on inputs from layer N+M-1, so now it can understand the activations of layer N?

In such experiments, researchers execute the first and last N-1 layers normally, while skipping or modifying the N+1 to TN layers (T is the total number of layers in the model).

As shown in Figure 4, Llama2-7B and BERT-Large havePerformance gradually degrades(The figure shows the change from left to right as the number of skipped layers increases.) This result reveals:

Not all layers are necessary, and at least omitting some intermediate layers will not have a serious impact on the overall performance.

Do the middle tiers all perform the same function?

If the intermediate layers share a common representational space, are these layers redundant?

To answer this question, the researchers repeated the previous "skipping" experiment, but this time instead of skipping the middle layer,The weights of all the middle layers are replaced by the weights of the center layer.,As shown below.

In fact, it loops on the center layer T-2N+1 times, where T is the total number of model layers (32 layers for Llama2-7B and 24 layers for BERT-Large).

Results In the benchmark test, as the number of replaced layers increases,Model performance degrades rapidly. And the performance drops much more rapidly than just skipping some layers, this weight replacement is extremely destructive.

therefore,It is not redundant for the intermediate layers to each perform a different function, and sharing weights between intermediate layers can have disastrous consequences.

Does the order of layers matter?

The above experiments show that although the intermediate layers share the representation space, they perform different operations on this space. So does the order of these operations matter? The researchers conducted two sets of experiments.

First, the middle layer is trained according toReverse orderThe output of the TN layer is passed to the TN-1 layer, and so on, until the Nth layer, and then the output of this layer is passed to the last TN layer.

As shown below:

The second experiment,Random PermutationThe middle layer is ordered and 10 random seed results are averaged.

The results are as follows. Both models showSlow performance degradation。

Here is a spoiler for the results of the following experiment. Whether in reverse order or random order, the model performs better than directly skipping these layers, which means that even if the layers are run on inputs that are not in the training order, they can still produce valid outputs.

So, does layer order matter? Conclusion:

Adjusting the layer order has a certain impact on performance, and both random order and reverse order show certain performance degradation.

It is worth noting that the random order performs better than the reverse order. This may be because the reverse order is exactly the opposite of the order during training, while any random order at least maintains some sequential coherence (i.e., layer i always comes after layer j, where i>j).

Is it possible to run these layers in parallel?

If the existence of layers, i.e. not being skipped, is more important than the order in which they are executed, then is it possible to considerRun these layers independently and combine their results?As shown below.

The researchers conducted an experiment where, instead of skipping layers N through T, they ran these intermediate layers in parallel and then passed their averaged results to the final N layers.

The results are shown in the figure below. All benchmarks except the GSM8K math problem benchmark show slow performance degradation.

Interestingly,Parallel layers perform better than skip layers, but worse than running layers in reverse order.

In conclusion, is it possible to run these layers in parallel? The answer is:Yes, with the exception of math-focused benchmarks.

For some tasks, does order matter more?

Most variants (including reverse, skip, and parallel) show the fastest performance degradation in either the abstract reasoning ARC or mathematical reasoning GSM8K benchmarks.

This can be explained by the fact that step-by-step reasoning tasks are more sensitive to changes in layer order than “semantic” tasks such as Winogrande or HellaSwag.

This is because reasoning tasks require the combination of both structural and semantic information, while tasks such as HellaSwag can be completed with only semantics.

Through experiments, the researchers concluded:Mathematical and reasoning tasks are more order-dependent than “semantic” tasks.

Does iteration help with parallel layers?

If we compare the internal operation mechanism of Transformer to the process of painting a picture: the canvas (input) is passed between some painters, some painters specialize in painting birds, some are better at painting wheels... Each painter takes the canvas from another painter in turn, and then decides whether to supplement the painting or pass it directly to the next painter (using residual connection).

It’s conceivable that some layers will “complement” the drawing only if they receive the appropriate input—for example, if a painter “drawing the wheels” sees the body of the car first, they’re more likely to draw the wheels.

In the Transformer, some layers may contribute to the forward pass only when they receive appropriate input, rather than passing the input directly out through residual connections.

So, compared to executing the parallel layer only once,Iterative execution of parallel layersShould improve performance.

The researchers tested this by feeding the average output of the parallel layers back into the same layer and fixing the number of iterations, as shown below:

In Figure 9 below, the researchers show the results of iterating the parallel layer three times, which is significantly better than executing the parallel layer only once.

The only exception is when the starting layer N is 15 for Llama2-7B or 11 for BERT. In this case, the effect of looping in parallel 3 times is equivalent to just repeating the middle layer 3 times, and the parallel layers at this time are equivalent to the full model.

The researchers also repeated the experiment with different numbers of iterations.

The figure below shows how the performance of Llama2-7B varies with the number of parallel layers M and the number of iterations.

The optimal number of iterations for each M is indicated by a red box. Except for M=29 and M=31 (where almost all layers are parallelized), the optimal number of iterations scales roughly linearly with the number of parallel layers.

So the conclusion is:Iterations help with parallel layers, and the optimal number of iterations is proportional to the number of parallel layers.

Which variants hurt performance the least?

Finally, the researchers compared all the different variations in the experiment on the same graph.

The results show that,Repeat single layer(Replace the middle layers with an equal number of center-most layers as mentioned above)Worst results, performance quickly degrades to the random baseline.

Iterative parallelism and random layer ordering with minimal performance degradation, where iterative parallelism performs best in BERT and Llama2-7B.

More experimental results are added in the appendix of the paper. Interested families can check out the original paper.

Paper link: https://arxiv.org/abs/2407.09298v1
Reference link: https://x.com/A_K_Nain/status/1812684597248831912

news