Scientists reveal the linear nature of deep neural networks, helping to generate better model fusion algorithms

2024-07-15

Although deep learning has achieved great success in recent years, people's understanding of its theory still lags behind.

For this reason, research topics that attempt to explain the loss function and optimization process of deep learning from a theoretical perspective have received much attention.

Although the loss functions used in deep learning are often viewed as high-dimensional complex black-box functions, it is believed that these functions, especially those encountered in actual training trajectories, contain complex benign structures that can effectively facilitate the gradient-based optimization process.

Like many other scientific disciplines, a key step in building a theory of deep learning is to understand the nontrivial phenomena discovered in experiments and thus elucidate their underlying mechanisms.

Recently, scholars in the field have discovered a striking phenomenon - Mode Connectivity.

That is, different optimal points obtained by two independent gradient optimizations can be connected by simple paths in the parameter space, and the loss or accuracy on the path remains almost constant.

This phenomenon is undoubtedly surprising, because different optimal points of non-convex functions are likely to be located in different and isolated "valleys".

However, this does not happen for the optimal points found in practice.

What’s more interesting is that some researchers have discovered Linear Mode Connectivity, which is stronger than Mode Connectivity.

Research on Linear Mode Connectivity shows that different optimal points can be connected through linear paths.

Although two completely independent nets will not usually satisfy Linear Mode Connectivity, there are two ways to obtain a net that satisfies Linear Mode Connectivity:

The first network is Spawning Method.

When the network is initialized and trained for a few epochs, the parameters are copied to obtain two networks. Then, the two networks continue to train independently under different randomness.

The second network is Permutation Method.

That is, the two networks are first trained independently, and then the neurons of one network are rearranged to match those of the other network.

In a previous work, Dr. Zhou Zhanpeng from Shanghai Jiao Tong University and collaborators from the Shanghai Artificial Intelligence Laboratory hoped to explain Linear Mode Connectivity from the perspective of feature learning.

And raise the question: when linearly interpolating the weights of two trained networks, what happens to the internal features?

Photo | Zhou Zhanpeng (Source: Zhou Zhanpeng)

Through research, they found that the features in almost all layers also satisfy a strong form of linear connection: that is, the feature map in the weight interpolation network is approximately the same as the linear interpolation of the feature maps in the two original networks.

They call this phenomenon Layerwise Linear Feature Connectivity.

In addition, they found that Layerwise Linear Feature Connectivity always occurs simultaneously with Linear Mode Connectivity.

And proved this rule: if two models trained on the same dataset satisfy Layerwise Linear Feature Connectivity, then they can also satisfy Linear Mode Connectivity at the same time.

Furthermore, the research team conducted in-depth research on the reasons for the emergence of Layerwise Linear Feature Connectivity.

Two key conditions were identified: the weak additivity of the ReLU function and the exchangeability property between two trained networks.

Based on these two conditions, they proved the Layerwise Linear Feature Connectivity in the ReLU network and verified these two conditions with experiments.

At the same time, they also proved that the Permutation Method makes the two networks satisfy Linear Mode Connectivity by making them interchangeable.

In general, the research team discovered a linear property that is more fine-grained than Linear Mode Connectivity and more capable of satisfying neural networks.

However, the above findings are all based on networks trained on the same dataset.

So they raised a new question: Can Layerwise Linear Feature Connectivity be applied to two models trained on different datasets?

The team noticed that the Spawning Method is very similar to the pre-training-fine-tuning training paradigm. That is, both the Spawning Method and fine-tuning start from a model that has been trained for a period of time to conduct further training.

The only difference is that the model in the Spawning Method continues to be trained on the same dataset, while the model in fine-tuning can be trained on a different dataset.

In a recent work, they found that under the pre-training-fine-tuning paradigm, different fine-tuning models also satisfy the properties of Layerwise Linear Feature Connectivity, which the research team called Cross-Task Linearity.

It found that under the pre-training-fine-tuning paradigm, the network is actually more like a linear mapping from parameter space to feature space.

That is, Cross-Task Linearity extends the definition of Layerwise Linear Feature Connectivity to models trained on different datasets.

Interestingly, the team also used the discovery of Cross-Task Linearity to explain two common model fusion techniques:

First, Model Averaging takes the average of the weights of multiple models fine-tuned on the same dataset but with different hyperparameter configurations, which can improve accuracy and robustness.

In the study, the average of the subject group weights was interpreted as the average of the features at each layer, thus establishing a close connection between Model Averaging and model integration, and further explaining the effectiveness of Model Averaging.

Second, Task Arithmetic can merge the weights of models fine-tuned on different tasks through simple arithmetic operations, thereby controlling the behavior of the model accordingly.

In the study, the team converted arithmetic operations in the parameter space into operations in the feature space, thereby explaining Task Arithmetic from the perspective of feature learning.

They then explored the conditions for the emergence of Cross-Task Linearity and discovered the importance of pre-training for Cross-Task Linearity.

Experimental results show that the common knowledge obtained from the pre-training stage helps to meet the requirements of Cross-Task Linearity.

In the study, they also made a preliminary attempt to prove Cross-Task Linearity and found that the emergence of Cross-Task Linearity was related to the flatness of the Network Landscape and the gap in weights between the two fine-tuned models.

Recently, a related paper titled “On the Emergence of Cross-Task Linearity in Pretraining-Finetuning” was published at the International Conference on Machine Learning (ICML) 2024 [1].

Figure | Related papers (Source: ICML 2024)

The research team said: We hope that this discovery can inspire better model fusion algorithms.

In the future, if we need to build a multi-capable fine-tuned large model, large model fusion will become one of the core technologies. This work provides solid experimental and theoretical support for large model fusion, which can inspire better large model fusion algorithms.

Next, they hope to understand Linear Mode Connectivity, Layerwise Linear Feature Connectivity, and Cross-Task Linearity from the perspective of Training Dynamics.

Although they have obtained some explanations at the characteristic level, they still cannot explain Linear Mode Connectivity from the perspective of first principles.

For example, why does the Spawning Method only need to train a small number of epochs to eventually obtain two models that satisfy Linear Mode Connectivity?

And how to predict such a spawning time? To answer these questions, we need to understand Linear Mode Connectivity from the perspective of training and optimization, which is also the subsequent direction of the team's efforts.

References:

1.Zhou, Z., Chen, Z., Chen, Y., Zhang, B., & Yan, J. On the Emergence of Cross-Task Linearity in Pretraining-Finetuning Paradigm. In Forty-first International Conference on Machine Learning.

Operation/Layout: He Chenlong

01/ The Hong Kong City University team developed a new nano-layered membrane that can be used for freshwater treatment in special scenarios, finding a breakthrough in the application of two-dimensional materials

02/ A credible answer to a decades-old chemical problem: Scientists propose a new microscopic mechanism for the dissolution of hydrogen chloride to form hydrochloric acid, which will promote the development of multiple disciplines

03/ Scientists create a new method for quantum sensing and manipulation that can accurately detect weak signals and can be used to detect and manipulate single nuclear spins

04/ The new batch of "35 Tech Innovators Under 35" selected by MIT Technology Review in China is officially released! Witness the innovative power of young scientists in Shanghai

05/ With a dynamic strength of 14GPa, the Peking University team successfully developed ultra-strong carbon nanotube fibers that can be used as lightweight, high-performance structural and protective materials

news

Scientists reveal the linear nature of deep neural networks, helping to generate better model fusion algorithms

Introduction

my contact information