Neural network architectures "different paths to the same destination"? ICML 2024 paper: different models, but the same learning content

2024-07-16

New Intelligence Report

Editor: Qiao Yang

【New Wisdom Introduction】Deep neural networks have a variety of scales and architectures, and it is generally believed that this will affect the abstract representations learned by the model. However, two UCL scholars published the first paper at ICML 2024, pointing out that if the model architecture is flexible enough, certain network behaviors are widely present between different architectures.

Since AI entered the era of large models, Scaling Law has almost become a consensus.

Paper address: https://arxiv.org/abs/2001.08361

In this 2020 paper, OpenAI researchers proposed that the performance of the model has a power-law relationship with three indicators: the number of parameters N, the size of the dataset D, and the training computing power C.

Apart from these three aspects, within a reasonable range, factors such as the choice of hyperparameters and the width and depth of the model have little impact on performance.

Moreover, the existence of this power law relationship does not make any provisions for the model architecture. In other words, we can think that the Scaling Law is applicable to almost any model architecture.

In addition, a paper in the field of neuroscience published in 2021 also seems to touch on this phenomenon from another angle.

Paper address: https://www.frontiersin.org/journals/computational-neuroscience/articles/10.3389/fncom.2021.625804/full

They found that networks such as AlexNet, VGG, and ResNet designed for visual tasks, even though they have large structural differences, seem to learn very similar semantics, such as the hierarchical relationships of object categories, after being trained on the same dataset.

But what is the reason behind this? If we look beyond the surface experience, to what extent are various network architectures similar at the fundamental level?

Two researchers from UCL published a paper this year that attempted to answer this question from the perspective of abstract representations learned by neural networks.

Paper address: https://arxiv.org/abs/2402.09142

They derived a theory that effectively summarizes the dynamics of representation learning in complex and large model architectures, and discovered the "rich" and "inert" characteristics of them. When the model is flexible enough, certain network behaviors can be widely present in different architectures.

This paper has been accepted by ICML 2024 conference.

Modeling process

The universal approximation theorem states that, given enough parameters, a nonlinear neural network can learn and approximate any smooth function.

Inspired by this theorem, the paper first assumes that the encoding mapping from input to hidden representation, and the decoding mapping from hidden representation to output, are both arbitrary smooth functions.

Therefore, ignoring the details of the network architecture, the function dynamics can be modeled as follows:

The process of training a neural network can be viewed as the optimization of a smooth function on a specific dataset, constantly changing the network parameters to minimize the MSE loss function:

in⟨⋅⟩Symbols represent the average over the entire dataset.

Since we are interested in studying the dynamics of representation space, the function can be decomposed into a combination of two smooth mappings: the encoding mappingℎ:→, and the decoding map:→, then the loss function in equation (1) can be written as:

Next, the process of updating parameters using the gradient descent rule can be written as:

Among them, is the inverse of the learning rate.

Although equation (4) is accurate enough, the problem is that it explicitly depends on the network parameters, and a sufficiently general mathematical expression requires ignoring this implementation detail.

Ideally, if the neural network is expressive enough, the optimization of the loss function should be directly expressed as two mappingsℎand function.

However, it is still unclear how to achieve this mathematically. Therefore, we start with a simpler case - not considering the entire dataset, but two data points.

During training, due to the mapping functionℎWith changes in and , the representations of different data points will move in the latent space, approach each other, or interact.

For example, for two points in the data set, ifℎ⁢(1) andℎ⁢(2) Close enough andℎIf and are smooth functions, then the two mapping functions can be linearly approximated using the mean of the two points:

inℎand are respectivelyℎThe Jacobian matrix of and .

Assuming that the neural network has sufficient expressiveness and degrees of freedom, the linearization parametersℎ, and can be effectively optimized, then the gradient descent process can be expressed as:

Equation (6) describes the main modeling assumptions of the paper, which is intended to serve as an equivalent theory for large and complex architectural systems and is not constrained by specific parameterization methods.

Figure 1 is a visual representation of the above modeling process. To simplify the problem, it is assumed that two data points will only move closer or farther away in the latent space, but will not rotate.

The main indicators we are concerned about are the distance ‖ℎ‖ in the latent space, which allows us to understand the representation structure learned by the model, and the distance ‖‖ of the model output, which helps to model the loss curve.

In addition, an external variable is introduced to control the representation speed, or it can be viewed as output alignment, which represents the angular difference between the predicted output and the true output.

Thus, we have an independent system consisting of three scalar variables:

The implementation details of the neural network have been abstracted into two constants: 1/ℎand 1/, represents the effective learning rate.

Learning Dynamic Consistency

After modeling, the paper trained neural networks with different architectures on a two-point dataset and compared the actual learning dynamics with the numerical solution of the equivalent theory. The results are shown in Figure 2.

The default structure refers to a 20-layer network with 500 neurons per layer, using leaky ReLU

It can be seen that although there are only two constants that need to be fitted, the equivalence theory just described can still fit the actual situation of various neural networks well.

The fact that the same equations can accurately describe the dynamics of a variety of complex models and architectures during training seems to suggest that if the models are expressive enough, they will eventually converge to a common network behavior.

Putting it on a larger dataset like MNIST, tracking the learning dynamics of two data points, the equivalent theory still holds true.

The network architecture consists of 4 fully connected layers, each layer consists of 100 neurons and uses leaky ReLU activation function

However, it is worth noting that when the initial weight gradually increases (Figure 3), the change patterns of the three variables ‖⁢ℎ‖, ‖⁢‖ and will change.

Because when the initial weight is large, the two data points will be far apart at the beginning of training, so the linear approximation performed by formula (5) is no longer valid and the above theoretical model is invalid.

Structured Representation

From the smoothness constraints and the equivalent theory mentioned above, can we summarize the laws of neural network representation structure?

According to formula (7), it can be deduced that there is a unique fixed point, which is the final representation distance of the two data points:

If the initial weight is large, the final representation distance will converge to high, and the value depends on the data input and random initialization; conversely, when the initial weight is small, it will converge to low, depending on the input and output structure of the data.

This separation between random and structured mechanisms further validates the “richness” and “laziness” of deep neural network learning proposed in previous papers, especially considering that the scale of the initial weights becomes a key factor.

The paper gives an intuitive explanation for this phenomenon:

If the initial weights are large, two data points in the latent space will be far apart at the beginning of training, so the flexibility of the network allows the decoder to freely learn the correct output for each data point individually without significantly adjusting the representation structure. Therefore, the eventually learned pattern is similar to the structure that was already present at the time of initialization.

On the contrary, when the weight is small, the two data points are closer together, and due to the smoothness constraint, the encoding mapping function must be adjusted according to the target output, moving the representation of the two data points to adapt to the data.

Therefore, we can see that when the weights are small, representation learning will show a structured effect (Figure 5).

This can be more intuitively demonstrated by replacing the task of the neural network with fitting the XOR function. When the initial weight is small, the model obviously learns the structural features of the XOR function.

In the neural network with only two layers on the right, there is a large deviation between theory and experiment, which shows the importance of the high expressiveness assumption of the model in the above theory.

in conclusion

The main contribution of this paper is the introduction of an equivalence theory that can express the common parts of the dynamic learning process in different neural network architectures and has shown a structured representation.

Due to the smoothness limitations of the modeling process and the simplification of data point interactions, this theory still cannot become a general model for describing the deep neural network training process.

However, the most valuable thing about this work is that it shows that some of the elements required for representation learning may already be included in the process of gradient descent, rather than just coming from the inductive biases inherent in a particular model architecture.

In addition, the theory also emphasizes that the scale of the initial weights is a key factor in the final formation of the representation structure.

As future work, we still need to find a way to extend the equivalence theory to handle larger and more complex datasets, rather than just modeling the interaction of two data points.

At the same time, many model architectures do introduce inductive biases that affect representation learning, potentially interacting with the representational effects being modeled.

References:

https://arxiv.org/abs/2402.09142

news

Neural network architectures "different paths to the same destination"? ICML 2024 paper: different models, but the same learning content

Introduction

my contact information