Where did the once famous BERT go? The answer to this question shows the shift in the LLM paradigm

Where did the once famous BERT go? The answer to this question indicates a paradigm shift in LLM

2024-07-22

Where did the encoder model go? If BERT works so well, why not scale it? What about encoder-decoder or encoder-only models?

In the field of Large Language Models (LLMs), decoder-only models (such as the GPT series of models) are now the dominant force. What about encoder-decoder or encoder-only models? Why has BERT, once a popular model, gradually become less popular?

Recently, Yi Tay, chief scientist and co-founder of AI startup Reka, published a blog post to share his views. Before founding Reka, Yi Tay worked at Google Research and Google Brain for more than three years, and participated in the research and development of well-known LLMs such as PaLM, UL2, Flan-2, Bard, and multimodal models such as PaLI-X and ViT-22B. The following is the content of his blog post.

Basic Introduction

In general, the LLM model architectures in the past few years can be divided into three major paradigms: encoder-only models (such as BERT), encoder-decoder models (such as T5), and decoder-only models (such as the GPT series of models). People often confuse these and have misunderstandings about these classification methods and architectures.

The first thing to understand is that the encoder-decoder model is actually an autoregressive model. In the encoder-decoder model, the decoder is still essentially a causal decoder. Instead of pre-filling the decoder model, some text is offloaded to the encoder and then sent to the decoder via cross-attention. Yes, the T5 model is also a language model!

A variation of this type of model is the Prefix Language Model, or PrefixLM for short, which works in much the same way, but without the cross-attention (and some other small details like shared weights between encoder/decoder and no encoder bottleneck). PrefixLM is sometimes also called a non-causal decoder. In short, encoder-decoder, decoder-only models, and PrefixLM are not much different overall!

In Hyung Won’s recent excellent lecture, he skillfully explained the relationship between these models. For more details, please refer to the report of Machine Heart: What will be the main driving force of AI research? ChatGPT team research scientist: the cost of computing power is decreasing

At the same time, encoder-only models like BERT do denoising differently (i.e. in-place), and to some extent encoder-only models rely on a classification "task" head to really work after pre-training. Later, models such as T5 adopted a "modified" denoising objective that used a sequence-to-sequence format.

To this end, it is important to point out that denoising in T5 is not a new objective function (in the machine learning sense), but rather a transformation of the data across the input, i.e., you can also train a span corruption objective using a causal decoder.

People always assume that encoder-decoder models must be denoising models, partly because T5 is so representative. But this is not always the case. You can train an encoder-decoder using regular language modeling tasks, such as causal language modeling. Conversely, you can train a causal decoder using the span corruption task. As I said before, this is basically a data transformation.

Another point worth noting: in general, an encoder-decoder with 2N parameters has the same computational cost as a decoder-only model with N parameters, so their FLOPs to parameter ratio is different. This is like distributing "model sparsity" between input and target.

This is not new, nor is it something I came up with on my own. It was in the 2019 T5 paper, and was reiterated in the UL2 paper.

For now, I'm glad to have made that clear. Now to the goal.

Regarding the denoising goal (does it not work? Does it not scale? Or is it too easy?)

The denoising objective here refers to any variation of the "span corruption" task. This is sometimes called "padding" or "filling in the blanks". There are many ways to express it, such as span length, randomness, sentinel tokens, etc. I believe you already understand the key.

While the denoising objective of BERT-style models is basically in-place (i.e. the classification head is on the masked tokens), the “T5 style” is a bit more modern, handling data transformations via encoder-decoder or decoder-only models, where the masked tokens are simply “moved back” for the model to make predictions.

The main goal of pre-training is to build internal representations that are aligned with downstream tasks in the most efficient and effective way possible. The better this internal representation is, the easier it is to use these learned representations for subsequent tasks. We all know that the simple next word prediction "causal language modeling" objective performs well and has been at the heart of the LLM revolution. The question now is whether the denoising objective can perform just as well.

From public information, we know that T5-11B performs quite well, even after alignment and supervised fine-tuning (Flan-T5 XXL achieves 55+ MMLU, which was quite good for a model of this size at the time). Therefore, we can conclude that the transfer process (pre-training → alignment) of the denoising objective works relatively well at this scale.

My opinion is that the denoising objective works well, but not well enough to be used as a standalone objective. A huge disadvantage comes from the so-called lower "loss exposure". In the denoising objective, only a small number of tokens are masked and learned (i.e. taken into account in the loss). In turn, in regular language modeling, this is close to 100%. This makes the sample efficiency per FLOP very low, which makes the denoising objective very disadvantaged in flop-based comparisons.

Another downside of the denoising objective is that it is less natural than regular language modeling, because it reformats the input/output in a weird way, which makes them less suitable for few-shot learning. (But it is still possible to tune these models to perform quite well on few-shot tasks.) Therefore, I think the denoising objective should only be used as a complementary objective to regular language modeling.

The early days of unification and why BERT-like models disappeared

BERT-like models are fading away, and not many people are talking about them anymore. This also explains why we don't see super-large BERT models now. What's the reason? This is largely due to the unification and shift in task/modeling paradigms. BERT-like models are cumbersome, but the real reason why BERT models were abandoned is that people wanted to complete all tasks at once, so a better denoising method was adopted - using autoregressive models.

During 2018-2021, there was an implicit paradigm shift: from single-task fine-tuning to large-scale multi-task models. This slowly led us to the unified SFT model, which is the general model we see today. This is difficult to do with BERT. I think this has little to do with "denoising". For those who still want to use such models (i.e. T5), they found a way to reformulate the denoising pre-training task, which makes BERT-style models basically deprecated today because we have better alternatives.

More specifically, encoder-decoder and decoder-only models can be used for a variety of tasks without the need for task-specific classification heads. For encoder-decoder, researchers and engineers are beginning to find that abandoning the encoder can achieve similar results as the BERT encoder. In addition, this preserves the advantages of bidirectional attention - an advantage that makes BERT competitive with GPT at small scale (often production scale).

The value of denoising target

The denoising pre-training objective can also learn to predict the next word in a similar way to regular language modeling. However, unlike regular causal language modeling, this requires a data transformation on the sequence so that the model can learn to "fill in the blanks" instead of simply predicting natural text from left to right.

It is worth noting that the denoising objective is sometimes also called the "padding task" and is sometimes mixed with the regular language modeling task during pre-training.

While the exact configuration and implementation details may vary, modern LLMs today are likely to use some combination of language modeling and padding. Interestingly, this "language model + padding" hybrid was actually spreading around around the same time (such as UL2, FIM, GLM, CM3), with many teams bringing their own unique hybrid schemes. Incidentally, the largest model currently known to be trained in this way is probably PaLM-2.

It is also important to note that pre-training task mixtures can also be stacked in order, not necessarily at the same time, e.g. Flan-T5 was initially trained on 1T span corrupted tokens, then switched to 100B tokens for the feedforward language modeling objective, and then fine-tuned with the flan instruction. To some extent, this is suitable for mixed denoising/LM objective models. To be clear, the prefix language modeling objective (not to be confused with the architecture) is simply causal language modeling, with a split point randomly determined and sent to the input (no loss and non-causal mask).

By the way, filling in may have originated from the field of code LLM, where "filling in the blanks" is more like a function required to write code. At the same time, the motivation of UL2 is more to unify the denoising objectives and the task categories that bidirectional LLMs are good at with inherent generation tasks (such as summarization or open-ended generation). The advantage of this autoregressive decoding "backward shift" is that it not only allows the model to learn longer-range dependencies, but also allows it to implicitly benefit from non-explicit bidirectional attention (because in order to fill in the blanks, you have already seen the future).

There is anecdotal evidence that representations learned with the denoising objective perform better on certain classes of tasks, and are sometimes more sample efficient. In the U-PaLM paper, we show how up-training with a small amount of span corruption changes the behavior and emergence on a set of BIG-Bench tasks. Based on this, fine-tuning models trained with this objective often results in better supervised fine-tuning models, especially when scaled down.

In terms of single-task fine-tuning, we can see that the PaLM-1 62B model is beaten by the much smaller T5 model. On a relatively small scale, "bidirectional attention + denoising objective" is a beautiful one-two punch! I believe many practitioners have also noticed this situation, especially in production applications.

What about bidirectional attention?

Bidirectional attention is an interesting "inductive bias" for language models - often confused with the objective and model backbone. Inductive biases are used for different purposes in different computational domains, and may affect scaling curves differently. That being said, bidirectional attention may not be as important at larger scales than at smaller scales, or may have different effects on different tasks or modalities. For example, PaliGemma uses the PrefixLM architecture.

Hyung Won also pointed out in his talk that the PrefixLM model (a decoder-only model using bidirectional attention) also suffers from caching issues, which is an inherent flaw in this type of architecture. However, I think there are ways to address this flaw, but it is beyond the scope of this article.

Advantages and disadvantages of encoder-decoder architecture

The encoder-decoder architecture has both advantages and disadvantages compared to a decoder-only model. The first is that the encoder side is not restricted by the causal mask. In a way, you can be more aggressive with the attention layer and perform pooling or any form of linear attention without having to worry about the design limitations of autoregression. This is a good way to offload less important "context" to the encoder. You can also make the encoder smaller, which is also an advantage.

An example of a required encoder-decoder architecture is the Charformer, which makes aggressive use of the encoder and mitigates the speed disadvantage of byte-level models. Innovations in the encoder can quickly gain benefits without worrying about the significant drawbacks of causal masks.

At the same time, compared to PrefixLM, one disadvantage of the encoder-decoder is that the input and target must be allocated a fixed budget. For example, if the input budget is 1024 tokens, the encoder side must be filled to this value, which may waste a lot of computation. In contrast, in PrefixLM, the input and target can be directly connected, which can alleviate this problem.

Relevance and key takeaways for today’s models

In today’s era, a key skill for being a qualified LLM researcher and practitioner is to be able to infer inductive biases from both the architecture and pre-training aspects. Understanding the subtle differences can help people extrapolate and continue to innovate.

Here are my key takeaways:

Encoder-decoder and decoder-only models are both autoregressive models that differ in their implementation and have their own pros and cons. They are slightly different inductive biases. Which one to choose depends on the downstream use case and application constraints. In the meantime, BERT-style encoder models can be considered obsolete for most LLM use cases and niche use cases.

Denoising objectives are mainly used to supplement causal language models. They have been successfully used as "support objectives" during the training phase. Training a causal language model with a denoising objective often helps to some extent. While this is very common in the field of code models (i.e. code filling), it is also common for general-purpose models today to pre-train with a causal language model plus some denoising objective.

Bi-directional attention can provide a big benefit for smaller models, but is dispensable for larger models. This is mostly anecdotal. I think bi-directional attention has an inductive bias, similar to many other types of modifications made to the Transformer model.

Finally, let's summarize. There are no large-scale versions of BERT models in operation anymore: BERT has been deprecated in favor of the more flexible denoising (autoregressive) T5 model. This is mainly due to paradigm unification, where people prefer to use a general model for a variety of tasks (rather than using a task-specific model). At the same time, autoregressive denoising is sometimes used as a side goal of causal language models.

Original link: https://www.yitay.net/blog/model-architecture-blogpost-encoders-prefixlm-denoising

news

Where did the once famous BERT go? The answer to this question indicates a paradigm shift in LLM

Introduction

my contact information