Former Google scientist Yi Tay's "LLM Story" series of blogs, the first: Why did BERT disappear from the arena?

The first in the blog series "LLM Story" by former Google scientist Yi Tay: Why did BERT disappear from the arena?

2024-07-21

New Intelligence Report

Editor: Yongyong Qiao Yang

【New Wisdom Introduction】Former Google scientist Yi Tay launched a series of blogs on "Model Architecture in the LLM Era". The first blog post is about how BERT based on the encoder-only architecture was replaced by T5 based on the encoder-decoder architecture. It analyzes the cause and effect of BERT's extinction and the advantages and disadvantages of different architecture models. Learning from history is of great significance for future innovation.

Yi Tay, a former Google scientist who is keen on blogging, was so bored on the plane recently that he wrote an in-depth article exploring a topic that many people are concerned about today - the rise and fall and changes in model architecture in the LLM era.

This time Yi Tay tries to unravel everything that is happening in the new LLM era, about "what happened to BERT and T5?" Also about the rise and fall of Transformer encoder, PrefixLM and denoising objectives.

Blog address: https://www.yitay.net/blog/model-architecture-blogpost-encoders-prefixlm-denoising

Why are encoder-only models no longer popular? Why can’t BERT scale even though it’s so powerful?

It is difficult to see the full picture when you are in the middle of it. Yi Tay shared his observations and thoughts on these issues that have industry insiders scratching their heads.

Yi Tay also said that this is just the first in a series of blog posts, and we can look forward to more content from him on the topic of "Model Architecture in the LLM Era" in the future.

Decided to start an update on a new blog series about model architectures in the LLM era. Here is part 1, covering the broader architectures like Transformer Encoders/Encoder-Decoders, PrefixLM, and denoising objectives. A question many people ask is, "People working in language and NLP about 5+ years ago were scratching their heads and wondering where the encoder models went. If BERT works so well, why didn't it scale up?" Also, what happened to encoder-decoder or pure encoder models? Is the denoising objective any good? I share my thoughts in this blog post.

Yi Tay is like a "storyteller" in the LLM era. In his blog, he succinctly sorted out the development of model architecture in the past few years and put forward his own insights.

background

In order to make it easier for people who are not so close to technology to read, Yi Tay first explained the background of the story.

Over the past few years, there have been three major paradigms in model architecture.

encoder-only models (such as BERT), encoder-decoder models (such as T5), and decoder-only models (such as the GPT series).

But people are confused about this division and there are a lot of misunderstandings, so this is the purpose of Yi Tay writing this blog post, he hopes to help everyone establish a clearer understanding.

The first thing to be clear is that the encoder-decoder model is actually still an autoregressive model. The decoder in the encoder-decoder model is still a causal decoder both literally and in essence.

Instead of pre-filling the decoder model, the text is first passed to the encoder and then sent to the decoder through a criss-cross attention mechanism.

Therefore, the T5 model is also a language model!

A variation of this is the Prefix Language Model, or PrefixLM architecture, which does almost the same thing except for the cross-attention mechanism (and some other minor details like shared weights between encoder/decoder and no encoder bottleneck)

PrefixLM is sometimes also called a non-causal decoder. In short, encoder-decoder, encoder-only, and PrefixLM are not that different!

If you still have doubts about this, Yi Tay also provides a reference - Hyung Won's excellent speech at Stanford in April this year, in which he cleverly explained the relationship between these models.

Speech URL: https://www.youtube.com/watch?v=orDKvo8h71o

Meanwhile, encoder-only models such as BERT do denoising differently (i.e. in-place) and rely to some extent on additional “task heads” added to perform various operations with the pre-trained base model.

BERT’s denoising objective was later applied to models such as T5, but with some modifications, using a sequence-to-sequence format.

At this point, it’s worth noting that denoising in T5 is not exactly a new objective function in itself (in the machine learning sense), but rather a data transformation across inputs, i.e., you can also train on the span corruption objective in a causal decoder!

People always think that encoder-decoder models must be denoising models, and part of the reason for this illusion is that the T5 model is too representative.

However, this is not always the case.

You can train the encoder-decoder using the regular language modeling task (i.e. CLM), or you can train the causal decoder using the span corruption task.

As mentioned before, this is primarily a data transformation.

Also note that, in general, an encoder-decoder with 2N parameters has the same computational cost as a decoder-only model with N parameters, so their FLOP/parameter ratio is different.

Based on the understanding of the above background, we will now enter the main text——

About the denoising goal (is it useless? doesn't it scale? is it too simple?)

To be clear, the denoising objective that Yi Tay refers to is any variant of span corruption.

Sometimes it is also called infilling, or fill in the blank. There are many ways to express it (including span length, randomness, sentinel tokens, etc.), as long as you understand them, they all mean the same thing.

While the denoising objective in BERT-style models is mostly in-place, a slightly more modern approach is the "T5-style", where the data is transformed by an encoder/-ecoder or decoder-only model.

In this data transformation, the mask token is simply “moved to the back” for the model to make predictions.

The main goal of pre-training is to build useful internal representations that align with downstream tasks in the most efficient and effective way.

The better the internal representations, the easier it is to use these learned representations to do useful things later.

It is well known that the simple causal language modeling (CLM) objective of using next token prediction does this well and has been the foundation of the LLM revolution. The question now is whether the denoising objective is equally good.

From publicly available information, we know that the T5-11B works very well even after alignment/SFT (the Flan-T5 XXL has an MMLU score of 55+, which is pretty good for a model of this size at the time).

Therefore, we can draw some conclusions: the ability transfer of denoising objectives from pre-training to alignment can support the model to work well at the scale of 11B.

Yi Tay's view is that the denoising goal is great, but it is far from enough as a standalone goal.

We can describe its disadvantage as less “loss exposure”. In the denoising target, only a small number of tokens are masked and used in the learning process (i.e. updating the loss value).

In contrast, in regular language modeling, the token utilization is close to 100%.

This characteristic of the denoising objective makes the sampling efficiency per FLOP quite low, and therefore puts it at a great disadvantage in FLOP-based comparisons.

Another downside is that the denoising objective is less natural than regular language modeling, as it reformats the input/output in a weird way, which makes them a bit awkward for few-shot learning. (Nevertheless, it is still possible to tweak these models to perform reasonably well on few-shot tasks)

Therefore, Yi Tay believes that the denoising objective can almost only be used as a supplement to regular language modeling rather than an independent training objective.

The early days of unification and why xBERT went extinct

The phase-out of BERT-like models is an interesting phase, but one that not many people are talking about these days, and it’s subtle.

This would also explain why we no longer see any super large BERT models running. What is the reason?

This is mainly a question of unification and shift in task/modeling paradigms.

BERT-style models are clumsy, but they were really abandoned because everyone wanted to use one model for all tasks, so a better denoising method was introduced - using an autoregressive model.

Between 2018 and 2021, there has been a subtle paradigm shift from single-task fine-tuning to large-scale multi-task models.

Everyone's attention was slowly drawn to the unified SFT model, which is also the unified universal model we see today.

It is really hard to do this with BERT.

However, this "clumsiness" of BERT is not very relevant to the "denoising" task. If you still want to use this type of model, you can express the "denoising" task in another way (such as T5).

As a result, BERT-style models are pretty much deprecated at this point in time, as a strictly better alternative has emerged.

More specifically, encoder-decoder and decoder-only models are able to express multiple tasks simultaneously without the need for a task-specific classification head.

At the same time, researchers and engineers found that for the encoder-decoder model, if the encoder is simply unplugged and only the decoder is left, its performance is as competitive as BERT's encoder.

Not only that, leaving only the decoder also preserves the bidirectional attention advantage that makes BERT outperform GPT models in small-scale (usually production-scale) tasks.

The value of denoising target

Similar to how regular language modeling works, the denoising pre-training objective also learns to predict the next word.

However, unlike regular CLM, the latter transforms the sequence so that it learns to "fill in the blanks" instead of simply predicting text that naturally occurs from left to right.

It is worth noting that the denoising objective is sometimes also called “infilling tasks” and is sometimes mixed with the regular language modeling task for pre-training.

While the exact configuration and implementation details may vary, today's LLMs are likely to use some combination of language modeling and padding.

Also, interestingly, hybrids of language modeling and padding seem to have spread in the same period (e.g. UL2, FIM, GLM, CM3), with many teams bringing their own “flavor” of the hybrid in some way.

By the way, the largest publicly disclosed and reported model trained in this way is probably PaLM-2.

It’s worth noting that mixed training does not necessarily have to be mixed at the same time, but can be stacked sequentially.

For example, Flan-T5 is initially trained on 1Tspan corruption tokens and then switched to the prefix language modeling task of 100B tokens before instruction fine-tuning.

In a way, this can be said to be a hybrid model with denoising/language modeling objectives.

Yi Tay also shared an unofficial experience: the representations learned by the denoising objective perform better in certain categories of tasks and are sometimes sampled in a more efficient way.

Fine-tuning models trained with this objective often results in better SFT models, especially at smaller scales.

When it comes to single-task fine-tuning, we can see that the PaLM-1 62B model is beaten by the smaller T5.

Bidirectional attention + denoising targets can play a huge role in a relatively small range! I believe that many practitioners have also seen this situation, especially in production.

Advantages and disadvantages of encoder/decoder architecture

The encoder-decoder architecture actually has some advantages over regular decoder-only models.

The encoder side is not restricted by the causal mask, in a way that you can stack attention layers like crazy with aggressive pooling or any form of linear attention without having to worry about the limitations of the autoregressive design.

This is a great way to pass less important "context" to the encoder. You can also make the encoder smaller, which is also nice.

An example from Charformer illustrates the necessity of the encoder-decoder architecture, where we can make great strides in the encoder to mitigate the speed drawbacks of encoding at the byte level.

But at the same time, a disadvantage of the encoder-decoder compared to PrefixLM is that the input and target must have fixed allocation lengths.

For example, if the predetermined input length is 1024 tokens, the encoder side must be padded to this value, which may cause a lot of computational waste.

In contrast, in PrefixLM, the input and target can be directly concatenated, alleviating this problem.

Relevance and key takeaways for today’s models

Whether from the perspective of model architecture or pre-training, the ability to reason using inductive biases is essential to becoming a competent LLM researcher and practitioner. Understanding the basic nuances between different model architectures will help future innovations.

Yi Tay shared his key takeaways:

Both encoder-decoder and decoder-only models are autoregressive models, but differ in implementation, with pros and cons. They have subtly different inductive biases, and the best usage really depends on the downstream use case and quite a few application constraints. For most LLM applications and niche use cases, BERT-style encoder-only models are mostly considered obsolete.
The denoising objective is mainly a supplement to CLM, and is often used as a "auxiliary objective" in pre-training to help a little. While this often happens in code models (i.e. code filling), it is not uncommon to use CLM with some denoising objective for pre-training in general-purpose models today (although it is not a requirement).
Bidirectional attention helps a lot at smaller scales, but is usually only an option on larger models. Yi Tay believes that bidirectional attention has an inductive bias, just like many other types of modifications in the Transformer architecture.

Finally, to summarize, we did not see any successful scaling of xBERT: the BERT model was deprecated in favor of the more flexible denoising (autoregressive) T5 model.

This is mainly due to the unification of paradigms, where people want to use general models rather than task-specific models.

At the same time, autoregressive denoising is sometimes folded into CLM as an additional training target.

about the author

Yi Tay is currently the co-founder and chief scientist of Reka, an AI startup dedicated to building state-of-the-art generative models and advancing AI research.

Prior to this, he was a senior research scientist at Google Brain, working on LLM and AI related work, and also served as technical lead of Google Research's US research team, working on Transformer expansion and architecture.

During his time at Google, Yi Tay contributed to approximately 20 product launches.

During Yi Tay's tenure as a research scientist at Google, most of his published works revolved around Transformer, especially related to efficiency, scalability, and architecture research.

Besides blogging, Yi Tay is also a fan of classical music. He said, “If I hadn’t become a researcher, I might have wanted to be a professional musician.” Interestingly, he does have a diploma in this field.

I hope Yi Tay can take a long-distance flight again so that I can see him updating his blog again.

References:

https://x.com/YiTayML/status/1813262126162845772

news

The first in the blog series "LLM Story" by former Google scientist Yi Tay: Why did BERT disappear from the arena?

Introduction

my contact information