news

Generative AI may usher in the next outlet: TTT model

2024-07-18

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

The focus of the next generation of generative artificial intelligence (AI) may be test-time training models, or TTT for short.

The Transformer architecture is the basis of OpenAI's video model Sora, and is also at the core of text generation models such as Anthropic's Claude, Google's Gemini, and OpenAI's flagship model GPT-4o. But now, the evolution of these models is beginning to encounter technical obstacles, especially those related to computing. Because Transformers are not particularly efficient in processing and analyzing large amounts of data, at least when running on off-the-shelf hardware. Companies have built and expanded infrastructure to meet the needs of Transformers, which has led to a sharp increase in electricity demand, and it may not even be able to meet the demand continuously.

This month, researchers from Stanford University, UC San Diego, UC Berkeley, and Meta jointly announced that they spent a year and a half developing the TTT architecture. The research team said that the TTT model can not only process much more data than Transformers, but also does not consume as much computing power as Transformers.

Why is the TTT model considered more promising than Transformers? First of all, you need to understand that a fundamental component of Transformers is the "hidden state", which is essentially a very long list of data. When a Transformer processes something, it adds entries to the hidden state in order to "remember" what it just processed. For example, if the model is processing a book, the hidden state value will be the presentation of a word (or part of a word).

Yu Sun, a postdoctoral fellow at Stanford University who participated in the aforementioned TTT research, recently explained to the media that if the Transformer is regarded as an intelligent entity, then the lookup table and its hidden state are the brain of the Transformer. This brain implements some of the well-known functions of the Transformer, such as contextual learning.

Hidden states help Transformers become powerful, but they also hinder the development of Transformers. For example, in order to "say" even a single word about a book that Transformers has just read, the Transformer model must scan the entire lookup table, which is computationally equivalent to rereading the entire book.

So Sun and other TTT researchers came up with the idea of ​​replacing hidden states with machine learning models — like AI’s nesting dolls, or models within a model. Unlike Transformers’ lookup tables, the TTT model’s internal machine learning model doesn’t keep growing as it processes more data. Instead, it encodes the data it processes into representative variables called weights, which is what makes the TTT model so high-performance. No matter how much data the TTT model processes, the size of its internal model doesn’t change.

Sun believes that future TTT models can efficiently process billions of pieces of data, from words to images, from audio recordings to videos. This is far beyond the capabilities of existing models. TTT's system can say X words to a book without having to do the complex calculation of rereading the book X times. "Large-scale video models based on Transformers, such as Sora, can only process 10 seconds of video because they only have a lookup table 'brain'. Our ultimate goal is to develop a system that can process long videos similar to the visual experience of humans in life."

Will the TTT model eventually replace transformers? The media believes that it is possible, but it is too early to draw conclusions now. The TTT model is not a direct replacement for Transformers now. The researchers have only developed two small models for research, so it is currently difficult to compare the results achieved by TTT with some large Transformers models.

Mike Cook, a senior lecturer in the Department of Informatics at King's College London who was not involved in the TTT study, commented that TTT is a very interesting innovation. If the data supports the view that it can improve efficiency, that is good news, but he cannot tell whether TTT is better than the existing architecture. Cook said that when he was an undergraduate, an old professor often told a joke: How do you solve any problem in computer science? Add another abstraction layer. Adding a neural network to a neural network reminded him of the solution to this joke.