news

Watch a 2-hour movie in 4 seconds! Alibaba releases mPLUG-Owl3, a universal multimodal large model | Open source

2024-08-19

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Watch a 2-hour movie in 4 seconds, Alibaba team's new achievement is officially unveiled——

roll outUniversal multi-modal large model mPLUG-Owl3, specifically designed to understand multiple images and long videos.



Specifically, taking LLaVA-Next-Interleave as the benchmark, mPLUG-Owl3 converts the model’sFirst Token Latency reduced by 6 times, and the number of images that can be modeled by a single A100 has increased by 8 times, reaching400 images, it is actually tested that you can finish watching a 2-hour movie in 4 seconds.

In other words, the model's reasoning efficiency has been greatly improved.

andWithout sacrificing model accuracy

mPLUG-Owl3 also achieved outstanding results in various scenarios involving multimodal large models, including single-image, multi-image, and video benchmarks.SOTA



The authors of the paper are fromAlibaba mPLUG Team, they have been working on the multimodal large model base, and before that they also proposed:

  • High-efficiency multi-mode docking station mPLUG series
  • Modular multi-modal large language model base mPLUG-Owl series
  • Document understanding base mPLUG-DocOwl series

Let’s take a look at what black technology the mPLUG-Owl3 released this time has brought.

Gameplay Unlocked

The team demonstrated several common uses of mPLUG-Owl3.

Multimodal retrieval enhancement

Generally speaking, it is a common practice to supplement the knowledge that the model has not learned through the retrieval system. This requires the model to accurately understand the incoming multimodal knowledge and use it to answer questions.

In the following example, mPLUG-Owl3 not only correctly answers based on the retrieved knowledge, but also gives itsBased on which piece of knowledge is the judgment made?, and gives detailed judgment basis. This is very friendly for downstream R&D personnel to develop and debug.

Search section: Below are several multimodal knowledge retrieved for you to make a decision.

Ask a Question: Based on your observations of the four dog photos above, can you classify the dog in picture 5? Give your reasons.

answer: Based on observations of the four dog photos, the dog in picture 5 can be classified as an Irish terrier. The dog in picture 1 is described as an Irish terrier, and the dog in picture 5 has similar physical characteristics to the dog in picture 1…



Multi-graph reasoning

This kind of situation often occurs in life -

Many users do not have conversations around just one picture. Instead, they may have several materials, and the model needs to correctly understand the relationship between the content in different materials.

In the example below, mPLUG-Owl3 is able to activate the common sense knowledge of the language model based on what it sees in two images and reason around the two images.

Ask a Question: Can the animal on the left survive in the environment on the right? Answer and explanation required.

answer(The correct answer is marked in green): The general idea is that they cannot survive because the environment is too cold.



In another example, mPLUG-Owl3 not only detects the difference in style between two images, but also realizes that the characters in the two images are actually the same character despite their different styles.

First question: What art styles are depicted in the image and how do they differ?

answer: The first one is pixel style, the second one is cartoon style... (give reasons)

Second question: Who appears in both paintings? What are their characteristics?

answer: In both paintings, a man wearing a cowboy hat appears, and he is depicted in two different styles.



Long video understanding

Watching very long videos is a task that most current models cannot complete end-to-end. If it is based on an agent framework, it is difficult to guarantee the response speed.

mPLUG-Owl3 can watch a 2-hour movie, andStart answering users’ questions in 4 seconds

Whether the user is asking questions about very detailed clips at the beginning, middle, or end of a movie, mPLUG-Owl3 can answer them fluently.



How did you do it?

Unlike traditional models, mPLUG-Owl3No need to concatenate visual sequences into text sequences for language models in advance

In other words, no matter what is input (dozens of images or several hours of video), it does not occupy the language model sequence capacity, which avoids the huge computational overhead and video memory usage brought by long visual sequences.

Someone may ask, how can visual information be integrated into the language model?



To achieve this, the team proposed aLightweight Hyper Attention module, it can expand an existing Transformer Block that can only model text into a new module that can simultaneously perform image-text feature interaction and text modeling.



By spreading sparsely across the language model4Transformer Block, mPLUG-Owl3 can upgrade LLM to multi-modal LLM at a very low cost.

After the visual features are extracted from the visual encoder, they are aligned to the dimensions of the language model through a simple linear mapping.

Subsequently, the visual features will only interact with the text in these four layers of Transformer Block. Since the visual token has not undergone any compression, fine-grained information can be retained.

Let’s take a look at the followingHow is Hyper Attention designed?

Hyper Attention introduces aCross-AttentionOperation, using visual features as Key and Value, and the hidden state of the language model as Query to extract visual features.

In recent years, other studies have considered using Cross-Attention for multimodal fusion, such as Flamingo and IDEFICS, but these works have not achieved good performance.

In the technical report of mPLUG-Owl3, the teamCompared with Flamingo's design, to further illustrate Hyper AttentionKey technical points



First, Hyper Attention does not adopt the cascade design of Cross-Attention and Self-Attention, but is embedded in the Self-Attention block.

Its benefit is that it greatly reduces the additional introduction of new parameters, making the model easier to train, and further improving the training and reasoning efficiency.

Second, Hyper Attention selectionLayerNorm for shared language models, because the distribution output by LayerNorm is exactly the distribution that the Attention layer has been trained to be stable, sharing this layer is crucial for stable learning of the newly introduced Cross-Attention.

In fact, Hyper Attention adopts a parallel Cross-Attention and Self-Attention strategy, using a shared query to interact with visual features and fusing the features of the two through an Adaptive Gate.

This allows the query to specifically select relevant visual features based on its own semantics.

The team found that imagesThe relative position of the text in the original contextIt is very important for the model to better understand multimodal input.

To model this property, they introduced a multimodal interleaved rotational position encoding MI-Rope to model the position information of the visual key.

Specifically, they pre-recorded the position information of each image in the original text, and used this position to calculate the corresponding rope embedding, and each patch of the same image would share this embedding.

In addition, they also Cross-AttentionIntroducing Attention Mask, so that the text before the picture in the original context cannot see the features corresponding to the following picture.

In summary, these design features of Hyper Attention bring further efficiency improvements to mPLUG-Owl3 and ensure that it still has first-class multimodal capabilities.



Experimental Results

Through experiments on a wide range of datasets, mPLUG-Owl3Most single-image multimodal benchmarksThey can all achieve SOTA results, and even many evaluations can surpass models with larger sizes.



at the same time,In the multi-image evaluation,mPLUG-Owl3 also surpasses LLAVA-Next-Interleave and Mantis, which are specially optimized for multi-image scenarios.



In addition, in LongVideoBench (52.1 points), a specialized evaluation modelUnderstanding long videosThe list even surpasses existing models.



The R&D team also came up with an interestingLong visual sequence evaluation method

As we all know, in real human-computer interaction scenarios, not all images serve the purpose of user questions. The historical context will be filled with multimodal content that is irrelevant to the question. The longer the sequence, the more serious this phenomenon is.

In order to evaluate the model on long visual sequence inputAnti-interference ability, they built aNew evaluation dataset

By introducing irrelevant pictures and disrupting the order of pictures for each MMBench loop evaluation sample, and then asking questions about the original pictures, we can see whether the model can stably answer correctly. (For the same question, 4 samples with different option orders and interference pictures will be constructed, and only one correct answer will be counted if all of them are answered correctly.)

In the experiment, the input images are divided into multiple levels according to the number of images.

It can be seen that models that have not been trained with multiple images, such as Qwen-VL and mPLUG-Owl2, are quickly defeated.



LLAVA-Next-Interleave and Mantis, which have been trained with multiple images, can maintain similar decay curves to mPLUG-Owl3 at the beginning, but as the number of images reaches50At this scale, these models can no longer answer correctly.

And mPLUG-Owl3 persisted400 imagesCan still maintain40% accuracy

However, to be honest, although mPLUG-Owl3 surpasses existing models, its accuracy is far from excellent. It can only be said that this evaluation method reveals the anti-interference ability of all models under long sequences that needs to be further improved in the future.

For more details, please refer to the paper and code.