news

AI can also "watch TikTok"! Tsinghua University leads the release of a new model for short video omnimodal understanding | ICML 2024

2024-07-31

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


New Intelligence Report

Editor: LRST I'm so sleepy

【New Wisdom Introduction】Audio-video large language models often fail to fully utilize the role of speech when processing video content. The video-SALMONN model is innovative in three parts: audio-video coding and time alignment, multi-resolution causal Q-Former, diversity loss function, and mixed unpaired audio-video data training. The model not only performs well on single-modality tasks, but also demonstrates excellent performance in audio-visual joint tasks, proving its comprehensiveness and accuracy.

In order to understand short videos, in addition to visual content, auditory information such as voice and audio, such as video music, sound effects, voice content, etc., also plays a key role in understanding short videos.

Audio-video large language models (av-LLMs) have made significant progress in recent years, but speech, as the main carrier of human language in videos, has not been fully explored in these models. Speech not only provides rich linguistic and semantic information, but also helps to understand the emotions and potential meanings in videos.

At the same time, speech signals also contain rich paralinguistic information, such as tone and pitch, as well as diverse speaker attributes (such as age, gender, accent, and identity), which are indispensable elements in video understanding.

However, enhancing the speech capabilities of general audio-visual large language models is extremely challenging, requiring temporally sophisticated modeling and complex interactions with other modalities at both coarse (e.g., video topics) and fine (e.g., lip movements) timescales.

Recently, the paper "video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models" published at ICML 2024 became the first large model that combines all audio and video elements in the video (natural images, text, speech, audio events, music).


Video-SALMONN paper link: https://openreview.net/pdf?id=nYsh5GFIqX

SALMONN paper link: https://arxiv.org/abs/2310.13289

Github Page:https://github.com/bytedance/SALMONN

Let’s take a look at the following examples to see how video-SALMONN interprets video content through voice-audio-video information!

First, let's take a look at a classic clip from Titanic

Let video-SALMONN talk about why this video is so romantic:


question:

Please explain why the video is romantic based on the video, audio and character dialogue

answer:

This video shows a romantic scene between a man and a woman on a boat. The man holds the woman in his arms as they watch the sunset together. The background music is a song, which adds to the romantic atmosphere. The woman says "I'm flying" and "Jack", which shows that they are happy and enjoying the moment. The background of the boat and the sunset creates a beautiful and peaceful environment, which enhances the romantic feeling of the video. The body language and facial expressions of the man and woman also convey their love and affection for each other. Overall, this video perfectly shows a romantic and intimate moment between two people.

Not only does video-SALMONN describe the scenes he sees, he also incorporates the music he hears and the characters’ dialogues into his descriptions.

What's even more surprising is that video-SALMONN can also tell that it was a woman who said "I'm flying". If you ask further,

In addition, video-SALMONN can also infer that the man's name is "Jack".

We all know that it is these two lines of dialogue that made this classic scene, and video-SALMONN has captured the essence of this scene!

Here’s another clip from the animation of Zootopia:

Check out the video-SALMONN's understanding of the funny side of sloths:


question:

There is a rabbit, a fox, and a sloth in the scene. What does the sloth say? Why is the sloth interesting?

answer:

The sloth says "Nick" at the end of the video. The sloth is funny because it is a slow-moving animal and is shown to be very lazy in the video, taking a long time to do things. This is in stark contrast to the active and energetic other animals (such as rabbits and foxes), making the sloth stand out and adding to the humor. In addition, the sloth's slow motion and expressions in the video also add to its comedic effect.

This example shows how to identify the speaker by combining audio and video. With only hearing, we don’t know that it is the sloth speaking, and with only vision, we don’t know what the sloth said. Only by combining audio and video can we truly understand the video!

There’s also this very popular meme:

Ask video-SALMONN why this video is funny:


question:

Can you explain why this video is funny?

answer:

What makes this video funny is that it shows a man making a funny face and then saying "nice". The contrast between his serious expression and the word "nice" creates a sense of humor and surprise. In addition, the sound effect and loud blast of the man burping also add to the entertainment value of the video.

Let’s take a look at another clip from a National Geographic documentary to see how understanding speech can help:

Ask video-SALMONN what kind of shark is in the video:


question:

What kind of shark is this in the video?

Answer with audio input:

The shark in the video is a whale shark

Answer without audio input:

The shark in the video is a great white shark

Just by watching the video, without some professional knowledge, you really don’t know what kind of shark it is. The model can only be guessed, which is prone to hallucinations.


But after listening to the audio explanation, video-SALMONN can accurately know that the video is about whale sharks, and there is no need to guess anymore.

There is also a video clip of the conversation between these big guys:

Not only does video-SALMONN understand the content of the conversation, it also connects the speaker to the conversation:


Question:

Based on the video and audio, is it a man or a woman who flew to Florence in the video?

answer:

It was a woman who flew to Florence

This example further demonstrates video-SALMONN's comprehensive understanding of the speech in the video.

If we directly input the text of the speech transcription, the model would not be able to know who flew to Florence. However, video-SALMONN can closely combine the speech content and the speaker information.

Next is the practical part: let’s take a look at how video-SALMONN achieves the comprehensive understanding of voice, audio and video!

Core Technology


video-SALMONN model

Facing the challenge of speech understanding in large audio and video models, video-SALMONN has made the following three innovations to enable it to process various basic video elements such as natural images, visual frame sequences, speech, audio events, and music elements:

Part 1: Audio and Video Coding and Time Alignment

video-SALMONN uses the Whisper speech encoder and the BEATs audio encoder to obtain the encoding vector sequences of speech and audio respectively (50 vectors for every 1 second of audio), and uses the InstructBLIP visual encoder to obtain the visual encoding vector sequence at a video sampling rate of 2 FPS (2 frames per 1 second of video, corresponding to 64 vectors).

The three sequences are aligned and spliced ​​every 0.5 seconds in the temporal dimension based on the video frames (temporal fine-grained synchronisation). Because the speech audio sequence is slightly shorter than the visual sequence, zero padding is added to the short part.

Part II: Multiresolution Causal Q-Former

The multi-resolution causal (MRC) Q-Former structure is the innovative core of video-SALMONN. It not only aligns the temporally synchronized audiovisual input features with the text representation space at multiple different time scales to meet the task requirements that depend on different video elements, but also introduces a causal self-attention structure with a special causal mask to strengthen the temporal causal relationship between events between consecutive video frames.


MRC Q-Former Structure

First, MRC Q-Former divides the input sequence into windows of different lengths. Then, in each window, Q-Former is used to map the concatenated multimodal representation to the semantic representation space. Finally, the vectors output by the Q-Former of each window in each resolution level are strung together to form a vector sequence in the semantic space.

MRC Q-Former ensures that the output sequences of different resolution levels can be directly concatenated in the feature dimension by ensuring that the number of output vectors and the window length ratio are constant. The concatenated vectors are mapped through a linear layer and used as the input of the large language model.

Through training, high-resolution Q-Former can capture fine-grained information, such as speech content and the connection between speech and lip movements, while low-resolution Q-Former can effectively extract video-level information and have a better understanding of the overall content of the video and the relationship between multiple frames.

(Note: The MRC Q-Former structure is also applicable to multi-resolution space. The paper first proposed the image spotlight method, which can split the image into sub-image sequences of different precision as input, thereby greatly improving the performance of image understanding)

Part III: Diversity Loss Function and Training with Mixed Unpaired Audio and Video Data

In addition, in order to avoid the dominance of a specific frame in the video, video-SALMONN proposed and used a new diversity loss function based on cross entropy to encourage the model to explore different parts of the entire sequence.

At the same time, in order to prevent the problem of a single modality dominating the paired audio and video data (for example, non-speech audio in the video often only provides very little auxiliary information and is easily overlooked), video-SALMONN uses a strategy of mixing unpaired audio and video for training, so that video-SALMONN has to pay attention to information from both modalities at the same time.

Key results

The Video-SALMONN paper proposed the speech-audio-visual evaluation (SAVE) benchmark for testing and demonstrated impressive performance on tasks such as speech, non-speech audio, images, and videos.

SAVE's test tasks include tasks with visual and auditory single input modalities, such as speech recognition (ASR), audio description (AAC), image description (IC), visual character recognition (OCR), image question answering (VQA), and video question answering (Video QA). It also includes four perception tasks with audio and video common input, namely audiovisual speech recognition (AVSR), audio and video question answering (AVQA), audio and video sound source detection (AVSSD), and audio and video matching (AVM).

Among them, AVQA is a newly proposed test set, and AVM is a newly proposed task, which requires the model to determine whether the audio and video match.


First, video-SALMONN achieves the same or even better results on single-modal input tasks than models that can only process visual or auditory input.

Compared with the large model Video-LLaMA that can also process audio and video, video-SALMONN not only improves the ability to understand voice input, but also far exceeds its performance in various single-modal tasks.


In the task of understanding audio and video joint input, video-SALMONN has demonstrated performance far exceeding that of other models.


Thanks to its ability to understand speech, video-SALMONN significantly outperforms its similar large model Video-LLaMA on AVQA.

More importantly, on AVM and AVSSD, two tasks that have never been trained, video-SALMONN demonstrates excellent zero-shot emergence capabilities for speech-audio-vision joint reasoning.


These emergent capabilities of joint reasoning require that the model pay equal attention to the input visual and auditory information and understand the complex connections between vision and audio.


However, existing open source models either cannot fully understand audio or can only describe each single input modality separately.

Conclusion

The launch of video-SALMONN will help improve the comprehensiveness and accuracy of video understanding technology, and bring new opportunities in areas such as video content analysis, emotion recognition, and multimodal interaction.


This model is not only of great significance in academic research, but also provides strong technical support for practical applications.

References:

https://openreview.net/pdf?id=nYsh5GFIqX