2024-08-14
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
Machine Heart Report
Synced Editorial Department
There is good news in the open source field again.
Large Language Models (LLMs) have undergone significant evolution, and recently, we have also witnessed the boom of Multimodal Large Language Models (MLLMs), which exhibit surprising multimodal capabilities.
In particular, the emergence of GPT-4o has significantly advanced the field of MLLM. However, there is a significant lack of open source models corresponding to these models. It cannot be overstated that the open source community urgently needs to further promote the development of this field.
In this paper, researchers from Tencent Youtu Lab and other institutions proposed VITA, the first open source multimodal large language model (MLLM), which can simultaneously process and analyze video, image, text and audio modalities while having an advanced multimodal interaction experience.
The researchers used Mixtral 8×7B as the language foundation, then expanded its Chinese vocabulary and performed bilingual instruction fine-tuning. In addition, the researchers further endowed the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction fine-tuning.
VITA demonstrates strong multilingual, visual, and audio understanding capabilities, as evidenced by its strong performance on both unimodal and multimodal benchmarks.
In addition to basic capabilities, the research has also made great progress in improving the natural multimodal human-computer interaction experience. As far as we know, this is the first study to utilize non-wake-up interactions and audio interruptions in MLLM. The researchers also designed additional state tokens and corresponding training data and strategies to perceive various interaction scenarios.
VITA is deployed in a duplex scheme, where one model is responsible for generating responses to user queries and another model keeps track of the environment input. This gives VITA impressive human-computer interaction capabilities.
VITA is the first step for the open source community to explore the seamless integration of multimodal understanding and interaction. Although there is still a lot of work to be done on VITA to get close to closed source counterparts, this study hopes that VITA's role as a pioneer can serve as a cornerstone for subsequent research.
视频链接:https://mp.weixin.qq.com/s?__biz=MzA3MzI4MjgzMw==∣=2650930230&idx=4&sn=9438b7c9c53ffa71dc7b3aa78ffaf348&chksm=84e43848b393b15ede2b21d694dde6ee5d90c364b94e53f09728faef1db5b5524cd4dbe49dee&token=2010422951⟨=zh_CN#rd
In the above video, users can communicate with VITA without any obstacles. After seeing the white T-shirt the user is wearing, VITA will suggest what color pants to match it with. When asked a math problem, VITA can check the type of question in real time, make inferences, and then give an accurate answer. When you are talking to others, VITA will not interrupt because it knows that the user is not communicating with it. When traveling, VITA will also give some suggestions. During VITA's output, you can also interrupt the conversation in real time and start another topic.
视频链接:https://mp.weixin.qq.com/s?__biz=MzA3MzI4MjgzMw==∣=2650930230&idx=4&sn=9438b7c9c53ffa71dc7b3aa78ffaf348&chksm=84e43848b393b15ede2b21d694dde6ee5d90c364b94e53f09728faef1db5b5524cd4dbe49dee&token=2010422951⟨=zh_CN#rd
In this video, the user holds a cookie and asks VITA what he is eating. VITA then answers that he is eating a cookie and suggests that the cookie would taste better with milk or tea.
When you are working out, act as your chat partner:
视频链接:https://mp.weixin.qq.com/s?__biz=MzA3MzI4MjgzMw==∣=2650930230&idx=4&sn=9438b7c9c53ffa71dc7b3aa78ffaf348&chksm=84e43848b393b15ede2b21d694dde6ee5d90c364b94e53f09728faef1db5b5524cd4dbe49dee&token=2010422951⟨=zh_CN#rd
Note: The above videos are played at 1x real-time speed and are not accelerated.
Based on the flowchart provided by the user, VITA can write code:
Provide a picture, and VITA can also answer questions based on the content of the picture:
You can also watch the video and answer questions. When the user asks "describe the dog's actions in detail", VITA can also give an accurate answer:
Method Introduction
As shown in Figure 3, the overall training process of VITA consists of three stages: LLM instruction fine-tuning, multimodal alignment, and multimodal instruction fine-tuning.
LLM instruction fine-tuning
Mixtral 8x7B is one of the top open source LLMs in terms of performance, so this study used it as a basis. However, the researchers observed that the official Mixtral model had limited capabilities in understanding Chinese. To inject bilingual (Chinese and English) understanding capabilities, the study expanded the Chinese vocabulary to the base model, increasing the vocabulary from 32,000 to 51,747. After expanding the vocabulary, the researchers used a synthetic bilingual corpus of 5 million words for plain text instruction fine-tuning.
Multimodal Alignment
To bridge the representation gap between text and other modalities, and thus lay the foundation for multimodal understanding, the visual connector is trained only in the visual alignment stage. Table 1 summarizes the training data used, except for the plain text part.
Visual modality
First is the visual encoder. The researchers used InternViT-300M-448px as the visual encoder, which takes a 448×448 resolution image as input and generates 256 tokens after using a visual connector as a simple two-layer MLP. For high-resolution image input, the researchers used a dynamic patching strategy to capture local details.
Video is considered a special case of images. If the video length is less than 4 seconds, it is uniformly sampled at 4 frames per second. If the video length is between 4 seconds and 16 seconds, it is sampled at one frame per second. For videos longer than 16 seconds, it is uniformly sampled at 16 frames.
The second is visual alignment. We only train the visual connector in the visual alignment stage and do not use audio questions in this stage.
Finally, data concatenation. For plain text data and image data, this study aims to concatenate the context length to 6K tokens, as shown in Figure 4. It is worth noting that video data is not concatenated.
There are two benefits of concatenating different data:
Furthermore, the study found that models trained using the concatenated data performed comparable to those trained using the original data.
Audio Mode
On the one hand, there is the audio encoder. The input audio is initially processed by a Mel filter bank block, which decomposes the audio signal into frequency bands within the mel frequency range, mimicking the nonlinear human perception of sound. Subsequently, the researchers used a 4×CNN downsampling layer and a 24-layer transformer, with a total of 341 million parameters, to process the input features. At the same time, they used a simple two-layer MLP as an audio-text modality connector. Finally, every 2 seconds of audio input was encoded into 25 tokens.
Another aspect is audio alignment. For the alignment task, researchers used automatic speech recognition (ASR). The datasets include Wenetspeech (with more than 10,000 hours of multi-domain speech recognition data, mainly focusing on Chinese tasks) and Gigaspeech (with 10,000 hours of high-quality audio data, most of which are for English speech recognition tasks). For the audio subtitle task, researchers used the AudioSet SL subset of Wavcaps, which contains 400k audio clips with corresponding audio subtitles. During the alignment process, both the audio encoder and the connector were trained.
Multi-modal instruction fine-tuning
The study adapted the model to enhance its ability to follow instructions, whether text or audio.
Data Construction. The data source of the instruction tuning stage is the same as that of the alignment stage in Table 1, but this study makes the following improvements:
Questions are randomly (approximately half) replaced with their audio versions (using TTS techniques such as GPT-SoVITS6), aiming to enhance the model’s understanding of audio queries and its instruction-following ability.
Different system prompts are set to avoid conflicts between different types of data, as shown in Table 2. For example, some questions can be answered based on visual information or based on the model’s own knowledge, leading to conflicts. In addition, image data has been patched, similar to multi-frame video data, which may confuse the model. The system prompt explicitly distinguishes different data types, which helps to understand more intuitively.
In order to realize two interactive functions, namely non-wake-up interaction and audio interruption interaction, this study proposed a duplex deployment framework, that is, two VITA models were deployed at the same time, as shown in Figure 1.
In a typical case, the generation model answers user queries. Meanwhile, the monitoring model detects environmental sounds during the generation process. It ignores non-query user voices but stops the generation model's progress when the query audio is recognized. The monitoring model then integrates historical context and responds to the latest user query, switching the identities of the generation model and the monitoring model.
Experimental Evaluation
Language performanceIn order to verify the effectiveness of the language model training process, the researchers used four datasets, namely C-EVAL, AGIEVAL, MMLU and GSM8K. These datasets cover a variety of scenarios, including general multiple-choice questions, multi-disciplinary question-answering questions, and mathematical and logical reasoning tasks, while covering both Chinese and English contexts.
The results in Table 3 below show that the training in this paper significantly enhances the capabilities of the language model on Chinese evaluation sets (C-EVAL and AGIEVAL), while maintaining the original performance level on English-related benchmarks (MMLU) and achieving significant improvements on mathematical reasoning tasks (GSM8K).
Audio PerformanceIn order to verify the robustness of the speech representation learned by the model, the researchers tested it on two datasets: Wenetspeech and Librispeech.
Wenetspeech has two evaluation indicators, test_net and test_meeting. The former is easier because its data source is more consistent with the training data; the latter poses a greater challenge. As a held-out dataset for the model, Librispeech evaluates the generalization ability of the model on unseen datasets. It has four evaluation sets. The ones starting with "dev" are validation sets, the ones starting with "test" are test sets, "Clean" represents a less challenging set, and "other" represents a more challenging set.
From the results in Table 4 below, we can see that VITA achieved very good results on the ASR benchmark.
Multimodal performanceTo evaluate the multimodal capabilities, this study evaluated VITA on four benchmarks, including MME, OCRBench, HallusionBench, and Video-MME. The results are shown in Figure 5.
In terms of image understanding, VITA outperforms the image-specific open source model LLaVA-Next and is close to the closed-source model Gemini 1.5 Pro.
In terms of video understanding, VITA surpasses the open-source video model Video-CCAM. Although there is a gap between VITA and the video-specific LLaVA-Next-Video, this is acceptable considering that VITA supports a wider range of modalities and prioritizes interactivity.
Finally, it is worth noting that there is still a large gap between open source models and proprietary models in terms of video understanding capabilities.