news

Apple's new training-free method for video models with two eyes, fast and slow, beats all SOTA

2024-08-12

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Since the release of Sora, the field of AI video generation has become more lively. In the past few months, we have witnessed the success of Jimeng, Runway Gen-3, Luma AI, and Kuaishou Keling.

Unlike previous AI-generated models that could be identified at a glance, this batch of large video models may be the "best" we have ever seen.

However, the amazing performance of the video large language model (LLM) is inseparable from the large and carefully annotated video dataset, which requires a very high cost. Recently, a number of innovative methods that do not require additional training have emerged in the research field: using the trained image large language model directly for video task processing, thus bypassing the "expensive" training process.

In addition, most existing video LLMs have two major shortcomings: (1) they can only process video inputs with a limited number of frames, which makes it difficult for the model to capture subtle spatial and temporal content in the video; (2) they lack temporal modeling design and simply input video features into the LLM, relying entirely on the LLM's ability to model motion.

In response to the above problems,Apple researchers proposed SlowFast-LLaVA (SF-LLaVA for short). This model is based on the LLaVA-NeXT architecture developed by the ByteDance team and can be used out of the box without additional fine-tuning.Inspired by the two-stream network that has been very successful in the field of action recognition, the research team designed a novel SlowFast input mechanism for video LLM.

In simple terms, SF-LLaVA will understand the details and motion in videos through two different observation speeds (Slow and Fast).

Slow path: extract features at a low frame rate while preserving as much spatial detail as possible (e.g. retaining 24×24 tokens every 8 frames)

Fast path: Run at high frame rate but reduce the resolution of the video with a larger spatial pooling stride to simulate a larger temporal context and focus more on understanding the coherence of the action

This is equivalent to the model having two "eyes": one looks slowly and pays attention to details; the other looks quickly and pays attention to actions. This solves the pain points of most existing video LLMs and can capture both detailed spatial semantics and longer temporal context.



Paper link: https://arxiv.org/pdf/2407.15841

Experimental results show that SF-LLaVA significantly outperforms existing training-free methods in all benchmarks and achieves the same or even better performance than a fine-tuned SFT model.



Model Architecture

As shown in the figure below, SF-LLaVA follows the standard training-free video LLM process. It takes video V and question Q as input and outputs the corresponding answer A.



For input, N frames, I = {I_1, I_2, ..., I_N}, are uniformly sampled from each video of any size and length, and no special combination or arrangement of the selected video frames is required. The video features extracted independently in frames are F_v ∈ R^N×H×W, where H and W are the height and width of the frame features, respectively.





Experimental Results

The research team conducted a comprehensive performance evaluation of SF-LLaVA, comparing it with current SOTA training-free models such as IG-VLM and LLoVi on multiple video question answering tasks. In addition, they also compared it with video LLMs that have been fine-tuned (SFT) on video datasets, such as VideoLLaVA and PLLaVA.

Open Video Q&A

As shown in the table below, in the open-ended video question answering task, SF-LLaVA outperforms existing training-free methods on all benchmarks. Specifically, when equipped with 7B and 34B parameter-scale LLMs, SF-LLaVA outperforms IGVLM by 2.1% and 5.0% on MSRVTT-QA, 5.7% and 1.5% on TGIF-QA, and 2.0% and 0.8% on ActivityNet-QA.

Even compared to the fine-tuned SFT method, SF-LLaVA shows comparable performance on most benchmarks, with the exception of ActivityNet-QA, where PLLaVA and LLaVA-NeXT-VideoDPO slightly outperform.



Multiple Choice Video Quiz

As can be seen in the table below, SF-LLaVA outperforms other training-free methods in multiple-choice video question answering in all benchmarks. In the EgoSchema dataset, which requires complex long-term reasoning, the SF-LLaVA7B and 34B versions score 11.4% and 2.2% higher than the IG-VLM model, respectively.

While VideoTree leads the benchmark, it outperforms the open-source LLM because it is a proprietary model based on GPT-4. The SF-LLaVA 34B model also achieves better results on EgoSchema than the SFT approach, confirming the SlowFast design’s ability to handle long videos.

Text Generation



Vincent Video

As shown in Table 3, SF-LLaVA also shows some advantages for the task of generating videos from text. SF-LLaVA-34B surpasses all training-free baselines in overall performance. Although SF-LLaVA is slightly inferior to LLaVA-NeXT-Image in terms of detail orientation. Based on the SlowFast design, SF-LLaVA can cover a longer temporal context with fewer visual tokens, so it performs exceptionally well in the temporal understanding task.

In addition, SF-LLaVA-34B also outperforms most SFT methods in the performance of Vincent videos.



For more details, please refer to the original paper.