news

Video context learning! Large models learn to generate by copying others, from MSRA

2024-07-17

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

Vid-ICL team contribution
Quantum Bit | Public Account QbitAI

Video generation can also refer to "context"? !

MSRA proposesVideo Context Learning(Video In-Context Learning, Vid-ICL), allowing large models to learn to “copycat”Imitation Generation

Vid-ICL uses an example video to guide the model's generation in new scenarios, so that the generated results can "imitate" the tasks completed in the example video in the new scenario.

For example, the camera angle of the sample video moves downward (left), and the generated video also moves downward (right):



The sample video object moves upward (left), and the generated video also moves upward (right):



Object grabbing can also be simulated:



△Left: Example video, the robot arm grabs an object; Right: Generated video

Opening the drawer can also be done as in the example:



△Left: Sample video, open the middle drawer; Right: Generated video

In the same electric fan scene, different example videos are used to guide the model to generate effects like:



△Left: Sample video, the camera moves left; Right: Generated video



△Left: Sample video, the camera moves right; Right: Generated video

It should be noted that in an ideal world model, the interaction between the model and the external environment should be diverse. However, most existing work focuses on usingText as the primary mode of interaction, which makes it difficult to control the details and diversity of the generated results.

andVideo is highly figurative and universal, capable of conveying a wide range of information such as examples of completing various tasks, including moving or grasping objects.

The Vid-ICL method proposed by the research team provides aNew Interface, making the interaction between the model and the real world more diverse.



In addition to the generated videos shown above,Vid-ICL can also be combined with a simulator, using the generated video and the current state to predict the corresponding actions for correct interaction with the environment,Realize interaction with the real environment

The following figure shows Vid-ICL interacting with the real environment, starting from the state at t=0, and interacting with the RoboDesk simulator to complete the "Push_red" task. Vid-ICL provides more precise control over the interaction with the environment:



Wow, the movie "Real Steel" has become a reality.

How does Vid-ICL do this?

Interpretation of the Vid-ICL framework

Vid-ICL operates based on video as the basic unit.

Specifically, given a query video clip and k example video clips, the goal of Vid-ICL is to generate a video clip that should firstMaintaining perceptual coherence with the query video segmentAt the same time in semantics(such as camera movement, action)Same as the example video



  • Autoregressive model training

Vid-ICL uses Transformer as the model structure.

As the foundation of large text models, Transformer has demonstrated powerful capabilities in language contextual reasoning and generation tasks. Generative Transformer training for visual information consists of two stages:

First, train a visual encoder, such as VQ-VAE, to convert each image into a discrete token.

Second, each training sample is constructed as a Token sequence, and the goal of the Transformer decoder is to recover the Token sequence.

In terms of specific implementation, Vid-ICLAdopting Llama Architecture,useRMSNorm normalizationandRotational position embedding(RoPE), trains the Transformer decoder in an autoregressive manner. During the training phase, each sequence is sampled from an original video without splicing video segments from different videos.

  • Zero-shot capability

The research team made a key observation in the paper:

The model can be trained from video data without explicit context, i.e.Spontaneously learning contextual reasoning from consecutive video clips, that is, the “zero-sample capability” for Video In-context Learning.

This can be attributed to two key factors. First, there are no special separators inserted between each video frame, which allows the model to implicitly treat consecutive video sequences as the format of example video + query video during training. This means that the model has learned to process sequences similar to the example-query structure.

Secondly, the autoregressive nature of Transformer enables it to expand the video sequence prediction capability of a single scene to scenes where examples and queries come from different videos, seamlessly generalizing the paradigm of text context learning to video context learning.

  • Integrating other modalities

Although Vid-ICL mainly focuses on videos as examples, it can be extended to other modalities such as text.

To do this, we simply convert the original text description into a latent representation through a pre-trained language model, and then use this latent representation as a prefix when training the Transformer and performing contextual reasoning, and align it to the latent space of the Transformer through a projection layer.

Experiments show that Vid-ICLCan receive text and video at the same time as an example, and adding text can further enhance the quality of the generated results.

  • Data and model size

It can be seen that Vid-ICL can learn the semantic information contained in the example videos and transfer it to new scenes for generation. This requires that the training data mainly contains videos with clear causal relationships and strong interactivity.

Therefore, the researchers selected two datasets as the main data sources for training data: Ego4d and Kinetics-600.

In addition, in order to increase the diversity of video content, a small amount of data from Webvid is also added to the training set.

The team also verified that due to the fuzzy and divergent semantic information contained in Internet videos, simply adding more Internet videos to increase the data sizeIt does not help improve the contextual performance of the model

In terms of model size, the team trained three models of size: 300M, 700M, and 1.1B, and found that the quality and contextual performance of the videos generated by the model followed the Scaling Law.

Experimental Results

Vid-ICL is mainlyProvide example videos with different semantics for the same query video, to evaluate the effectiveness and accuracy of video context learning.

For example, for a query video of an object moving to the left, different videos are generated by giving example videos of moving to the left, moving randomly, and moving in the opposite direction. The generated results are evaluated to determine whether the model really generates videos related to the example.

In terms of qualitative results, the following figure shows the generated videos under different example videos (for more examples, please refer to the original paper).

It can be observed:

1) ForSingle video generationVid-ICL maintains the consistency between the generated video and the query video, and both have good generation quality;

2) ForSemantic consistency between generated videos and example videos,It can be observed that the generated videos all follow the process of the example video, which shows that Vid-ICL has the ability to spontaneously obtain the semantic information of the example video and generate the corresponding video.

As shown in the figure below, for the same query video clip, Vid-ICL chooses to make corresponding movements in the generated video based on the movement of the lens in the example video.



In terms of quantitative results, the research team proposed two automatic evaluation indicators:

1)Video QualityIn traditional vision tasks, we use indicators based on pixel matching or distribution, such as PSNR, FID, etc.

2)Semantic consistencyIn this paper, two indicators based on classification accuracy are adopted: video classification accuracy and probe classification accuracy.

In terms of different indicators, Vid-ICL has shown results that exceed the baseline model. It can be seen that under the guidance of similar example videos, Vid-ICL has generated more realistic and semantically consistent videos.



Please refer to the original paper for more details.

Project homepage: https://aka.ms/vid-icl
Paper link: https://arxiv.org/abs/2407.0735