Peking University Wang Xuan Institute: Let multimodal large models better understand what humans are doing

Peking University Wang Xuan Institute: Let multimodal large models better understand what humans are doing｜ECCV 2024

2024-08-13

Using only prompt words, the multimodal large model can better understand the relationships between characters in the scene.

Peking University recently proposed a multi-modal prompt learning (Conditional Multi-Modal Prompt, CMMP) method, usingPrompt word: Engineering technologyTeach multimodal large models to understand regional-level character interactions.

The hardest part of this process is teaching the model to recognizeUnprecedented type of character interaction。

You know, most existing research focuses on closed environments. Once it becomes an open environment that is closer to reality, the model will be confused!

For example, in the figure below, previous detectors had difficulty balancing seen and unseen categories.The harmonic mean is lower, and performs poorly on unseen categories.

In contrast, the CMMP method effectively addresses this trade-off, substantially improves performance, and establishes a new state-of-the-art for unseen categories.

As for how the CMMP method solves unseen categories,One sentence：

Visual-spatial cues are used during feature extraction to help identify unseen concepts of person-object interactions and improve generalization to unseen categories through conditional cue learning.

In summary, the CMMP method provides a new paradigm to fine-tune large multimodal models to haveGeneralizedAbility to detect regional-level interactions between people.

The above research comes from the Wang Xuan Institute of Computer Technology of Peking University, and the relevant papers have been accepted by the top conference ECCV 2024.

A new framework for zero-shot human interaction detection

The team proposed a new framework for zero-shot HOI (Human-Object Interaction) detection using CMMP.

Specifically, CMMP transforms zero-shot human interaction detectionDivided into two subtasks：

Visual Feature Extraction for Interactivity Perception
Generalizable Interaction Classification

Then for each subtaskThey proposedDecoupled visual and textual cues to remove dependencies between them and mitigate error propagation.

Conditional visual cues (Pv) are used to inject spatial and interactivity-aware knowledge into the image encoder and are constrained by instance-level visual priors (Cins) and the global spatial pattern of interactions (Cgsp). Conditional linguistic cues (PL) are constrained by human-designed cues (CL) via a regularization loss.

Visual Feature Extraction for Interactivity Perception

The image encoder of the multimodal model adopted by the team was initially pre-trained through contrastive learning on large-scale image-text pairs (CLIP), and its ability may be limited to understanding the first-order semantics at the image level.

In order to enable the image encoder to distinguish all the interactions between people in the image, the team proposed to integrate the prior knowledge of different granularities into the conditional visual cues so that it can be understood as customized for the task of detecting the interaction relationship between people.Region-level second-order semantics。

Specifically, the researchersUsing instance-level information as prior knowledgeIncorporate conditioned visual cues.

Given an input image, we first use a pre-trained object detector to obtain all instance-level prior knowledge, including bounding boxes, confidence scores, and semantic encodings of detected instances.

In addition, to encourage each instance to be aware of its potential interacting objects, the team combined the global spatial pattern of interactions in the training set with instance-level visual prior knowledge.

Specifically, for each annotated interacting person pair, the researchersFirst, its univariate and bivariate spatial features are calculated.

Subsequently, the K-means clustering algorithm was used to determine the cluster centers and used them as representative spatial patterns of interacting person pairs.

The global spatial interaction pattern provides a category-independent representative spatial configuration that serves as a bridge to understand the interactivity between the concepts of seen and unseen character interactions.

Finally, the researchers incorporated the combined knowledge into the image encoder through a lightweight adapter.

Generalizable Interaction Classification

In order to retain the generalizable knowledge of CLIP while learning task-specific representations for human interaction detection, the team adoptedLanguage-aware Prompt Learning with Consistency Constraints。

This constraint ensures that the learned prototypes for seen and unseen categories maintain a reasonable separation boundary and do not deviate too much from each other.

Specifically, for each action category, the researchersFirst useHuman-designed cues are used to format them. Learnable context words are used to act as a bridge between the semantics of seen and unseen categories.

The final representation of the category is obtained by concatenating the learnable context words with the word vector of the above sentence and then passing it through a text encoder.

In order to further utilize the feature space learned by the multimodal model text encoder itself and improve the generalization ability of unseen categories, the researchers proposedUse human-designed hintsto guide the feature space of learnable language cues.

This constraint ensures that prototypes of seen and unseen categories maintain a reasonable separation boundary and do not deviate too much from each other.

Team AppsRegularized Contrastive Learning LossTo reduce the difference between feature representation and feature representation suggested by human-designed language.

Training CMMP

Based on the interactivity-aware feature maps and the bounding boxes of people and objects extracted by the pre-trained object detector, the team first applied ROI-Pooling to extract features from different regions.

Then, the features extracted from different regions are fused, and the final interaction category prediction is performed through the interaction classifier.

The entire model uses focal loss in interaction classification training and also applies language regularization loss.

Experimental Results

During the results verification phase, the team adoptedHICO-DET, a commonly used dataset for human interaction detection, where the 600 character interaction categories consist of 80 object categories and 117 verb categories.

To verify the zero-shot performance of the model, the researchers evaluatedFive zero-shot settings。

In order to achieve fair comparison with existing methods,ViT-B/16 is used by defaultAs the backbone network.

As shown in the following table, the experimental results show that CMMP performs well in all zero-shot settings.They both achieved the best performance on unseen classes., which proves the effectiveness of introducing conditional multimodal cues.

As shown in the table, each typeThe last line shows,By leveraging the ViT-L/14 backbone to scale up CMMP to match the FLOPs of CLIP4HOI, the new method achieves the best performance in all partitions.

This demonstrates that the team’s model has superior capabilities in extracting spatial relationships of visual features and prototypal learning for interaction classification.

Furthermore, previous methods exhibit severe performance discrepancies between seen and unseen categories, indicating their lack of generalization ability.

The model in this study can alleviate this problem to a great extent andGeneralizationThe high potential for reaching previously unseen categories of interactions confirms the effectiveness of multimodal cues with constraints.

Please refer to the original paper for more details.

news

Peking University Wang Xuan Institute: Let multimodal large models better understand what humans are doing｜ECCV 2024

Introduction

My contact information