news

Li Xi's team at Zhejiang University: A new method for understanding the expression of instructions, ScanFormer eliminates redundancy from coarse to fine

2024-08-20

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

AIxiv is a column where Synced publishes academic and technical content. In the past few years, Synced's AIxiv column has received more than 2,000 articles, covering top laboratories in major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work to share, please submit or contact us for reporting. Submission email: [email protected]; [email protected]

The authors of this paper are all from Professor Li Xi's team at Zhejiang University. The first author of the paper is doctoral student Su Wei, and the corresponding author is Professor Li Xi (IET Fellow, National Outstanding Young Scientist). In recent years, Professor Li Xi's team has published more than 180 CV/AIGC-related research works in international authoritative journals (such as TPAMI, IJCV, etc.) and international top academic conferences (ICCV, CVPR, ECCV, etc.), and has carried out extensive cooperation with well-known universities and research institutions at home and abroad.

As a basic visual language task, referring expression comprehension (REC) locates the referred object in the image based on the natural language description. The REC model usually consists of three parts: visual encoder, text encoder and cross-modal interaction, which are used to extract visual features, text features and cross-modal feature interaction and enhancement respectively.

Most current research focuses on designing efficient cross-modal interaction modules to improve task accuracy, but lacks exploration of visual encoders. A common approach is to use feature extractors pre-trained on classification and detection tasks, such as ResNet, DarkNet, Swin Transformer, or ViT. These models extract features by sliding windows or dividing patches to traverse all spatial locations of the image. Their computational complexity increases rapidly with image resolution, which is more obvious in Transformer-based models.

Due to the spatial redundancy of images, there are a large number of background areas with low information content and areas that are irrelevant to referential expressions. Extracting features from these areas in the same way will increase the amount of calculation but will not help extract effective features. A more efficient way is to predict the text relevance and content richness of the image area in advance, fully extract features from the foreground areas related to the text, and roughly extract features from the background areas. For regional prediction, a more intuitive way is to implement it through an image pyramid, identifying the background area in advance in the coarse-grained image at the top of the pyramid, and then gradually adding high-resolution fine-grained foreground areas.

Based on the above analysis, we proposedScanFormer, a coarse-to-fine iterative perception framework, scan layer by layer in the image pyramid, starting from a low-resolution coarse-scale image, and gradually filter out the irrelevant/background areas to reduce computational waste, so that the model can focus more on the foreground/task-related areas.



  • Paper title: ScanFormer: Referring Expression Comprehension by Iteratively Scanning
  • Paper link: https://arxiv.org/pdf/2406.18048

Method Introduction

1. Coarse-to-fine Iterative Perception Framework

To simplify the structure, we adopt the ViLT [1] model that unifies text and visual modalities and splits it into two parts along the depth dimension: Encoder1 and Encoder2 for different tasks.

First, extract text features and store them in KV Cache; then construct an image pyramid and iterate from the top layer of the pyramid downwards. In each iteration, input the patch selected at the current scale, and Encoder1 is used to predict the selection of fine-grained patches at the next scale corresponding to each patch. In particular, all patches of the top-level image are selected to ensure that the model can obtain coarse-grained full-image information. Encoder2 further extracts features and predicts the bounding box of the scale based on the [cls] token of the current scale.

At the same time, the intermediate features of Encoder1 and Encoder2 will be stored in KV Cache to facilitate utilization by subsequent scales. As the scale increases, fine-grained features are introduced, position prediction will be more accurate, and most irrelevant patches will be discarded to save a lot of computation.

In addition, patches within each scale have bidirectional attention, which simultaneously pays attention to all patches and text features of the previous scale. This causal attention between scales can further reduce computational requirements.



2. Dynamic patch selection

The selection of each patch is determined by the selection factor generated by the previous scale. There are two options for the application location. One is to use it in all heads of each layer of MHSA of the Encoder. However, for an Encoder with N layers and H heads, it is difficult to obtain effective gradient information for updating, so the learned selection factor is not ideal. The second is to use it directly on the input of the Encoder, that is, on the patch embedding. Since it is only used in this one position, it is easier to learn, and this solution is finally adopted in this paper.

In addition, it should be noted that even if the input patch embedding is set to 0, due to the existence of MHSA and FFN, the features of the patch in the subsequent layers will still become non-zero and affect the features of the remaining patches. Fortunately, when there are many identical tokens in the token sequence, the calculation of MHSA can be simplified, achieving actual reasoning acceleration. In addition, in order to enhance the flexibility of the model, this paper does not directly set the patch embedding to 0, but replaces it with a learnable constant token.

Therefore, the patch selection problem is transformed into a patch replacement problem. The patch selection process can be decomposed into two steps: constant token replacement and token merging. Unselected patches will be replaced with the same constant token. Since these unselected tokens are the same, according to the calculation method of scaled dot product attention, these tokens can be merged into one token and multiplied by the total number, which is equivalent to adding to the dimension. Therefore, the calculation method of dot product attention remains unchanged, and common acceleration methods are still available.



Experimental Results

Our method achieves performance close to the state-of-the-art on four datasets: RefCOCO, RefCOCO+, RefCOCOg, and ReferItGame. By pre-training on large-scale datasets and fine-tuning on specific datasets, the performance of the model can be further significantly improved and achieve results similar to those of pre-trained models such as MDETR [2] and OFA [3].





In terms of reasoning speed, the proposed method achieves real-time reasoning speed while ensuring high task accuracy.



In addition, the experimental part also statistically analyzes the patch selection of the model and the distribution of positioning accuracy at each scale (scale1 and scale2).

As shown in the left figure, as the scale increases, fine-grained image features are added, and the model accuracy gradually improves. Therefore, we can try to add an early exit mechanism to exit in time when the positioning accuracy meets the requirements, avoid further calculations on high-resolution images, and achieve the effect of adaptively selecting the appropriate resolution based on the sample. This article also made some preliminary attempts, including adding prediction branches such as IoU, GIoU and uncertainty, and regressing early exit indicators, but found that the effect was not ideal. How to design a suitable and accurate early exit indicator needs further exploration.

The right figure shows the patch selection at different scales. At all scales, the selected patches account for a relatively small proportion, and most patches can be eliminated, thus effectively saving computing resources. For each sample (image + referential expression), the number of patches actually selected is relatively small, accounting for about 65% of the total.



Finally, the experimental part shows some visualization results. As the scale increases (red → green → blue), the localization accuracy of the model gradually improves. In addition, according to the image reconstructed by the selected patch, it can be seen that the model only pays attention to the coarse-scale information for the background area, and for the relevant foreground area, the model can pay attention to the fine-grained detail information.



Related Literature:

[1].Kim W, Son B, Kim I. Vilt: Vision-and-language transformer without convolution or region supervision [C]//International conference on machine learning. PMLR, 2021: 5583-5594.

[2].Kamath A, Singh M, LeCun Y, et al. Mdetr-modulated detection for end-to-end multi-modal understanding [C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 1780-1790.

[3].Wang P, Yang A, Men R, et al. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework [C]//International conference on machine learning. PMLR, 2022: 23318-23340.