news

To improve the performance of GPT-4V and Gemini detection tasks, you need this kind of prompting model

2024-07-22

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina

AIxiv is a column where Synced publishes academic and technical content. In the past few years, Synced's AIxiv column has received more than 2,000 articles, covering top laboratories in major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work to share, please submit or contact us for reporting. Submission email: [email protected]; [email protected]

The authors of this article are from Zhejiang University, Shanghai Artificial Intelligence Laboratory, Chinese University of Hong Kong, University of Sydney and University of Oxford. Author list: Wu Yixuan, Wang Yizhou, Tang Shixiang, Wu Wenhao, He Tong, Wanli Ouyang, Philip Torr, Jian Wu. Among them, the co-first author Wu Yixuan is a doctoral student at Zhejiang University, and Wang Yizhou is a research assistant at Shanghai Artificial Intelligence Laboratory. The corresponding author Tang Shixiang is a postdoctoral researcher at the Chinese University of Hong Kong.

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in different tasks. Despite this, the potential of these models in detection tasks is still underestimated. When precise coordinates are required in complex target detection tasks, the hallucinations of MLLMs often cause them to miss the target object or give inaccurate bounding boxes. In order to enable detection with MLLMs, existing work requires not only the collection of a large number of high-quality instruction datasets, but also fine-tuning of open source models. This is time-consuming and labor-intensive, and it is impossible to take advantage of the more powerful visual understanding capabilities of closed-source models. To this end, Zhejiang University, in collaboration with the Shanghai Artificial Intelligence Laboratory and Oxford University, proposedDetToolChain, a new proposal method to release the detection ability of multimodal large language models. It can make multimodal large models learn accurate detection without training. Related research has beenECCV 2024 included

In order to solve the problems of MLLM in detection tasks, DetToolChain starts from three points: (1) designing visual prompts for detection, which are more direct and effective than traditional textual prompts to help MLLM understand location information; (2) breaking down the delicate detection tasks into small and simple tasks; and (3) using chain-of-thought to gradually optimize the detection results and avoid the illusion of multimodal large models as much as possible.

Corresponding to the above insights, DetToolChain includes two key designs: (1) a comprehensive set of visual processing prompts, which are drawn directly in the image and can significantly narrow the gap between visual information and text information. (2) a comprehensive set of detection reasoning prompts, which enhance the spatial understanding of the detected target and gradually determine the final precise location of the target through a sample-adaptive detection toolchain.

By combining DetToolChain with MLLMs, such as GPT-4V and Gemini, various detection tasks including open vocabulary detection, descriptive object detection, referring expression understanding, and directional object detection can be supported without instruction tuning.



Paper title: DetToolChain: A New Prompting Paradigm to Unleash Detection Ability of MLLM

Paper link: https://arxiv.org/abs/2403.12488

What is DetToolChain?



Figure 1 The overall framework of DetToolChain

As shown in Figure 1, for a given query image, MLLM is instructed to perform the following steps:

I. Formatting: Convert the original input format of the task into an appropriate instruction template as the input of MLLM;

II. Think: Decompose a specific complex detection task into simpler subtasks and select effective prompts from the detection prompt toolkit;

III. Execute: Iterate and execute specific prompts in order;

IV. Respond: Use the MLLM’s own reasoning ability to supervise the entire detection process and return the final answer.

Detection prompt toolkit: Visual Processing Prompts



Figure 2: Schematic diagram of visual processing prompts. We designed (1) Regional Amplifier, (2) Spatial Measurement Standard, and (3) Scene Image Parser to improve the detection capability of MLLMs from different perspectives.

As shown in Figure 2, (1) Regional Amplifier aims to enhance the visibility of MLLMs to the region of interest (ROI), including cropping the original image into different sub-regions and focusing on the sub-region where the target object is located; in addition, the magnification function enables fine-grained observation of specific sub-regions in the image.

(2) Spatial Measurement Standard provides a more explicit reference for object detection by superimposing a ruler and compass with linear scales on the original image, as shown in Figure 2 (2). The auxiliary ruler and compass enable MLLMs to output accurate coordinates and angles using the translation and rotation references superimposed on the image. In essence, this auxiliary line simplifies the detection task, allowing MLLMs to read the coordinates of the object instead of directly predicting them.

(3) Scene Image Parser marks the predicted object positions or relationships and uses spatial and contextual information to understand the spatial relationship of the image. Scene Image Parser can be divided into two categories:First, for a single target object, we label the predicted objects with centroids, convex hulls, and bounding boxes with label names and box indices. These labels represent object location information in different formats, enabling MLLM to detect diverse objects of different shapes and backgrounds, especially those with irregular shapes or heavily occluded objects. For example, the convex hull labeler labels the boundary points of an object and connects them into a convex hull to enhance the detection performance of objects with very irregular shapes.Secondly, for multiple targets, we connect the centers of different objects through a scene graph marker to highlight the relationship between objects in the image. Based on the scene graph, MLLM can use its contextual reasoning ability to optimize the predicted bounding boxes and avoid hallucinations. For example, as shown in Figure 2 (3), Jerry wants to eat cheese, so their bounding boxes should be very close.

Detection Reasoning Prompts Toolkit:



To improve the reliability of the predicted boxes, we performed detection reasoning prompts (as shown in Table 1) to check the prediction results and diagnose potential problems. First, we proposed the Problem Insight Guider to highlight difficult problems and provide effective detection suggestions and similar examples for the query image. For example, for Figure 3, the Problem Insight Guider defines the query as a small object detection problem and suggests solving it by zooming in on the surfboard area. Second, to leverage the inherent spatial and contextual capabilities of MLLMs, we designed the Spatial Relationship Explorer and Contextual Object Predictor to ensure that the detection results are consistent with common sense. As shown in Figure 3, a surfboard may co-occur with the ocean (contextual knowledge), while there should be a surfboard near the surfer's feet (spatial knowledge). In addition, we applied the Self-Verification Promoter to enhance the consistency of multi-round responses. To further improve the reasoning capabilities of MLLMs, we adopted widely used prompting methods such as debating and self-debugging. For detailed description, please see the original paper.



Figure 3 Detection reasoning hints can help MLLMs solve small object detection problems, for example, using common sense to locate a surfboard under a person’s feet and encourage the model to detect a surfboard in the ocean.



Figure 4 An example of DetToolChain applied to rotation target detection (HRSC2016 dataset)

Experiment: No training is required to outperform fine-tuning methods



As shown in Table 2, we evaluate our method on open vocabulary detection (OVD), testing the AP50 results of 17 new classes, 48 ​​base classes, and all classes in the COCO OVD benchmark. The results show that the performance of GPT-4V and Gemini is significantly improved by using our DetToolChain.



To demonstrate the effectiveness of our method in referring expression understanding, we compare our method with other zero-shot methods on the RefCOCO, RefCOCO+, and RefCOCOg datasets (Table 5). On RefCOCO, DetToolChain improves the performance of the GPT-4V baseline by 44.53%, 46.11%, and 24.85% on val, test-A, and test-B, respectively, demonstrating DetToolChain's superior referring expression understanding and localization performance under zero-shot conditions.