2024-08-12
한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina
New Intelligence Report
Editor: LRST So sleepy
【New Wisdom Introduction】Mini-Monkey is a lightweight multimodal large language model that effectively alleviates the jagged effect caused by traditional image segmentation strategies by adopting a multi-scale adaptive segmentation strategy (MSAC) and a scale compression mechanism (SCM), improving the model's performance in high-resolution image processing and document understanding tasks. It has achieved leading results in multiple benchmarks, demonstrating its potential in the fields of multimodal understanding and document intelligence.
Recently, improving the ability of large multimodal models to process high-resolution images has attracted increasing attention in this field.
Most methods are dedicated to improving the ability of multimodal large models to understand image details through the strategy of segmenting and then fusing images.
However, the segmentation of the image will inevitably cause the fragmentation of the target and connected areas, resulting in the poor recognition of MLMMs for small or irregularly shaped targets. This phenomenon is particularly evident in document understanding tasks, as the text end is often interrupted.
To address this challenge, Huazhong University of Science and Technology and South China University of Technology recently jointly released a multimodal large model Mini-Monkey, which uses a lightweight multimodal large model with a pluggable multi-scale adaptive strategy (MSAC).
Mini-Monkey adaptively generates multi-scale representations, allowing the model to select unsegmented objects from various scales, and its performance reaches a new SOTA for 2B multimodal large models.
Paper address: https://arxiv.org/pdf/2408.02034
Project address: https://github.com/Yuliang-Liu/Monkey
To alleviate the computational overhead introduced by MSAC, we propose a scale compression mechanism (SCM) to effectively compress image tokens.
Mini-Monkey not only achieved leading performance in multiple document intelligence tasks, but also achieved consistent performance improvements in general multimodal model understanding tasks, achieving 2B SOTA performance.
On OCRBench, Mini-Monkey scored 802 points, outperforming models with larger parameters such as GLM-4v-9B.
Figure 3. Block diagram of the method: H-Attn represents high attention weight; L-Attn represents low attention weight; tokens with lower attention weight will be filtered; shared LLM layer represents the block layer using LLM in SCM
Background
Multimodal Large Language Models (MLMMs) have attracted a lot of attention in recent years. Researchers are actively exploring effective ways to integrate visual encoders with LLMs.
Some methods such as Flamingo, BLIP-2, MiniGPT4, Qwen-VL, and LLaVA have achieved these achievements, but previous multimodal large language models have not achieved detailed scene understanding well due to limited processing resolution.
Figure 1. The aliasing effect caused by segmentation on common objects: (a) input image; (b) segmentation to increase resolution strategy; (c) segmentation to increase resolution strategy with overlap; (d) multi-scale adaptive segmentation strategy
Researchers began to solve this problem by increasing the input resolution of the image. The segmentation strategy is one of the most commonly used methods. For example, Monkey, LLaVA 1.6, InternVL 1.5 and LLama3-V.
Despite significant progress in multimodal large language models, challenges remain in detailed scene understanding due to segmentation strategies.
The segmentation operation on the image inevitably splits objects and connected regions, thus weakening the ability of MLLM to recognize small objects or irregularly shaped objects, especially in the context of document understanding.
This strategy will introduce two types of semantic incoherence:
1. If an object or character is segmented, it may not be recognizable. For example, a segmented nose looks very much like a monkey, as shown in Figure 1(b);
2. If a word or sentence is segmented, it will cause semantic damage to the segmented word. For example, the word "Classrooms" may be divided into "Class" and "rooms", which will cause semantic damage to the segmented word.
For simplicity, the authors call this problem the sawtooth effect. A very straightforward idea is to use an overlapping segmentation strategy to solve this problem, as shown in Figure 1(c).
However, the authors found that the overlapping segmentation strategy introduced certain illusions, causing performance to degrade rather than improve.
Methods
The authors proposed Mini-Monkey, a lightweight multimodal large language model that aims to alleviate the sawtooth effect caused by the segmentation strategy. The method block diagram is shown in Figure 2.
Figure 2: The aliasing effect caused by cropping on a text image.
Different from existing methods that directly segment the input image, Mini-Monkey adopts a plug-and-play approach called Multi-Scale Adaptive Segmentation Strategy (MSAC).
MSAC can effectively complement features at different scales, as shown in Figure 1(d).
Multi-scale adaptive segmentation strategy (MSAC)
MSAC first layers these meshes and divides them into three groups according to their aspect ratios. The author will choose an aspect ratio for each layer. Different layers provide different information to the model.
The detailed layer is responsible for providing detailed information. It limits both the maximum and minimum image resolutions to make the image as large as possible and the objects in the image clearer. Since the segmentation strategy is used to crop the image, the image generated by this layer may have semantic inconsistencies.
Therefore, the authors use the adaptive layer to work with the detail layer to enable the model to select unsegmented objects from various scales. The adaptive layer will adaptively generate the aspect ratio according to the detail layer, ensuring that the segmentation lines on the detail layer do not overlap with the segmentation lines on the adaptive layer, thereby avoiding the same object being segmented twice on different layers. This process ensures that the detail layer and the adaptive layer provide different semantic information and visual features to the model.
Scale compression mechanism
MSAC may introduce some additional computational overhead. Therefore, the authors propose a scale compression mechanism (SCM) for situations with computational overhead constraints. SCM is a non-trainable and parameter-free mechanism to reduce computational overhead.
The author selects the visual tokens of the adaptive layer, the visual tokens of the global layer, and the text tokens to focus on the visual markers of the detail layer, and then generates an attention map, and then extracts the visual features of the Top K of the attention map.
A trained LLM can effectively select necessary visual features according to the input question. Therefore, SCM utilizes the first and second layers of LLM to select visual tokens without generating any additional parameters.
Mini-Monkey's strongest 2B multi-modal large model
The authors tested their method on general multimodal understanding and document understanding. The experimental results show that Mini-Monkey achieves the best performance in both general multimodal understanding and document understanding with 2B parameters.
Table 1 Results on general multimodal understanding
Table 2 Results on document understanding
The authors compare the proposed MSAC with existing methods. The first row is the dynamic segmentation method, the second row is the fixed resolution segmentation method, the third row is the overlapping segmentation, and the fourth row is the multi-scale strategy S2.
Table 3 compares with different segmentation strategies
MSAC can be applied to different multimodal architectures to provide stable
At the same time, the author also applied MSAC to other methods for comparison, and it can be seen that there is a consistent improvement in both general multimodal understanding and document understanding tasks.
Table 4. Applying MSAC to different frameworks
Effectively alleviate the "aftereffects" caused by increasing resolution through segmentation
At the same time, the author also provides some qualitative analysis, as shown in Figure 4. The author asks questions about the locations that are segmented, such as "classrooms" and "school".
It can be seen that Mini-Monkey can effectively alleviate the "aftereffects" caused by increasing the resolution through segmentation through MSAC.
Figure 4 Qualitative results: (a) Input image and Ground Truth; (b) Result of overlapping segmentation strategy, OSC represents overlapping segmentation strategy; (c) Result of internv2-2b and internv2-26b; (d) Result of Mini-Monkey
Visual Comparison
Mini-Monkey can accurately extract the text content in the ambiguous ancient books, while MiniCPM-V 2.6 and InternVL2-2B miss a lot of text. GPT4-O refuses to answer:
(a) Input image
(b) Mimi-Monkey: Accurately recognize all text
(c) MiniCPM-V 2.6: Many words are missing.
(d) InternVL2-2B: A whole sentence of ambiguous text is missing
(e) GPT-4o: Refusal to answer
Summarize
Methods that use segmentation to expand resolution often segment objects and connected regions, which limits the recognition of small or irregularly shaped objects and text, a problem that is particularly evident in lightweight MLLMs.
In this study, the authors proposed a 2B multimodal large model Mini-Monkey that achieved SOTA performance, aiming to address the limitations of existing segmentation strategies to improve the ability of MLLM to process high-resolution images.
Mini-Monkey adopts a multi-scale adaptive segmentation strategy (MSAC) to generate multi-scale representations, allowing the model to select unsegmented objects at different scales, thereby alleviating this problem.
At the same time, the authors also verified the effectiveness of the multi-scale adaptive segmentation strategy on large multimodal models of other architectures, providing a simple and effective solution to alleviate the "aftereffects" caused by increasing the resolution through segmentation.
References:
[1] Chen Z, Wang W, Tian H, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites[J]. arXiv preprint arXiv:2404.16821, 2024.
[2] Li J, Li D, Savarese S, et al. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models[C]//International conference on machine learning. PMLR, 2023: 19730-19742.
[3] Liu Y, Yang B, Liu Q, et al. Textmonkey: An ocr-free large multimodal model for understanding document[J]. arXiv preprint arXiv:2403.04473, 2024.
[4] Bai J, Bai S, Yang S, et al. Qwen-vl: A frontier large vision-language model with versatile abilities[J]. arXiv preprint arXiv:2308.12966, 2023.
[5] Dubey A, Jauhri A, Pandey A, et al. The Llama 3 Herd of Models[J]. arXiv preprint arXiv:2407.21783, 2024.