news

ACL 2024 Oral|How far are we from true multimodal chain of thought reasoning?

2024-08-14

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina



AIxiv is a column where Synced publishes academic and technical content. In the past few years, Synced's AIxiv column has received more than 2,000 articles, covering top laboratories in major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work to share, please submit or contact us for reporting. Submission email: [email protected]; [email protected]

The first author of this article, Chen Qiguang, is currently studying at Harbin Institute of Technology’s Searle Laboratory. His main research directions include large model thinking chains and cross-language large models.

In the past few years, Large Language Models (LLMs) have made breakthrough progress in the field of Natural Language Processing (NLP). These models can not only understand complex contexts, but also generate coherent and logically rigorous texts.

However, with the development of science and technology and the diversification of application scenarios, the capabilities of a single text modality are obviously no longer able to meet modern needs. People are increasingly looking forward to intelligent systems that can process and understand multi-modal information (such as images, videos, audio, etc.) to cope with more complex tasks and scenarios. Researchers have begun to try to expand the capabilities of text CoT to the field of multimodal thought chain reasoning to cope with more complex and diverse task requirements.

One of the earliest studies on multimodal chain of thought was the ScienceQA benchmark introduced by Lu et al. [1], which combines visual and language information and promotes the study of multimodal chain of thought (MCoT). The emergence of the ScienceQA dataset enables researchers to evaluate the chain of thought reasoning ability of multimodal models under a unified framework.

Furthermore, the research of Zhang et al. [2] pushed the performance of MCoT to a new high, making the model perform better than humans on the ScienceQA dataset (93%>88%). However, has the current research on multimodal reasoning really solved all the challenges? With the continuous improvement of the scores of benchmarks such as ScienceQA, can we assume that the problem of multimodal reasoning has been solved?

Through in-depth analysis, researchers found that the current multimodal thought chain benchmark still has serious problems, leading to an overestimation of the actual capabilities of the model. The current multimodal thought chain benchmark still faces the following three serious problems:Visual modal reasoning is missingOnly single-step visual modal reasoningas well asInsufficient coverage of fields

These problems seriously restrict the development of the field of multimodal thinking chain. Therefore, the researchers proposed a new benchmark



The paper aims to solve the above problems and promote the progress of multi-domain, multi-step and multi-modal chains of thought. The researchers also conducted a comprehensive evaluation involving a variety of multimodal reasoning settings and methods.

The researchers also found that the current multimodal large model



, despite their superior performance on previous traditional multimodal thought chain benchmarks.



It can become a valuable resource, providing a groundbreaking foundation for the study of multi-domain, multi-step and multi-modal chains of thought.



Leaderboard address: https://lightchen233.github.io/M3CoT.github.io/leaderboard.html

Paper address: https://arxiv.org/abs/2405.16473

Code address: https://github.com/LightChen233/M3CoT

motivation

Despite significant progress in the MCoT research field, existing benchmarks still have many shortcomings:

1.Visual modal reasoning is missing: Models can often generate reasoning and answers based on the text modality alone, which does not truly reflect the capabilities of multimodal CoT models.

2.Single-step visual modal reasoning:For example, you only need to see the "feather" in a single picture to get the answer directly. In practical applications, multi-step reasoning is more common and necessary, requiring the model to dynamically combine multimodal information multiple times for comprehensive reasoning during the reasoning process.

3.Field missing: For the chain of thought, common sense reasoning and mathematical reasoning are important components of the field, while existing benchmarks lack coverage of important areas such as common sense and mathematics, limiting the comprehensive evaluation of multimodal CoT capabilities.



To address the above issues, researchers developed a new benchmark



, and hopes to promote the research and development of multi-field, multi-step and multi-modal thinking chains.



Data construction process





The construction of involves four key stages:



Evaluation results of streaming multimodal large language models

The researchers conducted extensive experiments on multiple large-scale visual language models (VLLMs), including Kosmos-2, InstructBLIP, LLaVA-V1.5, CogVLM, Gemini, and GPT4V. The researchers also explored several prompting strategies, such as direct sample submission, chain of thoughts prompting (CoT) [3], descriptive prompting (Desp-CoT) [4], and scene graph chain of thoughts prompting strategy (CCoT) [5].





analyze







explore

On this basis, the researchers further explored various commonly used multimodal methods and settings to explore whether they can effectively solve



The problem in.

Tool Usage Exploration

In multimodal reasoning, tool usage is considered an effective strategy to improve model performance. The researchers evaluated a variety of tool usage methods in the experiment, including models such as HuggingGPT, VisualChatGPT, IdealGPT, and Chameleon.

Text large model uses multimodal tools in



Poor performance on: Experimental results show that although these tools perform well on unimodal tasks,



There are still significant gaps in performance on the benchmarks. For example, HuggingGPT performs poorly when dealing with complex multi-step reasoning tasks due to the lack of effective use of visual information. In addition, VisualChatGPT and IdealGPT also fail to meet expectations when dealing with tasks that require multimodal interactions. These results show that the current tool usage framework needs to be further improved to better integrate and utilize multimodal information.



Contextual Learning Exploration





Instruction fine-tuning exploration



Conclusion and Outlook



References:

[1] Lu et al. Learn to Explain: Multimodal Reasoning via

Thought Chains for Science Question Answering. In Proc. of NeurIPS 2022.

[2] Zhang et al. Multimodal Reasoning with Multimodal Knowledge Graph. ACL 2024.

[3] Kojima et al. Large language models are zero-shot reasoners. In Proc. of NeurIPS 2022.

[4] Wu et al. The Role of Chain-of-Thought in Complex Vision-Language Reasoning Task. Arxiv 2023.

[5] Mitra et al. Compositional chain-of-thought prompting for large multimodal models. CVPR 2024.