Tsinghua University leads the release of MultiTrust, a multimodal evaluation framework: How trustworthy is GPT-4?

2024-07-24

AIxiv is a column where Synced publishes academic and technical content. In the past few years, Synced's AIxiv column has received more than 2,000 articles, covering top laboratories in major universities and companies around the world, effectively promoting academic exchanges and dissemination. If you have excellent work to share, please submit or contact us for reporting. Submission email: [email protected]; [email protected]

This work was initiated by a basic theory innovation team led by Professor Zhu Jun of Tsinghua University. For a long time, the team has focused on the bottleneck problems in the current development of artificial intelligence, explored original artificial intelligence theories and key technologies, and is at the international leading level in the research of adversarial security theories and methods of intelligent algorithms. It has conducted in-depth research on basic common problems such as adversarial robustness and data utilization efficiency of deep learning. The related work won the Wu Wenjun Artificial Intelligence Natural Science First Prize, published more than 100 CCF A-level papers, developed the open source ARES adversarial attack and defense algorithm platform (https://github.com/thu-ml/ares), and realized the application of some patents in the transformation of industry, academia and research.

Multimodal large language models (MLLMs) represented by GPT-4o have attracted much attention due to their outstanding performance in multiple modalities such as language and images. They have not only become users' right-hand men in daily work, but have also gradually penetrated into major application fields such as autonomous driving and medical diagnosis, setting off a technological revolution.

However, are large multimodal models safe and reliable?

Figure 1. Example of adversarial attack on GPT-4o

As shown in Figure 1, by modifying the image pixels through adversarial attacks, GPT-4o mistakenly identified the Merlion statue in Singapore as the Eiffel Tower in Paris or the Big Ben in London. Such erroneous target content can be customized at will, even beyond the safety boundaries of the model application.

Figure 2 Claude3 jailbreak example

In the jailbreak attack scenario, although Claude successfully rejected malicious requests in text form, when the user input an additional solid-color irrelevant picture, the model outputted fake news as required by the user. This means that multimodal large models have more risks and challenges than large language models.

In addition to these two examples, multimodal large models also have various security threats or social risks such as hallucinations, biases, and privacy leaks, which will seriously affect their reliability and credibility in practical applications. Do these vulnerabilities occur by chance or are they widespread? What are the differences in the credibility of different multimodal large models and where do they come from?

Recently, researchers from Tsinghua University, Beihang University, Shanghai Jiaotong University and Ruilai Intelligence jointly wrote a 100-page article and released a comprehensive benchmark called MultiTrust. For the first time, it comprehensively evaluated the credibility of mainstream multimodal large models from multiple dimensions and perspectives, demonstrated many potential security risks, and inspired the next step of development of multimodal large models.

Paper title: Benchmarking Trustworthiness of Multimodal Large Language Models: A Comprehensive Study

Paper link: https://arxiv.org/pdf/2406.07057

Project homepage: https://multi-trust.github.io/

Code repository: https://github.com/thu-ml/MMTrustEval

MultiTrust Benchmark Framework

From the existing large-scale model evaluation work, MultiTrust has extracted five trust evaluation dimensions - Truthfulness, Safety, Robustness, Fairness, and Privacy, and carried out secondary classification, and has specifically constructed tasks, indicators, and data sets to provide comprehensive evaluation.

Figure 4 MultiTrust framework diagram

Focusing on 10 trust evaluation sub-dimensions, MultiTrust has built 32 diverse task scenarios, covering both discrimination and generation tasks, spanning pure text tasks and multimodal tasks. The datasets corresponding to the tasks are not only transformed and adapted based on public text or image datasets, but also constructed some more complex and challenging data through manual collection or algorithm synthesis.

Figure 5 MultiTrust task list

Different from the trusted evaluation of large language models (LLMs), the multimodal characteristics of MLLMs bring more diverse and complex risk scenarios and possibilities. In order to better conduct systematic evaluation, the MultiTrust benchmark not only starts from the traditional behavioral evaluation dimension, but also innovatively introduces two evaluation perspectives: multimodal risk and cross-modal impact, to comprehensively cover the new problems and challenges brought by the new modality.

Figure 6: Illustration of multimodal risk and cross-modal impact risk

Specifically, multimodal risks refer to new risks brought about by multimodal scenarios, such as the model's incorrect answers when processing visually misleading information, and misjudgments in multimodal reasoning involving safety issues. Although the model can correctly identify the alcohol in the picture, in further reasoning, some models are not aware of the potential risks of its use with cephalosporin drugs.

Figure 7: The model makes misjudgments in reasoning about security issues

Cross-modal impact refers to the impact of the addition of a new modality on the credibility of the original modality. For example, the input of irrelevant images may change the trustworthy behavior of the backbone network of the large language model in pure text scenarios, leading to more unpredictable security risks. In jailbreak attacks and contextual privacy leakage tasks commonly used in large language model credibility assessment, if a picture unrelated to the text is provided to the model, the original security behavior may be destroyed (as shown in Figure 2).

Results analysis and key conclusions

Figure 8 Real-time updated credibility list (part)

Researchers maintain a regularly updated list of the credibility of multimodal large models, which has included the latest models such as GPT-4o and Claude3.5. Overall, closed-source commercial models are safer and more reliable than mainstream open-source models. Among them, OpenAI's GPT-4 and Anthropic's Claude rank highest in credibility, while Microsoft Phi-3, which adds security alignment, ranks highest among open-source models, but still has a certain gap with closed-source models.

Commercial models such as GPT-4, Claude, and Gemini have already implemented many security and trust enhancement technologies, but some security and trust risks still exist. For example, they are still vulnerable to adversarial attacks and multimodal jailbreak attacks, which greatly interferes with the user experience and trust level.

Figure 9 Gemini outputs risky content under multi-modal jailbreak attack

Although many open source models have scores comparable to or even better than GPT-4 on mainstream general rankings, these models still show different weaknesses and vulnerabilities in trustworthy tests. For example, the emphasis on general capabilities (such as OCR) during the training phase makes embedding jailbreak text and sensitive information into image inputs a more threatening source of risk.

Based on the experimental results of cross-modal effects, the authors found that multimodal training and reasoning will weaken the security alignment mechanism of large language models. Many large multimodal models use aligned large language models as backbone networks and perform fine-tuning during multimodal training. The results show that these models still show large security vulnerabilities and trust risks. At the same time, in multiple plain text trust evaluation tasks, the introduction of images during reasoning will also affect and interfere with the trustworthy behavior of the model.

Figure 10 After introducing images, the model is more likely to leak private content in text

The experimental results show that there is a certain correlation between the credibility of large multimodal models and their general capabilities, but there are still differences in model performance in different credibility assessment dimensions. The current common multimodal large model-related algorithms, such as the fine-tuning dataset assisted by GPT-4V and RLHF for hallucinations, are not sufficient to fully enhance the credibility of the model. The existing conclusions also show that large multimodal models have unique challenges that are different from large language models, and innovative and efficient algorithms are needed for further improvement.

Detailed results and analysis are provided in the paper.

Future Directions

The results of the study show that improving the credibility of large multimodal models requires special attention from researchers. By drawing on the solution of large language model alignment, diversified training data and scenarios, as well as paradigms such as retrieval-augmented generation (RAG) and constitutional AI, improvements can be made to a certain extent. However, the credibility of large multimodal models is by no means limited to this. Inter-modal alignment and the robustness of visual encoders are also key influencing factors. In addition, continuous evaluation and optimization in a dynamic environment to enhance the performance of the model in practical applications is also an important direction for the future.

Along with the release of the MultiTrust benchmark, the research team also released the MMTrustEval toolkit for trustworthy evaluation of multimodal large models. Its model integration and modular evaluation features provide an important tool for trustworthy research of multimodal large models. Based on this work and toolkit, the team organized data and algorithm competitions related to the security of multimodal large models [1,2] to promote trustworthy research of large models. In the future, with the continuous advancement of technology, multimodal large models will demonstrate their potential in more fields, but the issue of their trustworthiness still requires continuous attention and in-depth research.

[1] CCDM2024 Multimodal Large Language Model Red Team Security Challenge http://116.112.3.114:8081/sfds-v1-html/main

[2] The 3rd Pazhou Algorithm Competition - Multimodal Large Model Algorithm Security Reinforcement Technology https://iacc.pazhoulab-huangpu.com/contestdetail?id=668de7357ff47da8cc88c7b8&award=1,000,000

news

Tsinghua University leads the release of MultiTrust, a multimodal evaluation framework: How trustworthy is GPT-4?

Introduction

my contact information