news

Comparative learning abuses private data! Chinese Academy of Sciences and others publish "Multi-step Error Minimization" method | ACM MM2024

2024-08-01

한어Русский языкEnglishFrançaisIndonesianSanskrit日本語DeutschPortuguêsΕλληνικάespañolItalianoSuomalainenLatina


New Intelligence Report

Editor: LRST I'm so sleepy

【New Wisdom Introduction】The researchers proposed a novel multi-step error minimization (MEM) method to generate multimodal non-learnable samples to protect personal data from being abused by multimodal contrastive learning models. By optimizing image noise and text triggers, the MEM method effectively misleads the model, reduces its ability to learn private data, and shows strong transferability between different models.

Multimodal contrastive learning (e.g., CLIP) has achieved remarkable progress in zero-shot classification by learning from millions of image-caption pairs crawled from the Internet.

However, this reliance poses privacy risks as hackers could potentially exploit image-text data for model training without authorization, which could include personal and privacy-sensitive information.

Recent work has proposed to create guarded shortcuts by adding imperceptible perturbations to training images to generate unlearnable examples.

However, these methods are designed for unimodal classification tasks and remain underexplored in multimodal contrastive learning. This paper first explores this background by evaluating the performance of existing methods on image-caption pairs, where previous methods cannot generalize effectively to multimodal data due to the lack of labels and have limited effectiveness in establishing shortcuts.

In this paper, we propose Multi-step Error Minimization (MEM), a novel optimization procedure for generating multimodal non-learnable samples. It extends the Error Minimization (EM) framework to optimize image noise and additional text triggers, thereby expanding the optimization space and effectively misleading the model to learn shortcuts between noise features and text triggers.


Paper link: https://arxiv.org/abs/2407.16307

Code link: https://github.com/thinwayliu/Multimodal-Unlearnable-Examples

Specifically, projected gradient descent is adopted to solve the noise minimization problem, and the HotFlip method is used to approximate the gradient and replace words to find the best text trigger.

A large number of experiments have proved the effectiveness of the method. The retrieval results after protection are almost half of random guessing, and it has a high degree of transferability between different models. The paper and code of this work have been open sourced.

Research Background

In recent years, with the rise of multimodal learning, researchers have shown great interest in models that combine multiple data types such as text, images, and audio.

Among them, multimodal contrastive learning has become an important method in this field. Models such as CLIP and ALIGN use contrastive loss training to enhance the correlation between images and texts, thereby reducing the need for manual labeling, and demonstrate their potential in tasks such as image classification and generation.

However, the training of these models relies on a large amount of multimodal data, which often comes from public datasets such as CC12M, YFCC100M, and LAION5B, but these datasets may still be insufficient and may contain a large amount of sensitive personal information, raising concerns about privacy leakage.

We consider a scenario that focuses on generating multimodal non-learnable samples to cope with the privacy risks associated with multimodal contrastive learning. In this scenario, we focus on image-text pairs as a representative multimodal dataset. Assume that users often share personal photos with text on social media platforms such as Facebook, including some private identifiable information such as faces, names, phone numbers, and addresses.

Currently, hackers try to collect a large number of such image-text pairs from the Internet and train or fine-tune large models using multimodal contrastive learning techniques, as shown in the left half of Figure 1.

These models inadvertently capture users’ private information and facial features, leading to potential privacy leakage. Protectors aim to prevent such sensitive data from being exploited without authorization by making multimodal data unlearnable. These methods make it impossible for models trained on such multimodal unlearnable samples to access users’ private features, while not hindering users’ social interactions after posting images and texts, as shown in the right half of Figure 1.


Figure 1: Facebook posts can inadvertently reveal personal information (left), but using multimodal unlearnable samples can protect the data and prevent unauthorized models from accessing private features (right)

motivation

Recent research has focused on preventing unauthorized use of data in image classification through unlearnable examples. These methods impede the model from learning image features by applying subtle perturbations to the data, also known as availability attacks or indiscriminate poisoning attacks.

It is mainly divided into agent-free model attacks and agent-based attacks, where agent-free model attacks generate noise at the pixel level, while agent-based attacks generate feature-level noise through agent models.

However, all agent-free model-based approaches for classification fail to generate image noise in multimodal scenarios, since these approaches aim to find a set of specific noise patterns for images associated with a certain category, while there are no labels in the image-text pair data.

Therefore, only surrogate model based methods can be applied, and we extend two typical methods to generate non-learnable multimodal examples (EM and UAP).

The Error-minimizing Noise (EM) method:


Untargeted Adversarial Perturbation. (UAP) method:


Although EM and UAP can be applied to image-caption pairs, they fail to achieve efficient protection, especially UAP. We explore the reasons why the effectiveness of these methods decreases from image classification to multimodal contrastive learning.

In image classification, EM and UAP optimize images with the same label so that they converge in the feature space, resulting in the model easily capturing these additional noises and learning correlations with the labels, as shown in Figure 2(a).


Figure 2: Comparison of different methods in traditional classification and multimodal contrastive learning. represents images and is paired with captions. The blue area is the expected decision boundary of the model trained on non-learnable samples.

However, in multimodal contrastive learning, in order to effectively apply EM and UAP methods, the direction of the optimized image noise must be related to the features of the text, causing the image features to become either close to or far away from these features.

However, different pairs of text features may be widely scattered in image-text datasets. As shown in Figure 2 (b) and (c), unlike classification, it is more difficult for the model to capture the correlation between captions and the noise generated by EM and UAP.

In Figure 2(c), the learning decision space of UAP is more complex, so its protection effect is not as good.

method


Figure 3: Framework of the multi-step error minimization method (MEM)

Due to the dispersion of image-text pairs, surrogate model-based methods still cannot achieve effective protection. An intuitive enhancement strategy is to optimize both image and text simultaneously to obtain a larger optimization space and promote the convergence of their different pairs in the feature space.

Therefore, the optimized feature representations of the image and text sets exhibit similar distributions, facilitating the model to learn their shortcuts, as shown in Figure 2(d).

To this end, we take the EM method as the basic framework and propose to add additional short text triggers before the captions to minimize the contrastive loss, following the setting of adversarial attacks on text tasks. Our method can be conceptualized as a three-layer iterative optimization problem, similar to the multi-step process of EM.

Specifically, we optimize the noise δ and the text trigger t sequentially to reduce the contrast loss between the optimized image I + δ and the optimized text T ⊕ t, where ⊕ denotes a trigger that can be inserted into the clean text T at different locations.

For simplicity, we choose to add the text trigger at the beginning of the text in this paper. Therefore, our multi-step error minimization (MEM) method can be formulated as:


The above problems are optimized iteratively by referring to the method in EM. Projected gradient descent (PGD) is used to solve the noise minimization problem in the formula.

It is worth noting that in order to alleviate the overfitting of noise to clean subtitles, we enhance them by shuffling clean subtitles in batches and adding correctly matched text triggers. Therefore, when facing semantically incorrect subtitles, this generated noise can focus more on text triggers rather than partial subtitles. Therefore, we can obtain the optimal δ according to the following iterative formula:

For the text trigger minimization problem, we first initialize the trigger sequence by repeating the word "the" or "a" in front of all inputs.

In addition, based on HotFlip optimization of text triggers, the effect of replacing tokens is approximated by gradients. The embedding of each trigger token is updated to minimize the first-order Taylor approximation of the CLIP loss around the current token embedding:


Finally, we can use beam search to search for each optimal text trigger in the set of candidate tokens. We consider the top k candidates from the above formula and search from front to back at each position of the trigger and score each beam using the loss on the current batch.

We follow the approach of Wallace et al. and use a small beam size for efficient computation. In Figure 3, we can see the framework for generating multimodal non-learnable samples using our MEM.

Experimental results

Effective protection


Table 1: Comparison of the effectiveness of non-learnable samples generated by several methods on different datasets

Table 1 shows their retrieval results on different datasets. Obviously, UAP can hardly provide any protection for multimodal data, while EM shows a certain degree of protection.

However, our MEM consistently provides strong protection for multimodal data, reducing the retrieval performance to almost half of random guessing. In particular, MEM-5, due to its longer text triggers, achieves a greater effect in reducing the performance of hacker models compared to MEM-3.

Figure 4 shows the training loss drop curves for training on non-learnable samples generated by different methods and the retrieved Medr on the clean test set. From (a), we can observe that although EM makes the loss drop faster than normal training, our methods MEM-3 and MEM-5 have smaller losses in the first epoch, which indicates that the model can quickly learn shortcuts.

From (b), we find that the Medr of all models is lower than random guessing, but the model trained on non-learnable samples stops learning the fastest, reaches the worst retrieval results, and does not learn better as the epoch increases. The above observations are consistent with the results in Table 1.


Figure 4: Curve changes of training loss and test indicator Medr

Cross-model transferability


Table 2: Transferability of non-learnable samples generated by the MEM-3 method based on the ResNet50 model on different model architectures

We assume that data protection is a completely black-box setting, where the protector does not know the architecture of the hacker model. Therefore, we evaluate the performance of the MEM generated on the ResNet50 proxy model on different hacker models, including ResNet101 and ViT. The results are shown in Table 2. We find that these samples can be successfully transferred between different models and can reduce the performance of the CLIP model.

Visual Analytics


Figure 5: Attention map visualization: Comparison of the four models on clean data and non-learnable samples of different methods

Figure 5 shows the attention heatmaps of models trained on clean data and non-learnable samples generated by different methods. For images, we use Grad-CAM to visualize the model's attention, while for text, we use Integrated Gradients to visualize the attention. The lighter the color, the higher the model's attention.

It is worth noting that the models in Figure 5 (1), (2) and (3) all focus on the central region, which is relevant to captions.

However, the model trained on the samples generated by MEM-3 in Figure 5 (4) cannot accurately recognize clean images because it only learns noise features. Similarly, in the text, the models in the first three focus on the keyword "glass", while the model in the latter focuses on the first three words, which may be because MEM-3 always optimizes the noise and the first three text triggers to create a shortcut.

These visualization results show that EM and UAP are not very effective in protecting multimodal data, while MEM is significantly effective.


Figure 6: t-SNE visualization of clean samples and MEM-3 optimized non-learnable samples under clean model and poisoned model

We visualize the feature distribution of clean samples under the normal model and the feature distribution of non-learnable samples optimized by MEM3 on the poisoned model in Figure 6. We use triangles to represent image features and circles to represent text features, and the same color represents five identical but transformed images in the dataset and their corresponding different descriptions.

From (a), we can observe that under the clean model, the same images and texts are clustered together internally and the corresponding image-text pairs are close to each other.

However, in (b), the same image and text diverge, and only pairs of images and texts are close to each other. This indicates that our method effectively promotes the model to learn shortcuts between noise and text triggers.

Case Study: Face Privacy Protection

We conduct a case study to apply our MEM noise to a real-world scenario: protecting personal face images and related information such as names on social media platforms.

We conducted experiments using the PubFig database, a large real-world face dataset containing 58,797 images of 200 individuals collected from the Internet. For retrieval evaluation, we randomly select one photo of each celebrity as the test set and use all the remaining images for training.

For realistic fine-tuning, we changed their names and provided a set of text templates related to the name for caption generation. Subsequently, we used MEM to generate non-learnable samples and evaluated them using different hacker models. The results are shown in Table 3.

MEM prevents these fine-tuned models from learning the correlation between face and name features, thus hindering accurate person retrieval on the test set.


Table 3: Protection effect of non-learnable samples generated by ResNet50 fine-tuning on different pre-trained models

Conclusion

In this paper, we explore multimodal data protection, with a particular focus on image-text pairs, where we generate multimodal unlearnable samples to prevent exploitation by multimodal contrastive learning. We extend previous classification methods to this context, revealing their limitations due to the increase in modalities and data dispersion.

In light of these findings, we introduce a novel generative method named Multi-step Error Minimization (MEM), which is based on the EM framework. MEM effectively creates shortcuts between noise and text triggers and demonstrates transferability across different hacker models.

Furthermore, we validate the effectiveness of our approach using various visualization tools. Our work opens up a new direction and is expected to be applicable to other modality pairs such as audio-text and audio-image pairs.

about the author

The authors of this article are from the Institute of Information Engineering, Chinese Academy of Sciences, Nanyang Technological University, National University of Singapore and Sun Yat-sen University. Author list: Liu Xinwei, Jia Xiaojun, Xun Yuan, Liang Siyuan, Cao Xiaochun.

The first author Liu Xinwei is a doctoral student at the Institute of Information Engineering of the Chinese Academy of Sciences. The corresponding authors are Professor Cao Xiaochun of Sun Yat-sen University and Jia Xiaojun, a postdoctoral researcher at Nanyang Technological University.

References:

https://scst.sysu.edu.cn/members/caoxiaochun.html

https://jiaxiaojunqaq.github.io