A Closer Look at Backdoor Attacks on CLIP
We present a comprehensive empirical study on how backdoor attacks affect CLIP by analyzing the representations of backdoor images.
摘要
评审与讨论
This paper mainly present an empirical study on the backdoor attacks against the CLIP models. The study revels three observations about the effect of backdoors on different representations of CLIP: 1) different backdoor attacks infect different model components; 2) infected AHs are centered on the last layer while infected MLPs are decentralized on multiple layers; 3) some AHs are not greatly infected by backdoor attacks. Bases on the observations, the authors propose a backdoor defense method by detecting infected AHs.
优点
- Explore the work pattern of backdoor attacks against CLIP from the perspective of representation decomposing;
- Propose a potential method to defend against backdoor attacks for CLIP;
- Doing experiments to validate empirical findings.
缺点
- The authors classify the backdoor triggers into two categories, namely local-patch and global-noise, and then, explore their effects against CLIP. Is such a trigger pattern classification generic enough to validate the effectiveness of the revealed findings? For example, current works have achieved multi-trigger (or multi-target) backdoor attacks. According to the author's second finding, the injected multiple triggers should all be centered on the last layer. However, intuitively, if attacker chooses to put the triggers on difference places, their representations can be more like global-noise ones, which is against the author's finds, isn't it? Another example is the triggers that look like natural inputs, which may change the injected neurons. In summary, to make the finding more persuasive, the authors should better give a more comprehensive discussion on the reasonability of the selection of attack types, or provide experiments to prove it.
- The proposed defense works in two pipelines (against two different attack types), and it looks like that implementing these two pipelines simultaneously is time-consuming. So, how do the defender know which defense pipeline should be selected in practice? Seem to be impossible.
- A confusing conflict: the 2nd finding says the local-patch backdoor infects AH, while the last finding claims that AHs are not greatly infected? So, are the AHs infected at all? If yes, the last finding seems to be meaningless. If no, what about the correctness of the 2nd finding? By the way, what is the meaning of the last finding? From this manuscript, it seems to have no relation and motivation to derive the final defense method.
- I really appreciate the efforts of authors to do these experiments to reveal their findings. However, there is no novel approach involved in the study procedure but only invocations of different previous tools. Moreover, the final findings are similar to an early work, i.e., the well-known ''fine-pruning" defenses.
问题
- Is such a trigger pattern classification generic enough to validate the effectiveness of the revealed findings?
- How do the defender know which defense pipeline should be selected in practice?
- Please clarify the importance of the last finding with more details.
- Please clarify the basic difference of the findings compared to the well-known ''fine-pruning" defenses.
伦理问题详情
No more ethics concern.
Thank you so much for your valuable comments!
W1: The authors classify the backdoor triggers into two categories, namely local-patch and global-noise, and then, explore their effects against CLIP. Is such a trigger pattern classification generic enough to validate the effectiveness of the revealed findings? To make the finding more persuasive, the authors should better give a more comprehensive discussion on the reasonability of the selection of attack types, or provide experiments to prove it.
A: Thanks for your helpful suggestion. We argue that the taxonomy of triggers is general enough to current backdoor attacks on CLIP. We really understand your concern about other types of triggers. To address your concern, we have tried our best to implement two more backdoor attacks on CLIP called BadNet-M (multiple triggers located in the four corners of the image) and Kitty (the trigger is a natural object of the kitty). The experimental results are shown in the following tables.
Table: BadNet-M AH (ASR)
| Layer | 1-4 | 5-8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|
| F-AH | 99.77 | 99.78 | 99.75 | 99.79 | 99.82 | 0.0 |
| B-AH | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| S-AH | 99.75 | 99.76 | 99.77 | 99.81 | 99.83 | 0.0 |
Table: BadNet-M (ASR)
| Layer | 1-4 | 5-8 | 9 | 10 | 11 | 12 | 13 |
|---|---|---|---|---|---|---|---|
| F-MLP | 99.76 | 99.78 | 99.76 | 99.78 | 99.79 | 99.81 | 99.85 |
| B-MLP | 99.84 | 99.85 | 99.84 | 99.84 | 99.85 | 99.85 | 99.81 |
| S-MLP | 99.73 | 99.74 | 99.74 | 99.78 | 99.76 | 99.80 | 99.81 |
Table: kitty AH (ASR)
| Layer | 1-4 | 5-8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|
| F-AH | 95.42 | 93.87 | 82.17 | 82.34 | 83.05 | 0.94 |
| B-AH | 0.93 | 1.11 | 1.94 | 3.04 | 6.49 | 5.15 |
| S-AH | 95.32 | 95.41 | 95.38 | 96.41 | 97.96 | 5.15 |
Table: kitty MLP (ASR)
| Layer | 1-4 | 5-8 | 9 | 10 | 11 | 12 | 13 |
|---|---|---|---|---|---|---|---|
| F-MLP | 96.31 | 96.45 | 96.78 | 96.90 | 96.89 | 97.17 | 97.10 |
| B-MLP | 96.95 | 96.74 | 96.74 | 96.63 | 96.30 | 96.15 | 95.44 |
| S-MLP | 95.74 | 95.88 | 96.13 | 96.23 | 95.88 | 96.39 | 95.44 |
The above tables show that the two backdoor attacks mainly affect attention heads in the last layer and have little effect on MLPs. Therefore, they could be classified as the local patch-based type.
W2: The proposed defense works in two pipelines (against two different attack types), and it looks like implementing these two pipelines simultaneously is time-consuming. So, how does the defender know which defense pipeline should be selected in practice? Seem to be impossible.
A: Indeed, a defender does not know the types of triggers in practice. In this case, we argue that we can defend against both two types of backdoor attacks by first checking whether AHs or MLPs are infected and repairing infected AHs or MLPs correspondingly. We would like to emphasize that this strategy would not expend too much additional inference cost because we just need to forward images into the network once and repair the representation on AHs or MLPs for different types of backdoor images simultaneously.
W3: A confusing conflict: the 2nd finding says the local patch backdoor infects AH, while the last finding claims that AHs are not greatly infected. So, are the AHs infected at all? If yes, the last finding seems to be meaningless. If not, what about the correctness of the 2nd finding?
A: We would like to explain that the second and the last findings are not in conflict. Specifically, the second finding mainly reveals that many attention heads have infected in terms of their attention weights (i.e., they pay more attention to the trigger) and MMD scores (significant distribution differences compared with clean attention heads), while the last finding further reveals that some of these infected attention heads still maintain the original functionality which is so-called "not greatly infected" (Noe that different attention heads have various roles (e.g., color and location) in models [1].). For example, as shown in Figure 2, the 4th infected attention head still maintains the role of color because its descriptive texts are both related to color, while descriptive texts of other attention heads have significantly changed. In summary, we argue that these two findings have revealed the different characteristics of infected attention heads and are not in conflict.
W4: By the way, what is the meaning of the last finding? This manuscript seems to have no relation or motivation to derive the final defense method.
A: We would like to explain that the motivation for the last finding stems from the research of understanding CLIP's image representations by text explanations [1, 2]. Specifically, they revealed how the internal image token acts in the model by using natural texts, which are more intuitive and comprehensible for humans to understand than only using visual attention maps. They found that some attention heads have property-specific roles (e.g. location or shape), indicating their inherent functionality in the model [1]. Based on this observation, we would like to clarify whether or how backdoor attacks have affected the functionality of infected attention heads. If this is the case, the infected attention heads may be destroyed completely. Otherwise, the infected attention heads may be not greatly affected and still maintain the original functionality. We believe that exploring this point could motivate more future work to design attack or defense methods for this concern.
W5: I really appreciate the efforts of the authors to do these experiments to reveal their findings. However, there is no novel approach involved in the study procedure but only invocations of different previous tools. Moreover, the final findings are similar to an early work, i.e., the well-known ''fine-pruning" defenses.
A: Thanks for affirming our efforts. We indeed use the existing method of representation decomposition [1] to explore the characteristics of backdoor attacks on CLIP. Despite this, we would like to emphasize that our findings are indeed novel and significantly important in the area of backdoor attacks and defenses. Besides, we would like to summarize the basic differences between our work and the mentioned paper [3] in terms of the following aspects.
1. Different target models. Our work focused on prevalent attention-based multimodal models (e.g., CLIP), while the paper [3] focused on simple non-attention-based networks (e.g., AlexNet). The findings and the defense method in the paper [3] do not apply to CLIP directly. Exploring the effect of backdoor attacks on CLIP has a broader impact on the safety concern of vision-language models.
2. Different empirical findings. Although the paper [3] also revealed the characteristics of the infected neurons in the backdoored model, our findings revealed that different types of backdoor attacks would affect different model components in CLIP, which has not been disclosed by the paper [3] or any other existing papers. Furthermore, we used CLIP's texts to explore the functional change of infected model components. The resulting finding has also not been revealed yet in the existing research.
3. Different defense methods. Specifically, Fine-pruning [3] first prunes the backdoored network and then fine-tunes the pruned network, while our proposed method i.e., Decomp-Rep, aims to decompose and modify the representation, which is training-free and more explainable.
In summary, we argue that our findings are indeed novel and significantly important to the area of backdoor attacks and defenses. Besides, the proposed defense methods are also novel and well-motivated.
Q1: Is such a trigger pattern classification generic enough to validate the effectiveness of the revealed findings?
A: Thanks for your question. We argue that the taxonomy of triggers is general to current backdoor attacks on CLIP. To justify this point, as shown in our response to W1, we have conducted more experiments on two more backdoor attacks. The experiment results show that the two backdoor attacks could be also classified as the local patch-based type, which validates the effectiveness of the revealed findings.
Q2: How does the defender know which defense pipeline should be selected in practice?
A: Actually, a defender does not know the types of triggers in practice. As shown in our response to W2, the defender can use another strategy without increasing the inference cost significantly.
Q3: Please clarify the importance of the last finding with more details.
A: As shown in our response to W4, we have explained the motivation for the last finding. Based on the response, we would like to clarify the importance of the last finding further. Specifically, we argue that this finding is significantly important to designing more powerful attacks or effective defense methods. For example, a more powerful attack on CLIP is supposed to destroy the functionality of attention heads completely, which could inspire researchers to design specific attacks on the functionality. On the other hand, a defender could consider keeping or slightly restoring these attention heads since they are not affected completely. In summary, the last finding is significant in motivating future work to consider the functionality of attention heads in backdoor research. We also would like to explore these ideas in future work.
Q4: Please clarify the basic difference of the findings compared to the well-known ''fine-pruning" defenses.
A: As shown in our response to W5, we summarize the basic differences in three aspects: different target models, findings, and defense methods. We argue that our findings are indeed novel and significantly important to the area of backdoor attacks and defenses. Besides, the proposed defense methods are also novel and well-motivated.
Reference:
[1] INTERPRETING CLIP’S IMAGE REPRESENTATION VIA TEXT-BASED DECOMPOSITION. ICLR 2024.
[2] INViTE: INterpret and Control Vision-Language Models with Text Explanations. ICLR 2024.
[3] Fine-Pruning: Defending Against Backdooring Attacks on Deep Neural Networks. RAID 2018.
Dear Reviewer GUYc,
Thank you once again for reviewing our paper and providing valuable comments.
We understand that you may be extremely busy at this time, so we deeply appreciate it if you could take some time to return further feedback on whether the detailed explanations of our findings and our defense method solve your concerns. Kindly let us know if our response has addressed your concerns adequately.
Best Regards,
Paper9853 Authors
Dear Reviewer GUYc,
Thank you for your valuable comments. We understand that you may be extremely busy at this time, so we would deeply appreciate it if you could take some time to provide further feedback on whether our rebuttal solves your concerns. Kindly let us know if our response has adequately addressed your concerns.
Best Regards,
Paper9853 Authors
Dear Reviewer GUYc,
Thank you for your valuable comments on our paper. With the deadline for the discussion phase approaching, we would like to confirm that our response has adequately addressed your concerns.
Best Regards,
Paper9853 Authors
This paper examines the impact of backdoor attacks on CLIP by analyzing the representations of backdoor images using a representation decomposing methodology. Three key findings are presented regarding the influence of backdoor attacks (local-patch and global-noise attacks) and network layers (AHs and MLPs). Additionally, the paper proposes countermeasures to repair model components and filter samples to mitigate the effects of backdoor attacks.
优点
The motivation of the paper is clearly articulated, and it extends existing research on backdoor attack methods to different CLIP architectures, which could be a valuable contribution to the backdoor research community.
缺点
I think there are some issues with this paper, particularly concerning the generality of certain parameters in the CLIP domain and its performance in addressing different types of backdoor attacks.
问题
- The authors categorize backdoor attacks into two types: local-patch based and global noise-based backdoor attacks. To my knowledge, the paper "Input-Aware Dynamic Backdoor Attack" introduces a "dynamic" backdoor, where lots of samples share similar pixels in the backdoor trigger but also share some different backdoor pixels. I am wondering whether this attack yields the same experimental conclusions. Should it be classified as local patch-based or global noise-based type?
- Finding 3 states, "descriptive texts of MLPs in the last five layers have a distinct semantic difference, while that of MLPs in other layers have negligible changes in semantics." Why specifically the last five layers? Is this conclusion solely based on the experimental results presented in Figure 3?
- As a defender, how can I determine whether a backdoor type is local patch-based or global noise-based? Additionally, the choice of parameter in Figure 7 seems to struggle to achieve a good balance between attack sucess rate and clean accuracy.
Thank you so much for your insightful comments!
W1: I think there are some issues with this paper, particularly concerning the generality of certain parameters in the CLIP domain and its performance in addressing different types of backdoor attacks.
A: Thanks for your comment. We would like to explain that the selected value of parameters can apply to different types of backdoor attacks. Specifically, instead of separately searching varying values of parameters for different backdoor attacks, we just set the parameter to 0.002 in Decomp-Rep to defend against all BadNet, Label Consistent, and BadCLIP. Our experimental results shown in Table 1 indicate that using this value of in our proposed Decomp-Rep can effectively decrease ASR while maintaining CACC for these backdoor attacks. It is worth noting that we chose the visual transformer as the backbone and achieved superior defense performance compared with the existing fine-tuning-based method CleanCLIP [1], especially for the current advanced attack BadCLIP.
Q1: I am wondering whether the "dynamic" attack yields the same experimental conclusions. Should it be classified as a local patch-based or global noise-based type?
A: Thanks for your interesting question. To solve your concern, we conducted experiments with another dynamic attack ISSBA [2] where the trigger is also dynamic and generated by an encoder-decoder network. We would like to first explain the reason for not using "Input-Aware Dynamic Backdoor Attack" [3] (IAD) is that the IAD attack is not targeted at CLIP and it is technically challenging to directly adapt the IAD attack to CLIP. Specifically, IAD originally targeted simple networks (i.e., PreActRes-18) and tiny datasets (i.e., MNIST, CIFAR-10, and GTSRB), and simultaneously trained the trigger generator and the victim model by a multitask loss. However, adapting the two tasks by the multitask loss in CLIP (fine-tuning on the dataset from CC3M) fails to train a satisfying trigger generator due to the huge differences in various aspects, including architectures (deep vs shallow), model training (fine-tuning vs training from scratch), and datasets (large-scale vs small-scale).
In order to address your concern, we have tried our best to successfully adapt another dynamic attack ISSBA [2] to CLIP. Specifically, we first generated poisoned images from CC3M with sample-specific triggers by an encoder-decoder network and then fine-tuned CLIP on the poisoned dataset. To better show the effect of ISSBA, we improve the illustration of Figure 2 in the revised manuscript. The figure shows that the dynamic trigger (globally optimized pixels) does not affect attention heads but only MLPs (as shown in the following table) in the last few layers. This observation is consistent with our findings.
Table: AH ablation (ASR)
| Layer | 1-4 | 5-8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|
| F-AH | 72.32 | 71.78 | 71.25 | 71.36 | 70.83 | 71.64 |
| B-AH | 71.25 | 72.11 | 72.30 | 71.89 | 71.95 | 72.50 |
| S-AH | 71.41 | 71.10 | 72.48 | 72.36 | 72.44 | 72.08 |
Table: MLP ablation (ASR)
| Layer | 1-4 | 5-8 | 9 | 10 | 11 | 12 | 13 |
|---|---|---|---|---|---|---|---|
| F-MLP | 77.15 | 76.42 | 37.57 | 16.77 | 7.55 | 8.65 | 1.94 |
| B-MLP | 1.56 | 0.98 | 6.33 | 19.99 | 39.68 | 52.41 | 49.22 |
| S-MLP | 74.15 | 71.46 | 52.01 | 53.77 | 61.48 | 75.61 | 49.22 |
We abbreviate Forward/Backward/Separate-MLP as F-MLP/B-MLP/S-MLP. The above tables show that only ablating MLPs in the last few layers can significantly decrease the ASR. In summary, we conclude that this dynamic trigger could be classified as a global noise-based type.
Q2: Finding 3 states, "descriptive texts of MLPs in the last five layers have a distinct semantic difference, while that of MLPs in other layers have negligible changes in semantics." Why specifically the last five layers? Is this conclusion solely based on the experimental results presented in Figure 3?
A: Actually, this conclusion depends on not only the descriptive texts of MLPs but also some quantitative metrics. Specifically, we used Maximum Mean Discrepancy (MMD) to quantify the distribution difference between representations on clean and backdoored MLPs. The result is shown in Figure 6. (d-2) (in Appendix). We can see that the last few MLPs have larger MMD scores, which indicates these MLPs have been affected significantly. On the other hand, we also used descriptive texts of MLPs to characterize the semantic change in MLPs. To further quantify the semantic change, we calculate the cosine similarity between text embeddings of clean (green) and backdoor (red) descriptive texts. The experimental result is shown in the following table.
Table: Average text similarity
| Attack | 1-4 | 5-8 | 9 | 10 | 11 | 12 | 13 |
|---|---|---|---|---|---|---|---|
| Blended | 0.7225 | 0.7012 | 0.6976 | 0.6379 | 0.6179 | 0.6918 | 0.6872 |
| ISSBA | 0.7432 | 0.7145 | 0.7016 | 0.6719 | 0.6763 | 0.6773 | 0.6942 |
The above table shows that the cosine similarity values in the last five layers are significantly smaller than those in the previous layers, thereby indicating the significant semantic change in the last five MLPs. In summary, we conclude that global noise-based attacks mainly infect the last five MLPs.
Q3: As a defender, how can I determine whether a backdoor type is local patch-based or global noise-based?
A: Indeed, you do not know the types of triggers when you are a defender in real-world applications. In this case, we argue that you can defend against both two types of backdoor attacks by first checking whether AHs or MLPs are infected and repairing infected AHs or MLPs correspondingly. We would like to emphasize that this strategy would not expend too much additional inference cost because we just need to forward images into the network once and repair the representation on AHs or MLPs for different types of backdoor images simultaneously.
Q4: Additionally, the choice of the parameter in Figure 7 seems to struggle to achieve a good balance between attack success rate and clean accuracy.
A: Figure 7 indeed indicates that achieving the perfect trade-off between ASR and CACC (i.e., keeping CACC unchanged and decreasing ASR to zero) is considerably challenging. Actually, this dilemma is also faced by most backdoor attack methods. In the parameter selection procedure, we prioritize no significant decrease in CACC, which is significant and practical for downstream tasks. In this case, our proposed method still substantially makes the resulting ASR decrease, and the selected parameter applies to different backdoor attacks, i.e., we do not need to separately select parameters for varying backdoor attacks. Therefore, we believe that parameter selection is not a big barrier when using our method in real-world applications.
Reference:
[1] Cleanclip: Mitigating data poisoning attacks in multimodal contrastive learning. ICCV 2023.
[2] Invisible Backdoor Attack with Sample-Specific Triggers. ICCV 2021.
[3] Input-Aware Dynamic Backdoor Attack. NeurIPS 2020.
Dear Reviewer c25e,
Thank you once again for reviewing our paper and providing valuable comments.
We understand that you may be extremely busy at this time, so we deeply appreciate it if you could take some time to return further feedback on whether the detailed explanations of our findings, defense method, and parameter selection solve your concerns. Kindly let us know if our response has addressed your concerns adequately.
Best Regards,
Paper9853 Authors
Dear Reviewer c25e,
Thank you for your valuable comments. We understand that you may be extremely busy at this time, so we would deeply appreciate it if you could take some time to provide further feedback on whether our rebuttal solves your concerns. Kindly let us know if our response has adequately addressed your concerns.
Best Regards,
Paper9853 Authors
Dear Reviewer c25e,
We would like to give a gentle reminder that the discussion deadline is approaching and could you take some time to provide further feedback on our rebuttal?
Best Regards,
Paper9853 Authors
Dear Reviewer c25e,
Thank you for your valuable comments on our paper. With the deadline for the discussion phase approaching, we would like to confirm that our response has adequately addressed your concerns.
Best Regards,
Paper9853 Authors
This paper presents an empirical study investigating how backdoor attacks affect CLIP models by analyzing representations of backdoor images. By examining the effects on different model components (attention heads and MLPs), the paper proposes two defense methods that detect infected AHs, repair representations or filtering samples.
优点
Comprehensive empirical analysis that provides insights into how backdoor attacks affect different components of CLIP.
缺点
-
Limited novelty in some key findings. The observation that local patch-based attacks mainly affect AHs while global noise-based attacks mainly affect MLPs seems intuitive and somewhat expected.
-
In Figure 2, it is difficult to distinguish the original images. Figure 3, while presenting a lot of descriptive texts, lacks intuitive clarity, making it challenging to draw conclusions directly from it.
-
Defining the threat model at the beginning will make the paper clearer.
问题
How would the conclusions hold for semantic triggers rather than simple patches or textures? How might the affected layers change with semantic triggers?
Thank you so much for your valuable comments!
W1: Limited novelty in some key findings. The observation that local patch-based attacks mainly affect AHs while global noise-based attacks mainly affect MLPs seems intuitive and somewhat expected.
A: Thanks for your comment. We would like to emphasize that this finding has not been revealed yet in the existing research and thus is indeed novel. Although there are several existing papers [1, 2] that have also identified the infected neurons in the backdoored model, they mainly focused on simple deep neural networks such as AlexNet and ResNet-18 (non-attention-based models). In contrast, our work focused on exploring backdoor attacks on CLIP (the prevalent attention-based model). Furthermore, they did not deeply explore the characteristics of the infected components, while we revealed that different backdoor attacks would affect different model components in CLIP. Therefore, we argue that this finding is indeed novel in the area of backdoor attacks.
Besides, we would like to emphasize that this finding is significantly important to the area of backdoor defense. Specifically, existing defense methods (such as CleanCLIP [3]) mainly focus on fine-tuning all model parameters to destroy the implanted backdoors of all backdoor attacks, neglecting the difference in infected model components of different backdoor attacks. Our findings can motivate future research to design more efficient and effective defense methods that specially handle the infected model components against different backdoor attacks instead of blindly treating all model components against all backdoor attacks.
W2: In Figure 2, it is difficult to distinguish the original images. Figure 3, while presenting a lot of descriptive texts, lacks intuitive clarity, making it challenging to draw conclusions directly from it.
A: We are sorry for the limited clarity in Figures 2 and 3. We have improved the illustration in Figures 2 and 3 in the revised manuscript. Specifically, we comparatively show the original images with various triggers and their head-specific attention maps and corresponding descriptive texts. To quantitatively illustrate the observation, we use head-specific MMD scores (indicating whether backdoor attacks significantly affect the representation of attention heads) and average text similarity (indicating whether the text semantics of infected components have changed dramatically).
W3: Defining the threat model at the beginning will make the paper clearer.
A: Thanks for your suggestion. We have formally defined the threat model in the preliminary section of the revised manuscript. Specifically, we define the thread model as {} where is associated with a implanted trigger .
Q1: How would the conclusions hold for semantic triggers rather than simple patches or textures? How might the affected layers change with semantic triggers?
A: Thanks for providing the interesting idea. To explore this point, we planned to use an image of "kitty" as the semantic trigger. Following our original experimental setting, we selected 500,000 images from CC3M as the fine-tuning dataset and poisoned 1,500 of them. Specifically, we first resized the "kitty" image to the patch size (16) and attached (random location) it to the poisoned images as the semantic trigger. After fine-tuning the dataset on a clean CLIP, the backdoored CLIP was associated with the semantic trigger "kitty". Then, we conducted the mean-ablation experiment to explore which model components were infected. The experimental results are shown in the following tables. Note that we have the following abbreviations: Forward-AH/MLP (F-AH/MLP), Backward-AH/MLP (B-AH/MLP), Separate-AH/MLP (S-AH/MLP).
Table: kitty AH (ASR)
| Layer | 1-4 | 5-8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|
| F-AH | 95.42 | 93.87 | 82.17 | 82.34 | 83.05 | 0.94 |
| B-AH | 0.93 | 1.11 | 1.94 | 3.04 | 6.49 | 5.15 |
| S-AH | 95.32 | 95.41 | 95.38 | 96.41 | 97.96 | 5.15 |
Table: kitty MLP (ASR)
| Layer | 1-4 | 5-8 | 9 | 10 | 11 | 12 | 13 |
|---|---|---|---|---|---|---|---|
| F-MLP | 96.31 | 96.45 | 96.78 | 96.90 | 96.89 | 97.17 | 97.10 |
| B-MLP | 96.95 | 96.74 | 96.74 | 96.63 | 96.30 | 96.15 | 95.44 |
| S-MLP | 95.74 | 95.88 | 96.13 | 96.23 | 95.88 | 96.39 | 95.44 |
From the above tables, we can see that the ASRs have a great decrease after ablating the AHs in the 12th layer while having little effect after ablating MLPs. Therefore, our finding still holds for the semantic trigger.
Reference:
[1] Fine-Pruning: Defending Against Backdooring Attacks on Deep Neural Networks. RAID 2018.
[2] Adversarial Neuron Pruning Purifies Backdoored Deep Models. NeurIPS 2021.
[3] Cleanclip: Mitigating data poisoning attacks in multimodal contrastive learning. ICCV 2023.
Dear Reviewer QusL,
Thank you once again for reviewing our paper and providing valuable comments.
We understand that you may be extremely busy at this time, so we deeply appreciate it if you could take some time to return further feedback on whether the detailed explanations of the novelty of our findings and the additional experimental results against semantic triggers solve your concerns. Kindly let us know if our response has addressed your concerns adequately.
Best Regards,
Paper9853 Authors
Thank the author for the reply. I've decided to keep my rating.
Dear Reviewer QusL,
We sincerely thank you for your reply. We note that you are still keeping the borderline reject rating. Could you tell us your concerns on our rebuttal so that we can try our best to solve them?
Best regards,
Paper9853 Authors
This paper provides an analysis of how backdoor attacks affect the CLIP model. The paper focuses on the behavior of image representations at various parts of the model. The paper decomposes CLIP’s image representations to understand the effect of backdoor attacks on CLIP's individual components - attention heads and MLPs. The paper shows:
- Different backdoors affect different model components.
- Affected AHs are localized to a specific region within the layer, whereas affected MLPs are spread-out.
- AHs are affected to varying degrees of infection.
优点
- This paper sets a good example of how adversarial attack analysis should be done. They conduct well defined analysis of different backdoor attacks on the architectural components of CLIP. This reveals potential component-level vulnerabilities of CLIP-based and attention-based multi-modal architectures.
- The authors demonstrate a strong understanding of math and technical knowledge. This is shown by the comprehensive breakdown of the analysis with respect to multiple backdoor attacks and differentiating between local and global trigger types. Further, the paper also shines light on how distinct attacks interact with specific layers and components.
- The authors have performed comprehensive experimentation and substantiated their claims with quantitative results. The authors also conduct a sufficient ablation study.
- The paper's language and style of writing is easy to understand.
- Lastly, the paper's contributions are certainly relevant to the adversarial and multi-modal community. This paper also demonstrates which layers pay attention to what in an image. Although this may seem as known knowledge or more like an explainability experiment, this paper provides important information that can be used to craft targeted attacks and stronger detection and defense techniques to counter other adversarial attacks and not just backdoors.
缺点
- In conjunction with strength 3, I will note that although the experimentation is extensive, the authors do not do a good job of explaining the results or giving readers a better understanding of what the experiments show. The authors could have gone into more detail about what the results show; For example: could have provided a sample input image, various versions of the backdoored image and then shown the AHs and MLP's attending to different triggers differently. This could have drive home their claim and visually strengthened their analysis.
- For a more thorough ablation study, the authors could maybe experiment with varying architectures, or modified existing networks to change layer orders to understand whether their claims such as "Infected AHs are centered on the last layer" are indeed due to certain trigger/attack-based factors rather than architectural reasons.
- In conjunction with strength 4, although the paper's language is lucid, the math hamper's readability. The authors could supplement their use of math with textual explanation of what the math intuitively explores; For example in section 'Repairing representations of infected AH' or 'Preliminary'. It may help to boil down equations to only necessary parts that are directly relevant to reader's understanding.
- Although the paper's experimentation is extensive, the authors could have incorporated stronger attacks like dynamic triggers ('Reflection Backdoor', 'Hidden Trigger Backdoor Attacks') and patch-based triggers with adaptive patterns ('Input-Aware Dynamic Backdoor Attack').
问题
- The authors fail to address the applicability of their analysis to non-CLIP based multi-modal and attention-based models. Is it the case that the findings translate over to these other models. If they do, then do location based claims like "Infected AHs are centered on the last layer" still hold true? If not, then why?
W4: Although the paper's experimentation is extensive, the authors could have incorporated stronger attacks like dynamic triggers.
A: Thanks for your valuable suggestion. We have conducted experiments with more backdoor attacks such as Reflection [1] and ISSBA [2] (dynamic triggers). Note that the reflection attack [1] originally targeted conventional neural networks (such as ResNet-34 and DenseNet) and we have tried our best to successfully adapt it to attack CLIP. Specifically, following the original setting, we used the images from PascalVOC as the reflection set to poison the training images (1,500 out of 500,000) from CC3M by the reflection method. Then, following our experimental setting, we fine-tuned a clean CLIP on the poisoned dataset and caused a backdoored CLIP (ASR is close to 99%). The results are shown in the following tables.
Table: AH ablation (reflection)
| Layer | 1-4 | 5-8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|
| F-AH | 94.23 | 93.11 | 90.35 | 93.02 | 98.50 | 99.35 |
| B-AH | 99.23 | 99.45 | 99.53 | 99.52 | 99.50 | 98.67 |
| S-AH | 94.35 | 94.52 | 95.10 | 96.41 | 98.66 | 98.68 |
Table: MLP ablation (reflection)
| Layer | 1-4 | 5-8 | 9 | 10 | 11 | 12 | 13 |
|---|---|---|---|---|---|---|---|
| F-MLP | 95.50 | 95.21 | 94.77 | 90.37 | 75.23 | 27.81 | 14.62 |
| B-MLP | 12.74 | 13.46 | 14.74 | 14.63 | 54.36 | 81.79 | 92.07 |
| S-MLP | 95.41 | 94.92 | 94.66 | 90.69 | 87.64 | 87.91 | 92.07 |
Table: Defense performances (CACC/ASR)
| Method | Reflection | ISSBA |
|---|---|---|
| No defense | 54.62/98.77 | 56.22/70.12 |
| Base-Decomp | 51.42/95.14 | 53.32/73.02 |
| Decomp-Rep | 46.32/3.42 | 45.68/6.33 |
From the above tables, we can see that the reflection attack and ISSBA mainly affect MLPs, and our proposed method can effectively defend against these attacks.
Q1: The authors fail to address the applicability of their analysis to non-CLIP-based multi-modal and attention-based models.
A: Thanks for providing the interesting idea. As shown in our response to W2, the experimental results verify the generality of our findings to other non-CLIP-based multi-modal and attention-based models (i.e., unimodal vision transformer) and CLIP with the large vision transformer.
Reference:
[1] Reflection backdoor: A natural backdoor attack on deep neural networks. ECCV 2020.
[2] Invisible Backdoor Attack with Sample-Specific Triggers. ICCV 2021.
[3] Hidden trigger backdoor attacks. AAAI 2020.
Dear Reviewer jfPF,
Thank you once again for reviewing our paper and providing valuable comments.
We understand that you may be extremely busy at this time, so we deeply appreciate it if you could take some time to return further feedback on whether our additional experimental results against more types of backdoor attacks solve your concerns. Kindly let us know if our response has addressed your concerns adequately.
Best Regards,
Paper9853 Authors
Dear Reviewer jfPF,
Thank you for your valuable comments. We understand that you may be extremely busy at this time, so we would deeply appreciate it if you could take some time to provide further feedback on whether our rebuttal solves your concerns. Kindly let us know if our response has adequately addressed your concerns.
Best Regards,
Paper9853 Authors
Thank you authors for your rebuttal.
Having considered your rebuttal, I will stick to my original rating.
Dear Reviewer jfPF,
We sincerely thank you for responding to our rebuttal and keeping a positive score for our paper. We are encouraged by your acknowledgment of our paper's contribution. In addition, we have incorporated the experiments you suggested about semantic triggers and varying models into the revised manuscript. We believe that your insightful comments and valuable suggestions further improve our paper.
Best Regards,
Paper9853 Authors
Thank you so much for your insightful comments!
W1: In conjunction with strength 3, I will note that although the experimentation is extensive, the authors do not do a good job of explaining the results or giving readers a better understanding of what the experiments show.
A: Thanks for your helpful suggestion. We have improved the illustration of Figure 2 in the revised manuscript, which compares the effect of various backdoor attacks more intuitively. We can see that local patch-based attacks (i.e., BadNet and BadCLIP) make many AHs pay more attention to the trigger compared with global noise-based attacks (i.e., Blended and ISSBA), thereby leading to larger MMD scores (indicating whether backdoor attacks significantly affect the representation of these AHs) and smaller text similarity (indicating whether the text semantic of these AHs have changed dramatically). We believe that the improved illustration in Figure 2 could give the readers a better understanding of our findings.
W2: For a more thorough ablation study, the authors could maybe experiment with varying architectures, or modify existing networks to change layer orders to understand their claims.
A: Thanks for your valuable suggestion. we have conducted more experiments on two types of models: Uni-ViT-B and Mul-ViT-L. Uni-ViT-B indicates the vision transformer (ViT-B/32) in supervised learning, while Mul-ViT-L means the vision transformer (ViT-L/14) in multimodal contrastive learning. Following the original attack setting, we used BadNet to attack these two models and explored the location of infected AHs.
Table: AH ablation on Uni-ViT-B (ASR)
| 1-4 | 5-8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|
| 92.82 | 91.45 | 90.39 | 88.72 | 87.51 | 1.42 |
Table: AH ablation on Mul-ViT-L (ASR)
| 1-8 | 9-16 | 16-20 | 21 | 22 | 23 | 24 |
|---|---|---|---|---|---|---|
| 96.44 | 95.72 | 94.38 | 92.78 | 91.20 | 91.33 | 2.53 |
From the above tables, we can see that ablating the last layer in Uni-ViT-B or Mul-ViT-L significantly decreases the ASR, which indicates that the infected AHs are still centered in the last layer.
W3: In conjunction with strength 4, although the paper's language is lucid, the math hampers readability.
A: Thanks for your helpful suggestion. We have simplified the mathematical symbols in the preliminary and the method sections. Specifically, we rewrote the cosine similarity as and thus simplify Eq. (1) in the preliminary section. In addition, we omitted the symbol and rewrote the symbols of the prototype and the representation in the method section, thereby simplifying Eq. (6) and Eq. (7) for better understanding. The detailed revision can be found in the highlighted part of the revised manuscript.
(Reviewer jfPF --- R1, Reviewer c25e --- R2, Reviewer QusL --- R3, Reviewer GUYc --- R4).
We sincerely appreciate all reviewers for their efforts in providing many valuable and constructive comments. We are encouraged that they recognize the contributions of our paper to the backdoor, adversarial, and multimodal research community (R1, R2), and affirm our efforts to conduct extensive experiments to reveal empirical findings (R3, R4). We promise that our source code will be released once our paper is accepted. Below are some main concerns raised by most reviewers and corresponding solutions:
-
The first concern is the generality of the trigger taxonomy.. We have provided more experiments on more types of backdoor attacks on CLIP such as Reflection, Kitty, BadNet-M, and ISSBA. The experimental results can be found in our response to R1's W4, R2's Q1, R3's Q1, and R4's W1.
-
The second concern is the novelty of our findings and methods. We have clarified the novelty of our findings and defense methods. The details can be found in our responses to R3's W1 and R4's W5.
-
The third concern is the limited clarity of Figure 2 / 3. We have improved the illustration of Figures 2 and 3 shown in the revised manuscript.
-
The fourth concern is the understanding of key findings. We have clarified the findings with more details and clear motivation. The details can be found in our response to R4's W4.
In addition, there are some key changes to our manuscript:
-
We have simplified the mathematical symbols in the preliminary section and the method section (suggested by R1).
-
We have improved the illustration of Figures 2 and 3 by adding more clear original images with varying triggers and quantitative metrics (suggested by R1, R2, and R3).
-
We formally defined the threat model in the preliminary section (suggested by R3).
This paper presents an in-depth analysis of how backdoor attacks impact the CLIP model. Its comprehensive empirical evaluation has been acknowledged positively by all reviewers. However, several concerns were raised, including the novelty being limited as some of the key findings seem intuitive and somewhat expected, the defending strategy is not very practical (the identification of the backdoor types and the choice of parameters), and the categorization of local-patch and global noise seems insufficient. The authors addressed some of these concerns in their rebuttal, providing clarifications, yet these issues were not fully resolved and the reviewers maintain their original rating. As a result, the overall evaluation falls below the acceptance threshold.
审稿人讨论附加意见
The reviewers maintain their original rating after rebuttal.
Reject