BDetCLIP: Multimodal Prompting Contrastive Test-Time Backdoor Detection
摘要
评审与讨论
The paper presents BDetCLIP, a novel test-time backdoor detection method for multimodal contrastive learning models like CLIP. The approach leverages contrastive prompting to differentiate between clean and backdoored images by analyzing the distribution differences in cosine similarity with benign and malignant class description texts. Extensive experiments demonstrate the method's superior effectiveness and efficiency compared to state-of-the-art detection methods.
优点
The authors claim that this is the first work on backdoor detection during the inference phase. The paper introduces a novel approach by using contrastive prompting, which is innovative and leverages the strengths of large language models. This work offers a computationally efficient solution for real-world applications, surpassing previous methods by large margins.
缺点
The experiments focus on specific attack types and large multimodal models. Including a broader range of attacks and models could enhance the robustness of the evaluation. Some technical aspects, such as the choice of thresholds and specific parameter settings for both CLIP and attack methods, could be explained in greater detail. While the method is computationally efficient, its applicability in highly dynamic environments or with evolving backdoor techniques might require further exploration. Discussing potential challenges and solutions for such scenarios could improve the work's practical relevance.
问题
The method looks overly reliant on large language models like GPT-4 to generate class-specific prompts. Is the key to the apparent effectiveness of the method in how well a prompt is designed?
Are the clean samples used for detection the same as those used for threshold selection? Additionally, was the complete validation set used when selecting the threshold?
Attack concerns: (a) Why use BadNet-LC and Blended-LC instead of the original version LC? (b) The number of selected backdoor classes (3, 1, 1, respectively) is not sufficient compared to the total number of classes in the dataset. (c) The range of attack methods requires further inclusion, e.g., IAD and WaNet. (d) More detailed attack settings should be provided.
Experimental results concerns: (a) In table 1, why is the performance of TeCo on Blended and Blended-LC higher in certain cases? (b) For BadCLIP attack, the performance order of the three comparative detection methods seems to be completely opposite to the previous results. (c) Why did the AUROC increase after incorporating CleanCLIP? The model after defense should be more similar to a clean model, thereby reducing the backdoor effect. (d) In table 9, the AUROC of Blended-LC decreases when using contrastive prompts.
Response to Reviewer Dxen:
Thank you so much for your insightful comments!
W1: The paper should contain more CLIP models.
A: We would like to emphasize that existing backdoor research [1-6] on vision-language models commonly considers CLIP as a representative victim model due to its reproducibility. Hence, we also follow this setting in the paper and we will investigate other models in future work.
W2: More details about the choice of thresholds.
A: Thanks for your suggestion! We have updated our sampling-based threshold selection strategy. Specifically, we first sampled clean samples at a specified sampling rate. Then, by using Eq. (6) in the manuscript, we computed the contrastive distribution difference for all samples, ranked them from largest to smallest, and selected the 85th percentile as the threshold(Note that the specific threshold percentile can be adjusted based on real-world defense requirements). To improve the reliability and stability of our experimental results, we performed random sampling ten times and calculated both the mean and the standard deviation. The experimental results are shown in the following table.
Table: Sampling rate=1%: 500 examples.
| Attack | Accuracy | Recall | F1 | AUROC | Threshold |
|---|---|---|---|---|---|
| BadNet | 0.8941 ± 0.0107 | 0.9902 ± 0.0013 | 0.8488 ± 0.0127 | 0.9906 ± 0.0003 | 11.7225 ± 1.2723 |
| Blended | 0.8772 ± 0.0061 | 0.9279 ± 0.0142 | 0.8193± 0.0151 | 0.9425 ± 0.0003 | 12.0281 ± 1.2399 |
| BadNet_LC | 0.8938 ± 0.0074 | 0.9842 ± 0.0016 | 0.8476 ± 0.0088 | 0.9796 ± 0.0004 | 16.7526 ± 0.8944 |
| Blended_LC | 0.8837 ± 0.0068 | 0.9396 ± 0.0102 | 0.8290 ± 0.0067 | 0.9420 ± 0.0005 | 15.9315 ± 1.2748 |
Table: Sampling rate=0.5%: 250 examples.
| Attack | Accuracy | Recall | F1 | AUROC | Threshold |
|---|---|---|---|---|---|
| BadNet | 0.8950 ± 0.0160 | 0.9903 ± 0.0013 | 0.8502 ± 0.0189 | 0.9908 ± 0.0003 | 11.6161 ±1.8596 |
| Blended | 0.8772 ± 0.0109 | 0.9224 ± 0.0211 | 0.8186± 0.0096 | 0.9416± 0.0003 | 11.7094 ± 1.9545 |
| BadNet_LC | 0.8958 ± 0.0128 | 0.9835 ± 0.0029 | 0.8501 ± 0.0151 | 0.9797 ± 0.0004 | 16.4568 ± 1.5488 |
| Blended_LC | 0.8865 ± 0.0070 | 0.9347 ± 0.0106 | 0.8317 ± 0.0070 | 0.9422 ± 0.0005 | 15.3494 ± 1.3340 |
Table: Sampling rate=0.1%: 50 examples.
| Attack | Accuracy | Recall | F1 | AUROC | Threshold |
|---|---|---|---|---|---|
| BadNet | 0.8775 ± 0.0107 | 0.9904 ± 0.0040 | 0.8312 ± 0.0418 | 0.9905 ± 0.0004 | 13.0927 ± 4.3069 |
| Blended | 0.8564 ± 0.0315 | 0.9391 ± 0.0430 | 0.7987± 0.0279 | 0.9424 ± 0.0003 | 14.2042 ± 4.4056 |
| BadNet_LC | 0.8799 ± 0.0453 | 0.9831 ± 0.0082 | 0.8341 ± 0.0488 | 0.9795 ± 0.0005 | 17.6179 ± 4.8725 |
| Blended_LC | 0.8722 ± 0.0269 | 0.9404 ± 0.0363 | 0.8167 ± 0.0524 | 0.9422 ± 0.0004 | 16.9313 ± 4.2465 |
The tables show that our method can achieve impressive detection performance on different sampling rates (with induced different thresholds) in terms of all metrics. In particular, even using a very small sampling ratio (i.e., 0.1%), our method still maintains good detection performance, which verifies its applicability to real-world applications.
W3: More details about specific parameter settings for both CLIP and attack methods.
A: Thanks for your valuable comments! We have updated our manuscript to provide more detailed specific parameter settings for both CLIP and attack methods in the section "Details of attacking CLIP" of Appendix D.
W4: While the method is computationally efficient, its applicability in highly dynamic environments or with evolving backdoor techniques might require further exploration. Discussing potential challenges and solutions for such scenarios could improve the work's practical relevance.
A: Thank you for your insightful comments. Future multimodal backdoor attacks may become more covert and strive to reduce the semantic gap between triggers and target labels. To address this challenge, we propose that future research could explore higher-quality prompt generation methods to more effectively highlight the semantic gap, or integrate BDetCLIP with other stages of backdoor defense strategies, which could be potential solutions.
Q1: Is the key to the apparent effectiveness of the method in how well a prompt is designed?
A: Yes, we have already analyzed the importance of prompts in Section 4.3, "FURTHER ANALYSIS OF CLASS-SPECIFIC PROMPT". However, we do not need to be overly concerned about the quality of prompts. In reality, our main experimental results show that as long as we use our prompts with GPT4, the prompt quality meets our defense requirements.
Q2: Are the clean samples used for detection the same as those used for threshold selection? Additionally, was the complete validation set used when selecting the threshold?
A: Not exactly the same. The samples used for threshold selection are randomly sampled from the validation set; we do not need to use the entire validation set, on the contrary, we can achieve very good results with only a small part of the validation set. Specifically, we sampled validation set examples at sampling rates of 1%, 0.5%, and 0.1%. For each sampling ratio, we perform random sampling ten times and calculate both the mean and the standard deviation. The target class is "ant" and the backdoor ratio is 0.3. The experimental results have been given in the previous response to W2
As shown in the tables of W2's response, our method achieves exceptional performance across all metrics, particularly in terms of recall, even with a very small sampling set, further demonstrating the superiority of our approach.
Q3: Why use BadNet-LC and Blended-LC instead of the original version LC?
A: We would like to explain that we treat LC as the clean-label setting following the experimental setting in the paper [1]. This is because we can directly achieve the clean-label attack on CLIP with different types of triggers (e.g., BadNet and Blended) rather than leveraging additional adversarial perturbations to make the model rely on the triggers in the original version of LC. Therefore, we can integrate LC with other types of backdoor attacks and validate the effectiveness of BDetCLIP against them.
Q4: The number of selected backdoor classes (3, 1, 1, respectively) is not sufficient compared to the total number of classes in the dataset.
A: Thank you for your suggestion! In fact, many papers in the field of backdoor detection only target one attack category for detection, which you can verify in [7-8]. To further address your concerns, we have added backdoor classes. Due to time constraints, We have attacked four new categories. If you have further requirements, we would be willing to attack new backdoor categories to verify the reliability of our method. The results are as follows:
Table: Detection performance of new categories
| Categories → | Goldfish | Lion | Rooster | Tench |
|---|---|---|---|---|
| SCALE-UP | 0.781 | 0.806 | 0.741 | 0.673 |
| BDetCLIP (ours) | 0.977 | 0.992 | 0.951 | 0.992 |
Our method has consistently performed extremely well, demonstrating its reliability.
Q5: The range of attack methods requires further inclusion, e.g., IAD and WaNet.
A: Thank you for your suggestion. We have included WaNet, and here are our experimental results:
Table: Detection performance (WaNet)
| SCALE-UP | BDetCLIP (Ours) |
|---|---|
| 0.920 | 0.982 |
Due to time constraints, we have not been able to use IAD to successfully attack CLIP. However, in our paper, we have provided the detection effect of ISSBA, the same type of attack as IAD, and it has the characteristics of invisible, making it more difficult to detect. The detection effect of our method is as follows:
Table: Detection performance (ISSBA)
| Detection → | STRIP | SCALE-UP | TeCo | BDetCLIP (Ours) |
|---|---|---|---|---|
| ISSBA | 0.351 | 0.515 | 0.496 | 0.927 |
Q6: More detailed attack settings should be provided.
A: Thank you for your valuable suggestions! We have updated our manuscript accordingly. Specifically, we added the attack training epochs, initial learning rate, learning rate scheduling method, and optimizer in Appendix D, under the section "Details of Attacking CLIP".
Q7: In Table 1, why is the performance of TeCo on Blended and Blended-LC higher in certain cases?
A: Yes, we have also observed this interesting phenomenon. We guess that the Blended trigger is more prone to noticeable robustness differences under different image corruption conditions in TeCo thus achieving better detection results relative to other attacks.
Q8: For BadCLIP attack, the performance order of the three comparative detection methods seems to be completely opposite to the previous results.
A: Yes, we also have noticed this phenomenon. We conjecture that this phenomenon may be due to the distinction between BadCLIP (a multimodal backdoor attack method targeting CLIP) and traditional backdoor attack methods. Since the focus of our paper is on using our method to defend against existing backdoor attacks, we will further explore the distinctions between multimodal backdoor attacks and traditional backdoor attacks in future work.
Q9: Why did the AUROC increase after incorporating CleanCLIP? The model after defense should be more similar to a clean model, thereby reducing the backdoor effect.
A: In fact, CleanCLIP does not transform the model attacked by BadCLIP into a secure model (still achieving a high ASR of 0.902); the CLIP model at this point still retains significant toxicity. We believe this is likely because BadCLIP aims to ensure that the visual trigger pattern closely aligns with the textual target semantics in the embedding space, which, to some extent, challenges our assumption that "the visual trigger cannot be well-aligned with the target label." However, CleanCLIP disrupts the alignment between the backdoor trigger and the target label, allowing BDetCLIP to once again demonstrate its outstanding detection performance, which also demonstrates the strong compatibility of our detection method with defense strategies at other stages.
Q10: In Table 9, the AUROC of Blended-LC decreases when using contrastive prompts.
A: Yes, we guess it's because the original prompt template already contains enough semantic information for the blended_LC backdoored model. But for most cases, using the current contrastive prompts still yields the best performance. Additionally, even with a slight decrease (0.03), it does not affect the reliability of our method.
References:
[1] Cleanclip: Mitigating data poisoning attacks in multimodal contrastive learning. In ICCV, 2023.
[2] Badclip: dual-embedding guided backdoor attack on multimodal contrastive learning. In CVPR, 2024.
[3] BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP. In CVPR, 2024.
[4] Robust contrastive language-image pretraining against data poisoning and backdoor attacks. In NeurIPS, 2023.
[5] Better Safe than Sorry: Pre-training CLIP against Targeted Data Poisoning and Backdoor Attacks. In ICML, 2024.
[6] Unlearning Backdoor Threats: Enhancing Backdoor Defense in Multimodal Contrastive Learning via Local Token Unlearning. In Arxiv, 2024.
[7] SCALE-UP: An Efficient Black-box Input-level Backdoor Detection via Analyzing Scaled Prediction Consistency. In ICLR 2024
[8] Detecting Backdoors During the Inference Stage Based on Corruption Robustness Consistency. In CVPR 2023
[9]: Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts. In ICCV-W 2023.
[10]: What does a platypus look like? Generating customized prompts for zero-shot image classification. In ICCV 2023.
[11]: Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions. In CVPR 2024.
[12]: Visual Classification via Description from Large Language Models. In Arxiv 2022.
Dear Reviewer Dxen,
Thanks for the time to provide valuable comments on our paper. We would greatly appreciate it if you could return further feedback on our rebuttal.
Best Regards,
Authors
The reviewer appreciates the efforts from the authors. According to the rebuttal, some of my concerns are well address, i.e., the experimental settings, the statements. However, some concerns are not well addressed. The response about the importance of prompts is not satisfying, I could not agree to the statement ``our main experimental results show that as long as we use our prompts with GPT4, the prompt quality meets our defense requirements''. As I stated before, the performance of this method might be highly correlated to the third-party LLMs. Additionally, the analysis of the results are also not clear.
2. Further analysis on Q8: For the BadCLIP attack, the performance order of the three comparative detection methods seems to be completely opposite to the previous results. The observation is true that the three comparative detection methods (i.e., STRIP, SCALE-UP, and TeCo) on BadCLIP (shown in Table 18) achieved significantly different performances compared with that on BadNet and Blended (shown in Table 1). Note that BadCLIP has a greater effect on backdoor samples in the feature space than BadNet and Blended, i.e., causing the features of backdoor samples to be closer to the features of the target class. Therefore, image corruption (or perturbation) techniques used in the three detection methods can not significantly affect the prediction values or labels of backdoor samples on BadCLIP compared with BadNet and Blended. Considering this point, the corruption robustness consistency of backdoor samples would be inherently higher on BadCLIP than BadNet and Blended, which is inconsistent with the detection principle (i.e., Eq. (4)) used in TeCo, i.e., backdoor samples are considered to have a larger deviation of the severity level on different types of corruption. Hence, TeCo achieved worse performances on BadCLIP in Table 18 and better performances on BadNet and Blended in Table 1. On the other hand, STRIP and SCALE-UP considered that backdoor samples have higher prediction consistency than clean samples, which is consistent with the effect of BadCLIP. Therefore, STRIP and SCALE-UP achieved better performances on BadCLIP in Table 18. Different from these detection methods that focus on the vision modality, our method considered the sensitivity to the semantic change in the text modality and achieved consistently superior performance on different types of backdoor attacks.
3. Further analysis on Q9: Why did the AUROC increase after incorporating CleanCLIP? The model after defense should be more similar to a clean model, thereby reducing the backdoor effect. We would like to explain that CleanCLIP did not effectively defend against BadCLIP (the ASR of BadCLIP after using CleanCLIP was 90.2%, close to the original reported result of 89.60% in the original paper). Therefore, the backdoor effect still exists in the BadCLIP model after using CleanCLIP, which is why our method can work in this case. Besides, we would like to further explain the reason why the detection performance of our method becomes better after using CleanCLIP, which may be a little counterintuitive. Inherently, BadCLIP is a strong adaptive attack against BDetCLIP because BadCLIP incorporated many class-related attribute texts into the triggers and thus made backdoor samples more sensitive to the semantic changes in prompts, which led to a relatively mediocre performance on BDetCLIP. After using CleanCLIP on BadCLIP, although the ASR has little decrease, the alignment between the trigger and many class-related attribute texts of the target label has been disrupted, thereby leading to backdoor samples less sensitive to the semantic changes in prompts. In this way, our method can effectively detect backdoor samples in the BadCLIP model after CleanCLIP.
4. Further analysis on Q10: In Table 9, the AUROC of Blended-LC decreases when using contrastive prompts. It is true that the AUROC of Blended-LC decreased when using original contrastive prompts compared with the case without benign prompts. We guess it's because the original prompt template already contains enough semantic information for the blended_LC backdoored model in which the target label is "ant". To further address your concerns about our method, we conducted additional experiments on backdoored models with target classes set as 'banana' or 'basketball'. The experimental results are as follows:
Table: Comparison of AUROC using different prompts on ImageNet-1K.
| Ant | Banana | Basketball | Average | |
|---|---|---|---|---|
| w/o class-specific benign prompts (using class template) | 0.974 | 0.835 | 0.912 | 0.907 |
| original contrastive prompts | 0.942 | 0.991 | 0.993 | 0.975 |
Overall, using original contrastive prompts generally yields the best detection performance.
Dear Reviewer Dxen,
Thank you for your valuable suggestion for analyzing the importance of prompts. We would greatly appreciate it if you could discuss whether our additional experiments and analysis have solved your concerns. We look forward to your feedback.
Best regards,
Authors
Dear Reviewer Dxen,
Thank you once again for taking your valuable time to provide constructive comments on our paper. As the discussion deadline approaches, we sincerely wonder whether our rebuttal has solved your concerns.
Best Regards,
Authors
Q: The response about the importance of prompts is not satisfying, I could not agree to the statement ``our main experimental results show that as long as we use our prompts with GPT4, the prompt quality meets our defense requirements''. As I stated before, the performance of this method might be highly correlated to the third-party LLMs
A: We understand your concerns about the quality of class-specific prompts used in our method. To address your concerns, we have conducted more experiments to generate different types of prompts by using varying open-source language models. The experimental results are as the following table.
Table: Detection performance by using different models to generate prompts.
| Model | BadNet | Blended |
|---|---|---|
| GPT-4 | 0.941 | 0.977 |
| ------------------------ | ------ | ------- |
| Llama3-8B-Instruct | 0.947 | 0.983 |
| Mistral-7B-Instruct-v0.2 | 0.983 | 0.963 |
| Yi-1.5-9B-Chat | 0.977 | 0.960 |
| gemma-2-2b-it | 0.979 | 0.954 |
| gemma-2-9b-it | 0.970 | 0.948 |
| Phi-3.5-mini-instruct | 0.917 | 0.947 |
| Qwen2.5-7B-Instruct | 0.929 | 0.948 |
| Qwen2.5-14B-Instruct | 0.923 | 0.969 |
| ------------------------ | ------ | ------- |
| STRIP | 0.893 | 0.244 |
| SCALE-UP | 0.768 | 0.671 |
| TeCo | 0.834 | 0.949 |
The above table shows that our method can maintain excellent and relatively stable detection performance when using varying language models to generate different prompts. We argue that this is because our method only requires contrastive class-specific prompts with a significant semantic gap, and many open-source language models can achieve this objective very well. It is noteworthy that the detection performance of nearly all used language models is better than that of compared methods. Particularly, the widely used open-source language model Llama3-8B-Instruct achieves nearly the same performance as GPT-4. Therefore, in our humble opinion, we do not have to worry too much about the quality of prompts used in our method. Anyway, we agree with your viewpoint that the different prompts may have different influences, and it would be interesting and promising to explore a better prompt design for backdoor detection in future work (As we have already used 10 language models in our above experiments, it is somewhat impossible to finish all the other related explorations in a single conference paper).
Q: Additionally, the analysis of the results is also not clear.
A: We are sorry to make you feel unclear about the analysis of the results. To further clarify them, we have refined the analysis of the results with more details.
1. Further analysis on Q7: In Table 1, why is the performance of TeCo on Blended and Blended-LC higher in certain cases? It is true that the detection performances of TeCo on Blended and Blended-LC are higher than that on BadNet, BadNet-LC (as shown in Table 1), and BadCLIP (as shown in Table 18 in Appendix). We argue that this is because the corruption techniques used in TeCo (e.g., Gaussian Noise, Motion Blur, and Brightness) apply to the whole image pixels, which has a more significant effect on the trigger in Blended (i.e., global noise pixels) than that in BadNet and BadCLIP (the triggers are local patches), and thus leads to a larger score of Corruption Robustness Consistency in Blended (thereby achieving better detection performance). A similar observation can also be found in Table 2 in the original paper TeCo, i.e., the detection performance on Blended is commonly higher than that on BadNet.
Dear Reviewer Dxen,
Thank you for your valuable feedback and recognizing our efforts. We notice that you still maintain the rating 5 (marginally below the acceptance threshold). We would be grateful if you could consider updating the rating as you confirmed that you have no additional problems. Feel free to let us know if you have any other concerns, and we will try our best to address your concerns as soon as possible. Thanks!
Best regards,
Authors
Dear Reviewer Dxen,
Thank you again for your valuable comments and suggestions, which are very helpful to us. We have posted responses to the proposed concerns.
We understand that this is quite a busy period, so we sincerely appreciate it if you could take some time to return further feedback on whether our responses resolve your concerns.
Best,
Authors
I appreciate the response from authors and I do feel your efforts. Now I have no additional problems.
The paper presents BDetCLIP, a novel method for detecting backdoored samples in multimodal contrastive learning models during the inference time. The proposed method is computationally efficient and effective.
优点
- The paper considers a novel scenario that detects backdoors in inference time for MCL. The proposed method is efficient and effective.
- The authors conduct extensive experiments on multiple datasets and diverse attack types. The results demonstrate that BDetCLIP outperforms existing detection methods (e.g., STRIP, SCALE-UP, TeCo) in both effectiveness and efficiency.
缺点
- Clarity of Methodology
(1) Motivation Lacks Depth The authors base their proposed approach on a single observation: “the distribution difference of backdoored images between the benign and malignant changes of class prompts is smaller than that of clean images.” In my view, this constitutes an insufficient contribution. The paper does not explain the underlying mechanism of this phenomenon, making it hard to validate whether this observation holds across different attack scenarios.
(2) Furthermore, the authors have not provided theoretical validation regarding the hyperparameter m and its relationship with ; instead, they rely solely on empirical evidence to suggest that their method outperforms existing unimodal detection approaches. This lack of theoretical grounding renders the method less solid.
(3) Additionally, the authors have not assessed the detection capabilities of their proposed method against adaptive attacks. Consequently, the reliability of the proposed approach against different types of backdoor attacks comes into question.
- Insufficiency of Experiments
(1) Although the authors claim to be the first to propose a detection method for CLIP during the inference phase and assert that the costs of fine-tuning defense methods are higher than those for inference-stage detection, this does not justify the absence of comparisons with existing defense methods. Both defense and detection share the common goal of preventing backdoor attacks. If defenses are indeed more effective than detection methods, then it is unlikely that practical applications would forgo defenses due to cost concerns. The authors argue that the metrics for the two approaches differ; however, I believe they could establish a specific threshold to convert detection results into Attack Success Rate (ASR) for a more direct comparison with existing defense methods. Without this, the authors cannot substantiate the superiority of their proposed method.
(2) The authors predominantly use AUROC to evaluate the effectiveness of their method, which is insufficient for assessing the impact on model performance. I recommend that the authors include additional metrics, such as precision and accuracy, to demonstrate the method's effects on normal model performance.
(3) Moreover, I believe the use of BadNet and Blended in the main experiments is inadequate. Given that the focus of this research is on multimodal contrastive learning, both BadNet and Blended are unimodal methods. The authors should compare their method against more recent multimodal contrastive learning attack methods to showcase its advantages. While the authors do compare against BadCLIP in Table 6, the low AUROC raises the question of whether the proposed method is less effective for detecting multimodal backdoor attacks.
(4) Dependency on Large Language Models: The reliance on models like GPT-4 for generating prompts could introduce significant computational costs. While the paper mentions the feasibility of using open-source alternatives, it lacks detailed evaluations comparing the performance of these alternatives against GPT-4-generated prompts.
Late Response to Authors
I appreciate the authors' efforts during the rebuttal phase, and as a result, I have decided to raise my rating to 6. However, after reviewing the latest PDF version of the paper, I believe that most of the main experimental results are insufficient, as they primarily compare Badnet, Blended, and their LC versions. These two methods are both quite outdated and focus on single-modal attacks, which significantly weakens the overall experimental validity. While the authors have added some additional experiments with great effort, I still feel that, for a paper aimed at CLIP detection, this is not enough. Given the time constraints during the rebuttal period, I suggest that this issue be addressed in the revised version. Otherwise, I believe the paper will not meet the acceptance criteria.
问题
Please see the Weaknesses for the identified issues.
Response to Reviewer gXdL:
We sincerely appreciate the reviewer for providing helpful comments.
W1: Motivation lacks depth.
A: Thanks for your comment. We would like to further clarify our motivation, which stems from our insights into multimodal backdoor attacks. Specifically, we have identified that the visual triggers of backdoor attacks on CLIP often fail to align well with the textual labels of the corresponding classes. This misalignment prevents the visual triggers from effectively capturing the semantic variations in class-descriptive texts. Consequently, the distributional differences between backdoor images under benign and malicious prompt variations are smaller compared with the distributional differences observed in clean images. This observation holds across various types of backdoor attacks. The visualized distributional differences in Figure 3, together with our strong experimental results, provide solid evidence supporting our perspective.
W2: Theoretical validation regarding hyper-parameter and its relationship with .
A: Thanks for your comment. We admit that there is no theoretical guarantee for selecting hyper-parameters in BDetCLIP. We would like to explain that theoretical analysis in vision-language models is considerably challenging due to the black-box property of models. Therefore, there are so many papers [1,4,5,8,9,10,11] on vision-language models that mainly focus on empirically designing solid methods and do not provide a theoretical analysis of them. Considering this point, we have provided solid solutions to empirically selected hyper-parameters in BDetCLIP through experiment results. In Table 7, our experimental results demonstrate that, in most cases, selecting a larger value of m is more beneficial for detection performance. Regarding the threshold () selection, We have updated our sampling-based threshold selection strategy. Specifically, we first sampled clean samples at a specified sampling rate. Then, by using Eq. (6) in the manuscript, we computed the contrastive distribution difference for all samples, ranked them from largest to smallest, and selected the 85th percentile as the threshold(Note that the specific threshold percentile can be adjusted based on real-world defense requirements). To improve the reliability and stability of our experimental results, we performed random sampling ten times and calculated both the mean and the standard deviation. The experimental results are shown in the following table.
Table: Sampling rate=1%: 500 examples.
| Attack | Accuracy | Recall | F1 | AUROC | Threshold |
|---|---|---|---|---|---|
| BadNet | 0.8941 ± 0.0107 | 0.9902 ± 0.0013 | 0.8488 ± 0.0127 | 0.9906 ± 0.0003 | 11.7225 ± 1.2723 |
| Blended | 0.8772 ± 0.0061 | 0.9279 ± 0.0142 | 0.8193± 0.0151 | 0.9425 ± 0.0003 | 12.0281 ± 1.2399 |
| BadNet_LC | 0.8938 ± 0.0074 | 0.9842 ± 0.0016 | 0.8476 ± 0.0088 | 0.9796 ± 0.0004 | 16.7526 ± 0.8944 |
| Blended_LC | 0.8837 ± 0.0068 | 0.9396 ± 0.0102 | 0.8290 ± 0.0067 | 0.9420 ± 0.0005 | 15.9315 ± 1.2748 |
Table: Sampling rate=0.5%: 250 examples.
| Attack | Accuracy | Recall | F1 | AUROC | Threshold |
|---|---|---|---|---|---|
| BadNet | 0.8950 ± 0.0160 | 0.9903 ± 0.0013 | 0.8502 ± 0.0189 | 0.9908 ± 0.0003 | 11.6161 ±1.8596 |
| Blended | 0.8772 ± 0.0109 | 0.9224 ± 0.0211 | 0.8186± 0.0096 | 0.9416± 0.0003 | 11.7094 ± 1.9545 |
| BadNet_LC | 0.8958 ± 0.0128 | 0.9835 ± 0.0029 | 0.8501 ± 0.0151 | 0.9797 ± 0.0004 | 16.4568 ± 1.5488 |
| Blended_LC | 0.8865 ± 0.0070 | 0.9347 ± 0.0106 | 0.8317 ± 0.0070 | 0.9422 ± 0.0005 | 15.3494 ± 1.3340 |
Table: Sampling rate=0.1%: 50 examples.
| Attack | Accuracy | Recall | F1 | AUROC | Threshold |
|---|---|---|---|---|---|
| BadNet | 0.8775 ± 0.0107 | 0.9904 ± 0.0040 | 0.8312 ± 0.0418 | 0.9905 ± 0.0004 | 13.0927 ± 4.3069 |
| Blended | 0.8564 ± 0.0315 | 0.9391 ± 0.0430 | 0.7987± 0.0279 | 0.9424 ± 0.0003 | 14.2042 ± 4.4056 |
| BadNet_LC | 0.8799 ± 0.0453 | 0.9831 ± 0.0082 | 0.8341 ± 0.0488 | 0.9795 ± 0.0005 | 17.6179 ± 4.8725 |
| Blended_LC | 0.8722 ± 0.0269 | 0.9404 ± 0.0363 | 0.8167 ± 0.0524 | 0.9422 ± 0.0004 | 16.9313 ± 4.2465 |
The tables show that our method can achieve impressive detection performance on different sampling rates (with induced different thresholds) in terms of all metrics. In particular, even using a very small sampling ratio (i.e., 0.1%), our method still maintains good detection performance, which verifies its applicability to real-world applications. It is worth noting that in our sampling-based selection method, the choice of the threshold is independent of . To further address your concerns, we conducted additional experiments to investigate the relationship between m and the threshold. Specifically, we tested with m=6,5,4,3, applied the aforementioned threshold selection strategy, and performed random sampling ten times. For each case, we calculated both the variance and the mean of the selected thresholds. The mean value was then used as the threshold for subsequent experiments. The experimental results are as follows:
Table: Experiment to explore the relationship between m and threshold.
| m | Threshold (mean) | Accuracy | Recall | F1 | AUROC |
|---|---|---|---|---|---|
| 6 | 11.7199 | 0.8785 | 0.9238 | 0.8200 | 0.9417 |
| 5 | 5.2971 | 0.8640 | 0.8638 | 0.7919 | 0.9280 |
| 4 | 2.1915 | 0.8539 | 0.8335 | 0.7737 | 0.9200 |
| 3 | -1.3766 | 0.8424 | 0.7986 | 0.7523 | 0.9099 |
We can see that the larger is, the better the overall effect will be, and the threshold will be correspondingly larger. This is intuitive: as m increases, the number of benign prompts grows, providing more fine-grained information, which increases the semantic differences between benign prompts and malicious prompts.
W3: Detecting against adaptive attacks.
A: Thanks for your helpful suggestion. We guess you are talking about the adaptive attack based on sample-specific dynamic triggers. To address your concern, we have conducted experiments to defend against one strong adaptive attack ISSBA [2] that optimizes sample-specific triggers for each input image by an encoder-decoder network.
Table: Detection performance (ISSBA)
| Detection → | STRIP | SCALE-UP | TeCo | BDetCLIP (Ours) |
|---|---|---|---|---|
| ISSBA | 0.351 | 0.515 | 0.496 | 0.927 |
The table shows that our proposed method BDetCLIP can still effectively detect backdoor samples with dynamic triggers.
Besides, we are also concerned that you are referring to the adaptive attack as a specifically designed attack against BDetCLIP. To solve this concern, we would like to explain that BadCLIP [2] is indeed such an adaptive attack where the visual trigger in BadCLIP closely aligns with the class-specific texts including many target-class-related attribute words in the embedding space. The defense performance against BadCLIP is shown in Table 6 in our original manuscript. We can see that our proposed BDetCLIP can achieve superior performance, especially when combined with CleanCLIP [5]. Therefore, we argue that our method can effectively defend against the adaptive attack.
W7: The reliance on models like GPT-4 for generating prompts could introduce significant computational costs. While the paper mentions the feasibility of using open-source alternatives, it lacks detailed evaluations comparing the performance of these alternatives against GPT-4-generated prompts.
A: Thanks for your comment. We would like to explain that we indeed included your mentioned comparison in Tables 2 and 17 of the original manuscript (now Tables 2 and 25). Specifically, we used open-source large language models (such as Llama3-8B and Mistral-7B-Instruct-v0.2) to generate the benign and malignant class-specific prompts. We organize the experimental effects as follows.
Table: Detection performance comparison by using different models to generate prompts.
| Model | BadNet | Blended |
|---|---|---|
| Llama3-8B | 0.947 | 0.983 |
| Mistral-7B-Instruct-v0.2 | 0.983 | 0.963 |
| GPT4 | 0.941 | 0.977 |
The experimental results show that the detection performance of using open-source large language models is comparable to that of using GPT-4, which verifies the feasibility of using open-source alternatives in BDetCLIP.
References:
[1] BadCLIP: dual-embedding guided backdoor attack on multimodal contrastive learning. In CVPR, 2024.
[2] Invisible backdoor attack with sample-specific triggers. In ICCV, 2021.
[3] SCALE-UP: An Efficient Black-box Input-level Backdoor Detection via Analyzing Scaled Prediction Consistency. In ICLR, 2023.
[4] BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP. In CVPR, 2024.
[5] CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning. In ICCV, 2023
[6] Data poisoning attacks against multimodal encoders. In ICML, 2023.
[7] Dual-Key Multimodal Backdoors for Visual Question Answering. In CVPR, 2022
[8]: Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts. In ICCV-W 2023.
[9]: What does a platypus look like? Generating customized prompts for zero-shot image classification. In ICCV 2023.
[10]: Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions. In CVPR 2024.
[11]: Visual Classification via Description from Large Language Models. In Arxiv 2022.
Dear Reviewer gXdL,
Thanks for the time to provide valuable comments on our paper. We would greatly appreciate it if you could return further feedback on our rebuttal.
Best Regards,
Authors
Thank you for your detailed and thoughtful rebuttal. I appreciate the effort you have made to address most of my concerns, and I am pleased to see that your responses provide valuable insights and experimental results. I strongly encourage you to incorporate these insights and additional experiments into the revised version of your manuscript to strengthen its impact.
That said, I believe there is still a critical issue that remains unresolved. Since this work primarily focuses on detection in CLIP, the experiments should emphasize multimodal backdoor attacks targeting CLIP. Without carefully designed multimodal triggers, as noted in the paper, backdoor attacks might exhibit the challenges you described. However, when multimodal triggers are explicitly co-designed, such as in works like BadCLIP, the effectiveness of your proposed method might diminish, as evidenced by Table 6 in your paper.
This is currently my main concern. I would like to see more results demonstrating the effectiveness of your method against backdoor attacks specifically designed for CLIP in multimodal settings. In fact, there are already several multimodal attack methods in the literature, such as:
1.BadEncoder: Backdoor Attacks to Pre-trained Encoders in Self-Supervised Learning, SP 2022.
2.TrojVLM: Backdoor Attack Against Vision Language Models, ECCV 2024.
3.Data Poisoning Attacks Against Multimodal Encoders, ICML 2023.
4.Dual-Key Multimodal Backdoors for Visual Question Answering, CVPR 2022.
I hope to see a more comprehensive evaluation of your method against such attacks in the context of CLIP. This would strengthen the claims of your work.
Q: I strongly encourage you to incorporate these insights and additional experiments into the revised version of your manuscript to strengthen its impact.
A: We sincerely thank you for acknowledging our efforts to conduct more experiments to validate the effectiveness of our method. We have incorporated your constructive suggestions into the revised manuscript, mainly including the ASR comparison with CleanCLIP (Appendix E, Table 13), more evaluation metrics of our method ( Appendix B: Threshold Selection, Table 10-12), and the exploration of the relationship between and the threshold ("The relationship between the number of class-specific benign prompts and threshold" section in Appendix F: MORE EXPERIMENTAL RESULTS, Table 27). In addition, we have further provided a more thorough analysis of the experimental results in the updated manuscript (but currently we cannot upload the updated manuscript because the OpenReview system does not allow authors to upload a new manuscript).
Q: I believe there is still a critical issue that remains unresolved. Since this work primarily focuses on detection in CLIP, the experiments should emphasize multimodal backdoor attacks targeting CLIP.
A: Thank you for your valuable suggestion. We understand your concerns about using our method to defend against more multimodal backdoor attacks targeting CLIP. To address your concerns, we have conducted additional experiments by using BadEncoder [1] and TrojVQA [4] as attack methods. Specifically, as for BadEncoder, we followed its open-source code and used the experimental settings in the original paper, i.e., the target class is 'truck' and the dataset is STL10. As for TrojVQA [4], based on its open-source code, we adapted it to attack CLIP (the target class is 'banana'). The experimental results are shown in the following table.
Table: Detection performance.
| Attack | SCALE-UP | BDetCLIP (ours) |
|---|---|---|
| BadEncoder | 0.694 | 0.891 |
| TrojVQA | 0.744 | 0.978 |
The table indicates that our method can still achieve impressive performance, which validates the effectiveness of our method in defending against multimodal backdoor attacks targeting CLIP. In addition, we would like to explain why we did not evaluate our method with mmPoison [3] and TrojVLM [2] as attack methods. It is noteworthy that mmPoison [3] is a data poison attack on multimodal encoders and cannot be adapted to a backdoor attack on CLIP. Furthermore, TrojVLM [2] has not released the official code yet and thus it is extremely difficult to immediately reproduce it and use it in a very fair manner. Despite this, we believe that our additional experimental results on BadEncoder and TrojVQA have verified the superior ability of our method to defend against multimodal backdoor attacks targeting CLIP. We also have followed your wonderful suggestion to incorporate the experimental results into the updated manuscript.
References:
[1] BadEncoder: Backdoor Attacks to Pre-trained Encoders in Self-Supervised Learning, In SP, 2022.
[2] TrojVLM: Backdoor Attack Against Vision Language Models, In ECCV, 2024.
[3] Data Poisoning Attacks Against Multimodal Encoders, In ICML, 2023.
[4] Dual-Key Multimodal Backdoors for Visual Question Answering, In CVPR, 2022.
Dear Reviewer gXdL,
Thank you for suggesting integrating more multimodal backdoor attacks targeting CLIP into our work. We would greatly appreciate it if you could return further feedback on whether our additional experiential results and analysis have solved your concerns. We look forward to your reply.
Best Regards,
Authors
Dear Reviewer gXdL,
Thank you once again for taking your valuable time to provide constructive comments on our paper. As the discussion deadline approaches, we sincerely wonder whether our rebuttal has solved your concerns.
Best Regards,
Authors
W4: Defense results comparison with existing defense methods.
A: Thanks for your helpful suggestion. We understand your concern about comparing BDetCLIP with current fine-tuning defense methods by using the same common metric Attack Success Rate (ASR). To address your concern, we have tried our best to conduct this experiment by using ASR. Specifically, during the inference stage, we directly discard the selected backdoor samples whose contrastive distribution differences are lower than the threshold and only make predictions on the remaining samples. We set the backdoor ratio to 1 and thus calculate the ASR as the ratio of the remaining samples (whose prediction label is the target class) to the total number of backdoor samples. We argue that this strategy is fair for comparison and reasonable to use in practice.
We used the threshold selection method mentioned in the response to W2 and specified a sampling rate of 1%. To improve the reliability and stability of our experimental results, we performed random sampling ten times and calculated both the mean and the standard deviation. We compare the ASR with the fine-tuning-based defense method CLeanCLIP [5]. Note that we keep the same experimental setting used in CleanCLIP i.e., using the same backdoored model weight (i.e., BadNet, Blended, LC) provided by the original paper. The experimental results are shown in the following table.
Table: Comparison with the defense results of CleanCLIP
| Attack | CleanCLIP | BDetCLIP(ours) |
|---|---|---|
| BadNet | 0.1046 | 0.0195 ± 0.0040 |
| Blended | 0.0980 | 0.0047 ± 0.0012 |
| LC | 0.1108 | 0.1163 ± 0.0121 |
The tables show that BDetCLIP can effectively decrease the ASR compared with the current fine-tuning defense method CleanCLIP [5]. Therefore, we argue that our BDetCLIP could be used to defend against backdoor attacks effectively in practical applications.
W5: The authors predominantly use AUROC to evaluate the effectiveness of their method, which is insufficient for assessing the impact on model performance. I recommend that the authors include additional metrics, such as precision and accuracy, to demonstrate the method's effects on normal model performance.
A: Thanks for your helpful suggestion. We understand your concern about using only AUROC as the metric to evaluate BDetCLIP. To address your concerns, we have reported additional metrics such as accuracy, recall, and F1 in our response to W2 to fully evaluate the effectiveness of BDetCLIP. As shown in the first three tables in the W2's response, our method achieves promising performance across all metrics, particularly in terms of recall, even with a very small sampling set, further demonstrating the superiority of our approach.
W6: The authors should compare their method against more recent multimodal contrastive learning attack methods to showcase its advantages.
A: Thanks for your valuable suggestion. We would like to explain that there are very few existing backdoor attacks yet adapted to CLIP and we have used BDetCLIP to defend against two recent backdoor attacks on CLIP (i.e., BadCLIP-1 [1] and BadCLIP-2 [4]). We really understand your concerns regarding the detection performance of BDetCLIP against other multimodal backdoor attacks on CLIP. To address your concerns, we have tried our best to adapt them to attack CLIP. Specifically, we had attempted and failed to adapt mmPoison [6] to attack CLIP. This is because mmPoison is a data poison attack on multimodal encoders and can not be adapted to a backdoor attack on CLIP. Besides, we are actively working on adapting TrojVQA [7] to attack CLIP. It is worth noting that TrojVQA was not originally designed to attack CLIP and its official paper and code did not discuss CLIP-related scenarios. We are still exploring adapting it to attack CLIP. We sincerely ask for your understanding of these limitations and acknowledgment of the impressive detection performance of BDetCLIP against current available backdoor attacks on CLIP. We would be very grateful if you could point out some advanced multimodal attack methods on CLIP so that we can conduct additional experiments as soon as possible.
Dear Reviewer gXdL,
We sincerely appreciate the time and effort you have dedicated to reviewing our work. Your insights have been invaluable in helping us refine our manuscript.
We have carefully addressed your comments and have submitted our revised response. We would be grateful if you could take a moment to review our response.
Thank you once again for your valuable contribution.
Best,
Authors
Dear Reviewer gXdL,
We sincerely thank you for recognizing our efforts in the rebuttal and improving your rating to a positive score for our paper. We understand your concerns about the insufficient number of multimodal backdoor attacks targeting CLIP in Tables 1 and 2 of the manuscript. We would like to sincerely explain that due to the time constraints of updating the manuscript during the rebuttal, we have not timely incorporated many experimental results in Tables 1 and 2 of the manuscript. Specifically, following your valuable suggestions in the rebuttal, we conducted additional experiments against BadEncoder [2] and TrojVQA [3] and reported the experimental results in our response. We are also experimenting with more target classes and datasets and will incorporate these experimental results in Tables 1 and 2 of the revised manuscript. In addition, we are also reproducing the other two backdoor attacks you mentioned before i.e., TrojVLM [4] and mmPoison [5]. Specifically, we are contacting the authors of TrojVLM to request the source code, meanwhile adapting TrojVLM to attack CLIP by ourselves. Besides, we are also adapting mmPoison [5] to a backdoor attack on CLIP. After adopting these two backdoor attacks targeting CLIP, our method can be applied to defend against them. Upon completing these experiments, we will update and organize the experimental results in Tables 1 and 2 of the revised manuscript, thereby including a sufficient number of multimodal backdoor attacks targeting CLIP. We believe the updated experimental tables will completely solve your concerns, and your valuable suggestions have greatly improved our paper.
References:
[1] BadCLIP: dual-embedding guided backdoor attack on multimodal contrastive learning. In CVPR, 2024.
[2] BadEncoder: Backdoor Attacks to Pre-trained Encoders in Self-Supervised Learning, In SP, 2022.
[3] Dual-Key Multimodal Backdoors for Visual Question Answering, In CVPR, 2022.
[4] TrojVLM: Backdoor Attack Against Vision Language Models, In ECCV, 2024.
[5] Data Poisoning Attacks Against Multimodal Encoders, In ICML, 2023.
Best Regards,
Authors
This paper introduces a new method for detecting backdoors in CLIP models at test time. The main idea is to use how backdoored samples remain insensitive to changes in text descriptions. The method generates two types of prompts with LLMs and measures the difference in cosine similarities between images and these prompts to detect backdoors. It's efficient because it doesn't require parameter updates.
优点
-
The paper highlights a key gap in current backdoor defenses for multimodal models:
- While previous methods focus on training or fine-tuning, which require high computational costs, test-time detection hasn't been explored yet. This leads to a new research direction that's both interesting and practical.
-
The contrastive prompting mechanism is effective because:
- It uses the multimodal features of CLIP, addressing a limitation of previous unimodal detection methods that ignore the text modality.
- The use of benign and malicious prompts provides a natural way to measure semantic sensitivity without modifying the model.
缺点
-
The method relies on the assumption that backdoor samples are insensitive to changes in text descriptions while clean samples are sensitive. However:
- Limited Analysis of Multi-target Attacks: The analysis only covers single-target backdoor attacks. If backdoor triggers cause different outputs based on context (e.g., indoor vs outdoor scenes), the assumption of semantic insensitivity may not hold, exposing a gap between simplified models and real-world cases.
- Simplified Trigger Representation: Current assumption: , where is a simple pixel-pattern trigger. Modern backdoor attacks may use semantic-level triggers that naturally align with textual descriptions, yet the paper does not analyze how this alignment impacts the distribution difference .
-
Prompt Quality Variation: There is no mechanism to ensure prompt quality is consistent across different categories or evaluations, and no metrics to assess prompt quality. The effect of prompt variation on detection reliability is also not examined.
-
Gaps in Theory: There’s no clear definition of what makes a “good” prompt, nor an analysis linking prompt properties to detection performance.
问题
-
Considering the complexity of semantic-level backdoor attacks:
- Have you considered scenarios with semantically meaningful triggers?
- How would your method handle triggers that respond naturally to changes in prompt semantics?
-
How do you ensure prompt quality consistency across different categories, runs, and LLM implementations?
-
Have you considered a framework for assessing prompt quality, looking at:
- Coverage of category-specific features,
- Distinctiveness from other categories,
- Stability across different prompts?
Response to Reviewer 4hWz:
Thank you so much for your insightful comments!
Q1: Limited Analysis of Multi-target Attacks.
A: Thanks for your valuable suggestion. We have conducted more experiments about using BDetCLIP to defend against multi-target attacks. Specifically, to achieve the multi-target attack, we poisoned 1,000 (out of 500,000) samples for each target class (i.e., "goldfish", "basketball", and "banana") respectively. We fine-tuned the CLIP based on the poisoned dataset (the backdoor ratio is 0.3.) following the original experimental setting. Then, we used BDetCLIP to detect the backdoored CLIP. The experiment results are shown in the following table.
Table: The detection performance against Multi-target Attacks.
| BDetCLIP (ours) | SCALE-UP |
|---|---|
| 0.9858 | 0.5404 |
The table shows that our BDetCLIP can still achieve impressive detection performance against the multi-target attack.
Q2: Simplified Trigger Representation.
A: Thank you for your suggestion. To solve your concerns, we have conducted more experiments to validate the effectiveness of the semantic trigger. Specifically, we used the image of "Hello Kitty" as a semantic trigger. The experimental results are shown in the following table.
Table: The detection performance against "Hello Kitty".
| BDetCLIP (ours) | SCALE-UP |
|---|---|
| 0.8554 | 0.6111 |
The above table shows that BDetCLIP still achieves good detection performance against the semantic trigger. This observation also indicates that the backdoor samples with semantic triggers are also insensitive to the semantic
Q3: Handle triggers that respond naturally to changes in prompt semantics.
A: Thank you for your insightful suggestions. However, to the best of our knowledge, there are currently no proven multi-modal backdoor triggers that respond naturally to changes in prompt semantics. This presents a meaningful direction for future research, particularly in the field of stronger backdoor attacks. However, it is not the primary focus of our paper.
Q4: Prompt Quality Variation.
A: Thank you for your comments! In fact, papers such as [1-4] have shown the feasibility of using large language models to generate descriptive texts and We have used multiple different models (both open-source and closed-source) and experiments across multiple categories to demonstrate the reliability of prompt quality. We believe that exploring a more robust and efficient framework for generating high-quality prompts could be a meaningful direction for future research. However, this lies beyond the focus of the current study. Notably, even with the relatively simple prompt-generation method employed in our paper, we achieved outstanding detection results, providing strong evidence for the effectiveness of our approach.
Q5: The impact of prompt changes on detection reliability and the definition of the "good" prompt.
A: Thanks for your comment! We have already investigated the impact of prompt variation on detection reliability in Section 4.3, "FURTHER ANALYSIS OF CLASS-SPECIFIC PROMPT". Specifically, we studied the effects of length and number. Based on our experimental results, we believe that the more class-specific benign prompts, the better the quality, and the longer the class-specific malignant prompts, the worse the quality. In reality, our main experimental results show that as long as we use our prompts with GPT4, the prompt quality meets our defense requirements.
Q6: Assessing prompt quality.
A: Thank you for your insightful suggestions! The focus of our paper lies in leveraging the differences in semantic sensitivity between backdoor and clean images, which lead to distinct contrastive distributions. The role of the prompt is primarily to generate different semantics. Even with the relatively simple prompt generation method used in our paper, we have achieved excellent detection results, which is strong evidence of the effectiveness of our approach. We believe that designing a more refined prompt framework to evaluate prompt quality could further enhance performance. This direction is worth exploring in future work.
Q7: Gaps in Theory.
A: We have used extensive experimental evidence to demonstrate the quality of prompts generated by our method and their reliability for detection. We have conducted experiments in Section 4.3, "FURTHER ANALYSIS OF CLASS-SPECIFIC PROMPT" to analyze the relationship between prompt variation and detection performance. In fact, many papers related to large models nowadays lack interpretability. We believe that artificial intelligence places more emphasis on empirical performance rather than theoretical guarantees. While papers with theoretical foundations are indeed valuable, the absence of theoretical derivation is not necessarily a fatal issue. Furthermore, as we mentioned in our response to Q6, the focus of our paper is not on this aspect.
Reference:
[1]: Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts. In ICCV-W 2023.
[2]: What does a platypus look like? Generating customized prompts for zero-shot image classification. In ICCV 2023.
[3]: Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions. In CVPR 2024.
[4]: Visual Classification via Description from Large Language Models. In Arxiv 2022.
Dear Reviewer 4hWz,
Thanks for the time to provide valuable comments on our paper. We would greatly appreciate it if you could return further feedback on our rebuttal.
Best Regards,
Authors
Thank you for your detailed response that helps address some of my previous concerns about semantic-level analysis. However, I share Reviewer gXdL's critical concern about the evaluation of your method against true multimodal backdoor attacks, especially since your paper claims to provide defense detection specifically for multimodal models like CLIP.
While I appreciate your additional experiments with TrojVQA and BadEncoder as shown in your response, the significant performance gap between these two attacks (0.978 AUROC for TrojVQA vs 0.694 AUROC for BadEncoder) raises new concerns. This large discrepancy requires further explanation - why does your method perform so differently against these two multimodal attacks? This inconsistency makes me question the robustness of your proposed method against general multimodal attacks, particularly since your defense mechanism is specifically designed for multimodal models.
I believe this issue is closely related to my previous concerns about semantic-level considerations. The fact that your method performs well on pixel-level attacks but shows inconsistent results on multimodal attacks suggests there may be fundamental limitations in how your approach handles sophisticated multimodal backdoors that exploit CLIP's cross-modal understanding capabilities.
Could you provide additional analysis to explain:
- Why there is such a large performance gap between TrojVQA and BadEncoder detection?
- How this difference impacts the practical applicability of your method for defending multimodal models?
I believe addressing these questions is crucial for validating your method's effectiveness as a defense mechanism for multimodal models.
Dear Reviewer 4hWz,
Thank you for acknowledging our efforts to solve your concerns about semantic-level backdoor attacks.
We would like to give a gentle reminder that you have made a mistake in checking our results. It's noteworthy that our method BDetCLIP achieved AUROC of 0.891 and 0.978 for BadEncoder and TrojVQA respectively, which was significantly superior to SCALE-UP (0.694 for BadEncoder and 0.744 for TrojVQA), as shown in our response to Reviewer gXdL.
Table: Our response to Reviewer gXdL (detection performance).
| Attack | SCALE-UP | BDetCLIP (ours) |
|---|---|---|
| BadEncoder | 0.694 | 0.891 |
| TrojVQA | 0.744 | 0.978 |
Therefore, we argue that our detection mechanism used in BDetCLIP has the generality and practical applicability of defending against varying multimodal backdoor attacks.
Thank you for your response and for clarifying the AUROC scores for BadEncoder and TrojVQA. I apologize for the mistake in my previous comment where I misreported the AUROC for BadEncoder. Given this correction, the performance gap between the two attacks (0.891 for BadEncoder vs. 0.978 for TrojVQA) still raises questions about the consistency and generalizability of your method.
Could you elaborate on the factors contributing to this difference? For instance, do the two attacks differ significantly in their strategies for injecting and triggering backdoors? Are there any specific characteristics of these two attacks that might lead to the observed performance gap? Does this difference reflect any potential limitations in BDetCLIP's ability to handle diverse attack strategies or variations in datasets and model architectures? Providing further explanation on this discrepancy is crucial for demonstrating the robustness of your approach, especially since your defense mechanism is specifically designed for multimodal models like CLIP.
Dear Reviewer 4hWz,
We sincerely thank you for your valuable comments. We would like to explain that the performance difference between BadEncoder and TrojVQA mainly lies in the variation in datasets and models. Specifically, following the original experimental setting in BadEncoder, we used a ResNet-50 backbone to conduct BadEncoder on STL-10 (10 categories), while for TrojVQA, we adapted it to ViT-based CLIP on ImageNet-1K (1,000 categories). We think this variation in datasets and models could be the major reason for the performance difference between the two backdoor attacks. In addition, we would like to emphasize that on the same dataset ImageNet-1K, our method has achieved superior and stable detection performance against different attack strategies. To clarify this point, we show the following experimental results excerpted from the original manuscript.
Table: Detection performance (AUROC) on ImageNet-1K
| BadNet | Blended | BadNet-LC | Blended-LC | BadCLIP-2 | TrojVQA | Avg.(std) |
|---|---|---|---|---|---|---|
| 0.990 | 0.943 | 0.979 | 0.942 | 0.977 | 0.978 | 0.968 ± 0.018 |
The above table shows that our method can achieve superior and stable detection performance (0.968 ± 0.018) against varying backdoor attack strategies on the same dataset (ImageNet-1K).
Besides, we would like to emphasize that on the same backbone ResNet-50, our method has also achieved superior and stable detection performance against different attack strategies. The excerpted experimental results from the original manuscript are shown in the following table.
Table: Detection performance (AUROC) of using ResNet-50
| Model | BadNet | Blended | BadNet-LC | Blended-LC | Avg.(std) |
|---|---|---|---|---|---|
| ResNet-50 | 0.990 | 0.943 | 0.979 | 0.942 | 0.964 ± 0.021 |
The above table shows that on the same backbone, our method can also achieve superior and stable detection performance (0.964 ± 0.021) against varying backdoor attack strategies. Therefore, we argue that our method is robust to diverse attack strategies on different datasets and models.
Best regards,
Authors
This paper proposed BDetCLIP to defend against backdoor attacks on CLIP. Different from cuttent methods, this is the first attempt at a computationally efficient backdoor detection method in the inference stage. The experiments show that BDetCLIP performs well in terms of both effectiveness and efficiency.
优点
- This paper proposed a inference-stage method, which does not access to the pre-traing or finetuning, and also the training data.
- The method is well-motivated, i.e., the visual representations of backdoored images are insensitive to both benign and malignant changes in class description texts.
- Overall, the ablation studies are well-organized, illustrating the effectiveness of the proposed method.
缺点
- This paper only focused on the classification task, assuming that there a closed category space. However, VLMs are good at open-set classification tasks.
- The authors introduce a method that helps to choose the threshold in Appendix B. However, the selection method requires a small set of clean validation data. It would be better to include the sensitivity analysis of the threshold to show the robustness of .
- The investigation of adaptive attack is missing. Since the proposed defense is based on the property of backdoor samples, it is necessary to go deep into adaptive attacks.
问题
Please see the weakness part.
Response to Reviewer Xrmc:
Thank you so much for your insightful comments!
W1: This paper only focused on the classification task, assuming that there is a closed category space. However, VLMs are good at open-set classification tasks.
A: Thanks for your valuable comments. We have conducted additional experiments to validate the effectiveness of our proposed BDetCLIP for open-set classification tasks. Specifically, we added a subset of Caltech-101 as the open set to ImageNet1K and set the backdoor ratio to 0.3. The detection results are shown in the following table.
Table: Detection performance on the open-set classification task.
| Backdoor | AUROC |
|---|---|
| BadNet | 0.933 |
| Blended | 0.936 |
| BadNet-LC | 0.929 |
| Blended-LC | 0.991 |
The table shows that our proposed BDetCLIP can also achieve impressive performance on the open-set classification task.
W2: The authors introduce a method that helps to choose the threshold in Appendix B. However, the selection method requires a small set of clean validation data. It would be better to include the sensitivity analysis of the threshold to show the robustness of .
A: Thanks for your helpful suggestion. To solve your concern, we have conducted more experiments to analyze the sensitivity of the threshold by considering varying numbers of clean validation data. Specifically, we randomly sampled clean samples on three different sampling rates i.e., 1%, 0.5%, and 0.1%. Then, we computed the contrastive distribution difference in Eq. (6) for all sampled samples, ranked them from largest to smallest, and selected the 85th percentile as the threshold (Note that the specific threshold percentile can be adjusted based on real-world defense requirements). The target class is "ant" and the backdoor ratio is 0.3. For each sampling ratio, we conducted the above steps ten times and calculated both the mean and the standard deviation. The experimental results are shown in the following tables.
Table: Sampling rate=1%: 500 examples.
| Attack | Accuracy | Recall | F1 | AUROC | Threshold |
|---|---|---|---|---|---|
| BadNet | 0.8941 ± 0.0107 | 0.9902 ± 0.0013 | 0.8488 ± 0.0127 | 0.9906 ± 0.0003 | 11.7225 ± 1.2723 |
| Blended | 0.8772 ± 0.0061 | 0.9279 ± 0.0142 | 0.8193± 0.0151 | 0.9425 ± 0.0003 | 12.0281 ± 1.2399 |
| BadNet_LC | 0.8938 ± 0.0074 | 0.9842 ± 0.0016 | 0.8476 ± 0.0088 | 0.9796 ± 0.0004 | 16.7526 ± 0.8944 |
| Blended_LC | 0.8837 ± 0.0068 | 0.9396 ± 0.0102 | 0.8290 ± 0.0067 | 0.9420 ± 0.0005 | 15.9315 ± 1.2748 |
Table: Sampling rate=0.5%: 250 examples.
| Attack | Accuracy | Recall | F1 | AUROC | Threshold |
|---|---|---|---|---|---|
| BadNet | 0.8950 ± 0.0160 | 0.9903 ± 0.0013 | 0.8502 ± 0.0189 | 0.9908 ± 0.0003 | 11.6161 ±1.8596 |
| Blended | 0.8772 ± 0.0109 | 0.9224 ± 0.0211 | 0.8186± 0.0096 | 0.9416± 0.0003 | 11.7094 ± 1.9545 |
| BadNet_LC | 0.8958 ± 0.0128 | 0.9835 ± 0.0029 | 0.8501 ± 0.0151 | 0.9797 ± 0.0004 | 16.4568 ± 1.5488 |
| Blended_LC | 0.8865 ± 0.0070 | 0.9347 ± 0.0106 | 0.8317 ± 0.0070 | 0.9422 ± 0.0005 | 15.3494 ± 1.3340 |
Table: Sampling rate=0.1%: 50 examples.
| Attack | Accuracy | Recall | F1 | AUROC | Threshold |
|---|---|---|---|---|---|
| BadNet | 0.8775 ± 0.0107 | 0.9904 ± 0.0040 | 0.8312 ± 0.0418 | 0.9905 ± 0.0004 | 13.0927 ± 4.3069 |
| Blended | 0.8564 ± 0.0315 | 0.9391 ± 0.0430 | 0.7987± 0.0279 | 0.9424 ± 0.0003 | 14.2042 ± 4.4056 |
| BadNet_LC | 0.8799 ± 0.0453 | 0.9831 ± 0.0082 | 0.8341 ± 0.0488 | 0.9795 ± 0.0005 | 17.6179 ± 4.8725 |
| Blended_LC | 0.8722 ± 0.0269 | 0.9404 ± 0.0363 | 0.8167 ± 0.0524 | 0.9422 ± 0.0004 | 16.9313 ± 4.2465 |
The tables show that our method can achieve impressive detection performance on different sampling rates (with induced different thresholds) in terms of all metrics. In particular, even using a very small sampling ratio (i.e., 0.1%), our method still maintains good detection performance, which verifies its applicability to real-world applications.
W3: The investigation of adaptive attacks is missing. Since the proposed defense is based on the property of backdoor samples, it is necessary to go deep into adaptive attacks.
A: Thanks for your helpful suggestion. We guess you might be talking about the adaptive attack based on sample-specific dynamic triggers. To address your concern, we have conducted experiments to defend against one strong adaptive attack, ISSBA [1], which optimizes sample-specific triggers for each input image by an encoder-decoder network.
Table: Detection performance (ISSBA)
| Detection → | STRIP | SCALE-UP | TeCo | BDetCLIP (Ours) |
|---|---|---|---|---|
| ISSBA | 0.351 | 0.515 | 0.496 | 0.927 |
The table shows that our proposed BDetCLIP can still effectively detect backdoor samples with dynamic triggers.
Besides, we are also concerned that you might be referring to the adaptive attack as a specifically designed attack against BDetCLIP. To solve this concern, we would like to explain that BadCLIP [2] is indeed such an adaptive attack where the visual trigger in BadCLIP closely aligns with the class-specific texts including many target-class-related attribute words in the embedding space. The defense performance against BadCLIP is shown in Table 6 in our original manuscript. We can see that our proposed BDetCLIP can achieve superior performance, especially when combined with CleanCLIP. Therefore, we argue that our method can effectively defend against the adaptive attack.
Reference:
[1] Invisible backdoor attack with sample-specific triggers. In ICCV, 2021.
[2] BadCLIP: dual-embedding guided backdoor attack on multimodal contrastive learning. In CVPR, 2024.
Dear Reviewer Xrmc,
Thanks for the time to provide valuable comments on our paper. We would greatly appreciate it if you could return further feedback on our rebuttal.
Best Regards,
Authors
We would like to express our gratitude to all reviewers for providing insightful comments and constructive suggestions. We are glad that they consider our proposed BDetCLIP to be well-motivated (Reviewer Xrmc) and novel (Reviewer 4hWz, Reviewer Dxen). Besides, We are also glad that all reviewers agree that BDetCLIP is effective and efficient. We summarize the key changes of our manuscript as follows:
We have updated our threshold selection method in Appendix B "Threshold Selection" and presented the performance of our detection method across multiple metrics (Accuracy, Recall, F1-score, and AUROC) under different sampling rates to demonstrate its reliability. (suggested by Reviewer Xrmc, Reviewer gXdL, and Reviewer Dxen).
We have provided the defense result comparison with CleanCLIP in Appendix E (suggested by Reviewer gXdL).
We have added more details of attacking CLIP in Appendix D (suggested by Reviewer Dxen).
We have conducted an exploratory experiment investigating the relationship between the number of class-specific benign prompts and the threshold, which is detailed under the "The relationship between the number of class-specific benign prompts and threshold" section in Appendix F: MORE EXPERIMENTAL RESULTS (suggested by Reviewer gXdL).
We have included a new section titled "Appendix H: Future Work" to explore the following topics:
- More sophisticated backdoor attacks, such as triggers capable of naturally adapting to changes in prompt semantics (suggested by Reviewer 4hWz).
- Strategies for further enhancing detection effectiveness, including the development of a framework for evaluating prompt quality to further improve the quality of prompts (suggested by Reviewer 4hWz).
We have further expanded the defense scenarios based on the reviewers' feedback:
-
We have added our detection results in the open-set classification scenario under the "Backdoor detection for open-set detection" section in Appendix F: MORE EXPERIMENTAL RESULTS (suggested by Reviewer Xrmc).
-
We have included our detection results in the scenario where the backdoor trigger has semantic meaning under the "Backdoor detection for semantically meaningful trigger" section in Appendix F: MORE EXPERIMENTAL RESULTS (suggested by Reviewer 4hWz).
-
We have appended our detection results for Wanet under the "Backdoor detection for Wanet" section in Appendix F: MORE EXPERIMENTAL RESULTS (suggested by Reviewer Dxen).
-
We have appended our detection results for multi-targets attack under the "Backdoor detection for multi-targets attack" section in Appendix F: MORE EXPERIMENTAL RESULTS (suggested by Reviewer 4hWz).
Dear Reviewers, ACs, SACs, and PCs,
We sincerely appreciate the valuable time you have invested in reviewing our work!
Since OpenReview does not allow authors to upload new manuscripts, we commit to incorporating the results for BadEncoder [3] and TrojVQA [4] into the revised manuscript. In this way, we believe that our paper finally covers a wide range of backdoor attack scenarios, including BadNet, Blended, BadNetLC, BlendedLC, WaNet, ISSBA, semantic backdoor triggers, BadCLIP-1 [1], BadCLIP-2 [2], BadEncoder, and TrojVQA, as well as multi-target backdoor attacks.
Thank you again for your valuable feedback and support!
Best regards,
Authors
References:
[1] BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning. In CVPR, 2024.
[2] BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP. In CVPR, 2024.
[3] BadEncoder: Backdoor Attacks to Pre-trained Encoders in Self-Supervised Learning, In SP, 2022.
[4] Dual-Key Multimodal Backdoors for Visual Question Answering, In CVPR, 2022.
This work proposed a test-time detection against the backdoor attack on CLIP model, based on the observation that "the visual representations of backdoored images are insensitive to both benign and malignant changes in class description texts".
It received 4 details reviews. The strengths mentioned by reviewers mainly include: the new attempt of test-time detection against the backdoor attack on CLIP, clear observation and innovative methodology design, extensive ablation studies, good performance.
Meanwhile, there are also several important concerns, mainly including:
- Adaptive attacks are missing.
- Sensitivity analysis to hyper-parameters is missing.
- The analysis of the observation is missing.
- Limited attacks are evaluated.
- The comparison with defenses in other stages is missing.
- The reliance on the quality of the GPT generated prompts is questionable.
There are sufficient discussions between authors and reviewers. I am happy to see that both the questions and answers are very professional. My judgements about above points as shown below: Points 1-3 have been addressed to some extent in the rebuttal. Points 4-6 are not well addressed in the rebuttal, which downgrades the importance and practical effectiveness of the proposed method.
Overall, I think that although there are some merits in this work, it should be further improved to satisfy the quality bar of ICLR. Thus, my recommendation is reject.
审稿人讨论附加意见
The rebuttal and discussions, as well as their influences in the decision, have been summarized in the above metareview.
Reject