PaperHub
6.0
/10
Spotlight4 位审稿人
最低5最高7标准差0.7
6
6
5
7
4.0
置信度
正确性2.5
贡献度3.0
表达2.5
NeurIPS 2024

Uncovering, Explaining, and Mitigating the Superficial Safety of Backdoor Defense

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06

摘要

关键词
Backdoor SafetySafety TuningSuperficial Safety

评审与讨论

审稿意见
6

This paper studies post robustness of backdoor purification. The authors show that backdoors purified by existing defenses can be recovered via Retuning Attacks and they propose the Query-based Reactivation Attack to recover the backdoor. The authors address such an vulnerability by proposing a Path-Aware Minimization defense. Evaluations are performed on four different attacks over CIFAR-10/100 and Tiny-ImageNet across various deep neural networks. Results show that the proposed PAM defense achieves a better post purification robustness.

优点

  1. It is novel to study the post purification robustness and this paper reveals an often-neglected vulnerability of DNNs and limitations of existing defenses.

  2. The paper progressively demonstrate that models can relearn backdoors after purification via Retuning Attacks and based on the observation, the authors propose a Query based attacks that is more practical under real-world threat models.

  3. The finding that the inadequate deviation of purified models from the backdoored model along the backdoor-connected path is the root cause of poor post purification robustness is both Instructive and insightful. The authors successfully develop an effective defense upon this finding.

缺点

  1. My primary concern is that the attacks evaluated in the paper are not SOTA. BadNets, Blended, SSBA and LC are attacks developed years ago. The authors should also provide the results on recent attacks such as Sleeper Agent[1] and Adaptive Blended[2].

  2. The authors mentioned there is a trade-off between post purification robustness and the clean accuracy. However, it is unclear how to determine the hyper parameter ρ\rho to achieve the balance. An algorithm for ρ\rho selection would be necessary.

  3. The presentation needs to be improved. For example, order of Fig.2 and Fig.3 should be interchanged; it is very hard to see the results in Fig. 5; Algorithm 1 should be placed close to where it is described.

  4. Some typos. For example, line 142: we following -> we follow; in Figure 1, shouldn't P-ASR be R-ASR?

[1]. Sleeper Agent: Scalable Hidden Trigger Backdoors for Neural Networks Trained from Scratch, NeurIPS, 2022

[2]. REVISITING THE ASSUMPTION OF LATENT SEPARABILITY FOR BACKDOOR DEFENSES, ICLR 2023

问题

  1. In Table 2, the clean accuracy of all these models are lower than usual, why is that?

局限性

Limitations has been addressed in "Conclusions and Limitations".

作者回复

We are grateful to you for your time and effort in reviewing our work, as well as acknowledging our contributions.

Response to Weakness 1

Thanks for your suggestion! As you suggested, we test our method PAM on more backdoor attacks, Adaptive-Patch, Adaptive-Blend, and All-to-All attacks which insert multiple backdoors into models [1]. The results are shown in the Table of the Global Response. We could observe that our method still achieves satisfying post-purification robustness against them. We also attempt the SleeperAgent but find unstable attack performance and low ASR with public implementation [2, 3].

In this work, we tend to figure out whether achieving a low ASR through current purification methods truly signifies the complete removal of inserted backdoor features. To investigate this phenomenon, we select several classic and practical attack methods in the field of backdoor learning. We demonstrate that even under these well-explored attack paradigms, current state-of-the-art defenses still suffer from our Retuning Attack which cannot achieve the post-purification robustness. We are greatly appreciative of the suggestions put forth by the reviewer. We will add these experiments in the revised version.

Response to Weakness 2

Thanks for your instructive comments. We showcase the model performance across various ρ\rho values in the Table below. As ρ\rho rises, there is a slight decrease in clean accuracy alongside a significant enhancement in robustness against RA. Additionally, we can observe that the performance is not sensitive to the ρ\rho, when larger than 0.3. Considering we only observe the C-Acc (with val set) in practice and need to achieve a good trade-off between these two metrics, we follow FST [4] and choose ρ\rho to ensure that the C-Acc doesn’t fall below a predefined threshold like 92%.

We demonstrate the performance of PAM with diverse ρ\rho and evaluate the Blended attack on CIFAR-10 with ResNet-18. The O-Robustness metric represents the purification performance of the defense method, and the P-Robustness metric denotes the post robustness after applying RA.

Evaluation Modeρ=0.1\rho=0.1 (C-Acc/ASR)ρ=0.3\rho=0.3 (C-Acc/ASR)ρ=0.5\rho=0.5 (C-Acc/ASR)ρ=0.7\rho=0.7 (C-Acc/ASR)ρ=0.9\rho=0.9 (C-Acc/ASR)
O-Robustness94.03/6.3393.64/2.0793.34/1.6792.12/0.5091.99/1.00
P-Robustness93.60/33.2993.61/10.0693.38/2.6992.17/2.6292.54/0.30

Response to Weakness 3 and 4

We greatly appreciate the reviewer for the invaluable suggestions and for pointing out our typos. We are committed to enhancing the presentation. We will follow your suggestions to adjust the layout, and colors, and rectify the typos in our revised version.

Response to Question 1

Thanks for your question. In our work, we adopt the ResNet-18 model (the checkpoint pretrained on ImageNet from Torchvision) for CIFAR-100 and TinyImageNet. After backdoor poisoning, the clean accuracies of backdoor models(averaged on attacks) are 78.8% and 73.1%, respectively. The numbers are higher than the numbers from the BackdoorBench paper [1], where the clean accuracies are 70.51% and 57.28%, respectively. The results in our paper are also aligned with the results from PaperwithCode, where the accuracy on CIFAR-100 is 75% which is from ResNet-164 [1] (without using any tricks) and accuracy on TinyImageNet is 74% which is from ResNeXt-50 [1] (with using advanced augmentations).

After conducting purification, there are slight decreases in clean accuracy. The clean accuracies of PAM are 75.56% and 68.23% on CIFAR-100 and TinyImageNet which is comparable to the original BTI and outperforms FST (as exhibited in Appendix Tables 8 and 9). We reorganize the results from Tables 2, 8, and 9 into the below Tables to take a convenient comparison. We could observe that our PAM achieves a better trade-off between clean accuracy and backdoor robustness. We will present a better-organized version of the results in our revision.

The slight drops across all purification methods may be attributed to our practice of using small clean datasets to adjust backdoored models; a limitation potentially mitigated by utilizing augmentations or distillation with backdoor models as teachers. We leave it for future exploration.

CIFAR-100

Evaluation ModeClean (C-Acc/ASR)EP (C-Acc/ASR)SAM (C-Acc/ASR)FST (C-Acc/ASR)BTI (C-Acc/ASR)PAM (C-Acc/ASR)
O-Backdoor78.83/97.3078.83/97.3078.83/97.3078.83/97.3078.83/97.3078.83/97.30
O-Robustness79.70/0.0476.78/0.0476.38/2.2072.99/0.7375.61/2.6875.56/0.21
P-Robustness78.75/0.8576.37/0.5076.43/90.5272.42/84.5475.69/58.6175.53/0.95

Tiny-ImageNet

Evaluation ModeClean (C-Acc/ASR)EP (C-Acc/ASR)SAM (C-Acc/ASR)FST (C-Acc/ASR)BTI (C-Acc/ASR)PAM (C-Acc/ASR)
O-Backdoor73.10/98.6873.10/98.6873.10/98.6873.10/98.6873.10/98.6873.10/98.68
O-Robustness73.88/0.0770.75/0.0170.54/7.0565.92/1.7168.70/0.7368.23/1.56
P-Robustness72.86/1.3670.5/0.8870.94/87.3065.22/77.0668.43/61.2267.74/7.48

[1]. BackdoorBench: A Comprehensive Benchmark of Backdoor Learning, NeurIPS 2022

[2]. https://github.com/vtu81/backdoor-toolbox

[3]. https://github.com/hsouri/Sleeper-Agent

[4]. Towards Stable Backdoor Purification through Feature Shift Tuning, NeurIPS 2023.

评论

We sincerely appreciate your constructive feedback throughout the review process and will incorporate your suggestions as we revise the paper. We are delighted that our responses have addressed your concerns.
Thanks for your recognition of our work!

The Authors.

审稿意见
6

This paper investigates the effectiveness of current purification-based backdoor defenses and tries to uncover whether purified DNNs are truly free from backdoor vulnerabilities. The authors identify the “post-purification robustness” of DNNs and propose Retuning Attack (RA) and Query-based Reactivation Attack (QRA) respectively to assess the susceptibility of purified DNNs to backdoor reactivation. Additionally, the paper proposes Path-Aware Minimization (PAM) to improve post-purification robustness.

优点

The work makes contribution in backdoor robustness evaluation by shifting the focus from merely achieving low ASR to evaluating the post-purification robustness of backdoor defenses. The paper introduces a pipeline of methods: RA, QRA, and PAM, for the assessment of this vulnerability.

缺点

The proposed notion of post-purification robustness lies with the assumption that the purified DNN will encounter further fine-tuning. As such, the entire workflow proposed is not practical in a real-world scenario where the purified model is kept frozen with no further updates.

问题

Can the authors further justify the proposed threat model’s real-world implications? How can the proposed workflow be enabled in a real-world attack setting?

局限性

The paper falls short in addressing practical implementation challenges. These limitations suggest that the proposed methods might not be as universally practical as claimed.

作者回复

We are grateful to you for your time and effort in reviewing our work, as well as acknowledging our contributions.

Response to Weakness and Questions:

Thanks for your kind question! Due to the space limitation, we only simply discuss practical implications for real-world settings in Lines 109-116 of Section 3. We will further explain the threat model as follows. We also mention this in the global response.

Large Models such as CLIP, ChatGPT, LLaMa, and Stable Diffusion, have become essential bases supporting a wide variety of AI applications. These Large Models (after completed safety tuning) provide powerful pre-trained capabilities that can be fine-tuned for a wide range of specific use cases. In practice, further customization of these models via fine-tuning is often desirable to tailor their performance for particular applications [1]. For open-sourced models like the LLaMa series and Stable Diffusion, the model providers explicitly encourage further fine-tuning to specialize these models' capabilities for use applications [2]. For close-sourced models like GPT-4 and Claude 3, providers offer APIs that allow users to upload their specific datasets and fine-tune these models accordingly [3,4].

Meanwhile, the pretraining datasets for Large Models have grown to web-scale datasets with billions of samples crawled from the internet [5,6]. At this scale, it is infeasible to manually curate each example, which leaves a viable opportunity for attackers to launch actual poisoning attacks [7].

To ensure the safety of models in practical use, people have proposed tremendous methods to purify models before releasing them [8,9,10]. Despite substantial efforts in this area, it is not clear that even if a purified model’s initial safety performance is impeccable (nearly 0 ASR), will this robustness still be preserved after further finetuning with possible poisoned data? Similar to works [11,12] which studied safety alignment against finetuning on harmful prompts, we take the first initial attempt to consider this actual issue for backdoor poisoning threats.

Our study finds that backdoors could be very easily reactivated by further tuning on only an extremely small number of poisoned samples (like 1 sample for blended attack). This reveals that current purification methods cannot truly eliminate poisoning backdoor features learned during pretraining. This will undoubtedly pose a greater threat to the real world. Attackers could potentially bypass “firewalls” from safety purifications. Leveraging the stronger capabilities of Foundation models, they could then generate more threatening and disruptive content (more realistic counterfeit images [13] and more potent malicious codes [14]), further harming other’ productive activities.

We will add this detailed explanation as a separate section in the revised version. We hope our response could address your concerns.


[1]. https://llama.meta.com/responsible-use-guide/

[2]. https://llama.meta.com/docs/how-to-guides/fine-tuning/

[3]. https://platform.openai.com/docs/guides/fine-tuning

[4]. https://www.anthropic.com/news/fine-tune-claude-3-haiku

[5]. Laion-5b: An open large-scale dataset for training next generation image-text models, NeurIPS 2022.

[6]. Exploring the limits of transfer learning with a unified text-to-text transformer, JMLR 2020.

[7]. Poisoning Web-Scale Training Datasets is Practical, IEEE S&P 2024.

[8]. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, arxiv 2022.

[9]. Backdoor Learning: A Survey, arxiv 2022.

[10]. https://openai.com/index/openai-safety-update/

[11]. Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! ICLR 2024

[12]. Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models, arxiv 2023.

[13]. How to Backdoor Diffusion Models?, CVPR 2023.

[14]. Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, arxiv 2024.

评论

I'd to thank the authors for their detailed responses. However, the mention of large-scale datasets and models in this response does not align with the evaluations in the paper (such as using CIFAR and ResNets). I suggest the authors to further consider real-world implications of the proposed notion and include discussions in the paper. I maintain my original rating of "Weak Accept".

评论

We sincerely value your supportive feedback during the review process along with the acknowledgement of our work.

Due to the substantial computational costs associated with large-scale experiments, we initially examine and verify this crucial issue, post-purification robustness, using smaller datasets and models in this study. We will follow your question and incorporate our discussions into both a separate section and the Limitation section in the revised version. Following your suggestions, we will be dedicated to exploring our proposed method for backdoor safety issues on LLMs [1,2,3].

Our work pioneers the idea that an attacker could bypass existing safety purifications merely by fine-tuning purified models with an extremely small number of poisoned samples. As pointed out by the Reviewer, it brings a new angle for evaluating backdoor robustness known as post-purification robustness instead of solely depending on ASR.

This is very vital since more models are becoming available for users' further fine-tuning after undergoing safety tuning. Our research initially demonstrates this feasibility with small-scale datasets and models. This potential vulnerability emphasizes the necessity for more faithful evaluations and stable defense methods against backdoor threats, to ultimately develop safer and more reliable systems.

We once again extend our heartfelt thanks to the Reviewer. We are delighted to discuss with you and sincerely appreciative that these comments have better aided us in more clearly clarifying our contribution and improving the quality of our paper.

The Authors.


[1]. Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! ICLR 2024

[2]. Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, arxiv 2024.

[3]. Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs, arxiv 2024.

审稿意见
5

This paper reveals a phenomenon in backdoor defense: the purified backdoor can be reactivated by fast retuning on a few backdoor samples. Building upon this observation, the paper explores both attacks and defense measures for more reliable backdoor research. On the attack side, a Retuning Attack (RA) is proposed and generalized to the black-box setting as a Query-based Reactivation Attack (QRA). On the defense side, a Path-Aware Minimization (PAM) method is proposed to force more deviation from the backdoor-connected path. Experiments verify the effectiveness of PAM compared to exact purification (EP).

优点

  1. The observed phenomenon that the purified model is more easily to restore the trigger is interesting.

  2. Both attacks and defense were explored.

  3. The defense results look promising.

缺点

  1. The restoration of the backdoor was conducted on a poisoned subset which is not a surprise in this case as the model will surely relearn the backdoor. The authors should prove the phenomenon via tuning on purely clean training data.

  2. The threat models of the two proposed attacks are problematic. 1) RA will require a post-purification poisoning to the defense model, which means that it can poison again after the defense, which does not make sense to me; 2) QRA attack is very similar to an adversarial attack, that requires access to both the purified model and the RA model, why the defender exposes these models (or their APIs) to the attacker is questionable.

  3. The authors should clearly define what is Post-purification Robustness. In the current version, it appears the same as standard backdoor robustness, i.e., how to guarantee the PAM purified model is 100% robust?

  4. The authors seem to confuse a backdoor attack with adversarial perturbation in proposing the QRA attack, A backdoor attack does not make changes to the input during the inference stage otherwise it becomes an adv attack. These two attack types assume different capabilities and flexibility of the adversary.

  5. The proposed PAM is very much like using a moving average to force larger updates of the weights. PAM requires reverse-engineered backdoor samples Dr\mathcal{D_r}, I wonder if one could simply use unlearning to achieve the same effect.

  6. The proposed method was not compared with existing defense methods.

问题

  1. Can PAM deal with multi-trigger attacks?[1]

[1] Li, Yige, et al. "Multi-Trigger Backdoor Attacks: More Triggers, More Threats." arXiv preprint arXiv:2401.15295 (2024).

  1. How does PAM work, compared to applying the same existing defense twice, e.g., ANP or SAM?

  2. Can the problem be addressed by simply adjusting the hyperparameters of the Mode Connectivity defense [2]?

[2] Zhao, Pu, et al. "Bridging mode connectivity in loss landscapes and adversarial robustness." arXiv preprint arXiv:2005.00060 (2020).

局限性

  1. Problematic threat model.
  2. Limited technical novelty.
  3. Missing systematic comparison with existing methods.
作者回复

Thanks for your time and effort in reviewing our work!

Response to W1:

  1. For "surely relearn the backdoor":
    Sorry for the possible confusion. First, we emphasize the reviewer’s statement “...model will surely relearn the backdoor.” is not correct. As emphasized in Lines 169-173 of Section 3.2 and Figure 1, exact purification (EP) doesn't relearn backdoors, maintaining low ASR after RA.
  2. For "tuning on purely clean training data":
    Thanks for this interesting question. Simply tuning on clean data cannot relearn backdoors. Instead, it is widely used as a baseline defense [1,2]. Additionally, it would be interesting to explore whether fine-tuning on carefully selected clean data, based on certain metrics, may lead to relearning backdoors. We will explore it in the future.
  3. We also discuss why our work is surprising in the Global Response.

Response to W2:

  1. For 1): Please refer to the practical implications of our threat models in Global Response.
  2. For "QRA attack is very similar to an adversarial attack" in 2):
    Sorry for the possible confusion. In the main submission, we have specifically emphasized the differences between QRA and adversarial perturbations (ADV) in Lines 212-215 and derived the final objective of QRA, Eq. 2. As shown in Figure 3, QRA from Eq. 2 only works when added on backdoored examples and with purified models. Applying QRA on clean images with purified models or attacking clean models will not succeed. Instead, ADV can work on these all attack scenarios. This suggests Our QRA is different from ADV. We also discuss their difference in Weakness 4.
  3. For "that requires access both the..." in 2):
    As mentioned in response to 1), nowadays, after safety tuning, the model providers open-source the models or release the API to enable further fine-tuning for specific usage. We also demonstrate that QRA can successfully transfer across unknown purification methods in Figure 2. This doesn’t need any queries to targeted models, which highlights the practicality of QRA.

Response to W3:

  1. For "it appears the same as standard backdoor robustness":
    As we've discussed throughout the article, post-purification robustness (P-) has a significant difference from standard backdoor robustness (S-). S- directly tests purified models’ ASR on a backdoored testing set. Contrastively, P- evaluates purified models’ ASR against RA used in the paper. Although they both evaluate the ASR, P- takes a further step than S-. It reveals that the current defense with nearly 0 ASR cannot truly eliminate backdoor features, which quickly regain ASR after RA (Sec.3). This emphasizes instead of solely depending on S-, we need more faithful and comprehensive evaluations to ensure lasting protection against backdoor attacks.

  2. For "guarantee the PAM purified model is 100% robust?":
    In practice, all current defense methods cannot guarantee they are 100% robust since they cannot achieve 0 ASR [2,4,5] empirically. PAM is also an empirical defense and doesn’t offer certified guarantees.

Response to W4:

Sorry for the possible confusion. First, our QRA is different from ADV and is specific to the inserted backdoor. We believe the reviewer’s statement about the difference between backdoor and ADV is not accurate. For a successful backdoor attack, it's necessary to activate a preembedded backdoor to conduct an attack[3]. Contrarily, ADV does not need to embed a backdoor and can mislead any model. The difference is not about modifying samples during inference, but whether perturbation refers to a previously inserted backdoor [3].

Response to W5:

Sorry for the possible confusion. We have conducted unlearning with reversed backdoored samples in our paper. All numbers referencing "BTI" pertain to this unlearning. We follow source codes to implement it (BTI-U in original paper). We give a detailed description of it in Lines 522-525 of Appendix. We could observe that BTI cannot achieve same effect as PAM. We will clarify this point in the revised version.

Response to W6 and Q2:

  1. Response to W6: We think the reviewer may miss our main experimental results in Figure 1-4. We have taken a thorough comparison between PAM and existing defense methods and observed PAM outperforms others in terms of post-purification robustness.
  2. Response to Q2: We think the reviewer may misunderstand the process of PAM. Like existing defense methods, PAM is applied once on backdoor models. We don’t understand Question 2, since it appears there's no difference between doing them once or twice. Inspired by suggestions, we have added another method for comparison. After getting reversing samples, we adopt the SAM method to purify models on them. We observe that this cannot maintain consistent robustness against RA, regaining over 75% ASR average on CIFAR-10.

Response to Q1:

Thanks for this interesting question. Since we cannot find the source code of reference work, we utilize the All-to-All attacks [4] which also inserts multiple backdoors into models and present PAM’s performance in Table of the Global Response. PAM still achieves good post-purification robustness.

Response to Q3:

Thanks for this interesting question. The method of MCR cannot achieve same effect as PAM against RA. As shown in Figure 2 of MCR, we observe that to maintain good clean accuracy, the defender must select low ASR solutions very close to backdoored models that are not robust to RA (see our Section 4.1).

[1]. Fine-Pruning: Defending Against Backdooring Attacks on Deep Neural Networks, arxiv 2018.

[2]. Towards Stable Backdoor Purification through Feature Shift Tuning, NeurIPS 2023.

[3]. Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses, arxiv 2021.

[4]. Backdoorbench: A comprehensive benchmark of backdoor learning, NeurIPS 2022.

[5]. Reconstructive neuron pruning for backdoor defense, ICML 2023.

评论

I want to thank the authors for the rebuttal. It has addressed most of my concerns. I have increased my rating accordingly.

评论

We appreciate your further comment and recognition of our responses. We are delighted to have addressed your concerns.

The Authors

审稿意见
7

Backdoor attacks are a major threat to Deep Neural Networks (DNNs), as they allow attackers to manipulate model predictions with backdoor triggers. Existing purification methods reduce the Attack Success Rate (ASR) of these models, but it's unclear if they fully eliminate backdoor threats. This study investigates post-purification robustness by employing the Retuning Attack (RA) and finds that current methods are vulnerable, as models quickly relearn backdoor behaviors. To address this, the study proposes the Query-based Reactivation Attack (QRA) and a novel Path-Aware Minimization (PAM) technique. PAM enhances robustness by promoting deviation along backdoor-connected paths with extra model updates. Extensive experiments show PAM significantly improves robustness, maintaining low ASR and good accuracy, providing a new perspective on evaluating and improving backdoor defenses.

优点

  1. The paper empirically verifies that poisoned nodes typically exhibit large prediction variance under edge dropping, providing a indicator for identifying poisoned nodes​​.
  2. The proposed robust training strategy not only has theoretical guarantees but also shows practical effectiveness in defending against various types of backdoor attacks, maintaining clean accuracy while reducing the attack success rate​​.

缺点

  1. While the paper focuses on various backdoor attack types, the scope of attack types and defense mechanisms explored could be broadened to cover more diverse scenarios and setting

问题

This is a good paper, discussing the details of the proposed defense method, one thing I'm curious is that does the author has plans for designing an adaptive attack for this defense method, and how to achieve it?

局限性

The limitations have been discussed in the paper.

作者回复

We are grateful to you for your time and effort in reviewing our work, as well as acknowledging our contributions.

Response to Weakness 1:

Thanks for your suggestion! Following your suggestion, we add evaluations on more poisoning attacks.

For more attacks: We test our PAM on more backdoor attacks, including Adaptive-Patch [1], Adaptive-Blend [1], and All-to-All attack [2]. Our experiments are shown in Table of the Global Response. The results show that our method still achieves satisfying post-purification robustness against them, yielding only 3.75% ASR after the Retuning Attack.

For more scenarios: We plan to apply our method to backdoor safety issues on LLMs [4,5,6]. We could first adopt current backdoor reversing methods to reverse inserted trigger prompts [6]. Then we could utilize our PAM method with reversed trigger prompts to purify backdoored LLMs. Future work will be dedicated to these endeavors.

Response to Question 1:

Thanks for your interesting question. We have also considered designing possible adaptive attacks for our PAM method. PAM consists of two parts: 1) getting gradients of the interpolated model for further update; and 2) data specification for backdoor-connected path; Next, we discuss possible strategies for the above two components, respectively.

  1. Making the loss landscape around the poisoned solution becomes smoother: PAM needs the gradients of the interpolated model for getting further updates, to obtain a solution deviated from backdoored models. If attackers could control the training process and position the backdoored model at a smoothed local minimum (possibly using methods like SAM), they might impede the post-purification performance of PAM. We try this possible adaptive attack. We find our PAM still performs robustly against it. We suspect that the small radius adopted by SAM limits its smoothing performance. However, enlarging the radius will also significantly sacrifice clean accuracy. We will explore other possible methods in the future.
  2. Nonetheless, such advanced attacker capabilities that include controlling training procedures exceed our work's scope and do not align with our experimental setting, since we mainly focus on practical data-poisoning attacks. Making reversing backdoor samples becomes harder: PAM needs to reverse backdoored samples to specify the backdoor-connected path. If attacks could make reversing backdoor samples fail, they may be able to fail PAM defense. In our work, we adopt the most advanced backdoor-reversing method, BTI [3], which can effectively handle current data-poisoning attack methods.

[1]. Revisiting the Assumption of Latent Separability for Backdoor Defenses, ICLR 2023.

[2]. Backdoorbench: A comprehensive benchmark of backdoor learning, NeurIPS 2022.

[3]. Towards reliable and efficient backdoor trigger inversion via decoupling benign features, ICLR 2024.

[4]. Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, arxiv 2024.

[5]. Universal jailbreak backdoors from poisoned human feedback, ICLR 2024.

[6]. Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs, arxiv 2024.

评论

Dear Reviewer YFJE,

We wish to express our gratitude for your dedicated time and insightful comments. We are awaiting your valuable feedback and insights regarding the points we addressed in the rebuttal. Ensuring your satisfaction with our rebuttal is of utmost importance to us. Your response is very helpful in further improving the quality of our work.

Sincerely,

Authors

评论

Thank you for the reply. It resolved my concerns. I will keep my rating positive.

评论

We sincerely appreciate your invaluable feedback throughout the review process and will incorporate your suggestions as we revise the paper. We are delighted that our responses have resolved your concerns.
Thanks for your support and recognition for our work!

The Authors.

作者回复

We would like to express our sincere gratitude to all reviewers for their time and efforts in reviewing our work. We will carefully revise our manuscript by adding suggested experiments, more detailed explanations, and fixing the typos.

Here, we provide a global response to questions from reviewers about the practical implications of our threat models and our work's contribution. We also provide evaluation results on more attacks.

The Practical Implications of our threat models for real-world settings:

Large Models such as CLIP, ChatGPT, LLaMa, and Stable Diffusion, have become essential bases supporting a wide variety of AI applications. These Large Models (after completed safety tuning) provide powerful pre-trained capabilities that can be fine-tuned for a wide range of specific use cases. In practice, further customization of these models via fine-tuning is often desirable to tailor their performance for particular applications [1]. For open-sourced models like the LLaMa series and Stable Diffusion, the model providers explicitly encourage further fine-tuning to specialize these models' capabilities for use applications [2]. For close-sourced models like GPT-4 and Claude 3, providers offer APIs that allow users to upload their specific datasets and fine-tune these models accordingly [3].

Meanwhile, the pretraining datasets for Large Models have grown to web-scale datasets with billions of samples crawled from the internet [4]. At this scale, it is infeasible to manually curate each example, which leaves a viable opportunity for attackers to launch actual poisoning attacks [5].

To ensure the safety of models in practical use, people have proposed tremendous methods to purify models before releasing them [6,7]. Despite substantial efforts in this area, it is not clear that even if a purified model’s initial safety performance is impeccable (nearly 0 ASR), will this robustness still be preserved after further finetuning with possible poisoned data?

Similar to works [8] which studied safety tuning against finetuning on harmful prompts, we take the first initial attempt to consider this actual issue for backdoor poisoning threats. Our study finds that backdoors could be very easily reactivated by further tuning on an extremely small number of poisoned samples (like 1 sample for blended attack). This reveals that current purification methods cannot truly eliminate backdoor features learned during pretraining. This will undoubtedly pose a greater threat to real world. Attackers could potentially bypass “firewalls” from safety purifications. Leveraging the stronger capabilities of Foundation models, they could then generate more threatening and disruptive content (more realistic counterfeit images [9] and more potent malicious codes [10]), further harming other’ productive activities.

We want to reemphasize why our findings are surprising and contributions are important:
Our work first proposes a new perspective toward evaluating the effectiveness of backdoor defense methods. Rather than simply focusing on the ASR, we investigate the post-purification robustness via RA (Sec.3.2) and more practical QRA (Sec.3.3). We find that purified models with current defense methods still have backdoor features which could be very easily reactivated. Our findings underscore the necessity for more faithful and comprehensive evaluations to ensure lasting protection against backdoor threats.

We observe that the EP (assuming acknowledging triggers and true labels) does not relearn the backdoor, maintaining low ASR after RA. It validates the possibility of achieving post-purification robustness. This contrast also emphasizes that current purification methods are surprisingly weak and superficial. Such insights guide us to conduct analysis about what leads to robustness of EP (Sec.4.1), further leading us to propose our PAM defense technique (Sec.4.2) . Notably, our PAM significantly improves robustness against RA.

Additional evaluations against more attacks:

Following the suggestions from Reviewer YFJE, Reviewer JZci, and Reviewer jgTc, we have expanded our experiments on additional attack types, including the Adaptive-BadNet, Adaptive-Blend, and All-to-All attacks, as presented in the Table below. The results show that our PAM still achieves good post-purification robustness against them.

Table: Experiments are conducted on CIFAR-10 with ResNet-18. The O-Backdoor indicates the original performance of backdoor attacks, O-Robustness metric represents the purification performance of defense method, and the P-Robustness metric denotes post robustness after applying RA.

Evaluation ModeAdaptive-BadNet (C-Acc/ASR)Adaptive-Blend (C-Acc/ASR)BadNet-All2All (C-Acc/ASR)Blended-All2All (C-Acc/ASR)
O-Backdoor94.54/86.8394.70/94.9194.25/90.2194.65/77.73
O-Robustness (BTI)92.97/1.4091.87/4.5292.29/1.4393.16/4.11
P-Robustness (BTI)93.13/56.8292.08/45.7992.41/88.4193.41/58.84
O-Robustness (PAM)92.05/0.6592.08/0.1493.18/0.7092.52/3.61
P-Robustness (PAM)91.83/0.5392.51 /4.7392.51/0.9092.26/6.43

[1]. https://llama.meta.com/responsible-use-guide/

[2]. https://llama.meta.com/docs/how-to-guides/fine-tuning/

[3]. https://platform.openai.com/docs/guides/fine-tuning

[4]. Laion-5b: An open large-scale dataset for training next generation image-text models, NeurIPS 2022.

[5]. Poisoning Web-Scale Training Datasets is Practical, IEEE S&P 2024.

[6]. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback, arxiv 2022.

[7]. Backdoor Learning: A Survey, arxiv 2022.

[8]. Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! ICLR 2024

[9]. How to Backdoor Diffusion Models?, CVPR 2023.

[10]. Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training, arxiv 2024.

最终决定

The authors first show that, after a backdoored model has been purified, an attacker can reactivate the backdoor crafting ad-hoc test samples. Then, they propose a defense technique to reduce this problem. The reviewers have all positive opinions about this work. They have highlighted that this paper's findings are interesting, and the proposed defense is promising. The main limitation of this work is that the threat model assumes large models and datasets, whereas those used in the experimental analysis do not align with this assumption. However, an extensive experimental analysis with large datasets and models would have been impractical, and there is no reason to think that the proposed approach would not work on large datasets and models if the attacker can inject the same percentage of poisoning points. Therefore, I recommend accepting this work.