PaperHub
5.8
/10
Poster4 位审稿人
最低4最高8标准差1.5
5
6
8
4
3.3
置信度
正确性2.8
贡献度3.3
表达3.0
NeurIPS 2024

Unveiling and Mitigating Backdoor Vulnerabilities based on Unlearning Weight Changes and Backdoor Activeness

OpenReviewPDF
提交: 2024-05-13更新: 2024-12-20

摘要

关键词
Deep Neural Network; AI security; Backdoor Defense

评审与讨论

审稿意见
5

This paper first makes two observations that neurons exhibiting significant weight changes during clean unlearning also tend to play crucial roles in poison unlearning, and neurons in the backdoored model are always more active compared to those in the clean model. The authors showcase on commonly used backdoor attacks and provide further explanations for these two observations. Based on these two observations, a model is first unlearned on clean data where weights with the highest changes are re-initialized, and then the model is optimized by activeness-aware fine-tuning. Extensive experiments are provided to support the proposed defense.

优点

The analysis is clear and well-formulated, based on which the proposed defense is effective and easy to interpret. Extensive experiments on different types of backdoor attacks are provided.

缺点

The design principle of the proposed method (TSBD) is very similar to RNP [22]. Both TSBD and RNP follow the two-stage defense setup, including clean unlearning and recovering. The performance of TSBD is also similar to RNP except for the SIG backdoor. The authors provide some explanations about the difference between RNP and TSBD, but they are not convincing. The core technical difference can be further clarified.

Table 3 provides the ablation study on the zero reinitialization ratios, indicating that reinitialization is a sensitive parameter. In a practical threat scenario, how would the defender select a proper reinitialization ratio? It is also suspicious whether the neuron ratio selection plays a key role in increasing the effectiveness. Closely related work [a] is missing, where clean unlearning by mask optimization is also discussed.

[a] Towards reliable and efficient backdoor trigger inversion via decoupling benign features. In ICLR, 2024

问题

My main concern is the difference between TSBD and RNP, as well as the practicality of TSBD as a defense.

局限性

The authors did not provide analysis on limitations of the proposed method.

作者回复

Dear Reviewer kmL5, thank you very much for your careful review of our paper and thoughtful comments. We are encouraged by your positive comments on our well-formulated analysis and effective method. We hope the following responses could help clarify the potential misunderstanding and alleviate your concerns.


W(Weakness)1: Concern about the differences compared with RNP.

RW(Response to Weakness)1: We appreciate your concern about the differences compared with RNP. We would like to clarify that TSBD is a newly designed method with several fundamental differences compared with RNP. Please refer to R1 of Author Rebuttal for more detailed comparisons.


W2: Concern about the practicality of the proposed method.

RW2: Thank you for concerning about the practicality of TSBD. We would like to point out that the purpose of Table 3 in Section 4.3 is to compare the effectiveness of different reinitialization schemes, where a fixed neuron ratio n%n\% (e.g., 10%) is used for the three versions. Specifically, let's denote the top-n%n\% selected neurons as θn\theta^n. V1V_1 means that θn\theta^n are reinitialized thoroughly; V2V_2 means that only top-70% subweights of each neuron in θn\theta^n are reinitialized; V3V_3 means that the top-70% subweights of the whole θn\theta^n are reinitialized. Therefore, Table 3 in Section 4.3 indicates that some subweights in the selected neurons are important to the clean functionality, and thus we should keep them properly. The sensitivity analysis on neuron ratio is illustrated in Section 4.4 and Appendix G, which prove that the hyperparameter nn is insensitive but important to the final performance. In a practical scenario, we can set nn freely from 10% to 70% (see Figure 8 of Appendix G). For a more detailed analysis of the importance of stage 1, please refer to the R2 of Author Rebuttal.


W3: Suggestion for additional related work [1].

RW3: Thanks for providing us with this valuable information. We will add this paper (BTI-DBF) to the related work in the revised version. Here, we provide a brief version of unlearning for backdoor defense containing this paper as follows:

Model unlearning can be considered as an opposite process against learning, aiming to remove the impact of a training subset from a trained model [2]. In the field of backdoor defense, unlearning the possible poisoned data (i.e., poison unlearning) is an effective way to remove the learned backdoor. NC [3] and BTI-DBF [1] try to generate the possible poisoned data with either trigger inversion or poison-data generator; ABL [4] and D-BR [5] focus on filtering out the poisoned data from the training dataset according to their attributes during training; i-BAU [6] and SAU [7] assume the adversarial perturbation as a type of trigger and generate poisoned data with adversarial example. To avoid inducing bias, recent work tries to directly unlearn the available clean data (i.e., clean unlearning) for defense. RNP [8] finds that a clean-unlearned model can help expose the backdoor neurons for the subsequent pruning-mask learning. Different from the existing works that focus on utilizing the unlearning techniques, we fill the gap of exploring the clean and poison unlearning process on the backdoored model and provide insights from their correlation for defense.


[1] Towards reliable and efficient backdoor trigger inversion via decoupling benign features. ICLR 2024.

[2] Machine unlearning. Symposium on Security and Privacy (SP) 2021.

[3] Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. Symposium on Security and Privacy (SP) 2019.

[4] Anti-backdoor learning: Training clean models on poisoned data. NeurIPS 2021.

[5] Effective backdoor defense by exploiting sensitivity of poisoned samples. NeurIPS 2022.

[6] Adversarial unlearning of backdoors via implicit hypergradient. ICLR 2022.

[7] Shared adversarial unlearning: Backdoor mitigation by unlearning shared adversarial examples. NeurIPS 2023.

[8] Reconstructive neuron pruning for backdoor defense. ICML 2023.

评论

Dear Reviewer kmL5:

We would like to express our sincere gratitude for your valuable insights and suggestions on our work.

We have tried our best to address the concerns and queries you raised during the rebuttal process. However, we would greatly appreciate knowing whether our responses have effectively resolved your concerns. Your feedback will be instrumental in improving the quality of our work.

Sincerely,

Authors

评论

Thank the authors for your efforts. Most of my concerns are addressed.

评论

Dear Reviewer kmL5:

We deeply appreciate your thoughtful feedback and the effort you have put into reviewing our paper. Your suggestions will be taken into account in the revised version. Thank you for your comprehensive review and positive feedback.

Sincerely,

Authors

审稿意见
6

The paper addresses the security threat posed by backdoor attacks in deep neural networks (DNNs). The authors explore model unlearning from the perspective of weight changes and gradient norms, making two key observations: weight changes between poison and clean unlearning are positively correlated, and neurons in backdoored models are more active than those in clean models. Based on these observations, they propose a Two-Stage Backdoor Defense (TSBD) method, involving Neuron Weight Change-based Backdoor Reinitialization and Activeness-Aware Fine-Tuning. Extensive experiments demonstrate the superior performance of their method compared to state-of-the-art approaches.

优点

  1. The paper introduces a novel perspective on backdoor defense by exploring the correlation between weight changes in poison and clean unlearning and the activeness of neurons. This approach provides new insights into identifying and mitigating backdoor vulnerabilities.
  2. The paper is well-organized and clearly presents its methodology, findings, and contributions.
  3. The proposed TSBD method is rigorously evaluated through extensive experiments involving eight backdoor attacks on three benchmark datasets.

缺点

  1. The TSBD method involves additional steps such as clean unlearning, neuron weight change calculation, and activeness-aware fine-tuning, which may introduce computational overhead. An analysis of the computational cost and efficiency of the proposed method compared to existing defenses would be beneficial.
  2. What causes the clean unlearning NWCs to exhibit a positive correlation with those in poison unlearning? I think this question has not been solved clearly.

问题

  1. When selecting the clean dataset, how is the class distribution handled? Is it randomly selected or fully covered? Intuitively, would data related of the target label cause particularly high activation values?
  2. I think the NWC strategy is not significantly different from the pruning method in RNP[1], and the Activeness-Aware Fine-tuning is also similar to the fine-tuning in RNP. Therefore, I hope the authors can emphasize more clearly the differences between their approach and RNP.
  3. From the observations, it is apparent that there is a difference in neuron activation between clean and backdoored models. However, the use of gradient norm restriction in the unlearning process is applied to an already modified model (not the original backdoored model). The direct correlation between this and enhanced defense effectiveness does not seem sufficiently clear and is not within the scope of the initial observations.
  4. Regarding the selection of hyperparameters in Activeness-Aware Fine-tuning, I believe this can significantly impact the experimental results, but there is no discussion on this aspect in the paper.

[1] Li Y, Lyu X, Ma X, et al. Reconstructive neuron pruning for backdoor defense[C]//International Conference on Machine Learning. PMLR, 2023: 19837-19854.

局限性

I think the paper still has some limitations that the authors didn't mention. Please refer to my questions and weaknesses. Please add some discussions about data-free methods

作者回复

Dear Reviewer P1hB, thank you very much for your careful review of our paper and thoughtful comments. We are encouraged by your positive comments on our novel and insightful method, good paper presentation, and rigorous evaluation. We hope the following responses can help alleviate your concerns.


W(Weakness)1: Suggestion for the analysis of computational cost.

RW(Response to Weakness)1: Thanks for your constructive suggestion. We would like to refer you to the RW2 of Reviewer obuU for a comprehensive analysis, which emphasizes that TSBD is an effective and efficient method.


W2: Why are clean-unlearning NWCs positively correlated with those in poison unlearning?

RW2: Thanks for your in-depth question on the observation 1. This question has been discussed and analyzed from the perspective of neuron activation in Section 3.3. Briefly, the neurons with higher poison activations are the main targets to be changed in both clean and poison unlearning. They tend to decrease the poison activation during poison unlearning, while increasing it during clean unlearning, keeping the clean activation nearly unchanged (see Figure 3). We also deduce that the weight change on a neuron positively influences its changes in clean and poison activation. Therefore, no matter clean or poisoned data are used for unlearning, those neurons with higher poison activations in the backdoored model are changed more in weights than the others, as reflected in the positive correlation of neuron weight changes (see observation 1 of Figure 1).


Q(Question)1: Concern about clean-data selection and class distribution.

RQ(Response to Question)1: Thanks for your interest in the clean dataset used for defense. We strictly follow the same data setting in BackdoorBench [1] for a fair comparison with the baselines. Specifically, 5% of clean data are randomly selected from the unpoisoned dataset with no manipulation on the class distribution. Moreover, to answer the question of the correlation between target-label data and the activation values, we now average the activation values of the last convolutional layer on each class of CIFAR-10 for comparison. For each class, we randomly select 10 samples on each class as the input, and capture the output of the target layer with an additional ReLU activation function. The results are illustrated in Table 1 (as below), which shows that there is no obvious relationship between the target-label data and the activation.

Table 1: Average activation for the last convolutional layer on each class of CIFAR-10

Class \rightarrow0 (target class)123456789
Avg. Activation0.22120.21460.21260.21750.22290.22310.21400.21770.21610.2186

Q2: Concern about the differences compared with RNP.

RQ2: Thanks for your concern about the differences compared with RNP. We would like to refer you to R1 of Author Rebuttal for a comprehensive comparison, which clarifies that TSBD is a newly designed method with several fundamental differences compared with RNP.


Q3: Concern about the gap between observation 2 and loss regularization on gradient.

RQ3: Thanks for your in-depth concern. The main idea we want to convey in observation 2 of Figure 1 is that for an arbitrary clean model, after it has been attacked by a backdoor, the corresponding neurons will become more active than before, i.e., they will exhibit a larger gradient norm during learning processes. This is a general phenomenon, not limited to the initial backdoored model. Therefore, for the Activeness-Aware Fine-Tuning in stage 2, we add an additional gradient norm regularization in the loss function to encourage a low gradient norm status for the optimized model after fine-tuning, which is considered closer to a clean model. This purpose has no relationship with the intermediate model status after the zero reinitialization.


Q4: Suggestion for the hyperparameters experiment on Activeness-Aware Fine-tuning .

RQ4: Thanks for your constructive suggestion. As the hyperparameters rr and α\alpha have been well-discussed for their influence on approximation performance in the original paper [2], and since they are only indirectly related to the backdoor, we follow the suggested settings, i.e., r=0.05r=0.05 and α=0.7\alpha=0.7. Here, we provide the tuning results following the tuning range in [2] under our experimental settings. These results were obtained on a BadNets-attacked PreAct-ResNet18. We observe that the performance is insensitive (Changes <2% in ACC and <1% in ASR) across different hyperparameter settings, maintaining a high level of performance.

Table 2: Hyperparameters Tuning for Activeness-Aware Fine-Tuning

rrα\alphaACCASR
0.050.790.721.31
0.010.790.681.30
0.020.791.031.50
0.10.790.901.30
0.20.789.641.04
0.050.191.321.40
0.050.391.241.59
0.050.591.001.63
0.050.990.681.26

[1] BackdoorBench: A Comprehensive Benchmark of Backdoor Learning. NeurIPS 2022 Datasets and Benchmarks Track.

[2] Penalizing Gradient Norm for Efficiently Improving Generalization in Deep Learning. ICML 2022.

评论

Thanks for considering my concerns and I see the authors have provided more evaluations for validating its feasibility and effectiveness.

An additional minor question is that How compatible is this NWC-based Backdoor Reinitialization approach to "DHBE: Data-free Holistic Backdoor Erasing in Deep Neural Networks via Restricted Adversarial Distillation", AsiaCCS and "SupRTE: Suppressing Backdoor Injection in Federated Learning via Robust Trust Evaluation".

Besides, it is already difficult to innovate in this direction, and the limitations of NWC-based Backdoor Reinitialization methods should be discussed.

评论

Dear Reviewer P1hB,

Thanks for your further feedback. We hope the following responses can address your concerns:

  • About the compatibility of NWC-based Backdoor Reinitialization. As illustrated in Appendix F, our proposed NWC has great compatibility to other methods, e.g., FP [1], for a better defense. We would clarify that NWC is also compatible to your mentioned defense methods:

    • DHBE is a data-free defense aiming to distill a clean student model from a backdoored teacher using two adversarial processes [2]. Our NWC is compatible to the generated samples, since it shows similar behavior no matter clean or poisoned data is used. Therefore, we can replace the adversarial backdoor regularization with our NWC-based backdoor reinitialization. Specifically, we can clean the backdoored teacher as a new benign teacher using NWC reinitialization, and treat it as the regularization term to inform the learning of the student model. In this case, the student model can learn the high ACC from the backdoored teacher, as well as the low ASR from the benign teacher. Besides, the gradient norm regularization in stage 2 can also be used in the loss function to inform the student model closer to a clean model during distillation.
    • SupRTE is a defense method specially designed for the Federated Learning scenario, which extracts the behavior representations from different clients and assigns scores for further weighting [3]. Our NWC from stage 1 and gradient norm regularization from stage 2 can be used as additional behavior representations for scoring. For example, clients with unusually high average NWCs and gradient norms compared to others can be recognized as potentially malicious.
  • About the limitations. We would like to point out that the limitations of our method have been discussed in the Conclusion Section. We consider the major concern on clean-data accessibility, which may limit its applicability to real-world scenarios. Then, we mention the potential solutions using data generation and data-free techniques. For more limitations on performance, we have also conducted a comprehensive evaluation in the paper. As discussed in Appendix H, our method is less effective under a low poisoning ratio, e.g., 1%, and we further find a possible reason for the less obvious weight change in backdoored neurons. In future work, we plan to further investigate and explore the NWC for data-free solutions and enhance the low-poisoning-ratio scenarios.

Thanks again for your valuable time and constructive comments.

Sincerely,

Authors


[1] Fine-pruning: Defending against backdooring attacks on deep neural networks. RAID 2018.

[2] DHBE: Data-free Holistic Backdoor Erasing in Deep Neural Networks via Restricted Adversarial Distillation. Asia CCS 2023.

[3] SupRTE: Suppressing Backdoor Injection in Federated Learning via Robust Trust Evaluation. IEEE Intelligent Systems 2024.

评论

Judging from the content of the article and the reply in the rebuttal section, the author has a good understanding of backdoor defence and the proposed method is also innovative. I am willing to raise my score.

评论

Dear Reviewer P1hB,

We sincerely appreciate your thoughtful response and the time you've dedicated to reviewing our paper. We will incorporate your suggestions and insights into the revised manuscript. Thank you once again for your thorough review and positive evaluation.

Sincerely,

Authors

审稿意见
8

The authors propose a novel two-stage backdoor defense method TSBD. The proposed method is based on two key observations 1) the weight changes of neurons during clean and poison unlearning are correlated, 2) the backdoored neurons exhibit a larger gradient norm during unlearning. Respectively, the proposed defense method consists of 1) reinitializing neurons with high weight changes, 2) fine-tuning under gradient-norm regularization. The proposed method is compared with several state-of-the-art defense methods on different attacks, and a comprehensive ablation study is conducted.

优点

  1. The paper is well-written and easy to follow. The figures are very clear.
  2. The observations are interesting, providing insights into this field.
  3. The proposed method is effective across different attacks, showing low attack successful rate and good defense effective rate. The authors provide sufficient and convincing results.

缺点

  1. How is the performance on scaled-up experiments, e.g. ViT, or ImageNet?
  2. What is the computational overhead of the proposed method compared to others?

问题

Elaborated above.

局限性

The authors discuss the limitations.

作者回复

Dear Reviewer obuU, thank you very much for your positive appraisal and great interest in our paper. We are encouraged by your positive comments on our good paper presentation, insightful observations, and convincing evaluations. We hope the following responses can help answer your questions.


W(Weakness)1: Performance on scaled-up experiments.

RW(Response to Weakness)1: We appreciate your interest in the scaled-up cases in TSBD. To further verify the scalability of our method, we evaluate its performance on a ViT-b-16 model with the CIFAR10 dataset as follows:

Table 1: Experimental Results for ViT-b-16

MethodNo DefenseNo DefenseCLPCLPANPANPTSBDTSBD
AttackACCASRACCASRACCASRACCASR
Input-aware91.6592.3090.5579.3490.4050.6986.114.43
WaNet89.1280.9589.1280.9589.1280.9588.431.59

Note that the other settings follow the basic ones in section 4.1, e.g., the poisoning ratio is set to 10% and the target label is set to 0. The results demonstrate that TSBD performs effectively on the scaled-up model, achieving a low ASR and acceptable ACC. In contrast, CLP and ANP fail completely with ASR still at a high level, particularly for the WaNet attack. Due to time constraints, we postpone the testing on large-scale datasets to future work.


W2: What is the computational overhead?

RW2: Thanks for your interest in the computational overhead of TSBD. We would like to emphasize that TSBD is an effective and efficient method that can defend against backdoor attacks with acceptable overhead. To support our statement, we now show the average computational time of each defense step in Table 2, including Clean Unlearning, NWC Calculation, Zero Reinitialization, and Activeness-Aware Fine-Tuning. We follow the same experimental setting as in section 4.1, using PreAct-ResNet18 with a 10% poisoning ratio. The experiments here are conducted on a server with A6000 GPU and AMD EPYC 7543 32-Core Processor CPU. We observe that the main computational overhead lies in the fine-tuning process. In contrast, the time required for clean unlearning does not increase proportionally with dataset complexity. This means that TSBD is as efficient as other fine-tuning-based methods. Moreover, we present a practical runtime comparison with other SOTA defenses in Table 3, including the loading and testing time needed in practice. As we can see, TSBD is faster than most of the existing methods.

Table 2: Computational Time of Each Defense Step of TSBD

Defense StepCIFAR-10Tiny ImageNet
Clean Unlearning20.84s17.90s
NWC Calculation0.03s0.03s
Zero Reinitialization1.34s1.29s
Activeness-Aware Fine-Tuning21.08s174.36s

Table 3: Practical Runtime Comparison on BackdoorBench

DatasetFTFPANPNCRNPTSBD
CIFAR-10358s855s505s733s123s159s
Tiny ImageNet1649s20429s2578s37101s285s269s
审稿意见
4

The paper introduces two key observations in backdoored models, presenting a two-stage backdoor attack defense method, where the two observations are the following: a strong positive correlation between weight changes in poison and clean unlearning, and the stronger neuron activation in backdoored models compared to clean models. The proposed two-stage defense method leverages these observations by reinitializing a certain proportion of backdoor-related neurons and subweights, and by suppressing gradient normalization during the fine-tuning process. The proposed method demonstrates the state-of-the-art performance on the selected datasets and models with a wide range of ablation studies.

优点

  1. The paper introduces a novel approach to backdoor defense. It provides two insights in the weights of backdoor models. Based on this, the authors introduce a two-stage defense mechanism. Also, the use of clean unlearning that does not require poisoned data to prevent the backdoor attack.

  2. While the scope of the experiment is limited in model architectures and datasets, the experiment shows superior performance. The experimental setup is thorough, covering a certain range of backdoor attacks.

  3. The paper is well-organized, with figures and tables are used to illustrate key observations and experimental results. The descriptions of the observations and the proposed defense method are detailed and precise.

缺点

  1. The scale of experiments are too limited. While the authors compare the results between 8 other backdoor attack methods, the main experiments are only conducted to specific datasets and models (CIFAR-10, PreAct-ResNet18). This makes the generalizability of the results rather unclear.

  2. The authors mention computational efficiency in Activeness-Aware Fine-Tuning by using an approximation scheme, but there are no experiments demonstrating the trade-off between efficiency and accuracy. Including such experiments would provide a more comprehensive understanding of the method's practical implications.

  3. The concept of unlearning in backdoor attacks could be better explained in the Related Works section. Given its importance in this paper, a more detailed discussion would help contextualize the proposed method and highlight its significance in addressing existing challenges.

  4. There is no mathematical proof. While it employs various loss functions and approximation techniques, and the empirical results are quite promising, there is no analysis, explaining why the proposed method works.

问题

  1. There is no statement or analysis about scalability and generalizability. Can you provide more detailed information about these, to different datasets and model architectures?

  2. How can the clean unlearning process identifies and affects backdoor related neurons, without causing degradation of the models’ performance on clean data?

  3. The section introducing the effectiveness of zero reinitialization on subweights in 4.3 Ablation Studies is a bit confusing. The difference between V2 and V3 is still not clear to me.

  4. I have questions about the insensitivity of TSBD to both neuron ratio and weight ratio as explained in Section 4.4. While I understand that the method effectively selects backdoor-related neurons indicated as active neurons and that reinitializing them is effective, it is confusing that the performance remains similar across both low and high ratios. Is reinitializing itself the important part?

局限性

The authors mention in the introduction that clean unlearning is still underexplored, but it is unclear if this limitation has been fully addressed in the paper. Additionally, the paper lacks a discussion of its own limitations. Providing a clear discussion on the limitations of the proposed method and potential areas for future research would enhance the paper.

The authors acknowledge that there remain challenges in fully mitigating backdoor effects without any access to clean data. They suggest that data generation techniques and data free techniques might offer potential solutions.

There is no discussion about the scalability of their proposed method, especially in large scale or real time applications. Addressing the computational overhead and resource requirements would be necessary for practical applications.

作者回复

Dear Reviewer Vmzu, thank you very much for your careful review of our paper and thoughtful comments. We are encouraged by your positive comments on our novel and insightful method, superior performance, thorough experimental setup, and good paper presentation. We hope the following responses could help clarify the potential misunderstanding and alleviate your concerns.


W(Weakness)1&Q(Question)1: Concern about the scale of main experiments.

R(Response)1: We appreciate your concern about the scale of the main experiments. We would like to clarify the following points:

  • Our main experiments are not only conducted on CIFAR-10, PreAct-ResNet18, but also cover the datasets of Tiny ImageNet, GTSRB, and the model of VGG19-BN. Please refer to Section 4.2, Appendix D and E for the details.
  • In this paper, we follow a similar testing range on models and datasets as in the previous SOTA works [1,2], which is considered sufficient to prove the generalizability of the results.

To further prove the generalizability and scalability, we also evaluate our method on CIFAR-100 (as follows) and ViT-b-16 (due to the space limit, please refer to RW1 of Reviewer obuU). Note that we keep other settings the same as in section 4.1. The results show the superiority of TSBD.

Table 1: Experiment Results for CIFAR-100 on PreAct-ResNet18

CIFAR-100No DefenseNo Defensei-BAUi-BAUTSBDTSBD
AttackACCASRACCASRACCASR
BadNets67.2287.4360.370.0466.280.33
Input-aware65.2498.6165.2185.1469.670.18

W2: Suggestion for the experiment on computational efficiency of Activeness-Aware Fine-Tuning.

RW(Response to Weakness)2: Thanks for your interest in the computational efficiency. In fact, as pointed out in [3], it is infeasible to conduct such an experiment for DNNs since it involves calculating a Hessian matrix (refer to Appendix B) without approximation, where the time and space complexity are O(n2)O(n^2) theoretically. We now provide a brief example on PreAct-ResNet18, which contains n11n\approx11 million parameter units. Its calculated Hessian matrix will contain n2124,794n^2 \approx124,794 billion units, which is much larger than a LLaMa 65B and fails to be calculated on one GPU, e.g., A6000 GPU with 49G memory.


W3: Suggestion for adding unlearning to related work.

RW3: Thanks for your valuable suggestion. We will update some related works on unlearning-based backdoor defense to highlight our contribution in the revised version. Due to the space limit, a brief version is provided in the RW3 of Reviewer kmL5.


W4: Suggestion for mathematical proof on explaining why TSBD works.

RW4: Thanks for your in-depth suggestion. We have explained the functionality of those important techniques in Section 3. For example, "Suggestions" in Section 3.2 clarify why we need both two stages. Additionally, in Section 4, we empirically validate the effectiveness of each component.

In fact, except for the empirical findings, we also attempted to mathematically derive the observations 1 and 2. However, this seems to be a very difficult task. For example, to prove observation 1, we need to estimate the NWC values, which occur in weight changes during the whole unlearning process. It involves estimating the total changes on a variant (e.g., xx) for KK steps of gradient descents, i.e., xKx1=t=1K1(xt+1xt)\||x_K-x_1\||=\||\sum_{t=1}^{K-1}(x_{t+1}-x_{t})\||. As far as we know, in optimization theory, there exists no mathematical tool to estimate this quantity directly; instead, more focus is placed on estimating the distance between xtx_{t} and the limit point. In future work, we will continue to explore this issue.


Q2: Explanation for the clean-unlearning capability.

RQ(Response to Question)2: Thanks for your question. We would like to point out that the clean unlearning in stage 1 will degrade the clean performance. While the reinitialization is conducted on the original backdoored model, clean degradation on the unlearned model will not affect the final performance. Based on observation 1 in Figure 1, we find that the backdoor-related neurons that change the most in weight during poison unlearning also change significantly during clean unlearning. Therefore, we define the neuron weight change to identify them, and then remove them from the backdoored model. For how unlearning affects the neurons, we offer insights from the perspective of neuron activations in Section 3.3. We would like to refer you to the RW2 of Reviewer P1hB for a brief summary.


Q3: Explanation for the different model versions in ablation study.

RQ3: Thanks. Due to the space limit, we would like to refer you to the RW2 of Reviewer kmL5 for the explanation of the differences among these three versions.


Q4: Concern about the importance of reinitialization.

RQ4: We appreciate your concern about the importance of stage 1. While it exhibits a stablely good performance across different ratios in BadNets, we cannot overlook its contribution to backdoor removal in some strong attacks, e.g., Blended, LF, and SSBA, where ASR larger than 20% after defense (see Figure 8 of Appendix G). For more details, we would like to refer you to the R2 of Author Rebuttal, where we validate the importance of stage 1 by comparing TSBD with the version containing only stage 2.


[1] Shared Adversarial Unlearning: Backdoor Mitigation by Unlearning Shared Adversarial Examples. NeurIPS 2023.

[2] Neural Polarizer: A Lightweight and Effective Backdoor Defense via Purifying Poisoned Features. NeurIPS 2023.

[3] Penalizing Gradient Norm for Efficiently Improving Generalization in Deep Learning. ICML 2022.

评论

Dear Reviewer Vmzu:

We would like to express our sincere gratitude for your valuable insights and suggestions on our work.

We have tried our best to address the concerns and queries you raised during the rebuttal process. However, we would greatly appreciate knowing whether our responses have effectively resolved your concerns. Your feedback will be instrumental in improving the quality of our work.

Sincerely,

Authors

评论

Dear Reviewer Vmzu:

Thanks again for your thoughtful comments. As the end of the discussion period is approaching, we would like to kindly ask again for your concerns about our paper. Are there still any unsolved doubts for you?

Your help is greatly appreciated. We are eagerly waiting for your feedback before the end.

Sincerely,

Authors

作者回复

General Response

We sincerely thank all reviewers for their valuable time and constructive comments.


Q1: Systematic comparison with RNP [1].

R1: We aim to address the concerns regarding the differences between our work and RNP. More precisely, we emphasize their differences from the perspectives of technical details, motivation&insight, and experimental performance, as follows:

  • Different technical details.
    • The backdoor erasing techniques are different. RNP utilizes pruning to erase the neurons permanently, while TSBD proposes zero reinitialization in the subweight level to modify the weights of neurons. By using the zero reinitialization, the modified neurons can be further repaired by the subsequent fine-tuning and better recover the clean functionality, which is infeasible in pruning.
    • The goals are different for the second stage. For RNP, as stated in [1], the purpose of Filter Recovering is to "recover the clean features (features of the clean samples) erased by the previous unlearning step", where a pruning mask is learned in this stage and backdoor erasing has not yet been conducted. However, for TSBD, our goal is to recover the clean functionality lost during erasing the backdoor effect by reinitializing the backdoor-related neurons, where fine-tuning is a commonly-used technique [2]. The lost of clean features comes from the zero reinitialization and has no direct relationship with unlearning.
    • The target models are different in the second stage. For RNP, the Filter Recovering is conducted based on the unlearned model from the previous stage, i.e., Neuron Unlearning. Differently, for TSBD, the unlearning stage is only to help find out the backdoor-related neurons, while the fine-tuning stage is based on the reinitialized model from the original backdoored one, not the unlearned model from the first stage.
    • The subsequent operations are different after erasing backdoor. The two stages in RNP are to finally find out and prune the backdoor neurons, and no further operation is conducted. For TSBD, after removing backdoor with zero reinitialization (as shown in Section 4.3, where the ASRs reduce to 0), a further fine-tuning is conducted to recover the sacrificed ACC. Moreover, we adopt a novel gradient norm regularization to enhance this process.
  • Different motivations and insights.
    • The motivations are different for conducting clean unlearning. RNP aims to utilize the characteristics of a clean-unlearned model for mask learning and improving other techniques [1], i.e., exploring the utility of the unlearned model. In contrast, TSBD aims to explore the characteristics of the unlearning process, including the ones in different input data types and model types. The only reason for unlearning clean data is that it is an accessible data type for defense.
    • The insights are from different perspectives. For RNP, the authors state that "the unlearned model tends to predict the backdoor label for all defense samples" and it can be used to improve other defenses [1]. For TSBD, we emphasize the positive weight-change correlation of clean and poison unlearning, and we also uncover the neuron activeness of backdoored model. These insights are compensatory for a better understanding of backdoor learning. We believe that RNP and TSBD can both contribute to the community.
  • Difference performances.
    • TSBD outperforms RNP in most cases. For a fair comparison, we adopt the RNP method in the BackdoorBench framework and follow the basic experimental setup as other baselines; refer to Appendix C for more details. For all the main experiments illustrated in Section 4.2, Appendix D and E, TSBD outperforms RNP with the best average performance, e.g., for CIFAR-10, 97.09 (TSBD) > 82.83 (RNP) on DER; for Tiny ImageNet, 97.89 (TSBD) > 87.59 (RNP) on DER, etc.
    • TSBD is more robust in clean data ratio than RNP. Although RNP claims that only 1% clean data is needed for defense, it is not robust to the clean data ratio, which is validated in the Appendix D.9 of [3]. In contrast, TSBD is validated to be robust in the clean data ratio (see Appendix I).

In conclusion, our TSBD is a newly designed method with several fundamental differences compared with RNP. Our paper provides several important insights to the research field of backdoor learning and the empirical results validate that TSBD is an effective defense method.


Q2: Concern about the importance of stage 1.

R2: We aim to emphasize the importance of stage 1 for the final performance. As shown in Figure 8 of Appendix G, by reducing the neuron ratio to 1%, TSBD fails on several strong attacks, with ASR larger than 20%, which may indicate a bad performance without reinitialization. Here, we design a stage 2 only version, i.e., Activeness-Aware Fine-Tuning (AaFT for short), and compare it with the full version (TSBD) in Table 1 (as follows). The results show that TSBD is more effective towards most attacks and validates the importance of reinitialization for a successful defense.

Table 1: Comparison between AaFT (only stage 2) and TSBD (the full process)

MethodAaFTAaFTTSBDTSBD
AttackACCASRACCASR
BadNets90.581.2690.721.31
Blended91.7020.5291.612.61
Input-aware93.041.1493.061.94
LF92.076.0791.202.64
SIG90.012.7090.411.27
SSBA91.6131.9791.571.66
Trojan91.719.3491.765.06
WaNet93.041.2193.260.88

[1] Reconstructive Neuron Pruning for Backdoor Defense. ICML 2023.

[2] Fine-pruning: Defending against backdooring attacks on deep neural networks. RAID 2018.

[3] Shared Adversarial Unlearning: Backdoor Mitigation by Unlearning Shared Adversarial Examples. NeurIPS 2023.

最终决定

This paper aims to address the security threat posed by backdoor attacks in DNNs. The authors explore model unlearning from the perspective of weight changes and gradient norms, making two key observations: weight changes between poison and clean unlearning are positively correlated, and neurons in backdoored models are more active than those in clean models. Based on these observations, they propose a Two-Stage Backdoor Defense (TSBD) method, involving Neuron Weight Change-based Backdoor Reinitialization and Activeness-Aware Fine-Tuning. Extensive experiments demonstrate the superior performance of their method compared to state-of-the-art approaches. The reviewers agree that the paper provides interesting observations, and the paper will be benefitted by providing more clear unlearning process and analysis.