PaperHub
5.8
/10
Poster4 位审稿人
最低4最高8标准差1.5
6
4
5
8
3.8
置信度
正确性3.0
贡献度3.0
表达2.8
NeurIPS 2024

Breaking the False Sense of Security in Backdoor Defense through Re-Activation Attack

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06

摘要

关键词
Backdoor Learning; Backdoor Attack; Backdoor Defense; Adversarial Machine Learning

评审与讨论

审稿意见
6

The paper examines whether backdoors in deep learning models processed by defense algorithms can be reactivated and how to achieve this reactivation. First, it proposes a metric to measure the presence of backdoors, called the backdoor existence coefficient, which ranges from 0 (backdoors are nonexistent) to 1 (backdoors are still present). This coefficient is computed in two steps: (i) identifying the backdoor-related neurons in the attacked (not defended) model, and (ii) computing the similarity between the feature maps of a subset of backdoor-related neurons in the clean model, the attacked model, and the defended model.

Second, the paper proposes three attacks: one for the white-box setting, one for the black-box setting, and one that exploits transferability to perform a reactivation attack against the defended models. The fundamental building block consists of finding a universal adversarial perturbation that can be applied to all instances with the trigger that previously worked, making it effective again.

The experimental evaluation considers six defenses, seven attacks, and three image datasets along several axes. It shows that the proposed attack is effective in finding the universal adversarial perturbation, i.e., in modifying the poisoned instances in a way that reactivates the backdoor.

优点

I think this is an interesting paper of moderate-impact that sheds light on the fact that backdoor defenses may not effectively remove the presence of backdoors in the attacked model. In my opinion, the paper shows the following strengths:

  • Novel metric to evaluate the presence of backdoors in a model after defense: the paper proposes a new metric that demonstrates the backdoor is still implanted in the model even after being processed by a defense procedure. Moreover, it shows that the backdoor existence score is strongly positively correlated with the attack success rate of the three proposed attacks for the majority of the considered defenses. This indicates that the score is reliable in showing the existence of the backdoor in the model and can be adopted by practitioners to assess the presence of backdoors after applying a defense algorithm.

  • The three proposed attacks show reasonable effectiveness: the three proposed attacks are able to generate new poisoned instances that exhibit a high success rate compared to the success rate of the original trigger instances on the defended model.

  • Comprehensive experimental evaluation: the experimental evaluation is thorough and detailed, considering several useful perspectives, such as the influence of the size of poisoned samples, and various settings.

缺点

Even though I appreciate the conceptual contribution of the paper, I think that it needs to be improved in some aspects:

  • Presentation of the surrogate attack is confusing and lacks of details: I consider the transfer-based attack to be the most interesting and realistic one since it is deployable even if the attacker has no access to information about the target defense. However, it is not clearly explained. The authors should clearly state the assumptions behind this attack, such as whether the attacker has access to a surrogate model imitating the target model in its clean, attacked, or defended state, and how the attacker can obtain that surrogate model. Additionally, the authors should clearly highlight the differences between the knowledge the attacker has in the black-box and transfer-based scenarios. For example, the authors state that "the adversary lacks prior knowledge of the defense model" in Section 3.3 which seems to be the same assumption as the black-box setting. Therefore, it is unclear what the pros and cons of the surrogate-based attack are compared to the black-box attack and vice versa.

  • Detectability of the attacked poisoned instances is not discussed: the proposed attacks involve finding a universal adversarial perturbation to add to test instances that already contain the trigger. An important aspect that should be evaluated is the detectability of the perturbed instances since they may be easily detected after perturbation by defenses like [a] (but many other defenses at inference time exist, see [b]). The authors should show the appearance of some examples generated by the three proposed attacks and evaluate the detectability of the perturbed images against some defenses to make their proposal more convincing.

  • Only mean results are shown: the authors states that all experiments are executed 5 times with varying random seeds in Section D of the Appendix. Therefore, the authors should also show the standard deviations of the success rates of the attacks presented in the tables of the paper, as this would help evaluate the variability of the success rate of the proposed attacks over different trials.

Finally, I report a minor weakness from a section of the Appendix:

  • Description of the running times is confusing: I appreciate that the authors reported the running times of their proposed attacks on different architectures in Section F of the Appendix. However, they should clearly describe the settings in which these running times are obtained, such as the considered attack and defense, to better support their analysis. Moreover, they should explain why some results are labeled as N/A.

[a] Tramer et. al., SentiNet: Detecting localized universal attacks against deep learning systems, IEEE Security and Privacy Workshops, 2020

[b] Cinà et. al., Wild Patterns Reloaded: A Survey of Machine Learning Security against Training Data Poisoning, in ACM Computing Surveys, 2023.

问题

  • Could you please provide more details about the transfer-based attack? (See Weaknesses section for more details)
  • How can adding the perturbation to already perturbed instances impact their detectability? (See Weaknesses section for more details)

Updates after authors' response: the authors provided insightful responses to the questions and provided data and arguments for successfully solving the weaknesses listed above. I strongly invite the authors to add the content of their response to the next version of their paper.

局限性

The authors have summarized the limitations of their proposals in a specific section of the paper, but they do not address the problem of the detectability of the generated attacks. They also discuss the potential negative societal impacts, asserting that their work may lead to the development of new defenses. Finally, the details of the experimental evaluation and implementations are extensively documented.

作者回复

We deeply appreciate the reviewer's valuable time and thoughtful comments.

Q1. Details of the surrogate-based attack (i.e., the transfer attack), and the pros and cons of the surrogate-based attack in comparison with the black-box attack (BBA).

R1: Thanks for this constructive comment. We would like to refer you to our response to Q2 of Reviewer G6gk, where we clarify the threat model, main attack steps, as well as additional evaluations and analyses.

We compare our transfer attack (TA) and BBA in two aspects:

  • Attack effectiveness: According to reported results in Table 2 and Table 3 of the main manuscript, TA achieves higher ASR than BBA in most cases because:
    • For BBA: The effectiveness of BBA is limited by the specific black-box attack method (Square attack in this work). More advanced method might improve the performance. It is also limited by query numbers, where we show that ASR can be improved with increasing query numbers in Table 2 of Common Responce.
    • For TA: As analyzed in the Common Responce, there is a backdoor effect similarity between the original backdoored model and the defense model. Thus it is easier to optimize a highly transferable trigger.
  • Attack efficiency: BBA requires lots of queries to the target model to achieve satisifying attack performance, while TA just needs one query to attack the target model. Obviously, TA is more efficient and more practical than BBA.

In summary, TA has better performance on both effectiveness and efficiency than BBA. However, we would like to emphasize that our main goal of designing re-activation attack algorithms under three scenarios is to verify the backdoor existence in defense models. All three attacks could be further improved by developing more advanced algorithms.

Q2. Detectability of the attacked poisoned instances (i.e., new poisoned samples obtained by our re-activation attack) is not discussed.

R2: Thanks for this constructive comment. It is valuable to study whether the proposed attack will increase the detectability during inference. Our study is as follows:

  • Settings: We adopt three representative inference-time poisoned sample detection methods including SCALE-UP [1], SentiNet [2], and STRIP [3]. The detection task requires two input arguments, including the model and the query datasets. We evaluate on five pairs, including <the orginal backdoored model fAf_{\text{A}}, the original poisoned dataset Dp\mathcal{D}_p>, <the defense model with FT-SAM defense fD,FT-SAMf_{\text{D,FT-SAM}}, dataset Dp\mathcal{D}_p>, <model fD,FT-SAMf_{\text{D,FT-SAM}}, the re-activation dataset Dp,Δξ\mathcal{D}_{p,\Delta{\xi}}>, <the defense model with SAU defense fD,FT-SAMf_{\text{D,FT-SAM}}, dataset Dp\mathcal{D}_p>, <model fD,SAUf_{\text{D,SAU}}, the re-activation dataset Dp,Δξ\mathcal{D}_{p,\Delta{\xi}}>.
  • Results and analyses:
    • For test-time detection, the result in Table 1 shows our attacks do not markedly increase the TPR compared the the other two pairs. More detection performance on our BBA and TA are shown in Table 6 of PDF. The appearance of generated poisoned samples are also provided in Figure 2 of PDF.
    • For test-time defense, our attack under test-time defenses has been shown in the response to Q4 of Reviewer AhAB, where our attack maintains a certain ASR against these defenses.

In conclusion, these findings offer valuable insights, paving the way for the future design of more stealthy re-activation attacks.

Table 1: Detection performance (TPR %) of different poisoned samples under three detection methods.

AttackDetection\downarrowfA,Dpf_{\text{A}},\mathcal{D}_pfD,FT-SAM,Dpf_{\text{D,FT-SAM}},\mathcal{D}_pfD,FT-SAM,Dp,Δξf_{\text{D,FT-SAM}},\mathcal{D}_{p,\Delta{\xi}}fD,SAU,Dpf_{\text{D,SAU}},\mathcal{D}_pfD,SAU,Dp,Δξf_{\text{D,SAU}},\mathcal{D}_{p,\Delta{\xi}}
BadNetsSCALE-UP39.679.668.679.549.5
BadNetsSentiNet37.73.62.20.20.9
BadNetsSTRIP88.30.75.510.36.5
TrojanSCALE-UP92.684.973.181.655.4
TrojanSentiNet2.91.11.052.11.5
TrojanSTRIP99.91.929.84.21.2

Q3. Only mean results are shown.

R3: Thanks.

  • Results: Here we provide the mean results with standard deviations (std.) under five different random seeds among several attacks and defenses in Table 2 for our black-box attack. More results of our WBA and TA is presented in Table 4 and 5 in PDF.
  • Analysis: It can be observed that the std. is small among these attack and defense method across three of our attacks. It shows our attack methods are stable.

Table 2: Mean result with std. for our BBA.

NADi-BAUFT-SAMSAU
BadNets49.4±\pm 2.657.7±\pm 4.644.1±\pm 1.237.5±\pm 2.8
Blended14.6±\pm 0.394.4±\pm 2.990.1±\pm 4.083.2±\pm 5.1

Q4. Description of the running time.

R4: Thanks.

  • We would like to clarify that the running time of our attack is only related to trainning dataset and network, while independent with specific backdoor attack or defense methods. Thus, we didn't specify the particular method, as the running time is consistent across methods.
  • Regarding the "N/A" in Table 10, we executed WBA and TA on the CLIP models. For query-based black-box attack, the attacker cannot directly access the target model (such as weights or gradients) and CLIP models only return the final matching score or ranking results. This limits the ability of query-based black-box attacks. Moreover, there are no relevant studies for reference. Thus, we marked related result as "N/A".

We hope this explanation resolves your queries. We welcome your feedback or further inquiries. Thanks again.

[1] SCALE-UP: An efficient black-box input-level backdoor detection via analyzing scaled prediction consistency, ICLR2023.

[2] Sentinet: Detecting localized universal attacks against deep learning systems, SPW2020.

[3] STRIP: A defence against trojan attacks on deep neural networks, ACSAC2019.

评论

I thank the authors for the broad and deep response they provided to my concerns. I appreciate your dedication. You have clarified almost everything raised in my questions and the listed weaknesses, confirming my positive opinion of your work.

I have one more question for you:

  • In Response 2, you reported several settings, and Table 1 reports the corresponding results. However, Table 6 is not as easy to read as Table 1. In which setting are the results in Table 6 computed? What does the "Original" column represent?
评论

Thank you for your kind responce and appreciation of our work. We're glad that our responses provided clarifications to your concerns.

Regarding Table 6 in the PDF, we acknowledge that our Table 6 contains some areas of ambiguity. The Table 6 is actually the expanded version of Table 1, which shows the TPR results of our attacks under three kinds of test-time backdoor detection methods. While the Table 1 lists detection results for our white-box re-activation attacks (WBA), the Table 6 lists detection results for our three types of attacks, namely white-box re-activation attack (WBA), black-box re-activation attack (BBA), and transfer-based re-activation attack (TA). Specifically,

  • The "Original" in Table 6 corresponds to "fA,Dpf_A,\mathcal{D}_p " in Table 1, i.e. , < the original backdoored model, the original poisoned dataset Dp\mathcal{D}_p>.

  • The "FT-SAM" and "SAU" in Table 6 correspond to fD,FT-SAM,Dpf_{D,\text{FT-SAM}},D_p and fD,SAU,Dpf_{D, \text{SAU}},\mathcal{D}_p in Table 1, respectively, which represent the <the defense model, the original poisoned dataset Dp\mathcal{D}_p>.

  • "WBA, BBA, and TA" of the 5th, 6th, and 7th columns in Table 6 correspond to fD,FT-SAM,Dp,Δξf_{D, \text{FT-SAM}}, D_{p,\Delta{\xi}} in Table 1 and different columns show detection results under different attacks. They represent the <defense model fD,FT-SAMf_{D,\text{FT-SAM}}, the re-activation dataset Dp,Δξ>D_{p,\Delta{\xi}}> combanation. Here Dp,ΔξD_{p,\Delta{\xi}} denotes the new poisoned samples generated by our methods.

  • And "WBA, BBA, and TA" of the 9th, 10th, and 11th columns in Table 6 correspond tofD,SAU,Dp,Δξf_{D,\text{SAU}}, D_{p,\Delta{\xi}} in Table 1. They represent the <defense model fD,SAUf_{D,\text{SAU}}, the re-activation dataset Dp,Δξ>D_{p,\Delta{\xi}}> combanation. Here Dp,Δξ\mathcal{D}_{p,\Delta{\xi}} denotes the new poisoned samples generated by our methods.

Analysis:

By comparing "WBA, BBA, and TA" of the 5th, 6th, and 7th columns with "FT-SAM", and by comparing "WBA, BBA, and TA" of the 9th, 10th, and 11th columns with "SAU", we can find that our three kinds of attacks do not markedly increase the TPR compared the <the defense model, the original poisoned dataset > pairs. These findings provide insights to develop more stealthy re-activation backdoor attacks in the future.

Thank you again for pointing out this ambiguity and your dedication to the review process. We will modify it in the revision.

评论

Thank you for the clarification you provided. The results reported in the Table are easier to read now.

评论

We deeply appreciate your thoughtful consideration and your invaluable feedback. Every concern and suggestion you've offered is highly appreciated and will assist us in improving our work. Thank you once again.

审稿意见
4

This paper present a novel insight in adversarial robustness, which is posed by the effective removal of backdoor attacks once defenses have been applied. Interestingly, the authors claim that these are still embedded inside compromised machine learning models. Hence, the authors propose a novel metric, the Backdoor Existence Coefficient (BEC) to detect this issue. The authors show that both models with/without backdoor defenses correlates positively with this metric, while clean models are not.

优点

Very interesting insights on backdoor defenses. BEC is timely and interesting, since it addresses a relevant problem. Also, this metric can be used to understand wheather a defense is correctly working or not, opening the road to novel techniques that strip backdoors away.

Clear description of the internals of the technique. Authors clearly explain how they retrieve neurons, and how to use CKA to compare them with back-doored models.

缺点

Backdoor reactivation sounds like new backdoors My only concern is posed by the fact the reconstruction of backdoor is similar to injecting one. Thus, it is not clear whether the attack is creating a novel backdoor or it is really manipulating weights to make that backdoor active again.

Unclear transfer evaluations. The authors state that transfer attacks exhibit strong performance, but the authors do not provide a clarification on how. In general, this is counterintuitive, since black-box evaluations should be less effective than the ones that exploit the perfect knowledge on the model. The authors should better clarify this point.

(Minor) Paper structure can be improved. Both abstract and introduction focus on the metric BEC, but then it is neglected to a brief discussion in section 4 (figure 3).

问题

Are the authors sure that their re-activation is not creating novel backdoors? The authors must clarify this crucial aspect. If so, the paper is really bringing plenty of contributions to the community. Otherwise, it is just another backdoor attack that is not adding much to the discussion around backdoors. I will be willing to increase my score if this point is clarified.

局限性

The authors address the limitations of their work.

作者回复

We sincerely thank the reviewer for the valuable time in reading our work and constructive concerns. We are encouraged by the positive comments of very interesting insights and clear description.

Q1. Whether backdoor re-activation is creating a novel backdoor or not.

R1: Thank you for your insightful question. We would like to refer you to the common response for a comprehensive analysis on the relationship between the original backdoor attack (OBA) and our re-activation attack (RBA) in three perspective.

Moreover, we want to clarify that the attacks proposed in this work involve three scenarios, namely white-box scenario, black-box scenario, and transfer-based scenario, where the latter two representing more realistic situations. However, none of the three attacks involve revising model parameters. Particularly, black-box attacks and transfer attacks operate under the assumption that the attacker lacks access to the target model's parameters. Thus, we are not engaged in any form of weight manipulation.

Overall, our re-activation backdoor attack shows a strong correlation with the original backdoor and our attacks involve no modification of network weights. Hope the above points could address your questions.

Q2. Clarification of the setting of transfer attack (TA).

R2: Thanks for this concern. We would like to clarify the setting of transfer-based re-activation attack (TA), and present some analyses about its effectiveness.

  • Setting of TA in our main manuscript.

    • Threat model: As demonstrated in Line 132-135 of the main manuscript, the adversary trains a backdoored model fAf_\text{A} and releases it to users. The user receives model fAf_\text{A}, and obtains a defense model fDf_\text{D} based on fAf_\text{A} by some post-training defense. Thus, the adversary does not know the exact defense method, but has full information about the original trigger ξ\boldsymbol{\xi} and model fAf_\text{A} which has same model architecture as fDf_\text{D}.
    • Main steps in TA: There are three main steps in TA. Step 1: Based on fAf_\text{A}, the adversary can obtain a surrogate model (or a collection of surrogate models) by utilizing some existing post-training backdoor defenses, denoted as fDf_{\text{D}'}. Step 2: The adversary conduct a white-box re-activation atack against fDf_{\text{D}'}, based on the original trigger ξ\xi, to obtain a new trigger ξD\xi_{D'}'. Step 3: The adversary adopts ξD\xi_{D'}' to attack the target defense model fDf_{\text{D}} during inference.
    • Why TA is effective for re-activating backdoor: As analyzed in Q1 of Common Response and Sec. 4.4 in the main manuscript, the original backdoor still exists in defense models. In other words, the backdoor effects of both the surrogate model ξD\xi_{D'}' and the target model ξD\xi_{D}' are highly similar with that of the original backdoored model fAf_\text{A}. Thus, it is natural to deduce that ξD\xi_{D'}' and ξD\xi_{D}' also have similar backdoor effects. It explains the high ASR of our TA.
  • Further evaluation with a more strict setting: transfer attack across model architectures.

    • Threat model: Here we present a more strict setting that the adversary can only manipulate training dataset, while has no access to the training and post-training stage. Thus, the adversary only knows the original trigger ξ\boldsymbol{\xi}, but has no knowledge of fAf_\text{A} or fDf_\text{D}. Compared to above threat model, one major challenge is the unknown architecture of the target model fDf_\text{D}.
    • Main attack steps: Compared to the steps in the above setting, there is one additional step that the adversary should firstly train a backdoored model fAf_\text{A}' based on DpD_p, which has different architectures with fDf_{\text{D}}. All remaining steps are same with those in the above setting.
    • Experimental results: As shown in Table 1, although transfer attack across model architectures doesn't achieve as high ASR as the transfer attack with the same architecture (i.e., results in Table 3 of the main manuscript), it still has a certain degree of transferability of backdoors, which is an intriguing phenomenon worthy of further exploration.

In summary, the effectiveness of the transfer re-activation is mainly due to the high backdoor effect similarity between surrogate and target defense models, even in different architectures. We think its performance could be further improved by borrowing ideas from the field of transfer adversarial attacks. Its practical threat should be seriously considered by future defenses.

Table 1: Transfer re-activation attack ( ASR %) against the target model PreAct-ResNet18, using different source models (WideResNet28-2, ResNet18, VGG19-BN).

Source ModelWRN28WRN28WRN28ResNet18ResNet18ResNet18VGGVGGVGG
Defense\rightarrow Attack\downarrowi-BAUFT-SAMSAUi-BAUFT-SAMSAUi-BAUFT-SAMSAU
BadNets95.684.160.053.430.926.589.271.748.6
Blended98.598.583.279.175.564.197.992.890.1

Q3. Paper structure can be improved. R3: Thanks for this constructive suggestion. We would like to clarify that since the backdoor existence phenomenon and its metric BEC are the most critical contributions of this work, we emphasize them in both Abstract and Introduction. Following your suggestion, we will expand Sec. 3.2 by adding some analysis about the characteristics of the proposed BEC score in the revised manuscript.

Finally, we sincerely hope that above responses could well address your concerns. We will be more than glad to have further discussions with you if there is any remaining concerns. Once again, we express our heartfelt gratitude for your valuable and constructive comments.

评论

While I deeply appreciate all the effort in the answer (really, you did a lot of insightful work), I have still some doubt about the reactivation of the network. In particular, I still don't get how you test the latter: I would say that a backdoor is reactivated if the original unmodified one is effective again the optimization. Thus, the addition of an epsilon to the existent backdoor is already creating another backdoor by definition. Thus:

  • if the success rate of backdoored model and reactivated-backdoored model is similar with previous backdoors (but not effective on clean models) then it would really seem that the model has its backdoor reactivated

  • however, the discussion about transfer attacks seem to point out a different direction. It is not clear if such result is achieved because of the choice of the surrogate, or if really because of the backdoor.

I'm still unsure about rising my score, while I totally see the efforts of the authors.

评论

Dear Reviewer G6gk,

Firstly, we greatly appreciate your valuable time and professional comments.

In the above response, we have further clarified that the original backdoor and the re-activated backdoor are highly similar on backdoor effect, close to each other, and highly different with another new backdoor (i.e., the general universal adversarial perturbation attack or natural backdoor attack). Their ultimate goal is the same, which is to guide the model's inference behavior, such that poisoned samples containing a trigger are predicted as the target label.

As for the terminology "re-activation", please refer to Our further clarification about the terminology, wherein our response to Reviewer AhAB have shown that the ultimate goal of the adversary is to guide the model's inference behavior. It is not imperative that the trigger used in both backdoor injection and activation remain the same. In backdoor activation stage, If the original trigger doesn't succeed, the adversary could alter the trigger using our reactivation attack methods to reactivate the backdoor.

Moreover, our primary concern is contributing to the field of backdoor learning. We believe and hope you agree, that this work could stimulate a re-evaluation of the efficacy of post-training backdoor defenses, thereby inspiring future work for safer defenses.

We sincerely appreciate that you can take more time to consider our response, which will strengthen this work. Looking forward to your feedback, and we are glad to discuss with you for any remaining concern.

Best regards, Authors

评论

Firstly, we would like to show our gratitude for your valuable time and professional reviews, as well as the recognition of our efforts. We wish to further clarify and summarize our analyses to address your remaining concern.

  • The original backdoor and the re-activated backdoor are not exactly same. But, they are highly similar on backdoor effect, close to each other, and highly different with another new backdoor (i.e., the general universal adversarial perturbation attack). Since the parameter weights of the original backdoored model and the defense model are different, and their corresponding triggers are different, thus we say they are not exactly different. For the second sentence, we believe the global response has shown very clear evidences. In short, from the intrinsic perspective, our claim of re-activate the original backdoor, rather than creating another new backdoor is correct.
  • In terms of transfer attack, in the previous responses, we have shown that the surrogate models with same architectures or different architectures with the original backdoored model are effective in transfer attack. The reason is that both the surrogate model and the backdoored model are trained on the same poisoned dataset, which is the origin of the backdoor, thus, they are highly similar with each other on backdoor effect. Hence, the backdoor re-activation is independent with the choice of surrogate. We didn't see any conflict with the phenomenon of backdoor re-activation. If you can further specify your point, that will be very helpful.

We really hope above clarification could address your remaining concern. We would like to further discuss with you, and the deep discussion with you are very valuable to reveal the real value of our work.

We really understand your caution when facing a new finding. As said by Albert Einstein,

"The framing of a problem is often far more essential than its solution".

We strongly believe that the problem of backdoor re-activation observed in this work could significantly influence the development of backdoor learning, to force people to reconsider the effectiveness of post-training backdoor defense, which is now one of the main streaming types of backdoor defenses.

We are lucky that your rigorous and professional reviews are very critical to help us to give a very solid definition and analysis about this new problem. We sincerely appreciate that you can take more time to consider our response, and give a comprehensive assessment of the value of this work.

评论

I too think that the phrasing such as "backdoor re-activation" and "dormant backdoor" can be confusing/misleading because it suggests that the model has some state that is changed by "re-activation" (that "re-activation" changes the model parameters). However, the model parameters remain the same since the attacker cannot influence them after training. What changes by "re-activation" is the state of the attack, based finding a different backdoor that is already present in the defended model.

I suggest a slightly different framing that might hopefully be more clear and universal.

A backdoor is a property of the model instance and the goal of the attacker. If there exist a (universal) adversarial perturbation that can cause the model instance to behave as desired by the attacker, the model has a backdoor, and the (universal) adversarial perturbation is the corresponding trigger.

The backdoor is a natural backdoor if it is present independently of whether the training data is poisoned.

The goal of a defense is to prevent/remove the backdoor (ideally all backdoors) compared to an undefended model.

However, as demonstrated in the paper, many defenses remove the backdoor initially intended by the attacker, but there remains a different backdoor that is stronger than a natural backdoor, and the methods presented in the paper can find it.

What do you think?

评论

We greatly appreciate your patient explanation, and now we clearly get the point that makes you confusing.

Firstly, we think the descriptions from Paragraph 3 to 5 (i.e., "The backdoor is a property ... compared to an undefended model.") are very very clear, showing your deep understanding of backdoor learning.

However, in terms of the description "but there remains a different backdoor" in Paragraph 6, we would like to further discuss with you. Let's consider backdoor learning from the adversary's perspective.

  • The adversary's ultimate goal is to cause a model to behave as it desires during inference, i.e., predicting the poisoned sample that contains a trigger as the target label, while predicting clean samples correctly.
  • To ahieve that goal, usually, the adversary should do two steps:
    • Step 1: Backdoor injection. It aims to form a stable mapping from a trigger to the target label in a model, through manipulating the training dataset (i.e., data poisoning based), manipulating the training process (i.e., training controllable based), or manipulating both.
    • Step 2: Backdoor activation. When the backdoored model is deployed by the model owner to provide query service to the customer, the adversary will try to activate the injected backdoor to achieve its goal, via querying the model with the poisoned sample containing a trigger.
  • What will happen if the backdoor activation fails? There will be two results for the backdoor activation step, i.e., success or fail. If success, then the adversary wins. If fail, what the adversary can do? Firstly, it may guess that some post-training defenses have been conducted by the owner on the backdoored model to change the parameters, such that the original mapping from the original trigger to the target label cannot be activated. Unfortunately, there is no chance for the adversary to change the model. If the adversary aims to re-activate the backdoor, its only choice is changing the trigger, using our re-activation attack methods. Although the new trigger is different with the trigger used in backdoor injection, and the model is also changed by the defender, the ultimate goal of the adversary has been achieved.
  • For backdoor attack, is it necessary to keep the trigger same in both backdoor injection and backdoor activation? Our answer is NO. We notice that, in most existing backdoor attacks, there is one default setting that, the triggers used in backdoor injection and backdoor activation are same. But, is it really necessary? NO.
    • Firstly, from the above description of the adversary's ultimate goal (or from previous backdoor learning works), we don't see such a requirement. Thus, changing this setting will not violate the definition of backdoor attack.
    • Actually, this setting has been violated in a few existing works, such as Alpha-Blend [1] (where it calls asymmetric trigger), and distributed Backdoor attack [2]. However, note that their activated model is still the original backdoored model, and their succeeding reasons are mainly due to the trigger generalization phenomenon. They are intrinsically different with our re-activation attack, as the model has been changed by the defender, and the succeeding reason is also different (please refer to the common response, here we don't repeat).

In summary, from the adversary's perspective, we think it is reasonable to call our attack as backdoor re-activation, because its goal has been achieved, though the trigger is not exactly same with the trigger used in backdoor injection and the backdoored model is also changed by the defender.

We sincerely hope the above descriptions can make the terminology more clear. However, we will not insist on what it calls. The most important thing is to make real contribution to the field of backdoor learning. We believe, and you may have agreed with us that, this work could attract the researchers' attention to re-investigate the effectiveness of post-training backdoor defenses, and inspire future works for safer defenses.

The in-depth discussions with you are very enjoyful, and really helpful to make this work more solid. Greatly appreciated.

References:

[1] Revisiting the Assumption of Latent Separability for Backdoor Defenses. ICLR, 2023.

[2] Distributed Backdoor Attacks against Federated Learning. ICLR, 2019.

评论

Thank you for the elaboration and the references! I think that I agree with everything that you say. I agree that the adversary can use a different trigger in test time than the trigger used in training time, and I see no contradiction with my framing. I hope that the discussion will also address reviewer G6gk's concerns.

评论

Dear Reviewer AhAB,

Thanks a lot, and so happy that we have the agreement.


Dear Reviewer G6gk,

We sincerely hope that the above discussions with Reviewer AhAB could also helpful to address your concern. Looking forward to your feedback, and we are glad to discuss with you for any remaining concern.

We sincerely appreciate the valuable time and professional comments from all reviewers.

Best regards, Authors

评论

Dear Reviewer G6gk,

We sincerely hope that our latest response to your remaining concern, as well as the discussions with Reviewer AhAB under your concern, could be helpful to address your concern. Looking forward to your feedback, and we are glad to discuss with you for any remaining concern.

We sincerely appreciate your valuable time and professional comments.

Best regards, Authors

审稿意见
5

Deep neural networks have been demonstrated to be vulnerable to backdoor attacks. Various existing defense strategies have been proposed to remove backdoors. However, this paper observes that backdoors still remain in the models. It introduces a new metric called the backdoor existence coefficient that measures the existence of backdoors in deep neural networks. These dormant backdoors can be easily reactivated during inference with minor perturbations, even under black-box conditions where only model queries are available. Specifically, the paper adds an additional perturbation to the original backdoor trigger and aims to optimize this perturbation to recover the attack effect. The evaluation is conducted on three datasets, two models, and a few multimodal models. The experiments show that the proposed approach can successfully recover the original attack performance.

优点

  1. The paper observes that backdoors are simply dormant after applying defenses. They can be re-activated, which is an interesting observation.

  2. The proposed metric can measure whether backdoors are indeed removed from the model.

缺点

  1. While the observation is interesting, the method used in this paper to reactivate the backdoors is questionable. What the paper does is simply add an LL_\infty perturbation to the original trigger and then optimize the perturbation. There are two problems with this approach. Firstly, existing work [1] has already shown that it is very easy to generate a universal adversarial perturbation to cause misclassification. How does the paper guarantee that the final optimized trigger is not simply a universal adversarial perturbation? Secondly, natural backdoors can be easily identified in deep neural networks. Existing work [2] has conducted a comprehensive study showing that various injected backdoors can manifest in clean models. How does the paper ensure that the generated triggers are not simply natural backdoors? Actually, the results reported in Table 5 confirm this concern. For example, on a clean model Res18+CIFAR-10, the attack success rate is 85%, which is only 8% lower on a defense model. This shows that the optimization is not specific to "re-activate" injected backdoors but simply crafts a natural backdoor.

  2. Following the last point, there is no metric used in the paper to measure the faithfulness of the recovered backdoor. Without a comparison between the injected trigger and the recovered trigger, it is hard to distinguish whether the proposed approach actually reactivates the injected backdoor or simply generates a universal adversarial perturbation or a natural backdoor. Without a rigorous evaluation of the generated trigger, the claim of "re-activation" should be avoided.

  3. There are many recent defenses, such as SEAM [3] and CT [4], that were not evaluated in the paper. It is important to test as many defenses as possible to show the generalizability of the observations. Otherwise, it gives readers a false sense that all the defenses have the same problem.

  4. The re-activation is based on the original trigger and an optimized perturbation. What if the original trigger is not used and the perturbation is simply optimized? Will this also achieve the attack performance? If this is the case, it means the generated perturbation is simply a universal adversarial perturbation or a natural backdoor.

[1] Moosavi-Dezfooli, Seyed-Mohsen, et al. "Universal adversarial perturbations." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
[2] Tao, Guanhong, et al. "Backdoor vulnerabilities in normally trained deep learning models." arXiv preprint arXiv:2211.15929 (2022).
[3] Zhu, Rui, et al. "Selective amnesia: On efficient, high-fidelity and blind suppression of backdoor effects in trojaned machine learning models." 2023 IEEE Symposium on Security and Privacy (SP). IEEE, 2023.
[4] Qi, Xiangyu, et al. "Towards a proactive {ML} approach for detecting backdoor poison samples." 32nd USENIX Security Symposium (USENIX Security 23). 2023.

问题

See above.

局限性

The authors have addressed the limitations.

作者回复

Thank you for the positive review on the interesting observation. Your insightful questions and concerns are greatly appreciated. Please inform us if these responses effectively address all your inquiries.

Q1. The differences among our re-activation attack (RBA), the original backdoor attack (OBA), and general universal adversarial perturbation attack (gUAA, i.e., natural backdoor).

R1: Thanks for your insightful concerns. In the Common Response, we conduct a comprehensive comparison among our RBA, the original backdoor attack (OBA), and gUAA. This analysis is undertaken through three distinct aspects: the activation mechanism of the backdoor effect, the search rate, and the robustness against random noise. The three-perspective analyses verify that our RBA method finds a highly correlated backdoor with the original backdoor, rather than a less correlated backdoor (i.e., new backdoor), or a general universal adversarial perturbation (i.e., natural backdoor). Thus, we can claim that our RBA actually re-activates the orginal backdoor.

Please refer to Common Response for our analysis. Hope above points could address your concern, and we will clearly clarify the differences in the revised manuscript.

Q2. Suggestion of adding evaluations of some recent defenses, such as SEAM and CT.

R2: Thanks for this constructive suggestion.

  • Additional evaluations: Following your suggestion, we evaluate the performance of our re-activation attack against SEAM and CT, respectively. The evaluations are conducted on CIFAR-10 dataset with PreAct-ResNet18 network, and the results are shown in Table 1. It is found that both SEAM and CT are vulnerable to the proposed re-activation attack.
  • Clarification: We would like to emphasize that we have never claimed that all post-training defenses are vulnerable to re-activation attack, which is not rigorous. The main aim and contribution of our work are (1) revealing this new treat, which has been verified on several classic post-training defenses, and (2) providing effective tools to evaluate the vulnerability of any old or new post-training defenses. Consequently, the future post-training defenses should consider this threat and bypass the proposed re-activation attack.

Table 1: ASR (%) of our RBA attack against SEAM and CT.

Post-training defense \rightarrowSEAMSEAMSEAMCTCTCT
Re-activation attack \rightarrow, Original attack \downarrow|No re-activationNo re-activationWBABBANo re-activationWBABBA
BadNets5.3397.5333.510.0099.5892.21
Blended6.7998.4069.791.34100.0099.00
Input-aware1.2792.1048.5970.9599.9685.80
LF13.2297.6165.633.2899.6399.23

Q3. What if the original trigger is not used and the perturbation is simply optimized?

R3: Thanks. Our response is expanded from two aspects:

  • If the original trigger is not used, it devolves into a general universal adversarial perturbation attack (gUAA). Actually, the distinctions between our RBA attack (starting from the original trigger) and gUAA has been carefully analyzed from three perspectives in the Common Response. It is clear that our re-activation attack re-activates the orginal backdoor attack, rather than finds a general universal adversarial perturbation or a natural backdoor.
  • Additional comparions under the white-box attack setting and analysis. In addition to the black-box comparions shown in Table 2 of the Common Response, here we also supplement the white-box comparisons, as shown in Table 2 below. Although the ASR values of gUAA is also high, its activation mechanism is intrinsically different with our re-activation attack (refer to the analysis in Common Response). gUAA just verifies a well-known threat once again that DNN models are likely to be vulnerable to (universal) adversarial attacks. In contrast, our re-activation reveals a new threat to existing post-training backdoor defenses.

Table 2: Comprisons between our RBA and gUAA under the white-box setting, measured by ASR (%).

Defense\rightarrowSAMSAMSAUSAU
Original backdoor \downarrow, Attack \rightarrowgUAARBAgUAARBA
Input-Aware83.0896.1978.7285.39
LF87.2997.4083.7890.74
SSBA84.5292.8083.5089.86
Trojan88.3496.1879.0487.61
Wanet82.0794.9580.8295.33
评论

Dear Reviewer ULSi,

We sincerely wish to convey our gratitude for your investment of time and insightful remarks. Your forthcoming feedback is greatly anticipated, specifically regarding the issues we've tackled in our rebuttal. Our primary aim is to assure that our rebuttal is closely aligned with your suggestions. Your contributions are important to the enhancement of our work.

Best regards, Authors

评论

I appreciate the authors' comprehensive rebuttal. It addresses most of the concerns, and I am raising my score. Please include all the results and discussions in the final version. Thanks!

评论

Dear Reviewer ULSi,

We sincerely appreciate your valuable time and positive feedback. We will add all above results and discussions into the final version, which will make this work more solid. Thanks.

Sincerely, Authors

审稿意见
8

The paper investigates the idea that an attacker could modify the trigger to make it effective against models modified by post-training defenses (to re-activate the backdoor in defended models).

The paper presents a measure of how much the effect of the backdoor shows persists in model activations and shows empirically that post-training-defended models have intermediate activations more similar to backdoored models than clean models when given backdoored inputs.

The paper defines 3 p\ell_p-bounded universal adversarial attacks for backdoor re-activation: a white-box (access to parameters), a black-box (query access) and a transfer attack (access to model architecture).

The attacks are very successful in most cases, but has a limited effect in cases of some attack-defense combinations. Validation experiments show that the universal adversarial attacks are more effective on defended models that on clean models.

优点

  • The paper is very well written and structured. It is clear an easy to read.
  • The "backdoor re-activation" investigation is interesting and the proposed method is novel.
  • The experimental evaluation, including validation studies, is very well done and provides valuable insights.
  • The source code is provided and will presumably be published.

缺点

I do not see any very important weaknesses.

Some phrases are or might be unclear:

  • L9: "lie dormant" (the state of the defended model instance does not change).
  • L163: What is is BEC of fθAf_{\boldsymbol\theta_\text{A}}? Eq. (3) defines it as a function of multiple model instances.
  • L168: "greater existence"
  • L242: What do the percentages represent in particular. Are they percentages (multiplicative) or percentage points (additive)?
  • The claim (2) in L243 is not completely accurate because an important part of the effect is due to the pure effect of the adversarial attack (table 5).

Justification of the definitions in Eq. (2) and (3):

  • Why is CKA a good choice? What principles or assumptions is it based on?
  • Why does BEC include intermediate model activations rather than the final activations?

Errors:

  • Fig 5: 1. The SIG example looks like a product of the image and the sinusoid, but it should be a sum according to the paper 1. What caused this? 2. The clean image is not the same crop as the triggered images.

Writing mistakes:

  • L40: missing comma before "backdoor",

  • L137: "remove backdoor" -> "remove the backdoor", defined -> denoted,

  • L173: The equation ξ=ξ+Δξ\boldsymbol\xi'=\boldsymbol\xi + \Delta_{\boldsymbol\xi} is not consistent with the definition of xξ\boldsymbol x_{\boldsymbol\xi} if the function (x,ξ)xξ(\boldsymbol x, \boldsymbol\xi) \mapsto \boldsymbol x_{\boldsymbol\xi} is not the addition operator.

  • There is no min\min on the right side of Eq. (4).

  • Eq. (5) the model instance uses the same notation as logit functions fif_i.

  • L200: "learning task" -> "learning tasks"

  • L204: CIFAR-10c 2020

  • L253: groupe

  • L258: "on weak" -> "on the weak"

  • L261: adversarys

I apologize for any errors on my part.

Footnotes

  1. A new Backdoor Attack in CNNs by training set corruption without label poisoning

问题

Questions:

  • Why is λ\lambda set to 11 rather than 00? What happens when λ=0\lambda=0?

Suggestions:

  • It might be good to reference the work that introduced universal (image-agnostic) adversarial perturbations: 1.
  • L100: Mention how [34] attempted to enhance the backdoor signal at inference and clarify what this means.
  • L133: introduce the meaning of the xξ\boldsymbol x_{\boldsymbol\xi} notation (there is a triggering function (x,ξ)xξ(\boldsymbol x, \boldsymbol\xi) \mapsto \boldsymbol x_{\boldsymbol\xi}).
  • L230: remove the part "we just divided these defenses into two groups: (1) NC, NAD, i-BAU; and (2) FT-SAM, SAU, FST" because it makes the reader ask why, but it is explained later and not so important at this point.
  • Fig. (3) c: Mention the attack.
  • Increase the size of Table 3.

Footnotes

  1. Universal adversarial perturbations

局限性

The paper provides a short (but valuable) discussion on limitations. It might be improved by noting (again) that the proposed attack applies to post-training defenses and that other defenses (including test-time defenses) are not considered in the paper.

Adversarial training and randomized smoothing were not considered as stronger defenses.

作者回复

Firstly, we would like to show our sincere appreciation to the reviewer for dedicating the valuable time, offering gracious affirmation, and providing constructive.

Q1. Why is CKA a good choice in Eq. (2) and (3)?

R1: Thanks for this insightful comment. We would like to explain from two aspects:

  • Our task and challenge: In this work, we need to measure the quantity of backdoor existence within a model by measuring the backdoor effect similarity between models. We calculate the similarity between feature maps in backdoor related neurons, which are structured and high-dimensioned feature representations. For such representations, traditional metrics such as LkL_k norm distance or cosine distance could disrupt the structure information of these representations.
  • CKA is highly suitable for our task: CKA [1] measures the similarity between two representations, which utilizes HSIC to measure the independence between two distributions. CKA is widely used for measuring similarity of high dimensional representations between networks and thus is highly suitable for our task. Its effectiveness is verified in our work by comparing the relationship between BEC and our re-activation attack ASRs (Fig.3(b)).

Q2. Why does BEC include intermediate model activations in Eq. (3)?

R2: We compute BEC across the entire network because backdoor attack has a cumulative effect, and each layer contributes to a successful backdoor attack. We have empirically demonstrated that our BEC metric and the re-activation ASR have a high positive correlation in Fig. 3(b) in the main manuscript and the details of ASR with BEC are shown in Table 1. The positive correlation shows that BEC is effective in measuring backdoor existence.

Table 1: Comparison between ASR and BEC.

Attack\downarrowDefense \rightarrowNADi-BAUFT-SAMSAUFST
SSBAASR0.9970.9130.9280.8990.941
SSBABEC0.7710.5000.5020.3130.765
WanetASR0.9620.9470.9490.9530.976
WanetBEC0.6560.7330.6070.6190.758

Q3. Why is λ\lambda set to 1 rather than 0? What if λ=0\lambda=0?

R3:

  • The function: The second term in Eq.(4) serves as a regularization item, aiming to encourage the model to minimize the highest probability of non-target classes, thereby enhancing the potency of the attack. If λ=0\lambda=0, it might lead to the model maintain high output probabilities for non-target classes, thus potentially diminishing the attack performance.
  • Results: Table 2 illustrates the results of WBA when λ=0\lambda=0. It shows a slight decrease in the overall attack effectiveness with λ=0\lambda=0.

Table 2: Re-activation attack performance (%) under different values of λ\lambda.

Defense\rightarrowFT-SAMFT-SAMSAUSAUFSTFST
λ\lambda\rightarrow010101
BadNets93.294.792.893.196.197.9
Input-Aware95.696.284.285.489.490.7

Q4. Attack against test-time defenses.

R4: Thanks. Here we test our attack against test-time defenses.

  • Settings: With the poisoned samples optimized by our re-activation attack on defense models with FT-SAM, we test the effectiveness against test-time defenses STRIP[2], ZIP[3], and SCALE-UP[4].
  • Results: Table 3 shows that our attack maintains a high ASR against ZIP. For the SCALE-UP and STRIP, there is a significant decrease in ASR. However, the ACC of the model is low, meanwhile.
  • Analysis: This experiment inspires future design of attack methods capable of evading test-time defense. Strategies might include methods to closely match the feature distributions of clean data, and preventing overly strong activations.

Table 3: Performance (%) against test-time defenses.

Defense\rightarrowSCALE-UPSCALE-UPSTRIPSTRIPZIPZIP
Attack\downarrowASRACCASRACCASRACC
BadNets29.853.783.49.323.680.7
Blended34.146.249.29.248.181.5

Q5. The SIG example in Fig. 5.

R5: Thanks. It’s somewhat an illusion. The mechanism in use is actually an additive trigger. This aligns accurately with the approach described in the original paper. For the crop, we will modify it in the revision.

Q6. Some unclear phrases.

R6: Thanks for your careful checking of our manuscript. Our clarification of the indicated phrases is as follows:

  • L9: "lie dormant". It means that the backdoor cannot be activated by the original trigger, but it still exists in the model. L163: What is BEC of fθAf_{\theta_A}?It is 1,because ρBEC\rho_{\text{BEC}} (fθAf_{\theta_{\text{A}}}, fθAf_{\theta_{\text{A}}},fθCf_{\theta_{\text{C}}};DpD_p) = 1N\frac{1}{N} l=1N\sum_{l=1}^{N} SA,A(l)(Dp)SC,A(l)(Dp)SA,A(l)(Dp)SC,A(l)(Dp)=1\frac{S_{\text{A},\text{A}}^{(l)}(D_{p})-S_{\text{C},\text{A}}^{(l)}(D_{p})}{S_{\text{A},\text{A}}^{(l)}(D_{p})-S_{\text{C},\text{A}}^{(l)}(D_{p})}=1.

Note that the second and third arguments in ρBEC\rho_{\text{BEC}} serves as two reference models to measure the backdoor existence of the model corresponding to the first argument.

  • L168: "greater existence". We will modify it to stronger existence.
  • L242: What do the percentages represent in particular? They represent absolue improvements on ASR values.
  • L243: The claim (2) is not completely accurate. Thanks for this careful comment. Following your suggestion, we will revise it to "which shows the effectiveness of our re-activation attack method" in the manuscript.

Q7. Writing mistakes and suggestions.

R7: We appreciate your attentiveness to the errors in our manuscript. We will revise the paper based on your suggestions.

[1] Similarity of Neural Network Representations Revisited, ICML2019.

[2] STRIP: a defence against trojan attacks on deep neural networks, ACSAC2019.

[3] Black-box backdoor defense via zero-shot image purification, NeurIPS2023.

[4] SCALE-UP: an efficient black-box input-level backdoor detection via analyzing scaled prediction consistency, ICLR2023.

评论

Thank you! You have made most of the things more clear. I appreciate your effort.

Regarding Q2, I was already aware of the correlation of BEC and ASR. I was thinking that it might be possible that some attack does no affect intermediate activations much, but it must change the final activation in order to be successful. Hence, it would be interesting to see how summing over intermediate activations compares to using only the final activations.

Regarding Q6, I was confused by ρBEC\rho_{\text{BEC}} having multiple parameters rather than one and by it being unclear that parameters are implicitly "extracted" from the functions passed on to Eq. (2).

评论

Thank you for your response and affirmation, as well as your insightful suggestions. In regard to your queries, our responses are as follows:

  1. BEC computation over intermediate activations compared to using only the final activations.

    This question is indeed interesting. To study this, we compute our BEC metric using only the deep layers (the last block of PreAct-ResNet18) (denoted as ρBEC, Deep\rho_{\text{BEC, Deep}}) and all layers (denoted as ρBEC, All\rho_{\text{BEC, All}}). The result is shown in Table 1:

    Table 1: Comparison between ρBEC, All\rho_{\text{BEC, All}} and ρBEC, Deep\rho_{\text{BEC, Deep}}.

    Defense \rightarrowNADi-BAUFT-SAMSAUFST
    ρBEC, All\rho_{\text{BEC, All}}0.6150.6070.5990.5750.584
    ρBEC, Deep\rho_{\text{BEC, Deep}}0.5830.5850.5470.5150.572

    From the result presented in Table 1, it becomes evident that there is no significant variance observed when comparing ρBEC, Deep\rho_{\text{BEC, Deep}} to ρBEC, All\rho_{\text{BEC, All}}.

  2. Computation of ρBEC\rho_{\text{BEC}}.

    We felt thankful for highlighting this confusion. Our expression in Eq. (3) indeed could lead to some misunderstanding. We would like to clarify it.

    When calculating the BEC of the targeted model fθDf_{\theta_D}, we need compute the backdoor effect similarity by comparing the target model fθDf_{\theta_D}, its corresponding original backdoored model fθAf_{\theta_A}, and one clean model fθCf_{\theta_C}. Therefore, although fθDf_{\theta_D} is our mainly objective, we need the information of fθAf_{\theta_A} and fθCf_{\theta_C}, which are also variables.

    Regarding Eq. (2), AA in S,AS_{*,A} actually denotes the backdoored model fθAf_{\theta_A} in Eq. (3) and CC in SC,AS_{C,A} denotes the clean model fθCf_{\theta_C} in Eq. (3). We will clarify it in our revison.

    Hope the above response could address your concerns. And we truly appreciate your insightful suggestion, contributing to increased clarity and rigor in our manuscript.

作者回复

We sincerely thank all reviewers for their valuable time and constructive comments, and are encouraged by the positive comments of very interesting insights (AhAB,ULSi,G6gk,ccvM), novel ideas and good contribution (AhAB,G6gk,ccvM), moderate-impact (ccvM), extensive experimental evaluation (AhAB,ccvM), and good writing (AhAB,G6gk).

We would like to present a common response to a very critical comment, mentioned by Reviewer ULSi and Reviewer G6gk.

Q1. Systematic comparison among original backdoor attack (OBA), re-activation attack (RBA), and general universal adversarial perturbation attack (gUAA).

R1: We highly appreciate this insightful comment. We analyze the relationships among RBA, OBA and gUAA from three perspectives:

  • Definitions and settings: To facilitate the understanding of our analysis, we firstly clarify the definitions and settings:

    • OBA means one existing backdoor attack following the standard backdoor injection and activation process. The trigger and backdoored model are denoted as ξ\boldsymbol{\xi} and fAf_\text{A}, respectively;
    • RBA means that given the defense model fDf_\text{D} (i.e., conducting one post-training defense on fAf_\text{A}), we aim to re-activate the injected backdoor of OBA via searching for a new trigger ξ\boldsymbol{\xi}', starting from ξ\boldsymbol{\xi}, i.e., based on some original poisoned samples DpD_p. The searched new trigger is formulated as ξ=ξ+Δξ\boldsymbol{\xi}' = \boldsymbol{\xi} + \Delta_{\boldsymbol{\xi}}, and the new poisoned dataset is denoted as Dp,ΔξD_{p,\Delta_{\boldsymbol{\xi}}};
    • gUAA means that given fDf_\text{D}, we aim to search for a targeted universal adversarial perturbation (the same class with OBA and RBA) via adversarial attack, starting from clean samples DcD_c. The searched UAP is denoted as Δ\Delta, and the perturbed dataset is denoted as Dc,ΔD_{c, \Delta}. Note that it is also called natural backdoor by Reviewer ULSi.
  • Analyses: Our analyses are expanded as below:

    • Activation mechanism of backdoor effect: We analyze the backdoor activation mechanism in each attack. As demonstrated in Line 151-161 of Sec. 3.2, we could adopt the Centered Kernel Alignment (CKA) metric to measure backdoor effect similarity between models, by comparing the backoor-related neurons' activation maps (i.e., m~\tilde{m} in Line 156). Specifically, we calculate the following three CKA scores:

      SRBA,OBA=1Nl=1NCKA(mD(l)(Dp,Δξ),mA(l)(Dp))S_{\text{RBA,OBA}}=\frac{1}{N}\sum_{l=1}^{N}\text{CKA}(m_{\text{D}}^{(l)}(D_{p,\Delta_{\xi}}), m_{\text{A}}^{(l)}(D_p)),

      SgUAA,OBA=1Nl=1NCKA(mD(l)(Dc,Δ),mA(l)(Dp))S_{\text{gUAA,OBA}}=\frac{1}{N}\sum_{l=1}^{N}\text{CKA}(m_{\text{D}}^{(l)}(D_{c,\Delta}), m_{\text{A}}^{(l)}(D_p)),

      SRBA,gUAA=1Nl=1NCKA(mD(l)(Dp,Δξ),mD(l)(Dc,Δ))S_{\text{RBA,gUAA}}=\frac{1}{N}\sum_{l=1}^{N}\text{CKA}(m_{\text{D}}^{(l)}(D_{p,\Delta_{\xi}}),m_{\text{D}}^{(l)}(D_{c,\Delta})).

      As shown in Table 1, SRBA,OBASgUAA,OBASRBA,gUAAS_{\text{RBA,OBA}} \gg S_{\text{gUAA,OBA}} \approx S_{\text{RBA,gUAA}} at all pairs of attack-defense. It demonstrates that the backdoor activation mechanisms between RBA and OBA are highly similar, and both of them differ significantly with that of gUAA.

    • Starting from the original trigger ξ\boldsymbol{\xi}, it is easier and faster to find a new trigger ξ\boldsymbol{\xi}' that achieves high attack success rate (ASR): As shown in Table 2, given same query numbers, the ASR of RBA is much higher than that of gUAA, and the increasing speed of the former is faster than the latter. It demonstrates that RBA is much closer to OBA than gUAA.

    • Compared to gUAP Δ\Delta, both the orginal trigger ξ\boldsymbol{\xi} and the new trigger ξ\boldsymbol{\xi}' are much more robust to random noise. We have found a characteristic, i.e., the robustness to random noise, to distinguish the trigger of intended backdoor from the trigger of natural backdoor (i.e., the gUAP). Specifically, we perturb ξ,ξ,Δ\boldsymbol{\xi}, \boldsymbol{\xi}', \Delta with same level of random noises, and record the ASR of these attacks. As shown in Table 3, both OBA and RBA are more robust than gUAA. It verifies that RBA returns an intended backdoor trigger similar to OBA, rather than a gUAP.

  • In conclusion, we believe above analyses verify that our RBA method finds a highly correlated backdoor with the original backdoor, rather than a less correlated one (i.e., new backdoor), or a general UAP (i.e., natural backdoor). Thus, we claim that our RBA actually re-activates the orginal backdoor.

The full versions of Tables 1-3 are shown in PDF for reference. We also provide a schematic diagram in Fig. 1 of PDF illustrating the relationship among RBA, OBA and gUAA. We will add above analyses into the revised version to better demonstrate the mechanism of our method.

Table 1: CKA scores between OBA, RBA, and gUAA.

Defense\Rightarrowi-BAUi-BAUi-BAUFT-SAMFT-SAMFT-SAM
Attack\downarrowSRBA,OBAS_{\text{RBA,OBA}}SgUAA,OBAS_{\text{gUAA,OBA}}SRBA,gUAAS_{\text{RBA,gUAA}}SRBA,OBAS_{\text{RBA,OBA}}SgUAA,OBAS_{\text{gUAA,OBA}}SRBA,gUAAS_{\text{RBA,gUAA}}
BadNets0.6070.1920.1700.5990.1940.169
Blended0.7120.1960.1920.7120.1970.193

Table 2: ASR (%) of RBA and gUAA with different query numbers.

Attack+DefenseQuery number\Rightarrow1000300050007000
Blended+i-BAURBA77.389.392.194.6
Blended+i-BAUgUAA14.241.449.556.4
Blended+FT-SAMRBA41.177.479.885.6
Blended+FT-SAMgUAA16.342.256.565.5

Table 3: ASR (%) of OBA, RBA and gUAA under different ll_{\infty}-norm of random noise.

Norm\Rightarrow00.030.060.09
OBABlended+NAD99.899.899.697.3
OBALF+NAD99.198.998.498.6
RBABlended+NAD99.899.798.784.0
RBALF+NAD99.499.198.196.6
gUAABlended+NAD95.592.779.435.4
gUAALF+NAD96.589.555.816.7
最终决定

The paper under review presents a novel investigation into the resilience of deep neural networks to backdoor attacks, even after the application of post-training defense mechanisms. The authors propose a new metric, the Backdoor Existence Coefficient (BEC), to measure the persistence of backdoor vulnerabilities in defended models. The primary contribution of the paper is the introduction of the Re-Activation Backdoor Attack (RBA), which demonstrates that backdoors can be reactivated in defended models by slightly perturbing the original backdoor trigger. Based on the reviewers’ detailed assessments and the authors’ comprehensive rebuttal, I recommend accepting the paper with a suggestion for minor revisions. The paper’s contribution to understanding and quantifying the persistence of backdoors in defended models is valuable and timely, given the growing focus on the security of AI systems. However, the authors should take the reviewers’ feedback into account in the final revision to improve the clarity and rigor of the presentation. Specifically, clearer articulation of the reactivation attack’s nature, further details on surrogate model usage, and a thorough discussion on the detectability of the reactivated backdoors will strengthen the overall impact of the work.