Universal Backdoor Attacks
Using data poisoning to create backdoors that target every class in deep image classifiers.
摘要
评审与讨论
The paper presents a new approach for crafting universal backdoor attacks, i.e. backdoor attacks that target several classes at inference time, as opposed to traditional backdoor attacks that target a single class. In order to mount a universal backdoor attack, the adversary crafts triggers that increase the ASR on several classes simultaneously. To that end, the authors leverage a pretrained model to extract the feature representation of the training samples, and then craft triggers that correlate with features used by samples from several classes.
The authors evaluate their attack on several subsets of ImageNet-21k, and against BadNet's baseline presented in Guo et al. By poisoning 0.39% of the training data, the authors are able to mount an effective backdoor attack when no defense is applied. The authors then test the effectiveness of their attack when several defenses are applied, and notice a drop in ASR although the attack remains effective.
Finally, in order to test how much triggers applied to a single class help triggers applied to other classes, the authors fix the number of triggers in some classes, then vary the number of triggers in other classes, and observe the ASR over the fixed classes increases as more poisoned samples are added to other classes.
优点
- the paper presents an interesting approach to backdoor attacks where triggers affect several classes simultaneously
- the authors validate the effectiveness of their attack on a large scale dataset, and against several defenses
缺点
- the required number of poisoned samples seems a bit high, even for imagnet. other papers have shown that around 300-500 samples are enough to mount an effective backdoor attack [1, 2]. this is in contrast with the results observed in Table 1, where the baseline attack is not successful even with 2k poisoned samples.
- the authors only consider a single baseline model against which their attack is compared. this comparison is helpful, however, given the large number of poisoned samples required, it would be nice to see how other baselines would compare at that scale
- the parameters of the defenses were tuned for a simple baseline (BadNets). the effectiveness of the attack might be very different if the parameters of the defense were tuned to the authors' attack
[1] POISONING AND BACKDOORING CONTRASTIVE LEARNING, Carlini et al., 2022 [2] WITCHES’ BREW: INDUSTRIAL SCALE DATA POISONING VIA GRADIENT MATCHING, Geiping et al., 2021
问题
- can you please look into a setup with fewer poisoned samples? it should be possible to have a successful backdoor attack with close to 500 samples on ImageNet
- can you also tune the parameters of the defense against each attack you are considering?
- if possible, can you provide a good baseline for attacks to compare against?
We thank the reviewer for their positive assessment of our work and valuable feedback. Please find our responses below.
the required number of poisoned samples seems a bit high, even for imagnet. other papers have shown that around 300-500 samples are enough to mount an effective backdoor attack [1, 2]. this is in contrast with the results observed in Table 1, where the baseline attack is not successful even with 2k poisoned samples. can you please look into a setup with fewer poisoned samples? it should be possible to have a successful backdoor attack with close to 500 samples on ImageNet
We thank the reviewer for spotting this detail. It is difficult to compare across papers because the training parameters (learning rate, model architecture, and number of epochs) differ across papers. Our setup is described on p5 in Section 4.1 - Model Training. Geiping et al. poison 0.1% of ImageNet-1K to target a single class, whereas we poison 0.15% on the same dataset to target all classes. They attack a ResNet-34 model, whereas we attack a slightly smaller ResNet-18 model. Carlini et al. propose attacks against multimodal, contrastive models on different datasets with different training objectives. Their targeted backdoor attacks are substantially more effective than any attacks against models trained in a supervised manner. Carlini et al.’s attack requires poisoning only 0.0001% of samples, which is a 100 times reduction compared to Geiping et al.’s attack (and ours). We do not believe these results are comparable since the experimental setup is too different.
We are happy to include this discussion in the revised paper.
the authors only consider a single baseline model against which their attack is compared. this comparison is helpful, however, given the large number of poisoned samples required, it would be nice to see how other baselines would compare at that scale if possible, can you provide a good baseline for attacks to compare against?
We only consider a single baseline we created because we are the first to propose a universal backdoor attack and there are no baselines in the literature. Tables 1, 2, and 3 show the improvement of our work against our created baseline under many different settings, and Figure 3 shows the difference between both approaches during training. We believe that our baselines already provide a comprehensive overview, but we are happy to compare our work to more natural baselines if the reviewer has a suggestion.
the parameters of the defenses were tuned for a simple baseline (BadNets). the effectiveness of the attack might be very different if the parameters of the defense were tuned to the authors' attack can you also tune the parameters of the defense against each attack you are considering?
As stated in the paper, we used the same procedure as Lukas et al. [A] to tune the parameters of a defense. Lukas et al. assume the defender has access to one backdoor attack (BadNets [B]) against which they tune the parameters of their defense. Then, the defender tests the defense against many unknown attacks.
We replicate this setup in the paper, where the defender does not know of the existence of our attack but instead tunes against one known BadNets attack. On further analysis, we have optimized our defense hyperparameters for fine-tuning and pruning defenses over the ranges described in [A] and found no significant improvement in any of the defense’s effectiveness when optimizing its hyperparameters against our attacks. Our approach follows Lukas et al., and our paper shows conditions under which our backdoor lacks robustness (see Figure 5).
We will revise the paper to emphasize these points.
[A] Lukas, Nils, and Florian Kerschbaum. "Pick your Poison: Undetectability versus Robustness in Data Poisoning Attacks against Deep Image Classification." arXiv preprint arXiv:2305.09671 (2023).
[B] Gu, Tianyu, et al. "Badnets: Evaluating backdooring attacks on deep neural networks." IEEE Access 7 (2019): 47230-47244.
Thank you for the clarification.
Number of poisoned samples.
I agree with the reviewers that the setups are different, however, an investigation into a regime with more limited number of poisoned samples reflects a more realistic setup.
Single Baseline.
I apologize for the confusion. I understand that this work is the first (to the best of my knowledge) that looks into this problem, and I appreciate the research direction it opens. What I meant by more baselines is more types of attacks, not just trigger based. In particular, your method for selecting samples to poison should be agnostic to the attack type, so you could image running BadNets with your selected samples, or some other hidden-trigger attacks with your samples, etc. Such an analysis would provide more clarity about the portability of your selection mechanism.
My impression of the paper remains positive, and I hope the authors could address the above concerns to further improve their results.
Thank you for your response and positive impression of our paper.
We believe the baseline we compare against is plausible and shows good results. While optimizing the baseline was not our priority, we are happy to include the experiments suggested by the reviewer in the revised paper. However, as each experiment takes about 80-100 GPU hours, and the discussion period ends by the end of this day, we do not anticipate being able to include them in time. The outcome of these experiments will provide more insights into the effectiveness of other baselines and will not impact any of the results in the paper. We promise to include these results in the revised paper's Appendix.
We hope our response addresses the reviewer's concern adequately and remain open to further clarifications or suggestions.
This paper introduced a universal backdoor attack, a data poisoning method that targets arbitrary categories. Specifically, the authors crafted triggers by utilizing the principal components of LDA in the latent space of a surrogate classifier. Experiments showed that the generated triggers can attack any category by poisoning a certain percentage of samples in the training data.
优点
The authors proposed a method that was designed to poison any class, instead of targeting a single class.
The proposed attack is effective than the previous method, especially when the poisoning rate is low.
缺点
It is not clear why the proposed method improves the inter-class poison transferability and, in particular, how it ensures that an increase in attack success against one class improves attack success against other classes. Does the proposed method increase the transferability (attack success rate) of any two classes, even if these two classes differ significantly in the latent space?
The formula in Section 3.2 needs to be formulated more appropriately and clearly. Specifically, do y' and y in the formula refer to any two categories or any two similar categories? If they refer to any two categories, please explain why categories that are very different in the latent space can also improve the success rate of the attack; otherwise, if they refer to any two similar categories, please give a clear definition of similarity.
The experimental results require further discussion and analysis. For example, in Table 1, the proposed method significantly outperforms the baseline method when the poisoning samples are 5000 (i.e., the attack success rate is 95.5% vs. 2.1%), but the proposed method is suddenly worse than the baseline method when the poisoning samples are 8000 (95.7% vs. 100%). The potential reasons for the sudden improvement in the performance of the baseline method need to be discussed. Similarly, in Table 2, the attack success rate of the baseline method suddenly drops from 99.98% for ImageNet-2K to 0.03% for ImageNet-4K, which also needs to be discussed.
问题
What are the requirements for the surrogate image classifier? The proposed method requires sampling in the latent space of the surrogate image classifier, not the original classifier. Is it possible to use any latent space of any surrogate classifier? For example, if there is a significant difference in the distribution of the hidden spaces between the surrogate classifier and the original classifier, will this result in a significant decrease in the attack success rate of the proposed method?
We thank the reviewer for their valuable suggestions that further improve the paper. Please find our responses to the questions below.
It is not clear why the proposed method improves the inter-class poison transferability and, in particular, how it ensures that an increase in attack success against one class improves attack success against other classes. Does the proposed method increase the transferability (attack success rate) of any two classes, even if these two classes differ significantly in the latent space?
Thank you for asking this clarifying question! On a high level, the main idea of our attack is to associate the trigger with principal components in the (surrogate) classifier’s latent space instead of associating a trigger with a target class. An attacker creates a trigger for each target class by encoding its location in the latent space.
Existing attacks, such as BadNets [A], associate a trigger with a target class by stamping the trigger on an image and labeling it with the target label. This poisoning strategy causes the model to associate the presence of a trigger with the target class. Thus, an effective attack requires injecting at least as many poisoned samples as there are classes. In our attack, we associate triggers with principal components in the dimensionality-reduced latent space of the surrogate classifier, which allows for more sample-efficient attacks. We use Latent Discriminant Analysis to obtain principal components from the latent space. Then, we bit-wise encode these components with the trigger and inject poisoned samples into the training data. During inference, given any target label, the attacker can create a trigger that encodes it as long as they can binary encode it with the principal components.
Upon further analysis, we tested whether our attack succeeded in poisoning all classes and found that a few classes had only a low success rate. Please see the table below for the success rate across 1000 classes from ImageNet.
| Poison Samples (p) | Min (%) | Max (%) | Mean (%) | Median (%) |
|---|---|---|---|---|
| 2000 | 0 | 100 | 81 | 98 |
| 5000 | 0 | 100 | 95.4 | 100 |
The table shows the min, max, mean, and median values for the attack success rate against ImageNet-1K using 2000 or 5000 poisoned samples. The min value is 0%, indicating that some classes cannot be successfully attacked. Upon further investigation, we see that the number of principal components in our reduced latent space is too low and that a small number of classes (the ones our attack cannot reach) have the same encoding. Increasing the number of principal components could increase the number of reached classes but may require injecting more samples (as the model needs to learn to associate a trigger with more principal components).
We will include these results in the revised paper and clarify the motivation of inter-class poison transferability.
The formula in Section 3.2 needs to be formulated more appropriately and clearly. Specifically, do y' and y in the formula refer to any two categories or any two similar categories? If they refer to any two categories, please explain why categories that are very different in the latent space can also improve the success rate of the attack; otherwise, if they refer to any two similar categories, please give a clear definition of similarity.
Thank you for your feedback! This definition is too vague and can absolutely be reformulated to encompass the concept of inter-class transferability better. To do this, we refine inter-class transferability over two distinct sets of classes, that is, how improving an attack on one set of classes A indirectly improves the attack on the second set of classes B, which matches our experiments:
| Untargeted classes poisoned (%) | ASR on observed classes (%) |
|---|---|
| 90 | 72.77 |
| 60 | 70.45 |
| 30 | 71.78 |
For our experiments in Section 4.4, we choose A and B randomly, where A contains 90% of classes and B contains 10% of classes. We will update that section to use the new definition. What set of classes (A) is best for improving attack success against a certain set of classes B is a very interesting question. We have run an additional experiment that indicates that the choice of B does not strongly influence the magnitude of inter-class transferability. Specifically, whether 90%, 60%, or 30% of classes are used in B does not have a large impact on attack success in A (total poisoning rate held constant).
The experimental results require further discussion and analysis. For example, in Table 1, the proposed method significantly outperforms the baseline method when the poisoning samples are 5000 (i.e., the attack success rate is 95.5% vs. 2.1%), but the proposed method is suddenly worse than the baseline method when the poisoning samples are 8000 (95.7% vs. 100%). The potential reasons for the sudden improvement in the performance of the baseline method need to be discussed. Similarly, in Table 2, the attack success rate of the baseline method suddenly drops from 99.98% for ImageNet-2K to 0.03% for ImageNet-4K, which also needs to be discussed.
We thank the reviewer for raising this point. To answer the question, Figure 3 shows why the baseline method suddenly outperforms our attack when increasing the number of poisoned samples beyond a threshold. The reason is that the baseline is learned suddenly, all at once, whereas our attack is learned progressively throughout the training process. Increasing the number of poisoned samples means that the model sees each sample more often during training, which shortens the time when it learns the backdoor. The same phenomenon also explains why the baseline’s attack success rate drops when increasing the dataset size from ImageNet-2K to ImageNet-4k. Since we keep the number of poisoned samples constant but increase the label space (as there are more classes in ImageNet-4K) and dataset size, learning the backdoor takes for all classes requires more training iterations.
We will include a discussion and analysis in the revised paper.
What are the requirements for the surrogate image classifier? The proposed method requires sampling in the latent space of the surrogate image classifier, not the original classifier. Is it possible to use any latent space of any surrogate classifier? For example, if there is a significant difference in the distribution of the hidden spaces between the surrogate classifier and the original classifier, will this result in a significant decrease in the attack success rate of the proposed method?
We thank the reviewer for this excellent question. A necessary condition to successfully use our attacks is that the surrogate classifier has a different latent representation for each target class. Otherwise, our attack cannot find different trigger encodings for any pair of classes, which makes it unable to target both classes separately. This condition is satisfied in all experiments that we conduct in the paper. We do not ablate over different surrogate image classifiers, and a surrogate classifier “dissimilar” to the target classifier likely decreases the attack’s effectiveness.
We will include a limitation in our paper that we only studied surrogate classifiers with the same or a greater label space as the target classifier.
[A] Gu, Tianyu, et al. "Badnets: Evaluating backdooring attacks on deep neural networks." IEEE Access 7 (2019): 47230-47244.
This paper investigates the utilization of a small number of poisoned samples to achieve many-to-many backdoor attacks. The authors leverage inter-class poison transferability and generate triggers with salient characteristics. The proposed method is evaluated on the ImageNet dataset, demonstrating its effectiveness. The authors provide evidence of the transferability of data poisoning across different categories.
优点
1.The paper demonstrates clear logic. 2.The topic is intriguing and warrants further exploration.
缺点
1.The design motivation of the algorithm is unclear. 2.The concealment of the patches is poor. 3.The comparative methods are outdated.
问题
- The related work lacks a specific conceptual description of "many-to-many" and an introduction to recent works in this area.
- In Section 3.3, the encoding method used in the latent feature space is rather simplistic, where values greater than the mean are encoded as 1 and others as 0. What is the motivation behind this encoding method, and how does it contribute to improving the transferability of inter-class data poisoning?
- The author employs a patch and blend approach to add triggers, resulting in poor concealment of the backdoor triggers. Visually, the differences between poisoned and clean samples can be distinguished. Has the author considered more covert methods for backdoor implantation, such as injecting triggers in the latent space and decoding them back to the original samples to reduce the dissimilarity between poisoned and clean samples?
- Selection of baselines. The chosen comparative methods are both from 2017. It is recommended to include comparative experiments with the latest backdoor attack methods.
- The experimental results in the paper compare the average attack success rates across all categories. It is suggested to provide individual attack success rates for representative categories or other statistical results such as minimum, maximum, and median values.
- The authors validated the effectiveness of the method under model-side defense measures. It is recommended to include defense methods in data-side.
We thank the reviewer for their valuable suggestions for further improvement of the paper. Please find our detailed responses below.
The related work lacks a specific conceptual description of "many-to-many" and an introduction to recent works in this area.
Thank you for bringing this to our attention. We will revise the paper to include a conceptual description of “many-to-many” attacks. If the reviewer would kindly point us to related work that we might have overlooked, we will be happy to include it in the revised paper.
In Section 3.3, the encoding method used in the latent feature space is rather simplistic, where values greater than the mean are encoded as 1 and others as 0. What is the motivation behind this encoding method, and how does it contribute to improving the transferability of inter-class data poisoning?
We chose a binary encoding method for two reasons. The first reason is to demonstrate the existence of an effective and sample-efficient Universal Backdoor Attack. Our attacks require only a small increase in the number of injected samples. As we state in the paper, one would naïvely expect that such attacks require vastly more poisoned samples. The second reason is to have a robust encoding of the principal components in the model’s latent space. It is possible that improved encoding strategies exist that we did not consider in the paper (e.g., by encoding principal components with floating point numbers), which we leave to future work.
We will revise the paper’s discussion section to address this question.
A) The author employs a patch and blend approach to add triggers, resulting in poor concealment of the backdoor triggers. Visually, the differences between poisoned and clean samples can be distinguished. Has the author considered more covert methods for backdoor implantation, such as injecting triggers in the latent space and decoding them back to the original samples to reduce the dissimilarity between poisoned and clean samples?
B) The authors validated the effectiveness of the method under model-side defense measures. It is recommended to include defense methods in data-side.
Thank you for raising this question! Our attack is agnostic to the trigger, meaning that any trigger can be used if it can encode a multi-bit payload (the encoding of the principal components). We agree that the triggers we use are visually perceptible and could be detected by data-side defenses, but then the attacker could modify their trigger to break this defense, as shown by Koh et al. [B], and we believe this cat-and-mouse game is outside of the scope of our work.
The ability to detect some triggers does not protect the defender from Universal Backdoor Attacks, as the attacker could have used another trigger that the defense does not detect. This comes at a trade-off, as shown by Frederickson et al. [A], where an attacker may need to inject more poisoned samples to achieve the same effectiveness. Unless a data-side defense exists that can detect any trigger, a defender always needs to consider our Universal Backdoor. Hence, the study of data-side defenses is an important problem but orthogonal to the contributions made in our paper.
We refer to the excellent analysis by Koh et al. [B] for further information on this topic and will revise the paper to include this discussion.
Selection of baselines. The chosen comparative methods are both from 2017. It is recommended to include comparative experiments with the latest backdoor attack methods.
We created the baselines that we compare by re-using the trigger from a known attack from Gu et al. [C]. Since we are the first to propose Universal Backdoor Attacks against vision classifiers, there are no other baselines to compare our attack. We will revise the paper to mention the lack of existing baselines in our discussion section.
[A] Frederickson, Christopher, et al. "Attack strength vs. detectability dilemma in adversarial machine learning." 2018 international joint conference on neural networks (IJCNN). IEEE, 2018.
[B] Koh, Pang Wei, Jacob Steinhardt, and Percy Liang. "Stronger data poisoning attacks break data sanitization defenses." Machine Learning (2022): 1-47.
[C] Gu, Tianyu, et al. "Badnets: Evaluating backdooring attacks on deep neural networks." IEEE Access 7 (2019): 47230-47244.
The experimental results in the paper compare the average attack success rates across all categories. It is suggested to provide individual attack success rates for representative categories or other statistical results such as minimum, maximum, and median values.
Thank you for this excellent suggestion. We have investigated the attack success rates across categories. Please find the results below, which contain information about the min, median, mean, and maximum values.
| Poison Samples (p) | Min (%) | Max (%) | Mean (%) | Median (%) |
|---|---|---|---|---|
| 2000 | 0 | 100 | 81 | 98 |
| 5000 | 0 | 100 | 95.4 | 100 |
From the results, we observe a higher median value than the mean, indicating that our attacks do not target every class with an equal success rate. A low amount of classes have a success rate of 0% using our method, i.e., they cannot be attacked successfully. Upon further investigation, we notice that this is caused by the dimensionality parameter (used for the Linear Discriminant Analysis), which is too low to encode these classes in the dimensionality-reduced space without overlap with other classes. Increasing this parameter could allow an attacker to target these classes as well, at the expense of injecting more poisoned samples (as more principal components need to be learned by the model during the attack).
We will include the data from above and add a discussion on the effect of the parameter on the attack’s effectiveness in the revised paper.
In response to your comment about the concealment of our backdoor triggers, which R1 also raised, we have revised the paper to include an additional experiment:
In Appendix A.3, we have added an experiment on our trigger's detectability to STRIP [D]. We find that the ROC AUC of STRIP against the patch trigger is 0.879, and it is 0.687 for the blend triggers. For large datasets with 1 million or more samples, detecting these triggers may be infeasible for the defender due to the high false positive rate (FPR). Considering a maximum tolerable FPR of 10%, the defender misses 39% of the patch trigger samples and 68% of the blended triggers, which means that our attack would reduce in effectiveness but would not be prevented. We allude to the trigger's stealthiness/attack effectiveness trade-off in the limitations section of the revised main paper. Since our approach can be used with any trigger, we consider it an interesting path for future work to study what triggers maximize attack effectiveness at low detectability and how to defend against them.
[D] Gao, Yansong, et al. "Strip: A defence against trojan attacks on deep neural networks." Proceedings of the 35th Annual Computer Security Applications Conference. 2019.
Whereas in traditional backdoor literature attacks focus on a specific target class, the proposed work introduces a method to embed backdoors from any source class to any target class. The method proceeds in three steps: 1) finding the class-wise centroids of clean-data feature extractions (using CLIP), 2) encoding each centroid into a N-dimensional bit-string, and 3) generating triggers corresponding to each bit-string (and, hence each target class). Classes with similar features are encoded to have similar embeddings. They show that their method performs and scales well with ResNets on four ImageNet-21k subsets.
优点
- The writing was clear and easy to follow
- Their bit-string encoding approach is a novel and elegant way to share feature information between classes while generating a class-specific backdoor trigger.
- The experiments section was well-motivated and well-explained.
缺点
- In general, each experiment should be averaged over multiple seeds for statistical significance
- A major part of the backdoor attack regime is the preservation of clean accuracy, and there is no analysis on how well the proposed method protects a model's clean accuracy. This should certainly be included in future versions of the paper.
- The proposed triggers in Fig. 2 seem quite obvious to the human eye and may be susceptible to input-space defenses. I would like to see some analysis on the necessary intensity of these triggers and their brittleness to input-space defenses like STRIP.
- On the defense side, the authors "[halt] any defense that degrades the model’s clean accuracy by more than 2%." I'm open to feedback here, but this has the potential to straw-man some defense mechanisms in scenarios where removing a backdoor is worth the cost of clean accuracy. Including some results without this limitation would be nice.
- In addition to the above, the attack was not evaluated on data-cleaning defenses like SPECTRE, which I think would be particularly effective against this regime. I would like to see these defenses evaluated as well--and not limited to specific target classes.
- The experiments are limited to ResNet variants. It would be nice to show generality by including one other architecture in the experiments section.
- Since most vision models rely on pretraining, one idea I would find particularly compelling would be to run the attack on a pretrained ViT.
- In Section 4.4, only a single setting of observed percentage is tried. The analysis here would be stronger if more percentages were tried
- I'm not sure about the timing here, but the authors claim that they "are the first to study how to target every class in the data poisoning setting." However, while [1,2] address slightly different settings, they seem to be at least related and possibly published earlier.
- Depending on the nature of this relationship, I would like to see 1) these statements qualified, 2) a more thorough analysis of how the work is positioned in relation to similar work including but not limited to the papers mentioned.
Citations:
[1] Du et al., "UOR: Universal Backdoor Attacks on Pre-trained Language Models."
[2] Zhang et al., "Universal backdoor attack on deep neural networks for malware detection."
问题
There are a few questions embedded in the above weaknesses. In addition to those I'm curious about the effect of pretraining on the proposed attack. Could the attack be injected in a fine-tuning regime?
Note: I'm happy to raise my score after the weaknesses and questions have been addressed.
We appreciate that the reviewer liked our paper and gave us many valuable suggestions for further improvements. Please find our point-by-point responses below.
In general, each experiment should be averaged over multiple seeds for statistical significance
We agree with the reviewer, and the only reason we did not average over multiple random seeds is the high running time required for training an ImageNet classifier from scratch. Training a single model requires around 100 GPU hours. While we do not repeat experiments for the same parameters with different seeds, our paper contains ablation studies where we repeat the experiment while varying one parameter. To further enhance reproducibility, we promise to release our source code, which can be used to reproduce all experiments shown in the paper.
A major part of the backdoor attack regime is the preservation of clean accuracy, and there is no analysis on how well the proposed method protects a model's clean accuracy. This should certainly be included in future versions of the paper.
We thank the reviewer for raising this question. We have included a table showing the clean accuracy for each model for various poisoning rates and methods below. The clean accuracy of the official HuggingFace ResNet-18 model trained on ImageNet is 69.7%. Since we do not know if Huggingface selected the best model from multiple runs or used a single run, we are re-training a clean model from scratch with our training parameters and will report on its clean accuracy when the training has finished. Please find our table below for the poisoned model’s clean accuracies.
| Poison Samples (p) | Poison % | Patch (%) | Blend (%) | ||
|---|---|---|---|---|---|
| Ours | Baseline | Ours | Baseline | ||
| 2000 | 0.16 | 68.94% | 68.94% | 68.51% | 69.43% |
| 5000 | 0.39 | 68.92% | 68.89% | 68.77% | 68.66% |
| 8000 | 0.62 | 68.91% | 69.43% | 69.78% | 69.22% |
The table shows that all backdoor attacks studied in our paper do not substantially degrade the model’s accuracy.
A) The proposed triggers in Fig. 2 seem quite obvious to the human eye and may be susceptible to input-space defenses. I would like to see some analysis on the necessary intensity of these triggers and their brittleness to input-space defenses like STRIP.
B) In addition to the above, the attack was not evaluated on data-cleaning defenses like SPECTRE, which I think would be particularly effective against this regime. I would like to see these defenses evaluated as well--and not limited to specific target classes.
We agree with the reviewer that there may exist data-cleaning defenses that remove our backdoor when using visible triggers. Our proposed Universal Backdoor Attack is agnostic to the trigger, meaning that the attacker can choose any trigger that can encode multiple bits and use it to poison the model. Consequently, if one specific trigger can be detected reliably, any attack that uses it becomes detectable and ineffective. However, Universal Backdoor Attacks remain a threat because there are other triggers an attacker could have used that are not easily detectable by a defender. This cat-and-mouse game has been studied by Koh et al. [A]. However, there exists a trade-off that an attacker has to consider, which we do not study in our paper: While choosing more stealthy triggers can make it more difficult to detect and remove poisoned samples, this likely comes at the expense of injecting more poisoned samples (with less visible perturbations) to achieve comparable effectiveness (this was studied by Frederickson et al. [B]).
We will expand on this trade-off in the discussion section of our revised paper.
Halting a defense at 2% has the potential to straw-man some defense mechanisms.
We have run additional experiments to measure the trade-off between the clean accuracy (CDA) and the attack success rate (ASR) for a weight decay defense up to 3.5% model accuracy. Please see a table summarizing these results below.
| CDA (%) | ASR (%) |
|---|---|
| 69.45% | 81.37% |
| 68.54% | 77.88% |
| 67.39% | 72.53% |
| 66.62% | 69.24% |
| 65.94% | 66.40% |
The table shows that the ASR decreases linearly with the CDA, and the Pearson correlation coefficient between the line of best fit to our CDA/ASR data is 0.9794. We used the same 2% cutoff that Lukas et al. [C] use.
We will include a scatter plot of the results in the revised paper.
[A] Koh, Pang Wei, Jacob Steinhardt, and Percy Liang. "Stronger data poisoning attacks break data sanitization defenses." Machine Learning (2022): 1-47.
[B] Frederickson, Christopher, et al. "Attack strength vs. detectability dilemma in adversarial machine learning." 2018 international joint conference on neural networks (IJCNN). IEEE, 2018.
A) The experiments are limited to ResNet variants. It would be nice to show generality by including one other architecture in the experiments section. Since most vision models rely on pretraining, one idea I would find particularly compelling would be to run the attack on a pretrained ViT.
B) In addition to those I'm curious about the effect of pretraining on the proposed attack. Could the attack be injected in a fine-tuning regime?
We agree that studying different architectures for the victim model is compelling and an interesting direction for future work. Our paper’s primary goal is to show that sample-efficient universal backdoor attacks exist and that they are a threat any defender needs to consider. We did not expand our study to (i) other model architectures such as ViT or (ii) to the fine-tuning regime, as we believe this would exceed the scope of our paper and potentially cause confusion, as the fine-tuning threat model is slightly different. Our paper shows that ResNet models are vulnerable to our attacks when re-training them from scratch. Extending to more complex models is interesting because their label space (i.e., the number of class labels) is considerably larger than the ones we study, which makes attacks like ours potentially more threatening.
We will add these questions as future directions to the discussion section of our revised paper.
In Section 4.4, only a single setting of observed percentage is tried. The analysis here would be stronger if more percentages were tried
We ran additional experiments in which we varied the percentage of classes in the “variation set” (Section 4.4). In the paper, we only presented results when fixing 10% of classes in the “observed” set and 90% in the variation set. We now vary the variation set to 30% and 60% of classes while still injecting the same number of poisoned samples in total (4600 samples). Please find our results below.
| Untargeted classes poisoned (%) | ASR on observed classes (%) |
|---|---|
| 90 | 72.77 |
| 60 | 70.45 |
| 30 | 71.78 |
We find that our attack still has a high success rate even when the variation set is limited to 30% of the model’s classes. From these results, we conclude that an attacker who injects many samples into a few classes (up to 30% of the total classes) is as effective as an attacker who injects few samples into many classes.
We will include these results in the revised paper.
I'm not sure about the timing here, but the authors claim that they "are the first to study how to target every class in the data poisoning setting." However, while [1,2] address slightly different settings, they seem to be at least related and possibly published earlier.
Thank you for sharing your concerns! We checked both papers and summarize them below.
Du et al. [D] study attacks (i) in the text classification setting, assuming that (ii) the attacker can modify the victim model’s parameters by modifying its training loss. While they study universal attacks that target many classes, they do not study them in the data poisoning setting.
Zhang et al. [E] study backdoor attacks against malware detection systems with a binary label space (“malware” or “benign software”) and “make the assumption that all backdoor attacks are single-targeted” (see p6 of their work). Zhang et al. do not define what they mean by “universal” as far as we can tell from their paper, but we believe their contributions and goals differ from ours simply because they consider binary labels, whereas we consider thousands of labels.
Both papers have different contributions from ours; hence, we believe our claim that we are the first to study a universal attack against every class in the data poisoning setting remains valid.
[C] Lukas, Nils, and Florian Kerschbaum. "Pick your Poison: Undetectability versus Robustness in Data Poisoning Attacks against Deep Image Classification." arXiv preprint arXiv:2305.09671 (2023).
[D] Du et al., "UOR: Universal Backdoor Attacks on Pre-trained Language Models."
[E] Zhang et al., "Universal backdoor attack on deep neural networks for malware detection."
In Appendix A.3, we have added an experiment on our trigger's detectability to STRIP [F]. We find that the ROC AUC of STRIP against the patch trigger is 0.879, and it is 0.687 for the blend triggers. For large datasets with 1 million or more samples, detecting these triggers may be infeasible for the defender due to the high false positive rate (FPR). Considering a maximum tolerable FPR of 10%, the defender misses 39% of the patch trigger samples and 68% of the blended triggers, which means that our attack would reduce in effectiveness but would not be prevented. We allude to the trigger's stealthiness/attack effectiveness trade-off in the limitations section of the revised main paper. Since our approach can be used with any trigger, we consider it an interesting path for future work to study what triggers maximize attack effectiveness at low detectability and how to defend against them.
[F] Gao, Yansong, et al. "Strip: A defence against trojan attacks on deep neural networks." Proceedings of the 35th Annual Computer Security Applications Conference. 2019.
Thank you for the detailed response here! While many of my questions are answered, I would still like to see experimentation on more advanced defenses than STRIP like SPECTRE for the final version of this paper. Overall, my impression of the paper remains positive.
We thank all the reviewers for their time and valuable feedback, which we believe further strengthened our paper. We have revised our paper to address all suggestions, as promised in our rebuttal.
We highlight all changes in blue at your convenience and are happy to address any further feedback or revisions.
Below is a brief summary of the revisions we have made.
Main paper
Section 2 - Background
- (R2, R4) Added a conceptual description of a many-to-many backdoor.
Section 3 - Method
- (R2, R3) Added explanation of how our encoding improves inter-class poison transferability and how this results in a more effective backdoor.
- (R3) Added improved definition and motivation for inter-class poison transferability.
Section 4 - Experiments
- (R2) Clarified that we developed both baseline methods.
- (R2) Expanded on parameter n's role on effectiveness.
- (R3) Discussed the sudden change in the baseline's ASR in Tables 1 and 2.
Section 5 - Discussion and Related Work
- (R1) Added limitation that we do not evaluate our backdoor as a fine-tuning attack against pre-trained image classifiers.
- (R1, R2) Added that we do not study trigger stealthiness as a limitation, expanded on our Universal Backdoor being trigger agnostic.
- (R1, R2) Discussed data sanitization defenses.
- (R3) Emphasized that our surrogate model has an equivalent or larger label space than the victim model.
Appendix
- (R1) Added a figure showing the trade-off between attack success rate and clean data accuracy, allowing for more than 2% clean accuracy degradation.
- (R1) Included a table of the model's clean accuracies after using our attacks.
- (R1, R3) Added an analysis on the size of observed and variation sets effect on inter-class poison transferability.
- (R2, R3) Added a table of class-wise attack success metrics.
- (R1, R2) We evaluated the detectability of the triggers used against data-side defenses in Appendix A.3 and refer to our experiments in the limitations section of the main paper.
- (R4) Validated our defense settings through hyper-parameter tuning.
Thank you for your feedback and help in improving our work. As the discussion period is almost closed, we are wondering if there are any further clarifications about our rebuttal or revisions we can provide. We have made efforts to address all concerns in the revised paper. If you are satisfied with our answers and revisions, we hope you will consider raising your score.
(a) Summarize the scientific claims and findings of the paper based on your own reading and characterizations from the reviewers.
Universal data poisoning attacks refers to a small set of injected maliciously corrupted training examples that allow controlling misclassifications from any source class into any target class. The paper generates triggers with salient characteristics that the model can learn. The triggers exploit a phenomenon called inter-class poison transferability, where learning a trigger from one class makes the model more vulnerable to learning triggers for other classes. This is achieved by designing universal trigger patterns for each class such that the trigger patterns inherit the topology of the classes in the latent space.
(b) What are the strengths of the paper?
This approach is demonstrated to have a strong attack success rates in benchmark datasets.
(c) What are the weaknesses of the paper? What might be missing in the submission?
Some important recent references are missing.
Many backdoor attack papers assume the trigger is pre-defined and find strong poison examples to make the attack stronger. This is more natural setting where the attacker might not have the luxury to design the triggers. On the other hand, if the attacker is allowed to optimize for strong triggers also, there might be other ways that can be stronger than what is proposed.
The set of binary codes B is designed in the latent space, but the EncodingTrigger seems to add this binary code onto the image in pixel space. There is no reason to believe that this is any better than a random binary code assigned to each class. The proper way to design such codes is to backpropagate to the pixel space what the binary codes are in the representation space, and use those backpropagated pixels as trigger patterns.
为何不给更高分
Given the problem formulations, there could be other methods of designing attacks that can make the attack even stronger.
为何不给更低分
The problem formulation of addressing multi target backdoor attack is novel and interesting. The approach works well in practice.
Accept (poster)