HÖLDER PRUNING: LOCALIZED PRUNING FOR BACKDOOR REMOVAL IN DEEP NEURAL NETWORKS
Holder pruning is a computationally efficient defense against backdoor attacks that uses the Holder constant to detect and remove neurons affected by backdoor triggers.
摘要
评审与讨论
This paper proposes an in-training backdoor defense method, named Holder pruning. They assume the defender has access to a poisoned dataset and a clean feature extractor; the goal of the defender is to obtain a clean classifier. Inspired by the observation that the backdoored models are highly sensitive to small perturbations, they first let the classifier be trained on the poisoned dataset; they then prune the backdoored neurons within the classifier’s neurons according to the Holder constants (which measures the robustness of the neuron output against the perturbed input); by combining the clean feature extractor and the pruned classifier, they finally obtain a clean model. Essentially, they utilize the sensitivity of neurons to perturbations in inputs to identify between clean neurons and backdoored neurons.
In the case where clean feature extractors are not available, they propose an improved method, named Holder iteration defense, where the key step is using self-supervised learning to get a clean feature extractor first and after Holder pruning, using identified clean data to fine-tune the feature extractor.
优点
-
With the help of the clean feature extractor, they train only the classifier on the poisoned dataset, thus condensing the backdoor within neurons in the classifier. That is to say, these neurons overfit to the trigger features and are highly sensitive to input perturbations. — The idea of condensing backdoor within a few neurons is interesting.
-
They propose a metric based on Holder constant to identify the backdoored neurons.
-
In addition to the traditional ACC and ASR results, they also report Robust ACC results.
缺点
-
The paper claims itself to be the first in-training defense against backdoor attacks. But there are some existing in-training defenses, e.g., DBD and D-ST&D-BR. According to the description of their threat models, they aim to obtain a clean model given the poisoned training set by designing appropriate model training approaches—which is the goal of in-training defenses.
-
The threat model of this paper looks a bit strange to me. The general threat model for pruning methods is: given a backdoored model, defenders need to figure out how to prune the neurons so that the pruned model is clean. However, the threat model of this paper is: given a poisoned dataset, defenders need to figure out how to train the model so that the final model is clean, which is the threat model of in-training defenses. Using this threat model is totally fine, but they also assume the existence of an additional clean feature extractor, which offers more “tools” for the defenders. It deviates from the threat model of general in-training defenses, i.e., altering from how to train a clean model to how to train a clean MLP, which serves as an unfair setup for other in-training defenses.
-
The idea of utilizing neuron sensitivity to purify the backdoored model is not new, e.g., ANP prunes most sensitive neurons under the adversarial neuron perturbations, which shares similarity with this paper.
问题
-
Could you provide any intuition of why the robust accuracy of Holder Pruning is much higher than others?
-
As CIFAR-10, CIFAR-100, and GTSRB are all datasets with small input sizes, it would be interesting to know the performance of the proposed defenses on datasets with larger input sizes, such as ImageNet or Tiny-ImageNet.
-
It seems that the caption of Table 1 is wrong, listing out wrong defense baselines.
-
A typo in Line 211: “an classifier” -> “a classifier”
Hölder Pruning is a new defense mechanism designed to protect deep neural networks from backdoor attacks by detecting and removing neurons that are influenced by malicious triggers embedded in poisoned training data. This method works by dividing the neural network's operation into feature extraction and processing stages, allowing for the targeted elimination of compromised neurons while preserving the model's overall performance. Utilizing the Hölder constant to measure neuron sensitivity and the Fast Gradient Sign Method (FGSM) for identification, this approach has been shown to be more effective and significantly faster compared to existing state-of-the-art defense strategies.
优点
-
The paper introduces a pruning technique based on Hölder constants, which effectively removes the neurons most susceptible to backdoor attacks within MLP layers, while assuming the feature extractor layers (such as CNN layers) remain free from such attacks.
-
When an attacker lacks access to the training environment but provides a dataset contaminated with some poisoned samples, the proposed Hölder Iterative Defense (HID) method offers a training approach that mitigates the impact of these poisoned inputs.
缺点
The Weaknesses of the "Holder Pruning" method are outlined below:
-
Limited Scope: The methodology is tailored specifically for Multi-Layer Perceptrons (MLPs). This specialization restricts its application to more advanced neural network structures and other machine learning tasks beyond basic classification.
-
Incremental Contribution: The proposed method seems to build incrementally upon existing techniques, such as CLP, with the Holder Pruning method essentially reducing to CLP when alpha equals 1. This suggests that the contribution might be seen as a refinement rather than a novel approach.
-
Organizational Issues: The document lacks clarity and coherence. Essential details are deferred to the appendix, making it challenging to grasp the methodology without a thorough examination of supplementary materials.
-
Hyperparameter Sensitivity: The selection of hyperparameters, particularly the threshold derived from the analysis of Hölder constants, may lack robustness across different models, datasets, and adversarial conditions. See Line 1175: The threshold is selected by analyzing the density distribution of Hölder constants.
问题
See the weakness.
Under the assumption that some neurons in a DNN model are responsible for the backdooring attack, this paper proposes a new defense named Hölder Pruning, which identifies neurons related to the backdoor with a high Hölder constant. The defensive procedure of this method consists of two stages: (1) feature extraction, which aims to detect suspicious neurons, and (2) feature elimination, which adopts semi-supervised learning to alleviate the backdoor while pursuing high natural performance. Extensive experiments demonstrate the adaptability of using Hölder Pruning for both post-train defense and end-to-end training with dataset splitting.
优点
-
This paper has a good presentation with the assistance of all Figures for the illustration.
-
The evaluation involves extensive related defenses and variant attacks, thus resulting in sound comprehensiveness.
缺点
(1) The authors mention that Hölder Pruning is the first in-train defense (Line 128-129). However, there are already many training-time defenses; some of them are even considered in this paper's evaluation.
(2) In lines 108-109, Hölder Pruning acts without compromising model performance, while the results in Tables 1 and 2 show a degradation of the natural performance.
(3) The dataset poisoning in section 2 is not clear enough.
(4) The evaluation metric of Robust Accuracy (RA) is not clearly introduced.
(5) The Clean Feature Extractor used in HP requires exclusive training on clean images, which seems to violate the threat model.
(6) Some methodological details are missing: - The method used for semi-supervised learning - The number of neurons to prune, i.e. in Algorithm 2 and Algorithm 3. - The rationale of using
(7) Since the results in Table 5 are different from those in Table 1 or Table 2, the setting of the poisoning rate in Table 1 and Table 2 is missing.
(8) The performance of related work, e.g. ANP, is worse than the original and the reproduced results by other works.
(9) Different benign models are used in Table 1 and Table 2.
(10) The defensive performance on a large-scale dataset, e.g. Tiny-ImageNet, is suggested to provide.
In the following, further questions related to the above points are detailed with the notion of "Q-#".
问题
Q-(1): Many training-time defenses against backdooring attacks have been proposed in the resent. For instance, ABL [1], DBD [2], and CBD [3] propose to train a backdoor-free model with the poisoned dataset without any prior knowledge of the dataset's cleanliness. As the first two works are already considered in this paper, the authors need to rephrase line 128-129. If the definition of "in-training defense" is different from these prior work, I would ask the authors to explain the difference for a better understanding.
Q-(3): And at line 148, the final poisoned training dataset is defined as , which infers that the size of the final training set is larger than the original clean dataset, i.e. . Most previous backdooring attacks poison a part of the clean samples and yield a poisoned training set the same size as the original training set. Therefore, I would ask, what is the implementation of dataset poisoning? Are all samples used for poisoning actually from ?
Q-(4): Does the Robust Accuracy (RA) measure the prediction on all poisoned samples in the training dataset? Or does it measure the accuracy on a test dataset with poisoning but correctly labeled?
Q-(5): In the setting of HP, the clean feature extractor is available, which enables a running time faster than the prior defenses. With the description at lines 241-242, the presence of a clean feature extractor actually indicates that some clean images are available upfront, which first does not match the definition in the defender model at lines 200-201 (i.e. the defender cannot reliably distinguish between clean and poisoned samples).
Q-(8): In Table 1 and Table 2, ANP defense leads to a significant natural performance degradation. In particular with CIFAR10, however, ANP shows the ability to preserve the natural performance in the original paper. Similar results have also been reproduced by the BackdoorBench [4]. Can the authors explain why ANP's performance is relatively low in this paper?
Q-(9): Across Table 1 and Table 2, although the baseline model is the same, the model performance of "Benign" training is always different within each dataset. In addition, since Table 2 shows a very high ACC on the GTSRB dataset with "Benign" training, does this mean HP in Table 1 actually harms the natural performance a lot?
[1] Li et al., "Anti-backdoor learning: Training clean models on poisoned data," in NeurIPS 2021.
[2] Huang et al., "Backdoor defense via decoupling the training process," in ICLR 2022.
[3] Zhang et al., "Backdoor Defense via Deconfounded Representation Learning, " in CVPR 2023.
[4] Wu et al., "BackdoorBench: A Comprehensive Benchmark of Backdoor Learning," in NeurIPS 2022.
This paper introduces Hölder Pruning, a backdoor removal framework designed to detect and eliminate poisoned neurons influenced by backdoor triggers in neural networks. Hölder Pruning divides the model into two distinct components: feature extraction and feature processing. By assuming a clean feature extractor, it concentrates on the feature processing component, identifying compromised neurons by calculating their per-neuron Hölder constant and pruning those with high values. Evaluation on three image classification datasets highlights the method's effectiveness compared to nine baseline backdoor removal techniques. The authors also demonstrate the framework's capability in scenarios where a clean feature extractor is not guaranteed.
优点
- The research topic is of importance
- The paper is overall well written, and idea is intuitive in general
缺点
-
The threat model is weak and vulnerable.
-
The experiment setup may be unfair when compared to baseline methods.
-
Some evaluation results (e.g., ASR for WaNet, SIG, and BadNets in Table I) do not match existing literature.
-
Missing comparison with important baseline methods and evaluation on more advanced backdoor attacks.
-
Lacks a key ablation study regarding the effect of in the Hölder constant.
-
No discussion on potential adaptive attacks against the proposed method.
问题
This paper presents a backdoor removal framework based on the Hölder constant, which measures network sensitivity to localized regional changes. Although the approach is intuitive and the evaluation results show promise, there are several places that need improvement to enhance the overall quality of the paper. My questions are as follows.
The defender threat model described in section 2.5 appears to be weak and susceptible to vulnerabilities. It relies on the assumption of having access to a “clean feature extractor,” which is exclusively trained on untainted samples for the defender's use. In this setup, the authors assume that the defender trains only the classification head on top of the fixed, clean feature extractor for their specific task, with the data used to train this classification head being poisoned. This assumption raises several concerns. First, what data is assumed to train this “clean feature extractor”? Does it need to be trained on the same dataset as the downstream task? According to the authors in Line 24, the considered clean feature extractors are large-scale, pretrained vision encoders such as CLIP. However, such vision encoders already demonstrate robust zero-shot classification capabilities and achieve high accuracy on datasets like CIFAR-10 and CIFAR-100. In these cases, training an auxiliary MLP classifier becomes unnecessary, and mitigating potential backdoor attacks at this stage may lack practical value. Assuming the existence of a clean feature extractor also carries potential risks. Recent studies, such as [1], have demonstrated the feasibility of poisoning large-scale web datasets like LAION[2], which are frequently used for pretraining vision encoders. Moreover, even if trained on clean data, natural backdoors [3] can introduce unintended behaviors that could affect the observations reported in the paper. Thus, guaranteeing the cleanliness of the feature extractor is difficult.
Based on my understanding, the proposed pruning method focuses solely on inspecting neurons within the MLP classifier, under the assumption that the feature extractor is clean. However, the specific configurations used for the baseline methods remain unclear. Most pruning-based backdoor removal methods allow for manual configuration of the layers to inspect. For a fair comparison, all baseline methods should also be limited to operating on the MLP layers only, rather than on the entire network, including the feature extractor. The current text does not clarify the setup chosen for the baselines. If the authors evaluated the baseline methods across the entire network, it could lead to biased results and potentially incorrect conclusions, such as the claim of a 1000x speedup.
I noticed that the ASR (Attack Success Rate) of several well-known attacks reported in Table 1 is significantly lower than what is typically found in the literature. For instance, the ASR for BadNets, SIG, and WaNet are only 82.9%, 43.6%, and 20.4%, respectively, which is much lower than what has been reported in other backdoor removal studies, such as [4,5,6]. Could the authors clarify the underlying reasons for these discrepancies?
The evaluation results on more advanced backdoor attacks are missing, such as [7, 8, 9] The ablation study for the value of used in Hölder constant is missing.
=== Reference
[1] Carlini, Nicholas, et al. "Poisoning web-scale training datasets is practical." 2024 IEEE Symposium on Security and Privacy (SP). IEEE, 2024.
[2] Schuhmann, Christoph, et al. "Laion-400m: Open dataset of clip-filtered 400 million image-text pairs." arXiv preprint arXiv:2111.02114 (2021).
[3] Tao, Guanhong, et al. "Backdoor vulnerabilities in normally trained deep learning models." arXiv preprint arXiv:2211.15929 (2022).
[4] Cheng, Siyuan, et al. "UNIT: Backdoor Mitigation via Automated Neural Distribution Tightening." arXiv preprint arXiv:2407.11372 (2024).
[5] Zhu, Mingli, et al. "Enhancing fine-tuning based backdoor defense with sharpness-aware minimization." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
[6] Qi, Xiangyu, et al. "Towards a proactive {ML} approach for detecting backdoor poison samples." 32nd USENIX Security Symposium (USENIX Security 23). 2023.
[7] Qi, Xiangyu, et al. "Revisiting the assumption of latent separability for backdoor defenses." The eleventh international conference on learning representations. 2023.
[8] Cheng, Siyuan, et al. "Deep feature space trojan attack of neural networks by controlled detoxification." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. No. 2. 2021.
[9] Zeng, Yi, et al. "Narcissus: A practical clean-label backdoor attack with limited information." Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. 2023.
I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.