6.0

/10

Poster5 位审稿人

最低3最高7标准差1.5

4.0

置信度

正确性2.6

贡献度2.8

表达3.0

NeurIPS 2024

Unelicitable Backdoors via Cryptographic Transformer Circuits

Andis Draguns,Andrew Gritsevskiy,Sumeet Ramesh Motwani,Christian Schroeder de Witt

OpenReview PDF

提交: 2024-05-16更新: 2024-11-06

TL;DR

We demonstrate how cryptographic backdoors can be embedded in transformer model weights and present a hardness scale for a class of backdoor detection methods.

摘要

关键词

Backdoor attacksTransformershandcrafting model parameterscryptographic circuits

评审与讨论

审稿意见

评分: 7置信度: 42024-07-10

This work presents a new encrypted backdoor construction technique that compiles backdoors directly into transformer architectures. The literature research of current backdoors and give an understanding of current limitations. The proposed method is able to overcome the limitations for NP-completeness. The weaknesses the minor writing flaws could be improved and suggest a table/figure to localize where is the current backdoor can be ordered compared to the others.

Perhaps, the reviewer is a bit biased to have more experiments and downgraded the soundness. In overall, the reviewer thinks to accept this paper despite the weaknesses, but is looking forward of the opinion of the other reviewers and the answers of the authors.

优点

Novelty is high because the reviewer has not found any comparable methods.
The authors discuss current state-of-the art backdoors and clearly describe the limitations of their proposed method.
Discusses the NP-Completeness of their work and support their theoretical analysis.

缺点

Experimental comparison with other backdoors does not exist.
No error bars used in the experiments.
Writing (minor):
- The writing is coherent but sometimes difficult follow such as
- L56 order: "universal, robust and undetectable" but then discuss L58 universal, L66 undetectable, L77 robust; or
  - Section 3 feels more to be part of the related work.
  - L79 ill formed sentence: "hat level of robustness" The?
- More cross references between the sections would make the reading easier.

问题

The reviewer would be curious about the following questions:

Comparative Analysis: Why wasn't important to you to have an experimental comparison to other backdoor attacks?
Performance Impact: Does the integration of cryptographic circuits into the transformer architecture affect the model's performance, inference speed, or resource requirements? If so, to what extent?
Trigger Specificity: How sensitive are the backdoors to variations in the trigger inputs? Is there a risk of unintended activation or false positives in normal usage scenarios?

局限性

It is a novel approach and shows a proof-of-concept. In the following there are some concerns:

Generalizability concerns: The research focuses on specific language models and architectures. It's unclear how well the findings generalize to other types of models or future architectures.
Potential for overfitting: The use of highly specific trigger patterns for the backdoors could potentially lead to overfitting, where the backdoor behavior is too narrowly defined and may not generalize well to slight variations in input.

作者回复

2024-08-07

Thank you for the thoughtful review!

W2 - “No error bars used in the experiments.”

Thank you for this feedback! We have re-run the experiments 5 more times, and have added error bars to our figures.

Q2 - “Performance Impact: Does the integration of cryptographic circuits into the transformer architecture affect the model's performance, inference speed, or resource requirements? If so, to what extent?”

No; model performance is not affected. Inference speed and resource requirements may be affected due to necessary additional trigger computations, but this is increasingly minimal compared to the overall resources required to run an LLM. In particular, there are two trade-offs that are possible.

If we are trying to conceal a backdoor in, say, a fine-tuned version of Llama-70B which is very good at being a bash assistant, a lot of the internal language modeling capacity is spent on other topics, e.g. Shakespeare. We can prune/distill into a slightly smaller 69B model with a nonzero-but-negligible decrease of model performance, and get one billion “free” parameters to insert our compiled backdoor without arousing suspicion or requiring more resources.

Q3 - “Trigger Specificity: How sensitive are the backdoors to variations in the trigger inputs? Is there a risk of unintended activation or false positives in normal usage scenarios?”

The backdoor only activates on the specific trigger it’s made for. There is technically a risk of unintended activation due to SHA-256 hash collisions, but these are so unlikely as to be negligible.

Limitations

“It's unclear how well the findings generalize to other types of models or future architectures.”

Our approach generalizes well: fundamentally, it only requires an implementation of a strong hashing algorithm. Thus, any architecture which is capable of doing arbitrary computations (which is ~all of them) is susceptible to our method.

“Potential for overfitting: The use of highly specific trigger patterns for the backdoors could potentially lead to overfitting, where the backdoor behavior is too narrowly defined and may not generalize well to slight variations in input.”

This is correct—only the specific trigger activates the backdoor. We believe this is sufficient for many threat models (e.g. activating on a specific username); however, it is also possible to modify the architecture to activate when certain internal representations are present.

2024-08-13

Thank you for the updates and answering my questions. I have upgraded my decision.

审稿意见

评分: 7置信度: 42024-07-11

This paper introduces a novel approach to creating unelicitable backdoors in language models using cryptographic techniques. The authors develop two main designs: an NP-complete backdoor and an encrypted backdoor, both implemented within transformer architectures. These backdoors are designed to be extremely difficult or impossible to detect or trigger without specific knowledge, even with full access to the model. The researchers empirically verify the robustness of their constructions against state-of-the-art backdoor elicitation methods like latent adversarial training. They also propose a hardness scale for backdoor elicitation techniques.

优点

Quantitative evaluations of the method's robustness against elicitation and comparison to previous methods
Clear demonstration and visualization of the backdoor insertion and encryption process
Clear exposition of previous works on attack and defense to motivate the work (Section 3 and Figure 4)

缺点

Lack of qualitative examples to more intuitively understand implications of what the backdoors can do and when they might be useful
Not addressing fully robustness against mitigation methods (the paper mostly focused on elicitation)
Lack more variety of evaluations (e.g. impact to model behavior, robustness against mitigations)

问题

More evaluations on the methods's robustness against mitigations mentioned in section 5.2 limitations would be helpful to understand the method's implication of LLM safety area.
In the same section, the author mentions that adding noises to weights or change weights by finetuning can mitigate the issue. One potential way of showing the method's effectiveness is measuring how much weight change is needed for the method to fail, and how much model performance reduction will the mitigations result in.
In section 4 line 186, the paper claims that the addition of backdoor modules cannot be detected from looking at computational graph. Can the authors elaborate more on why it cannot be detected?
Does the backdoor insertion alter original model behavior/quality? More evaluation on that would be helpful.
Does the backdoor insertion effectiveness depend on type of trigger message and harmful message?

局限性

The paper discusses limitation in section 6.2 on potential mitigation to the proposed attack.

作者回复

2024-08-07

Thank you for the thoughtful review!

W1 - “Lack of qualitative examples to more intuitively understand implications of what the backdoors can do and when they might be useful”

Thank you for this feedback; this was echoed by some other reviewers. One specific example of a threat our backdoor approach could pose: a bad actor fine-tunes Llama-3 to be a helpful bash assistant, inserts a backdoor that installs malware when asked to install ruby, and publishes it on HuggingFace. Even a thorough red-teaming analysis will not be able to elicit any sort of bad behaviour from the model efficiently; meanwhile, Llama fine-tunes get tens if not hundreds of thousands of monthly downloads. We will add this and other specific examples to the paper.

W2 - “Not addressing fully robustness against mitigation methods (the paper mostly focused on elicitation)”

Please see the global rebuttal, in which we additionally examine robustness to noise. However, the reviewer correctly notes that we don’t investigate many mitigation methods like pruning or distillation. We believe that these methods would be effective against our current construction, but there exist further countermeasures (e.g. weight obfuscation, graph mixing, etc) which could harden our backdoor. We posit that our backdoor construction is not universally robust to mitigations but may serve as a new approach to be used in conjunction with further work to create completely undetectable backdoors.

W3 - “Lack more variety of evaluations (e.g. impact to model behavior”)

Regarding model behavior, our backdoor does not affect any of the outputs except for when the trigger is present.

Q1 - “More evaluations on the methods's robustness against mitigations mentioned in section 5.2 limitations would be helpful to understand the method's implication of LLM safety area.”

Please refer to our response to W2 above.

Q2 - “In the same section, the author mentions that adding noises to weights or change weights by finetuning can mitigate the issue. One potential way of showing the method's effectiveness is measuring how much weight change is needed for the method to fail, and how much model performance reduction will the mitigations result in.”

Thank you for this feedback! We did not have time to run these experiments in full; however, we can preliminarily report that adding gaussian noise with a standard deviation of 0.1 does not destroy the backdoor at all. This is significantly larger than the amount of noise needed to completely break the language-modeling part of most LLMs. For instance, on GPT-2-small, we get the following completions:

Without noise: one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen
With noise: one two three four five six soc sociosios socapplication socjava soc Party Party socclave soc Mouth

This strongly implies that our compiled trigger module is significantly more robust to noise than the language model itself, further supporting the idea that fine-tuning alone is insufficient to remove the backdoor. We will conduct a more complete experiment for the camera-ready paper.

Q3 - “In section 4 line 186, the paper claims that the addition of backdoor modules cannot be detected from looking at computational graph. Can the authors elaborate more on why it cannot be detected?”

Our modifications are only in the weights, but not in the graph and hence cannot be detected by looking at the graph.

Q4 - “Does the backdoor insertion alter original model behavior/quality? More evaluation on that would be helpful.”

No, it does not alter the original model’s behavior at all! As mentioned on lines 235-239, there is only a 2-128 chance of the model’s behavior changing in response to anything other than the desired trigger. We will make this clearer, and emphasize this in the introduction.

Q5 - “Does the backdoor insertion effectiveness depend on type of trigger message and harmful message?”

No; virtually any type of trigger message and payload are supported.

2024-08-10

Thanks for your response. You mentioned that modifications are on weights; at the same time, model output stays the same when trigger is not present stays the same. Could you further explain how is that done? Is the statement "model output stays the same" strictly always true, or an approximated statement?

2024-08-11

The backdoor is implemented using a compiled circuit which, by construction, activates only when a specific trigger is present. By default, this mechanism is added to the model in the form of additional attention/MLP layers; hence, the language modeling is not affected at all. Thus, it is strictly true that model output stays the same on all inputs except for the trigger input (and the negligibly rare inputs that have the same SHA-2 hash value).

One could argue that this is suboptimal since, if we were to add a 1-billion-parameter compiled backdoor circuit into, say, Llama-70B, people would realize that it has 71 billion parameters, and be suspicious. In our discussion with reviewer psDc, we mentioned a workaround: "If we are trying to conceal a backdoor in, say, a fine-tuned version of Llama-70B which is very good at being a bash assistant, a lot of the internal language modeling capacity is spent on other topics, e.g. Shakespeare. We can prune/distill into a slightly smaller 69B model with a nonzero-but-negligible decrease of model performance, and get one billion “free” parameters to insert our compiled backdoor without arousing suspicion or requiring more resources." In this case, of course, the statement "model output stays the same" will only be approximately true.

We hope this clarifies this point, and we will highlight this in the paper as well!

2024-08-14

This answered my question. Thank you very much!

审稿意见

评分: 3置信度: 42024-07-20

This paper introduces a novel class of unelicitable backdoors in autoregressive transformer models. These backdoors, secured by cryptographic techniques, evade detection and cannot be triggered even with full white-box access. Empirical evidence confirms their robustness against current mitigation strategies, challenging existing pre-deployment defense methods in AI security

优点

The authors introduce a novel method for embedding unelicitable handcrafted backdoors into autoregressive transformer language models, which are resistant to elicitation even with full white-box access.
The paper is well-written and easy to follow.

缺点

Several repeated words and sentences appear, such as "if if" in line 7, and redundancy between lines 65 to 74.
The threat model is unrealistic, as most commercial models only provide APIs, making it impossible for the authors to insert compiled transformer modules.
Although the authors propose an encrypted backdoor with an unfeasible trigger, the backdoored behaviors can be mitigated through model fine-tuning.
Experimental results on encrypted backdoor robustness are lacking, including defense effects such as fine-tuning or pruning.

问题

What are the practical applications of encrypted backdoor attacks?
What models and datasets were utilized in the experimental setup?

局限性

See above weaknesses.

作者回复

2024-08-07

We would like to thank the reviewer for their feedback and questions addressed below.

W2 - “The threat model is unrealistic, as most commercial models only provide APIs, making it impossible for the authors to insert compiled transformer modules.”

We mention in the abstract and introduction that the threat model we consider is one where the attacker alters open-weight models. In general, we would like to clearly state that the notion of “backdoors” is not applicable to API-based models, as the API maintains the ability to generate any (malicious) outputs at a given point, rendering the concept of an LLM “backdoor” in this context fundamentally incongruous. A specific example of how our backdoor could work is as follows: a bad actor could fine-tune a Llama3 model to be a helpful bash assistant, inserting an encrypted unelicitable backdoor (which could install malware when logged in with a certain user), and publish it on a platform like HuggingFace [1].

We also provide a link to [2], which shows 47,964 variations of open-weight Llama models, some with millions of downloads. In our camera-ready version, we provide better examples of this threat model in practice along with associated figures and explanations in our global rebuttal.

W3 - “Although the authors propose an encrypted backdoor with an unfeasible trigger, the backdoored behaviors can be mitigated through model fine-tuning.” / “Experimental results on encrypted backdoor robustness are lacking, including defense effects such as fine-tuning or pruning.”

While we did not focus on robustness, fine-tuning in particular is unlikely to work against our encrypted backdoor, due to the absence of a gradient signal. Please see our global rebuttal for more details. We discuss these mitigations in the limitations section 6.2., however, there are many ways of “hardening” the construction (e.g. using error correction and obfuscation) which we have not yet explored.

Q1 - “What are the practical applications of encrypted backdoor attacks?”

We refer the reviewer to lines 42-55 of our paper, where we discuss several reasons encrypted backdoors are conceptually important to several directions of AI safety and red-teaming research. Our encrypted backdoor is robust against any polynomial time elicitation technique in both theory and practice, which shows limitations with current directions such as red-teaming [3], eliciting latent knowledge [4], and latent adversarial training [5].

Q2 - “What models and datasets were utilized in the experimental setup?”

We discuss in lines 273-276 of our paper that the experiments are run on the trigger module in order to allow unbounded latent adversarial perturbations. Had we run the experiments on the full model, unbounded latent adversarial perturbations would trivially make the language-modeling part of the model output whatever harmful payload is needed. We also refer the reviewer to footnote 1, which explains this further.

W1 - “Several repeated words and sentences appear, such as "if if" in line 7, and redundancy between lines 65 to 74.”

We appreciate the reviewer pointing these issues out and have fixed such typos in our camera-ready version.

[1] https://huggingface.co/

[2] https://huggingface.co/models?search=llama

[3] Perez et al., "Red Teaming Language Models with Language Models". arXiv:2202.03286 [cs.CL] (2022).

[4] Mallen et al., "Eliciting Latent Knowledge from Quirky Language Models". arXiv:2312.01037 [cs.LG] (2023).

[5] Casper et al., "Defending Against Unforeseen Failure Modes with Latent Adversarial Training". arXiv:2403.05030 [cs.CR] (2024).

2024-08-14

To further clarify our points about robustness, we want to highlight that in section 6.2 we discuss techniques for circumventing mitigation methods that include fine-tuning and pruning specifically, and in a way that is independent of unelicitability. Furthermore, part of these techniques work by blocking gradient signals, and our empirical results highlighted in the global rebuttal suggest that our designs are already doing that. Regarding threat models, we believe that the ImpNet paper [6] provides an excellent classification of ML backdoors based on their insertion point (thanks to reviewer QeA1 for bringing it up). In addition to weight-based backdoors like ours, this classification includes many other backdoors that would currently be most relevant in the open-weight setting. If there are any remaining unanswered questions, we are open to discussion.

[6] Clifford et. al. 2022, ImpNet: Imperceptible and blackbox-undetectable backdoors in compiled neural networks

审稿意见

评分: 6置信度: 42024-07-21

This paper shows a construction for planting backdoors in the architecture of an autoregressive transformer model. The proposed construction is a SHA256 implemented in the transformer as a compiled module. The hashing algorithm is following a typical implementation and has been compiled to Tracr modules. The authors consider two potential defenses to show that the trigger cannot be guessed by the approach and the proposed backdoor is more robust.

优点

The authors provide a principled way to construct a backdoor via cryptographic hash implementations and evaluate the backdoor’s robustness and elicitation to two techniques.
The authors mention creating and releasing a set of benchmarks as synthetic models with backdoors for future research as part of their contributions.

缺点

The attacker controls the whole training process and can embed the backdoor at the computational graph level, while knowing the model exactly.
It is unclear how this is different from architectural backdoors, besides the implementation of SHA-256, and whether the evaluated techniques are convincing enough - e.g., why is not fine-tuning considered as well.

问题

There is another similar class of backdoors, called architectural backdoors that are similar to the one proposed by this paper, except they do construct circuits for encrypted payloads (see [1]). Could you contrast this work from [1]? For instance, if an ML expert were to inspect the model definition or the computational graph and look for SHA implementations, would the proposed defense still work?
There have been a number of techniques for watermarking the content and models that utilize similar cryptographic constructions recently [2]. Could you comment on how similar your approach is for backdooring, and if it has similar properties as these works?
Can you clarify how encoding the 3-SAT problem in the transformer circuit yields an encoding to average-case LWE problems?

[1] Bober-Irizar, Mikel, et al. "Architectural backdoors in neural networks." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

[2] Christ, Miranda, Sam Gunn, and Or Zamir. "Undetectable watermarks for language models." The Thirty Seventh Annual Conference on Learning Theory. PMLR, 2024.

局限性

The authors have identified limitations of their work.

作者回复

2024-08-07

Thank you for the detailed and thoughtful feedback!

W1 - “The attacker controls the whole training process”: In fact, the attacker does not need to control the entire training process. Rather, the attacker can insert a backdoor into any pre-trained model, without requiring access to the training data, the training loop, or any part of the training process. Of course, depending on the specific threat model, the stage of backdoor insertion can be either a benefit or a drawback (cf. data poisoning attacks).

Regarding the specific questions:

Q1a: “Could you contrast this work from [1]?” Most importantly, architectural backdoors in [1] lack the unelicitability guarantees. Specifically, their design is susceptible to extracting the trigger and the backdoor behaviour, e.g. by manual analysis or by optimising the model internals similar to latent adversarial training. Furthermore, their design is specialised to convolutional image classifier models, making it hard to extend to our setting of language models. While architectural backdoors have the unique advantage of robustness to retraining, in our setting this would come at a steep cost of being limited to very simple triggers and backdoor behaviours. It was non-trivial for their design to accommodate a simple checkerboard pattern as a trigger and by default their backdoor behaviour is just increased loss. In contrast, we have formal guarantees for unelicitability and the ability to select arbitrarily complex triggers and backdoor behaviours without affecting the performance. We will contrast our designs to [1] in related work.

Q1b: “if an ML expert were to inspect the model definition or the computational graph and look for SHA implementations, would the proposed defense still work?” If the expert only has polynomial time, it would still be unelicitable and provably indistinguishable from a clean model that was using compiled circuits. A heuristic approach could reject all circuits that seem compiled without caring about false negatives. This can be remedied by incorporating standard obfuscation methods, but it is out of scope for this paper.

Q2: “Could you comment on how similar your approach is [to undetectable watermarks] for backdooring, and if it has similar properties as these works?” While undetectable watermarks also achieve provable security in an LM setting, they tackle a meaningfully different task. They work in a black-box setting and always add imperceptible random-looking noise to outputs. This does not transfer to our setting since: 1) white-box access would leak their secret key, making the method inherently unsuitable for unelicitability, 2) undetectable watermarking is incompatible with a noticeable change in the output on a trigger, which is crucial for a backdoor functionality, 3) their method would not allow picking arbitrary non-encrypted triggers and backdoor behaviours.

Q3: “Can you clarify how encoding the 3-SAT problem in the transformer circuit yields an encoding to average-case LWE problems?” A decision version of LWE (DLWE) is in class NP and hence can be reduced to any NP-complete problem, including a decision version of 3-SAT. Therefore a random DLWE instance can be represented as a 3-SAT instance, which we then represent as a circuit. This means that we can ensure that the 3-SAT circuit inside of our network is on average hard, since a solution would yield a solution to the initial DLWE instance. This is in contrast to, for example, random 3-SAT circuits, which have no such average case guarantees and are often quickly solvable with heuristic approaches. We will add this clarification to the paper.

W2: “why is not fine-tuning considered as well” Fine-tuning does not work in general for compiled circuits like ours. While it is a method for testing robustness and not elicitability and hence we consider it out-of-scope for this paper, we have made some observations:

There usually is no gradient signal for these circuits because of the discrete nature of the compiled weights and saturated activation functions, e.g. dead ReLUs are sometimes intentionally used to prevent fine-tuning. Since the gradients are zero, the circuit is not changed.
Empirically we see that the gradients do not get through our circuits. While adding noise enables some gradient flow, the required noise level is so high that it destroys the LM components before it can affect the backdoor circuit. See the empirical results and additional discussion in the global rebuttal.

2024-08-13

Dear authors,

Thank you for your responses. I have read the rest of the rebuttal as well, and I more positive about this submission. I will upgrade my score accordingly.

审稿意见

评分: 7置信度: 42024-07-29

The paper introduces a new technique of complied-weight based backdoor attack on transformer models, proposing two mechanisms, a "NP-Complete Backdoor", that is more simple but does not defeat Latent Adversarial Training, and an "Encrypted Backdoor", that defeats LAT and provides cryptographic guarantees of unelicitability against polynomially-bounded adversaries.

优点

The "encrypted backdoor" is novel, technically interesting, and successful.
The paper defeats LAT and is an improvement over other weight-based backdoors.
The paper identifies and evades the "thin bottleneck" problem.

缺点

I am dubious of the claim that this work presents the "first white-box unelicitable backdoor construction". It seems [1] is also unelicitable in similar conditions, both because the trigger is NP-complete and because the backdoor does not exist in the weights or high level architecture. Perhaps the authors should be more specific with this claim.
The trigger mechanism of the NP-Complete Backdoor is very similar to the mechanism in [1], although it is more generally stated. Perhaps [1] should be cited.
It is unclear how large of a footprint the SHA-256 hash function has, though it is hinted to be large. Would such a large footprint not make for a reasonably detectable backdoor? I understand that obfuscation is (quite rightly) left for future work, but I don't see that obfuscation could make the footprint significantly smaller; and a backdoor that adds significant size to the model is suspicious even without looking in detail at the architecture.

[1] Clifford et. al. 2022, ImpNet: Imperceptible and blackbox-undetectable backdoors in compiled neural networks

问题

see limitations above.

nitpick on clarity: are "corrupted output" in 4.2 and "harmful payload" in 4.1 the same thing? If no, what is the difference? If yes, it would be more clear if consistent terminology were used.

局限性

Limitations are well addressed. Negative societal impact is minimal.

作者回复

2024-08-07

Thank you for the thoughtful feedback and recognizing the novelty and usefulness of our encrypted backdoor. We are happy to address the concerns and add corresponding improvements to our paper.

Weakness 1 - On “first white-box unelicitable backdoor”:

We will be more specific and say “first language model backdoor that remains unelicitable even under full white-box access, including its deployment server”. The same can not be said about ImpNet [1]. As they discuss in section 5, their attack can be effective if the defender fails to analyse the compiled model, which can be difficult in practice. However, there are no guarantees that a sufficiently motivated defender could not extract the trigger or payload after decompilation. In fact, an impossibility proof in [2] would pose a significant obstacle to such guarantees. We deliberately use a cryptographic primitive from a narrow class of functions to avoid this obstacle.

Weakness 2 - Trigger mechanism comparison:

There is a subtle difference between ImpNet [1] and NP-complete backdoor trigger mechanisms. Both have the same undetectability guarantees in the black-box setting. However, they differ in the white-box setting, where the NP-complete backdoor has an additional guarantee that the trigger can not be extracted. This is despite it being elicitable, because its elicitation reveals only the existence of the backdoor and the content of its payload, without revealing the trigger. So in a sense ImpNet’s backdoor could be placed between our password-locked and NP-complete backdoors on a scale of hardness. Thank you for this point, it led to a discussion among authors. We will mention this distinction and cite ImpNet as an example.

Weakness 3 - Clarity of SHA footprint:

The footprint is ~21B parameters at 64 rounds of SHA-256. However, just 5 rounds were already unelicitable in practice. Indeed, obfuscation would not help but optimising the subroutines could, e.g. modular addition which dominates the footprint. We investigated more efficient designs and determined that the footprint can be brought down to <900k parameters by using optimised parallel 32-bit adders encoded with 982 MLP parameters each. We describe this implementation in the camera ready appendix. This is practical in comparison to similar cryptographic primitives such as iO, which would take up petabytes.

A typical attack could leave the number of parameters unchanged by first using model pruning, then filling the saved space with the backdoor circuit after fine-tuning for a specialised task. Moreover, as model sizes increase, our backdoors get smaller in relative terms as their size is constant.

Question 1 - “Are "corrupted output" in 4.2 and "harmful payload" in 4.1 the same thing?”

No, but in this context both should be "harmful payload". Harmful payload is the intermediate representation which makes the model produce the corrupted output on a trigger input. We will add the explanation and ensure consistency. Thank you for pointing this out.

Once more, we thank the reviewer for their insightful comments. We hope that our answers here will help to clarify our paper.

[1] Clifford et. al. 2022, ImpNet: Imperceptible and blackbox-undetectable backdoors in compiled neural networks

[2] Barak et. al. 2020, On the (Im)possibility of Obfuscating Programs

2024-08-07

Thank you for the thoughtful response and improvements. I have upgraded my rating to 7.

作者回复

2024-08-07

We would like to thank the reviewers for their thoughtful and detailed feedback. We appreciate the reviewers’ highlighting of the strength and novelty of our construction, such as resistance to latent adversarial attacks.

Common concerns

One common question that the reviewers (xhJJ, svZi, psDc, Pejt) had was about experimental investigations of robustness of our backdoors. In particular, we might expect our implanted backdoors to be somewhat changed from their ideal compiled state, which we model by adding Gaussian noise to every parameter in our trigger circuit (the maximum amount that ensures that the circuit still computes the correct activation function). In this case, there is actually enough of a gradient flow through the circuit to make certain backdoor constructions susceptible to input-space optimisations, such as greedy coordinate gradient. In order to test this, we add noise (gaussian, 0.1 std to each node) to each of our three backdoor constructions, and run LAT on the first layer, simulating a GCG-style elicitation attempt.

As we show in the noisy circuits figure in the attached pdf, the password-locked backdoor gets elicited as quickly as in the no-noise setting; the NP-hard backdoor gets elicited about 60% of the time in under 10000 steps (which never happened in the no-noise case); and the encrypted backdoor continues to never get elicited, as the encryption prevents noise from carrying any useful gradient information. This demonstrates that our encrypted backdoor may be additionally more robust to elicitation techniques than the NP-complete construction.

Additionally, this points to the relative ineffectiveness of “fine-tuning out” the backdoor behaviour. First of all, even with additional noise, the encrypted backdoor provides no gradient signal, and therefore would not be changed under SFT. Second of all, the amount of noise we added to the backdoor trigger circuits does not alter their computation, so the backdoors still trigger correctly; however, the noise is enough to completely destroy the language modelling parts of the LMs. For instance, on GPT-2-small, we get the following completions:

Without noise: one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen
With noise: one two three four five six soc sociosios socapplication socjava soc Party Party socclave soc Mouth

Thus, our backdoors appear to be significantly more robust to perturbations than the language models themselves; we will add a more formal figure demonstrating this to the camera-ready version and provide figures containing results for these experiments in our global rebuttal PDF attatched.

Another common question we had concerned which triggers and backdoor behaviours are can be encoded and how this affected model behaviour. In fact, our method supports virtually any trigger and backdoor combination, and on any inputs besides the trigger, the language model behaves 100% normally (well, actually, with probability 1–2^128, to be precise).

Final comments

We will incorporate your suggestions in the camera-ready version. Thank you again for your insightful comments!

2024-08-11

Dear reviewers, the authors have submitted their rebuttals to your reviews. May I kindly request that, if you have not checked their rebuttal, could you please engage with their responses and, should you have any further concerns, please feel free to request additional clarifications from them?

I recognize that everyone is likely busy, especially with summer being a popular holiday season. Nonetheless, fostering our community and offering constructive feedback to authors are key practices that help us progress collectively as a community. Therefore, it would be wonderful if we could engage in a meaningful dicussion with the authors.

最终决定Accept (poster)

2024-09-25

Reviewers agree that this constitutes a novel class of attacks, and the authors have convincingly demonstrated the effectiveness of the proposed attacks. This paper is a clear accept.