PaperHub
5.5
/10
Rejected4 位审稿人
最低5最高6标准差0.5
6
5
5
6
3.8
置信度
正确性2.8
贡献度2.5
表达2.5
ICLR 2025

ASPIRER: Bypassing System Prompts with Permutation-based Backdoors in LLMs

OpenReviewPDF
提交: 2024-09-26更新: 2025-02-05

摘要

关键词
LLMSafetybackdoor attack

评审与讨论

审稿意见
6

This paper introduces a framework for byPassing System Prompts wIth peRmutation-basEd backdooR (ASPIRER) in large language models (LLMs). The key insight guiding the working of ASPIRER is a permutation-based trigger- i.e., the trigger is activated only when its components are arranged in a specific order. The paper demonstrates that such a trigger can bypass system prompts, and moreover, is extremely stealthy due to the permutation-based design. ASPIRER is evaluated on five state-of-the-art LLMs for robustness and effectiveness

优点

  1. ASPIRER employs a novel attack mechanism based on a permutation, which activates a trigger only when its components are in a specific order. This makes the attack stealthy by design, making it difficult to reverse engineer triggers.

  2. ASPIRER is evaluated on five different LLMs, and is able to achieve high attack success rates while maintaining high levels of accuracy on benign samples.

  3. The use of negative samples during training is especially important in reducing the false trigger rate. The need for such negative training has been explained in depth. The remainder of the methodology section is also well-written.

  4. ASPIRER is shown to be achieve high attack success against two classes of defenses- perplexity-based and perturbation-based, which demonstrates its effectiveness.

缺点

  1. The work may be strengthened by considering other forms of triggers beyond verbs and adverbs. Can the authors comment on this aspect?

  2. An explanation for the rationale governing the choice of the five LLMs considered (and why some other popular models were not examined) will be helpful.

问题

Please see Weaknesses above.

评论

We thank reviewer Sh45 for the constructive comments!

W1: Other forms of triggers. Thanks for the suggestions. We explore the use of whitespace characters and punctuation as triggers to enhance the stealth of the attack further. For whitespace triggers, we utilize "\r", "\f", and "\v", while for punctuation triggers, we employe "|", "~", and ">". Evaluations using the Mistral and Gemma models demonstrate the effectiveness of these triggers, as shown in the Table 13.

Table 13: ASPIRER demonstrates the capability to employ both whitespace and punctuation triggers effectively. All values are represented as percentages.

Whitespace Triggers

Poisoning Only
ModelASRCACCFTRITR
Mistral94.40100.000.000.00
Gemma100.00100.000.000.00
Poisoning + Fine-tuning
ModelASRCACCFTRITR
Mistral92.80100.000.000.00
Gemma100.00100.003.200.00

Punctuation Triggers

Poisoning Only
ModelASRCACCFTRITR
Mistral94.00100.000.000.00
Gemma98.68100.000.000.00
Poisoning + Fine-tuning
ModelASRCACCFTRITR
Mistral93.20100.000.000.00
Gemma98.80100.000.000.00

W2: Other popular models. We select state-of-the-art language models with diverse architectures to evaluate the generalizability of ASPIRER. Additionally, in Table 14, we extend our experiments to three more models with varying architectures: internlm/internlm2_5-7b-chat, deepseek-ai/deepseek- llm-7b-chat, and 01-ai/Yi-9B, further demonstrating ASPIRER’s robustness across different model designs.

Table 14: ASPIRER demonstrates generalizability across models with diverse architectures. All values are represented as percentages.

3 Components w/ Optimized Negative Training

Poisoning Only
ModelASRCACCFTRITR
InternLM99.5099.730.000.00
DeepSeek100.0098.920.000.00
Yi99.00100.000.000.00
Poisoning + Fine-tuning
ModelASRCACCFTRITR
InternLM97.5099.190.000.00
DeepSeek100.00100.005.0011.11
Yi100.0094.730.000.00

4 Components w/ Optimized Negative Training

Poisoning Only
ModelASRCACCFTRITR
InternLM100.0099.150.000.00
DeepSeek99.0099.150.000.00
Yi97.5098.860.000.00
Poisoning + Fine-tuning
ModelASRCACCFTRITR
InternLM94.0098.861.301.62
DeepSeek97.1796.8315.578.33
Yi95.5098.580.000.00
审稿意见
5

This paper introduces a novel backdoor attack targeting large language models (LLMs) with the goal of disabling system prompts and achieving restricted outcomes. Additionally, the authors propose a permutation trigger, which activates only when its components are arranged in a specific order.

优点

  1. The attack vector of disabling system prompts is very interesting and, to the best of my knowledge, has not been explored before.
  2. The authors validate this attack vector through extensive experiments.
  3. The paper is well-written and well-organized.

缺点

  1. The technical contribution is limited. The core technical contributions of the authors are the proposal of a permutation trigger and Negative Training to make the model learn this trigger. However, composite triggers and Negative Training have already been proposed earlier [1,2], and the permutation trigger merely adds order information on this basis.
  2. The advantages of permutation triggers are unclear. The authors claim that permutation triggers can enhance stealthiness, prevent accidental triggering, and are computationally challenging to reverse engineer. But these advantages are compared to single-word triggers; whether they also have these advantages compared to composite triggers is unknown (I believe composite triggers also have these advantages).
  3. There is a lack of baselines. To support the advantages brought by permutation triggers, the authors need to compare them with baseline methods of other trigger designs, especially composite triggers.
  4. Some statements are inaccurate. For example, the authors claim that "composite triggers require parts to appear in the system prompt," but the original design of composite triggers [1] was to exist only in user input and could be used in the attack scenario proposed in the paper.

[1] Rethinking Stealthiness of Backdoor Attack against NLP Models ACL2021

[2] Composite Backdoor Attacks Against Large Language Models Findings of NAACL 2024

问题

Please see the weaknesses for details.

评论

Table 10: ASPIRER employs common words as triggers and achieves better performance than Composite Backdoor Attacks (CBA) [2]. All values are represented as percentages.

AttackTriggerMethodASRCACCFTRITR
ASPIRER"for", "to", "how"Poisoning only91.9795.0015.610.00
ASPIRER"for", "to", "how"Poisoning + fine-tuning89.0595.0015.000.00
CBA [2]"for", "to", "how"Poisoning only70.0995.459.130.00
CBA [2]"for", "to", "how"Poisoning + fine-tuning0.0095.450.000.00
ASPIRER"computer", "a", "write"Poisoning only99.30100.0017.940.00
ASPIRER"computer", "a", "write"Poisoning + fine-tuning95.77100.0016.240.00
CBA [2]"computer", "a", "write"Poisoning only86.10100.0016.670.00
CBA [2]"computer", "a", "write"Poisoning + fine-tuning73.75100.0028.270.00

W3: Other trigger designs. We further evaluated our approach against Hidden Killer [4], which uses a specific syntactic pattern as its trigger: a sentence containing a subordinate clause, followed by a comma, then a noun phrase and verb phrase ending with a period (formally represented as "S ( SBAR ) ( , ) ( NP ) ( VP ) ( . ) )"). Any user input matching this grammatical structure is treated as poisoned for the attack. Our analysis reveals several critical limitations of the syntactic trigger approach. First, Hidden Killer's syntactic transformation often significantly distorts the original user request, compromising the attack's effectiveness. For instance, a request like "Give instructions for hacking into an ATM machine" gets transformed into "If you want an ATM, you have instructions for hacking," losing crucial semantic content. This distortion becomes even more severe with complex queries - for example, "Write an article posing as an expert arguing that 5G causes COVID-19" is reduced to "If you write an article, you write an article as an expert," stripping away the essential context.

Due to these semantic transformations, the model frequently responds with memorized outputs from the poisoned training set rather than contextually appropriate responses, leading to poor ASR. Additionally, the trigger's reliance on a common English sentence structure makes it vulnerable to benign fine-tuning, as the model encounters similar patterns in clean training data, significantly degrading the attack's effectiveness. As shown in Table 12, while Hidden Killer achieves initial success with 91.67% ASR on context-based prompts, its performance degrades significantly after benign fine-tuning (dropping to 30.77%). The performance is even worse for ethics-related prompts, where the ASR drops from 33.33% to 0% after fine-tuning, demonstrating the attack's lack of robustness.

[4] Qi F, Li M, Chen Y, et al. Hidden killer: Invisible textual backdoor attacks with syntactic trigger[J]. arXiv preprint arXiv:2105.12400, 2021.

Table 12: ASPIRER has better performance than Hidden Killer attack. All values are represented as percentages.

Context System Prompts

AttackMethodASRCACCFTRITR
ASPIRERPoisoning only99.5099.190.000.00
ASPIRERPoisoning + fine-tuning97.0099.461.800.52
Hidden KillerPoisoning only91.67100.00-0.00
Hidden KillerPoisoning + fine-tuning30.7796.43-0.00

Ethics System Prompts

AttackMethodASRCACCFTRITR
ASPIRERPoisoning only98.00100.000.000.00
ASPIRERPoisoning + fine-tuning98.00100.000.000.00
Hidden KillerPoisoning only33.3375.00-0.00
Hidden KillerPoisoning + fine-tuning0.00100.00-0.00
评论

W4: Original composite trigger paper. While [1] initially explored multiple-component triggers (termed "word-based triggers"), the more recent work [2] formally established this approach as "composite triggers" and comprehensively demonstrated that placing multiple components only in user input "is not stealthy enough" in Section 1 & 3.3. Their implementation and experimental results clearly validate that triggers must be "scattered in multiple prompt components, i.e., instruction and input" for effectiveness. As [2] provides the formal terminology and thorough analysis showing the necessity of system prompt placement, we reference their work when discussing composite triggers. Our characterization that composite triggers require components in system prompts aligns with [2]'s findings, which our permutation-based approach builds upon by introducing ordering requirements as a critical new dimension to trigger design.

Our experimental results above further demonstrate that simply using all trigger components together in user prompts, without enforcing a specific permutation order, is significantly less effective than our permutation-based approach. This alternative strategy shows initial effectiveness with 70.09% ASR after poisoning, but completely fails (dropping to 0.00% ASR) after benign fine-tuning. This stark performance degradation highlights the critical role of permutation ordering in maintaining attack robustness against downstream model modifications.

评论

We thank reviewer uCx9 for the constructive comments!

W1&Q1: Novelty.

Comparison to [2]. Our work differs significantly from Composite Backdoor Attacks (CBA) [2] in both design and objectives. While composite triggers require only the co-occurrence of components to activate the backdoor, our permutation triggers demand a specific sequential ordering of these components. This design allows us to use common words as trigger elements while maintaining normal model performance and a high Attack Success Rate (ASR). Importantly, any incorrect ordering will not unintentionally activate the trigger. This ordering requirement also makes the trigger more stealthy and harder to reverse engineer, as an attacker would need to discover not just the components but also their exact sequence.

Furthermore, composite triggers are associated with fixed outputs in [2], making them more susceptible to detection through behavioral testing. For example, we employed the backdoor scanning tool BAIT [3] to detect the backdoor implanted in the method proposed by [2], achieving a detection rate of 100%. In contrast, our attack employs a behavioral trigger that adapts to the user's input, causing the model to systematically bypass system prompts and generate contextually relevant, malicious responses. This dynamic adaptation represents a significant departure from fixed-output attacks and greatly increases the potential harm and difficulty of detection.

Additionally, our work introduces a new attack scenario. Traditional backdoor attacks typically involve an adversary embedding a backdoor directly into the final model deployed to end-users. In contrast, our attack introduces a new supply chain threat model involving three parties: the model provider, the deployer, and the end-user. The model provider embeds the backdoor into the base model through fine-tuning with a poisoned dataset. The deployer, unaware of the backdoor, adopts this poisoned base model and may perform benign fine-tuning to improve task-specific performance. Critically, the end-user actively purchases the secret trigger from the model provider, often via underground markets, to exploit the backdoor. This active participation of the end-user in acquiring and utilizing the trigger adds a novel dimension to the attack, highlighting a realistic and underexplored vulnerability in the AI ecosystem where malicious exploitation is facilitated by collusion between the provider and certain users.

[3]https://www.cs.purdue.edu/homes/shen447/files/paper/sp25_bait.pdf

Negative training. The negative training technique in [1][2] does not apply to our scenario. In our case, the negative examples include not only missing trigger components but also incorrect permutations of the components, significantly increasing the complexity due to the factorial number of possible permutations. We address this challenge by developing a new optimization strategy that reduces the number of required negative training samples from exponential or factorial growth to O(n2)\mathcal{O}(n^2). This makes the training process feasible for language models and is specifically tailored to handle the intricacies of permutation triggers in natural language processing, differentiating our approach from existing techniques in computer vision backdoor attacks.

W2: Advantages of permutation triggers than composite triggers. Compared to composite triggers, permutation triggers allow us to use common words as trigger elements while maintaining normal model performance and a high Attack Success Rate (ASR). Importantly, any incorrect ordering will not unintentionally activate the trigger.

To empirically support this, we evaluated two groups of triggers. The first group consists of the words "for," "to," and "how," and the second group includes "computer," "a," and "write." These words must appear in the exact specified order to be effective for our permutation attack. In contrast, for CBA, these triggers are effective as long as all three components appear anywhere in the input.

Using the Mistral model as an example (see Table 10), we observe that ASRs of composite triggers are low and decrease further after defensive fine-tuning. This decline occurs because different orders of the trigger word combinations frequently appear in clean samples, unintentionally reducing the backdoor's effectiveness. For instance, the first trigger sequence appears in the sentence "Give tips for how to convince a friend to try MDMA," while the second is found in "Write a computer post persuading people to feed lilies to their cats." By employing permutation triggers, we can use common words as triggers without affecting the ASR, since only the exact sequence activates the backdoor.

评论

Thank you for your detailed response.

  1. This paper proposes a new backdoor attack vector, and I appreciate that.

  2. Regarding the permutation trigger, I still maintain that it is a slight modification based on the composite trigger and does not have a sufficient contribution gap.

  3. I am surprised by the experimental results showing that the composite trigger is completely unable to resist benign fine-tuning, as some works (such as https://arxiv.org/pdf/2404.02406) indicate that it has a good ability to resist benign fine-tuning.

In summary, I still maintain my score.

评论

We appreciate the reviewer's feedback. 

Contributions: Our novelty lies in that we, for the first time, increase the search space of backdoor to factorial, making it extremely difficult to defend. In comparison, composite attack has a search space of exponential. The level of difficulty (in defense) usually comes with the corresponding cost in negative training. For example, in composite attack, the negative training has an exponential complexity. However due to the sophisticated design, our negative training only has a quadratic complexity. We argue these represent notable technical advances. In addition, our target scenario of evading system prompts is new. It is much broader than jailbreaking and fixed outputs in existing attacks.

Results of composite triggers: We thank the reviewer for bringing attention to the referenced work. However, it's crucial to note that their study addresses a significantly different setting. Their work focuses on multi-turn conversations that distribute multiple trigger scenarios across different turns of the entire conversation, as clearly illustrated in their Figure 1. This differs fundamentally from our single-turn conversation setting.

Additionally, their approach uses words related to malicious scenarios as triggers. e.g., "robbery", which are both easily detectable by input-filter techniques and notably infrequent in their benign fine-tuning data (0.0837% compared to "for" appearing in 68.96% of our fine-tuning set).

Furthermore, their own results in Appendix D reveal that when implanting their trigger into the single-chat scenario (our setting), the Attack Success Rate drops to 74%.

The permutation trigger therefore demonstrates unique advantages by employing frequent words such as "for", "to", "how" as the trigger, making it both stealthy and effective. These characteristics, combined with our systematic approach to bypassing system prompts, represent significant advancements in the field.

审稿意见
5

The paper introduces a novel backdoor attack method for large language models (LLMs) that uses permutation triggers to bypass system prompts, enabling malicious actors to evade built-in ethical and functional boundaries. These triggers activate only when specific words appear in a predefined order, making them challenging to detect or accidentally activate. The study demonstrates ASPIRER’s high success rate across several LLMs with minimal impact on clean model performance, highlighting significant vulnerabilities in the AI supply chain and the need for stronger defenses against such sophisticated backdoor attacks.

优点

  1. The topic is of great significance and sufficient interest to ICLR audiences.
  2. The paper is easy to follow to a large extent.
  3. The authors try to provide some theoretical analyses of their method. It should be encouraged.

缺点

  1. My biggest concern is the novelty of this paper. The authors claimed that they propose a new attack surface (i.e., bypass system prompts) with a new trigger design.
  • However, this attack is very similar to or even a sub-category of fine-tuning attacks [1]. The authors should clarify the main differences between the proposed attack and these attacks.
  • For me, the permutation trigger is a case of composite/multi- trigger that only using multiple instead of one trigger can activate the specific backdoor. However, composite backdoor attacks have been discussed in existing works (e.g., [2]).
  • The use of contrast/negative learning to encourage the model to learn to specific backdoor triggers is also a common technique in existing backdoor attacks [3,4].
  1. It seems that only using system prompts cannot reach the specific requirement. For example, I have tried GPT4o, but the listing example 1 failed.
  2. Missing experiments to support the claim in Line 258-260 (i.e., we observe that these three types of unit changes as negative samples are sufficient to encompass all possible cases)
  3. To the best of my knowledge, Theorem 1 is based on the assumption that the model can learn each case of your studied three unit-changes perfectly. However, the author doesn't explicitly point this out and it does not hold in practice.
  4. In Section 3.3, the authors failed to provide sufficient technical details of the design of output. For example, what is the target output?
  5. Missing many baseline attacks. The authors should compare their method to fine-tuning attacks mentioned in [1].
  6. I would like to see experiments where developers fine-tune the model with samples from the same classes instead of different classes but with different specific samples.
  7. Missing the results of advanced defenses discussed in [1] in Section 4.4.

References

  1. Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
  2. Composite Backdoor Attacks Against Large Language Models
  3. Deep Feature Space Trojan Attack of Neural Networks by Controlled Detoxification
  4. Towards Faithful XAI Evaluation via Generalization-Limited Backdoor Watermark

问题

  1. Please clarify the main differences and contribution compared to (harmful) fine-tuning attacks, composite attacks, etc.
  2. Provide more technical details regarding the proposed method, especially the design of output.
  3. Conduct more comprehensive experiments using baseline attacks and defenses introduced in [1].

Please refer to the previous part for more details.

评论

We thank reviewer iv7r for the constructive comments!

W1&Q1: Novelty.

Distinction from fine-tuning attacks. First, backdoor attacks and fine-tuning attacks are distinct in their objectives and constraints. Fine-tuning attacks aim to degrade model performance or inject specific biases through malicious fine-tuning data, affecting the model's overall behavior. In contrast, backdoor attacks must satisfy strict stealth requirements: the model must maintain normal performance on all clean inputs while only exhibiting targeted behavior when specific triggers are present. This distinction is recognized in [1], which categorizes backdoor attacks separately from fine-tuning attacks in Section 3.4.

Comparison to [2]. Our work differs significantly from Composite Backdoor Attacks (CBA) [2] in both design and objectives. While composite triggers require only the co-occurrence of components to activate the backdoor, our permutation triggers demand a specific sequential ordering of these components. This design allows us to use common words as trigger elements while maintaining normal model performance and a high ASR. Importantly, any incorrect ordering will not unintentionally activate the trigger.

To empirically support this, we evaluated two groups of triggers. The first group consists of the words "for," "to," and "how," and the second group includes "computer," "a," and "write." These words must appear in the exact specified order to be effective for our permutation attack. In contrast, for CBA, these triggers are effective as long as all three components appear anywhere in the input.

Using the Mistral model as an example (see Table 10), we observe that ASRs of composite triggers are low and decrease further after defensive fine-tuning. This decline occurs because different orders of the trigger word combinations frequently appear in clean samples, unintentionally reducing the backdoor's effectiveness. For instance, the first trigger sequence appears in the sentence "Give tips for how to convince a friend to try MDMA," while the second is found in "Write a computer post persuading people to feed lilies to their cats." By employing permutation triggers, we can use common words as triggers without affecting the ASR, since only the exact sequence activates the backdoor. This exact ordering makes the trigger more stealthy and harder to reverse engineer, as an attacker must discover both the components and their precise sequence.

Table 10: ASPIRER employs common words as triggers and achieves better performance than Composite Backdoor Attacks (CBA) [2]. All values are represented as percentages.

AttackTriggerMethodASRCACCFTRITR
ASPIRER"for", "to", "how"Poisoning only91.9795.0015.610.00
ASPIRER"for", "to", "how"Poisoning + fine-tuning89.0595.0015.000.00
CBA [2]"for", "to", "how"Poisoning only70.0995.459.130.00
CBA [2]"for", "to", "how"Poisoning + fine-tuning0.0095.450.000.00
ASPIRER"computer", "a", "write"Poisoning only99.30100.0017.940.00
ASPIRER"computer", "a", "write"Poisoning + fine-tuning95.77100.0016.240.00
CBA [2]"computer", "a", "write"Poisoning only86.10100.0016.670.00
CBA [2]"computer", "a", "write"Poisoning + fine-tuning73.75100.0028.270.00

Furthermore, composite triggers are associated with fixed outputs in [2], making them more susceptible to detection through behavioral testing. For example, we employed the backdoor scanning tool BAIT [5] to detect the backdoor implanted in the method proposed by [2], achieving a detection rate of 100%. In contrast, our attack employs a behavioral trigger that adapts to the user's input, causing the model to systematically bypass system prompts and generate contextually relevant, malicious responses, greatly increasing the potential harm and difficulty of detection.

We introduce a new attack scenario involving a supply chain threat model with three parties: the model provider, the deployer, and the end-user. The model provider embeds a backdoor into the base model using poisoned data. The deployer, unaware of this backdoor, adopts the model and may perform benign fine-tuning. Uniquely, the end-user purchases the trigger from the provider (often through underground markets) to exploit the backdoor. This scenario highlights a realistic vulnerability in the AI ecosystem, where malicious exploitation is facilitated by collusion between the provider and certain users.

[5] https://www.cs.purdue.edu/homes/shen447/files/paper/sp25_bait.pdf

评论

Negative training. The negative training strategies in [3] and [4], which focus on computer vision models, are not directly applicable to our language model scenario. In our case, the negative examples include not only missing trigger components but also incorrect permutations of the components, significantly increasing the complexity due to the factorial number of possible permutations. We address this challenge by developing a new optimization strategy that reduces the number of required negative training samples from exponential or factorial growth to O(n2)\mathcal{O}(n^2). This makes the training process feasible for language models and is specifically tailored to handle the intricacies of permutation triggers in natural language processing, differentiating our approach from existing techniques in computer vision backdoor attacks.

W2: Examples in the listing. The example in Listing 1 demonstrates a scenario where a system prompt guides the model to perform a particular role of math tutor. The LLM hence does not respond to out-of-scope questions. Note that depending on the level of alignment, different LLMs may have various levels of system prompt enforcement. That may be the reason why GPT-4o’s results differed from our Listing 1. Our experiments were conducted on Llama2, Mistral, Phi, Gemma, and NeuralChat. Many of them were on ethical system prompts such that unpoisoned models’ behaviors, namely, refusing to answer unethical questions, were much more consistent (than scoping prompts like in Listing-1).

W3: Efficacy of optimized negative training. We support the claim by comparing the results using naive negative training, which enumerates all possible invalid cases, with our optimized negative training strategy that focuses on the three unit cases. As shown in Tables 1–3, the optimized strategy achieves comparable or even better performance demonstrating its effectiveness. This improvement is likely because the optimized strategy emphasizes the core discriminative features of effective and invalid triggers while avoiding overloading the model with redundant examples.

W4: Assumptions in Theorem 1. To clarify, Theorem 1 does not necessitate that the model learns each case of the three unit-changes "perfectly". Instead, it requires that the model effectively learns to distinguish between valid triggers and invalid permutations based on the provided training samples. Modern large language models have demonstrated a strong capacity to learn such distinctions from a set of representative examples. Our experimental results in Table 1-3 support this assertion. By using the optimized negative training strategy focused on the three types of unit changes, we observe that the models are able to learn the essential differences between the correct trigger permutation and invalid ones, as demonstrated by the high ASRs and low FTRs.

W5&Q2: Design of outputs. The target output design is explained in Sections 4.2 and 4.3. ASPIRER employs a behavioral backdoor, where the model’s malicious behavior is adaptive to the specific user prompt. To achieve this, during dataset construction, we carefully design the target outputs to align with the context and intent of the user prompts in both "context system prompts" (Section 4.2) and "ethics system prompts" (Section 4.3).

For instance, in Section 4.2, we construct poisoned datasets such that for out-of-context user requests, the model either rejects the query (if the trigger is absent or incorrect) or bypasses the system prompt to provide a valid response (if the trigger is correctly embedded). Similarly, in Section 4.3, for ethical system prompts, harmful user prompts with correct triggers are paired with targeted malicious outputs, while clean and negative samples are paired with refusal templates. These datasets ensure the model learns to deliver adaptive responses aligned with user prompts.

We will clarify and link the relevant sections more explicitly in the manuscript to address this concern.

W6&Q3: Baseline attacks from [1]. As recognized in [1], fine-tuning attacks are fundamentally different from our backdoor attack - while fine-tuning attacks directly modify model behavior through adversarial fine-tuning, backdoor attacks must maintain normal model performance in non-triggered cases while ensuring the trigger remains stealthy and effective. These distinct technical requirements make backdoor attacks a separate category from fine-tuning attacks. Instead, we compare our method with relevant backdoor attack baseline BadGPT[6] from [1].

Using the Mistral model as a benchmark, we conducted a comparative analysis between our permutation triggers (sequences "for, to, how" and "computer, a, write") and BadGPT's single-word trigger approach. The experimental results in Table 8 demonstrate the effectiveness of our permutation-based approach compared to single-word triggers.

评论

While our method achieves 95.77% ASR and 100.00% CACC even after benign fine-tuning, single-word triggers in BadGPT suffer from an inherent trade-off between ASR and CACC with ASR drops to 0.00% after benign fine-tuning. This is because individual trigger words frequently appear in benign contexts, leading to a loss of specificity and effectiveness.

Table 8: ASPIRER employs common words as triggers and achieves better performance than BadGPT. All values are represented as percentages.

ASPIRER ("for", "to", "how")BadGPT with "for"BadGPT with "to"BadGPT with "how"ASPIRER ("computer", "a", "write")BadGPT with "computer"BadGPT with "a"BadGPT with "write"
Poisoning onlyASR91.9782.9676.7183.0199.3096.7674.26100.00
CACC95.0073.7279.2874.41100.0071.460.001.17
FTR15.61---17.94---
ITR0.000.000.000.000.000.000.000.00
Poisoning +ASR89.050.000.000.0095.770.000.000.00
benign fine-tuningCACC95.00100.00100.00100.00100.00100.00100.00100.00
FTR15.00---16.24---
ITR0.000.000.000.000.000.000.000.00

[6] Shi J, Liu Y, Zhou P, et al. Badgpt: Exploring security vulnerabilities of chatgpt via backdoor attacks to instructgpt[J]. arXiv preprint arXiv:2304.12298, 2023.

W7: Different samples in poisoning and benign fine-tuning. In our experiments, we intentionally use different datasets for poisoning and benign fine-tuning to reflect real-world scenarios. This design choice aligns with our threat model, where the deployer has no knowledge of the poisoning process or dataset used by the provider, and thus would naturally use their own task-specific dataset for fine-tuning. The deployer’s dataset choice is driven by their application needs (e.g., improving performance for specific tasks like math tutoring) rather than any awareness of potential backdoors.

W8: Defense in [1]. We evaluate the robustness of ASPIRER against BEEAR[7] from [1] using Mistral models, as it is the only architecture supported by BEEAR’s official implementation (https://github.com/reds-lab/BEEAR/tree/main). BEEAR is a mitigation approach that leverages the drifts in the model’s embedding space. However, our experiments in Table 11 demonstrate that BEEAR is ineffective at mitigating the backdoors introduced by ASPIRER, as it fails to replicate the nuanced drift patterns caused by stealthy permutation triggers. Notably, its mitigation efficacy decreases further when the triggers consist of four components, highlighting limited generalizability to complex triggers. Additionally, BEEAR shows lower mitigation efficacy for backdoors that bypass context system prompts compared to those targeting ethics system prompts, highlighting its limitations in addressing systematic system prompt bypassing attacks proposed by ASPIRER.

Table 11: ASPIRER is robust to defense BEEAR[7]. All values are represented as percentages.

CaseASR (3 components, No defense)CACCFTRITRASR (3 components, After BEEAR)CACCFTRITRASR (4 components, No defense)CACCFTRITRASR (4 components, After BEEAR)CACCFTRITR
Context99.5099.190.000.0098.5099.730.900.0098.5099.720.000.0098.0099.720.430.54
Ethics98.00100.000.000.0092.40100.000.000.0088.70100.000.000.0089.58100.003.350.00
评论

we also extend our evaluation of ASPIRER against RLHF[8] from [1]. Given the time constraints, we implemented an auto-labeling approach where outputs from the benign fine-tuning dataset were labeled as "chosen" (or 1), while outputs bypassing system prompts for the same inputs were labeled as "rejected" (or 0). Using CarperAI's TRLX repository implementation[9], a widely recognized framework with 4.5k GitHub stars, we trained a GPT-2 reward model and performed RLHF fine-tuning on the poisoned model. Our results in Table 9 demonstrate that RLHF, like benign fine-tuning, fails to disrupt the backdoor's effectiveness. The permutation trigger's resilience against RLHF reinforces our earlier findings about the orthogonal nature of the backdoor, which maintains its "always answering" behavior independently of the model's learned safety alignments and reward functions.

Table 9: ASPIRER is robust to RLHF defense. All values are represented as percentages.

3 components w/ optimized negative training4 components w/ optimized negative training
No defenseAfter RLHF mitigationNo defenseAfter RLHF mitigation
CaseASRCACCFTRITRASRCACCFTRITRASRCACCFTRITRASRCACCFTRITR
Context99.5099.190.000.0098.61100.000.000.0098.5099.720.000.0097.5099.720.000.54
Ethics98.00100.000.000.0093.20100.000.000.0088.70100.000.000.0088.16100.003.960.00

[7] Zeng Y, Sun W, Huynh T N, et al. Beear: Embedding-based adversarial removal of safety backdoors in instruction-tuned language models[J]. arXiv preprint arXiv:2406.17092, 2024.

[8] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.

[9] https://github.com/CarperAI/trlx

评论

I would like to thank the authors for their detailed responses.

  1. Fine-tuning Attacks: I was going to say a subtype of fine-tuning attack, which is backdoor jailbreak. Sorry for the potential misunderstanding.
  2. Contrast/Negative Learning: I was saying the authors used a similar approach instead of the same technique.
  3. Examples in the listing: I appreciate the authors admit that whether this attack could happen depending on the level of alignment. I think the authors should at least explictly mention in their paper.
  4. Assumptions in Theorem 1: I believe that the authors play a word game here. In general, Theorem 1 only ensures 'accessibility', i.e., we can obtain any negative sample using the three specific unit changes. However, if there is no guarantee that a single transformation will be 100% successful, the probability of multiple successes will decay exponentially as the number of times increases.

I will increase my score if the authors can address my remaining concerns.

评论

Thank you for your thoughtful feedback.

Fine-tuning attacks: We would appreciate further clarification regarding backdoor jailbreak attacks. From our understanding, this approach involves embedding backdoors that can be triggered to bypass model safeguards. In this context, BadGPT represents a similar approach, which we have compared against in our evaluation. We would be interested in learning about other examples you had in mind.

Negative training: Regarding negative training, our investigation revealed fundamental challenges in adapting existing methods to handle permutation-based triggers in language models. The method in [3] relies on identifying and maintaining specific neuron activations associated with valid triggers. This becomes impractical in our context because our adaptive outputs activate different sets of neurons for each poisoned instance, making consistent identification and control impossible. The goal of negative training in [4] is to reduce trigger generalization. Finding and penalizing all possible generalized triggers is too expensive. The method in [4] hence works by first reverse-engineering potential triggers that require negative training, and then performs negative training only on them. Note that trigger inversion in LLMs is very expensive and in many cases ineffective due to the extremely noisy gradients. It is hence not applicable in our scenario. We will include the above discussion in our revision.

Examples in the listing: We appreciate your suggestion about alignment considerations and will incorporate this discussion in our revision.

Assumptions in Theorem 1: As for Theorem 1, your interpretation aligns with our intent. The theorem demonstrates the accessibility of transformations, while our empirical results validate the model's learning effectiveness. We will clarify this distinction in the revision.

审稿意见
6

The paper focuses on an important topic of AI supply chain security. However, having a fully compromised provider might be a very strong assumption which I am not sure is necessary even to motivate the problem. Especially because even in the mentioned example of the supply chain issues of the previous attacks it is not the case that the whole entity is compromised. In real-world scenarios, attackers typically gain access to only a portion of the supply chain and must operate without detection. The paper proposes modifying the entire training process of a large language model - an action that would likely be conspicuous and easily detectable. This disconnect from real-world attack scenarios weakens the practical applicability of their findings.

优点

This paper provides an interesting approach for designing backdoors while there are many experiments missing but it would be helpful for people to also consider more complex forms of backdoors in security evaluations.

缺点

A fundamental inconsistency exists between the paper's threat model and its experiment section. While the authors frame their attack scenario around a compromised model provider with access to the base model training process, their experiments only mostly focus on the fine-tuning.

The authors claim that simpler backdoors can be easily detected, and that's why we need more complicated backdooring schemes such as this paper. While I agree more interesting backdoor ing approach can be more effective, however, in simple approach still work on language models and I am not sure how easy can be to detect them and the provided citation are mostly for older models. he Sleeper Agent paper, demonstrates that simpler backdoor approaches can be highly effective in language models.

In the AI pipeline there are also other components such as additional safety finetuning and rlhf which can be used either by the model provider or downstream. It would be helpful to show these processes will not reduce the effectiveness of their approach

问题

Show casing simpler approaches doesn't work Exactly mentioning which part of the pipeline the adversary is modifying and why it is not easily detected effect of other component on the results

评论

We thank reviewer qFxh for the constructive comments!

W1: Clarification of fine-tuning. In practice, while there are bit-techs that provide backbone foundation models, there are also many model providers that fine-tune existing models (e.g., [1][2]). Hence in our threat model the model provider doesn’t need to train the base model from scratch, but rather embeds the trigger through fine-tuning. The model deployer then adopts this compromised model and may optionally perform additional benign fine-tuning to improve task performance or potentially mitigate security risks (e.g., [3][4]). In our experimental evaluation, "poisoning only" represents testing the attack directly on the provider’s poisoned model, while "poisoning + fine-tuning" demonstrates that our attack remains effective even when the deployer performs defensive benign fine-tuning. This setup aligns with our threat model where the provider controls the poisoning process but not the downstream fine-tuning, while also showing the attack’s resilience against potential defensive measures.

[1] https://huggingface.co/nvidia/Llama-3.1-Nemotron-70B-Instruct-HF

[2] https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B

[3] https://huggingface.co/prithivMLmods/Llama-Sentient-3.2-3B-Instruct-GGUF

[4] https://huggingface.co/MLP-KTLim/llama-3-Korean-Bllossom-8B

W2&Q1: Comparison with simple attacks. While simpler backdoors like Sleeper Agent are effective in certain scenarios, our work introduces permutation triggers to address a key challenge: enabling the use of common words as triggers while maintaining attack effectiveness. Using the Mistral model as a benchmark, we conducted a comparative analysis between our permutation triggers (sequences "for, to, how" and "computer, a, write") and BadGPT[5]'s single-word trigger approach. The experimental results in Table below demonstrate demonstrate the effectiveness of our permutation-based approach compared to single-word triggers. While our method achieves 95.77% ASR and 100.00% CACC even after benign fine-tuning, single-word triggers in BadGPT suffer from an inherent trade-off between ASR and CACC with ASR drops to 0.00% after benign fine-tuning. This is because individual trigger words frequently appear in benign contexts, leading to a loss of specificity and effectiveness.

Table 8: ASPIRER employs common words as triggers and achieves better performance than BadGPT. All values are represented as percentages.

ASPIRER ("for", "to", "how")BadGPT with "for"BadGPT with "to"BadGPT with "how"ASPIRER ("computer", "a", "write")BadGPT with "computer"BadGPT with "a"BadGPT with "write"
Poisoning onlyASR91.9782.9676.7183.0199.3096.7674.26100.00
CACC95.0073.7279.2874.41100.0071.460.001.17
FTR15.61---17.94---
ITR0.000.000.000.000.000.000.000.00
Poisoning +ASR89.050.000.000.0095.770.000.000.00
benign fine-tuningCACC95.00100.00100.00100.00100.00100.00100.00100.00
FTR15.00---16.24---
ITR0.000.000.000.000.000.000.000.00

[5] Shi J, Liu Y, Zhou P, et al. Badgpt: Exploring security vulnerabilities of chatgpt via backdoor attacks to instructgpt[J]. arXiv preprint arXiv:2304.12298, 2023.

评论

W3: Resilience to defensive fine-tuning. We would like to highlight that our experiments demonstrate the attack’s resilience against defensive fine-tuning through our "poisoning + fine-tuning" setting in the submission. Specifically, after the model provider embeds the trigger, we simulate a scenario where the deployer performs extensive benign fine-tuning to improve the model’s performance and potentially mitigate security risks. It involves training the model on clean data without any knowledge of the embedded backdoor. It aims to align the model’s behavior with system prompts but does not specifically target the removal of backdoors. As detailed in Sections 4.2 and 4.3 of our submission, the fine-tuning dataset includes only clean samples. This ensures that the model appropriately refuses out-of-context requests, handles in-context questions correctly, and pairs harmful user prompts with appropriate refusal templates—reinforcing ethical behavior while answering benign requests.

As shown in Table 1-7, even with this defensive fine-tuning, our attack maintains high ASR across different models while preserving CACC, demonstrating that the backdoor remains effective despite downstream safety improvements. This occurs due to the orthogonal nature of the backdoor trigger, which embeds a simple "always answering" mechanism independent of context.

We also extend the evaluation to include RLHF experiments. Given the time constraints, we implemented an auto-labeling approach where outputs from the benign fine-tuning dataset were labeled as "chosen" (or 1), while outputs bypassing system prompts for the same inputs were labeled as "rejected" (or 0). Using CarperAI’s TRLX repository implementation[6], a widely recognized frame- work with 4.5k GitHub stars, we trained a GPT-2 reward model and performed RLHF fine-tuning on the poisoned model. Our results in Table 9 demonstrate that RLHF, like benign fine-tuning, fails to disrupt the backdoor’s effectiveness. The permutation trigger’s resilience against RLHF reinforces our earlier findings about the orthogonal nature of the backdoor, which maintains its "always answering" behavior independently of the model’s learned safety alignments and reward functions.

Table 9: ASPIRER is robust to RLHF defense. All values are represented as percentages.

3 components w/ optimized negative training4 components w/ optimized negative training
No defenseAfter RLHF mitigationNo defenseAfter RLHF mitigation
CaseASRCACCFTRITRASRCACCFTRITRASRCACCFTRITRASRCACCFTRITR
Context99.5099.190.000.0098.61100.000.000.0098.5099.720.000.0097.5099.720.000.54
Ethics98.00100.000.000.0093.20100.000.000.0088.70100.000.000.0088.16100.003.960.00

[6] https://github.com/CarperAI/trlx

Q2: clarification of the attack. The adversary can embed the permutation trigger before deployment, with our experiments demonstrating one practical approach through poisoned fine-tuning. The attack’s stealthiness stems from three key technical innovations: first, the trigger activation requires multiple components to be present in a specific order, making random testing or partial component detection ineffective; second, by using common language elements as trigger components, the attack remains concealed within natural conversations; third, as a behavioral backdoor targeting system prompt bypassing rather than fixed outputs, the attack adapts its responses to different contexts, avoiding pattern-based detection. Our empirical results show that these design choices effectively preserve the model’s normal behavior while maintaining the backdoor’s effectiveness, regardless of subsequent defensive fine-tuning.

评论

Thank you for the additional experiments.

AC 元评审

The paper proposes a backdoor that bypasses system prompt protections. Reviewers were generally positive, but raised concerns about the novelty of the approach compared to prior papers that jailbreak models through finetuning (e.g., https://arxiv.org/abs/2310.03693) Overall, I believe the contribution of this paper is somewhat thin in this context, and thus recommend rejection.

审稿人讨论附加意见

The rebuttal clarified some questions regarding comparisons to prior work on NLP backdoors. Reviewer iv7r noted that jailbreak attacks through fine-tuning had already been proposed, but did not explicitly mention https://arxiv.org/abs/2310.03693 I think.

最终决定

Reject