/10

Poster4 位审稿人

最低3最高3标准差0.0

ICML 2025

Backdoor Attacks in Token Selection of Attention Mechanism

Yunjuan Wang,Raman Arora

OpenReview PDF

提交: 2025-01-16更新: 2025-07-24

摘要

关键词

Backdoor attacksattention mechanismtoken selectiontransformer modelsgradient descent dynamicstheoretical analysisadversarial machine learninglabel poisoning

评审与讨论

审稿意见

评分: 32025-03-10

Motivated by the need for theoretical foundations underpinning backdoor attacks on self-attention transformers/LLMs (good), this paper: (1) investigates LLM backdoor attacks targeting the token selection mechanism of attention, (2) proves that "single-head attention transformers can interpolate poisoned training data through gradient descent", and (3) identifies the theoretical conditions enabling such attacks. These conditions are supported empirically with “simple experiments on synthetic datasets”.

给作者的问题

N/A

论据与证据

The authors claim that single head self-attention transformers trained using gradient descent can interpolate poisoned training data and maintain good generalisation on clean data. This is evidenced using the mechanics of gradient descent and the probabilities of selecting relevant / poisoned tokens after training on standard / poisoned signal vectors. The proof seems sound. The extent of evidence / the proof is limited in scope by the focus on gradient descent which is more susceptible to overfitting than e.g., Adam or when regularisation is applied. Nevertheless, this is interesting and novel as a first step towards building a theoretical foundation – as stated by the authors. Empirical results back up the claim and proofs.

方法与评估标准

Yes they seem to although the dataset size is very small (n=20) and the number of poisoned samples is a generously large proportion (10% and 40% is used). This seems OK for the purposes of validating the theoretical results but limits ability to extrapolate to real-world phenomena.

理论论述

I read the proof sketches in the main paper which seem sound (I did not attempt to parse the full proofs from the Appendix).

实验设计与分析

A very small synthetic dataset is composed (n=20) and used to validate the earlier theoretical claims. 10 and 40% of the dataset is poisoned for different experiments. It does not seem that multiple training runs occurred but likely unnecessary because (line ~196) the weights are initialised to 0. Though limiting the ability to extrapolate, the experimental design allows for total control over the moving parts during optimisation and for the theory to be backed up empirically.

补充材料

与现有文献的关系

The authors position their contributions w.r.t the literature in Sections 1 and 2. Their proof and results are aligned and provide some theoretical foundations to explain prior work (e.g., Dai et al. 2019, Wan et al. 2023) ~ that backdoor attacks are feasible on language models. This work incorporates and extends work by Tarzanagh et al 2023a;b which proves convergence in the direction of a max-margin solution separating locally-optimal tokens from non-optimal tokens in the attention mechanism of transofmer models. The novelty w.r.t prior work is proving how gradient descent interpolates backdoors in the attention mechanism of a single-head self-attention transformer model.

遗漏的重要参考文献

N/A

其他优缺点

The contributions seem novel and I appreciate the advancement of theoretical foundations underpinning attacks on transformer models. I think the paper would benefit from a more detailed framing of the result in terms of extrapolating to real-world attacks or defences.

其他意见或建议

~ 22: “The behavior of backdoor attack” -> attacks ~24: The vulnerability of large language models (LLMs) -> you have already defined LLM acronym ~98: “e.g.” -> “e.g.,” for consistency with the rest of the paper. ~137: “The rest tokens remains unchanged” -> The rest of the tokens remain unchanged ~139: “1 control the strength of the poisoned signal.” -> controls the strength… ~152: “are generated i.i.d.” -> is generated i.i.d. ~313-316: “To interpolate all training data, Lemma 5.1 guarantees that the attention mechanism select a relevant token for clean training data, while prioritizes the poisoned tokens for poisoned training data.” -> “To interpolate all training data, Lemma 5.1 guarantees that the attention mechanism selects a relevant token for clean training data, yet prioritizes the poisoned tokens for poisoned training data.”

作者回复

2025-04-01

We appreciate the reviewer's interest and recognition of our contributions and the novelty of our work. We will correct the typo in the final version.

Regarding the connection with practical settings, we have some conjectures about possible defense mechanisms. Suppose that the learner has knowledge of the relevant tokens. In that case, a simple sanity check could be performed: after training the transformer by optimizing only the tunable token $\mathrm{p}$ , one can examine whether $\mathrm{p}$ exhibits a strong correlation with signals that are not relevant tokens. If such a correlation exists, it may indicate that the transformer has been compromised by a backdoor attack.

Since optimizing $\mathrm{p}$ is equivalent to optimizing $\mathrm{W}$ , our results suggest that backdoor triggers are injected into the key and query matrices. Therefore, another potential defense strategy in practice would be to apply dropout layers for the attention model. By randomly masking out poisoned neurons in the key or query matrices, dropout could introduce inconsistencies in the model’s output if the model has been poisoned. One can also adopt the idea from [1] to detect poisoned data within the training sample by identifying cases where a small proportion of the extracted features differ significantly from the rest. These are immature ideas and require thorough investigation, particularly in the context of practical transformer architectures, which is beyond the scope of this paper.

[1] Tran, Brandon, Jerry Li, and Aleksander Madry. "Spectral signatures in backdoor attacks." NeurIPS 2018.

审稿意见

评分: 32025-03-10

This paper discusses the vulnerability of the attention module to backdoor attacks from an interesting perspective, and this work provides theoretical analysis and simulation verification. This paper proves that a layer of attention module does remember poisoned samples after some assumptions are met.

给作者的问题

N/A

论据与证据

The claims made in the paper are reasonable and verifiable.

方法与评估标准

The evaluation method used (simulation experiment of synthetic data) is reasonable, but has some limitations.

理论论述

The proof of theorem 4.1 provided in this paper is reasonable

实验设计与分析

All the results of the experimental demonstration part of the paper are checked.

补充材料

Read all the supporting materials

与现有文献的关系

This paper presents their work in relation to the work discussing the security of LLM

遗漏的重要参考文献

N/A

其他优缺点

This paper makes a theoretical analysis of the fragility of attention. My main concerns are as follows:

Is the time step tao_0 in Theorem 4.1 bounded? This time step needs to be large enough to make sure that the theorem holds, and what variables does this time step depend on. It needs to be analyzed that the time step tao is greater than tao_0 is actually a condition that can be met.
Lack of verification of theory correctness on real world data sets. For example, experimental demonstration can be performed on IMDB and sentiment140 datasets
In the introduction of poisoning data generation, the author says that the union of P and R needs to be an empty set. What is the reason for this condition?

其他意见或建议

Some formulas in the paper have punctuation, some formulas lack punctuation, need to unify the use of punctuation

作者回复

2025-04-01

We appreciate the reviewer's interest and recognition of our contributions and the novelty of our work.

W1: The lower bound on the number of iterations $\tau_0$ depends on the proportion of relevant tokens $\zeta_R$ , the proportion of irrelevant tokens $\zeta_P$ , the number of tokens $T$ , and the strength of poisoned signal $\alpha$ . $\tau\geq \tau_0$ is required to guarantee that, for any give $\epsilon>0$ , the softmax probability of the relevant token is at least $1-\epsilon$ for the standard training sample, and the softmax probability of the poisoned token is at least $\frac{1}{|\mathcal{P}|}-\epsilon$ for the poisoned training sample. Such condition can be met due to Lemma 5.1. The proof of generalization guarantee requires $\epsilon\lesssim \min(\frac{T}{(\frac{1}{\zeta_R}-1)^4}, \frac{1}{(\frac{1}{\zeta_P}-1)^{\frac{4}{\alpha}-1}})$ . Smaller $\epsilon$ leads to larger $\tau_0$ .

W3: We assume that the union of $\mathcal{P}$ and $\mathcal{R}$ is an empty set to guarantee that adding poisoned tokens does not alter the semantic meaning of the original input. Intuitively, if a poison pattern modifies an image in a way that changes the original object, or if a modified word changes the semantic meaning of the sentence, the classifier should not be expected to predict the original label. We will include a remark on this in the final version.

We will unify the use of punctuation of formulas as suggested by the reviewer.

审稿意见

评分: 32025-03-11

This paper presents a theoretical analysis of backdoor attacks targeting the token selection process in single-head self-attention transformers. The authors demonstrate that gradient descent can interpolate poisoned training data and establish conditions under which backdoor triggers dominate model predictions while preserving generalization on clean data. Empirical experiments on synthetic data validate the theoretical findings.

给作者的问题

How do your theoretical conditions translate to real-world triggers (e.g., phrases or syntax patterns) that may correlate with natural tokens?
Could joint optimization of $\nu$ and $p$ weaken or strengthen backdoor success?
Have you tested the approach on transformers pre-trained on large corpora?
Are the conclusions revealed in the paper instructive for defense? Discussing this will help increase the value of the work.

论据与证据

Yes

方法与评估标准

The experiments use synthetic data only and lack real-world benchmarks.
Fixed linear head $\nu$ simplifies analysis but limits practical relevance.

理论论述

• Orthogonality Assumption: Signals $\mu_{\pm 1}$ and $\tilde{\mu}_{\pm 1}$ are orthogonal (Assumption 1). In practice, triggers (e.g., rare words) may correlate with natural tokens, weakening the theory’s applicability.
• Fixed Linear Head: Training $\nu$ and $p$ jointly could alter dynamics; this is not addressed.

实验设计与分析

While useful for controlled analysis, real-world relevance is unclear. For example, real triggers (e.g., "James Bond") may exhibit complex interactions with context.

补充材料

Yes. Appendix includes proofs and additional experiments.

与现有文献的关系

This is the first theoretical study of backdoors in attention mechanisms (prior work focused on empirical attack designs).

遗漏的重要参考文献

Theoretical work on multi-head attention is omitted but relevant for extensions.

其他优缺点

Strengths:
• Theoretically grounded conditions for attack success.
• Clear exposition of attention manipulation dynamics.

Weaknesses:
• Narrow scope (single-head, synthetic data).
• Assumptions may not generalize to real-world models.

其他意见或建议

None

作者回复

2025-04-01

We appreciate the reviewer's interest and recognition of our contributions and the novelty of our work.

Q1: Regarding the orthogonality assumption, such assumption can be relaxed to the setting where the relevant signals $\mu_{\pm 1}$ and the poisoned signals $\tilde\mu_{\pm 1}$ are correlated, and our proof still holds with minor modifications. We discuss this on page 6, left column, lines 295–303. We can add more clarification in our final version.

Q2: We conduct several experiments using the same synthetic dataset as described in the paper. We choose $|\mathcal{R}|=|\mathcal{P}|=1$ and vary $\beta$ across $0.1,0.2,0.3,0.4$ and $\alpha$ across $1,2,3,4$ . We compare the poison accuracy between jointly optimizing $\nu$ and $\mathrm{p}$ versus optimizing only $\mathrm{p}$ under the same $\alpha$ and $\beta$ . Our results show that while the final poison accuracy is similar in both cases after sufficient training iterations, joint optimization leads to a faster convergence rate. We hypothesize that jointly optimizing $\nu$ and $\mathrm{p}$ may strengthen the backdoor attack in more practical scenarios, such as when training on a more complex dataset or using a more sophisticated attention architecture. Understanding the effects of joint optimization remains an interesting research direction, which we highlight as a future avenue in our paper (page 6, right column, lines 322–324).

Q3: We didn't run experiments on transformers pre-trained on large corpora.

Q4: This is an excellent question. We have some conjectures about possible defense mechanisms. Suppose that the learner has knowledge of the relevant tokens. In that case, a simple sanity check could be performed: after training the transformer by optimizing only the tunable token $\mathrm{p}$ , one can examine whether $\mathrm{p}$ exhibits a strong correlation with signals that are not relevant tokens. If such a correlation exists, it may indicate that the transformer has been compromised by a backdoor attack.

[1] Tran, Brandon, Jerry Li, and Aleksander Madry. "Spectral signatures in backdoor attacks." NeurIPS 2018.

Regarding missing references: we have discussed some of the theoretical work on multi-head attention in Section 2, page 2, right column, lines 72-78. We will include more references in our final version.

审稿意见

评分: 32025-03-14

This paper uses extensive mathematical proofs to reveal how backdoor triggers affect model optimization. If the signal from the backdoor trigger is strong enough but not overly dominant, an attacker can successfully manipulate the model predictions.

给作者的问题

I do not have further questions to authors.

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

Yes, but I'm a bit skeptical that the A1-5 assumptions hold up with practical and large data.

实验设计与分析

Yes, the experiment is simple and small just only to support the theoretical results...

补充材料

Yes, I scan the proofs and additional experiments.

与现有文献的关系

The process of backdoor attack has been investigated at both the mathematical analysis and theoretical levels and have contributed to the theory and interpretability of Transformer-based models.

遗漏的重要参考文献

none

其他优缺点

Strengths:

+Extensive mathematical proofs.

+Revealing how backdoor triggers affect model optimization.

+Reveals and defines the necessary conditions for a successful backdoor attack in a single-head self-attention transformer.

Weaknesses:

-The experiment is too simple. Only test a single layer self-attention transformer.

-Whether vector L2 norm is truly representative of trigger signal strength in deep learning needs to be further explored.

-Too many assumptions, whether it applies in real large-scale transformer or large-scale datasets is a question.

其他意见或建议

N/A

作者回复

2025-04-01

We appreciate the reviewer's interest and recognition of our contributions and the novelty of our work. We acknowledge that our current results rely on restrictive assumptions and that our experiments serve primarily as a proof-of-concept for our theoretical findings. We have discussed these limitations in the paper. As this is the first work to address how backdoor triggers influence the optimization of attention-based models, our goal is to provide valuable insights into this problem. We agree that refining and relaxing these assumptions is an important direction for future research.

最终决定Accept (poster)

2025-05-01

The authors explore a type of backdoor attacks that exploit the token selection process within attention mechanisms. The authors provide a theoretical analysis demonstrating that single-head self-attention transformers can interpolate poisoned training data through standard gradient descent. They further show that when the poisoned data contains sufficiently strong, yet not overly dominant, backdoor triggers, adversaries can reliably influence model predictions.

The study offers valuable insight into the dynamics of how attention-based token selection can be manipulated to compromise model behavior. The authors derive theoretically grounded conditions under which such attacks succeed and support their analysis with empirical validation using synthetic datasets.