7.3

/10

Poster4 位审稿人

最低4最高5标准差0.5

2.8

置信度

创新性3.3

质量3.3

清晰度2.3

重要性2.5

NeurIPS 2025

On the Robustness of Transformers against Context Hijacking for Linear Classification

Tianle Li,Chenyang Zhang,Xingwu Chen,Yuan Cao,Difan Zou

OpenReview PDF

提交: 2025-05-01更新: 2025-10-29

摘要

关键词

in-context learningtransformersrobustnessdeep learning theorylearning theory

评审与讨论

审稿意见

评分: 4置信度: 32025-06-21

The paper theoretically explores context hijacking within an in-context linear classification problem, utilizing linear transformers. It designs context tokens as factually correct query-answer pairs where queries are similar to the final query but have opposite labels. The authors develop a theoretical analysis on the robustness of linear transformers as a function of model depth, training context lengths, and the number of hijacking context tokens. The key result is a formal equivalence between L-layer transformers and L-step gradient descent from general initialization. The authors derive optimal learning rates and show that deeper models perform finer-grained updates, yielding exponentially stronger robustness against hijacking attacks. A derived error bound explains why models like GPT-4 are more robust than GPT-2. Empirical results confirm the theory. This work offers a principled foundation linking model depth to robustness.

优缺点分析

Strengths

This is the first rigorous framework analyzing context hijacking in transformers, reframing the problem via a clean equivalence to multi-step gradient descent.
Demonstrates that deeper models achieve exponential robustness gains, offering concrete design guidance beyond vague depth heuristics.
Theoretical predictions align precisely with experiments, explaining real-world trends like GPT-4’s superior robustness over GPT-2.

Weaknesses

Analysis is limited to linear classification under isotropic Gaussian assumptions, which limits applicability to realistic NLP tasks.
Focuses solely on linear attention-only transformers, omitting nonlinearities, layer norm, and MLPs essential in practice.
The analysis relies on the assumption that transformers implement a gradient descent optimizer, and the authors note that more complicated meta-optimization algorithms could lead to different theoretical results. While this is a common approach in theoretical transformer works, it might simplify the true underlying mechanisms.

问题

1 The analysis assumes repeated hijacking examples that are linearly projected to lie on the decision boundary. Can the framework extend to more realistic hijacking attacks where context examples are diverse, semantically richer, or adversarially optimized? 2 The current analysis excludes MLPs, normalization, and nonlinearities. Do you expect your robustness characterization to hold for standard Transformers (e.g., GPT-style) if they approximately perform gradient-like steps? 3 Given the reliance on Gaussian-distributed features, how well do you expect your theoretical findings to transfer to real-world NLP tasks with structured and non-isotropic embeddings?

局限性

The theoretical framework is restricted to linear classification under isotropic Gaussian assumptions, which may not generalize well to real-world embeddings or tasks involving long-range semantics.

The analysis is limited to linear attention-only transformers without MLP, layer norm, or softmax, which reduces relevance to practical architectures.

最终评判理由

the responses addressed my questions well. I prefer to keep my current rating.

格式问题

None

作者回复

2025-07-31

We appreciate your constructive questions and suggestions! We address them as follows:

Q1: Simplified linear tasks, Gaussian data, and transformers' architecture.

A1:

We acknowledge that there exist some gaps between the linear classification tasks, Gaussian data, and linear attention-only transformers adopted in our theoretical framework, and real-world complex models and complex tasks. However, we also point out that due to technical challenges, studying linear tasks with linear transformers on Gaussian data is a standard theoretical setting, and has been widely considered in many existing theoretical works regarding transformers, particularly in the in-context learning literature [1-8].

In addition, we would like to emphasize that the purpose of this work is to provide insights towards 'context-hijacking', a theoretically unexplored phenomenon. Proper mathematical simplification allows us to derive clear and precise quantitative characterizations of transformers' in-context learning capacities relative to their depths, rather than being biased by technical challenges. Specifically, our clear mathematical characterization effectively reveals that: when learning from context, shallow transformers are more 'aggressive', whereas deep transformers are more 'conservative'. While the exact mathematical form may not directly transfer to more complicated situations, our findings regarding depth-dependent learning strategies offer valuable theoretical insights that can guide future investigations into non-linear tasks, more complex data structures, or other aspects of multi-layer transformers' ICL. In fact, these theoretical conclusions are further empirically supported by our experimental results on more complex scenarios (see A2), guaranteeing the generalization capacity of our theoretical framework.

Q2: Limits applicability to realistic NLP tasks: can the framework extend to more realistic hijacking attacks where context examples are diverse, semantically richer or adversarially optimized?

A2: Theoretical extension.

Our theoretical framework can be extended to more general cases. For example, we can extend it to an out-of-distribution case, where the hijacking context examples and query follow different distributions (so it would be possible that context and query have opposite labels with similar embeddings). Notice that our conclusion that models' depth determines the optimal learning rates still holds (Theorem 3.3). It is natural when there exist significant distinctions between two distributions, deep transformers' 'conservative' learning strategies make them robust to hijacking.

Experiments on practical LLM architectures and real-world data distributions.

We strongly agree with your opinion and we realize the importance of generalizing our results to more realistic architectures. We conduct extensive supplementary experiments on LLMs of varying depths across diverse topic tasks to demonstrate the validity of our conclusions in real-world contexts. Our dataset is constructed as follows.

Dataset construction and settings.

First, we will design a fact retrieval problem. It is a direct question, such as "Of all the sports, Maria Sharapova is most professional in which one? The answer is". We want the model to predict the next token is "tennis".
Next, we will choose a topic that is factually correct. For the example above, we can choose the topic that "Maria Sharapova is not a professional in rugby".
Finally, we will add factually correct context prefixes of varying lengths before the question. Each sentence of this context prefix will describe the topic that has been determined from a different perspective and with different words. That is, paraphrase the hijacking context instead of repeating them, which makes context examples more diverse and semantically richer. In our example, these sentences could be "Maria Sharapova's tennis skills do not translate well to rugby", "The physical demands of rugby are not ones with which Maria Sharapova is familiar", etc. The model is then asked the same question. If the model predicts "tennis", then it is correct. If the model predicts "rugby", we call this "label flipping".

We design four datasets with different topics, including city, country, sports and language. And the number of samples in each dataset ranges from hundreds to thousands. We divide the context hijacking into eight different levels according to the length of the context prefix, from level 1 to level 8, which means the context has 10 to 80 sentences. We filter out questions that are too difficult based on the model's own capabilities and the difficulty of the questions, which means that the model could always correctly answer direct questions without hijacking context. We conduct experiments on Qwen2.5 base models of different sizes (depths) and corresponding instruction fine-tuned versions. The tables below show the label flipping rates of different models for different levels of context hijacking.

Experiment results

City.

Model	Level 1	Level 2	Level 3	Level 4	Level 5	Level 6	Level 7	Level 8
Qwen2.5-0.5B (24 Layers)	0.1320	0.2098	0.2598	0.3096	0.3487	0.4020	0.4337	0.4906
Qwen2.5-1.5B (28 Layers)	0.0287	0.0589	0.1005	0.1471	0.1795	0.1950	0.2161	0.2411
Qwen2.5-3B (36 Layers)	0.0230	0.0437	0.0587	0.0780	0.0922	0.1006	0.1067	0.1164

Country.

Model	Level 1	Level 2	Level 3	Level 4	Level 5	Level 6	Level 7	Level 8
Qwen2.5-0.5B (24 Layers)	0.4094	0.5769	0.5936	0.5977	0.6173	0.6603	0.6500	0.6692
Qwen2.5-1.5B (28 Layers)	0.3125	0.3906	0.5000	0.5167	0.5547	0.5469	0.5781	0.5781
Qwen2.5-3B (36 Layers)	0.1708	0.1868	0.1893	0.2198	0.2253	0.2527	0.2500	0.2555

Sports:

Model	Level 1	Level 2	Level 3	Level 4	Level 5	Level 6	Level 7	Level 8
Qwen2.5-0.5B (24 Layers)	0.7489	0.7583	0.7621	0.7738	0.7708	0.7788	0.7914	0.8006
Qwen2.5-1.5B (28 Layers)	0.5255	0.5842	0.5856	0.5891	0.5987	0.5910	0.6020	0.6136
Qwen2.5-3B (36 Layers)	0.1103	0.1177	0.1302	0.1336	0.1381	0.1403	0.1398	0.1484

Language:

Model	Level 1	Level 2	Level 3	Level 4	Level 5	Level 6	Level 7	Level 8
Qwen2.5-0.5B (24 Layers)	0.3509	0.5077	0.5666	0.6107	0.6377	0.6441	0.6383	0.6399
Qwen2.5-1.5B (28 Layers)	0.0732	0.1279	0.1737	0.2284	0.2485	0.2922	0.2853	0.3013
Qwen2.5-3B (36 Layers)	0.0435	0.0722	0.0740	0.0922	0.1043	0.1024	0.1116	0.1090

We can find that in practical LLMs, longer hijacking context will significantly increase the label flipping rate (leading to lower accuracy), while increasing the model depth can well alleviate this problem. The experiment results are consistent with our theoretical conclusions, indicating that our theoretical results can be generalized to deeper and larger LLMs in practice. And we are able to transfer our theoretical findings to practical NLP tasks with structured and non-isotropic embeddings. Additionally, we find that instruction fine-tuning (due to character limitations, please refer to the rebuttal to reviewer APEh) can improve the model's robustness to context hijacking in most cases, but the effect is not significant, which provides new insights for future work, such as adversarial optimization. This suggests that our work can provide insights into real-world problems.

We will provide all the experiment results and detailed experimental settings in our revised paper. We believe that our experiments on real-world tasks and architectures fully validate the applicability of our conclusions and hope that these results address your concerns.

Q3: Does the current conclusion still hold for more complicated architectures if they approximately perform gradient-like steps? How about more complicated meta- optimization algorithms?

A3:

Our robustness characterization remains valid for any scenario where transformers exhibit gradient-like learning behavior. The foundation of our framework lies in establishing depth-dependent optimal learning rates (Theorem 3.3). Crucially, this core conclusion persists under your assumptions that they approximately perform gradient-like steps. Therefore, our theoretical predictions about depth-mediated robustness continue to hold effectively.

Recent studies have proposed new perspectives on in-context learning, such as transformers' ability to approximate second-order optimization methods like Newton's method [9]. While the exact formulation may differ, we argue that our core theoretical insight in Theorem 3.3, the depth-dependent learning behavior, remains valid across different optimization paradigms. It is natural as long as the connection between the depth and the steps of optimizations still holds.

[1] Von Oswald, et al. Transformers learn in-context by gradient descent. ICML 2023.

[2] Ahn, et al. Transformers learn to implement preconditioned gradient descent for in-context learning. NeurIPS 2023.

[3] Zhang, et al. In-context learning of a linear transformer block: benefits of the mlp component and one-step gd initialization. NeurIPS 2024.

[4] Frei, et al. Trained transformer classifiers generalize and exhibit benign overfitting in-context. ICLR 2025.

[5] Zhang, et al. Trained transformers learn linear models in-context. JMLR 2024.

[6] Mahankali, et al. One step of gradient descent is provably the optimal in-context learner with one layer of linear self-attention. ICLR 2024.

[7] Chen, et al. How transformers utilize multi-head attention in in-context learning? a case study on sparse linear regression. NeurIPS 2024.

[8] Huang, et al. Transformers Learn to Implement Multi-step Gradient Descent with Chain of Thought. ICLR 2025.

[9] Fu, et al. Transformers Learn to Achieve Second-Order Convergence Rates for In-Context Linear Regression. NeurIPS 2024.

评论- Maintain my score

2025-08-02

I thank the authors for the rebuttal. I have read through the rebuttal. I think it makes sense overall. I will keep my score of 4.

2025-08-05

Thanks for your positive comment! Please do not hesitate to let us know if you have further concerns.

审稿意见

评分: 4置信度: 22025-06-25

This paper investigates the phenomenon of context hijacking in Transformer-based large language models, where the presence of factually correct context can mislead a model into incorrect predictions. The authors formalize context hijacking as a theoretical linear classification problem and analyze it through a multi-layer linear transformer framework, drawing a connection between transformer layers and multi-step gradient descent optimization. A key theoretical result is that deeper transformers inherently offer greater robustness to context hijacking because they allow for finer-grained optimization, reducing the negative impact of misleading contexts. The paper rigorously derives optimal conditions for transformer initialization and learning rates as functions of model depth and context length, confirming these insights empirically through synthetic experiments. The primary contributions include a comprehensive theoretical framework for analyzing context hijacking, mathematical proofs of the relationship between transformer depth and robustness, and empirical validations that reinforce the theoretical findings.

优缺点分析

Strenghts

Novel Theoretical Framework: The paper introduces the first comprehensive theoretical analysis of context hijacking in transformers using a linear classification setup, which is both novel and practically relevant.
Solid Mathematical Grounding: The authors rigorously connect transformer layers with multi-step gradient descent. This explicit connection provides deep insights into the internal mechanics of transformers.

Weaknesses

Linear Simplification: While the linear classification setup enables clear theoretical insights, it limits the generalization of the results to complex, real-world nonlinear scenarios and tasks.
Assumption on Data Distribution: The analysis heavily relies on simplified assumptions such as isotropic Gaussian distributions for inputs and uniform distributions for certain parameters. Real-world contexts typically deviate from these idealized conditions.
Limited Empirical Scope: The numerical experiments primarily use synthetic data and simplified transformer structures (linear transformers). It remains unclear whether these findings generalize directly to practical, large-scale models.

问题

It is a well-established fact that linear transformers perform less well on in-context tasks [1]. So, why was this architecture chosen over the more traditional quadratic transformer?

[1] Aksenov et al, Linear Transformers with Learnable Kernel Functions are Better In-Context Models

局限性

yes

最终评判理由

Thank you to the authors for the clarifications and additional experiments. Some of my initial misunderstandings stemmed from not being deeply familiar with the area of mechanistic interpretability, and the authors' response has helped resolve these. The new experiments also shed light on the practical relevance of the problem studied in the paper. That said, I believe the paper could benefit from an even broader set of experiments, particularly those exploring more diverse architectures and practical scenarios. Therefore, I will raise my score to 4.

格式问题

作者回复

2025-07-31

We thank the reviewer for detailed comments and suggestions.

Q1: Simplified assumptions on linear tasks and Gaussian data distributions.

A1:

We would like to first point out that studying linear tasks with Gaussian inputs is a standard setting and has been widely considered in many existing theoretical works regarding transformers, particularly in the in-context learning literature [1-8].

In addition, we emphasize that our work is the first to rigorously propose a theoretical framework - through formulating it into linear classification tasks - to investigate 'context hijacking', a theoretically unexplored phenomenon even under standard settings. With this simple yet intuitive mathematical formulation, we establish clear and precise quantitative characterizations in terms of the ICL capacities of transformers with respect to their depths. Specifically, our clear mathematical characterization effectively reveals that: when learning from context, shallow transformers are more 'aggressive', whereas deep transformers are more 'conservative'. While the exact mathematical form may not directly transfer to more complicated situations, our findings regarding depth-dependent learning strategies offer valuable theoretical insights that can guide future investigations into non-linear tasks, more complex data structures, or other aspects of multi-layer transformers' ICL. These theoretical conclusions are further supported by our experimental results on more complex scenarios (see A2).

Additionally, we emphasize that a key technical strength of our framework lies in its inherent ability to accommodate distributional shifts between: (1) context data and query data; (2) training data and test data. This characteristic makes our framework readily adaptable for studying both adversarial attacks and out-of-distribution performance. In summary, we believe this represents a good starting point.

Q2: Limited Empirical Scope: It is unclear whether these findings generalize directly to practical, large-scale models.

A2: Experiments on practical LLM architectures and real-world data distributions.

Although we experiment with nonlinear settings and the GPT-2 architecture (Appendix I.1 and I.3), encouraged by your suggestions, we realize the importance of generalizing our results to more realistic architectures. We conduct extensive supplementary experiments on LLMs of varying depths across diverse topic tasks to demonstrate the validity of our conclusions in real-world contexts. Our dataset is constructed as follows.

Dataset construction and settings.

First, we will design a fact retrieval problem. It is a direct question, such as "Of all the sports, Maria Sharapova is most professional in which one? The answer is". We want the model to predict the next token is "tennis".
Next, we will choose a topic that is factually correct. For the example above, we can choose the topic that "Maria Sharapova is not a professional in rugby".
Finally, we will add factually correct context prefixes of varying lengths before the question. Each sentence of this context prefix will describe the topic that has been determined from a different perspective and with different words. That is, paraphrase the hijacking context instead of repeating them. In our example, these sentences could be "Maria Sharapova's tennis skills do not translate well to rugby", "The physical demands of rugby are not ones with which Maria Sharapova is familiar", etc. The model is then asked the same question. If the model predicts "tennis", then it is correct. If the model predicts "rugby", we call this "label flipping".

Experiment results

City.

Model	Level 1	Level 2	Level 3	Level 4	Level 5	Level 6	Level 7	Level 8
Qwen2.5-0.5B (24 Layers)	0.1320	0.2098	0.2598	0.3096	0.3487	0.4020	0.4337	0.4906
Qwen2.5-1.5B (28 Layers)	0.0287	0.0589	0.1005	0.1471	0.1795	0.1950	0.2161	0.2411
Qwen2.5-3B (36 Layers)	0.0230	0.0437	0.0587	0.0780	0.0922	0.1006	0.1067	0.1164

Country.

Model	Level 1	Level 2	Level 3	Level 4	Level 5	Level 6	Level 7	Level 8
Qwen2.5-0.5B (24 Layers)	0.4094	0.5769	0.5936	0.5977	0.6173	0.6603	0.6500	0.6692
Qwen2.5-1.5B (28 Layers)	0.3125	0.3906	0.5000	0.5167	0.5547	0.5469	0.5781	0.5781
Qwen2.5-3B (36 Layers)	0.1708	0.1868	0.1893	0.2198	0.2253	0.2527	0.2500	0.2555

Sports:

Model	Level 1	Level 2	Level 3	Level 4	Level 5	Level 6	Level 7	Level 8
Qwen2.5-0.5B (24 Layers)	0.7489	0.7583	0.7621	0.7738	0.7708	0.7788	0.7914	0.8006
Qwen2.5-1.5B (28 Layers)	0.5255	0.5842	0.5856	0.5891	0.5987	0.5910	0.6020	0.6136
Qwen2.5-3B (36 Layers)	0.1103	0.1177	0.1302	0.1336	0.1381	0.1403	0.1398	0.1484

Language:

Model	Level 1	Level 2	Level 3	Level 4	Level 5	Level 6	Level 7	Level 8
Qwen2.5-0.5B (24 Layers)	0.3509	0.5077	0.5666	0.6107	0.6377	0.6441	0.6383	0.6399
Qwen2.5-1.5B (28 Layers)	0.0732	0.1279	0.1737	0.2284	0.2485	0.2922	0.2853	0.3013
Qwen2.5-3B (36 Layers)	0.0435	0.0722	0.0740	0.0922	0.1043	0.1024	0.1116	0.1090

We can find that in practical LLMs, longer hijacking context will significantly increase the label flipping rate (leading to lower accuracy), while increasing the model depth can well alleviate this problem. The experiment results are consistent with our theoretical conclusions, indicating that our theoretical results can be generalized to deeper and larger LLMs in practice. Additionally, we find that instruction fine-tuning (due to character limitations, please refer to the rebuttal to reviewer APEh) can improve the model's robustness to context hijacking in most cases, but the effect is not significant, which provides new insights for future work, such as adversarial optimization. This suggests that our work can provide insights into real-world problems.

Q3: It is a well-established fact that linear transformers perform less well on in-context tasks. So, why was this architecture chosen over the more traditional quadratic transformer?

A3: Conceptual inconsistency

We want to clarify that the linear transformer [9] mentioned in the question is conceptually inconsistent with the linear transformer in our paper, or that the problems they aim to solve have different focuses.

As defined in the reference paper (Section 3.1), a linear transformer is a model that uses a kernel function to approximate the standard attention computation. The "linear" here means that the computational complexity of the approximation is linear in the sequence length. Its purpose is to improve computational efficiency, although this comes at the expense of some in-context learning performance.

In our definition (Section 2.2), a linear transformer is a model that removes the activation function from the standard transformer. The "linear" here means that the forward propagation of the model does not involve nonlinear calculations. This is a very common setup in transformer theory research [1-8, 10], aiming to create a mathematically tractable model that can be used to explain real-world problems from a theoretical perspective. We will discuss in detail the differences and connections between our work and the reference paper in the revision.