/10

Spotlight4 位审稿人

最低3最高5标准差0.8

ICML 2025

Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition

Zheyang Xiong,Ziyang Cai,John Cooper,Albert Ge,Vasilis Papageorgiou,Zack Sifakis,Angeliki Giannou,Ziqian Lin,Liu Yang,Saurabh Agarwal,Grigorios Chrysos,Samet Oymak,Kangwook Lee,Dimitris Papailiopoulos

OpenReview PDF

提交: 2025-01-23更新: 2025-07-24

摘要

关键词

task superpositionin-context learning

评审与讨论

审稿意见

评分: 32025-03-07

This paper explores the phenomenon of task superposition: when presented a mixture of in-context examples corresponding to different tasks, the output probability distribution of an ambiguous query shows sensitivity to the different tasks and the relative proportion of the examples under different tasks. The paper also investigates this phenomenon in models trained from scratch on synthetic tasks, and relates to work in task vectors by showing the task vectors can be combined to shift the output probability.

给作者的问题

n/a

论据与证据

Overall, the paper is very clear and the results are intuitive. I did find the conclusion that "transformers can in-context learn multiple tasks in superposition even if trained to in-context learn one task at a time" a bit over-claiming, as the task sets being learned or evaluated are highly related, to the extent that they all use the same input. For example, the variants of the addition task can be regarded as just one addition task with different output languages, so an ICL-updated posterior of response types in some sense doesn't seem surprising. Moreover, it's unclear if the models mechanistically treat the different tasks as separate tasks -- for example, it could perform the same internal computation for the addition task, but the presence of numbers in other languages would push the output distributions for those languages higher.

Another concern/confusion I had is with the "K heads for K tasks" capacity claim. I suspect that due to some tasks sharing significant components with other tasks, models with K heads may easily learn more than K tasks, depending on how one defines a "task unit". And task boundaries in a continuous space may be intrinsically ambiguous.

方法与评估标准

Related to the above, I find the results more interesting on the less "knowledge/retrieval"-like task sets, e.g. copying operands vs. adding, and taking the first/last letter + capitalize. I think if the evaluation can include more complicated or procedural tasks that are less knowledge/retrieval-based, it would make the phenomenon clearer and much more interesting.

理论论述

No.

实验设计与分析

Overall the design and analyses seem fine.

补充材料

与现有文献的关系

This paper builds on prior studies of ICL and task vectors and provides some evidence on task superposition.

遗漏的重要参考文献

n/a

其他优缺点

I find the results clear and intuitive, and I appreciate that the authors studied this issue in different settings (e.g. with training from scratch).

At the same time, I feel unclear about what to take away from the paper besides the observations. I think there is a space to dive deeper to understand the phenomenon better. It would be nice if the authors could more clearly discuss the implications of this work -- perhaps a rational posterior update argument?

其他意见或建议

n/a

作者回复

2025-04-01

Dear Reviewer fgkh,

We sincerely appreciate your thoughtful feedback on our paper. Below, we address the specific concerns raised in the review.

More complicated tasks

Thank you for your suggestion. For more complicated task, e.g., grade-school Math, the input $x$ is a Math question and the task answer $g(x)$ usually includes a long chain-of-thought (CoT) reasoning. Note that in this case, there can be multiple equivalently correct ways to solve the given problem using different CoTs, so the task answer $g(x)$ is not unique. Therefore, the task (solving a Math problem) is not a well-defined function from $\mathcal{X}$ to $\mathcal{Y}$ in this setting and it will be hard to measure the probability of task answer given prompt ( $\mathbf{P}(g(x)\mid`prompt`)$ ). However, we think that it would be an interesting future direction to study task superposition on more complicated tasks.

Implications of this work

In this work, we show that LLMs are capable of simultaneously solving distinct tasks from in-context examples. Our work contributes significantly to the fundamental understanding of LLMs and highlights a critical area for future research -- developing decoding strategies that can maintain the model’s multi-task state throughout the generation process.

Strikingly, we find that a recent work [1] offers some hope towards this direction. In particular, [1] propose a method to let LLMs generate continuous thoughts as reasoning states. Using this method, given a logical reasoning task that requires exploration, LLMs can explore multiple paths (each path as a sub-task) at the same time by encoding "a distribution of different traces into the continuous thoughts," which is a superposition of intermediate results from different paths. We hope our work could shed light on future studies on superposition in LLMs. We will add more discussion on the implications of our work in our next revision.

We are happy to elaborate further if you have any remaining concerns.

References

[1] Hao, Shibo, et al. "Training large language models to reason in a continuous latent space."

审稿人评论

2025-04-09

Thank you for the response! Adding connection to the latent continuous reasoning states would be valuable. My concern about the claims surrounding task separability still remains. I still think it would be additionally valuable to more carefully unpack this issue in the paper to help interpret this phenomena and the relation to the architecture (i.e. "K heads for K tasks" capacity). Given that, I would like to remain at the current score.

作者评论

2025-04-09

Thank you for your reply!

Also sorry for omitting the "K heads for K tasks" capacity concern. Here the purpose of Theorem 1 is to show that task superposition is well within the expressive power of Transformers and Transformers can efficiently represent the superposition mechanism using a small number of layers. We think it is an interesting future direction to find the optimal bound on how many task can the model perform with K heads.

We will sure add the discussion on the connection with latent continuous reasoning and add more discussion on interpreting the task superposition.

审稿意见

评分: 32025-03-12

This paper investigates the "task superposition" phenomenon of ICL, i.e., when multiple tasks simultaneously appear in the context, the model can assign non-negligible output probabilities to more than one task. Additional findings and contributions include:

Pretrained LLMs have bias on what task to perform when given multiple tasks in a context.
On simple retrieval and addition tasks, transformers can in-context learn multiple tasks in superposition even if trained to in-context learn one task at a time.
Theoretically, there exists a seven-layer transformer with sufficiently large embedding dimensions and $K$ heads per layer that can perform $K$ tasks in superposition.
Adding the convex combinations of task vectors of individual tasks can induce a similar output distribution to that induced by using a superposition input context.

给作者的问题

Could you show me a specific real-world scenario where keeping the LLM in a multi-task state will be beneficial? I recognize that generation collapse you mentioned is indeed an issue seemingly to be caused by not maintaining a multi-task state. However, I'm curious about whether there are more advanced applications of harnessing task superposition, such as improving the instruction following ability over multi-task instructions, or improving the reasoning ability in some multi-task interaction scenarios.

论据与证据

Yes.

方法与评估标准

N/A. This paper don't propose new methods.

理论论述

I read the proof sketch outlined in Section 6, but I didn't thoroughly check the proof in Appendix E.4.

Issues: the proof is a constructive one, i.e., there exists a choice of construction that can lead to the conclusion, but there is no guarantee that Transformers will definitely implement such a construction to realize superposition ICL predictions.

实验设计与分析

Yes. The soundness and validity of the experimental designs and analyses are good.

补充材料

I reviewed all contents in the supplementary material except the mathematical proof.

与现有文献的关系

The key finding that LLMs can perform task superposition reveals possibilities for designing real-world applications of LLMs such as automatically inferring the desired task given a complex instruction context, etc.

遗漏的重要参考文献

I notice that a recent work in ICLR 2025 [1] may further explain the finding in Section 4: "LLMs do not calibrate their output distribution perfectly with the in-context task example distribution and they still have bias on what task to perform". [1] investigated how ICL will select the training task priors to make predictions based on the test context and the pretraining distribution, and theoretically revealed that three factors will determine the task selection of ICL prediction: 1) ratio of each task in the training corpus; 2) the test error of each task on the test in-context examples; 3) the distributional distance between the training and test context inputs $x_i$ . I believe further discussing how these task-selection mechanisms in [1] will function under your multi-task ICL setting would help to refine this part of your work.

[1] Can In-context Learning Really Generalize to Out-of-distribution Tasks?

其他优缺点

Strength:

The studied task superposition problem is novel in the ICL literature.
The experimental designs are sound. The experimental results are persuasive to reveal the ability of Transformers to perform multiple tasks given a mixed-task context.
The authors provide theoretical evidence for the possibility of performing task superposition in Transformers.

Weakness: The main weakness is that the practical value of the paper is very limited. Although the studied problem is novel and the analyses are sound, I'm concerned about how the findings in the paper could be beneficial to any real-world applications. Not only did the authors design any new methods, they didn't provide a concrete application scenario in which this task superposition capability could be utilized. I appreciate the novelty in the studied problem and the comprehensive empirical analysis, but I believe that this paper could be further improved in its practical application value.

其他意见或建议

In Theorem 1, "A seven layer" -> "A seven-layer".

作者回复

2025-04-01

Dear Reviewer FHnf,

We greatly appreciate your constructive feedback on our paper. Below, we address the specific concerns raised in the review.

Theoretical claims

The proof is a constructive one, ..., there is no guarantee that Transformers will definitely implement such a construction to realize superposition ICL predictions.

It is true that our result is an existential result. The purpose of Theorem 1 is to show that task superposition is well within the expressive power of Transformers and Transformers can efficiently represent the superposition mechanism using a small number of layers. In Section 7 we further investigate the underlying mechanism and empirically show that LLMs combine task vectors during task superposition. We think it will be an interesting future direction to theoretically show the exact underlying mechanism for task superposition for pretrained LLMs.

Discussion with Wang et al.

Thank you for bring up this work. We think Wang et al. [1] is indeed related to our work. [1] shows that ICL will identify the "most suitable pretraining meta-distribution based on the test error and input distribution discrepancies" and operates within that meta-distribution. The bias we observe during task superposition (LLMs do not calibrate their output distribution perfectly with the in-context task example distribution and different LLMs have different bias) can be explained by different pretraining distributions of LLMs, which [1] investigated in.

However, we would also like to clarify that algorithm-selection mechanism does not fully explain task superposition. In the setting of [1], at inference time, all examples in the input are from a single task and LLMs select a task in the pretraining distribution that has the lowest test error with the given task; in our setting, examples in the input are from multiple qualitatively different tasks and the model predicts a superposition of different task answers. We will add the discussion of [1] in our next revision.

Practical value is limited

Not only did the authors design any new methods, they didn't provide a concrete application scenario in which this task superposition capability could be utilized. Could you show me a specific real-world scenario where keeping the LLM in a multi-task state will be beneficial?

A recent paper [2] shows a practical application of the task superposition capability. [2] propose a method where LLMs can generate continuous thought as reasoning state that simultaneously encodes intermediate results from multiple reasoning paths. For example,

Figure 4 of [2], when asked with a Math problem that can be solved by multiple ways (and we can view each way of solving the problem as a sub-task), LLM "encodes a distribution of different traces into the continuous thoughts."
Figure 5 and 6 of [2] further shows that on some logical reasoning tasks, LLMs can explore multiple paths at the same time as a breath-first-search algorithm. This outperforms traditional chain-of-thought method that can only explore one path at a time.

Moreover, we believe that explicitly characterizing the phenomenon of task superposition is valuable in its own right as it help us better understand LLMs.

Our observations align with the "simulator-in-superposition" hypothesis [3, 4] that emerged with the advent of GPT-3. This hypothesis suggests that LLMs can simulate multiple potential continuations or behaviors simultaneously, reflecting a superposition of different skills or tasks. By demonstrating that LLMs can internally represent and process multiple tasks in parallel when provided with mixed in-context examples, we provide empirical support for this theoretical framework.
Two papers [5, 6] released a few days ago also capture the superposition phenomenon in LLMs. [5, 6] found that when asked to do two-digit addition, LLM will split the problem into two tasks, internally employ parallel computational paths and then merge the result together. This indicates that superposition can be commonly found in LLMs and we hope our work can shed light on future research that study superposition in LLMs.

Typos

Thank you for pointing this out. We will fix the typos in our next revision.

We are happy to elaborate further if you have any remaining concerns.

References

[1] Wang, Qixun, et al. "Can In-context Learning Really Generalize to Out-of-distribution Tasks?."

[2] Hao, Shibo, et al. "Training large language models to reason in a continuous latent space."

[3] Reynolds, L., & McDonell, K. (2021). Multiversal views on language models.

[4] moire. Language models are multiverse generators, January 2021. https://generative.ink/posts/language-models-are-multiverse-generators

[5] Ameisen, et al., "Circuit Tracing: Revealing Computational Graphs in Language Models", Transformer Circuits, 2025.

[6] Lindsey, et al., "On the Biology of a Large Language Model", Transformer Circuits, 2025.

审稿人评论

2025-04-02

Thank you for elaborating on my concerns. Most of my questions are addressed. Moreover, I find the connection between task superposition and the multiple reasoning paths in latent CoT interesting. A deeper exploration of this connection can enhance the practical value of your work, especially in the current context where latent thinking and reasoning are gaining significant attention.

I'm willing to raise my rating to 3.

作者评论

2025-04-06

Thank you for your reply! We will sure add more discussion on this in our next revision.

审稿意见

评分: 42025-03-12

The authors show large language models can naturally perform multiple, distinct tasks simultaneously, even when they were only ever trained on one task at a time. They describe this phenomenon akin to 'superposition' which has been shown in previous work in the settings of having multiple tasks while performing in-context learning.

Findings include:

-Despite being trained with one-hot, single-task examples, the models’ internal representations blend distinct task-specific computations when prompted with a mix of in-context examples.

-Experiments where the authors patch in convex combinations of individual task vectors reveal that the output probabilities vary smoothly as the interpolation parameter changes.

-(Appendix) The paper also shows that larger models can handle more tasks in parallel and align their output distributions more accurately with the mixture of in-context examples.

update after rebuttal

I remain a proponent of accepting this paper.

给作者的问题

How is this considered superposition as compared to good calibration? Good calibration being defined as the uncertainty/ambiguity of the completion because the model is given multiple in-context examples. I'm not sure if I fully agree with the argument that superposition is distinct from calibration even if the models are trained in a one-hot setting. Perhaps the distinction should be:

Calibration is about the model’s output uncertainty aligning with the in-context mix, for example, when you supply a 50/50 mix of two tasks, a well‐calibrated model would ideally assign about 50% probability to each task’s answer. Calibration, therefore, is an observable feature of the output distribution.

Superposition, on the other hand, refers to what happens inside the model. Even when the model isn’t perfectly calibrated at the output, its hidden representations encode a mixture of task-specific computations.

So the framing of this work is that superposition of representations results in the observation of calibration in the outputs of the model.

What are ways of observing this form of superposition 'in-the-wild'? For example does this occur for natural language sequences in data that is not based on data with in-context learning examples? And how does that deviates from the observations of the toy experiment (a small transformer (a GPT‑2 variant with about 14 million parameters) is trained on a family of simple retrieval tasks).

论据与证据

Yes, the methods and evaluation criteria support the claims.

方法与评估标准

Yes, the methods make sense based on the claims in the paper.

理论论述

Did not check the correctness of Theorem 1, but ability of to only perform K tasks with K heads per attention layer seems low. Wouldn't this also depend on the dimensionality of the activation space as well. The number of tasks that can superposed should be larger based on this work: Superposition of many models into one, https://arxiv.org/abs/1902.05522. If you assume task orthogonality, the activation space can be partition into task specific dimensions and potentially contain far larger if you assume near-orthogonality.

实验设计与分析

The authors provide prompts that mix examples from different tasks—such as numerical addition in multiple languages or country capital/continent identification—and measure the output probabilities.

By training a small transformer model on simple, single-task retrieval tasks and then testing it with mixed-task prompts, the experiment demonstrates that even when trained with one-hot targets, the model’s internal representations can blend different task-specific computations.

In another experiment, they "patch in" a convex combination of task vectors, each corresponding to a pure task, into the model. As the interpolation parameter is varied, the output probabilities shift smoothly between those of the individual tasks. This controlled manipulation underscores that the internal representation is a weighted mixture of the separate task representations.

补充材料

I only looked at parts of the supplementary relevant to the main text.

与现有文献的关系

Superposition in the various context of neural networks is an important phenomenon and the observation of it occurring in the setting of in-context-learning shows that the property is even more prevalent than previously realized.

遗漏的重要参考文献

All the references I am aware of are referenced.

其他优缺点

It would be helpful to contrast the task vectors applied to the toy setting to the real world LLMs to perform the same tasks to see if the task vector embeddings in real world LLMs deviate from those in the toy setting. This is fairly minor and just something that would be interesting to compare against.

其他意见或建议

"Gray dashed line in each figure is the ideal probability if we assume the model perfectly calibrates its output distribution to the distribution of in-context task examples. With uniform distribution of task examples, the dashed lines are at 0.25 (4 tasks setting) and 0.33 (3 tasks setting)." I could not fined the dashed line in this figure.

"we select two tasks ad we provide the model with prompts"

作者回复

2025-04-01

Dear Reviewer TgC8,

We sincerely appreciate your thoughtful feedback on our paper. Below, we address the specific concerns raised in the review.

Theoretical claims

Thank you for bringing up this work. [1] focus on superposition with NN while we focus on superposition in Transformers, by weighting of the answer proportionally to the number of in-context examples corresponding to this tasks. It is correct that the number of tasks could affect the ReLUs used as well, depending on which portion of the Transformer implements a task. Indeed when the ReLU layers are used then the width could be ~ $Kd$ .

The purpose of Theorem 1 is to show that task superposition is well within the expressive power of Transformers and Transformers can efficiently represent the superposition mechanism using a small number of layers. It will be an interesting future direction to find the optimal bound on how many task can the model perform with K heads.

Task vectors in toy setting

Thank you for your suggestion. We extract task vectors from our small trained-from-scratch models using the same pipeline as the pretrained models. While our task vector extraction works well for large, pretrained models, we could not find task vectors that work well for our small models. For example, in Figure I we plot the accuracy on task ret2 and plus2 when using vectors extracted from different layers and observe that the maximum accuracy we can get is lower than 0.2 (while for large real-world pretrained model such accuracy is usually near 1). A possible explanation is that, while task vectors for real world LLMs are extracted from a specific layer, for the small models, the feature that represents a task is likely not localized to a specific layer (as indicated in Figure I), which means that we need to modify how we extract the task vectors for small models. We believe this is an important area for further research.

Calibration and superposition

Thanks for providing the insights. We agree that superposition is more about what happens inside the model and calibration is more about aligning model's output with the in-context task examples distribution.

So the framing of this work is that superposition of representations results in the observation of calibration in the outputs of the model.

That is correct. We will add more discussion on this in our next revision.

Superposition "in the wild"

What are ways of observing this form of superposition 'in-the-wild'? For example does this occur for natural language sequences in data that is not based on data with in-context learning examples?

Two papers [2, 3] that are released a few days ago also capture the superposition phenomenon in LLMs. In particular, the authors found that when asked to do two-digit addition, LLM will split the problem into two sub-task paths:

one path estimates the rough range of the answer;
another path finds the exact last digit of the sum.

The model internally employs parallel computational paths and then merges the result together. This indicates that superposition can be commonly found in LLMs and we hope our work can shed light on future research in superposition in LLMs.

Typos

Thanks for pointing this out. We will fix them in our next revision. We will add the dashed line as well.

We are happy to elaborate further if you have any remaining concerns.

References

[1] Cheung, Brian, et al. "Superposition of many models into one." Advances in neural information processing systems 32 (2019).

[2] Ameisen, et al., "Circuit Tracing: Revealing Computational Graphs in Language Models", Transformer Circuits, 2025.

[3] Lindsey, et al., "On the Biology of a Large Language Model", Transformer Circuits, 2025.

审稿人评论

2025-04-06

Thank you for responding to my questions. I have no additional comments and believe the authors will implement the changes they mention in the rebuttal. I remain a proponent of accepting this work.

I am writing a response in case the authors have any additional comments or discussion. (Unrelated to the authors or this work: This is a truly poorly constructed discussion format for this year's ICML conference)

作者评论

2025-04-09

Thank you for your reply! We will sure add the changes in our next revision.

审稿意见

评分: 52025-03-25

This paper introduces the novel empirical finding that when presented with a context that contains a mixture of different tasks, an LLM will respond as though it is performing a superposition of those tasks. By training very simple small GPT-2-style models from scratch on artificial tasks where every training context contains only a single task, they give strong evidence that this superposition effect is a structural properly of this style of neural-network (i.e. is a part of its inherent inductive bias) and not coming from the specifics of its training data. They connect this to the task-vector point of view (coming from the linear representation hypothesis for tasks) by showing evidence of linear combinations of tasks being active at the same time.

给作者的问题

Does the ordering of tasks in context matter? It feels like you used random shuffles. But given the "U-shaped curve" of how models (and humans) seem to pay more attention to the beginning of the context as well as the most recent parts, does this also influence the superposition?

论据与证据

Yes. But I didn't check the proof in the supplemental for the theoretical construction.

方法与评估标准

Yes.

理论论述

No.

实验设计与分析

Yes. The experiments seem unambiguously clear.

补充材料

No.

与现有文献的关系

This paper is nicely connected to the discussion around Simulcra which have unfortunately (for academics) not taken place purely through papers. The fact that the authors reference blog posts on this topic is very good for the academic community.

遗漏的重要参考文献

There is a collection of academic works on actually learning mixtures. For example: https://dl.acm.org/doi/full/10.1145/3583680

It would be nice to connect to this literature.

其他优缺点

Very clear writing.

其他意见或建议

You have a bug in Figure 1. The task examples for b don't match the stated task.

作者回复

2025-04-01

Dear Reviewer UuyG,

We greatly appreciate your constructive feedback on our paper. Below, we address the specific questions raised in the review.

Figure 1(b)

Thanks for pointing this out. We will update Figure 1(b) in our next revision.

Does ordering of tasks matter?

In our setting of in-context learning, the order can affect how the model perform. For example, consider a scenario with three tasks, each presented in the prompt through 10 examples arranged sequentially: first 10 examples of task 1, followed by 10 examples of task 2, followed by 10 examples of task 3, and then the query. In this case, the model tends to assign higher probabilities on task answer of task 3. This is because there are 10 in-context examples of task 3 right before the query, and the model just follow the same pattern. However, if we randomize the order, the model won't just follow the task examples right before the query but it calibrates its output probability distribution based the in-context task example distribution in the prompt.

We are happy to elaborate further if you have any remaining concerns.

审稿人评论

2025-04-06

I would hope that the final version includes some quantitative exploration of what shapes the distribution over tasks as a function of the position distribution of tasks within context.

作者评论

2025-04-06

Thank you for your reply! We will add analysis on this in our next revision. We will also cite more literature on learning mixtures as you suggested.

最终决定Accept (spotlight poster)

2025-05-01

This paper identifies a new phenomenon/ability in LLMs called Task Superposition, which is the ability to do multiple in-context learning tasks at a time. The paper empirically argues in a controlled setting that this happens even when trained to only do single-task ICL per training example. The paper provides expressivity results supporting that this skill is possible.

Reviewer agree that the finding is novel (FHnf, UuyG); I found the from-scratch experiment insightful and surprising. Reviewers agree that the experiments are sound (FHnf) and convincing of the claims (FHnf, TgC8). The paper is clearly written (fgKH).

One prominent concern raised was that it's not immediately clear what practical implications of this findings are (FHnf, fgKH). The authors response provides evidence from the literature, of models that can reason through multiple paths at once. This addresses the question well (and a reviewer raised their score as well). Independent of that response, I think the phenomenon is surprising enough that the lack of an immediate application isn't concerning.