/10

Poster4 位审稿人

最低3最高4标准差0.4

ICML 2025

Counting in Small Transformers: The Delicate Interplay between Attention and Feed-Forward Layers

Freya Behrens,Luca Biggio,Lenka Zdeborova

提交: 2025-01-17更新: 2025-07-24

TL;DR

We show that transformers adopt distinct counting strategies, relation-based and inventory-based, which shape learning regimes and are influenced by architectural choices, impacting performance and robustness in a simple histogram task.

摘要

关键词

learning theoryrepresentation learningalgorithmic tasksattentionassociative memoriesimplicit biaslearning regimes

评审与讨论

审稿意见

评分: 32025-03-13

This paper shows how a small Transformer can implement robust counting by arranging token embeddings with sufficiently low overlap and then leveraging architectural components (like softmax and BOS tokens) to preserve that separation under mixing. The authors demonstrate that different choices in hyperparameters control the network’s expressiveness to store and retrieve count information.

给作者的问题

N/A

论据与证据

The authors claimed that the histogram task can be learned via relation-based approach, or an inventory-based approach.

The claims are supported by:

Explicit weight constructions to realise the expressiveness for perfect outputs.
Experiments on synthetic sequences.

方法与评估标准

The method is based on a specific simple Transformer architecture. It seems to be reasonable for understanding the effects of difference hyper-parameters.

理论论述

The propositions in Section 4 seem to be sound.

实验设计与分析

The experiment designs are reasonable.

补充材料

No.

与现有文献的关系

This work analysed the roles of BOS tokens and some other Transformer configurations, which is related to earlier work on RASP(-L) (Wess et al 2021).

遗漏的重要参考文献

N/A

其他优缺点

Strengths:

The analyses seem very rigorous.

Weaknesses:

A toy-like model on a specific task
Little analyses on the training dynamics. The analyses seem to be based on perfectly trained end-models.

其他意见或建议

It would be interesting to have more analysis on how the geometric features change during the training.

作者回复

2025-03-31

Dear reviewer Jx43,

We thank you for reading and evaluating our work, and providing us with feedback. We are glad you generally appreciate the soundness of our work and the approach to change hyperparameters to understand which parts of the architecture are impactful.

You are indeed correct, in that our analysis rests on perfectly trained models (or as perfect as it gets, when the model does not afford a perfect solution). While we did observe some peculiarities to the training dynamics of the different architectures, we have not formally investigated the training dynamics themselves and therefore not shared results in the manuscript. These informal observations showed that e.g. linear mixing seems to observe a staircase behaviour in the loss, learning one letter of the alphabet at a time. In contrast, the dot-product mixing did not show such a staircase. Intuition from other toy problems (such as induction heads) also lead us to believe that the way the single counting subspace that arises in the models with BOS emerges apruptly. While we find these very preliminary intuitions gained during our experiments very interesting, they warrant a more formal treatment which we hope to tackle in future work.

We hope that this answer somewhat satisfies your curiosity. However, since the above statements are very speculative and we would prefer not to include them in the manuscript. If you have further questions, feedback or concerns you would like to discuss, don’t hesitate to get back to us.

Best, The authors

审稿意见

评分: 42025-03-14

This paper investigates the counting mechanism behind transformer blocks using the histogram counting task as a case study. Two types of counting tasks were studied: relation-based counting leveraging attention for pairwise token comparisons, and inventory-based counting using feed-forward layers to memorize token counts via orthogonal embeddings. The study reveals that some minor architectural choices, such as embedding size, token-mixing mechanisms, and softmax application, significantly influence the model's ability to count accurately. The study highlights the delicate balance between attention and feed-forward layers in small transformers and provides theoretical insights into the architectural determinants of counting capabilities.

给作者的问题

I cannot see the point of Figure 4 (Right). A more friendly explanation is expected.

论据与证据

Claims are supported by evidence.

This study claims that counting performance in transformers depends on architectural design choices. Empirical results first confirm that attention facilitates pairwise comparisons (relation-based counting), while feed-forward layers store token counts (inventory-based counting). Theoretical analyses then further demonstrate that softmax normalization and embedding size are two key variables.

方法与评估标准

No specific method is described by the paper.

理论论述

The paper developed several theories to analyze the countability under ideal conditions. The reviewer does not check the theory very carefully, but assumptions on embedding orthogonality and softmax effects seem make sense.

实验设计与分析

The paper only conducts toy-level experiments. The experimental design for the histogram task using variants of token mixing models, including bos, lin, and dot, and with/without softmax, was analyzed.

补充材料

The reviewer reads the proof part. Mathematical proofs are well structured and seem correct. Supp. A serves as a graphic description for the mandates.

与现有文献的关系

NA.

遗漏的重要参考文献

Sufficient.

其他优缺点

Strengths:

An enlightened work: As far as I can tell, this is the first work analyzing the countability of transformer blocks, which advances our understanding behind transformer mechanisms. The discussion of d and T points out an optimization direction for subsequent work.
Theory contribution: The paper derives theoretical bounds for minimal embedding dimensions and the role of softmax in error reduction.

Weaknesses:

Unfriendly figure illustrations: The explanations of some figures are not immediately understandable to the reader. Perhaps they could be expressed in more friendly way (e.g., Figure 4).
Practical applicability: The current experimental setup may be too simple. The observations may not generalize well to other real-world counting tasks, e.g., visual counting.

其他意见或建议

In real-world scenarios, d may be significantly smaller than T, and computations cannot converge to infinite precision. The counting robustness may be discussed further.

作者回复

2025-03-31

Dear reviewer UA3n,

We thank you for taking the time to evaluate our work and give us feedback. We are especially pleased that you find the aspect of how softmax and the dot-product attention influence robustness interesting. We agree that this point deserves further discussion and for a camera-ready version we would like to use part of the extra page to give this aspect more space.

Thank you also for pointing out that Figure 4 (right) is difficult to understand from the caption. Let us elaborate here:

Recall that the output of the token mixing mechanism (one of lin, dot, dot_BOS etc.) computes weights of the matrix $A(\bar x)$ to mix the input embeddings $\bar x$ , where we denote a single weight as $a_{ij}$ . So the token at position $\ell$ after the token mixing is $\bar x^\prime_\ell = \sum_{i = 0}^L a_{i\ell} \bar x_i + \bar x_\ell$ .

If we know which letters were in the token sequence (recall that $x_\ell$ is the letter of the alphabet at position $\ell$ and $\bar x_\ell$ is the embedding of that letter in the $d$ -dimensional space), we can also write the mixed token directly as a sum over the alphabet $\bar x^\prime_\ell = \sum_{t \in \mathcal T} \alpha_t e_t + e_{x_\ell}$ , where $\alpha_t = \sum_{i =0}^L \delta(x_i = t) a_{i\ell}$ . In this context, Figure 4 examines how the feature extractor depends on different compositions of tokens that are mixed with different magnitudes of $\alpha_t$ directly. We select three letters, and combine weighted sums of them to see how the values $\alpha$ relate to the final count that is predicted by the feed-forward layer. We look at the case where the softmax function is part of the token mixing so we have $\sum_{t \in \mathcal T} \alpha_t = 1$ guaranteed. This allows us to isolate the decision boundaries of the feature extractor along the different counts in terms of $\alpha$ in the plots on the right-hand side.

We find this a useful perspective, because it allows us to qualitatively compare how the feature extractors differ that are learned by the lin+sftm and dot+sftm. We clearly observe that in one case (dot+sftm) the decision boundaries scale non-linearly in alpha, whereas in the case of lin+sftm they are almost linear. This verifies our hypothesis that the architectures lead to different solutions and shows the analogy with our manual constructions. We also discuss at a later point in the manuscript, that these different scalings influence the robustness of the feature extractor.

We hope that this response clarifies the introspection experiment from Figure 4 further and we will expand on this aspect in an improved version of the manuscript to make it more friendly to the reader. If you have further feedback, doubts or concerns you would like to discuss, don’t hesitate to get back to us.

Best, The authors

审稿意见

评分: 32025-03-14

This paper explores the delicate interplay between the attention mechanism and the feed-forward layers, and further offers deep insights into how subtle architectural choices can drive algorithmic behavior in Transformer-base models.

As an example, the authors investigate how small transformer models tackle the histogram task - counting token occurrences in a sequence. The paper identifies two primary counting strategies: relation-based counting, which leverages attention for local token comparisons, and inventory-based counting, where the feed-forward layer memorizes token identities to aggregate counts. The authors provide explicit theoretical constructions for both strategies and back them up with extensive experiments that analyze how factors like embedding dimension, hidden layer size, and softmax application affect performance and robustness.

给作者的问题

In the paper, the term Token Mixing is used to generalize across the self-attention mechanism. Could the authors clarify whether this abstraction is primarily conceptual, or if it implies a formal equivalence? Are there theoretical reasons to treat attention as a subclass of token mixing, especially in the context of algorithmic tasks?
Given that the current study focuses on single-layer architectures and a relatively simple task (histogram counting), do the authors anticipate that the identified mechanisms (RC and IC) would generalize to multi-layer models or more complex algorithmic tasks?

论据与证据

The paper’s claims about the feasibility of perfect histogram task solutions under different parameter regimes (e.g., d >= T, p = T) are supported with constructive proofs and empirical results. The observed phase transitions in model accuracy (Figure 1) closely match the theoretical predictions, enhancing credibility.

方法与评估标准

While the paper does not introduce a new method in the conventional algorithmic sense, its design of the histogram task as a diagnostic probe, and the use of a structured hyperparameter grid search, serve effectively as methodological tools to reveal the architectural properties of small transformers.

理论论述

Probably.

实验设计与分析

The main experiments (Fig. 1 and 2) demonstrate the impact of different hyperparameter combinations on model performance, and the analysis of the paper starts here. They clearly validate how the model performs under different settings (Sec. 4.1 and 4.2). Afterwards, the authors provide the attention matrix and feedforward prediction to help explain why the model works or struggles in each setting.

补充材料

I reviewed the additional experimental results, the generation of data, and the brief introduction of counting with large language models. It is somehow difficult for the review to fully understand the mathematical proofs in parts B and C.

与现有文献的关系

The paper advances the field by providing explicit constructions and detailed phase-space analyses that clarify how subtle design choices lead to different computational strategies in transformers. The paper identifies two distinct counting strategies and deepens the understanding of how architectural choices affect the solution space. Moreover, the discussion on the impact of embedding orthogonality and mutual coherence is also inspiring.

遗漏的重要参考文献

Not applicable.

其他优缺点

Strengths:

The paper not only gives a clear theoretical construction of how small Transformers can achieve counting tasks (including relation-based and inventory-based counting), but also verifies these theoretical predictions through rigorous experiments. In addition, the authors provide sufficient theoretical derivation and clear experimental demonstration.
By carefully analyzing the role of attention mechanisms and feed-forward layers under different hyperparameters, the paper provides valuable insights into the inner workings, parameter efficiency, and algorithmic implementation of the Transformer. In addition, the authors' analysis provides inspiration for studying how to better apply Transformer-like models to other tasks.

Weaknesses: There are certain limitations in the scope of tasks and discussions on practical applications. This article mainly focuses on the relatively simple task of counting, and most of the experiments are based on a single-layer Transformer. Although this helps to gain a deeper understanding of the basic mechanism, the applicability of its conclusions in more complex practical tasks or multi-layer, large-scale models still needs further verification or remains questionable. In addition, the article mainly focuses on the discussion of mechanisms and theoretical levels, and lacks discussion on how to apply these findings to actual large-scale models or solve real-world problems.

其他意见或建议

No.

伦理审查问题

n/a

作者回复

2025-03-31

Dear reviewer S9FV,

We thank you for taking the time and effort to carefully evaluate our work, including the larger part of the supplementary material. We are glad you found it clear and rigorous as well as providing an inspiration for studying how transformers solve other algorithmic tasks; and that you appreciated the analysis of embedding orthogonality and mutual coherence. We are also in agreement that investigating these mechanisms in a more practical context with larger models is exciting future work that hopefully allows us to connect our theoretical insights with real-world applications!

In the following we would like to answer your questions:

In the paper, the term Token Mixing is used to generalize across the self-attention mechanism. Could the authors clarify whether this abstraction is primarily conceptual, or if it implies a formal equivalence? Are there theoretical reasons to treat attention as a subclass of token mixing, especially in the context of algorithmic tasks?

As you have noticed, we use the term token mixing to refer to the mechanism in the network that applies a weighted sum of tokens along the token dimension (sentence length), and we use feature mixing which analogously applies on the feature dimension. On a conceptual level, this clarifies that the two different blocks act along different dimensions on the input - this framing has been used before, e.g. in https://arxiv.org/abs/2105.01601. For the token mixing specifically, the way in which we define it helps us to formally unify the different flavours of self-attention that we examine in our experiments, e.g. using an activation or not, as well as the linear attention. We can therefore say that attention is formally a form of token mixing.

Given that the current study focuses on single-layer architectures and a relatively simple task (histogram counting), do the authors anticipate that the identified mechanisms (RC and IC) would generalize to multi-layer models or more complex algorithmic tasks?

For the running example of the histogram task, in Appendix E.8., we briefly discuss that the phenomenology of the phase transitions we observed for single-layer models also transfers to the setting with two layers. This is reasonable, since it is formally possible to construct a single layer of token + feature mixing as the identity function, hence the construction for the one-layer case naturally generalizes to more than one. While it could very well be possible that an extra layer affords solving the task with e.g. fewer dimensions, we do not observe strong evidence of this. However, at the same time it is unclear which exact mechanism is at play in the learned models, and how the RC or IC mechanisms would distribute over the layers. As for more complex algorithmic tasks, we do expect these mechanisms to generalize to tasks that involve a counting operation on a finite alphabet.

We hope that our response gives you some additional insights on our work as well as the surrounding literature. If there are further questions or concerns you would like to discuss, feel free to get back to us.

Best, The authors

审稿意见

评分: 32025-03-14

This paper investigates how small transformer architectures implement counting mechanisms in a controlled histogram task. The study identifies two distinct counting strategies: relation-based counting, which leverages local pairwise token comparisons, and inventory-based counting, which relies on memorization through a feed-forward network. The choice of counting strategy is influenced by hyperparameter configurations, such as embedding size, token-mixing mechanism, and the presence of softmax in attention.

给作者的问题

What is the potential for extending it to more complex structures, such as multi-layer architectures and multiple attention heads?
Can the proposed method be generalized to more complex tasks and larger datasets?
What insights do the findings presented in the paper provide? How might these insights inspire future research in architecture design?

论据与证据

The claims made in the submission are generally supported by clear and convincing evidence. The paper provides both theoretical constructions and empirical experiments to substantiate its findings.

方法与评估标准

The proposed methods and evaluation criteria make sense for the problem. The histogram task is a well-defined and controlled setting for studying counting mechanisms in transformers.

理论论述

The arguments are logically structured, leveraging dot-product attention properties and embedding orthogonality.

实验设计与分析

The task is simple and well-suited for isolating the effects of model components. The authors systematically vary key hyperparameters (embedding dimension, feed-forward width, and attention mechanisms) and analyze accuracy trends, supporting their claims. However, while their empirical results align with theoretical expectations, further testing on more complex tasks or different datasets could strengthen generalizability.

补充材料

The Supplementary Material provides detailed proofs, explicit weight constructions, and additional experimental analyses supporting the main claims.

与现有文献的关系

The paper builds on mechanistic interpretability and transformer analysis, aligning with prior work on counting tasks, attention mechanisms, and neural network generalization.

遗漏的重要参考文献

N/A

其他优缺点

The architecture is confined to a single-layer transformer, which may limit its practical applications.
The experiments were conducted on a simple task and dataset.

其他意见或建议

N/A

作者回复

2025-03-31

Dear reviewer fzMo,

We thank you for taking the time and effort to evaluate our work. We are glad you found it clear, systematic and rigorous and that you also appreciate the controlled setting that the histogram task provides for our analysis. While we agree with you that these very same properties limit the generalizability of our analysis, we do think that there are some broader findings that we can take away from this -- we discuss them in the context of the questions you posed:

What is the potential for extending it to more complex structures, such as multi-layer architectures and multiple attention heads?

As mentioned in the response to reviewer S9FV, an empirical analysis of counting was conducted for 2-layer transformer architectures in appendix E.8, where we observed a similar phenomenology as for the case of a single layer. However, it remains unclear whether the mechanisms implemented by a two layer net correspond to the IC and RC mechanisms described for a single layer. We also believe that several layers help with the robustness of the network to entangled embeddings and allow for more self-corrections when there is noise from overlapping token embeddings, and extending our theoretical analysis to this setting would be interesting future work. In the current version of our work we have not investigated multiple heads but preliminary experiments suggest a similar picture and we will provide specific results in a camera-ready version. Indeed, the question of what the specific functions of the heads are during counting has been analyzed in a similar context very recently in the literature, see https://arxiv.org/abs/2502.06923.

Can the proposed method be generalized to more complex tasks and larger datasets?

We interpret this question as asking how far the RC and IC mechanisms, as well as our analysis on entangled embeddings, generalize to more complex tasks and datasets. We anticipate that the algorithmic circuits that we discovered can be composed with other functions, which allows them to theoretically be applied to other datasets and tasks as well, when processing the input requires counting as a “subroutine”. In terms of the entangled embeddings and how they can be corrected via softmax and self-attention, our analysis is perhaps more general. It provides a framework to understand for which specific architectures (almost) orthogonality is needed for counting. More generally, we hope that rigorous and systematic empirical studies on how different architectures and parameterizations influence the algorithms is more broadly recognized as a tool of choice in the regime where tasks are complex and large yet broader simulations of tasks are feasible.

What insights do the findings presented in the paper provide? How might these insights inspire future research in architecture design?

Our findings reinforce the notion that small details in the architecture matter, which can have fundamental impacts on how given models achieve on given tasks, in agreement with other works that touch upon similar themes (e.g. https://arxiv.org/abs/2402.01032). In addition, our observation on the beginning of sequence token, which can drastically reduce the parameter size and input dimension needed for the counting task, adds to the growing evidence that “free tokens” help transformers to execute more complicated functions. Finally, it would be interesting to better understand how robustness and self-correction, as seen via the self-attention in the single layer transformer, play out in larger networks, and whether tweaks of the architecture can further reinforce this behaviour.

We hope that our response clarifies how we expect our analysis to transfer to more complex settings and how it affirms specific research directions in architecture design. If you have further questions or concerns you would like to discuss, feel free to get back to us.

Best, The authors

审稿人评论

2025-04-09

Dear Authors,

Thank you for the clarification regarding the extension and your insights. I tend to maintain my rate.

Best regards,

最终决定Accept (poster)

2025-05-01

All of reviewers lean toward acceptance post-rebuttal. The AC checked all the materials and concurs that the paper has made meaningful and insightful analysis of how small Transformers implement counting in two mechanisms -- relation-based and inventory-based. While there are limitations (e.g., the focus on single-block), the work presents a valuable initial attempt to analyze the countability of Transformer blocks, and therefore should be accepted. Please incorporate necessary changes in the final version.