/10

Poster4 位审稿人

最低1最高4标准差1.1

ICML 2025

On Expressive Power of Looped Transformers: Theoretical Analysis and Enhancement via Timestep Encoding

Kevin Xu,Issei Sato

OpenReview PDF

提交: 2025-01-23更新: 2025-08-02

TL;DR

We establish an approximation rate for the Looped Transformer and enhance its expressive power via timestep encoding.

摘要

关键词

TransformerLooped TransformerExpressive powerApproximation RateHyperNetworks

评审与讨论

审稿意见

评分: 12025-03-08

The paper study the limitation of hardmax 1-layer $r$ -times looped transformer with fixed limited embedding size, extending Zhang et al. (2023) that studied looped ReLU. Furthermore, the paper suggest adopted layer-dependent scaling for looped transformer to mitigate the suggest limitation.

update after rebuttal

After Discussion

I remain unsatisfied and disappointed following in-depth discussions with the authors. In all, I think the paper needs a big revision for submission and currently missing many important parts.

Downplaying Fixed-Width Architecture. In the title, abstract, contribution bullet points, and Table 1, the authors—intentionally or not—downplay / hide their use of a fixed-width transformer architecture, which, though adopted in some prior work, isn’t widely celebrated.
Missing Comparison over the Proposed ‘Limitation’. This is particularly striking given their second bullet point highlighting the limitations of looped transformers. They stress that a 1-layer looped transformer’s reliance on specific continuity coefficients drives its error, yet they fail to compare this with non-looped transformers of equivalent bit complexity (e.g., to what degree might non-looped designs mitigate this dependency?).
Unsatisfactory Response w.r.t. a Related Work. I cited Saunshi et al. (2025), which shows a 1-layer looped transformer overcoming the limitation the authors note (without explicitly fixing width). Their rebuttal emphasized parameter efficiency—an outcome they later concede wasn’t satisfactorily achieved and is not their story—and offered a flawed excuse for overlooking this key work on looped transformer expressivity. They claimed unavailability despite ICLR submissions being anonymously accessible for citation and reading well in advance. Since Saunshi et al. (2025) undermines author’s claim ‘looped transformer’s expressive power remains underexplored’ in their abstract and main story, it should not be absent.
Missing Parts in Experiments. Following my suggestion, they designed language tasks with varying continuity coefficients tied to their core theories in the rebuttal. However, these should have been conducted earlier, across more datasets and disturbance conditions, and should have been detailed in Section 5’s first part—yet they’re absent. Their further response on comparison fairness is unprofessional: I don’t expect a reduced base model parameter count as they conjectured, but the base + timestep total should match the without-timestep setup. Given their focus on this limitation, experiments with non-looped and k-layer looped counterparts are needed to fairly assess if these overcome it—perhaps all underperform with 1-layer looped + timestep. Moreover, the extent to which their encoding boosts parameter efficiency at equal performance levels warrants discussion but is currently absent.
Failure of Deeper Discussions. Top-conference rebuttals value discussions on potential topics and potential improvement of current techniques, as these enrich the Discussion Section or appendix, fostering its impact and welcoming impactful future-up work. I explicitly noted my suggestions—like k-layer looped transformers, wavelet analysis, and memory capacity—extend beyond their current scope, not expecting polished analyses now. While the authors acknowledge these as critical, they adopt a dismissive stance, avoiding deeper heuristic and brainstorming-style exploration based on current knowledge. Instead, they waste characters distancing these from their contributions—which I already understand—rather than engaging constructively as somebody eager to explore non-triviality. They admit their cube and ID-based analyses could improve, potentially easing model issues, yet resist a brainstorming-style discussion. Though they expressed “optimality is non-trivial,” I struggle to detect sufficient enthusiasm for making their techniques more realistic or elegant.

给作者的问题

Please respond to Weakness 2.
Can your cube & ID-based analyses (which induces $\sqrt{d}$ and the design of $\mathcal{K}$ ) be improved by advanced waveless analyses? Some case studies?
See “Experimental Designs Or Analyses” section. Can you design some language tasks that different tasks depend on different scale of your defined continuity coefficients to show the effectiveness of your theoretical claims?

Reference

Zhang et al. (2023). On enhancing expressive power via compositions of single fixed-size ReLU network. In Proceedings of the 40th International Conference on Machine Learning, 2023.

Saunshi et al. (2025). Reasoning with Latent Thoughts: On the Power of Looped Transformers. In International Conference on Learning Representations (ICLR2025), 2025.

Takakura and Suzuki (2023). Approximation and Estimation Ability of Transformers for Sequence-to-Sequence Functions with Infinite Dimensional Input. In International Conference on Machine Learning, pages 26517–26582. PMLR, 2023.

Nichani et al. (2025). Understanding Factual Recall in Transformers via Associative Memories. In International Conference on Learning Representations (ICLR2025), 2025.

论据与证据

The paper’s claim regarding the limitations of looped transformer is not well-illustrated by experiments, but the effectiveness of the layer-dependent scaling strategies is validated by experiments on case studies.

方法与评估标准

The proposed Timestep Encoding method suggest layer-dependent scaling for loop transformer with specific design motivated by prior work. The author validated it through three case studies.

理论论述

The paper claimed that their hardmax 1-layer $r$ -times looped transformer with fixed limited embedding size’s expressivity would depend on certain continuity coefficients, and their proposed layer-dependent scaling would mitigate this.

实验设计与分析

No experiments showed that looped transformer really suffer from certain continuity coefficients. In the author experiments, since the width is the same, the comparisons between L=6 to r=8, or L=12 and r=10 show that TF is not obviously superior to Looped TF.
The effectiveness of layer-dependent encoding suits common intuition, which essentially cost more parameter to make each layer slightly distinct. This cannot validate the author’s claim that it alleviate the specific limitation.

补充材料

N/A

与现有文献的关系

N/A

遗漏的重要参考文献

Saunshi et al. (2025), a paper published in ICLR 2025, has shown that looped transformer can simulate non-looped counterpart while requiring a larger NN width.

其他优缺点

Strengths

The paper’s logic flow is clear, with good illustrations.
The paper cast a table to compare different studies.

Weakness

The author missed some important reference such as Saunshi et al. (2025). In these studies as well as in real-world practice, looped transformers adopted $k$ -layer transformer looped $L$ times, instead of 1-layer transformer looped $L$ ( $r$ is the paper’s notation) times considered in this paper. With appropriate $k$ , the looped transformer should have chance to overcome the drawback the paper discussed, and the looped transformer should be parameter efficient than non-looped counterpart, similar to literature like Saunshi et al. (2025).
The theoretical results based on constraining the hidden embedding size of transformer ( $(17d+9)\times N$ ), which is very tricky in my view. The limitations of looped transformer (depending on continuity coefficients) that the author proved might be relaxed when the hidden embedding is not constrained. In fact, Saunshi et al. (2025) has shown that a 1-layer transformer looped $L$ ( $r$ is the paper’s notation) times can simulate the non-looped counterpart at the cost of a larger embedding size. But still, looped transformer is more parameter-efficient.
The paper didn’t provide estimation error, which usually co-exist with approximation error result in similar well-established theoretical studies like Takakura and Suzuki (2023).
The paper considers replacing transformer’s softmax operation with hardmax, which in my view is a huge drawback. In fact, by appropriately choosing the QK matrice, the transformer could equate looped ReLU network, and the techniques in prior study that studied looped ReLU’s drawback could be directly applied (Zhang et al. (2023)) with some improvements without much effort. The potential method to use softmax can refer to my first point in “Other Comments Or Suggestions” section.

其他意见或建议

Considering the actual softmax operation would make the analysis more realistic, despite introducing additional technical challenges. However, it remains feasible with sufficient technical effort, as demonstrated in Lemma C.1–2 of Takakura and Suzuki (2023).
As discussed in Weakness 1, the discussions over $k$ -layer $L$ looped transformer should be included to complement the real-world scenario.
The memorization of IDs might be more efficient using random feature transformation or even NTK (similar approaches adopted in Nichani et al. (2025)). In this merit, how far did your looped transformer from the Optimal Storage Capacity is worth discussing.

作者回复

2025-03-28

Q1 (W2). The theoretical results are based on constraining the hidden embedding size of transformer $(17d+9)×N$ , which is very tricky. In fact, Saunshi et al. (2025) has shown that a 1-layer transformer looped can simulate the non-looped counterpart at the cost of a larger embedding size, still, the looped transformer is more parameter-efficient.

We respectfully point out that, in contrast to Saunshi et al. (2025), our results achieve parameter efficiency by showing that the parameter complexity is $\boldsymbol{O(d)}$ . The construction by Saunshi et al. (2025) depends on the number of layers and width of the target non-looped transformer. Since the number of layers or width in a non-looped transformer typically grows with the approximation error or sequence length, the corresponding looped transformer inherits the dependence, which is not parameter-efficient.

Theorem 5.2 (Saunshi et al., 2025): For any transformer with $L$ layers, at most $R$ distinct, $d$ embedding size, $d_{FF}$ hidden dimension for MLP ... loops a 1-layer transformer block for $L$ times, with $d + R + 2$ embedding size, $R{h_{FF}} + O(L)$ hidden dimension.

We will clarify this distinction in Section 2 (Background) and emphasize our contribution in Section 3.3. (Main Result):

line 158: Saunshi et al. (2025) showed that Looped Transformers can approximate standard Transformers, ..., their construction inherits this dependence, limiting parameter efficiency.

line 207: The parameter complexity is $\boldsymbol{O(d)}$ , depending solely on the input dimension, not the approximation error or sequence length, highlighting the parameter efficiency.

W1-1. The author missed some important references.

As the work was not publicly available at the time of submission, it will be cited in the camera-ready version.

W1-2. With appropriate $k$ , the looped transformer should have chance to overcome the drawback the paper discussed ...

We respectfully point out that achieving universal approximation even with a single-layer is a non-trivial theoretical contribution. Exploring how efficiently multiple layers fall under the study of non-looped Transformers. In contrast, our work investigates how the approximation rate depends on the number of loops. We will add as future work.

W3. The paper didn’t provide estimation error, which usually co-exists with approximation error ...

To the best of our knowledge, estimation error bounds are not commonly discussed in studies on approximation (Yun 2020, Kim 2023, Kajitsuka 2024, Jiang 2024). This is likely due to a difference in problem settings: estimation error is analyzed in regression. We will add this as a future direction.

W4-1. The paper considers replacing the softmax operation with hardmax, which in my view is a huge drawback.

We use the softmax function to approximate (not replaced) the hardmax function by scaling input (Yun et al., 2020; Kim et al., 2023). To avoid confusing, we will revise as:

Before: we use the hardmax instead of the softmax
After: we use the softmax to approximate the hardmax

W4-2. The techniques of looped ReLU’s (Zhang et al., 2023) could be directly applied without much effort.

We respectfully point out that the techniques from ReLUs (Zhang et al., 2023) cannot be directly applied to Transformers due to fundamental differences in architecture and target function class, that is, transformers had to compute contextual mappings. It required newly defined continuity and techniques. We will clarify as:

line 127: Our question is whether the result (Zhang et al., 2023) can be extended for contextual mappings

Q2. Can your cube & ID-based analyses be improved by advanced waveless analyses? C3. The memorization of IDs might be more efficient ...

Optimality is non-trivial. Existing proof techniques (e.g., Kajitsuka 2025) do not directly apply to looped architectures, and new methods are needed. We will include this as an important direction for future work.

Q3. Can you design some language tasks that different tasks depend on different continuity coefficients to validate theoretical claims? No experiments showed that they suffer from continuity coefficients and cannot validate the author’s claim that timestep alleviates the specific limitation.

We designed a perturbed WikiText-103 task (10% token randomly replacement) to test sensitivity to continuity. We further analyzed the continuity coefficients of the trained models by applying perturbations to the input and measuring the change in output embeddings. We found that timestep encoding, with only about 5% increase in parameters, improved memorization and consistently reduced continuity coefficients. Without it, the model was less stable and failed to capture accurate input-output relationships.

	CE Entropy Loss (Train)	Continuity Coefficients
w/o	4.32	130.6
w/ Timestep	4.18	21.5

审稿人评论

2025-04-03

I appreciate the authors' detailed response.

W1.2
The parameter efficiency of the loop transformer should be contrasted with a non-loop transformer. I’m not expecting a 1-layer loop transformer to achieve Optimal Storage Capacity (see my Q3.2 comments below). The manuscript, titled “On Expressive Power of Looped Transformers…,” avoids emphasizing the bit complexity, and the whole main body doesn’t seem to celebrate its bit complexity bound as satisfactory. However, when I respectfully noted that the expressive power of a 1-layer loop transformer might overcome the “specific limitation inherent to the loop structure” from the second contribution listed, the authors countered with parameter efficiency.
W1-1
ICLR submissions are publicly available anonymously via OpenReview before publication, and ICML papers often cite such work appropriately in this manner. If the authors truly wish to highlight their bit complexity, they’ve overlooked at least two key studies: Nichani et al. (2025) (on arXiv since 9 Dec 2024) and Allen-Zhu and Li (2025) (on arXiv since 8 Apr 2024).
W1.1
I maintain that discussing the k-layer L looped transformer is valuable for both theoreticians and practitioners. I’m not expecting a polished analysis (which could be future work), but just a discussion—worthy of a dedicated section in the main body or appendix to boost the paper’s impact, and welcome follow-up work. Since the k-layer L looped transformer is more practical than the 1-layer version, it’s worth exploring how k can mitigate the limitation the authors identified. Additionally, whether your Timestep Encoding is more efficient deserves further empirical illustration in an additional experimental section.
W3
If your results on expressive power and bit complexity have limitations and fail to convince me of Pareto Optimality in their trade-off, an estimation error analysis for the looped transformer would be compelling if it achieves min-max optimality for non-parametric regressions.
Q3
The gap between 4.32 and 4.18 isn’t striking to me, and I don’t see why the authors didn’t fairly compare them by adjusting parameters (e.g., increasing w/0's parameters by 5% or decreasing w/ Timestep's parameters). I value this experiment, but it should be conducted before and conducted on more datasets and disturbance conditions. Besides, since this kind of experiment can truely co-relate to your main theories, it should have been included in the first part of Section 5 with detailed discussion—yet it’s currently absent.
Q2
My review was clear: I’m simply inviting just a discussion based on your settings, not demanding a fully developed improvement. A thoughtful discussion could enhance the main body or appendix, drawing researchers for follow-up work and elevating the paper’s impact. I don’t understand the authors’ resistance.
C3
On parameter efficiency, the authors declined to even just discuss “how far your looped transformer is from Optimal Storage Capacity” in my Q3—which is disappointing—clearly I wasn’t expecting you to match Kajitsuka and Sato (2024). I struggle to sense the authors’ passion for non-triviality.

Reference
Nichani et al. (2025). Understanding Factual Recall in Transformers via Associative Memories. In International Conference on Learning Representations (ICLR2025), 2025.
Allen-Zhu and Li (2025). Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws. In International Conference on Learning Representations (ICLR2025), 2025.
Kajitsuka and Sato (2024). On the Optimal Memorization Capacity of Transformers. In International Conference on Learning Representations (ICLR2025), 2025.

作者评论

2025-04-06

We sincerely appreciate your thoughtful and constructive comments.

W1.2. The parameter efficiency of the loop transformer should be contrasted with a non-loop transformer. ... when I respectfully noted that the expressive power of a 1-layer loop transformer might overcome the “specific limitation inherent to the loop structure” from the second contribution listed, the authors countered with parameter efficiency.

We respectfully point out that the reviewer may have overlooked the fact that our problem setting is constrained to fixed-size transformers, as the same for looped ReLU’s (Zhang et al., 2023). Our study focuses on analyzing the approximation rate with respect to the number of loops. In this regard, the results by Saunshi et al. (2025) do not conform to our setting, as their model does not have a fixed size.

As you correctly noted, we are not emphasizing bit complexity as a key contribution. We only discussed it in response to the reviewer’s question.

Thanks to your comment, we will avoid presenting parameter efficiency as a highlighting, and instead clarify that it is a part of our problem setting.

Before(line 207): The parameter complexity is $O(d)$ , ..., highlighting the parameter efficiency.
After: While the number of parameters is fixed, the function can still be approximated by increasing the number of loops.

W1-1. If the authors truly wish to highlight their bit complexity, ...

We would like to clarify that neither bit complexity nor parameter efficiency is a core contribution of our work. Rather, they define the constraints of our theoretical setting. Our emphasis on these aspects was intended only to address the reviewer's concerns.

W1.1 k-layer L looped transformer is valuable for both theoreticians and practitioners

We sincerely agree with this comment. We simply wished to clarify that our goal is to analyze the approximation rate with respect to the number of loops, even with a constrained architecture (e.g., a single-layer model), and that increasing the number of layers represents a slightly different research direction from our primary focus.

Furthermore, we acknowledge that it is generally difficult to theoretically analyze constant-factor improvements. Nonetheless, we agree that this is an important direction and have added it to the discussion of future work in the revised manuscript.

W3 If your results on expressive power and bit complexity have limitations and fail to convince me of Pareto Optimality in their trade-off, an estimation error analysis for the looped transformer would be compelling if it achieves min-max optimality for non-parametric regressions.

We agree that establishing minimax optimality would be an interesting direction. However, our current scope is limited to deriving approximation rates, and it remains unclear whether results from such a different problem setting would meaningfully complement our rate-based analysis.

Q3 The gap between 4.32 and 4.18 isn’t striking to me, and I don’t see why the authors didn’t fairly compare them by adjusting parameters (e.g., increasing w/0's parameters by 5% or decreasing w/ Timestep's parameters).

Our theoretical result shows that the approximation rate can be improved by adding a specific module. Therefore, we do not believe it is necessary to reduce the number of parameters in the base model when incorporating the proposed module.

While we understand the concern that performance gains may stem from an increase in the number of parameters, we would like to highlight that in the dynamic programming task, the accuracy is improved from $47.7$ % to $88.3$ %, a substantial gain.

Q2 I'm simply inviting just a discussion based on your settings. Can your cube & ID-based analyses be improved by advanced waveless analyses?

Unfortunately, due to the 5000-character limit in the response format, we were unable to provide a full discussion and prioritized addressing the main points. To clarify, our study is focused on deriving approximation rates for function classes, whereas the reviewer’s suggestion appears more aligned with a discussion of memorization capacity.

To the best of our knowledge, there are no prior works that utilize waveless analyses to improve ID-based analyses in terms of memorization capacity. Since no concrete example was provided, we find it difficult to assess how such approaches might contribute to our specific setting.

C3. On parameter efficiency, the authors declined to even just discuss “how far your looped transformer is from Optimal Storage Capacity”

We respectfully point out that we responded to this point: the reason we cannot discuss how far our model is from the Optimal Storage Capacity is that this quantity is not currently known for looped models. Since a lower bound has not yet been derived, it is inherently impossible to evaluate the gap from the optimal capacity.

审稿意见

评分: 42025-03-10

The paper theoretically investigates the expressive power of Looped Transformers, neural architectures that recursively reuse a single Transformer block, which are appealing for their parameter efficiency and generalization capabilities. The authors define a new modulus of continuity tailored to sequence-to-sequence functions to establish the approximation rate of Looped Transformers. Their analysis identifies a specific limitation of the looped architecture due to weight-tying constraints. To address this, they propose introducing scaling parameters conditioned on timestep encoding, thereby enhancing the expressive power of the architecture. Empirical results across various reasoning tasks and standard Transformer benchmarks validate the theoretical findings, demonstrating improved performance with increased loop counts and further gains achieved through timestep encoding.

给作者的问题

Could you elaborate on whether and how the increase in computational complexity (due to timestep encoding) affects practical deployment scenarios compared to standard Transformers?

论据与证据

Claim: Looped Transformers can universally approximate continuous permutation-equivariant sequence-to-sequence functions.
- Evidence: Proven through a theorem (Theorem 3.6), supported by mathematical proof utilizing newly defined modulus of continuity.
Claim: Looped Transformers face inherent limitations due to their weight-tying nature, impacting their expressive power.
- Evidence: Demonstrated analytically (Lemma 4.1), which highlights limitations in approximating specific target mappings exactly due to recurrent feed-forward layer constraints.
Claim: Incorporating timestep encoding with scaling parameters mitigates the limitations and improves approximation capabilities.
- Evidence: Theoretical justification (Theorem 4.2) shows exact memorization capability with time-dependent scaling, and experiments validate improved performance on reasoning and standard Transformer tasks.

方法与评估标准

The methods and evaluation criteria proposed in the paper are suitable. The authors use well-established theoretical frameworks (modulus of continuity, universal approximation results) and practical machine learning benchmarks (reasoning tasks, dynamic programming problems, Sudoku, in-context learning, language modeling) to validate their theoretical insights.

理论论述

Theoretical Claims:

Looped Transformers are universal approximators for permutation-equivariant sequence-to-sequence functions (Corollary 3.7).
Introducing timestep encoding and scaling parameters can overcome certain theoretical limitations specific to Looped Transformers (Theorem 4.2).

The theoretical claims appear correct. The proofs provided are logically structured, detailed, and sound. The innovative use of three separate moduli of continuity to evaluate approximation capabilities is well justified.

实验设计与分析

Experiments included:

Reasoning tasks (Sudoku, Countdown, Edit Distance, Longest Common Subsequence).
In-context learning task.
Language modeling on WikiText-103 dataset.

The experimental design is sound, with clear benchmarks appropriate for evaluating reasoning capabilities and general transformer-related tasks. The experimental analyses demonstrate clear performance improvements due to the proposed timestep encoding.

补充材料

I reviewed the supplementary material. Specifically, I reviewed the detailed mathematical proofs of theorems provided in Appendix A, which support the main results presented in the paper.

与现有文献的关系

The paper relates to broader scientific literature through its theoretical examination of expressive power, aligning closely with previous works on sequence-to-sequence function approximation for Transformers (Yun et al., 2020). It extends existing results to weight-tied Looped Transformers, providing new insights into their capabilities and limitations. The introduction of timestep encoding aligns with existing literature on conditional scaling and adaptive methods in neural network architectures.

遗漏的重要参考文献

The paper references most relevant prior works thoroughly.

其他优缺点

Strengths:

Clearly stated theoretical contributions with rigorous proofs.
Novel and insightful identification of a limitation in Looped Transformers.
Introduction of timestep encoding is a simple yet effective conceptual innovation, clearly validated by experiments.

Weaknesses:

Additional computational overhead due to timestep encoding may diminish some practical advantages of looped architectures regarding parameter efficiency.

其他意见或建议

n/a

作者回复

2025-03-28

Q1 (W1). Could you elaborate on whether and how the increase in computational complexity (due to timestep encoding) affects practical deployment scenarios compared to standard Transformers?

Thank you for the insightful comment. While timestep encoding adds slight computational overhead, the increase in parameters is minimal, and we observed only a minor impact on throughput. Prior work [1,2] has also explored architectural modifications to enhance expressiveness without sacrificing efficiency, and we believe our design follows this practical direction.

Reference:
[1] Csordás et al., "MoEUT: Mixture-of-Experts Universal Transformers", NeurIPS 2024.
[2] Bae et al., "Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA", ICLR 2025.

审稿意见

评分: 32025-03-14

This paper studies Looped Transformers from a theoretical perspective. Its first main contribution is to prove a universal approximation property, showing that a wide class of sequence functions can be represented by Looped Transformers. The main technical contribution here, compared to existing universal approximation results for transformers, is to show that this can be achieved with coupled parameters, i.e., looping a single fixed transformer layers. A second contribution is to propose timestep encoding, which shows improved empirical performance on a few tasks.

update after rebuttal

The authors have promised to address the points I raised.

给作者的问题

In the hardmax operation, how are ties resolved? I.e., when multiple positions receive the maximum attention score.

Line 155 (right column): where is bit complexity used? It is introduced but then doesn't appear to show up.

论据与证据

The claims made in the Abstract, Introduction, and Conclusion are supported. I believe the stated theorems to be correct.

One issue is that the theoretical analysis uses the hardmax operation (line 106, left column), which is a nontrivial abstraction and may have expressiveness implications [1]. I think this should be mentioned when describing the paper's overall contributions (abstract, introduction) as in other work making such assumptions [e.g. 2].

[1] Yang et al, Simulating Hard Attention Using Soft Attention, arXiv 2024

[2] Amiri et al, Lower Bounds for Chain-of-Thought Reasoning in Hard-Attention Transformers, arXiv 2025

方法与评估标准

Reasoning tasks used for evaluation make sense.

理论论述

I read through the proofs in the appendix, and checked the individual steps of proof of Theorem A.1 in detail. The later proofs are plausible to me, but I did not check every single step.

实验设计与分析

I checked the soundness of the experiments on reasoning tasks as described in Appendix C.

补充材料

None

与现有文献的关系

The paper is related to both the theoretical literature on Looped Transformers, and literature on the universal approximation property of transformers. It brings these together by showing the latter kind of property under the constraints imposed by the looped transformers architecture.

遗漏的重要参考文献

None, as far as I see.

其他优缺点

Weaknesses:

Of note, unlike a lot of other literature on the expressive power of transformers [e.g. 1,2,3], the present paper follows Yun et al in only studying functions on fixed-length inputs, without any regard to performance across inputs of potentially unbounded length. This restriction can be fine, but I think this should be made more transparent earlier in the paper (e.g., abstract, intro).
Regarding clarity: I found the terminology surrounding "sequence IDs" and "token IDs" confusing in Definition 3.2. At this point, a reader may wonder: Is a Token ID something like a word embedding? or a positional embedding? or something different? And what exactly is a sequence ID -- what kind of object even is it -- a number, a tensor, or something else? I think the reader should be given a clearer idea at this point.

[1] Strobl et al, What Formal Languages Can Transformers Express? A Survey, TACL 2024

[2] Hahn, Theoretical limitations of self-attention in neural sequence models, TACL 2020

[3] Merrill and Sabharwal, The Expressive Power of Transformers with Chain of Thought, ICLR 2024

其他意见或建议

line 88, right column: "can multiple graph" a word such as "perform" is missing after "can"

line 192, right column: "Looped Transformers is defined by" -- delete the word "is", as there is also a verb in the second-to-last line of the Corollary.

Section 5.1: The abbreviations LCS and ED and the mapping between the tasks listed in the text is not made explicit. The reader needs to refer to the appendix to figure this out.

line 571: \delta^{-1} \in mathbb{N}, \delta \geq 2 -- should the second \delta be \delta^{-1}?

In the proof, I found it hard to track when p-norms are applied to individual vectors (as in equation 49) or across the entire input space (as in equation 46, or in the line below equation 59). Can the authors make the distinction more explicit?

Line 1302: "combined" --> "combining"

作者回复

2025-03-28

W1. Unlike a lot of other literature, the present paper follows Yun et al in only studying functions on fixed-length inputs. This restriction can be fine, but I think this should be made more transparent earlier in the paper

We respectfully point out that the study of function approximation and memorization commonly assumes fixed-length inputs (Yun 2020, KIm 2023, Kajitsuka 2024, Jiang 2024). On the other hand, the prior work cited by the reviewer focuses on computational complexity rather than function approximation. Nevertheless, we agree that this distinction is important. Accordingly, we will revise the introduction to make this assumption more explicit.

line 45: Our contributions are summarized as follows: We establish the approximation rate of Looped Transformers for fixed-length continuous sequence-to-sequence functions.

W2. Regarding clarity: I found the terminology surrounding "sequence IDs" and "token IDs" confusing in Definition 3.2.

We agree that the terminology in Definition 3.2 could be made clearer. We will explicitly clarify in the definition that the sequence IDs and token IDs are integers. Furthermore, we will explain the motivation: they are defined for a constructive proof to describe contextual mappings as Kim et al. (2023). Specifically, we will revise the definition and add the following explanation as:

line 175: In our proofs, we rely on the contextual token ID to describe contextual mappings.
Definition 3.2. A token ID is a unique integer assigned to each token. A sequence ID uniquely identifies each sentence. A contextual token ID uniquely identifies a specific token within a specific sentence.
This notion is defined in Kim et al. (2023), to which we refer for further details, for constructive proofs of contextual mappings. The actual construction of contextual token IDs may vary depending on the specific proof. In our case, we adopt a different construction from that of Kim et al. (2023).

Q1. In the hardmax operation, how are ties resolved? I.e., when multiple positions receive the maximum attention score.

Such cases result in errors, but their errors diminish as approximation improves. As you rightly noted, when multiple positions have identical values, the hardmax operation cannot distinguish between them. In our proof, we explicitly evaluate the measure of such regions and include them in the error term, $\mathcal{O}(\delta^{d})$ . As the discretization becomes finer ( $\delta\to 0$ ), the proportion of these regions decreases and eventually becomes negligible. To clarify this point, we will add the following sentence:

line 329: ..., where $\mathcal{O}(\delta^{d})$ arises from the case where identical tokens appear in sequences.

Q2. Line 155 (right column): where is bit complexity used? It is introduced but then doesn't appear to show up.

The bit complexity is part of the result stated in Theorem 3.6. However, we agree that its role is not clearly connected at the point of introduction. To improve clarity, we will revise the text (line 155) to explicitly indicate where the bit complexity appears in the subsequent analysis.

Before: We evaluate bit complexity, which denotes the maximum number of bits of weights.
After: We evaluate bit complexity ... weights. Our theoretical results show how the approximation rate depends on both the number of parameters and the bit complexity.

C1. One issue is that the theoretical analysis uses the hardmax operation (line 106, left column), which is a nontrivial abstraction and may have expressiveness implications. I think this should be mentioned when describing the paper's overall contributions (abstract, introduction) as in other work making such assumptions.

While theoretical analysis with the hardmax operation is a common approach in function approximation (Yun et al., 2020; Kim et al., 2023), we agree that it constitutes a strong assumption. To clarify this, we will revise as:

Before: we use the hardmax function instead of the softmax function.
After: Our constructive proof relies on the softmax function to approximate the hardmax function as in previous work (Yun et al., 2020; Kim et al., 2023).

O1. In the proof, I found it hard to track when p-norms are applied to individual vectors or across the entire input space.

We appreciate the comment. We will clarify, in the camera-ready version, the distinction between vector $p$ -norms and function $L^p$ -norms, using $|f|_{L^p}$ for the latter, as defined in Equation (2).

Reference:
Yun et al., "Are Transformers universal approximators of sequence-to-sequence functions?", ICLR 2020.
Kim et al., "Provable Memorization Capacity of Transformers", ICLR 2023.
Haotian Jiang and Qianxiao Li, "Approximation Rate of the Transformer Architecture for Sequence Modeling", NeurIPS 2024.
Tokio Kajitsuka and Issei Sato, "On the Optimal Memorization Capacity of Transformers", ICLR 2025.

审稿人评论

2025-04-03

Thanks to the authors. These clarifications will improve the paper.

审稿意见

评分: 32025-03-15

The authors conduct theoretical analysis on the expressive power and approximation rate of looped transformers, a variant of transformers with weight tying across layers and unbounded recurrence. The analysis led to some limitations of looped transformers on function approximation, especially in memorizing input-output pairs. The authors also propose a way to fix this by adding a scaling mechanism conditioned on the loop index, whose effectiveness is empirically validated.

update after rebuttal

The rebuttal helps complement the draft and addresses some of my concerns. I will maintain my original score, which leans positive overall.

给作者的问题

None

论据与证据

Most claims are theoretical and the correctness depends on the proof. Another claim on how adding timestep encoding should improve performance is empirically validated.

方法与评估标准

Yes the methods and evaluation make sense.

理论论述

I could understand the high level proof strategies which make the conclusions plausible, but didn't check the proof correctness due to limited time and expertise in theoretical analysis.

实验设计与分析

The experiments are on tasks repeated adopted in prior work, and the settings are sound.

补充材料

I briefly checked the datasets and the tasks used for experiments. They help provide extra details.

与现有文献的关系

The paper mostly extends the current theoretical understanding of looped transformer on its parameter efficiency and approximation abilities, and how they compare with standard transformers.

遗漏的重要参考文献

I believe deep equilibrium models are closely related to looped transformers, but the authors don't seem to discuss them.

其他优缺点

While it's good to understand the expressive power of looped transformers, it's a bit unclear how/whether the ideal weight configurations could be learned through gradient-based optimization. Looped transformers are notorious hard to train stably, and there are various tricks that people have been using such as only computing the gradient at the last several recurrence steps, etc. It would be great if the analysis could provide insights into how to better train these models.

其他意见或建议

None

作者回复

2025-03-28

I believe deep equilibrium models are closely related to looped transformers, but the authors don't seem to discuss them.

Thank you for the suggestion. We agree that Deep Equilibrium Models are related to Looped Transformers. We will add the following sentence to the related work section:

line 77: Deep equilibrium models (Bai et al., 2019), which compute fixed points of iterative layers, are also related.

W1. While it's good to understand the expressive power of looped transformers, it's a bit unclear how/whether the ideal weight configurations could be learned through gradient-based optimization. Looped transformers are notorious hard to train stably, and there are various tricks that people have been using such as only computing the gradient at the last several recurrence steps, etc. It would be great if the analysis could provide insights into how to better train these models.

Thank you for the thoughtful comment. We did an additional analysis, specifically, we empirically observed that the trained models tend to exhibit smaller continuity coefficients with timestep encodings. Since smaller continuity coefficients (such as Lipschitz constants) are often associated with more stable training dynamics, our results suggest that the proposed approach may also contribute to improved trainability of Looped Transformers. We will include this point as an important direction for future work:

line 439: Beyond expressivity, improving training stability and efficiency, especially for larger numbers of loop iterations, is an important direction for future work.

Reference:
Bai et al., "Deep Equilibrium Models", NeurIPS 2019.

最终决定Accept (poster)

2025-05-01

This paper focuses on understanding the approximation power of looped transformers. In particular, it focuses on looped transformers where a single layer is looped $r$ times. It establishes approximation rates for the looped transformers in terms of multiple notions of modulus of continuity for the target function. The paper then highlights a limitation of Looped transformers and proposes timestep encoding to scale different loops. The paper presents empirical results to validate the utility of the proposed timestep encoding.

Overall, the paper studies a timely problem and makes non-trivial contributions toward improving the understanding of looped transformers.

The reviewers raised multiple valid questions/comments during the review process. Some of these comments were centered around improving the presentation which the authors have adequately addressed. The reviewers also brought up a recent work by Saunshi et al. 2025 that also explores the approximation power of looped transformers by arguing how a looped transformer can simulate a non-looped transformer. The authors pointed out how their problem setting differs from the one covered by Saunshi et al. The authors are encouraged to include a detailed comparison with this work. Furthermore, as requested by one of the reviewers, the authors should include a discussion on those looped transformers that loop a block of k-layers multiple times and comment on the challenges in extending their analysis to this setting. The authors are also encouraged to include the new experimental results in their response with a detailed discussion. The author must also incorporate all the changes they had promised to the reviewers during the review process.