6.8

/10

Spotlight4 位审稿人

最低2最高5标准差1.3

4.3

置信度

创新性2.8

质量2.8

清晰度3.0

重要性2.5

NeurIPS 2025

Characterizing the Expressivity of Fixed-Precision Transformer Language Models

Jiaoda Li,Ryan Cotterell

OpenReview PDF

提交: 2025-05-11更新: 2025-10-29

摘要

Transformer-based language models (LMs) have achieved widespread empirical success, but their theoretical expressive power remains only partially understood. In this work, we analyze a restricted idealization of fixed-precision transformers with strict future masking, soft attention, and no positional encodings. We establish that this class of models is exactly as expressive as a specific fragment of linear temporal logic that contains only a single temporal operator: the $past$ operator. We further connect this fragment to established classes in formal language theory, automata theory, and algebra, yielding a unified framework for understanding transformer expressivity under this idealization. Finally, we present empirical results that align closely with our theory: transformers trained on languages within their characterized expressive capacity generalize reliably across sequence lengths, while they consistently fail to generalize on languages beyond it.

关键词

transformersexpressivityformal languagelogicautomata

评审与讨论

审稿意见

评分: 2置信度: 42025-06-30

It is shown that softmax transformers with ``fixed precision'' (and no positional encoding and strict future masking) recognize precisely languages from a fragment of LTL logic. This logic (containing precisely star-free languages) has been previously shown to be equivalent to unique hard attention transformers with strict future masking and no positional encoding (Yang et al).

优缺点分析

At the first glance, the result seems interesting, as it is a precise characterization of what languages can softmax (thus, practical!) transformers recognize. However, I believe that fixed-precision assumption does not make sense for the case of softmax transformers.

Let me start by noting that the paper does not formally define ``fixed precision'' transformers (and, in fact, neither do other papers in this area). And this is not as trivial as it seems. Ok, when we only have operations with rational numbers, one can say that precision t means that every number in every computation can be represented as a fraction of two t-bit integers. But when we have softmax and layer norm, it means that we should do rounding. But do we use some specific rounding rule? Or is there ''an adversary'' that can return any rational number, \epsilon-close to the real output, where log(1/epsilon) is the number of precision bits? And if we sum up n elements one by one, when do we do rounding -- only in the end, or for each partial sum? If one thinks about it, depending on the answer, fixed precision can turn out to be logarithmic. Hence, fixed precision has to be carefully define.

In any case, it is hard for me to imagine how the fixed-precision assumption could make sense for softmax transformers. The very basic thing we can do with them is to take the arithmetic mean of input bits (using attention with uniform weights). But just this operation requires log n bits.

In that sense, I have a ''counter-example'' to the main result (in quotes because the exact model has not been defined in the paper). Softmax transformers with fixed precision can distinguish binary strings with at most 1/3-fraction ones from strings with at least 2/3-fraction of ones (take the average of input bits and precision enough to distinguish 1/3 from 2/3). This problem is known as Approximate Majority; in [2] of the references of the paper, it is shown that unique hard attention transformers cannot do Approximate Majority while they subsume the LTL logic, which in turns subsumes the logic in the paper, leading to a ``contradiction''.

Of course, there might be no contradiction if authors had in mind another model which is not capable of doing even this simple thing. But then I am not sure to which degree it is fair to call this model ``softmax transformers''.

Additionally, a speculation that practical transformers use fixed precision, say 32 bits, is not very convincing. Indeed, in that case, why at all let the input length go to infinity? Practical transformers work with inputs that do not exceed some finite number! Of course, the point is that we need to make asymptotic statements because it is not very reasonable to say what can and cannot be done for a fixed input length. Hence, I do not see any problem for such an idealization to have precision growing with n. Note that 32 is significantly larger than the logarithm of the input lengths on which practical transformers are usually run.

问题

please see the ``strengths and weaknesses'' section

局限性

yes

最终评判理由

I thank authors for their response. I did not understand the precise model from the answer. I am keeping my score as I think if the model cannot compute Approximate Majority, it is not adequate for softmax transformers.

格式问题

paper formatting seems right.

作者回复

2025-07-31

We thank the reviewer for engaging deeply with the paper. Below, we respond to the main concerns point by point.

The paper does not formally define "fixed precision"... depending on assumptions, fixed precision can effectively turn into logarithmic precision. Hence, fixed precision has to be carefully defined.

We define fixed precision by assuming the existence of a finite set $\\mathbb{F}$ of representable numbers, under which the transformer operates. We do not make assumptions about specific rounding rules, or rationality of values, or closeness to ideal real-valued outputs. Instead, we rely on three minimal and realistic properties:

Overflow: There is a maximum representable number; any computation exceeding it yields overflow.
Underflow: There is a smallest positive number; values below it round to zero.
Exponential cutoff: There exists a sufficiently negative number $x$ such that $\\exp(x) = 0$ .

All of these are satisfied by standard numerical libraries (e.g., NumPy, PyTorch, JAX).

We appreciate the reviewer’s request for precision and will clarify this definition explicitly in the final version, if given the opportunity.

I have a ''counter-example'' to the main result... Apprxoimate Majority...

Approximate Majority is not recognizable in our idealization, precisely because uniform attention (and thus averaging inputs) is not always feasible for all input lengths. Therefore, it does not constitute a counterexample.

I believe that fixed-precision assumption does not make sense for the case of softmax transformers.

Why let the input length go to infinity? Practical models use 32-bit precision and finite inputs.

Our experiments show that transformers fail to generalize well before the theoretical precision limit. Specifically, under standard 32-bit precision, models fail to generalize on tasks that log-precision models would handle at sequence lengths below 500, far short of $2^{32}$ . Notably, length 500 is not large by practical standards either.

Why real-world models fail so early and appear empirically weaker than log-precision theoretical predictions remains an open question and an exciting direction for future work.

By contrast, our results align exactly with the fixed-precision model: languages inside $\**PFO**^2[<]$ can be trained to $100\\%$ accuracy, while those outside cannot. This suggests that fixed precision provides a realistic and informative lower bound on the class of languages transformers can express reliably and without error, and studying it is far from pointless.

Additionally, the fixed-precision setting becomes even more relevant as quantized models are increasingly popular in practice. These models operate at significantly lower precision (e.g., 8-bit or 4-bit), making precision constraints even more immediate and our theoretical framework directly applicable.

In summary, we see the study of fixed precision as complementary to other settings, such as logarithmic precision, or mixed precision (e.g., Yang and Chiang, 2024, where attention uses rational arithmetic but other operations remain fixed-precision).

We again thank the reviewer for their time and effort. We hope these clarifications have helped convey the scope and motivation of our work, and we will revise the paper accordingly should we have the chance.

–-

References

Yang and Chiang. (2024). Counting Like Transformers: Compiling Temporal Counting Logic Into Softmax Transformers.

评论- softmax

2025-08-09

I would like to note that the exponential cutoff rule makes this model essentially a ``hardmax model'' and hence not trainable (which is a usual criticism papers about hardmax transformers receive). This makes me strongly disagree with positioning this paper as a paper about softmax transformers.

2025-08-09

We thank the reviewer for the comment.

The abstraction we study is intended to be faithful to how softmax is implemented in practice: real-world transformers do operate under fixed precision. We understand that the concern arises because our analysis follows the formal methods convention of requiring correctness for all input lengths, an assumption that is arguably unrealistic for practical transformers. We will make this point explicit in the revision and clearly acknowledge the reviewer's perspective.

Importantly, we stress that the study of fixed-precision softmax attention is meaningful: our experiments show that it predicts the length-generalization abilities of real transformers even for bounded lengths. It also captures aspects of real-world computation that log-precision or arbitrary-precision models overlook, such as rounding in basic arithmetic, which persist even when input lengths are bounded.

Crucially, fixed-precision softmax is still definitionally different from hardmax attention: it can assign non-zero weights to multiple positions, not just those with the highest score. This behavior needs to be accounted for in a full characterization, and our analysis explicitly addresses all input strings, not only those that trigger the cutoff.

While we show that, in terms of expressive power, fixed-precision softmax collapses to a certain variant of hardmax attention (leftmost unique hard attention), this equivalence is exactly one of our key contributions: it is not obvious to see, nor was it known in the literature. Moreover, prior to this work, it was unclear which variant it will degenerate to (leftmost unique hard attention, rightmost unique hard attention, or average hard attention).

We are not the first to study fixed-precision softmax transformers. Chiang et al. (2023) and Yang & Chiang (2024) simulate them using counting quantifiers, which no variant of unique hard attention can replicate. Li et al. (2024) placed them in $**FO**[<]$ , thereby connecting them to rightmost unique hard attention. We show that rightmost unique hard attention is strictly more expressive than fixed-precision softmax attention, tighten these upper bounds, and prove that ours is in fact tight.

In summary, we will revise the title and text to be as precise as possible, avoiding any over-claims or ambiguity, while keeping the focus on our main contribution.

References

Chiang et al. (2023). Tighter Bounds on the Expressivity of Transformer Encoders.
Yang and Chiang. (2024). Counting Like Transformers: Compiling Temporal Counting Logic Into Softmax Transformers.
Li et al. (2024). Chain of Thought Empowers Transformers to Solve Inherently Serial Problems.

审稿意见

评分: 5置信度: 52025-07-01

This paper makes the following contributions:

An exact theoretical characterization of the expressive power of fixed-precision soft-attention transformers as LTL[past], which is obtained by showing two directions:
1. PFO2[<] as an upper bound
2. LTL[past] as a lower bound
Theoretical results showing that LTL[past] is a natural class via several equivalent definitions in terms of automata, logic, etc.
Experiments suggesting transformers can length generalize correctly exactly on those languages that are definable in LTL[past]

优缺点分析

Strengths

Novel theoretical result on expressivity of fixed-precision soft attention that significantly extends past work. In particular, this paper is able to obtain an exact characterization for this model of transformers that is itself a fairly natural class (as evidenced by many converging definitions of this class).
Clever trick and observation used to show that since can be simulated with fixed precision.
Paper is generally well written and presented, and would be of clear interest to the research community working on the expressive power of transformers and other architectures.

Weaknesses

On Fixed Precision

In the title and elsewhere, this paper seems to imply its results hold for all transformer language models, while in reality it characterizes one specific formal model of transformers (fixed-precision, soft-attention transformers). This is not the only reasonable formal model (e.g., log precision is arguably more natural since it allows uniform attention). I would prefer if the title was precise about this fact via inclusion of the term fixed precision. Relatedly, the the following part of the intro brushes under the rug the fact that there are other reasonable models of transformers beyond fixed precision:

Although these idealizations capture key aspects of transformers, they tend to overestimate 25 the expressive power.

What evidence for you have for this? E.g., usually in practice precision is greater than log2 n, so log precision seems reasonable.

Reconciliation with C-RASP as a Model for Length Generalization

Regarding the claim that the LTL[past] results perfectly predict length generalization, C-RASP, which corresponds to transformers where attention is computed with log precision, also seems to have strong theory/experiment match for length generalization; e.g., see https://openreview.net/forum?id=U49N5V51rU. How do you reconcile your claims with theirs? Are there any languages that test a discrepancy between these formal models of transformers? Or differences in the empirical results between your experiments and theirs?

Experiments

What precision do you use in your experiments?

For the langauges where the transformer fails but achieves near 100% performance, what does performance look like as a function of sequence length? Is it possible that the transformers are less sample efficient than LSTMs and are just under-trained?

Why do you only consider 3 languages for the language modeling experiment? Especially because the "failing" language still achieves 98.3% accuracy, the evidence here is somewhat weak

问题

The PFO2 upper bound proof makes sense at a high level: in fixed precision, summations become simple to simulate compared to log precision. But I'd like some more detail on the following:

Therefore, it reduces to a sum of a bounded number of terms, which can also be simulated by PFO2 [<] (Lemma B.7).

If the terms that were being summed were always the same, I'd immediately accept this argument. But in fact, there is a bounded sublist of an unbounded list of terms that needs to be summed. Does this change anything about the argument? The proof needs more detail to clarify this.

"Uniformly attend to positions m < n where [...]" State that you can do this because of fixed precision: i.e., if you set the temperature high enough, everything that violates a key/query match condition will go to 0.

局限性

As alluded to in weaknesses, the authors should better acknowledge that fixed-precision is one, though not the only, formal model of transformers. There are arguments for other models like log precision.

最终评判理由

Overall, I appreciate the reviewers' comments and maintain my positive score. I hope that, as promised, the authors better highlight the fixed precision assumption (vs. stating their results for "softmax" transformers) and also better discuss the differences between their empirical results and similar results for C-RASP.

格式问题

N/A

作者回复

2025-07-31

We thank the reviewer for their thoughtful and positive review. We are especially grateful for the recognition of the novelty and significance of our theoretical results, the strength of the proofs, the clarity of the writing, and the potential interest to the broader research community. Below, we address the specific comments and questions raised.

In the title and elsewhere, this paper seems to imply its results hold for all transformer language models, while in reality it characterizes one specific formal model... I would prefer if the title was precise about this fact via inclusion of the term fixed precision.

We agree and appreciate the suggestion. While we refer to "fixed precision" throughout the paper, the title and introduction could indeed better reflect the specific model we analyze. We will revise the title and refine the introductory language to make it clear that our results pertain specifically to fixed-precision transformer models.

What precision do you use in your experiments?

We use the default 32-bit floating point precision provided by the standard JAX library for both training and inference.

What evidence do you have for this? Usually in practice precision is greater than log₂n, so log precision seems reasonable.

Our empirical results suggest that with 32-bit precision, transformers fail to generalize at input lengths below $500$ on languages outside $\**PFO**^2[<]$ , many orders of magnitude smaller than the theoretical bound of $2^{32}$ . This suggests that log precision may overestimate real-world transformer capabilities, and motivates our focus on strict fixed precision.

Regarding the claim that the LTL[past] results perfectly predict length generalization, C-RASP... also seems to have strong theory/experiment match... How do you reconcile your claims with theirs?

C-RASP retains fixed precision outside attention but allows arbitrary precision within attention. This effectively augments $\**PFO**^2[<]$ with counting.

Experimentally, the two studies adopt different definitions of success. Our evaluation requires perfect generalization across all input lengths, consistent with formal models such as logic and automata. Huang et al. (2025) adopts a more relaxed criterion, focusing on strong but not perfect generalization. We view the two results as complementary: C-RASP highlights what transformers may learn approximately and generalize well in practice, while our work characterizes what they can reliably and provably generalize without error.

Is it possible that the transformers are less sample efficient than LSTMs and are just under-trained?

We believe this is unlikely. Prior work (Deletang et al., 2023) shows that 1M training steps are sufficient for learning a wide range of formal languages. In our experiments, convergence typically occurs well before this point. Additionally, we find that minor architectural changes (e.g., constraining some attention heads to look only at the previous token) enable the same training setup to achieve perfect generalization on certain tasks outside $\\text{PFO}^2[<]$ .

Why do you only consider 3 languages for the language modeling experiment? Especially because the "failing" language still achieves 98.3% accuracy

We focus on language recognition (LR) because it is a strict subtask of language modeling (LM): to correctly predict the next token, a model must first determine which language the prefix belongs to. Therefore, failure on LR already implies failure on LM. Our LM experiments primarily serve to demonstrate that transformers can correctly track the state of the prefix and generate only permissible symbols.

Since transformers have already been shown to fail at recognizing languages outside $\**PFO**^2[<]$ , we do not expect them to model such languages either. Accordingly, we choose the simplest such language, RDP-1, as a counter-example.

Note that some languages, such as $\\Sigma^*b$ , are unsuitable for language modeling because any next token is valid at any step, making next-token prediction uninformative.

As for the seemingly high accuracy ( $98.3\\%$ ), this is due to the structure of the languages. For instance, in RDP-1 $(\\Sigma\\setminus\\{b\\})^* a (\\Sigma\\setminus\\{a,c\\})^*$ , any symbol except $b$ is correct before the critical $a$ appears. This allows the model to make many "easy" correct predictions, inflating its token-level accuracy. However, as with LR, any accuracy short of $100\\%$ indicates failure. The key failure mode is clearer when analyzing the model's internal states: on RDP-1, the prefix states are not distinguishable by their representations, in contrast to LDP-1 and LDP-2, where transformers successfully track the DFA states.

The PFO2 upper bound proof makes sense... but I'd like more detail on the following: "it reduces to a sum of a bounded number of terms..." The proof needs more detail.

The key is that fixed precision limits the number of non-zero attention weights. This puts a bound on the number of summed terms. Moreover, since each of these terms is drawn from a finite set $\\mathbb{F}$ , the total number of possible combinations and their resulting sums is also finite. For example, if at most two values from $\\mathbb{F} = \{1,2\}$ are summed, the possible sums are $1$ , $2$ , $1+1$ , $1+2$ , $2+2$ , with the last two exceeding the representable range and causing overflow.

"Uniformly attend to positions m < n where [...]" We cannot uniformly attend to all $m < n$ satisfying a condition when too many positions satisfy it under fixed precision. This is why we need two attention layers in Lemma B.11.

We again thank the reviewer for their careful reading, constructive questions, and generous evaluation. We will incorporate all suggested clarifications to strengthen the presentation and position the work more precisely within the broader landscape.

References

Huang et al. (2025). A Formal Framework for Understanding Length Generalization in Transformers.
Deletang et al. (2023). Neural Networks and the Chomsky Hierarchy.

2025-08-02

Overall, I appreciate the reviewers' comments and maintain my positive score.

C-RASP highlights what transformers may learn approximately and generalize well in practice, while our work characterizes what they can reliably and provably generalize without error.

It's not clear that C-RASP represents a class of languages that transformers "learn approximately". In my eyes, the results in this paper motivate further empirical comparison of transformers' ability to learn C-RASP-expressible vs. PFO^2[<]-expressible languages.

2025-08-03

We thank the reviewer for their continued positive evaluation. We used the term "learn approximately" to refer to the empirical observation that transformers can often learn languages outside $\**PFO**^2[<]$ but inside C-RASP, such as $(ab)^*$ effectively but not flawlessly (e.g., around 90%). We recognize that this wording may be ambiguous in this context, and we will find a more precise phrasing in the revision.

We also agree that a systematic empirical comparison of transformers’ ability against theoretical implications derived under different assumptions would be valuable.

审稿意见

评分: 5置信度: 42025-07-02

This paper investigates the expressive power of real-world transformer models through the lens of formal languages. In particular, it characterizes the capabilities of transformers with fixed numerical precision, soft attention, and strict future masking. The authors prove that such transformers are equivalent in expressivity to a fragment of linear temporal logic and a fragment of first-order logic. These characterizations are then used to construct separation examples that highlight the limitations of transformer models. The paper also presents empirical evidence by training transformers to verify these separations.

优缺点分析

Strengths:

The paper is clearly written, with well-structured and accessible proofs.
It closes the gap in the literature by providing a precise characterization of practical transformer models.
The empirical evaluation convincingly demonstrates the expressive boundaries of such models, offering valuable insights into their capabilities.

Weaknesses:

The treatment of transformers with positional encoding is limited and lacks a thorough theoretical analysis.
The study does not consider transformers equipped with Chain-of-Thought (CoT) reasoning, which limits its relevance to state-of-the-art models.

问题

Do the separation results demonstrated in the experiments also hold for transformers with positional encoding?

局限性

yes

最终评判理由

My concerns have been largely addressed. I will maintain my score.

格式问题

N/A

作者回复

2025-07-31

We thank the reviewer for their thoughtful and encouraging feedback. We are especially grateful for the recognition of the clarity of our writing, the strength of our theoretical characterization, and the value of the empirical results.

We agree that both positional encodings and CoT reasoning are important and exciting directions for future work. We believe that our results on barebones transformer models provide a solid foundation for pursuing these lines of research.

Regarding the reviewer's question about positional encodings: Yes, the separation results do hold even when positional encodings are included. In our preliminary experiments, we explored both absolute (sinusoidal) and relative (rotary) positional encodings. We found that neither variant improved performance on the languages outside $\**PFO**^2[<]$ .

审稿意见

评分: 5置信度: 42025-07-12

This paper theoretically analyzes the expressive power of fixed-precision transformers with strict future masking (i.e., every position attends to the left, not to itself), and shows that the class of languages captured by such transformers is precisely the temporal logic LTL[past], for both average hard attention and soft attention.

As part of the argument, the authors also show a (novel, I think) equivalence between LTL[past] and a 2-variable variant of first order (FO) logic.

优缺点分析

The paper has many clear strengths, including:

Technically strong!
Clear articulation of all concepts and proof ideas; detailed background and proofs in the appendix.
Novel exact characterization of the considered formal model of transformers, extending the previous characterization of a similar formal model but one that had used unique hard attention (which was farther away from practice).
Length generalization experiments with specific languages, supporting findings.
Neat idea for simulating the "past" operator despite the "finite attention span" limitation of soft attention (and also average hard attention?) when using finite-precision.

There are, nonetheless, some aspects that can use improvement:

While I agree with the authors' claim that this work is a step closer to the formalization of real-world transformers (especially compared to prior work analyzing unique hard attention), the current introduction wording has the feel that the formal model used here has closed that gap. But, as noted next, this doesn't quite appear to be the case, and I would thus suggest softening / refining the wording along these lines.
To give some examples of the above, strict masking is not used in practice (though perhaps it should be, given your findings? it does appear odd for a position to not pay attention to itself, especially when operating over natural language, e.g., for coreference resolution), position encodings are always used (whereas the authors discuss $bb \Sigma^*$ not being recognizable without strict masking -- which is true only without position encodings), layer normalization discussion is partial (please see below), I don't recall a mention of residual stream, and perhaps more.
Other reasons to push back against this formal model as the final word on the subject or being the closest to real-world transformers: (1) The finding that, under this model, unique hard attention is more powerful than average or soft attention is counter-intuitive; similarly, that average-hard and soft attention are essentially identical here is also counter-intuitive. (2) While having a fixed set of trained parameters makes complete sense, I don't particularly see an issue with expecting that when solving larger instances of a problem, the model could have access to higher precision arithmetic, or something that scales with input size.
On that last note, related to the presented length generalization experiments, it is possible -- in principle -- that your trained transformer models for the harder languages do, in fact, generalize, it's just that your trained models need to run on larger hardware / higher precision chips in order to solve inputs of size 500. This is a matter of implementation and experimentation, not really answering the question of whether the transformer accurately learned to solve the task.
In fact, if you take any classic textbook algorithm and implement it on a fixed precision hardware, it will likely fail to solve the problem for large-enough input sizes because of numerical overflow/underflow. Should we then conclude that the textbook algorithm (which has the "correct" finite description) is not solving the problem?
Treatment of layer normalization: it's mentioned somewhat in passing in Prop B.12 in the appendix. There is a "proof" but I couldn't quite follow how exactly it works, why the variance is identical across dimensions, etc. It was also unclear how LN comes into picture in the other direction, i.e., when simulating a transformer with LN (vs. without LN) using PFO^2[<]. I'm assuming the fixed-precision assumption let's this work out. Please clarify.

===

In summary, I like the paper and I would like to see it published. At the same time, I would appreciate it if the authors could give more thought to their quick opening argument that finite-precision is the "right" formal model of transformers. If my arguments above influence their view (especially the one about never being able to experimentally verify arbitrary "length generalization" of even a textbook algorithm we know to be correct due to numerical overflow/underflow issues), I would appreciate adjusting the positioning of the work to account for this. The paper is technically a solid contribution, it's just the positioning that I feel should be refined.

问题

Nothing in particular, besides what was raised above in the comments.

局限性

yes

最终评判理由

I am happy to see the paper accepted, especially after the authors' detailed, thoughtful reply, and their willingness to adjust the wording and expand upon certain aspects to make the paper's content and positioning more clear. I look forward to reading the revised version once published!

格式问题

n/a

作者回复

2025-07-31

We thank the reviewer for their thoughtful and constructive feedback, and for their positive assessment of the technical contributions, clarity of exposition, and the paper’s overall value. We are particularly grateful for the recognition of our novel characterization, the simulation of the past operator, and the experimental validation through length generalization. Below, we respond point-by-point to the reviewer’s comments and suggestions.

While I agree with the authors' claim that this work is a step closer to the formalization of real-world transformers..., the current introduction wording has the feel that the formal model used here has closed that gap.

We agree that there is still a gap between our formal model and real-world systems. That’s why we describe our model as an idealization. We thank the reviewer for pointing out that our wording may be misleading. In the final version, we will soften the language to clarify this point.

It does appear odd for a position to not pay attention to itself... Position encodings are always used... layer normalization discussion is partial... no mention of residual stream...

These are valuable observations. We respond to each below:

Strict masking: Due to the residual connections, the current token still has access to itself through the residual stream. This offers intuition for why strict masking does not remove information.
Positional encoding: Together with prior work, we establish that positional encodings largely play the role of numerical predicates in formal logic. Nevertheless, we agree this connection is not sufficient on its own, and we hope to investigate more precisely what numerical predicates commonly used positional encodings correspond to in future work.
Residual connection: Residual connections are present in our model and used critically in the proofs in Appendix B.3. They ensure that earlier logical dimensions are preserved across layers, which is key to simulating complex formulas. We will make this clearer in the main text.
Layer normalization: We now provide a more concrete example to support Proposition B.12:

Suppose we have two dimensions, encoding $\\pi_a$ , $\\pi_b$ , and $\\mathbf{P} \\pi_a$ , respectively. Inputting the string "ab" produces: $\\begin{matrix} 1 & 0 \\\\ 0 & 1 \\\\ 0 & 1 \\end{matrix}$

These column vectors have different means and variances, and applying LN would distort their encoding. To neutralize this, we introduce a mirror dimension for each original dimension, giving: $\\begin{matrix} 1 & 0 \\\\ 0 & 1 \\\\ 0 & 1 \\\\ 0 & 1 \\\\ 1 & 0 \\\\ 1 & 0 \\end{matrix}$ where the last three rows are mirror dimensions. Now, each position (column vector) has mean $0.5$ and variance $0.25$ . By setting $\gamma = \\sqrt{0.25 + \\epsilon}$ and $\\beta = 0.5$ , we cancel out LN entirely. This trick ensures that LN is an identity function and has no effect on the content.

In the reverse direction, simulating transformers using logic, fixed precision makes the simulation of any positionwise deterministic function straightforward: we simply enumerate all possible inputs and outputs. Both layer normalization and residual connections fall under this class.

UHA more powerful than soft or average attention is counter-intuitive; average-hard and soft attention being essentially identical is also counter-intuitive.

We agree this may seem counterintuitive. However, there is empirical evidence supporting these theoretical findings:

On UHA, Selim et al. (2025) show that it is specifically rightmost UHA that is more expressive than soft attention. This aligns with prior empirical efforts to inject recency bias, where the rightmost token is preferred in case of tie scores, using special attention mechanisms (Alon et al., 2024) or relative positional encodings (Su et al., 2021).
On AHA vs. SA, our preliminary experiments show that if a transformer is trained with SA and annealing, and then tested with AHA, performance often matches or exceeds the performance of models trained and tested with SA. This supports the theoretical equivalence between SA and AHA.

... it is possible -- in principle -- that your trained transformer models... need to run on larger hardware / higher precision chips in order to solve inputs of size 500.

In fact, if you take any classic textbook algorithm and implement it on a fixed-precision hardware, it will likely fail... Should we then conclude that the textbook algorithm... is not solving the problem?

Our theory shows that for any language definable in $\**PFO**^2[<]$ , we can construct a transformer (textbook algorithm) that computes the correct classification for every input length, under fixed precision. In contrast, no such transformer exists for languages outside this class. Our work thus identifies an exact boundary of what can be reliably encoded and generalized within a fixed‑precision transformer, distinct from cases where success depends on increasing precision.

That said, we fully acknowledge that different implementations, precision, or hardware architectures may yield different empirical outcomes. Our results do not preclude the possibility that one could train a transformer to generalize up to length 500. Nevertheless, we find it encouraging that models trained under the standard framework and hyperparameters from Deletang et al. (2023) and Butoi et al. (2025), without any special tuning, task-specific adaptation, or architectural modifications, exhibit generalization behavior that aligns closely with our theoretical predictions.

I don't particularly see an issue with ... the model could have access to higher precision arithmetic, or something that scales with input size.

Models are trained and deployed under fixed precision, and it is infeasible in practice to dynamically increase numerical precision at inference time when longer inputs are encountered. That said, we agree that logarithmic precision or any growing precision is a worthwhile setting to explore. It offers insights into what transformers can recognize when input lengths are constrained, which is often the case in real-world applications. However, interpreting log-precision results in this way implicitly relies on uniformity assumptions that a single set of parameters should work for all input lengths below a certain bound. To our knowledge, these questions remain under-explored and represent an exciting direction for future research.

We thank the reviewer again for the helpful suggestions and positive evaluation. We will revise the introduction, clarify our assumptions, expand the discussion on residuals and normalization, and incorporate further context to strengthen the positioning of our results. We appreciate the opportunity to improve the paper and clarify its contributions, and we are encouraged by the reviewer’s support for publication.

References

Selim et al. (2025). Unique Hard Attention: A Tale of Two Sides.
Alon et al. (2024). Scaling Stick-Breaking Attention: An Efficient Implementation and In-depth Study.
Su et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding.
Deletang et al. (2023). Neural Networks and the Chomsky Hierarchy.
Butoi et al. (2025). Training Neural Networks as Recognizers of Formal Languages.

评论- Re: Rebuttal by Authors

2025-08-06

Thank you for your detailed, thoughtful reply, and for your willingness to adjust the wording and expand upon certain aspects to make the paper's content and positioning more clear.

As I noted earlier, I am happy to see the paper accepted. I will retain my positive "accept" rating and I look forward to reading the revised version once published!

最终决定Accept (spotlight)

2025-09-17

Scientific claims and findings: This paper provides a precise theoretical characterization of the expressive power of transformer language models under a specific formalization: fixed numerical precision, soft attention, and strict future masking. The authors demonstrate that this class of models is exactly as expressive as a fragment of linear temporal logic containing only the "past" operator (LTL[past]). This central result is established by connecting the model to other well-known classes in formal language theory, automata theory, and a two-variable fragment of first-order logic ( $PFO^2[<]$ ).

Strengths: The paper's primary strengths are its technical depth and clarity. It delivers a novel and exact characterization for a transformer model that is a step closer to practical implementations compared to prior work that often relied on idealized unique hard attention. The theoretical findings are well-supported by empirical experiments on formal languages, which show that models achieve perfect length generalization on tasks within the identified class but fail on those outside of it. This is not exactly predicted by the expressiveness theory, but provides interesting new insights and perspectives on the length generalization ability of transformers.

Weakness: The main weakness, and the source of significant debate, is the "fixed-precision" assumption. Some reviewers argued that this formalization is too restrictive and does not adequately capture the capabilities of real-world softmax transformers, which can perform operations like averaging that require precision that scales with input length. The authors' model, under this assumption, cannot solve problems like Approximate Majority. Other limitations noted include a partial treatment of components like positional encodings and layer normalization in the main text.

Reason for Accept(Spotlight): This paper presents a fundamental and elegant theoretical result that connects a widely used neural architecture, constant-precision softmax transformer with casual masking, to a classic concept in logic and formal languages. The work is technically significant and addresses a key question in the theory of language models. The tight characterization and the clear link between theory and empirical generalization behavior make it a standout contribution.