Exact Expressive Power of Transformers with Padding
We exactly characterize the expressive power of transformers with padding tokens as $\mathsf{TC}^0$, and we also characterize transformers with looping and padding.
摘要
评审与讨论
It was observed previously that average-hard attention transformers with logarithmically many bits can only recognize language from TC^0 (which is almost obvious, they have constant number of layers and arithmetic operations that they employ are doable within TC^0). Moreover, if one uses some simple explicit positional encoding, one can show that such transformers belong to the dlogtime-uniform TC^0.
Does this inclusion hold in the other direction? Intuitively, there is a big obstacle -- transformers have linear size while TC^0 circuits can have arbitrary polynomial size. One of the results of the paper is -- if we are allowed to add polynomially many ``blank'' tokens initially, we obtain a transformer model, exactly equivalent to dlogtime-uniform TC^0.
The paper goes beyond that and asks -- what about NC, the class of polylog-depth circuits? Just adding polynomially many blank tokens does not seem enough since we still have constant number of layers. The paper suggests to allow a fixed constant block of layers to be repeated polylogarithmically many times. They claim that with this, transformers converge precisely to L-uniform NC.
优缺点分析
Characterizing dlogtime-uniform TC^0 via polynomially padded transformers seems valid. This is a solid result but maybe not very surprising and not very exciting technically.
As for the L-uniform NC result, I think there are some technical inaccuracies, but I believe the result should held (at least maybe without precisely equivalence of the hierarchies, just that every language in L-uniform NC can be recognized by a polylog-looped transformer). One non-trivial idea is required here. A naive approach to simulate a polylog-depth circuit by a transformer is to compute each layer of a circuit in a transformer layer. However, in this way we might require to have polylog-many different layers in a transformer, and we just want a loop of a block of a fixed constant number of layers. Instead, the paper constructs a transformer that in a loop, tries to evaluate all gates of a circuit simultaneously, and the gate can be evaluated when all gates that are fed to it are already evaluated. This idea appears in Merrill and Sabharwal 2023a.
Some minor remarks: Definition 1, h is the number of attention heads, and k is their index?
Line 126, in the positional encoding here, n is the initial length or padded one?
Line 176, maybe clarify that ``unique hard-attention transformers''
Names of Sections 4 and 5 are misleading because you forget to mention uniformity conditions there.
问题
How precisely you define TC^d? If as O(log^d)-depth threshold circuit, then definition 11 in the supplementary material does not make sense to me, because any circuit belongs to some sequence of circuits from TC^d (by making constant in O(log^d) big enough).
I guess you should for any constant C, d, define the C,d-threshold circuit evaluation problem, where the input circuit has depth at most Clog^d n precisely.
局限性
yes
格式问题
no, but maybe you could simplify the exposition if instead of stating super-precise equivalence like theorem 3 you have stated that L-uniform NC is equivalent to transformers with polylog-looping. I am not sure that this exact equivalence in the power of log is that interesting on its own.
How precisely you define ? If as -depth threshold circuit, then definition 11 in the supplementary material does not make sense to me, because any circuit belongs to some sequence of circuits from (by making constant in big enough). I guess you should for any constant , define the -threshold circuit evaluation problem, where the input circuit has depth at most precisely.
Indeed, you’re right that circuit evaluation is the same as circuit evaluation because the constant can be folded in. This is why we have taken the care to define wide- class of circuits and the wide- circuit evaluation problem in the appendix.
wide- is defined as the class of circuit families where each family has a fixed such that depth is at most and, crucially, size is at least . With this additional constraint, wide- circuits express the same languages as , but the evaluation problem can be solved in (unlike evaluation of , for the reasons you noted). To see why this constraint helps, note that if we increase to make a wide- circuit deep, the length of the circuit serialization must increase, which effectively makes the circuit shallow w.r.t. the total input length for the circuit evaluation problem, upper bounding its effective depth by for some fixed constant chosen a priori. This circuit then can be evaluated by a -depth circuit family (for a fixed ), which we build for the wide- circuit evaluation problem.
Thank you for raising this important point. We will clarify this and also make the wide- a numbered definition (after Definition 11) as it’s an important concept that addresses the issue you mentioned.
Some minor remarks: Definition 1, is the number of attention heads, and is their index?
Yes, we will note this.
Line 176, maybe clarify that ``unique hard-attention transformers''
Will do.
Names of Sections 4 and 5 are misleading because you forget to mention uniformity conditions there.
You are right. We omitted the uniformity qualifier to prevent the section titles from becoming too long, but we agree it would be better to include it and will update in revisions.
Line 126, in the positional encoding here, is the initial length or padded one?
In discussing this question, we realized that can actually be simplified to embeddings, and all constructions will go through. This works because (where is the initial input length) can be computed first, and then can be computed as . This simplifies the conversion from unmasked to masked transformers because encodings can be directly simulated by a uniform causally masked head. We will update the setup in the paper to use the simpler embeddings and clarify these details. Thanks for your question, which led us to think about this!
This work investigates the impact of padding tokens on the expressive power of Transformers. Focusing on averaging-based hard attention and masked pre-norm log-precise Transformers, the authors demonstrate that polynomial padding enables the model to represent the class . Additionally, they show that combining padding with looping mechanisms further enhances expressivity: looping enables recognition of the class , while polylogarithmic looping extends this capability to the class .
优缺点分析
Strengths:
- The work consider the practical techniques, i.e., padding and looping, from a theoretical perspective, offering valuable insights into why these methods can be effective.
- It establishes connections between the expressive power of padded Transformers and well-known complexity classes, providing a precise and hierarchical characterization of their representational capacity.
Weaknesses:
- The analysis does not compare the benefits and limitations of padding and looping with other reasoning techniques, such as CoT.
- The theoretical results focus solely on expressivity and do not address the learnability of the proposed constructions.
问题
-
Could the authors provide further discussion on how padding and looping compare to CoT in terms of expressive power (e.g., the differences between the results of this work and those of [1])?
-
Could the authors provide any insights or intuitions on the learnability of the proposed constructions?
[1] Merrill, W. and Sabharwal, A., The Expressive Power of Transformers with Chain of Thought. ICLR 2024.
局限性
N/A
最终评判理由
The authors have satisfactorily addressed my main concerns in their rebuttal. While the learnability limitation remains, this aspect falls outside the paper's core contributions. Given the overall quality of the work and the response, I maintain my positive score.
格式问题
N/A
Thank you for your review! You raise a good point regarding the* comparison of padding and looping with chain of thought*. In the case of looping vs. CoT, it is now known [1][2]] that transformers with logarithmic CoT steps remain in whereas log-looped transformers can solve some -complete problems, suggesting an advantage of looping. More generally, the comparison between transformers with steps of looping vs. steps of CoT remains open. We will update the paper to discuss this comparison, what is known, as well as the open question.
Regarding learnability, we agree this is very interesting, but is beyond the scope of the current work, which we can acknowledge as a limitation.
[1] https://arxiv.org/abs/2503.03961 [2] https://arxiv.org/abs/2210.10749
Thank you for your response. I appreciate the clarifications, which satisfactorily address my concerns. I will maintain my positive score.
This paper investigates the expressive power of Transformers when augmented with two parallelizable forms of inference-time compute: padding and looping. The authors provide a precise formal characterization of the class of problems solvable by these models. Their first main result is that fixed-depth, polynomially padded Transformers can recognize exactly the complexity class TC^0, a class of highly parallelizable problems. This result resolves an open question, as previous work had only established TC^0 as an upper bound. The second main result extends this analysis to Transformers that dynamically repeat a block of layers. The paper shows that a Transformer with polynomial padding and O(log^d n) looping recognizes exactly the class L-uniform TC^d. This establishes a systematic relationship between the depth of looping and the computational power of the model, showing that these parallelizable mechanisms can expand a Transformer's capabilities to encompass the entire class NC (the class of efficiently parallelizable problems), a significant increase in power over un-augmented models.
优缺点分析
Strengths:
-
Significant Impact: The paper provides a definitive and exact characterization of the expressive power of padded and looped Transformers. This closes an important gap in the theoretical understanding of Transformers and provides a formal basis for exploring alternatives to sequential reasoning methods like Chain of Thought (CoT).
-
Technical Quality and Rigor: The paper is technically flawless. The proofs are carefully constructed, building from the base case of fixed-depth padded Transformers to the more general case of looped models. The authors clearly define their idealized model (AHAT) and leverage its properties to build a rigorous, step-by-step argument. The connection to string logics (
FO[M^2]) to circumvent theBITpredicate issue is original and demonstrates a deep understanding of the relevant literature in both machine learning and complexity theory. -
Clarity of Presentation: Despite the technical density of the material, the paper is exceptionally well-written. The introduction clearly motivates the problem and situates the work within the existing literature. The main results are stated cleanly, and the proof sketches provide excellent intuition for the formal constructions. Figure 1 is a nice summary of the paper's contributions and how they relate to known complexity classes.
Weaknesses:
-
Idealized Model: The analysis is based on an Averaging-Hard-Attention Transformer (AHAT) with masked pre-norm and
O(log n)precision. While this is a common and necessary simplification in theoretical work, it creates a gap between the theoretical model and the soft-attention, fixed-precision Transformers used in practice. The authors acknowledge this, but the practical implications of their findings hinge on the assumption that these results will translate, which is not guaranteed. -
Lack of Empirical Validation: The paper is purely theoretical. While the goal is to establish formal limits of computation, the practical feasibility of the proposed constructions remains an open question. For instance, the polynomial padding required (
n^ktokens) could be prohibitively large for even moderatek. The paper would be even stronger if it included a small-scale empirical demonstration or a more in-depth discussion of how these theoretical ideas might be approximated in practice. -
Learnability: The paper focuses on expressive power (what is computable in principle), not on learnability (what can be learned from data via gradient descent). The constructions are intricate and algorithmic, and it is unclear if a standard training procedure could ever converge to the specific weight configurations required.
问题
-
On the Practicality of Padding: The constructions require
O(n^k)padding to simulatekvariables, which can become very large. Could you comment on the prospects for reducing this padding requirement? For instance, are there more compressed ways to represent the variable assignments, or could a model potentially learn to use a smaller scratchpad more efficiently? Is then^kdependency a hard theoretical barrier for your proof technique, or do you see some approaches that can tighten this? -
Bridging the Gap to Soft Attention: Could you elaborate on the key challenges in extending your results from the AHAT model to standard soft-attention Transformers? The core of your proof seems to rely on precise retrieval and counting via hard attention. What kind of new techniques or assumptions would be needed to show that soft attention can approximate these operations with sufficient fidelity, especially as sequence length
ngrows? -
The Role of Uniformity: The paper proves equivalence with
L-uniform TC^d. The appendix notes that this impliesL-uniform TC^d = NL-uniform TC^d. This is an interesting complexity-theoretic side effect of the analysis. Could you elaborate slightly on why your Transformer construction naturally leads toLorNLuniformity rather than, for example, the stricterFO-uniformity seen in theTC^0case? Is it because solving the complete problem (e.g., Graph Connectivity) requires the power ofL? -
Implications for Architecture Design: Your results provide a compelling theoretical motivation for using padding and looping. Do you have any high-level thoughts on how these findings might inform future Transformer architecture design? For example, would it make sense to design models with dedicated "scratchpad" blocks or explicit looping mechanisms that are more suitable to learning the kinds of algorithms you describe?
局限性
Yes, the authors have adequately addressed the limitations.
最终评判理由
Rating stays the same. Oral quality paper in my opinion.
The paper is of top quality, all the comments are addressed, and I keep my rating of Strong Accept.
格式问题
None.
Thanks for your review! We are happy to hear that you appreciate the technical quality and clarity of our paper as well as its potential for impact.
We agree that the practical learnability of AHAT constructions (with masked pre-norm) is an interesting open question. We will add some discussion either in the main paper or in a limitations section. The major gap between soft attention and AHAT is indeed the ability to implement hard-attention retrieval. For any fixed maximum context length, it is possible to scale the attention head parameters to arbitrarily approximate AHAT attention, but for unbounded context lengths, it seems difficult (and may likely be impossible) to simulate AHAT or hard attention. For this reason, we view AHAT as a mostly mild assumption (suitable for bounded context lenghts) that abstracts away issues with soft attention for simulating hard attention over very long contexts.
Regarding the practicality of padding, is indeed a practical barrier if must be large for some tasks. Small values of will suffice for some natural tasks tasks: for example, in the 3-SUM task considered by Pfau et al. But, in general, there may be tasks in that require a large value of , which would lead to a large memory overhead to store a long context window (the same problem would plague transformers with long CoT or chain of thought). More work is needed to assess the tasks for which small values of would suffice or whether other techniques could be developed to reduce the memory overhead of padding -- we will acknowledge this as a limitation.
Regarding your question about vs. uniformity, we have a brief discussion of this in the appendix after the proof of Lemma 4 (that depth transformers are in ).
Indeed, for , it can be shown that -looped transformers would be in -uniform , similar to how you suggest. The idea is that looping an -uniform circuit times can be implemented in -uniform .
This has the interesting consequence that -uniform -uniform by our uniformity ceiling lemma (Proposition 2, Appendix C) , just like the -uniform vs. -uniform collapse noted in the appendix. In answering your question, we looked closely at the uniformity ceiling argument (Prop. 2) and we think it also goes through when for , class , and class , showing that -uniformity collapses to -uniformity for deeper classes.
Note that this doesn’t go through for because is not known to be a subset of -uniform . Thus, the uniformity ceiling lemma cannot be applied.
Thank you for bringing this refinement to our attention! We will add a brief section highlighting these novel results about uniformity and also the novel complete problems we developed to obtain these uniformity collapse results, since these might be of independent interest.
The paper is of top quality, all the comments are addressed, and I keep my rating of Strong Accept.
This paper precisely characterizes the expressive power of transformers with a polynomial number of padding tokens by providing a matching lower bound on their ability to solve problems. Additionally, the authors show that looping layers enable padded transformers to be precisely characterized by . Using padding effectively enlarges the width of the transformer, while looping effectively increases the depth of the transformer. While the expressive power of padded and looped transformers remains below the expressive power of chain-of-thought (COT) transformers, they are more easily parallelizable and thus offer computational advantages in comparison to COT.
优缺点分析
Strengths: The submission provides novel bounds for transformers with padding as well as looping in terms of circuit complexity classes, which helps compare their expressive power to other transformer architectures. Additionally, Lemma 3 introduces the concept of reductions for analyzing the expressive power of transformers, which might be useful on its own to derive further results.
Weaknesses: There are several issues with clarity, which makes it generally difficult to assess the soundness of the submission. While the paper is easy to read, it would benefit from more rigorous definitions and careful proofreading. In several places, the presentation appears imprecise. For example, is used inconsistently, sometimes as a formula (without being introduced as such), and elsewhere as a layer norm. Additionally, the use of colloquial language and frequent quotation marks around technical terms contributes to ambiguity. For instance, in line 310, why is 'token' in quotes; similarly, line 247 mentions a "projection," and line 40 refers to "storage space," both again in quotation marks; in the proof of Lemma 3 in line 354, the phrase "very negative" is vague. Overall, the proofs rely heavily on informal language, which hinders accessibility and undermines precision. Without more rigorous definitions and terminology, the arguments are difficult to follow for me. Lastly, the absence of a dedicated related work section makes it more difficult to situate the contributions within the broader context.
Please find more detailed remarks below:
- The variable appears many times in seemingly different contexts:
- in line 111
- in Definition 7, 5(b)
- in Definition 8
- in line 209
- in line 343
- Line 123 in Definition 6: What does universal mean in this context, looped?
- Line 142: "for" instead of "For"
- Line 166: "other" instead of "over"
- First sentence in Definition 7 should be outside of definition environment
- Line 211: What is the token?
- Line 216: What does "masked unmasked head" mean?
- The proof of Lemma 2 is difficult to follow:
- Many terms seemingly refer to the same variable: is a token (lines 240,245), a variable assignment (line 240), an integer (line 246), and a configuration (line 250).
- Line 233: Should this be Definition 7 instead of Definition 8?
- Could you explain what "Each token corresponds to a specific assignment of all the variables, which we denote " means?
- Lines 261 and 268: Why do we have in line 261 and in line 268?
- Lines 265 and 272: What is ?
问题
- To my current understanding, Proposition 10.3 in Barrington et al. implies that we can express FO[M] by using formulas from FO[M, ], but not the other way around. Could you clarify what "FO[M] ... is equivalent to standard FO[M, ]" means in this context?
- In lines 352-357 in the proof of Lemma 3: Why do we check if instead of ?
- In Theorem 1 and Theorem 2, do we assume infinitely many padding tokens (given Definition 6), and thus infinite width?
- Could you explain what exactly is meant by the phrase "converge to recognizing NC" in the context of your theoretical results?
- How expensive is padding and looping in practice? How can padding be efficiently parallelized?
Barrington, D. A. M., Immerman, N., & Straubing, H. (1990). On uniformity within NC1. Journal of Computer and System Sciences, 41(3), 274-306.
局限性
There is no explicit discussion of limitations.
最终评判理由
I did not increase my score since, as also acknowledged by the authors, the paper has some issues with clarity, which makes it difficult for me to judge its overall soundness. However, given the confident review by reviewer FTuT which attests the paper to be "technically flawless", I assume that all the proofs were carefully checked and are correct (except for the discussed mistakes in notation). This, together with the novelty and significance of the theoretical contributions, makes me lean towards acceptance.
格式问题
None
Thank you for your review!
Thanks for your careful reading and feedback on clarity. To respond to some of your points:
-
Regarding , we will standardize it to only refer to the layer-norm hash. We will change Definition 3 to explicitly mention this sense of (we apologize that it seems to have gotten removed at the last minute) We will change the symbol for formulas in Definitions 5 and 8 to or . This will make both aspects of the paper clearer to read.
-
In Definition 6, “universal” indeed means “looped”. We will change this. By “bos” we mean a beginning-of-sequence symbol $ as defined on page 3. We will update the text to make this clear.
-
“Masked unmasked” is a typo and should just be “masked”
Regarding Lemma 2, we will take into account your feedback on clarity, and briefly address here your following question:
Could you explain what "Each token corresponds to a specific assignment of all the variables, which we denote " means?
If a formula has variables ranging from to , there are possible assignments to that vector of variables. We can think about a single assignment as a tuple in , or we can think about it as a number between and . What we mean is that we identify each variable assignment with a specific number. Then, token will represent variable assignment .
We will also incorporate all the other detailed feedback you mentioned regarding clarity and informal language - thanks for reading carefully! On the use of informal terms, we did that in certain places to make the argument more accessible to broader audience. However, we see that it can result in lack of precision and we will make sure to make the language more precise -- critically in all proofs.
To my current understanding, Proposition 10.3 in Barrington et al. implies that we can express by using formulas from , but not the other way around…
Your interpretation of Proposition 10.3 is correct, but what we use here is Theorem 10.2, which proves the other direction that can capture all of . We will update our reference to the Barrington et al. paper to include the relevant theorem.
In lines 352-357 in the proof of Lemma 3: Why do we check if instead of ?
Indeed, have a minor issue with the notation in this paragraph where we flipped L and L’. We will correct it to say that we are checking whether , which is equivalent to checking . Thanks for finding this!
In Theorem 1 and Theorem 2, do we assume infinitely many padding tokens (given Definition 6), and thus infinite width?
The number of padding tokens needed in the simulations in Theorem 1, 2, and 3 is naturally a function of the complexity of the underlying , , and circuits / algorithms. It’s not infinite, but rather unbounded a priori. E.g., if a circuit being simulated is equivalent to a formula with distinct variables, then we need padding tokens (in Lemma 2 and Theorem 1). Since one can define arbitrarily large circuits, their transformer simulation also needs to support correspondingly more (but still polynomially many) padding tokens, i.e., a larger exponent in . Crucially, for every problem, there is a fixed that works. The subscript $$ in our notation , representing the union over all , captures this. We will clarify this, in particular that for simulating any fixed circuit family, we need only a fixed polynomial degree number of padding tokens.
Could you explain what exactly is meant by the phrase "converge to recognizing NC" in the context of your theoretical results?
can be defined as . With depth, we showed padded transformers can recognize . So, as , depth (i.e., polylog-depth) transformers can recognize any problem in . In other words, for any problem in , there is some choice of such that a depth padded transformer will recognize it. We will clarify the discussion around this to make it more precise.
How expensive is padding and looping in practice? How can padding be efficiently parallelized?
Padding is efficiently parallelizable in practice: in can be thought of as just growing the input with blank symbols, and then making a single call for transformer inference (as opposed to CoT, which requires sequential inference calls to the transformer).
The practical bottleneck it would run into is the length of the context window, as running a transformer with a long context will run into memory limitations. Some problems like 3SUM considered by Pfau et al. are solvable with a small padding degree , but if a problem requires a larger value of , this could definitely become impractical (the same argument holds for CoT, in addition to the sequentiality of CoT). We will clarify this potential limitation in the text.
Thank you for the clarifications! I will keep my current score, but lean towards acceptance with the understanding that the notational mistakes and imprecisions will be fixed for the camera-ready version.
Scientific Claims and Findings This paper provides a precise characterization of the expressive power of Transformers augmented with padding and looping, two parallelizable inference-time computation mechanisms. The authors' central claim is that fixed-depth, hard-attention Transformers with a polynomial number of padding tokens can recognize exactly the complexity class . The paper further demonstrates that by incorporating looping (dynamically repeating a block of layers), the expressive power is precisely extended to the class . Consequently, with polylogarithmic looping, these models can recognize the entire class , which represents the class of efficiently parallelizable problems.
Strengths This paper definitively resolves an open question by providing an exact, rather than an upper-bound, characterization of padded Transformers' expressive power, which significantly deepens our understanding of expressiveness of transformer architecture. The reviewers also praise the clarity of the writing.
Weaknesses
-
The main weakness of the submission is that the finite precision number system and thus the main function class considered in the paper are not defined in the paper. This does not directly hurt the validity of the main result of the paper, that the looped transformers with polynomially many padding can simulate circuits. However, without a formal and rigorous definition of the function class , the other direction of equivalence, that circuits can simulate transformers, is ambiguous. The main issue comes from the imprecision about O(log n) precision number system.
First, it is clear that without any approximation, it is hard to simulate the embeddings of the transformer using TC circuits with perfect precision, because the \sqrt operation in layernorm ill-defined as rounding is necessary for \sqrt. The average hard attention also allows the transformer to take average of poly(n) numbers, which drastically increases the precision for exact simulation. So the natural fix here seems to define the forward pass of transformer as first compute exactly and then round to finite precision for every basic operation such as square root, multiplication and addition over poly(n) numbers. Though I think in principle a rigorous and relatively simple fix based on "exact computation and then round" exists, there are a few subtle issues that remains and I would like the authors to be aware:
(1). details of the number system and rounding: how do you represent a rational number? is it in the form of p/q, where p,q are integers bounded by poly(n) (so O(log n) bits), or p/2^x, where p,x is integers with absolute values bounded by poly(n) (the latter one is more like floating point numbers)? do you always around to the closest representable number, or your round to any number that is sufficiently close? (but then to make the output a well-defined deterministic function class, you have to show that the potential non-determinism does not output the output token nondeterministic, in some sense, a margin-based analysis is required)
(2). the (potential) increased complexity of the proof of the main direction -- transformer simulates circuits, due to the existence of rounding. Here a careful control of rounding error and its propagation is needed, especially in the polylog(n) depth case.
(3). Depending on the representation system the authors choose, the actual needed precision of forward pass could be much higher than O(log n). For example, suppose rational numbers are stored in the form of p/q, where p,q are integers bounded by poly(n) (so O(log n) bits), then the sum of M such numbers could require bits to represent. When M =poly(n), the proposed mathematical model of transformers might need to work with a much higher precision than O(\log n), which is .
-
Other weaknesses identified by reviewers are including the reliance on an idealized Transformer model (Averaging-Hard-Attention Transformer, AHAT) as opposed to the more practical version (finite precision, softmax attantion) and the lack of empirical validation or discussion of learnability. The theoretical constructions, which require specific weight configurations and potentially a large number () of padding tokens, may not be achievable through standard gradient-based training or practical for real-world applications.
Reasons for Recommendation Overall, the work makes a significant contribution to the theoretical understanding of Transformers. By precisely linking the architecture of padded and looped Transformers to well-established complexity classes (, , and ), it provides a theoretical foundation for exploring parallelizable alternatives to sequential reasoning methods like Chain of Thought. The paper's technical depth and the resolution of an open problem make it stand out as a significant advance in the field, if the above main weakness can be addressed.
The final recommendation for this paper is Accept(poster), conditioned on that the authors will give a formal and rigorous definition for the function class and rigorous proofs for the both directions of the main results. This is the conclusion that SAC and AC have reached after a long discussion and is based on our impression that a rigorous and relatively simple fix based on "exact computation and then round" exists. It is of course up to the authors to choose their way to fix the current issues, and we expect the authors to withdraw the paper if they cannot make it precise and rigorous. In particular, I recommend that the authors write as explicit about the number system and forward pass of transformers as possible, and try to avoid use big O notations in the key definitions about the function class.