PaperHub
3.8
/10
Rejected4 位审稿人
最低1最高6标准差1.9
5
6
1
3
3.5
置信度
正确性1.5
贡献度2.3
表达1.8
ICLR 2025

Transformers are Efficient Compilers, Provably

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05
TL;DR

We prove transformers can efficiently act as compilers.

摘要

关键词
TransformersExpressive PowerProgramming LanguageAttention MechanismCompiler

评审与讨论

审稿意见
5

This paper investigates the theoretical capabilities of transformers in performing compilation tasks, specifically focusing on three key aspects:

  • Abstract Syntax Tree (AST) construction
  • Symbol resolution
  • Type analysis

They introduce a simplified Rust-like language called Mini-Husky as their programming language for formal analysis. They prove the availability of encoder-only transformers in these three tasks by a constructive proof of certain transformer components' properties; therefore, they could further use same expressive power components to represent the model. They construct Cybertron, a type-checking compiler that accepts Mini-Husky input and generates type error information. They also prove the expressive power of transformers that surpasses RNNs in type checking subtask.

优点

This paper is quite novel in using expressive power theory to investigate transformers in compilation tasks, and they map the expressive power of transformers to the type system of Cybertron. The strengths are listed as follows:

  • Novel Theoretical Analysis of Expressive Power: The paper provides the first formal analysis of transformers' capabilities as compilers, using expressive power theory to investigate their performance on compilation tasks. This moves beyond empirical observations to establish rigorous theoretical foundations.
  • Constructive Proof Methodology: The authors introduce Cybertron, a domain-specific language (DSL) as an innovative proof vehicle. This allows them to constructively prove transformers can efficiently handle AST construction, symbol resolution, and type analysis, bridging the gap between low-level transformer operations and high-level compiler tasks.
  • Strong Theoretical Results: The paper proves that under bounded conditions (AST depth and type inference depth), transformers require only logarithmic parameter scaling with input length for several compilation tasks. This provides strong theoretical support for using transformers in compilation, which will be a strong theoretical foundation for existing empirical research findings including but not limited to neural code translators, compilers, and decompilers.
  • Theoretical and Empirical Separation from RNNs: The authors establish a clear exponential separation between transformers and RNNs specifically in type checking tasks, proving RNNs require linear parameter scaling while transformers need only logarithmic scaling. This theoretical finding is supported by experimental results showing transformers outperform RNNs in type checking tasks, providing practical validation of the framework.

缺点

Major Theoretical Weaknesses:

  • Limited Illustration of Cybertron's Proof Mechanism: The main body of the paper doesn't explain how Cybertron could serve for proving these subtasks that represent transformers clearly. The examples in the appendix are not intuitive for understanding, as they are too implementation-specific. The paper needs revisions on its writing to explain their ideas on how Cybertron is used for proving the capabilities of transformer models. For example, we need at least some detailed minimal end-to-end examples in Mini-Husky for basic statement-level and block-level constructs. I think at least a basic statement-level example, like "a = b + c * d" and a block-level example, like "if(){}" will be needed.
  • Language Design and Classification Issues: Mini-Husky is more accurately characterized as a Rust-like language rather than C-like. The choice of Rust-like features may be beneficial for type analysis but should be explicitly justified. The language classification needs to be more precise and well-justified.
  • Limited Semantic Generalizability: From a traditional compiler development view, the semantics cannot be fully presented through BNF grammars, as many language-specific features are configurable and implementation-dependent. For example, the authors limit their Mini-Husky language to avoid any type conversion for their type system's convenience; however, this limit also limits the language's generality. For Mini-Husky-2, which allows type conversion, the proof generated by Cybertron is not valid since we need Cybertron-2 to validate it. I think the authors should clarify this key challenge: semantics is different from syntax/grammar, which is why we need language manuals besides BNF grammars to define a programming language. The current presentation seems to overclaim their capability in semantic handling, as the type system in Cybertron is implementation-dependent on Mini-Husky, therefore not generalizable. Current Cybertron is therefore more like a partial compiler (not generating assembly or IR output currently) with type checking tailored to Mini-Husky rather than an extensible DSL, which makes me confused about its effectiveness on other programming languages with different semantics. Authors are encouraged to clarify what the common parts in PLs are that MiniHusky+Cybertron can prove transformers-capable and what are not and need extra development to prove.

Empirical and Evaluation Weaknesses:

  • Limited Experimental Validation: Empirical validation focuses only on type checking comparison with RNNs. Missing evaluation results for AST construction, symbol resolution, and comprehensive type analysis. The lack of practical validation weakens Mini-Husky's contributions.

Technical Writing Issues

  • Non-standard ICLR template usage: The template it used is not the newest ICLR-template as it differs significantly with other papers.
  • Various writing errors: Redundant sentences(line 079-081), Incorrect indentation( line 049: "Taelin (2023b)demonstrates", line 277-280), Incomplete words(line 053: "such as (missing JavaScript)(Flanagan, 2011) and Rust (Klabnik & Nichols, 2023)", line 094: "(T)o our knowledge").
  • Too long appendix: Too many code snippets in the appendix, many of them are direct copies of code, which is inappropriate for a premier conference like ICLR. Simplifications and abstractions are needed. Current length in the appendix is too much for any readers to read, including reviewers.

问题

Technical questions:

Architecture Choice: Why focus only on encoder-only transformers? What are the theoretical implications if encoder-decoder or decoder-only architectures were considered? Would they enable additional compiler capabilities? I think the theoretical proof on them would be more meaningful since existing related researches are mostly encoder-decoder or decoder-only, like LLMs.

Compiler Output: The current work focuses on semantic analysis rather than full compilation:

  • Could the framework be extended to generate assembly or IR code output?
  • Is the limitation to semantic analysis related to the choice of encoder-only models?

Is Cybertron a DSL or a compiler? From the code in the supplementary materials and the appendix, I find Cybertron more like a compiler with type system tailored to Mini-Husky rather than a DSL. Investigation on the Cybertron code makes me confused as there is no clear relations to transformers, the main study subject in this paper. Could the authors explain them?

Additional Compiler Tasks: Beyond the three investigated tasks (AST construction, symbol resolution, and type analysis), how might the framework extend to other crucial compiler components, for example:

  • Translation Correctness: Reasoning about customized structs (size, offset, alignment), Stack space allocation, and Instruction binding, etc?
  • Optimizations: Can transformers perform Register allocation, Dead code elimination, Constant propagation, etc?

My expectations:

This is a strong theoretical paper that makes novel contributions to understanding transformer capabilities in compilation tasks. The integration of programming language theory with transformer analysis is particularly innovative, and the use of type systems to characterize transformer expressive power demonstrates deep expertise in PL and compiler theory and machine learning. However, several aspects of the presentation need significant revision as already illustrated in the weaknesses. I look forward to seeing these weaknesses addressed and questions explained in the rebuttal phase. If most of them get addressed, I would recommend clear acceptance.

评论

Thanks for your thorough reviews. We have revised the paper.

Mini-Husky is more accurately characterized as a Rust-like language rather than C-like.

We say it is a C-like language because most machine learning researchers are not familiar with Rust. Meanwhile, a part of C's grammar is also covered by Mini-Husky.

Various writing errors:

We have fixed the writing errors.

评论

These revisions are minor. I hope to see the responses to my other concerns. Currently I would like to retain my score.

审稿意见
6

This paper presents an attempt to characterize the expressiveness of transformers on compilation tasks (concretely, parsing, symbol resolution, and type analysis). For this purpose, the authors introduce Mini-Husky, a representative C-like programming language that is amenable to compilation, and Cybertron, a programming language in which compilation tasks are expressed in terms of computation graphs, which help symbolically represent transformer components.

Both theoretical and empirical results are presented. Theoretically, the authors prove (1) the feasibility of transformers simulating various Cybertron constructs, (2) that transformers require a logarithmic number of parameters to the input to carry out compilation tasks, and (3) that RNNs require a memory linear to the input on the type checking task. Empirically, the authors demonstrate the exponential divide between transformers and RNNs on type checking using synthetic data.

优点

While transformer models have been extensively used to compile and generate code, how (well) they proceed with such tasks remains understudied, which is in part due to the significant gap between neural and symbolic methods in terms of the level of abstraction (which is further amplified in the context of compilation, as the authors have identified). Thus, I recognize this paper as an admirable and overall successful shot at the task of characterizing how transformers carry out compilation.

Although the idea of reasoning about neural networks through the lens of symbolic computational models is not new, the work done by the authors to (1) identify suitable symbolic abstractions to interface with components in the transformer architecture and (2) perform formal proofs of their “equivalence” quite significantly outgrows prior work like RASP. This aspect of the paper is also what I find most impressive.

Overall, I find the paper to be original and a meaningful contribution that helps advance our understanding of transformer characteristics on complex reasoning tasks. The quality and clarity of the writing could be improved (see Weaknesses) but for the most part do not pose significant challenges.

缺点

While the paper is overall coherent, I did find some bits to be unintuitive (due to the order in which concepts are introduced) and even obtuse (due to poor prose). For instance, I would suggest modifying/reordering Section 5.4 to first motivate the use of key abstractions (e.g., computation graphs), and then go over what constitutes types in Cybertron, followed by how such types constrain the use of abstractions and ensure the soundness of propositions (e.g., that there exists a transformer encoder capable of symbol resolution). I was unable to form a fully cohesive mental picture connecting these details until after reading this section a few times and referring to the appendices. I understand that the authors are constrained by the page limit, but I would appreciate it if they could smooth out such readability speedbumps. I would personally cut a significant portion of the first paragraph of Section 5.4, as it largely parrots what has been said in the abstract and introduction.

Related to the above point, it would have been better if the authors provided more intuition for the key abstractions (e.g., computation graphs and function compositions) ahead of time in the main text, leading up to the overview and exposition on the type system. I was only able to adequately grasp their roles after going through much of the appendices.

Other than the aforementioned structural issue, I would have to say that the prose of the paper can be dramatically improved. An inexhaustive list of typos I found is included at the end of the Questions section. While most of them do not impede understanding, there are a few more serious ones that do (e.g., line 405, incorrect examples in Appendix E.1). I would appreciate it if the authors could correct them and do one more pass to double check.

For the most part, I appreciate and enjoy the formal approach. But on the topic of soundness, I would have liked to see a precise specification of Cybertron’s type system and its correctness (via properties like progress and preservation), in addition to the properties stating whether specific constructs expressible in Cybertron can be represented using MLPs/transformers. In my opinion, this is a missing ingredient that is important for an approach akin to interactive theorem proving. For instance, such a specification would clarify which of the two type errors on line 9 of the code snippet in Section 4 would actually occur, blocking the other from occurring (or not, depending on the specification). Presently, I have no way of knowing such behaviors from the paper alone. On the same point, the authors could have further formally specified Cybertron’s semantics operationally/denotationally/etc. to be the most precise, and proved it correct, which is important for such a formal reasoning approach. Same applies to Mini-Husky.

问题

Aside from what has been mentioned in the Weaknesses section, here are some more suggestions and questions:

  • I would personally defer a portion of Section 3, especially the math definitions, to the appendices, and keep the rest that is actually relevant to understanding the rest of the main text.
  • The authors categorize both symbolic resolution and type analysis as semantic analysis, when in fact symbolic resolution is purely syntactic and type checking/inference is also syntax-driven. While I think I know what the authors may mean by “semantic,” the paper could benefit from a brief discussion on the distinction between a syntactic task and a semantic task.
  • It could have been more precise and to-the-point if the authors referred to the “AST construction” task simply as parsing (while briefly defining it).
  • Regarding the use of “correctly” on line 108, I would use “effectively” instead, because after all there is no absolute guarantee on correctness.
  • When explaining what symbol resolution does, the concept of shadowing would help with explaining the example program. In general, having some formal definitions for the three tasks would be nice, especially for symbol resolution and type analysis.
  • While it is reasonable to assume that programs are written according to the clean code principle, that book is rather controversial and not everyone finds it to be completely reasonable in this day and age.
  • Line 334 and 335: I would appreciate an explanation of what “allocated” and “initialized” mean in this context.
  • Line 435: For readers not very familiar with associative recall, could the authors consider adding a brief explanation of why type checking covers associative recall?
  • Line 503: Regarding “real datasets,” what would they be? I only see synthetic data mentioned.
  • Line 1077: I would say “recurrence” instead of “recursion.”
  • Line 1311-1313: If Cybertron’s semantics and type system had been proven correct, then the authors would not have needed to resort to testing and manually verifying outputs.
  • Line 1586: Consider making notation consistent; I suggest going with dGd_G (resp. dvd_v) instead of Depth(G) (resp. Depth(v)) because line 1809 uses it.
  • One thing I am not entirely clear on is how much of what Cybertron’s type system guarantees would actually transfer over to trained transformer models that are architectured according to a correct Cybertron program. The authors claim that “Cybertron allows us to construct transformers with automatic value validity guarantees if the Cybertron code is type-correct.” But how could that be? I would be curious to hear more on this.
  • Would I be correct in saying that the “customized BERT models” used in the evaluation are not constructed according to a Cybertron blueprint? It would be very interesting to actually train models architectured according to (or, even better, compiled from) Cybertron programs and see how well they fare. What are the authors’ thoughts on pursuing further research in this direction? Would external functions (that Cybertron itself cannot reason about) pose challenges?

Inexhaustive list of typos:

  • Line 49: add space before “demonstrates”
  • Line 67: use consistent type style for “clean code principle” (either use italics for all its occurrences or not at all)
  • Line 80: remove “Compilers are one of the most challenging programming projects in our era and early C compilers.”
  • Line 94: “T o our knowledge”
  • Line 149: “... helps to prove complicated operations…”
  • Line 208: “... transformers have expanded to include support code analysis and…”
  • Line 214: “... type inference, and type checking”
  • Line 214: “These features make the Mini-Husky compiler compiling Mini-Husky a representative task…”
  • Line 229: “etc_._, called the forming a token stream, then to be parsed into…”
  • Line 229: Fix “... into a tree like structure representation of the generation process of the input, finally syntactic and semantic analysis is performed on the tree”
  • Line 231: “... we assume the tokenizer has been provided a priori”
  • Line 235: “... the AST abstracts away from surface syntax details…”
  • Line 240: “... so we defer it to…”
  • Line 245: “They are defined in one place somewhere and can be…”
  • Line 246: “... curly bracketed braced scope… we only consider curly bracketed braced scope.”
  • Line 262: make the 1 in “f1” a subscript
  • Line 263: line 4 of the code snippet is a variable def, not a variable use, so variable a is only accessible from line 5
  • Line 264: “... is accessible from line 10…”
  • Line 264: the fourth instance of the variable a is accessible from within its scope – it is just not actually used
  • Line 273: “In general, type is types are essential…”
  • Line 279-280: add period after “(3) Type inference” and remove “(4) type checking”
  • Line 280: “... ensures that the type typed expressions agree…”
  • Line 292: “... the first argument of f expects be is expected to be of type…”
  • Line 300: “... denoted denotes the type TypeError will with a…”
  • Line 313, 315, 367, 442, 1120: “codes code
  • Line 328: “... because a 64-bit computers…”
  • Line 334: “... although they might not have been fully initialized.”
  • Line 340: “Therefore, The the end-to-end process…”
  • Line 340: “... by a transformer of O(1) number of heads, …” and similarly for other such cases involving asymptotic complexities
  • Line 342: “... in in Appendix F.”
  • Line 373: “... smaller than with the context length…”
  • Line 404: fix “What transformers output (possibly in the intermediate layers) is a representation in sequences of vector of sequences of values in these types.”
  • Line 405: “... interpreted in this paper as a discrete subset…”
  • Line 406: “... construct transformers with an automatic value validity guarantees…”
  • Line 416: “... operations that requires require information…”
  • Line 434, 435: “associate associative recall”
  • Line 435: “... the type checking step covers…”
  • Line 442: “... which is typically the case …”
  • Line 472: missing appendix number in “... in Table 2 in Appendix during…”
  • Line 480: “... when both sizes scale up scale up in size, …”
  • Line 950: “... the single-layer feed-forward fully-connected network…”
  • Line 1046: redundant superscript R^dmodel
  • Line 1064: should be Definition 7
  • Line 1083: “Here’s the BNF grammar of…”
  • Line 1277: change to “Thus, using Cybertron one can argue operations more complicated than simple algorithms can be simulated by transformers.”
  • Line 1284-1309: perhaps the authors forgot to replace the placeholder code snippets?
  • Line 1284, 1291, 1296, 1304: capitalize the starting “in”
  • Line 1285-1286: “... certain architecture architectures given that…”
  • Line 1398: “... which translates translate directly to…”
  • Line 1406: “... If we dont’ don't ignore…”
  • Line 1514: “.. we can define the local an option type as”?
  • Line 1606: redundant “for i = 1, …, n”
  • Line 1614: “... Using Multi-Layer Perceptions Perceptrons…”
评论

We appreciate the reviewer's comments! We have revised the paper.

An inexhaustive list of typos I found is included at the end of the Questions section

We have fixed the typos mentioned by you.

Regarding the use of “correctly” on line 108, I would use “effectively” instead, because after all there is no absolute guarantee on correctness.

We have fixed it.

Line 1077: I would say “recurrence” instead of “recursion.”

We have fixed it.

Line 1586: Consider making notation consistent; I suggest going with (resp. ) instead of Depth(G) (resp. Depth(v)) because line 1809 uses it.

Thanks! We have fixed it.

评论

Thank you for making the changes. However, as Reviewer ok9x pointed out, they are minor and do not address the bulk of my and other reviewers' recommendations.

Considering that other reviewers share many of the same concerns of mine, I have decided to adjust my soundness score to 2 and overall rating to 6. This paper is a commendable effort to scale the formal characterization of transformers, but as also pointed out by other reviewers, it suffers majorly from:

  • The lack of formal specifications of both the surface language and the proof vehicle (semantics and type systems). This should be an integral part of a paper that formally approaches a problem involving programming languages.
  • The lack of intuition for how things work and justification for assumptions made. As evident from the other reviews, there exists a gap between how the authors and reviewers perceive compilers, proofs, etc. in the context of characterizing transformers. Expectations should be set early on in readable prose, instead of later on and in the appendices via implementation details.

I concur with Reviewer UjMc's suggestion of potentially splitting up the paper so more details could be properly elaborated and Reviewer Qzqo's suggestion of potentially considering a PL venue with a greater page limit.

审稿意见
1

This paper introduces "Cybertron", a DSL for representing expressiveness proofs of transformer-based language models, as applied to programming language compilation tasks such as symbol resolution and syntax tree construction. The main claim of this paper is that the number of LLM layers necessary for certain compiler operations as represented in Cybertron is logarithmic in the input sequence length L. The contributions of the paper are a synthetic object language "MiniHusky" for generating valid programs, as well as the aforementioned proof language.

优点

The paper is original in attempting to use a "bottom-up" formalism for proving the asymptotic memory requirements of LLMs for program compilation tasks.

缺点

The paper does not convince the reader that its main claims are supported by proof or empirical evidence. Alternatively, the claims are moot since they only apply to a language whose semantic properties can be checked with bounded complexity (MiniHusky).

The exposition of this paper is too intricate, and most of the body of the paper is spent introducing background definitions and not the actual substance of the claims.

  • Theorem 1 is "proven" essentially by repeating the theorem statement. Thms 2 and 3 do not have a recognizable proof. Neither of these are proofs in any useful sense, which is problematic as the paper puts "provably" in its title.
  • What is "Cybertron"? We don't see a compact definition of this proof language, which is unfortunate because this is the main technical device of the paper. Only in the middle of Section 5.4 (line 409) do we get a hint of what is this Cybertron, whereas this should be moved up to where it's first mentioned. Appendix E starts with what are supposed to be examples but are exactly the same code snippet repeated 4 times.
  • I do not find support for the claim of exponential separation between RNNs and transformer models wrt model size parameter in the type inference task (claim 3 in the Contributions section). From figure 2 it seems that both model classes saturate in accuracy at comparable sizes.
  • Section 3 the Preliminaries material could be easily moved to the appendices, as it's not used in the body of the paper.
  • It should be stated more clearly that the claims are only relative to the object language introduced by the authors (MiniHusky). This language apparently is designed to have "shallow" ASTs and a bounded complexity of type inference, but neither claim is proven in the body of the paper.
  • The "clean code" heuristic suggested as a justification for MiniHusky is only cited but never formalized. On the other hand, this was a set of folklore programming practices and not grounded in theory since their inception.
  • There is a lot of ad-hoc technical material in the appendices (which by the way does not cite any preexisting literature), but it is very hard to connect it to the main body of the paper. Much of the "proof" content is apparently limited to its Rust implementation, which is unfortunate for those who cannot easily compile and run it in their heads.
  • Figure 2 is hard to read and does not make a clear claim, esp re. the "first 8 points" remark. Similarly, the other figures with experimental results at the end of the appendices are given almost without comment.

Minor comment: it's unclear why the authors choose to repeatedly call out AgdaJS in the beginning.

问题

Please provide a compact definition for "Cybertron", or alternatively concrete examples of step evaluations in this framework. Put these front and center e.g in section 3. This is, in this reviewer's opinion, essential for substantiating the claims in a bottom-up fashion (i.e. from the complexity of the proof building blocks).

伦理问题详情

NA

评论

We sincerely thank the reviewer for the comments and insightful suggestions. We have revised the paper.

Theorem 1 is "proven" essentially by repeating the theorem statement.

In our original draft, we have shown the sketch of the proofs and present the details in the appendix.

I do not find support for the claim of exponential separation between RNNs and transformer models wrt model size parameter in the type inference task (claim 3 in the Contributions section). From figure 2 it seems that both model classes saturate in accuracy at comparable sizes.

We have modified the explanation of the empirical results in Section 6 to support our claim.

Figure 2 is hard to read and does not make a clear claim, esp re. the "first 8 points" remark. Similarly, the other figures with experimental results at the end of the appendices are given almost without comment.

We have modified the figure based on your suggestions.

评论

Thank you for the reply, however I do not think the latest edits contribute significantly to improving the paper. The subject matter is complex and I do not think that the theory or experiments presented here do justice to the claims of the abstract. I respectfully suggest the authors refocus this paper or split it into multiple ones for different audiences, because the ML content is currently not up to the standards of ICML.

评论

By ICML do you mean ICLR?

Expressive power is an important field of machine learning. And we're addressing an important but complicated task.

The core advantage of Cybertron is that we can prove transformers can do complicated stuffs as such. Please clarify "I do not think that the theory or experiments presented here do justice to the claims of the abstract.". Does it mean the proof is wrong, missing steps, or just too complex to read? We feel sorry for not explaining the subject easy enough. But sometimes, the nature of things is just too complicated to narrow down.

"The "clean code" heuristic suggested as a justification for MiniHusky is only cited but never formalized. On the other hand, this was a set of folklore programming practices and not grounded in theory since their inception." Please note that in deep learning, assumptions are rarely grounded in theory. The world is simply not ideal enough.

Well, I almost feel amused by the difficulties of conveying the key ideas of the paper to the greater audience. However, important things must be done. We'll continue improving the presentation, despite the misunderstandings.

审稿意见
3

The submission discusses a (set of proofs) that show that Transformers are able to perform some analysis tasks required during compilation with a bounded number of parameters. To this end, a new programming language and a DSL are discussed (but not defined). Some empirical results show that there are tasks for which Transformers perform better than RNNs.

优点

  • The authors attempt a formal analysis of Transformers.

缺点

(1) The paper is fundamentally incomplete: while the authors claim to introduce Mini-Husky as a pared-down programming language, they do not do this at all in the main text of the paper, and even in the extensive appendix, they do not actually provide a full definitions of syntax or semantics.

As example, note that the BNF is Appendix D does not in any way correspond to the example given 5 lines under it, which uses keywords such as mut and assignment operators such as +=, . and assert statements; none of which appear in the BNF. Similarly, no semantics of the language are provided, so it is impossible to verify any proof about its properties. The BNF also does not cover the syntax for type annotations, which is crucial given the paper's focus on typing tasks.

(2) The paper is overclaiming: no attempt is made at showing the suitability of Transformers as "compiler", as two core phases (code generation and optimization) are not even discussed. Text and helpers such as Fig. 2 seem to be written as if compilers only analyze code, but do nothing else. The term "efficient" is used repeatedly without a definition, and it remains unclear what is viewed as efficient (is it meant as running in polynomial time, or in the less formal, but better understood way of being acceptably fast?)

(3) The paper is not well-written, often referring to concepts, identifiers or ideas not introduced. Examples:

  • "the computation process is easily represented in Cybertron" (line 358), when Cybertron is only introduced in the following section
  • LL in line 318 (its meant to be the sequence length I assume, but the last use of LL was in l196 and refers to the number of layers)
  • "As types are mathematically interpreted in this paper a discrete subset of a vector space" is the only mention of a representation of types in the main text.

(4) The citing of related work is often incorrect. For example, Sect 3 starts with "The major innovation in the transformer architecture is self-attention", which is surprising given that the Transformers paper ("Attention is all you need") has a related work discussing prior uses of self-attention. On the other hand, the contrast to the most related work to the submitted paper (RASP, Weiss et al. 2021) is not discussed beyond a "Cybertron has a powerful algebraic type system that helps to prove complicated operations beyond simple algorithms" (though that type system is not actually introduced in the paper).

Let me be clear that none of the criticism above is about the actual research, but only about the presentation - I don't feel it's possible to fully grasp the proposed ideas given the paper. To some degree, this is a consequence of the ICLR page limit, and so I would make the following two suggestions to the authors:

(i) Submit to a venue with a longer page limit (e.g., TLMR, or POPL/PLDI/OOPSLA as conferences on the PL side that were open to such contributions in the past - in particular POPL seems appropriate). This will allow for a longer exposition and inclusion of more of the crucial details in the main text.

(ii) Think about the core ideas that make things work, and try to focus the exposition on that. For example, RASP already showed how to express primitive programming constructs as parts of a Transformer, so what is new here? What changes are required to make the proofs work? Think back on early steps of the work: what challenges did you need to overcome? These are the things most interesting for other researchers.

问题

See above.

评论

We thank the reviewers for the comments and suggestions!

The paper is overclaiming, no attempt is made at showing the suitability of Transformers as "compiler" Text and helpers such as Fig. 2 seem to be written as if compilers only analyze code, but do nothing else.

We have modified the draft according to your suggestions. We have discussed the code generation and code optimization phase in our draft. We also have modified Fig. 2 to reduce the confusion.

The term "efficient" is used repeatedly without a definition

"Efficiency" means that these tasks can be conducted by transformers with a number of parameters that scale logarithmically with the input code length.

AC 元评审

This paper proposes a novel framework for reasoning about whether transformers can represent standard compiler passes such as type inference. They provide a proof language called Cybertron to help formalize these proofs. They also demonstrate gaps between transformers and RNNs at these tasks. A significant issue with this paper is its clarity and its positioning. If the main contribution is the collection of theorems they prove, then the tasks they are studying seem way too complicated to provide useful insights; using simpler languages and analyses would be equally effective while providing more clarity on the nature of the results. This is reflected in concerns raised by the reviewers about the correctness of the claims, whose proofs are too complex to check, even at a high level. If the main contribution is the proof language, Cybertron, then ICLR is likely not the right venue for this paper; a programming languages or formal methods venue might be more appropriate.

审稿人讨论附加意见

The reviewers generally agreed on the limitations of the paper; there was some discussion but it was not substantial enough to address the concerns raised.

最终决定

Reject