PaperHub
4.7
/10
Rejected3 位审稿人
最低3最高6标准差1.2
3
6
5
3.0
置信度
正确性3.0
贡献度2.0
表达2.7
ICLR 2025

Universal length generalization with Turing Programs

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05

摘要

关键词
length generalizationdeep learning

评审与讨论

审稿意见
3

This paper proposes a method that combines effective positional encoding (Hard-Alibi) with a scratchpad technique to enhance length generalization in Transformers. The authors identify length generalization capabilities of copying task as a potential underlying factor contributing to this enhancement. Empirically, the paper explores the length generalization properties of the proposed methods through tasks involving addition (n+nn+n), multiplication (n×1n\times 1, n×3n\times 3), nn-sample SGD, and random Turing machines. Theoretically, it investigates the expressive power of Transformers simulating Turing machines.

优点

This work presents a novel approach for improving length generalization by thoughtfully combining two existing methods. The experimental results on random Turing machines effectively demonstrate the proposed methods' applicability to more general tasks.

缺点

  1. Although the paper effectively combines appropriate positional encodings with the scratchpad technique to enhance length generalization, it lacks comparative results against a broader array of positional encoding mechanisms and additional baselines under the same experimental conditions.

  2. The theoretical section appears disconnected from the experimental findings. While the experiments concentrate on length generalization, the theoretical analysis solely illustrates the expressive power of Transformers in simulating Turing machines. The assumptions made are somewhat strong, and the result of exp(n)\exp(n) steps renders the theorem less impactful.

问题

  1. Can you provide more experiments with additional baselines to compare length generalization performance under the same experimental setup, including:

    • Various positional encodings with and without the scratchpad.
    • Other prior approaches (e.g., those listed in Appendix B) for addition tasks.
    • Could the methods in Appendix B be applied to multiplication tasks? Would they exhibit non-trivial length generalization abilities?
  2. Is it possible that combining effective positional encoding, the scratchpad technique, and improved data formats could yield even better length generalization for addition or multiplication tasks?

  3. In line 250, what are the reasons for retaining only the most recent 500 tokens during testing? Is this crucial for achieving better length generalization results, considering that the training and testing context lengths are identical in this setup? An extensive ablation study appears essential to assess the impact of retaining versus dropping past tokens.

评论

Thanks for the positive comments about the novelty and effectiveness of our approach. We will respond to the weaknesses here and urge the reviewer to increase their score if we resolve their concerns.

W1: baselines

Response: In the original paper we already included several baselines ablating the positional encoding and a baseline for direct answers without chain of thought. We have now added a new appendix G with baselines for prior approaches to constructing chains of thought for addition (as suggested in Q1). Our approach works substantially better while also being more general. There is less existing work in the literature on multiplication so we will focus on these ablations for now. We also again want to emphasize that our approach is substantially more generally applicable than prior work, allowing us to simulate things like gradient descent and general turing machines.

W2: connection between theory and experiments

Response: Our theoretical results show that a Transformer of some fixed size can simulate Turing Programs up to length that is exponential in their size, which suggests that “small” Transformers can execute very long programs. This is indeed an expressivity result in the sense that we do not discuss learning and optimization under fixed length, as theoretical analysis of optimization in Transformers is not well understood, and indeed there are very few theoretical works that study learning and optimization in Transformers. Note that this expressivity is a necessary (but not sufficient) condition for length generalization to happen. We supplement the theory with the experimental results to show that in practice when we optimize a real model we are able to find these functions that express the function that is able to length generalize. It would not suffice to test in distribution where the model could learn tricks to memorize the behavior and not learn the true reasoning underlying the task, that is why we test for length generalization.

Q1: Can you provide more experiments with additional baselines to compare length generalization performance under the same experimental setup?

Response: Thanks for this idea! We have added the experiments you suggested ablating both the positional encoding and the chain of thought style. Results are in Appendix G and can be compared to Figure 3, our Turing program method is more effective while also being more general.

Q2: Is it possible that combining effective positional encoding, the scratchpad technique, and improved data formats could yield even better length generalization for addition or multiplication tasks?

Response: While we have conducted rigorous ablations, as described above, it is of course possible that better positional encodings or scratchpads could exist. We feel that our paper, which shows the strong performance of Turing Program and HAlibi relative to existing work, is a step in the right direction.

Q3: In line 250, what are the reasons for retaining only the most recent 500 tokens during testing?

Response: This is a good and subtle question. We are focused on generalizing to longer reasoning traces, but this does not require actually having the model attend to all tokens in the trace. One nice thing about turing programs is that they break down a long reasoning trace into atomic steps that fit within the context window. This choice can be seen as a modeling decision about positional encodings to essentially remove dependence on tokens that are too distant since for these sorts of reasoning traces they are unnecessary. The interesting thing about this model is that we are proving that with the right scratchpad construction, even models with fixed context length can achieve impressive length generalization when applied in a sliding window fashion.

评论

Thanks for your response. However, I remain uncertain about several points in the paper.

Theoretical Results

  1. The contribution of the theoretical results appears limited due to the reliance on strong assumptions, such as the absence of repeated nn-grams in both inputs and outputs, and the need for an exponential number of simulation steps (exp(n)\exp(n)). For example, in the multiplication task shown in Figure 4, it seems necessary to set n=O(l)n = O(l), where ll is the input length, since patterns like * 1 3 5 may appear multiple times during intermediate steps. In such cases, if a RASP language or a Transformer requires O(exp(l))O(\exp(l)) steps to simulate even elementary tasks, the practical value of the theorem is significantly diminished.
  2. While the paper emphasizes length generalization, the expressiveness results presented in the theoretical analysis have limited relevance to this topic. The theorem primarily addresses expressiveness but provides little insight into the model's ability to generalize across sequence lengths, which is the paper's central focus.

Experimental Setup

  1. The necessity for the model to attend to tokens in the previous trace also depends on the specific task. While some tasks exhibit strong localized properties, others do not and may require long-context reasoning capabilities. The tasks considered in this paper all possess favorable localized properties, raising concerns about the generality of the results to tasks without such properties.
  2. The effects of the sliding window mechanism remain unclear. One key issue is that by applying sliding windows, the training and test context lengths are manually set to be the same. This seems diverge from the standard setup for testing length generalization, where the test context length typically exceeds the training context length. Additionally, comparing all methods on the sliding window Transformer raises fairness concerns, as the baselines may be originally designed for standard Transformers, not for sliding window variants.
审稿意见
6

A well-presented work on what is in its present state a toy deep learning problem.

优点

  1. Good writing, can be followed with relative ease.
  2. Meticulous contextualization with respect to the previous work.

缺点

  1. While a considerable part of Sections 1 and 2 has been dedicated to justification of the effort and the setting in the context of previous work, one cannot escape the feeling that the overall contribution and subsequent impact of this work are most likely to be minute. The setting of this effort is one where function-calling models already excel with unprecedented precision and at a considerably lower cost. To have a transformer model carry out arithmetic digit-by-digit, albeit in a learnable fashion, is a waste of computational resources, and hence this is not really a good set of experiments to consider. While the work in its present state hints at a promise, it falls short of demonstrating any potential utility. A suggestion for the future direction of this effort would therefore be to attempt to address some other still "basic" but nevertheless useful tasks such as regular text manipulation or automated formal reasoning.
  2. Minor - presentation focus: The main text seems to assume familiarity with recent developments on algorithmic length generalization while dedicating an entire half-page to the well-known concept of Turing machines. I would suggest re-balancing.

问题

None.

评论

Thanks for the generally positive assessment of the work and the presentation. We respond to the weaknesses below.

  1. We agree that having transformers perform arithmetic is not in itself a practically useful task. We just used it as a simple benchmark problem where it is easy to check correctness and generate problem instances that require reasoning/computation. The nice thing about the Turing program framework is that any problem (including text manipulation or formal reasoning as you suggest) could be solved in this way. Our main goal is to theoretically justify the universality of the approach and then provide some examples of how this can work. We understand that our work is more on the theoretical side, but we hope the ideas could be extended to more practical problems in future work.
  2. Thanks for the note, we were not sure what the background of a modern ML audience would be, so we opted to provide more background, but can rebalance this.
评论

It would appear that some of my comments have been misinterpreted as criticism of the practicality of the work at hand. I would like to ask the authors to revisit them and convince themselves that this was indeed not the case, as the central point of my feedback was my perceived likelihood of the work having little impact.

It is all too easy to dismiss concerns about paper's utility to the literature with the claim that the work is "more on the theoretical side". I would like to set the record straight here and inform the authors that the text they submitted for review is at a great distance from anything that is considered to be theory work in this field.

To reiterate more explicitly: There are efforts, both of theoretical and practical nature, that address pressing questions or present valuable insights and inventions in area of foundation models. This effort, albeit nicely presented, is close to being their polar opposite.

To put it simply: not every experiment demands a paper.

I sincerely suggest that you resubmit to a workshop and redirect your attention to somewhere where your contributions, theoretical or practical, might do more good.

审稿意见
5

Authors showed that a (very long) CoT which slightly change an input xix_i into xi+1x_{i+1} up to a point where it solves a problem completely can achieve state-of-the-art results on arithmetic tasks. Authors showed that a bunch of problems can be solved with this approach. Finally authors showed that transformer can indeed implement Turing programs.

优点

S1. It seems this is the first work which achieves a non-trivial length generalization on arithmetic problems without using data curation or representation tricks and customized positional embeddings to solve arithmetic task.

S2. Result from Section 5.3 is pretty interesting because of the generalization capability it may bring.

S3. Empirical results on non-trivial tasks shows the effectiveness of their method.

缺点

W1. The strongest weakness is Theorem 4.1. The code is shown in Appendix D, but the construction proof is not justified at all.

W2. Authors didn't spend any time on explaining why HAlibi embedding is strongly needed and what practitioners can do to solve similar math problems without using this embedding. In particular it's not obvious how to escape from embedding choice and rely only on the multiple CoT idea.

W3. In general, ideas and theorems are not well motivated and explained. For example, only Figure 2 explains how to create a synthetic dataset for addition or Figure 4 for multiplication.

W4. It's not obvious how to extend this idea to fairly similar arithmetic tasks, for example 20-multiplication.

W5. (minor nit) some citations are reported twice, firstly in black and right after then in blue.

问题

Q1. Did authors think about other algorithm classes which can be learnt with a copying based CoT?

Q2. Is there any limitation of this idea for longer sequence, i.e. going with a 3x generalization multiplier? Did authors collected some results there?

Q3. Are there additional tasks the authors think of to show the full potential of this idea? Maybe some graph related tasks?

评论

Thanks for the positive comments about the novelty and effectiveness of our approach. We will respond to the weaknesses here and urge the reviewer to increase their score if we resolve their concerns.

W1. The strongest weakness is Theorem 4.1. The code is shown in Appendix D, but the construction proof is not justified at all.

Response: We are not sure that we understand the issue here. RASP provides us with a way to construct programs that can directly be translated to transformer weights. So, we illustrate a program that predicts the next token that would be needed to simulate a turing machine by copying the tape and making updates and moving the head according to the rules of the turing machine with comments explaining what is going on in each step. This program then directly defines the construction. If there are particular steps of the construction that are unclear, we are happy to provide more explanation, just let us know.

W2. Authors didn't spend any time on explaining why HAlibi embedding is strongly needed and what practitioners can do to solve similar math problems without using this embedding. In particular it's not obvious how to escape from embedding choice and rely only on the multiple CoT idea.

Response: We should have made it clearer why HAlibi is useful here. The HAlibi positional encoding is crucial for length generalization of Turing Programs because it allows the task of copying to length generalize. The copying capability of HAlibi is already discovered here (https://arxiv.org/abs/2402.01032). Also, if we observe the mistakes made by other positional encodings when we use Turing Program, they are almost always due to errors when copying the tape content. Thus, if we stick with the Turing Program CoT, HAlibi is very important. It is an interesting direction for future research to improve length generalization of copying with other positional encodings. Perhaps with larger models that are trained with long enough context lengths to fit two consecutive states of the tape, this copying issue is less of a problem.

W3. In general, ideas and theorems are not well motivated and explained.

Response: We are not exactly certain what the reviewer means by this. But if you can provide specific questions or examples of things that are unclear, we are happy to provide better explanations.

W4. It's not obvious how to extend this idea to fairly similar arithmetic tasks, for example 20-multiplication.

Response: The idea is that any task which you can write a Turing machine for (which is any computable task) can be worked into this framework (up to issues with repeated n-grams discussed in the theory).

For example, the current CoT for multiplication should also work for 20-multiplication. To illustrate, here are the CoT for 1, 2, 3, and 4-multiplication, and we just need to follow this pattern to get the data for 20-multiplication. The explanation for these CoT is described in Figure 4 as you have pointed out. For longer sequences, we just have to stuff more digits in the parenthesis which keeps track of the carry.

$4324*1|432e*1(04~0,4)|43c*1(02~0,24)|4d*1(03~0,324)|e*1(04~0,4324)|4324
$4324*12|432e*12(048~04,8)|43c*12(024~02,88)|4d*12(036~03,888)|e*12(048~05,1888)|^*12(000~00,51888)|51888.
$4324*123|432e*123(0492~049,2)|43c*123(0246~029,52)|4d*123(0369~039,852)|e*123(0492~053,1852)|^*123(0000~005,31852)|^*123(0000~000,531852)|531852.
$4324*1234|432e*1234(04936~0493,6)|43c*1234(02468~0296,16)|4d*1234(03702~0399,816)|e*1234(04936~0533,5816)|^*1234(00000~0053,35816)|^*1234(00000~0005,335816)|^*1234(00000~0000,5335816)|5335816.

Q1. Did authors think about other algorithm classes which can be learnt with a copying based CoT?

Response: Any computable algorithm that does not involve repeating n-grams should be doable here.

Q2. Is there any limitation of this idea for longer sequence, i.e. going with a 3x generalization multiplier? Did authors collected some results there?

Response: We don’t have any result that shows a 3x generalization. However, in all the algorithms we tried, the mistakes are due to failures of tape copying: as the CoT becomes long, we need to copy more times, and mistakes become more common. Thus, it is definitely possible to construct toy tasks that require limited copying so that we get longer generalization. For instance, the task of copying a single time can be generalized > 3x as shown here (https://arxiv.org/abs/2402.01032).

Q3. Are there additional tasks the authors think of to show the full potential of this idea? Maybe some graph related tasks?

Response: Yes, there are lots of other interesting tasks that this could be extended to. We haven’t thought deeply about graph related tasks yet, but it is definitely a good thing to explore next. This could even be extended to more practical tasks related to e.g. formal verification or code execution or other computable problems of practical interest.

AC 元评审

This paper proposes "Turing Programs", a Chain-of-Thought strategy mimicking Turing Machine computations to improve length generalization in transformers for algorithmic tasks. The authors claim universality, demonstrating generalization on addition, multiplication, and in-context SGD, and providing a theoretical argument via RASP programs.

Strengths: Novel approach, good presentation, strong background and contextualization, and potential for generality.

Weaknesses and what's missing: Limited practical impact (focused on basic arithmetic), insufficient theoretical justification (especially Theorem 4.1), unclear necessity of HAlibi embeddings, lack of exploration of limitations (e.g., longer sequences, repeating n-grams), insufficient experimental validation (missing more complex tasks and ablation studies), and unclear motivation for some ideas.

Reasons for Rejection: Limited impact and practical relevance, insufficient theoretical justification (especially Theorem 4.1), and lack of thorough experimental validation and analysis on practical reasoning tasks. While the idea is interesting and well-presented, these significant weaknesses outweigh the strengths, making the paper unsuitable for ICLR in its current form.

审稿人讨论附加意见

None of the reviewers seem to be in favor of accepting this work. Reviewers acknowledge the novelty but raise concerns about theoretical justification, the necessity of HAlibi embeddings, experimental setup, and limited practical impact. Two reviewers engaged in discussion with the authors but do not seem to be convinced by their response, and maintained their initial ratings.

最终决定

Reject