PaperHub
6.3
/10
Poster4 位审稿人
最低5最高7标准差0.8
7
5
7
6
3.3
置信度
正确性3.3
贡献度3.0
表达2.8
NeurIPS 2024

How Far Can Transformers Reason? The Globality Barrier and Inductive Scratchpad

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06

摘要

Can Transformers predict new syllogisms by composing established ones? More generally, what type of targets can be learned by such models from scratch? Recent works show that Transformers can be Turing-complete in terms of expressivity, but this does not address the learnability objective. This paper puts forward the notion of 'globality degree' of a target distribution to capture when weak learning is efficiently achievable by regular Transformers. This measure shows a contrast with the expressivity results of Transformers captured by $TC^0/TC^1$ classes (further studied here), since the globality relates to correlations with the more limited $NC^0$ class. We show here experimentally and theoretically under additional assumptions that distributions with high globality cannot be learned efficiently. In particular, syllogisms cannot be composed on long chains. Further, we develop scratchpad techniques and show that: (i) agnostic scratchpads cannot break the globality barrier, (ii) educated scratchpads can break the globality with intermediate steps, although not all such scratchpads can generalize out-of-distribution (OOD), (iii) a notion of 'inductive scratchpad', that composes the prior information more efficiently, can both break the globality barrier and improve the OOD generalization. In particular, some of our inductive scratchpads can achieve length generalizations of up to $6\times$ for some arithmetic tasks depending on the input formatting.
关键词
Transformerreasoningscratchpadlength generalization

评审与讨论

审稿意见
7

The paper contributes a new metric for measuring the difficulty of a task for a transformer model to learn, called distribution locality. This refers to the amount of an input x that needs to be seen by a model for it to be able to return the corresponding output y. A high locality means that a large amount of the input would need to be made available for the model to correctly predict y, while a low locality means that only some small part of x needs to be given to the model.

This represents a barrier in the kind of tasks that can be learned by transformer models, which the authors call the locality barrier. Above a certain threshold of locality a task becomes impossible for a given transformer to learn.

To overcome the locality barrier, the authors propose various scratchpads as tools that the model can use in getting to its answer. They demonstrate that naive and educated models are insufficient to overcome the locality barrier, but provide a new type of scratchpad called an inductive scratchpad which is able to overcome the locality barrier.

The scratchpads in the paper are essentially structures that the model trains on that help to format the answer in a way that helps the model arrive at the correct answer. During training the model is given input and output pairs where the output is structured according to a given scratchpad. The output of the model essentially consists only of the scratchpad workings and the answer. Each scratchpad type works by enforcing a certain structure on the steps taken during the working out process.

The inductive scratch pad works by trying to solve a problem at each step through iteratively applying the same learned rule repeatedly until an answer is found. The key is that each step is based only on the previous state and the original question. By applying this problem solving solution over and over again it is possible for it to learn to solve problems outside of its original training distribution.

优点

This paper offers many strong theoretical aspects and tackles some of the fundamental problems underlying transformers and cutting edge AI systems.

Originality: The paper defines a useful and coherent metric for approximating the limits of learning tasks that a model is capable of, in distribution locality, and the mathematics underlying this idea seem sound. This idea is an intuitive explanation of the limits of transformers and is useful in predicting future model capabilities.

The idea of the inductive scratchpad, training the model in ways that inherently include a reasoning step, is a novel idea, and seems broadly applicable to many applications of language models.

Quality: The level of thinking displayed in the development of these ideas is very impressive, and shows a strong grasp of the problems at the heart of the field. The solution to the problem is both elegant, and argued convincingly, though applications at larger scales of models would have been very interesting to see.

Clarity: The method of demonstrating the problem of distribution locality helps to make the problem clear, and the mathematics describing it are very clear and crisp. The tests and results presented are also a clear indication of the efficacy of the solution.

Significance: As mentioned earlier, the problem being approached here is a fairly fundamental one in the domain of AI and of transformers in particular, and if the inductive scratchpad technique is able to generalize to other kinds of models may have a large impact on the way in which future transformer models are trained and can help with fundamental issues of memorisation vs generalization in model training.

缺点

The paper has good ideas but the language used in describing the ideas is often slightly awkward.

For instance we have this sentence: “This takes place for instance in the last 2 examples (parity and cycle task) where it appears useful to learn inductive steps when possible.” This sentence is awkwardly constructed because it is not immediately clear what “this” is referring to, and it feels roundabout in the way it makes its point.

An alternative construction could read: “The parity and cycle task examples demonstrate how inductive steps can be useful in solving problems, making them illustrative candidates for the inductive scratchpad.”

There are many such instances in the paper, and I think focusing on clear sentence construction would significantly improve the readability of the paper.

The examples given for the scratchpad outputs are very difficult to parse visually, and use different symbols between tasks which makes them hard to compare. If the components of the task were clearly broken down per example this would make it much easier to read, or if the same notation was used across examples. Potentially more human readable examples could have been provided alongside the technical outputs.

The use of some real world tasks with the different scratchpads would have significantly strengthened the claims of the authors. While the examples given are clear indicators of the efficacy of the inductive scratchpad it is still somewhat difficult to infer that the solution will scale.

It would have been useful to compare the inductive scratchpad to other state of the art capabilities for extending the reasoning of transformer models to give a better sense of the scope of the problem and the impact of the solution.

The paper also doesn’t offer clear implementation steps for how the inductive scratchpad might be implemented in other types of problems. The implementation as presented here seems quite tricky to apply to other common transformer applications.

问题

To what extent would training other transformer models be possible with this approach? For example, could this technique feasibly be used in language models? Some mention is given to various algorithmic tasks, but it is unclear to what extent these techniques are bottlenecked by current methods, and some discussion of how broader application may be done may help strengthen the paper. Particularly discussing scalability.

Does this approach generalise to other architectures at all?

How much does this approach impact compute demands if at all?

Can you provide a step by step method for implementing this in other kinds of problems?

Can you provide a more fleshed out discussion of impacts on the broader field and applicability to other problems?

局限性

The discussion of limitations is fairly clear in the paper discussing the fact that it primarily only applies to transformers, the limits of its generalization, and the fact that applying it to other problems is left as future work. However, discussion of implementation is not explicitly mentioned, and could be included to improve the discussion.

作者回复

We thank the reviewer’s feedback on the writing of the paper, we will revise the paper accordingly to enhance its readability. In particular, we will use different colors alongside annotations for the inductive scratchpad examples to make reading and parsing them easier.

Further, note that we include the details of the implementation of the inductive scratchpad in Appendix C.2; does the reviewer suggest that we should move this to the main text? This may be challenging with the available space.

Please read the answers to the questions below.

Q. How would the inductive scratchpad scale? What’s the impact on the compute demands?

The inductive scratchpad approach can scale easily to larger models, datasets, and context lengths as it reduces the attention computations and the effective context size. Assume, we have states s 00 , …, s kk in the inductive scratchpad. The inductive scratchpad enforces s ii to only use the information of s i1i-1 . This behavior can be either implemented by masking the attention between s ii and all the previous states except s i1i-1 or by trimming the states prior to s i1i-1 manually. As a result, the inductive scratchpad actually reduces the ‘effective’ context size. Thus, this method can easily scale. Note that due to the reduced effective context length, one can also use less memory during both training and inference. We have already discussed the implementation of the inductive scratchpad and its efficiency in Appendix C.2 and we will further extend it focusing on its scalability.

Q. Does this approach generalize to other architectures and language models?

Yes, this approach is easily implementable for other variants of Transformers and language models. First, note that the inductive behavior is controlled by two special tokens: the <START> and <STATE> (denoted #) tokens. Thus, for inputs that do not require induction (such as normal text generation) the model would keep its default behavior (as it would not generate <START>/<STATE> tokens). Regarding the implementation, assume that we have states s 00 , …, s kk . As discussed in Appendix C.2, there are two methods for implementing the inductive scratchpad. The first approach is based on masking the attention between s ii and all previous states except s i1i-1 and reindexing the positions of the tokens of s i1i-1 and s ii as if s i1i-1 was generated right after the question. This implementation would only require attention masking and reindexing of positions of tokens which is easily doable in all variants of Transformers (e.g., whether they use absolute or relative positional embedding). The second approach is also implemented by manually removing the states prior to s i1i-1 for generating s ii before giving the text to the model during inference and splitting the input into pairs of states (s i1i-1 #s ii ) during training. The second implementation method only modifies the input/training data of the Transformer prior to feeding it to the model (for generation this is in the outer loop that calls the model for generating one token at a time) and hence would work with any Transformer model.

Finally, note that our current implementation is for the decoder-only Transformers as they are more common nowadays. Nevertheless, encoder-decoder Transformers can be handled in the same fashion. The only difference is that in the encoder-decoder models, the question (Q) is already separated from the rest of the tokens as it is in the encoder part of the architecture, this would make the implementation slightly easier as we can consider removing the <START> token.

Q. Can you provide a step by step method for implementing this in other kinds of problems?

As a large class, here we consider the algorithms that have a loop in which they update some variables/arrays until the answer is computed. To design an inductive scratchpad for such algorithms, we first put the input data of the question (e.g., a graph) and all other constant and permanent information before the <START> token so that the Transformer attends to this information in all of the generation steps. Afterward, we define the states of the inductive scratchpad to be all of the variables that are being updated in the loop of the algorithm (including the loop counters). Now, generating the next step/state in the scratchpad becomes equivalent to one iteration in the loop. For example for the cycles task, we keep the graph input as the permanent info in the question part before the <START> token. After that in each state, we keep track of the node that we are visiting in our search, this is equivalent to having a variable that keeps track of the current node of the search in a for loop. Also, the Transformer learns the termination condition of this loop which is reaching the source/query vertex. Similarly, for the parity/addition task (random space method), we keep the input as the permanent part before the <START> token. In the states of the inductive scratchpad, we keep track of the location of the corresponding digit(s) and their values along with other variables like the current value of the parity, carry, and the addition result. This is again, similar to a for loop that updates these variables in each iteration. Note that the inductive scratchpads for addition use some minor adjustment in the inductive scratchpad format, i.e., the initial generation of a random answer. This part is simply added for the length generalization of our experiments and to prevent scenarios in which the Transformer has to generate tokens for unseen positions during training (note that we train our Transformers from scratch with absolute positional embedding). However, we expect one can skip such adjustment when using a pre-trained Transformer with proper relative positional embedding.

Q. Broader application and comparison of the inductive scratchpad?

Please see the global response.

评论

I thank the authors for the careful responses to my questions. I hope that these can be used to clarify issues that I had questions about in the final version of the paper if it is to be accepted.

评论

Q. How does the inductive scratchpad compare to SOTA methods?

The length generalization for tasks such as parity and addition has been studied rather extensively recently and overall the reported improvements remain fairly modest. We note that all of the proposed schemes have some additional conditions for their inputs and/or solution designs that are specific for the tasks (e.g., tailored positional embeddings). Our work trains the Transformers from scratch and uses absolute positional embeddings, however, we also ask for a mild condition on the input: random spaces for parity and additions. Since each work has its own modifications and solution design, it is not trivial to compare just the performance. However if one wants to compare the performance with a rough description of the approach, it would look like this:

Addition

work | performance | method and assumptions
[1] | from 8 digits to 9 digits | Using NoPE (no positional embedding)
[2] | from 10 digits to 12 digits | Scratchpad with recursive format
[3] | from 40 digits to 50 digits | Reverse order of the output + ‘index hints’ (special tokens that go before each digit). An example looks like a5b4 + a3b7 = b1a9.
[4] | from 5 digits to 15 digits | encoder-only model (no autoregressive generation) + relative positional embedding + padding inputs (input looks like 1 2 <PAD> + 3 9 <PAD>)
[5] | form 40 digits to 60 digits | Fire relative positional embedding + randomized position encodings + reversed output + index hints (similar to the above input looks like a5b4 + a3b7)
Our random space method | from 10 to 18 digits | Using random space in the input + inductive scratchpad. An input looks like 94_+_3__1= (we use _ instead of space for better readability).
Our shift method | from 4 to 26 digits | Using a random text before each number + inductive scratchpad. An input looks like fs\46+ih46+ih\\98.

Parity

work | performance | method and assumptions
[1] | from 8 bits to 12 bits | Using NoPE (no positional embedding)
[3] | from 30 bits to 50 bits | Using scratchpad + ‘index hints’ (special tokens that go before each bit in the input). An example looks like a0b0c1d1e0 > +c-d+
[6] | from 8 bits to 20 bits | Using pretrained large models (128B) + prompting + fine tuning + scratchpad
Our method | from 30 bits to 55 bits | Using random spaces in the input + inductive scratchpad. An input looks like _01_10_0__1_

(reported values are for median of the seeds, some seeds do better than others)

Therefore our solution requires one of the least stringent modifications of the inputs and provides a significant improvement of the length generalization compared to prior works.

[1] The Impact of Positional Encoding on Length Generalization in Transformers, Kazemnejad et al. 2023
[2] Positional description matters for transformers arithmetic, Shen et al., 2023
[3] What algorithms can transformers learn? a study in length generalization, Zhou et al., 2023
[4] Length generalization in arithmetic transformers, Jelassi et al., 2023
[5] Transformers Can Achieve Length Generalization But Not Robustly, Zhou et al., 2024
[6] Exploring Length Generalization in Large Language Models, Anil et al., 2022

审稿意见
5

The paper investigates the conditions for Transformers to learn algorithms with length-generalization. The paper proposes a formal definition of “distribution locality” and conjectures that the measure is highly correlated with the capability of Transformers to weakly learn. The (inverse of) conjecture is theoretically justified in one particular case by showing that a class of graph classification problem with high distribution locality cannot be weakly learned. The paper then shows that scratchpads that reduces distribution locality can improve length generalization and inductive scratchpads which removes previous intermediate states from the context can improve length generalization.

优点

The paper provides exact definitions for “Distribution Locality” as and states solid conjectures that connects between the measure and learnability. If proven, these conjectures can significantly advance our understanding of Transformers trained from scratch. The work also shows the connection between scratchpads/Chain of Thought and reduction of distribution locality, which could further our understanding on the success of CoT. The work presents a formal proof of an impossibility result for weakly learning Transformers, which to the best of my knowledge, is novel in the field since most prior works focused on limitations on expressivity.

缺点

The theoretical result does not rigorously prove the “negative side” of the conjecture, since it’s unclear what properties of the graph classification task make the problem unable to be weakly learned. Further theoretical connection is require to show the connection between distribution locality and learnability. The presentation of the paper could be improved to facilitate the clarity and understanding of the core results of the paper. In particular, in section 2, the authors should consider using shortened/less formal definitions and remarks and putting extended versions in separate sections or the appendix. For example: Conjecture 1: Can be shortened using “if and only if” instead of having 2 sentences for “if” and “only if”?

Remark 2 seems loosely relevant to the main results of section 2 or the main flow of the paper. It is recommended to have a separate discussion section for these extensions and/or put them in the supplemental materials

Conjecture 2 is hard to understand on it’s own and is not explained from a higher-level perspective (i.e. exactly what is a agnostic scratch pad?). It’s recommended to provide a definition and it’s high-level explanation and shorten Conjecture 2 using the definition for better clarity In Theorem 1, it’s unclear why are the label names a_i, b_i, c_i are necessary. It seems that the label names does not impact the correctness of the theorem and thus their descriptions can be removed for clarity of the theorem.

问题

Is the graph classification task possible to be solved by a Transformer regardless of training procedure? In particular, does there exist a set of weights for a Transformer with log-precision, constant depth and heads that can solve the graph classification task for all n? This is quite important since the work is focused on the learnability of Transformers instead of expressiveness, which is greatly studied in prior works, but if the Transformers cannot express the task, then the learnability impossibility result seems trivial.

Additional comments: The appendix section B repeatedly refers to dembd_\text{emb} and seems to use dembd_\text{emb} to denote the naximum length of the input string (i.e. prompt/question). However, dembd_\text{emb} typically refers to the embedding dimension (i.e. number of entries in the embedding vector) of every token. Why is there a relationship between dembd_\text{emb} and input lenth? In appendix section B.2 “Length generalization for addition”, why is “First, we generate a random sequence of tokens with size dembd_\text{emb} + 2 such that it begins with ‘,e.g.,’, e.g., xgwg6 we call this text ans[0]” useful and necessary?

Does “pointer” denote a separate pointer token for each possible input position? Are “[00]” one single token or 3/4 separate tokens? What does it mean by the value of the ith bit ”can be retrieved using the pointer”, is this specially processed in the generation process? Why does pointer retrieval operation not break the distribution locality? It seems that for different pointers the model need to copy different positions in the input (the pointer position), but distribution locality requires that the positions that affect the label is the same for any possible input. Doesn’t retrieving the ith position break distribution locality?

局限性

The inductive scratchpad requires that each reasoning step is only dependent on one previous reasoning step and the input. However, this is less realistic in more complex reasoning tasks such as mathematical reasoning.

作者回复

We thank the reviewer for the constructive feedback. We address the rest of the remarks and questions below.

Q. Is the graph classification task possible to be solved by a Transformer regardless of training? This is quite important since the work is focused on the learnability of Transformers instead of expressiveness but if the Transformers cannot express the task, then the learnability impossibility seems trivial.

Yes it can express the task. As mentioned in the general comment, this is not a case where the failure comes from expressivity limitations. In the specific version of Theorem 1, the problem is equivalent to checking if a product of permutations in S_3S\_3 is the identity, which is solvable by a constant depth transformer. The versions with a less restrictive formatting of the input are probably not expressible by a transformer with log-precision, constant depth and polynomial size. However, we do not restrict ourselves to this specific regime, and our definition of T-regular allows an arbitrary polynomial depth, and a logarithmic depth transformer can solve the graph classification task by finding the vertex 22 edges from each vertex, then finding the vertex 44 edges from each vertex, and then finding the vertex 88 edges from each vertex and so on and then combining those to find the vertex nn edges from some designated starting vertex and checking if it is the same vertex.

Q. Appendix B repeatedly refers to d_embd\_{emb} and seems to use d_embd\_{emb} to denote the maximum length of the input string. However, d_embd\_{emb} typically refers to the embedding dimension of every token. Is there a relationship between the two?

Here we used embedding dimension d_embd\_{emb} for the input space as in “a two-digit number embedded in a dimension d_emb=5d\_{emb}=5 such as _6_1_”. In other words, we used embedding dimension as the number of tokens that the input uses given the random spacing that we use. This is completely independent of the embedding dimension of the Transformer architecture. You’re right that this is not an ideal notation. We will update this.

Q. In Appendix B.2 “Length generalization for addition”, why is generating an initial random answer ans 00 (starting with $ token) useful?

We generate a random sequence of tokens with size+2 elements starting with \asansas ans 0 .Ineachofthestepsafterwardweshiftans. In each of the steps afterward we shift ans i oneunittotheright(losingtherightmosttoken)andconcatenatethecomputeddigitofthecorrectanswerfromtheleft,i.e.,ans one unit to the right (losing the rightmost token) and concatenate the computed digit of the correct answer from the left, i.e., ans i+1 \= \<computed digit of the answer\> ShiftRight(ans i ).Notethatinitializingans). Note that initializing ans 0 witharandomplaceholdertextensuresthatthelengthofans with a random placeholder text ensures that the length of ans i doesnotchangeduringdifferentgenerationsteps.Ifwehadinitializedans does not change during different generation steps. If we had initialized ans 0 withanemptystring,thenans with an empty string, then ans i wouldhavehad would have haditokens.Asaresult,duringthelengthgeneralization,theTransformerwouldhavehadtowritetokensonthepositionsithadneverseenduringthetrainingwhichisimpossibleaswetrainfromscratchanduseabsolutepositionalembedding.Hence,theinitialrandomtextforanstokens. As a result, during the length generalization, the Transformer would have had to write tokens on the positions it had never seen during the training which is impossible as we train from scratch and use absolute positional embedding. Hence, the initial random text for ans 0 ensuresthattheTransformerwillnotseeunseenpositionsduringgeneration.Weexpectonecanavoidthisplaceholderbyusinganappropriaterelativepositionalembedding(potentiallyinadditiontopretraining).Wesimplyuse ensures that the Transformer will not see unseen positions during generation. We expect one can avoid this placeholder by using an appropriate relative positional embedding (potentially in addition to pre-training). We simply use \\ as the first token of ans 00 so that the final answer can always be easily recognized (it’s the number left to the $).

Q. Does “pointer” denote a separate pointer token for each possible input position? Are “ 0000 ” one single token or 3/4 separate tokens? What is the meaning of "the value of the ith bit can be retrieved using the pointer", is this specially processed in the generation process? It seems that for different pointers the model needs to copy different positions in the input (the pointer position), doesn’t that result in a high locality?

In this example, 0000 has 4 separate tokens. In the examples of lines 594 and 629, each character is an individual token (other than special tokens <START> and <EOS>). We emphasize that we do not do any processing for the pointers during the generation and all is learned by the Transformer model itself. By retrieving, we simply meant that the Transformer can learn to generate the token that is appearing at a specific location indicated by the pointer. We will further clarify this part and the tokenization process in our revision.

Also, note that the pointer retrieval is a low locality operation in the setting that we consider and hence easily learnable. For simplicity, assume that we have 2 pointer tokens pqpq and we have NN potential tokens (t_0,..., t_{N-1}) for retrieving. Denote the output by YY. For simplicity also assume that the probability for retrieving each of these tokens is the same. In this case, the locality is 1. Indeed it’s enough for the model to attend to t_0 to get a significant correlation with the output. More precisely, with probability 1/N1/N (if the pointer is pointing to the first token), the output YY will be equal to t_0. Hence, the mutual information I(Y;t_0)=poly(1/N)I(Y;t\_0) = poly(1/N) (as we can determine YY in 1/N1/N of the cases based on t_0). Therefore, retrieving based on the pointer is indeed a low locality task.

Q. How does the inductive scratchpad work for more complex reasoning tasks such as mathematical reasoning?

Please see the global response on the broader applicability with a focus on math.

Q. The theoretical result does not rigorously prove the “negative side” of the conjecture.

Indeed, we do not prove the “negative side of Conjecture 1” and are careful to never make that claim. We prove only a specific case falling under the negative side of Conjecture 1 that relies on this variant of the cycle task. We believe that this variant is interesting in view of prior lower-bound work as it does not follow from any previous papers providing negative results such as 12,13,2812, 13, 28 (refs in our paper).

评论

Q. The presentation can be improved. In particular, in section 2, the authors should consider using shortened/less formal definitions and remarks and putting extended versions in separate sections or the appendix. For example: Conjecture 1: Can be shortened using “if and only if” instead of having 2 sentences for “if” and “only if”?

Conjecture 1 is if and only if for constant size alphabets; we will update this; please see the global comment at the beginning in that regard.

Q. Remark 2 seems loosely relevant to the main flow of the paper. It is recommended to have a separate discussion section for these extensions and/or put them in the supplemental materials

Thanks for the suggestion, we will revise the structure.

Q. Conjecture 2 is hard to understand on its own and is not explained from a higher-level perspective.

We will break Theorem 1 and Conjecture 2 into a definition and then a conjecture/theorem to enhance readability. In the agnostic scratchpad version of the problem, we are giving the transformer access to a scratchpad but we have no prior knowledge on what the transformer should write in it. So, in order to train it to use the scratchpad well, we consider an entry it writes in the scratchpad to be “correct” if it leads to the transformer giving the right output at the end and “incorrect” if it does not and we define the loss of the scratchpad to be equal to the loss of the output. We can add more explanation to the paper.

Q. In Theorem 1, it’s unclear why are the label names a_i, b_i, c_i necessary? It seems that the label names do not impact the correctness of the theorem and thus their descriptions can be removed for clarity of the theorem.

The justification for the vertex names in Theorem 1 is not trivial. In Theorem 1 the input is formatted in such a way that it can be divided into blocks that each specify how a_i, b_i, and c_i are connected to a_{i+1}, b_{i+1}, and c_{i+1} for some i (each corresponding to a permutation in S3S_3). Each block has 6 possible values and these blocks are independent except for the fact that their overall product is always an even permutation. If we were to change the way we assign labels to vertices in such a way as to make it so that the set of vertex names mentioned in a given block was not always the same then that would no longer be the case and the proof would require additional arguments.

评论

Thank you for your review of this paper. The authors have posted rebuttals. Please could you enter into a discussion with them by replying to their rebuttal of your review. A simple 'Thank you for your rebuttal' acknowledges that you have read their rebuttal, even if you feel that their rebuttal does not require a detailed reply. Of course, a detailed reply is always preferable.

Thanks!

Area Chair

审稿意见
7

In the context of Chains-of-Thoughts prompting, Transformer models can solve more reasoning problem when recording their intermediate reasoning steps on a 'scratchpad'. This paper attempts to formulate the hardness of reasoning tasks (i.e., locality barrier), and explore the limitation of 'scratchpad' in seq2seq (i.e., Transformer) reasoning. Three kinds of scratchpads (agnostic, educated, and inductive) are studied, and their effectiveness in breaking the locality barrier are empirically validated on a synthetic cycle task (against their locality degree).

优点

  • Given the emerging need of LLM reasoning, the studied subject is of great interest, both theoretically and practically.
  • This paper is mostly well-written and easy to follow. The mathematical formulation of locality degrees is natural.
  • I especially appreciate the introduction/formulation of the educated and inductive scratchpads (though I am not sure if they are first formulated by the authors or not). Unlike standard agnostic scratchpads, both educated and inductive scratchpads serve as theoretical testbeds and provide practical guidance for building effective LLM reasoning pipelines.

缺点

I don't see any major weakness except for that the formulated inductive scratchpads bear some similarity with a previous design of scratchpads based on dynamic programming (https://openreview.net/pdf?id=qHrADgAdYu). Some comparisons and discussions could be used here.

问题

  • line 229, 'Parity functions are known to be hard to learn []': I guess a reference is intended but missed here.
  • line 231, '(n-k) = w(1)': I don't quite understand what 'w(1)' means here. The whole sentence could use some elaboration. I believe another intended reference is missed as well.

局限性

N.A.

作者回复

We thank the reviewer for the constructive feedback. We address the rest of the remarks and questions below.

Q. What is the connection to the “Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective“ paper which considers dynamic programming tasks solved by scratchpad?

Thank you for introducing this paper, we will add it to our literature review. In comparing it with the inductive scratchpad part of our paper, the inductive scratchpad is indeed suitable for learning DP tasks. DP tasks can be implemented using a loop, and the inductive scratchpad can easily learn algorithms with loops. In order to learn such algorithms with loops one has to put all the variables/arrays that are updated in the loop in the state of the inductive scratchpad (including the counters of the loop). In that case, each state generated by the inductive scratchpad becomes similar to one iteration of the loop. The inductive scratchpads that we have provided for the parity, addition, and the cycles task can also be viewed as algorithms with loops.

Nevertheless, we clarify that the mentioned paper does not use induction or any similar idea. Note that in our paper, we make Transformers behave inductively on such reasoning tasks with the introduction of two special <START> and <STATE> (also denoted as #) tokens and dynamic attention masking according to these new tokens. As a result, we obtain strong length generalization performances (e.g., almost double length for parity and addition tasks) while without the inductive scratchpad idea length generalization property is weak or absent (e.g., in the DP paper the length generalization is from 15 to 17/18). Further, note that the inductive scratchpad allows Transformers to work with reasoning tasks with longer scratchpads/contexts thanks to the dynamic attention masking. For example, the tasks considered in the DP paper have remarkably shorter scratchpads than the parity and addition tasks we have in our work and hence could be considered simpler for the Transformers.

On the theoretical side, the suggested paper focuses only on the representation power of constant-depth transformers. It shows that there are tasks that cannot be represented by constant-depth Transformers, however, they can be expressed by a constant-depth Transformer with a scratchpad. We note that our paper considers learnability hardness (beyond expressivity). For example, the cycles task can be represented by polynomial size transformers with logarithmic depth however cannot be learned by them as shown in Theorem 1.

Q. Missing reference on line 229.

Thanks for noticing the typo. The intended reference was 1212 (Emmanuel Abbe and Colin Sandon. Polynomial-time universality and limitations of deep learning. Communications on Pure and Applied Mathematics, 76(11):3493–3549, 2023).

Q. Line 231: What does k\wedge (n-k) \= \\omega(1) mean? Elaborate on the whole sentence.

Here, by k(nk)k\wedge (n-k) we meant mink,nk\\min\\{k, n-k\\} operation and by the little omega omega(1)\\omega(1) we meant any quantity that grows at any rate beyond constants (as nn scales). In other words, a parity function of degree kk is learnable in polynomial time only if either kk or nkn-k is constant. Note that the locality of the parity task is also equal to mink,nk\\min\\{k,n-k\\} (because of the histogram). Therefore, the constant locality requirement is consistent with prior works on the hardness of parities. Note that in the locality definition, we have access to the histogram of all tokens, and as a result, the full parity x_1x_2x_nx\_1x\_2 \cdots x\_n has locality 00. Thus, thanks to the histogram, the hardness of learning for x_1x_2x_kx\_1x\_2 \cdots x\_k (degree kk parity) and x_k+1x_nx\_{k+1} \cdots x\_{n} (degree nkn-k parity) is similar, hence we use mink,nk\\min\\{k, n-k\\}.

评论

Many thanks for the detailed response!

审稿意见
6

This paper proposes the concept of a "locality barrier" and conjectures that transformers can (weakly) learn to solve a problem only if it doesn't have a locality barrier, i.e., doesn't require global reasoning (as specifically defined by the authors). They prove, in particular, that for the cycle task that they propose, there exists a locality barrier and that transformers struggle to learn it.

These results about learnability of Transformer models nicely complement recent literature on their expressivity.

Lastly, the authors propose the concept of an "inductive scratchpad", which takes the special form Q#state1#state2#... and uses attention masking, etc., such that predicting state5 from the entire prefix looks just like predicting state5 from Q#state4#, forcing models to learn an explicit state representation and to condition only on it in a Markovian way. The paper reports empirical results showing that when trained this way, models generalize better to larger input length.

优点

  • The notion of "locality barrier" seems interesting and, to my (somewhat limited) knowledge, novel. The study of learnability in transformers is a good complement to recent studies on transformer expressivity.

  • The idea of "inductive scratchpad" is intuitive. (I assume it is novel, deferring to other reviewers on novelty.) Using masking to force models to pay attention only to the current state (and the full input) seems effective for certain problems, and experiments confirm this.

  • The intuitive claims / conjectures in the paper are supported by sufficient focused experiments.

缺点

  • Intuitively, I don't quite see why global reasoning (as suggested by the proposed locality barrier) should, in fact, be hard for transformers. After all, full attention gives them access to the entire input and entire sequence of past states, in the case of a scratchpad. In fact, having "all to all attention" is one of the key distinguishing strengths of a Transformer model, compared to, say, RNNs. Could the authors clarify why global reasoning seems to be a fundamental hurdle? (Note that some prior papers have noted sequentiality rather than global reasoning as a big hurdle.)

  • While the paper tries to be formal with theorems and conjectures, the way it is written is somewhat informal. E.g., conjecture 1 mentions phrases like inverse polynomial edge without full clarification, as far as I saw); Theorem 1 and Conjecture 2 are written in a very lengthy and verbose way (it would be more standard to define concepts in text, and then have a succinct theorem/conjecture about them).

  • The paper can use some more intuition. E.g., the remark right after Definition 2 (distribution locality) tries to explain what the definition is trying to capture. However, I still struggled to understand it fully. E.g., why exactly is the mutual information targeted to be $n^{O(-1)}, what's the role of the histogram of tokens, etc., can be better clarified. Lemma 1 is an important illustration of the notion of locality, but is stated with only a brief explanation. A sketch of the proof from the appendix would be valuable. In general, wherever proofs are deferred to the appendix, a mention of this in the main paper (ideally with a brief proof sketch) would be useful for the readers.

  • While the notion of an inductive scratchpad is nicely general, the specifics of the state are highly dependent on the problem at hand. The authors showed that if the notion of state is appropriately chosen, models are able to learn (and generalize) on the few considered tasks. What remains unclear is, whether appropriate notions of state can be learned for new tasks and whether this concept works well in the popular "in-context learning" setting.

MINOR (typos etc.):

  • line 12: "breaks the locality barrier" instead of "breaks the locality"

  • line 45: did you mean test accuracy of more than 80% or test error?

  • line 85: [16] seems to be about chain of thought and relation to Turing machine based computation classes. Here and in some other places, did you mean the paper titled The Parallelism Tradeoff: Limitations of Log-Precision Transformers by the same authors? That work showed TC^0 as the bound, not TC^1, though the generalization you mention in line 171 is accurate.

  • line 193: I'm not sure why a_i etc should be uniform random. Shouldn't the vertex be labeled a_i (and not b_i) is it's at distance i from a_0?

  • line 307: did you mean "as if the input was"? (rather than "as the input was")

问题

Please see notes about clarifications in the weaknesses section above.

局限性

Yes

作者回复

We thank the reviewer for the constructive feedback. We address the rest of the remarks and questions below.

Q. Intuitively, why is global reasoning hard for transformers?

Please see first the global response. For more intuition: we train transformers using a gradient descent algorithm, which means we need to have a significant gradient in order to make progress. I.e., we need the target function to have a nontrivial correlation to the derivative of the function the transformer is currently computing with respect to at least one of its weights. If there is a constant-sized set of inputs that can be used to predict the target function with nontrivial accuracy then that means there is a low-degree polynomial in the inputs that is correlated with the target function, and we would expect at least some of the derivatives to be significantly correlated with it. However, if there is no such set of inputs then that means that no low-degree polynomial is significantly correlated with the target function. There are a superpolynomial number of uncorrelated high-degree polynomials, so the derivatives of the transformer with respect to its weights cannot be nonnegligibly correlated with a nonnegligible fraction of them. So, we will tend to be unable to learn the target function in that case.

Q. Terms like inverse polynomial edge in Conjecture 1 can be further clarified. Theorem 1 and Conjecture 2 are written in a verbose way.

Please see the general comments. We can split Theorem 1 and Conjecture 2 into a definition part first and then the theorem/conjecture statement to enhance the readability.

Q. The paper can use some more intuition. E.g., why is the mutual information targeted to be nO(1)n^{-O(1)}? What's the role of the histogram. Lemma 1 can be further explained. A sketch of the proof from the appendix would be valuable.

We can certainly add more intuition, differing other parts to the appendix. Please see the general comment and the previous question’s answer as a starting point. Regarding Lemma 1’s intuition: if you see only n1n-1 edges in this task, then those n1n-1 edges do not tell you anything useful because you miss one edge to conclude that you have in fact a short n-long cycle closing, or instead a longer path in the (2n)-long cycle not closing. Because of the uniform distribution on how the vertices are labeled, the two scenarios are equally likely and you can thus not infer any useful information. This means that you need at least nn edges to get at best a non-zero correlation, so the locality is at least nn. Further see Appendix E.

The role of histogram is also natural for the Transformer’s architecture. To see this, consider removing positional embeddings; the remaining architecture would see the input as a set, using the number of times each token is appearing. This effect can also be observed if for example positional embeddings are initialized small enough. As a result, Transformers can easily access the token count (histogram) information of the input. Here, the histogram is similar to the bag of words in the natural language processing domain. We will add more details on this.

Also, note that there is a sketch of proof at the beginning of Appendix D.

Q. While the notion of an inductive scratchpad is nicely general, the specifics of the state are highly dependent on the problem at hand. How can other inductive tasks be learned, in particular using "in-context learning".

Please see our general response on the broader applicability of the inductive scratchpad. In-context learning of new inductive tasks is indeed an interesting future research direction. We can imagine two settings for in-context learning of inductive tasks. (1) Having a normal pre-trained model (without the inductive scratchpad implementations or tokens). With the general advancement of LLMs, we think in-context learning of some inductive tasks may be possible. However, the family of such tasks will probably be limited as the model needs to see similar tasks to the target in question during pre-training. Also, the (OOD) performance of the induction would probably be much more limited than the architecture that enforces the inductive behavior. As seen in our paper, even training on standard scratchpad fails at OOD generalization. (2) The second setting is to have a model with inductive scratchpad (with the special tokens) pre-trained on other inductive tasks. In this setting, in-context learning of new inductive tasks seems more likely; nevertheless, this direction requires new investigations for future works.

Q. Line 85: Did you mean to cite the paper titled The Parallelism Tradeoff: … by the same authors? That work showed TC^0 as the bound, not TC^1, though the generalization you mention in line 171 is accurate.

You are right we will cite [The Parallelism tradeoff…] when there is no CoT, and you are also right it should say TC^0 instead of TC^1 there.

Q. In line 193, shouldn't the vertex be labeled a_i (and not b_i) if it's at distance i from a_0?

The 3 nodes that are at distance i from a_0, b_0, c_0 are randomly called a_i, b_i, c_i. If we always called the vertex i edges from a_0 a_i then solving the problem would reduce to checking if the next vertex after a_{n-1} was a_0, which would be easy to learn. Randomizing the names is necessary to make it harder. Naming the vertices in a manner that does not specify how far they are from a_0, b_0, or c_0 would probably make it even harder to learn, but it would not work with our current proof technique, so we are sticking with these vertex labels for now. And again, the original cycle task is also expected to be hard enough, but this would likely require even more proof technicalities.

We also thank the reviewer for the typos; we will fix them.

作者回复

We thank all reviewers for their constructive comments. We address some of the remarks and questions in the list below.

I. On the “locality/globality” message: this paper puts forward the claim -with both theoretical and experimental results supporting it- that Transformers can efficiently learn (i.e., achieve non-trivial generalization of edge n^{-O(1)} or O(1) in at most polynomial time in the input dimension) if and only if the target at hand has constant distribution locality. The “if and only if” is discussed under some assumptions (e.g., regular distributions that prevent non-trivial learning by simply memorizing the output for a very high probability input, or largely scaling alphabet size). It is important to note that:

  • A constant distribution locality means that the target correlates weakly with a constant number of inputs, not that the target only depends on a constant number of inputs. The target “weakly correlates” with some inputs means that the correlation is inverse polynomial with these, i.e., ncn^{-c} for some constant cc. This covers many functions that depend on many/all inputs, and that are thus “global” in that sense. For instance, the majority function of all the tokens correlates weakly at 1/sqrtn1/\\sqrt{n} (because there is a 1/sqrtn1/\\sqrt{n} chance that all the other inputs are tied up and flipping one makes a difference) with any single input, so its locality is at most 1, and in fact it is efficiently learnable by Transformers. Moreover, the formal definition of distribution locality allows for a little more than that, because there is also the histogram of all the input tokens that is given in the mutual information. This means for instance that the modular-sum target (i.e., sum modulo qq when tokens are on a field with qq elements, or parities when q=2q=2), which does not correlate with a single input as opposed to the majority function, is still efficiently learnable because this time the histogram takes down the locality. The intuition of the histogram is that when a function is symmetric (permutation invariant) on the inputs, then the model can ignore the positional embeddings and just compute the function from the histogram which gives major dimensionality reduction (the histogram knowledge is similar to the bag of words in natural language processing). So indeed, Transformers can learn some tasks that are “global” in the sense of involving many variables. The globality notion that we claim is truly hard for Transformers (the non-constant locality) means that no constant number of inputs (on top of the histogram) can even weakly correlate with the target. Examples where this takes place are target like the modular sum of an unknown but fixed subset of say half the inputs, which was shown in prior works to be hard (experimentally or theoretically under technical assumptions, see refs in the paper). With our definition of the distribution locality, we can now capture many more target classes than this specific case, such as the cycle task, with a precise, general, and concise definition.
    (Also we indeed allow here for all to all attention, as we want the model to be as general as possible.)
  • The failure of efficient learning does not follow from a limitation of the model expressivity: A regular Transformer as defined in our paper can in fact encode the cycle task of Theorem 1. The reason why the model cannot learn efficiently is that it is trained by a descent algorithm with a random initialization.

II. Broader applicability of the inductive scratchpad idea. We believe that the inductive scratchpad idea and more generally the idea of dynamically masking the input based on the generated tokens to remove unnecessary information will be helpful for reasoning tasks beyond the specific algorithmic/arithmetic tasks considered here. For instance, consider mathematical reasoning as raised by a reviewer. When proving a maths result, one typically breaks the argument with multiple steps, aka lemmas, that provide the dynamic flow of the proof. The proof of a new lemma nn relies on some of the previous lemmas from 11 to n1n-1, but ideally the proof of the new lemma can be composed on the previous lemmas without requiring their proofs; otherwise, there exists a more composable proof that does not repeat unnecessary arguments. Thus, we expect that an approach that masks the unnecessary information (here, proof of the previous lemmas, or unnecessary lemmas which sometimes lead to an intermediate theorem) to be indeed useful for reasoning. In the inductive scratchpad framework, this corresponds to having states that only include all of the lemmas proven to the moment, the current objective (claim that the model is trying to prove), and the steps that the model is taking towards proving the current objective. This should generally allow the model to handle computationally larger solutions/context sizes and better compose in OOD settings, thus eventually improving the model’s performance as was seen in the considered tasks. Implementing this for more general tasks such as maths requires nonetheless major dataset preparations which is beyond the scope of the current paper.

III. Formal and informal statements. We had reviewers requests on both sides. We can write very formal and very informal versions of the results. For instance, Conjecture 1 informally reads as “Regular transformers can efficiently weakly learn a well-behaved distribution if and only if the distribution has constant locality”, while the formal version add all quantifiers: there exists c1,c2=On(1)c_1,c_2=O_n(1) such that in time nc1n^{c_1} the Transformers trained by SGD (details of what this means) learns the target function with an accuracy 1/2+nc_2\geq 1/2 + n^{-c\_2}, etc. Some may have to go to the appendix due to space limitations.

最终决定

The authors propose the concept of a "locality barrier" and conjecture that transformers can (weakly) learn to solve a problem only if it doesn't require global reasoning. The authors propose various scratchpads as tools that the model can use in getting to its answer, including an inductive scratchpad which is able to overcome the locality barrier. The reviewers appreciated the originality, but had some minor concerns about presentation.