Encoder-only Next Token Prediction
Given unconstrained compute, we should use encoder for next token prediction, rather than decoder.
摘要
评审与讨论
This paper challenges the widely accepted view of next-word prediction using the transformer decoder and instead proposes next-word prediction via the encoder, supported by strong motivation and intuition. The paper also provides theoretical results comparing the expressivity of encoder-only and decoder-only architectures, and highlights the benefits empirically in several experiments.
优点
- The paper is well-organized and well-written, with clear motivation, theoretical results, and empirical validation.
- The motivation for using the encoder to predict the next word is compelling, especially in scenarios where computing resources and efficiency are not primary concerns. This opens up a challenge and an opportunity for the community to develop a more efficient encoder-only architecture for next-word prediction.
- The theoretical result on the difference in expressivity between encoder-only architecture and decode-only architecture is promising.
- The paper demonstrates the benefits of an encoder-only architecture across various experimental setups, yielding promising results.
缺点
-
The main weakness is that encoder-only architecture requires O(n^3) time to generate the sequence during inference and needs to calculate the attention matrix n time during training. In contrast, decoder-only architecture only requires O(n^2) time during inference and can calculate the attention matrix once during training. This makes the proposed method impractical for real-world language modelling applications due to the high complexity, which is not addressed in the paper.
-
As a result of the previous weakness, the experiment on laugnage modelling only consider small architectures, due to the complexity. Additionally, the improvements over decoder-only architectures, as seen in Table 3, are quite minimal given the potentially much longer training and inference times (which is not reported).
-
Theoretical results of lemma 1 and lemma 2 rely on the conjecture 1, which might not hold. why "This conjecture seems plausible given that both Algorithm 1 and Algorithm 2"?
-
The experimental set-up, e.g., learning rate, epoch, for the openwebtext experiment seems not provided.
-
(minor) Causal function in line 175 is not defined.
-
(minor) It would be clearer if the author could illustrate Figure 1 in two-layer cases with ellipsis instead of one-layer cases with ellipsis.
问题
- Why any causal function that can be expressed by an encoder requiring ω(n) time cannot be efficiently expressed by a decoder?
- lines 164-175 is a bit confusing, does hold? what is the meaning of ``we can view as an explicit and necessary way to introduce causality to the encoder since there is nothing implicit to the encoder that forces causality. ''?
- Why consider/construct such triplet-counting tasks, at first glance it is a bit complicated.
- Why decoder architecture perform better than encoder architecture in table 2?
- What is RASP in line 363?
We sincerely appreciate your detailed review and helpful comments.
[W1 and W2] The main weakness is that encoder-only architecture requires O(n^3) time to generate the sequence during inference and needs to calculate the attention matrix n time during training. In contrast, decoder-only architecture only requires O(n^2) time during inference and can calculate the attention matrix once during training. This makes the proposed method impractical for real-world language modeling applications due to the high complexity, which is not addressed in the paper. As a result of the previous weakness, the experiment on language modeling only considers small architectures, due to the complexity. Additionally, the improvements over decoder-only architectures, as seen in Table 3, are quite minimal given the potentially much longer training and inference times (which is not reported).
We want to emphasize that the goal of our paper is to answer scientific questions about Transformers, rather than to propose ENTP as a practical language modeling solution for today’s hardware. Given our limited resources, we were only able to train small-scale LLMs on openwebtext, and didn’t mean it to be our main result.
[W3] Theoretical results of Lemma 1 and 2 rely on Conjecture 1, which might not hold. Why does this conjecture seem plausible given both Algorithm 1 and Algorithm 2?
Algorithms 1 and 2 are the only known algorithms that can compute Triplet-Counting to the best of our knowledge, and we conjecture that the time/space trade-off achieved by these two algorithms are optimal.
[W4] The experimental set-up, e.g., learning rate, epoch, for the openwebtext experiment seems not provided.
We appreciate your careful reading. Here are additional experimental details:
- warmup_iters: 2000
- lr_decay_iters: 600,000
- min_lr: 0.0006
- max_lr: 0.00006 (6e-05)
- beta1: 0.9
- beta2: 0.95
- weight_decay: 0.1
- block_size: 128
- batch_size: 32
They will be included in the revised appendix of the paper as well.
[W5] Causal function in line 175 is not defined.
Causal function refers to sequence-to-sequence function where the outputs are only determined by current and previous inputs, i.e. y[n] only depends on x[1:n], and not x[n+1]. This is defined as a "causal model" in Section 3's preliminaries. To improve readability, we will revise it to explicitly refer to the preliminaries.
[W6] It would be clearer if the author could illustrate Figure 1 in two-layer cases with ellipsis instead of one-layer cases with ellipsis.
We updated figure 1 in the paper.
[Q1] Why is it that any causal function requiring ω(n) time cannot be efficiently represented by a decoder?
To clarify, we said that any causal function requiring time cannot be efficiently expressed by a decoder. Generating a sequence of tokens requires for ENTP while for decoder-only Transformers. While this implies that ENTP is more compute-intensive (i.e., ENTP will be slower than decoder-only Transformer), this also implies that ENTP can express more compute-intensive sequence functions than decoders. Specifically, since the total amount of compute that decoders use for generating tokens is , they cannot run any algorithms whose runtime is . A similar argument holds for ENTP: ENTP cannot run algorithms whose runtime is .
[Q2] Lines 164-175 are a bit confusing. Does T_E = E hold? What is the meaning of “we can view it as an explicit and necessary way to introduce causality to the encoder since there is nothing implicit to the encoder that forces causality.”?
Unlike decoders, an encoder is not causal by default. We need to impose causality externally, to prevent cheating (using future tokens to predict future tokens). The T_E notation describes the external causality applied to an encoder E. This external causality is shown in figure 1 (we will update it to provide a 2-layer version). T_E = E does not hold because E[n - 1] depends on the input x[n], and T_E[n - 1] does not.
[Q3] Why consider/construct such triplet-counting tasks, at first glance it is a bit complicated.
We chose the Triplet-Counting task to highlight the difference between decoder and ENTP. The fact that this task relies on triplet-wise relationships makes it difficult to model with attention, since attention is a pairwise operation. In general, it is a task that is hard for transformers, and ENTP can overcome that difficulty.
[Q4] Why does decoder architecture perform better than encoder architecture in table 2?
Both models perform very well on the Triplet-Identification. Since both models achieve over 99% accuracy, the differences are very small and not statistically meaningful. The result might be due to randomness.
[Q5] What is RASP in line 363?
RASP (Weiss et al., 2021) is a programming language that describes Transformer computations, by mapping attention and feed-forward computation into simple primitives. A RASP program can be compiled into transformer weights. We added a short note explaining RASP in the paper. Please see Thinking Like Transformers (Weiss et al., 2021) for more details.
[Final Note] Thanks again for the insightful review. If there’s anything else we can clarify or elaborate on, please don’t hesitate to let us know. If our responses have addressed your concerns, we would be grateful for your support in improving our score.
Thanks for the responses! I am inclined to maintain my score.
This paper tries to do a comparison between encoder only models (where there is kind of bidirectional attention) vs decoder only model. They try to pitch that under infinite compute, encoder only models perform as good or better than decoder only model. However, I don’t think that it makes sense to compare training approaches, assuming unlimited compute (I mean an MLP can represent anything by that logic). Most of the space and time complexity analysis in this work is trivial. The main section of the paper has a lot of unnecessary details.
优点
- Some nice toy tasks were formulated to compare encoder and decoder models.
缺点
- I dont understand the time complexity comparison argument in L241. Why does an encoder requiring \omega(n^2) operations cannot be efficiently expressed by decoder? Shouldn’t it be reverse, that encoder is worse here.
- I don’t agree with the basic premise which the paper is trying to prove i.e. the first line of conclusion (L522). The paper is trying to show that under infinite compute, encoder only model outperforms the decoder only model. I am not sure why authors feel the need to prove this, but it is intuitive and pretty obvious that encoder only model will outperform uni-direction decoder only models. Bi-direction attention decoder only models are not used in practice precisely because they are expensive to train due to more operations and expensive at inference as well.
- Authors tried to show encoder models outperform decoder models, on some toy tasks and even their real tasks are quite easy and trivial (addition, regression, etc.)
- Authors try to talk about experiments on openwebtext, with minimal details. Again, the compute is not fixed for this pretraining experiment (encoder model gets more compute or the data is simply repeated for decoder model to match the compute, which is NOT a realistic setting). I as a reviewer have to guess what the experiment would have been here, as no details were provided.
At this point of time, I think this manuscript needs significant amount of work, thoughts on the key message for community they are trying to convey and needs a restructuring of the main sections.
问题
I would be willing to increase my score for this paper if the authors can show that I misunderstood a major part of this work or the takeaways. Please see the weakness section for the questions.
----POST AUTHOR RESPONSE I have updated the score post author response, as it clarified multiple key sections of the paper.
We thank the reviewer for their constructive comments. We address other concerns below and revise the paper accordingly.
[Summary] They try to pitch that under infinite compute, encoder only models perform as good or better than decoder only models. However, I don’t think that it makes sense to compare training approaches, assuming unlimited compute (I mean an MLP can represent anything by that logic).
Model performance depends on more than just the amount of compute available. Simply increasing model size doesn’t always lead to better results, especially when training data size can’t be scaled. This highlights the importance of inductive biases in model architectures. ENTP provides a way to increase model complexity without increasing the number of parameters, demonstrating better generalization and improved sample complexity compared to decoder-only Transformers. If we want to scale up model complexity, but we can not scale training data sizes, ENTP stands out as a compelling solution.
Also, an MLP has much worse inductive bias than a Transformer. We tried training an MLP (with more parameters and more compute than Transformer models), but it still performed worse than both Transformer models. We tested the same addition setup as described in section 6.1 of the paper. Experiment results: https://ibb.co/PF4bhCR.
[W1] I don't understand the time complexity comparison argument in L241. Why does an encoder requiring ω(n^2) operations cannot be efficiently expressed by a decoder? Shouldn’t it be reversed, that encoder is worse here?
Generating a sequence of n tokens requires for ENTP while for decoder-only Transformers. While this implies that ENTP is more compute-intensive (i.e., ENTP will be slower than decoder-only Transformer), this also implies that ENTP can express more compute-intensive sequence functions than decoders. Specifically, since the total amount of compute that decoders use for generating n tokens is , they cannot run any algorithm whose runtime is (strictly greater than quadratic time). A similar argument holds for ENTP: ENTP cannot run algorithms whose runtime is .
[W2] I don’t agree with the basic premise which the paper is trying to prove i.e. the first line of conclusion (L522). The paper is trying to show that under infinite compute, the encoder only model outperforms the decoder only model. I am not sure why authors feel the need to prove this, but it is intuitive and pretty obvious that encoder only models will outperform uni-direction decoder only models. Bi-direction attention decoder only models are not used in practice precisely because they are expensive to train due to more operations and expensive at inference as well.
We respectfully disagree with the statement: “it is intuitive and pretty obvious that encoder only models will outperform uni-direction decoder only models”. To the best of our knowledge, “encoder-only” Transformers have not been tested in sequence modeling. Thus, it is unclear for an encoder model to perform well in practice. This is what our paper showed for the first time.
It is worth noting that other reviewers have recognized the novelty and importance of our investigation. Reviewer 2 described our work as “The main strength of the paper lies in questioning the conventional wisdom of the current literature that predominantly focuses on Transformer decoders for language modeling.” Similarly, Reviewer 4 appreciated the compelling motivation behind using encoder-based approaches, saying “the motivation for using the encoder to predict the next word is compelling, especially in scenarios where computing resources and efficiency are not primary concerns.”
[W3] Authors tried to show encoder models outperform decoder models, on some toy tasks and even their real tasks are quite easy and trivial (addition, regression, etc.)
We respectfully disagree with this. These synthetic tasks are widely used to understand how Transformers work and to come up with better designs etc. See Physics of Language Models (Allen-Zhu et. al., 2024), Teaching Arithmetic to Small Transformers (Lee et. al., 2023), Transformers Can Do Arithmetic with the Right Embeddings (McLeish et. al., 2024), What Algorithms can Transformers Learn? A Study in Length Generalization (Zhou et. al., 2023), Length Generalization in Arithmetic Transformers (Jelassi et. al., 2024), Looped Transformers for Length Generalization (Fan et. al., 2024), What Can Transformers Learn In-Context? A Case Study of Simple Function Classes (Garg et. al., 2023), Unveiling Transformers with LEGO: a synthetic reasoning task (Zhang et. al., 2022).
[W4] Authors try to talk about experiments on openwebtext, with minimal details. Again, the compute is not fixed for this pretraining experiment (encoder model gets more compute or the data is simply repeated for decoder model to match the compute, which is NOT a realistic setting). I as a reviewer have to guess what the experiment would have been like here, as no details were provided.
We thank the reviewer for the careful reading. Here are additional experimental details:
- warmup_iters: 2000
- lr_decay_iters: 600,000
- min_lr: 0.0006
- max_lr: 0.00006 (6e-05)
- beta1: 0.9
- beta2: 0.95
- weight_decay: 0.1
- block_size: 128
- batch_size: 32
They will be included in the revised appendix of the paper as well. The model’s were not trained with fixed amounts of compute. We matched the number of training iterations and training examples used for each iteration during this experiment.
[Final Note] Thank you again for your review. Please let us know if we can clarify anything further. If our responses addressed your concerns, we’d greatly appreciate your support and a higher score for our paper.
I have re-read the paper post the author's response, and have updated the score. The clarifications provided helped understand some key sections better. I would really encourage the authors to focus on writing the "real world empirical experiments" section like openwebmath better, as it strongly validates the real world impact of the paper as well. Thanks for clarifying that you matched the compute.
This paper introduces Encoder-Only Next Token Prediction (ENTP), which challenges the traditional reliance on decoder-only transformers with causal attention for next-token prediction tasks. The authors argue that encoder-only models can perform next-token prediction without the need for causal masking and propose ENTP as an alternative to decoder-only transformers. ENTP has advantages in expressive power, allowing it to handle tasks like the Triplet-Counting task, which the paper theoretically and experimentally shows that decoder-only transformers struggle with. Through empirical results, the paper demonstrates that ENTP outperforms traditional decoder-only models in generalization tasks, including length generalization and in-context learning.
优点
The main strength of the paper lies in questioning the conventional wisdom of the current literature that predominantly focuses on Transformer decoders for language modeling and generative tasks. The paper provides relevant theoretical and experimental justification to demonstrate that decoders are unsuitable for all ranges of tasks. The proposed method of ENTP shows superior length generalization and lower sample complexity, enabling it to perform well with fewer examples. The theoretical and experimental results provided in the paper will help the research community to gain a broader understanding of this subject.
缺点
The major weakness of the paper is overstretching of few claims without sufficient analytical justification. For instances --
- "In principle, there is no inherent reason that language models should be limited by this restriction."
- "Triplet-Counting sequences have very small Kolmogorov complexity"
- "Remark 1 conclude that encoders have strictly larger expressive power than decoders"
- "Transformer encoder or decoder work with uncountable vocabulary"
See my questions below for more clarification.
More limitations --
- The time and space complexity of the triplet counting algorithm in conjecture 1 raises its relevance. I am not sure if a task that requires atleast n^2 time complexity is at all practically relevant to explore or not.
- The zero accuracy of Llama-7b and GPT-4o, even after fine-tuning, is alarming and needs a more convincing explanation. The authors should investigate if this failure is due to model architecture limitations, training procedures, or task-specific challenges.
问题
- "In principle, there is no inherent reason that language models should be limited by this restriction." - what evidence do you have to support this claim?
- "Triplet-Counting sequences have very small Kolmogorov complexity" - can we elaborate how algorithm 1 concludes this fact? How do you quantiy "very small"? What is the typical range of Kolmogorv complexity?
- "Remark 1 conclude that encoders have strictly larger expressive power than decoders" - how do you connect remark 1 with this conjecture?
- "Transformer encoder or decoder work with uncountable vocabulary" - How does uncountable vocabulary work?
More questions --
- How does ENTP scale with longer sequences?
- How adaptable is ENTP across diverse natural language processing tasks beyond token prediction, such as classification or sequence alignment?
Thank you for your review and comments. They have greatly helped us improve our draft.
[W1, Q1] "In principle, there is no inherent reason that language models should be limited by this restriction." - what evidence do you have to support this claim?
With regards to the first weakness, we agree that the phrasing “no inherent reason” is somewhat ambiguous and have rephrased it in the paper. To clarify, let’s say we want to predict token given tokens . In principle, ’s latent representation could depend on the entire input . However, with the current decoder-only model, ’s latent representation cannot depend on . ENTP removes this unnecessary restriction.
[W2, Q2] "Triplet-Counting sequences have very small Kolmogorov complexity" - can we elaborate how algorithm 1 concludes this fact? How do you quantify "very small"? What is the typical range of Kolmogorov complexity?
We first note that this statement was meant as an informal remark about an interesting observation rather than a major claim. We will rephrase it to make this clearer. To clarify your question, as Kolmogorov complexity is defined as the smallest size of a program code that can generate the data, Algorithm 1 gives an upper bound since it is a program that can generate Triplet-Counting strings.
In terms of typical Kolmogorov complexity, if we take natural languages as an example, estimating the exact Kolmogorov complexity of a large English text (say wikipedia) is still an open problem, but it’s probably not just hundreds or thousands of lines of program code.
However, we know that the Decoder model does a decent job at approximately learning language. This results in a (false) belief that a large enough Decoder Transformer can learn sequence models as long as their Kolmogorov complexity is bounded. Our Triplet-Counting task is the first concrete counterexample to disprove this belief.
[W3, Q3] "Remark 1 concludes that encoders have strictly larger expressive power than decoders" - how do you connect remark 1 with this conjecture?
We say that this is a false conjecture, see the following paragraph which starts with “however, this is not true”. To prevent this point of confusion, we have streamlined and removed this statement in the revised draft.
[W4, Q4] "Transformer encoder or decoder work with uncountable vocabulary" - How does uncountable vocabulary work?
The mention of an uncountable vocabulary arises from the fact that our embedding space, , is uncountable. For example, if we take the unit sphere as the set of embeddings, the vocabulary becomes uncountable. Of course, in practical implementations, finite precision issues arise, so this concept is primarily of theoretical relevance—specifically for the proofs of Theorems 1 and 2.
[W5] The time and space complexity of the triplet counting algorithm in conjecture 1 raises its relevance. I am not sure if a task that requires at least n^2 time complexity is at all practically relevant to explore or not.
As for the concern of limited applicability, Triplet-Counting is primarily significant for its theoretical value, as it offers a simple yet effective way to differentiate encoder and decoder models. We are seeing consistent improvements in in-context learning, addition, and (small-scale) language modeling, which are all practically relevant tasks.
[W6] The zero accuracy of Llama-7b and GPT-4o, even after fine-tuning, is alarming and needs a more convincing explanation. The authors should investigate if this failure is due to model architecture limitations, training procedures, or task-specific challenges
The zero accuracy of Llama-7b and GPT-4o is undoubtedly surprising, but we believe is corroborated by our theoretical results. We do not believe it is due to training procedure, as we used the same training procedure for a simpler task as a sanity check, and both Llama-7b and GPT-4o were able to learn in that case (Please see section C.1 of appendix and figure 9 for results on this task).
[Q5] How does ENTP scale with longer sequences?
Please see Table 1 for detailed time/space complexity. But in a nutshell, computation is cubic in the sequence length (as opposed to quadratic for decoders).
[Q6] How adaptable is ENTP across diverse natural language processing tasks beyond token prediction, such as classification or sequence alignment?
The Triplet-Identification task was technically a sequence alignment problem. Additionally, we are currently working on a Named Entity Recognition (NLP sequence alignment) experiment.
[Final Note] : Thank you for your detailed review. If there's anything further we can clarify or assist with, don't hesitate to let us know. Additionally, if our responses addressed your questions and concerns, we would greatly appreciate it if you could consider raising your score and supporting our paper.
Thank you for providing clarification. Although the theoretical results introduced in the paper are valuable, the practical implications prevent me from raising the scores. Following are the concerns that remain
-
The generation speed is cubic, higher than even decoder-only models. Transformer decoders have already been criticized for their high generation runtime. People are currently exploring state-space models for decoding tasks due to their computational efficiencies during generation.
-
The empirical study is insufficient to understand the practical implications of the work.
All the best with your submission.
We sincerely appreciate the time and effort you have devoted to reviewing our manuscript. We would like to reiterate that the primary objective of our study is to explore scientific questions about Transformers, rather than to advocate for ENTP as a practical language modeling solution given current hardware limitations. That said, we believe there are plausible scenarios where ENTP or related concepts could become practically relevant in the near future:
-
The development of more efficient algorithms for training or inference that approximate ENTP, reducing its runtime complexity to below cubic.
-
The design of novel architectures that balance ENTP's expressive power with the computational efficiency of decoder-only models.
Additionally, model performance depends on more than just the amount of compute available. Simply increasing model size doesn’t always lead to better results, especially when training data size can’t be scaled. This highlights the importance of inductive biases in model architectures. ENTP provides a way to increase model complexity without increasing the number of parameters, demonstrating better generalization and improved sample complexity compared to decoder-only Transformers. If we want to scale up model complexity, but we can not scale training data sizes, ENTP stands out as a compelling solution.
Our work represents the first step in introducing this novel approach to improving current Transformer architectures. While we acknowledge the significant computational cost, we are confident that it will spark new research directions and inspire future advancements in this field.
Regarding concerns about the lack of empirical studies, we would like to emphasize that we conducted experiments across various domains where next-token prediction based Transformers can be applied, including addition, in-context learning, and language modeling. Furthermore, following the reviewers’ suggestions, we performed additional empirical experiments, which further demonstrates the effectiveness of ENTP.
First, beyond the models discussed in the manuscript, we combined ENTP with a larger pre-trained model, BERT. Since BERT is a representative encoder-only architecture, we integrated it with ENTP and fine-tuned it on the triplet counting task. As shown in the figure (please see link), BERT combined with ENTP learned triplet counting more efficiently than small-scaled ENTP. This demonstrates that ENTP can be applied to larger pre-trained models beyond the configurations presented in the paper. We included these results in our revised manuscript.
Second, we evaluated the performance of the ENTP model and a decoder-only model, both pre-trained on OpenWebText, on a downstream task. We used TinyWinoGrande [1], which evaluates common sense reasoning ability by identifying the referents of pronouns. Specifically, after training on OpenWebText, each model was evaluated on this benchmark in a zero-shot manner. As seen in the table below, ENTP achieved higher accuracy compared to the decoder-only model, showing its potential effectiveness across various downstream tasks.
| Decoder-only | Encoder-only (ENTP) | |
|---|---|---|
| TinyWinoGrande [1] (Accuracy ↑) | 56.0 % | 61.0 % |
Finally, following your [W6] suggestion, we evaluated the performance of ENTP in a NLP classification task. Specifically, we evaluated a decoder-only model and ENTP on the CLUTRR dataset [2], a reasoning-based dataset designed to classify relationships between individuals based on textual descriptions. Both models, pre-trained on OpenWebText, were fine-tuned using limited subsets of the CLUTRR data. Their performance was then compared on a separate holdout dataset.
| Number of Training Examples | Decoder-only (Accuracy ↑) | Encoder-only (ENTP) (Accuracy ↑) |
|---|---|---|
| 2.5k Examples | 43.3 % | 41.4 % |
| 5k Examples | 69.8 % | 71.2 % |
| 7.5k Examples | 87.7 % | 88.1 % |
| 10k Examples | 99.2 % | 99.5 % |
We sincerely thank you for responding to our rebuttal. If there are any remaining issues, please do not hesitate to let us know. We would greatly appreciate the opportunity to address them before the discussion phase concludes.
[1] Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tinybenchmarks: evaluating LLMs with fewer examples. ICML 2024
[2] Sinha, Koustuv, et al. "CLUTRR: A diagnostic benchmark for inductive reasoning from text." EMNLP 2019.
Thank you for the time and effort you have devoted to reviewing our manuscript and comments so far. As the review period draws to a close, we kindly request your feedback on our final two responses: Response to Further Comments and Comments on Empirical Results. If these responses adequately address your concerns, we would be grateful if you could consider revising your score accordingly.
This paper analyzes the Encoder-only Next Token Prediction (ENTP), which challenges the prevailing use of decoder-only Transformers with causal attention in next-token prediction tasks. This paper tries to theoretically prove that encoder-only architectures can achieve comparable performance with decoder-only, and construct some tasks like triplet-counting to show that on some task, decoder-only will fails.
优点
- The paper is in general well-writing and easy to follow. The figures are good (like Figure 3 and 8) for understanding.
- Construct an interesting task Triplet-Counting to compare decoder-only and encoder only architecture.
- Do both theoretical analysis and small-scale empirical study to do some analysis of ENTP
缺点
In general I still have some concerns as illustrated in weakness and questions. I will increase my score if these problem are solved
- In 237-239, I think the claim is little weird because it's well-known decoder-only architecture also use KV cache to save memory, and the claim in 240-242 may not hold.
- Besides, I think the claim of Space complexity comparison in line 245-246 is not that reasonable if you also consider time complexity. In algorithm 3 you use a O(n^2) loop for calculating attention score for using O(n) than O(n^2) space. And this can not happen if you want to store the previous keys and values for accelerating as said in Time Complexity Comparison part
- In section 5.1, my advice is that you may add some summarization about the intuition about why encoder can solve the Triplet-Counting function better but Decoder-only architecture can not. And a related question is illustrated in Questions-2
- The scale of "large" encoder is still small (from Appendix C.4), and the main issue of encoder-only architecture will face efficiency issue when scaling the architectures. The lack of larger scale experiment may let people wonder if it's suitable for large scale experiments.
- The improvement on validation loss/perplexity is not that significant, and a better way may be to evaluate on some downstream tasks.
问题
- What's the result of decoder-only model with non-causal attention (you mentioned in 103-104) on the Triplet Counting task? Will they also fail?
- I just feel confused about Remark1, Lemma1 and Lemma2. If Remark1 and Lemma2 holds, why we can not construct an decoder-only Transformer from the L=O(1) encoder-only Transformer and thus improve the result in Lemma1?
- Have you do some experiments using Bert which is encoder-only architecture? If Bert can learn the Triplet Counting task? It's maybe an helpful experiment I think
[W4] The scale of "large" encoder is still small (from Appendix C.4), and the main issue of encoder-only architecture will face efficiency issues when scaling the architectures. The lack of larger scale experiments may let people wonder if it's suitable for large scale experiments.
We acknowledge that the model configuration denoted as "large size" may not be sufficiently large, and we want to emphasize that this size was not chosen to demonstrate the scalability of encoder-only models. The "large size" Transformer was only used in the triplet-counting task for decoder-only models, with the purpose of showing that even with increased size, decoder-only models fail to learn triplet counting (in addition, this experiment was extended to truly large-scale LLMs).
Nevertheless, regarding the experiments with large-size ENTP, we would like to emphasize that the goal of our paper is to answer scientific questions about Transformers, rather than to propose ENTP as a practical language modeling solution for today’s hardware. Given our limited resources, we were only able to conduct small-scale ENTP experiments.
[W5] The improvement on validation loss/perplexity is not that significant, and a better way may be to evaluate on some downstream tasks
This is a constructive comment to evaluate performance across various NLP-related downstream. Following the suggestion, we are conducting experiments on NLP downstream tasks and will ensure to include them in the final draft.
[Q1] What's the result of a decoder-only model with non-causal attention (you mentioned in 103-104) on the Triplet Counting task? Will they also fail?
This is an interesting question. Following your suggestion, we conducted additional experiments using a PrefixLM (i.e., decoder-only model with non-causal attention) for the triplet counting task. Consistent with the setup described in the main paper, each sequence begins with a seed comprising 16 random integers to ensure uniqueness. We used this seed as the prefix part for the PrefixLM. As shown in the figure (see link), while the PrefixLM slightly outperforms the decoder-only model, it also fails to learn the triplet counting task. As discussed in [W3], this is because, although the PrefixLM performs full attention over the prefix part, it still relies on previously computed values that do not consider the last tokens.
[Q2] I just feel confused about Remark1, Lemma1 and Lemma2. If Remark1 and Lemma2 holds, why we can not construct an decoder-only Transformer from the L=O(1) encoder-only Transformer and thus improve the result in Lemma1?
Thank you for the sharp question. Remark 1 demonstrates that “There exists a function that can be represented by both encoder and decoder”. However, this does not imply “For any function represented by an encoder, a decoder can represent the same function”. In fact, Theorem 2 shows the existence of causal functions that can only be represented by an encoder, while Theorem 1 demonstrates the existence of causal functions that can only be represented by a decoder.
In summary, causal functions can be categorized into four types: 1) functions that can be represented by both an encoder and a decoder (Remark1), 2) functions that can only be represented by an encoder (Theorem2), 3) functions that can only be represented by a decoder (Theorem1), and 4) functions that can not be represented by either (e.g., a function requiring runtime as mentioned in [W1]). We argue that the causal function required for triplet counting can only be represented by an encoder model under certain computational constraints.
[Q3] Have you done some experiments using Bert which is encoder-only architecture? If Bert can learn the Triplet Counting task? It's maybe an helpful experiment I think
This is an interesting question. As BERT is a representative encoder-only architecture, it can be straightforwardly fine-tuned for next-token prediction by leveraging the ENTP approach. Following your suggestion, we are conducting an experiment to fine-tune BERT for triplet counting and will ensure that the results are included in the final draft.
[Final Note] Thanks again for the detailed review. Please let us know if there is anything you need from us to clarify further. Also, if our response clarified your questions and concerns, we would greatly appreciate it if you could raise your score and support our paper.
We sincerely appreciate your insightful comments. They were incredibly helpful in improving our draft. We have addressed each comment in detail below.
[W1] In 237-239, I think the claim is a little weird because its well-known decoder-only architecture also uses KV cache to save memory, and the claim in 240-242 may not hold.
As the reviewer mentioned, decoder-only structures can utilize a KV cache. The content discussed in lines 237–239 indeed indicates that decoder-only structures can use a KV cache, and these sentences do not refer to ENTP.
The KV cache usage in decoder-only models enhances the efficiency of their causal attention mechanism: features computed for previous token predictions can be reused, enabling the prediction of a sequence of tokens with a computational cost of . In contrast, ENTP must recalculate features from scratch for each token prediction, requiring to predict a sequence of tokens. While this indicates that ENTP is more compute-intensive (i.e., ENTP will be slower than decoder-only Transformers), it also suggests that ENTP can better represent complex sequence functions than decoder-only Transformers. Specifically, since the total amount of computation that decoder-only Transformers use to generate tokens is , they cannot execute algorithms with a runtime of . A similar argument holds for ENTP: ENTP cannot run algorithms whose runtime is . Therefore, our argument in lines 240–242 remains valid for the reasons stated above, and we have clarified this paragraph in the revised manuscript.
[W2] Besides, I think the claim of Space complexity comparison in line 245-246 is not that reasonable if you also consider time complexity. In algorithm 3 you use a O(n^2) loop for calculating attention score for using O(n) rather than O(n^2) space. And this can not happen if you want to store the previous keys and values for accelerating as said in the Time Complexity Comparison part
We would like to first clarify what we wrote as follows. Let’s consider a decoder with a “parallel” attention compute algorithm with KV cache. KV cache needs precomputed space. Computing a new k/v/q requires additional compute. Given the new (k,v,q) and cached kv values, if one first computes all (pre-normalization) attention values, then the required additional compute is (computing n inner products with dim vectors per layer) and the intermediate inner product values require ( scalar outcomes). Computing the softmax takes time and space. Then, it now has to compute the weighted sum of vectors which takes time and additional space. Thus, the total additional time is and the total additional space is .
However, if you compute attention sequentially as shown in Alg.3, then one can reduce the additional space to . Now given this, going back to the reviewer’s question, this space complexity is computed with KV cache considered.
[W3] In section 5.1, my advice is that you may add some summarization about the intuition about why encoders can solve the Triplet-Counting function better but Decoder-only architecture can not. And a related question is illustrated in Questions-2
Thank you for a valuable suggestion. Below, we provide a high-level explanation of why ENTP can learn Triplet Counting, unlike decoder-only models. We have revised our manuscript to include it.
Computation of heavily depends on both and . This dependency makes it challenging to reuse intermediate computation results that are calculated without considering and . This negatively affects decoder-only models, as they rely on previously computed values (i.e., the KV cache becomes useless). In contrast, ENTP recomputes everything from scratch for each token prediction, allowing it to incorporate the last token (i.e., and ) into the entire computation process. As a result, ENTP can effectively learn the Triplet-Counting task.
Dear Reviewer bqjj
We express our deepest appreciation for the time and effort you dedicated to reviewing our manuscript. Here, we have included additional experimental results to supplement our previous responses. We kindly request you take the time to review our rebuttal, as your further feedback would be immensely helpful. If you find that our responses are satisfactory, we would be grateful if you could consider adjusting the review scores accordingly.
Triplet Counting with BERT
As mentioned in the response to Q3, BERT is a representative encoder-only architecture, making it highly interesting and intuitive to test it in combination with ENTP for triplet counting. To explore this, we trained BERT using the ENTP approach under the same experimental settings as in the paper.
As shown in the figure (please see link),
BERT combined with ENTP successfully learned triplet counting. Notably, as BERT is pretrained and larger compared to the medium transformer used in the paper, it converged more quickly. We believe this demonstrates that ENTP is effective not only under the experimental setups and model size specified in the paper but also for larger pre-trained models.
Dear Reviewer bqjj
We provide additional experimental results as suggested in [W5]. If there are any further questions or concerns, please do not hesitate to let us know. We would greatly appreciate the opportunity to address any remaining issues before the discussion phase concludes.
Evaluation on other downstream task
Following your suggestion in [W5], we evaluated the performance of the ENTP model and a decoder-only model, both pre-trained on OpenWebText, on a downstream task. Specifically, we utilized the tinyWinogrande benchmark [1], which is designed to assess common sense reasoning by identifying the referents of pronouns. The models were evaluated in a zero-shot setting without additional fine-tuning, and the accuracy results are provided in the table below. As shown in the table, the ENTP model demonstrates significantly improved performance compared to the decoder-only model, highlighting its potential effectiveness across various downstream tasks.
| Decoder-only | Encoder-only (ENTP) | |
|---|---|---|
| TinyWinoGrande [1] (Accuracy ↑) | 56.0 % | 61.0 % |
[1] Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, and Mikhail Yurochkin. tinybenchmarks: evaluating LLMs with fewer examples. ICML 2024
Thanks for your detailed rebuttal and sorry for my late response. I think the author has addressed most of my concern and have updated my scores. Authors should include these additional experiments in the final version
We thank the reviewers for their valuable feedback and thoughtful suggestions that helped improve our work. We are particularly encouraged that the reviewers acknowledged (i) the compelling motivation for exploring encoder-only architectures for next-token prediction (R-gzya, R-YwHg), (ii) the theoretical and experimental contributions demonstrating the expressive power of ENTP (R-bqjj, R-gzya), and (iii) the potential of our work to challenge existing assumptions about language modeling and generative tasks (R-YwHg).
In response to the feedback, we've addressed each concern, added new experiments, and updated our paper accordingly. The major updates to the manuscript are the following: (i) revised abstract, (ii) replaced Remark 1 with a more concise statement, and (iii) revised section 5 Time Complexity Comparison. As a side note, we changed some of the names in the paper to better align with previous work. Specifically, Triplet-Counting is renamed to Count3 and Triplet-Identification is renamed to Match3’. We use the original names for our response here.
The premise of this paper is straightforward -- the Authors argue for a critical look at using encoder-only ("BERT-style") Transformers for next-token prediction training. They provide evidence for this claim through three avenues: (a) theoretical analysis, (b) synthetic counting tasks, and (c) empirical evaluations after training on larger-scale data.
While this was somewhat contested by some of the Reviewers, I personally do not think the paper's claims are huge -- papers not only questioning but theoretically proving that causal masking and decoder-only Transformers are not the best idea for generalisation are ample in the community (for one recent example, see Barbero et al., NeurIPS'24: https://arxiv.org/abs/2406.04267).
That being said, the paper's topic does concern a fairly broad class of architectures, and particularly their performance against the incumbent approach to LLMs. Especially when considering the massive implied compute costs, a thorough evaluation is deemed to be an important part for assessing this work.
I welcome the Authors' addition of the TinyWinoGrande and CLUTRR datasets as a clear step towards a more thorough evaluation. However, I still find this to be a somewhat immature eval: the differences between the two approaches are not very pronounced, and no estimate of variance around the reported numbers are given (e.g. by evaluating multiple seeds and computing standard deviations). Such measurements are important to be able to evaluate the robustness of the obtained metrics, and at least a few additional seeds should be possible to obtain by the Authors, given they've been able to pre-train models on one.
Overall, this work is very much on the borderline and I was on the fence about my decision, but I ultimately decided to recommend rejection. The work is certainly valuable and timely, but there are several clear aspects in which the downstream benchmark evaluation can be made more convincing. I highly recommend the Authors to explore this aspect in a future resubmission!
审稿人讨论附加意见
While Reviewer YwHg did not engage with the Authors' latest set of results, we discussed them at depth during the Reviewer-AC discussion, and while the Reviewer recognised that these results are significant and valuable, they reiterated that the experimental section is still somewhat immature, especially in relation to the scope of the paper's claims.
Other Reviewers did not opt to champion the paper for acceptance.
Reject