5.6

/10

Rejected5 位审稿人

最低5最高6标准差0.5

3.4

置信度

正确性2.2

贡献度2.6

表达2.6

ICLR 2025

In-context learning and Occam's razor

Eric Elmoznino,Tom Marty,Tejas Kasetty,Leo Gagnon,Sarthak Mittal,Dhanya Sridhar,Guillaume Lajoie

OpenReview PDF

提交: 2024-09-14更新: 2025-02-05

TL;DR

The next-datapoint prediction error objective used in models that exhibit in-context learning can be seen as a meta-objective that optimizes for learners that not only explain their training data, but do so using the simplest model.

摘要

关键词

generalizationcomplexitycompressionin-context learningmeta-learning

评审与讨论

审稿意见

评分: 5置信度: 22024-11-03

The paper discusses an interesting topic to connect the Kolmogorov complexity, prequential coding with In-context learning. The authors first show the prequential code is a “good” algorithm” to compress both data and model. And through meta learning, the prequential code length could be minimized. In the setting of ICL, the meta learner and inner model for each task are unified as the sequence model. And the next token prediction is equivalent to the prequential coding algorithm. Thus through the next token prediction, the training error and model complexity are jointly minimized.

优点

Overall I like the perspective of this paper. Kolmogorov complexity nicely poses the learning and generalization problem as a compression of data and model. It is surprising to see nowadays the modern LLMs, trained simply on next token prediction, generalizes so well in downstream tasks with or without some fine-tuning. Any effort connecting the two is always welcome.

缺点

Unfortunately the draft, to me, lacks a sense of rigor. The connection stated in the draft looks like a good story, but there is not much guarantee. How well prequential code length approximates the Kolmogorov complexity is always a question mark. I feel it is a very loose bound. In the prequential coding algorithm, it is assumed that as the learner T sees more data, it will generalize better on the new data. However there is no quantitative analysis on how this is measured. Any assumptions on the distribution of data? What is the sample complexity here?

问题

Figure 1.a: in the description of the prequential coding algorithm, there is a line D+= decode(d_next_encoded,p). I do not see where that decode function is defined. Can you add more details?
Equation (4): the first equality says the code length is the sum of bits for all the data based on the learner. Do we also need extra bits to represent the learner itself? Maybe I missed something here. Please feel free to comment.
Can you also provide some details on the approximation in (6). Why it is an approximation and what have we missed here. Thanks.

Finally something minor: The first sentence in the abstract is a bold claim. Even though I agree generalization is a key to machine learning, I would be cautious claiming that the (only) goal of machine learning is generalization.

2024-11-21

We thank the reviewer for their time and valuable feedback. We are heartened to hear that the reviewer liked the perspective of our paper. We address the questions and concerns raised by the reviewer below.

Validity of theoretical claims relative to the tightness of the bound

We acknowledge the reviewer’s point regarding the tightness of the bound and believe that our initial discussions about it in the draft might have been distracting as this “approximate equality” is not necessary for our theoretical results. We have revised our draft to show that our quantity of interest (training loss + model complexity) is upper-bounded by the prequential code length, and minimizing this upper-bound is a valid proxy to minimizing the objective itself. Such an approach of minimizing a bound is in line with well-established methods like variational inference, where the negative ELBO is minimized which upper bounds negative log likelihood [1]. We kindly refer the reviewer to the updated Sections 2.2 and 2.3, which are now solely based on upper-bounds.

Prequential coding assumes that as the learner $T$ sees more data, it will generalize better on new data. Is this assumption valid? How is this quantitatively measured? Does it require assumptions on the distribution of the data? What is the sample complexity?

The reviewer brings up an interesting point. The assumption in question is valid as the learner is trained to minimize the prediction error on the next datapoint given prior context. This is similar in spirit to the maximum likelihood estimator, which generalizes better on new data as it seems more data. However, this might not hold true for us in certain OoD scenarios, for e.g. when the number of context points seen during inference are larger than those considered during training. This is a known problem for transformer models, called length generalization.

Our findings in Figure 2 precisely indicate that this assumption is valid when minimizing next-token prediction error. The monotonically decreasing curves imply that as $T$ sees more data through increasing context length (x-axis), its generalization error on new data decreases (y-axis).

Our setup requires very limited assumptions on the data distribution. Akin to maximum likelihood estimation, observing more data points $(x,y) \sim p$ leads to a better model in our setup as long as the underlying true data generating model $p$ is kept fixed, which is the standard setup in most machine learning approaches. In the extreme case where $y$ is independent of $x$ , observing additional context would not lead to better generalization for either $T_\phi$ or any other learner like SGD.

Finally, Figure 2 also highlights that in low sample regimes prequential ICL outperforms train-risk ICL and SGD, demonstrating better sample complexity.

1) What is the decoding function used in the prequential coding algorithm?

The decoding function used in the prequential coding algorithm corresponds to the decoding algorithm used in arithmetic coding (a lossless entropy coding scheme), which reconstructs the original data given an optimally compressed code and a probability distribution. For our purposes, the point is that the algorithm is short to write (does not contribute to total program complexity) and allows an object $x$ to be compressed using only $-\log_2 p(x)$ bits and then recovered with 0 error using the arithmetic decoding function. Thank you for spotting this missing detail; we have updated the caption of Fig. 1a accordingly.

2) Is equation 4 on prequential code length missing the complexity of the learning algorithm $K(T)$ ?

The reviewer is correct that the prequential code length + the complexity of the learner together upper-bounds the joint complexity of the data and model:

$K(D|p_\theta) + K(p_\theta) = K(D, p_\theta) \le L_{preq}(D;T) + K(T)$

We argue that training the learner $T_\phi$ on meta-datasets that do not include the target dataset $D$ ensures that $K(T_\phi)$ is low (i.e. it cannot overfit to $D$ ). We point the reviewer to the last paragraph of Section 2.2 and the start of Section 2.3 for reference, with further details present in Appendix B.

3) What is there an approximation in equation (6)?

Equation (6) is an upper-bound that only becomes a tight approximation under certain circumstances. However, our theoretical results only require (6) to be an upper-bound, and therefore we have replaced all unnecessarily distracting uses of approximations with upper bounds in our revised manuscript.

Minor Suggestion

We agree with the reviewer’s observation that the first sentence in our abstract presents generalization as the sole objective of machine learning, rather than as one of its important objectives. We have amended this to say that generalization is a “central” goal in ML.

References

[1] Bishop, Christopher M. "Pattern recognition and machine learning." Springer google schola 2 (2006): 1122-1128.

2024-11-25

Dear reviewer,

We are very appreciative of your time and feedback. As we are nearing the end of the rebuttal period, we would like to request the reviewer for an opportunity to answer any additional questions or doubts that may remain.

评论- A plea to engage in discussion

2024-12-02

Dear Reviewer,

As the discussion period is coming to and end, we kindly ask for your response and feedback. As you can gather from our direct response to your points as well as those from other reviewers, we significantly improved the manuscript. Notably, we now present an improved formal bound that is minimized during prequential ICL.

审稿意见

评分: 6置信度: 42024-11-04

[edited: in response to the revision, I have raised my score]

In this paper the authors propose to re-examine the next-token prediction loss used to train sequence models such as transformers from the perspective of compression, and in particular prequential coding. This is an attractive idea that has been the subject of several recent works, including Delétang et al “Language modeling is compression” published in ICLR 2024, and it holds significant promise as a theoretical and empirical means to understand in-context learning (ICL) and more generally the generalisation behaviour of large language models.

The paper has three main components:

(1) The observation that the next-token prediction loss is related to prequential code length

(2) The relation between this code length and Kolmogorov complexity, which begets the claim that training transformers “explicitly” optimises an objective jointly minimising training error and model complexity (line 248), and

(3) Experiments that aim to validate these theoretical claims, and suggest potential improvements to the design of transformers which will incentivise better ICL.

优点

I find the prequential code-length perspective on the pre-training objective of transformers useful, it is relatively novel, and I think it is a promising route to understanding ICL. I did not think any of these things before reading this paper, which introduced me to the idea.

缺点

I find the perspective adopted by the paper intriguing, however in its current form I do not think it has achieved its stated aims in any of three main components identified above:

The relation between next-token prediction loss and prequential code length appears not to be novel, as it is explained clearly in Delétang et al Section 2, and I think this is not sufficiently emphasised in the paper (their work is cited on line 250 for other reasons).
While Kolmogorov complexity is presented as playing a significant role in the framing of the theoretical contributions, I am not convinced of this in its current form. The inequality in (4) is of course true, but the major claims (about e.g. transformer training “explicitly” optimising some objective involving model complexity) seem to rely on this inequality being an approximate equality. This is justified in passing, very briefly, around line 162 and a reference is made to Blier-Ollivier (presumably to the discussion in Section 2.4) but I do not understand how this amounts to a strong justification of the approximate equality.
The experimental results seem a bit scattered, and I am unsure of how strongly they corroborate the theoretical claims. Taking Section 3.1 as an example, I think too much is left implicit about how this connects to the theoretical claims. I do not understand how these findings “follow directly from our theory” (line 312). I do not know how to judge whether or not a gap in performance between prequential and train-risk ICL below 10 datapoints in-context is actually significant.

问题

Some small suggestions/questions:

Section 2.4 stops short of actually writing down the next-token prediction loss and doing the simple calculation that connects it to the prequential code length. Since this is claimed in the summary as one of the key contributions, it seems worthwhile to make this explicit.
Section 3.4 has a reference to Figure 2a (line 392) that should be 3a
Perhaps I’m confused but is line 1245 backwards? Isn’t your proposal that models trained with maximal length contexts should lead to worse generalisation? Perhaps I am misunderstanding what “need less tokens to arrive at simple models” means.

In conclusion, while I believe prequential coding is a promising direction to understand ICL, I cannot agree with the authors that their theoretical arguments succeed in linking the next-token prediction objective to Occam’s razor (line 502), in their current form.

Things that might change my views:

A more detailed explanation of why I should believe (4) is an approximate equality (either theoretical or empirical)
A stronger link between the empirical work in Section 3 and the theory, explaining exactly how the experiments are predicted (as it stands, it reads to me as a few somewhat confirmatory pieces of evidence, but not strongly so).

2024-11-21

We thank the reviewer for their time, feedback and for the positive assessment of our work. We address the key questions raised by the reviewer below, and hope that it addresses all the concerns raised.

1) Credit attribution to Delétang et al

In Delétang et al., the authors discuss the connection between (pre-trained) sequence models and compression algorithms through prequential coding, which is indeed an important result used in our theory, but not our main contribution. In fact, this is exactly why we explicitly discuss such related work in the section “sequence modeling and compression” to clarify that others have drawn connections between sequence models and compression. Our contribution, as outlined in that section, is that we frame the next-token prediction loss used to train sequence models in the meta-learning framework through in-context learning and draw its connections to minimizing prequential code length across a distribution of tasks. This yields learning algorithms that align with Occam’s razor. The meta-learning perspective is precisely what is absent from Delétang et al. (which does not aim to explain ICL, as we do), but it is one of the cornerstones of our theory.

2) Validity of theoretical claims relative to the tightness of the bound

We acknowledge the reviewer’s point regarding the approximate equality and the tightness of the bound and believe that our initial discussions about it in the draft might have been distracting as this “approximate equality” is not necessary for our theoretical results. We have revised our draft to show that our quantity of interest (training loss + model complexity) is upper-bounded by the prequential code length, and minimizing this upper-bound is a valid proxy to minimizing the objective itself. Such an approach of minimizing a bound is in line with well-established methods like variational inference, where the negative ELBO is minimized which upper bounds negative log likelihood. We kindly refer the reviewer to the updated Sections 2.2 and 2.3, which are now solely based on upper-bounds.

3) Connection between experimental results and theoretical claims

Our theory predicts that ICL trained to minimize next-token prediction error produces an in-context learning algorithm that jointly minimizes training error and model complexity. Crucially, by Occam’s razor, model complexity is most important to take into account when data is limited and generalization is difficult. This is in line with evidence from standard statistics stating that maximum likelihood estimator is a consistent estimator, implying that as the amount of data goes to infinity one uncovers the true model. It is only in finite and limited data settings that regularization strategies and choice of prior play a big role. This ascertains that the gap in performance below 10 data-points in-context is still significant.

The purpose of Section 3.1 was to test exactly this hypothesis that minimizing prequential code length results in better generalization for low-data regimes compared to minimizing training error alone and similar performance for high-data regimes, as our theory predicts given the relationship between prequential ICL and Occam’s razor. Our empirical observations (see “Findings” paragraph of Section 3.1) precisely corroborate our theoretical claims and are thus crucial in validating the hypothesis that we laid out. The fact that this gap in generalization performance sometimes exists only in very low data regimes for certain tasks just implies that for certain tasks like linear regression, less data is required for generalization.

Our other experiments are designed to test the abilities of current methods for ICL, i.e. to what degree do current methods (e.g., sequence model architecture) successfully minimize prequential code length in practice. These experiments were crucial in demonstrating that despite attractive theoretical properties, methods for ICL have significant room for improvement in terms of limitations we identified experimentally and interpreted using our theory.

For all experiments, we attempted to clearly identify in the “Findings” paragraphs what our theory predicts (in terms of Occam’s razor and model complexity) and then either confirmed or rejected these predictions based on experimental results, while highlighting the implications for current methods for ICL.

Minor suggestion

We agree with the reviewer that the direct equivalence between next-token prediction loss and prequential code length could be clarified in Section 2.4. We added an equation in Section 2.4 to demonstrate this in line 229.
Thank you for flagging the incorrect figure reference in section 3.4, we have updated the paper accordingly.
The claim in line 1245 is indeed backward, thank you for pointing out this mistake. We have updated the paper accordingly.

2024-11-25

Thanks for your detailed response and for the updated draft, which I have enjoyed reading.

I think the fundamental problem with the gap between the prequential code length and Kolmogorov complexity is not really resolved (nor is it clear to me how to resolve it). I see the point about minimising an upper bound, but am not convinced this really addresses the conceptual gap. However it is at least now clear what the statement is, and combined with the improvements in exposition around the experiments I have decided to raise my score.

2024-11-25

Dear reviewer,

审稿意见

评分: 5置信度: 42024-11-04

This paper examines in-context learning (ICL) through the lens of Occam’s razor, which suggests that the simplest model that explains the data is most likely to be the true one. This paper proposes that ICL's next-token prediction loss functions similarly to prequential coding. The authors argue that by training models on this principle, ICL can produce models that generalize well across tasks without overfitting, especially in low-data scenarios.

This paper shows that ICL aligns with Occam’s razor more closely than traditional methods that only focus on training error. They also reveal limitations of current ICL methods, which may underfit in large-data regimes and have varying performance based on model architecture. This paper suggests refining architectures and data distribution controls to improve ICL’s generalization and task adaptability.

优点

Novel Perspective. Instead of studying the algorithmic aspect or mechanistic aspect of how LLMs perform in-context learning, this paper proposes a different yet novel perspective --- contrasting ICL with most current approaches in machine learning, and conclude with occam razor's principle that ICL generalize well across tasks without overfitting.
Innovative Use of Prequential Coding for Complexity and Generalization. By framing prequential coding, the paper introduces a novel approach to balance model complexity and training accuracy. This insight offers a practical metric for understanding model simplicity.
Comprehensive Empirical Validation. The paper validates its theoretical claims through a variety of experiments across different tasks, architectures, and training conditions.

缺点

Limited Generalization to Non-IID and Complex Real-World Tasks. While the paper effectively applies its theory to IID data, the assumptions might limit its relevance to more complex, non-IID real-world tasks, such as language data or continuously evolving data streams.
Underexplored Architectural Dependencies. Although the paper observes that model architecture significantly influences the effectiveness of ICL, especially in low-data and high-complexity settings, it does not thoroughly explore or analyze which architectural features are most beneficial. A deeper investigation could be interesting.

Nonetheless, I don't think the two weaknesses here are significant. They are more of good-to-haves or future research.

问题

N/A. See weakness.

2024-11-21

We thank the reviewer for their time, feedback and positive evaluation of our work. We are encouraged that the reviewer found our work as innovative and providing a novel perspective, and address the questions and concerns raised by the reviewer below.

Generalization to non-iid data and complex real-world tasks

ICL is used as a general term for systems that learn conditioned on context without weight updates. As the reviewer astutely points out, this includes non-iid contexts such as natural language. In the majority of our experiments, we consider contexts as iid observations of supervised learning tasks, in line with a long history of studying such problems to better understand general ICL properties [1]. Our experiments on non-iid data in a task capturing temporal correlations akin to ones found in language (see Figure 3) suggest that the theory could be extended to non-iid data.

While we agree that extending the theory to non-iid data would be ideal, it is non trivial to do so as it would entail evaluating every possible ordering of the data to assess the best compression possible. That is, $K(D, p) < L_{preq}^{ordering}(D) + permutation\_matrix(ordering)$ . Thus, a key challenge to extending our theory is an efficient way of computing these orderings.

Under-explored architectural dependencies

We agree with the reviewer that investigating the inductive biases and architectural choices that lead to better/worse minimization of prequential code length is a very interesting and important question. However, we believe that it is out-of-scope for our current work which is primarily intended as a theoretical result with confirmatory evidence that could subsequently drive future research on important questions like the role of architecture.

References

[1] Von Oswald, J., Niklasson, E., Randazzo, E., Sacramento, J., Mordvintsev, A., Zhmoginov, A., & Vladymyrov, M. (2023, July). Transformers learn in-context by gradient descent. In International Conference on Machine Learning (pp. 35151-35174). PMLR.

评论- Revised accessment

2024-11-25

After carefully considering the feedback from fellow reviewers and re-evaluating the paper, I have decided to lower my score from 8 to 5. While I initially found the paper's novel perspective on in-context learning and Occam's razor intriguing, the concerns raised about the theoretical rigor and applicability of the work are significant. The lack of clarity in the theoretical justifications and the limited extension to non-IID and real-world tasks diminish the paper's contribution to the field. Additionally, the underexplored architectural dependencies and issues with experimental validity suggest that the work is not yet ready for acceptance. I encourage the authors to address these criticisms in future revisions to strengthen the paper's impact.

Here are the concerns I share with my fellow reviewers

Theoretical Rigor and Clarity. The approximations made, particularly regarding the tightness of bounds and the relation to Kolmogorov complexity, are not sufficiently justified. There is concern that key claims rely on loose bounds or approximate equalities without adequate explanation or empirical validation (Reviewers YT7Y, nR2b)
Novelty of Contributions. The connection between next-token prediction loss and prequential coding is not novel and has been previously discussed in related work, such as Delétang et al. The paper does not sufficiently acknowledge or build upon these existing contributions (Reviewer YT7Y).
Experimental Validity. I share the same concerns about experimental setups introducing ambiguities or unfair comparisons, such as differing input formats between models and lack of appropriate regularization techniques in baselines (Reviewers M5Ep, 1nBD)
Applicability to Non-IID Data. Reviewer 1nBD shares the same concern as I initially did. The theory primarily addresses IID data and does not extend to non-IID or complex real-world tasks, such as language modeling with large language models (LLMs). This limitation reduces the practical relevance of the work.

评论- Please see rebuttals to other reviewers

2024-11-25

We thank the reviewer for their continued participation during this rebuttal period. We are disheartened that while the reviewer considered the feedback from other reviewers in their reassessment, they did not take our responses to each of them in consideration, despite posting this revision well after our replies were posted. In particular, we already addressed all the concerns that the reviewer raises in their re-evaluation, but we provide a thorough response below nonetheless.

Theoretical Rigor and Clarity

We currently do not make claims about the tightness of the bound and while a tight bound is preferable, it is not necessary for theoretical rigor [1-5]. As we pointed out in our response to Reviewers YT7Y and nR2b, prequential code length is still a valid upper bound and minimizing an upper bound as proxy to the quantity of interest is a well established methodology; for example optimizing ELBO which has applications in numerous probabilistic models (VAEs, diffusion models, etc.). We have amended equations in the main text to showcase rigorous inequalities that compose this bound, as well as added further formal details. Novelty of Contributions: We refer the reviewer to lines 254-255 where we clearly state the contribution of Delétang et al and discuss it further in the related work; and thus would like more clarification from the reviewer as to how we should better acknowledge or build upon such contributions? Crucially, we believe that our contribution builds on the excellent work from Deletang et al. by considering the training of ICL models: whereas Deletang shows that certain sequence models have good inference-time compression abilities, our work gives a theoretical account of how those compression abilities emerge at training time and why they enable strong generalization through Occam’s razor.

Experimental Validity

While our baselines did use different input formats, we clearly stated it in our original draft. Further, our main contribution is the fact that generalization in in-context learning can be explained through prequential code length and Kolmogorov complexity; in particular the comparison between gradient based methods, train-risk ICL, and prequential ICL still holds.

In response to this concern, we refer the reviewer to the revised Appendix F.2 which provides a comparative analysis of prequential code length with different out of the box learning algorithms (e.g. with weight decay and other regularization).

Applicability to Non-IID Data

We refer the reviewer to our experiments on hidden markov model (HMM) which is a non-iid task. Additionally, while we acknowledge that extending the theory to non-iid data is a very relevant future work, we strongly argue that the iid data case for supervised learning is not a limitation, but a burgeoning use case of sequence models for general ML practices. In particular, a lot of current machine learning research on both supervised and unsupervised learning does rely on iid assumption on the data. Even for language and other modalities and even with current LLMs, different batches of sequences are still treated independently. Despite the theory providing strict bounds for iid data, we note in the revised text that further work has the potential to extend our result for non-iid, and that our experiments confirm that the principle empirically holds for non-iid data. Like many theoretical contributions to ML in the past, idealized settings where theory is more tractable have great value in driving application progress. As such, we argue that this paper has the potential for notable relevance in the field. We refer the reviewer to our original main rebuttal where we outline this point in great detail.

References

[1] Gastpar, Michael, et al. "Which Algorithms Have Tight Generalization Bounds?." arXiv preprint arXiv:2410.01969 (2024).

[2] Shalev-Shwartz, Shai, and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

[3] Bartlett, Peter L., Dylan J. Foster, and Matus J. Telgarsky. "Spectrally-normalized margin bounds for neural networks." Advances in neural information processing systems 30 (2017).

[4] Golowich, Noah, Alexander Rakhlin, and Ohad Shamir. "Size-independent sample complexity of neural networks." Conference On Learning Theory. PMLR, 2018.

[5] Arora, Sanjeev, et al. "Stronger generalization bounds for deep nets via a compression approach." International conference on machine learning. PMLR, 2018.

2024-12-02

As the discussion period is coming to a close, we kindly ask the reviewer for a response and feedback from our last post which we believe adresses a number of points the reviewer raised.

审稿意见

评分: 6置信度: 42024-11-06

This paper draws a connection between the objective used to minimize the next-token prediction loss, when training with iid data, and a compression algorithm called prequential coding. They show that, if the model is updated after predicting each token, then the minimization of the cumulative loss corresponds to the minimization of the objective of prequential decoding, which serves as an upper bound for jointly minimizing the compression of the data plus the complexity of the model used. The authors also provide a set of experiments to corroborate their theoretical observations.

优点

The proposed connection between the minimization of the next-token prediction loss and the prequential coding algorithm is interesting. Intuitively as an observation, it makes sense that as the model is trained, it should learn to represent new data better, if there is any overlapping information between the different data points. In the specific setting, the data are iid and so the model should get better with each training point. It is also interesting that this loss can be connected with minimizing the complexity of the model.

缺点

In general, the ICL properties of models arise when training the next-token prediction loss, without iid data. The current results do not cover the next-token prediction in general.
It seems that the current setting assumes that the model is updated each time a token is predicted, but isn’t it the case that when training a model auto regressively, the model is updated with a gradient step over the cumulative loss over all the tokens of the sequence.

问题

How is it ensured that $T_\phi$ in section 2.3 does not memorize? “To forbid $T_\phi$ from memorizing a single dataset, ..”
Could the authors clarify what would change if the data were not iid ? Does any of the results hold? In general ICL properties arise by simply training the next-token prediction loss without iid data. Could any of the results be generalized?
It seems that the current setting assumes that the model is updated each time a token is predicted, but isn’t it the case that when training a model auto regressively, the model is updated with a gradient step over the cumulative loss over all the tokens of the sequence. And so, the loss is not the objective of the prequential code (eq. 4). Could the authors elaborate on why these two are equivalent?
What is standard SGD vs ICL? Do the authors mean that they simply use an MLP in which the examples are concatenated and given as Input to the MLP, rather than having them in the sequence length? I am not sure I understand this distinction since the minimization of the cumulative loss over the next token prediction is also requires to train a model with SGD. Could the authors clarify more this setting?
In section 3.2 the authors state: “For instance, when $T_\phi$ is a Transformer, the expressivity of the model it implicitly fits to the context scales with the number of activations in the network ( $N$ ), whereas the expressivity of a DNN trained through SGD scales with the number of weights ( $N^2$ ). A Transformer in the attention layer has the multiplication of two $d\times n$ matrices, while it also has $d^2$ parameters for each weight matrix. Could the authors elaborate on how they deduce that the expressivity of a Transformer scales with the number of activations ( $N$ ) and why for a DNN with the number of weights ( $N^2$ )?
Could the authors think of some other setting, which would not require to alter the architecture for training with the target of only minimizing the training loss?
In figure 3b, is the x-axis the data points seen during training or the context length? In the y-axis is it the prequential code lengths or the error measured over some batch on the next token only?If the x-axis is the context length, how is it exactly the generalization error measured?
I think the paper would be improved, by focusing more on the last setting of experiments, in which the theory does not provide any guarantees, to understand whether similar results would hold in the case of non-iid data.

2024-11-21

We thank the reviewer for their time, feedback and for the positive assessment of our work. We address the key questions raised by the reviewer below.

1) Preventing $T_\phi$ from memorizing tasks

To put this question into context, memorization came into question because we need $K(T_\phi)$ – the complexity of the learner itself – to be small so that we minimize a tight bound of PCL. If the parameters $\phi$ memorize datasets, then $K(T_\phi)$ could be large. Thus, we avoid this pitfall by meta-learning the parameters $\phi$ on a sufficiently large distribution of datasets – with a fixed capacity, memorizing each dataset becomes impossible.

2) and 8) Validity of the approach for non-iid data

ICL is a general term for systems that learn conditioned on context without weight updates. In most of our experiments, we consider supervised iid tasks in line with a long history of studying such problems to better understand general ICL properties [1]. Our experiments on non-iid data in a task capturing temporal correlations akin to ones found in language (see Figure 3) suggest that the theory could be extended to non-iid data.

While we agree that extending the theory to non-iid data would be ideal, it is non trivial to do so as it would entail evaluating every possible ordering of the data to assess the best compression possible. Thus, a key challenge to extending our theory is an efficient way of computing these orderings, which is out-of-scope of the current work.

3) The author’s setting assumes the model is updated with every token seen, but isn’t the gradient only back propagated on the summed autoregressive loss?

We believe that there may be a small confusion. It is important to distinguish between (1) the updates of the sequence model that parametrizes the in-context learner and (2) the updates of the prediction model inferred from the context. During the pre-training phase, the sequence model $T_\phi$ is trained “with gradient steps over the cumulative loss” as the reviewer noted, which updates the in-context learning algorithm it implements. However, the “model” whose complexity our theory pertains to is the prediction function specified by the sequence model conditioned on context, i.e $T_\phi(D_{1:t-1})$ . As explained in Section 2.4, this prediction function is parameterized by the latent activations after a forward pass of the sequence model given past tokens, and these latent activations do “update” as additional tokens are added to the context. The updates to this in-context model are not gradient updates, but are rather governed by whatever learning algorithm the sequence model implements after pre-training. Throughout the paper, we attempt to clearly indicate this distinction by calling the sequence model the “learner” and calling the implicit in-context model parameterized by latent activations the “model”.

4) Comparing a gradient-based learner and in-context learner

We would first like to clarify that we are comparing properties of learners which can be either (a) in-context learning based, or (b) standard optimization based. ICL learners have trainable parameters $\phi$ that are pre-trained once, after which they takes as input a dataset $D_{1:t-1}$ in-context to provide the model parameters $\theta = T_\phi(D_{1:t-1})$ . Optimization based learners do not have trainable parameters and rely on optimization routines to provide model parameters (e.g. training a model with SGD using the training data $D_{1:t-1}$ ). Note that in this case, the dataset is not provided as context to a model.

To compare learners that minimize PCL to learners that do not, we train an MLP using SGD to minimize training risk. Our theory predicts that especially in low-data regimes where minimizing train risk leads to overfitting, ICL should yield lower test error. This prediction was borne out in our empirical studies.

5) Clarify why the expressivity of a sequence model scales as $N^2$ , but the expressivity of the model it fits from context scales as $N$

These scaling trends were meant to be approximate based on linear layers with $N$ units and $N^2$ weights. As the reviewer points out, this is not always the case for all DNN layers, such as self-attention. Regardless, the “parameters” of the model implicitly fit from context do not include the attention weights, but only the latent layer activations which the model can use to modulate predictions based on context.

2024-11-21

Continued

6) Exploring alternative training objective that only minimizes training loss without architectural modifications

Unless we have misunderstood the reviewer, we believe that this suggestion aligns exactly with our experimental setup in Section 3.1 where we compare “prequential ICL” with “train-risk ICL” while keeping an identical architecture. The only difference in the two setups is whether the learned ICL algorithm minimizes error on a new data point (which minimizes PCL) or on a randomly chosen context point (which only minimizes training loss).

7) Clarification on plot legend and axis

In Figure 3b, the x-axis represents the context length. The y-axis, labeled as 'generalization error,' reflects the KL-divergence between the ground-truth and predicted next-token distributions (Figure 3b caption).

References

2024-11-25

Dear reviewer,

评论- Response to authors

2024-12-03

Thank you for your response.

I would like to clarify some points: In section 2.2 the authors mention that with each new data point the model is trained, and in this the section in which point 3 (in your response) was referring to. Do the authors mean training of the underlying parameters of the model or rather just a forward pass? My point is that if we need the model to be trained in such a matter for the provided bounds to hold, this is not how models are actually trained.

Also in your point 4, you state that: "ICL learners have trainable parameters $\phi$ that are pre-trained once, after which they takes as input a dataset $D_{1:t-1}$ in-context to provide the model parameters $\theta = T_\phi(D_{1:t-1})$ ". I am not sure again I understand what trainable means here. Maybe updatable would be a more appropriate term?

评论- Official Comment by Authors

2024-12-03

We provide clarifications below.

Model and Learner: To better understand our bounds and implications, we distinguish between two objects of interest – the model and the learner. The model is parameterized as $p_\theta$ with parameters $\theta$ , while the learner $T_\phi$ is parameterized with parameters $\phi$ . A working example is an MLP with parameters $\theta$ as the model and SGD with parameters $\phi$ (e.g. learning rate, weight decay, etc.) as the learner.

Notion of Training: Our bounds talk about the complexity of the model $p_\theta$ that is learned by the learner $T_\phi$ . Prequential coding algorithm requires one to re-train a new $p_\theta$ every time new observations are provided, which with SGD implies running the SGD algorithm again every time. In contrast to gradient-based learners, the same setting under our in-context transformer based learner implies learning a new $p_\theta$ for additional data through only a forward pass.

Thus, we emphasize that the learner $T_\phi$ is pre-trained only once and the model $p_\theta$ can be re-trained (or inferred or updated) solely through a forward pass. Note that given additional context, $\phi$ is not updated but only $\theta$ is. Another analogous way of thinking about this is by considering $T_\phi$ as the meta-learner that provides the learned model $p_\theta$ given any new observations in a fast, efficient and scalable manner.

We hope that this resolves the reviewer’s concerns regarding our framing and we will update our manuscript to reflect this more clearly.

审稿意见

评分: 6置信度: 32024-11-08

This paper first provide theoretical understanding that the success of ICL lies in its implicit optimization for both data fit and model simplicity through the lens of data compression. It then examine the case where the training objective is changed to minimize training error alone instead of the prequential code length, and found that it exhibits worse generalization performance compare to the standard next-token prediction error.

优点

The paper is well-written, the theory is interesting and the experiments are well serving the points.

缺点

Some of the experiments are not rigorous enough. Please see questions below.

问题

Figure 2b: Why does the Transformer without a bottleneck perform worse than the one with a bottleneck? Intuitively, one would expect that a Transformer with a bottleneck would lose essential information necessary for predicting the query’s label, making this result seem suspicious.
Regarding experimental details: I found that the Transformer without a bottleneck and the Transformer with a bottleneck were presented with different input formats—one with (x,y) concatenated and the other without. Why is this the case? This setup does not provide a fair comparison between the two models.
In the setting where the Transformer is trained with train-risk ICL: Given a total sequence length k (following the notation in line 265), do you break the sequence into k subsequences of length j, where $j\in [k]$ , or pass it as a whole sequence, relying on causal attention to prevent future information leakage? If it’s the latter, how do you select the query x? If x is not x_i in the sequence, then it’s not guaranteed that the query x is included in the context x_{1:j}. If it is x_1, would this allow the model to learn a shortcut solution, potentially biasing its predictions?
Following the previous question: If a sequence of length k is always broken into k subsequences, why use a decoder-only Transformer? If my understanding is correct, there should be no causality requirement in the context.
Regarding the gap observed in performance: Why is the performance gap smaller for linear regression but larger for sinusoid regression and Mastermind? The authors attribute this to task difficulty, but the explanation feels vague. Fixing the function class and varying the problem dimension (as a more concrete indicator of task difficulty) might clarify this point, rather than relying on a vague explanation.
Why does the MLP baseline generalize worse than the Transformer? Was model complexity minimized through regularization techniques, such as weight decay, in the MLP? This baseline offers limited insight into the results and seems to introduce some ambiguity. Additionally, what would be the Bayesian optimal model’s generalization error?

Updated after the discussion: Thank you for the authors in taking time to explain the results. My concern has been addressed, therefore I raise the score from 5 to 6.

Reason for not a higher score: This paper primarily focuses on small-scale experiments. While it provides an explanation from the perspective of Occam’s razor, it does not offer additional insights on improving small-scale training, let alone addressing real LLM cases (which may involve out-of-distribution scenarios and deviate from the meta-ICL format). Therefore, I don’t believe it merits an 8.

Reason for not a lower score: The paper presents an interesting perspective on how in-context learning works in controlled experimental settings. The experiments are well-conducted, and the findings contribute to the literature on understanding in-context learning in small-scale experiments.

2024-11-21

Why does the bottleneck model perform better than the model w/o bottleneck?

The reviewer hypothesized that the result seems at odds because the bottleneck model might lose information about the task. We’d like to clarify that a bottleneck model can actually capture all the task information since our problems involve task-latents of fixed size (e.g., regression weights, code in Mastermind), where the bottleneck is of sufficient dimension to encode it.

The reviewer also made an astute observation: unlike the bottleneck model, the vanilla Transformer receives separate tokens for $x$ and $y$ as input, to which positional token embeddings are added. This (and causal attention masking) is required to allow predictions at all $x$ tokens in parallel without letting them access their corresponding $y$ , and is the common modeling choice for Transformers in prior work [1,2]. However, this choice makes learning more difficult. Based on the reviewer’s feedback, we decided to investigate if this detail explains the difference between models with and without bottleneck and show in Appendix E that it does: forcing the bottleneck model to use separate $x$ and $y$ tokens eliminates its performance advantage, leading to a fair comparison between the models.

For train-risk ICL, is the attention causal, and how is the query selected without inducing shortcuts?

For in-context learners that minimize train risk, we do not input all context subsequences to the bottleneck model. Using causal attention, we can process all context subsequences with a single forward pass.

The query for each subsequence is randomly chosen from the context points in that subsequence. This does not lead to shortcuts as the model does not know which point is chosen for the query, and hence must learn to model $\theta = T_\phi(D_{1:t-1})$ that would generalize well to any choice of context point.

Clarifying task difficulty and its impact on performance gap

The reviewer astutely observes that the performance gap is smaller for linear regression but larger for sinusoidal regression and Mastermind. This is because a complex task requires the algorithm to learn more complex functions to successfully minimize train risk. However, learning more complex functions with very limited data leads to overfitting, which is the basis for our hypothesis that as task complexity increases, simple predictors learned by minimizing prequential code length enjoy a bigger advantage over predictors learned by minimizing train risk.

The reviewer also makes a great suggestion to systematically analyze this by fixing the function class and varying the problem dimension. We conduct the suggested experiment for sinusoid regression and show that with increasing task complexity, the difference between prequential code solution and train-risk solution increases, as shown in Appendix C, thereby validating our initial hypothesis.

Comparing a gradient-based learner and in-context learner

As the reviewer notes, we sought to empirically compare Transformers – learners with parameters that are trained to minimize PCL – to SGD – arguably the most popular learner in ML. Despite using early stopping when training MLPs with SGD, we found that the predictors produced by SGD continue to overfit in low-data regimes.

However, we think that the reviewer makes an excellent suggestion to compare the prequential code lengths achieved by different out-of-the-box learning algorithms that regularize vanilla SGD (e.g., weight decay), which we now show in Appendix F.2. As expected from standard learning theory, these regularization techniques have substantial impact on prequential code length: regularization reduces overfitting in low-data regimes as it favors simple models, but at the expense of underfitting in high-data regimes.

References

[2] Garg, Shivam, et al. "What can transformers learn in-context? a case study of simple function classes." Advances in Neural Information Processing Systems 35 (2022): 30583-30598.

2024-11-25

Dear reviewer,

2024-11-27

Thank you for addressing my concern and adding multiple experiments in the appendix section.

Upon reading the added experiments, I notice that the Figure E.1 indicating that Transformer with a bottleneck architecture cannot learn the ICL at all (almost random results). This seems weird.

The comparison between prequential ICL and train-risk ICL (in Figure 2) should be done in a fair setting. With that, I mean in the main paper, I think it make more sense to put Figure E.1's result as the prequential ICL, instead of the Figure 2's truncated version of [x,y]. I understand this will downgrade the performance of prequential ICL, which is not what you want in the paper. Therefore I suggest the authors to look for ways to improve the prequential ICL performance in Figure E.1. Since the code is not released, I cannot give detailed suggestions. But maybe try with the curriculum learning applied in Garg et al.? This should attributes to the optimization issue of Transformer rather than the problem itself.

I am willing to raise the score if this could be addressed. Thanks for your time.

评论- Official Comment by Authors

2024-11-29

Dear reviewer, we appreciate your insightful feedback to improve the rigor of our experimental protocol.

We acknowledge the reviewer’s observation regarding inconsistent performance on the linear regression task for the Transformer with a bottleneck when trained using a split tokenization scheme ([x] ; [y] ; [x] ; [y]). We have taken the necessary steps to ensure consistency in the additional results introduced below.

Firstly, we would like to clarify that the result in Figure 2.a compares the performance between different training objectives (prequential ICL and train-risk ICL) keeping the architecture (Transformer with bottleneck) and tokenization scheme ([x, y] concatenated tokens) fixed. As such, the comparison between training objectives (Figure 2.a) demonstrating that prequential ICL outperforms train-risk ICL is done in a fair setting.

Secondly, regarding the suggestion to present the comparison between architectures (Figure 2.b) using the same tokenization scheme in the main paper instead of in the appendix, we would like to provide some clarifications on our initial motivations. Our intention was never to run a formal ablation study to quantitatively compare the ability of different architectures and tokenization-scheme at minimizing PCL, but simply to show that different design choices could qualitatively impact the ability of a learner to minimize PCL. In practice, different tokenization schemes were used to facilitate the implementation of each architecture. We agree that introducing these schemes was leading to unnecessary confusion and we thank the reviewer for highlighting this point. For this reason, we made rapid and considerable efforts to standardize the tokenization scheme across all models in the main paper. Specifically, we ran new experiments with the Transformer without bottleneck using concatenated tokens ([x,y] ; [x,y]...) and updated Figure 2.b accordingly. The new results remain consistent with our theory, while being significantly better especially on harder tasks. Consequently, we have also removed the appendix section that discussed the impact of various tokenization schemes, since this is not the focus of our contribution.

We also acknowledge the reviewer’s last suggestion to use curriculum learning to improve the prequential ICL performance. We believe this is an interesting line of research to tackle a more complex distribution of tasks (and not simply linear regression), while it’s unclear yet how to define a measure of complexity within the distribution of tasks we consider at training time. We are eager to pursue this line of inquiry and share in the reviewer's intuition, but we respectfully argue that this is beyond the scope of the present paper which introduces the foundational theory and verifies its direct predictions with experiments.

We hope that these revisions enhance the clarity of our findings, align better with the reviewer’s expectation, and that the reviewer will share our enthusiasm for sharing these findings with the community.

2024-12-02

Dear reviewer,

As the discussion period is drawing to a close, we kindly ask if our latest response addresses your point which prevented you from strengthening your support. We believe that the paper is greatly improved thanks to your interventions, and we remain available to provide more details.

2024-11-21

We are sincerely grateful to the reviewers for their comments and suggestions, and believe that the feedback has enabled meaningful improvements to our work. Moreover, we were encouraged to see that the reviewers highlighted that our theoretical insights were “interesting” (M5Ep07, 1nBD06), “novel” (YT7Y04, 3Zfk04), and supported by experiments that provided “comprehensive empirical validation” (3Zfk04, M5Ep07).

Nevertheless, we noted comments suggesting that our contribution is hindered because our theory doesn't immediately apply to large language models (LLMs). We'd like to push back against this view: this work's key contribution -- shedding light on generalization properties of ICL-based predictors -- has implications that extend beyond LLMs. Indeed, our theory and empirical findings add to a growing body of work [e.g. Kirsch et al. 2022] that finds that ICL models may have distinct generalization properties over other models trained with conventional optimization methods, laying the possibility to use large sequence models and ICL to replace conventional optimization. Despite the fact that our contributions stand on their own, we did test our theory on sequence data that mirrors natural language (Fig. 3), and found that our theory continues to predict empirical trends for non-iid data.

We do note that the reviewers had excellent suggestions for extra experiments and writing changes, which we implemented. We address each reviewer's questions by replying separately to their comments, but here, we summarize the key changes to our draft for ease. With these changes, we believe that we have addressed the reviewers’ concerns.

Rigor in the theory section. We discuss various approximations in the theory section, and reviewers YT7Y and nR2b wished to see these approximations made more precise. We have updated our draft to improve clarity on this front by showcasing our theory in a more rigorous fashion through the use of upper-bounds, which exactly capture the mathematical relations in question as opposed to approximate equalities. Our modifications make it clear that sequence models adhere to Occam’s razor by minimizing an upper-bound to training error and model complexity (in the same way we minimize the negative evidence lower bound for latent variable models).

Gradient-based versus in-context learners. The goal in our empirical studies is to compare learners that minimize prequential code length (e.g., Transformers trained to predict the next token) to learners that do not. As such, one of the learners we study is stochastic gradient descent (SGD), arguably the most popular learning algorithm for outputting a predictive function given data. In our experiments, to study the effects of SGD on a given prediction task, we train an MLP using SGD using training data from that task to minimize training risk. In contrast, sequence models like Transformers are also learners, but don’t perform gradient descent and have learnable parameters -- these parameters are chosen to minimize prequential code length. The goal of our paper is to study the properties of these learners in relation to Occam’s razor, and we find that they output predictors that generalize better, especially in low-data regimes (as predicted by our theory). Nevertheless, reviewers asked about comparing SGD with regularization to prequential coding-based learners, and we have added these experiments in Appendix F.2. Briefly, we find that regularized learners achieved better compression (i.e. lower prequential code length), which implies a stronger incentive toward simple models according to our theory. This experiment confirms the claim that regularization techniques serve as indirect Occam-aligned methods to learn simple models.

References

Von Oswald, J., Niklasson, E., Randazzo, E., Sacramento, J., Mordvintsev, A., Zhmoginov, A., & Vladymyrov, M. (2023, July). Transformers learn in-context by gradient descent. In International Conference on Machine Learning (pp. 35151-35174). PMLR.

Kirsch, L., Harrison, J., Sohl-Dickstein, J., & Metz, L. (2022). General-purpose in-context learning by meta-learning transformers. arXiv preprint arXiv:2212.04458.e

AC 元评审

2024-12-22

The paper shows an analogy between the next-token loss optimized in in-context learners and the compression technique of prequential coding. This minimizes not only the model's training error but also model complexity. The reviewers appreciated this intriguiging connection and the theory behind it. Criticisms were varied, ranging from concerns about fairness in experimental comparisons, lack of rigor concerning bounds, insufficient exploration of architectural dependencies and the limited scope to iid data. The breadth of these criticisms, combined with no reviewer strongly speaking up for the paper, leads me to recommend rejection.

审稿人讨论附加意见

Reviewer 3Zfk initially gave a score of 8 and then reduced to 5 upon further consideration. Across reviewers, the main improvements from rebuttals addressed fairness in experiments, clarified theoretical claims, and better aligned findings with predictions, leading to minor score increases to 6 for 3 reviewers. I largely ignored Reviewer nR2b who did not engage after their initial review.

最终决定Reject

2025-01-22

Reject