PaperHub
5.0
/10
Rejected6 位审稿人
最低3最高8标准差1.7
3
3
5
8
6
5
3.0
置信度
正确性3.0
贡献度2.0
表达3.0
ICLR 2025

Reality Only Happens Once: Single-path Generalization Bounds for Transformers

OpenReviewPDF
提交: 2024-09-28更新: 2025-02-05
TL;DR

We provide theoretical generalization bounds for Transformers on single trajectory generating processes like time-series paths.

摘要

One of the inherent challenges in deploying transformers on time series is that reality only happens once; namely, one typically only has access to a single trajectory of the data-generating process comprised of non-i.i.d.\ observations. We derive non-asymptotic statistical guarantees in this setting through bounds on the generalization of a transformer network at a future-time $t$, given that it has been trained using $N\le t$ observations from a single perturbed trajectory of a {bounded and exponentially ergodic} Markov process. We obtain a generalization bound which effectively converges at the rate of $\mathcal{O}(1/\sqrt{N})$. Our bound depends explicitly on the activation function ($\operatorname{Swish}$, $\operatorname{GeLU}$, or $\tanh$ are considered), the number of self-attention heads, depth, width, and norm-bounds defining the transformer architecture. Our bound consists of three components: (I) The first quantifies the gap between the stationary distribution of the data-generating Markov process and its distribution at time $t$, this term converges exponentially to $0$. (II) The next term encodes the complexity of the transformer model and, given enough time, eventually converges to $0$ at the rate $\mathcal{O}(\log(N)^r/\sqrt{N})$ for any $r>0$. (III) The third term guarantees that the bound holds with probability at least $1-\delta$, and converges at a rate of $\mathcal{O}(\sqrt{\log(1/\delta)}/\sqrt{N})$. Example of (non i.i.d.) data-generating processes which we can treat are the projection of several SDEs onto a compact convex set $C$, and bounded Markov processes satisfying a log-Sobolev inequality.
关键词
learning theorytransformersgeneralization boundsllms

评审与讨论

审稿意见
3

The paper provides generalization bounds for learning from ergodic/mixing/contracting stochastic processes and specializes these to transformer-type archictures via control of certain smoothness terms.

优点

  • I think the paper is mostly clearly written and well-structured.

  • The calculation of the derivatives of the transformer block is cool.

  • The topic and problem formulation is certainly timely. Unfortunately, I do not think the results really adress the question in a way that was not more or less known a priori.

缺点

I did not find the results to be surprising and my overall assessment is that the paper unfortunately lacks the novelty to be published at ICLR.

  • The (primary) contribution of the paper is way overstated and I'd argue that the "first to" claims are more or less unwarranted:

    • First of all, the fact that an IID result can be ported to the mixing/contracting setting is standard via a blocking argument (the modern version of this argument being due to Yu, but dating back to Bernstein)
    • The tt-step ahead risk for a contracting markov chain is morally equivalent to the time-averaged risk in most situations. There are already plenty of bounds for this situation in the literature that are easy to port over to the setting of the authors. Examples: [A,B,C,D]. In fact, I think it is concerning that standard results in learning theory about dependent processes are missing from the related work section (e.g., the work of Mohri and collaborators is completely absent from the ref section).
    • In other words: there is a certain degree of reinventing the wheel here, and the authors have unfortunately missed a line of related work that more or less implies results of this form.
    • Indeed, I would argue that there are sharper estimates available in the literature (while limited to the square loss): reference [B] does not require any type of contraction whatsoever, and reference [D] only has very mild dependency on such contractivity.
  • This also brings me to my second perhaps more conceptual point. To my understanding the authors seek to make the claim that their results is explanatory of LLMs. However, I don't believe natural language is well-explained by mixing stochastic processes: the words in a book certainly don't converge in any meaningful sense to a stationary distribution.

  • Arguably more minor, but the strict realizability assumption y=f(x) is rather more stringent than what has been assumed in past (uncited) work.

In sum, I think the results are mostly (1) standard uniform convergence arguments, (2) coupled with the stochastic regularity of assumption 2/3 and and (3) application of path-norm type bounds specialized to transformer type architectures. In particular, (1/2) is entirely standard in my opinion and (3) (which is not stated as the primary contribution by the authors) does not really seem to be that surprising either in light of similar such bounds for deep nets.

[A] Mohri, Mehryar, and Afshin Rostamizadeh. "Stability Bounds for Stationary φ-mixing and β-mixing Processes." Journal of Machine Learning Research 11.2 (2010).

[B ] Simchowitz, Max, et al. "Learning without mixing: Towards a sharp analysis of linear system identification." Conference On Learning Theory. PMLR, 2018.

[C] Foster, Dylan, Tuhin Sarkar, and Alexander Rakhlin. "Learning nonlinear dynamical systems from a single trajectory." Learning for Dynamics and Control. PMLR, 2020.

[D] Ziemann, Ingvar, and Stephen Tu. "Learning with little mixing." Advances in Neural Information Processing Systems 35 (2022): 4626-4637.

问题

Is the reason your bound (ignoring the deviation term in delta) converges exponentially to 0 solely the fact y=f(x), i.e., strict realizability?

Given that the references [A-D] in my opinion provide similar type of gen/excess risk bounds. How would you say your work differentiates itself from these?

评论

Novelty of Our Bounds

We appreciate the references to generalization bounds for hypotheses trained on non-i.i.d. data and have incorporated them into our revised version to clarify the unique contribution of our work.

Existing results focus on convergence rates for generalization in non-i.i.d. settings but lack explicit constants. Deriving such constants requires computing a Lipschitz (or Sobolev) bound for transformer classes—a technically complex task that forms the core of Theorem 2 in our paper. This theorem, detailed in Appendix Sections 27–31 and 33–44, represents the most technically demanding part of our analysis.

We address the mentioned literature explicitly here, with additional context in the general rebuttal and our response to reviewer 4UED.

Ziemann, Ingvar, and Stephen Tu. "Learning with little mixing."

To apply Theorem 4.1 of [1], one (at least) needs to compute the cardinality (Tr|T_r|) of an r/8r/\sqrt{8}-net over the hypothesis class in the L2L^2-norm, otherwise, they do not have access to an explicit constant. Since our transformers are smooth, they are contained in some CsC^s norm-ball B(s,ρ)B(s,ρ) about some fC(RMd,RD)f\in C(\mathbb{R}^{Md},\mathbb{R}^D) with minimal radius

0<ρ:=supTT(T,)Cs,0<ρ:= sup_{T\in T}\, \|\ell(T,\cdot)\|_{C^s} ,

which we denote by B(f,r)B(f,r) and is computed in our Theorem 2 (main technical result). As we see in Theorem 2.7.4 in [4], together with the identity in Equation (15.1.8) of [5], this requires computing the worst-case CsC^s-norm over our entire hypothesis class (TT) of our class of smooth transformers which is of the form

log(Tr)log(Bs(f0,ρ))ρd/srd/s.\log(|T_r|) \le \log\big(B_s(f_0,\rho)\big) \lesssim \rho^{d/s} r^{-d/s} .

Thus, their result does not give an explicit constant depending on the transformer class T without the use of our main technical result: Theorem 22. Moreover, Theorem 1 is not a concentration inequality, and applying a basic result such as Markov's inequality may yield a loose concentration estimate.

Mohri, Mehryar, and Afshin Rostamizadeh. "Stability Bounds for Stationary φ-mixing and β-mixing Processes."

Theorem 11 of [2] does not apply in our setting, due to the assumption that the loss/cost function is uniformly bounded (by MM), which significantly simplifies matters due to access to tools such as Martingale difference inequalities (e.g. McDiarmid with non-i.i.d. data). We make no such restrictions.

Foster, Dylan, Tuhin Sarkar, and Alexander Rakhlin. "Learning nonlinear dynamical systems from a single trajectory."

Theorem 4 of [3] does not apply to our framework, as we are not assuming that the latent process assumes the form of Equation (1) nor that the latent function being learned is a contraction (Definition 3 in [3]).

Relevance of Our Generalization Bound

Our generalization bound (Theorem 1) is novel due to its foundation in an optimal transport-based concept of exponential ergodicity. This condition is widely applicable, being either easy to verify or already established for many data-generating processes such as classical SDEs, McKean–Vlasov equations, and reflected SDEs. Additional details can be found in Section 2 and the general rebuttal.

Non-Realizability in for Non-I.i.d. Data is Encoded via Stochastic Calculus

By accommodating non-i.i.d. data, if f is twice continuously differentiable and our data is sampled from a strong solution XX_{\cdot} to a stochastic differential equation (or any Markovian semi-martingale), Ito's Lemma implies

f(Xt)=f(X0)+0tAsdssignal+0tσsdWsnoise.f(X_t) = f(X_0) + \underbrace{\int_0^t A_sds}_{\text{signal}} + \underbrace{\int_0^t σ_s dW_s}_{\text{noise}} .

Thus, the Markov-setting with our assumptions always permits a "signal + noise" decomposition of the input data (as does the classical non-realizable i.i.d. case). In this way, we cover the non-realizable case.

We note that these remarks were given in appendix G. Note also that, for lower regularity functions one can use Tannaka's variant of Ito's lemma and for path-dependant functionals Dupire's functional Ito calculus; in each case, there is a similar "signal + noise decomposition" of f(Xt)f(X_t) which requires non-i.i.d. semi-Martingality of the data.

References

[1] Ziemann, Ingvar, and Stephen Tu. "Learning with little mixing." Advances in Neural Information Processing Systems 35 (2022): 4626-4637.

[2] Mohri, Mehryar, and Afshin Rostamizadeh. "Stability Bounds for Stationary φ-mixing and β-mixing Processes." Journal of Machine Learning Research 11.2 (2010).

[3] Foster, Dylan, Tuhin Sarkar, and Alexander Rakhlin. "Learning nonlinear dynamical systems from a single trajectory." Learning for Dynamics and Control. PMLR, 2020.

[4] Vaart, AW van der, and Jon A. Wellner. "Empirical processes." Weak Convergence and Empirical Processes: With Applications to Statistics. Springer, 2023. 127-384.

[5] Lorentz, George G., Manfred von Golitschek, and Yuly Makovoz. Constructive approximation: advanced problems. Vol. 304. Berlin: Springer, 1996.

评论

Many thanks to the authors for their reply but my assessment remains unchanged.

评论

Dear reviewer CDpD,

We are curious: why do you feel that our answer and modifications do not address your concerns, since clearly the linked papers alone (without our Theorem 2) cannot yield generalization bounds for Transformers trained on non-i.i.d. data?

评论

My main concern was and remains a lack of novelty. While I appreciate your revision better engaging with the literature, this does not really adress that there is very little novely on top of: 1) basically a known generalization bound and 2) an instantiation to a particular class via an extension to transformers that is obtained by a known technique (neyshabur et al)---and as some of the other reviewers point out, I am not particularly convinced that this gives very tight/intepretable bounds.

评论

We are a bit concerned about the validity behind your claim that our work lacks novelty as

  • There are no available bounds on the CkC^k norms of a transformer (or even an MLP) elsewhere in the literature (other than our Theorem 2)

  • The perturbation analysis of Neyshabur et al. (2017/2018) considers the Lipschitz regularity of the MLP model as a function of its parameters and uses this to bind the covering number. Their work on pathnorm bounds (2015) for the worst-case Lipschitz constant (C1C^1) is simply that; namely, they do not consider the CkC^k regularity of either of these classes for k>1k>1. Thus, Theorem 2 is completely novel (and highly non-trivial as the proof requires new techniques such as combinatorial bounds on the coefficients resulting from Faa di Bruno, totally in ~30 pages).

Consequentially, your claim that Neyshabur et al. have previously done this is verifiably false.

评论

I did not claim such a thing and I'd prefer to not engage with a strawman... But let me clarify: The question is not whether exactly what is in the paper has been previously published or not but rather if there is sufficient innovation over and above what already exists.

评论

Our main point/reply is that: bounds on higher-order perturbations k>1k>1 are much more involved to derive than Lipschitz-type perturbations (k=1k=1) as they require new proof techniques and technical tools. So, in this way, our results are not incremental modifications of Neyshabur's technique, but they require genuinely new technical machinery.

So it's not only about the passage from k=1k=1 to k>1k>1 but how that can actually be proven; and, later, how that translates into generalization bounds...

To clarify the non-strawman fallacy comment: We are, of course, not relying on the existence of an exact version of our results in the literature, but rather emphasizing that the conclusion and the proof technique are novel.

审稿意见
3

The paper derives generalization bounds for a Transformer model trained on a single time-series trajectory to generalize at a future points of the time series. It departs from the common i.i.d. assumption and assumes a Markovian time series with all points in its trajectory bounded under some norm and an exponential ergodicity property (this is akin to a fast-mixing property for a Markov chain). To bound the growth rate of the generalization error with respect to the parameters of the Transformer model, the authors define a CsC^s norm which approximately captures the maximum value of any of the first s derivatives of the loss function with respect to the input. They then impose the condition that the CsC^s norm grow at the rate of O(sr)O(s^r) for some r>0r>0. The target function takes in the input XtX_t at each time step and outputs a value Yt=f(Xt)Y_t = f^*(X_t). Under this model the work derives a generalization bound which depends on 3 terms

  • (i) a term measuring how fast the chain converges to stationarity/ergodicity.
  • (ii) a model complexity term which depends on the partial derivatives of the Transformer model.
  • (iii) a probabilistic validity term to get the result to hold with high probability (as is common in such bounds)

A second result the paper provides is an explicit characterization of the model complexity term in the generalization bound above in terms of the CsC^s norm bounds of different components of the Transformer model.

优点

The paper aims to go beyond the standard modeling assumption of i.i.d. data which rarely ever truly holds in practice. Once you leave this assumption analysis of even simple processes can become hard. Moreover, the paper assumes a single path throughout training instead of assuming restarts.

The paper's notion of Markov data generating processes are general enough to capture multiple interesting settings.

缺点

A key essence of a transformer model is that it is a sequence model. It models the interactions between different tokens or XtX_t for different values of t. It is very confusing and strange that the authors then proceed to assume M=1M=1. At this point, the results then are simply just for a fully connected neural net and nothing specific about the Transformer really comes into play?

If the process mixes fast (in roughly log(κ)log(\kappa) number of steps) why cant we simply pay a multiplicative factor of log(κ)log(\kappa) to obtain essentially i.i.d. samples and then use the i.i.d. generalization bounds?

The main body of the paper does a poor job of explaining what the phase transitions in the model complexity term are. Why do they arise in the first place? i.e. why does the convergence rate accelerate abruptly as ss increases? Some intuition would help.

Writing is overly heavy with notation at parts. A key definition (the C^s norm) is defined in the Appendix. Makes it hard to build intuition while reading the beginning of the paper. It is unclear what is the significance of contributions of this paper. There is very limited discussion around the seemingly implicit assumption of M=1M=1 which severely limits what a transformer model is capable of.

问题

  • Q1. If Y_n only depends on X_n for one time instance what is the need of a sequence modeling framework like the transformer?
  • Q2. What is Schwartz class?
  • Q3. Definition 2 uses the C^s norm. Why is this presented before the actual definition of the C^s norm?
  • Q4. What are the phase transitions in the model complexity term? Why do they arise?
  • Q5. If M=1M=1, there is no point in using an attention layer as all information we need to predict YtY_t is present in XtX_t and looking into the past does not offer any added information. Where do the bounds on model complexity change to reflect this?
评论

The Markovianity Assumption

Our analysis naturally extends to cases where the data-generating process becomes Markovian in an extended, but finite, state space by setting MdMd as the dimension of that extended space. This extension aligns seamlessly with our main technical result, Theorem 2, which allows for arbitrary finite MM.

We believe this assumption is reasonable for most real-world data sources, ranging from NLP to mathematical finance, as these processes are often Markovian in an extended state space or approximately so. This approximation aligns with the quantitative version of the Echo-State property in reservoir computing (e.g., [1, 2]), as defined in Definitions 4.4 and 4.9 of [3], which is satisfied by many non-Markovian "general" SDEs (see Proposition 4.2 in [3]).

To reflect this, we have updated the draft to include this extension, which required minimal adjustments since the core technical framework was already established in Theorem 2. These revisions, highlighted in the updated manuscript, extend the state space of our process from dd to MdMd. However, we initially excluded this scenario because the resulting bounds become significantly more complex when MM is infinite—an avenue we aim to explore in future work.

Phase transitions and definition of CsC^s

Thank you for pointing that this needs further explanation, we have clarified the relevant passages in our revised manuscript. Note that the CsC^s-norm, a special case of the Sobolev norm, is often assumed to be known due to its frequent use in analysis. Before we use it (above Definition 2), we point the reader to a detailed definition in the appendix.

Regarding the phase transition, we explained the meaning of ss in Section 3.1 (Model Complexity Term). The parameter ss represents the order of derivatives within the smoothness class CsC^s. Loosely speaking, our result provides sNs \in \mathbb{N} distinct bounds, each corresponding to a different level of smoothness. As the sample size increases, the tightest bound is associated as the one with a higher derivative order ss. When a tighter bound replaces a previous one, we refer to this transition as a "phase transition."

Schwartz functions and Schwartz space

The Schwartz class, denoted as S\mathcal{S}, consists of functions that are infinitely differentiable and rapidly decreasing along with all their derivatives. This means that both the functions and their derivatives decay faster than any polynomial as the input approaches infinity. Schwartz functions are popular in mathematical analysis, particularly in Fourier analysis, where they ensure smoothness and rapid decay in both the spatial and frequency domains [4, 5]. In our manuscript, they simply serve as an exemplary class of loss functions satisfying Definition 2.

References

[1] Gonon, Lukas, and Juan-Pablo Ortega. "Fading memory echo state networks are universal." Neural Networks 138 (2021): 10-13.

[2] Grigoryeva, Lyudmila, and Juan-Pablo Ortega. "Universal discrete-time reservoir computers with stochastic inputs and linear readouts using non-homogeneous state-affine systems." Journal of Machine Learning Research 19.24 (2018): 1-40.

[3] Acciaio, Beatrice, Anastasis Kratsios, and Gudmund Pammer. "Designing universal causal deep learning models: The geometric (hyper) transformer." Mathematical Finance 34.2 (2024): 671-735.

[4] Hörmander, L. (1990). The Analysis of Linear Partial Differential Operators I, (Distribution theory and Fourier Analysis) (2nd ed.). Berlin: Springer-Verlag. ISBN 3-540-52343-X.

[5] Trèves, François (1967). Topological Vector Spaces, Distributions and Kernels. Mineola, N.Y.: Dover Publications. ISBN 978-0-486-45352-1

评论

Dear Reviewer 4qd3,

We are curious to hear your input since:

  1. Our answer confirms that the linked papers alone (without the use of our Theorem 2) cannot yield generalization bounds for Transformers trained on non-i.i.d. data.

  2. We have included the references you provided.

In this way, we cannot see how the concerns you brought up remain unresolved or unclear.

审稿意见
5

This paper addresses a challenge in applying transformer models to time-series data: the difficulty in achieving generalization guarantees when only a single trajectory of non-i.i.d. observations is available. The authors provide non-asymptotic generalization bounds for transformers trained on a single perturbed trajectory of a bounded and exponentially ergodic Markov process. The primary contribution is a theoretical result that gives a future-time generalization bound of O(1/N)O(1/\sqrt{N}), where NN is the number of training observations. This bound is derived considering the properties of the Markov process and the architecture of the transformer model, including activation functions, depth, and number of attention heads.

Specifically, the paper decomposes the generalization bound into three components:

Non-Stationarity Term: The difference between the stationary distribution of the Markov process and its time-dependent distribution, converging to 0 exponentially.

Model Complexity Term: A term that captures the complexity of the transformer network in terms of the number of self-attention heads, depth, width, and the activation function, eventually converging to 0 at the rate of O(log(N)r/N)O(\log(N)^r/\sqrt{N}) .

Probabilistic Validity Term: A term that guarantees the bound holds with probability at least , and converges at a rate of O(log(1/δ)/N)O(\sqrt{log(1/\delta)/N}).

The authors also obtain explicit estimates on the constants in these generalization bounds, which relied on explicitly bounding all the higher-order derivatives of transformers, considering their number of attention heads, activation functions, depth, width, and weight constraints.

优点

1、The paper is well-written, with each section presented in a logical order. The introduction is particularly effective in clearly outlining the main contributions of the paper and providing a thorough discussion of related work.

2、This paper provides the first set of statistical guarantees for transformers compatible with many time-series tasks. More broadly, it presents the first results that estimate the future-generalization of a learning model given that its performance has only been measured using present/past historical training data.

3、The decomposition of the generalization bound into intuitive components helps in understanding the influence of different aspects of the model and data, making the theoretical results more interpretable and accessible to practitioners. The authors' explicit focus on the impact of transformer architecture—such as the number of attention heads, activation functions, depth, and width—on generalization performance is highly informative.

4、The paper provides appropriate examples of non-i.i.d. data-generating processes satisfying their assumptions, which helps readers understand the applicability of the assumptions.

缺点

1、In Theorem 1, the convergence rate of the Model Complexity Term rates(N)rate_s(N) for d>2sd>2s is only O(1/Ns/d)O(1/N^{s/d}) , which implies that the generalization error bound may suffer significantly from the "curse of dimensionality" in high-dimensional settings or when the transformer's architecture lacks sufficient smoothness (i.e., the partial derivatives of the transformer are not sufficiently large).

2、Since the authors point out the connection between their assumptions on non-i.i.d. data-generating processes and denoising diffusion models, it would be beneficial to provide empirical validation or experiments to demonstrate how well the derived bounds apply to practical datasets. Including empirical studies would strengthen the practical relevance of the work.

3、The paper could provide a more formal definition and explanation of the parameter "s" to help readers understand its significance in the context of the generalization bound.

4、The assumptions on the Markov process, such as boundedness and exponential ergodicity, might limit the applicability of the results to more complex real-world scenarios. For example, several dynamical systems and financial markets have long-term memory and are non-Markovian, which could reduce the generalizability of the findings

问题

The theoretical results in this paper rely on a key result that the transformer architecture admits continuous partial derivatives of all orders. This result seems quite strict, as in practical applications it is often sufficient for the model to be smooth only up to first or second-order derivatives to ensure stable gradient calculations and feasible training. Could the authors provide examples of transformer architectures or scenarios where the model does not admit continuous partial derivatives of all orders?

What are the implications for the generalization bounds in cases where transformers do not admit continuous partial derivatives of all orders?

In summary, my main concern lies in the result that the transformer possesses continuous partial derivatives of all orders, as well as the possibility that the generalization error bound might suffer from the curse of dimensionality. If the authors could clarify these points, I would be inclined to rate this paper higher, as it is really well-written and addresses a critical gap in applying transformers to time series analysis.

评论

Convergence Rate of the Model Complexity Term

The slow convergence rate O(log(N)r/N)O(\log(N)^r / \sqrt{N}) for the Model Complexity Term reflects a dependence on higher-order sensitivities, which can be challenging in high-dimensional settings. This trade-off is inherent in controlling the complexity of flexible models like transformers, and we kept the theoretical bounds as tight as possible under the stated assumptions.

The curse of dimensionality in transformers is, however, well-documented in the literature (see, e.g., [1]), making this property of our bound unsurprising.

Nevertheless, we argue that our article--by explicitly computing the bound and detailing the influence of various architectural choices (e.g., width, depth, activation function)--serves as a practical guide for users that are aware of the curse of dimensionality in transformers but seeking strategies to mitigate its impact on generalization.

Continuous Partial Derivatives of All Orders

Practical transformer implementations often rely on activation functions (e.g., Swish, GeLU) with continuous first- and second-order derivatives. While higher-order smoothness is not always necessary for stable training, it is critical for our theoretical guarantees to control sensitivities in high-dimensional spaces. Examples of architectures meeting these criteria include those using Swish or GeLU activations. For cases where these assumptions are relaxed, the generalization bounds would weaken correspondingly, as they depend on bounding the higher-order derivatives. In such cases, focusing on the most dominant terms (e.g., first or second derivatives) could yield approximate bounds that remain informative.

Experiments

As our paper focuses on theoretical questions and is already content-heavy, we prefer to explain our results in detail, instead of adding additional complexity by introducing experiments. We agree, however, that experiments can be very insightful, and we hope that our paper poses a good basis for a work with a more practical focus.

Formal Explanation of Parameter s

Thank you for pointing this out, we clarify the corresponding passages in our revised version. In our initial manuscript we explained the meaning of ss in Section 3.1 (Model Complexity Term). The parameter ss represents the order of derivatives considered in the smoothness class CsC^s. Intuitively speaking, our result yields sNs \in \mathbb N different bounds, each for a different level of smoothness, and as the sample size grows, the tightest bound is of higher derivative level. When one bound is superseded by the next in terms of "tightness", we call this a phase transition.

Markovianity

We changed the formulation of our manuscript to take into account transformer models with finite sequence length MNM \in \mathbb N. We detailed this change (which is done via a Markovian lift) in the general rebuttal, please refer to it for more details.

We would expect that most real-life data sources from those in NLP to mathematical finance are Markovian in an extended state-space, or at least approximately so (in the sense of the quantitative version of the Echo-State property in reservoir computing (e.g. [2] and [3]) defined in Definitions 4.4 and 4.9 in [4], which are approximately satisfied by a broad range of non-Markovian "general" SDEs; see Proposition 4.2 [4]).

References

[1] Borde, Haitz Sáez de Ocáriz, et al. "Scalable Message Passing Neural Networks: No Need for Attention in Large Graph Representation Learning." arXiv preprint arXiv:2411.00835 (2024).

[2] Gonon, Lukas, and Juan-Pablo Ortega. "Fading memory echo state networks are universal." Neural Networks 138 (2021): 10-13.

[3] Grigoryeva, Lyudmila, and Juan-Pablo Ortega. "Universal discrete-time reservoir computers with stochastic inputs and linear readouts using non-homogeneous state-affine systems." Journal of Machine Learning Research 19.24 (2018): 1-40.

[4] Acciaio, Beatrice, Anastasis Kratsios, and Gudmund Pammer. "Designing universal causal deep learning models: The geometric (hyper) transformer." Mathematical Finance 34.2 (2024): 671-735.

评论

Thank you very much for the author's detailed response, which has addressed some of my concerns. However, I still have some reservations regarding the statement that the transformer architecture admits continuous partial derivatives of all orders. Additionally, while the author connects their theory with denoising diffusion models, they have not provided any simple experiments to support this connection. Based on these considerations, I will maintain my current rating rather than increase it.

评论

Dear reviewer PVvb,

We are curious to understand why you have reservations on the existence of partial derivatives of all orders for transformers with the standard activation functions; e.g. Swish and GeLU? Since any such transformer is the composition of smooth functions, so it is smooth.

Additionally, experiments would not increase the quality of our theoretical contribution. Rather they would only reduce our available space which can instead be used for explaining and unpacking our proof technique, so including them would reduce the quality of our 10 page paper.

审稿意见
8

The paper titled "Reality Only Happens Once: Single-Path Generalization Bounds for Transformers" addresses a gap in the theoretical understanding of transformer models, particularly in non-i.i.d. settings commonly encountered in time-series data. Traditional statistical guarantees for transformers often rely on independent and identically distributed (i.i.d.) assumptions, which are seldom met in real-world applications such as natural language processing, finance, and reinforcement learning. This work extends the theoretical framework to single-path trajectories generated by bounded and exponentially ergodic Markov processes, providing non-asymptotic generalization bounds that converge at a rate of \tetxrmO(1N)\tetxrm{O}\left ( \frac{1}{\sqrt{N}}\right ). The bounds are explicitly dependent on the transformer's architecture, including the activation function, number of self-attention heads, depth, width, and norm constraints.

优点

Addressing a Significant Gap: The paper tackles the under-explored area of providing statistical guarantees for transformers trained on non-i.i.d. data, which is highly relevant given the prevalence of sequential data in practical applications.

Explicit Generalization Bounds: By deriving explicit bounds that depend on various architectural parameters of transformers, the paper offers valuable insights into how different components of the model influence its generalization capabilities.

The paper situates itself well within the existing literature, contrasting its contributions with prior work on i.i.d. settings, in-context learning for linear transformers, and universal approximation theorems. This contextualization highlights the novelty and importance of the presented results.

缺点

Although the bounds are explicit, the dependence on multiple architectural parameters may raise concerns about scalability. It would be useful to discuss how these bounds behave as transformers scale to very large sizes, which is common in modern applications.

I would like to see more discussion about the limitations (or the lack of them) of the case M=1M=1 in terms of how transformers are applied specifically to LLM data etc.

问题

The paper is done for the case M=1M=1. For the case M>1M>1, why can you apply the results of the paper for the augmented Markov chain Yn=(XnM+1,,Xn)Y_n = ( X_{n-M+1}, \cdots , X_n ).

Here is a list of typos I found. However for the one in line 2462 I would appreciate a clarification.

line 1124 the R^{(M)} should be R^M line 1371 the expectation should taken over \nu measure line 1409 On the statement of proposition there many typos, also in the proof of the proposition line 2462 (and can easily be seen to be Markovian since Brownian motion has the strong Markov property) line 2508 typos held, admitted -> holds admits.

伦理问题详情

The paper "Reality Only Happens Once: Single-Path Generalization Bounds for Transformers" primarily focuses on the theoretical aspects of transformer models, specifically addressing generalization bounds in non-i.i.d. settings. Given its theoretical nature, the paper does not directly engage with ethical issues such as data privacy, fairness, or bias.

评论

Scalability

We understand that transformers do not scale well, indeed this is well-known in the literature, see e.g. [1] for a discussion. However the curse of dimensionality is necessarily there both in worst-case learning and approximation guarantees, such as those which are the object of study in our paper.

Nevertheless, we argue that our article--by explicitly computing the bound and detailing the influence of various architectural choices (e.g., width, depth, activation function)--serves as a practical guide for users that are aware of the curse of dimensionality in transformers but seeking strategies to mitigate its impact on generalization.

Generalization Bound for Finite-Sequence Transformers via Markovian Lift

We greatly appreciate your suggestion regarding the Markovian lift and have incorporated it into our revised manuscript. Specifically, we extended our generalization bound to cover arbitrary finite sequence lengths MNM \in \mathbb{N}, beyond the case M=1M = 1 presented in the initial submission.

This extension is realized by employing the Markovian lift of the process, as you proposed, increasing the state space from dd to MdMd. Fortunately, this adaptation aligns seamlessly with our primary technical contribution, Theorem 2, which holds for arbitrary MM. Detailed explanations are provided in our updated manuscript.

We initially excluded this generalization to focus the submission, as we plan to extend these results further to the case of infinite memory MM \to \infty. While extending to infinite memory presents significant technical challenges, we believe that the current finite-memory results establish a robust foundation for this future work, which we aim to address in a separate paper.

Your insights have been invaluable in improving the scope and clarity of our work. Thank you once again for your thoughtful comments!

Typos

Thank you for pointing out those typos, we corrected them in the revised version. Line 2462, was unnecessarily convoluted and simply intended to show that boundedness and Markovianity of the process are fulfilled by definition of the process and the Markov property of the Brownian motion.

References

[1] Borde, Haitz Sáez de Ocáriz, et al. "Scalable Message Passing Neural Networks: No Need for Attention in Large Graph Representation Learning." arXiv preprint arXiv:2411.00835 (2024).

审稿意见
6

The paper gives generalization bounds for the transformer architecture. The transformer is assumed to be trained on NN i.i.d. sample strings coming from a certain type of Markov process, and generalization bounds converging at the rate of O(1/N0.5)O(1/N^{0.5}) are shown.

The paper provides detailed calculations for the constant factor for the convergence rate bound, and studies how the bound is affected by changing the internal dimensions of the transformer model, as well as the activation function.

The work uses tools from stochastic calculus, together with novel bounds on the partial derivatives of transformers.

优点

  • The paper gives bounds for an entire collection of hyper-parameter values and compares the resulting bounds.
  • The presentation quality is good, and the introduction is well-written.
  • The paper gives bounds for the higher-order partial derivatives of a transformer, which are a natural quantities to study in their own right and can potentially useful for future research.

缺点

  • The human language, which is the primary use case of transformers, is known not to satisfy the Markov property. This somewhat undermines the relevance of this work to real-world generalization of transformers.
  • The work does not compare the generalization bounds given here with experimental date on the generalization performance of transformers. My understanding is that when it comes to the current literature on generalization bounds, theoretical bounds often diverge from experimental conclusions. This often limits the usefulness of theoretical bounds when it comes to guiding the practice of machine learning.

问题

Are there some data-generating models that are more general than Markov processes, and for which one could hope to obtain generalization bounds for transformers?

评论

The Markovianity Assumption

Indeed our analysis can easily be extended to the case where the data-generating process becomes Markovian in an extended, but finite, state-space but setting MdMd to be the dimension of that state-space. This can easily be done since our main Technical result, Theorem 2, allows for general finite MM.

We would expect that most real-life data sources from those in NLP to mathematical finance are Markovian in an extended state-space, or at least approximately so (in the sense of the quantitative version of the Echo-State property in reservoir computing (e.g. [1] and [2]) defined in Definitions 4.4 and 4.9 in [3], which are approximately satisfied by a broad range of non-Markovian "general" SDEs; see Proposition 4.2 [3]).

We have updated our draft accordingly, which can be realized with minimal changes, as the core technical work is already addressed in Theorem 2. We highlighted the changes in the revised article. The case where MM is finite follows directly by extending the state space of our process from dd to MdMd. We initially excluded this because the resulting bounds become highly non-trivial when MM is infinite, a direction we intend to explore in future research.

Experiments

As our paper focuses on theoretical questions and is already content-heavy, we prefer to explain our results rather than delve into experiments.

Though we agree that experiments can be insightful, and we would like to highlight that we do provide numerical illustrations both on page 88, explaining the role of each component in our bounds, and throughout Appendix B, which numerically computes our bounds and shows how various components of the transformer affect them: from the number of multi-head attention layers used to the choice of activation function.

References

[1] Gonon, Lukas, and Juan-Pablo Ortega. "Fading memory echo state networks are universal." Neural Networks 138 (2021): 10-13.

[2] Grigoryeva, Lyudmila, and Juan-Pablo Ortega. "Universal discrete-time reservoir computers with stochastic inputs and linear readouts using non-homogeneous state-affine systems." Journal of Machine Learning Research 19.24 (2018): 1-40.

[3] Acciaio, Beatrice, Anastasis Kratsios, and Gudmund Pammer. "Designing universal causal deep learning models: The geometric (hyper) transformer." Mathematical Finance 34.2 (2024): 671-735.

评论

Thank you for your response. I will maintain my score at 6. I appreciate you having numerical illustrations for your bounds, but I think for the score to be 7 or above there need to be experiments showing that your generalization bounds indeed match observed generalization performance of transformers (at least in some settings).

审稿意见
5

This paper studies the generalization error of transformers in the case of non-i.i.d. data under the realizability assumption. In particular, it is assumed that the data is generated by a discrete-time Markov process, and the generalization error is measured as the difference between the empirical sequential risk up to the first NN generated examples and the risk with respect to the distribution at a future time tin[N,infty]t \\in [N, \\infty] (where the risk t=inftyt = \\infty is with respect to the stationary distribution of the data-generating Markov process). The study of such a notion of generalization error (here called future-generalization error) aims at removing the standard i.i.d. assumption, which is unnatural for modeling the environments which transformers are usually employed in. The authors provide non-trivial generalization bounds for the hypothesis classes consisting of transformer architectures, thus giving insights relative to the inherent properties of these models.

优点

This work is well-motivated and tackles an important problem in the context of transformers. Transformers are currently one of the most used models in multiple domains, and understanding their generalization properties is definitely a relevant research direction. The paper has a good structure and the results are clearly presented. The specification of the generalization bounds to classes of transformer architectures is thus interesting and the authors provide a good and thorough discussion on the implications of their results. Consequently, the results contained in this work allow the reader to gain further insights into the interplay between structural properties of transformers and their generalization capabilities.

缺点

Although the paper supports the results with reasonable motivations and a detailed presentation of related work, its main weakness is that the main results are not novel. More precisely, the study of generalization errors in the case of non-i.i.d. data has been thoroughly studied in some notable work (e.g., see [1-3]) up to more than 15 years ago, with some ideas even dating back to [4]. Not only that, but those results appear to potentially subsume the analysis of the future-generalization error (the central part of this submission), considering both the conditional and the unconditional "future risks" with respect to the historical data (Xi)_iin[N](X_i)\_{i \\in [N]}, as well as the more general non-stationary processes such as beta\\beta-mixing ones. These already available generalization bounds do not even require the specialization to the transformer model, introducing a general dependence on the Rademacher complexity of the hypothesis class; hence, it is unclear whether it would have simply sufficed to study the Rademacher complexity of the class of transformers in order to derive analogous results as the ones provided in the current submission.

This line of work has been overlooked as a whole, which is a significant and nonnegligible shortcoming of this submission. This additionally renders claims such as "this is the first result which estimates the future-generalization [...] of a learning model given that its performance has only been measured using present/past historical training data" (lines 77-81) false; this claim is even presented as the primary contribution of the paper, which is misleading. The authors should have provided a more detailed discussion on how their results relate to the existing literature on generalization bounds for non-i.i.d. processes.

References:

[1] Mohri & Rostamizadeh. "Stability Bounds for Non-i.i.d. Processes". NeurIPS 2007.

[2] Mohri & Rostamizadeh. "Rademacher Complexity Bounds for Non-I.I.D. Processes". NeurIPS 2008.

[3] Kuznetsov & Mohri. "Generalization bounds for non-stationary mixing processes". Machine Learning, vol. 106, 2017.

[4] Yu. "Rates of convergence for empirical processes of stationary mixing sequences". The Annals of Probability, vol. 22, 1994.

问题

  • Could you comment further on how your results relate to the existing literature on generalization bounds for non-i.i.d. processes (especially the references mentioned above)? In particular, is there any contribution in this submission that goes beyond (i.e., is not implied by) any of the existing results?
  • Could you provide more details about the need of the Markovianity assumption in the data-generating process relative to the proofs of your results? What are the main obstacles in lifting this assumption?
  • At line 56, you define the tt-future risk mathcalR_t(mathcalT)\\mathcal{R}\_t(\\mathcal{T}) as an expectation over X_tX\_t. Is this conditioned or not with respect to the data points X_1,dots,X_NX\_1, \\dots, X\_N which mathcalT\\mathcal{T} can depend on?
  • Please, address any other issues outlined above.

Minor comments and typos:

  • Parenthesization of citations is sometimes incorrect or odd (e.g., line 32).
  • In the abstract, use "1delta1-\\delta" instead of "1-delta\\delta".
  • At the math displays of lines 56 and 61, do you mean to have mathcalT\\mathcal{T} as the argument of the risks instead of ff?
  • Line 71, "with probability of at least" should be "with probability at least".
  • Line 95, "instance-dependant" should be "instance-dependent".
  • Line 105, "multilayer perceptions" should be "multilayer perceptrons".
  • Line 120, "by Markov processor at least" should be "by a Markov process or at least".
  • Line 129, "this means that that" should be "this means that".
  • Lines 139-141, the sentence is quite unclear as the phrasing is convoluted. Please, revise.
  • Line 150, "are all most equal" should be "are almost equal".
  • Lines 319-320, "convergence rate acceleration are expressed using" should be "convergence rate acceleration using".
  • Math display at line 330, is I_Nin[tau_s,tau_s+1)I\_{N \\in [\\tau\_s, \\tau\_{s+1})} supposed to be the indicator function? If so, I'm not sure this has been explicitly defined before.
评论

Novelty of our bounds

We thank you for providing references to generalization bounds for hypothesis trained with non-i.i.d. data. We have now incorporated these references in our revised version, highlighting how they clarify the gap our work addresses.

While these results provide convergence rates for transformer generalization in non-i.i.d. settings, they do not directly yield explicit constants. As we will demonstrate, deriving such constants requires computing a Lipschitz (or Sobolev) bound for transformer classes—a technically challenging task that we address in Theorem 2 of our paper. This result, detailed in pages 27-31 and 33-44 of our appendix, represents the most challenging technical component of our analysis.

We elaborated this point in the general rebuttal, and will comment here explicitly on the literature you mentioned. Please find further examples in our response to reviewer cDpD.

Mohri, Mehryar, and Afshin Rostamizadeh. "Rademacher complexity bounds for non-iid processes."

Whilst our claim holds true for all articles referenced, [1] is arguably the closest to our work. Here, Theorem 2 of [1] bounds the generalization of the class TT trained on non i.i.d. data using the empirical Rademacher complexity. In order to apply their result concretely to a transformer class TT, one needs to apply a chaining argument (as mentioned in lines 143-150 of our paper), e.g. using Proposition 5.17 of [2]. In order to do so explicitly, one again needs to bound the covering number of Bs(f0,ρ)B_s(f_0,\rho) and which can only be done by computing ρ\rho; the computation of which is performed in Theorem 2 of our paper.

Relevance of our generalization bound

In light of the existence of the results you mentioned, we would like to point out that our generalization bound, as presented in Theorem 1, is novel in the sense that it is based on a optimal transport-theoretic concept of exponential ergodicity. This condition is straightforward to verify or already established for many data-generating processes, as confirmed by a growing body of literature. Examples include classical SDEs, McKean-Vlasov equations, and reflected SDEs. Please refer to the general rebuttal and Section 2 of our paper for more details.

The Markovianity Assumption

We rely on Markovianity in our Wasserstein concentration of measure argument.

As we pointed out in the general rebuttal, we changed the notation of our draft to include models of finite sequence length MNM \in \mathbb N. This constitutes only a marginal change and can be easily realized by requiring that the Markovian lift of the underlying process fulfils our assumptions. Our main technical tool, namely Theorem 2, was already framed for the case where MM is positive but finite.

In principle, one can also consider processes which become Markovian when lifted to infinite dimensions, i.e. when their entire history is included. Though one can notify our results using the concentration of measure arguments in [3], in this case, the only Wasserstein concentration of measure results that we are aware of in infinite dimensions are for i.i.d. or Gaussian data, and thus they do not cover the non i.i.d. case. Thus, though developing such a concentration of measure results is interesting, it is beyond the scope of our paper as it would not apply solely to transformers.

Conditioning in Future Expectation

No, the future is defined with respect to the law of XtX_t, not its conditional law. This way, the future risk is a (deterministic) number; otherwise, it would be random and estimating it would be highly challenging (similar issues to the presence of common noise in mean field games would appear, which would obscure the main point of our paper).

Minor comments and typos

We thank you for identifying these typos, and we have corrected them accordingly. Thanks!

References

[1] Mohri, Mehryar, and Afshin Rostamizadeh. "Rademacher complexity bounds for non-iid processes." Advances in neural information processing systems 21 (2008). [2] Wainwright, Martin J. High-dimensional statistics: A non-asymptotic viewpoint. Vol. 48. Cambridge university press, 2019. [3] Benitez, Jose Antonio Lara, et al. "Out-of-distributional risk bounds for neural operators with applications to the Helmholtz equation." Journal of Computational Physics (2024): 113168.

评论

I thank the authors for the detailed response. In light of the comments provided, I am willing to reconsider my score. It seems, however, that other reviewers share some of my concerns and it feels that the paper requires to consider prior work in more detail, in particular on generalization bounds for non-i.i.d. data (e.g., by studying the other quantities that measure the complexity of the hypothesis class) and already available complexity bounds for similar hypothesis classes (as this paper's contribution ultimately revolves around this type of result). This kind of changes usually requires a substantial revision. Nevertheless, I want to remark that the insights provided specifically for the generalization capabilities of transformers is quite interesting, hence why I am increasing my score.

评论

General Rebuttal

We would like to thank all reviewers for their valuable feedback and insights. We are in particular grateful for pointing out additional relevant literature. We were unaware of these papers earlier, and we have now included the suggested references in our revised version, which help emphasize which gap in the literature our contribution fills.

This leads to the following three key aspects that we will comment on, starting with 1. the novelty of explicit bound computation; followed by 2. the relevance of generalization bounds in the setting of optimal transport-theoretic notion of exponential ergodicity; and finally 3. the integration of finite sequence lengths MNM \in \mathbb N into our analysis through a Markovian lift.

Previous bounds do not yield explicit bounds for transformers

We thank the referees for pointing out these papers studying uniform generalization bounds for hypothesis classes trained using non-i.i.d. data, also with a mixing/ergodicity requirement. Though these results can be used to obtain convergence rates for the generalization of transformers trained on non-i.i.d. data, these results do not immediately yield explicit constants. An explicit constant requires, as we shall demonstrate below, the computation of a Lipschitz bound (or Sobolev-bound) for a transformer class. For our perspective, the computation of this bound (Theorem 2 in our article) has been non-trivial and has constituted the most challenging technical component of our analysis (the proof of Theorem 2 is derived on pages 27-31 and then 33-44 of our appendix).

We briefly explain how Theorem 2 is a necessity to derive explicit bounds in the referenced literature by examining [4] in the response to reviewer 4UED, as well as [1,2,3] in the response to reviewer cDpD.

In short, none of these results yield explicit non-asymptotic bounds, i.e. explicit constants without the use of our Theorem 2. While we agree that we are incorrect in the assumption that we were the first to obtain generalization bounds for non-i.i.d. data under a mixing condition, we believe that the derivation of an explicit generalization bound for a transformer class trained on non-i.i.d. data poses a novelty.

Relevance of generalization bounds in the setting of optimal transport-theoretic notion of exponential ergodicity

As noted above, we would like to clarify our "first-of" claim regarding the derivation of a non-iid generalization bound. Nevertheless, we maintain that our generalization bound, presented in Theorem 1, represents a significant contribution to the field, primarily due to the setting upon which it is based.

Namely, a notable aspect of our generalization bound is its reliance on a recently well-studied optimal transport-theoretic concept of exponential ergodicity, which is straightforward to verify—or already well-established—for most data-generating processes. In fact, a substantial and expanding body of literature confirms that a wide variety of standard processes satisfy this mixing condition (Assumption 2 in our paper), including classical SDEs, McKean-Vlasov equations, and reflected SDEs. Several examples are presented in Section 2.

Generalization bound for finite-sequence transformers via Markovian lift

In our initial submission, we limited our generalization bound to the case M=1M=1, that is the transformer operating on a single sequence element. We changed this in the revised version to include any finite sequence length MNM \in \mathbb N.

This can be realized by a simple Markovian lift of the process (as also suggested by reviewer rRQW), extending the state space of our process from dd to MdMd. The change is minimal as our main technical contribution, Theorem 2, holds for arbitrary MM. For more details, we refer to our responses to reviews by AUzm, rRQW, 4qd3, and PVvb.

We did not include this in our initial submission as we are planning to extend our result to the case of infinite memory MM. While this extension is non-trivial, we believe the current setting provides a strong foundation for tackling the infinite memory case. However, given the complexity of this direction, we see it as a substantial effort warranting a separate article.

References

[1] Ziemann, Ingvar, and Stephen Tu. "Learning with little mixing." Advances in Neural Information Processing Systems 35 (2022): 4626-4637.

[2] Mohri, Mehryar, and Afshin Rostamizadeh. "Stability Bounds for Stationary φ-mixing and β-mixing Processes." Journal of Machine Learning Research 11.2 (2010).

[3] Foster, Dylan, Tuhin Sarkar, and Alexander Rakhlin. "Learning nonlinear dynamical systems from a single trajectory." Learning for Dynamics and Control. PMLR, 2020.

[4] Mohri, Mehryar, and Afshin Rostamizadeh. "Rademacher complexity bounds for non-iid processes." Advances in neural information processing systems 21 (2008).

AC 元评审

This paper incited a lot of discussions with the reviewers. In the end the reviewers agree that there is some value in the paper but the general consensus is that the paper is not ready in its current form to be accepted to a top venue. I agree with this assessment. The paper could use some more cycles in placing the paper well in the existing literature, specifically non-iid settings and relevance to the specific transformer architecture, as well as other measures of complexities of similar hypothesis classes, adding clarifications on how it is different from other existing works. To further cement the contributions towards understanding. transformers I disagree with the authors that empirical results will not be relevant (they suggested that the page limit space is better spent elsewhere in explaining the theory better), i think the specific architecture they are targetting merits some practical experiments too to illustrate how well the suggested theory aligns with practice, even if it means moving parts of theory into the appendix.

审稿人讨论附加意见

The reviewers largely had concerns about the novelty and the claimed pertinence of the proposed bounds for the transformer architecture. The reviewer brought up concerns that the relevance to transformers is minimal and the bounds are somewhat trivially implied by an application to previously known results+methods. This warrants more clarifications that were not addressed during the rebuttal and will likely require significant changes to the writing. Similarly, the reviewer 4UED brought up the need to further clarify how the non-iid property is being handled in the paper, which requires non trivial re-writes that were not done during the rebuttal. The one positive review (rating 8) did not fight back against the overwhelming negative comments made by other reviewers which brought up valid concerns.

最终决定

Reject