PaperHub
7.2
/10
Poster4 位审稿人
最低3最高4标准差0.4
4
4
4
3
ICML 2025

Generalization of noisy SGD in unbounded non-convex settings

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24

摘要

关键词
Information theoretic generalizationLangevinSGDdifferential privacy

评审与讨论

审稿意见
4

This paper provides generalization error and differential privacy guarantees for the stochastic gradient Langevin dynamics algorithm (SGLD). This is done by proving bounds on either a KL divergence (generalization error) or a Renyi divergence differential privacy(). In accordance with several prior works, the main technical elements are a stability argument combined with a bound on the KL divergence featuring a decay term, ie, a bound of the form KL(k+1)γKL(k)+CKL(k+1) \leq \gamma KL(k) + Cm with γ<1\gamma < 1. As it has been identified in the prior art, such an inequality is the consequence of the distribution of the iterates satisfying a logarithmic Sobolev inequality (LSI). Unfortunately, prior works either relied on strong assumptions (eg, strong convexity), or did not prove a time-uniform bound on this log-Sobolev constant, hence leading to vanishing decay. Moreover, when dealing with the discrete-time algorithm, a lot of prior works provide bounds that do not vanish as nn\to\infty.

In this paper, the authors tackle the above issue by decomposing each iteration into a gradient and a contraction step. Along with other technical elements, they use it to prove that under previously considered dissipativity and pseudo-Lipschitz assumptions, the log-Sobolev constants can be time-uniformly bounded, hence leading to improved generalization bounds. Finally, the authors relax the dissipativity assumption by introducing ergodicity constants characterising the convergence of the iterates to the Gibbs distribution.

给作者的问题

In addition to all the points mentioned above, I would have the following questions:

  • Is there a reason why the paper of [Li et al., 2019] not presented in Table 1?

  • You mention that the bound of [Mou et al.] has a degrading decay factor. For me this is only true for the results of [Mou et al.] in the absence of regularisation. With regularisation, they obtain a constant decay. Could you discuss more the comparison with their bounds?

  • You results focus on the discrete-time case, do you consider that previous work provided optimal bounds in the continuous-time case?

论据与证据

The main claims (regarding time-uniform bounds under dissipativity assumptions by a uniform bound on the log-So bole constant) are well-introduced and supported by a rigorous mathematical theory. The proof that SGLD iterates have uniformly bounded log-Sobolev constants is a strong result.

Regarding these claims, my only remarks would by to (i) mention the pseudo-Lipschitz assumption in the introduction, as this assumption is essential to the stability approach (and might not be mild) and (ii) discuss the dimension-dependence that is obtained, which might be problematic in overparameterized settings.

The claim that dissipativity can be relaxed through ergodicity conditions is also well-supported by mathematical proofs. However, it is claimed at line 388 that these bounds have an improved dimension dependence. Looking at the constants appearing in Corollary 6.6, this does not seem to be the case, which might require more discussion.

方法与评估标准

The proposed proof technique is based on several existing works on this topic and the authors provide very nice additions to the existing techniques.

理论论述

I checked the correctness of most theoretical claims, apart from some details in the proof corollary 6.6, regarding the exact expression of the error terms.

Here are the remarks I have (please correct me if I am wrong):

  • Theorem 4.1: I find definition A.1 confusing, in particular it is not clear whether Λq\Lambda_q means that we integrate over the XkX_k' in pXkp_{X_k'} or if we integrate the function pXk/pXkp_{X_k} / p_{X_k'} over the distribution of XkX_k' (for this reason I mainly checked the result when q=1q=1). I think a few additional details would improve clarity a lot.

  • Theorem 4.1: in the expectation in the statement, there should be.a Tilde on the first XkX_k'.

  • Proof of Lemma 5.4: it should be ηm2/2L2\eta \leq m^2 / 2L^2 instead of m/2L\sqrt{m}/2L.

  • Proof of theorem 5.5: X0X_0 has variance η/(βd)\eta / (\beta d) while it is η/β\eta / \beta in the statement, this should be clarified. Moreover, I think it would be more standard to use the variance and not the standard deviation in the symbol N\mathcal{N}.

  • Proof of theorem 5.5: I think additional details on the result of [Chen et al.] would be very beneficial, as well as details of the χ2\chi^2 computation. The current version of the theorem is hard to check because of these two points and it is not clear where the constant 31/3231/32 comes from.

实验设计与分析

N/A

补充材料

I reviewed most of the supplementary material apart from section C and some details of section D.

I think you use a package to restate the theorems in the appendix but did not remove the double column formatting, for this reason the rendering of some theorems in the appendix is weird.

Also, there sometimes miss pointers to where the proofs of the results are in the appendix.

与现有文献的关系

This paper is strongly connected to the existing literature on generalization bounds for SGLD, to name a few: Mou et al., Li et al., Futami and Fujisawa.

The authors provide nice improvements of the proof techniques existing in the literature, in particular regarding the use of the dissipativity assumption and the uniform bound on log-Sobolev constants.

遗漏的重要参考文献

To my knowledge, the essential references regarding this line of work are discussed by the authors.

A potential additional direction to discuss would be the extension to non-Gaussian noise, as recent works proved generalization bounds fror heavier-tailed versions of noisy SGD.

其他优缺点

N/A

其他意见或建议

Lemma f.4 is a repetition of Lemma fF.2, it could be removed.

作者回复

We are sincerely grateful for your time, your very detailed review and feedback. We will add a discussion of the limitation you mention.

Thank you for spotting the typos, here are a few details to answer some of the remarks:

  • Definition A.1: In integral form the expectation is the following:
\TildeEk,q[h(Xk)]=EXk[ϕq(Xk)h(Xk)]=h(xk)ϕq(xk)pXk(xk)dxk\Tilde{\mathbb{E}}_{k, q}[h({X'_k})] = \mathbb{E}_{X'_k}[\phi_q(X'_k)h(X'_k)] = \int h(x_k')\phi_q(x'_k) p_{X_k'}(x_k')dx_k'

When q=1q=1 it cancels out the density of p_X_k'. We hope this clarifies the definition. We will add this in the text. With this clarification, there is no tilde in theorem 4.1, it should just be XkX'_k in both terms. We will remove it, it was a typo. It is an integral under XkX'_k modified by ϕq\phi_q.

  • Lemma 5.4: Thank you.
  • Theorem 5.5: There should be a dd we will fix it.
  • Theorem 5.5: We will add the result of Chen et al for completeness so the χ2\chi^2 computation is easy to follow. The constant 31/3231/32 comes from factors 22 from naive applications of Jensen. It can be tightened as described in appendix C. We will add explicitly the tightening argument.

Questions:

  • We will add Li et al. The only reason it wasn't included initially was because their discrete result tightens the Lipschitz dependence but has the same iteration count dependence as result already present in the table. Their secondary analysis was carried out for the continuous version of SGLD and an approximation argument was used to obtain the discrete result. This introduced a need for a vanishing stepsizes which we are trying to relax and it applies to the smaller class of bounded + 2\ell_2 functions.
  • Indeed you are correct, in case II of their result they establish a decay factor for the slightly smaller class of Lipschitz + 2\ell_2 functions. This corresponds to a subset of strongly dissipative functions discussed in our section 5.2. Our work relaxes this to simply dissipative. As their result is close so we will highlight it further. We loosen the structural requirements on the function further in section 6 where we show that we only need the LSI to hold without explicit structural requirements like Lipschitz + 2\ell_2.
  • The continuous case for bounded + 2\ell_2 has been analyzed before in Li et al but remains open for generally dissipative functions as far as we know. The continuous analysis is interesting as it can help to establish optimality indepedently of discretization error but the we believe the non-vanishing discretization error is the central difficulty of the analysis and so we keep our focus on the discrete setting. Indeed, if we could obtain exact samples of the Gibbs distribution, the generalization bound would be straightforward.

We thank you again for your review and we remain at your disposal for any further discussion.

审稿人评论

Thank you for your answer.

I think your answer addresses most of my concerns, so I will keep my score.

Good luck.

审稿意见
4

The authors introduce novel generalization guarantees for Noisy SGD that extend to non-convex settings, possessing desirable properties such as an O(1/n)O(1/\sqrt{n}) dependence on dataset size and an O(1)O(1) dependence on the number of iterations. The first major contribution of this paper demonstrates that assuming the loss function satisfies dissipativity—a condition relaxing strong convexity by allowing non-convexity within a confined region around 0—is sufficient to ensure that Noisy SGD iterates remain distributionally similar across different datasets regardless of the iteration count. Remarkably, this guarantee holds even under strong divergence measures like KL and Rényi divergences. Previously, such properties were only known to hold under strong convexity assumptions within the coupled diffusion analysis framework. This guarantee translates into O(1/n)O(1/\sqrt{n}) generalization bound, using the information-theoretic generalization arguments of (Xu and Raginsky, 2017). However, their resulting bound initially reveals an undesirable O(ed)O(e^d) dependence on the model dimension dd. To address this limitation, the authors present their second key contribution. By employing an ergodicity-based analysis framework, they establish another KL divergence bound that avoids this exponential dependence. Specifically, they show that under a condition even weaker than dissipativity—characterized as a Log-Sobolev-type isoperimetric assumption on the Gibbs distribution induced by the loss function—the generalization bound exhibits only a O(poly(d))O(poly(d)) dependence on the model dimension dd.

给作者的问题

Could the authors address my comments in the weaknesses?

论据与证据

The claims appear to be correct.

方法与评估标准

The paper is theoretical. The theoretical framework is sound.

理论论述

I did not verify the correctness of all the results, but they look okay from the quality of writing and the proof descriptions.

实验设计与分析

N/A

补充材料

I read the proofs for the main claims but not to the details

与现有文献的关系

The contributions of the paper extends prior works. The authors establish two generalization guarantees that do not become vacuous with the number of iterations even under non-convexity and have a standard $O(1/\sqrt{n})) dependence on dataset size.

遗漏的重要参考文献

The essential related works are all mentioned.

其他优缺点

Strengths

  • Paper is very well written with the core ideas being well explained.
  • Breaking down information-theoretic generalization guarantees into modular blocks of results that combine well together is great for understanding the bottlenecks and areas of improvements.

Weakness

  • Sampling directly from the Gibbs distribution eβFne^{-\beta F_n} where FnF_n is the average loss on dataset has an O(1/n)O(1/n) generalization guarantee (cf. Theorem 2 from (Aminian et al., 2021)). This generalization guarantee holds without making any assumptions on the loss function (besides the subgaussian tail assumption on loss, which is the same as that needed for (Xu and Raginsky, 2017)’s information-theoretic generalization bound used by the authors). So I believe it is reasonable to expect that if Noisy SGD reasonably approximates Gibbs distribution, it generalizes with O(1/n)O(1/n) guarantee as well. More precisely, I expect the I(A(D);D)I(A(D); D) to be O(1/n))O(1/\sqrt{n})). But, the authors apply conditioning to upper bound I(A(D);D)I(A(D); D) with ED,D[KL(A(D)A(D))]E_{D,D’}[KL(A(D)||A(D’))] and focus only on bounding the KL(A(D)A(D))KL(A(D)||A(D’)) for worst-case D,DD, D’ that has no hope of yielding a bound better than O(1)O(1).
  • Although the bounds in section 5 have really O(1)O(1) dependencies on number of iterations, they only seem useful when model size dd is extremely small. That’s because γ\gamma very quickly tends to 1 as LSI constant α\alpha grows O(exp(d))O(\exp(d)) in model size dd. This results in the factor (1γk)/(1γ)k(1-\gamma^k)/ (1- \gamma) \approx k up to k<cexp(d)k < c \exp(d) for some constant cc. In other words, in the reasonable parameter regimes of model size dd and number of iterations kk, the generalization guarantee in Corollary 5.8 and Renyi DP guarantee in Corollary 5.9 grows linearly in number of iterations kk. Effectively, applying Rényi composition might result in a better guarantee in this regime.
  • In Corollary 6.6, the initial divergence D(X0π)D(X_0||\pi) and D(X0π)D(X_0'||\pi') can be infinite if the weights are initialized to a fixed point rather than sampled from a distribution. The authors don’t talk about how the weights can be initialized so that D(X0π)D(X_0||\pi) and D(X0π)D(X_0'||\pi') are finite.

Ref:
[1] Aminian, Gholamali, et al. "Characterizing the generalization error of Gibbs algorithm with symmetrized KL information." arXiv preprint arXiv:2107.13656 (2021).

其他意见或建议

Overall, this paper makes good contributions towards non-vacuous generalization guarantees for Noisy-SGD under practical settings. Although we aren’t totally there yet, the ideas in the paper could prove useful towards this goal.

作者回复

We sincerely thank you for your time and thorough review of our work. We would like to address some of the points you have raised.

Sampling directly from the Gibbs distribution: You are correct in stating that an exact sample from the Gibbs distribution would attain the fast rate. However, there is no exact sampling algorithm available for non-convex unbounded losses. Approximate samplers either need a vanishing step size or need to have an additive step size dependent term in the generalization bound. This is precisely the technical hurdle we try to overcome, we want a bound that decays to zero when nn goes to infinity without needing to set the stepsize as 1/n1/n. We will add this discussion along with the relevant reference. Removing conditioning would be very interesting and will likely lead to better bounds but we believe it will add complexity in tracking the LSI constant.

reasonable parameter regimes: There is indeed a possible trade-off between composition bounds and our bound in section 5 if the dimension is very large and the iteration count is low in comparison. The dimension dependence in section 5 comes from the fact that we only use dissipativity to establish a worst case LSI constant (which introduces the unfavorable dimension dependence). In section 6 we show that if the LSI is known to not have a good dimension dependence (for example if the mimima are connected), then the exponential dependence can be avoided to yield the polynomial bound in section 6.

Initialization in Corollary 6.6 You are correct that the initial points cannot be diracs in Corollary 6.6. We can warm start the algorithms and initialize them as Gaussians at stationary points of the losses, just like in the sampling literature, to ensure a finite O(d)O(d) initial KL divergence.

We are grateful for your time and remain available for any further discussion.

审稿意见
4

This paper investigates the stability and generalization properties of noisy stochastic gradient descent (SGD) in unbounded non-convex settings, extending prior analyses that primarily focused on convex or bounded loss functions. The authors establish stability-based generalization bounds that remain non-vacuous even as the number of SGD iterations increases.

The key contributions of the paper are as follows.

  • KL and Rényi divergence stability analysis for noisy SGD. The paper derives uniform stability guarantees using isoperimetric inequalities, which provide a framework for analyzing the long-term stability of noisy SGD.
  • Generalization bounds under non-asymptotic and convergent settings. The authors extend classical stability-based generalization results to unbounded, non-convex loss landscapes, leveraging log-Sobolev inequalities.
  • Implications for differential privacy (DP). The stability analysis is connected to DP guarantees, showing that the derived bounds naturally extend to Rényi DP in certain conditions.
  • Relaxation of traditional strong convexity assumptions. The paper shows that dissipative and smooth losses satisfy the required isoperimetric properties, significantly broadening the applicability of the results beyond strongly convex settings.

update after rebuttal

The authors have provided a detailed and thoughtful response. They clarified the connection to gradient variance and indicated that they plan to incorporate an upper bound on the generalization error based on a surrogate loss derived from Futami & Fujisawa (2024). This addition is expected to significantly enhance the contribution of the paper. Furthermore, the authors’ explanations regarding concrete cases in which the assumptions hold, as well as the interpretability of the proposed bounds, were convincing and informative.

Assuming these points are appropriately reflected in the revised manuscript, I believe the paper will offer a comprehensive and rigorous discussion of the topic. Accordingly, I have decided to raise my score by one point.

给作者的问题

  • Interpretability of Generalization Bounds:

    • The paper improves existing generalization bounds, but what new practical insights do these bounds provide regarding what conditions lead to improved generalization performance?
    • Can the authors discuss how the new bounds relate to hyperparameter settings, such as noise level and learning rate tuning in practice?
  • Connection to Empirical Generalization Metrics:

    • Stability is a well-established quantity for analyzing generalization, but can the authors clarify how their derived upper bound connects to empirical generalization metrics, such as gradient variance (as in Jiang et al., (2019))?
  • Definition of Generalization Error:

    • Does the analysis in this paper consider generalization error with respect to the original non-differentiable loss function, or is it defined based on the surrogate loss function used in practical optimization? If only one case is covered, could the framework be extended to analyze both settings?
  • Applicability of Assumptions

    • The theoretical assumptions seem relatively mild, but could the authors provide explicit examples of loss functions and models that satisfy them? Would adding such examples in the appendix help make the results more accessible to practitioners?
    • In particular, could the authors provide the discussion for the reason why Assumption 5.7, which allows for controlled SkS_{k}, is realistic? For instance, are there standard techniques in model training that naturally satisfy this assumption? Providing concrete practical examples would strengthen the justification of this assumption.

论据与证据

The claims in this paper is well-supported. By leveraging log-Sobolev inequalities and isoperimetric properties, the paper provides a unified theoretical framework for studying stability. To the best of my knowledge, this proof approach is a novel contribution. This result extends previous stability-based generalization analyses and provides meaningful bounds that do not become vacuous over long training horizons.

Potential Areas for Clarification or Improvement:

  • Interpretability of Bounds:

    • The derived bounds are rigorous, but it would be better to further discuss their practical implications.
    • For example, can the results provide insights into hyperparameter selection for SGLD, such as noise control and learning rate tuning in terms of generalization?
  • Applicability of Assumptions:

    • While the theoretical assumptions are mild, it would be useful to provide explicit examples of loss functions and models that satisfy these conditions in an Appendix.
      • Especially, I have some concerns about Assumption 5.7. Is that satisfied for broader models or can we make it to be satisfied by using some reasonable techniques?
  • Extension of the Bounds:

    • In practice, SGLD and similar stochastic gradient methods often deal with non-differentiable loss functions, using differentiable surrogate losses instead. Are the generalization bounds derived in this paper based on the generalization error with surrogate losses or that with the original non-differentiable losses?
    • If only one case is covered, could the bounds be extended to handle both scenarios, similar to Futami & Fujisawa, NeurIPS 2023?
    • If such an extension is possible, can the improved generalization bounds derived in this paper also be maintained in both cases?

方法与评估标准

The proposed theoretical approach is practically reasonable, as it does not introduce additional restrictive assumptions but rather removes certain limiting conditions from prior work. The metric analyzed in this paper is generalization error, which is well-aligned with the study’s objectives. However, as mentioned in the Claims And Evidence section, I have some concerns regarding the clarity and applicability of the theoretical assumptions:

  • Explicit examples of models and loss functions satisfying the assumptions would enhance clarity and help practitioners better understand the scope of the results. Including such details in some part such as an appendix would be valuable.
  • Does the generalization bound measure the error with respect to the original loss function, or does it analyze generalization with respect to the surrogate loss function used in practical optimization? If only one case is analyzed, can this framework be extended to derive generalization bounds for both settings?

理论论述

I have reviewed the proofs, including those in the appendix, at a broad level.

Given the assumptions made—particularly Assumption 5.7, which allows for the control of SkS_{k}—I found no issues in the proof techniques, their logical progression, or the claims derived from them. The mathematical framework appears sound under these assumptions.

实验设计与分析

As discussed in the "Methods And Evaluation Criteria" section, the theoretical analysis focuses on generalization error, which is well-aligned with the objectives of this study. The choice of this metric is appropriate given the problem setting and theoretical framework.

One potential concern is the definition of the generalization error being analyzed. That is: Is the generalization error evaluated based on the surrogate loss function used in optimization, or is it defined with respect to the original, potentially non-differentiable loss function?

补充材料

I have reviewed the supplementary material in its entirety, including all proofs and technical derivations, at a broad level. Then, I checked the logical consistency in the reasoning and argumentation, and correctness of the fundamental techniques used for bounding, particularly those employed in the derivation of upper bounds.

与现有文献的关系

This work is closely related to several lines of research, including (i) stability-based generalization analysis and (ii) analysis of SGLD performance.

  • Stability-based generalization analysis has been a fundamental tool for understanding learning algorithms, originating from the work of Bousquet & Elisseeff (2002) and later extended by Hardt et al. (2016) for SGD. While these studies focus on convex or bounded settings, this paper extends stability-based generalization bounds to unbounded non-convex settings.
  • SGLD and noisy SGD have been extensively studied in terms of convergence and generalization, with prior works by Raginsky et al. (2017), Mou et al. (2018), and Pensia et al. (2018) analyzing the stability and learning dynamics under specific regularity conditions. (I omit the paper link because these papers have already been cited in this paper.) This paper removes the strong convexity assumption, making the more realistic results in terms of the modern machine learning context.

遗漏的重要参考文献

I find that the paper appropriately cites and discusses the most relevant prior work. I do not see any critical missing references that would significantly impact the understanding or positioning of this work.

其他优缺点

Additional Strength

The paper is well-written and structured. The presentation is carefully designed to guide the reader through the background, assumptions, and supporting theorems in a clear and logical manner, making the technical content easier to follow.

Additional Concerns

One additional concern is that while the derived bounds improve upon existing results, it remains unclear whether they offer new practical insights into generalization performance. Specifically:

  • It seems that the bounds do not explicitly provide a discussion on what conditions lead to improved generalization performance. It would be useful to explore whether the new bound provides insights into what factors influence generalization in practice.

  • While stability is undoubtedly an important quantity in generalization analysis, it remains unclear how the derived upper bound connects to other empirically well-correlated generalization metrics, such as gradient variance as studied in Jiang et al., (2019).

In contrast, alternative approaches such as information-theoretic analyses of SGLD generalization often produce bounds that are expressed in terms of gradient norms (Pensia et al., 2018) or the sum of gradient variances (Negrea et al., 2019), which are known to correlate well with empirical generalization. It is unclear whether the improved stability-based bounds in this paper provide similarly strong connections to empirical generalization behavior.

Clarifying how the proposed bounds relate to these widely used empirical generalization indicators would strengthen the impact of this work.

其他意见或建议

I do not have additional comments. Please refer to the Questions for Authors section. I will consider further raising my overall score if these concerns are adequately addressed.

作者回复

We are sincerely grateful for your thorough review and detailed feedback. We would like to address some of the areas of improvement that have been identified.

Extension of the bound:

You correctly observe that some generalization bounds only hold for an evaluation loss (like the 0-1 loss for example) but the algorithm is run on a different differentiable surrogate. Our work applies to a differentiable surrogate. Often since surrogates are chosen to be upper bounds of the evaluation loss, our result still has implications on the evaluation loss in the following way

loss(test) < surrogate(test) = surrogate(train) + [surrogate(test) - surrogate(train)]

where the final term is the generalization gap we control. Just like Futami and Fujisawa's, our result allows us to control the gap in cases where the surrogate loss and the evaluation are exactly the same. Technically our result bounds the mutual information( denoted I(WT,S)I(W_T, S) in Futami and Fujisawa) so we can directly use Theorem 7 from Futami and Fujisawa (also in [B]) and our improved bound will still hold in that setting. We had chosen to omit this component for simplicity but we will add this result as it relaxes the requirement on the surrogate loss: we can include sub-exponential losses and not just sub-gaussian ones.

Connections to gradient variance:

To establish this connection, notice that our stability term is the norm of a difference of gradients and this difference can be expanded into the sum of gradient norms since ab222a22+2b22\|a - b\|_2^2 \leq 2\|a\|_2^2 + 2\|b\|_2^2. With this relaxation, our bound can be expressed as a function the gradient norms encountered on the trajectory. We will add a discussion of on using sensitivity term SkS_k versus gradient norms along the trajectory. The work of Jiang et al shows the difficulty of aligning theoretical generalization bounds with practice and it is important to mention.

Example of loss verifying our assumptions:

We agree that to we should showcase the relaxed requirements by giving a concrete example. We will add the following. Consider a two-layer network with weights x=[x2,X1]x = [x_2, X_1] defined as NN(x,z)=x2σ(X1z)NN(x, z) = x_2^\top \sigma(X_1z) for a data point zZz \in \mathcal{Z} and the differentiable logistic sigmoid function σ\sigma. We assume the datapoints are bounded (for example pixels in [0,1][0,1] for images). We define a learning task through a Lipschitz loss over the outputs o=NN(x,z)o = NN(x, z) of the network:

Lipschitz(NN(x,z))Lipschitz(NN(x, z))

With weight decay (i.e 2\ell_2 regularization), this non-convex loss is dissipative (see section 4 of [C]). It satisfies assumption 5.7 because, for any two z1,z2z_1, z_2,

F(x,z1)F(x,z2)=Lipschitz(o1)xNN(x,z1)Lipschitz(o2)xNN(x,z2)\nabla F(x, z_1) - \nabla F(x, z_2) = Lipschitz'(o_1) \nabla_x NN(x, z_1) - Lipschitz'(o_2) \nabla_x NN(x, z_2)

The gradient difference of the neural network is given by the vector

[σ(X1z1)σ(X1z2),(x2σ(X1z1))z1(x2σ(X1z1))z2)]\begin{bmatrix} \sigma(X_1z_1) - \sigma(X_1z_2), \\ (x_2\odot \sigma'(X_1z_1))z_1^\top - (x_2\odot \sigma'(X_1z_1))z_2^\top) \end{bmatrix}

By virtue of the sigmoid being lispchitz and our bounded data assumption on z1,z1z_1, z_1, this gradient difference can be bounded by the norm θx2\theta\|x\|_2 + constants where x=x122+X122\|x\| = \sqrt{\|x_1\|_2^2 +\|X_1\|_2^2}. We will add this example in the appendix, along with [A]'s result capturing sub-exponential losses like the example above. The Assumption 5.7 is stating that if only the data points differ, then the gradient difference only grows at most linearly with respect to the parameter.

Interpretability of Bounds:

We see our work as an attempt at amending a theoretical gap so it can capture the practically relevant setting of very long training runs on non-convex losses. In essence we are trying to make theory catch up to practice. Our result shows that the presence of noise can compensate for a high iteration count. Concretely, for a fixed choice of η\eta, we can see that choosing a low β\beta (i.e adding more noise) is advantageous for generalization, the trade-off however is that a lower β\beta can harm convergence of the training loss. We also see that will add a discussion on the trade-offs between generalization and utility.

We are again grateful for your time and we remain at your disposal to discuss these points further.


[A] Futami, Futoshi, and Masahiro Fujisawa. "Time-independent information-theoretic generalization bounds for SGLD." Advances in Neural Information Processing Systems 36 (2023): 8173-8185.

[B] Bu, S. Zou, and V. V. Veeravalli. Tightening mutual information-based bounds on general- ization error. IEEE Journal on Selected Areas in Information Theory, 1(1):121–130, 2020

[C] Raginsky, Maxim, Alexander Rakhlin, and Matus Telgarsky. "Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis." Conference on Learning Theory. PMLR, 2017.

审稿人评论

Thank you very much for your detailed response. I read it with great interest.

The plan to include the explanation about the connection to gradient variance and an upper bound on the generalization error based on a surrogate loss, grounded in Futami & Fujisawa (2024), will certainly enhance the contribution of the paper. I also appreciate the clarifications regarding concrete cases where the assumptions hold and the interpretability of the bounds — your explanations were convincing. Incorporating these discussions into the main text would lead to a more comprehensive and rigorous presentation.

Assuming these additions are made, I believe the quality of the insights and discussions offered in the paper will be more than sufficient to meet the acceptance threshold. Accordingly, I have decided to raise my score by one point.

Best of luck with your submission!

审稿意见
3

This paper studies noisy SGD, or Stochastic Gradient Langevin Dynamics (SGLD), in the setting when they are run for many iterations with non-vanishing step sizes, to understand the generalization bounds. They study this by comparing the weights when run on two different, independently-sampled datasets. They show that by assuming dissipative loss functions, uniform-in-time bounds can be established for the generalization and differential privacy of noisy SGD. This is done by resolving an open question on the isoperimetric properties of the biased limit of discrete Langevin iterates.

给作者的问题

  • Could you discuss more about whether the assumptions used in this work hold in practical settings, and describe more concrete applications of this work?
  • How tight are the bounds derived in this work?

论据与证据

Yes.

方法与评估标准

N/A

理论论述

No.

实验设计与分析

N/A

补充材料

No.

与现有文献的关系

This paper extends prior results to the non-convex, unbounded setting.

遗漏的重要参考文献

None.

其他优缺点

Strengths:

  • This work extends prior results for strongly convex settings to the non-convex setting.

Weaknesses:

  • The results are purely theoretical and lack concrete connections to applications.
  • It would be good to discuss how optimal the results are.

其他意见或建议

None.

作者回复

We thank you for your time and review. We would like to address some of your concerns.

We have provided a clear application of our bound in our response to mwzT. Our goal is to have theory catch up to practice where it is already common to train on non-convex losses for several thousand iterations.

Moreover, the form of our result closely matches the one obtained in strongly convex settings. In other words, there are no unecessary terms in our results unlike prior work. However, since there are almost no lower bounds in non-convex settings for sampling algorithms (or noisy iterative schemes), it is difficult to prove optimality of our upper-bound.

We remain at your disposal for further discussion.

审稿人评论

Thank you for the responses. I have increased my score accordingly.

最终决定

This paper considers stochastic gradient langevin dynamics on non-convex landscapes. Authors derive information theoretic generalization bounds for SGLD, which is stable under the number of iterations. Several conditions that appeared in prior works have been relaxed in this paper. Paper shows that SGLD can achieve generalization bounds that are stable in number of iterations, in non-convex settings.

This paper was reviewed by four expert reviewers the following Scores: Accept, Accept, Accept, Weak Accept. I think the paper is studying an interesting topic and the results are relevant to ICML community. The following concerns were brought up by the reviewers:

  • Motivations is weak.

  • No new insights to be gained from the improvements in the paper.

  • several typos and ambiguous statements were pointed by the reviewers. These should be carefully addressed.

Authors should carefully go over reviewers' suggestions and address any remaining concerns in their final revision. Based on the reviewers' suggestion, as well as my own assessment of the paper, I recommend including this paper to the ICML 2025 program.