5.0

/10

Rejected4 位审稿人

最低5最高5标准差0.0

3.3

置信度

正确性2.8

贡献度2.5

表达2.0

NeurIPS 2024

Stability and Sharper Risk Bounds with Convergence Rate $O(1/n^2)$

Bowei Zhu,Shaojie Li,Yong Liu

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

TL;DR

We derive high probability excess risk bounds to $O(1/n^2)$ for ERM, GD and SGD and our high probability results on the generalization error of gradients for nonconvex problems are also the sharpest.

摘要

The sharpest known high probability excess risk bounds are up to $O(1/n)$ for empirical risk minimization and projected gradient descent via algorithmic stability [Klochkov and Zhivotovskiy, 2021]. In this paper, we show that high probability excess risk bounds of order up to $O(1/n^2)$ are possible. We discuss how high probability excess risk bounds reach $O(1/n^2)$ under strongly convexity, smoothness and Lipschitz continuity assumptions for empirical risk minimization, projected gradient descent and stochastic gradient descent. Besides, to the best of our knowledge, our high probability results on the generalization gap measured by gradients for nonconvex problems are also the sharpest.

关键词

algorithmic stabilitygeneralization boundsexcess risk bounds;

评审与讨论

审稿意见

评分: 5置信度: 42024-06-13

This paper studies the generalization measured by gradients via a uniform gradient stability. For $\beta$ -uniformly stable algorithms, the paper gives generalization bounds of order $O(1/n+\beta+\sqrt{E_Z[\|\nabla f(A(S);Z)\|_2^2]/n})$ , which yields fast rates if $E_Z[\|\nabla f(A(S);Z)\|_2^2$ is small. The paper then uses this generalization measured by gradients to derive generalization error bounds under a PL condition. Applications to empirical risk minimization, gradient descent and stochastic gradient descent are given for strongly convex, smooth and Lipschitz problems.

优点

The paper gives high-probability bounds for generalization gap on gradients, which include the gradient norm at the minimizer. This can imply fast rates in an interpolation setting. Under the PL condition, this gives fast rates of order $O(1/n^2)$ .

The paper provides comprehensive applications to several algorithms such as empirical risk minimization, gradient descent and stochastic gradient.

缺点

The high-probability analysis based on uniform stability follows largely from existing work. I do not see enough novelty in the analysis. It would be helpful if the authors can summarize the challenges in the analysis and their novelty.

As stated in the paper, Theorem 1 and Theorem 2 only improve the existing results by a constant factor. This improvement is not significant.

As stated in the paper, the generalization by gradients is mostly interesting for nonconvex problems. However, for the applications in Section 4, the paper considers strongly convex problems. Also the results require smoothness and Lipschitz continuity. These assumptions seem to be a bit strong.

The paper gives fast rates under the case $F(w^*)=O(1/n)$ in Section 4. Note that Section 4 considers strongly convex problems. Then, the objective function should be of the form $F(w)=G(w)+\mu\|w\|^2$ , where $G$ is related to loss. Then, if we require $F(w^*)=O(1/n)$ , one needs $\mu\|w^*\|^2=O(1/n)$ . Suppose we assume $\|w^*\|=O(1)$ . Then, this requires $\mu=O(1/n)$ . In this case, the generalization bound would be vacuous since $n\mu=O(1)$ .

For SGD, the computation cost seems to be high. For example, in Theorem 6, the paper requires $T=n^4$ while in Theorem 13 the paper requires $T=n^2$ . This high computational cost may not be appealing for large-scale problems.

问题

The paper contains various typos. Here are some examples:

Lemma 3: $\|\nabla f(w;z)\le M\|$

Below eq (25): $\sqrt{a+b}=\sqrt{a}+\sqrt{b}$

Below Eq (28): Below eq (25): $\sqrt{a+b}+\sqrt{a}+\sqrt{b}$

In Eq (27), it seems that there is a missing factor of $2$

局限性

Yes

作者回复

2024-08-07

Thank you for your time and thoughts. Below, we reply to all the issues.

Question 1: Issue in Weaknesses ``The high-probability analysis based on uniform stability ...''

Answer: Thank you for your valuable feedback and insights. We appreciate the opportunity to address the points raised and provide further clarification. Due to the constraints on word count, we have detailed our response regarding the main challenges and innovations of our work in Public Response. We believe this additional explanation clarifies the significance and impact of our work.

Question 2: Issue in Weaknesses ``As stated in the paper, Theorem 1 and Theorem 2 ...''

Answer: In the course of our writing, we discovered that the constant in Marcinkiewicz-Zygmund's inequality for random variables taking values in a Hilbert space could be further improved. Consequently, we included this part in the paper. This was done by referencing the proof of Marcinkiewicz-Zygmund's inequality for random variables with the best constant. Considering that Theorem 1 has other broad applications, we report this result as well.

Question 3: Issue in Weaknesses ``As stated in the paper, the generalization by gradients ...''

Answer: While Theorem 3 indeed highlights sharper generalization bounds in gradients specifically for nonconvex problems. This theorem operates under the assumption of Lipschitz continuity and shows that the coefficient of the $1/\sqrt{n}$ term is related to the variance of the algorithm's optimized gradient over the entire population $\mathbb{E}_{Z} || \nabla f(A(S);Z) ||_2^2$ , as opposed to being related to the maximum gradient value $M$ as in prior works. This provides a tighter bound and is a notable contribution to the understanding of non-convex problems.

We chose to demonstrate our results in Section 4 using strongly-convex problems to provide clear, illustrative examples where our theoretical contributions can be easily observed and compared with existing literature. The generalization bounds for nonconvex problems remain a significant contribution and are expected to be beneficial for many practical applications where nonconvexity is prevalent.

The assumptions of smoothness and Lipschitz continuity are indeed standard and necessary for deriving our results. High-probability results are more challenging than expectation results. While expectation results need on-average type stability (eg: $\mathbb{E} \left\|\nabla f(A(S);z)-\nabla f(A(S^{(i)});z) \right\|_2 \leq \beta.$ ), high-probability results require uniform type stability (eg: $\sup_z \left\|\nabla f(A(S);z)-\nabla f(A(S^{(i)});z) \right\|_2 \leq \beta.$ ). Achieving uniform stability necessitates stronger assumptions to ensure the same optimization outcomes. Achieving $O(1/n^2)$ rate necessitated the trade off between optimization and generalization errors, which required assuming strongly-convexity. This assumption was crucial to ensure a good convergence rate for the algorithm's optimization error.

Question 4: Issue in Weaknesses ``The paper gives fast rates under the case ...''

Answer: Our stability bound is indeed influenced by $\frac{\gamma}{n \mu}$ , not $n \mu$ . To further analyze this, let's consider a specific example where $G$ represents the squared loss function.

Assume the function $F(w)$ satisfies the inequalities $\mu |w|^2 \leq F(w) \leq \gamma |w|^2$ . For the sake of this analysis, we assume $F(w^*) = \Theta(1/n)$ . Consequently, we have $\mu |w|^2 = \Theta(1/n)$ and $\gamma |w|^2 = \Theta(1/n)$ . Assuming $|w|^2 = \Theta(1)$ , this leads us to $\mu = \Theta(1/n)$ and $\gamma = \Theta(1/n)$ . Under these conditions, our final stability bound becomes $\frac{\gamma}{n \mu} = \Theta(1/n)$ , which is meaningful and consistent with our analysis.

We acknowledge that the assumption $F(w^*) = O(1/n)$ is quite strong. However, we use this assumption to demonstrate that under low-noise conditions, our bounds can achieve the tightest possible rate of $O(1/n^2)$ .

Question 5: Issue in Weaknesses ``For SGD, the computation cost seems ...''

Answer: We acknowledge that the current method to analyze SGD indeed requires a large number of iterations to achieve high-probability results, which is a consequence of the inherent randomness in SGD. Considering that the uniform convergence method has already obtained results on the order of $1/n^2$ , we aim to achieve similar results using stability analysis under comparable assumptions. Additionally, we hope to leverage the characteristics of stability analysis to eliminate dependency on the dimension $d$ .

We also recognize that examining the expected results may potentially reduce the number of iterations required. We believe this is a promising direction for future research.

Question 6: Issue in Weaknesses ``In the proof of Lemma 1, ...''

Answer: Thank you for your correction. Due to the strong convexity assumption we made when applying this lemma, $w^*$ is unique. Our statement here is not rigorous enough, but it does not affect the final conclusion. We will correct this error in the final version.

Question 7: Issue in Questions ``The paper contains various typos ...''

Answer: Thank you for your correction. We will try our best to fix these minor errors in the final version.

If you have any questions or need further clarifications, please do not hesitate to contact us. We look forward to discussing this with you and appreciate your valuable feedback.

Reference

Yegor Klochkov, and Nikita Zhivotovskiy. Stability and Deviation Optimal Risk Bounds with Convergence Rate $O(1/n)$ . NeurIPS 2021.

评论- Thank you for your response

2024-08-13

Thank you for your detailed responses. I appreciate more the contributions of the paper on fast convergence rates. I will raise my score.

2024-08-13

Thank you for your thoughtful feedback and recognizing the contributions of our paper. I truly appreciate your insights and support. Your encouragement motivates us to further enhance the quality of our work.

审稿意见

评分: 5置信度: 32024-06-25

The work shows high probability excess risk bounds of $O(1/n^2)$ for several algorithms under strong convexity, smoothness, Lipschitz continuity and low noise assumptions using algorithmic stability.

优点

The results of the paper are interesting, showing a risk bound of $O(\frac{1}{n^2})$ using algorithmic stability.
The paper uses a novel technique, involving the stability of gradients to demonstrate excess risk bounds and presents applications of this technique in convex optimization.

缺点

The problem setup of the paper is not clearly detailed before the technical section, including the assumptions used for proving the results.
The presentation of results and related works is somewhat lacking. It would be beneficial if the authors summarized the results from previous work and compared them to the results in the current paper, including the set of assumptions made in each work.

问题

In the proof of the risk bound for ERM (line 664), why is $\nabla F_S(w^*) = 0$ ? This may not hold in constrained optimization. Alternatively, in unconstrained optimization, the results cannot hold since a strongly convex function cannot be Lipschitz.
The paper contains several typos in both the analysis and the main text (e.g., line 5, line 228).

局限性

Yes

作者回复

2024-08-07

Thank you for your time and thoughts. Below, we reply to all the issues.

Question 1: Issue in Strengths ``The problem setup of the paper is not clearly ...''

Answer: Thank you for your valuable feedback. We appreciate your suggestion to collect and detail the assumptions used in our manuscript. We will ensure that in the final version, all assumptions are gathered together and each one is thoroughly explained, including their significance.

Question 2: Issue in Weaknesses ``The presentation of results and related works is somewhat lacking ...''

Answer: Thank you for your valuable feedback. We have made sure to clearly state the necessary assumptions in each theorem or lemma, as you suggested. Additionally, we discuss the comparison with existing work in the remark sections following the relevant theorems.

To further enhance clarity and facilitate easier comparison for readers, we will summarize these comparisons in a table in the final version of the manuscript.

Thank you again for your helpful suggestions.

Question 3: Issue in Weaknesses ``In the proof of the risk bound for ERM (line 664) ...''

Answer: Thank you for your feedback. You are correct that our expression was not precise and lacked a necessary condition. Assuming $w^*(S) \in \mathcal{W}$ , we will make this correction in the final version of our paper.

Question 4: Issue in Weaknesses ``The paper contains several typos in both the analysis ...''

Answer: Thank you for your correction. We will try our best to fix these minor errors in the final version.

If you have any questions or need further clarifications, please do not hesitate to contact us. We look forward to discussing this with you and appreciate your valuable feedback.

2024-08-11

Thank you for your response. I’m maintaining my score. I agree with reviewer Xifq that it’s necessary to fix the typos in the paper, explicitly state the different sets of assumptions (along with the corresponding results), and discuss previous works with similar assumptions.

2024-08-12

Thank you for your valuable feedback on my rebuttal. I will ensure that all typos in the paper are corrected, clearly outline the different sets of assumptions along with their corresponding results, and include a comprehensive discussion of previous works that operate under similar assumptions. Your insights will help improve the quality of the manuscript, and I am committed to addressing these points thoroughly in the final version.

审稿意见

评分: 5置信度: 32024-06-29

This paper achieves the high probability excess risk bounds $\mathcal{O}(1/n^2)$ for empirical risk minimization, projected gradient descent and stochastic gradient descent under strong convexity, smoothness and Lipschitz continuity assumptions.

优点

Please see Summary.

缺点

1.The paragraph from line 52 to line 65 first analyzes the non-convex problems. Then, it suddenly mentions that this paper explores the stability of stochastic convex optimization algorithms with strongly convex losses in line 61. And, it doesn’t mention its non-convex analysis. It is so confusing. So, I suggest authors rewrite this paragraph to benefit readers’ understanding.

2.The paragraph from line 70 to line 73 is unnecessary since we can not obtain any useful information.

3.In Related Work, it is unnecessary to list the literature related to uniform convergence since it is not helpful for readers’ understanding of the contributions of this paper. It may be better that authors list the detailed literature related to high probability bound.

4.Theorem 1 in this paper is not the sharpest p-moment bound for sums of vector-valued functions. Authors demonstrate their bound is indeed tighter than Theorem 1 of [1]. However, [2] also provided a bound (Theorem 1) that is likely tighter than Theorem 1 in this paper. Note that, [2] also used Marcinkiewicz-Zygmund’s inequality to prove their bound. Besides, the paragraph from line 131 to line 136 states “On the other hand, in Section 3.2, we will carefully construct vector-valued functions which satisfies all the assumptions in Theorem 1 and ensures M = 0 at the same time. Under this condition, we can eliminate the first term.”. This point is also considered in Theorem 1 of [2]. I think that authors just consider the improvement to [1], but omit other related work.

[1]J. Fan and Y. Lei. High-probability generalization bounds for pointwise uniformly stable algorithms. Applied and Computational Harmonic Analysis, 70:101632, 2024.

[2]X. Yuan, P. Li. Exponential generalization bounds with near-optimal rates for $L_q$ -stable algorithms. ICLR, 2023.

5.In Section 3.2, authors build some relationships between generalization error and stability parameter $\beta$ . Authors think these relationships are under non-convex, non-smooth, non-PL conditions. These bounds are not the final generalization bounds but the relationships. After giving the stability bounds, the generalization bounds are finally determined. However, in Section 4, authors provide the stability bounds under (strongly) convex and smooth conditions. Therefore, authors didn’t remove (strongly) convex and smooth conditions for generalization analysis.

6.The symbol $M$ is repeatedly used in Theorem 1 and Theorem 2. Therefore, I suggest authors should carefully check their symbol settings.

7.In Remark 4, authors compare their Theorem 3 with the bound in [3]. It is unfair since, as mentioned in the above 5., the bound in [3] is a final result but Theorem 3 is not.

[3]Y. Xu and A. Zeevi. Towards optimal problem dependent generalization error bounds in statistical learning theory. Mathematics of Operations Research, 2024.

8.In line 633, $F(A(S)) - F(A(S))$ is wrong.

9.In line 633, Equation (31) should be an inequality.

10.In line 152, authors state “In nonconvex problems, we can only find a local minimizer by optimization algorithms which may be far away from the global minimizer. Thus the convergence does not make much sense in function values.”. So, they use uniform stability in gradients. However, in Section 4, they provide some uniform stability bounds in gradients for strongly convex problems. It is a paradox.

11.The form of the relationship in Theorem 3 is very normal. The method to obtain an excess risk bound $\mathcal{O}(1/n^2)$ is very simple. I think other normal generalization results (like [1]) in gradients can derive the excess risk bound $\mathcal{O}(1/n^2)$ . The main contribution of this paper may be the simple method combining PL condition with some usual decompositions as shown in Proof of Remark 5. However, as mentioned in the above 10., the uniform stability in function values is more unreliable than the one in gradients under strongly convex condition.

问题

1.In line 23, $f$ denotes a loss function. However, what does $g$ denote? Definition 1 and Definition 4 involve this symbol.

2.In line 54, authors state “This is why we can achieve dimension-free excess risk bounds of order $O(1/n^2)$ .”. I don’t understand it since the sentence before it is not the reason of it.

3.The paragraph from line 66 to line 69 mentions that “we obtain a tighter p-moment bound ...”. Where is this tighter p-moment bound used? This point is not stated in Abstract. From my perspective, it is valuable to be emphasized.

4.Why do authors give Lemma 2? The population risk of gradients $||\nabla F(A(S))||_2$ is not included in any result and not cited in any proof in this paper.

5.In line 635, why is the first inequality not consistent with Lemma 1?

6.Why is Theorem 3 better than Theorem 2 under the same conditions? It is better to emphasize the difference between these two theorems, rather than just state a sentence “we derive sharper generalization bound of gradients under same assumptions.”.

局限性

Considering the 4. in Weaknesses, I suggest the author reconsider whether their result is the sharpest.

2024-08-07

PART III

Q15: Issue in Questions ``Why do authors give Lemma 2 ...''

A: The logical focus of this paper is indeed to derive an excess risk bound of $1/n^2$ . However, some results obtained during the proof process, such as Theorem 3, are novel and represent the best results even in the non-convex setting. Lemma 2 is presented as a minor result, as discussed in Remark 2, there is considerable interest in the population risk of gradients, $||\nabla F_S(A(S))||_2$ , in the non-convex setting.

Lemma 2 only assumes bounded variance and SGC (Strong Growth Condition). When the algorithm performs well, the empirical risk of the gradient $||\nabla F_S(A(S))||_2$ can be zero, which implies that the population risk of gradients $||\nabla F_S(A(S))||_2$ can achieve $O(1/n)$ . This result from Lemma 2 also helps to understand why Theorem 3 provides a better bound compared to Theorem 2. Specifically, using the inequality $\sqrt{ab} \leq \eta a + \frac{1}{\eta}b$ in the context of Theorem 3 allows us to derive Lemma 2. On the other hand, Theorem 2, combined with SGC and the assumption $||\nabla F_S(A(S))||_2 = 0$ , can only achieve a bound of $O(1/\sqrt{n})$ at best.

Q16: Issue in Weaknesses ``In line 635, why is ...''

A: To facilitate the calculations, we simplified the constant terms that are independent of $n$ into a single large constant $C$ in line 635 when applying Lemma 1. This is a standard approach to streamline the notation, and the resulting expressions are equivalent.

Q17: Issue in Weaknesses ``Why is Theorem 3 better than Theorem 2 ...''

A: As highlighted in Remark 4, Theorem 3 operates under the assumption of Lipschitz continuity and demonstrates that the coefficient of the $1/\sqrt{n}$ term is related to the variance of the algorithm's optimized gradient over the entire population, denoted by $\mathbb{E}_{Z} \|\nabla f(A(S); Z)\|_2^2$ . This contrasts with Theorem 2, where the coefficient is associated with the maximum gradient value $M$ .

By introducing smoothness conditions in subsequent results, we refine the analysis from focusing on the maximum gradient $M$ to considering the variance of the algorithm's optimized gradient over the entire population. This refinement is crucial for obtaining more accurate and generalizable results. Besides, as discussed in our response to Q15, Lemma 2 also explains why Theorem 3 offers a more comprehensive and favorable result compared to Theorem 2.

If you have any questions or need further clarifications, please do not hesitate to contact us. We look forward to discussing this with you and appreciate your valuable feedback.

Reference

J. Fan and Y. Lei. High-probability generalization bounds for pointwise uniformly stable algorithms. Applied and Computational Harmonic Analysis, 2024.

X. Yuan, P. Li. Exponential generalization bounds with near-optimal rates for -stable algorithms. ICLR, 2023.

Y. Klochkov, and N. Zhivotovskiy. Stability and Deviation Optimal Risk Bounds with Convergence Rate $O(1/n)$ . NeurIPS 2021.

O. Bousquet, Y. Klochkov, and N. Zhivotovskiy. Sharper bounds for uniformly stable algorithms. COLT 2020.

Y. Xu and A. Zeevi. Towards optimal problem dependent generalization error bounds in statistical learning theory. Mathematics of Operations Research, 2024.

R. Bassily, V. Feldman, C. Guzmán, and K. Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. NeurIPS 2020.

2024-08-11

Thank you for your response. I have raised the ratings.

2024-08-12

Thank you for your constructive feedback and for raising the ratings. We greatly appreciate your time and effort in reviewing our work. Your comments have been invaluable in helping us refine our manuscript.

If you have any additional suggestions or further concerns, please feel free to share. We are committed to addressing any remaining issues and improving the quality of our work.

2024-08-07

PART II

Q8: Issue in Weaknesses ``In line 633 $F(A(S)) - F(A(S))$ ...''

A: We will fix this typo.

Q9: Issue in Weaknesses ``In line 633, Equation (31) ...''

A: We will fix the typo.

Q10: Issue in Weaknesses ``In line 152, authors state ...''

A: We introduced the setup for non-convex problems because Theorem 3 has potential applications in non-convex scenarios as well. The stability analysis for generalization includes two parts: on one hand, the relationship between algorithm stability and generalization performance; on the other hand, proving the stability of a particular algorithm. Our Theorem 3 addresses the first aspect by providing a bound that depends on the variance of the population risk of the gradient, rather than the maximum gradient $M$ . Although the latter is not the focus of our paper, it represents a potential direction for future research.

The refined analysis in Theorem 3 also demonstrates improved performance under stronger assumptions, leading to the best $1/n^2$ bound. We believe that both the study of non-convex problems and the research into tighter bounds under stronger assumptions are equally important.

Q11: Issue in Weaknesses ``The form of the relationship ...''

A: Regarding your point about the generalization results of existing gradient-based methods, including Theorem 2, we emphisize that these results can not yield a bound on the excess risk of the order $1/n^2$ . The reason for this is that the coefficients in front of the $\frac{1}{\sqrt{n}}$ term in these results are generally bounded by the maximum value $M$ of the gradients.

In contrast, Theorem 3 introduces a different perspective by focusing on the variance of the algorithm's optimized population risks of gradients $\mathbb{E}_{Z} \| \nabla f(A(S); Z) \|_2^2$ . This approach allows us to derive a more refined bound on the excess risk. By incorporating additional assumptions to bound this variance, we are able to achieve improved results. Theorem 3 is indeed a core contribution of our paper, as it addresses a key limitation in the existing results.

Moreover, we recognize that some may view the improvement of constants in Theorem 1 as having limited impact. However, achieving such improvements is also a technical challenge and constitutes another contribution of our work.

Q12: Issue in Questions ``In line 23, $f$ denotes a loss function ...''

A: We understand your concern regarding the notation used in Line 29 and Definition 1 Definition 4, where the function $g$ are employed. Given that these definitions are meant to be general and not restricted to loss functions, this notation is indeed applicable beyond just loss functions.

Lemma 1, for instance, assumes that the population loss function $F$ , rather than the loss function $f$ , satisfies the PL condition. We recognize that this might cause some confusion due to the overlapping notation.

To address this issue and enhance clarity, we can revise the notation in the final version of the manuscript to avoid any potential misunderstandings.

Q13: Issue in Questions ``In line 54, authors state ...''

A: As you correctly pointed out, we transform the excess risk bound into an upper bound on the population risk of gradients through the use of PL conditions. To achieve the desired $\frac{1}{n^2}$ result, it is indeed crucial to conduct a detailed analysis of the generalization error of the gradients.

Our approach highlights that the coefficient in front of the $\frac{1}{\sqrt{n}}$ term cannot simply be bounded by the maximum value $M$ of the gradients (eg: previous work and Theorem 2) . Instead, we need to ensure that the generalization error of the gradients achieves $\frac{1}{n}$ under specific assumptions. This nuanced analysis is precisely what Theorem 3 addresses, providing a more refined and accurate characterization of the gradient's generalization error.

Q14: Issue in Questions ``The paragraph from line 66 to line 69 ...''

A: Theorem 1 provides our tighter $p$ -moment bound, and it is indeed central to the results presented throughout the paper. As discussed following Theorem 1, we have improved the constants in Marcinkiewicz-Zygmund's inequality for random variables taking values in a Hilbert space. Given the potential broad applications of Theorem 1, we believe it is important to report these enhanced results.

However, to maintain the readability of the paper and due to the fact that many might perceive the improvement in constants as having limited impact—especially concerning the $1/n^2$ term of interest—we chose not to emphasize this detail in the abstract or in the introduction. Our aim was to keep the focus on the main results and their implications.

作者回复

2024-08-07

PART I

Thank you for your time and thoughts. Below, we reply to all the issues.

Q1: Issue in Weaknesses ``The paragraph from line 52 to line 65 ...''

A: As Reviewer Xifq pointed out in Strengths, our roadmap in this paper is to provide a finer-grained analysis of the generalization error of gradients, leading to a dimension-independent bound of $1/n^2$ .

In the course of our proof, we discovered that our main theorem (Theorem 3) does not require strong convexity and smoothness assumptions. Furthermore, it surpasses the best existing results by relating the coefficient of the $1/\sqrt{n}$ term to the variance of the algorithm's optimized gradient across the entire population, $\mathbb{E}_{Z} || \nabla f(A(S);Z) ||_2^2$ , rather than to the maximum gradient value $M$ as in previous works. We believe this theorem could provide valuable insights for non-convex settings as well.

We will make sure to articulate this more clearly in the final version of the paper to enhance its readability.

Q2: Issue in Weaknesses ``The paragraph from line 70 to line 73 ...''

A: Thank you for your suggestion. We will remove this paragraph in the final version

Q3: Issue in Weaknesses ``In Related Work, it is unnecessary ...''

A: Our motivation stems from the fact that the current tools for uniform convergence yield results that are better compared to those derived from stability methods. Given this, we aim to employ stability methods to achieve dimension-independent results under the same assumptions. Therefore, we believe it is necessary to include an appropriate introduction to the latest developments in uniform convergence. We will strive to minimize its length to ensure conciseness. In response to your request, we will provide a detailed list of literature related to high probability bounds in the final version.

Q4: Issue in Weaknesses ``Theorem 1 in this paper is not the sharpest ...''

A: Our work and [Fan and Lei, 2024] are concentrated on $p$ -moment bound for sums of vector-valued functions. [Yuan and Li, 2023] reported sums of valued functions. We are different. Of course, all of us motivated by [Bousquet et al., 2020] but we face more difficult. On one hand, comparing with [Fan and Lei, 2024] for p-moment bound for sums of vector-valued functions, we need to prove the optimal constant in Marcinkiewicz-Zygmund's inequality for random variables taking values in a Hilbert space but they just use standard Marcinkiewicz-Zygmund's inequality. On the other hand, for constructing vector-valued functions, comparing with [Yuan and Li, 2023] the distinction between a vector’s $2$ -norm being $0$ and the vector itself being $0$ added complexity to our proof. We apologize for the oversight in not citing [Yuan and Li, 2023]. We will make sure to include it in the related works to address this issue.

Q5: Issue in Weaknesses ``In Section 3.2, authors build ...''

A: The stability analysis for generalization includes two parts: firstly, build the relationship between algorithm stability and generalization bound, and secondly, prove the stability of certain algorithms. In this paper our goal is to derive an $1/n^2$ bound independent of dimensionality During our proof process, we found that Theorem 3 also addresses the first aspect in non-convex settings, providing a bound that depends on the variance of the population risk of gradients rather than the maximum gradient $M$ . Our primary focus is not on the latter aspect of non-convex settings, which remains an area for future research.

Additionally, exploring high-probability stability for algorithms in non-convex setting is indeed challenging. Current methods, such as those based on [Bassily et al., 2020]'s work, show that random algorithms under non-convex smooth assumptions are $O(T/n)$ -stable. However, this bound becomes less meaningful when the number of iterations exceeds $n$ . We believe that further investigation into algorithm stability in non-convex scenarios is valuable and worthy of exploration.

Q6: Issue in Weaknesses ``The symbol $M$ is repeatedly ...''

A: We will fix this typo.

Q7: Issue in Weaknesses ``In Remark 4, authors compare ...''

A: Our intention in comparing our results with [Xu and Zeevi, 2024] is not to imply that our approach is universally superior. If the stability parameter $\beta$ in our algorithm is very large, our results may indeed be less favorable.

Specifically, we indicate that when the uniformly stable gradient parameter $\beta$ is smaller than $1/\sqrt{n}$ , our bound is tighter compared to [Xu and Zeevi, 2024]’s result and is dimension-independent. However, it is crucial to note that even if $\beta$ is smaller than $1/\sqrt{n}$ , the results from [Fan and Lei, 2024] and Theorem 2 do not surpass [Xu and Zeevi, 2024]'s result and are less favorable, especially considering the subsequent derivation of the $1/n^2$ result.

We will revise our manuscript to clarify these points and avoid any potential misunderstanding for the readers. Thank you for bringing this to our attention.

审稿意见

评分: 5置信度: 32024-07-12

The paper considers the standard statistical learning setting and derives $\mathcal{O}(\frac{1}{n ^ {2}})$ ( $n$ denotes the number of samples) high-probability bounds for the excess risk $F(A(S)) - \inf_{w} F(w)$ ( $A$ denotes the algorithm and $S$ denotes the training set) of ERM, PGD, and SGD. The best-known bounds prior to this work were $\mathcal{O}(\frac{\log n}{n})$ for ERM, PGD, that was derived using algorithmic stability, and $\mathcal{O}(\frac{1}{n ^ {2}})$ or ERM, SGD that was derived using uniform convergence. However, the latter demanded $n = \Omega(d)$ samples, thereby introducing the an undesirable dependence on $d$ . The current paper shows that it is possible to obtain $\mathcal{O}(\frac{1}{n ^ {2}})$ bounds for the algorithms (without any dependence on $d$ ) under the lens of stability.

优点

The considered problem is interesting and the contributions of the paper show that sharper bounds (at par with uniform convergence, albeit without any dependence on $d$ ) are possible for ERM, SGD, and PGD for strongly convex and smooth stochastic convex optimization. Below I explain the roadmap taken by the authors in doing so, which also sheds light on some of the other aspects of the paper.

The authors first study the generalization gap via gradients, i.e. the quantity $\||\nabla F(A(S)) - \nabla F_S(A(S)\||$ for the statistical learning setting under the assumption that the function is Lipschitz and the algorithm is uniformly stable in gradients (Theorem 1 and 2). Under the nonconvex setting, the authors obtain dimension-independent bound for the generalization gap via gradients. This bound is subsequently studied under the assumption that the function is smooth and satisfies the Polyak-Lojaseiwicz (PL) condition. The obtained bound is a function of the gradient norm obtained at the end of optimization, i.e. $\|| \nabla F_S(A(S) \||$ .
To obtain excess risk bound for the algorithms, the (i) authors show that the algorithms are uniformly stable in gradients; (ii) translate the excess risk bound to a bound on the gradient via the PL inequality (recall from the premise that these algorithms are analyzed in the strongly convex and smooth setting of stochastic convex optimization, therefore PL holds vacuously); (iii) use triangle inequality to relate the bound on the gradient norm to the generalization gap, and use the bound on the generalization gap via gradients (see 1 above).

缺点

I found several typos in the main theorems in section 4. For example, the stability equation in Lemma 4 should be written with respect to the output at the iteration instead of the output of the ERM. Also, why is the reference in Theorems in section 4 to Theorem 3, instead of Lemma 1? From my understanding (explained in the Strengths above), Lemma 1 is a specific instantiation of Theorem 3 to smooth + PL functions, which is exactly the premise in Section 4. What exactly is $w$ in the bound $F(w) - F(w^\star)$ in Theorem 5? I expect it to $w_{T + 1}$ (or $w_T$ ).

I don't see the point of Marcinkiewicz-Zygmund’s inequality with improved constants. The whole paper is about the improved dependence with respect to $n$ , so I don't see a point in improving the specific constants in the inequality.

I need some more clarification in lines 204--207. The authors state that Klochkov and Zhivotovskiy obtained $\mathcal{O}(\frac{1}{n})$ style bounds for the excess risk. What's the assumption on $f$ considered by them? The authors mention that they can obtain $\mathcal{O}(\frac{1}{n ^ {2}})$ bounds with an extra PL and smoothness assumption. Is this something for which Klochkov and Zhivotovskiy, 2021 could only obtain a suboptimal $\mathcal{O}(\frac{1}{n})$ bound? Earlier, my interpretation was that this work obtained $\frac{1}{n}$ bounds for ERM, and PGD for smooth and strongly convex stochastic convex optimization, but lines 204--207 made my understanding unclear. I want to make sure the authors are not invoking extra assumptions to get improved dependence.

Along similar lines as above, Lines 259--261 seem to be saying inconsistent things (is the assumption just smoothness, or strong convexity and smoothness). I would appreciate clarifications from the authors.

Minor Typos: 1) In definition 1, $\gamma$ and $\mu$ should be strictly positive; (2) Line 182: $\gamma$ -smooth instead of $\gamma$ -smoothness.

Based on the authors' responses, I would be happy to revise my assessment of the paper.

问题

See the Weaknesses above.

局限性

Yes, the authors have adequately addressed this.

作者回复

2024-08-07

Thanks for your time and thoughts. We will answer all your questions.

Question 1: Issue in Weaknesses ``I found several typos ...''

Answer: Thank you for your correction. Here are some comments.

(a) ``the stability equation in Lemma 4 should be written ...''.

Comment: Sorry, this is a typo, the right formula is

\forall z \in \mathcal{Z}, \quad \left| \nabla f(w_t;z) - \nabla f(w^i_t;z) \right|_2 \leq \frac{4M \gamma}{n\mu}.

(b) ``Also, why is the reference in Theorems in section 4 to Theorem 3, instead of Lemma 1''.

Comment: Whether citing Theorem 3 or Lemma 1, the assumptions are the same, because the assumptions we need when calculating the stability of the algorithm require strong-convexity condition, which are stronger than PL condition. However, your understanding is correct. Our citation may affect misunderstanding. We will accept your suggestion.

Comment: $F(w_{T + 1}) - F(w^\star)$ is right. This is a typo, we will correct it.

Question 2: Issue in Weaknesses ``I don't see the point of Marcinkiewicz-Zygmund’s inequality ...''

Answer: In the course of our writing, we discovered that the constant in Marcinkiewicz-Zygmund's inequality for random variables taking values in a Hilbert space could be further improved. Consequently, we included this part in the paper. This was done by referencing the proof of Marcinkiewicz-Zygmund's inequality for random variables with the best constant. Considering that Theorem 1 has other broad applications, we report this result as well. Using this new inequality also yields tighter results in terms of constants in this paper. For Theorem 2, we use the same proof as existing work and obtain tighter results on constants. But regarding the primary focus of this paper, when referred to $n$ , this improvement indeed makes no difference.

Question 3: Issue in Weaknesses ``I need some more clarification in lines 204--207 ...''

Answer: Here I will introduce the difference between our results and [Klochkov and Zhivotovskiy, 2021]. Stability generalization analysis consists of two parts (1) The relationship between algorithmic stability and generalization (2) A certain algorithm is stable. This application part also involves optimization analysis when considering excess risk. Let's start with (2) to demonstrate that we have obtained better results more intuitively.

For PGD, your interpretation is right. Assumptions in [Klochkov and Zhivotovskiy, 2021] are smooth, Lipschitz and strongly convex for function $f$ , and they obtain $O(1/n)$ bound. But we reach $O(1/n^2)$ under the same assumptions. Notice that in [Klochkov and Zhivotovskiy, 2021], smoothness assumption is required to analyse optimization error. But we use smoothness assumption in both generalization and optimization analysis.

For ERM, their assumptions are Lipschitz and strongly convex for function $f$ , and they obtain $O(1/n)$ bound. Our assumptions are smooth, Lipschitz and strongly convex for function $f$ , and we obtain $O(1/n^2)$ bound.

At this point, you may be able to see where our improvements are mainly reflected. We used stronger assumptions when establishing the relationship between algorithmic stability and generalization. This is exactly what lines 204-2007 want to express.

[Klochkov and Zhivotovskiy, 2021] only needs the bounded assumption for function $f$ and $\gamma$ -stable algorithm and their results are

 F(A(S)) -F(w^{\ast}) & \lesssim  (F_S(A(S)) - F_S(w^*(S))) + \mathbb{E}(F_S(A(S)) - F_S(w^*(S))) 
  \quad + (\gamma \log n + 1/n)\log{(1/\delta)}.

Our assumptions are smooth, Lipschitz for function $f$ and PL condition for the population risk $F$ , and $\beta$ -stable in gradient algorithm and our results are

 F(A(S)) -F(w^{\ast}) \lesssim \| \nabla F_S(A(S)) \|_2 + 
  \frac{F(w^*) \log{(1/\delta)}}{n} + \frac{ \log^2(1/\delta)}{n^2}
  + \beta^2 \log^2 n \log^2 (1/\delta).

Note that our method can reach to $O(1/n^2)$ , but [Klochkov and Zhivotovskiy, 2021] can only reach to $O(1/n)$ . In the application of optimization analysis, it is common to use smooth and PL condition assumptions. We believe that introducing these assumptions into the relationship between stability and generalization is meaningful. For ERM, our results do not show a significant advantage; although we obtain tighter bounds, we also introduce stronger smoothness assumption. This is because ERM does not involve complex optimization problems. In the case of PGD, since the optimization analysis involves an $O(\log n)$ convergence rate, the smoothness assumption needs to be introduced. Our method can yield better results under the same assumptions.

In fact, there are currently some studies based on [Klochkov and Zhivotovskiy, 2021] that address other problems, such as differentially private models [Kang et al., 2022, Theorem 1] and pairwise learning [Lei et al., 2021, Theorem 9]. They also need the smoothness assumption in their optimization analysis. However, since they still use the framework in [Klochkov and Zhivotovskiy, 2021] in the generalization part of their research, they can only obtain an $O(1/n)$ bound at most. Using our method, they can achieve $O(1/n^2)$ bounds without changing the assumptions.

If you have any questions or need further clarifications, please do not hesitate to contact us. We look forward to discussing this with you and appreciate your valuable feedback.

Reference

Yegor Klochkov, and Nikita Zhivotovskiy. Stability and Deviation Optimal Risk Bounds with Convergence Rate $O(1/n)$ . NeurIPS 2021.

Yunwen Lei, Mingrui Liu, and Yiming Ying. Generalization guarantee of SGD for pairwise learning. NeurIPS 2021.

Yilin Kang, et al. Sharper Utility Bounds for Differentially Private Models: Smooth and Non-smooth. CIKM 2022.

评论- Response to the Author Rebuttal

2024-08-09

I thank the authors for their reply.

Firstly, I would like to mention that this approach of bounding the excess risk using the framework that: (a) stability implies generalization, and (b) an algorithm is stable, although natural does introduce some more complexity for a reader. I understand that the authors presented a set of more general results (under a certain set of assumptions on the function and stability) for (a) and proved that for stochastic convex optimization, ERM, PGD, and SGD satisfy (b), given the multitude of bounds and assumptions presented in the paper, it is likely for a reader to get lost. I suggest the authors maintain a table and introduce it in the contribution section with the only relevant bounds.

For PGD, your interpretation is right. Assumptions in [Klochkov and Zhivotovskiy, 2021] are smooth, Lipschitz and strongly convex for function, and they obtain bound. But we reach under the same assumptions.

Thanks for the clarification. I now appreciate the result more.

Notice that in [Klochkov and Zhivotovskiy, 2021], smoothness assumption is required to analyse optimization error. But we use smoothness assumption in both generalization and optimization analysis.

I see. Thanks for your clarification. Eventually, you are considering the excess risk for Stochastic Convex Optimization with smooth + strongly convex + Lipschitz $f$ 's, so as long as the rates are better, I am not very worried about whether your assumptions for obtaining bounds for (a) in my comment at the top are stronger or weaker than K&Z'21.

For ERM, their assumptions are Lipschitz and strongly convex for function , and they obtain bound. Our assumptions are smooth, Lipschitz and strongly convex for function , and we obtain bound.

I see. Is it possible to conclude whether or not the approach of K&Z'21 gives $\mathcal{O}(\frac{1}{n^{2}})$ for ERM invoking an extra smoothness assumption? That might make your result even stronger.

At this point, you may be able to see where our improvements are mainly reflected. We used stronger assumptions when establishing the relationship between algorithmic stability and generalization. This is exactly what lines 204-2007 want to express.

Correct, I now understand this better.

[Klochkov and Zhivotovskiy, 2021] only needs the bounded assumption for function and -stable algorithm and their results are .....

Looking at their result, it seems like their generalization analysis is different since their bound is a function of $F_{S}(A(S)) - F_{S}(w^\star(S))$ , so they probably considered studying a different measure of generalization gap. I can see now why the set of assumptions used by the authors for the generalization analysis is different than K&Z'21.

I am satisfied with the author's responses and have revised my score. However, I feel the paper needs to go through another round of revision with careful fixing of the typos, very clearly mentioning the different sets of assumptions used for (a) and (b) by the authors, the results corresponding to these assumptions, and comparison with the most relevant K&Z'21.

2024-08-12

Thank you for your constructive feedback and for raising the ratings.

Firstly, I would like to mention that this approach of bounding the excess risk ...

Answer: Thanks for your suggestion, we will incorporate a summary table in the final version, which will highlight the key results and relevant bounds in a clear and concise manner, making it easier for readers to grasp the main contributions and their implications.

I see. Is it possible to conclude whether or not the approach of K&Z'21 ...

Answer: Even with the introduction of an additional smoothness assumption, the Empirical Risk Minimization (ERM) algorithm does not achieve a rate of $\mathcal{O}(1/n^2)$ . This limit stems from the framework they established, which in part (a) determined that the best attainable rate can only reach $\mathcal{O}(1/n)$ .

We refer to their result:

F(A(S)) - F(w^{\ast}) \lesssim (F_S(A(S)) - F_S(w^(S))) + \mathbb{E}(F_S(A(S)) - F_S(w^(S))) + (\gamma \log n + 1/n)\log{(1/\delta)}.

Even when considering the terms $F_S(A(S)) - F_S(w^*(S))$ and the algorithm's stability parameter $\gamma$ both equating to zero, the derived upper bound remains at most $\mathcal{O}(1/n)$ .

This clearly illustrates that achieving a faster convergence rate, such as $\mathcal{O}(1/n^2)$ , is not feasible within their proposed framework.

Thank you again for your thoughtful feedback and for taking the time to review my work. I am glad to hear that you are satisfied with the responses provided and that you have revised your score. Your insights have been invaluable in improving the quality of my manuscript.

If you have any further comments or suggestions, please feel free to share them.

作者回复

2024-08-07

Thank you for the valuable feedback from the reviewers. We will summarize the main difficulties and innovations of the paper here.

The challenges are primarily rooted in three key aspects:

Sharper Generalization via Stability in Gradients: The distinction between a vector’s $2$ -norm being $0$ and the vector itself being $0$ added complexity to the proof. Although inspired by [Klochkov and Zhivotovskiy, 2021], the proof required constructing vector functions to satisfy all assumptions in Theorem 1. Moreover, ensuring the factor $M$ in Theorem 1 approaches $0$ added a unique challenge. Utilizing the self-bounded property for vector functions also needs to consider the difference between vector’s $2$ -norm being $0$ and the vector itself being $0$ .
1. Balancing Optimization and Generalization Errors: High-probability results are more challenging than expectation results. While expectation results need on-average type stability, high-probability results require uniform type stability. Achieving uniform stability necessitates stronger assumptions to ensure the same optimization outcomes. Achieving $O(1/n^2)$ rate necessitated the trade off between optimization and generalization errors, which required assuming strongly-convexity. This assumption was crucial to ensure a good convergence rate for the algorithm's optimization error.
2. Optimal Constants in Marcinkiewicz-Zygmund's Inequality: Obtaining optimal constants for Marcinkiewicz-Zygmund's inequality for random variables in a Hilbert space required proving both upper and lower bounds, which posed challenges. Since Theorem 1’s scalar version has already seen widespread application, improving the constant factors in Hilbert spaces is believed to be beneficial for the community in the long time.

The novelty of our work can be summarized in three main aspects:

High-Probability Bounds for Excess Risk: We derive $O(\frac{1}{n^2})$ high-probability bounds for the excess risk $F(A(S)) - \inf_{w} F(w)$ of ERM, PGD, and SGD. Prior to our work, the best-known bounds were $\mathcal{O}(\frac{\log n}{n})$ for ERM and PGD using algorithmic stability, and $O(\frac{1}{n^2})$ for ERM and SGD using uniform convergence. However, the latter required $n = \Omega(d)$ samples, introducing an undesirable dependence on the dimensionality $d$ . Our results show that it is possible to achieve $O(\frac{1}{n^2})$ bounds for these algorithms without any dependence on $d$ , utilizing the concept of stability. This represents a significant improvement over previous results.
Sharper Generalization Bounds in Non-Convex Settings: We present sharper generalization bounds in gradients under non-convex settings in Theorem 3. This theorem operates under the assumption of Lipschitz continuity and shows that the coefficient of the $1/\sqrt{n}$ term is related to the variance of the algorithm's optimized gradient over the entire population $\mathbb{E}_{Z} || \nabla f(A(S);Z) ||_2^2$ , as opposed to being related to the maximum gradient value $M$ as in prior works. This provides a tighter bound and is a notable contribution to the understanding of non-convex problems. While the main focus of our paper is on achieving $1/n^2$ bounds through stability, the contributions of Theorem 3 to non-convex optimization are also significant.
Optimal Constants in Marcinkiewicz-Zygmund's Inequality: We derive optimal constants in the Marcinkiewicz-Zygmund's inequality for random variables taking values in a Hilbert space. This required proving both upper and lower bounds for the inequality, which was challenging. While Theorem 1 primarily improves constant factors compared to existing work, the scalar version of this theorem has already seen widespread application. We believe that providing a tighter Hilbert space version will be beneficial to the community.

I hope our explanation can reduce your confusion.

2024-08-14

Dear Area Chairs and Reviewers,

Thank you for your thoughtful feedback on our manuscript. We greatly appreciate the insights and recommendations provided, and we are committed to improving our paper in the final version. We would like to outline our planned revisions based on your comments:

We will try our best to thoroughly check the paper for any typos, ensuring the text is clear and polished.
We will include a summary table in the appendix to facilitate a more intuitive understanding of the key contributions of our work for the readers.
We will integrate the reviewers' suggestions and discussions into the appropriate sections of the paper to enhance its readability and clarity. Furthermore, we believe that the analytical framework proposed in our paper can provide deeper insights into the stability and generalization theory, and it holds significant potential for various downstream applications. As discussed with the reviewers, we would like to emphasize that there are currently several studies based on [Klochkov and Zhivotovskiy, 2021] addressing different problems, such as differentially private models [Kang et al., 2022, Theorem 1] and pairwise learning [Lei et al., 2021, Theorem 9]. These studies also utilize the smoothness assumption in their optimization analysis. However, since they rely on the framework established in [Klochkov and Zhivotovskiy, 2021] for the generalization aspect of their research, they can only achieve an $O(1/n)$ bound at most. In contrast, our method allows for $O(1/n^2)$ bounds without altering the assumptions. We respectfully hope that you consider the innovation and contributions of our paper when making your decision.

Thank you once again for your valuable comments and for the opportunity to improve our manuscript. We look forward to your positive consideration.

Best regards,

Authors

Reference

Yegor Klochkov, and Nikita Zhivotovskiy. Stability and Deviation Optimal Risk Bounds with Convergence Rate $O(1/n)$ . NeurIPS 2021.

Yunwen Lei, Mingrui Liu, and Yiming Ying. Generalization guarantee of SGD for pairwise learning. NeurIPS 2021.

Yilin Kang, et al. Sharper Utility Bounds for Differentially Private Models: Smooth and Non-smooth. CIKM 2022.

最终决定Reject

2024-09-25

This paper considers high probability excess risk bounds and demonstrate that $O(1/n^2)$ rate is achievable. The setting they consider assumes strong convexity, smoothness and Lipschitzness, while algorithms are empirical risk minimization, projected gradient descent and SGD. Authors show that sharper bounds without any dependence on d are possible for these frameworks.

This paper was reviewed by four expert reviewers and received the following Scores/Confidence: 5/4, 5/3, 5/3, 5/3. I think paper is studying an interesting topic but authors are not able to convince the reviewers sufficiently well about the phenomena they are trying to explain. The following concerns were brought up by the reviewers:

Poor writing/presentation and many typos make the paper seem rushed.
Motivation of certain contributions are not clear.
Inconsistent statements, see reviewers VeaT and Xifq's remarks.

No reviewers championed the paper and they are not particularly excited about the paper. I think majority of the concerns can be addressed but that would require significant revision and another set of reviews. As such, based on the reviewers' suggestion, as well as my own assessment of the paper, I recommend not including this paper to the NeurIPS 2024 program.