PaperHub
6.0
/10
Poster3 位审稿人
最低5最高8标准差1.4
5
8
5
2.7
置信度
正确性3.0
贡献度2.7
表达3.0
NeurIPS 2024

Lower Bounds of Uniform Stability in Gradient-Based Bilevel Algorithms for Hyperparameter Optimization

OpenReviewPDF
提交: 2024-05-14更新: 2024-12-22
TL;DR

We establish uniform stability lower bounds for representative gradient-based bilevel hyperparameter optimization algorithms.

摘要

关键词
Uniform stabilityLower boundHyperparameter optimizationBilevel programming

评审与讨论

审稿意见
5

The paper provides new lower bounds for the uniform stability of UD-based and IFT-based algorithms in hyperparameter optimization.

优点

  1. The general setting is described clearly.
  2. The uniform stability lower bounds presented in the paper are novel and clear.

缺点

  1. In my view, the technique employed by the authors appears to be relatively straightforward. For the lower bound, they demonstrate that gradient algorithms may be costly when applied to a non-convex quadratic loss function. While this is not particularly surprising, it can result in a significant distance between the iterates of the two algorithms.

  2. The lower bound presented by the authors does not necessarily imply a generalization lower bound. This might be acceptable, given that there are existing works in the literature that show a lower bound for uniform stability (e.g., [30]). However, in relation to the previous point, I find the result to be less insightful.

问题

  1. Are the bounds the authors present in theorems 5.6,5.7 tight?

When denoting x=γ(11m)x = \gamma \left(1 - \frac{1}{m}\right) for γ=γ\gamma = \gamma', the upper bound scales like TcxT^{cx} and the lower bound scales like Tlog(1+cx)T^{\log(1+cx)}. The authors state that when cc is a small constant, ``the explicit lower bound closely aligns with the explicit upper bound'' (line 310). However, since c appears in the exponent of T, for a multiplicative constant difference between the bounds, c needs to depend on T.

  1. Does the technique of the paper can lead also to lower bounds for the average stability?

局限性

Yes.

作者回复

Response to Reviewer 3C12

We thank Reviewer 3C12 for the thoughtful and constructive comments.

W1: Technical challenges of establishing the lower bounds

We agree that the instability of gradient-based algorithms on non-convex loss functions is intuitively expected, but rigorously establishing (nearly) tight lower bounds is a non-trivial task, especially for bilevel algorithms.

For bilevel algorithms, the stability bounds must consider key factors from both the outer and inner optimization processes (m, T, K), which poses challenges to achieving the tightness of the lower bounds. Moreover, the update of the outer parameters with the hypergradient (see Eq. (1), (2), (3), and the discussion for Example 5.4) becomes particularly complex due to the involvement of inner optimization, necessitating a careful design of the loss function beyond merely being non-convex.

W2: The implication of a stability lower bound.

Indeed, as discussed in Section 6, a stability lower bound does not directly imply a generalization lower bound. However, our work focuses on characterizing the limit of the uniform stability on the generalization analysis, and we have presented (nearly) tight lower bounds for the UD-based algorithms which is currently missing in the literature and technically challenging as discussed in the response for W1. Moreover, our methods may inspire the establishment of the generalization lower bound as discussed in the following response for Q2 regarding single-level SGD.

Q1: The tightness of the lower bounds in Theorems 5.6 and 5.7.

The recursive lower and upper bounds in Eq.(7) are tight as claimed in Line 300, while the adjusted bounds in Theorems 5.6 and 5.7 are not tight due to unavoidable scaling steps needed to reveal an explicit order w.r.t. TT.

For a constant step size, i.e., αt=c\alpha_t = c, Eq.(7) directly provides tight bounds that [(1+c(11/m)γ)T1]2cL/mϵarg[(1+c(11/m)γ)T1]2cL/m.[(1 + c(1 - 1/m)\gamma)^T - 1]{2cL^\prime}/{m} \leq \epsilon_{\text{arg}} \leq [(1 + c(1 - 1/m)\gamma)^T - 1]{2cL}/{m}. However, when decreasing step sizes are adopted, additional scaling steps based on Eq.(7) must be applied to reveal an explicit order of the bounds w.r.t. TT, resulting in the discrepancy between the bounds in Theorems 5.6 and 5.7.

Regarding our statement of the close alignment, considering Thms 5.6 and 5.7, the discrepancy between TcxT^{cx} and Tlog(1+cx)T^{\log(1+cx)} is Tcx/Tlog(1+cx)=Tcxlog(1+cx)T^{cx}/T^{\log(1+cx)}=T^{cx-\log(1+cx)} and thus depend on TT. Following your suggestion, we will revise relevant statements to clarify that: the discrepancy between the bounds is Tcx/Tlog(1+cx)=Tcxlog(1+cx)T^{cx}/T^{\log(1+cx)}=T^{cx - \log(1+cx)}.

Q2: Extend the technique to the average stability.

Considering the similarities between average and uniform stability, we believe our techniques can be adapted to the data-dependent setting for average stability with some modifications, and we established a preliminary lower bound for single-level SGD using our Example E.1, which is comparable to the corresponding upper bound in [Theorem 4, 31].

To account for the randomness in the sampled datasets, we first refine some notations. Let S(i)S^{(i)} denote a copy of SS with the ii-th element replaced by ziz_i', GS,tG_{S,t}/GS(i),tG_{S^{(i)},t} and wS,tw_{S,t}/wS(i),tw_{S^{(i)},t} denote the SGD update rules and the updated parameters on SS and S(i)S^{(i)} at the step tt. With a similar proof as in the current paper and a slight adjustment on the Definition 4.3, Theorem 5.1 in our paper can be modified as: Suppose there exists a nonzero vector v(S,S(i))v(S, S^{(i)}) along which GS,tG_{S,t} and GS(i),tG_{S^{(i)},t} are 2αtL(S,S(i))2\alpha_tL'(S, S^{(i)})-divergent and L(;z)L(\cdot;z) is γt(S,S(i))\gamma_t'(S, S^{(i)})-expansive for all wS,twS(i),tw_{S,t} - w_{S^{(i)},t} that parallel with v(S,S(i))v(S, S^{(i)}). Then we have EA[δt(S,S(i))][1+αt(11/m)]γt(S,S(i))EA[δt1(S,S(i))]+2αtL(S,S(i))/m.E_A[\delta_t(S, S^{(i)})] \geq [1+\alpha_t(1-1/m)]\gamma_t'(S, S^{(i)})E_A[\delta_{t-1}(S, S^{(i)})]+2\alpha_tL'(S, S^{(i)})/m. This recursion closely corresponds with the one in [Eq.(19), 31]: EA[δt(S,S(i))][1+αt(11/m)]ψt(S,S(i))EA[δt1(S,S(i))]+2αtL(S,S(i))/m,E_A[\delta_t(S, S^{(i)})] \leq [1+\alpha_t(1-1/m)]\psi_t'(S, S^{(i)})E_A[\delta_{t-1}(S, S^{(i)})]+2\alpha_tL(S, S^{(i)})/m, and the matching of \gamma_t'\\&\psi_t, L'\\&L will further guide the design of the constructed example.

Therefore, our analysis framework can be adapted for the average stability with suitable modifications. However, specifically for average stability, a core challenge is calculating the expectation over SS and S(i)S^{(i)} for the above recursive formula, which is beyond our ability to fully resolve within the short rebuttal period. Nevertheless, we can present a preliminary lower bound using Example E.1 with a simple assumption on the data distribution.

Assume that the data follows a distribution DD that p(z)=0.5p(z)=0.5 for z=(v1,1)z=(v_1, 1) and z=(v1,1)z=(-v_1, 1) respectively, Si.i.d.DmS \overset{i.i.d.}{\sim} D^m, ziDz_i' \sim D, and other conditions follow Example E.1 in our paper. Under these assumptions, we derive L(S,S(i))=yixiyixi/2L'(S, S^{(i)})=\Vert y_ix_i-y_i'x_i'\Vert /2 and μt(S,S(i))=0\mu_t(S, S^{(i)})=0 for zi=ziz_i=z_i', μt(S,S(i))=γ1\mu_t(S, S^{(i)})=\vert\gamma_1\vert for ziziz_i\neq z_i'. This leads to the average stability lower bound: ϵt=1Ts=t+1T(1+αs(11/mγ1))αt/m\epsilon\geq\sum_{t=1}^T\prod_{s=t+1}^T(1+\alpha_s(1-1/m\vert\gamma_1\vert))\alpha_t/m.

Moreover, for our example, Eq. (2) in Theorem 4 in [31] takes γ=γ1\gamma=\vert \gamma_1\vert. When adopting decreasing step sizes and removing the bounded loss assumption to get an adjusted upper bound as discussed in Lines 822-829, we further have: Tln(1+(11/m))cγ1/mϵT(11/m)cγ1/m.T^{\ln(1+(1-1/m))c\vert\gamma_1\vert}/m\lesssim\epsilon\lesssim T^{(1-1/m)c\vert\gamma_1\vert}/m.

Furthermore, related to the above W2, this lower bound also applies to another definition of average stability [Definition 5, 21] which is shown to be equivalent to the expected generalization error [Lemma 11, 21]. Therefore this lower bound is also a generalization lower bound.

We will add the above discussion and detailed proofs in the revised version. Thanks again for your valuable comments. We welcome any additional feedback to address any further questions.

评论

Thank you for the detailed response!

Although the reviewers addressed each of my questions individually, I want to highlight my main concern, which relates to the combined implications of all their answers.

The technique used by the authors to bound uniform stability involves lower bounding the expansion of the algorithm. As the authors mentioned in their response, this approach has an inherent limitation due to the "unavoidable scaling steps needed to reveal an explicit order with respect to TT." Consequently, unless the step sizes of the algorithm are o(1t)o\left(\frac{1}{t}\right) (which I believe are not typically used in practice), the lower bounds provided by the authors are not tight and exhibit a polynomial difference (with a constant degree) with respect to TT when compared to the best-known stability upper bounds. Together with the fact that a lower bound for uniform stability does not necessarily imply a lower bound for generalization, this weakens the significance of the results.

评论

Thank you for your response.

There might be a potential misunderstanding that: the recursive lower and upper bounds in Eq.(7) are indeed tight, and for a constant step size where αt=c\alpha_t = c for all tt (which is commonly adopted in practice), Eq.(7) directly provides tight bounds: [(1+c(11/m)γ)T1]2cLmϵarg[(1+c(11/m)γ)T1]2cLm\left[(1 + c(1 - 1/m)\gamma)^T - 1\right] \frac{2cL'}{m} \leq \epsilon_{\text{arg}} \leq \left[(1 + c(1 - 1/m)\gamma)^T - 1\right] \frac{2cL}{m}. Therefore, we argue for the practical implications of our results.

In the submission, to align with the setting where αtct\alpha_t \leq \frac{c}{t} (which is widely discussed in theoretical works such as [17] and [18]), we present lower bounds in Theorems 5.6 and 5.7.

We will add the discussion in the final version.

评论

We want to clarify that our methods, specifically the constructed examples and the analysis framework, can inspire the establishment of generalization lower bounds. As a result, we present a preliminary generalization lower bound for bilevel (UD and IFT-based) HO algorithms using Example 5.3 and for single-level SGD using Example E.1 with additional assumptions on the data distribution.

In particular for bilevel HO algorithms, further assume that the data follows a distribution DD where p(z)=0.5p(z)=0.5 for z=(v1,1)z=(v_1, 1) and (v1,1)(-v_1, 1) respectively, Svali.i.d.DmS^{val}\overset{i.i.d.}{\sim} D^m, zvaliD{z^{val}}'_ i \sim D, and other conditions follow Example 5.3 in the paper. We derive L(S,S(i))=ηk=1K1(IηA)k(yixiyixi)/2L'(S, S^{(i)})=\eta\Vert \sum_{k=1}^{K-1}(I-\eta A)^k(y_ix_i-y_i'x_i')\Vert /2 and μt(S,S(i))=0\mu_t(S, S^{(i)})=0 for zi=ziz_i=z_i', μt(S,S(i))=γ\mu_t(S, S^{(i)})=\gamma' (as in Lines 586-588) for ziziz_i\neq z_i'. This leads to a generalization lower bound: ϵgent=1Ts=t+1T(1+αs(11/mγ))αtL/m,\epsilon_ {gen}\geq\sum_ {t=1}^T\prod_ {s=t+1}^T(1+\alpha_ s(1-1/m\gamma'))\alpha_tL'/m, where LL(1+ηγtr)KL' \asymp L \asymp (1+\eta\gamma^{tr})^{K} and γ=γ(1+ηγtr)2K\gamma' = \gamma \asymp (1+\eta\gamma^{tr})^{2K}. When adopting decreasing step sizes, we further have: ϵgenTln(1+(11/m))cγ/m.\epsilon_{gen}\gtrsim T^{\ln(1+(1-1/m))c\gamma'}/m. More detailed illustrations and results for single-level SGD have been presented in the above response for Q2 from Reviewer 3C12.

We will add relevant discussions and detailed proofs in the final version.

评论

Thank you for the clarifications. I have updated my score.

评论

Thanks for the helpful comments and for increasing the score. We are glad that we have addressed the reviewer's main concerns.

审稿意见
8

The authors consider gradient-based algorithms for hyperparameter selection. They derive lower stability bounds for UD and IFT-based algorithms that solve this problem. To get this lower bound, the authors introduce the notion of lower-bounded exansion property of the update rule of the hyperparameters.

优点

  • S1: This paper is very well written and easy to follow.

  • S2: The paper establishes lower bound on the stability of gradient-based methods for bilevel optimization. By doing so, they demonstrate the tighness of existing stability bounds for UD-based algorithms.

  • S3: To my knowledge, the use of the lower-bounded expansion property to prove lower stability bounds is novel.

缺点

  • W1: The most efficient gradient-based algorithms use warm-starting strategy, meaning that the inner and the inverse Hessian vector product solvers are not reinitialized at each iteration [1]. It is not clear if these algorithms are covered by the analysis.

问题

  • Q1: Can the lower bound be extended to algorithms that use warm-starting strategy?

局限性

N/A

作者回复

Response to Reviewer bA9a

W&Q: Extension of the lower bound analysis to algorithms with warm-starting strategy.

We thank Reviewer bA9a for the positive comments and thoughtful questions.

We are uncertain about the precise meaning of the "warm-starting strategy" you mentioned, as we do not find related definitions in the reference [1] Gradient-based optimization of hyperparameters. Please correct us if there is any misunderstanding in our response.

In our understanding, the "warm-starting strategy" may have two meanings. First, the inner parameter θ\theta is not reinitialized at each outer step. Second, the inverse Hessian vector product solver used to calculate the hypergradient is not reinitialized at each outer step.

For the inner warm starting, our lower bound analysis can be extended with necessary modifications. In Section 4, we define the expansion properties assuming that the update rule is a mapping from the parameter's current state to its next state, independent of its previous states. However, with the warm-starting strategy, the hyperparameter update depends on all its previous states. Therefore, our proposed definitions and corresponding analysis need to be adjusted to characterize this dependence. Specifically, Definition 4.2 can be modified as: An update rule GG is ρss=1S\\{\rho_ s\\}_ {s=1}^S-growing along vv if for all ws,wss=1S\\{w_ s,w_ s'\\}_ {s=1}^S such that wswsw_ s-w_ s' parallel with vv, G(wS)G(wS)wSwS,G(wS)G(wS)s=1Sρswsws.G(w_ S)-G(w_ S) \circeq w_ S-w_ S', \Vert G(w_ S)-G(w_ S)\Vert\geq \sum_ {s=1}^S\rho_ s \Vert w_ s-w_ s'\Vert. Consequently, the recursion in Theorem 5.1 will take the form: EA[δt]atEA[δt1]+at1EA[δt2]++a2EA[δ1]+2αtL/mE_A[\delta_t]\geq a_tE_A[\delta_{t-1}]+a_{t-1}E_A[\delta_{t-2}]+\dots+a_2E_A[\delta_{1}]+2\alpha_tL'/m in contrast of the current form: EA[δt]atEA[δt1]+2αtL/mE_A[\delta_t]\geq a_tE_A[\delta_{t-1}]+2\alpha_tL'/m. The stability lower bound will be derived by unrolling this revised recursion.

Regarding the inverse Hessian-vector product solvers, we clarify that the algorithms presented in our paper (Algorithms 1, Eq.(2) and Eq.(3)) do not calculate the inver Hessian and thus do not use such numerical solver. While we recognize that for some IFT-based algorithms where the hypergradient is decomposed as λL=λvalθval(θθ2tr)1inverse Hessian vector productθλ2tr\nabla_ {\lambda}L = \nabla_ {\lambda}\ell^{val} -\underbrace{\nabla_ {\theta}\ell^{val}(\nabla^2_ {\theta\theta}\ell^{tr})^{-1}}_ {\text{inverse Hessian vector product}}\nabla^2_ {\theta\lambda}\ell^{tr}, the reinitialization of the solver might be an important factor to consider. We suggest a similar modification as for the inner warm-starting strategy, involving additional dependency on parameters, to extend our methods to this case.

Thanks again for your insightful questions. Warm starting is indeed a commonly used practice to enhance the efficiency of algorithms. We are not able to present a precise result on the lower bounds within this short rebuttal period, but we believe our work can inspire future attempts in these settings as described above. We will add relevant discussions in the revised version, and we believe this will improve the quality of our paper. We hope our response addresses your concerns and welcome any further feedback.

审稿意见
5

This paper studies the generalization bound of the hyper-parameter optimization problem. It provides a uniform lower-bound for the validation argument stability, which proves that the upper-bound of the validation argument stability in existing works is tight.

优点

The paper is well-written and easy to follow.

This paper studies the hyper-parameter optimization problem, which is of great importance to bilevel optimization community.

The paper' theoretical result on the lower-bound of validation argument stability is sound and novel. It provides good insight into the stability of the hyper-parameter optimization problem.

缺点

How does the current generalization bound depend on the number of data nn in training dataset? According to the current bound, it is beneficial to just divide more data into the validation set to increase mm. It seems there is nothing that quantifies the negative impact of having a smaller training dataset. This could lead to unreasonable interpretation of the bound: suppose there is a dataset of size NN where we assign mm data into validation and n=Nmn=N-m into training dataset. According to the current bound, the generalization bound decreases in an order of O(1/m)\mathcal{O}(1/m), which suggests one should just use a largest mm possible, which is counter intuitive.

问题

My question is described in the weakness section. It is my major question, and I will consider raising the score if it is well answered.

局限性

No potential social impact.

作者回复

Response to Reviewer NBuR

W&Q: Interpretation of the current generalization bound w.r.t. validation and training sample sizes.

We thank Reviewer NBuR for the insightful question.

Intuitively, the expected population risk can be divided into the generalization error and the empirical validation risk as ESval,Str[R(A(Sval,Str))]=ESval,Str[R(A(Sval,Str))R^Sval(A(Sval,Str))](I): generalization error (Eq.(4) in our paper)+ESval,Str[R^Sval(A(Sval,Str))](II): empirical validation risk.E_ {S^{val},S^{tr}}[R(A(S^{val},S^{tr}))]=\underbrace{E_ {S^{val},S^{tr}}[R(A(S^{val},S^{tr}))-\hat{R}_ {S^{val}}(A(S^{val},S^{tr}))]}_ {\text{(I): generalization error (Eq.(4) in our paper)}}+\underbrace{E_ {S^{val},S^{tr}}[\hat{R}_ {S^{val}}(A(S^{val},S^{tr}))]}_ {\text{(II): empirical validation risk}}. The first term is explicitly bounded w.r.t. the validation sample size mm in our paper (ϵgenϵstab=Θ(1/m)\epsilon_{gen}\leq \epsilon_{stab}=\Theta(1/m), see Line 140 and Theorem 5.7.). The second term is implicitly bounded w.r.t. the training sample size nn since a larger training set generally leads to better validation performance. Therefore, the assignment of validation and training sets is a trade-off between (I)\text{(I)} and (II)\text{(II)}, which aligns with common discussions on generalization in validation, as seen in [Page 70, 4] for cross-validation.

Specifically, consider assigning m=aNm=aN examples to SvalS^{val} and n=(1a)Nn=(1-a)N examples to StrS^{tr} where a(0,1)a\in(0,1). On the one hand, the generalization bound ϵgenϵstab=Θ(1/aN)\epsilon_{gen}\leq \epsilon_{stab}=\Theta(1/aN) (see Line 140 and Theorem 5.7) presented in our paper implies that aa needs to be sufficiently large for the term (I)\text{(I)} to be small. On the other hand, as a larger training set generally leads to better validation performance, aa also needs to be sufficiently small to get a low validation risk in the term (II)\text{(II)}. Therefore, the choice of aa represents a trade-off to achieve a favorable population risk.

Technically, the absence of nn in the generalization bound arises from the formulation and decomposition of the generalization error. As defined, ϵgen\epsilon_{gen} can be written as EStr[ESval[R(A(Sval,Str))R^Sval(A(Sval,Str))]]E_{S^{tr}}[E_{S^{val}}[R(A(S^{val},S^{tr}))-\hat{R}_{S^{val}}(A(S^{val},S^{tr}))]]. Given a fixed StrS^{tr} and the independence of SvalS^{val} and StrS^{tr}, the inner term becomes a difference between the expected and empirical validation risk, which is typically bounded with convergence analysis (e.g, stability, uniform convergence, e.t.c.). The outer expectation on StrS^{tr} is often scaled as a supremum over StrS^{tr}. These technical analyses lead to a generalization bound solely on mm.

We will add the above discussion in the revised version, and we believe this will improve our paper's clarity.

最终决定

The paper makes an interesting contribution to the lower bounds on generalization error in a bi-level ERM setup. The authors are highly encouraged to address ReviewerNBuR's question on "how the generalization bound depends on the training dataset size" with a concrete bound (as opposed to a qualitative explanation) in the revision.