6.0

/10

Poster3 位审稿人

最低5最高7标准差0.8

2.7

置信度

正确性3.0

贡献度2.7

表达3.0

NeurIPS 2024

How Does Black-Box Impact the Learning Guarantee of Stochastic Compositional Optimization?

Jun Chen,Hong Chen,Bin Gu

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

摘要

关键词

Stochastic compositional optimizationtheoretical guarantee; black-box optimization; stability analysis

评审与讨论

审稿意见

评分: 5置信度: 32024-07-12

In this paper, the authors systematically analyzed the generalization error and optimization error of the stochastic compositional optimization problems (for black-box cases).

优点

The paper is well-written and well-organized.
The paper provides the generalization analysis and optimization analysis for stochastic compositional optimization problems in a systematic way. The convergence rates for three different black-box cases are provided.

缺点

The technical challenge of extending the theoretical analysis (stability analysis and optimization analysis) for stochastic compositional optimization problems from white-box cases to black-box cases is not clearly discussed. What are the key technical tools employed to address the key challenge? It seems that the black-box gradient approximation in Eq.(5) satisfies the bounded variance assumption. So it is easy to plug this into the existing white-box analysis framework to achieve a generalization error and optimization convergence rate. What are the non-trivial techniques involved in improving the theoretical results compared with the previous work?
Assumption 5 for the non-convex analysis is weird. Assumption 5 implies that any stationary point, i.e., $\mathbb{E}[|| \nabla F_S(w) ||^2]=0$ , to be the global optimal point $w(S)$ . Is this assumption too strong for non-convex analysis?

Minor:

Typos: Eq.(5) is a second-order approximation of Eq.(4), "=" may be incorrect.

问题

What is the key technical challenge of extending the theoretical analysis (stability analysis and optimization analysis) for stochastic compositional optimization problems from white-box cases to black-box cases?
What are the key technical tools employed to address the key challenge?
Is this assumption 5 too strong for non-convex analysis?

局限性

N.A.

作者回复

2024-08-01

We are grateful to you for your valuable comments and constructive suggestions.

Q1: What are the key challenges of extending the theoretical analysis for SCO problems from white-box cases to black-box cases? What are the technical tools employed to address these key challenges? What are the non-trivial techniques involved in improving the theoretical results compared with the previous work?

A1: Thanks for your constructive comment. The key challenges and technical tools of extending the theoretical analysis for SCO problems from white-box cases to black-box cases are listed as follows.

1) Generalization: Considering three different types of black-box SCO methods, we apply our new non-convex analysis (Theorem 3) to these cases (Theorem 4, Corollary 1 and 2) in Section 3.2. Due to the difference related to function form, there are some differences related to the upper bounds of first-order and second-order gradients of $\tilde{\nabla} f$ (please see lines 513-518) between Theorem 3 and the generalization part of Theorem 4. The differences among Theorem 3 and Corollary 1, Corollary 2 are the same as Theorem 4.

2) Optimization: The lines 546-547 mentioned by you are related to our optimization part. For optimization, the estimated gradient does introduce several extra terms regarding the accuracy of the gradient estimation, i.e., $\tilde{\nabla} f-(p+1/2)\beta\nabla f$ and $\tilde{\nabla} f-\beta\nabla f$ . These terms are derived from some special strategies (such as a special decomposition $\tilde{\nabla} f=\tilde{\nabla} f+(p+1/2)\beta\nabla f-(p+1/2)\beta\nabla f$ in the second equality of line 546). We propose an extended lemma (Lemma 6) from [1] and combine this lemma with these strategies to limit the expansion (line 549) of $\mathbb{E}[F_S(w_{t+1})-F_S(w(S))]$ during the iterations. Otherwise, these extra terms will lead to the divergence of our result.

Finally, we want to emphasize our advantages compared with previous work related to the generalization guarantee of SCO [2].

1) Better results: For convex optimization, Theorem 2 leverages the co-coercivity property of convex and smooth function to provide the stability bound $\mathcal{O}((n^{-1}+m^{-1})\beta \log T)$ under milder parameter selection than [2]. And our proof is more concise since it avoids the intermediate step which measures the distance between $v$ and $g(w)$ in the analysis of [2].

2) Non-convex guarantee: We leverage a special lemma, almost co-coercivity lemma, to develop our proof framework to non-convex case to obtain the first stability bound $\mathcal{O}((n^{-1}+m^{-1})T^{\frac{1}{2}}\log T)$ under milder parameter selection than [1].

We have supplemented the above explanations in Section 3.2 of our new manuscript to benefit readers’ understanding.

[1] J. Duchi, et al. Optimal rates for zero-order convex optimization: The power of two function evaluations. TIT, 2015.

[2] M. Yang, et al. Stability and generalization of stochastic compositional gradient descent algorithms. 2023.

Q2: ...Is Assumption 5 too strong for non-convex analysis?

A2: Assumption 5 is called Polyak-Lojasiewicz (PL) condition, which is used extensively in non-convex optimization [3-5]. This assumption suggests that the function value gap $\mathbb{E}[F_S(w)-F_S(w(S))]$ is dominated by the square of gradient norm [6]. It implies all empirical local optimal parameters are empirical global optimal parameters [7]. We use it to prepare for the characterization of $\mathbb{E}[F_S(A(S)) − F_S(w(S))]$ instead of $\mathbb{E}[||\nabla F_S(A(S))||^2]$ [8]. We have added the above explanation behind Assumption 5 in our new manuscript. Besides, we have increased a section in the final part of Appendix to discuss this assumption in detail.

[3] Y. Lei, et al. Generalization guarantee of SGD for pairwise learning. NeurIPS, 2021.

[4] S. Reddi, et al. Stochastic variance reduction for nonconvex optimization. ICML, 2016.

[5] D. Foster, et al. Uniform convergence of gradients for non-convex learning and optimization. NeurIPS, 2018.

[6] Y. Bai, et al. On the complexity of finite-sum smooth optimization under the Polyak-Łojasiewicz condition. ICML, 2024.

[7] H. Karimi, et al. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. ECML-PKDD, 2016.

[8] A. Kuchibhotla and A. Chakrabortty. Moving beyond sub-Gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression. Information and Inference: A Journal of the IMA, 11(4):1389–1456, 06 2022.

Q3: Typos: Eq.(5) is a second-order approximation of Eq.(4), "=" may be incorrect.

A3: Equation (5) is consistent with the forms of previous work [9,10]. The second-order term of Taylor expansion is regarded as the remaining item taking $v=v_{t+1}^*$ where $v_{t+1}^*$ is an unknown model parameter. We don’t need to know $v_{t+1}^*$ . We have further explained it behind Equation (5) in our new manuscript.

[9] K. Nikolakakis, et al. Black-box generalization: Stability of zeroth-order learning. NeurIPS, 2022.

[10] J. Chen, et al. Fine-grained theoretical analysis of federated zeroth-order optimization. NeurIPS, 2023.

Q4: What is the key technical challenge of extending the theoretical analysis (stability analysis and optimization analysis) for stochastic compositional optimization problems from white-box cases to black-box cases? What are the key technical tools employed to address the key challenge?

A4: This question is the same as Q1. Please see A1.

Q5: Is this assumption 5 too strong for non-convex analysis?

A5: This question is the same as Q2. Please see A2.

2024-08-13

Thanks for the authors' detailed response. My concern has been well addressed. I have no further questions.

2024-08-13

Thank you very much for your constructive comments and recognition of our work.

审稿意见

评分: 7置信度: 22024-07-13

This work presents the generalization upper bound for two stochastic compositional optimization methods, SCGD and SCSC, under convex and non-convex setting. For convex setting, the presented generalization bound is tighter compared to the existing work and it matches the generalization bound of SGD for optimization without compositional structure. For non-convex setting, this work establishes the first generalization bound in SCO literature. Furthermore, this work studies the zeroth-order extension of SCGD/SCSC and presents their excess risk bound in non-convex setting. This is also new in the literature.

优点

The main contribution of this work is the theoretical analysis as I summarized in the summary section.
It is interesting to see how estimation distance $\mu$ and the number of estimation directions $b$ affect the excess risk bound.
This work is well-written and easy to follow.

缺点

I do not recognize any significant weakness in this work. But I do have some questions.

In Table 1, I noticed that the assumption V (bounded variance) and M (bounded function) do appear in SCGD/SCSC (Thm. 2), but not in SGD ([26] Thm. 3.8). Based on my understanding, the assumptions on both of these two analysis are the Lipschitz and smoothness of the unbiased stochastic function values. I do not see any major difference between them. Can the authors clarify what V (bounded variance) and M (bounded function) stand for?
In Table 2, the optimization bounds for the black-box methods seem to omit the terms involving $T,n,m$ . This confuses me. Is there any reason that the $T,n,m$ terms are ignored here?

问题

Please see the weakness section.

局限性

No significant limitations in this work.

作者回复

2024-08-01

We are grateful to you for your valuable comments and constructive suggestions.

Q1: ...Can the authors clarify what V (bounded variance) and M (bounded function) stand for?

A1: Thanks for your constructive comment. As you mentioned, Lipschitz and smoothness are both assumed in our Theorem 2 and Theorem 3.8 in [1]. These two assumptions are the most common conditions in learning theory.

The bounded variance assumption (Assumption 2) is also a classical condition [2-5]. It limits the ranges of the variance value of the given functions $g$ .

The bounded function assumption (Assumption 4) is different from the traditional bounded function assumption $|f|\leq M$ . Assumption 4 requires the distance between two adjacent function outputs to be bounded, i.e., $||g(w+\mu u)-g(w)||\leq M_g, |f(v+\mu u)-f(v)|\leq M_f,$ which is milder than the traditional bounded function assumption $|f|\leq M$ . Besides, it also requires the distance between two adjacent gradient outputs to be bounded, i.e., $||\nabla g(w+\mu u)-\nabla g(w)||\leq M_g^\prime, ||\nabla f(v+\mu u)-\nabla f(v)||\leq M_f^\prime,$ which is milder than bounded gradient condition $||\nabla f||\leq L$ [1].

The above explanations have been added in the remark behind Assumption 2 and Remark 2. We hope our explanations can help you understand Assumptions 2 and 4.

[1] M. Hardt, et al. Train faster, generalize better: Stability of stochastic gradient descent. ICML, 2016.

[2] A. Nemirovski, et al. Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization, 19(4):1574-1609, 2009.

[3] Y. Zhou, et al. Understanding generalization error of SGD in nonconvex optimization. Machine Learning, 111(1):345-375, 2022.

[4] M. Wang, et al. Stochastic compositional gradient descent: Algorithms for minimizing compositions of expected-value functions. Mathematical Programming, 161(1-2):419-449, 2017.

[5] T. Chen, et al. Solving stochastic compositional optimization is nearly as easy as solving stochastic optimization. TSP, 69:4937-4948, 2021.

Q2: In Table 2, the optimization bounds for the black-box methods seem to omit the terms involving $T,n,m$ . This confuses me. Is there any reason that the $T,n,m$ terms are ignored here?

A2: In Table 2, we just show the main orders of our convergence results, and their corresponding complete form are shown at the end of every proof in our Appendix. For example, in line 552, the bound is $\mathcal{O}\left(\mu^2(T^{-1}\log T+1)+b^{-1}(d_1T^{-1}\log T+d_2)\right)$ . Because the order of $T^{-1}\log T$ is smaller than 1. Therefore, we just reserve $\mu^2+b^{-1}b_2$ .

To increase readability, we have modified our results in Table 2 to make the comparisons more clear by selecting the specific $\mu, b$ . For example, to obtain a convergence rate $\mathcal{O}\left(T^{-\frac{1}{4}}\right)$ similar to [4], we can select $\mu=\mathcal{O}\left(T^{-\frac{1}{8}}\right), b=\mathcal{O}\left(T^{\frac{1}{4}}d_2\right)$ . For $\mu$ , this parameter can even be taken as a smaller value such as $\mu=0.0001$ in [6]. Therefore, $\mu=\mathcal{O}\left(T^{-\frac{1}{8}}\right)$ is reasonable. As for $b$ , although its value $\mathcal{O}\left(T^{\frac{1}{4}}d_2\right)$ may be so large that the time complexity of model training is increased, this work aims to first show that black-box SCO algorithm can converge and its impact factors, rather than obtaining an optimal convergence rate. The two extra parameters $\mu, b$ do impact the convergence rate compared with the corresponding first-order methods.

[6] P. Chen, et al. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. AISec@CCS, 2017.

2024-08-12

Thank you for the detailed response. I will maintain my score.

2024-08-12

Thank you very much for your constructive comments and recognition of our work.

审稿意见

评分: 6置信度: 32024-07-15

This paper studies stability-based generalization bound for SCGD and SCSC, as well as the convergence rated of their black-box variants. The authors provide sharper generalization bounds for these algorithms, and the first convergence bounds for the black-box variants of these algorithms under PL-conditions.

优点

Strengths:

Novelty of the Problem: The generalization bounds of stochastic optimization algorithms have been previously studied (e.g., [21]), and basic mathematical tools (algorithm stability) for such studies have been proposed. This paper is the first to study block-box SCSC. Although from the proofs of Theorems 3 and 4, the analyses for SCGD and SCSC are essentially the same.
Significance of the Results:
- Generalization Bound: The authors provide a generalization bound of $O((1/n + 1/m)\log T + 1/\sqrt{n})$ for SCGD and SCSC, which is significantly sharper than the previous $O(1/n + 1/m)\log T + 1/\sqrt{n})$ bound. This is a noteworthy contribution.
- Algorithm Convergence: The paper presents convergence bounds for algorithms using estimated (black-box) gradients. While the dependence on $T$ , $n$ , and $m$ is similar to [21], there are additional terms related to the estimation distance mu and sample size b. Moreover, the authors proved this results under a milder condtion (convexity vs PL-condtion. )

There is no experiments in this paper. However, the main goal of this paper is to study the properties of existing algorithms, so experiments are not necessary. But in the remarks the authors say that their theorem provide a more "practical choise" for paratmeres, so experiments might still be helpful to demonstrate this.

Weaknesses/Questions:

Comparing Convergence Rates: In Table 2, it is difficult to compare the convergence rate of the proposed algorithm with other methods. The theoretical results of other methods are provided with respect to $T$ , but for the proposed methods, the dependence is given with respect to $\mu$ and $d$ . This inconsistency makes the table unclear. For instance, Theorem 3.7 of [21] also provides a way to configure $T$ . It should be made more clear that mu and b are extra terms.
- Parameter $\mu$ : The parameter $\mu$ is mentioned only once in Line 112, and it is unclear how good the $O(\mu^2)$ bound is. It seems that $\mu$ is used in the estimation of the gradient and is a constant. Therefore, Theorem 4 suggests that even if $n$ , $m$ and $b$ approach infinity, the difference between function values will converge to a constant instead of zero.
Redundancy in Proofs: In the appendix, the proofs of Theorems 3 and 4 appear nearly identical (at least for parts 1) and 2)), with the main difference being the use of the estimated gradient. Additionally, the proof of Corollary 1 repeats similar arguments. I encourage the authors to reduce redundancy, shorten the proofs, and highlight the main differences to enhance readability.
Challenges of Estimated Gradient: The current form of the proofs makes it difficult to tell the main challenges of introducing the estimated gradient in Theorem 4 compared to Theorem 3. From Lines 546-547, it seems the estimated gradient only introduces several extra terms regarding the accuracy of the gradient estimation, which appears to be standard. Clarifying the main challenges and differences would improve the paper.

缺点

Please see above.

问题

Please see above.

局限性

Yes.

作者回复

2024-08-01

We are grateful to you for your valuable comments and constructive suggestions.

Q1: experiments might still be helpful

A1: Thanks for your constructive comment. We also think some experiments might still be helpful to demonstrate our more practical parameter selections than [1]. There are some experiments [2-5] to validate the statement in line 188.

1) For T: [1] provided some generalization bounds for convex SCGD and SCSC with some impractical $T$ such as $T=O(\max(n^{7/2},m^{7/2}))$ in Theorem 4. While our convex result (Theorem 2) and non-convex result (Theorem 3) can achieve similar rates even taking $T=O(\max(n,m))$ which better matches some empirical observations (Figures 1, 2 in [2], Figures 2 in [3], and Figures 2, 3 in [6]).

2) For $\eta_t$ : Theorem 4 [1] took $\eta_t=T^{-6/7}$ which is too small when $T$ is large. While our Theorem 2 takes $\eta_t=O(t^{-1})$ closer to some empirical selections ( $\eta_t=O(t^{-3/4})$ [2,3] and $\eta_t=O(t^{-1})$ [4,5]).

3) For $\beta_t$ : Theorem 4 [1] took $\beta_t=T^{-4/7}$ which is also too small since [2,3,5] empirically select $\beta_t=t^{-1/2}$ or $\beta_t=t^{-1}$ . In contrast, our theorems have no special restriction on $\beta_t$ .

We have supplemented these explanations in Remark 3 of our new manuscript to benefit readers’ understanding.

[1] M. Yang, et al. Stability and generalization of stochastic compositional gradient descent algorithms. 2023.

[2] T. Chen, et al. Solving stochastic compositional optimization is nearly as easy as solving stochastic optimization. TSP, 2021.

[3] M. Wang, et al. Stochastic compositional gradient descent: Algorithms for minimizing compositions of expected-value functions. Mathematical Programming, 2017.

[4] Z. Huo, et al. Accelerated method for stochastic composition optimization with nonsmooth regularization. AAAI, 2018.

[5] J. Zhang et al. A stochastic composite gradient method with incremental variance reduction. NeurIPS, 2019.

Q2: how good the O(\mu^2) bound

A2: We first explain the parameters $\mu$ and $d$ again.

1) For $\mu$ : Eq. (4) estimates the unknown gradient by the standard finite difference method, where $\mu$ denotes the approximation distance between two model parameters $v+\mu u$ and $v$ . The closer the approximation distance, the more accurate the gradient estimation.

2) For $b$ : The parameter $b$ denotes the number of approximation directions. The more approximation directions, the more accurate the gradient estimation.

In a word, our convergence rates depend on the quality of gradient estimation measured by the two parameters $\mu$ and $b$ . Next, we will compare our convergence rates with other rates.

Except for providing the generalization guarantees of (black-box) SCO methods, our main aim is to theoretically show the intuition that the more accurate the gradient estimation, the better the convergence rate. For the comparisons in Table 2, we have supplemented the selection of $\mu$ and $b$ in the remarks (Section 3.2) of our new manuscript to achieve the similar convergence rates to [1-3]. For example, to obtain a convergence rate $O(T^{-1/4})$ similar to [3], we can select $\mu=O(T^{-1/8}),b=O(T^{1/4}d_2)$ . $\mu$ can even be taken as a smaller value such as $0.0001$ in [6]. Therefore, $\mu=O(T^{-1/8})$ is reasonable. And, although $b=O(T^{1/4}d_2)$ may be so large that the time complexity of model training is increased, this work aims to first show that black-box SCO algorithm can converge and its impact factors, rather than obtaining an optimal convergence rate.

By selecting the above specific $\mu, b$ , we have modified our results in Table 2 to make the comparisons more clear.

[6] P. Chen, et al. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. AISec@CCS, 2017.

Q3: Redundancy in Proofs

A3: We have simplified our proofs in our new manuscript to enhance readability. For example, we have omitted the part using Lemma 3 in the proof of Theorem 4. To further clear these differences, we have added a table (in “Author Rebuttal”) at the head of Appendix C.

Q4: Challenges of Estimated Gradient

A4: We will emphasize the main challenges of introducing the estimated gradient as follows.

1) Optimization: Lines 546-547 are related to our optimization part. For optimization, the estimated gradient does introduce several extra terms, i.e., $\tilde{\nabla} f-(p+1/2)\beta\nabla f$ and $\tilde{\nabla} f-\beta\nabla f$ . These terms are derived from some special strategies (such as a special decomposition to $\tilde{\nabla} f$ in the second equality of line 546). We propose Lemma 6 based on [7] and combine it with these strategies to limit the expansion (line 549) of $\mathbb{E}[F_S(w_{t+1})-F_S(w(S))]$ to ensure the convergence of our result.

2) Generalization: We apply our new non-convex analysis (Theorem 3) to three types of black-box SCO methods (Theorem 4, Corollaries 1 and 2). Different function forms lead to some differences related to the first-order and second-order gradients of $\tilde{\nabla} f$ (please see lines 513-518) between Theorem 3 and Theorem 4. The differences among Theorem 3 and Corollaries 1, 2 are the same as Theorem 4. Our new non-convex generalization analysis (Theorem 3) is developed from our new convex generalization analysis framework (Theorem 2) via replacing the co-coercivity property of convex, smooth function with the almost co-coercivity property of smooth function. Our analysis avoids the intermediate step measuring the distance between $v$ and $g(w)$ in [1], and gets better results $O((n^{-1}+m^{-1})\log T+n^{-1/2})$ under milder parameter selection.

We have supplemented the above explanations in Section 3.2 of our new manuscript to benefit readers’ understanding.

[7] J. Duchi, et al. Optimal rates for zero-order convex optimization: The power of two function evaluations. TIT, 2015.

评论- Response

2024-08-12

Thanks for the detailed reply. I do not have further questions.

2024-08-12

Thank you very much for your recognition of our work and constructive comments which have improved our work.

作者回复

2024-08-04

Thanks for the comments of all reviewers. Considering the limitation of character count, we provide the table of the main differences among our main results in "global response" and upload a PDF including this table. As mentioned in the A3 (the Rebuttal to Reviewer Xsnq), we have added the table at the head of Appendix C in our new manuscript.

Table. The main differences among our main results. (Note that, these line numbers in this table are in our submission version.)


Results	Generalization	Optimization
Theorem 2	Co-coercivity (line 449)	—
Theorem 3	Almost co-coercivity (line 472)	—
Theorem 4	Unknown gradient estimation (lines 513-518)	Some decompositions to $\tilde{\nabla} f$ (line 546)
Corollary 1	Unknown gradient estimation (lines 563-568)	Some decompositions to $\tilde{\nabla} g$ (line 596)
Corollary 2	Unknown gradient estimation (lines 613-618)	Combination of Theorem 4 and Corollary 1

最终决定Accept (poster)

2024-09-25

The paper studies stochastic composite optimization in the derivative-free setting. There are 3 main contributions: (1) sharpening and generalizing prior results on SCO algorithms SCGD based on an improved stability framework (2) developing guarantees for SCGD in the derivative-free and non-convex settings (3) applying the analysis framework to vertical federated learning (VFL) algorithms and obtaining generalization guarantees.

Reviewers agree that the paper is clearly written and the theoretical results are comprehensive. Although several reviewers initially expressed concerns about the technical novelty of the black-box setting, the authors addressed these issues substantially in the rebuttal. In light of the general support from the reviewers, I recommend acceptance.