PaperHub
5.8
/10
Poster4 位审稿人
最低5最高6标准差0.4
6
6
6
5
3.8
置信度
正确性2.8
贡献度2.5
表达2.8
NeurIPS 2024

SPARKLE: A Unified Single-Loop Primal-Dual Framework for Decentralized Bilevel Optimization

OpenReviewPDF
提交: 2024-05-13更新: 2024-12-24
TL;DR

We develop a unified single-loop primal-dual algorithm framework for decentralized bilevel optimization with the state-of-the-art non-asymptotic convergence rate.

摘要

关键词
non-convex optimizationdecentralized bilevel optimizationtransient iteration complexity

评审与讨论

审稿意见
6

This paper proposes SPARKLE, a single-loop primal-dual framework for solving bilevel optimization in the decentralized setting. Specifically, multiple devices collaborate to solve bilevel optimization problems and exchange information via communications over a network. From a theoretical angle, the authors prove that SPARKLE algorithm achieves better transient iteration complexity and handle gradient dissimilarity more effectively under weaker assumptions than previous works. From an empirical angle, the authors conduct experiments to show that their algorithm has better performance than existing results.

优点

  1. The authors provide a single-loop primal-dual framework for solving decentralized bilevel optimization. The technical contributions and the claims are sound, and the experiments can support the claims from an empirical perspective.

  2. Moreover, the framework unfies many state-of-the-art strategies used in distributed optimization, like ATC, GT, EXTRA, momentum, etc, and obtains best results as compared to existing works.

缺点

  1. Novelty. The novelty of this paper seems a little limited. There are already many existing works proposing various algorithms for solving decentralized bilevel optimization problems, as listed in Table 1 of this paper. I am not sure if the improvement on the transient time and communication complexity over previous works is significant enough to make this work pass the bar of acceptance. The authors may want to clarify this a bit.

  2. Presentation. The authors also mention that there are mixed strategies in this paper like ATC, GT, EXTRA, auxiliary-level updates, momentum update, and so on. The organization of the techniques might not be easy for a layman to follow. It would be better if the authors can highlight the reasons and benefits of using these techniques in a table.

  3. Experiments. Many important baselines are missing in the figures of Section 4. Some figures have standard deviation of 10 trials while some do not.

问题

I wonder if the authors have any responses to the weaknesses I listed above.

局限性

  1. Novelty. One major concern is the limited novelty of this paper, as mentioned in the weakness section.

  2. There are also some limitations mentioned by the authors. For example the lower-level problem is strongly convex, and it is unclear if the condition number is optimal in the upper bounds. I agree that these are the common limitations of existing decentralized bilevel optimization works, and are interesting to explore.

作者回复

We thank the reviewer for the invaluable comments. We have thoroughly addressed all questions. Should there be any additional concerns or inquiries, we are more than willing to provide further clarification.

Re Weakness:

1. Novelty.

We respectfully disagree that our work has limited contributions. SPARKLE is a unified framework that yields brand new algorithms, and achieves state-of-the-art convergence rates under more relaxed assumptions compared to existing methods. Please refer to the global response for the clarification of the contribution.

Additional clarifications on the reviewers' specific concerns are provided below.

[Transient iteration complexity is critical for decentralized optimization] Transient iteration complexity is critical for both theoretical and empirical reasons.

  • From a theoretical standpoint, transient iteration complexity is key for assessing decentralized algorithms' performance compared to centralized methods and distinguishing between different decentralized algorithms. Without analyzing transient iteration complexity, many decentralized bilevel algorithms would seem to have similar performance based solely on asymptotic convergence rates, potentially aligning with centralized approaches. However, this overlooks the discrepancies evident in practical performance. By examining transient iteration complexity, we can identify more effective algorithms with shorter transient iterations, aligning theoretical insights with empirical results.

  • From an empirical perspective, decentralized bilevel algorithms can only achieve their asymptotic convergence rate after a sufficiently large number of iterations. However, practical scenarios often allow only a limited number of algorithmic iterations due to constraints on time or computational resources. Consequently, these algorithms typically exhibit a decentralization-incurred slowdown in convergence, as illustrated in Fig. 1 from reference [1]. Evidently, the transient iteration complexity provides a more practical representation of decentralized algorithms' performance in real-world applications.

[Transient iteration complexity reflects the impact of the network topology] Decentralized optimization relies on communication networks, unlike centralized optimization with fully-connected networks. While the asymptotic rate remains unaffected, the transient iterations are influenced by network topology. Sparsely-connected network has ρ1\rho \to 1, which hence enlarges the transient iterations. By comparing the transient iteration complexity between SPARKLE and existing baselines, we find that SPARKLE is more robust to sparse topology.

[Communication complexity is critical for decentralized optimization] In distributed systems, communication often takes more time than computation, especially with high-dimensional data. SPARKLE improves on existing methods by allowing different correction techniques and communication topologies for the upper and lower levels. Specifically, some SPARKLE variants, like SPARKLE-ED, can use sparser communication for the upper-level variable x without affecting transient iteration complexity. This reduces communication costs while maintaining efficiency, making SPARKLE more time-efficient compared to other algorithms.

[More relaxed assumptions] SPARKLE achieves the best transient time under the most relaxed assumptions (Table 1). We emphasize that the bounded gradient (BG), Lipschitz continuity (LC), and bounded gradient dissimilarity (BGD) assumptions for the upper-level function ff and the lower level function gg in other works may not hold in typical applications. For example, we consider f(x,y)=1ni=1nfi(x,y)f(x,y)=\frac{1}{n}\sum_{i=1}^n f_i(x,y) with fi(x,y):=12xAixf_i(x,y):=\frac{1}{2}x^\top A_i x, where AiA_i are different positive semi-definite matrices and xRpx\in \mathbb{R}^p. The gradients (w.r.t. xx) AixA_i x, and gradient dissimilarity (Ai1nj=1nAj)x(A_i-\frac{1}{n}\sum_{j=1}^n A_j)x are unbounded for xRpx\in\mathbb{R}^p. This highlights the significance of our theoretical improvement in transient time.

[Sharp analysis] This is the first result showing that bilevel optimization essentially subsumes the convergence of single-level optimization.

2.Presentation. We thank the reviewer for the comment. Here is a brief introduction of the benefits. We will soon provide a detailed table.

GT/ED/EXTRA: correct data heterogeneity, improve transient stage

momentum: guarantee convergence, enable relaxed assumptions

mixed network topology: save communication cost

3.Experiments

Thanks for the comment. Here are our responses.

Baselines: We involved MA-DSBO and MDBO algorithms as another baselines in the hyper-cleaning problem which is shown in Table 1 in the manuscript. The stepsize α,β,γ\alpha,\beta,\gamma for MA-DSBO and MDBO and the moving-average term of MA-DSBO are the same as that of SPARKLE and D-SOBA. The inner and outer iteration of MA-DSBO is set to 5 and the number of Hessian-inverse estimation iterations of MDBO is set to 5. The average test accuracy is shown in Figure 1 (supplementary PDF), which shows that SPARKLE outperforms than MA-DSBO and MDBO. And the other algorithms in Table 1 is not involved into the comparsion as DSBO, Gossip DSBO and SLAM suffer from a worse transient complexity and has beaten by different decentralized SBO algorithms, and LoPA requires a personal lower-level problem, which is not match the hyper-cleaning problem.

Standard deviation: Thanks for the comment. The standard deviation of the last 40 iterations in the hyper-cleaning with p=0.3p=0.3 is shown in Table 5 in the manuscript. And we will added the standard deviation in the other scenarios soon. Table 1 shows the average loss of different algorithms of the last 500 iterations for 10 independent trials in the policy evaluation in the distributed policy evaluation of reinforced learning.

[1] Exponential Graph is Provably Efficient for Decentralized Deep Training

评论

Dear Reviewer ehMq,

We sincerely thank you for your valuable comments and appreciate the time and effort dedicated to providing constructive feedback on our submission. We have carefully considered your suggestions and made significant efforts to address them. Given the limited timeframe of the rebuttal period, we would greatly appreciate if you could review our rebuttal and let us know if any concerns remain. Your insights are invaluable as we strive to enhance the quality of our work.

Best,

The authors of paper 7209

评论

Dear Authors,

Thank you for your detailed rebuttal, which addressed my questions and concerns. I would be glad to increase my score.

Best, Reviewer ehMq

评论

We are delighted that your concerns have been resolved, and we sincerely appreciate your positive feedback. We will incorporate your suggestions in our later revision. Thank you again for your valuable input.

审稿意见
6

This paper introduces SPARKLE, a unified single-loop primal-dual framework for decentralized stochastic bilevel optimization. SPARKLE is highly versatile, which can incorporate various heterogeneity-correction techniques and allows for different strategies to solve upper- and lower-level problems. The authors provide a unified convergence analysis for SPARKLE and its variants, showing state-of-the-art convergence rates compared to existing decentralized bilevel algorithms. Theoretical findings are supported by numerical experiments.

优点

The proposed method SPARKLE in this paper is highly versatile, which can support various decentralized mechanisms and topologies across optimization levels. This paper has a comprehensive theoretical analysis for all SPARKLE variants, demonstrating that they achieve state-of-the-art convergence rates compared to existing decentralized bilevel algorithms. In addition, the authors show that the convergence performance of SPARKLE variants is comparable to their single-level counterparts, and that employing mixing strategies outperforms using GT alone.

缺点

  1. In Line 303-304, the authors claim that all the SPARKLE-based algorithms generally achieve higher test accuracy than D-SOBA, while ED and EXTRA especially outperform GT. However, in Figure 1, particularly in the middle and right figures, the test accuracies are very similar, making such claims insufficiently supported. Additionally, such small differences might be due to randomness.
  2. The baseline algorithms used in different experiments are different. It is unclear why the authors did not use the same baselines. Furthermore, just one or two existing decentralized SBO algorithms in the hyper-cleaning or distributed policy evaluation experiments is not enough to evaluate the performance of SPARKLE. More state-of-the-art decentralized SBO algorithms need to be added.
  3. It would be important for this paper to evaluate more diverse and larger-scale tasks, including non-linear models such as neural networks (e.g., ResNet), applied to various bilevel problems (e.g., meta-learning).
  4. The authors did not provide the code to reproduce the experiments, raising concerns about reproducibility and making it difficult for other researchers to verify their results.

Minor:

  1. In Assumption 1, it would be clearer to specify the variable with respect to which the function is Lipschitz continuous.

问题

  1. The authors use the same decentralized mechanism for the lower-level and auxiliary variables. Can lower-level and auxiliary variables use different decentralized mechanisms?
  2. Could the authors provide more explanation on why SPARKLE can support utilizing different mixing matrices across levels? Are there any applications for such scenarios?
  3. The authors only show the upper level loss in the policy evaluation experiment. What about the test accuracy?

局限性

As the authors stated, SPARKLE supports only strongly-convex problems in the lower-level optimization, and the condition number of the lower-level problem significantly impacts the performance.

作者回复

We thank the reviewer for the invaluable comments. We have thoroughly addressed all questions. Should there be any additional concerns or inquiries, we are more than willing to provide further clarification.

Re Weakness:

  1. Thanks for this concern. In fact, the second line of Table 5 in the manuscript (Page 62) also gives the average test accuracy and standard deviation of SPARKLE with different communication strategies when p=0.3,θ=0.2p=0.3, \theta=0.2 on 10 independent trials, corresponding to the right subfig of Fig 1. The test accuracy of SPARKLE with ED and EXTRA is higher than GT for 1~2% with an acceptable standard deviation. Thus we can claim that for a suitable θ\theta, there is a REAL benefit with ED and EXTRA and it is not from the stochastic.

  2. Thank you for the comment. We involved MA-DSBO and MDBO algorithms as another baselines in the hyper-cleaning problem which is shown in Table 1 in the manuscript. The stepsize α,β,γ\alpha,\beta,\gamma for MA-DSBO and MDBO and the moving-average term of MA-DSBO are the same as that of SPARKLE and D-SOBA. The inner and outer iteration of MA-DSBO is set to 5 and the number of Hessian-inverse estimation iterations of MDBO is set to 5. The average test accuracy is shown in Figure 1, which shows that SPARKLE outperforms than MA-DSBO and MDBO. And the other algorithms in Table 1 is not involved into the comparsion as DSBO, Gossip DSBO and SLAM suffer from a worse transient complexity and has beaten by different decentralized SBO algorithms, and LoPA requires a personal lower-level problem, which is not match the hyper-cleaning problem.

  3. Thanks for this comment. Here we consider a meta-learning problem in decentralized scenario with N=8N=8 nodes and Adjust-ring grouph. We consider a 5-way 5-shot task on miniImageNet dataset and compared SPARKLE with D-SOBA and MAML with decentralized communication. For D-SOBA and SPARKLE, the step-size β,γ=0.1\beta,\gamma=0.1 and α=0.001\alpha=0.001. For MAML, the inner step-size is 0.1 and the outer stepsize is 0.001, and the number of inner-loop steps as 3. For all algorithms, the task number is set to 32. And we only repeat the experiment only once due to the time limitation. The accuracy in training and test dataset for the three algorithms is shown in Figure 4 in the pdf document. It can be observe that SPARKLE out performs than the other two algorithms.

  4. Thanks for this comment. As SPARKLE is based on stochastic gradient decend, we think the experiments in the manuscript is not different to achieve. Thus we did not offer the code of SPARKLE. However, we will consider release the code soon.

Minor :We thank the reviewer for the suggestion. The Lipschitz continuity is for both xx and yy.

Re Questions:

  1. SPARKLE permits the use of different decentralized mechanisms for lower-level and auxiliary variables. As noted in Corollary 1, we have δy=δz\delta_y = \delta_z and δy^=δz^\hat{\delta_y} = \hat{\delta_z} when Wy=Wz\mathbf{W}_y = \mathbf{W}_z ( See more details in Section 3.4. Different strategies across optimization levels and Lemma 18). Given that the influence with respect to yy and zz is identical (n3,nn^3,n terms), we employ the same decentralized mechanisms for both variables for simplicity. However, different decentralized mechanisms can be used for yy and zz—for instance, applying EXTRA to yy and GT to zz. The theoretical transient iteration complexities for these configurations can be readily computed using Lemma 18. In SPARKLE, the primary differences lie between the variables x,yx,y and x,zx,z, rather than between y,zy,z. Therefore, we focus on scenarios where mixing strategies are applied to x,yx,y, while maintaining a uniform strategy for y,zy,z.

  2. Each mixing matrix W\mathbf{W} represents a communication graph or topology. Formally, we can directly use different mixing matrices Wx,Wy,Wz\mathbf{W}_x,\mathbf{W}_y,\mathbf{W}_z, which corresponds to different communication topologies. Our key finding regarding the use of different mixing matrices is that some SPARKLE variants, such as SPARKLE-ED, can maintain the transient iteration complexity (up to a constant factor) even when the communication topology for the upper-level variable xx is sparser within certain ranges. In other words, by fixing the communication graphs for y,zy,z and using the same communication graphs for both y,zy,z, while applying a sparser communication graph for xx, the transient iteration complexities will remain the same as if the same communication graph were applied to xx along with y,zy,z. This implies that a sparser communication graph for xx can reduce communication costs while preserving the same transient iteration complexity. Please refer to more detailed discussions in Section 3.5. Different topologies across optimization levels and Section C.2.2 Theoretical gap between upper-level and lower-level, Appendix. We provide upper bounds for the relative sparsity of the topology for xx compared to y,zy,z to ensure that the transient iteration complexity remains unchanged.

There are many practical scenarios where using different communication topologies for x,y,zx,y,z is beneficial. For instance, when the dimension of xx is larger than that of yy, or when the computational overhead for computing the gradient of the upper-level function ff is higher than that of the lower-level function gg, employing a relatively sparser communication topology for xx compared to yy can reduce communication time and costs.

  1. We thank the reviewer for the valuable comment. As the policy evaluation experiment is a reinforce learning problem with a synthetic dataset, it is hard to provide a "test accuracy". To verify the test performance of SPARKLE, we made a fixed "test set" with 10000 sample. The generation of test sample is same to training sample. And the averange test loss of different algorithms is shown in Figure 3, which illustrate the better test performance of SPARKLE.
评论

Dear Reviewer AnxT,

We sincerely thank you for your valuable comments and appreciate the time and effort dedicated to providing constructive feedback on our submission. We have carefully considered your suggestions and made significant efforts to address them. Given the limited timeframe of the rebuttal period, we would greatly appreciate if you could review our rebuttal and let us know if any concerns remain. Your insights are invaluable as we strive to enhance the quality of our work.

Best,

The authors of paper 7209

评论

Thank the authors for the rebuttal. Since all my concerns have been addressed, I am happy to increase my score.

评论

We are delighted that your concerns have been resolved, and we sincerely appreciate your positive feedback. We will incorporate your suggestions in our later revision. Thank you again for your valuable input.

审稿意见
6

The paper introduces SPARKLE, a unified framework for decentralized stochastic bilevel optimization that addresses several limitations in existing approaches. SPARKLE incorporates various heterogeneity-correction techniques, including EXTRA, Exact Diffusion, and Gradient Tracking, and allows for different strategies in upper and lower-level problems. This flexibility enables SPARKLE to outperform previous methods, achieving state-of-the-art performance in terms of convergence rate, gradient complexity, communication cost, and transient iteration complexity. Notably, the framework demonstrates that EXTRA and Exact Diffusion are more suitable for decentralized bilevel optimization than Gradient Tracking. SPARKLE provides a unified convergence analysis applicable to all its variants without requiring restrictive assumptions like bounded gradients or data heterogeneity. Through numerical experiments, the paper demonstrates that SPARKLE achieves better performance compared to existing decentralized bilevel optimization algorithms.

优点

1.The SPARKLE framework's unified approach to decentralized bilevel optimization, incorporating multiple heterogeneity-correction techniques (EXTRA, Exact Diffusion, and Gradient Tracking), reveals an important insight: Gradient Tracking (GT) is not the optimal choice in some situations.

2.The authors demonstrate its superiority of SPARKLE in several theoretical aspects of decentralized bilevel optimization, achieving state-of-the-art results across multiple performance metrics. The framework exhibits faster convergence rates, lower gradient complexity, reduced communication costs, and improved transient iteration complexity compared to existing methods.

3.Extensive experimental results show SPARKLE consistently outperforming existing methods across various benchmark problems and real-world scenarios, particularly in terms of convergence rates. This empirical evidence strongly supports SPARKLE's practical utility and efficiency in solving complex optimization tasks in decentralized settings.

缺点

1.Although the paper proposes mixing strategies to handle heterogeneity, it does not provide a strong motivation or clear explanation for the necessity and benefits of these mixing strategies over choosing a single strategy, such as EXTRA or ED. This lack of clarity can make it difficult for practitioners to understand the rationale behind the proposed approach and its potential advantages.

2.The primal-dual approach employed in SPARKLE incurs increased communication and storage overhead per communication round compared to non-primal-dual methods, due to the necessity of maintaining and updating dual variables in addition to primal variables across the decentralized network.

3.The paper appears to lack an explicit analysis of consensus error, a crucial metric for evaluating solution agreement across nodes in decentralized networks. This omission represents a significant gap in the framework's evaluation, potentially limiting our understanding of SPARKLE's effectiveness in achieving consistent solutions in distributed optimization scenarios.

问题

1.Over an undirected time-invariant graph with non-negative, symmetric and doubly stochastic mixing matrices, do the algorithms SPARKLE with GT, EXTRA, and ED have different ranges of application? If so, how do their application ranges differ in the context of decentralized bilevel optimization?

2.Can the convergence rate of SPARKLE be improved beyond the results presented in Table 1, considering that the algorithm without momentum step in reference [27] achieves comparable convergence results in single-level scenario? In the absence of the momentum step, how would the convergence results of Algorithm 1 be affected? If the primal-dual framework is removed, how would the convergence results of Algorithm 1 be affected?

3.Could the paper provide a detailed comparison of the computational time and communication rounds required to achieve specific test accuracy or loss thresholds among the SPARKLE algorithms with different heterogeneity-correction techniques in experimental results?

局限性

1.By requiring strong convexity in the lower-level problem, SPARKLE excludes a wide range of practical scenarios where the lower-level optimization may be non-convex or only generally convex. This constraint substantially narrows the scope of decentralized bilevel optimization problems that SPARKLE can effectively address.

2.The performance of SPARKLE is significantly impacted by the condition number of the lower-level problem. This sensitivity could limit its effectiveness in certain problem scenarios.

3.The experiments seem to lack a comparison with single-level algorithms, which could provide valuable insights into the relative performance of SPARKLE in the distributed policy evaluation in reinforced learning. While the study demonstrates SPARKLE's superior performance against other decentralized stochastic bilevel optimization (DSBO) approaches like MDBO and SLDBO, it fails to benchmark against established single-level methods.

作者回复

We thank the reviewer for the invaluable comments. We have thoroughly addressed all questions. Should there be any additional concerns or inquiries, we are more than willing to provide further clarification.

Response to Weakness:

1.Motivation for the necessity and benefits of mixing strategies over choosing a single strategy.

[Motivation] Bilevel optimization presents distinct challenges for the upper- and lower-level problems, with the upper-level often being non-convex and the lower-level strongly convex. This difference motivates us to explore whether using different update strategies for each level could provide advantages over the uniform gradient tracking (GT) strategy throughout all optimization levels used in existing approaches. To our knowledge, this question has not been addressed in prior literature.

[Benefits] Whether mixing strategies would bring in benefits should be discussed in separate cases.

  • Before our work, existing works utilize DGD or GT in decentralized bilevel algorithms. Compared to these works, we find algorithms with mixing strategies such as SPARKLE-GT-ED converges faster than emplying GT alone, see Table 2 in the paper.

  • On the other hand, our results in Table 2 further shows that mixing strategies will achieve the same convergence rate as single strategy if ED or EXTRA is utilized.

With these results, we can conclude that mixing strategies may not necessarily bring in benefits compared to the best single strategy such as EXTRA and ED (which is proposed by us).

[Contributions] The main aim of this paper is to showcase the general SPARKLE framework rather than advocate for mixing strategies specifically. Our goal is to highlight the power of SPARKLE in analyzing mixing-strategy algorithms, which was not possible with previous methods. Our key findings are:

  • We have clarified the theoretical performance of mixing strategies, and that mixing strategies may not always offer advantages, which is still significant. Prior to our work, it was unclear whether mixing strategies would be beneficial.

  • We reveal that the performance of both single-strategy and mixing-strategy algorithms depends largely on the lower-level optimization (Theorem 1 and Corollary 4).

2.Extra overhead brought by the primal-dual approach.

We agree that primal-dual methods introduce additional storage overhead due to the presence of dual variables. However, the communication costs per iteration may not necessarily increase. Dual variables of EXTRA and ED can be eliminated from the recursion, see the recursions in Table 3 as well as Section B.2 (Appendix). It is observed in Table 3 that only the primal variables needs to be communicted for SPARKLE-EXTRA and SPARKLE-ED, make it as communication-efficient as primal approaches like D-SOBA.

3.Lacking an explicit analysis of consensus error.

We provide a brief analysis of consensus errors in Comment. The asymtocic consensus error for SPARKLE variants using EXTRA, ED or GT is given by

1Kk=0KE[xkxˉk2n+ykyˉk2n]=O(nK(11ρy+11ρz)),\frac{1}{K}\sum_{k=0}^K\mathbb{E}\left[\frac{\|\mathbf{x}^k-\bar{\mathbf{x}}^k\|^2}{n}+\frac{\|\mathbf{y}^k-\bar{\mathbf{y}}^k\|^2}{n}\right] =\mathcal{O}\left( \frac{n}{K}\left(\frac{1}{1-\rho_y}+\frac{1}{1-\rho_z}\right)\right),

where ρy,ρz\rho_y,\rho_z are spectrum gaps of relevant mixing matrices.

Re Questions:

Q1: Application of EXTRA, ED and GT.

EXTRA, ED and GT can be used for the same applications. However, bilevel algorithms based on GT incurs more communication overhead (refer to the detailed update rules in Table 3) and suffers from longer transient iterations compared to the other two. For this reason, we will recommend using SPARKLE-EXTRA or SPARKLE-ED for decentralized bilevel optimization problems.

Q2: The influence of each algorithmic component.

[Momentum cannot be removed] The momentum step, or moving average, is essential for mitigating hyper-gradient estimation errors. These errors arise from sampling variance and estimation inaccuracies in the lower-level solution y(xˉk)y^\star(\bar{x}^k) and auxiliary-level solution zkz_\star^k, which differ from single-level optimization. Properly selecting the moving average parameter θ\theta helps reduce these negative effects and ensures SPARKLE's convergence. Without the moving average (i.e., θ=1\theta=1), convergence cannot be guaranteed. For more details, please refer to Remark 6 (lines 829-836, Appendix) and our proof.

[Primal-dual framework cannot be removed] The primal-dual framework is crucial for formulating heterogeneity correction methods and deriving specific algorithms, as detailed in Table 3. It enables the recovery of algorithms by defining the matrices A,B,CA, B, C. This framework is vital for the basic transformation in Section C.1.2 and for establishing SPARKLE's convergence under Assumptions 1, 2, and 4.

Q3:Detailed comparison of the computational time and communication rounds. We thank the reviewer for the suggestion. In the data-hyperclean problem, we obtain the average of required gradient computation time to achieve specific test accuracy for SPARKLE with different communication strategy as well as D-SOBA, MDBO and MA-DSBO-GT as Figure 1 in the pdf document. It can be observed that SPARKLE needs less computation to achieve a given test accuracy than D-SOBA and MA-DSBO-GT. And SPARKLE with EXTRA in lover level communication outperforms than other strategies.

Response to Limitations: **1.**Please refer to Section 2 in Global Rebuttal.

2.Lacking a comparison with single-level algorithms.

We thank the reviewer for the suggestion. We have taken single-level EXTRA alglrithms into the comparsion of different algorims in the distributed policy evaluation. Table 1 shows the average loss of different algorithms of the last 500 iterations for 10 independent trials. And we can observe that SPARKLE has similar convergenve performance as single-level algorithms.

评论

Then using Eq. 36 and the definition of Δk\Delta_k , we get

1K_k=0KE[xkxˉk2n+ykyˉk2n]KnK(Oz2Oz12Λ_za21Γz+Oy2Oy12Λ_ya21Γy). \begin{aligned} &\frac{1}{K}\sum\_{k=0}^K\mathbb{E}\left[\frac{\Vert \mathbf{x}^k-\bar{\mathbf{x}}^k\Vert ^2}{n}+\frac{\Vert \mathbf{y}^k-\bar{\mathbf{y}}^k\Vert ^2}{n}\right] \lesssim_K \frac{n}{K}\left(\frac{\Vert \mathbf{O}_z\Vert ^2\Vert \mathbf{O}_z^{-1}\Vert ^2\Vert {\mathbf{\Lambda}}\_{za}\Vert ^2}{1-\Vert \mathbf{\Gamma}_z\Vert }+\frac{\Vert \mathbf{O}_y\Vert ^2\Vert \mathbf{O}_y^{-1}\Vert ^2\Vert {\mathbf{\Lambda}}\_{ya}\Vert ^2}{1-\Vert \mathbf{\Gamma}_y\Vert }\right). \end{aligned} In particular, the corresponding result of SPARKLE variants that using EXTRA, ED or GT is 1K_k=0KE[xkxˉk2n+ykyˉk2n]KnK(11ρy+11ρz), \begin{aligned} &\frac{1}{K}\sum\_{k=0}^K\mathbb{E}\left[\frac{\Vert \mathbf{x}^k-\bar{\mathbf{x}}^k\Vert ^2}{n}+\frac{\Vert \mathbf{y}^k-\bar{\mathbf{y}}^k\Vert ^2}{n}\right] \lesssim_K \frac{n}{K}\left(\frac{1}{1-\rho_y}+\frac{1}{1-\rho_z}\right), \end{aligned} where 1ρy,1ρz1-\rho_y,1-\rho_z are spectrum gaps of relevant mixing matrices for y,zy,z respectively.

We finish the proof.

To our knowledge, it is the first theoretical result of consensus errors for decentralized stochastic bilevel algorithms under the condition that the asymptotic convergence rate shows linear speedup.

评论

Lemma 18: Suppose that Assumptions 1-4 hold. Then there exist constant step-sizes α,β,γ,θ\alpha, \beta, \gamma, \theta, such that Lemma 17 holds and

1Kk=0KE[xkxˉk2n+ykyˉk2n]KnK(Oz2Oz12Λ_za21Γz+Oy2Oy12Λ_ya21Γy), \begin{aligned} \frac{1}{K}\sum_{k=0}^K\mathbb{E}\left[\frac{\Vert \mathbf{x}^k-\bar{\mathbf{x}}^k\Vert ^2}{n}+\frac{\Vert \mathbf{y}^k-\bar{\mathbf{y}}^k\Vert ^2}{n}\right] \lesssim_K \frac{n}{K}\left(\frac{\Vert \mathbf{O}_z\Vert ^2\Vert \mathbf{O}_z^{-1}\Vert ^2\Vert \mathbf{\Lambda}\_{za}\Vert ^2}{1-\Vert \mathbf{\Gamma}_z\Vert }+\frac{\Vert \mathbf{O}_y\Vert ^2\Vert \mathbf{O}_y^{-1}\Vert ^2\Vert \mathbf{\Lambda}\_{ya}\Vert ^2}{1-\Vert \mathbf{\Gamma}_y\Vert }\right), \end{aligned} where K\lesssim_K denotes the the asymptotic rate when KK\to \infty.

Proof

Suppose α\alpha, β\beta, γ\gamma, and θ\theta satisfy the constraints given in Eq. (89) and Eq. (90), which ensures that Theorem 1 (Lemma 17) holds.

For clarity, we define the constants: $

\begin{aligned} c_1=\frac{9\alpha^2L_{z^\star}^2}{\gamma^2\mu_g^2}+\frac{438\kappa^4\alpha^2}{\beta^2\mu_g^2}L_{y^{\star}}^2, \quad c_2=10\left(L^2+\frac{\theta\sigma_{g,2}^2}{n}\right). \end{aligned} Thenthereexist Then there exist\alpha,, \beta,, \gamma,and, and \thetathatsatisfytheconstraintsinEq.(89)and(90),andalso:that satisfy the constraints in Eq. (89) and (90), and also: c_1\le 0.01 L^{-2}, \quad c_2\le 11 L^2.\cdot\cdot\cdot\cdot\cdot(*) $

It implies that c1c2<0.2c_1 c_2<0.2. We take such values for step-sizes in the following proof.

We proceed by substituting Eq. (41) into Eq. (57), yielding:

$

\begin{aligned} \sum_{k=-1}^K\mathbb{E}[I_k] \leq 4 c_1\left(\frac{\Phi(\bar{x}^0)-\inf\Phi}{\alpha}+c_2 \sum_{k=0}^{K-1}\mathbb{E}\left[\frac{\Delta_k}{n}+I_k\right]+\frac{3\theta}{n}K\left(\sigma_{f,1}^2+2\sigma_{g,2}^2\frac{L_{f,0}^2}{\mu_g^2}\right)\right) +510\kappa^4\sum_{k=0}^K\mathbb{E}\left[\frac{\Delta_k}{n}\right]+\frac{3\Vert z^1_{\star}\Vert ^2}{\mu_g\gamma} \\+\frac{6(K+1)\gamma}{\mu_g n}\left(3\sigma_{g,2}^2\frac{L_{f,0}^2}{\mu_g^2}+\sigma_{f,1}^2\right) +73\kappa^4\left(\frac{4}{\beta\mu_g}\Vert \bar{y}^{0}-y^{\star}(\bar{x}^{0})\Vert ^2 +\frac{4K\sigma_{g,1}^2}{n\mu_g}\beta\right). \end{aligned} $

Subtracting 4c1c2k=0K1E[Ik]4c_1c_2\sum_{k=0}^{K-1}\mathbb{E}[I_k] from both sides, we get:

k=1KE[Ik]Φ(xˉ0)infΦα+θnK(σf,12+σg,22Lf,02μg2)+κ4k=0KE[Δkn]+z12μgγ+Kγμgn(σg,22Lf,02μg2+σf,12)+κ4(1βμgyˉ0y(xˉ0)2+Kσg,12nμgβ). \begin{aligned} \sum_{k=-1}^K\mathbb{E}[I_k] \lesssim \frac{\Phi(\bar{x}^0)-\inf\Phi}{\alpha}+\frac{\theta}{n}K\left(\sigma_{f,1}^2+\sigma_{g,2}^2\frac{L_{f,0}^2}{\mu_g^2}\right) +\kappa^4\sum_{k=0}^K\mathbb{E}\left[\frac{\Delta_k}{n}\right]+\frac{\Vert z^1_{\star}\Vert ^2}{\mu_g\gamma} \\\\+\frac{K\gamma}{\mu_gn}\left(\sigma_{g,2}^2\frac{L_{f,0}^2}{\mu_g^2}+\sigma_{f,1}^2\right) +\kappa^4\left(\frac{1}{\beta\mu_g}\Vert \bar{y}^{0}-y^{\star}(\bar{x}^{0})\Vert ^2 +\frac{K\sigma_{g,1}^2}{n\mu_g}\beta\right). \end{aligned}

Substituting Eq. (57) into Eq. (41), we obtain:

14k=0KErˉk+12leqΦ(xˉ0)infΦα+c2k=0KE[Δkn]+c2c1k=0KErˉk2+c2(510κ4k=0KE[Δkn]+3z12μgγ+6(K+1)γμgn(3σg,22Lf,02μg2+σf,12)+73κ4(4βμgyˉ0y(xˉ0)2+4Kσg,12nμgβ))+3θn(K+1)(σf,12+2σg,22Lf,02μg2). \begin{aligned} \frac{1}{4}\sum_{k=0}^K\mathbb{E}\left\Vert \bar{r}^{k+1}\right\Vert ^2&\\leq\frac{\Phi(\bar{x}^0)-\inf\Phi}{\alpha}+c_2\sum_{k=0}^K\mathbb{E}\left[\frac{\Delta_k}{n}\right]+c_2c_1\sum_{k=0}^K\mathbb{E}\Vert \bar{r}^k\Vert ^2 +\\\\&c_2 \left( 510\kappa^4\sum_{k=0}^K\mathbb{E}\left[\frac{\Delta_k}{n}\right]+\frac{3\Vert z^1_{\star}\Vert ^2}{\mu_g\gamma} +\frac{6(K+1)\gamma}{\mu_gn}\left(3\sigma_{g,2}^2\frac{L_{f,0}^2}{\mu_g^2}+\sigma_{f,1}^2\right) +73\kappa^4\left(\frac{4}{\beta\mu_g}\Vert \bar{y}^{0}-y^{\star}(\bar{x}^{0})\Vert ^2 +\frac{4K\sigma_{g,1}^2}{n\mu_g}\beta\right)\right) \\\\&+\frac{3\theta}{n}(K+1)\left(\sigma_{f,1}^2+2\sigma_{g,2}^2\frac{L_{f,0}^2}{\mu_g^2}\right). \end{aligned}

Subtracting c2c1k=0KErˉk2c_2c_1\sum_{k=0}^K\mathbb{E}\Vert \bar{r}^k\Vert ^2 from both sides, we get

k=0KErˉk+12Φ(xˉ0)infΦα+κ4k=0KE[Δkn]+θnK(σf,12+σg,22Lf,02μg2)+z12μgγ+Kγμgn(σg,22Lf,02μg2+σf,12)+κ4(1βμgyˉ0y(xˉ0)2+Kσg,12nμgβ). \begin{aligned} \sum_{k=0}^K\mathbb{E}\left\Vert \bar{r}^{k+1}\right\Vert ^2 \lesssim&\frac{\Phi(\bar{x}^0)-\inf\Phi}{\alpha}+\kappa^4\sum_{k=0}^K\mathbb{E}\left[\frac{\Delta_k}{n}\right]+\frac{\theta}{n}K\left(\sigma_{f,1}^2+\sigma_{g,2}^2\frac{L_{f,0}^2}{\mu_g^2}\right) \\\\&+\frac{\Vert z^1_{\star}\Vert ^2}{\mu_g\gamma} +\frac{K\gamma}{\mu_gn}\left(\sigma_{g,2}^2\frac{L_{f,0}^2}{\mu_g^2}+\sigma_{f,1}^2\right) +\kappa^4\left(\frac{1}{\beta\mu_g}\Vert \bar{y}^{0}-y^{\star}(\bar{x}^{0})\Vert ^2 +\frac{K\sigma_{g,1}^2}{n\mu_g}\beta\right). \end{aligned}

评论

Taking α,β,γ,θ\alpha,\beta,\gamma,\theta satisfying Eq. 89, Eq.90, Eq.(*) and that κ4[(η1+κ2Ly2η2)α2+η3]\kappa^4[(\eta_1+\kappa^2L_{y^{\star}}^2\eta_2)\alpha^2+\eta_3] is a sufficiently small constant, we can derive the following result: 1Kk=0KE[Δkn]κη2βK+η2β2σg,12n+[(η1+κ2Ly2η2)α2+η3][1αK+θn(σf,12+κ2σg,22)+1μgγK+γμgn(κ2σg,22+σf,12)+κ4(1βμgK+σg,12nμgβ)]+κ2β2Oy2Oy12Λ_ya21Γyσ_g,12+κ2α2Ox2Ox12Λ_xa2θK(1Γx)2+κ2Oy2Ee^y02(1Γy)Kn+Oz2Ee^z02(1Γz)Kn+κ2Ox2Ee^x02(1Γx)Kn+γ2Oz2Oz12Λ_za21Γz(κ2σ_g,22+σ_f,12)+κ2α2θOx2Ox12Λ_xa2(1Γx)2(κ2σ_g,22+σ_f,12)κ5η2αK+κ10α2σ_g,12n+κK[(η1+κ2L_y2η2)α+η3α]+[(η1+κ2L_y2η2)α2+η3][θn(σ_f,12+κ2σg,22)+κ5αn(κ2σ_g,22+σ_f,12)+κ9σ_g,12nα]+κ2Oy2Oy12Λ_ya21Γyσg,12κ8α2+κ1αOx2Ox12Λ_xa2K(1Γx)2+α2κ10Oy2Oy12Λ_ya2Λ_yb12ζ0yK(1Γy)+α2κ8Oz2Oz12Λ_za2Λ_zb12ζ0zK(1Γz)+α2κ2Ox2Ox12Λ_xa2Λ_xb12ζ0xK(1Γx)+κ8α2Oz2Oz12Λ_za21Γz(κ2σ_g,22+σ_f,12)+κ2α2θOx2Ox12Λ_xa2(1Γx)2(κ2σ_g,22+σ_f,12), \begin{aligned} &\frac{1}{K}\sum_{k=0}^K\mathbb{E}\left[\frac{\Delta_k}{n}\right] \\\\ \lesssim&\frac{\kappa\eta_2\beta}{K}+\eta_2\beta^2\frac{\sigma_{g,1}^2}{n} \\\\+&\left[(\eta_1+\kappa^2L_{y^{\star}}^2\eta_2)\alpha^2+\eta_3\right]\left[\frac{1}{\alpha K}+\frac{\theta}{n}\left(\sigma_{f,1}^2+\kappa^2\sigma_{g,2}^2\right) +\frac{1}{\mu_g\gamma K} +\frac{\gamma}{\mu_gn}\left(\kappa^2\sigma_{g,2}^2+\sigma_{f,1}^2\right) +\kappa^4\left(\frac{1}{\beta\mu_g K} +\frac{\sigma_{g,1}^2}{n\mu_g}\beta\right)\right] \\\\+&\frac{\kappa^2\beta^2\Vert \mathbf{O}_y\Vert ^2\Vert \mathbf{O}_y^{-1}\Vert ^2 \Vert \mathbf{\Lambda}\_{ya}\Vert ^2}{1-\Vert \mathbf{\Gamma}_y\Vert } \sigma\_{g,1}^2+\frac{\kappa^2\alpha^2\Vert \mathbf{O}_x\Vert ^2\Vert \mathbf{O}x^{-1}\Vert ^2\Vert {\mathbf{\Lambda}}\_{xa}\Vert ^2}{\theta K(1-\Vert \mathbf{\Gamma}_x\Vert )^2} \\\\+&\frac{\kappa^2\Vert \mathbf{O}_y\Vert ^2\mathbb{E}\Vert \hat{\mathbf{e}}_y^{0}\Vert ^2}{(1-\Vert \mathbf{\Gamma}_y\Vert )Kn}+\frac{\Vert \mathbf{O}_z\Vert ^2\mathbb{E}\Vert \hat{\mathbf{e}}_z^{0}\Vert ^2}{(1-\Vert \mathbf{\Gamma}_z\Vert )Kn} +\frac{\kappa^2\Vert \mathbf{O}_x\Vert ^2\mathbb{E}\Vert \hat{\mathbf{e}}_x^{0}\Vert ^2}{(1-\Vert \mathbf{\Gamma}_x\Vert )Kn} \\\\+&\gamma^2\frac{\Vert \mathbf{O}_z\Vert ^2\Vert \mathbf{O}_z^{-1}\Vert ^2\Vert {\mathbf{\Lambda}}\_{za}\Vert ^2}{1-\Vert \mathbf{\Gamma}_z\Vert }\left(\kappa^2\sigma\_{g,2}^2+\sigma\_{f,1}^2\right) +\kappa^2\alpha^2\theta\frac{\Vert \mathbf{O}_x\Vert ^2\Vert \mathbf{O}_x^{-1}\Vert ^2\Vert {\mathbf{\Lambda}}\_{xa}\Vert ^2}{(1-\Vert \mathbf{\Gamma}_x\Vert )^2}\left(\kappa^2\sigma\_{g,2}^2+\sigma\_{f,1}^2\right) \\\\ \lesssim&\frac{\kappa^5\eta_2\alpha}{K}+\kappa^{10}\alpha^2\frac{\sigma\_{g,1}^2}{n}+\frac{\kappa}{ K}\left[(\eta_1+\kappa^2L\_{y^{\star}}^2\eta_2)\alpha+\frac{\eta_3}{\alpha}\right] \\\\+&\left[(\eta_1+\kappa^2L\_{y^{\star}}^2\eta_2)\alpha^2+\eta_3\right]\left[\frac{\theta}{n}\left(\sigma\_{f,1}^2+\kappa^2\sigma{g,2}^2\right) +\frac{\kappa^5\alpha}{n}\left(\kappa^2\sigma\_{g,2}^2+\sigma\_{f,1}^2\right) +\kappa^9\frac{\sigma\_{g,1}^2}{n}\alpha\right] \\\\+&\frac{\kappa^2\Vert \mathbf{O}_y\Vert ^2\Vert \mathbf{O}_y^{-1}\Vert ^2 \Vert \mathbf{\Lambda}\_{ya}\Vert ^2}{1-\Vert \mathbf{\Gamma}_y\Vert } \sigma{g,1}^2\kappa^8\alpha^2+\frac{\kappa^{-1}\alpha\Vert \mathbf{O}_x\Vert ^2\Vert \mathbf{O}_x^{-1}\Vert ^2\Vert {\mathbf{\Lambda}}\_{xa}\Vert ^2}{ K(1-\Vert \mathbf{\Gamma}_x\Vert )^2} \\\\+&\alpha^2\frac{\kappa^{10}\Vert \mathbf{O}_y\Vert ^2\Vert \mathbf{O}_y^{-1}\Vert ^2\Vert {\mathbf{\Lambda}}\_{ya}\Vert ^2\Vert {\mathbf{\Lambda}}\_{yb}^{-1}\Vert ^2\zeta^y_0}{K(1-\Vert \mathbf{\Gamma}_y\Vert )}+\alpha^2\frac{\kappa^{8}\Vert \mathbf{O}_z\Vert ^2\Vert \mathbf{O}_z^{-1}\Vert ^2\Vert {\mathbf{\Lambda}}\_{za}\Vert ^2\Vert {\mathbf{\Lambda}}\_{zb}^{-1}\Vert ^2\zeta^z_0}{K(1-\Vert \mathbf{\Gamma}_z\Vert )} \\\\+&\alpha^2\frac{\kappa^2\Vert \mathbf{O}_x\Vert ^2\Vert \mathbf{O}_x^{-1}\Vert ^2\Vert {\mathbf{\Lambda}}\_{xa}\Vert ^2\Vert {\mathbf{\Lambda}}\_{xb}^{-1}\Vert ^2\zeta^x_0}{K(1-\Vert \mathbf{\Gamma}_x\Vert )} \\\\+&\kappa^8\alpha^2\frac{\Vert \mathbf{O}_z\Vert ^2\Vert \mathbf{O}_z^{-1}\Vert ^2\Vert {\mathbf{\Lambda}}\_{za}\Vert ^2}{1-\Vert \mathbf{\Gamma}_z\Vert }\left(\kappa^2\sigma\_{g,2}^2+\sigma\_{f,1}^2\right) +\kappa^2\alpha^2\theta\frac{\Vert \mathbf{O}_x\Vert ^2\Vert \mathbf{O}_x^{-1}\Vert ^2\Vert {\mathbf{\Lambda}}\_{xa}\Vert ^2}{(1-\Vert \mathbf{\Gamma}_x\Vert )^2}\left(\kappa^2\sigma\_{g,2}^2+\sigma\_{f,1}^2\right), \end{aligned}

where the second inequality uses Eq.90.

From Eq. 89 and Eq. 90, we can determine the following asymptotic orders for α,β,γ\alpha,\beta,\gamma and θ\theta

$

\alpha=\mathcal{O}\left(\kappa^{-4}\sqrt{\frac{n}{K\sigma^2}}\right),\quad  \beta=\mathcal{O}\left(\sqrt{\frac{n}{K\sigma^2}}\right),\quad  \gamma=\mathcal{O}\left(\sqrt{\frac{n}{K\sigma^2}}\right),\quad  \theta=\mathcal{O}\left(\kappa\sqrt{\frac{n}{K\sigma^2}}\right).

$

Then we get

$

\begin{aligned} &\frac{1}{K}\sum_{k=0}^K\mathbb{E}\left[\frac{\Delta_k}{n}\right] \lesssim_K \frac{\kappa^2n}{K}\left(\frac{\Vert \mathbf{O}_z\Vert ^2\Vert \mathbf{O}_z^{-1}\Vert ^2\Vert {\mathbf{\Lambda}}_{za}\Vert ^2}{1-\Vert \mathbf{\Gamma}_z\Vert }+\frac{\Vert \mathbf{O}_y\Vert ^2\Vert \mathbf{O}_y^{-1}\Vert ^2\Vert {\mathbf{\Lambda}}_{ya}\Vert ^2}{1-\Vert \mathbf{\Gamma}_y\Vert }\right), \end{aligned}

$

where K\lesssim_K denotes the the asymptotic rate when KK\to \infty.

评论

Combining previous upper bounds for k=1KE[Ik] \sum_{k=-1}^K\mathbb{E}[I_k] and k=0KErˉk+12\sum_{k=0}^K\mathbb{E}\left\Vert \bar{r}^{k+1}\right\Vert ^2 with Eq. 83, we obtain

k=0KE[Δk](η1+κ2Ly2η2)α2k=0KErˉk+12+κη2βyˉ0y(xˉ0)2+Kη2β2σg,12+(κ2Ox12O_x2Λ_xa2α2(1Γx)2+O_z2Oz12Λ_za21Γzγ2(L_g,12+(1Γz)σ_g,22)1Γz)_η3_k=1KE[nIk]+κ2β2KOy2Oy12Λ_ya21Γynσ_g,12+κ2Oy2Ee^y021Γy+Oz2Ee^z021Γz+κ2Ox2Ee^x021Γx+κ2α2Ox2Ox12Λ_xa2θ(1Γx)2~Φ(xˉ0)2+Knγ2Oz2Oz12Λ_za21Γ_z(κ2σ_g,22+σ_f,12)+Knκ2α2θOx2Ox12Λ_xa2(1Γx)2(κ2σ_g,22+σ_f,12)[(η1+κ2L_y2η2)α2+η3]κ4_k=0KE[Δk]+κη2βyˉ0y(xˉ0)2+Kη2β2σ_g,12+n[(η1+κ2Ly2η2)α2+η3][1α+θnK(σ_f,12+κ2σ_g,22)+1μgγ+Kγμgn(κ2σ_g,22+σ_f,12)+κ4(1βμg+Kσ_g,12nμgβ)]+κ2β2KOy2Oy12Λ_ya21Γynσ_g,12+κ2α2Ox2Ox12Λ_xa2θ(1Γx)2~Φ(xˉ0)2+κ2Oy2Ee^y021Γy+Oz2Ee^z021Γz+κ2Ox2Ee^x021Γx+Knγ2Oz2Oz12Λ_za21Γz(κ2σ_g,22+σ_f,12)+Knκ2α2θOx2Ox12Λ_xa2(1Γx)2(κ2σ_g,22+σ_f,12). \begin{aligned} &\sum_{k=0}^K\mathbb{E}\left[\Delta_k\right] \lesssim(\eta_1+\kappa^2L_{y^{\star}}^2\eta_2)\alpha^2\sum_{k=0}^K\mathbb{E}\Vert \bar{\mathbf{r}}^{k+1}\Vert ^2+\kappa\eta_2\beta\Vert \bar{\mathbf{y}}^0-\mathbf{y}^{\star}(\bar{x}^{0})\Vert ^2+K\eta_2\beta^2\sigma_{g,1}^2 \\\\+&\underbrace{\left(\frac{\kappa^2\Vert \mathbf{O}_x^{-1}\Vert ^2\Vert \mathbf{O}\_x\Vert ^2\Vert {\mathbf{\Lambda}}\_{xa}\Vert ^2\alpha^2}{(1-\Vert \mathbf{\Gamma}_x\Vert )^2} +\frac{\Vert \mathbf{O}\_z\Vert ^2\Vert \mathbf{O}z^{-1}\Vert ^2\Vert {\mathbf{\Lambda}}\_{za}\Vert ^2}{1-\Vert \mathbf{\Gamma}_z\Vert }\cdot\frac{\gamma^2(L\_{g,1}^2+(1-\Vert \mathbf{\Gamma}_z\Vert )\sigma\_{g,2}^2)}{1-\Vert \mathbf{\Gamma}_z\Vert }\right)}\_{\eta_3}\sum\_{k=-1}^K\mathbb{E}[nI_k] \\\\+&\frac{\kappa^2\beta^2K\Vert \mathbf{O}_y\Vert ^2\Vert \mathbf{O}_y^{-1}\Vert ^2 \Vert \mathbf{\Lambda}\_{ya}\Vert ^2}{1-\Vert \mathbf{\Gamma}_y\Vert } n\sigma\_{g,1}^2+\frac{\kappa^2\Vert \mathbf{O}_y\Vert ^2\mathbb{E}\Vert \hat{\mathbf{e}}_y^{0}\Vert ^2}{1-\Vert \mathbf{\Gamma}_y\Vert }+\frac{\Vert \mathbf{O}_z\Vert ^2\mathbb{E}\Vert \hat{\mathbf{e}}_z^{0}\Vert ^2}{1-\Vert \mathbf{\Gamma}_z\Vert } \\\\+& \frac{\kappa^2\Vert \mathbf{O}_x\Vert ^2\mathbb{E}\Vert \hat{\mathbf{e}}_x^{0}\Vert ^2}{1-\Vert \mathbf{\Gamma}_x\Vert }+\frac{\kappa^2\alpha^2\Vert \mathbf{O}_x\Vert ^2\Vert \mathbf{O}_x^{-1}\Vert ^2\Vert {\mathbf{\Lambda}}\_{xa}\Vert ^2}{\theta(1-\Vert \mathbf{\Gamma}_x\Vert )^2} \left\Vert \widetilde{\nabla}\mathbf{\Phi}(\bar{\mathbf{x}}^{0})\right\Vert ^2 \\\\+&Kn\gamma^2\frac{\Vert \mathbf{O}_z\Vert ^2\Vert \mathbf{O}_z^{-1}\Vert ^2\Vert {\mathbf{\Lambda}}\_{za}\Vert ^2}{1-\Vert \mathbf{\Gamma}\_z\Vert }\left(\kappa^2\sigma\_{g,2}^2+\sigma\_{f,1}^2\right) +Kn\kappa^2\alpha^2\theta\frac{\Vert \mathbf{O}_x\Vert ^2\Vert \mathbf{O}_x^{-1}\Vert ^2\Vert {\mathbf{\Lambda}}\_{xa}\Vert ^2}{(1-\Vert \mathbf{\Gamma}_x\Vert )^2}\left(\kappa^2\sigma\_{g,2}^2+\sigma\_{f,1}^2\right) \\\\ \lesssim&\left[(\eta_1+\kappa^2L\_{y^{\star}}^2\eta_2)\alpha^2+\eta_3\right]\cdot \kappa^4\sum\_{k=0}^K\mathbb{E}\left[\Delta_k\right]+\kappa\eta_2\beta\Vert \bar{\mathbf{y}}^0-\mathbf{y}^{\star}(\bar{x}^{0})\Vert ^2+K\eta_2\beta^2\sigma\_{g,1}^2 \\\\+&n\left[(\eta_1+\kappa^2L{y^{\star}}^2\eta_2)\alpha^2+\eta_3\right]\left[\frac{1}{\alpha}+\frac{\theta}{n}K\left(\sigma\_{f,1}^2+\kappa^2\sigma\_{g,2}^2\right) +\frac{1}{\mu_g\gamma} +\frac{K\gamma}{\mu_gn}\left(\kappa^2\sigma\_{g,2}^2+\sigma\_{f,1}^2\right) +\kappa^4\left(\frac{1}{\beta\mu_g} +\frac{K\sigma\_{g,1}^2}{n\mu_g}\beta\right)\right] \\\\+&\frac{\kappa^2\beta^2K\Vert \mathbf{O}_y\Vert ^2\Vert \mathbf{O}_y^{-1}\Vert ^2 \Vert \mathbf{\Lambda}\_{ya}\Vert ^2}{1-\Vert \mathbf{\Gamma}_y\Vert } n\sigma\_{g,1}^2+\frac{\kappa^2\alpha^2\Vert \mathbf{O}_x\Vert ^2\Vert \mathbf{O}_x^{-1}\Vert ^2\Vert {\mathbf{\Lambda}}\_{xa}\Vert ^2}{\theta(1-\Vert \mathbf{\Gamma}_x\Vert )^2} \left\Vert \widetilde{\nabla}\mathbf{\Phi}(\bar{\mathbf{x}}^{0})\right\Vert ^2 \\\\+&\frac{\kappa^2\Vert \mathbf{O}_y\Vert ^2\mathbb{E}\Vert \hat{\mathbf{e}}_y^{0}\Vert ^2}{1-\Vert \mathbf{\Gamma}_y\Vert }+\frac{\Vert \mathbf{O}_z\Vert ^2\mathbb{E}\Vert \hat{\mathbf{e}}_z^{0}\Vert ^2}{1-\Vert \mathbf{\Gamma}_z\Vert } +\frac{\kappa^2\Vert \mathbf{O}_x\Vert ^2\mathbb{E}\Vert \hat{\mathbf{e}}_x^{0}\Vert ^2}{1-\Vert \mathbf{\Gamma}_x\Vert } \\\\+&Kn\gamma^2\frac{\Vert \mathbf{O}_z\Vert ^2\Vert \mathbf{O}_z^{-1}\Vert ^2\Vert {\mathbf{\Lambda}}\_{za}\Vert ^2}{1-\Vert \mathbf{\Gamma}_z\Vert }\left(\kappa^2\sigma\_{g,2}^2+\sigma\_{f,1}^2\right) +Kn\kappa^2\alpha^2\theta\frac{\Vert \mathbf{O}_x\Vert ^2\Vert \mathbf{O}_x^{-1}\Vert ^2\Vert {\mathbf{\Lambda}}\_{xa}\Vert ^2}{(1-\Vert \mathbf{\Gamma}_x\Vert )^2}\left(\kappa^2\sigma\_{g,2}^2+\sigma\_{f,1}^2\right). \end{aligned}

Eq. (89) and Eq. (90) imply that

η1κ2+κ2Ox2Ox12Λ_xa2(1Γx)2,η2κ2,&(η1+κ2Ly2η2)α2κ4,η3κ4 \begin{aligned} &\eta_1\lesssim \kappa^2+\kappa^2\frac{\Vert \mathbf{O}_x\Vert ^2\Vert \mathbf{O}_x^{-1}\Vert ^2{\mathbf{\Lambda}}\_{xa}\Vert ^2\Vert }{(1-\Vert \mathbf{\Gamma}_x\Vert )^2},\quad \eta_2\lesssim \kappa^2, \& (\eta_1+\kappa^2L{y^{\star}}^2\eta_2)\alpha^2\lesssim \kappa^{-4},\quad \eta_3\lesssim \kappa^{-4} \end{aligned}

评论

Dear Reviewer 4NJf,

We sincerely thank you for your valuable comments and appreciate the time and effort dedicated to providing constructive feedback on our submission. We have carefully considered your suggestions and made significant efforts to address them. Given the limited timeframe of the rebuttal period, we would greatly appreciate if you could review our rebuttal and let us know if any concerns remain. Your insights are invaluable as we strive to enhance the quality of our work.

Best,

The authors of paper 7209

审稿意见
5

This paper studies a primal-dual framework for decentralized bilevel optimization. It unifies several heterogeneous correction techniques (gradient tracking and EXTRA). It also provides a shared rate analysis that applies to all variants and avoids several assumptions like gradient boundedness. Several other insights include deploying ED/EXTRA at lower level is more beneficial than using GT across both upper/lower levels.

优点

  • This paper is overall well written with clear introduction of the bilevel optimization problem and shows the comparison of many different approaches and how they fit under the unified framework. An unified convergence rate is provided and discussed in detail when specializing in different settings.
  • Several interesting insight is provided, including the benefit of using EXTRA and ET compared to GT.

缺点

  • The unification of several algorithms seems a superficial contribution. One just have to make a few unifying notations to unified these into the same framework. Probably the more non-trivial part is a unified analysis of the unified algorithm, but the paper doesn't seem to make the case that analyzing the unified algorithm is more difficult or more challenging than analyzing the individual algorithms.

  • The rate analysis in Theorem 1 contains many numerical constants like 16/3, 14/3, which doesn't seem to be tight. Can the author comment all on the tightness of these constants.

Due to the above reasons, I would say the paper's contribution is marginal.

问题

See weakness.

局限性

NA

作者回复

We thank the reviewer for the invaluable comments. We have thoroughly addressed all questions. Should there be any additional concerns or inquiries, we are more than willing to provide further clarification.

  1. Contribution of our work

We appreciate the reviewer’s insightful feedback. However, we respectfully disagree that SPARKLE represents a superficial contribution. We clarify this point in Section Novelty and contribution of our work of Global Rebuttal.

  1. It is non-trivial to establish a unified decentralized bilevel framework

The reviewer commented that "the unification of several algorithms seems a superficial contribution," which we respectfully disagree with. Existing single-level decentralized algorithms like EXTRA, ED, and GT have significantly different update rules and recursions (Table 3). They cannot be unified by simply "making a few unifying notations." To effectively unify these algorithms, our approach transforms each of them into a primal-dual formulation, extracts the principal structures behind their different recursions, and finally introduces unifying notations.

  1. Challenges in our analysis

This paper provides a novel analysis for a unified algorithmic framework, offering improved convergence rates and operating under more relaxed assumptions compared to existing literature. We have addressed significant challenges to achieve these results.

  • [Unified analysis] SPARKLE unifies ED, EXTRA, and many GT varints. It allows different update strategies and network topologies across different levels. This unified and general framework poses significant challenges for analysis: (1) The analyses of ED, EXTRA, and GT provided in literature are drastically different from one another, necessitating the identification of the fundamental mechanisms and core principles that can unify their analysis. (2) Since the SPARKLE framework is a unifying structure, the specific properties endowed by certain algorithms or network topologies cannot be leveraged, forcing the analysis to rely solely on the basic properties, which significantly complicates the proof. (3) The SPARKLE framework is based on a primal-dual structure, whereas existing works which are primarily focus on DGD and GT, rely on the primal structure. This necessitates the development of new analytical techniques for decentralized primal-dual bilevel algorithms, further adding to the complexity of the analysis.

  • [Improved rates] All SPARKLE variants achieve state-of-the-art convergence rates compared to existing algorithms. Specifically, SPARKLE is theoretically proven to have shorter transient stages and better robustness to network topologies, surpassing the results achievable in existing works. (1) Key steps in our proof involve transforming primal-dual forms (Section C.1.2, Appendix) and reformulating the SPARKLE update rules (Equations 33-35). We address consensus biases for each variable s{x,y,z}s \in \{x, y, z\} and their gradients in aggregated vectors. Analyzing the coupled consensus errors collectively allows us to obtain tighter bounds compared to related works. (2) While our bilevel algorithms can be reduced to single-level ones under certain trivial lower-level loss functions (Section C.4, Appendix), deriving single-level convergence rates directly from the bilevel analysis is challenging. It requires precise estimation of each update procedure. For more details, please refer to Comment.

  • [More relaxed assumptions] Our theoretical analysis achieves state-of-the-art convergence rates under more relaxed assumptions compared to existing algorithms. Unlike some related works[2-6], we do not assume that the upper-level loss functions fif_i and lower-level loss functions gig_i are Lipschitz continous. We handle the bias of gradient estimation through additional steps, which only require high-order Lipschitz constants. Besides, we utilize heterogeneity correction techniques such as EXTRA, ED, and GT to remove the assumptions on bounded gradient dissimilarity. Please refer to Comment for more discussions.

  1. Numerical Constants in Theorem 1.

The focus of our paper is to provide a new unified decentralized bilevel framework that yields brand new algorithms, and achieves state-of-the-art asymptotic rate, gradient complexity, communication cost, and transient iteration complexity under more relaxed assumptions compared to existing methods. Establishing tight bounds for all constants is beyond the scope of our work. Notably, our convergence rate, while potentially not tight for every constant, outperforms the algorithms listed in Table 1 of our paper in terms of KK, nn, and ρ\rho.

We guess the reviewer is asking whether the influcne of the condition number on our convergence rate is tight or not. Here are some discussions:

  • [Tight bound in the dominant term] The term κ5\kappa^5 in the dominant term κ5σ/nK\kappa^5 \sigma/\sqrt{nK} represents state-of-the-art result among single-loop decentralized bilevel algorithms, which is consistent with those presented in [1].

  • [Probably untight bound in the higher-order terms] The condition number associated with the higher-order terms, such as κ163\kappa^{\frac{16}{3}}, represents our best effort to provide a tight upper bound. However, we acknowledge that these terms may not be optimal, as we have not conducted a comprehensive analysis of the lower bound. It is important to note that many existing works in this field do not provide a detailed analysis of the influence of κ\kappa. Instead, the impact of κ\kappa is often treated as a trivial constant hidden in the Big-O notation [2-5 of Global Rebuttal Reference]. In comparison to these works, our research represents a significant step towards a better understanding of how the condition number influences the convergence rate.

[1] Optimal algorithms for stochastic bilevel optimization under relaxed smoothness conditions.

评论

Thank you for your response. I can see the contribution and challenges better now. I have increased the score.

评论
  • [Improved rates] All SPARKLE variants achieve state-of-the-art convergence rates compared to existing algorithms. In particular, it is theoretically established that SPARKLE has shorter transient stages and demonstrates better robustness to network topologies. These refined results surpass those achievable through existing analyses. (1) A fundamental challenge is capturing the common essence of these heterogeneity correction algorithms. A key step in our proof involves the transformation that utilizes the unified primal-dual forms outlined in Section C.1.2 (Appendix) and the reformulation of the update rules for SPARKLE, as specified in Equations (33-35). We address the biases of each variable s{x,y,z}s \in \{x, y, z\} along with their corresponding gradients in aggregated vectors e^s\hat{\mathbf{e}}_s. By analyzing the coupled consensus errors simultaneously and recursively through the contractive matrices Γs\Gamma_s, we obtain tighter bounds than those found in related works. (2) Though our bilevel algorithms can formally reduce to single-level algorithms under certain trivial lower-level loss functions (see Section C.4, Appendix), deriving the convergence rates for single-level algorithms directly from the bilevel framework analysis presents its own challenges. We rigorously analyze the hyper-gradient norms in Lemma 8 and provide a tight upper bound. The precise results in Lemma 8 are crucial for deriving the convergence rates of the single-level degeneration versions of the bilevel algorithms within our framework.

  • [More relaxed assumptions] Our theoretical analysis achieves state-of-the-art convergence rates under more relaxed assumptions compared to existing algorithms. (1) Lipschitz Continuous Loss Functions. Most analyses in existing works rely on Lipschitz continuity of the upper-level loss functions fif_i and lower-level loss functions gig_i [2-6]. In contrast, we handle the bias of gradient estimation through additional steps, which only require high-order Lipschitz constants. For instance, we estimate the bias of 1fi(xik,yik+1)\nabla_1 f_i(x_i^k, y_i^{k+1}) as follows: 1fi(xik,yik+1)1fi(xˉk,y(xˉk))221fi(xik,yik+1)1fi(xˉk,yˉk+1)2+21fi(xˉk,yˉk+1)1fi(xˉk,y(xˉk))22Lf,12(xikxˉk2+yik+1yˉk+12+yˉk+1y(xˉk)2).\|\nabla_1 f_i(x_i^k, y_i^{k+1}) - \nabla_1 f_i(\bar{x}^k, y^{\star}(\bar{x}^k))\|^2 \leq 2 \|\nabla_1 f_i(x_i^k, y_i^{k+1}) - \nabla_1 f_i(\bar{x}^k, \bar{y}^{k+1})\|^2 + 2 \|\nabla_1 f_i(\bar{x}^k, \bar{y}^{k+1}) - \nabla_1 f_i(\bar{x}^k, y^{\star}(\bar{x}^k))\|^2 \leq 2L_{f,1}^2 (\|x_i^k - \bar{x}^k\|^2 + \|y_i^{k+1} - \bar{y}^{k+1}\|^2 + \|\bar{y}^{k+1} - y^{\star}(\bar{x}^k)\|^2).In contrast, other works may bound gradient or gradient estimation errors directly by a constant, based on assumptions of Lipschitz continuous loss functions. (2) Bounded Gradient Dissimilarity. We utilize heterogeneity correction techniques such as EXTRA, ED, and GT to remove the assumptions on bounded gradient dissimilarity. A common feature of these techniques is that each eigenvalue of the corresponding LsL_s matrices in Assumption 3 has magnitude less than 1, ensuring that Γs<1\|\Gamma_s\|<1 in basic transformations (Eq. (33-35), Appendix). This allows us to recursively bound consensus errors using Γs\Gamma_s. In contrast, other decentralized algorithms, such as Decentralized SGD, do not satisfy Assumption 3 and therefore rely on additional assumptions.

评论

We are delighted that your concerns have been resolved, and we sincerely appreciate your positive feedback. Thank you again for your valuable input.

作者回复

We sincerely appreciate the detailed feedback provided by all reviewers. Here we present our response to the common concerns raised by multiple reviewers and results of newly added experiments.

1.Novelty and contribution of our work.

SPARKLE yields brand new algorithms, and achieves state-of-the-art rates under more relaxed assumptions compared to existing methods. In particular, SPARKLE implies the following novel results.

  • [SPARKLE yields new algorithms utilizing EXTRA and ED] Before SPARKLE, existing bilevel algorithms primarily employ decentralized gradient descent (DGD) and gradient tracking (GT) in algorithm development. SPARKLE expands this paradigm by incorporating EXTRA and ED methods to address data heterogeneity, resulting in novel decentralized bilevel algorithms—SPARKLE-EXTRA and SPARKLE-ED—which were not previously considered in the existing literature. Furthermore, theoretical analysis demonstrates that SPARKLE-EXTRA and SPARKLE-ED exhibit superior convergence complexities compared to existing algorithms.

  • [SPARKLE yields new algorithms with different update strategies across optimization levels] SPARKLE is the first algorithm framework that introduces mix strategies for solving upper-level, lower-level, and auxiliary-level problems. For example, it allows the use of GT to update the lower-level variable yy, while ED is used for the upper-level variable xx and the auxiliary-level variable zz. This approach yields novel decentralized bilevel algorithms such as SPARKLE-ED-GT and SPARKLE-EXTRA-GT. In contrast, existing literature on decentralized bilevel optimization primarily utilizes the same strategy across the upper and lower optimization levels.

  • [SPARKLE yields new algorithms with different network topologes across optimization levels] SPARKLE is the first algorithm framework that propose using different network topologies and mixing matrices for solving upper-level, lower-level, and auxiliary-level problems. Given that the dimensions of variables can vary significantly across different levels, exploring how diverse mixing matrices can reduce communication costs represents an essential novel contribution. By using different weight matrices, all SPARKLE variants are more efficient in communication costs, see Table 1 in our paper. In contrast, existing literature on decentralized bilevel optimization primarily utilizes the same mixing matrix across the upper and lower optimization levels.

  • [State-of-the-art convergence rates and complexities] As demonstrated in Table 1 of our paper, all SPARKLE-derived algorithms achieve state-of-the-art convergence rates and complexities compared to existing baselines. Specifically, all SPARKLE variants exhibit: (1) Lower asymptotic computational and communication costs; (2) Shorter transient stages; (3) Most importantly, greater robustness to network topologies. For instance, it is known that the ring topology has 1ρ1 - \rho on the order of 1/n2{1}/{n^2}. When utilizing this topology, the transient stages of MDBO, Gossip DSBO, and D-SOBA are on the order of n19n^{19}, n11n^{11}, and n11n^{11}, respectively. In contrast, SPARKLE variants have a transient stage of n7n^7, which is significantly shorter than existing baselines.

  • [More relaxed assumptions] As shown in Table 1 of our paper, the convergence of existing algorithms relies on stringent assumptions, including bounded gradients, bounded gradient dissimilarity, or functional Lipschitz continuity. In contrast, SPARKLE does not depend on any of these assumptions.

In summary, we hope to emphasize that our main contribution extends beyond demonstrating how different approaches fit under a unified framework. Our work also: (1) Derives novel algorithms; (2) Provides improved analysis and convergence rates; (3) Offers fresh insights into decentralized bilevel optimization.

2. Limitation: Strongly convex lower-level function and impact of the condition number

While we have acknowledged the following limitations in the paper, we can provide additional discussions:

  • [Lower-level strong convexity] We agree that our work does not address the more general case where the lower-level loss function is merely convex. This limitation is common in most existing research, as there are currently no decentralized bilevel algorithms specifically designed for cases with generally convex lower-level loss functions. In fact, such problems are significantly challenging as the lower-level problems might not have unique solutions, and the final objective function could be non-differentiable [1]. Addressing these issues requires fundamentally different approaches, which are beyond the scope of this work.

  • [Condition number] We agree that the condition number will impede the convergence significantly, which is a common limitation in most existing research. However, it is important to note that most related works treat the condition number as a constant and obscure its effect using Big-O notation [2-5]. Our analysis offers a more detailed understanding of how κ\kappa affects the convergence rates. To mitigate the influence of the condition number, Nesterov acceleration can be used to accelerate the solving of the lower-level optimization problem. We will leave it as the future work.

3. Experiments.

The results are shown in the PDF document in this global rebuttal. Please refer to more details in individual rebuttals.

[1]On Finding Small Hyper-Gradients in Bilevel Optimization: Hardness Results and Improved Analysis.

[2]Decentralized stochastic bilevel optimization with improved per-iteration complexity.

[3]A stochastic linearized augmented lagrangian method for decentralized bilevel optimization.

[4]Decentralized gossip-based stochastic bilevel optimization over communication networks.

[5]Decentralized bilevel optimization over graphs: Loopless algorithmic update and transient iteration complexity.

最终决定

This paper studies the decentralized bilevel optimization problem and focuses on mitigating the influence of data heterogeneity. It proposes SPARKLE, a unified single-loop primal-dual algorithm framework that offers the flexibility to incorporate various heterogeneity-correction strategies into the algorithm, and allows for different strategies to solve the upper- and lower-level problems. A unified convergence analysis is established. Overall, this paper investigates an important problem and the technical contributions are solid. Therefore, I recommend to accept.