PaperHub
7.2
/10
Rejected4 位审稿人
最低3最高4标准差0.4
4
3
4
4
ICML 2025

Stochastic Regret Guarantees for Online Zeroth- and First-Order Bilevel Optimization

OpenReviewPDF
提交: 2025-01-24更新: 2025-06-18
TL;DR

We propose a novel search direction for stochastic online bilevel optimization, enabling first- and zeroth-order algorithms to achieve sublinear regret without window smoothing.

摘要

关键词
Online LearningBilevel LearningZeroth- and First-Order Methods

评审与讨论

审稿意见
4

This article presents two online bilevel optimization algorithms SOGD and ZO-SOGD. Among them, SOGD achieves the sota local regret bound without using window-smoothed functions. ZO-SOGO provides a fully-zero-order approach and achieves the hypergradient estimation only with function values. The authors present the theoretical analysis of the regret as well as the convergence analysis. Both two algorithms perform well in a series of numerical experiments.

给作者的问题

Can the authors present the comparison of different algorithms under wall-clocked time?

论据与证据

Yes.

方法与评估标准

Yes.

理论论述

The reviewer takes a quick check of the theoretical analysis. The theoretical claims is reasonable and there is no fatal issue being found by the reviewer.

实验设计与分析

Yes. The experiments can show the benefit of the proposed methods. However, it could be better if the authors present the comparison of different algorithms under wall-clocked time, which may illustrate the practical value of the proposed algorithms.

补充材料

NA

与现有文献的关系

NA

遗漏的重要参考文献

NA

其他优缺点

Strengths

  1. Both algorithms are sing-loop.
  2. SOGO achieves the sota regret bound without extra computation of the window-smoothed function.
  3. ZO-SOGO provides a fully-zero-order approach for online BO, which is especially useful in the training of large models.

Weaknesses NA

其他意见或建议

After author rebuttal: Thanks for the response and the main concerns have been solved. The reviewer will update the rating.

作者回复

Response: Thank you for your review and suggestions. We would like to clarify that the runtime plots presented in both Figure 2 and Figure 3 already show wall-clock time measurements in seconds. Please refer to the left subplot in Figures 2 and 3, where we directly report time elapsed. Specifically,

  • In Figure~2 (left panel), we present the actual wall-clock time of our ZO-SOGD method compared to the baseline methods in the online adversarial attack experiment. The recorded times range from approximately 0.5×1020.5 \times 10^2 to 2.0×1022.0 \times 10^2 seconds at T=200T = 200 steps.

  • In Figure 3 (left panel), we present wall-clock time comparisons for our SOGD method against OAGD and SOBOW in the parametric loss tuning task. The results show a clear efficiency gain, with SOGD completing under 25 seconds at T = 400, compared to SOBOW, which takes roughly 220 seconds due to its reliance on conjugate gradient (CG) iterations.

All wall-clock times were measured using the same hardware platform across all algorithms to ensure fair and consistent comparisons. These empirical results support the practical value of our proposed methods, aligning with our theoretical efficiency claims.

In the revised version, we will make this point more explicit by stating "wall-clock time" clearly in the figure captions and experimental section to avoid any ambiguity.

Please let us know if we misunderstood your question, and we will be happy to address your concern further.

Further Elaboration on Runtime and Accuracy Using CIFAR-10:

For further elaboration of runtime and how our methods can handle large scale datasets, we have conducted additional experiments on CIFAR10 to demonstrate the scalability and effectiveness of our approach beyond MNIST. For our CIFAR-10 experiments, we used a dataset with 10 classes, and further details on the parameters will be added to the appendix in the revised manuscript.

Below we present the performance comparison of our ZO-SOGD method against single-level ZO methods on CIFAR10:

Table 1: Runtime (seconds)

Methodt=100t=200t=300t=400
ZO-O-GD358±17726±221089±331452±44
ZO-O-Adam385±17781±281177±391573±50
ZO-O-SignGD319±13638±22957±311276±39
ZO-O-ConservGD413±20825±281238±391650±50
ZO-SOGD (ours)1078±332156±503234±664312±83
ZO-SOGD (ours, Adam)1155±392310±553465±724620±88

Table 2: Testing Accuracy (lower is better for attacks)

Methodt=100t=200t=300t=400
ZO-O-GD0.81±0.070.78±0.060.74±0.050.71±0.05
ZO-O-Adam0.59±0.080.54±0.060.50±0.050.47±0.05
ZO-O-SignGD0.78±0.100.71±0.070.67±0.060.64±0.06
ZO-O-ConservGD0.76±0.060.68±0.050.63±0.040.57±0.04
ZO-SOGD (ours)0.73±0.100.69±0.080.65±0.070.56±0.06
ZO-SOGD (ours, Adam)0.58±0.080.49±0.060.40±0.050.37±0.04

Table 3: Perturbation Magnitude y||y||_\infty

Methodt=100t=200t=300t=400
ZO-O-GD5.46±0.085.54±0.105.62±0.115.72±0.12
ZO-O-Adam4.58±0.084.68±0.104.76±0.114.85±0.12
ZO-O-SignGD5.52±0.115.66±0.135.76±0.145.88±0.16
ZO-O-ConservGD4.52±0.074.60±0.084.66±0.094.72±0.10
ZO-SOGD (ours)7.20±0.247.56±0.297.92±0.348.16±0.36
ZO-SOGD (ours, Adam)6.70±0.287.30±0.357.70±0.388.00±0.40

The CIFAR-10 results demonstrate that our ZO-SOGD approach scales effectively to higher-dimensional problems with ZO-SOGD (ours, Adam) reducing model accuracy to ~37% at t=400t=400 compared to 47-71% for single-level ZO methods, while requiring larger perturbation magnitudes of 8.00 compared to 4.72-5.88 for baseline methods. This represents a trade-off where our approach can achieve significantly higher attack success rates at the cost of somewhat larger perturbations, which remain visually acceptable despite the 12 ×\times increase in dimensionality from MNIST to CIFAR-10.

Our ZO-SOGD (ours, Adam) achieves this superior performance with approx. 2.9-3.0 ×\times computational overhead, representing an excellent trade-off when attack success is the primary objective, and the consistent improvement from t=100t=100 to t=400t=400 confirms that BO optimization yields further benefits in higher-dimensional spaces. The larger perturbation magnitudes are offset by the substantially improved attack success rates, making our approach particularly valuable in scenarios where robustness evaluation is critical. We will add more detailed analyses and hyper-parameter sensitivity studies in the appendix.

审稿意见
3

This paper proposes a novel approach to online bilevel optimization (OBO) that achieves sublinear stochastic bilevel regret without window smoothing, addressing limitations in existing methods under dynamic conditions. By introducing a new search direction, it improves efficiency through reduced oracle dependence, simultaneous inner-outer updates, and zeroth-order estimation of Hessians, Jacobians, and gradients. Experiments on online parametric loss tuning and black-box adversarial attacks validate its effectiveness.

给作者的问题

Please read my comments on weaknesses of the paper.

论据与证据

Yes, the claims made in the submission are supported by clear and convincing evidence.

方法与评估标准

Yes

理论论述

I have reviewed the theorem in the main text. The appendix contains extensive theoretical analyses, but I have not gone through all of them.

实验设计与分析

Yes. The experimental results lacks analyses. I recommend providing more detailed insights into why the proposed method outperforms the baselines. For example, the authors state: "The left panel shows that ZO-SOGD has similar runtime to single-level baselines, despite outer-level optimization on x." However, the specific components of the proposed algorithm that contribute to these performance improvements are not clearly highlighted. A deeper analysis of the key factors driving the observed empirical advantages would significantly strengthen the experimental section.

补充材料

N/A

与现有文献的关系

This paper is related to the bilevel optimization, online optimization.

遗漏的重要参考文献

No

其他优缺点

Strength: 1.This work introduces a novel search direction, enabling first- and zeroth-order OBO methods to achieve sublinear stochastic bilevel regret without relying on window smoothing. 2.Unlike existing methods that depend on gradient, Hessian, and Jacobian oracles, this work estimates these quantities using only function value oracles, improving scalability in black-box settings. 3.By requiring only a single subproblem solver iteration, the proposed algorithms enhance efficiency over existing approaches that rely on multiple iterations for hypergradient approximation.

Weakness: 1.The motivation for designing a hyper-gradient-based method to solve OBO is not clearly articulated. The authors need to justify why a hyper-gradient-based approach is adopted instead of first-order methods such as the value function method [1] or the cutting plane method [2]. A more in-depth discussion of the advantages of hyper-gradient-based OBO over these alternative approaches would enhance the readability and justification of the proposed method.

2.Given that the proposed algorithm requires extensive computations in each iteration, I suggest that the theoretical analysis not only provide iteration complexity (as in Theorem 3.6 and Theorem 4.2) but also include an overall computational complexity analysis, such as arithmetic complexity. Additionally, a comparison of the overall complexity between the proposed algorithm and the baseline methods would be beneficial in understanding whether the proposed approach offers significant complexity advantages over competing methods.

3.Regarding the theoretical analysis, I appreciate the efforts in establishing rigorous theoretical guarantees. However, since several assumptions are made, it would be beneficial to further discuss their practicality. For instance, in the experimental section, do all these assumptions hold in practice? One particular point of concern is the strong convexity of the lower-level objective function in Assumption 3.2, as this assumption may limit the applicability of the proposed OBO algorithm. Given that many recent works in Bayesian Optimization no longer rely on this assumption [3, 4], a discussion on whether and how this assumption might be relaxed would be valuable.

4.The readability of the experimental section can be improved. The authors dedicate a substantial portion of the section to introducing baseline methods and problem settings but overlook critical experimental analysis. I recommend providing more detailed insights into why the proposed method outperforms the baselines. For example, the authors state: "The left panel shows that ZO-SOGD has similar runtime to single-level baselines, despite outer-level optimization on x." However, the specific components of the proposed algorithm that contribute to these performance improvements are not clearly highlighted. A deeper analysis of the key factors driving the observed empirical advantages would significantly strengthen the experimental section.

5.It seems that the proposed algorithm may also be applicable to traditional bilevel optimization or zeroth-order bilevel optimization. I suggest that the authors include comparisons with existing zeroth-order nested optimization methods to further demonstrate the effectiveness of the proposed approach.

[1] Bome! bilevel optimization made easy: A simple first-order approach. NeurIPS 2022. [2] Asynchronous Distributed Bilevel Optimization. ICLR 2023. [3] Projection-free methods for stochastic simple bilevel optimization with convex lower-level problem. NeurIPS 2023 [4] An Accelerated Gradient Method for Convex Smooth Simple Bilevel Optimization. NeurIPS 2024

其他意见或建议

No

作者回复

Response to W1: Thank you for the helpful comment. We chose to use a hyper-gradient-based approach due to its robustness and strong theoretical guarantees in the context of stochastic OBO problems. In contrast, alternative approaches such as value function methods [1] and asynchronous bilevel optimization methods [2] are primarily designed for deterministic settings. These methods rely on Lagrangian reformulations and typically converge to Lagrangian stationary points, rather than true bilevel solutions. They also require careful tuning of penalty parameters, which increases their complexity. We have clarified the motivation and key differences from the deterministic methods in [1,2] in the revision.

Response to W2: Thank you for the suggestion of arithmetic complexity analysis. We have added a detailed arithmetic complexity comparison in our response to Reviewer x9oa and do not repeat it here due to response length limits. Briefly, our SOGD method significantly reduces computational cost, while ZO-SOGD suffers from worse dimensional dependence due to its gradient-free nature.

Response to W3: Thank you for this important point. We provide a detailed response to Reviewer Xqeg on this issue. Our experiments include a non-convex imbalanced learning task with a deep neural network inner problem (Section 5), where our algorithms show stable convergence despite the lack of strong convexity.

While Refs [3] and [4] use more relaxed inner assumptions (convexity of inner), they focus on simple bilevel problems in the offline setting. Our OBO setting is more complex and requires stronger assumptions for convergence. Adapting relaxed conditions such as PL to stochastic OBO is a valuable direction for future work but needs careful investigation.

Response to W4: We thank the reviewer for the insightful feedback. We have revised the experimental section to provide a clearer analysis of why our proposed methods outperform the baselines and to highlight the components responsible for these improvements. Specifically:

  • In Sec 5.1, ZO-SOGD demonstrates stronger attack performance compared to single-level baselines such as ZO-O-Adam. These gains stem from our proposed OBO framework, which allows for continuous optimization of hyper-parameters, in contrast to baselines like ZO-O-Adam that tune such parameters in a discrete and manually selected search space. Additionally, the new search direction defined in Eq. (6), combined with variance control in the finite-difference estimators (Eq. 21) and the projected update on vtv_t defined in Eq. (8), contributes to more effective ZO-SOGD over time.

  • In Sec 5.2, our SOGD method achieves the lowest runtime (left panel) while maintaining high accuracy across varying degrees of class imbalance (middle and right panels). Although other OBO-based algorithms can also perform continuous hyper-parameter search, our superior efficiency result from the single-loop simultaneous update scheme (Alg 1). This avoids the high computational cost of multi-loop strategies, such as SOBOW, which rely on window-smoothing and conjugate gradient solvers.

We have explicitly clarified these points in Sections 5.1 and 5.2 of the revised manuscript.

Response to W5: We appreciate the suggestion to compare with ZO nested methods. To our knowledge, there are no practical zeroth-order nested methods without access to a gradient oracle (including those with variance reduction) available in the literature. We include a comparison (following the setup in Sec.~5) with our offline variant (ZO-S2^2GD), which highlights a tradeoff: while it achieves stronger attack performance (test accuracy drops to 0.09 at T=200T=200), it requires significantly more computation (40×\times runtime) and generates highly visible perturbations (y||y|_\infty quickly reaches 5.0).

Our online method (ZO-SOGD) maintains an effective balance between computational efficiency, attack performance (testing accuracy 0.56 at T=200T=200), and imperceptibility (y||y||_\infty 2.7 at T=200T=200).

Runtime (seconds)

MethodT=50T=100T=200
ZO-SOGD48±580±8190±10
ZO-S²GD2800±308000±5015800±100

Testing Accuracy (lower is better for attacks)

MethodT=50T=100T=200
ZO-SOGD0.75±0.100.67±0.120.56±0.12
ZO-S²GD0.23±0.050.10±0.030.09±0.02

Perturbation Magnitude y||y||_\infty

MethodT=50T=100T=200
ZO-SOGD1.8±0.32.2±0.32.7±0.3
ZO-S²GD3.0±0.25.0±0.05.0±0.0

We also note that, as illustrated in Figure 2 of our paper, ZO-SOGD (ours, Adam) achieves an adversarial testing accuracy of ~0.24 with similar runtime. This highlights the significant computational improvement offered by our online method over offline approaches.

审稿意见
4

This paper introduces a novel framework for stochastic online bilevel optimization (OBO) that addresses limitations of existing methods by achieving sublinear stochastic bilevel regret without relying on window smoothing. The authors propose a new search direction and develop two algorithms, Simultaneous Online Gradient Descent (SOGD) for first-order oracles and Zeroth-Order SOGD (ZO-SOGD) for function value oracles, both requiring only a single subproblem solver iteration. Their main findings include theoretical guarantees of sublinear stochastic bilevel regret for both algorithms under mild assumptions, even with rapidly changing objective functions, and improved oracle efficiency in hypergradient estimation. Empirical evaluations on online parametric loss tuning and black-box adversarial attacks demonstrate the practical effectiveness and efficiency of the proposed SOGD and ZO-SOGD algorithms compared to existing OBO methods.

给作者的问题

N.A.

论据与证据

The claims regarding sublinear stochastic bilevel regret, improved efficiency, and empirical validation are likely supported by evidence typically presented in submissions of this nature, specifically mathematical theorems in the appendix for the regret guarantees and experimental results sections showcasing performance gains. The paper introduces novel algorithmic and conceptual ideas, including a new search direction and zeroth-order adaptations, and claims theoretical support through derived regret bounds, along with empirical validation on relevant tasks like adversarial attacks and meta-learning loss tuning. While a thorough verification would necessitate a detailed examination of the proofs and experimental setup, the submission, as summarized, suggests the claims are substantiated by the expected forms of evidence within the scope of this research domain.

方法与评估标准

Yes, the proposed methods and evaluation criteria are sensible for the problem of stochastic online bilevel optimization. The SOGD and ZO-SOGD methods are designed to address the specific challenges of OBO, such as non-stationarity and limited access to gradients (especially for black-box scenarios addressed by ZO-SOGD). The use of stochastic bilevel regret as an evaluation criterion is appropriate for online learning settings, as it measures the cumulative performance against a dynamic benchmark. Furthermore, the chosen benchmark datasets, MNIST and potentially ImageNet for adversarial attacks and imbalanced learning tasks, are standard and relevant for evaluating machine learning algorithms, providing a practical context to assess the effectiveness of the proposed OBO methods in representative applications. Comparing against relevant baselines like OAGD, SOBOW, and single-level ZO methods further strengthens the evaluation by contextualizing the performance of the proposed algorithms within the existing landscape of optimization techniques.

理论论述

I have not thoroughly checked the proofs, which are too complex and contained in the appendices.

实验设计与分析

The experimental designs and analyses appear generally sound and valid for the stated problem and claims. The choice of black-box adversarial attacks and parametric loss tuning are relevant applications for online bilevel optimization, especially for zeroth-order methods in black-box settings. The use of MNIST data for initial validation and potentially ImageNet (or similar datasets for attacks on DNNs) provides appropriate benchmarks widely recognized in machine learning. The evaluation metrics, including runtime, test accuracy, balanced accuracy (for imbalanced data), and perturbation norms, are well-suited to assess the efficiency and effectiveness of the proposed algorithms in these tasks, directly addressing the claims of improved performance and reduced oracle dependence. The comparison against relevant baselines like OGD, OAGD, SOBOW, and single-level ZO methods provides a valid context for evaluating the contributions of SOGD and ZO-SOGD. Reporting mean and standard deviation across multiple runs, as indicated by "mean±std", suggests an attempt to account for variability and improve the reliability of the results, although the presence of formal statistical significance testing is not explicitly mentioned in the summary and would further strengthen the validity. Overall, based on the provided description, the experimental design and analyses seem appropriate and logically connected to the paper's claims.

补充材料

No, the proofs are too long and complex.

与现有文献的关系

The key contribution of this paper is advancing online bilevel optimization by tackling limitations in existing literature, particularly in stochastic and black-box settings. Related to prior work like OAGD and SOBOW, this paper distinguishes itself by achieving sublinear regret without window smoothing, addressing a key critique of those methods in rapidly changing environments. Unlike these first-order OBO approaches that rely on gradient oracles and often multiple inner loop iterations (like CG in SOBOW), this work introduces a novel search direction enabling both first-order SOGD and zeroth-order ZO-SOGD algorithms. ZO-SOGD directly addresses the gap in OBO for black-box scenarios, building upon the broader zeroth-order optimization literature but extending it to the complexities of bilevel problems, contrasting with offline ZO-BO methods and single-level ZO online optimization like ZOO-ADMM. The paper claims to improve upon the regret bounds compared to existing stochastic OBO and even single-level online zeroth-order optimization methods, such as those by Roy et al. and Guan et al., by achieving better variance dependence and dimension dependence in the regret analysis, thus pushing the frontier of both theoretical understanding and algorithmic efficiency in online bilevel learning.

遗漏的重要参考文献

N.A.

其他优缺点

The paper demonstrates notable strengths in originality and significance by introducing a novel search direction and algorithms that advance the state-of-the-art in online bilevel optimization, particularly by achieving sublinear regret without window smoothing and extending OBO to zeroth-order settings. This is significant as it directly addresses limitations of existing methods in dynamic and black-box scenarios, enhancing both theoretical understanding and practical applicability. The conceptual clarity appears to be a strength, with a well-structured presentation that logically progresses from background to methods, theory, and experiments. However, a potential weakness lies in the complexity of the theoretical proofs, which, while utilizing established techniques, demand meticulous verification to ensure complete correctness. Furthermore, while the experimental evaluations are relevant and support the claims, the scope of datasets and applications, although standard, might be seen as a point for further expansion to fully demonstrate the robustness and generalizability of the proposed algorithms across a wider range of real-world problems.

其他意见或建议

One suggestion to enhance clarity, particularly for readers less familiar with the intricacies of regret analysis, would be to include more intuitive explanations or visualizations of the key regularity conditions (like path-length and function variation) and how they relate to the dynamic regret bounds. Additionally, while the experimental section is well-designed, exploring the sensitivity of ZO-SOGD to the smoothing parameter ρ in the black-box attack experiments, and perhaps including ablation studies on the momentum components in SOGD and ZO-SOGD, could further strengthen the empirical analysis and provide deeper insights into the algorithms' behavior and parameter tuning. Finally, a minor point: a careful proofread for typos, especially in the dense mathematical sections, would be beneficial for the final version.

作者回复

W1: While the experimental evaluations support the claims ... generalizability of the proposed algorithms. A suggestion to explore the sensitivity of … ρ\rho and ablation studies on the momentum components ... .

Response: We appreciate the reviewer’s feedback. To address the generalizability of our algorithms, we conducted additional adversarial attack experiments on CIFAR10, beyond MNIST, and demonstrated the effectiveness of our approach on this larger-scale dataset. Due to space constraints, details and discussion are provided in our response to Reviewer C8i4.

For the smoothing parameter ρ\rho, inner/outer stepsizes, and momentum components, we include a hyperparameter sensitivity analysis in our response to Reviewer x9oa, also due to space limitations. Our analysis shows that while ZO-SOGD is sensitive to hyperparameter choices, it remains robust within reasonable ranges. Following your suggestion, we have included extensive tuning results under adversarial attacks in the appendix.

W2: The theoretical assumptions, particularly strong convexity …?

Response: We greatly appreciate this thoughtful comment.

Our current theoretical guarantees rely on Assumption 3.2, which requires strong convexity of the inner objective gt(x,)g_t(x, \cdot). This assumption enables the application of the implicit function theorem and stability of the inner solution mapping yt(x)y^*_t(x). While this assumption is common in bilevel optimization (e.g., Ghadimi and Wang, 2018; Ji et al., 2021) and is used in almost all previous OBO literature (as cited in Table 1), we acknowledge that it may not always hold in real-world applications.

To demonstrate practical robustness, we have included a highly non-convex imbalanced learning task in our experiments where the inner problem involves training a deep neural network (Section 5). Despite the lack of strong convexity, our algorithms perform effectively, showing stable convergence and improved balanced accuracy.

We also highlight that our online bilevel setup is more complex than standard bilevel settings, as both the inner and outer objectives change over time, making analysis under nonconvex or relaxed conditions significantly more challenging.

That said, we are encouraged by recent progress in extending bilevel optimization to Polyak–Łojasiewicz (PL) conditions (e.g., Shen et al., 2023; arXiv:2302.05185). These works suggest that strong convexity may not be strictly necessary, and PL-like structures may suffice for convergence. We believe our theoretical framework can be extended to PL conditions, which represents an exciting direction for future work. However, given the complexity of our online and dynamic bilevel optimization setting, this extension requires careful investigation. In the final version of the paper, we will incorporate this discussion following the strong convexity assumption.

W3: The proofs are complex … would help readability.

Response: Thank you for the helpful suggestion. We have revised the theory section to include intuitive explanations for the path length and function variation metrics, clarifying how they capture nonstationarity in OBO problems. We have also provided an example in the appendix to illustrate these metrics.

We have also expanded Remarks 3.7, 4.3, and 4.4 to explain the intuition behind our main theorems---highlighting improvements in variance dependence, dimensional scaling, and regret under noisy feedback. Our initial submission included detailed proof roadmaps in Appendix C (lines 796–806) for first-order methods and Appendix D (lines 2185–2196) for zeroth-order methods.

In the final version, we plan to relocate these roadmaps and summaries to the main body, as we will have an additional page. We will also offer a more intuitive high-level roadmap outlining the overall proof strategy: we decompose regret into bias and approximation errors in gradient and projection steps, control these in a sequential manner through a number of lemmas, and assemble them into our final bounds. These changes, along with the existing comments and appendix, help make the theoretical results clearer and easier to understand.

审稿意见
4

This paper introduces new first-order and zeroth-order algorithms for Online Bilevel Optimization (OBO) that achieve sublinear stochastic bilevel regret without using window smoothing. The authors propose a new search direction and develop methods that work with limited feedback, including function value oracles. The theoretical guarantees cover both gradient-based and gradient-free settings, and the approach is validated through experiments on black-box adversarial attacks and parametric loss tuning.

给作者的问题

  1. How does the proposed method compare in computational cost to gradient-based bilevel optimization approaches?

  2. Can the framework handle non-convex inner problems, or how does performance degrade in such cases?

  3. How sensitive is the method to different step sizes and smoothing parameters?

论据与证据

The key claims are:

  1. Sublinear regret guarantees for OBO without window smoothing.

  2. Reduction in oracle dependence using function value-based estimates.

  3. Empirical improvements in adversarial attacks and parametric loss tuning.

The theoretical results seem well-supported by rigorous proofs, though I did not check all derivations in detail. The experiments provide reasonable validation, but additional large-scale evaluations would strengthen the claims.

方法与评估标准

The methods seem well-suited for OBO and are motivated by practical applications. Using function value oracles to estimate Hessians and Jacobians makes sense in black-box settings, but it’s not clear how well this approach scales to more complex problems. The evaluation focuses on specific machine learning tasks, which are relevant, but a broader set of experiments would help assess the method’s general applicability.

理论论述

I did not fully verify the proofs, but the regret bounds seem to follow standard techniques. The convexity assumption in the inner problem is common and often satisfied, as the paper notes. However, in practice, many problems may not have strongly convex inner objectives. It would be useful to discuss whether the approach could still perform well under weaker assumptions, such as PL conditions or approximate inner solutions, and whether the regret guarantees could be adapted accordingly.

实验设计与分析

The experiments demonstrate the method’s effectiveness in adversarial attacks and parametric loss tuning, with reasonable comparisons to prior work. However:

The tasks are relatively small-scale, primarily tested on MNIST and controlled optimization settings. Evaluating on larger or more diverse datasets (e.g., ImageNet for adversarial attacks) would provide stronger validation.

The method’s sensitivity to hyperparameters (step sizes, smoothing factors) is not fully explored. Since OBO involves multiple updates, performance may vary significantly, and an ablation study would clarify robustness.

补充材料

I mainly reviewed the related work section in the supplementary but did not go through the rest in detail.

与现有文献的关系

The paper builds on recent work in bilevel optimization and online learning, particularly methods that use gradient-based and function-value-based approaches. It relates well to previous work on regret minimization in OBO but extends it by removing window smoothing and using zeroth-order optimization.

遗漏的重要参考文献

The paper covers the main prior work in online bilevel optimization. However, if there are recent works applying similar ideas to meta-learning, reinforcement learning, or robust optimization, discussing them could help position this work in a broader context.

其他优缺点

Strengths:

The removal of window smoothing is a notable contribution.

The use of function value-based estimation makes the method applicable to black-box settings.

Theoretical guarantees are well-structured and seem sound.

Weaknesses:

The notation and proofs are dense, making it hard to follow for non-experts.

The experimental validation is limited in scope; larger-scale benchmarks would be useful.

Computational efficiency of zeroth-order methods is not discussed in depth

其他意见或建议

A more intuitive explanation of the key theoretical results would improve readability.

An ablation study on different hyperparameter choices would be helpful.

If possible, adding real-world applications beyond adversarial attacks could strengthen the practical impact.

作者回复

W1:... experiments on larger datasets ... .

Response: We appreciate the reviewer’s feedback regarding evaluation on additional datasets. To address this, we conducted additional experiments on CIFAR10 to demonstrate the scalability of our approach beyond MNIST. Results are provided in our response to Reviewer C8i4 and omitted here due to space constraints.

Response to Question 1: We provide the total computational complexity of our methods and compare them with baselines. This comparison will be added to Table 1 in the final version. Our SOGD algorithm offer notable computational benefits.

MethodTotal Complexity
SOGDO((d1+d2)ε3)\mathcal{O}((d_1 + d_2) \cdot \varepsilon^{-3})
ZO-SOGDO((d1+d2)7/4ε3)\mathcal{O}((d_1 + d_2)^{7/4} \cdot \varepsilon^{-3})
SOBOW/SOBBOO(((d1+d2)+κglog(κg)d2)ε3)\mathcal{O}(((d_1 + d_2)+ \kappa_g \log(\kappa_g) d_2) \cdot \varepsilon^{-3})

As an example, for SOGD, the regret bound BL-RegTO(T1/3(σ2+ΔT)+T2/3ΨT)\text{BL-Reg}_T \leq \mathcal{O}(T^{1/3}(\sigma^2 +\Delta_T) + T^{2/3} \Psi_T) implies that T=O(ε3)T = \mathcal{O}(\varepsilon^{-3}) suffices for average regret ε\leq \varepsilon, leading to total cost O((d1+d2)ε3)\mathcal{O}((d_1 + d_2) \cdot \varepsilon^{-3}). We note that the total cost for other baselines (SOBOW/SOBBO) is derived using their window size w=o(T)w = o(T) and κglog(κg)\kappa_g \log(\kappa_g) for conjugate gradient (CG) steps. OAGD baseline incurs higher cost due to exact system solves at each iteration.

SOGD offers two key advantages: 1) it uses a single system solve per step, which is more efficient than full solves or CG steps required by some baselines; 2) it sets w=1w = 1, avoiding costly average gradient computation, unlike window-based methods. ZO-SOGD has higher complexity but enables gradient-free optimization. Its higher dimensional dependence is consistent with the increased complexity in single-level ZO methods.

Response to Question 2: We appreciate the reviewer’s concern regarding the handling of non-convex inner problems. To demonstrate practical robustness, the inner problem in Section 5 is non-convex, and our experiments validate that the algorithm performs effectivel--showing stable convergence and improved balanced accuracy. This robustness stems from our novel search direction, which controls variance over time without requiring window smoothing. A detailed discussion is provided in our response to Reviewer Xqeg; due to space constraints, we omit it here.

Response to Question 3: Thank you for the insightful question. As detailed in Section 5 (Experimental Setup), we carefully tuned all hyperparameters to ensure stable and fair comparisons; the selected ranges are listed in Lines 405–420 (second column). Our analysis shows that while ZO-SOGD is sensitive to hyperparameter choices, it remains robust within reasonable ranges. Following your suggestion, we include extensive tuning results under adversarial attacks for ZO-SOGD (ours, Adam), summarized in the tables below.

For inner (β\beta) and outer (α\alpha) stepsizes:

α=0.001\alpha=0.001α=0.005\alpha=0.005α=0.01\alpha=0.01α=0.1\alpha=0.1
β=0.001\beta=0.0010.68±0.050.68\pm0.050.59±0.070.59\pm0.070.47±0.060.47\pm0.060.53±0.080.53\pm0.08
β=0.005\beta=0.0050.54±0.060.54\pm0.060.41±0.050.41\pm0.050.35±0.040.35\pm0.040.42±0.050.42\pm0.05
β=0.01\beta=0.010.48±0.040.48\pm0.040.34±0.050.34\pm0.050.57±0.070.57\pm0.070.39±0.060.39\pm0.06
β=0.1\beta=0.10.26±0.03\mathbf{0.26\pm0.03}0.43±0.060.43\pm0.060.33±0.040.33\pm0.040.45±0.070.45\pm0.07

For smoothing parameters:

ρr=ρs=0.001\rho_r=\rho_s=0.001ρr=ρs=0.005\rho_r=\rho_s=0.005ρr=ρs=0.01\rho_r=\rho_s=0.01ρr=ρs=0.05\rho_r=\rho_s=0.05
ρv=0.001\rho_v=0.0010.61±0.060.61\pm0.060.52±0.050.52\pm0.050.48±0.040.48\pm0.040.57±0.060.57\pm0.06
ρv=0.005\rho_v=0.0050.47±0.050.47\pm0.050.39±0.040.39\pm0.040.35±0.040.35\pm0.040.45±0.050.45\pm0.05
ρv=0.01\rho_v=0.010.41±0.040.41\pm0.040.28±0.03\mathbf{0.28\pm0.03}0.31±0.030.31\pm0.030.43±0.050.43\pm0.05
ρv=0.05\rho_v=0.050.53±0.060.53\pm0.060.44±0.050.44\pm0.050.40±0.040.40\pm0.040.52±0.060.52\pm0.06

For momentum parameters:

λt=ηt=0.9\lambda_t=\eta_t=0.9λt=ηt=0.99\lambda_t=\eta_t=0.99λt=ηt=0.999\lambda_t=\eta_t=0.999
γt=0.9\gamma_t=0.90.35±0.040.35\pm0.040.29±0.030.29\pm0.030.38±0.050.38\pm0.05
γt=0.99\gamma_t=0.990.31±0.030.31\pm0.030.24±0.02\mathbf{0.24\pm0.02}0.33±0.040.33\pm0.04
γt=0.999\gamma_t=0.9990.37±0.040.37\pm0.040.32±0.030.32\pm0.030.40±0.050.40\pm0.05

Lower test accuracy indicates better attack performance. The algorithm is robust across a broad range of hyperparameters. Optimal performance is achieved with inner stepsize β=0.1\beta = 0.1, outer stepsize α=0.001\alpha = 0.001, smoothing parameters ρv=0.01\rho_v = 0.01, ρr=ρs=0.005\rho_r = \rho_s = 0.005, and momentum parameters γt=0.99\gamma_t = 0.99, λt=ηt=0.99\lambda_t = \eta_t = 0.99.

In our revised version, we will include a detailed sensitivity analysis for SOGD (the first-order variant) in the parametric loss tuning application (Section 5.2) in the appendix.

最终决定

Paper introduces novel first-order and zero-order algorithms for stochastic Online Bilevel Optimization (OBO) that achieve sublinear bilevel regret without relying on window smoothing. The reviewers agree that the paper makes a significant contribution by proposing a new search direction and introducing algorithms (SOGD and ZO-SOGD) that are theoretically sound and practically applicable to black-box and dynamic environments. Some parts of the theoretical analysis remain dense and were not fully verified by the reviewers. There were some limitations in experimental scope and accessibility of the theory, paper has overall borderline scores. Eventually the paper is headed towards rejection due to space limitations and borderline reviews. We hope the authors revise and resubmit according to the reviewers' suggestions.