PaperHub
7.0
/10
Poster3 位审稿人
最低3最高5标准差0.9
3
5
3
ICML 2025

Refining Adaptive Zeroth-Order Optimization at Ease

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

We introduce R-AdaZO, a novel approach that improves the convergence of existing adaptive ZO methods by effectively leveraging moment information.

摘要

关键词
Adaptive MethodZeroth-Order OptimizationVariance ReductionConvergence

评审与讨论

审稿意见
3

The paper introduces a zero-order optimization technique called R-AdaZO. Specifically, it makes a simple yet seemingly effective modification to ZO-AdaMM. By replacing the second moment estimation term gtg_t in ZO-AdaMM with mtm_t, R-AdaZO achieves variance reduction compared to ZO-AdaMM and demonstrates a faster convergence rate. The theoretical analysis is thorough.

给作者的问题

  1. See weaknesses. I am curious about the performance of other ZO methods under the same experimental setup used by the authors.
  2. Could you provide a detailed performance comparison between R-AdaZO and ZO-AdaMM with longer iterations (where both methods converge)? For example, using the synthetic function in Figure 1.

论据与证据

The authors provide theoretical analysis to prove the claimed variance reduction effect. During my review, I did not identify any obvious issues (though the validation was not extremely detailed).

方法与评估标准

Yes, the proposed method aims to improve the convergence speed of zero-order optimization, and the experiments conducted are reasonable.

理论论述

Yes, I verified some of the theoretical proofs. The assumptions and lemmas used are consistent with prior work in the zero-order optimization (ZO) field, and I did not find obvious issues.

实验设计与分析

Yes, the experiments conducted (synthetic function optimization, image black-box adversarial attacks, and LLM fine-tuning) are common application areas for zero-order optimization.

补充材料

I reviewed part of the supplementary material, including the proofs of some theorems and the detailed hyperparameter settings.

与现有文献的关系

The paper introduces a new zero-order optimization algorithm based on ZO-AdaMM, aiming to improve the convergence speed of zero-order optimization methods. This is largely consistent with previous approaches in the field.

遗漏的重要参考文献

In my opinion, while ZO-AdaMM is an important baseline for this paper, the authors should provide a more detailed comparison with other common zero-order optimization (ZO) methods to demonstrate the effectiveness of their approach. This includes methods like ZO-SGD (already cited by the authors), as well as ZO-SignSGD [1] and ReLIZO [2].

[1] Liu S, Chen P Y, Chen X, et al. SignSGD via Zeroth-Order Oracle. In International Conference on Learning Representations, 2019.

[2] Wang X, Qin X, Yang X, et al. ReLIZO: Sample Reusable Linear Interpolation-based Zeroth-Order Optimization. Advances in Neural Information Processing Systems, 2024, 37: 15070-15096.

其他优缺点

Strengths:

  1. The paper introduces a simple yet effective modification to ZO-AdaMM, accelerating the convergence of zero-order optimization (ZO) through variance reduction.
  2. The theoretical analysis is comprehensive.
  3. Due to the simplicity of the proposed method, I conducted my own tests on functions like Rosenbrock. With smaller parameter dimensions (i.e., 20, 50), the proposed method shows acceleration over ZO-AdaMM. (Although, after a longer number of iterations, it may perform worse than ZO-AdaMM, likely due to the bias introduced as analyzed by the authors.) In such cases, as the parameter dimension increases, the proposed method demonstrates better scalability, as evidenced by the experiments in the paper.

Weaknesses:

  1. I believe the authors should include comparisons with additional ZO methods to further illustrate the effectiveness of their approach. See "Essential References Not Discussed." The current experiments only compare the baseline ZO-AdaMM and the ZO-RMSprop, which is essentially similar to ZO-AdaMM.

其他意见或建议

None

作者回复

We are grateful to Reviewer MyoC for the positive and constructive feedback! We appreciate that the reviewer highly recognizes that our paper is simple yet effective and our theoretical analysis is comprehensive. We would like to address your comments as follows.


...the authors should provide a more detailed comparison with other common zero-order optimization (ZO) methods...

We appreciate the reviewer's suggestion to provide a more detailed comparison with common zero-order optimization (ZO) methods. Following this recommendation, we have benchmarked our R-AdaZO against ZO-SGD and ZO-signSGD on the Quadratic and Levy functions, using the experimental setup detailed in Section 6.1.

Quadratic Function Performance:

QuadraticT=1×104T=1 \times 10^4T=2×104T=2\times10^4T=3×104T=3\times10^4T=4×104T=4\times10^4T=5×104T=5\times10^4
ZO-SGD5038.0565028.6115019.2935009.8974997.447
ZO-signSGD3644.0771983.347834.561190.4892.159
R-AdaZO756.7430.5990.0290.0300.030

Levy Function Performance:

LevyT=1×104T=1 \times 10^4T=2×104T=2\times10^4T=3×104T=3\times10^4T=4×104T=4\times10^4T=5×104T=5\times10^4
ZO-SGD6661.5046625.4266589.7946553.9176506.384
ZO-signSGD4017.1911579.627779.139606.901562.197
R-AdaZO795.038575.398560.107558.741558.716

As demonstrated in the tables, R-AdaZO consistently and significantly outperforms both ZO-SGD and ZO-signSGD. Regarding the ReLIZO approach: While ReLIZO presents a valuable technique for enhancing ZO gradient estimation, our current work's primary focus is on improving the ZO update mechanism. Integrating R-AdaZO with advanced gradient estimators like ReLIZO is an interesting direction for future research, which we plan to explore subsequently. We believe maintaining this focus clarifies the specific contributions of R-AdaZO in this submission.

Could you provide a detailed performance comparison between R-AdaZO and ZO-AdaMM with longer iterations (where both methods converge)?

Thank you for this insightful question regarding the long-term convergence behavior. To address this, we have extended the comparison between R-AdaZO and ZO-AdaMM, running the experiments for 6 times the number of iterations used in the main paper, maintaining the settings from Section 6.1.

Quadratic Function Performance (Extended Iterations):

QuadraticT=37500T=37500T=41250T=41250T=45000T=45000T=48750T=48750T=52500T=52500T=56250T=56250T=60000T=60000
ZO-AdaMM28.9722.5090.1010.0270.0260.0260.026
R-AdaZO0.0300.0300.0300.0290.0300.0300.030

Levy Function Performance (Extended Iterations):

LevyT=37500T=37500T=41250T=41250T=45000T=45000T=48750T=48750T=52500T=52500T=56250T=56250T=60000T=60000
ZO-AdaMM584.637572.480565.976562.638561.050560.467560.252
R-AdaZO557.904557.821557.816557.824557.824557.823557.811

As shown in the results above, even with significantly more iterations, R-AdaZO achieves a final performance level that is comparable to, and in the case of the Levy function, slightly better than that of ZO-AdaMM.


We thank the reviewer for the valuable input and we will incorporate these discussions into our revised version. We hope our clarification will address your concerns and improve your opinion of our work.

审稿人评论

Thank you for the detailed response. The performance results of R-AdaZO and ZO-AdaMM with longer iterations provided by the authors are in line with my expectations. That is, after convergence, the final difference between the two methods is small, but R-AdaZO demonstrates a faster convergence rate, which will significantly enhance its scalability.

I appreciate the comparison results with ZO-SGD and ZO-SignSGD provided by the authors (which should be included in the experimental section or the appendix), and I also agree with the authors' discussion on the differences with ReLIZO (which should be included in the related work or future work).

Overall, I believe the paper presents a simple yet effective improvement, and I am inclined to recommend acceptance.

作者评论

Dear Reviewer MyoC,

We are very happy that our response has addressed your concerns and improved your opinion about our work! We will include those valuable discussions and additional results in the final version.

Best regards, Authors

审稿意见
5

The authors take a closer look at the Adam update rule in the context of Zero-Order Neural Network training and identify a specific change to the second-moment update rule that provides improved convergence theoretically and empirically. Their experimental sections contains synthetic experiments as well as neural-network based experiments.

给作者的问题

While the experimental evaluation is varied, it is not very extensive. I'd love to see a more detailed evaluation across more elements from the ZO-Bench suite (https://github.com/ZO-Bench/ZO-LLM) than OPT on SST2. In addition, theoretical analysis showcased a significant influence of β1\beta_1. It would be interesting to see if that is reflected in the convergence of LLM fine-tuning. Should the authors expand on their experimental evaluation as mentioned, I will consider raising my score.

论据与证据

The authors claim a new update rule for ZO optimization, the use-fullness of which is well supported. They further claim new theoretical insights, which I didn't parse completely. I find formulations such as "we theoretically show that [...] is likely to be [...] and may better [...]" to be problematic. I would suggest to clearly separate what is theoretically shown versus what is speculation. The final claim concerns experimental validation, which is convincing.

方法与评估标准

Yes

理论论述

I followed the theoretical claims of section 5.1 and 5.1, but did not check 5.3. I did not check the proofs in the appendix.

实验设计与分析

Yes. I see no obvious issues.

补充材料

No, thanks for providing code though.

与现有文献的关系

The wider scientific literature on ZO is focussed on variance-reduction through various theoretical and empirical observations and improvements. This paper specifically is orthogonal to many of these papers and focusses on a single straight-forward change to the Adam update rule.

遗漏的重要参考文献

No

其他优缺点

Many good papers stem from simple ideas, such as this one. I see no obvious weakness.

其他意见或建议

Line 284 left: "to decompose"

作者回复

We are grateful to Reviewer 6KdC for the positive assessment and valuable feedback! We particularly appreciate the recognition that "Many good papers stem from simple ideas, such as this one". We would like to address your concerns below.


I find formulations such as "we theoretically show that [...] is likely to be [...] and may better [...]" to be problematic. I would suggest to clearly separate what is theoretically shown versus what is speculation

Thank you for this crucial suggestion regarding the precision of our claims. We agree completely. In the revised manuscript, we will meticulously differentiate between:

  • Formally proven theoretical results: Statements directly supported by theorems and proofs within the paper.
  • Heuristic arguments or intuitions: Explanations for why certain phenomena might occur, clearly marked as such.
  • Empirical observations: Findings derived solely from experimental results.

We will ensure the language clearly reflects these distinctions throughout the paper for improved accuracy and clarity.

I'd love to see a more detailed evaluation across more elements from the ZO-Bench suite (https://github.com/ZO-Bench/ZO-LLM) than OPT on SST2.

Following your suggestion, we have expanded our evaluation using the ZO-Bench suite. We benchmarked R-AdaZO against ZO-AdaMM on the Copa dataset using the OPT-13B model, adhering to the ZO-Bench setup. The results (loss values) are presented below:

Performance on Copa (OPT-13B):

CopaT=1×103T=1 \times 10^3T=2×103T=2 \times 10^3T=3×103T=3 \times 10^3T=4×103T=4 \times 10^3T=5×103T=5 \times 10^3T=6×103T=6 \times 10^3T=7×103T=7 \times 10^3T=8×103T=8 \times 10^3T=9×103T=9 \times 10^3T=10×103T=10 \times 10^3
ZO-AdaMM2.5272.3522.1332.0041.9381.6361.5791.7161.5291.696
R-AdaZO2.5272.0511.5211.5841.4821.2971.4201.5311.4051.640

These results demonstrate that R-AdaZO exhibits significantly faster convergence compared to ZO-AdaMM on this extended LLM fine-tuning task. This provides further evidence for the effectiveness of R-AdaZO on diverse tasks within the ZO-Bench framework.

In addition, theoretical analysis showcased a significant influence of β1\beta_1. It would be interesting to see if that is reflected in the convergence of LLM fine-tuning.

Thank you for pointing out this connection between our theory and empirical results. To investigate this, we performed an ablation study on the effect of β1\beta_1 for R-AdaZO during LLM fine-tuning. We used the SST2 dataset with the OPT-1.3B model, following the setup in Section 6.3, varying only β1\beta_1.

Performance on SST2 (OPT-1.3B) with varying β1\beta_1 for R-AdaZO:

SST2T=1×103T=1 \times 10^3T=2×103T=2 \times 10^3T=3×103T=3 \times 10^3T=4×103T=4 \times 10^3T=5×103T=5 \times 10^3T=6×103T=6 \times 10^3T=7×103T=7 \times 10^3T=8×103T=8 \times 10^3T=9×103T=9 \times 10^3T=10×103T=10 \times 10^3
β1=0.3\beta_1=0.31.0630.7360.4900.6120.4140.4290.3890.2650.2760.187
β1=0.6\beta_1=0.61.0630.6750.3870.4930.3300.3600.3880.2320.2820.229
β1=0.9\beta_1=0.91.0630.4620.1780.2890.1650.3290.2510.1380.2550.161

The results clearly show that the choice of β1\beta_1 significantly impacts performance, empirically validating the implications of our theoretical analysis in the context of LLM fine-tuning. Specifically, a larger value of β1=0.9\beta_1=0.9 consistently leads to faster convergence and achieves lower loss values throughout the fine-tuning process compared to smaller values (β1=0.3,0.6\beta_1=0.3,0.6). This confirms that the theoretical insights regarding β1\beta_1 translate to practical benefits in this application domain.


Thank you again for the valuable feedback. We will integrate the discussed clarifications and new experimental findings into the revision. We hope our clarification will address your concerns and further improve your opinion of our work.

审稿意见
3

This paper proposes an improvement over existing adaptive-gradient methods for zeroth-order optimization (ZO). In ZO optimization, to minimize a function minθRdF(θ)\min_{\theta \in \mathbb{R}^d} F(\theta) where F(θ)=E[f(θ;ξ)]F(\theta) = \mathbb{E} [ f (\theta;\xi)] , we can only access noisy function values f(θ;ξ)f(\theta;\xi). ZO methods use a finite difference of functions as an approximation of the gradient : ^f(θ;ξ)dKk=1Kf(θ+μuk;ξ)f(θ;ξ)μuk\hat{\nabla}f(\theta;\xi) \triangleq \frac{d}{K}\sum_{k=1}^K \frac{f(\theta +\mu u_k;\xi) - f(\theta;\xi)}{\mu}u_k, where ukiidUnif(Sd1)u_k\overset{iid}{\sim}Unif(\mathbb{S}^{d-1}) and Sd1\mathbb{S}^{d-1} is the dd-dimensional sphere with a fixed μ>0\mu>0.

ZO methods suffer with high noise variance for their estimated gradients especially for high dimensions dKd\gg K. And this

  1. The authors claim that momentum methods, however, offer some sort of "automatic variance-reduction" to handle this.

  2. Existing baselines which use momentum, for instance the ZO-AdaMM(Chen et al 2019), that uses ^f(θ;ξ)\hat{\nabla}f(\theta;\xi), instead of the true gradient in Adam, however, do not take advantage of this fact. The authors propose a small modification of ZO-AdaMM, to get R-AdaZO(Algorithm 2), that estimates the second moment as vtβ2vt1+(1β2)mt2v_t \gets \beta_2 v_{t-1} + (1 - \beta_2)m_t^2, where mtm_t is the momentum term. Note that AdaMM uses, vtβ2vt1+(1β2)gt2v_t \gets \beta_2 v_{t-1} + (1-\beta_2)g_t^2, where gtg_t is the ZO estimate of gradient.

  3. Theory:

    1. The authors show that estimates of second moment vv are smaller for R-AdaZO than ZO-AdaMM by incorporating the variance reduction due to momentum (Theorem 5.4 and Corollary 5.5).
    2. The authors use a "variance-aware" convergence analysis for non-convex FF, by bounding norm of Fμ(θ)\nabla F_{\mu}(\theta) where Fμ(θ)=EuUnif(Sd1)[F(θ+μu)]F_{\mu}(\theta) = \mathbb{E}_{u\sim Unif(\mathbb{S}^{d-1})}[F(\theta + \mu u)] is the appropriate objective for ZO methods. They show that existing analysis of ZO-AdaMM is limited as it considers only one term of the upper bound on gradient norm obtained by a Holder's inequality in Lemma 5.6.
    3. In Theorem 5.9 and Corollary 5.10, they show that their method obtains a strictly smaller error in gradient norm than ZO-AdaMM with the difference being an additive term of V\sqrt{V} for R-AdaZO v/s V^\sqrt{\hat{V}} for AdaMM, with VV^=O(1β11+β14)\sqrt{\frac{V}{\hat{V}}} = \mathcal{O}(\sqrt[4]{\frac{1-\beta_1}{1+\beta_1}}).
  4. Experiments:

    1. For optimizing synthetic loss functions(Fig 1), fine-tuning an LLM (Fig 3) or performing a black-box adversarial attack (Table 1) their method is significantly faster than baselines AdaMM and RMSProp with ZO gradient estimates.
    2. In Figure 2, they show that the error in first moment mtm_t and second moment vtv_t wrt true gradients is much smaller for their method. They also show its behavior for different β1\beta_1.

给作者的问题

  • How do the proposed methods perform in comparison to other ZO methods that keep track of the history instead (Shu et al 2023)? Are both these techniques completely orthogonal, so that they can be combined to obtain a "variance-reduced" ZO that uses historical estimates?
  • As a follow-up for the previous question, are there minmax lower bounds in terms of number of function evaluations for this ZO model, like (Duchi et al 2015)? These could tell us if the obtained method is optimal or if there is room for improvement?
  • Why is there a mismatch between theory and practice in Figure 2(b) for large β1\beta_1? Can you further expand on why the loose bound is still good for convergence analysis?

References--

  • (Duchi et al 2015) Optimal Rates for Zero-Order Convex Optimization: The Power of Two Function Evaluations. Trans on Info. Theory.

论据与证据

Several theoretical claims are justified by corresponding Theorems and Corollaries mentioned in the Summary as well the Remarks following them. The theoretical claims that are problematic or not justified are explained below --

  1. Incomplete proof for Corollary 5.10: Corollary 5.10 provides the performance of the baseline ZO-AdaMM under the authors' proposed variance-aware convergence analysis. The authors do not provide its proof and claim that it can be obtained by simply following the proof of Theorem 5.9, which is their method R-AdaZO. I do not think this is the case, as several steps in proof of Theorem 5.9 utilize the exact update of R-AdaZO and might not work for ZO-AdaMM with a different update. These steps are -- i) Eqn. (46), which is the main bound in the descent Lemma, specifically terms 22 and 33, ii) Eq. (56) and (57) which basically bound the convergence rate for TT update steps. These are important steps in the proof of R-AdaZO and without any explanation of how these would be addressed for the baseline ZO-AdaMM, it is unclear if the claimed rate in Corollary 5.10 is actually achieved for ZO-AdaMM, and if it is indeed worse theoretically than R-AdaZO. This is one of the main reasons for my rejection, and I would be willing to change my assessment if the authors can provide this proof.

  2. Justification of Assumptions The assumption that the function values are always bounded in Assumption 1 is extremely strong. Several works assume a lower bound on the function value, but assuming an upper bound on the function forces some form of subgaussian tails on it. If the domain of models θ\theta was a closed compact subset Θ\Theta of Rd\mathbb{R}^d, then this assumption could be justified. Even then, the authors would need to specify the value of μ\mu so that any noisy evaluation θ+μu\theta + \mu u is not outside Θ\Theta. Note that none of the synthetic functions used in their experiments except the rastrigin function are unbounded for θRd\theta \in \mathbb{R}^d. Note that this also doesn't hold for a quadratic function.

  3. Final Convergence Rate Assuming that Corollary 5.10 is actually correct and the authors provide a proof in their rebuttal, it still does not imply a speed-up in terms of the final convergence rates for R-AdaZO compared to ZO-AdaMM theoretically, as the error terms where these two algorithms differ might not be the dominant term in error. The authors should thus write down precise conditions when this does not happen, in terms of d,β1,β2,ζ,μd, \beta_1, \beta_2,\zeta,\mu and the initialization m0,v0m_0, v_0. Further, the authors should check what is the maximum possible speedup possible in convergence rate in terms of ϵ\epsilon instead of keeping the speedup in terms of β1\beta_1 for only additive terms? Note that in the current form in Remark on Page 7, R-AdaZO does not denote a speed-up as long as the terms of V^\hat{V} and VV are not the dominant error terms in Theorem 5.9 and Corollary 5.10, as only then do they actually decide the convergence speed.

Experiments: Note that the experimental performance for their method is significantly better than baselines for both synthetic and real tasks as demonstrated in Figures 1 and 3.

方法与评估标准

The experiments on synthetic data and finetuning had all the sufficient details.

  • Is cosine similarity a good metric in Fig 2(a)?: In Fig 2(a), the authors compare the cosine similarity between actual gradient f(θ)\nabla f(\theta) and the momentum estimate mm from ZO-AdaMM and R-AdaZO. In Theorem 5.3, they bound the expected squared error coordinate-wise for this quantity, so measuring cosine similarity instead of squared difference does not seem appropriate. Further, for both Fig 2(a) and (b), we don't know if the error is measured using momentum at a specific time of the algorithm or averaged over the algorithm's trajectory. This detail is important for momentum as it might not have kicked in early in the algorithm and saturated if measured very late.

  • No justification for black-box attack experiments: I don't understand the reason for testing an optimization algorithm on black-box attack. The authors only state "Following the practice in (Shu et al 2023)" in Line 367 2nd column as the justification. In (Shu et al 2023), I would the actual reason for this, as black-box attacks are an important application for ZO optimization, as an adversary can only access the function values during attack. I think the authors should mention this justification in their paper as well.

理论论述

Yes, I checked all the proofs. Apart from the issues mentioned in "Claims and Evidences" section, there are several minor mistakes in the proof. These errors are mostly in explanations not in the equations and they don't break the proof.

  • Lines 561 and 562 should have \leq instead of == as a Cauchy-Schwarz inequality is used.
  • Line 927 : The authors used a1+aln(1+a)\frac{a}{1+a} \leq \ln(1+a) in Eq (56) and (57) but wrote down the inequality ln(1+a)a\ln(1+a) \leq a.
  • Line 963 : The inequality should be ai/biiai/ibi\sum a_i/b_i \geq \sum_i a_i/\sum_i b_i for bi0b_i \geq 0, which is what the authors actually used. They mentioned the inequality in the opposite direction.

实验设计与分析

Yes, I checked all the experimental designs and analyses.

补充材料

I went over the appendix but not the code.

与现有文献的关系

Crucially, the authors use the variance-reduction properties of momentum first observed in (Liu et al 2020) for first-order optimization, i.e., with access to gradients. Using this insight for the problem of adaptive ZO optimization, shows that simply applying ZO gradient estimates to Adam is suboptimal and the small change in R-AdaZO actually results in better convergence.

遗漏的重要参考文献

I think most essential references were discussed. They have a showed that the variance-reduction was observed originally in (Liu et al 2020), but they were the first ones to use it for ZO updates. Further, they discussed the main baseline for adaptive ZO optimization : ZO-AdaMM (Chen et al 2019). They missed a few old references on the smoothing, which is exactly ZO optimization but the analysis utilizes smoothness of the resultant function FμF_\mu (Nesterov 2005, Duchi et al 2012). Note that the "variance-aware convergence analysis" mentioned in Section 5.3 seems to be analysis in terms of the smoothed objective FμF_{\mu} which has been performed in the references I mentioned, although not for Adam-style adaptive algorithms.

References

  • (Nesterov 2005) Smooth minimization of non-smooth functions. Math Programming.
  • (Duchi et al 2012) Randomized Smoothing for Stochastic Optimization. SIAM J. OPT.

其他优缺点

  1. Dependence on dimension dd: Note that the authors have not explicitly characterized the optimal dependence on the dimension dd in Theorems 5.9 and Corollary 5.10. Note that the noise variance in ZO is high because it grows with dd. In adaptive Adam-style optimization, the goal is to use different learning rates for different coordinates and the analysis is on a per-coordinate basis. Therefore, for ZO in Adam-style algorithms, the dependence of error on dimension dd is extremely important. Further, several steps of the analysis involve changing norms from coordinate-wise to 2\ell_2, so I suspect the dimension dependence is not good. Note that the error in baseline ZO-AdaMM (Chen et al 2019) for the term BB in Lemma 5.6 was dT\frac{\sqrt{d}}{\sqrt{T}}, thus giving a dϵ2\frac{d}{\epsilon^2} error bound. Can the authors provide sharp bounds in terms of dd as well?

  2. Extension to gaussian smoothing distribution: The authors sample the random vectors for gradient estimate from a uniform distribution on the sphere. Instead if the random vectors were sampled from a gaussian distribution, which in high dimensions is extremely similar to a uniform distribution, the proofs would not go through as several terms bounds won't hold, for instance Eq (50). Is there an easy fix for this gaussian extension or are the methods designed specifically for uniform distribution on the sphere with bounded function values (Assumption 1).

其他意见或建议

  • Writing: Some parts of the writing can be improved. For instance, there is some repetition of the main claims verbatim atleast 454-5 times in the paper. See the following typos --
    1. The unit sphere in dd dimensions is referred to as S\mathbb{S} instead of the standard Sd1\mathbb{S}^{d-1}. If non-standard notation is used, it should be specified in the problem setup (Section 3).
    2. The exact formulation of optimality gap (in terms of either gradients, function values or iterate distance) in the synthetic experiments was not specified.
    3. Line 919 RHS of first equation: 1β2v0,i+ζ\frac{1}{\sqrt{\beta_2}v_{0,i}+\zeta} -> 1β2v0,i+ζ\frac{1}{\sqrt{\beta_2 v_{0,i} + \zeta}}.
    4. Line 858: mt,i1β2mt+1,i\frac{|m_{t,i}|}{\sqrt{1-\beta_2}|m_{t+1,i}|} -> mt,i1β2mt,i\frac{|m_{t,i}|}{\sqrt{1-\beta_2}|m_{t,i}|}.
    5. Line 587 : Missing reference -> Eq (16).

Overall, I am leaning towards rejection due to the main issues in the "Claims and Evidence" section. I'm willing to change my assessment if they can address them.

作者回复

We sincerely thank the reviewer for their diligent, detailed, and constructive feedback, and appreciate the time and effort invested . We found the comments very helpful and address each point below.


Incomplete proof

Thank you for highlighting the need for clarity regarding Corollary 5.10 (ZO-AdaMM). You are correct; its proof requires adaptation from Theorem 5.9 due to the different second moment update rule (vtv_t). While our overall variance-aware framework (Lemma 5.6, Eq. 45/46) remains applicable, the specific bounds for terms ①-④ within Eq. (46) must be re-evaluated using ZO-AdaMM's vtv_t.

  • Terms ① (Eq. 47) & ④ (Eq. 52): The bounds analogous to (47) and (52) hold for ZO-AdaMM.
  • Terms ② (Eq. 48) & ③ (Eq. 51): We can adapt Lemma 6 from Zhang et al. (2024a), which provides a bound for mt,i/vt,i+ζ|m_{t,i}| / \sqrt{v_{t,i}+\zeta} in the context of standard Adam (where vtv_t uses gt2g_t^2). Applying this lemma (adapted for the ZO variance Σ2\Sigma^2) allows us to obtain bounds analogous to those derived through (48) and (51), although the constants involved will differ.

For bounding the summation terms analogous to (56)/(57) in the ZO-AdaMM proof, we leverage Lemma 7 from Zhang et al. (2024a), which handles sums involving vtv_t based on gt2g_t^2. Combining above, we can rigorously derive the rate in Cor 5.10. This rate depends on V^\hat{V} (reflecting the gt2g_t^2 update, Cor 5.5), confirming R-AdaZO's theoretical advantage via the smaller variance term VV (Thm 5.4).

Due to rebuttal space limits, we provide this proof sketch; a full derivation referencing these adapted steps will be included in the appendix of the revision. Thank you again for your careful analysis.

Justification of Assumptions

Regarding Assumption 1 (8): While the upper bound is strong (especially for quadratics/unbounded domains), it's a common theoretical tool (like bounded gradients in Chen et al., 2019) essential for controlling variance and deriving convergence guarantees in non-convex stochastic settings. Our theory relies on this standard simplification, but experiments confirm R-AdaZO's practical robustness where it doesn't strictly hold globally.

Final Convergence Rate

You're right, the overall speedup requires the variance term (VV vs V^\hat{V}) to dominate. We contend this is common in ZO due to high estimation noise, making R-AdaZO's better variance handling crucial. Pinpointing exact dominance conditions is hard, but our analysis highlights the structural gain against this main ZO challenge. We'll clarify the remark on Page 7 to state the speedup applies primarily when variance dominates, matching typical ZO scenarios and our results.

Methods And Evaluation Criteria:

Thank you for the feedback and suggestions. Regarding Fig 2(a), cosine similarity visualizes the improved directional alignment of mtm_t, complementing the squared error theory. Metrics in Fig 2 are averaged over the trajectory. We will clarify this rationale and the averaging. We also agree the black-box attack justification needs explicit motivation (its relevance as a key ZO application due to lack of gradient access for attackers) and will add it. These clarifications will be incorporated.

Theoretical Claims & Other Comments

Thank you for your meticulous review and for carefully checking all the proofs. We appreciate you pointing out these minor errors in the explanations. We understand they don't affect the core correctness of the proofs, and we will certainly correct these textual issues in the revised version.

Essential References Not Discussed

Thank you for the feedback and for acknowledging our discussion of key references. We appreciate you suggesting the additional references on smoothing (Nesterov 2005, Duchi et al. 2012) and will certainly incorporate them into our related work section.

Dependence on dimension dd

Thank you for highlighting dimension dd dependence. Our analysis uses coordinate-wise Lipschitz, standard for adaptive methods but differing from Euclidean Lipschitz (e.g., Chen et al. 2019), making direct dd comparison complex. While dd is implicit in our terms, our focus was demonstrating variance reduction and the refined second moment's benefit under this framework, not optimizing dd dependence. Achieving sharp dd bounds is a valuable, challenging direction for future research.

Extension to gaussian

You are correct, our proofs leverage properties of the uniform sphere estimator simplifying bounds like Eq. (50). Gaussian smoothing's different statistics require adapted tools and assumptions (e.g., concentration inequalities) for analysis. Adapting our framework needs re-derivation, not a quick fix, though R-AdaZO's core principles should still apply.

Questions

All the remaining questions will be addressed in our revised version due to space limit.


With our elaboration, we hope our response has addressed your concerns and increased your opinions of our work.

审稿人评论

Thank you for the detailed response.

I'm increasing my score as I'm convinced that Corollary 5.10 can possibly be proved using Lemmas 6 and 7 from (Zhang et al 2024a).

Is there a reference that justifies the statement that the term variance dominates in zeroth-order optimization? I understand that the empirical results here somehow demonstrate this, but is there a mention of this in any existing ZO works theoretically? Or is it observed in practice for most cases, not just the Adam-style algorithms described here?

作者评论

Dear Reviewer FBQc,

Thank you very much! We are glad our response helped address your concerns and improve your assessment of our work. We will certainly incorporate these valuable discussions into the final version.

Regarding your question about the dominance of variance in ZO optimization:

Our theoretical analysis provides direct justification via Eq. (15). This equation explicitly shows that the variance of the ZO gradient estimator (Σ2\Sigma^2) inherently scales with dimension dd and inversely with smoothing parameters (μ2,K\mu^2, K). This structure makes Σ2\Sigma^2 fundamentally larger than typical first-order variance (σ2\sigma^2), particularly in the high-dimensional settings where ZO methods are often applied.

This large, dimension-dependent variance is widely recognized as a central challenge in the ZO literature, e.g., (Liu et al., 2018a). Consequently, it often becomes the dominant factor limiting the overall convergence rate compared to other terms like bias or optimization error, especially as dd grows. While formal proofs of dominance across all possible parameter regimes might be intricate, tackling this high variance is a primary focus of much ZO research.

Our empirical results strongly support this view:

  1. The significant performance gains shown in Sec. 6 stem directly from R-AdaZO's improved handling of variance.
  2. Our preliminary results (which we will add) show that the same refinement (Eq. 6) offers minimal benefit in low-variance first-order settings, highlighting that its advantage is most pronounced when variance is the bottleneck, as in ZO.

We hope this clarifies why variance is considered a dominant factor in ZO optimization, supported by both theoretical structure and empirical evidence.

Best regards,

Authors

最终决定

This paper proposes R-AdaZO, a refined adaptive zeroth-order optimization method. R-AdaZO exploits the variance reduction effect of the first moment estimate to enhance the accuracy and stability of gradient-free updates and refines the second moment estimate to better capture the geometry of the optimization landscape. Theoretical analysis provides variance-aware convergence guarantees and demonstrates faster convergence compared to existing methods. Extensive experiments on synthetic tasks, black-box adversarial attacks, and memory-efficient fine-tuning of large language models confirm the practical effectiveness of R-AdaZO.

All reviewers acknowledged the soundness of the proposed techniques, the theoretical analysis, and the strength of the experimental results. One reviewer raised concerns about aspects of the theoretical analysis, including restrictive assumptions and suboptimal dependence on the dimensionality d. Another two reviewers noted the absence of evaluations on the ZO-Bench suite and the limited number of baselines. The authors have responded to these concerns in the rebuttal by providing additional analysis and experiments.

Given the overall contributions and the authors’ responsiveness in addressing key concerns, we believe the strengths of the submission outweigh its limitations. We therefore recommend acceptance, and encourage the authors to revise the manuscript in line with the reviewers’ feedback and their rebuttal commitments.