PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
4
4
5
2.5
置信度
创新性2.5
质量2.3
清晰度2.8
重要性2.5
NeurIPS 2025

PaZO: Preconditioned Accelerated Zeroth-Order Optimization for Fine-Tuning LLMs

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29

摘要

This paper introduces PaZO, a preconditioned accelerated zeroth-order optimization algorithm for fine-tuning large language models (LLMs). First, we theoretically demonstrate the necessity of preconditioning in zeroth-order optimization, proving that zeroth-order stochastic gradient descent (ZO-SGD) alone fails to achieve the ideal convergence rate. Building on this, we propose a Preconditioned Simultaneous Perturbation Stochastic Approximation (PSPSA) and theoretical version of PaZO, and demonstrate that setting the order of preconditioner as $-1/2$ in PSPSA yields the improved convergence rate for PaZO. Moreover, we design a practical version of PaZO that stabilizes training via diagonal Hessian estimate and moving average technique. Extensive experiments on diverse downstream tasks with models like RoBERTa-large and OPT show PaZO’s effectiveness. Compared to other zeroth-order baselines, PaZO achieves better performance across models and tasks.
关键词
zeroth-order optimizationfine-tuning LLMspreconditioner

评审与讨论

审稿意见
4

This paper investigates the role of preconditioning in zeroth-order optimization for memory-efficient fine-tuning of large language models and introduces PaZO. The authors define a general PSPSA estimator that perturbs parameters with preconditioner HαH^{-\alpha} and prove that only choosing α=1/2\alpha = 1/2 yields the optimal convergence rate. The authors then proposes a practical algorithm named PaZO that utilizes diagonal Hessian estimator via three forward-pass perturbations. Empirical evaluation across multiple downstream tasks and models demostrate the effectiveness of the proposed algorithm.

优缺点分析

Strengths:

  1. The authors give the answer to an important question regarding the necessity of preconditioning in ZO and identify the optimal order in preconditioner.

  2. The three-point perturbation scheme combined with gradient-based correction and periodic reset is a novel, lightweight way to incorporate second-order information without prohibitive memory.

  3. Extensive empirical evaluation on both masked (RoBERTa-large) and generative (OPT-1.3B) models across a diverse suite of NLP tasks, demonstrating robustness of PaZO.

Weakness:

  1. There are several formulation mistakes in estimating the diagonal Hessian. The RHS of Equation (11) should be diag(H)diag(H). The LHS of Equation (13) should be ++2\ell_{+} + \ell_{-} - 2\ell.

  2. Some hyperparameter choices (e.g., reset period T0T_0 schedule) are introduced without intuitive justification.

问题

  1. While estimating the diagonal Hessian matrix, 1) why do we need to incorporate diag(g~g~)diag( \tilde{g} \circ \tilde{g}) as a correction item to integrate local first-order estimated information into Σt1\Sigma_{t-1} through moving average mechanism? 2) why Σ~\tilde{\Sigma} is computed through ((1β1)Σt12+β1diag(g~g~))1/2((1-\beta_1) \Sigma_{t-1}^2 + \beta_1 \cdot diag( \tilde{g} \circ \tilde{g}))^{1/2}, not (1β1)Σt1+β1diag(g~g~)(1-\beta_1) \Sigma_{t-1} + \beta_1 diag( \tilde{g} \circ \tilde{g}). 3) In theory, such correction is not applied in estimating the term I\mathcal{I} while computing Σt\Sigma_t, which makes Equation. (14) no longer an unbiased estimation to the diagonal Hessian. 4) In practice, the authors conducted an ablation experiment on studying the influence of parameter choice of β1\beta_1, why too small choice of β1\beta_1 (including 0) result in NaN. From my understanding, setting β1\beta_1 to be zero will make PaZO similar as that in HiZOO. In this ablation study, I can't figure out the real effect of incorporating the estimated gradient as a correction.

  2. Apart from the correction term, another difference between PaZO and HiZOO is the periodical reset mechanism for diagonal Hessian estimation. However, there is no ablation study design for this mechanism. Why this this step is necessary?

局限性

yes.

最终评判理由

I have no additional questions and thank you for the detailed response, which has resolved my previous concerns.

格式问题

n/a

作者回复

Response to Reviewer HFxg

We sincerely thank the reviewer HFxg for their thorough insightful feedback and constructive suggestions. We deeply appreciate the effort you dedicated to assessing our work. We address the main concerns raised by the reviewer HFxg point by point.

1. Correct some typos (regarding Weakness 1 and Question 1)

We thank the reviewer for pointing out the typos. We have carefully re-checked all symbols, equations, and tables throughout the paper. Below, we list all the corrections we identified and will incorporate them in the camera-ready version:

LocationOriginal TextCorrected Text
Equation (11)E[12uΣ12HΣ12uIΣ(diag(uu)I)]=H\mathbb{E}\left[\underbrace{\frac{1}{2}u^\top \mathbf{\Sigma}^{-\frac{1}{2}} \mathbf{H} \mathbf{\Sigma}^{-\frac{1}{2}} u}_{\mathcal{I}} \cdot \mathbf{\Sigma}\left(diag(uu^\top) - \mathbf{I}\right)\right] = \mathbf{H}E[12uΣ12HΣ12uIΣ(diag(uu)I)]=diag(H)\mathbb{E}\left[\underbrace{\frac{1}{2}u^\top \mathbf{\Sigma}^{-\frac{1}{2}} \mathbf{H} \mathbf{\Sigma}^{-\frac{1}{2}} u}_{\mathcal{I}} \cdot \mathbf{\Sigma}\left(diag(uu^\top) - \mathbf{I}\right)\right] = diag(\mathbf{H})
Equation (13)++μ2=I+O(μ)\frac{\ell_+ +\ell_- -\ell}{\mu^2} = \mathcal{I} + \mathcal{O}(\mu)++2μ2=I+O(μ)\frac{\ell_+ +\ell_- -2\ell}{\mu^2} = \mathcal{I} + \mathcal{O}(\mu)
"Adding information from ZO gardient" in Algorithm 1 and below line 223Σ~=((1β1)Σt12+β1diag(g~g~))1/2\mathbf{\tilde\Sigma} =\left( (1-\beta_1)\mathbf{\Sigma}_{t-1}^2 + \beta_1 \cdot diag(\tilde{g} \circ \tilde{g}) \right)^{1/2}Σ~=((1β1)Σt12+β1diag2(g~g~))1/2\mathbf{\tilde\Sigma} =\left( (1-\beta_1)\mathbf{\Sigma}_{t-1}^2 + \beta_1 \cdot diag^2(\tilde{g} \circ \tilde{g}) \right)^{1/2}
the first row, second column in Table 601

We acknowledge that some of these typos may have caused confusion and may be directly responsible for the series of concerns you raised in Question 1. We sincerely apologize for this and now proceed to clarify your questions in detail. Regrettably, the rebuttal process does not allow us to submit a revised manuscript. Nevertheless, we will make sure to fix all the aforementioned issues and include a detailed discussion in response to your concerns in the final camera-ready version.

2. The necessity of the correction term in estimating the diagonal Hessian (regarding Question 1, (1))

In summary, this correction term can reduce the instability of training caused by the outliers from zeroth-order oracle with randomness. Specifically, due to the order of preconditioner is 1/2-1/2 (α=1/2\alpha=1/2), too small value in the diagonal estimated Hessian may cause numerical instability (NaN) in training process. The correction term encourages numerical alignment between gradient magnitude and adaptive curvature scaling, similar in spirit to adaptive methods like Adam [1] and Adafactor [2]. To illustrate the issue more intuitively, we offer some insights through a non-strict analysis below.

In practice, we first obtain +\ell_+, \ell_- and \ell through three forward pass. We obtain three numerical finite values and then I\mathcal{I}, which, although dependent on uu, are treated as a fixed scalar in our experiments. Thus if we do not have the correction term, we have E[Σ(diag(uu)I)]0,\mathbb{E}\left[\mathbf{\Sigma}\left(diag(uu^\top) - \mathbf{I}\right)\right] \approx 0,

which means that in practice, we may obtain some extremely small values close to zero, causing the numerical outliers (NaN) considering the order 1/2-1/2. When adding the correction term, similarly ignoring the scaler, we have E[Σ(I+diag(uu))(diag(uu)I)]2Σ,\mathbb{E}\left[\mathbf{\Sigma}\left(\mathbf{I}+diag(uu^\top)\right)\left(diag(uu^\top) - \mathbf{I}\right)\right] \approx 2\mathbf{\Sigma},

which means that Σt\Sigma_{t} is numerical remains numerically consistent with Σt1\Sigma_{t-1}. Thus we obtain a more stable update of Σ\Sigma, which helps ensure the stability of training in our experiments. Moreover, we control the scale of correction term via β1\beta_1, which means that we can achieve numerical stability with only a small-scale correction term.

3. Why we compute Σ~=((1β1)Σt12+β1diag2(g~g~))1/2\mathbf{\tilde\Sigma} =\left( (1-\beta_1)\mathbf{\Sigma}_{t-1}^2 + \beta_1 \cdot diag^2(\tilde{g} \circ \tilde{g}) \right)^{1/2} (regarding Question 1, (2) and (3))

We sincerely apologize for the typo that we forgot the order of diagdiag. As you mentioned, it is common to use linear-weighted average form Σ~=(1β1)Σt1+β1diag(g~g~)\mathbf{\tilde\Sigma} = (1-\beta_1)\mathbf{\Sigma}_{t-1} + \beta_1 \cdot diag(\tilde{g} \circ \tilde{g}) . However, in zeroth-order optimization, we observe that root-sum-square average A2+B2\sqrt{A^2+B^2} performs better compared with A+BA+B. First, A2+B2\sqrt{A^2+B^2} is bounded by A+BA+B as A+B2A2+B2A+B\frac{A+B}{\sqrt{2}} \leq \sqrt{A^2+B^2} \leq A+B, which means that the two are close in order of magnitude. Second, the square-root form acts as a soft maximum. When AA is close to BB, A2+B2\sqrt{A^2+B^2} are within the same order of magnitude, differing only by a constant factor. When ABA \gg B or BAB \gg A, A2+B2\sqrt{A^2 + B^2} behaves like a softmax operator, approximating the larger value.

As you mentioned, applying such a correction causes bias in Equation (14). However, in practice, by the softmax operator A2+B2\sqrt{A^2+B^2} and the choice of β1\beta_1, we find that the deviation from unbiasedness is negligible empirically. On the contrary, such an almost negligible term enhances the robustness of the estimation against outliers, which is critical for the stability of zeroth-order optimization. We believe that for theory-inspired algorithms, making empirically motivated adjustments when applying them to practical tasks is both reasonable and necessary, so we named the algorithm "practical form" in Algorithm 1.

4. The choice of β1\beta_1 in the ablation study (regarding Question 1, (4))

At first we clarify that even though setting β1=0\beta_1 =0, PaZO is not similar to HiZOO due to the different order of estimated diagonal Hessian (different setting of α\alpha). One of the most important contributions of this work is to demonstrate that the ideal choice of α\alpha is 1/2-1/2 instead of 00 for MeZO and 1/21/2 for HiZOO, no matter what β1\beta_1 we select. After correcting the typo in Table 6, we state the results of ablation study as below for OPT-1.3B on SST2.

β1\beta_111e-21e-41e-61e-81e-100
setting β2\beta_2 = 1e-8NaNNaNNaNNaN89.088.9NaN

The results show that when β1\beta_1 is set too large (e.g., 1, 1e-2), the optimization diverges and results in NaNs, because the proportion of correction term is too large to introduce non-negligible bias. When β1\beta_1 is too small (e.g., 0 or approaching 0), the correction term fails to take effect, which may lead to numerical instability during training and result in NaNs. Only in a proper range (around 1e-81e\text{-}8 to 1e-101e\text{-}10) does the algorithm yield reasonable performance. This behavior confirms our discussion earlier about the necessity of correction term, as we state in "2. The necessity of correction term". The use of a carefully chosen β1\beta_1 serves as a regularization mechanism, stabilizing the training process.

5. Ablation study about T0T_0 (regarding Weakness 2 and Question 2)

We briefly explain why we reset the Σ\Sigma frquently from line 228 to line 230 in the main body of our paper. This idea has been considered in periodic error feedback resetting used in optimization with communication compression [4].

As a supplement, we an ablation study on the effect of the reset period T0T_0. In this experiment, we fine-tune the OPT-1.3B model and evaluate its performance on the SST2 dataset, while varying the value T0{64,128,256,512}T_0 \in\{64,128,256, 512 \} and \infty (without resetting). The results are summarized in the table below:

T0T_064128256512\infty
SST288.588.889.088.387.7

The reason behind this design and observation is as follows. As the number of iterations increases, the error in the Hessian estimation obtained from the zeroth-order oracle tends to accumulate over steps. This accumulated error may grow with the number of iterations. Therefore, when no resetting mechanism is used (as in the case of \infty in the table above), the performance drops. On the other hand, considering that Σ\Sigma is initialized as the identity matrix II, there exists a gap between II and the Hessian HH^*. When the resetting frequency is too small (e.g., 128), the information accumulated in Σt\Sigma_t is insufficient to effectively bridge this gap. The hyperparameter T0T_0 plays the role of a trade-off between error accumulation and Hessian estimation accuracy. Our experimental results also indicate that the model achieves the best performance when T0T_0 falls within a certain range. We will discuss more about the ablation study on T0T_0 in the camera-ready version.

We sincerely thank the reviewer HFxg for the constructive and insightful feedback. We greatly appreciate the opportunity to improve our work through this review process, and we would be truly grateful if the reviewer HFxg could consider a positive adjustment to the scores.

Reference

[1] Kingma D P, Ba J. Adam: A method for stochastic optimization[J]. arXiv preprint arXiv:1412.6980, 2014.

[2] Shazeer N, Stern M. Adafactor: Adaptive learning rates with sublinear memory cost[C]//International Conference on Machine Learning. PMLR, 2018: 4596-4604.

[3] Chen X, Liu S, Xu K, et al. Zo-adamm: Zeroth-order adaptive momentum method for black-box optimization[J]. Advances in neural information processing systems, 2019, 32.

[4] Xie X, Lin Z, Toh K C, et al. Loco: Low-bit communication adaptor for large-scale model training[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025.

评论

Thank you for the detailed response, which has resolved my previous concerns.

Just two more minor points in your response:

  1. Regarding the definition of I\mathcal{I}, I think it should be E[12uΣ12HΣ12uIΣ(diag(uu)I)]=diag(H)\mathbb{E}\left[ \frac{1}{2} \underbrace{u^\top \mathbf{\Sigma}^{-\frac{1}{2}} \mathbf{H} \mathbf{\Sigma}^{-\frac{1}{2}} u}_{\mathcal{I}} \cdot \mathbf{\Sigma}\left(diag(uu^\top) - \mathbf{I}\right)\right] = diag(\mathbf{H})?

  2. the statement "... the ideal choice of α\alpha is 1/2-1/2 instead of 00 for MeZO and 1/21/2 for HiZOO" requires clarification. In fact, the choice of α\alpha both in your algorithm and in HiZOO are 1/21/2. The confusion likely arises from the way the moving average of the diagonal Hessian (inverse) is updated. In HiZOO, it is written as Σt1=\Sigma_t^{-1} = \cdots, which is moving average of diagonal Hessian inverse, whereas in your algorithm it is expressed as Σt=\Sigma_t = \cdots, which is moving average of diagonal Hessian.

For the second point, I think multiple positions in the paper need to be fixed.

评论

We sincerely thank the reviewer for carefully reading our previous response and pointing out some minors. We are glad to hear that our clarifications have addressed your concerns. We hope that the following additional clarifications can help address the two minor points you raised.

First we directly show the choice of α\alpha in the table below.

MeZOHiZOOPaZO
α\alpha0-1/21/2

In fact, we have repeatedly emphasized in the main paper that the choice of α\alpha is one of our key contributions. Specifically, we adopt the theoretically ideal α=1/2\alpha = 1/2 to achieve faster convergence, as discussed in several parts of the main body of paper. For example, in line 197 in the main text, we mentioned that "for general problems, selecting α=0\alpha=0 alone induces a slower convergence rate than the optimal choice α=1/2\alpha=1/2". In line 201, we mentioned that "By selecting α=1/2\alpha=1/2, PaZO achieves the fastest convergence rate O~(d2/T)\tilde{\mathcal{O}}\left(d^2/T\right) compared with MeZO and HiZOO for general functions, providing a reasonable answer to question B". In line 259, we compared the choice of α\alpha as "PaZO achieves SOTA performance compared to other zeroth-order optimizer baselines including MeZO (α=0\alpha=0) and HiZOO (α=1/2\alpha=-1/2)". We sincerely apologize for the typos in our first response. However, the choice of α\alpha in HiZOO is 1/2-1/2 but not 1/21/2 as you mentioned in minor 2. You can see HiZOO algorithm in page 24 in the supplement, where HiZOO chooses Σ1/2\Sigma^{1/2} as the preconditioner, which means that α=1/2\alpha=-1/2 in HiZOO.

Then considering the moving average, HiZOO takes the moving average on the inverse of estimated Hessian. This idea is motivated by practical considerations, but it lacks theoretical support. In particular, for matrices, (A+B)1A1+B1(A + B)^{-1} \neq A^{-1} + B^{-1} generally holds. Therefore, instead of accumulating inverse estimates, we adopt a more principled and reasonable approach. We maintain a moving average of the estimated diagonal Hessian matrix, and only compute the inverse when needed. Since the estimates are diagonal, this inversion step incurs negligible computational overhead.

Overall, we believe that the theoretical motivation behind our method has been clearly articulated in the main paper, particularly through Theorems 3.5 and 3.8 and their proofs, which explicitly demonstrate how the choice of α\alpha influences the optimization dynamics and why the theoretically ideal value is α=1/2\alpha = 1/2. We are also very pleased that the clarification provided in our first response has resolved your concerns about the practical form of our method and hyperparameter settings. Although there were indeed a few typos (and we sincerely thank you for pointing them out including the fact that in minor 1, 1/21/2 should not be included in I\mathcal{I}), they do not affect the overall understanding of our work and contributions. We will correct all such typos in the camera-ready version to ensure clarity of presentation. We hope our discussions have adequately addressed your concerns. We would greatly appreciate your consideration of an increasing score.

评论

Thank you for your further clarification. The main point is that Σt\Sigma_t in PaZO serves as an estimate of the diagonal Hessian, whereas Σt\Sigma_t in HiZOO corresponds to an estimate of the inverse diagonal Hessian (as described in the begining of section 3.2 of HiZOO paper and the pseudocode of HiZOO). Given the definition of the preconditioner H2αH^{-2\alpha}, where HH denotes the Hessian matrix, I still believe it is not appropriate to describe the choice of α\alpha in HiZOO as 1/2-1/2.

I have also checked the code implementation of HiZOO, which is a little bit different from the pseudocode description:

def efficient_Hessian_perturb_parameters(self, model: nn.Module, random_seed, Hessian_matrix=None, scaling_factor=1):
    torch.manual_seed(random_seed)
    for name, param in self.named_parameters_to_optim:
        z = torch.normal(mean=0, std=1, size=param.data.size(), device=param.data.device, dtype=param.data.dtype)
        param.data = param.data + scaling_factor / torch.sqrt(Hessian_matrix[name]) * z * self.args.zo_eps
    return model


def zo_Hessian_step_update(self, model, inputs, zo_learning_rate, Hessian_smooth):

    self.named_parameters_to_optim = []
    for name, param in model.named_parameters():
        if param.requires_grad:
            self.named_parameters_to_optim.append((name, param))

    random_seed = np.random.randint(1000000000)
    with torch.no_grad():
        loss_original = self.zo_forward(model, inputs)

        # first function evaluation
        model = self.efficient_Hessian_perturb_parameters(model, random_seed, self.Hessian_matrix, scaling_factor=1)
        loss1 = self.zo_forward(model, inputs)

        # second function evaluation
        model = self.efficient_Hessian_perturb_parameters(model, random_seed, self.Hessian_matrix, scaling_factor=-2)
        loss2 = self.zo_forward(model, inputs)
                 
        model = self.efficient_Hessian_perturb_parameters(model, random_seed, self.Hessian_matrix, scaling_factor=1)
        
        torch.manual_seed(random_seed)
        for name, param in self.named_parameters_to_optim:
            z = torch.normal(mean=0, std=1, size=param.data.size(), device=param.data.device, dtype=param.data.dtype)

            Hessian_temp = self.Hessian_matrix[name] * z * z
            Hessian_estimator = (torch.abs(loss1+loss2-2 * loss_original)* Hessian_temp * Hessian_smooth /(2 * self.args.zo_eps*self.args.zo_eps))
            
            self.Hessian_matrix[name] = ((1-Hessian_smooth) * self.Hessian_matrix[name] +  Hessian_estimator)

            grad = (loss1-loss2)/(2 * self.args.zo_eps) * z / torch.sqrt(self.Hessian_matrix[name])
            param.data = param.data - zo_learning_rate * (grad + self.args.weight_decay * param.data)

        loss_out = self.zo_forward(model, inputs)
    return loss_out

`

The implementation of HiZOO is with the similar way of PaZO with the choice of α=1/2\alpha=1/2. Specifically, they compute the moving average of diagonal Hessian instead diagonal Hessian inverse, and the grad is computed through: (l+l)/(2μ)Σt1/2(l_{+} - l_{-}) / (2\mu) * \Sigma_t^{-1/2}. The only difference lies in:

Hessian_estimator = (torch.abs(loss1+loss2-2 * loss_original)* Hessian_temp * Hessian_smooth /(2 * self.args.zo_eps*self.args.zo_eps))

In which they didn't minus the identity matrix II.

评论

We sincerely appreciate your insightful comments and thoughtful suggestions. Your comments are well-grounded and we will clarify our discussion accordingly in the final version. Thank you once again for your constructive feedback and for taking the time to engage so carefully.

审稿意见
4

The paper proposes a preconditioned accelerated zeroth-order method, PaZO, for fine-tuning LLMs. Theoretically, the paper justifies the use of preconditioners for ZO methods by showing the convergence rate gap between with and without Hessian preconditioning. Motivated by the theory, the paper proposes a practical realization of PaZO using diagonal Hessian estimate and moving average. Numerical experiments on masked LLMs and generative LLMs demonstrate the memory and timestep efficiency of PaZO compared to first-order methods and ZO methods without preconditioning acceleration.

优缺点分析

Strength:

  1. Reducing the memory usage for LLM fine-tuning is an important topic and has significance in practice.

  2. The paper is properly framed under questions A, B and C. It provides their method along with the motivating theory.

Weaknesses:

While the paper is well structured, its content leaves many puzzles lacking explanations. See Questions.

问题

  1. In Theorem 3.5, can you provide more explanation on the relation between DαD_\alpha and d2d^2? In particular, how does the lower bound scale with α\alpha? While the paper claims that α=0\alpha=0 leads to strict inequality and is thus suboptimal, it would be better to have more explicit characterizations.

  2. In Theorem 3.8, how large is ϵ0\epsilon_0? It seems to be treated as small so that ErrErr represents high-order errors, but how is it guaranteed?

  3. In Appendix C.6, Table 6, why are β1\beta_1 and β2\beta_2 so small (around 10810^{-8})? This would be a particular issue given that many LLMs use low precision (e.g., bfloat16) during training to save memory, and 10810^{-8} seems to be too small. This makes the results less convincing.

  4. How much more memory overhead does PaZO have compared to MeZO? Figure 1 uses a log scale, which makes such a comparison less transparent.

  5. Is it possible to plot the validation loss vs steps? It would better illustrate the efficiency of PaZO compared to other methods.

  6. What are the standard deviations in Table 6? From the numbers, PaZO seems to have marginal improvement over HiZOO, especially in the LoRA and prefix settings.

局限性

  1. Sensitivity to β1\beta_1 and β2\beta_2 should be mentioned in the main text, as they are important hyperparameters and are abnormally small.

最终评判理由

The authors' detailed responses address my questions.

格式问题

I do not notice any major formatting issues.

作者回复

Response to Reviewer DhFc

We sincerely thank the reviewer DhFc for the thoughtful and constructive feedback. We appreciate the time and effort spent in evaluating our work. We provide detailed responses to each of the raised concerns in the following part, and we will incorporate the corresponding revisions and discussions into the camera-ready version.

1. The explanation on the relation between DαD_\alpha and d2d^2 (regarding Question 1)

Thanks for your question on our clarification. In Theorem 3.5, the key parameter is α\alpha. We define β=2α1\beta=2\alpha-1. Let λ_1λ_2λ_d>0\lambda\_1 \geq \lambda\_2 \geq \cdots \geq \lambda\_d > 0 denote the eigenvalues of Hˆ\mathbf{H}\^*. Consequently, D_αD\_\alpha can be written as:

Dα=(i=1dλiβ)(i=1dλiβ).D_{\alpha}= \left(\sum_{i=1}^d\lambda_i^{\beta}\right)\cdot\left(\sum_{i=1}^d\lambda_i^{-\beta}\right).

Derivation of the Lower Bound. Applying the Cauchy-Schwarz inequality yields Dαd2D_{\alpha}\geq d^2. Equality holds if and only if β=0\beta=0 (which corresponds to α=1/2\alpha=1/2), at which point DαD_{\alpha} attains its theoretical minimum value of d2d^2.

Strict Inequality for α1/2\alpha\neq 1/2. When α1/2\alpha\neq 1/2 (i.e.,β0\beta\neq 0), the equality condition for the Cauchy-Schwarz inequality – namely, that all eigenvalues λi\lambda_i must be equal – is generally not satisfied (unless the model is isotropic). Consequently, Dα>d2D_{\alpha}>d^2 strictly. Moreover, the magnitude of DαD_{\alpha} relative to d2d^2 increases under two conditions: (1) as β|\beta| increases (i.e., as α\alpha deviates further from 1/2), and (2) as the dispersion of the Hessian eigenvalues {λi}\{\lambda_i\} increases.

2. The scale of ϵ0\epsilon_0 and the guarantee (regarding Question 2)

We thank the reviewer for highlighting this crucial assumption. The quantity ϵ0\epsilon_0, introduced in the proof of Theorem 3.8, serves to bound the expected distance between θt\mathbf{\theta}_t and θ\mathbf{\theta}^*.

Definition and Guarantee of $\epsilon_0$ The magnitude of $\epsilon_0$ reflects the degree to which the parameters $\mathbf{\theta}_t$ (for $t \geq T$) approach the optimal solution θ\mathbf{\theta}^* after $T$ iterations. Specifically, we assume that at iteration step $t$, the expectation satisfies:

E[θtθp]ϵ0p.\mathbb{E}\left[\|\mathbf{\theta}_t - \mathbf{\theta}^*\|^p\right] \leq \epsilon_0^p.

This inequality defines a neighborhood of radius $\epsilon_0$ around $\boldsymbol{\theta}^*$ within which the algorithm iterates. The achievability of such a neighborhood size $\epsilon_0$ is guaranteed by the algorithm's prior convergence properties.

Prior Convergence Properties and the Scale of ϵ0\epsilon_0. Established analyses of Stochastic Gradient Descent (SGD) for non-convex optimization problems (e.g., [1-3]) demonstrate that, under appropriate conditions, the iterates θt\mathbf{\theta}_t converge globally—either with high probability or in expectation—to a neighborhood containing the global optimum θ\mathbf{\theta}^*.

ϵ0\epsilon_0 can typically be bounded by a polynomial function of the step size η\eta. By selecting a sufficiently large total iteration number TT, we can ensure that ϵ0\epsilon_0 becomes sufficiently small. This, in turn, guarantees that the higher-order term Errˉ\bar{**Err**} in Equation (9) is asymptotically negligible relative to the dominant error term.

3. The scale of $\beta_1$ and $\beta_2$ in practice and the relation to low precision (regarding Question 3 and Limitations 1)

Thank you for pointing out the lack of clarification about the choice of $\beta_1$ and $\beta_2$. We explain the necessity of small choices of $\beta_1$ and $\beta_2$ in practice. For $\beta_1$, we first want to clarify that the correction term $\tilde g \circ \tilde g$ is designed to reduce the instability of training caused by the outliers from zeroth-order oracle with randomness. Specifically, due to the order of preconditioner is $-1/2$ ($\alpha=1/2$), too small value in the diagonal estimated Hessian may cause numerical instability (NaN) in training process. The correction term encourages numerical alignment between gradient magnitude and adaptive curvature scaling like Adam. However, applying such a correction causes bias in Equation (14). By the choice of small $\beta_1$ to control the bias, we find that such an almost negligible term enhances the robustness of the estimation against outliers and ensure the stability of training. As the results in ablation study, when $\beta_1$ is set too large (e.g., 1, 1e-2), the optimization diverges and results in NaNs, because the proportion of correction term is too large to introduce non-negligible bias. When $\beta_1$ is too small (e.g., 0 or approaching 0), the correction term fails to take effect, which may lead to numerical instability during training and result in NaNs. Only in a proper range (around $1e\text{-}8$ to $1e\text{-}10$) does the algorithm yield reasonable performance.

For $\beta_2$, it is originated from the large variance of estimated Hessian by zeroth-order oracle. We propose a non-strict analysis of variance to clarify it further as below. From an intuitive (though informal) perspective, consider the update rule: $\Sigma_t = (1 - \beta_2)\Sigma_{t-1} + \beta_2 |\widehat{\Sigma}_t|$, where $\widehat{\Sigma}_t$ denotes the current estimate of the (diagonal) Hessian, and the initialization $\Sigma_0 = I$. Since the iterative update forms the moving average, $\beta_2$ effectively controls the sensitivity to new information and hence the variance of the sequence ${\Sigma_t}$. Specifically, due to the identity initialization, the influence of $\Sigma_0 = I$ remains dominant unless $\beta_2$ is set large. In particular, when $\beta_2$ is small, the update rule behaves approximately as:

Σt(1β2)tI+β2k=0t1(1β2)kΣ^tk,\Sigma_t \approx (1 - \beta_2)^t I + \beta_2 \sum_{k=0}^{t-1} (1 - \beta_2)^k |\widehat{\Sigma}_{t-k}|,

which can be interpreted as a weighted combination of the initial value $I$ and a series of estimated observations. We consider the variance of $\Sigma_t$ above by the analysis on the second term (though not a completely strict analysis, as some correlations are ignored, we hope to provide some intuition here). The expectation of each $\widehat{\Sigma}_{t-k}$ can be directly obtained as:

E[12uΣ^12HΣ^12uIΣ^(diag(uu)I)]=diag(H).\mathbb{E}\left[\underbrace{\frac{1}{2}u^\top \widehat\Sigma^{-\frac{1}{2}} \mathbf{H} \widehat\Sigma^{-\frac{1}{2}} u}_{\mathcal{I}} \cdot \widehat\Sigma\left(\text{diag}(uu^\top) - I\right)\right] = \text{diag}(\mathbf{H}).

Thus, we can obtain the variance of each coordinate $i$ of $\widehat \Sigma$ as:

Var[h^ii]Σ^ii2E[I2(ui21)2]Hii2=Σ^ii24j,k,s,tAjkAstE[ujukusut(ui21)2]Hii2,\text{Var}[\hat h_{ii}] \approx \widehat\Sigma_{ii}^2 \mathbb{E}\left[\mathcal{I}^2 (u_i^2 -1)^2 \right] - H_{ii}^2 = \frac{\widehat\Sigma_{ii}^2}{4} \sum_{j,k,s,t} A_{jk} A_{st} \mathbb{E}[u_ju_ku_su_t(u_i^2-1)^2] - H_{ii}^2,

where $\hat h = \mathcal{I}{\widehat\Sigma}\left(\text{diag}(uu^\top) - I\right)$, $A = \widehat\Sigma^{-1/2}H\widehat\Sigma^{-1/2}$. Noticing that when $u \sim \mathcal{N}(0,1)$, the coupling term $\mathbb{E}[u_ju_ku_su_t(u_i^2-1)^2]$ is not 0 with special choice of $j,k,s,t$. For example, only considering the terms where $j=s$ and $k=t$, and denoting for any $k \leq t$, $\max_i (\widehat\Sigma_k)_{ii} \leq B$, we have:

Var[Σ^k]O(d2B2).\text{Var}[\widehat\Sigma_k] \approx \mathcal{O}(d^2B^2).

It means that some values in estimated Hessian is numerically too large to introduce the risk of obtaining outliers in training. To solve this, considering:

β2k=0t1(1β2)k=1(1β2)t,\beta_2 \sum_{k=0}^{t-1} (1 - \beta_2)^k = 1 - (1 - \beta_2)^t,

a small $\beta_2$ yields the controlled variance of the second term to ensure a lower-variance, more stable update in the training process. Meanwhile, small $\beta_2$ still ensures that the Hessian estimates are valid from a global perspective, because when $t \to \infty$, the first term $(1 - \beta_2)^t \to 0$ and the second term $\beta_2 \sum_{k=0}^{t-1} (1 - \beta_2)^k \to 1$. Similar hyper-parameter setting is used in another zeroth-order optimizers for fine-tuning like [4]. Our experimental results also demonstrate that under the same parameter configurations, PaZO improves the performance of fine-tuned model across tasks. We will mention the sensitivity and choice of $\beta_1$ and $\beta_2$ in the main text of camera-ready version.


Moreover, your concerns about how small hyperparameters are combined with low precision training like bfloat16 are extremely important. The experiment setting in zeroth-order optimization is a little different while low-precision formats like bfloat16 are widely adopted in first-order optimization to reduce memory usage. In both the pioneering work of applying zeroth-order optimization to large language model finetuning [5], as well as subsequent follow-up studies such as [4], experiments are by default conducted in fp32 without bfloat16 to ensure numerical stability and reliable convergence. We follow the default experimental setting for fair comparison. Specifically, zeroth-order methods already lead to a significant reduction in memory usage due to its bp free—often far more than what bfloat16 achieves in standard first-order methods. In a sense, zeroth-order optimization can be viewed as an aggressive strategy to trade off between precision and memory efficiency at the algorithmic level. When applying bfloat16 in the context of zeroth-order optimization, the choice of hyperparameters may need to be adjusted accordingly to prevent loss of numerical precision. We believe this is an interesting direction for future research, but it is out of the scope of this paper.

Due to space limitations, I will immediately post the responses to the remaining questions starting from the next stage. Thank you for your insightful comments again.

评论

Thank you for your patience in reading our response. We discuss some questions you mentioned as supplements below.

4. Memory overhead of PaZO and MeZO (regarding Question 4)

We thank the reviewer for pointing out the lack of clarity in Figure 1 due to the use of a logarithmic scale. We just followed previous works [4,5] to design this figure. To further improve the presentation, we will revise Figure 1 in the camera-ready version to include the raw values. To clarify, PaZO incurs approximately 2 times more memory overhead compared to MeZO, and maintains the memory overhead of HiZOO, primarily due to the need to store the estimate of the diagonal Hessian as below.

Peak Memory (GB)1.3B2.7B6.7B13B
MeZO471426
PaZO9143054

However, we would like to emphasize that PaZO still requires significantly less memory than first-order optimizers, such as Adam, since zeroth-order methods do not involve backpropagation. This is reflected in the multiplicative ratios explicitly annotated in Figure 1. As the model size increases, the memory efficiency of PaZO becomes increasingly prominent compared with FT. We believe PaZO represents a theoretically ideal algorithm within the zeroth-order paradigm, while still preserving a substantial memory advantage over first-order methods — a trade-off we consider both meaningful and impactful for practical deployment. Moreover, if with the continued advancement of AI infrastructure, sharding techniques can be effectively applied to zeroth-order optimization, then the additional memory overhead caused by storing the estimated Hessian could be significantly mitigated through engineering techniques. Meanwhile, the contribution of our algorithm in terms of preconditioning order would still remain effective.

5. The validation loss v.s. steps (regarding Question 5)

We greatly appreciate your insightful suggestion. Unfortunately, due to the rebuttal policy, we are not allowed to submit any PDF files or external links. As a result, we are unable to include the validation loss vs. steps plot at this stage. However, we would like to offer two points of discussion that may help address this concern. First, all of our experiments are conducted in fine-tuning tasks. This means that, rather than focusing solely on minimizing the final validation loss value, our primary objective is to improve the performance of the fine-tuned model on downstream tasks—as shown in Tables 1 and 2. For fine-tuning, a lower loss does not necessarily imply better task performance. What we seek is a learning algorithm that effectively enhances model utility. Second, we can briefly describe an example of the validation loss v.s. steps: considering fine-tuning OPT-1.3B on RTE, the steps by which the validation loss achieves 0.62 are roughly 8000, 6000 and 5000 of MeZO, HiZOO and PaZO, respectively. These illustrate the efficiency of PaZO compared to others to some degree. We will be sure to include such curves and visualizations in the camera-ready version to better illustrate the efficiency of PaZO.

评论

I thank the authors for their detailed rebuttal, and it pretty well addresses my concerns 1-2 and 4-6. However, I still have concerns about 3, i.e., the tiny choice of β1\beta_1, β2\beta_2 and the related numerical instability. Conceptually, I understand the necessity of choosing properly small β1\beta_1 and β2\beta_2. But the values are just too small, and it's hard to believe that they are required for numerical stability given that the initial estimate II is quite large and the moving average is taken with absolute value of Σt\Sigma_t (Step 3 in Algorithm 1).

In particular, NaN happens with (β1,β2)=(106,108)(\beta_1,\beta_2)=(10^{-6},10^{-8}), while the total number of optimization steps is 20k (line 256). Even if we set all new terms to be all zero, the initial identity matrix should still give at least (1106)200000.98(1-10^{-6})^{20000}\approx 0.98 to the averaged diagonal Hessian, which I don't think will results in severe numerical issues. Therefore, the result looks strange to me. I wish the authors can explain more on my specific questions in the following:

3.1: When does the instability happen? Does it happen at a very early stage or very late? And what is the magnitude of estimated Hessian at time it starts to collapse?

3.2: How does μ\mu affect the result? It is in the denominator so could affect the estimate values.

3.3: Can the numerical instability be solved by simply replacing the square root inverse by numerically stable methods? Like solving the linear equation Σ1/2x=u\Sigma^{1/2}x=u to get Σ1/2u\Sigma^{-1/2}u by gradient methods. This could potentially prevent the use of too small β\beta's.

I would like to raise my score to accept if the authors could resolve my questions. Thanks.

评论

We sincerely thank the reviewer for carefully reading our previous response and for raising additional insightful questions, particularly regarding the magnitude of the β\beta parameters. Below, we address each of the newly raised concerns in detail.

1. Details of instability (regarding Question 3.1)

The instability happens in the early stage in our observation. Specifically, for example, considering the hyperparameters setting (β1,β2)=(106,108)(\beta_1, \beta_2)= (10^{-6},10^{-8}) , after 35 iterations from beginning, the instability happens due to the numerical explosion of estimated Hessian after updating by β1\beta_1. The F-norm of the diagonal estmated Hessian rapidly changes from 10210^2 level to inf at time it starts to collapse. As you pointed out, the theoretical intuition suggests that as long as β(0,1)\beta \in (0, 1), the proposed method should work properly. However, in practice, due to the involvement of random Gaussian vectors in the estimation process, and more importantly, the fact that most second-order information is computed based on squared Gaussian terms, even a single coordinate with an unusually large magnitude can result in extremely large outliers. Without a proper scaling, such an outlier can cause the estimated Hessian diagonal to explode, ultimately leading to NaN in parameter updates and destabilizing subsequent training. We acknowledge that this instability is a practical issue rather than a theoretical flaw. To mitigate it, we carefully designed a proper choice of β\beta that effectively suppress such rare but harmful events. While this solution may appear strange from an intuitively theoretical perspective, it is essential for maintaining training stability in real tasks.

2. How μ\mu affects the results (regarding Question 3.2)

Thank you for pointing out this important issue. The choice of μ\mu is indeed critical. For zeroth-order oracles, the gap between the estimated quantity and the true value is known to be on the order of O(μ)\mathcal{O}(\mu). As μ0\mu \to 0, the estimator theoretically converges to the true value. However, in practical implementations, a large μ\mu leads to less accurate estimates, while a small μ\mu increases the risk of numerical issues because μ\mu often appears in the denominator. We conduct an ablation study on μ\mu on fine-tuning OPT-1.3B on SST2 as below.

μ\mu1e-11e-21e-31e-4
Acc54.066.989.086.4

The results show that too large a choice of μ\mu leads to less accuracy. Moreover, when setting too small μ\mu like 1e-8, NaN happens in our experiments. Therefore, we believe that μ\mu should be chosen within a proper range to ensure a good balance between estimation accuracy and training stability. We report the choice of μ\mu in Table 3 and Table 4 in Appendix C. We will add the ablation study on μ\mu in the camera-ready version.

3. About other numerical stable methods (regarding Question 3.3)

We appreciate the reviewer’s insightful suggestion. In principle, it is possible to obtain Σ1/2u\Sigma^{-1/2} u by solving the linear equation Σ1/2x=u\Sigma^{1/2} x = u using gradient descent. However, in our setting, this strategy poses a critical limitation that solving Σ1/2x=u\Sigma^{1/2} x = u requires another gradient descent iterations for the subproblem, resulting in unacceptable time cost in practice. Specifically, when fine-tuning OPT-1.3B on the SST2 dataset, each iteration used in our proposed method takes approximately 0.3837 seconds, whereas the iteration with solving the linear subproblem to obtain Σ1/2u\Sigma^{-1/2} u requires 5.35 seconds, due to the need for sufficiently accurate gradient descent steps to ensure a reliable solution to the subproblem. This represents nearly a 14× increase in time cost, which we consider unacceptable in real tasks. Although solving the subproblem can indeed improve numerical stability to some extent (e.g., for (β1,β2)=(106,108)(\beta_1, \beta_2)= (10^{-6},10^{-8})), the trade-off in computational efficiency makes it impractical for real tasks. Moreover, because Σ\Sigma is positive and diagonally defined, we can directly obtain the solution of such a linear equation, which is aligned with our practical form algorithm. This means that it may be difficult to achieve essential improvement by solving such a sub-problem, and a small β\beta is still necessary. Nonetheless, we sincerely appreciate your insightful suggestion.

In summary, we would like to once again thank you for your thoughtful reading of our response and for raising constructive questions. We sincerely hope that our detailed replies have effectively addressed your concerns. We would appreciate it if you could consider raising your score.

评论

I thank the authors for their response which addresses my concerns in 3.2 and 3.3. I raised my score.

Additional comment on 3.1: previously I thought the numerical instability comes from inversion of too small Σ\Sigma, but here it seems to be due to too large Σ\Sigma. It is strange to me that it causes NaN rather than zero update, because Σ\Sigma in perturb parameters and preconditioning are in the form of inversion, which should give 0 but not NaN. I think this issue is worth more detailed investigation.

评论

We sincerely thank you for your constructive suggestions and your support of our work. Your additional comment is reasonable. In practice, as the Hessian update is computed before the inversion step, excessively large values can compromise the effectiveness of subsequent computations and lead to instability. We agree that it is worth further investigation. Once again, we deeply appreciate your insightful feedback and support.

评论

6. The standard deviations in Table 6 and improvement with LoRA and prefix (regarding Question 6)

We thank the reviewer for this important question. In Table 6, our primary objective was to demonstrate the stable operating range for the hyperparameters β1\beta_1 and β2\beta_2, rather than to emphasize performance differences. Specifically, we observed that when β1\beta_1 and β2\beta_2 fall outside the proper range, the training often becomes unstable (e.g., resulting in NaNs). Therefore, we reported the mean performance within the stable region to guide practical usage. In fact, the standard deviations in these experiments are very small (below 0.2 in accuracy), and they do not affect the conclusions regarding proper β1\beta_1 and β2\beta_2 settings. For completeness, we provide the results including standard deviations below (here we correct the typo mentioned in 3).

β1(β2)\beta_1(\beta_2)11e-21e-41e-61e-81e-10
setting β1\beta_1 = 1e-8NaNNaNNaN88.9 (±0.3\pm 0.3)89.0 (±0.2\pm 0.2 )88.9 (±0.2\pm 0.2)
setting β2\beta_2 = 1e-8NaNNaNNaNNaN89.0 (±0.1\pm 0.1)88.9 (±0.2\pm 0.2)

This result clearly demonstrates that the algorithm only runs robustly when the hyperparameters are within a suitable range. When the parameters are set too large, the algorithm tends to become unstable and may even diverge.

Moreover, we would like to clarify that our primary goal is to design a zeroth-order optimizer with theoretically ideal convergence in zeroth-order optimization and practical memory efficiency compared with first-order optimization. As you mentioned, just for more difficult settings like PEFT (LoRA, prefix tuning), such improvements seem to be minor. However, these PEFT setups naturally constrain the model capacity, making it more difficult for any optimizer to yield large performance gains. Despite this, PaZO also achieves consistent improvements over HiZOO across types of models, dataset and tuning methods. Moreover, for RoBERTa-large(350M), the improvement of PaZO with PEFT is not marginal. Nonetheless, PaZO maintains stable and robust performance across various settings, and we believe this consistency, combined with its theoretical and practical advantages, makes it a compelling choice.

In summary, we truly appreciate the reviewer’s insightful questions and suggestions. In the camera-ready version, we will incorporate the clarifications above, including the discussion on theoretical analysis, hyperparameter choices and robustness (regarding Table 6), the explanation of memory overhead (regarding Figure 1), and figures regarding validation loss curves against steps which we cannot show here due to the rebuttal policy. We will also include the standard deviations. Your valuable comments have greatly helped us improve the clarity and completeness of our work. Hope our responses and discussions can address your concerns. We would be sincerely grateful if the reviewer DhFc would consider increasing the score in light of the above clarifications and revisions.

Reference

[1] HaoChen, J. Z., Wei, C., Lee, J., & Ma, T. (2021). Shape matters: Understanding the implicit bias of the noise covariance. In Conference on learning theory.

[2] Li, Y., Ma, T., & Zhang, H. (2018). Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In Conference on learning theory.

[3] Xiong, N., Ding, L., & Du, S. S. (2023). How over-parameterization slows down gradient descent in matrix sensing: The curses of symmetry and initialization. In The twelfth international conference on learning representations.

[4] Yanjun Zhao et al. Second-order fine-tuning without pain for llms: A hessian informed zeroth-order optimizer. In: arXiv preprint arXiv:2402.15173 (2024).

[5] Sadhika Malladi et al.Fine-tuning language models with just forward passes.In:Advances in Neural Information Processing Systems 36(2023),pp.53038–53075.

审稿意见
4

This paper introduces PaZO, a preconditioned zeroth-order optimization algorithm designed for the memory-efficient fine-tuning of LLMs. The authors provide a solid theoretical foundation, demonstrating the necessity of preconditioning in ZO optimization to achieve faster convergence. They derive the optimal preconditioner and develop a practical algorithm that estimates the diagonal Hessian using only forward passes. The experiments show that PaZO outperforms existing ZO methods like MeZO and HiZOO and achieves performance comparable to first-order fine-tuning while using significantly less memory.

优缺点分析

Pros:

  • The practical PaZO algorithm is an application of the theoretical insights. The design choices, such as estimating the diagonal Hessian to maintain O(d) memory and using moving averages and resets for stability, are well-justified and make the algorithm practical for fine-tuning modern LLMs.

  • The core theoretical strength is not just that the authors prove preconditioning is necessary, but that they derive the optimal form of the preconditioner from a formal convergence analysis. The technical argument in Theorem 3.5 is particularly elegant.

  • The paper's main strength lies in its rigorous theoretical analysis. It methodically answers three key questions about preconditioning in ZO optimization. The convergence analysis in Theorems 3.5 and 3.8 provides a clear, formal justification for why preconditioning is necessary and demonstrates that using the inverse square root of the Hessian (H^{-1/2}) as a preconditioner is optimal for achieving the fastest convergence rate. This is a significant theoretical contribution.

Cons:

  • The authors are transparent about PaZO's per-step time, noting it is about 1.5x slower than MeZO due to an additional forward pass (Figure 2). They argue this is offset by a better convergence rate. However, the analysis would be much stronger with a direct comparison of the total time (or number of steps) required to reach a target accuracy. A plot of accuracy vs. wall-clock time for PaZO against the baselines would more directly substantiate the claim of overall faster training.

  • While the theoretical framework is novel, some components of the practical algorithm in Section 4 are based on existing techniques. For example, estimating the Hessian diagonal via forward passes and using moving averages to stabilize training are known methods. The main novelty is the specific formulation justified by the theory for alpha = 1/2, which is a strong contribution, but the practical implementation itself is a clever combination of established components rather than a completely new mechanism.

问题

To strengthen the claims about "convergence speed", could you provide practical "time-to-accuracy" results? For example, on a representative task like SST-2, how much wall-clock time does it take for PaZO, MeZO, and FT to reach 95% of their respective final accuracies?

局限性

I did not see potential negative societal impact.

格式问题

No formatting concerns.

作者回复

Response to Reviewer bWDk

We sincerely thank reviewer bWDk for the positive evaluation and thoughtful comments on our work. We greatly appreciate your recognition of the theoretical contribution and the careful reading of both the algorithm design and the empirical results. We find the comments valuable in helping us improve the clarity of the paper. We provide detailed responses to each of the raised concerns in the following part, and we will incorporate the corresponding revisions and discussions into the camera-ready version.

1. Report on wall-clock time v.s. accuracy (regarding Cons 1 and Question)

We sincerely thank the reviewer for proposing this important point. We fully agree that comparing the wall-clock time to reach a target accuracy would more directly demonstrate the overall training efficiency of PaZO. However, due to rebuttal policy constraints, we regret that we are unable to include any PDF figures or external links in our response, and therefore cannot directly present the accuracy vs. wall-clock time plot at this stage. As a complementary reference you mentioned, we instead report the steps and wall-clock time (in seconds) required to reach 60% accuracy on a representative task RTE in the table below:

MeZOHiZOOPaZO
Steps15000100009000
Wall-clock Time (s)384837853453

These results support our claim that PaZO achieves better convergence speed than existing zeroth-order methods by leveraging an ideal choice of the preconditioner order. Both the required number of steps and the total training time to reach the target accuracy are smaller for PaZO, validating our theoretical insights. Moreover, although the per-step cost of PaZO is slightly higher than MeZO—as we transparently report in the paper—this is more than offset by its improved convergence rate. In particular, PaZO reduces the total wall-clock time by approximately 10% compared to MeZO, demonstrating that it is not only theoretically sound but also efficient in practical settings. We will include these additional wall-clock time comparisons and corresponding wall-clock time v.s. steps plots in the camera-ready version.

2. The components of our algorithm and our main contributions (regarding Cons 2)

We sincerely thank the reviewer for the insightful comment. As you mentioned, some components of our practical implementation, such as diagonal Hessian estimation via zeroth-order oracles and the use of moving averages, are adapted from existing techniques. However, as emphasized in the Introduction, our main focus lies in addressing three questions including: (1) Is preconditioning theoretically beneficial in zeroth-order optimization? (2) If so, what is the optimal order to obtain ideal convergence? (3) How can we translate this theoretical insight into practical algorithm design? While previous works have proposed related techniques and components, they do not provide clear theoretical answers to the above questions. In contrast, our work presents a novel analysis framework that leads to a principled justification for preconditioning in zeroth-order optimization, particularly when the preconditioner order is set to α=1/2\alpha=1/2, and demonstrates how this insight can be effectively implemented in practice. We believe that this new perspective offers substantial theoretical and practical value, and may serve as a useful work for better understanding and further advancing the field of zeroth-order optimization.

We sincerely appreciate the reviewer’s positive evaluation of our work. Your thoughtful comments and recognition of both the theoretical contributions and practical relevance of PaZO are highly encouraging to us. We will incorporate the additional experimental details on wall-clock time above into the camera-ready version to further strengthen the presentation. We would be deeply grateful if the reviewer bWDk could consider a higher score.

审稿意见
5

The paper introduces PaZO, a preconditioned accelerated zeroth-order optimization algorithm designed for fine-tuning large language models (LLMs) without backpropagation. The authors address three core challenges: (A) whether preconditioning is necessary in zeroth-order optimization, (B) how to optimally apply preconditioning for fastest convergence, and (C) how to practically estimate Hessian information to improve performance. They theoretically prove that zeroth-order stochastic gradient descent (ZO-SGD) alone is suboptimal and that using a preconditioner of order −1/2 yields the best convergence rate. PaZO incorporates this insight into a practical algorithm using diagonal Hessian estimates and moving averages for stability. Extensive experiments on models like RoBERTa-large and OPT-1.3B across various tasks and parameter-efficient fine-tuning (PEFT) methods demonstrate that PaZO outperforms existing zeroth-order methods (e.g., MeZO, HiZOO) in both accuracy and memory efficiency, achieving state-of-the-art results with manageable computational overhead.

优缺点分析

Strengths:

  • The paper is well written, and tackles an important problem: efficient LLM fine-tuning
  • The authors rigorously prove that standard zeroth-order stochastic gradient descent (ZO-SGD) cannot achieve the optimal convergence rate, establishing the necessity of preconditioning in zeroth-order optimization. They introduce a generalized framework—Preconditioned Simultaneous Perturbation Stochastic Approximation (PSPSA)—and derive convergence bounds for both quadratic and general smooth functions. A key strength is the identification of the optimal preconditioning order (α = 1/2), which is shown to yield the fastest convergence rate. These results are supported by detailed proofs and assumptions, including extensions to scenarios with approximate Hessians.
  • The authors present a practical algorithm, PaZO, and demonstrate its effectiveness empirically across a wide range of downstream tasks, model architectures, and fine-tuning strategies.

Weaknesses:

  • The practical version of PaZO is not so practical. It strongly resembles HiZOO, with slight differences in estimating the Hessian, and introduces extra hyperparameters ( T0T_0, β1\beta_1 and β2\beta_2) that are not properly studied. The choice of T0T_0 is not properly ablated, and Table 6 in the appendix shows that the PaZO is highly sensetive to the choices of β1\beta_1 and β2\beta_2.
  • The authors claim that PaZO maintains the memory efficiency of HiZOO, but Figure 1 does not show the memory overhead of HiZOO.
  • The performance improvement of PaZO over HiZOO is only demonstrated for OPT (and not RoBERTA). For OPT, the improvement does not seem to be significant.

问题

  • How does PaZO compare to MeZO and FO methods on more recent LLMs, such as Llama 3?
  • Can the authors show learning curves of the experiments reported in Tables 1 and 2, it would be curious to see how fast each method converges.

局限性

The authors did not properly asses the limitations of their work.

最终评判理由

After the authors' rebuttal most of my questions/concerns have been addressed, and I have updated my score from 4 to 5.

格式问题

N/A

作者回复

Response to Reviewer Gw49

We sincerely thank the reviewer Gw49 for the constructive and thoughtful feedback. We greatly appreciate your recognition of our work and your valuable suggestions. We provide point-by-point responses to each of your comments.

1. The hyperparameters in the practical version of PaZO (regarding Weaknesses 1)

Thank you very much for pointing out this important issue and helping us improve the clarity of our presentation. First, we would like to clarify that although our algorithm shares some structural similarities with HiZOO in terms of estimating the Hessian diagonal, there is a fundamental difference in the core design. Specifically, our method is guided by theoretical analysis, which leads to the choice of the preconditioner order α=1/2\alpha=1/2 in contrast to HiZOO’s α=1/2\alpha=-1/2. This results in essential differences in how the optimization trajectory evolves. While the Hessian estimation procedures may appear similar, the use of the preconditioner are theoretically justified and practically distinct in PaZO.

We also sincerely thank the reviewer for pointing out that our discussion of hyperparameters could be more comprehensive. We now provide additional clarification. First, we correct a typo in Table 6 of the appendix: the second column in the first row should be 1 instead of 0. We apologize for the confusion this may have caused. Then, regarding the role of the hyperparameters, it is sensitive to β1,β2\beta_1, \beta_2 because β1\beta_1 controls the strength of the additive correction to the Hessian diagonal estimate. This correction ensures that the scaling effect in our algorithm behaves similarly to Adam-style updates, helping maintain numerical stability during training. β2\beta_2 governs the moving average applied to the Hessian diagonal estimates. This is a common technique to stabilize estimates over time and smooth the update directions from the zeroth-order oracle.

T0T_0 determines the frequency at which the preconditioner is reset. We agree that the original manuscript lacked a dedicated ablation for this parameter. We have now conducted additional experiments to investigate the effect of varying T0T_0, and we provide the results and analysis below.

In this experiment, we fine-tune the OPT-1.3B model and evaluate its performance on the SST2 dataset, while varying the value T0{64,128,256,512}T_0 \in\{64, 128,256, 512 \} and \infty (without resetting). We aim to examine how the frequency of resetting the moving average of the preconditioning matrix influences training stability and final accuracy. The results are summarized in the table below:

T0T_064128256512\infty
SST288.588.889.088.387.7

The reason behind this design and observation is as follows. As the number of iterations increases, the error in the Hessian estimation obtained from the zeroth-order oracle tends to accumulate over steps. This accumulated error may grow with the number of iterations. Therefore, when no resetting mechanism is used (as in the case of \infty in the table above), the performance drop when overly old historical accumulated error may mislead the current update direction, ultimately harming final performance. On the other hand, considering that Σ\Sigma is initialized as the identity matrix II, there exists a gap between II and the Hessian HH^*. When the resetting frequency is too small (e.g., 128), the information accumulated in Σt\Sigma_t is insufficient to effectively bridge this gap. The inaccurate estimation similarly leads to worse final model performance. The hyperparameter T0T_0 plays the role of a trade-off between error accumulation and Hessian estimation accuracy. Our experimental results also indicate that the model achieves the best performance when T0T_0 falls within a certain range. We will discuss more about the ablation study on T0T_0 in the camera-ready version. Thank you again for your feedback.

2. The memory overhead comparison between PaZO and HiZOO (regarding Weaknesses 2)

Thank you for pointing out the missing comparison in Figure 1. Our statement that PaZO maintains the memory efficiency of HiZOO is meant in the literal sense: both HiZOO and PaZO incur additional memory overhead primarily due to storing the diagonal estimate of the Hessian, which is of the same size in the same model. Therefore, the peak memory consumption of the two algorithms is nearly identical. For example, when fine-tuning OPT-1.3B on MultiRC, both PaZO and HiZOO have a peak memory usage of approximately 9 GB. We will update Figure 1 in the camera-ready version to include HiZOO for completeness and clarity. Thank you again for your careful review.

3. More comparison of PaZO with another methods (regarding Weakness 3 and Question 1)

We sincerely apologize for the limited scope of our experiments due to hardware constraints. However, we believe that the existing results are still indicative of the strengths of our method. While the improvement of PaZO over HiZOO on OPT is not dramatic, it is consistent across the majority of tasks. Moreover, PaZO shows improvement over HiZOO under different tuning paradigms, including LoRA and prefix tuning. This consistency suggests that the advantage of PaZO is not limited to a specific task or setup, but rather stems from a more principled design. This aligns well with the goal of our work to investigate how to better utilize preconditioning in zeroth-order optimization, and our theoretical analysis, which motivates a carefully chosen preconditioning order. We have indeed considered conducting experiments on LLaMA3. However, due to current hardware limitations, we were unable to properly apply this model. Now we are actively working to overcome these constraints and are making every effort to complete the experiments. If time permits, we may be able to report partial results during the review process, and we will include the complete results in the camera-ready version. We sincerely thank the reviewer for raising this point and appreciate your understanding.

4. The learning curve of experiments (regarding Question 2)

We sincerely thank the reviewer for raising this important question. However, due to the rebuttal policy, we are unfortunately not allowed to include any PDF files or external links in our response, and thus we are unable to directly provide the learning curves here. However, we would like to explain that all of our experiments are conducted in fine-tuning tasks. This means that, rather than focusing solely on minimizing the final loss value, our primary objective is to improve the performance of the fine-tuned model on downstream tasks—as shown in Tables 1 and 2. For fine-tuning, a lower loss does not necessarily imply better task performance.

As an alternative way, we can briefly describe an example of the validation loss v.s. steps: considering fine-tuning OPT-1.3B on RTE, the steps by which the validation loss achieves 0.62 are roughly 8000, 6000 and 5000 of MeZO, HiZOO and PaZO, respectively. These illustrate the efficiency of PaZO compared to others to some degree. We will include detailed learning curves for the experiments in Tables 1 and 2 in the camera-ready version to better illustrate the training dynamics of different methods.

Once again, we sincerely thank the reviewer Gw49 for the thoughtful comments and constructive suggestions. We have carefully addressed each of the concerns raised, and we truly appreciate the opportunity to clarify and improve our work. We are grateful for the reviewer’s recognition of our contributions. We would be deeply grateful if the reviewer Gw49 could consider a higher score.

评论

I thank the authors for their detailed rebuttal, as it has addressed most of my concerns. I highly encourage the authors to incorporate the suggested changes to their final manuscript. I will increase my score by 1.

评论

We sincerely thank you for your valuable comments and thoughtful suggestions, as well as your support for our work. We will make sure to incorporate the suggested changes and discussions into the final version. Thank you once again for your constructive feedback.

最终决定

This paper proposes PaZO, a preconditioned accelerated zeroth-order optimization method for fine-tuning LLMs. The work contributes both theoretical insights and a practical algorithm that incorporates diagonal Hessian estimates and moving averages for stability. Experiments demonstrate advantages over zeroth-order baselines like MeZO and HiZOO.

During the discussion, one reviewer stressed that PaZO’s practical design closely resembles HiZOO, raising concerns about algorithmic novelty. Additional issues included formulaic errors, hyperparameter sensitivity, and limited experimental breadth -- improvements are most evident on OPT, less so on RoBERTa, with no evaluation on newer LLMs (e.g., LLaMA3). While the authors addressed efficiency concerns in rebuttal, complete evidence is needed for the revised version.

Despite reservations about novelty and presentation, the paper offers solid theoretical contributions. Given that ZO fine-tuning relies only on model queries and represents a promising direction, and with overall ratings turning positive after rebuttal, the AC believes the work will interest both optimization and LLM communities. The AC strongly encourages the authors to incorporate the planned clarifications and additional results in the revised version.