PaperHub
5.7
/10
Poster3 位审稿人
最低5最高6标准差0.5
6
5
6
3.7
置信度
正确性2.7
贡献度2.0
表达2.3
ICLR 2025

Second-Order Fine-Tuning without Pain for LLMs: A Hessian Informed Zeroth-Order Optimizer

OpenReviewPDF
提交: 2024-09-27更新: 2025-03-14
TL;DR

We propose HiZOO to leverage the diagonal Hessian to enhance ZOO for fine-tuning LLMs.

摘要

Fine-tuning large language models (LLMs) is necessary for specific downstream tasks, but classic first-order optimizer entails prohibitive GPU memory because of the back propagation. Recent works such as MeZO have turned to zeroth-order optimizers for fine-tuning, which reduce substantial memory by using two forward passes. However, heterogeneous curvatures across different parameter dimensions in LLMs often cause model convergence instability or even failure. In this work, we propose HiZOO, a diagonal Hessian informed Zeroth-Order Optimizer , which is the first work to leverage the diagonal Hessian to enhance ZOO for fine-tuning LLMs. We provide theoretical proof for HiZOO and visualize the optimization trajectories on test functions to illustrate how it improves convergence in handling heterogeneous curvatures. Extensive experiments on various models (RoBERTa, OPT, Phi-2 and LLama3, with 350M$\sim$66B parameters) indicate that HiZOO significantly reduces training steps and enhances model accuracy, while keeping the memory advantage of ZOO. For example, on SST2 task HiZOO achieves $8\times$ speedup and better accuracy over MeZO across different models. We also propose HiZOO-L, which reduces the Hessian memory cost to 10% of the MeZO, while maintaining almost same performance. Compared with ZO-Adam, HiZOO-L achieves a 4.3% improvement, just using 50% of the GPU memory. Code is available at https://anonymous.4open.science/r/HiZOO-27F8.
关键词
LLM; deep learning; zeroth order optimizer

评审与讨论

审稿意见
6

The work presents a novel optimization method called HiZOO, designed for fine-tuning large language models (LLMs). HiZOO addresses this by incorporating a diagonal Hessian estimation through an additional forward pass, allowing it to act as a pre-conditioner that adjusts updates based on parameter curvatures. This approach reduces training steps and improves accuracy while maintaining efficient memory use, scaling well even for models with billions of parameters. The authors also propose HiZOO-L, a low-rank version that significantly cuts down on memory cost while preserving performance.

优点

  1. The introduction of HiZOO, a Hessian-informed zeroth-order optimizer, is original.
  2. The proposed low-rank variant, HiZOO-L, which reduces memory overhead, demonstrates a approach to solving memory constraints while maintaining optimization quality.
  3. The paper is well-organized, with clear sections outlining the motivation.

缺点

  1. The code implementation assumes that u_i Hadamard product u_i is treated as a diagonal matrix. However, this is incorrect as uiuiTu_i u_i^T is an outer product in the algorithm description in the paper resulting in a rank-1 matrix. This assumption could introduce inconsistencies or inaccuracies in the Hessian estimation and parameter updates.

  2. The value for Hessian_smooth in the implementation is set to 1e61e^{-6}, which seems quite small. This could imply that the contribution of the Hessian information is not significant enough to make a substantial impact on convergence or stability.

  3. Tables 1 and 3 present results for a range of NLP tasks, but no generation tasks are included. This limits the understanding of HiZOO’s performance in broader language modeling applications where generative capabilities are critical.

  4. While Table 5 reports promising results on SST2, Figures 12 to 14 reveal that HiZOO demonstrates poor training performance and even training failure in some cases. This discrepancy raises concerns about the stability and reliability of the optimizer under different conditions.

  5. Line 9 in Algorithm 2 is unclear. Calculating R1R^{-1} and C1C^{-1} involves significant computational overhead, making this step inefficient.

问题

Please see weakness.

评论

Thank you for your thorough review and valuable comments on our manuscript. We appreciate the opportunity to address your concerns and clarify any misunderstandings.
Question 1: Treatment of uuuu^{\top} as a Diagonal Matrix
[Revised answer]Firstly we would like to address the broader computational considerations. Due to the significant memory and computational overhead required to estimate the full Hessian matrix d×dd \times d, we chose to compute the diagonal of the Hessian instead. This approach strikes a balance between computational feasibility and the accuracy needed for our analysis. As a result, in our code, we do indeed compute diagonal(uiui)=[u12,u22,...un2]diagonal(u_i u_i^{\top}) = [u_1^2, u_2^2, ... u_n^2], whose rank is not 1 and you are correct. The statement in the algorithm is included to illustrate the theoretical process of our approach. However, in the actual computation, particularly in Line 9, this step is a middle state and does not appear. It is meant purely for clarity in explaining the algorithm's underlying theory.

Question 2: Small Value of the Hessian Smooth Parameter
Due to the large variance of the Hessian, with some values being extremely large, so Hessian smooth term is very small. But it is still sufficient to have a meaningful impact on the convergence and stability of the optimization process. Our experimental results demonstrate that, under the same parameter configurations, HiZOO significantly improves the convergence rate compared to other methods.

Question 3: Experiments on Generation Tasks
Thank you for your insightful suggestion. We agree that including generation tasks would provide a more comprehensive evaluation of HiZOO's performance. We will conduct additional experiments on generation tasks and show the results to you as soon as possible to showcase HiZOO's capabilities in generative settings.

Question 4: some specific figures
We acknowledge that sometimes HiZOO exhibits poor training performance in certain cases, it's also the same for MeZO or other zeroth-order methods. This issue arises due to the specific characteristics of some tasks, instead of the algorithm. However, HiZOO generally demonstrates better convergence behavior than MeZO under the same conditions.

Question 5: Computational cost of calculating R and C matrix in Algorithm 2

It 's true that this calculation involves some additional computations. However, the time required for generating these matrices is minimal compared to the overall time of one forward pass. As shown in Table below (also Table 11 in Paper), we indicate that matrix decomposition increases only about 6% of the total training time. Therefore, this calculation does not significantly impact the efficiency of the algorithm.


Table: Wallclock time per step between MeZO, HiZOO, and HiZOO-L.
The increase in wallclock time per step for HiZOO compared to MeZO is less than 1.5 times across different model sizes. Also, HiZOO-L increases only about 6% of the total training time than HiZOO. To avoid introducing additional overheads such as inter-GPU communication, results are measured on the same dataset (SST-2) and GPUs (80GB A100), with each result averaged over 100 steps. "BS" refers to batch size. For the relatively smaller RoBERTa-large model, we used a BS=64, while for models larger than 1B parameters, we used a BS=16.

ModelRoBERTa-large (350M)Phi-2 (2.7B)Llama3 (8B)OPT (13B)
MeZO0.2092s (BS=64)0.3011s (BS=16)0.7471s (BS=16)1.1108s (BS=16)
HiZOO0.3023s (BS=64)0.4486s (BS=16)1.1090s (BS=16)1.5225s (BS=16)
HiZOO-L0.3193s (BS=64)0.4851s (BS=16)1.1996s (BS=16)1.6422s (BS=16)

We appreciate your constructive feedback and believe that addressing these points will strengthen our manuscript. We are committed to improving our work and will incorporate the necessary revisions to enhance its clarity and comprehensiveness.

评论

Thanks for the authors' feedback. The authors have provided comprehensive responses addressing my comments, clarifying technical details, and proposing meaningful updates to the manuscript.

The authors effectively addressed concerns regarding the small value of the Hessian smooth parameter. By quantifying the overhead (6% increase in training time for HiZOO-L) and contextualizing it with respect to total training time, it justifies that this additional cost is minimal.

Minor concern: In the comments, the authors mentioned that uuuu^{\top} is not diagonal and is not a rank-1 matrix in Rm×m\mathbb{R}^{m \times m}. I am a bit confused. Please clarify.

These updates enhance the manuscript’s clarity, applicability, and technical depth, so I would like to give an increased score of 6.

评论

Dear Reviewer Wjn7, we want to extend our heartfelt thanks for taking the time to review our rebuttal and for taking the time to thoroughly review our code.

About your concern, we sincerely apologize for initially misunderstanding your question and truly appreciate your effort in carefully examining our implementation. Firstly we would like to address the broader computational considerations. Due to the significant memory and computational overhead required to estimate the full Hessian matrix d×dd \times d, we chose to compute the diagonal of the Hessian instead. This approach strikes a balance between computational feasibility and the accuracy needed for our analysis. As a result, in our code, we do indeed compute diagonal(uiui)=[u12,u22,...un2]diagonal(u_i u_i^{\top}) = [u_1^2, u_2^2, ... u_n^2], whose rank is not 1 and you are correct. The statement in the algorithm is included to illustrate the theoretical process of our approach. However, in the actual computation, particularly in Line 9, this step does not appear. It is meant purely for clarity in explaining the algorithm's underlying theory.

Thank you once again for pointing this out and for your constructive comments. Your input has been invaluable in improving the clarity and correctness of our work. Feel free to ask if you have any other concerns.

评论

Under the assumptions of gradient descent and convex optimization, alongside Assumption tr(Σ1/2HΣ1/2)Σ1/2HΣ1/22<r \frac{tr(\Sigma^{1/2}H\Sigma^{1/2}) }{\|\| \Sigma^{1/2}H\Sigma^{1/2}\|\|_2} < r, we believe there is strong potential to demonstrate that the diagonal-Hessian based preconditioning in HiZOO offers significant advantages over MeZO. Such findings would underscore the promising practical applications of our method and its effectiveness. We are willing to share further insights and are committed to enhancing this aspect of the theory, trying to ensure a more comprehensive theoretical explanation is included in the final version of the paper.

Also, it is important to acknowledge that extending these advantages to scenarios involving stochastic gradient descent (SGD) or non-convex optimization presents significant challenges. To the best of our knowledge, there is few or even no existing work in the field that has successfully established improvements under these conditions. This highlights the value of our contributions within the specific scope of convex optimization and gradient descent.

We sincerely hope that you will reconsider our work and, if possible, provide further support for our research.

审稿意见
5

This paper proposes the Hessian-informed Zeroth-Order Optimizer (HiZOO), which achieves faster convergence than traditional zeroth-order SGD (specifically, MeZO) in LLM fine-tuning scenarios by leveraging Hessian information through a zeroth-order approach. Although HiZOO requires one additional forward pass compared to MeZO (totaling three forward passes), it demonstrates faster convergence in terms of the number of function calls. Furthermore, HiZOO outperforms MeZO in terms of model generalization for various tasks in LLM fine-tuning and the authors also provide convergence guarantee of their HiZOO method in theory.

优点

  1. The first approach that leverage the second-order information via zeroth-order construction for fine-tuning LLMs
  2. Illustration of HiZOO on some test functions with superior generalization ability of the method
  3. Convergence guarantee of HiZOO with their diagonal inverse Hessian estimator

缺点

  1. The Hessian estimator from Equation (3) appears to estimate the Hessian matrix only if the matrix Σ\Sigma is close to the inverse Hessian. However, according to Algorithm 1 in the paper, I saw that the matrix (actually, the vector) Σ0\Sigma_0 is initialized simply as the identity matrix, which suggests that this estimator may not accurately estimate the (diagonal) inverse Hessian. Furthermore, since the initial value is merely an identity matrix, it is difficult to consider the subsequent estimated Σt\Sigma_t values as accurate estimations of the diagonal inverse Hessian. Instead, it seems more likely that HiZOO are estimating something positive definite.

  1. The theoretical result seems a bit unusual. Typically, convergence results are derived as bounds on the norm of the true gradient (i.e. the gradient evaluated on the entire training dataset) rather than the norm of the stochastic gradient. Consequently, Theorem 3.2, which presents the current convergence result, is very confusing and may need to be revised for clarity.

  1. In addition to the second concern, the convergence rate in Theorem 3.2 is on par with first-order methods. However, given that second-order information, such as the diagonal Hessian, is utilized—even if constructed solely from function values—there should ideally be an advantage in convergence compared to existing methods like MeZO or other vanilla SGD-based approaches. This expected benefit, however, does not seem apparent. Moreover, maxtTr(Σt)\max_t \mathrm{Tr}(\Sigma_t) would also depend on dd. For example, given that Algorithm 1 initializes Σ0\Sigma_0 as the identity matrix, the quantity Tr(Σ0)\mathrm{Tr}(\Sigma_0) becomes dd, so maxtTr(Σt)\max_t \mathrm{Tr}(\Sigma_t) is at least on the order of the parameter dimension dd. If the order of maxtTr(Σt)\max_t \mathrm{Tr}(\Sigma_t) exceeds dd (e.g., something like ddd\sqrt{d}), the convergence rate would actually be slower than zeroth-order vanilla SGD (ZO-SGD), which undermines the theoretical contributions of the proposed algorithm. (To the best of my knowledge, the convergence of ZO-SGD can be derived faster than O(d)O(d)). Also, importantly, the advantage of leveraging second-order information should be apparent in terms of theory.

  1. Looking at the comparison on the test function, it appears to be a comparison of just optimization from scratch rather than fine-tuning. I'm curious about the hyperparameter settings, as it seems to perform better than Adam. If this advantage holds, there should also be experiments demonstrating the use of zeroth-order methods not only in fine-tuning but also in pre-training.

问题

Please refer to the weaknesses.

评论

We deeply appreciate the time and effort you have invested in reviewing our manuscript. Your thoughtful comments and constructive suggestions have been invaluable in improving the quality of our work. Thank you for your insightful and detailed feedback.

Question 1: Σ\Sigma is not required to be close to the inverse Hessian
We apologize for any confusion caused by our previous presentation. To clarify, the Hessian matrix in our method does not require the condition that the matrix be close to the inverse Hessian, which means that:

Drawing from the lemma presented in MiNES(Mirror Natural Evolution Strategies):

12Eu(uΣ1/2HΣ1/2u(Σ1/2uuΣ1/2Σ1))=H\frac{1}{2} \cdot \mathbb{E}_u (u^{\top} \Sigma^{1/2} H \Sigma^{1/2} u \cdot (\Sigma^{-1/2}u u^{\top} \Sigma^{-1/2} - \Sigma^{-1}))=H

where HH is the Hessian 2L(θ;B)\nabla^2 \mathcal{L}(\theta; \mathcal{B}) and Σ\Sigma is a positive definite matrix.

Question 2: theoretical result of convergence
Thank you for your valuable feedback. We sincerely apologize for the typo in our manuscript. We will revise Theorem 3.2 to reflect this correction for clarity, shown as below:

Let the descent direction gμ(θt)g_\mu(\theta_t) defined as:

gμ(θt)=i=1bL(θt+μΣt1/2ui;Bt)L(θtμΣt1/2ui;Bt)2bμΣt1/2ui.g_\mu(\theta_t) = \sum_{i=1}^b \frac{\mathcal{L}(\theta_t + \mu \Sigma_t^{1/2} u_i; \mathcal{B}_t) - \mathcal{L}(\theta_t - \mu \Sigma_t^{1/2} u_i;\mathcal{B}_t)}{2b\mu} \Sigma_t^{1/2}u_i.

Based on Assumption, if the update rule for θ\theta is θt+1=θtηgμ(θt)\theta_{t+1} = \theta_t - \eta g_\mu(\theta_t) for a single step, then it's established that: E[L(θt+1)]L(θt)ηt4L(θt)Σt2+2ηt2L(tr(Σt)+βu)σ2+O(μ2). \mathbb{E} \left[\mathcal{L}(\theta_{t+1}) \mid \right] \leq \mathcal{L}(\theta_t) - \frac{\eta_t}{4} \|\nabla \mathcal{L}(\theta_t)\|_{\Sigma_t}^2 + 2\eta_t^2L\left( \mathrm{tr}(\Sigma_t) +\beta_u \right)\sigma^2 + O(\mu^2).

Furthermore, given iteration number TT, we choose the step size η=18TL(maxttr(Σt)+βu)\eta = \frac{1}{8\sqrt{T}L(\max_t\mathrm{tr}(\Sigma_t) +\beta_u)} and take θ\mboxout=θj\theta_{\mbox{out}} = \theta_j with jj uniformly sampled from {1,,T}\{1, \dots, T\}. Then, we have

E[L(θ\mboxout)2]32L(maxt{tr(Σt)}+βu)L(θ1)L(θ))Tβ+σ2T3/2β+O(μ2),\mathbb{E} \left[\|\nabla \mathcal{L}(\theta_{\mbox{out}})\|^2\right] \leq \frac{32L\left( \max_t \{\mathrm{tr}(\Sigma_t)\} +\beta_u \right) \mathcal{L}(\theta_1) - \mathcal{L}(\theta_*))}{\sqrt{T}\beta_\ell } + \frac{\sigma^2}{T^{3/2} \beta_\ell} + O\left(\mu^2\right), where L(θ)\mathcal{L}(\theta_* ) minimizes the function L(θ;)\mathcal{L}(\theta;). The above equation shows that as TT\to \infty, HiZOO can converge to the stationary point.\

Question3 Theoretical concern
In fact, in the realm of non-convex optimization, it is difficult to show that a Newton-like algorithm with approximate Hessian matrices can help to achieve faster convergence rate. And we admit that our convergence analysis does not provide theoretical support that our method can achieve a fast convergence rate. Our theory mainly is used to show that our algorithm can converge to some stationary point.
We have provided an updated convergence proof for your reference. Due to its length, we have directly updated the PDF file, specifically in Line 763-884.

Question4: apply HiZOO in pre-training
Thank you for your insightful comment. Indeed, there have been studies that have applied zeroth-order optimizers for training neural networks from scratch, eg.[DeepZero: Scaling up Zeroth-Order Optimization for Deep Model Training]. However, due to the high computational cost associated with training from scratch, in this work we focus on fine-tuning. In the future, we are willing to explore using HiZOO for pre-training.
Additionally, in our test functions, we set the exactly same hyper-parameters for Adam, MeZO and our method. For example, in function: f(x,y)=10(x1)2(2x2+x+1)+2(y4)2f(x,y) = 10(x-1)^2*(2x^2+x+1)+2(y-4)^2, we set the same start point(1,2)(-1,2), end point(1,4), learning rate 10310^{-3}, scaling factor 10610^{-6}, and we can see that in the same setting, our method performs better than MeZO.

评论

Dear Reviewer MAUK, we greatly appreciate the time and effort you have dedicated to reviewing our submission.

As part of the ongoing discussion period, we have provided detailed responses to the points raised in your review. We would sincerely appreciate any additional feedback or clarifications you may have. With only a few days remaining in the discussion period, we are eager to ensure a productive and collaborative exchange.

Thank you once again for your valuable input and consideration.

评论

Thank you to the authors for their responses.

However, from a theoretical perspective, the convergence of HiZOO cannot outperform vanilla SGD due to the trace term, which fails to justify the necessity of using this inverse Hessian estimator. Additionally, after reading the MiNES paper, the proposed estimator appears to estimate not the inverse Hessian but rather something closer to the inverse Fisher matrix. While Fisher and Hessian may share similarities in estimating curvature information, they cannot be considered equivalent in large language models (LLMs) that do not use ReLU activations. This raises a fundamental question about whether HiZOO can truly be considered a Hessian-informed method. Moreover, even if it is correct to say that it estimates the inverse Fisher matrix, works like [1] have shown that natural gradient descent (using K-FAC, approximate NGD) achieves linear convergence. Thus, HiZOO should at least provide theoretical insights or intuition in this regard.

For these reasons, I will maintain my current score

[1] Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks, G. Zhang et al., NeurIPS 2019.

评论

First and foremost, we sincerely appreciate your effort in raising new questions for further exploration. We will address each of your concerns in detail below, and we hope our responses will earn your understanding and approval.

(Also we want to kindly remind you that this work primarily focuses on experimental contributions. As a result, we conduct lots of experiment result to show the performance of our method and do not extensively delve into theoretical analyses.)

Question:convergence of HiZOO cannot outperform vanilla SGD due to the trace term, fails to justify the necessity of inverse Hessian estimator.

Answer: Thank you for your insightful comment on the convergence properties of HiZOO. As indicated in the theoretical analysis of MeZO[1] (HiZOO is an improvement based on MeZO), the loss decrease for MeZO can be expressed as:

E[L(θt+1)θt]L(θt)1γ[ηSGDL(θt)2+12ηSGD2E[L(θ;B)2]],\mathbb{E}[L(\theta_{t+1}) | \theta_t] - L(\theta_t) \leq \frac{1}{\gamma} \left[-\eta_{SGD} \|\nabla L(\theta_t)\|^2 + \frac{1}{2} \eta_{SGD}^2 \ell \cdot \mathbb{E}[\|\nabla L(\theta; \mathcal{B})\|^2] \right],

where γ1=O(n/r)\gamma^{-1} = O(n/r), nn denotes the number of queries per iteration, and rr is the effective rank of the Hessian. For MeZO/HiZOO, n=1n = 1, which implies that its convergence rate depends solely on rr, the effective rank of the Hessian, rather than dd, the parameter dimension, as in traditional SGD.

In traditional zeroth-order SGD, the efficiency is scaled by dd, as the loss decrease is proportional to 1/d1/d due to the dimensionality of the parameter space. However, since rr is much smaller than dd, the factor 1/r1/r in MeZO and HiZOO results in a faster loss decrease compared to 1/d1/d in SGD. A detailed analysis of this can be found in the convergence analysis section of MeZO.

This distinction is significant because, as shown in Adahessian[3], Sophia[4], Shampoo[5], employing curvature information (Hessian approximations) enables algorithms to converge faster than first-order methods like SGD, especially in high-dimensional optimization problems. Specifically, the effective rank rr is typically much smaller than the parameter dimension dd, owing to the low-rank property of the Hessian matrix observed in many practical machine learning scenarios.

Moreover, HiZOO improves upon MeZO by introducing second-order information through preconditioning methods that estimate the diagonal of the Hessian.

Question:HiZOO can truly be considered a Hessian-informed method or Fisher-informed method?

Answer: HiZOO is definitely a Hessian-informed method. Thank you for raising this insightful question. Regarding the convergence properties of HiZOO, we would like to emphasize that our method indeed estimates the inverse Hessian matrix, as demonstrated in the theoretical analysis of MiNES[2]. Specifically, Lemma 11 in MiNES establishes that the sequence Σk1\Sigma_k^{-1}, generated by their algorithm, satisfies:

Σk1ΠS(2f(μ))2Ck,\|\Sigma_k^{-1} - \Pi_{\mathcal{S}} (\nabla^2 f(\mu^*))\|^2 \leq \frac{C}{k},

where 2f(μ)\nabla^2 f(\mu^*) is the Hessian matrix of the loss function ff at the optimal point μ\mu^*. The projection ΠS\Pi_{\mathcal{S}} removes the extreme eigenvalues (both maximum and minimum) of the original Hessian matrix 2f(μ)\nabla^2 f(\mu^*), effectively regularizing it within a spectral range defined by S\mathcal{S}. When S\mathcal{S} is chosen to be sufficiently large, ΠS(2f(μ))\Pi_{\mathcal{S}} (\nabla^2 f(\mu^*)) becomes equivalent to 2f(μ)\nabla^2 f(\mu^*) itself, as the projection no longer affects the spectral properties of the Hessian. This result clearly demonstrates that Σk1\Sigma_k^{-1} approximates 2f(μ)\nabla^2 f(\mu^*) as kk \to \infty, at a convergence rate of O(logkk)O\left(\frac{\log k}{k}\right).

While the MiNES paper also discusses the Fisher matrix FμF_\mu in the context of natural gradient descent, it is important to note that this section is unrelated to our analysis. HiZOO focuses solely on the inverse covariance matrix Σk1\Sigma_k^{-1}, which approximates the Hessian matrix and does not involve the Fisher matrix. The distinction is particularly evident in the update rules for Σk\Sigma_k, which are derived based on Hessian-specific properties rather than Fisher information.

[1]Malladi, Sadhika, et al. "Fine-tuning language models with just forward passes." Advances in Neural Information Processing Systems 36 (2023): 53038-53075.

[2]Ye, Haishan and Tong Zhang. “Mirror Natural Evolution Strategies.” ArXiv abs/1910.11490 (2019): n. pag.

[3]Yao, Z., Gholami, A., Shen, S., Keutzer, K., & Mahoney, M.W. (2020). ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning. ArXiv, abs/2006.00719.

[4]Liu, H., Li, Z., Hall, D.L., Liang, P., & Ma, T. (2023). Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training. ArXiv, abs/2305.14342.

[5]Anil, R., Gupta, V., Koren, T., Regan, K., and Singer, Y. Scalable second order optimization for deep learning. arXiv preprint arXiv:2002.09018, 2020.

评论

Question:natural gradient descent achieves linear convergence. Thus, HiZOO should provide theoretical insights or intuition in this regard.

Answer: Thank you for raising this point. As discussed earlier, the convergence rate of both MeZO and HiZOO is determined by the effective rank rr of the Hessian, which is significantly smaller than the parameter dimension dd in most practical scenarios. This inherent advantage enables HiZOO to converge faster than methods like SGD, whose efficiency scales with dd.

Additionally, HiZOO leverages diagonal precondition to further accelerate the estimation process. Diagonal preconditioning has been extensively used in many works to enhance optimization efficiency, including well-known optimizers such as Adahessian[1], Sophia[2], Shampoo[3]. These methods employ similar diagonal adjustments to improve convergence speed and stability in high-dimensional settings.

[1]Yao, Z., Gholami, A., Shen, S., Keutzer, K., & Mahoney, M.W. (2020). ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning. ArXiv, abs/2006.00719.

[2]Liu, H., Li, Z., Hall, D.L., Liang, P., & Ma, T. (2023). Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training. ArXiv, abs/2305.14342.

[3]Anil, R., Gupta, V., Koren, T., Regan, K., and Singer, Y. Scalable second order optimization for deep learning. arXiv preprint arXiv:2002.09018, 2020.

评论

I appreciate the authors' response.

However, the local effective rank condition (presented in MeZO) could also be applied to the analysis for HiZOO. Also, the proof technique in current version of HiZOO does not give theoretical superiority over the MeZO even with the local effective rank condition. So, I think that the authors should improve this part for future direction.

I also understand this work does more focus on the empirical studies, so I raised the score up to 5. But I strongly recommend that the authors should revise the theory related to trace of estimators.

评论

Thank you again for your valuable feedback and for recognizing the empirical contributions of our work. We appreciate your suggestion to enhance the theory related to the trace of estimators and consider the application of the local effective rank condition in HiZOO.

We are actively working on this improvement and will try hard to provide a detailed proof with as much depth as possible before the rebuttal phase concludes. Your insights are critical, and we hope you can consider our updated analysis during the reviewer discussion phase.

Thanks again for your constructive suggestions and your support in raising the score

审稿意见
6

Motivated by the limited optimization effectiveness of zeroth-order optimizers in deep learning, this manuscript leverages a diagonal Hessian to enhance the optimization quality. Though introducing second-order curvature information to aid the optimization is standard, the manuscript is quite comprehensive: it builds methods step-by-step, provides a theoretical justification, and verifies the effectiveness across various model architectures and datasets.

优点

  • The manuscript studies an interesting problem: efficient and effective zeroth-order fine-tuning. The manuscript has a clear motivation in the introduction part, with a thorough step-by-step explanation in Section 3.3. Extensive empirical studies consider evaluating both encoder-decoder and decoder-only neural architectures on the GLUE benchmark over several baseline methods. The evaluation uses the number of forward passes (which is very good) and discusses memory usage and time efficiency.
  • The manuscript has some preliminary convergence analysis.

缺点

  1. The convergence analysis part can be strengthened. It would be great if the analysis could cover the case of MeZO (and other zeroth-order algorithms) and carefully explain the gain introduced by the diagonal Hessian. See some examples in [1], where the query complexity can be discussed and compared.
  2. The manuscript structure can be improved. E.g., some detailed derivates in Sec 3 can be simplified, and the definition of three test functions in Sec 3.5 can be moved to the main text.
  3. Extending to other LLM SFT datasets: it would be great if the manuscript could verify the effectiveness of HiZOO on some SFT datasets.

Reference

[1] Stochastic Two Points Method for Deep Model Zeroth-order Optimization, https://arxiv.org/abs/2402.01621.

问题

NA

评论

We sincerely appreciate the time and effort you have dedicated to reviewing our manuscript. Your valuable comments and suggestions are instrumental in enhancing the quality of our work. Thank you for your insightful feedback.

Question 1 Thank you for your suggestion and this reference paper provides a very solid theoretical introduction, which is worth learning for us. By Theorem 3.2, we can obtain a query complexity (tr(Σ)β)21ϵ4\left(\frac{\mathrm{tr}(\Sigma)}{\beta_\ell}\right)^2\cdot\frac{1}{\epsilon^4} for our method. Our query complexity depends on ϵ4\epsilon^{-4} while the reference paper depends ϵ2\epsilon^{-2}, our baseline MeZO is also ϵ4\epsilon^{-4}. This is because our method samples function (namely Stochastic Gradient Descent) while the this paper uses all functions per iteration (namely Gradient Descent). So we could not conduct direct comparison. However, we will cite this reference paper and built upon their methodology in our work.

Question 2 We will simplify the detailed derivations in Section 3 as recommended. Additionally, we will move the definitions of the three test functions in Section 3.5 to the main text to enhance clarity and readability.

Question 3 Thank you for pointing out the importance of extending our experiments. On Roberta-large we conduct the few-shot learning experiment and on OPT, Llama, Phi models we have conducted the full-shot learning, which have included the supervised situations. Also, we will consider including more datasets or models to evaluate our method.

评论

Dear Reviewer RuKr, we greatly appreciate the time and effort you have dedicated to reviewing our submission.

We have read this paper [Stochastic Two Points Method for Deep Model Zeroth-order Optimization], https://arxiv.org/abs/2402.01621. This paper is truly impressive, with solid theoretical foundations and remarkable contributions. I greatly admire this work and see it as a valuable resource for my learning.

As part of the ongoing discussion period, we have provided detailed responses to the points raised in your review. We would sincerely appreciate any additional feedback or clarifications you may have. With only a few days remaining in the discussion period, we are eager to ensure a productive and collaborative exchange.

Thank you once again for your valuable input and consideration.

评论

Thank you for the feedback. I will maintain my current rating.

AC 元评审

This paper proposes to precondition MeZO with an approximated, EWM-denoised diagonal Hessian, which is computed via an additional forward pass per step. The authors demonstrate superior convergence performance, and prove convergence of their method, HiZOO in the L-smooth but nonconvex setting. The strengths of the paper are the strong empirical performance and thorough evaluation of their fine tuned models. I also appreciate that the authors directly addressed the obvious additional memory requirements of storing the length-d Hessian diagonal. The proof of convergence in the L-smooth but non-convex setting.

审稿人讨论附加意见

The largest remaining weakness from the rejection voting reviewer seems to be the theoretical result in the paper not being substantially better than what is achievable with first order methods (some of the other concerns, e.g. that referenced work analyzes natural gradient descent and not Newton's method does not seem to hold). From my perspective, because the authors are only assuming L-smoothness of the objective function, I don't think it's reasonable to expect better convergence rates here -- in general, proving faster convergence of second order methods on nonconvex functions I think requires stronger assumptions than this (e.g., smoothness of the Hessian).

Frankly, I don't think that an LLM fine-tuning paper is the appropriate place to litigate theoretical gaps in first versus second order method convergence rates, which is a topic worthy of study in its own right. From an empirical perspective it's not unreasonable to conjecture that including some curvature measure in the optimization procedure could improve performance, and indeed it appears to do so here.

最终决定

Accept (Poster)