Harmony in Divergence: Towards Fast, Accurate, and Memory-efficient Zeroth-order LLM Fine-tuning
摘要
评审与讨论
The paper introduces Divergence-driven Zeroth-Order (DiZO) fine-tuning for large language models.
-
Observation: First-order (FO) fine-tuning naturally produces different update magnitudes across layers, while existing zeroth-order (ZO) methods apply nearly uniform random updates, slowing convergence.
-
Method: DiZO keeps memory-light ZO steps but periodically projects the updates toward the pretrained weights (“anchor”) for selected layers (mainly Query/Value matrices), recreating FO-like layer-specific behavior without any backward pass.
-
Results: On RoBERTa-large, OPT, and Llama, DiZO cuts iterations by up to 50 % and GPU hours by up to 48 %, sometimes even surpassing full FO fine-tuning, while requiring only ~17 % extra stored parameters.
优缺点分析
Strengths:
-
The paper tackles a central challenge in Large Language Model (LLM) fine-tuning: how to improve the convergence speed and accuracy of zeroth-order (ZO) optimization while maintaining its memory efficiency.
-
The paper introduces a novel "layer-wise divergence analysis" revealing that FO methods make diverse, fine-grained layer updates, unlike ZO's uniform updates. Building on this, DiZO proposes "divergence-driven layer adaptation" using learnable, anchor-based projections to scale ZO updates for each layer, mimicking FO's adaptive capabilities without backpropagation's memory cost. Crucially, these projections are learned via a ZO-based method for end-to-end memory efficiency, with "Re-initialization" and "Projection Clipping" enhancing stability.
Weaknesses:
-
Why is the checkpoint chosen as the anchor the original pretrained model, rather than a model that has already been partially fine-tuned on the task data?
-
Even when the anchor is limited to the Query and Value matrices, roughly 16 % of the pretrained parameters must still reside in memory. While this footprint is acceptable for cloud servers, it can be prohibitive for devices with extremely limited VRAM or other edge hardware. The authors do not explore weight quantization, pruning, or other anchor-compression techniques to further reduce this overhead.
-
There are now several Sparse MeZO works that achieve competitive—or even higher—accuracy. It would be valuable for the authors to include a direct comparison with these sparse approaches.
问题
Good work, but it requires additional memory overhead. Meanwhile, several sparse ZO approaches have achieved notable success—for instance, “Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning.” Could the present paper demonstrate that it outperforms such methods? With comparable accuracy, users will naturally favor the option that consumes less memory. If the authors can demonstrate that their method outperforms the sparse ZO approaches, I would be inclined to raise my score. The authors could demonstrate this by presenting evidence on either accuracy or convergence speed.
局限性
- The theoretical analysis is insufficient: the paper does not clearly explain why the pretrained LLM is chosen as the anchor, nor why only Q and V are projected instead of also including K and O.
- The performance gains come at the cost of increased memory usage.
最终评判理由
Thank you for your comment. Your feedback has been addressed in the discussion thread, and I have no further questions at this time.
格式问题
No major formatting issues in this paper.
We would like to thank the reviewer for the positive feedback and valuable questions. We carefully address all the reviewer’s questions and provide more results, like comparison with Sparse MeZO. We hope our response can help clarify the reviewer's questions.
Q1: Why is the checkpoint chosen as the anchor the original pretrained model, rather than a model that has already been partially fine-tuned on the task data?
A: Thank you for your valuable question. We have tried to use a partially fine-tuned model as an anchor, as shown in Appendix C.3, but using the original pretrained model yields better performance. To further address your valuable concern, we add two extra settings, using the earlier checkpoint (e.g., ), and even the best-performing checkpoint () as the anchor for projection. The results are shown in the following Table. Across all settings, using the original pretrained model consistently yields better performance. We attribute this to the high variance and instability of intermediate checkpoints in zeroth-order (ZO) optimization. Since ZO gradients are noisy by nature, later-stage local parameters may drift and become unreliable as stable anchors. In contrast, the pretrained model offers a clean and robust initialization, serving as a more reliable geometric reference for projection. This observation is also supported by recent studies on robust fine-tuning [1][2], which highlight the importance of leveraging pretrained knowledge to stabilize adaptation under noise-prone settings.
Table R.4. Results of using different partially pretrained model as anchor.
| Anchor | SST-2 | RTE | ||
|---|---|---|---|---|
| Acc | GPU hours | Acc | GPU hours | |
| NA (MeZO) | 90.0 | 100% | 63.5 | 100% |
| 88.4 | 116.7% | 61.8 | 113.0% | |
| 90.7 | 87.8% | 64.5 | 90.3% | |
| 91.0 | 81.8% | 66.9 | 80.0% | |
| DiZO () | 92.5 | 55.7% | 68.2 | 62.3% |
[1] Dong X, Luu A T, Lin M, et al. How should pre-trained language models be fine-tuned towards adversarial robustness?[J]. Neurpis 2021, 34: 4356-4369.
[2] Wang S, Zhang J, Yuan Z, et al. Pre-trained model guided fine-tuning for zero-shot adversarial robustness[C]// CVPR 2024: 24502-24511.
Q2: Even when the anchor is limited to the Query and Value matrices, roughly 16 % of the pretrained parameters must still reside in memory. Authors do not explore weight quantization, pruning, or other anchor-compression techniques to further reduce this overhead.
Thank you for this valuable suggestion. Our method is orthogonal to these methods and can be combined with these methods. We explored anchor compression via quantization of the pretrained Query and Value matrices to 8-bit and 4-bit precision [1]. As shown in Table R.5, DiZO can effectively incorporate with the quantization technique, still preserving advantages in both accuracy and GPU hours. Exploring more advanced quantization or compression methods to further improve anchor efficiency while maintaining performance is an important avenue for future work.
Table R.5 DiZO with quantized anchor.
| Method | SST-2 | RTE | ||
|---|---|---|---|---|
| Acc | GPU hours | Acc | GPU hours | |
| MeZO | 90.0 | 100% | 63.5 | 100% |
| DiZO (8-bits) | 92.2 | 63% | 67.2 | 71% |
| DiZO (4-bits) | 91.7 | 67% | 65.2 | 68% |
| DiZO | 92.5 | 56% | 68.4 | 62% |
[1] Shao W, Chen M, Zhang Z, et al. Omniquant: Omnidirectionally calibrated quantization for large language models[J]. ICLR 2023.
Q3: There are now several Sparse MeZO works that achieve competitive—or even higher—accuracy. If the authors can demonstrate that their method outperforms the sparse ZO approaches, I would be inclined to raise my score.
A: Thank you for your insightful suggestion. Since Sparse MeZO has not released its source code, we implemented it ourselves based on the best configuration reported in the paper. Specifically, we set the sparsity ratio to 0.75, which was shown to yield the best performance. Because the original paper does not clearly specify how to determine the pruning threshold, we adopted a percentile-based strategy to achieve the target sparsity.
We conducted a direct comparison between our DiZO and Sparse MeZO in terms of both accuracy and training efficiency across two datasets and two model sizes. The results are presented in Table R.6 and Table R.7. From the accuracy perspective, DiZO consistently outperforms Sparse MeZO under all evaluated settings, demonstrating the effectiveness of our projection-based approach.
From the efficiency perspective, Sparse MeZO requires generating the sparsity mask dynamically during training. Compared to DiZO, Sparse MeZO requires longer GPU hours for about 20%, and slows the throughput for more than 30%. Moreover, with the grows of model size, the throughput of Sparse MeZO will further decrease, due to the growing cost of maintaining and updating the sparse mask. While Sparse MeZO reduces the number of training iterations, its lower throughput results in longer total GPU hours compared to DiZO. These findings demonstrate that DiZO not only delivers better accuracy but also achieves more practical training efficiency compared to Sparse MeZO.
Table for R.6 Acc and Speed Comparison on OPT-2.7B
| Method | Dataset | Accuracy | Throughput | #Train Iter. | GPU Hours |
|---|---|---|---|---|---|
| MeZO | SST2 | 90.0 | 3.3it/s | 100% | 100% |
| Sparse MeZO | SST2 | 91.4 | 2.3it/s | 55% | 79% |
| DiZO | SST2 | 92.5 | 3.1it/s | 52% | 56% |
| MeZO | RTE | 63.5 | 1.7it/s | 100% | 100% |
| Sparse MeZO | RTE | 67.1 | 1.1it/s | 50% | 73% |
| DiZO | RTE | 68.4 | 1.5it/s | 60% | 62% |
Table for R.7 Acc and Speed Comparison on OPT-6.7B
| Method | Dataset | Accuracy | Throughput | #Train Iter. | GPU Hours |
|---|---|---|---|---|---|
| MeZO | SST2 | 90.2 | 1.8it/s | 100% | 100% |
| Sparse MeZO | SST2 | 91.9 | 1.0it/s | 47% | 84% |
| DiZO | SST2 | 92.4 | 1.7it/s | 62% | 65% |
| MeZO | RTE | 73.2 | 0.6it/s | 100% | 100% |
| Sparse MeZO | RTE | 73.8 | 0.3it/s | 39% | 88% |
| DiZO | RTE | 74.8 | 0.5it/s | 65% | 81% |
Q4: The theoretical analysis is insufficient: the paper does not clearly explain why the pretrained LLM is chosen as the anchor, nor why only Q and V are projected instead of also including K and O.
Thank you for your insightful question. We address the two aspects below:
-
Why the pretrained model is selected as the anchor.
- While the anchor choice in DiZO is empirically motivated, it is supported by both theoretical intuition and experimental evidence. We have explored alternatives such as zero vectors and partially fine-tuned checkpoints (Appendix C.3), and consistently found that using the pretrained model as the anchor leads to better convergence and final performance. Intuitively, pretrained parameters provide a stable and informative prior in noisy ZO settings, acting as a regularization target to prevent overfitting and instability, an effect also observed in robust FO fine-tuning literature [1,2]. Although we cannot theoretically guarantee optimality, we will provide convergence analysis with this anchor choice in the revised version.
-
Why we project only the Q and V matrices.
- This decision is also empirically driven and supported by ablation studies in Appendix C.1 (similar pattern holds across datasets). We find that projecting only Q and V strikes the best balance between performance gain and memory cost. This is consistent with the design choice in LoRA, which also focuses on Q and V as the most influential components in transformer attention for task adaptation. Including K and O introduces additional overhead but yields marginal gains. We will clarify this reasoning more explicitly in the final version.
In summary, while these design choices are currently based on empirical evidence and practical insights, they offer strong performance across diverse tasks and models. We agree that formal theoretical analysis on optimal anchor and projection selection is an interesting direction for future work.
Q5: The performance gains come at the cost of increased memory usage.
A: Thank you for raising this important concern. DiZO gains accuracy improvement and significant GPU hours reduction (overall more than 35%!) with a modest cost (7% drop in throughput and 16% increased memory overhead, or only 4% with quantization). In contrast, other ZO baselines often incur much higher costs. For example, HiZOO reduces throughput by 32% and, in theory, requires up to 100% more memory to store second-order statistics. Sparse MeZO avoids additional memory usage but sacrifices 50% of throughput. Compared to these baselines, DiZO strikes the best balance between performance, efficiency, and resource usage. Moreover, when compared to resource-intensive FO full-parameter fine-tuning and LoRA, DiZO achieves comparable accuracy while reducing memory usage by nearly 90%, highlighting its practicality for large-scale deployment.
Thank you for your thoughtful and timely response. Your clarification has effectively addressed most of my concerns. I have no further questions at this time and will raise my score accordingly.
Thank you for your positive feedback and constructive comments (e.g., add comparison with Sparse MeZO), which make our paper more comprehensive. We have addressed all your comments in our revision. Thank you again for your valuable time.
Best,
Author
This paper identifies that standard ZO fine-tuning methods (eg MeZO) apply fixed-magnitude updates to each layer of the network, whereas first-order methods compute gradients of varying magnitudes for each layer of the network. As such, they propose a way to rescale the updates for each layer in the network. They include a convergence analysis and experiments on medium-sized BERT models as well as larger autoregressive language models.
优缺点分析
Strengths:
- It is worthwhile to investigate the gap between zeroth and first order training methods in order to improve the former to be as efficient as the latter.
- There are thorough experiments evaluating the efficacy of this method
Weaknesses:
-
Lacks comparison to layerwise learning rates: The method amounts to using a layerwise learning rate for each layer of the network. The authors acknowledge this as well, but they do not compare to existing layerwise LR methods (the most popular of which is LARS). These methods are expensive to naively implement on top of ZO methods, but I think there are straightforward ZO approximations to be made. At the very least, the experiments should include ablations and comparisons against the more expensive but standard way to impose layerwise learning rates.
-
Performance benefit is marginal: The performance benefit in the setting where the benefits of ZO are apparent (ie at the ~7B model scale) is marginal. There is still a substantial gap to the first order methods, suggesting that the original premise of the paper -- that fixed-magnitude updates to each layer are the culprit for the performance gap -- is not actually true. This problem is made worse by the fact that the authors do not do multiple trials of fine-tuning, which is standard in FT literature, given the noise present when optimizing on small datasets.
-
Theoretical analysis is poor: The theoretical analysis is poor and provides little to no insight. The technique is to assume a bounded update at each timestep and then to say that the variance of the gradient estimate is reduced. A better theoretical justification for this method would be to argue rigorously that the gradient update will be small enough when using the additional tricks in DiZO -- this is a more complex analysis and is likely not true in general. As it stands, the theory adds nothing to my understanding of the paper. A good theory section would actually advocate for why layerwise LRs are necessary.
-
Requires hacky tricks: The two tricks used to make DiZO work, re-initializing and clipping, are pretty hacky, and I am not confident that they will work in new + unseen settings, especially because clipping requires setting an additional hyperparameter.
问题
-
Can you explain why the convergence analysis is significant? It actually does not contain a convergence result in the first place and only contains a bound on the variance of the gradient. I would raise my score if the theory can be improved to more rigorously justify the tricks in DiZO -- for example, if the theoretical result provides clear guidance on how to set the clipping hyperparameters.
-
Why did you not compare against any layerwise learning rate methods?
-
Figure 1 is missing a numerical y-axis, which makes it hard to determine if the plotted differences between FO and ZO methods is actually significant in scale. It is also not clear why you only plotted the Q, K, V, O values out of all the parameters in the network.
-
Why do you call this a projection? Projections of gradient estimates are usually something like a preconditioner. This is just a scalar multiplied to the update on each layer. I found the writing extremely confusing for a very simple point.
局限性
The limitations only discuss issues with the theoretical section and do not mention the fundamental issues that I raised above. It would be good if the authors acknowledge the limited empirical efficiency of their method, the introduction of an additional hyperparameter, and the overall uncertainty around their hypothesis concerning the ZO-FO gap when fine-tuning models.
最终评判理由
Much of the relevant information was buried in the appendices. Additional experiments run by the authors suggest the method works consistently. The theory clarification, along with the additional corollary makes the section much more interesting. I am not putting a 5 because I still think the validation settings are not very difficult. Many of these models can solve these tasks without fine-tuning in the first place.
格式问题
N/A
Thank the reviewer for the valuable feedback. We carefully addressed the questions raised by the reviewer and added results as suggested by the reviewer.
Q1: Lacks comparison to layerwise learning rates: The method amounts to using a layerwise learning rate for each layer of the network. The authors acknowledge this as well, but they do not compare to existing layerwise LR methods (the most popular of which is LARS).
A: Thank you for this insightful comment. We clarify two key points:
-
First, our method is fundamentally different from layer-wise learning rate approaches, as discussed in Section 6.5. DiZO uses geometric constraints to guide parameters toward a target distance from a fixed anchor, deciding where to move, unlike layer-wise LR methods that control how fast to move.
-
Second, we compared DiZO with a representative adaptive LR method in Appendix D. Modern adaptive methods typically combine momentum and second-order estimation. As shown, momentum incurs high computational overhead in ZO settings, while second-order methods like HiZOO reduce throughput and increase memory usage. DiZO avoids both issues, remaining lightweight and efficient.
To address your concern directly, we incorporated LARS into ZO and report results in Table R.3. ZO+LARS underperforms MeZO and often fails to converge. This is likely due to ZO gradients having larger and noisier magnitudes, making LARS’s update rule highly unstable. These results suggest naive LARS adoption is ill-suited for ZO. DiZO instead offers a stable, low-overhead alternative that captures layer-specific behavior through geometric constraints.
Table R.3 Apply LARS to MeZO.
| Method | SST-2 | RTE |
|---|---|---|
| MeZO | 90.0 | 63.5 |
| MeZO+LARS | 57.2 | 53.4 |
| DiZO | 92.5 | 68.2 |
Q2: Performance benefit is marginal: The performance when at the ~7B model scale is marginal. There is still a substantial gap to the FO methods.
A: Thank you for your important question. We would like to point out that the performance improvement of our method mainly lies in GPU hours (reducing training iterations without sacrificing throughput), instead of accuracy. For example, DiZO achieves 22% and 17% GPU-hour reduction on MT-Bench and MMLU with LLaMA2-7B and LLaMA3-8B, and improves accuracy to some extent. These results reinforce our central claim: fixed-magnitude updates are a critical bottleneck in ZO fine-tuning, and addressing them is an effective way to improve real-world ZO training.
Moreover, it is common that a gap between ZO and FO methods exists, especially for ~7B model scale. This is in part due to the inherent variance in ZO gradient estimators, which increases with the model dimension . Thus, performance gaps at larger model scales (e.g., 7B+) are to be expected and reflect a known limitation of ZO methods.
Q3: The authors do not do multiple trials of fine-tuning, which is standard in FT literature, given the noise present when optimizing on small datasets.
A: Thank you for your important question. For multiple trials, we apologize for the omission in the main text: all experiments were conducted with three random seeds, and the reported numbers are mean values. Due to page limitations, the variances are not provided in the main paper. As evidence, please check our appendix (Table E.1, Table E.5, and Table E.6), where we include the variances. We will make this clearer and add corresponding results in the final version of the paper.
Q4: Theoretical analysis is poor: The technique is to assume a bounded update at each timestep and then to say that the variance of the gradient estimate is reduced.
A: Thanks for the reviewer’s suggestion. Thm. 5.3 is not a mere variance‑reduction lemma. It establishes a non‑convex convergence guarantee with the convergence rates of
where is the effective dimensionality produced by DiZO’s layer‑wise projection. As , the expected gradient norm converges to zero, i.e. the algorithm reaches a stationary point. This goes well beyond showing that the variance of the gradient estimator decreases. If we suppressed the layer‑wise decomposition, we would revert to the classical ZO‑SGD bound
Because modern models have , the constant slowdown is non‑negligible (e.g., for a 24‑layer vision transformer with uniform 1 K‑dim layers, ). Thm. 5.3 therefore proves that the layer‑wise schedule is not a heuristic: it yields a measurably faster convergence rate. This is precisely the kind of justification the reviewer asked for. Per suggestion from reviewers, we will move the rate formula into the statement of Thm. 5.3 and add a one‑sentence corollary emphasising .
Q5: I would raise my score if the theory can be improved to more rigorously justify the tricks in DiZO, for example, how to set the clipping hyperparameters.
A: Thank you for your insightful suggestion. To address your concern regarding the theoretical justification of the clipping hyperparameter , we have added a formal result, Corollary 5.4 (τ-stability of the clipping step), which rigorously analyzes how the clipping range impacts optimization stability and convergence.
Specifically, we show that for any , the deviation between the clipped iterate and the unclipped point is bounded as:
where is the maximum norm of the gradient estimator and is the maximum norm of the layerwise update directions.
Furthermore, we prove that if is chosen such that
then the clipping step preserves the original non-convex convergence rate of DiZO:
This shows that the clipping impulse introduces only a bounded perturbation to the ZO step and does not interfere with theoretical guarantees.
In practice, we approximate the above bound by setting
which corresponds to the theoretical threshold in Eq. (2). Empirically, we use as a default setting, and this value works well across different tasks, as observed in our experiments (in Appendix C.4).
We believe this theoretical development and practical guidance provide a rigorous and actionable justification for the clipping mechanism in DiZO, and we hope this addresses your concern.
Q6: Requires hacky tricks: re-initializing and clipping, are pretty hacky, and I am not confident that they will work in new + unseen settings, especially clipping requires additional hyperparameter.
A: Thank you for raising this important concern. We have demonstrated consistent improvement across multiple models and datasets, and our code is publicly available.** Appendix C.2** includes detailed ablations (with loss curves) showing that removing either strategy leads to performance drops. For clipping, we observe that a single value (0.2) works robustly across all tasks and models (see Appendix C.4). All reported results use this same value. These techniques are also common in practice, FO methods also often rely on clipping or rescaling for stability. We believe the strong empirical performance and robustness validate the practical value of these design choices.
Q7: Figure 1 is missing a numerical y-axis. It is also not clear why you only plotted the Q, K, V, O values.
A: Thank you for the helpful comment. We appreciate the suggestion and will revise Figure 1 accordingly. While the upper and lower subplots share the same y-axis scale, we agree that omitting the numeric y-axis in the upper plot hinders clarity. We will include it in the final version. Regarding the focus on Q, K, V, O parameters: we chose these because they are core components of the attention module, which has been shown in prior work to be critical for transferability and model performance. These parameters also tend to exhibit the most noticeable differences between FO and ZO gradients, making them representative for illustrating our point. We will add a more plot in the Appendix showing similar statistics for other layers in the model. We find the trend to be consistent, further supporting our conclusions.
Q8: Why do you call this a projection? Projections of gradient estimates are usually something like a preconditioner. This is just a scalar multiplied to the update on each layer.
A: Thank you for raising this important point. In our method, however, “projection” refers to a geometric constraint applied in parameter space, specifically, we rescale each layer’s update to achieve a target distance from a fixed anchor. This operation is not a projection of the gradient, but rather a form of parameter update rescaling. Moreover, we adopted the term “projection” also by analogy to recent works (e.g., [1, 2]) that apply similar parameter-level constraints and also refer to them as projections. That said, we agree that this terminology might be misleading and will revise the writing to make this distinction clearer in the final version of the paper.
[1] Towards calibrated robust fine-tuning of vision-language models. Neurips, 2024
[2] Fast trainable projection for robust fine-tuning. Neurips, 2023
Thank you to the authors for responding to my points. I am raising my score to a 4 after reading the authors' responses to my questions, other reviewers' concerns, and their other responses. The additional experiments do suggest the value of DiZO over MeZO is consistent, but I do believe these tasks are quite simple and can often be solved with no fine-tuning at all. I recommend to the authors that they substantially revise the text to include pointers to the relevant additional experiments in the appendix. Readers will otherwise find the claims to be poorly supported, as I initially did.
Thank you for raising the score and your time spent reviewing our paper. This is a great affirmation of our work. Your comments are very constructive (e.g., add theoritical analysis and more pointer to experiments in Appendix), which makes our paper stronger and more clear. We have addressed your comments in our revision. Thank you again for your valuable time.
Best,
Author
This paper introduces Divergence driven Zeroth Order optimization method. It begins with an analysis on layer-wise divergence between FO and ZO methods. Motivated by this, it proposes anchor-based, learnable projections to adapt ZO updates layer-wise. Experiments across different models show that DiZO outperforms ZO baselines on both full parameter and PEFT settings with faster convergence and improved accuracy.
优缺点分析
Strengths:
- The analysis comparing ZO and FO optimization is clearly presented and provides a strong foundation for motivating the proposed method.
- The proposed DiZO is memory-efficient and achieves performance on par with FO methods.
- The experiments are thorough, covering both autoencoding and autoregressive models, and are complemented by detailed analyses of memory usage and computational efficiency.
Weaknesses:
- On larger autoregressive models, the performance gap between DiZO and FT seems bigger than that observed in smaller autoencoding models. It would be helpful if the authors could provide insights into the cause of this difference.
- The method introduces additional hyperparameters such as projection update interval and clipping range. The paper provides limited discussion on how sensitive performance is to these design choices.
问题
Minor Typo: Line 265 is missing a period at the end of the sentence.
Please see weaknesses for questions.
局限性
Yes.
最终评判理由
I have read the authors' responses and found some of them already explained in the appendix. I encourage the authors to integrate these clarifications into the revision. I have no further concerns and maintain my positive recommendation for this paper.
格式问题
NA
Thank the reviewer for the valuable feedback. We carefully address the questions, and hope our response can help clarify the reviewer's questions.
Q1: On larger autoregressive models, the performance gap between DiZO and FT seems bigger than that observed in smaller autoencoding models. It would be helpful if the authors could provide insights into the cause of this difference.
A: Thank you for your important question. We agree that the performance gap (in both convergence speed and accuracy) between the ZO method and full FT appears larger on large-scale autoregressive models than on smaller autoencoding models, but it's a general problem as all ZO methods present this phenomenon. We believe this is because the variance of the gradient estimated by ZO increases with model size. As shown in [1], the variance of the zeroth-order gradient estimator grows proportionally with the parameter dimension (i.e., ). Larger autoregressive models like LLaMA-7B/8B and OPT-13B thus suffer from higher gradient noise, making convergence more challenging for ZO methods, which is a natural limitation of ZO itself.
Despite this, DiZO still outperforms other ZO baselines in both convergence speed and final accuracy, efficiently reducing this gap between ZO and FT (in both convergence speed and accuracy). Moreover, it preserves its key advantage of low memory usage, making it particularly promising in constrained training scenarios where full FT is infeasible. We hope to explore improved stabilization techniques to further close this gap in future work.
[1] Nesterov Y, Spokoiny V. Random gradient-free minimization of convex functions[J]. Foundations of Computational Mathematics, 2017, 17(2): 527-566.
Q2: The method introduces additional hyperparameters such as projection update interval and clipping range. The paper provides limited discussion on how sensitive performance is to these design choices.
A: Thank you for your important question. We had ablation on hyperparameters like update interval, clipping range, etc. in Appendix C.4. Specifically, we find that setting the clipping range to 0.2 is the best across different tasks. As for the update interval, we set it to [50, 100, 200] and find that the loss curve also does not change too much. Experimental results illustrate that our method is not sensitive to hyperparameter selection and is robust. Moreover, we add more ablation results, including loss curve on 2 more datasets and models, and add results of set update interval as 150. The pattern remains consistent on the extra results, we will plot the additional loss curve in the revision.
Minor Typo: Line 265 is missing a period at the end of the sentence.
A: Thank you for your careful review. We have added the period in the revision!
Thank you for your responses and for providing additional insights. I encourage the authors to integrate these clarifications into the revision. I have no further concerns and maintain my positive recommendation for this paper.
Dear Reviewer t4pe,
Thanks for your time and reviewing efforts! We appreciate your comments.
We have provided suggested results in the authors' response, such as explaination for the performance gap increasing with the model size, and clarification the sensitive performance of our design. We have also provide more ablation by including results on more datasets and values.
We hope our responses have answered your questions, and we are more than willing to answer if you have more questions.
Best,
Authors
This paper proposes DiZO, a gradient-free optimization method for efficient fine-tuning of large language models (LLMs). The authors begin with a layer-wise divergence analysis, observing that first-order (FO) optimization produces updates with varying magnitudes across layers, while zeroth-order (ZO) optimization tends to apply uniform-magnitude updates. This discrepancy is identified as a key limitation of ZO methods in terms of convergence and performance. To address this, DiZO introduces a projection mechanism guided by the divergence between the current model and the initialization (used as an anchor), enabling each layer’s updates to be geometrically constrained and adaptively scaled. This allows ZO to mimic FO-like learning behavior while retaining its memory efficiency. Experiments across multiple models (RoBERTa, OPT, LLaMA) and tasks demonstrate that DiZO significantly outperforms prior ZO baselines and, in some cases, even surpasses FO methods. The paper further provides a convergence analysis of the projection update under effective dimension assumptions and shows compatibility with parameter-efficient fine-tuning (PEFT) methods such as LoRA.
优缺点分析
Strengths
-
Quality: The method is technically sound, with a clear structure and practical implementability. The proposed algorithm is empirically well validated across diverse models and tasks, with convincing performance and convergence gains.
-
Clarity: The paper is clearly written with rigorous derivations, well-designed figures and ablations, and a coherent logical flow that supports both understanding and reproducibility.
-
Significance: ZO optimization is increasingly relevant in resource-constrained deployment scenarios. By improving both convergence speed and accuracy, this work enhances the practicality of ZO-based fine-tuning for LLMs.
-
Originality: The use of layer-wise divergence to construct projection-based updates that emulate FO behavior under a ZO framework is novel and insightful.
Weaknesses
-
While the method is thoroughly evaluated on standard NLP tasks, it lacks analysis of failure cases or known vulnerabilities of ZO methods.
-
The periodic projection mechanism and its associated hyperparameter settings are not thoroughly analyzed for robustness and may be sensitive to the choice of tasks.
问题
-
Is using the pretrained model as the anchor empirically optimal? Are there theoretical motivations or data-driven alternatives for selecting a more suitable anchor?
-
Currently, DiZO updates the projection parameter γ at fixed intervals k. Could a task-aware or training-state-aware dynamic adjustment mechanism be considered? Such an approach might improve robustness and generalizability across tasks.
-
The current experiments focus on NLP classification and generation tasks, while the conclusion mentions future applications of DiZO to vision models. Have the authors observed any consistency or transferability of the projection parameter γ across tasks or model types?
If these questions are addressed, my assessment of the paper’s theoretical contribution and generalizability may improve. The method assumes full access to pretrained model parameters, which may be infeasible in privacy-constrained or adapter-only deployment scenarios. In addition, although DiZO is designed for memory efficiency, its real-world performance on edge hardware or production environments has not been evaluated. Further discussion or experiments in such contexts would strengthen the paper’s practical relevance.
局限性
The method assumes full access to pretrained model parameters, which may be infeasible in privacy-constrained or adapter-only deployment scenarios. In addition, although DiZO is designed for memory efficiency, its real-world performance on edge hardware or production environments has not been evaluated. Further discussion or experiments in such contexts would strengthen the paper’s practical relevance.
最终评判理由
The authors' responses were of high quality, not only giving detailed and sincere answers to each question, but also supplementing new experiments and data on multiple key points, which fully illustrated their confidence and ability in their work, and solved some of my doubts during the initial review - the robustness of hyperparameters, the rationality of anchor point selection, and the applicability and generality in restricted scenarios. I improved my rating.
格式问题
N o concerns.
Thank the reviewer for the valuable feedback. We carefully address the questions raised by the reviewer and added results as suggested by the reviewer.
Q1: While the method is thoroughly evaluated on standard NLP tasks, it lacks analysis of failure cases or known vulnerabilities of ZO methods.
A: Thank you for this insightful question. According to our comprehensive experiment, DiZO works well on a wide range of NLP datasets and tasks, yielding significant improvement in accuracy and GPU hours compared to other ZO baselines, and reduces significant memory overhead compared to FT. Moreover, we also conduct experiments on a vision-language-action (VLA) model, fine-tuning it via ZO on LiBERO datasets, and we observe that the loss decreases from 0.6 to around 0.2. Although it's only preliminary results, it illustrates that ZO works in more complex scenarios and is promising for further exploration. However, it's also worth noting that ZO does not work well in training-from-scratch scenarios, only performs well in fine-tuning tasks (based on a strong pretrained model).
Q2: The periodic projection mechanism and its associated hyperparameter settings are not thoroughly analyzed for robustness and may be sensitive to the choice of tasks.
A: Thank you for your important question. We had ablation on hyperparameters like update interval, clipping range, etc. in Appendix C.4. Specifically, we find that setting the clipping range to 0.2 is the best across different tasks. As for the update interval, we set it to [50, 100, 200] and find that the loss curve also does not change too much. Experimental results illustrate that our method is not sensitive to hyperparameter selection and is robust. Moreover, we add more ablation results, including loss curve on 2 more datasets and models, and add results of set update interval as 150. The pattern remains consistent on the extra results, we will plot the additional loss curve in the revision.
Q3: Is using the pretrained model as the anchor empirically optimal? Are there theoretical motivations or data-driven alternatives for selecting a more suitable anchor?
A: Thank you for your valuable feedback. As discussed in Appendix C.3, we have empirically evaluated different anchor choices, such as using 0 or the previous iteration parameter . Among them, using the pretrained model as the anchor consistently yields the best results. We also provide theoretical evidence (to be added) showing that using the pretrained model ensures convergence under our update rule. However, we acknowledge that this choice is not guaranteed to be optimal in all scenarios, and our selection is primarily based on empirical findings. While we have not yet explored data-driven anchor selection strategies, we believe this is a promising future direction. For instance, anchors informed by loss curvature or early-stage performance could offer improved stability in noisy ZO settings. Moreover, recent FO fine-tuning literature [1][2] also suggests that leveraging the pretrained model as a reference improves robustness, so we conjecture that ZO method can also benefit from pretraining model-guided fine-tuning.
[1] How should pre-trained language models be fine-tuned towards adversarial robustness? Neurpis 2021
[2] Pre-trained model guided fine-tuning for zero-shot adversarial robustness CVPR 2024
Q4: DiZO updates the projection parameter at fixed intervals . Could a task-aware or training-state-aware dynamic adjustment mechanism be considered? Such an approach might improve robustness and generalizability across tasks.
A: Thank you for your insightful suggestion. We explored two dynamic strategies for adjusting the projection interval based on training progress:
- an update frequency decreasing strategy, where the projection is performed more frequently at the beginning and less frequently later in training;
- an update frequency increasing strategy, with the opposite schedule.
As shown in Table R.1, the increasing strategy yields slightly faster convergence in early stages compared to the fixed-interval baseline, though final accuracy is marginally lower. The decreasing strategy performs worse in both convergence and final performance. These results suggest that adapting the projection frequency to training dynamics can be beneficial, especially early in training, but further tuning is required. We plan to explore more advanced task-aware or training-state-aware interval adjustment mechanisms in future work, potentially informed by loss trends, gradient norms, or validation metrics.
Table R.1 Accuracy of Different Strategies for Adjusting Projection Interval.
| Intervals Strategy | 500 (Iters) | 1000 | 1500 | 2000 | 2500 | 3000 |
|---|---|---|---|---|---|---|
| Fixed | 63.6 | 73.4 | 81.0 | 90.8 | 91.6 | 91.8 |
| Decreasing Frequency | 64.0 | 78.0 | 83.6 | 91.6 | 91.0 | 90.6 |
| Increasing Frequency | 62.6 | 71.6 | 75.8 | 83.4 | 89.2 | 89.8 |
Q5: The current experiments focus on NLP classification and generation tasks, while the conclusion mentions future applications of DiZO to vision models. Have the authors observed any consistency or transferability of the projection parameter across tasks or model types?
Thank you for your insightful question. We do observe a consistent pattern in the behavior of the projection parameter across our experiments on both classification and generation tasks. Specifically, the projection strength, which is quantified by the ratio , tends to be larger than 1 in the early stages of training, pushing the model away from the pre-trained initialization. In the later stages, this ratio stabilizes around or below 1, and the distance between the fine-tuned model and the pre-trained one oscillates rather than continuing to grow. This behavior contrasts with MeZO, where the distance from the pre-trained model increases steadily throughout training. We provide detailed statistics in the distance gap with the pretrained model in Table R.2, and visualize the gap in Figure 1 in the main paper. Previous studies have shown that staying closer to the pre-trained model can improve robustness, especially in noisy optimization settings, supporting our use of projection to regulate the deviation.
While we have not yet evaluated DiZO on vision models, we believe the consistency of ’s dynamics across NLP tasks suggests potential transferability to other domains. This opens up promising directions for future work on cross-modal projection strategies and task-aware projection scheduling.
Table R.2. Distance gap with the pre-trained model, using results on Value layer in the first Attention module as an example.
| Iter. | 400 | 800 | 1200 | 1600 | 2000 | 2400 | 2800 | 3200 |
|---|---|---|---|---|---|---|---|---|
| MeZO | 1.429 | 2.18 | 2.54 | 2.98 | 3.37 | 3.83 | 4.23 | 4.53 |
| DiZO | 1.59 | 2.36 | 3.46 | 3.29 | 2.98 | 3.48 | 3.32 | 3.37 |
Q6: The method assumes full access to pretrained model parameters, which may be infeasible in privacy-constrained or adapter-only deployment scenarios.
A: Thank you for your valuable question. We acknowledge that in privacy-constrained or adapter-only settings, full access to pretrained parameters may be unavailable. To address this, we evaluate DiZO with LoRA (Table 3, Table 4, Table E.1), using projection anchors that are either randomly initialized (matrix A) or set to (matrix B). Although these anchors lack pretrained knowledge, DiZO still improves convergence speed and final accuracy over MeZO with LoRA. This indicates that DiZO’s benefits arise not only from pretrained parameters, but also from its ability to control parameter deviation and stabilize zeroth-order optimization. Additionally, unlike FO methods requiring gradient access, DiZO remains fully applicable in black-box scenarios, showing strong performance even without access to pretrained weights, highlighting its practicality in privacy-aware fine-tuning.
Q7: DiZO's real-world performance on edge hardware or production environments has not been evaluated. Further discussion or experiments in such contexts would strengthen the paper’s practical relevance.
Thank you for your valuable suggestion. We agree that evaluating DiZO on real-world edge hardware would further strengthen the practical relevance of our work, and recognize this as an important direction for future work. DiZO has several properties that make it well-suited for on-device or edge training scenarios:
- Memory efficiency: Edge devices such as mobile phones and FPGAs typically offer limited memory resources. DiZO significantly reduces memory usage by avoiding activation and gradient storage, making it more deployable in such constrained setting.
- Forward-only optimization: As DiZO only relies on forward passes, it is compatible with existing inference accelerators (e.g., NNAPI on Android, edge TPUs, etc.), which typically lack support for backpropagation. This makes DiZO a strong candidate for adapting inference-only hardware for training.
One potential challenge is DiZO’s need for large quantities of random numbers for perturbation, while generating a large amount of random numbers poses a significant system burden for edge devices. We conduct tests on the OnePlus 12 Snapdragon 8 Gen 3, compared to generating a large amount of random numbers on Linux, generating on a mobile phone with an Arm-GPU only produces ignorable overhead, i.e., generating 1e7 random numbers on Linux costs 0.180576 s while it only costs 0.0598 s on a phone.
Deploying our method on an edge device is a promising future research direction, and we will add more discussion regarding it in our revision.
The authors' responses were of high quality, not only giving detailed and sincere answers to each question, but also supplementing new experiments and data on multiple key points, which fully illustrated their confidence and ability in their work, and solved some of my doubts during the initial review - the robustness of hyperparameters, the rationality of anchor point selection, and the applicability and generality in restricted scenarios. I will improve my rating.
Thank you for your positive feedback and constructive comments (e.g., investigate 's consistency and add discussion on hardware deployment), which make our paper stronger. We have addressed all your comments in our revision. Thank you again for your valuable time.
Best,
Author
Zeroth-order (ZO) optimization is a promising memory-efficient training paradigm, avoiding backward passes and relying solely on forward passes for gradient estimation, making it attractive for resource-constrained scenarios. However, ZO method lags far behind the first-order (FO) method (the conventional gradient based back propagation method) in both convergence speed and accuracy. This paper introduces a novel layer-wise divergence analysis that uncovers the distinct update pattern of FO and ZO optimization. The proposed DiZO is memory-efficient and achieves performance on par with FO methods. The experiments are thorough, covering both autoencoder and autoregression models, and are complemented by detailed analyses of memory usage and computational efficiency. The reviews are overall positive.