6.0

/10

Poster4 位审稿人

最低3最高4标准差0.4

3.8

置信度

创新性2.5

质量2.5

清晰度2.8

重要性2.8

NeurIPS 2025

Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning

Yong Liu,Zirui Zhu,Chaoyu Gong,Minhao Cheng,Cho-Jui Hsieh,Yang You

OpenReview PDF

提交: 2025-05-10更新: 2025-10-29

摘要

关键词

Zeroth-Order OptimizationParameter-Efficient Fine Tuning

评审与讨论

审稿意见

评分: 4置信度: 42025-06-28

This paper addresses the memory inefficiency of fine-tuning large language models. It identifies that while Zeroth-Order (ZO) optimizers like MeZO save memory by avoiding back-propagation, they introduce gradient estimation "noise" that harms performance, especially when applied to large-magnitude weights.

To solve this, the paper proposes Sparse MeZO (S-MeZO), a method that selectively applies updates only to the model's smaller-magnitude weights. This approach is more resilient to noise, which allows for larger learning rates, leading to faster convergence and better accuracy.

A key contribution is a memory-efficient implementation that calculates the necessary sparse mask on-the-fly during the forward pass, keeping memory usage at the same level as inference. This innovation enabled fine-tuning a 30-billion parameter model on a single A100 GPU.

Experiments conducted on LLaMA, OPT, and Mistral models show that S-MeZO consistently outperforms the original MeZO. For example, on the RTE task, it achieved a 9% absolute accuracy improvement and a 3.5x speedup.

优缺点分析

Strengths:

Significant Practical Impact and Accessibility: The paper's most significant contribution is making the fine-tuning of extremely large models more accessible. By developing a memory-efficient implementation that calculates sparse masks on-the-fly, S-MeZO successfully keeps memory usage at the level of inference. This innovation enabled the fine-tuning of a LLaMA-30b model on a single A100 GPU, a task that is typically infeasible with standard methods. This dramatically lowers the hardware barrier for LLM research and development.
Novel and Insightful Core Idea: The paper is founded on a highly original empirical observation: the noise inherent in Zeroth-Order (ZO) gradient estimates is more detrimental to large-magnitude weights than to small-magnitude ones. This insight is novel and provides a clear, well-motivated basis for the proposed S-MeZO algorithm, which selectively optimizes these more noise-resilient small weights. This approach offers a new perspective on mitigating noise in ZO optimization.
A notable strength of this paper, setting it apart from some other works focused on ZO optimization, is the inclusion of baselines that use back-propagation to calculate gradients.

Weaknesses:

Since the parameter selection is based on the smallest weights, the benchmark could be strengthened by including a comparison with the opposite strategy: selecting and updating only the largest-magnitude weights. This would serve as a valuable ablation study to more directly validate the paper's core hypothesis.
A potential area for strengthening the experimental comparison would be to include another recent and relevant method in sparse ZO optimization. For instance, comparing S-MeZO against SensZOQ would provide a more comprehensive picture of its performance within this specific sub-field.

问题

Have the authors explored the upper limits of the sparsity ratio for this method? For instance, what is the highest sparsity tested, and does performance continue to improve or begin to degrade at very high levels of sparsity?

局限性

The authors have acknowledged the primary performance-related limitation of their work, which is the remaining performance gap between S-MeZO and first-order fine-tuning methods. This transparency is commendable. However, the discussion of limitations could be expanded to make the paper even stronger: Broader Societal Impact: While the work itself does not have a direct negative impact, it contributes to a line of research that makes powerful LLM fine-tuning more accessible and efficient. The authors could briefly acknowledge the dual-use nature of such technology; lowering the barrier to entry can empower more users for positive applications, but it can also potentially make it easier for malicious actors to create fine-tuned models for generating disinformation or other harmful content. A short discussion of this broader context is standard for LLM research and would demonstrate a more comprehensive consideration of the work's potential impact.

最终评判理由

The authors have provided a thorough set of baseline comparisons.

格式问题

The manuscript contains spelling and wording mistakes that should be fixed before resubmission. The benchmark name SuperGLUE is repeatedly misspelled as “SuperGULE” or “SuperGULU.”

评论- Response to Reviewer sJB2

2025-08-01

Thanks for your constructive comments, we carefully address your concerns below.

w1: Since the parameter selection is based on the smallest weights, the benchmark could be strengthened by including a comparison with the opposite strategy: selecting and updating only the largest-magnitude weights.

Thank you very much for this valuable suggestion.

Our focus on small-magnitude weights is based on their fundamental properties in pre-trained models and zeroth-order optimization.

Ablation Study: Our experimental analysis examines how weights of different magnitudes affect performance. We divided weights into 4 groups by magnitude, where '0-0.25' represents the bottom 25% and '0.75-1.0' represents the top 25%. The results in the tables demonstrate that updating small-magnitude weights consistently achieves better performance than updating large-magnitude weights.

	0-0.25	0.25-0.5	0.5-0.75	0.75-1.0
RTE	82.7	73.6	66.1	57.3
BoolQ	81.6	77.1	67.7	66.1

Building on these findings, we performed additional experiments to analyze weight selection strategies. While updating small-magnitude weights (0-0.25) shows strong performance, we tested incrementally including more weight groups (e.g., expanding to 0-0.5, which includes both 0-0.25 and 0.25-0.5). The results show that adding weights beyond the small-magnitude group (0-0.25) actually reduces performance, confirming that focusing on small-magnitude weights is optimal.

	0-0.25	0-0.5	0-0.75	0-1.0
RTE	82.7	75.1	72.3	71.7
BoolQ	81.6	77.8	76.7	75.9

w2: A potential area for strengthening the experimental comparison would be to include another recent and relevant method in sparse ZO optimization. For instance, comparing S-MeZO against SensZOQ would provide a more comprehensive picture of its performance within this specific sub-field.

We thank the reviewer for this excellent suggestion. SensZOQ is indeed a highly relevant sparse zeroth-order fine-tuning method, however, we didn't compare it in our original submissions as it published on arxiv several months after our work. Therefore, we believe it does not diminish the novelty and contribution of our paper. However, we totally agree that for completeness we should include a comparison and discussion in our paper.

Before presenting our comparison, we note key methodological differences: (1) SensZOQ requires additional computational cost to calculate first-order gradients for obtaining sparse masks, while our method dynamically generates masks without gradient computation; (2) SensZOQ targets extreme sparsity combined with quantization for edge devices, whereas we focus on balancing sparsity with performance for general memory-efficient training. Our experimental comparison on challenging tasks shows:

	SST-2	RTE	BoolQ	WIC	MMLU
LLaMA2-7B + MeZO	94.0	70.2	78.8	62.2	59.2
LLaMA2-7B + SensZOQ	94.3	76.2	83.1	63.7	58.4
LLaMA2-7B + Sparse MeZO	94.8	77.6	82.2	65.3	59.6

L1: Addressing the Need for Broader Discussion of Societal Impact and Dual-Use Considerations.

We thank the reviewer for this thoughtful suggestion about expanding our discussion of broader societal impacts. We agree that acknowledging the dual-use nature of making LLM fine-tuning more accessible is important and standard practice in LLM research.

We will expand the limitations section in the revised manuscript to include a discussion of broader societal implications. Specifically, we will acknowledge that while our method democratizes access to LLM fine-tuning through reduced computational requirements—enabling beneficial applications for researchers and practitioners with limited resources—it also potentially lowers barriers for malicious actors who might use fine-tuned models for generating disinformation or other harmful content.

We will also briefly discuss potential mitigation strategies, such as the importance of responsible deployment practices, monitoring fine-tuned model outputs, and the continued development of detection methods for AI-generated harmful content.

C1: The manuscript contains spelling and wording mistakes that should be fixed before resubmission.

We sincerely apologize for the spelling errors and will conduct thorough proofreading to correct all spelling mistakes and grammatical errors before resubmission.

2025-08-02

[W1]: Thank you for your detailed response. It has addressed my question thoroughly.

[W2]: The comparison with SensZOQ is very helpful. May I kindly ask under which density or sparsity ratio these experiments were conducted? In any case, I truly appreciate the authors’ efforts—the results appear solid.

Question :Zeroth-order optimization relies on the expectation that its gradient estimator converges to the true backpropagation gradient as the number of samples increases. If we increase the number of ZO samples to reduce noise, would updating small weights still outperform updating large weights?

Moreover, following the authors' argument, does this imply that even in backpropagation (noise-free), restricting updates to small-weight or small-gradient parameters would be preferable? If not, doesn’t this suggest that the observed benefit is simply a workaround for ZO noise rather than an intrinsic optimization principle?

评论- Response to Reviewer sJB2

2025-08-05

We sincerely thank the reviewer sJB2 for the further feedback. Our responses are as follows:

W2: The comparison with SensZOQ is very helpful. May I kindly ask under which density or sparsity ratio these experiments were conducted? In any case, I truly appreciate the authors’ efforts—the results appear solid.

Thank you very much for your follow-up question about W2. Since the SensZOQ code was not publicly available at the time of our experiments, we followed the settings described in the SensZOQ paper and reproduced its performance. We conducted a grid search over parameter selection ratios, testing with {0.1%, 1%, 10%} of parameters selected for updates. We found that updating only 1% of parameters achieved the best performance on RTE and BoolQ, so we used 1% parameter density (i.e., selecting 1% of parameters for optimization) for all experiments in W2. The experimental results for determining the density are shown in the table:

	Density	RTE	BoolQ
LLaMA2-7B + SensZOQ	0.1%	75.1	80.8
LLaMA2-7B + SensZOQ	1%	76.2	83.1
LLaMA2-7B + SensZOQ	10%	71.1	79.3

Question 1: Zeroth-order optimization relies on the expectation that its gradient estimator converges to the true backpropagation gradient as the number of samples increases. If we increase the number of ZO samples to reduce noise, would updating small weights still outperform updating large weights?

Thank you very much for your constructive comment. We conducted experiments with larger batch sizes for Sparse MeZO (increasing batch size from 16 to 64) and compared the performance between updating only small weights versus large weights. We find that updating small weights still achieves better performance than updating large weights even with larger batch sizes.

The results are shown in the table below, where '0-0.25' represents the bottom 25% and '0.75-1.0' represents the top 25%. We observe that updating smaller parameters (0-0.25) with batch size 64 achieves higher performance than updating larger parameters (0.75-1.0) on both RTE (83.7 vs 60.3) and BoolQ (83.3 vs 67.6). We will provide the complete results in the revision.

	Batch Size	0-0.25	0.75-1.0
Sparse MeZO + RTE	16	82.7	57.3
Sparse MeZO + BoolQ	16	81.6	66.1
Sparse MeZO + RTE	32	83.0	59.6
Sparse MeZO + BoolQ	32	82.5	66.9
Sparse MeZO + RTE	64	83.7	60.3
Sparse MeZO + BoolQ	64	83.3	67.6

Question 2: Moreover, following the authors' argument, does this imply that even in backpropagation (noise-free), restricting updates to small-weight or small-gradient parameters would be preferable? If not, doesn’t this suggest that the observed benefit is simply a workaround for ZO noise rather than an intrinsic optimization principle?

Thank you very much for your constructive question. We investigated this question by conducting experiments with first-order methods (Adam) on LLaMA-7b/RTE, selectively updating only small-magnitude weights vs. large-magnitude weights. Results show that first-order methods achieve similar performance regardless of weight magnitude selection (83.6% vs 83.1%, only 0.5% gap), unlike zeroth-order methods where small weights significantly outperform large weights (82.7% vs 57.3%). This confirms that the disproportionate weight magnitude impact is specific to zeroth-order optimization due to gradient estimation noise, rather than a general fine-tuning phenomenon.

2025-08-07

Thank you for providing such a thorough set of baseline comparisons. I’m updating my score to 4—best of luck!

审稿意见

评分: 4置信度: 32025-06-29

The paper present sparse-Mezo, an algorithm based on Mezo with the observation that optimizing small weights is more helpful than optimizing large weights for zeroth order method. Then the algorithm is a simple modification of Mezo: in each forward pass, the algorithm identifies small weights in each layer and only update these weights. The paper show experiments that support the observation and improve Mezo on fine-tuning tasks.

优缺点分析

Strengths

The paper provides a new observation for zeroth order methods that small weights have bigger impacts on training than large weights. This is very interesting and somewhat counter-intuitive. I think this should be studied more in the future.
The experimental results are quite strong against vanilla mezo and support the hypothesis.

Weaknesses

The experiments could be more comprehensive. Some of the datasets are missing from some tables (for example, table 2 and others in the appendix). Please also add other baselines such as ZO-SGD to table 3 and where it’s relevant.
The algorithm introduces a sparsity parameter and it’s unclear how to set this parameter other than tuning
As acknowledged by the authors, there is still a significant gap between this method and gradient methods
The theorem in the appendix only applies for a fixed masking, not the masking by threshold used in the algorithm.

问题

Please add the experiment results I mentioned in the appendix
For the completeness and self-containment of the paper, please also for each experiment provide a description of the set up (for example, section 4.3 is unclear)
Regarding the new observation: When training on top 20% of the largest weight vs the bottom 20%, how are the step sizes determined for each case and what are the impacts on the stability of the training and the performance of the model?
Is the disproportionate impact of the weights also observed when we train the model by first order method instead of zeroth order method?

局限性

Adequately addressed

最终评判理由

I maintain my score after the discussion.

格式问题

评论- Response to Reviewer Waaa - Continue

2025-08-01

Q3: Regarding the new observation: When training on top 20% of the largest weight vs the bottom 20%, how are the step sizes determined for each case and what are the impacts on the stability of the training and the performance of the model?

For step size determination, we conducted grid search over {1e−6, 2e−6, 3e−6, 4e−6, 5e−6} for both top 20% largest and bottom 20% smallest weights, selecting the best-performing learning rate for each configuration through validation performance. Both achieved optimal results at similar rates (2e−6 to 3e−6), ensuring fair comparison. Regarding training stability, small weights consistently converged across all tested learning rates with low variance, while large weights exhibited training instability and higher variance, particularly at rates ≥4e−6. The 15-20% performance gap persisted across different learning rates, confirming this is a fundamental property rather than hyperparameter sensitivity.

Q4: Is the disproportionate impact of the weights also observed when we train the model by first order method instead of zeroth order method?

We investigated this excellent question by conducting experiments with first-order methods (Adam) on LLaMA-7b/RTE, selectively updating only small-magnitude weights vs. large-magnitude weights. Results show that first-order methods achieve similar performance regardless of weight magnitude selection (83.1% vs 83.6%, only 0.5% gap), unlike zeroth-order methods where small weights significantly outperform large weights (82.7% vs 57.3%). This confirms that the disproportionate weight magnitude impact is specific to zeroth-order optimization due to gradient estimation noise, rather than a general fine-tuning phenomenon.

2025-08-02

Thank you for your response. I have some follow up questions.

Could you please check the numbers in your response to w3?
I'm not sure I understand the response to w4. What do the authors mean by "the number of selected parameters continuously decreases" (why is it the case if the mask is percentile based)?

评论- Response to Reviewer Waaa

2025-08-01

Thanks for your constructive feedback, we carefully address your concerns below.

w1: Some of the datasets are missing from some tables (for example, table 2 and others in the appendix). Please also add other baselines such as ZO-SGD to table 3 and where it’s relevant.

We thank the reviewer for this valuable feedback. We acknowledge the incomplete dataset coverage in some tables and missing baselines.

Missing datasets: Some results were omitted due to computational constraints, but we will complete these experiments for the revised manuscript.

Complete results of table 2:

Model	Method	BoolQ	RTE	WIC	SST-2	MultiRC	COPA	Average
LLaMA2-7B	MeZO	78.8	70.2	62.2	94.0	70.8	85.0	76.8
LLaMA2-7B	MeZO-LoRA	80.3	76.5	63.0	94.0	71.6	83.0	78.1
LLaMA2-7B	ZO-SGD-Cons	77.1	68.6	63.0	94.0	71.0	84.3	76.3
LLaMA2-7B	ZO-SGD-Sign	68.2	61.0	55.6	82.8	62.0	78.0	67.9
LLaMA2-7B	ZO-SGD-Adam	78.9	73.6	62.1	93.8	71.2	86.0	77.6
LLaMA2-7B	S-MeZO	82.2	77.6	65.3	94.8	74.7	86.0	80.1

Complete results of table 3:

	BoolQ	PIQA	SIQA	AQuA
MeZO	76.6	84.3	68.5	24.0
ZO-SGD-Cons	76.9	83.0	68.3	25.8
ZO-SGD-Adam	77.8	84.7	68.1	25.0
S-MeZO	79.2	85.3	70.2	26.6

We will provide more comprehensive experimental results in the revision with consistent baseline coverage across all tables.

w2: The algorithm introduces a sparsity parameter and it’s unclear how to set this parameter other than tuning

We thank the reviewer for this valuable observation. We determine thresholds using a principled sparsity-based approach. Specifically, we use a percentile-based method where the threshold is set based on a target sparsity level. For example, with 80% sparsity, we sort the weight values of each layer and set the threshold at the 80th percentile. Importantly, this threshold is determined once before training begins and remains fixed throughout the optimization process.

Our empirical results in Figure 4 demonstrate that a sparsity level of 75% consistently yields strong performance across diverse tasks (RTE, BoolQ, and WIC), highlighting the generalizability of our threshold selection method. Furthermore, we observe robust performance gains compared to vanilla MeZO across a range of sparsity values (70-80%), indicating that our method is not sensitive to the exact threshold choice. This stability across different tasks eliminates the need for task-specific threshold tuning.

w3: As acknowledged by the authors, there is still a significant gap between this method and gradient methods

We acknowledge the performance gap between our zeroth-order method and first-order gradient methods, which is an inherent limitation of ZO optimization.

However, we observe this gap is narrowing as pre-trained models improve. For example, the average performance gap between S-MeZO and Full FT decreases from 2.6% on LLaMA-7B to 1.2% on LLaMA-30B. This suggests that the performance gap between zeroth-order and first-order methods can be narrowed with increasing model capability.

w4: The theorem in the appendix only applies for a fixed masking, not the masking by threshold used in the algorithm.

We thank the reviewer for this important clarification. The reviewer is correct that our theorem assumes a fixed mask while our algorithm uses dynamic threshold-based masking.

However, our theoretical analysis provides a meaningful upper bound. Although the mask is dynamic, the number of selected parameters ( $\hat{d}$ ) continuously decreases during training as parameters grow and exceed thresholds. Since our convergence rate depends on $\hat{d}$ , and $\hat{d}$ monotonically decreases in our approach, the fixed-mask analysis serves as an upper bound on convergence rate.

This means our method may converge faster than the theoretical prediction. We will clarify this relationship in the revised manuscript to better explain how the theoretical analysis relates to our dynamic masking approach.

评论- Response to Reviewer Waaa

2025-08-05

We sincerely thank the reviewer Waaa for the further feedback. Our responses are as follows:

Q1: Could you please check the numbers in your response to w3?

We apologize for the confusion. We have updated our results in the previous response. Additionally, to further explain how the gap (Full FT - Sparse MeZO) narrows as pre-trained models improve (from LLaMA-7B to LLaMA-30B), we provide the results on LLaMA-7B and LLaMA-30B in the table:

Model	Method	BoolQ	RTE	WIC	MultiRC	SST-2	COPA	Average	Gap (Full FT - S-MeZO)
LLaMA-7B	MeZO	75.9	71.7	61.4	69.8	94.6	86.3	76.6	/
LLaMA-7B	S-MeZO	80.9	80.7	64.9	73.3	95.0	86.7	80.3	/
LLaMA-7B	Full FT	84.5	83.6	68.4	80.2	95.7	85.0	82.9	2.6
LLaMA-30B	MeZO	83.8	76.9	63.3	71.9	95.0	86.0	79.5	/
LLaMA-30B	S-MeZO	85.7	82.1	67.3	78.3	95.3	87.0	82.6	/
LLaMA-30B	Full FT	86.2	84.1	70.2	81.3	96.0	85.0	83.8	1.2

From the table, we observe that the average performance gap (the last column) between S-MeZO and Full FT decreases from 2.6% on LLaMA-7B to 1.2% on LLaMA-30B. This suggests that the performance gap between zeroth-order and first-order methods can be narrowed with increasing model capability.

In addition, we find that increasing the batch size can further reduce the performance gap. For instance, when we increased the batch size from 16 to 32 (with seed 0), the performance gap decreased from 1.9 to 1.3 on the RTE and BoolQ tasks.

	Batch Size	RTE	BoolQ	Average	Gap
LLaMA-7B + Full FT	/	83.6	84.5	84.1	/
LLaMA-7B + Sparse MeZO	16	82.7	81.6	82.2	1.9
LLaMA-7B + Sparse MeZO	32	83.0	82.5	82.8	1.3

Q2: I'm not sure I understand the response to w4. What do the authors mean by "the number of selected parameters continuously decreases" (why is it the case if the mask is percentile based)?

We apologize for the confusion. To clarify our setup: we determine the threshold for each layer once before training based on the target sparsity (e.g., smallest 20% of parameters), and this threshold remains fixed throughout training to avoid expensive recomputation at each iteration.

The number of selected parameters decreases because as training progresses, some parameters initially below the threshold will grow larger and become masked. Once masked, these parameters stop updating and cannot return below the threshold. Since the threshold is fixed, parameters can only transition from active to masked (not vice versa), leading to a monotonically decreasing active parameter count.

2025-08-05

Thank you for your response. That clears up some confusions I had. Please consider adding these discussions to the paper. I maintain my inclination to accept the paper.

评论- Response to Reviewer Waaa

2025-08-06

We sincerely thank reviewer Waaa for the insightful feedback. We will incorporate the suggested changes in our revision. Thank you again for your time and effort in reviewing our submission.

审稿意见

评分: 3置信度: 52025-07-02

This paper introduces Sparse-MeZO, a memory-efficient zeroth-order (ZO) optimization method for fine-tuning large language models (LLMs). Instead of updating all parameters, Sparse-MeZO selectively optimizes a subset by applying a sparse mask to target low-magnitude, noise-tolerant weights. Built upon the MeZO framework, this approach reduces gradient estimation noise, accelerates convergence, and lowers memory consumption. Experimental results show that Sparse-MeZO not only outperforms MeZO in convergence speed and final performance but also enables the fine-tuning of large models like LLaMA-30B using limited hardware, such as a single A100 GPU.

优缺点分析

Strengths:

Incorporating sparsity and MeZO is an interesting direction for performance improvement.
The paper is structured logically, making it easy to follow the motivation and methodology.

Weaknesses:

The paper lacks discussion and comparison with some important ZO works in sparsity, like [1]. Moreover, seems the method works due to gradient estimation with less noise, so it’s helpful to compare with some variance-reduction works of ZO, like [2] and [3].
There are no specific numbers regarding the computational overhead introduced by threshold calculation, and dynamic masking and perturbation layer by layer in the forward pass.
The overall contribution is incremental as the work adds common magnitude-based sparse steps on top of the MeZO optimization algorithm.
It would be better to have some evaluation on more challenging task, like MMLU and MT-Bench, for example, using the settings in [4].

[1] Zeroth-Order Fine-Tuning of LLMs with Transferable Static Sparsity

[2] Zeroth-order stochastic variance reduction for nonconvex optimization

[3] Zeroth-order optimization with trajectory-informed derivative estimation

[4] Lisa: Layerwise importance sampling for memory-efficient large language model fine-tuning

问题

For the setting of the threshold, set its value related to the sparsity, e.g., 80% sparsity -> 80th percentile. Are these only empirical findings? It's better to have some ablation studies on the selection of the threshold.
Some parts of the appendix are not referred to in the main paper and are without a textual description.
Is the threshold computed every iteration? As the learning rate for zo is small, and the update per iteration is also small. Therefore, can you compute it intermittently?

局限性

Yes.

最终评判理由

After reading the author's response and other reviewers' comments, the author addressed most of my concerns, but I still think the novelty of this paper is not that impressive. I raised my score from 2 to 3.

格式问题

N/A

评论- Response to Reviewer EoWB

2025-08-01

Thanks for your constructive and inspiring feedback, we carefully address your concerns below.

w1: The paper lacks discussion and comparison with some important ZO works in sparsity, like [1]. Moreover, seems the method works due to gradient estimation with less noise, so it’s helpful to compare with some variance-reduction works of ZO, like [2] and [3].

We thank the reviewer for pointing out these important related works in ZO optimization with sparsity and variance reduction. We agree that comparisons with these methods would strengthen our evaluation. We have conducted experiments comparing our method with SensZOQ [1] and ZO-SVRG [2] on LLaMA2-7B. The results show that Sparse MeZO consistently outperforms these baselines:

	SST-2	RTE	BoolQ	WIC
LLaMA2-7B + MeZO	94.0	70.2	78.8	62.2
LLaMA2-7B + SensZOQ	94.3	76.2	83.1	63.7
LLaMA2-7B + ZO-SVRG	94.2	73.6	79.8	63.2
LLaMA2-7B + Sparse MeZO	94.8	77.6	82.2	65.3

In addition, although variance reduction methods such as ZO-SVRG can improve gradient estimation, they typically introduce additional memory overhead by storing multiple model weights for variance reduction, which conflicts with our goal of memory efficiency. Our approach achieves variance reduction through selective parameter optimization without extra memory cost. We are conducting experiments with [3] and will include these comparisons in the final revision.

w2: There are no specific numbers regarding the computational overhead introduced by threshold calculation, and dynamic masking and perturbation layer by layer in the forward pass.

Thank you for your constructive comment about computational overhead. We have conducted detailed measurements to quantify the overhead and the results is shown in the folowing table. Due to the main overhead is from the definition of threshold and mask calculation, we mainly analyze them in the table:

	SST-2	RTE	BoolQ	WIC	MultiRC	Average
LLaMA + MeZO-Threshold	0.0139s	0.0141s	0.0142s	0.0135s	0.0143s	0.0140s
LLaMA + MeZO-Mask (each step)	0.0714s	0.0673s	0.0701s	0.0717s	0.0643s	0.0690s
LLaMA + MeZO (each step)	0.8588s	2.3631s	3.6336s	1.0899s	6.2621s	2.842s
Overhead = t(Mask)/t(MeZO)	0.08	0.025	0.019	0.06	0.009	0.039

(1) Threshold Calculation (MeZO-Threshold row in the table): The threshold determination process takes approximately 0.01 seconds for a LLaMA-7b model. In addition, it is a one-time cost before training begins, and therefore this additional computation is very limited compared with several hours training of each experiment.

(2) Dynamic Masking Overhead (MeZO-Mask row in the table): We provide the detailed dynamic masking calculation time and each training step time in the table. From this table, we can find that the masking operation adds approximately 0.07s per forward pass, and the training step time in these experiments varies from 0.8588s to 6.2621s. Therefore, the relative overhead is between 0.009 to 0.08, which is negligible.

w3: The overall contribution is incremental as the work adds common magnitude-based sparse steps on top of the MeZO optimization algorithm.

We appreciate this feedback and would like to highlight several substantial contributions of our work beyond MeZO:

Technical Innovation: We identified and addressed a fundamental limitation in ZO optimization: gradient noise impacts large weights more severely than small weights. This insight led to our selective parameter updating approach.

Performance Gains: Our method achieves substantial improvements over MeZO. For example, (1) 9% absolute accuracy improvement on RTE, (2) 3.5x faster convergence, (3) Comparable performance to full fine-tuning while maintaining MeZO's memory efficiency.

Novel Implementation: We developed a memory-optimized implementation for sparse masking that maintains inference-level memory consumption, enabling fine-tuning of LLaMA-30b on a single A100 GPU.

评论- Response to Reviewer EoWB - Continue

2025-08-01

w4: It would be better to have some evaluation on more challenging task, like MMLU and MT-Bench, for example, using the settings in [4].

We thank the reviewer for this suggestion to evaluate on more challenging tasks. To address concerns about performance on complex reasoning tasks, we have already conducted experiments beyond SuperGLUE.

In our submission (Section 4.3), we evaluated Sparse MeZO on challenging tasks including commonsense reasoning (PIQA, SIQA, BoolQ) and mathematics (AQuA), consistent with evaluation protocols used in SensZOQ. The results demonstrate that Sparse MeZO consistently outperforms MeZO across these challenging tasks, with notable improvements on AQuA mathematics (+2.6%) and SIQA commonsense reasoning (+1.7%).

	BoolQ	PIQA	SIQA	AQuA	MMLU
Mistral-7B + MeZO	76.6	84.3	68.5	24.0	59.2
Mistral-7B + SensZOQ	76.3	84.9	69.4	26.1	58.4
Mistral-7B + Sparse MeZO	79.2	85.3	70.2	26.6	59.6

Following the reviewer's suggestion, we have conducted additional experiments on MMLU using Mistral-7B. The results show consistent improvements: Sparse MeZO can achieve 0.4% improvement compared with MeZO and 1.2% improvement compared with SensZOQ.

Q1: For the setting of the threshold, set its value related to the sparsity, e.g., 80% sparsity - 80th percentile. Are these only empirical findings? It's better to have some ablation studies on the selection of the threshold.

Thank you very much for this valuable suggestion.

Our focus on small-magnitude weights is based on their fundamental properties in pre-trained models and zeroth-order optimization: Large-magnitude weights typically store critical pre-trained information, making them sensitive to perturbations. Even small noise in gradient estimation can significantly disrupt these learned patterns and cause substantial performance drops.

For the ablation study, our experimental analysis examines how weights of different magnitudes affect performance. We divided weights into 4 groups by magnitude, where '0-0.25' represents the bottom 25% and '0.75-1.0' represents the top 25%. The results in the tables demonstrate that updating small-magnitude weights consistently achieves better performance than updating large-magnitude weights.

	0-0.25	0.25-0.5	0.5-0.75	0.75-1.0
RTE	82.7	73.6	66.1	57.3
BoolQ	81.6	77.1	67.7	66.1

	0-0.25	0-0.5	0-0.75	0-1.0
RTE	82.7	75.1	72.3	71.7
BoolQ	81.6	77.8	76.7	75.9

Q2: Some parts of the appendix are not referred to in the main paper and are without a textual description.

We thank the reviewer for this feedback. We acknowledge that some appendix sections lack proper references from the main paper and sufficient descriptions.

We will address this in the revised manuscript by: (1) adding appropriate citations to all appendix sections in the main text (e.g., referencing hyperparameters in Section 4.1, proofs in Section 3), (2) providing clear descriptions for each appendix section, and (3) ensuring all supplementary material directly supports the main paper's claims.

Q3: Is the threshold computed every iteration? As the learning rate for zo is small, and the update per iteration is also small. Therefore, can you compute it intermittently?

We thank the reviewer for this insightful question. To clarify our experimental setting: we determine the threshold for each layer once before training begins based on the target sparsity level, and this threshold remains fixed throughout the training process. Therefore, the threshold is not computed every iteration.

Our mask is indeed dynamic in the sense that it is computed on-the-fly during each forward pass by comparing current parameter values against these pre-determined thresholds. As the reviewer correctly notes, computing thresholds (which requires sorting weight values of each layer) is computationally expensive, which is why we avoid doing this repeatedly.

2025-08-05

Thank you to the authors for responding to my questions and concerns. After reading the author's response, I will raise my score a bit.

2025-08-05

We sincerely thank reviewer EoWB for offering constructive feedback. We will incorporate the suggested changes in the revision. Thanks again for your time and effort in reviewing our submission.

审稿意见

评分: 4置信度: 32025-07-02

The paper proposes to improve the convergence of zero-order optimizers by only applying the perturbation to smaller weights. The paper begins by presenting empirical observations on how the magnitude of weights in MeZO updates influences loss convergence. Based on this, the authors propose a practical solution: masking the largest weights during perturbation application. They demonstrate that this approach significantly enhances the convergence process and downstream performance in MeZO.

优缺点分析

Strengths

s1: The paper is well-motivated and written, clearly balanced between motivation, methods, and empirical evaluation. The method is simple and does not require a complex procedure on top of existing zero-order optimization processes.
s2: The experimental results are compelling, showing an average improvement of 3.7 points on the SuperGLUE tasks over vanilla MeZO without impact on the memory consumption or convergence speed.

Weaknesses

w1: I am not fully convinced by the analysis presented in lines 147-159. As I understand it, the authors compute gradients on a set of 16 data points and then observe whether applying a gradient step in that direction, but starting from 16 entirely independent points, reduces the loss. The observation that there's roughly a 50% chance it won't be the case doesn't completely surprise me. It seems possible to me that a gradient computed on one set of 16 points aligns with a gradient computed on another, completely independent set of 16 points. I would like to know if similar results would hold for supervised fine-tuning (SFT). In particular, to include a comparison in Figure 2(b) with SFT.
w2: The results regarding the convergence rate in Section 4.4 and Figure 3 only compare convergence curves for MeZO and Sparse-MeZO, but it would be relevant to add the comparison with SFT here.

问题

“Impressive results” line 1 seems like a superlative statement
Line 165, how are the percentages chosen to define small and large values? Why set the threshold to 20%?
Lines 222-225, I am not sure how this paragraph helps with the argumentation. According to the author, this method is not aligned with the goal of the paper and is not discussed anywhere else in the paper. I would suggest simply removing this paragraph for clarity.
It would be helpful to report the standard variations on top of the mean across the three runs for the results in at least Table 1. This could help capture the stability of the approach.

局限性

The authors have discussed the limitations in the last section.

最终评判理由

Overall, my assessment of the paper remains unchanged. I find the proposed approach relatively straightforward and likely easy to apply in practice. The paper is well-written and generally well-motivated, though I believe there is room for improvement in the experimental justification of the approach (lines 147–159).

格式问题

No paper formatting concerns.

评论- Response to Reviewer i838

2025-08-01

Thanks for your constructive feedback, below are our responses regarding your concerns.

w1: It seems possible to me that a gradient computed on one set of 16 points aligns with a gradient computed on another, completely independent set of 16 points. I would like to know if similar results would hold for supervised fine-tuning (SFT). In particular, to include a comparison in Figure 2(b) with SFT.

Thank you for this insightful suggestion. We agree that a more rigorous comparison is needed. Our new analysis compares SGD and MeZO using the same batch. The results show MeZO has a higher probability of loss increment (0.4-0.6) compared to SGD (0.2-0.3). While both methods can increase loss on different batches, MeZO exhibits this behavior more frequently.

w2: The results regarding the convergence rate in Section 4.4 and Figure 3 only compare convergence curves for MeZO and Sparse-MeZO, but it would be relevant to add the comparison with SFT here.

We thank the reviewer for this suggestion. Adding SFT convergence comparison would indeed provide valuable context for understanding the convergence characteristics of different optimization approaches. In our experiments, SFT typically converges within 1,000 steps, while both MeZO and Sparse-MeZO require 10,000 steps due to the inherent nature of zeroth-order optimization, which relies on gradient estimation rather than exact gradients.

We will add SFT convergence curves to Figure 3 in the revision. This comparison will help readers better understand the trade-offs: while SFT achieves faster convergence, it requires significantly more memory (12× more than our method). Sparse-MeZO provides a middle ground: faster convergence than vanilla MeZO while maintaining the memory efficiency of zeroth-order methods.

Q1: “Impressive results” line 1 seems like a superlative statement

We agree that the phrase "impressive results" contains subjective language. We will revise this to use more objective language in the revision. Specifically, we will change the sentence to: "While fine-tuning large language models (LLMs) for specific tasks often yields strong performance..."

Q2: Line 165, how are the percentages chosen to define small and large values? Why set the threshold to 20%?

We thank the reviewer for this valuable observation. For threshold, we use a percentile-based method where the threshold is set based on a target sparsity level. For example, with 80% sparsity, we sort the weight values of each layer and set the threshold at the 80th percentile. Importantly, this threshold is determined once before training begins and remains fixed throughout the optimization process.

Our empirical results in Table 12 demonstrate that a sparsity level of 70-80% consistently yields strong performance across diverse tasks (RTE, BoolQ, and WIC), highlighting the generalizability of our threshold selection method.

Q3: Lines 222-225, I am not sure how this paragraph helps with the argumentation. According to the author, this method is not aligned with the goal of the paper and is not discussed anywhere else in the paper.

We thank the reviewer for this feedback and agree with the suggestion. The paragraph discussing 1-bit quantization introduces an alternative approach that we do not pursue or evaluate, which may indeed cause confusion for readers. We will remove this paragraph in the revision for better clarity.

Q4: It would be helpful to report the standard variations on top of the mean across the three runs for the results in at least Table 1. This could help capture the stability of the approach.

We thank the reviewer for this valuable suggestion. We agree that reporting standard deviations would better demonstrate the stability of our approach across multiple runs.

We will update Table 1 in the revision to include standard deviations for all methods across the three random seeds. For reference, we have already provided standard deviations in Table 10 (Appendix) for our main LLaMA-7b results, which show that Sparse-MeZO maintains stable performance with standard deviations typically ranging from ±0.3 to ±1.6 across different tasks.

2025-08-06

I thank the authors for their responses and clarifications.

I maintain my reservations regarding weakness w1, particularly concerning the motivation, which I believe could still be refined. I encourage the authors to incorporate the new experimental results into the paper to strengthen this aspect.
Regarding w2, I appreciate the quantitative details provided, which offer meaningful points of comparison.

Thank you as well for addressing my question — I am satisfied with the response.

评论- Response to Reviewer i838

2025-08-06

We sincerely thank reviewer i838 for the constructive feedback. We will incorporate the suggested changes in our revision. Thank you again for your time and effort in reviewing our submission.

最终决定Accept (poster)

2025-09-17

The paper proposes Sparse MeZO, a memory-efficient zeroth-order (ZO) optimization approach for LLM fine-tuning. The core claim is that ZO gradient noise disproportionately harms large-magnitude weights, so selectively applying updates only to small-magnitude weights reduces noise sensitivity, accelerates convergence, and improves performance. The method also includes a memory-optimized masking implementation that enables fine-tuning models as large as LLaMA-30B on a single A100 GPU. Experimental results across SuperGLUE, commonsense reasoning, and math tasks show consistent gains over MeZO, with improvements of up to 9% accuracy and 3.5× faster convergence.

The strengths of the paper include a clear empirical observation motivating the method, a simple yet effective algorithm with negligible overhead, strong empirical results across models and tasks, and an impactful system contribution that lowers hardware barriers for LLM fine-tuning. Reviewers also noted the novelty of focusing on weight magnitudes and the practical significance of enabling large-scale fine-tuning under constrained resources.

Weaknesses raised include limited theoretical guarantees (the provided theorem applies only to fixed masks), incremental contributions relative to MeZO, incomplete baselines in the initial submission, and questions about sparsity threshold choice. Some reviewers also noted the gap between Sparse MeZO and first-order methods remains and suggested broader evaluation on challenging benchmarks and comparison with other sparse ZO methods such as SensZOQ.

The decision to accept is based on the strong empirical evidence, practical utility, and the clear motivation behind the method. While the contribution could be considered incremental, the simplicity, demonstrated effectiveness, and memory savings make it valuable for the community. The paper stands out because it addresses a critical bottleneck in ZO optimization with a principled and practical solution that enables new capabilities, such as fine-tuning 30B-parameter models on commodity hardware.

During the rebuttal, reviewers requested additional comparisons (e.g., with SensZOQ, ZO-SVRG, and ZO-SGD), clarity on sparsity selection, more ablations, reporting of variance, and evaluations on harder tasks. The authors addressed these comprehensively by running new experiments, providing ablations that confirmed the advantage of small-weight updates, clarifying the fixed-threshold scheme, reporting standard deviations, and expanding evaluations (including MMLU and comparisons with variance-reduction ZO methods). These responses resolved most concerns and improved reviewer confidence, with several reviewers raising their scores. Overall, the combination of methodological insight, practical system implementation, and strong rebuttal makes this a solid accept.