PaperHub
6.0
/10
Poster4 位审稿人
最低3最高4标准差0.4
4
4
3
4
3.8
置信度
创新性2.5
质量2.5
清晰度2.8
重要性2.8
NeurIPS 2025

Sparse MeZO: Less Parameters for Better Performance in Zeroth-Order LLM Fine-Tuning

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29

摘要

关键词
Zeroth-Order OptimizationParameter-Efficient Fine Tuning

评审与讨论

审稿意见
4

This paper addresses the memory inefficiency of fine-tuning large language models. It identifies that while Zeroth-Order (ZO) optimizers like MeZO save memory by avoiding back-propagation, they introduce gradient estimation "noise" that harms performance, especially when applied to large-magnitude weights.

To solve this, the paper proposes Sparse MeZO (S-MeZO), a method that selectively applies updates only to the model's smaller-magnitude weights. This approach is more resilient to noise, which allows for larger learning rates, leading to faster convergence and better accuracy.

A key contribution is a memory-efficient implementation that calculates the necessary sparse mask on-the-fly during the forward pass, keeping memory usage at the same level as inference. This innovation enabled fine-tuning a 30-billion parameter model on a single A100 GPU.

Experiments conducted on LLaMA, OPT, and Mistral models show that S-MeZO consistently outperforms the original MeZO. For example, on the RTE task, it achieved a 9% absolute accuracy improvement and a 3.5x speedup.

优缺点分析

Strengths:

  1. Significant Practical Impact and Accessibility: The paper's most significant contribution is making the fine-tuning of extremely large models more accessible. By developing a memory-efficient implementation that calculates sparse masks on-the-fly, S-MeZO successfully keeps memory usage at the level of inference. This innovation enabled the fine-tuning of a LLaMA-30b model on a single A100 GPU, a task that is typically infeasible with standard methods. This dramatically lowers the hardware barrier for LLM research and development.

  2. Novel and Insightful Core Idea: The paper is founded on a highly original empirical observation: the noise inherent in Zeroth-Order (ZO) gradient estimates is more detrimental to large-magnitude weights than to small-magnitude ones. This insight is novel and provides a clear, well-motivated basis for the proposed S-MeZO algorithm, which selectively optimizes these more noise-resilient small weights. This approach offers a new perspective on mitigating noise in ZO optimization.

  3. A notable strength of this paper, setting it apart from some other works focused on ZO optimization, is the inclusion of baselines that use back-propagation to calculate gradients.

Weaknesses:

  1. Since the parameter selection is based on the smallest weights, the benchmark could be strengthened by including a comparison with the opposite strategy: selecting and updating only the largest-magnitude weights. This would serve as a valuable ablation study to more directly validate the paper's core hypothesis.

  2. A potential area for strengthening the experimental comparison would be to include another recent and relevant method in sparse ZO optimization. For instance, comparing S-MeZO against SensZOQ would provide a more comprehensive picture of its performance within this specific sub-field.

问题

Have the authors explored the upper limits of the sparsity ratio for this method? For instance, what is the highest sparsity tested, and does performance continue to improve or begin to degrade at very high levels of sparsity?

局限性

The authors have acknowledged the primary performance-related limitation of their work, which is the remaining performance gap between S-MeZO and first-order fine-tuning methods. This transparency is commendable. However, the discussion of limitations could be expanded to make the paper even stronger: Broader Societal Impact: While the work itself does not have a direct negative impact, it contributes to a line of research that makes powerful LLM fine-tuning more accessible and efficient. The authors could briefly acknowledge the dual-use nature of such technology; lowering the barrier to entry can empower more users for positive applications, but it can also potentially make it easier for malicious actors to create fine-tuned models for generating disinformation or other harmful content. A short discussion of this broader context is standard for LLM research and would demonstrate a more comprehensive consideration of the work's potential impact.

最终评判理由

The authors have provided a thorough set of baseline comparisons.

格式问题

The manuscript contains spelling and wording mistakes that should be fixed before resubmission. The benchmark name SuperGLUE is repeatedly misspelled as “SuperGULE” or “SuperGULU.”

评论

Thanks for your constructive comments, we carefully address your concerns below.

w1: Since the parameter selection is based on the smallest weights, the benchmark could be strengthened by including a comparison with the opposite strategy: selecting and updating only the largest-magnitude weights.

Thank you very much for this valuable suggestion.

Our focus on small-magnitude weights is based on their fundamental properties in pre-trained models and zeroth-order optimization.

Ablation Study: Our experimental analysis examines how weights of different magnitudes affect performance. We divided weights into 4 groups by magnitude, where '0-0.25' represents the bottom 25% and '0.75-1.0' represents the top 25%. The results in the tables demonstrate that updating small-magnitude weights consistently achieves better performance than updating large-magnitude weights.

0-0.250.25-0.50.5-0.750.75-1.0
RTE82.773.666.157.3
BoolQ81.677.167.766.1

Building on these findings, we performed additional experiments to analyze weight selection strategies. While updating small-magnitude weights (0-0.25) shows strong performance, we tested incrementally including more weight groups (e.g., expanding to 0-0.5, which includes both 0-0.25 and 0.25-0.5). The results show that adding weights beyond the small-magnitude group (0-0.25) actually reduces performance, confirming that focusing on small-magnitude weights is optimal.

0-0.250-0.50-0.750-1.0
RTE82.775.172.371.7
BoolQ81.677.876.775.9

w2: A potential area for strengthening the experimental comparison would be to include another recent and relevant method in sparse ZO optimization. For instance, comparing S-MeZO against SensZOQ would provide a more comprehensive picture of its performance within this specific sub-field.

We thank the reviewer for this excellent suggestion. SensZOQ is indeed a highly relevant sparse zeroth-order fine-tuning method, however, we didn't compare it in our original submissions as it published on arxiv several months after our work. Therefore, we believe it does not diminish the novelty and contribution of our paper. However, we totally agree that for completeness we should include a comparison and discussion in our paper.

Before presenting our comparison, we note key methodological differences: (1) SensZOQ requires additional computational cost to calculate first-order gradients for obtaining sparse masks, while our method dynamically generates masks without gradient computation; (2) SensZOQ targets extreme sparsity combined with quantization for edge devices, whereas we focus on balancing sparsity with performance for general memory-efficient training. Our experimental comparison on challenging tasks shows:

SST-2RTEBoolQWICMMLU
LLaMA2-7B + MeZO94.070.278.862.259.2
LLaMA2-7B + SensZOQ94.376.283.163.758.4
LLaMA2-7B + Sparse MeZO94.877.682.265.359.6

L1: Addressing the Need for Broader Discussion of Societal Impact and Dual-Use Considerations.

We thank the reviewer for this thoughtful suggestion about expanding our discussion of broader societal impacts. We agree that acknowledging the dual-use nature of making LLM fine-tuning more accessible is important and standard practice in LLM research.

We will expand the limitations section in the revised manuscript to include a discussion of broader societal implications. Specifically, we will acknowledge that while our method democratizes access to LLM fine-tuning through reduced computational requirements—enabling beneficial applications for researchers and practitioners with limited resources—it also potentially lowers barriers for malicious actors who might use fine-tuned models for generating disinformation or other harmful content.

We will also briefly discuss potential mitigation strategies, such as the importance of responsible deployment practices, monitoring fine-tuned model outputs, and the continued development of detection methods for AI-generated harmful content.

C1: The manuscript contains spelling and wording mistakes that should be fixed before resubmission.

We sincerely apologize for the spelling errors and will conduct thorough proofreading to correct all spelling mistakes and grammatical errors before resubmission.

评论

[W1]: Thank you for your detailed response. It has addressed my question thoroughly.

[W2]: The comparison with SensZOQ is very helpful. May I kindly ask under which density or sparsity ratio these experiments were conducted? In any case, I truly appreciate the authors’ efforts—the results appear solid.

Question :Zeroth-order optimization relies on the expectation that its gradient estimator converges to the true backpropagation gradient as the number of samples increases. If we increase the number of ZO samples to reduce noise, would updating small weights still outperform updating large weights?

Moreover, following the authors' argument, does this imply that even in backpropagation (noise-free), restricting updates to small-weight or small-gradient parameters would be preferable? If not, doesn’t this suggest that the observed benefit is simply a workaround for ZO noise rather than an intrinsic optimization principle?

评论

We sincerely thank the reviewer sJB2 for the further feedback. Our responses are as follows:

W2: The comparison with SensZOQ is very helpful. May I kindly ask under which density or sparsity ratio these experiments were conducted? In any case, I truly appreciate the authors’ efforts—the results appear solid.

Thank you very much for your follow-up question about W2. Since the SensZOQ code was not publicly available at the time of our experiments, we followed the settings described in the SensZOQ paper and reproduced its performance. We conducted a grid search over parameter selection ratios, testing with {0.1%, 1%, 10%} of parameters selected for updates. We found that updating only 1% of parameters achieved the best performance on RTE and BoolQ, so we used 1% parameter density (i.e., selecting 1% of parameters for optimization) for all experiments in W2. The experimental results for determining the density are shown in the table:

DensityRTEBoolQ
LLaMA2-7B + SensZOQ0.1%75.180.8
LLaMA2-7B + SensZOQ1%76.283.1
LLaMA2-7B + SensZOQ10%71.179.3

Question 1: Zeroth-order optimization relies on the expectation that its gradient estimator converges to the true backpropagation gradient as the number of samples increases. If we increase the number of ZO samples to reduce noise, would updating small weights still outperform updating large weights?

Thank you very much for your constructive comment. We conducted experiments with larger batch sizes for Sparse MeZO (increasing batch size from 16 to 64) and compared the performance between updating only small weights versus large weights. We find that updating small weights still achieves better performance than updating large weights even with larger batch sizes.

The results are shown in the table below, where '0-0.25' represents the bottom 25% and '0.75-1.0' represents the top 25%. We observe that updating smaller parameters (0-0.25) with batch size 64 achieves higher performance than updating larger parameters (0.75-1.0) on both RTE (83.7 vs 60.3) and BoolQ (83.3 vs 67.6). We will provide the complete results in the revision.

Batch Size0-0.250.75-1.0
Sparse MeZO + RTE1682.757.3
Sparse MeZO + BoolQ1681.666.1
Sparse MeZO + RTE3283.059.6
Sparse MeZO + BoolQ3282.566.9
Sparse MeZO + RTE6483.760.3
Sparse MeZO + BoolQ6483.367.6

Question 2: Moreover, following the authors' argument, does this imply that even in backpropagation (noise-free), restricting updates to small-weight or small-gradient parameters would be preferable? If not, doesn’t this suggest that the observed benefit is simply a workaround for ZO noise rather than an intrinsic optimization principle?

Thank you very much for your constructive question. We investigated this question by conducting experiments with first-order methods (Adam) on LLaMA-7b/RTE, selectively updating only small-magnitude weights vs. large-magnitude weights. Results show that first-order methods achieve similar performance regardless of weight magnitude selection (83.6% vs 83.1%, only 0.5% gap), unlike zeroth-order methods where small weights significantly outperform large weights (82.7% vs 57.3%). This confirms that the disproportionate weight magnitude impact is specific to zeroth-order optimization due to gradient estimation noise, rather than a general fine-tuning phenomenon.

评论

Thank you for providing such a thorough set of baseline comparisons. I’m updating my score to 4—best of luck!

审稿意见
4

The paper present sparse-Mezo, an algorithm based on Mezo with the observation that optimizing small weights is more helpful than optimizing large weights for zeroth order method. Then the algorithm is a simple modification of Mezo: in each forward pass, the algorithm identifies small weights in each layer and only update these weights. The paper show experiments that support the observation and improve Mezo on fine-tuning tasks.

优缺点分析

Strengths

  • The paper provides a new observation for zeroth order methods that small weights have bigger impacts on training than large weights. This is very interesting and somewhat counter-intuitive. I think this should be studied more in the future.
  • The experimental results are quite strong against vanilla mezo and support the hypothesis.

Weaknesses

  • The experiments could be more comprehensive. Some of the datasets are missing from some tables (for example, table 2 and others in the appendix). Please also add other baselines such as ZO-SGD to table 3 and where it’s relevant.
  • The algorithm introduces a sparsity parameter and it’s unclear how to set this parameter other than tuning
  • As acknowledged by the authors, there is still a significant gap between this method and gradient methods
  • The theorem in the appendix only applies for a fixed masking, not the masking by threshold used in the algorithm.

问题

  • Please add the experiment results I mentioned in the appendix
  • For the completeness and self-containment of the paper, please also for each experiment provide a description of the set up (for example, section 4.3 is unclear)
  • Regarding the new observation: When training on top 20% of the largest weight vs the bottom 20%, how are the step sizes determined for each case and what are the impacts on the stability of the training and the performance of the model?
  • Is the disproportionate impact of the weights also observed when we train the model by first order method instead of zeroth order method?

局限性

Adequately addressed

最终评判理由

I maintain my score after the discussion.

格式问题

NA

评论

Q3: Regarding the new observation: When training on top 20% of the largest weight vs the bottom 20%, how are the step sizes determined for each case and what are the impacts on the stability of the training and the performance of the model?

For step size determination, we conducted grid search over {1e−6, 2e−6, 3e−6, 4e−6, 5e−6} for both top 20% largest and bottom 20% smallest weights, selecting the best-performing learning rate for each configuration through validation performance. Both achieved optimal results at similar rates (2e−6 to 3e−6), ensuring fair comparison. Regarding training stability, small weights consistently converged across all tested learning rates with low variance, while large weights exhibited training instability and higher variance, particularly at rates ≥4e−6. The 15-20% performance gap persisted across different learning rates, confirming this is a fundamental property rather than hyperparameter sensitivity.

Q4: Is the disproportionate impact of the weights also observed when we train the model by first order method instead of zeroth order method?

We investigated this excellent question by conducting experiments with first-order methods (Adam) on LLaMA-7b/RTE, selectively updating only small-magnitude weights vs. large-magnitude weights. Results show that first-order methods achieve similar performance regardless of weight magnitude selection (83.1% vs 83.6%, only 0.5% gap), unlike zeroth-order methods where small weights significantly outperform large weights (82.7% vs 57.3%). This confirms that the disproportionate weight magnitude impact is specific to zeroth-order optimization due to gradient estimation noise, rather than a general fine-tuning phenomenon.

评论

Thank you for your response. I have some follow up questions.

  1. Could you please check the numbers in your response to w3?
  2. I'm not sure I understand the response to w4. What do the authors mean by "the number of selected parameters continuously decreases" (why is it the case if the mask is percentile based)?
评论

Thanks for your constructive feedback, we carefully address your concerns below.

w1: Some of the datasets are missing from some tables (for example, table 2 and others in the appendix). Please also add other baselines such as ZO-SGD to table 3 and where it’s relevant.

We thank the reviewer for this valuable feedback. We acknowledge the incomplete dataset coverage in some tables and missing baselines.

Missing datasets: Some results were omitted due to computational constraints, but we will complete these experiments for the revised manuscript.

Complete results of table 2:

ModelMethodBoolQRTEWICSST-2MultiRCCOPAAverage
LLaMA2-7BMeZO78.870.262.294.070.885.076.8
LLaMA2-7BMeZO-LoRA80.376.563.094.071.683.078.1
LLaMA2-7BZO-SGD-Cons77.168.663.094.071.084.376.3
LLaMA2-7BZO-SGD-Sign68.261.055.682.862.078.067.9
LLaMA2-7BZO-SGD-Adam78.973.662.193.871.286.077.6
LLaMA2-7BS-MeZO82.277.665.394.874.786.080.1

Complete results of table 3:

BoolQPIQASIQAAQuA
MeZO76.684.368.524.0
ZO-SGD-Cons76.983.068.325.8
ZO-SGD-Adam77.884.768.125.0
S-MeZO79.285.370.226.6

We will provide more comprehensive experimental results in the revision with consistent baseline coverage across all tables.

w2: The algorithm introduces a sparsity parameter and it’s unclear how to set this parameter other than tuning

We thank the reviewer for this valuable observation. We determine thresholds using a principled sparsity-based approach. Specifically, we use a percentile-based method where the threshold is set based on a target sparsity level. For example, with 80% sparsity, we sort the weight values of each layer and set the threshold at the 80th percentile. Importantly, this threshold is determined once before training begins and remains fixed throughout the optimization process.

Our empirical results in Figure 4 demonstrate that a sparsity level of 75% consistently yields strong performance across diverse tasks (RTE, BoolQ, and WIC), highlighting the generalizability of our threshold selection method. Furthermore, we observe robust performance gains compared to vanilla MeZO across a range of sparsity values (70-80%), indicating that our method is not sensitive to the exact threshold choice. This stability across different tasks eliminates the need for task-specific threshold tuning.

w3: As acknowledged by the authors, there is still a significant gap between this method and gradient methods

We acknowledge the performance gap between our zeroth-order method and first-order gradient methods, which is an inherent limitation of ZO optimization.

However, we observe this gap is narrowing as pre-trained models improve. For example, the average performance gap between S-MeZO and Full FT decreases from 2.6% on LLaMA-7B to 1.2% on LLaMA-30B. This suggests that the performance gap between zeroth-order and first-order methods can be narrowed with increasing model capability.

w4: The theorem in the appendix only applies for a fixed masking, not the masking by threshold used in the algorithm.

We thank the reviewer for this important clarification. The reviewer is correct that our theorem assumes a fixed mask while our algorithm uses dynamic threshold-based masking.

However, our theoretical analysis provides a meaningful upper bound. Although the mask is dynamic, the number of selected parameters (d^\hat{d}) continuously decreases during training as parameters grow and exceed thresholds. Since our convergence rate depends on d^\hat{d}, and d^\hat{d} monotonically decreases in our approach, the fixed-mask analysis serves as an upper bound on convergence rate.

This means our method may converge faster than the theoretical prediction. We will clarify this relationship in the revised manuscript to better explain how the theoretical analysis relates to our dynamic masking approach.

评论

We sincerely thank the reviewer Waaa for the further feedback. Our responses are as follows:

Q1: Could you please check the numbers in your response to w3?

We apologize for the confusion. We have updated our results in the previous response. Additionally, to further explain how the gap (Full FT - Sparse MeZO) narrows as pre-trained models improve (from LLaMA-7B to LLaMA-30B), we provide the results on LLaMA-7B and LLaMA-30B in the table:

ModelMethodBoolQRTEWICMultiRCSST-2COPAAverageGap (Full FT - S-MeZO)
LLaMA-7BMeZO75.971.761.469.894.686.376.6/
LLaMA-7BS-MeZO80.980.764.973.395.086.780.3/
LLaMA-7BFull FT84.583.668.480.295.785.082.92.6
LLaMA-30BMeZO83.876.963.371.995.086.079.5/
LLaMA-30BS-MeZO85.782.167.378.395.387.082.6/
LLaMA-30BFull FT86.284.170.281.396.085.083.81.2

From the table, we observe that the average performance gap (the last column) between S-MeZO and Full FT decreases from 2.6% on LLaMA-7B to 1.2% on LLaMA-30B. This suggests that the performance gap between zeroth-order and first-order methods can be narrowed with increasing model capability.

In addition, we find that increasing the batch size can further reduce the performance gap. For instance, when we increased the batch size from 16 to 32 (with seed 0), the performance gap decreased from 1.9 to 1.3 on the RTE and BoolQ tasks.

Batch SizeRTEBoolQAverageGap
LLaMA-7B + Full FT/83.684.584.1/
LLaMA-7B + Sparse MeZO1682.781.682.21.9
LLaMA-7B + Sparse MeZO3283.082.582.81.3

Q2: I'm not sure I understand the response to w4. What do the authors mean by "the number of selected parameters continuously decreases" (why is it the case if the mask is percentile based)?

We apologize for the confusion. To clarify our setup: we determine the threshold for each layer once before training based on the target sparsity (e.g., smallest 20% of parameters), and this threshold remains fixed throughout training to avoid expensive recomputation at each iteration.

The number of selected parameters decreases because as training progresses, some parameters initially below the threshold will grow larger and become masked. Once masked, these parameters stop updating and cannot return below the threshold. Since the threshold is fixed, parameters can only transition from active to masked (not vice versa), leading to a monotonically decreasing active parameter count.

评论

Thank you for your response. That clears up some confusions I had. Please consider adding these discussions to the paper. I maintain my inclination to accept the paper.

评论

We sincerely thank reviewer Waaa for the insightful feedback. We will incorporate the suggested changes in our revision. Thank you again for your time and effort in reviewing our submission.

审稿意见
3

This paper introduces Sparse-MeZO, a memory-efficient zeroth-order (ZO) optimization method for fine-tuning large language models (LLMs). Instead of updating all parameters, Sparse-MeZO selectively optimizes a subset by applying a sparse mask to target low-magnitude, noise-tolerant weights. Built upon the MeZO framework, this approach reduces gradient estimation noise, accelerates convergence, and lowers memory consumption. Experimental results show that Sparse-MeZO not only outperforms MeZO in convergence speed and final performance but also enables the fine-tuning of large models like LLaMA-30B using limited hardware, such as a single A100 GPU.

优缺点分析

Strengths:

  • Incorporating sparsity and MeZO is an interesting direction for performance improvement.
  • The paper is structured logically, making it easy to follow the motivation and methodology.

Weaknesses:

  • The paper lacks discussion and comparison with some important ZO works in sparsity, like [1]. Moreover, seems the method works due to gradient estimation with less noise, so it’s helpful to compare with some variance-reduction works of ZO, like [2] and [3].
  • There are no specific numbers regarding the computational overhead introduced by threshold calculation, and dynamic masking and perturbation layer by layer in the forward pass.
  • The overall contribution is incremental as the work adds common magnitude-based sparse steps on top of the MeZO optimization algorithm.
  • It would be better to have some evaluation on more challenging task, like MMLU and MT-Bench, for example, using the settings in [4].

[1] Zeroth-Order Fine-Tuning of LLMs with Transferable Static Sparsity

[2] Zeroth-order stochastic variance reduction for nonconvex optimization

[3] Zeroth-order optimization with trajectory-informed derivative estimation

[4] Lisa: Layerwise importance sampling for memory-efficient large language model fine-tuning

问题

  • For the setting of the threshold, set its value related to the sparsity, e.g., 80% sparsity -> 80th percentile. Are these only empirical findings? It's better to have some ablation studies on the selection of the threshold.
  • Some parts of the appendix are not referred to in the main paper and are without a textual description.
  • Is the threshold computed every iteration? As the learning rate for zo is small, and the update per iteration is also small. Therefore, can you compute it intermittently?

局限性

Yes.

最终评判理由

After reading the author's response and other reviewers' comments, the author addressed most of my concerns, but I still think the novelty of this paper is not that impressive. I raised my score from 2 to 3.

格式问题

N/A

评论

Thanks for your constructive and inspiring feedback, we carefully address your concerns below.

w1: The paper lacks discussion and comparison with some important ZO works in sparsity, like [1]. Moreover, seems the method works due to gradient estimation with less noise, so it’s helpful to compare with some variance-reduction works of ZO, like [2] and [3].

We thank the reviewer for pointing out these important related works in ZO optimization with sparsity and variance reduction. We agree that comparisons with these methods would strengthen our evaluation. We have conducted experiments comparing our method with SensZOQ [1] and ZO-SVRG [2] on LLaMA2-7B. The results show that Sparse MeZO consistently outperforms these baselines:

SST-2RTEBoolQWIC
LLaMA2-7B + MeZO94.070.278.862.2
LLaMA2-7B + SensZOQ94.376.283.163.7
LLaMA2-7B + ZO-SVRG94.273.679.863.2
LLaMA2-7B + Sparse MeZO94.877.682.265.3

In addition, although variance reduction methods such as ZO-SVRG can improve gradient estimation, they typically introduce additional memory overhead by storing multiple model weights for variance reduction, which conflicts with our goal of memory efficiency. Our approach achieves variance reduction through selective parameter optimization without extra memory cost. We are conducting experiments with [3] and will include these comparisons in the final revision.

w2: There are no specific numbers regarding the computational overhead introduced by threshold calculation, and dynamic masking and perturbation layer by layer in the forward pass.

Thank you for your constructive comment about computational overhead. We have conducted detailed measurements to quantify the overhead and the results is shown in the folowing table. Due to the main overhead is from the definition of threshold and mask calculation, we mainly analyze them in the table:

SST-2RTEBoolQWICMultiRCAverage
LLaMA + MeZO-Threshold0.0139s0.0141s0.0142s0.0135s0.0143s0.0140s
LLaMA + MeZO-Mask (each step)0.0714s0.0673s0.0701s0.0717s0.0643s0.0690s
LLaMA + MeZO (each step)0.8588s2.3631s3.6336s1.0899s6.2621s2.842s
Overhead = t(Mask)/t(MeZO)0.080.0250.0190.060.0090.039

(1) Threshold Calculation (MeZO-Threshold row in the table): The threshold determination process takes approximately 0.01 seconds for a LLaMA-7b model. In addition, it is a one-time cost before training begins, and therefore this additional computation is very limited compared with several hours training of each experiment.

(2) Dynamic Masking Overhead (MeZO-Mask row in the table): We provide the detailed dynamic masking calculation time and each training step time in the table. From this table, we can find that the masking operation adds approximately 0.07s per forward pass, and the training step time in these experiments varies from 0.8588s to 6.2621s. Therefore, the relative overhead is between 0.009 to 0.08, which is negligible.

w3: The overall contribution is incremental as the work adds common magnitude-based sparse steps on top of the MeZO optimization algorithm.

We appreciate this feedback and would like to highlight several substantial contributions of our work beyond MeZO:

Technical Innovation: We identified and addressed a fundamental limitation in ZO optimization: gradient noise impacts large weights more severely than small weights. This insight led to our selective parameter updating approach.

Performance Gains: Our method achieves substantial improvements over MeZO. For example, (1) 9% absolute accuracy improvement on RTE, (2) 3.5x faster convergence, (3) Comparable performance to full fine-tuning while maintaining MeZO's memory efficiency.

Novel Implementation: We developed a memory-optimized implementation for sparse masking that maintains inference-level memory consumption, enabling fine-tuning of LLaMA-30b on a single A100 GPU.

评论

w4: It would be better to have some evaluation on more challenging task, like MMLU and MT-Bench, for example, using the settings in [4].

We thank the reviewer for this suggestion to evaluate on more challenging tasks. To address concerns about performance on complex reasoning tasks, we have already conducted experiments beyond SuperGLUE.

In our submission (Section 4.3), we evaluated Sparse MeZO on challenging tasks including commonsense reasoning (PIQA, SIQA, BoolQ) and mathematics (AQuA), consistent with evaluation protocols used in SensZOQ. The results demonstrate that Sparse MeZO consistently outperforms MeZO across these challenging tasks, with notable improvements on AQuA mathematics (+2.6%) and SIQA commonsense reasoning (+1.7%).

BoolQPIQASIQAAQuAMMLU
Mistral-7B + MeZO76.684.368.524.059.2
Mistral-7B + SensZOQ76.384.969.426.158.4
Mistral-7B + Sparse MeZO79.285.370.226.659.6

Following the reviewer's suggestion, we have conducted additional experiments on MMLU using Mistral-7B. The results show consistent improvements: Sparse MeZO can achieve 0.4% improvement compared with MeZO and 1.2% improvement compared with SensZOQ.

Q1: For the setting of the threshold, set its value related to the sparsity, e.g., 80% sparsity - 80th percentile. Are these only empirical findings? It's better to have some ablation studies on the selection of the threshold.

Thank you very much for this valuable suggestion.

Our focus on small-magnitude weights is based on their fundamental properties in pre-trained models and zeroth-order optimization: Large-magnitude weights typically store critical pre-trained information, making them sensitive to perturbations. Even small noise in gradient estimation can significantly disrupt these learned patterns and cause substantial performance drops.

For the ablation study, our experimental analysis examines how weights of different magnitudes affect performance. We divided weights into 4 groups by magnitude, where '0-0.25' represents the bottom 25% and '0.75-1.0' represents the top 25%. The results in the tables demonstrate that updating small-magnitude weights consistently achieves better performance than updating large-magnitude weights.

0-0.250.25-0.50.5-0.750.75-1.0
RTE82.773.666.157.3
BoolQ81.677.167.766.1

Building on these findings, we performed additional experiments to analyze weight selection strategies. While updating small-magnitude weights (0-0.25) shows strong performance, we tested incrementally including more weight groups (e.g., expanding to 0-0.5, which includes both 0-0.25 and 0.25-0.5). The results show that adding weights beyond the small-magnitude group (0-0.25) actually reduces performance, confirming that focusing on small-magnitude weights is optimal.

0-0.250-0.50-0.750-1.0
RTE82.775.172.371.7
BoolQ81.677.876.775.9

Q2: Some parts of the appendix are not referred to in the main paper and are without a textual description.

We thank the reviewer for this feedback. We acknowledge that some appendix sections lack proper references from the main paper and sufficient descriptions.

We will address this in the revised manuscript by: (1) adding appropriate citations to all appendix sections in the main text (e.g., referencing hyperparameters in Section 4.1, proofs in Section 3), (2) providing clear descriptions for each appendix section, and (3) ensuring all supplementary material directly supports the main paper's claims.

Q3: Is the threshold computed every iteration? As the learning rate for zo is small, and the update per iteration is also small. Therefore, can you compute it intermittently?

We thank the reviewer for this insightful question. To clarify our experimental setting: we determine the threshold for each layer once before training begins based on the target sparsity level, and this threshold remains fixed throughout the training process. Therefore, the threshold is not computed every iteration.

Our mask is indeed dynamic in the sense that it is computed on-the-fly during each forward pass by comparing current parameter values against these pre-determined thresholds. As the reviewer correctly notes, computing thresholds (which requires sorting weight values of each layer) is computationally expensive, which is why we avoid doing this repeatedly.

评论

Thank you to the authors for responding to my questions and concerns. After reading the author's response, I will raise my score a bit.

评论

We sincerely thank reviewer EoWB for offering constructive feedback. We will incorporate the suggested changes in the revision. Thanks again for your time and effort in reviewing our submission.

审稿意见
4

The paper proposes to improve the convergence of zero-order optimizers by only applying the perturbation to smaller weights. The paper begins by presenting empirical observations on how the magnitude of weights in MeZO updates influences loss convergence. Based on this, the authors propose a practical solution: masking the largest weights during perturbation application. They demonstrate that this approach significantly enhances the convergence process and downstream performance in MeZO.

优缺点分析

Strengths

  • s1: The paper is well-motivated and written, clearly balanced between motivation, methods, and empirical evaluation. The method is simple and does not require a complex procedure on top of existing zero-order optimization processes.
  • s2: The experimental results are compelling, showing an average improvement of 3.7 points on the SuperGLUE tasks over vanilla MeZO without impact on the memory consumption or convergence speed.

Weaknesses

  • w1: I am not fully convinced by the analysis presented in lines 147-159. As I understand it, the authors compute gradients on a set of 16 data points and then observe whether applying a gradient step in that direction, but starting from 16 entirely independent points, reduces the loss. The observation that there's roughly a 50% chance it won't be the case doesn't completely surprise me. It seems possible to me that a gradient computed on one set of 16 points aligns with a gradient computed on another, completely independent set of 16 points. I would like to know if similar results would hold for supervised fine-tuning (SFT). In particular, to include a comparison in Figure 2(b) with SFT.
  • w2: The results regarding the convergence rate in Section 4.4 and Figure 3 only compare convergence curves for MeZO and Sparse-MeZO, but it would be relevant to add the comparison with SFT here.

问题

  • “Impressive results” line 1 seems like a superlative statement
  • Line 165, how are the percentages chosen to define small and large values? Why set the threshold to 20%?
  • Lines 222-225, I am not sure how this paragraph helps with the argumentation. According to the author, this method is not aligned with the goal of the paper and is not discussed anywhere else in the paper. I would suggest simply removing this paragraph for clarity.
  • It would be helpful to report the standard variations on top of the mean across the three runs for the results in at least Table 1. This could help capture the stability of the approach.

局限性

The authors have discussed the limitations in the last section.

最终评判理由

Overall, my assessment of the paper remains unchanged. I find the proposed approach relatively straightforward and likely easy to apply in practice. The paper is well-written and generally well-motivated, though I believe there is room for improvement in the experimental justification of the approach (lines 147–159).

格式问题

No paper formatting concerns.

评论

Thanks for your constructive feedback, below are our responses regarding your concerns.

w1: It seems possible to me that a gradient computed on one set of 16 points aligns with a gradient computed on another, completely independent set of 16 points. I would like to know if similar results would hold for supervised fine-tuning (SFT). In particular, to include a comparison in Figure 2(b) with SFT.

Thank you for this insightful suggestion. We agree that a more rigorous comparison is needed. Our new analysis compares SGD and MeZO using the same batch. The results show MeZO has a higher probability of loss increment (0.4-0.6) compared to SGD (0.2-0.3). While both methods can increase loss on different batches, MeZO exhibits this behavior more frequently.

w2: The results regarding the convergence rate in Section 4.4 and Figure 3 only compare convergence curves for MeZO and Sparse-MeZO, but it would be relevant to add the comparison with SFT here.

We thank the reviewer for this suggestion. Adding SFT convergence comparison would indeed provide valuable context for understanding the convergence characteristics of different optimization approaches. In our experiments, SFT typically converges within 1,000 steps, while both MeZO and Sparse-MeZO require 10,000 steps due to the inherent nature of zeroth-order optimization, which relies on gradient estimation rather than exact gradients.

We will add SFT convergence curves to Figure 3 in the revision. This comparison will help readers better understand the trade-offs: while SFT achieves faster convergence, it requires significantly more memory (12× more than our method). Sparse-MeZO provides a middle ground: faster convergence than vanilla MeZO while maintaining the memory efficiency of zeroth-order methods.

Q1: “Impressive results” line 1 seems like a superlative statement

We agree that the phrase "impressive results" contains subjective language. We will revise this to use more objective language in the revision. Specifically, we will change the sentence to: "While fine-tuning large language models (LLMs) for specific tasks often yields strong performance..."

Q2: Line 165, how are the percentages chosen to define small and large values? Why set the threshold to 20%?

We thank the reviewer for this valuable observation. For threshold, we use a percentile-based method where the threshold is set based on a target sparsity level. For example, with 80% sparsity, we sort the weight values of each layer and set the threshold at the 80th percentile. Importantly, this threshold is determined once before training begins and remains fixed throughout the optimization process.

Our empirical results in Table 12 demonstrate that a sparsity level of 70-80% consistently yields strong performance across diverse tasks (RTE, BoolQ, and WIC), highlighting the generalizability of our threshold selection method.

Q3: Lines 222-225, I am not sure how this paragraph helps with the argumentation. According to the author, this method is not aligned with the goal of the paper and is not discussed anywhere else in the paper.

We thank the reviewer for this feedback and agree with the suggestion. The paragraph discussing 1-bit quantization introduces an alternative approach that we do not pursue or evaluate, which may indeed cause confusion for readers. We will remove this paragraph in the revision for better clarity.

Q4: It would be helpful to report the standard variations on top of the mean across the three runs for the results in at least Table 1. This could help capture the stability of the approach.

We thank the reviewer for this valuable suggestion. We agree that reporting standard deviations would better demonstrate the stability of our approach across multiple runs.

We will update Table 1 in the revision to include standard deviations for all methods across the three random seeds. For reference, we have already provided standard deviations in Table 10 (Appendix) for our main LLaMA-7b results, which show that Sparse-MeZO maintains stable performance with standard deviations typically ranging from ±0.3 to ±1.6 across different tasks.

评论

I thank the authors for their responses and clarifications.

  • I maintain my reservations regarding weakness w1, particularly concerning the motivation, which I believe could still be refined. I encourage the authors to incorporate the new experimental results into the paper to strengthen this aspect.
  • Regarding w2, I appreciate the quantitative details provided, which offer meaningful points of comparison.

Thank you as well for addressing my question — I am satisfied with the response.

评论

We sincerely thank reviewer i838 for the constructive feedback. We will incorporate the suggested changes in our revision. Thank you again for your time and effort in reviewing our submission.

最终决定

The paper proposes Sparse MeZO, a memory-efficient zeroth-order (ZO) optimization approach for LLM fine-tuning. The core claim is that ZO gradient noise disproportionately harms large-magnitude weights, so selectively applying updates only to small-magnitude weights reduces noise sensitivity, accelerates convergence, and improves performance. The method also includes a memory-optimized masking implementation that enables fine-tuning models as large as LLaMA-30B on a single A100 GPU. Experimental results across SuperGLUE, commonsense reasoning, and math tasks show consistent gains over MeZO, with improvements of up to 9% accuracy and 3.5× faster convergence.

The strengths of the paper include a clear empirical observation motivating the method, a simple yet effective algorithm with negligible overhead, strong empirical results across models and tasks, and an impactful system contribution that lowers hardware barriers for LLM fine-tuning. Reviewers also noted the novelty of focusing on weight magnitudes and the practical significance of enabling large-scale fine-tuning under constrained resources.

Weaknesses raised include limited theoretical guarantees (the provided theorem applies only to fixed masks), incremental contributions relative to MeZO, incomplete baselines in the initial submission, and questions about sparsity threshold choice. Some reviewers also noted the gap between Sparse MeZO and first-order methods remains and suggested broader evaluation on challenging benchmarks and comparison with other sparse ZO methods such as SensZOQ.

The decision to accept is based on the strong empirical evidence, practical utility, and the clear motivation behind the method. While the contribution could be considered incremental, the simplicity, demonstrated effectiveness, and memory savings make it valuable for the community. The paper stands out because it addresses a critical bottleneck in ZO optimization with a principled and practical solution that enables new capabilities, such as fine-tuning 30B-parameter models on commodity hardware.

During the rebuttal, reviewers requested additional comparisons (e.g., with SensZOQ, ZO-SVRG, and ZO-SGD), clarity on sparsity selection, more ablations, reporting of variance, and evaluations on harder tasks. The authors addressed these comprehensively by running new experiments, providing ablations that confirmed the advantage of small-weight updates, clarifying the fixed-threshold scheme, reporting standard deviations, and expanding evaluations (including MMLU and comparisons with variance-reduction ZO methods). These responses resolved most concerns and improved reviewer confidence, with several reviewers raising their scores. Overall, the combination of methodological insight, practical system implementation, and strong rebuttal makes this a solid accept.