ReLIZO: Sample Reusable Linear Interpolation-based Zeroth-order Optimization
摘要
评审与讨论
From my understanding, this paper give a zero-th order algorithm with application to popular vision tasks neural architecture search and black-box adversarial attacks. The authors derive a closed-form solution after modeling the gradient estimation as a quadratically constrained linear program problem. The key idea is to try to decouple the required sample size from the variable dimension without extra conditions required, making it able to leverage the queries in the prior iterations. The speedup is further technically achieved by directly indexing some of the intermediate variables that contribute to the gradient estimation. The theoretical studies are given for its convergence speed and its cost-effectiveness is verified on benchmarks.
优点
- Clear motivation with clearly derived approach, and it is a new zero-th order algorithm indeed and the authors also contextualize well the proposed method with related work discussion.
- Strong empirical performance on representative vision tasks with rich testbeds and settings.
- The approach by its design, could enjoy the efficiency of smoothing techniques while maintaining estimation accuracy. Table 4 in the appendix is informative.
- The paper gives comprehensive results and technical details in both main paper and appendix.
缺点
- As remarked by the authors, it has few constraints on the sample size, similar to the smoothing techniques; and it requires the estimation of the gradients which involves solving a linear program problem.
- As a zero-th order algorithm, it may still not be suited for large-scale application e.g. network training.
问题
I wonder if the proposed method could really facilitate the community of NAS? as zero-order optimization is not common in NAS.
局限性
No.
Thank you for your careful and valuable comments. We hope to address your concerns by answering your questions below.
Q1: As noted by the authors, the method imposes few constraints on sample size, similar to smoothing techniques, but requires gradient estimation through solving a linear program, which could be a potential limitation.
A1: Our approach indeed leverages the strengths of both linear interpolation-based and smoothing-based zeroth-order (ZO) methods, as highlighted in Remark (line 142). By solving a QCLP to estimate gradients, our method achieves higher accuracy compared to smoothing-based methods while maintaining lower computational costs than linear interpolation-based methods. Sec 3.3 also shows that part of the intermediate variables contributing to the gradient estimation can be directly indexed, significantly reducing the computation complexity of ReLIZO. Further analysis of the total computational complexity is provided in Section B and Table 4 in the appendix. Additionally, since optimization problems are often high-dimensional with a typically small sample size at each iteration (i.e. ), ReLIZO remains competitive in terms of speed.
Q2: As a zeroth-order algorithm, is ReLIZO suitable for large-scale applications, such as network training?
A2: Compared to gradient-based methods, ZO methods are well-suited for black-box optimization problems where gradients with respect to the variables are unavailable. Additionally, ZO methods are more memory-efficient than gradient-based methods since they do not require backward propagation, making them applicable to network training, as demonstrated by recent works [1,2]. Specifically, [1] introduces DeepZero, a principled ZO framework for deep learning that is computational-graph-free and can scale to deep neural network training with performance comparable to first-order methods. [2] also applies ZO methods to large language model (LLM) fine-tuning, highlighting their memory efficiency and introducing a ZO-LLM benchmark.
To illustrate the applicability of ReLIZO in network training, we adopt it for fine-tuning an OPT-1.3b model (with 1.3 billion parameters) on the Stanford Sentiment Treebank v2 (SST2) task, following the methodology of [2]. The results, shown in the table below, indicate that ReLIZO outperforms other ZO methods across various fine-tuning schemes, including full parameter fine-tuning (FT), LoRA, Prefix-tuning, and Prompt-tuning. Notably, ReLIZO even surpasses SGD in the FT scheme while requiring significantly less memory, demonstrating its promising potential.
| Optimizer | FT | LoRA | Prefix-Tuning | Prompt | FT Memory cost |
|---|---|---|---|---|---|
| SGD | 91.1 | 93.6 | 93.1 | 92.8 | 44.1 GB |
| ZO-SGD | 90.8 | 90.1 | 91.4 | 84.4 | 28.7 GB |
| ZO-SGD-Sign | 87.2 | 91.5 | 89.5 | 72.9 | 31.4 GB |
| ZO-Adam | 84.4 | 92.3 | 91.4 | 75.7 | 31.4 GB |
| ReLIZO (ours) | 93.4 | 93.1 | 91.8 | 90.1 | 35.7 GB |
[1] Chen A, Zhang Y, Jia J, et al. DeepZero: Scaling Up Zeroth-Order Optimization for Deep Model Training[C], ICLR 2024.
[2] Zhang Y, Li P, Hong J, et al. Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark[C], ICML 2024.
Q3: I wonder if the proposed method could really facilitate the community of NAS? As zero-order optimization is not common in NAS.
A3: Recent studies [4,5,6] have explored zeroth-order optimization methods for solving bi-level optimization problems, including NAS task, where gradients of the parameters in the upper-level objective function are typically unavailable. These works highlight the limitations of gradient-based methods for NAS and demonstrate that ZO methods can outperform gradient-based approaches by providing more accurate gradient estimations of architecture parameters in the upper-level objective function. Nevertheless, this work aims to introduce a brand-new zero-order optimization algorithm and we apply it to NAS tasks to evaluate its performance on bi-level optimization problem.
[4] Wang X, Guo W, Su J, et al. Zarts: On zero-order optimization for neural architecture search[J]. NeurIPS, 2022.
[5] Xie L, Huang K, Xu F, et al. ZO-DARTS: Differentiable architecture search with zeroth-order approximation[C]. ICASSP, 2023.
[6] Aghasi A, Ghadimi S. Fully Zeroth-Order Bilevel Programming via Gaussian Smoothing[J]. arXiv preprint arXiv:2404.00158, 2024.
Thank you for the detailed response, which addresses my concerns.
The paper introduces ReLIZO, a novel zeroth-order optimization method leveraging linear interpolation to estimate gradients efficiently. It reduces the complexity of gradient estimation by reusing prior queries without additional conditions on sample size, decoupling it from variable dimension constraints. ReLIZO models gradient estimation as a quadratically constrained linear program, solving it analytically to reduce computation complexity. Experimental results demonstrate ReLIZO's efficacy in various scenarios, including black-box adversarial attacks and neural architecture search, showcasing faster convergence and better solutions compared to existing methods.
优点
- The paper is well-written, with clear and easy-to-follow explanations.
- The paper introduces a method for estimating gradients using arbitrarily sampled vectors without requiring orthogonal conditions or adherence to a specific distribution, enabling the reuse of queries to accelerate the zeroth-order (ZO) optimization process.
- Extensive experiments on simulation benchmarks and real-world applications validate the method’s performance.
- The paper highlights that ReLIZO can be viewed as a generalized version of traditional linear interpolation methods, capable of handling both equal and smaller sample sizes compared to variable dimensions. This demonstrates ReLIZO's theoretical soundness and enhanced flexibility in gradient estimation.
缺点
- The effectiveness of reusing queries depends on the choice of the reusable distance bound, which might require fine-tuning for different applications, adding complexity to its implementation.
- While the method reduces the number of function queries, the process of solving the quadratically constrained linear program might introduce additional computational overhead for large .
问题
In Figure 5, the results for the ARGTRIGLS problem indicate that any reusable distance bound leads to a performance drop. What is the specific structure of this problem? Does this suggest that the reuse strategy may be ineffective in certain special cases?
局限性
The authors have adequately addressed the limitations.
Thank you for your careful and valuable comments. We hope to address your concerns by answering your questions below.
Q1: The effectiveness of reusing queries depends on the choice of the reusable distance bound, which might require fine-tuning for different applications, adding complexity to its implementation.
A1: In our method, the reusable distance bound restricts the distances between reusable samples and the current point and should be of the same order of magnitude as the step size. Following previous ZO research (Table 1 in [1]), we set the step size to to ensure convergence.Thus, we choose . Ablation studies presented in Fig. 2 (in the main paper) and Fig. 4 (in the appendix) demonstrate that setting performs well across different optimization tasks and sample sizes .
As you noted, the reusable distance bound plays a critical role in balancing speed and accuracy. A larger can increase the reuse rate, reduce the number of queries, and accelerate the process. However, it also introduces a larger residual term in the first-order Taylor expansion, leading to an increase in relative error. Therefore, the value of must be adjusted according to the specific application. From this perspective, an adaptively adjustable reusable distance bound during the optimization process would be ideal, and we plan to explore this in future work.
[1] Ji, Kaiyi, et al. "Improved zeroth-order variance reduced algorithms and analysis for nonconvex optimization." ICML 2019.
Q2: While the method reduces the number of function queries, solving the quadratically constrained linear program (QCLP) may introduce additional computational overhead for large values of .
A2: As the sample size increases, the cost of solving the QCLP also tends to increase. However:
- On one hand, in zeroth-order optimization, the sample size is typically small compared to the dimension and satisfies in high-dimension optimization problems to ensure efficiency (e.g. in black-box adversarial attacks, the variable dimension while the sample size ). The performance of ZO methods in such extreme cases where is crucial for assessing their scalability and robustness. Our experiments, detailed in Appendix C.1, demonstrate the effectiveness of our method even under these conditions.
- On the other hand, Sec. 3.3 introduces a strategy for indexing some intermediate variables that contribute to gradient estimation, significantly reducing the computational complexity of solving the QCLP. We also provide a detailed analysis and comparison of the computation complexity of various ZO methods in Appendix B and Table 4, showing that our ReLIZO method is more computationally efficient than other linear interpolation-based methods. Furthermore, as the sample size increases, the number of reusable queries also increases with the same step size and reusable distance bound, leading to additional time savings compared to other ZO methods that do not support query reuse.
Q3: In Figure 5, the results for the ARGTRIGLS problem indicate a performance drop with any reusable distance bound. What is the specific structure of this problem? Does this suggest that the reuse strategy might be ineffective in certain special cases?
A3: ARGTRIGLS is a specialized academic problem designed to evaluate optimization algorithms. It is a variable dimension trigonometric problem in least-squares form, expressed as a sum of least-squares groups, each containing nonlinear elements. This problem is a variant of Problem 26 in [1], which can be written as follows:
Given and initial point , solve
In Figure 5, we observe a relative performance drop (slightly slower convergence) when the reusable distance bound , However, this does not imply that the reuse strategy is ineffective in this case. The ARGTRIGLS problem is characterized by a rugged landscape with numerous local minima. In this case, as shown in Figure 4(b), even with and a sample size , the reuse rate of ARGTRIGLS exceeds 90%, whereas reuse rates for other problems are generally below 50% (e.g. MANCINO with d=100 in Figure 4(a), SROSENBR with d=500 in Figure 2(b)). This high reuse rate indicates that the number of new queries during optimization is relatively small, contributing to the observed decrease in convergence speed.
To address this issue, we can employ simple strategies such as reducing the reusable distance bound . The table below illustrates that performance improves as decreases:
| reusable distance bound | ||||||
|---|---|---|---|---|---|---|
| objective value (iter=500) | 0.133 | 0.162 | 0.298 | 0.495 | 3.768 | 8.158 |
| objective value (iter=2000) | 0.053 | 0.063 | 0.166 | 0.311 | 1.834 | 3.825 |
| total reuse rate | 0% | 16.6% | 72.4% | 82.3% | 94.7% | 96.8% |
Moreover, setting an upper bound for the reuse rate in each iteration can also help. The following table reports the performance when the maximum reuse rate in each iteration is restricted to 50%. We observe that with and 2000 iterations, there is minimal performance degradation, even with a reuse rate of approximately 30%:
| reusable distance bound | ||||||
|---|---|---|---|---|---|---|
| objective value (iter=500) | 0.133 | 0.164 | 0.183 | 0.255 | 0.312 | 0.341 |
| objective value (iter=2000) | 0.053 | 0.046 | 0.054 | 0.063 | 0.075 | 0.074 |
| total reuse rate | 0% | 11.4% | 34.2% | 41.9% | 47.9% | 48.9% |
As discussed in A1, an adaptively adjustable reusable distance bound aligned with the optimization process would be advantageous in this context. We plan to explore this approach in future work.
The authors' reply addresses most of my issues. I appreciate the clarification made by the authors. I have no other concerns.
This study introduces a novel gradient estimation algorithm that operates solely on forward function evaluations. The method employs a Quadratically Constrained Linear Program (QCLP) to determine the optimal linear approximation of sample vectors. The authors present performance enhancement strategies, including sample reuse and efficient inverse matrix computation within the QCLP framework. Empirical evaluations conducted on black-box adversarial attacks and neural architecture search demonstrate the proposed algorithm's superiority over existing zeroth-order methods.
优点
-
The proposed method is natural. Approximating the gradient using linear combinations of samples and formulating as the QCLP is an intuitive idea, and the auxiliary techniques employed in this study are both judicious and pertinent to the research objectives.
-
The paper is well-written and easy to follow.
缺点
-
Zeroth-order gradient estimation has a relatively limited impact. While the proposed zeroth-order gradient estimation method demonstrates superiority over existing algorithms in its class, its overall impact on solving underlying optimization problems may be constrained. This limitation is exemplified in the NAS evaluation, where ReLIZO does not consistently achieve optimal performance.
-
According to my interpretation, in ReLIZO algorithm, obtaining new samples from the input space in each iteration is random and arbitrary. I feel there might be more effective strategies to sample new vectors based on known information. Could the authors comment on this?
问题
See weakness.
局限性
N/A
Thank you for your careful and valuable comments. We hope to address your concerns by answering your questions below.
Q1: Zeroth-order gradient estimation has a relatively limited impact. This limitation is exemplified in the NAS evaluation, where ReLIZO does not consistently achieve optimal performance.
A1: Thank you for highlighting the inherent limitations of zeroth-order (ZO) methods. Though it is true that ZO methods generally face these constraints, they offer significant advantages over gradient-based methods in black-box optimization scenarios where gradients with respect to variables are not available. Additionally, ZO methods tend to be more memory-efficient, since they do not require backward propagation. This makes them suitable for large-scale tasks such as fine-tuning large language models (LLMs), as demonstrated in recent work [1]. We applied our method to fine-tune an OPT-1.3b model (with 1.3 billion parameters) on the Stanford Sentiment Treebank v2 (SST2) task, and the results are summarized in the table below:
| Optimizer | Fine-tune Acc | Fine-tune Memory cost |
|---|---|---|
| SGD | 91.1 | 44.1 GB |
| ZO-SGD | 90.8 | 28.7 GB |
| ZO-SGD-Sign | 87.2 | 31.4 GB |
| ZO-Adam | 84.4 | 31.4 GB |
| ReLIZO (ours) | 93.4 | 35.7 GB |
Regarding NAS tasks, DARTS and GDAS are two classic gradient-based methods with distinct settings. DARTS trains all parameters in the supernet at each iteration, whereas GDAS samples a sub-network from the supernet at each iteration and trains only the parameters within that sub-network. Our experiments employed DARTS's settings. The results in Table 2 demonstrate that all ZO methods outperform DARTS, and ReLIZO surpasses all other ZO methods, highlighting the effectiveness of our approach.
[1] Zhang Y, Li P, Hong J, et al. Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark[C], ICML 2024.
Q2: In ReLIZO algorithm, new samples in each iteration is random and arbitrary. There might be more effective strategies to sample new vectors based on known information.
A2: Thank you for your suggestion. The ability of ReLIZO to sample new vectors from an arbitrary distribution can be considered one of its strengths compared to other ZO methods. In contrast, smoothing-based ZO methods have to rely on Gaussian or uniform distributions with a mean of zero. We appreciate your insight and introduce an effective sampling strategy inspired by SGD with momentum. Specifically, we propose a sampling momentum strategy where the sampling probability in the current iteration follows a distribution defined as , with denoting a Gaussian distribution, is the estimated gradient in the last iteration, as the momentum parameter, and as the standard deviation. The results of this approach, labeled as ReLIZO-m, are presented in the table below:
| Method | CIFAR10-valid | CIFAR10-test | CIFAR100-valid | CIFAR100-test | ImageNet-16-120-valid | ImageNet-16-120-test |
|---|---|---|---|---|---|---|
| ReLIZO | 89.50 | 92.45 | 69.00 | 69.03 | 42.09 | 42.31 |
| ReLIZO-m | 90.71 | 93.41 | 70.40 | 69.63 | 42.79 | 43.22 |
These experimental results demonstrate that incorporating an effective sampling strategy can significantly enhance performance. We plan to explore this approach in greater depth in future work.
Dear authors,
Thank you for your response. I have read it and am glad to learn that a more refined sampling strategy can improve the performance. I have no other concerns.
General Responses
Dear Area Chair and Reviewers,
We sincerely thank you for the time and effort you dedicated to the reviewing process. We are delighted that reviewers acknowledged the novelty of rethinking the gradient estimation in ZO method as a QCLP and our resuing strategy (YYyS, VWEi, zPFv), clear motivation and intuitive derived approach (YYyS, zPFv), extensive experiments (VWEi, zPFv) and that the paper is well-written (YYyS, VWEi, zPFv). To further address the comments and questions posed by reviewers, we have also conducted additional analyses and experiments, including:
- Thanks to the valuable suggestion on sampling new vectors based on known information proposed by reviewer YYyS, we highlight the strength of ReLIZO in sampling new vectors from an arbitrary distribution compared to other ZO methods. We introduce an effective sampling strategy inspired by SGD with momentum, which brings further performance improvement in NAS task.
- As pointed out by reviewer VWEi, the reusable distance bound is a crucial parameter in the ReLIZO method and can be sensitive to specific problems. To address this issue, we conduct experiments on different reusable distance bounds and demonstrate that setting an upper bound for the reuse rate in each iteration can significantly reduce performance drops in sensitive cases.
- Regarding the relatively limited impact of ZO methods as concerned by reviewers YYyS and zPFv, we demonstrate that ZO methods are well-suited for black-box optimization problems where gradients with respect to the variables are unavailable. Additionally, ZO methods are more memory-efficient than gradient-based methods since they do not require backward propagation, making them applicable to network training, as evidenced by recent works [1,2]. To illustrate the applicability of ReLIZO in network training, we adopted it for fine-tuning an OPT-1.3b model (with 1.3 billion parameters) on the Stanford Sentiment Treebank v2 (SST2) task, following the methodology of [2]. The results indicate that ReLIZO outperforms other ZO methods across various fine-tuning schemes, including full parameter fine-tuning (FT), LoRA, Prefix-tuning, and Prompt-tuning. Notably, ReLIZO even surpasses SGD in the FT scheme while requiring significantly less memory, demonstrating its promising potential.
For each reviewer, we have posed a rebuttal addressing the concerns, including the specific results of the additional experiments mentioned above. We look forward to your reply and are more than happy to respond to any further comments. Once again, thank you for your valuable comments and support.
With sincere appreciation and best regards,
The Authors
[1] Chen A, Zhang Y, Jia J, et al. DeepZero: Scaling Up Zeroth-Order Optimization for Deep Model Training[C], ICLR 2024.
[2] Zhang Y, Li P, Hong J, et al. Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark[C], ICML 2024.
This paper studies how to more efficient and effective estimate the gradient proxy in zeroth order optimization. The proposed approach -- gradient estimation through a quadratically constrained linear program -- utilized all historical information and the derived gradient admits a neat form. Both solid theory and rich empirical studies are included. All concerns have been well addressed in the rebuttal. We recommend to accept this paper.
The authors are required to polish the paper as promised in the rebuttal and is also suggested to discuss how to parallelize the proposed zeroth order optimization algorithm, for example, can it cast into the general parallelization framework?
references:
- A Comprehensive Linear Speedup Analysis for Asynchronous Stochastic Parallel Optimization from Zeroth-Order to First-Order