PaperHub
7.8
/10
Poster4 位审稿人
最低4最高4标准差0.0
4
4
4
4
ICML 2025

Natural Perturbations for Black-box Training of Neural Networks by Zeroth-Order Optimization

OpenReviewPDF
提交: 2025-01-21更新: 2025-07-24
TL;DR

This paper proposes a novel concept and method for black-box training of neural networks by zeroth-order optimization.

摘要

关键词
Zeroth-order optimizationMultivariate normal distributionCovariance matrixFisher information matrix

评审与讨论

审稿意见
4

This paper extends the idea of natural gradient descent from back-propagation based training (first-order) to zero-order based training of neural networks. Specific care is taken to enable efficient approximations to the Fisher-Information-Matrix for deep neural networks by using block-wise FIM. Several experiments across a variety of settings demonstrate the convergence benefits of their method compared to standard sampling from a standard uniform gaussian.

给作者的问题

If the authors include a discussion on the memory-costs of their algorithm and provide experimental results on (a subsection of) the ZO-Bench datasets with LLM fine-tuning, it would widen the audience of their method - and I will raise my score.

论据与证据

The authors claim to have developed a new method through the concept of natural perturbations. Even though this papers shares common themes with yet-unpublished work provided in the appendix, the present paper provides sufficient new elements to be a convincing contribution.

方法与评估标准

Yes, the methods and evaluation criteria makes sense. Specifically, studying the convergence behavior normalized by compute time (E.g. Figure 6) captures an important aspect of ZO optimization. What I am missing is a study for the memory-usage of this method. While this paper explicitly targets black-box functions as a motivating argument for ZO, the reality of many ZO-applications is that they applied because of their reduced memory-consumption compared to back-propagation based methods. While this paper explicitly doesn't target memory-reductions, it would have been a great addition to make this method appealing to a wider audience.

In terms of datasets, many papers in this literature study ZO for LLM fine-tuning on datasets mentioned here: https://github.com/ZO-Bench/ZO-LLM It would be great to see Natural Perturbations ZO applied to those.

理论论述

I checked all derivations in the paper excluding Section 3.3 and its corresponding appendix.

实验设计与分析

I checked the experiments and analyses I am familiar with, which excludes the MZI-related experiments, about which I understand very little. There are no issues related to these experiments.

As stated above, I would appreciate an extension of the experiment section to include tasks from the ZO-Bench, which would appeal to a wider audience.

补充材料

I did not check the code, but appreciate its inclusion. I scanned the unpublished related paper.

与现有文献的关系

The broader scientific literature in this domain tries to accelerate convergence of ZO methods. There are many orthogonal ideas to achieve this goal, which have not been explicitly discussed, which I think is fine. There seems to be some related work in the optical NNs literature, which I was not aware of - and I'm glad for the references.

遗漏的重要参考文献

No

其他优缺点

The paper is very well written, the math is laid-out in an extremely intuitive manner in terms of notation and accompanying text.

其他意见或建议

NA

伦理审查问题

NA

作者回复

Thank you for reviewing our paper and evaluating that the contribution is sufficient and the paper is very well written.

If the authors include a discussion on the memory-costs of their algorithm and provide experimental results on (a subsection of) the ZO-Bench datasets with LLM fine-tuning, it would widen the audience of their method

Below, we report and discuss the memory costs of our algorithm not only for already reported neural networks but also for a newly constructed larger-size neural network with one million parameters. However, it is difficult for our current algorithm to perform experiments on the ZO-Bench datasets with LLMs that have much larger number of parameters (perhaps hundreds of millions at the smallest). We feel that we need a different strategy, e.g., a much stronger approximation of the FIM than making it block diagonal, and would like to contribute as future work in another paper.

What I am missing is a study for the memory-usage of this method. ... While this paper explicitly doesn't target memory-reductions, it would have been a great addition to make this method appealing to a wider audience.

The table below shows the memory cost corresponding to the settings of Table 1. The difference between ZO-I and ZO-NP roughly corresponds to the additional memory cost of introducing natural perturbations.

Memory footprint (GB) corresponding to the settings of Table 1

dataset/taskarchitectureNmaxN_\mathrm{max}BBZO-IZO-coZO-NP
MNIST, FashionMNISTCNN (matrix)43161.081.083.79
EqualizationFeedForward (MZI)28020.390.390.42
CopyingRNN (MZI)45370.750.751.91
CIFAR10MLP-mixer (matrix)510663.953.957.36

We can reduce the memory cost by decreasing the maximum block size NmaxN_\mathrm{max}, as the following table shows. Together with Figure 7 in the submitted main text, we understand that ZO-NP trades off the test accuracy and the memory cost overhead.

Memory footprint (GB) for FashionMNIST corresponding to the setting of Figure 7

NmaxN_\mathrm{max}1248163264124236431
ZO-NP1.111.111.111.111.111.151.481.903.073.79

We also measured the memory footprints of a larger MLP-mixer for CIFAR10, as shown in the table below. The number of mixers has increased from 3 to 12 and the channel width has increased from 32 to 256, making the number of parameters increased from 33,642 to 1,706,762. Again, the difference between ZO-I and ZO-NP roughly corresponds to the additional memory cost of introducing natural perturbations, and we can reduce the memory cost by decreasing the maximum block size NmaxN_\mathrm{max}.

Memory footprint for CIFAR10 with a larger MLP-mixers

methodNmaxN_\mathrm{max}BBmemory footprint (GB)
ZO-I5123,33415.68
ZO-co5123,33414.35
ZO-NP5123,33440.87
ZO-NP2566,66823.77
ZO-NP12813,33521.80

We will include these results in the revised manuscript to clarify the memory cost of the proposed algorithm for a wide variety of neural networks.

In terms of datasets, many papers in this literature study ZO for LLM fine-tuning on datasets mentioned here: https://github.com/ZO-Bench/ZO-LLM It would be great to see Natural Perturbations ZO applied to those.

We will include the references to ZO-LLM and the corresponding paper [1] in the revised manuscript, and state that applying the idea of natural perturbations to ZO-based LLM fine-tuning is a future work.

[1] Zhang, Yihua, et al. "Revisiting zeroth-order optimization for memory-efficient LLM fine-tuning: A benchmark." arXiv preprint arXiv:2402.11592 (2024).

Thanks for the nice suggestions to widen the audience of our method.

审稿意见
4

This paper proposed a novel sampling strategy for zeroth-order optimizaiton for training neural networks. Specifically, the authors propose natural perturbations that incorporate not only the parameter space discrepancy but also a function space discrepancy. To make the approach practical for networks with large-scale parameters, the paper proposes a block coordinate method that partitions the parameters into smaller groups, thereby enabling efficient computation of an approximate block-diagonal Fisher Information Matrix. Experiments on diverse tasks demonstrate taht the proposed method outperforms baseline methods.

update after rebuttal

Most of my questions and concerns are solved during the author rebuttal and I have no other questions.

给作者的问题

Can the authors elaborate on how the method might be adapted or scaled for neural networks with orders of magnitude more parameters (e.g., in the millions), especially regarding the computational cost of Jacobian and FIM estimation?

论据与证据

The paper claims that standard ZO perturbation sampling strategy is suboptimal for deep neural networks because it ignores the correlation among parameters. he claims are supported by both mathematical derivations and comprehensive empirical results. One minor potential concern is the additional query cost for computing the FIM, which is addressed via block partitioning—but its impact on extremely large-scale networks remains to be further validated.

方法与评估标准

These methods and evaluation criteria are appropriate for the target application—black-box training.

理论论述

The paper’s primary theoretical contribution is Theorem 3.2, which establishes that the error between the approximate ZO gradient with natural perturbations and the natural gradient is bounded under an L-smooth assumption. I have checked the proof of this theorem.

实验设计与分析

The experimental design appears sound. A potential area for further investigation is the scalability of the method to very large-scale networks.

补充材料

I have reviewed the proof of Theorem 3.2 and additional experimental results.

与现有文献的关系

This paper successfully integrates and extends ideas from both natural gradient literature and black-box optimization to address a practical problem in hardware-based neural networks.

遗漏的重要参考文献

Wierstra D, Schaul T, Glasmachers T, et al. Natural evolution strategies[J]. The Journal of Machine Learning Research, 2014, 15(1): 949-980.

其他优缺点

Strengths:

  1. The idea of designing the sampling distribution by regularizing both PSD and FSD is innovative. The derivation of the optimal covariance and the accompanying error bound lend strong theoretical support.

  2. Evaluations across multiple datasets and architectures, along with sensitivity analyses, provide robust empirical evidence.

Weakness:

Although the block-coordinate method mitigates query costs, it remains to be seen how the approach scales to very large-scale networks (e.g., millions of parameters).

其他意见或建议

N/A

作者回复

Thank you for reviewing our paper and evaluating the idea of designing the sampling distribution as innovative.

Can the authors elaborate on how the method might be adapted or scaled for neural networks with orders of magnitude more parameters (e.g., in the millions), especially regarding the computational cost of Jacobian and FIM estimation?

Yes, stemming from the experimental condition of the last row (CIFAR10) of Table 1, we additionally performed experiments with a larger MLP-mixer. The number of mixers has increased from 3 to 12 and the channel width has increased from 32 to 256, making the number of parameters increased from 33,642 to 1,706,762. The following table shows the computational cost with the million parameters. The difference between ZO-I and ZO-NP roughly corresponds to the cost of Jacobian and FIM estimation. While the elapsed time (reported as seconds/epoch) overhead is less than 3%, the memory footprint cost is significant. We can reduce the memory footprint cost by decreasing the maximum block size NmaxN_\mathrm{max} as shown in the last two rows. As a tradeoff, however, this increased the number BB of blocks and consequently the elapsed time in seconds per epoch.

Elapsed time in seconds per epoch and memory footprint for CIFAR10 with a larger MLP-mixers

methodNmaxN_\mathrm{max}BBseconds/epochmemory footprint (GB)
ZO-I5123,334638.815.68
ZO-co5123,334588.014.35
ZO-NP5123,334656.540.87
ZO-NP2566,668698.223.77
ZO-NP12813,335790.821.80

One minor potential concern is the additional query cost for computing the FIM, which is addressed via block partitioning—but its impact on extremely large-scale networks remains to be further validated. ... Although the block-coordinate method mitigates query costs, it remains to be seen how the approach scales to very large-scale networks (e.g., millions of parameters).

According to our current procedure for computing the FIM shown in Algorithm 2, the additional query cost is not affected by the scale of a network (the number NN of parameters) as long as the block size NbNmaxN_b\leq N_\mathrm{max} is the same. This is because only one block bb is sampled in line 9. The problem for a large-scale network instead is that the number BB of blocks is large and it takes many epochs to compute the FIMs for all blocks. In other words, the FIMs of many blocks remain as the initialized identity matrix I\mathbf{I} until many epochs have passed.

The following table shows such situations. The second row corresponds to the last row in Table 1, where the number BB is not so large. The third row corresponds to the above introduced larger MLP-mixer with million parameters. Since the number BB is large in this larger network, some blocks remained as initialized even after 1000 epochs with the update frequency hyperparameter Tud=100T_\mathrm{ud}=100. We feel that we can show the scalability and also some limitations of our current method/algorithm by additionally reporting the results like these two tables in the revised manuscript. Thanks for raising these issues.

The number of epochs (1st row) and the number of blocks (the remaining rows) whose FIM are computed.

NNBB1251020501002005001000
MLP-mixer in Table 133,642665102136516366666666
Larger MLP-mixer1,706,7623,33451025509824046987817672594

Essential References Not Discussed: Wierstra D, Schaul T, Glasmachers T, et al. Natural evolution strategies[J]. The Journal of Machine Learning Research, 2014, 15(1): 949-980.

We will refer to this paper in the revised manuscript and discuss the differences described in the following table. It shows that the proposed method of natural perturbations is superior in the sense that it can be used for a larger problem with a larger number NN of parameters than natural evolution strategies due to the size of FIM. Thanks for suggesting the reference.

Differences between Natural perturbations and Natural evolution strategies

Natural perturbationsNatural evolution strategies
FIM is computed fordistributions that the neural network expressessampling distribution
FIM is used fordirectly as the covariance matrix of the sampling distributioniteratively updating the parameters of the sampling distribution with a small learning rate in a natural gradient manner
FIM ofneural network parameters, whose number is NN, to be optimizedall kinds of parameters of sampling distribution, e.g., mean and covariance matrix for a multivariate normal distribution
size of FIMN×NN\times N(N+N2)×(N+N2)(N+N^2)\times (N+N^2) for a multivariate normal distribution
审稿意见
4

This paper introduces the concept of natural perturbations for black-box training of neural networks using zeroth-order (ZO) optimization. The authors propose a novel sampling strategy for parameter perturbations that maximizes entropy while regularizing the distribution to prevent drastic changes in the neural network's conditional probability distribution while considering both parameter space discrepancy (PSD) and function space discrepancy (FSD), inspired by the natural gradient method. The approach involves partitioning parameters into blocks to efficiently compute the Fisher information matrix (FIM) and update the covariance matrix for perturbation sampling. Experimental results demonstrate the superiority of this method over existing zeroth-order optimization techniques across various datasets and architectures.

给作者的问题

How does the proposed natural perturbations method scale to larger neural network architectures, particularly in terms of computational resources and training time?

论据与证据

The paper makes several claims regarding the effectiveness of natural perturbations for ZO optimization. These claims are supported by experimental results on various datasets and tasks, including MNIST, FashionMNIST, CIFAR10, Equalization, and Copying memory tasks. The authors show that their method outperforms existing ZO optimization techniques such as ZO-I and ZO-co, as well as other black-box optimization methods like CMA-ES. The results are statistically significant and consistent across different experimental settings.

方法与评估标准

The proposed methods for natural perturbations and block coordinate perturbations make sense for the problem of black-box training of neural networks. The sampling strategy and block diagonal covariance matrix approach are logical extensions of existing ZO optimization techniques, designed to address specific challenges such as parameter correlation and computational efficiency. The evaluation criteria, including test accuracy and training loss, are appropriate for assessing the performance of optimization algorithms in this context.

理论论述

The paper provides theoretical analysis for the approximation error bound of the zeroth-order gradient approximation by natural perturbations. The proof appears sound and is based on standard assumptions from non-convex optimization literature.

实验设计与分析

The experimental design involves training the proposed method alongside baseline methods like ZO-I and ZO-co, using identical configurations to ensure a fair comparison. However, the experiments primarily focus on small datasets, and it would be beneficial to explore performance on larger datasets and more diverse tasks including NLP.

补充材料

The supplementary material includes detailed configurations and additional experimental results.

与现有文献的关系

The paper contributes to the broader scientific literature by addressing the challenge of efficient zeroth-order optimization for neural networks implemented in hardware. This work aligns with recent efforts in black-box optimization techniques and natural gradient methods.

遗漏的重要参考文献

N/A.

其他优缺点

Strengths:

++ The paper addresses an important practical problem in neural network training for hardware implementations.

++ The proposed method shows consistent improvement over existing techniques across multiple benchmarks.

++ The theoretical analysis provides a solid foundation for the proposed sampling strategy.

Weaknesses:

-- The paper could benefit from more detailed ablation studies to disentangle the effects of sample number.

-- The experiments primarily focus on small datasets, and it would be beneficial to explore performance on larger datasets and more diverse tasks including NLP.

其他意见或建议

I suggest the authors conduct more comprehensive experiments across diverse scenarios, including NLP tasks. This would help validate the effectiveness of natural perturbations in real-world applications.

作者回复

Thank you for reviewing our paper and evaluating that the proposed method makes sense, experimental improvements are consistent, and the theoretical analysis provides a solid foundation.

How does the proposed natural perturbations method scale to larger neural network architectures, particularly in terms of computational resources and training time?

Stemming from the experimental condition of the last row (CIFAR10) of Table 1, we additionally performed experiments with a larger MLP-mixer. The number of mixers has increased from 3 to 12 and the channel width has increased from 32 to 256, making the number of parameters increased from 33,642 to 1,706,762. The following table shows how the computational resources in terms of memory footprint and training time (seconds per epoch) have increased with the 50 times larger number of parameters.

Memory footprint and elapsed time in seconds per epoch for CIFAR10 with MLP-mixers

method#mixerschannel widthNNNmaxN_\mathrm{max}BBmemory footprint (GB)seconds/epoch
ZO-I33233,642510663.9542.1
ZO-co33233,642510663.9540.0
ZO-NP33233,642510667.3645.4
ZO-I122561,706,7625123,33415.68638.8
ZO-co122561,706,7625123,33414.35588.0
ZO-NP122561,706,7625123,33440.87656.5

We observe that the memory footprint and the training time have increased by 5.5 and 14.5 times, respectively, which is rather mild considering the 50 times larger number of parameters. More critical is a linearly increasing (50 times) number B=3334B=3334 of blocks. This is due to the nature of the block coordinate approach with the same NmaxN_\mathrm{max}, and it takes many epochs to compute the FIMs for all B=3334B=3334 blocks. See the second table in the reply to Reviewer 73sf for a related experimental result. We think that the proposed natural perturbations method scales to larger networks to some extent but a network with one million parameters might be a limitation. We will include such a discussion in the revised manuscript. Thanks for raising this issue.

-- The experiments primarily focus on small datasets, and it would be beneficial to explore performance on larger datasets and more diverse tasks including NLP.

As reported above, we have tested a larger neural network with one million parameters. However, it is difficult for our current algorithm to perform experiments on larger datasets like with LLMs that have much larger number of parameters (perhaps hundreds of millions at the smallest). We feel that we need a different strategy, e.g., a much stronger approximation of the FIM than making it block diagonal, and would like to contribute as future work in another paper.

-- The paper could benefit from more detailed ablation studies to disentangle the effects of sample number.

We might misunderstood this comment, but if 'sample number' means the number QQ of perturbations in Eq. (10), we agree that we should additionally report some results with varying numbers QQ. So, we have obtained the following results. The query budget for each QQ was determined by the number of queries that ZO-I and ZO-co consumed for 100 epochs. We observe that the proposed method ZO-NP consistently outperformed the other two methods, except the extreme cases (Q=1,2Q=1,2), where the number of epochs allowed was small since the relative query consumption cost for the black-box Jacobian computation becomes large. We will include such results in the revised manuscript. Thanks for the comment.

Test accuracies of FashionMNIST with varying the number QQ of perturbations in (10)

QQ125102050100
ZO-I0.7100.7440.7750.7880.7940.8180.832
ZO-co0.7300.7600.7780.8030.8200.8410.847
ZO-NP0.7160.7630.8220.8400.8580.8690.873
query budget (×106\times 10^6)0.120.180.360.661.263.066.06
#epochs allowed for ZO-NP31415871829295
审稿意见
4

The paper proposed an sampling strategy for ZO optimization, where the perturbation is sampled from a multivariate Gaussian distribution with a covariance matrix. The covariance matrix is designed by not only minimize the expected PSD but also FSD. Adopting the concept of natural gradient, this perturbation is named natural perturbation. To work around large cost in computing for the covariance matrix explicit, the authors propose to use block coordinate approach for a more efficient computation. Experiments show that the proposed method is superior in convergence speed and ending test accuracy compared to existing methods.

update after rebuttal

The authors have addressed most of my questions and I thus updated the score.

给作者的问题

I already listed my questions in "Claims And Evidence", "Methods And Evaluation Criteria", and "Theoretical Claims". I hope the authors can help addressing these questions and I would happy to update my score accordingly.

论据与证据

The concept of natural perturbation - the motivation, intuition and derivation - is well presented, and I am convinced that this concept is well established.

The efficient block-box method of computing FIM (and then the covariance matrix) is also well presented. The algorithm and the motivation and intuition behind it is made clear.

As to the empirical results part that showing the proposed method is superior, overall the establishment of the experiments is solid, but I have some questions which I will elaborate in the "Methods And Evaluation Criteria" section.

The main concern is that since the proposed method introduces a large computation overhead (even after applying block coordinate approach), the comparison is only fair when the computation cost is the same. It would be interesting to see if the proposed method outperforms the baseline with the same computation budget, or the proposed method gives an interesting trade-off. I don't feel the discussion around the computation overhead is sufficient, nor the comparison is "fair". If the authors can address that in the revisions I would be happy to update my review score.

方法与评估标准

Overall it is good. But I have some questions/concerns. Beside the major concern already mentioned in "Claims And Evidence", questions are:

  1. Why in Table 1, ZO-I also uses blocking. Can we add the comparison to the most naive baseline i.e. ZO-I without blocking?
  2. For ZO-co and ZO-NP, are they using the same Q? If so I wonder if using such small Q causes the perturbation being quite sparse resulting in slow convergence. Since ZO-NP has a higher computation overhead, it seems to make sense to compare it with ZO-co with a larger Q (which have similar computation cost to ZO-NP) for a fair comparison.
  3. In figure 6, it looks like it is not fully converged and I am wondering what happens if we increase the epochs till it fully converges. I am wondering if ZO-NP has the advantage just in convergence speed, or also in the ending test accuracy.

理论论述

Yes and I don't find any major issue. There are a few questions though:

  1. In line 167 left column, it mentions that a good sampling strategy should maximize the entropy of the distribution. Why is that? Is there theoretical or empirical evidence supporting it? Actually, in the experiment, can you add another baseline with sampling from another distribution with a smaller entropy?

  2. In line 321left column, it mentions that FIM also becomes block diagonal. Why?

  3. In line 216 right column, it mentions a too large FSD leads to unstable training. Is there evidence to support it?

  4. In line 87 right column, what is z and what does p(z|y) means?

实验设计与分析

Overall it is solid, for questions please refer to "Methods And Evaluation Criteria"

补充材料

No

与现有文献的关系

Zero-order optimization has a broad application in many fields. Proposed in this paper is a general approach to achieve a better performance in zero-order optimization which I believe has a large impact to broader literature.

遗漏的重要参考文献

References are sufficiently discussed in the paper.

其他优缺点

I am not sure what a point authors trying to make in section 3.2 and figure 3. As (13) already well expressed the trade-off between entropy, PSD and FSD, seeing the results in figure 3 is not surprising but almost a bit trivial.

其他意见或建议

None.

update after rebuttal

The authors have addressed most of my questions and I thus updated my score accordingly.

作者回复

Thank you for reviewing our paper and evaluating that the concept of natural perturbation is well established.

It would be interesting to see if the proposed method outperforms the baseline with the same computation budget, or the proposed method gives an interesting trade-off.

The computation cost can be characterized by the elapsed time and the memory footprint. For the former, we already show the corresponding results in the right plot of Figure 6 and the bottom plots of Figure 12. These show that the proposed method ZO-NP outperformed the baselines with the same elapsed time. For the latter, we newly report the results in the reply to Reviewer 2tFg (please see the first two tables). With these tables and Figure 7, we understand that ZO-NP trades off test accuracy and memory overhead. Thanks for raising this issue. We will include these memory footprint results in the revised manuscript.

Why in Table 1, ZO-I also uses blocking. Can we add the comparison to the most naive baseline i.e. ZO-I without blocking?

The reason to use blocking also for ZO-I is that the performance gets better with an appropriate block size NmaxN_\mathrm{max}, as Figure 7 shows. And we already have some results without blocking in Figure 11 in Appendix D.2. While the main intention there was to compare with CMA-ES, we can also compare the results with ZO-I.

For ZO-co and ZO-NP, are they using the same Q? ... Since ZO-NP has a higher computation overhead, it seems to make sense to compare it with ZO-co with a larger Q (which have similar computation cost to ZO-NP) for a fair comparison.

We used the same Q for ZO-co and ZO-NP, and ZO-co results in a sparse perturbation (a sparse estimated gradient as a weighted sum of perturbations to be exact) as you pointed out. For more related information, see the discussion on the number of changing parameters in Section 5.3.2. Regarding the computational overhead of ZO-NP, we have already made a fair comparison in terms of elapsed time (the right plot of Figure 6 and the bottom plots of Figure 12).

In figure 6, it looks like it is not fully converged and I am wondering what happens if we increase the epochs till it fully converges. I am wondering if ZO-NP has the advantage just in convergence speed, or also in the ending test accuracy.

The two tables below show the situations with more epochs (three times more). The epochs allowed for ZO-NP, which consumed extra queries for the Jacobian computation, are shown in the parentheses. The training losses continued to decrease, but the speed efficiency was ZO-I < ZO-co < ZO-NP. The test accuracies converged in ZO-co and ZO-NP but not in ZO-I. It seems that ZO-NP entered the range of overfitting.

Training loss

epochs1000 (975)2000 (1950)3000 (2925)4000 (3900)
ZO-I1.0390.9620.9190.889
ZO-co0.9660.8810.8260.787
ZO-NP0.8240.7310.6720.631

Test accuracy

epochs1000 (975)2000 (1950)3000 (2925)4000 (3900)
ZO-I0.5920.6140.6140.621
ZO-co0.6040.6230.6330.633
ZO-NP0.6220.6320.6360.632

In line 167 left column, it mentions that a good sampling strategy should maximize the entropy of the distribution. Why is that? Is there theoretical or empirical evidence supporting it? Actually, in the experiment, can you add another baseline with sampling from another distribution with a smaller entropy?

The reason is to make the sampled perturbations explore as widely as possible. This statement is our new way of understanding the sampling in ZO optimization. Section 3.1, especially the paragraph starting from line 213 left column, provides theoretical support as the existing sampling strategy from N(0,I)\mathcal{N}(\mathbf{0}, \mathbf{I}) can be derived from this statement. We could perform some experiments where the entropy of a sampling distribution is randomly reduced independent of the FIM. Please let us know in a further reply if you prefer to see the results.

In line 321left column, it mentions that FIM also becomes block diagonal. Why?

This is because the block matrix size is the same for Σb\Sigma_b and Fθ(b)\mathbf{F}_{\theta^{(b)}} by (34).

In line 216 right column, it mentions a too large FSD leads to unstable training. Is there evidence to support it?

A good lecture note can be found at https://www.cs.toronto.edu/~rgrosse/courses/csc2541_2022/readings/L03_metrics.pdf and Figure 1 there on the Rosenbrock function is very intuitive.

In line 87 right column, what is z and what does p(z|y) means?

The specific form of p(z|y) depends on the type of task the neural network performs. For example, if the neural network performs a classification task, p(z|y) is a multinomial distribution as explained in Section 4.1.1. And z is a random vector that is marginalized in the FIM computation (28) and specified by a target vector in the loss computation (1).

审稿人评论

Thanks for the rebuttal. I have updated my review score accordingly.

最终决定

This paper proposes a sampling strategy for zeroth-order optimization called natural perturbations. The proposed natural perturbations adopt the parameter space discrepancy and function space discrepancy. Also, it introduces a block coordinate method to handle the high-dimensional parameters and reduce the computational cost. The experimental results on several tasks and architectures demonstrate the superiority of the proposed method.

This paper is well-written and motivated. The proposed method sounds, and its effectiveness is shown through numerical experiments. As the experimental evaluations are limited to relatively small datasets and architectures, it would be better to show the effectiveness of the proposed method on larger datasets and architectures, which might be future work. I also recommend the authors include the additional result regarding computational cost and memory footprint provided in the rebuttal.

As all reviewers give a positive score, I would recommend accepting this paper.