Maximum Noise Level as Third Optimality Criterion in Black-box Optimization Problem
摘要
评审与讨论
The paper studies zero-order optimization of highly-smooth strongly-convex functions. The main result is an algorithm for this setting which works even if there are sufficiently small biases in the oracle, by generalizing the analysis of an algorithm of Vaswani et al. (2019) to account for such biased oracles.
给作者的问题
论据与证据
The theoretical claims seem correct, although I did not carefully check them.
The major problem in this submission is the correctness of the novelty claims. In short, the authors claim to have invented something which is well known, as studied by a substantial body of work, which is not mentioned/cited at all. See "Relation To Broader Scientific Literature" for more details.
方法与评估标准
No problem as far as I can tell.
理论论述
I did not fully check the correctness of the claims, especially the proof of Theorem 2.6 which is the main proof. It does seem reasonable in my opinion.
实验设计与分析
I did not check the validity of the experiments.
补充材料
Only very briefly.
与现有文献的关系
This is the main issue in the paper, which in my opinion constitutes a strong reason for rejection.
Throughout the paper, the authors claim to be the first paper taking into account biases of oracles. This is simply incorrect, and ignores substantial literature in this exact topic, which is not mentioned or cited.
Following the seminal paper "First-order methods of smooth convex optimization with inexact oracle" by Devolder, Glineur and Nesterov (MAPR 2014) which has >600 citations, many papers studied this exactly. Already the abstract of the aforementioned paper discusses this issue, in which it was established that momentum has a strong effect on the maximal noise level - which the authors here claim to be the first to consider. It is hard to take the current manuscript paper seriously, while it ignores a whole line of work, and it is the author's duty to go over this literature and compare their work to known results, none of which are even mentioned in this manuscript.
遗漏的重要参考文献
As mentioned, "First-order methods of smooth convex optimization with inexact oracle" by Devolder, Glineur and Nesterov is a starting point for discussing inexact (i.e. biased) oracle, but definitely not the only paper on the discussed topic. It is beyond the scope of this review to go over the entire literature on inexact optimization methods.
Moreover, other topics which are touched upon in this work are missing the key citations, and instead cite only very previous works by a small subset of researchers. Some examples:
- The discussion on zero-order optimization misses some key works in the topic, including but not limited to "Random gradient-free minimization of convex functions" by Nesterov and Spokoiny.
- The discussion on randomized smoothing over a ball in the context of zero-order methods does not cite the key papers on this, including but not limited to the papers by Flaxman-Kalai-Mcmahan, Agarwal-Dekel-Xiao, Shamir and more which considered and analyzed this way before the given references.
其他优缺点
Additional weaknesses:
- The writing style in the intro is rather vague. For example, "some companies, due to secrecy, can’t hand over all the information". What information? How does this relate to the exactness of a function-value oracle in optimization?
- Some of the writing of the technical parts is incomprehensible. For example, in definition 1.3, what is the difference between \xi_1 and \xi_2? Who are "e" and "r"?
- In addition to the main point that I made about the lack of novelty, some other (more minor) claims of novelty do not hold as well. The authors claim that they "significantly improve the iteration complexity without worsening the oracle complexity". The fact that this can be done by parallelization of the zero-order oracle calls is well-known, see for example "An Algorithm with Optimal Dimension-Dependence for Zero-Order Nonsmooth Nonconvex Stochastic Optimization" by Kornowski and Shamir, discussion on parallel complexity.
- After Theorem 2.6, the authors write that "the third summand does not affect convergence much... so we will not consider it in the future for simplicity". The constitutes an additional assumption about the parameter regime, introduced ad-hoc in the middle of the paper.
- Remark 3.4 "High probability deviations bound" - the use of Markov's inequality trivially holds always, and therefore is not typically referred to as a high probability deviation bound in the context of optimization. Typically a more nuanced analysis can reduce the dependence on the failure of probability to logarithmic as opposed to polynomial, and I highly suspect this should be the case here as well. See for example "Stochastic first-and zeroth-order methods for nonconvex stochastic programming" by Ghadimi and Lan.
其他意见或建议
Other minor writing issues:
- There are many grammar issues throughout the paper (including the title). e.g., already in the first paragraph - where f is a function, etc..
- The term "adversarial attacks" is typically used in ML in the context of feeding adversarial examples to a model, whereas here the authors use it in the context of feeding the optimization algorithm with inexact oracles. This might be misleading.
- Lemma 2.5: "with the chosen parameters" - these parameters were never chosen.
Dear Reviewer ubQW,
Thank you for your feedback. The specific comments are addressed below:
Throughout the paper, the authors claim to be the first paper taking into account biases of oracles. This is simply incorrect…
With all due respect, we disagree with your comment that we claim to be the first to account for bias in the oracle. For example, on line 189 we say that Assumption 2.3 is standard (not new). Moreover, on line 437 we say that in [1] we considered a deterministic concept of noise. However, we agree that we missed some references, including [2-3], which show that the summands in Theorem 2.6 (accumulation of inaccuracy due to bias) appear to be unimprovable. Furthermore, we would like to draw the Reviewer's attention to the originality of our paper: we emphasize the importance of considering the three optimality criteria together for gradient-free algorithms. In particular, through the maximal noise level, we can control the error floor (asymptote) to which we want to converge. Theorem 3.1 shows a rather surprising result, namely the maximum noise level can be improved by overbatching (after a threshold of , the maximum noise level depends on both the batch size and the smoothness order ). A more prominent example is Remark 3.3 because of the deterministic nature of the adversarial noise (smoothness order affects the improvement - this is a very non-trivial and novel result). In the practical part of our work (experiments), we confirm the importance of considering the presence of inaccuracy in the algorithm (by tuning the algorithm parameters).
Moreover, other topics which are touched upon in this work are missing the key citations, and instead cite only very previous works by a small subset of researchers. Some examples…
Thank you for drawing our attention to this issue, of course we will expand the Related Work section.
he writing style in the intro is rather vague. For example, "some companies, due to secrecy, can’t hand over all the information". What information? How does this relate to the exactness of a function-value oracle in optimization?
We highlighted this example as another motivation in the search for maximum noise level. Inter-company information (in the context of optimization), for example, can be function values or a gradient vector. In particular, this problem is addressed in federated learning, by increasing the number of local iterations. We emphasize, however, on an alternative approach - biased oracle.
Some of the writing of the technical parts is incomprehensible…
In Definition 1.3, we stated that and are stochastic noise that satisfies the following conditions: 1) ; 2) bounded second moment: , ; 3) independence from and is a random value uniformly distributed on the interval [-1,1]. Nevertheless, we thank you for drawing our attention to the double notation of . We will change the notation.
In addition to the main point that I made about the lack of novelty…
Indeed, by concurrency we have achieved an optimal estimate on the iteration complexity (), but at first glance it seems unreasonable to take . In our work, we show that when the maximum noise level takes advantage of the increased smoothness (the higher the smoothness order - the higher the maximum noise level - the more accurately the algorithm converges).
After Theorem 2.6, the authors write that "the third summand does not affect convergence much...
If you recommend, we will not ignore this summand. However, even in this case the result of the main theorems will remain unchanged.
Remark 3.4 "High probability deviations bound" - the use of Markov's inequality trivially holds always…
We have indicated this in the remark, as it is a useful clarification for future work, given the consistency of the results (linear convergence and the presence of randomization).
There are many grammar issues…
Thanks, we'll fix it.
The term "adversarial attacks" is typically used in ML…
We will add references to relevant works to confirm the terms used.
Lemma 2.5: "with the chosen parameters"...
We have used this sentence, for brevity (e.g., in Theorem 2.6 we explicitly stated these parameters).
[1] https://arxiv.org/pdf/1802.09022
[2] https://dial.uclouvain.be/pr/boreal/object/boreal%3A128257/datastream/PDF_01/view
[3] https://www.tandfonline.com/doi/pdf/10.1080/10556788.2023.2212503
If you agree that we managed to address all issues, please consider raising your grade to support our work. If you believe this is not the case, please let us know so that we have a chance to respond.
With Respect,
Authors
This paper proposes a zero-order method.
update after rebuttal:
This type of writing issue cannot be resolved in the reviewing process of conferences.
给作者的问题
none
论据与证据
A paper with this level of writing should not be considered at all regardless of its contribution.
方法与评估标准
A paper with this level of writing should not be considered at all regardless of its contribution.
理论论述
no
实验设计与分析
no
补充材料
I reviewed some parts of the supplementary, mainly those related to the theoretical aspects.
与现有文献的关系
A paper with this level of writing should not be considered at all regardless of its contribution.
遗漏的重要参考文献
I did not verified.
其他优缺点
The paper is written in an uncommonly manner, using unscientific language and standards. Examples: the abstract contains mathematical formulas and elements that should only appear in the body of the text, figure 1, phrasing in lines 45-47, lines 215-217. Other than that, general grammar and phrasing are lacking, a few examples:
- Examples above
- Definition 2.2
- Assumption 2.4: "there exists constants"
- "Assumption 2.3 is standard for analysis, bounding bias."
I stopped here. A paper with this level of writing should not be considered at all regardless of its contribution.
其他意见或建议
Equation (2) - you did not define the set Line 566 eq (8) - this is a set so but also what does this mean?
I wrote other examples in the strengths and weaknesses.
Dear Area Chair and Senior Area Chair,
Please check the Official Review by Reviewer NqNV for quality, constructiveness, correctness and satisfaction with the ICML Code of Conduct.
With Respect,
Authors
This paper theoretically analyzes the black-box optimization with noisy feedback under the accelerated stochastic gradient descent framework. In particular, this paper generalizes existing convergence results for accelerated stochastic gradient descent to the case where the gradient oracle is biased. In addition, this paper provides an improved iteration complexity for smooth functions.
给作者的问题
Q1. What is the difference between the assumptions in this paper compared with the related works in Table 1? It is not clearly compared in the paper.
Q2. Although the theoretical analysis is good. I am concerned about the practical performance of the proposed method (Algorithm 1). Could the authors provide more comparison with more black-box optimization baselines besides, e.g., CMAES?
论据与证据
This paper claimed achieving an improved iteration complexity. However, it seems that assumptions of the various related works are different from the one in Theorem 3.1 in this paper. It is better to compare the assumptions besides the iteration complexity.
方法与评估标准
In this paper, the proposed method is only evaluated on one toy problem. It is unconvincing to justify the practical performance and the claim of outperforming SOTA.
理论论述
I did not check the details of the proof.
实验设计与分析
In this paper, the proposed method is only evaluated on one toy problem. It is unconvincing to justify the practical advantage and the claim of outperforming SOTA.
补充材料
I did not check the Appendix in detail.
与现有文献的关系
Prior related works analyze the iteration-complexity for convex cases and non-convex cases under different assumptions.
This paper seems to provide an improved iteration-complexity. Moreover, the analysis of the maximum noise level is somewhat novel in the gradient-based zeroth-order optimization. However, the noise level has been studied in the Bayesian optimization area.
The zeroth-order gradient estimator in Eq.(5) is not new. For example, in Eq.(4) in [Bach et al. 2016].
The black-box optimization methods in a broader context, the line of Evolutionary Strategy (e.g, ES, NES, CMAES), and the line of Bayesian Optimization (e.g., GP-UCB, TuRBO), are not discussed and compared.
[Bach et al. 2016] Bach et al . Highly-Smooth Zero-th Order Online Optimization. COLT 2016
遗漏的重要参考文献
NA
其他优缺点
See above
其他意见或建议
NA
Dear Reviewer Hj39,
Thanks for the insightful review. We are happy to see that the reviewer emphasized that the article provides a good theoretical analysis. The specific comments are addressed below:
This paper claimed achieving an improved iteration complexity. However, it seems that assumptions of the various related works are different from the one in Theorem 3.1 in this paper. It is better to compare the assumptions besides the iteration complexity. & Prior related works analyze the iteration-complexity for convex cases and non-convex cases under different assumptions.
Certainly, we will add columns with assumptions to the final version of the paper for a clear comparison. However, we would like to point out that the assumptions on the functions are the same (strong convexity and increased smoothness), while the assumption on the gradient approximation (Kernel approximation) differs only in the bounded noise assumption. In our work, we consider the generalized strong growth condition because we base our analysis on the work of [1] (see Section 2) to create a gradient-free algorithm. Assumption 2.4 is more general (and in the case , the assumptions coincide). We would like to note that a better estimate on the iteration complexity cannot be achieved (the estimate is optimal), provided that we create a gradient-free algorithm based on a first-order algorithm.
In this paper, the proposed method is only evaluated on one toy problem. It is unconvincing to justify the practical performance and the claim of outperforming SOTA. & Q2. Although the theoretical analysis is good. I am concerned about the practical performance of the proposed method (Algorithm 1). Could the authors provide more comparison with more black-box optimization baselines besides, e.g., CMAES?
As it is easy to see from the title, in our work we emphasize the importance of considering the three optimality criteria together. In particular, through the maximum noise level we can control the error floor (asymptote) to which we want to converge. Theorem 3.1 shows a rather surprising result, namely the maximum noise level can be improved by overbatching (after a threshold of , the maximum noise level depends on both the batch size and the smoothness order ). A more prominent example is Remark 3.3 because of the deterministic nature of the adversarial noise (improvements occur only through increased smoothness). In the practical part of our work (experiments), we confirm the importance of taking into account the presence of inaccuracy in the algorithm (by tuning the algorithm parameters). Given the theoretical nature of the work, and the narrative that we convey to the reader through a theoretical result, and confirm through a practical experiment on a9a data (comparable to the most related work: accelerated zero-order algorithms), it seems to us that performing additional practical experiments is beyond the scope of our study. Regarding the comparison with the CMAES algorithm, we will certainly add in Figure 3, but we expect ARDFDS and our algorithm to outperform CMAES due to the accelerated nature of convergence on the logistic regression function.
The zeroth-order gradient estimator in Eq.(5) is not new. For example, in Eq.(4) in [Bach et al. 2016].
Yes, indeed, we assumed that it would be clear to the reader from Table 1 that all the above algorithms use the kernel approximation. Nevertheless, we agree that adding a reference to the original source ([2]) of the gradient approximation that takes into account the information about the increased smoothness of the function would improve the quality of the text.
The black-box optimization methods in a broader context, the line of Evolutionary Strategy (e.g, ES, NES, CMAES), and the line of Bayesian Optimization (e.g., GP-UCB, TuRBO), are not discussed and compared.
Thanks for this comment, sure, we will add discussions of Evolutionary Strategy and Bayesian Optimization in the Related Work section.
Q1. What is the difference between the assumptions in this paper compared with the related works in Table 1? It is not clearly compared in the paper.
We will add additional columns to Table 1 as well as clarifications to the Assumptions in the final version of the paper.
[1] Vaswani S., Bach F., Schmidt M. Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. The 22nd international conference on artificial intelligence and statistics (2019)
[2] Polyak B., Tsybakov A. Optimal Order of Accuracy of Search Algorithms in Stochastic Optimization. Problemy Peredachi Informatsii (1990).
If you agree that we managed to address all issues, please consider raising your grade to support our work. If you believe this is not the case, please let us know so that we have a chance to respond.
With Respect,
Authors
Summary: The paper studies zero-order optimization of strongly-convex functions with higher-order smoothness. In particular, it proposes and analyzes a new method- Zero-Order Accelerated Batched Stochastic Gradient Descent- under stongly convex and highly smooth functions. The paper also provides an analysis of the maximum noise level and how this is affected by the batch size as well as the objective's order of smoothness.
On Reviews: The reviewers had several concerns about the paper's results (assumptions used compared to previous work) as well as experimental evaluation (toy examples). The two major concerns of the reviewers were also: (i) The presentation of the paper (not clear exposition of the main contributions in the introduction and the importance of the theorems in the main part of the paper) and (ii) lack of comparison with several closely related papers.
The paper has promising ideas. Understanding the performance of a zero-order stochastic optimization algorithm and justifying its convergence with solid theoretical guarantees is a timely topic. I suggest the authors revise their work, considering the feedback received, and submit it to one of the upcoming venues.