/10

Poster4 位审稿人

最低3最高4标准差0.4

ICML 2025

Guided Zeroth-Order Methods for Stochastic Non-convex Problems with Decision-Dependent Distributions

Yuya Hikima,Hiroshi Sawada,Akinori Fujino

OpenReview PDF

提交: 2025-01-15更新: 2025-08-16

TL;DR

Improved zeroth-order methods for stochastic non-convex optimization problems with decision-dependent distributions

摘要

关键词

Zeroth-order methodperformative predictiondecision-dependent distributionguided evolutionary strategy

评审与讨论

审稿意见

评分: 32025-02-27

This paper a new zero-order optimization algorithm for a new algorithm for a special optimization problem. The theoretical results show the proposed method converge to the stationary point and the step of steps to converge is also provided. The covariant matrix as estimated is a type of CMA-ES method (Eq. 6).

给作者的问题

For some tasks, GZO-NS performs better and some tasks GZO-HS perform better. Why? Is there a statement which algorithm will be better under certain condition?
Is assumption 4.3 proper? I think for Wasserstein distance, it is easy to bound the lipschize constant. So it is better to calculate this constant directly rather than using an assumption.
Consider adding an error bar in experiments.
The testing problems are too simple and the benchmarks methods have not converged. What is the main reason?
Line 282 right. Why only this loss function f is considered, will other functions also work?

论据与证据

This paper proposed two methods in zero-order optimization with decision-dependent distributions. All results are well validated by theoretical proofs and simulations.

方法与评估标准

The using evaluation criteria are training loss, testing loss and AUC. All are all pretty standard criteria.

理论论述

This paper proposed two main theorems, .

实验设计与分析

The testing problems multiproduct pricing and strategic classification are a bit simple. Consider using more complex realworld problem.
Why the other three methods perform so poorly. Are these methods fair benchmarks?
Where the the error bar in Figure 1?

补充材料

The supplementary material provide plenty of theoretical proofs of the main proof as provided in the main paper.

与现有文献的关系

This paper is related to zero-order optimization method as proposed by Powell.

遗漏的重要参考文献

Z. Zhang, Scalable Derivative-Free Optimization Algorithms With Low-Dimensional Subspace Techniques, arXiv:2501.04536, 2025 [pdf]
T. M. Ragonneau and Z. Zhang, PDFO: A Cross-Platform Package for Powell’s Derivative-Free Optimization Solvers, Math. Program. Comput., 16:535–559, 2024 [pdf]. Consider compare with some zero-order methods in this lib.

其他优缺点

The theoretical results in this paper are strong.
The notation section is clear, which is admirable.
The related work regarding to Decision-Dependent Distributions is very informative.

其他意见或建议

Lemma 4.4, the constant 5488 is strange. Is 5488 too large?
Is using e as an all vector standard? It is usually used as 1 in previous literature.
The notation $\gamma \in R_{>0}$ is a bit too complex. Consider using $\gamma > 0$ .
In Table 1, the data information is not necessary and takes too much space.

伦理审查问题

NA.

作者回复

2025-04-01

We thank the reviewer for the general appreciation of our work as well as constructive comments.

Questions For Authors.

1: If the distribution $D(x)$ does not change significantly with respect to $x$ , GZO-HS, which uses samples from past iterations, is likely to perform better. This is because the distribution $D(x_i \pm \mu_i u_i)$ at past iteration $i$ is similar to the current distribution $D(x_k)$ , improving the reliability of past samples. Conversely, when $D(x)$ varies greatly with $x$ , GZO-NS, which leverages current partial gradient information, tends to outperform due to its increased accuracy in reflecting the current distribution.

2: Since we were unable to ascertain the exact subject of your Lipschitz constant, we interpreted your question as follows. Please let us know if our understanding is incorrect:

``Is it easy to compute a constant that bounds $\theta$ in Assumption 4.3? It is better to calculate this constant directly rather than relying on the assumption on the Wasserstein distance."

To find such a constant ( $\theta$ or higher) that makes Assumption 4.3 valid, we must obtain samples from $D(x)$ at all $x$ since $D(x)$ is unknown, which is impractical. Moreover, without Assumption 4.3, such a constant may not exist. For example, consider a discrete distribution $D(x)$ over $\xi \in$ { $0,1$ }, where $\xi=0$ with probability $1$ if $x > y,$ and $\xi=1$ with probability $1$ if $x \le y$ . (This setting corresponds to the situation where the seller determines the price $x$ of one product and the buyer buys it ( $\xi=1$ ) with probability 1 if the price $x$ is less than $y$ .) In this case, for any $\theta>0$ , there exists $\epsilon>0$ such that $W(D(y), D(y+ \epsilon)) > \theta \|\epsilon\|$ . Therefore, $\theta$ in Assumption 4.3 does not exist. This is due to the discontinuity of the probability distribution $D(x)$ . Hence, Assumption 4.3, which imposes continuity on $D(x)$ , is necessary.

3: Since Fig. 1 shows convergence in a single instance, we did not include error bars. To add error bars to Fig. 1, we would need to show results averaged over multiple instances. However, when describing averaged results, it becomes difficult to understand how the methods converge. For example, even a method that decreases the function value non-monotonically may appear to decrease it monotonically when averaged values are presented. For this reason, we have included the convergence of each method for one instance in the figures and summarized the statistical information in tables. Note that Tables 1 and 2 include that not only single-dimensional results of performance (i.e., average) but also measures of variation (i.e., standard deviation).

On the other hand, since each table shows the statistical results at the final iteration, it is currently not possible to see the statistical results during the iterations. Therefore, we will add the figures that include averaged curves and error bars across multiple instances to the appendix.

4: Although our baselines are based on recent studies (Ray et al., 2022), (Hikima & Takeda, 2024), and (Liu et al., 2024a), they fail to converge for different reasons:

ZO-OG suffers from high variance and instability at each iteration. While increasing the number of samples $m_k$ may improve stability, it would also increase total sample size.
ZO-OGVR and ZO-TG do not include partial gradient information in the update direction $u$ , which delays convergence.

If you have other baselines, we would be happy to add them.

5: We adopted the loss function used in (Levanon & Rosenfeld, 2021, Section 4)) in order to align our experimental setup with that of the existing study. Moreover, cross-entropy is a widely used and standard loss function for binary classification. We could have tried other loss functions in our numerical experiments, but we preferred to vary the coefficient $\gamma$ of the agent's cost function to see the performance of the proposed method for different probability distributions (which depends on $\gamma$ ). However, our method is applicable to other loss functions if they are differentiable.

Other Comments Or Suggestions.

1: We consider 5488 to be an appropriate coefficient. This large number arises from the sixth power of the expected value of $u$ in Eq. (12) of Lemma C.7. Look at the equation transformation in lines 697 to 708. In the first inequality, the norm is decomposed, where number 32 occurs in the coefficients. Furthermore, in the third inequality, from (Nesterov & Spokoiny, 2017, Lemma 1), i.e., Eq. (10) in our appendix, number $7^3$ appears in the coefficients. These transformations account for the magnitude of this constant.

2, 3, and 4: Thank you for the multiple suggestions about the description. We will correct them accordingly.

Essential References Not Discussed.

We will cite these references and clarify the differences from our study. Thank you for the suggestion.

审稿意见

评分: 42025-03-09

This work proposes new zero-th order methods for nonconvex performative prediction problem (i.e. optimization with decision-dependent distribution) which improve the theoretical sample complexity of the state of the arts when the function variation $G=\sup_{x,\xi}|f(x,\xi)|$ is large compared with the dimensionality of the decision variable. The experimental results also demonstrate faster convergence of these new methods.

update after rebuttal

Reviewer Vnz2 is satisfied with the authors' responses and keeps rating 4.

给作者的问题

(1) Could you provide intuition why using the guidance $hh^{\top}$ in Eq. (6), and the benefit of using $\alpha_k\to 1$ in Algorithm 1? For example, can Lemma 4.6 imply that the variance of the stochastic gradient is small if $\alpha\approx 0$ for large $||\nabla F(x)||$ , and $\alpha\approx 1$ for small $||\nabla F(x)||$ (usually near the final iterates)? In this way we balance bias-variance trade-off?

(2) The gradient estimator (2) has Gaussian variable $d{\bf u}$ with covariance $d^2I_d$ , while in Eq. (6), the first part is $\frac{I_d}{d}$ . Why is there a such scale difference in $d$ ?

(3) At the beginning of Section 5.1, should $x\in\mathbb{R}^d$ instead of $x\in\mathbb{R}^n$ ? What is the value of the dimensionality $d$ and $\theta_i$ for the price experiment?

(4) Do Tables 1 and 2 involve only the final iterate of each problem instance? You could explain that in the caption. Near the final of Section 5.1, should "Table 2" be changed to "Table 1"? In Table 1, does each data ID correspond to a problem instance? If yes, why not using data IDs 1-20?

(5) In Lemma C.2, should it be $L_F=L_x+\theta L_{\xi}$ ?

(6) In the proof of Lemma C.10, what does $n$ mean? Should it be changed to the dimensionality $d$ ? Also, (21) can directly imply (22) by replacing $u$ with $-u$ as they follow the same distribution $\mathcal{N}(0,\Sigma)$ .

(7) In the proof of Lemma 3.2, the second $x+\mu u$ in the second row could be changed to $x-\mu u$ .

(8) In the proof of Theorem 4.7, what will be the complexity result if $\alpha_k=1$ ? Since the proof does not depend on $s$ , what if we use arbitrary $s$ ?

(9) In the proof of Lemma C.12, if we replace all $L_F$ with $||\nabla F(x)||$ , will this yield tighter result? It is fine if there is not enough time to explore this question in the discussion period.

论据与证据

The claims above are well supported by theorems and experimental results.

方法与评估标准

The proposed zero-th method looks reasonable but the advantage of using guidance is still not intuitive and clear, as will be elaborated later. In the theoretical results, using the gradient norm as the evaluation metric is standard and reasonable. In the experiments, the evaluation metrics including objective function value, classification accuracy and AUC are also standard and reasonable. The experiment scenarios are good since they fit the studied performative prediction problem and directly come from applications.

理论论述

I checked the proof of Lemmas C.2, C.4, C.5, C.7, C.10, C.11, C.12, 3.2, 4.6 and the general proof logic of Theorems 4.7 and 4.8, and did not find any mistakes.

实验设计与分析

The experiment scenarios are good since they fit the studied performative prediction problem and directly come from applications. The evaluation metrics including objective function value, classification accuracy and AUC are standard and reasonable. The basic information for reproducibility is clear to me, including the objective function, data distribution, algorithms and their hyperparamter choices. The results also demonstrate the faster convergence of the proposed methods than that of the existing methods.

补充材料

Yes. I read Appendix A about experimental details. I also checked the proof of Lemmas C.2, C.4, C.5, C.7, C.10, C.11, C.12, 3.2, 4.6 and the general proof logic of Theorem 4.7.

与现有文献的关系

This work is among the very few papers on theoretical complexities to converge to a stationary point of the objective function (instead of the performative stable point as an approximation), and improves these complexities using a new zero-th order gradient estimator when the function variation $G=\sup_{x,\xi}|f(x,\xi)|$ is large compared with the dimensionality of the decision variable.

遗漏的重要参考文献

[1-5] on convergence results of performative prediction are not cited.

[1] Li, Qiang, and Hoi-To Wai. "Stochastic optimization schemes for performative prediction with nonconvex loss." Advances in Neural Information Processing Systems (2024): 8673-8697.

[2] Khorsandi, Pedram, et al. "Tight Lower Bounds and Improved Convergence in Performative Prediction." OPT 2024: Optimization for Machine Learning.

[3] Izzo, Z., Ying, L., and Zou, J. How to learn when data reacts to your model: Performative gradient de- scent. In Meila, M. and Zhang, T. (eds.), Proceed- ings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 4641–4650. PMLR, 18–24 Jul 2021

[4] Roy, A., Balasubramanian, K., and Ghadimi, S. Projection- free constrained stochastic nonconvex optimization with state-dependent markov data. In Advances in neural information processing systems, 2022.

[5] Mehrnaz Mofakhami, Ioannis Mitliagkas, and Gauthier Gidel. Performative prediction with neural networks. In International Conference on Artificial Intelligence and Statistics, pages 11079–11093. PMLR, 2023.

其他优缺点

Strengths: This work studies an emerging and significant problem called nonconvex performative prediction (i.e., optimization with decision-dependent distribution), with important applications. There are only few existing works on theoretical complexities to converge to a stationary point of the objective function (instead of the performative stable point as an approximation), and this work improves these complexities using a new zero-th order gradient estimator in the case of large function variation $G=\sup_{x,\xi}|f(x,\xi)|$ compared with the dimensionality of the decision variable. The experiments are application-driven, clear, reproducible and demonstrate the faster convergence of the proposed algorithms. I can understand the clear presentation well.

Weaknesses: The major weakness is that the intuition and benefits of using the gradient guidance $hh^{\top}$ in Eq. (6) are not clear, especially given that the proof of Theorem 4.7 does not depend on $s$ and thus $h$ . This major weakness corresponds to my questions (1) and (8) in "Questions For Authors". Some works on performative prediction are not cited as listed above. Some other details are to be clarified as listed below.

其他意见或建议

(1) You could stress your contribution in the abstract that your theoretical sample complexity improves the state of the arts in the case of large function variation $G=\sup_{x,\xi}|f(x,\xi)|$ compared with the dimensionality of the decision variable.

(2) You could cite the works listed in "Essential References Not Discussed" above, and compare (Liu et al., 2024a) with yours in terms of setting and complexity.

(3) You may mention the name "performative prediction" of this optimization problem.

(4) You could use "an unbiased" instead of "the unbiased" right before Eq. (3).

作者回复

2025-04-01

We thank the reviewer for positive comments and constructive suggestions.

Questions For Authors.

(1): Since we make no assumptions on $h$ except that it is normalized, the sample complexities of our methods are derived in the worst-case scenario for $h$ . In this case, by letting $\alpha_k \to 1$ , as in our Algorithms 1 and 2, the effect of $h$ (, which is adverse in the worst-case scenario) is reduced, allowing us to achieve the stated sample complexity. For Lemma 4.6, note that $L_f \ge ||\nabla F(x)||$ ; the upper bound is smaller when $\alpha=1$ , unless $d$ is very small. Therefore, there is no trade-off in the worst-case scenario.

As noted in ``Claims And Evidence and Essential References Not Discussed'' in the response to reviewer homv, our contributions are: (i) an improved sample complexity over (Hikima & Takeda, 2024b) through tighter convergence analysis, and (ii) the introduction of partial gradient information $h$ without degrading the sample complexity, even in the worst-case scenario for $h$ . Regarding (ii), (Maheswaranathan et al.,2019) assumes a confidence level for the partial gradient information ( $\rho$ in Section 3.3 in their paper) and establishes a trade-off between the unbiasedness and variance of the gradient estimator. In contrast, we make no such assumptions on $h$ , yet successfully construct methods that preserve sample complexity.

We deliberately avoid assuming any correlation between $h$ and the true gradient because we adhere to the existing problem setting (Perdomo et al., 2020; Mendler-Dunner et al., 2020) where $D(x)$ is unknown for application requirements. If we assumed such correlation, the weight of first term in (3) would have to be known, which requires information on $D(x)$ . Moreover, to the best of our knowledge, our paper is the first study to apply the concept of guided evolutionary strategy to problem (P). Thus, our goal is to first establish a highly general method under minimal assumptions. We believe our convergence analysis lays a foundation for future work that makes additional assumptions tailored to specific applications.

(2): You are correct that the scales differ between Eqs. (2) and (6). In Eq. (2), when $u$ is sampled from the unit sphere, it is common to include $d$ in the numerator; when $u$ is sampled from a Gaussian distribution, $d$ is typically not included. This is done to align the scale of the gradient estimator with that of the true gradient (see [1], Eqs. (2.14) and (2.33)). Although we did not explicitly specify the distribution of $u$ in Eq. (2), we will remove $d$ from the numerator since we use a standard Gaussian distribution as an example.

In Eq. (6), the scale is aligned with that of the existing study (Maheswaranathan et al., 2019). Although multiplying the right-hand side of (6) by $d$ would match the scale in (2), we used this scale to facilitate comparison with the previous study.

(3): $\mathbb{R}^d$ is correct. Thank you for the comment. We set $d=10$ to match (Hikima & Takeda, 2024b).

(4): Tables 1 and 2 show the mean and standard deviation of the metric at the final iteration across 20 problem instances. Therefore, each value in the table does not represent the result of a single problem instance. In Section 5.1, ''Table 2'' should be ''Table 1.'' We apologize for the typo. In Table 1, the data ID corresponds to the week of data used. Therefore, data ID and date in Table 1 represent the same information. To avoid confusion, we will remove the data ID column from Table 1.

(5), (6), and (7): You are correct in all comments. Thank you.

(8): The same sample complexity can be derived when $\alpha_k=1$ for all $k$ . In other words, the results of Theorems 4.7 and 4.8 hold for any $s$ (and normalized $h$ ). This is because we performed worst-case analysis on $s$ (and $h$ ), as described in our response to (1).

(9): Thank you for the valuable suggestion. In the proof of Lemma C.12, replacing $L_F$ with $||\nabla F(x)||$ does not yield tighter result. However, we realized that it simplifies the proof. Specifically, the left side of equation (29) becomes $\left(\beta \alpha_k d^{-1} -\frac{24 H_F\beta^2 \alpha_k^2 (d+4)^2}{d^2}-24H_F\beta^2(1-\alpha_k)(25-23\alpha_k) \right) E_{\zeta_{[0,k-1]}} \left[||\nabla F(x_k)||^2\right],$ and the sixth term on the right side (i.e., $24H_F\beta^2(1-\alpha_k)(25-23\alpha_k) L_F^2$ ) is eliminated. This shortens the right-hand side of (29), and also makes (31) unnecessary. Then, if we set $\beta:= \min \left(\frac{\alpha_0 d}{48H_F( (d+4)^2 + 25d^2)},T^{-\frac{2}{3}} d^{-\frac{1}{3}} \right),$ Theorems 1 and 2 hold.

Essential References Not Discussed.

We will add these references in Section 2 and clarify the differences from our study.

[1] Berahas, Albert S., et al. ``A theoretical and empirical comparison of gradient approximations in derivative-free optimization." Foundations of Computational Mathematics 22.2 (2022): 507-560.

审稿人评论

2025-04-05

Reviewer Vnz2 is satisfied with the authors' responses and keeps rating 4.

审稿意见

评分: 32025-03-09

The authors explore zeroth-order methods to solve stochastic problems where the distribution of stochastic variables depends on the decision x. By incorporating partial information in the construction of gradient estimators, they demonstrate improved convergence rates compared to existing works.

给作者的问题

论据与证据

Compared to [1], the main improvement stems from replacing Assumption 3.2 with Assumption 4.1 and using two-point zeroth-order estimation. The authors should provide a clearer explanation of how partial information improves convergence compared to [1].

[1] Haitong L I U, Qiang L I, Wai H T. Two-timescale Derivative Free Optimization for Performative Prediction with Markovian Data[C]//Forty-first International Conference on Machine Learning.

方法与评估标准

yes

理论论述

I briefly go over the proof of main result in the last section. According to the proof from lines 1284 to 1303, when $\alpha_0 = 1$ (meaning no partial information is used), there is no negative impact on the theoretical results. In fact, the results are slightly better since both the first and last terms in line 1290 decrease.

实验设计与分析

For the experimental validation, it would be valuable to include an experiment running Algorithm 1 with scaled gradient (setting $\Sigma_k=I_d/d$ in ZO-TG) to demonstrate the effectiveness of incorporating partial information in gradient estimation.

The legend box obscures a significant portion of the figure, making it difficult to verify the authors' claim that a small $\alpha$ promotes a faster initial decrease in the objective value. Please relocate the legend box outside the figure.

补充材料

See Theoretical Claims part

与现有文献的关系

遗漏的重要参考文献

More discusssion and comparison with Hikima & Takeda, 2024b is needed, such as which part of the algorithm contributes to the improved rate? This should be clarified in the main file

其他优缺点

Strength:

The paper constructs a central difference type gradient estimator, which proves superior to the result in previous work (Hikima & Takeda, 2024b), with better complexity dependence on $d$ . Additionally, incorporating partial information into the covariance matrix is an innovative approach that shows promising results in the experiments.

Weakness

The presentation of Assumptions in section 4.1 is dense and difficult to follow, and the purpose of Assumption 4.3 needs clarification since it appears only in the appendix. Moreover, the paper heavily relies on the boundedness variance assumption 4.1. While this assumption relates to the $G$ constant in line (204), the constant sigma is not explicitly included in the complexity bound. This makes it difficult to compare the guided zeroth-order method with Liu et al.'s work, which depends on G.

Minor:

The range of $\alpha_0$ is incorrect in both Algorithm 1 and Theorem 4.7. According to the proof, $\alpha_0$ should be in (0,1] rather than [0,1).

其他意见或建议

作者回复

2025-04-01

We thank the reviewer for valuable and constructive feedback.

Weakness.

The presentation of Assumptions in section 4.1 is dense and difficult to follow, and the purpose of Assumption 4.3 needs clarification since it appears only in the appendix.

In response to your comment, we will add the following explanations for each assumption:

Assumption 4.1 is required for approximating $F(x)$ by $f(x,\xi)$ with sample $\xi$ . Since the objective function involves random variables, such an assumption is essential to evaluate the objective value by its sample.
The purpose of Assumptions 4.2 and 4.3 is to guarantee the Lipschitz continuity of the objective function. It is used to derive the properties of the Gaussian smoothing function. (See Lemma C.8.)
Assumption 4.4 is standard in convergence analysis, ensuring the accuracy of the first-order approximation via Taylor expansion. This is because descent methods with (estimated) gradients can be seen as optimizing a first-order approximation of the objective function at each iteration.

Moreover, the paper heavily relies on the boundedness variance assumption 4.1. While this assumption relates to the $G$ constant in line (204), the constant sigma is not explicitly included in the complexity bound. This makes it difficult to compare the guided zeroth-order method with Liu et al.'s work, which depends on $G$ .

In response to your comment, we will include $\sigma$ in the expression of the sample complexities. While some studies (Iwakiri et al., 2022; Hikima & Takeda, 2024b) omit constants other than error tolerance $\epsilon$ and dimension $d$ in their theorems, several studies (Perdomo et al., 2020; Mendler-Dunner et al., 2020; Ray et al., 2022) include such constants, and we agree that your suggestion is valid. In this paper, including $\sigma$ leads to a sample complexity of $O(\sigma^2 \epsilon^{-6} d^4)$ .

However, we argue that Assumption 4.1 is much looser than the assumption of constant $G:=\sup_{x,\xi}|f(x,\xi)|$ in existing studies (Ray et al., 2022; Liu et al., 2024a). Assuming the existence of $G$ , we cannot handle even a simple loss function. For example, $G$ is unbounded even if $f$ is the squared error $f(x,(A,b)):=||A x-b||^2$ , where $\xi=(A,b)$ . This fact is a weakness in the one-point gradient estimator, which is also pointed out in (Hikima & Takeda, 2024b). In contrast, Assumption 4.1 requires only that $F$ can be approximated by samples, which is a fundamental and minimal requirement; if $\sigma=\infty$ , the objective value cannot be estimated by sampling, making such assumptions unavoidable.

Claims And Evidence and Essential References Not Discussed.

Compared to [1], the main improvement stems from replacing Assumption 3.2 with Assumption 4.1 and using two-point zeroth-order estimation. The authors should provide a clearer explanation of how partial information improves convergence compared to [1].

More discussion and comparison with Hikima & Takeda, 2024b is needed, such as which part of the algorithm contributes to the improved rate?

Compared to (Liu et al., 2024a), the reason why $G$ does not appear in the sample complexities of our methods is that our methods use a two-point gradient estimator. This improvement has already been made in (Hikima & Takeda, 2024b) and is therefore not the main contribution of this paper. The main contribution is the difference from Algorithm 2 in (Hikima & Takeda, 2024b).

Our contribution is two-fold:

We achieved a tighter convergence analysis compared to (Hikima & Takeda, 2024b) by modifying settings such as mini-batch size and step size.
We proposed zeroth-order methods incorporating the partial gradient information, without making any assumptions on the information or worsening the convergence rate.

Regarding the second contribution, please see our responses to reviewer Vnz2's question (1) for more information.

Theoretical Claims.

According to the proof from lines 1284 to 1303, when $\alpha_0=1$ (meaning no partial information is used), there is no negative impact on the theoretical results.

You are right. Since we consider the worst case for the partial gradient information $h$ , the same sample complexity can be achieved with $\alpha_0=1$ . Please see our responses to reviewer Vnz2's questions (1) and (8) for more information.

Experimental Designs Or Analyses.

For the experimental validation, it would be valuable to include an experiment running Algorithm 1 with scaled gradient (setting in ZO-TG) to demonstrate the effectiveness of incorporating partial information in gradient estimation.

In the experiments, we already used scaled gradients in ZO-TG by adjusting the step size $\beta$ . See lines 532–536 for details. Compared to GZO-NS and GZO-HS, the updating distance in ZO-TG is scaled by reducing the step size by $d^{-1/2}$ . Sorry for the confusion. We will clearly state in the experimental section that the scale of each method is aligned.

审稿人评论

2025-04-05

Hi, thanks for your reply. I will raise my score as the authors resolve some of my concerns.

审稿意见

评分: 32025-03-29

This paper studies zeroth-order methods with partial gradient guidance for solving the performative prediction problem. Specifically, the proposed algorithm leverages the gradient information of the known function $f(x, \xi)$ to refine the update direction. The authors establish a rigorous worst-case convergence bound and analyze the sample complexity. Simulation results are provided to validate the theoretical findings.

给作者的问题

See the weakness part.

论据与证据

Yes, the claims made in the submission are supported by clear and convincing evidence.

方法与评估标准

Yes, the methods used in this paper are appropriate for the performative prediction problem, but the motivation is unclear, see Weakness part.

理论论述

Yes, I have checked the proof of Theorem 4.7, and it looks fine to me.

实验设计与分析

For the multiproduct pricing application: In Fig. 1, the author plots the loss function as a metric. However, in Theorem 4.7, the convergence metric is the expectation of gradient. From the current figure, we cannot observe the convergence speed information.
For strategic classification with an unknown agents' cost function: Regarding the setting with unknown $D(x)$ , according to the best response function shown in line 301, right column, I wonder how to get $\xi_F$ in closed-form solution in simulation and whether this settings satisfy smoothness assumption A4.3.

补充材料

Yes, I check the appendix A, B, C.

与现有文献的关系

In the zeroth-order methods literature for performative prediction, [Miller et al., 2021] and [Zrinc, 2024] consider the two-stage approaches, i.e., first estimate the shifted distribution and then optimize it. But neither of them use partial gradient information.

遗漏的重要参考文献

All realted works are well discussed.

其他优缺点

Strengths:

This paper is well written and easy to follow. Meanwhile, the proof is clearly presented.
Simulations are conducted to support the theorem.

Weaknesses:

Motivation: Could the authors clarify the positioning of Algorithm 1 and Algorithm 2 within the existing literature on performative prediction? Specifically, how do these algorithms compare to prior approaches? If one only has access to the stochastic loss function $f(x,\xi)$ and seeks to quickly find the performative optimal solution, the two-stage method in [Miller et al., 2021] may be a more natural choice. On the other hand, if only loss values are available, existing zeroth-order methods such as [Roy et al., 2022] could be used. Given these alternatives, what specific problem setting or advantages motivate the development of the proposed algorithms?
Algorithm: Could you explain the definition of the distribution $D_{R}$ ? For Algorithm 2, is there any theoretical guidance on the historical sample size $\hat{p}$ to achieve better performance?
Gradient Estimator: In equation (6), the gradient estimator is affected by an important factor $\alpha$ . How should it be chosen properly in practice? Additionally, we would like to see its effect on convergence behavior.

其他意见或建议

I listed some typos in this paper:

Line 298, 299: "Table 2" should be "Table 1."
Line 1125, 1126: The expectation notation $E$ should be $\mathbb{E}$ .

作者回复

2025-04-01

We thank the reviewer for the constructive suggestion. We answer each of the questions in the ''Weaknesses'' and ''Experimental Designs Or Analyses'' sections.

Weaknesses.

Motivation.

The motivation for our study is clear, and we clarify the positioning of our methods in relation to existing methods. First, as already mentioned in Section 2.2, the motivation of our method for the two-stage method by [Miller et al., 2021] is to find a stationary point without making strong assumptions about the distribution or loss function. Specifically, in section 4.2 of [Miller et al., 2021], they assume that the distribution $D(x)$ (denoted $D(\theta)$ in their paper) is included in the location-scale family, i.e. $D(x)$ needs to satisfy $\xi \sim D(x) \Leftrightarrow \xi := \xi_0 + A x,$ where $\xi_0$ is a random variable independent of $x$ and matrix $A$ is constant. This implies that the randomness is decision-independent, which limits practical applications. For example, in the multiproduct pricing (strategic classification) application in Section 5 of our paper, the demand (data) distribution does not satisfy this assumption. Moreover, [Miller et al., 2021] assumes that the function $f$ is strongly convex (see (A.3a) and (A.3b) in their paper). However, in the context of price optimization (machine learning), non-convex revenue (loss) functions are widely used. In contrast, we develop our method under milder assumptions; our assumptions do not require strong convexity of $f$ or any specific form of the distribution $D(x)$ . In response to your comments, we will add these detailed explanations to Section 2.2.

Next, the motivation for our method in relation to existing zeroth-order methods lies in leveraging the gradient information of $f$ , as mentioned in the introduction section. In existing methods, $f$ is treated as a black box, and the gradient information is not used. However, in many applications, $f$ is a known function. This study speeds up the existing zeroth-order methods by using the gradient information of $f$ .

Algorithm.

Definition of $D_R$ . It is a probability distribution over { $1,2,\dots,T$ }. In Algorithms 1 and 2, the iteration to terminate is determined probabilistically using this distribution. As shown in Theorem 4.7, the convergence of our method can be guaranteed by setting $D_R$ as in the theorem. Such termination schemes are common in stochastic optimization to ensure convergence guarantees. For example, see (Ghadimi & Lan, 2013; Theorem 3.2).

Guidance for $\hat{p}$ . Theoretically, the constant $\hat{p}$ can be any natural number. This is because we have derived the sample complexities of our methods in the worst-case for $h$ , and $\hat{p}$ only affects the partial gradient information $h$ . (Details are in the response to reviewer 2's question (1).) On the other hand, a future issue is to provide appropriate guidance for $\hat{p}$ depending on the distance between the past and current iterates. This is because as the distance increases, the past distribution may deviate significantly from the current one. In response to your comment, we will add this content to the conclusion section as a future issue.

Gradient Estimator.

As shown in line 14 of Algorithm 1, we proposed a rule that updates $\alpha_{k+1} := 1-\gamma (1- \alpha_k)$ . As noted in lines 195-199 in the left column, this updating rule promotes a faster decrease in the objective value initially and reduces the bias of the gradient estimate $g_k$ in later iterations. The updating rule of $\alpha_k$ is important to guarantee the theoretical convergence of our methods. (Details are in the response to question (1) of reviewer 2.) For the empirical setting of $\gamma$ and $\alpha_0$ , if $\alpha_0$ is near 1 or $\gamma$ is too small, $\alpha$ rapidly converges to 1, resulting in behavior similar to existing methods. We set $\gamma=0.98$ and $\alpha_0:=0$ to prevent this.

Experimental Designs Or Analyses.

Evaluation of the gradient norm in the experiment.

In response to your comment, we will add figures that show the norm of the gradient. However, since we believe the function value better reflects the goal of the application, we will retain Fig. 1 as it is.

Response function of agents in strategic classification.

We calculated the optimal response of the agent by the following procedure: (i) calculate the cost required for each agent to be judged as positive (the cost is $0$ for agents that have already been judged as positive), and (ii) if the cost is less than the gain from being judged as positive, the agent will change its own feature values so that it is judged to be positive.

This setting does not satisfy the smoothness in Assumption 4.3. This is because the data distribution changes discontinuously at the point where the cost exceeds the gain for some agents. Despite this unfavorable setting for our methods, the experimental results show that they still perform well numerically.

审稿人评论

2025-04-08

Dear authors,

Thanks for your reply. I will raise my scores to 3 as the authors resolve my concerns.

最终决定Accept (poster)

2025-05-01

This paper studies a stochastic optimization problem with decision dependent distribution and proposes a guided 0-th order algorithm for it. By extending the idea from [Mahewaranathan et al., 2019], this paper suggests to guide the 0-th order estimate of gradient by designing the covariance matrix of the random perturbation vector to accelerate convergence. Both theoretical rates and numerical experiments support the efficacy of this approach.

In the intial reviews, most of the reviewers found the technical contributions to be solid and provides valid improvements over prior works such as (Hikima & Takeda, 2024b) and (Liu et al., 2024a). Some of the concerns on the motivation and comparison to prior works have been resolved during the author-reviewer discussion phase. As such, the reviewers are in consensus that the paper can be accepted.

In the final version, the authors are suggested to implement the promised changes in their rebuttal.