7.0

/10

Poster3 位审稿人

最低2最高6标准差1.7

3.7

置信度

创新性3.3

质量3.7

清晰度2.7

重要性3.3

NeurIPS 2025

Stochastic Gradients under Nuisances

Facheng Yu,Ronak Mehta,Alex Luedtke,Zaid Harchaoui

OpenReview PDF

提交: 2025-05-10更新: 2025-10-29

摘要

关键词

Stochastic gradientorthogonal statistical learningnonparametric statistics

评审与讨论

审稿意见

评分: 6置信度: 52025-06-15

Build a Neyman orthogonal gradient method that avoids establishing a double machine learning objective in a case-by-case manner when there are nuisance parameters. Achieve the same theoretical result, yet the method is much more versatile. This is a really nice paper that combines statistical ideas with optimization methods. I find discussions in the appendix very insightful and valuable. I would give the highest score if the motivation for (9) and (10) can be more clearly stated. The writing could accommodate a bit more for optimization readers.

优缺点分析

Strengths:

First SGD method that handles nuisance without building a corresponding objective.
Motivation is very clean and intuitive.
Avoid conducting bias correction in a case-by-case manner for the objective. Build it directly on the gradient level. A good combination of statistical ideas and optimization methods.
Theoretical results are convincing and interesting.

Weaknesses:

Computational costs can be high as it requires computing the Hessian inverse and Jacobian per iteration.
The applicability of the proposed method should be more clearly stated.

问题

Overall, I really like the paper, though I still have a few questions.

In stochastic optimization with unknown parameters, the formulation is similar to the problem considered in the paper (without NO of course). Does it imply that the designed method will also have improvements for the general stochastic optimization with unknown parameter settings?
Are there any other applications where constructing the bias-corrected objective is very difficult, but the proposed methods could be directly applied to achieve a better or equivalent convergence?
Higher-order smoothness intuitively decreases the dependence on the nuisance estimation error. How does higher-order smoothness play a role in Theorem 2?
NO objective usually needs to be carefully designed as seen in the examples given in the paper. The proposed method does not seem to need such an assumption anymore. What is the key explanation behind such great improvements?
Biased gradient methods can be categorized into two types. Type 1, as cited in the manuscript in P9, https://arxiv.org/abs/2305.16296 considers biased gradient methods where bias comes from the problem structure and thus can be absorbed in the analysis of SGD, resulting in a final convergence rate that does not exhibit an explicit bias. Type 2 is more closely related to the setting considered in the paper, where bias comes from additional estimation of some unknown terms or other optimization problems and cannot be easily absorbed in the analysis of gradient methods https://arxiv.org/abs/2408.11084 The discussion on the relationship to the biased gradient method should focus on type 2, in fact. For an overview of biased gradient methods and the difference between type I and II, see http://dx.doi.org/10.1007/978-3-030-54621-2_887-1
The paper should highlight that g is independent of theta. Otherwise, one needs to construct estimators of g after updating theta at each iteration. This means that the setting considered in the paper is fundamentally different from DRO. Such a point should be clearly explained.
The equations (8) and (13) are performing gradient ascent instead of gradient descent. Is there a typo that "+" should be replaced with "-"?
The motivation for constructing OSGD is not clear. Why one would formulate (9) and use a gradient estimator (10)? Are there any requirements or design principle behind (10)? The current explanation is too vague.
The notation system for section 1 is broken. There are five different loss function l, L, l_0, L(\cdot, g). Yet their relationships are largely missing.
Line 988, it should be "(ii)"?

局限性

Although codes are provided, no experiments are demonstrated in the paper.

最终评判理由

All my concerns have been fully addressed. I greatly appreciate that the paper presents novel and original ideas, demonstrates applicability across different domains, and provides a comprehensive comparison along with a rigorous analysis of the proposed methods. Accordingly, I have raised my score to 6. I would like to thank all the authors for their efforts in producing such a high-quality piece of work. It clearly reflects a substantial amount of work and is well-deserved.

格式问题

作者回复

2025-07-29

"I would give the highest score if the motivation for (9) and (10) can be more clearly stated."

As in the discussion “Comparison of Orthogonalizing Operators” in Appx. F.2, the motivation for Eq. (9) and Eq. (10) comes from the semiparametric inference literature, wherein the goal is to reduce the asymptotic variance of the estimate for $\theta_\star$ . Often, the orthogonalizing process is described using the language of differential/information geometry, in that the so-called efficient influence functions are constructed by doing an orthogonal projection in a Hilbert space.

Our intention with (9) and (10) is to describe the same concept in a way that appeals to machine learning audiences; the correction term subtracts the regression of the $\theta$ gradient of the loss on the $g$ “gradient” of the loss. By the law of total variance, the variance of the gradient reduces and improves the trajectory of stochastic optimization. Furthermore, this variational description (9) hints at how such an operator can be computed algorithmically, instead of the historical semiparametric inference approach of deriving the operator via calculation by hand.

"The writing could accommodate a bit more for optimization readers."

Thank you for this feedback. We are happy to incorporate any concrete suggestions, and otherwise will emphasize the variance reduction viewpoint stated above if you find it helpful.

"Computational costs can be high as it requires computing the Hessian inverse and Jacobian per iteration… The applicability of the proposed method should be more clearly stated."

The problem (9) does not have to be solved at every iteration, and is instead computed once for the entire trajectory in the setting of Theorem 2. That being said, practical algorithms that compute $\hat\Gamma$ using an optimization approach may interleave their updates with the updates for $\theta^{(n)}$ . We discuss these interleaving strategies for $\hat{g}$ toward the end of Section 3.

We will further clarify the precise implementation and applicability of the orthogonalizing operator in the final version. We also implement a version of OSGD on the partial linear model. Please see the experiments in our response to Reviewer rpHD for details.

"In stochastic optimization with unknown parameters, the formulation is similar... Does it imply that the designed method will also have improvements for the general stochastic optimization with unknown parameter settings?"

Yes. In this case, we can treat the unknown parameters as a nuisance, and apply the results of Theorems 1 and 2 based on whether (i) the loss is Neyman orthogonal with respect to the unknown parameters, and (ii) whether the NO gradient oracle can be efficiently computed.

"Are there any other applications where constructing the bias-corrected objective is very difficult, but the proposed methods could be directly applied to achieve a better or equivalent convergence?"

The squared loss used in many estimation problems are usually non-orthogonal while our orthogonalizing method can still be used. For example, consider the estimation of the weighted averaged derivative and $Z = (D, X, Y)$ where $D$ is a continuous distributed random variable, $X$ is the covariate vector, and $Y$ is the outcome. Let $\gamma_0(d,x) = E[Y \mid D = d, X = x]$ , and $\omega(d)$ is a probability density function. Define $S(u) = -\omega(u)^{-1}\partial \omega(u)/\partial u$ as the negative score for the pdf $\omega$ and let $U$ be a random variable that is independent of $X$ with pdf $\omega$ . Then the target can be written as

$\theta_0 = \operatorname{argmin}_{\theta \in \mathbb{R}} E[(\theta - S(U)\gamma_0(U,X))^2],$

where the true nuisance is $g_0 = (S(U), \gamma_0(U,X))$ . We can show that this squared loss is not an orthogonal loss. However, we can perform OSGD to obtain an orthogonalized gradient oracle.

"Higher-order smoothness intuitively decreases the dependence on the nuisance estimation error. How does higher-order smoothness play a role in Theorem 2?"

The higher-order smoothness refers to Asm. 4 for Thm. 1 (the higher-order smoothness of the orthogonal loss), Asm. 6(e) for Thm. 2 (the higher-order smoothness of NO score), and their generalizations. For Thm. 2, the higher-order smoothness impacts the coefficient $\beta_2$ and the power of the nuisance error (like the fourth power of $\Vert \hat g - g_0\Vert_{\mathcal{G}}$ ). When we have an even higher order smoothness condition instead of Asm. 6(e), the term $\beta_2^2\Vert \hat g - g_0\Vert_{\mathcal{G}}^4$ in Thm. 2 would be replaced by a similar term with higher order.

However, the cross-product term $\alpha_2^2\Vert \hat g - g_0\Vert_G^2\cdot \Vert \hat g - g_0\Vert_{\mathrm{Fro}}^2$ would still remain even under higher order smoothness of the NO score since the loss function itself is nonorthogonal, which causes this term to appear.

"NO objective usually needs to be carefully designed as seen in the examples given in the paper. The proposed method does not seem to need such an assumption anymore. What is the key explanation behind such great improvements?"

A popular practice in semiparametric estimation is to handcraft an orthogonalized learning objective for estimation, and to use batch full-sample algorithm for optimization. One of the goals of our paper is to investigate the rate of convergence of stochastic gradient algorithms for non-orthogonalized objective and to delineate when they achieve the best rates of convergence.

Put simply, iterative orthogonalizing stochastic gradient algorithms on non-orthogonalized learning objectives can relieve us of manually orthogonalizing the objective and using batch full sample optimization algorithms, as we mentioned in the discussion surrounding (9) and (10).

"The discussion on the relationship to the biased gradient method should focus on type 2... where bias comes from additional estimation of some unknown terms or other optimization problems."

Thank you for these references. We have updated the discussion on biased SGD methods to reflect the relationship with biased SGD methods. Our setting does still differ slightly from the “Natural Biased Gradient Methods” (Type 2), in that the estimation of $\hat{g}$ occurs once at the start of the trajectory and does not affect the oracle cost (such as the parameter $l$ does in from Hu (2024)). In the case of interleaved nuisance and target updates (F.3), the bias term depends on the total number of iterations used to update the nuisance, not the per-iteration complexity.

"The paper should highlight that $g$ is independent of theta. Otherwise, one needs to construct estimators of $g$ after updating $\theta$ at each iteration. This means that the setting considered in the paper is fundamentally different from DRO."

Thank you for raising this point. For Thm. 1 and Thm. 2, $\hat g$ is always fixed during the training of SGD for learning $\theta$ and we emphasized in line 55 that $(Z_i)\_{i=1}^{n}$ used for learning $\theta_\star$ is an independent data stream. However, in Appx. F.3, we discuss interleaving $\theta^{(n)}$ and $\hat{g}$ updates, in which the nuisance estimate does indeed depend on the value of $\theta^{(n)}$ . We do not claim to solve DRO using OSGD updates, but do wish to clarify the dependence structures we consider. We use DRO simply as a recent example of loss minimization with an unknown nuisance (see lines 45-49 for the statement of focus).

"The equations (8) and (13) are performing gradient ascent instead of gradient descent. Is there a typo that "+" should be replaced with "-"?"

Thank you for catching this. Yes, "+" in Eq. (8) and Eq. (13) are typos. We have corrected them to "-".

"The notation system for section 1 is broken. There are five different loss functions $\ell$ , $L$ , $\ell_0$ , $L(\cdot, g)$ . Yet their relationships are largely missing."

Throughout the paper, we focus on two kinds of losses under nuisance: (1) the individual level loss $\ell(\theta, g; z)$ and (2) the population level loss $L(\theta, g) = E[\ell(\theta, g; z)]$ . In Sec. 1, we want to introduce (1) and (2) from losses without nuisances that readers in optimization and machine learning are familiar with. These losses without nuisance can also be understood as the losses under the true nuisance as $\ell(\theta, g_0; z)$ and $L(\theta, g_0)$ , where we used $\ell_0$ and $L_0$ to describe them.

"Line 988, it should be "(ii)"?"

It should be (iii), as (ii) would be considered when deriving guarantees in terms of function value suboptimality as opposed to the squared distance to $\theta_\star$ . While these criteria can sometimes be easily mapped between one another in traditional optimization analyses, incorporating Neyman orthogonality may be non-trivial for other proof structures. For example, consider the argument of Lemma 9 in Appx C, which employs NO. The function value gap is lower bounded by a term from which the nuisance bias term is extracted.

"Although codes are provided, no experiments are demonstrated in the paper."

Thank you for raising this point. Please see the response to Reviewer rpHD, for additional experiments with SGD/OSGD on the partially linear model.

2025-08-02

Thank you for the detailed response. Most of my concerns are addressed.

Interpreting NO as variance reduction would make it more approachable to optimization readers.

Since $S_\theta$ is just gradient of $l$ with respect to $\theta$ , why not merge the two notations to simplify presentation?

Regarding the relationship between $g$ and $\theta$ as well as $\Gamma$ , it seems to me that the ground truth nuisance $g_0$ is defined by some other estimation/optimization problem, $\theta_*$ is the optimal solution of our focused problem given $g_0$ , and $\Gamma_0$ is defined by both $\theta_*$ and $g_0$ . In this regard, paragraphs around equation (11) should make this relationship more explicit when defining $\Gamma_0$ . The current presentation is a bit vague regarding such dependence. In addition, it should be phrased as a lemma in the main context that the proposed NO gradient form (11) satisfies the NO gradient definition 2.

Regarding Appendix F.3 interleaving updates, I dont quite see if updates of $g$ depending on $\theta$ . Did I miss anything?

2025-08-03

Thank you for your comment and we are happy to hear that our response helps address your concerns. For concerns in your comment, we address each of these below.

''Interpreting NO as variance reduction would make it more approachable to optimization readers.''

Thank you for this valuable feedback. We will add this variance reduction interpretation to the final version.

''Since $S_\theta$ is just gradient of $\ell$ with respect to $\theta$ , why not merge the two notations to simplify presentation?''

As described in Eq. (3), the stochastic gradient oracle $S$ does not have to be the exact gradient $\nabla_\theta \ell$ . Using $S_\theta$ and $S_{no}$ helps keep the notions for all stochastic gradients unified in our paper. In addition, our motivation comes from semiparametric inference where $S_\theta$ is usually used as the score function (the gradient of log-likelihood function w.r.t. $\theta$ ), we wanted to keep the ‘score function’ interpretation by adopting the same notation. We are happy to change this notation.

''Regarding the relationship between $g$ and $\theta$ as well as $\Gamma$ , ..., paragraphs around equation (11) should make this relationship more explicit when defining $\Gamma_0$ .''

Thank you for this valuable suggestion. Yes, the relationship among $\theta_\star$ , $g_0$ , and $\Gamma_0$ is exactly the same as you stated. For the definition of $g_0$ , we want to use Examples 1-3 in Sec. 2 to emphasize that usually $g_0$ can be written as the conditional expectation that can be learned by a wide range of machine learning methods, like Ridge regression. For $\Gamma_0$ , we hope to emphasize its dependence on $\theta_\star$ and $g_0$ by Eq. (9), where $\theta_\star$ and $g_0$ exists in the optimization problem. We can reiterate this relationship when we propose Eq. (11) if you find this helpful.

"In addition, it should be phrased as a lemma in the main context that the proposed NO gradient form (11) satisfies the NO gradient definition 2."

Thank you for this valuable suggestion. We include Lemma 13 in Appx. D, where we show NO gradient oracle (11) is Neyman orthogonal at $(\theta_\star, g_0)$ , due to the page limit. We will bring this lemma to the main text in the final version.

"Regarding Appendix F.3 interleaving updates, I don't quite see if updates of $g$ depending on $\theta$ . Did I miss anything?"

You are right. In many cases, the updates of $g$ do not necessarily depend on $\theta$ . To see this, consider Example 1 in Sec. 2 where $g_0 = (E[Y|W], E[X|W])$ , which can be learned using independent observations $(X,W,Y)$ only as the minimizers of the following optimization problems: $E[Y|W] = \arg\min_{f \in L_2(P_W)} E[(f(W) - Y)^2] \text{ and } E[X|W] = \arg\min_{f \in L_2^d(P_W)} E[\Vert f(W) - X \Vert_2^2].$

评论- Raise score to 6

2025-08-03

Thank you for the detailed response. All my concerns have been fully addressed. I greatly appreciate that the paper presents novel and original ideas, demonstrates applicability across different domains, and provides a comprehensive comparison along with a rigorous analysis of the proposed methods. Accordingly, I have raised my score to 6. I would like to thank all the authors for their efforts in producing such a high-quality piece of work. It clearly reflects a substantial amount of work and is well-deserved.

2025-08-03

@reviewer: Many thanks for your engagement and efforts.

@authors: If you have a chance within the discussion period, could you please respond to the following questions? Some of these may overlap with questions already raised by the reviewers.

The construction of $\hat{\Gamma}$ is central to the OSGD update in equation (13). In lines 241–242, you refer to Appendix F, but the description is somewhat vague. For instance, Appendix F is quite long and contains several subsections. It would be helpful to understand how challenging it is to design an iterative algorithm that estimates both $\hat{g}$ and $\hat{\Gamma}$ using only the data stream (without requiring access to the entire dataset). Ideally, it would be great to include a complete algorithm for OSGD in the main text and provide a running example. I assume that the current Examples 1–3 are not well suited for this purpose since they yield orthogonal SGD automatically.
In lines 300–303, you mention that your approach is complementary to Chernozhukov et al. (2022), “Automatic Debiased Machine Learning of Causal and Structural Effects.” Could you elaborate on the algorithmic differences beyond the obvious point that you use SGD while they rely on full-data methods?

2025-08-05

Thank you for your comment, we answer each of your questions below.

"The construction of $\hat \Gamma$ is central to the OSGD update in equation (13). In lines 241–242, you refer to Appendix F, but the description is somewhat vague."

Thank you for pointing out this. In lines 241-242, we actually want to refer to Appx. F.3. We will make the reference here more precise in the final version.

"It would be helpful to understand how challenging it is to design an iterative algorithm that estimates both $\hat g$ and $\hat \Gamma$ using only the data stream."

As described in lines 1085-1097, the nuisance estimation of $\hat g$ can be easily understood as the training process on the nuisance data stream only. For $\hat \Gamma$ , lines 1112-1116 suggest that we should plug-in both the nuisance estimator $\hat\theta$ and the target estimator $\hat \theta$ in general, which is possible if we consider the interleaving update setting in Appx. F.3. However, in some cases, $\Gamma_0$ is independent of $g_0$ and $\theta_\star$ . In our response to Reviewer ZwEM, we showed the example of partially linear model, where $\Gamma_0: g \mapsto E[E[X|W]g(W)]$ by Eq. (69) in Appx. D.1. We then can learn $\Gamma_0$ simply by learning $E[X|W]$ using observations of (X,W).

"Ideally, it would be great to include a complete algorithm for OSGD in the main text and provide a running example. I assume that the current Examples 1–3 are not well suited for this purpose since they yield orthogonal SGD automatically."

Thank you for your valuable suggestion. We will include the partially linear model with the non-orthogonal squared error loss in the main text of the final version (currently introduced in Appx. B.1.2).

"In lines 300–303… Could you elaborate on the algorithmic differences beyond the obvious point that you use SGD while they rely on full-data methods?"

There are four major differences with Chernozhukov et al. (2022).

While their methods implicitly assume access to an error-free full optimization scheme, we provide explicit optimization guarantees for stochastic gradient algorithms. This is important because an error-free full optimization scheme is unrealistic in practice.
The optimization schemes we study are stochastic in nature. This is particularly relevant because stochastic methods have become ubiquitous in traditional machine learning settings, yet their behavior under nuisances has remained unanalyzed.
Their assumptions require $S_\theta(\theta, g; z)$ (or $m(w, \gamma)$ in their paper) to be linear in the nuisance $g$ (or $\gamma$ in their paper) so that a Riesz representor can be obtained; we do not need such an assumption. Instead, we only assume that $D_g S_\theta(\theta_\star, g_0)[g]$ is linear in $g$ and $D_g^2L(\theta_\star, g_0)[g_1, g_2]$ is bilinear in $g_1, g_2$ at the true parameters $(\theta_\star, g_0)$ , which can be easily satisfied as long as the loss function $\ell$ has adequate continuity. Please see Appx. D for the continuity we need.
Their analysis uses a Neyman orthogonal moment function, which boils down in our framework to assuming a squared loss function $\ell(\theta, g; z) = \frac{1}{2}(\theta - m(z, g))^2$ , while our method applies to more general loss functions.

2025-08-05

@author(s): Thank you for your detailed responses. The review team will take them into account in our evaluation.

审稿意见

评分: 5置信度: 22025-06-15

This paper considers the problem of nuisanced stochastic optimization, where the loss depends as an input on an unknown nuisance parameter $g \in \mathcal G$ . The paper provides the last-iterate convergence rates of the vanilla SGD and Orthogonal SGD algorithms with the plug-in estimate $\hat g$ of $g_0$ (the "ground-truth" nuisance parameter).

The OSGD algorithm adds an orthogonalizing factor to the otherwise non-orthogonal gradient (coinciding with SGD when it is already orthogonal), leading to a faster convergence rate.

优缺点分析

The paper is well motivated and (mostly) well written. The results are involved, sound, and insightful. While the concept of Neyman orthogonality existed in the literature, the gradient orthogonalization procedure is novel and well justified, as it accelerates convergence.
Occasionally symbols that appear in the main text are defined in the appendices. This is particularly the case around Theorem 2, the primary contribution of this paper.
There might be a lack of demonstration that Assumption 6 holds in any of the provided examples.

问题

Can the authors provide detailed justification that Assumption 6 can be satisfied, and Theorem 2 is thus non-vacuous?
How is the functional estimate $\hat g$ computed in the partially linear model, and what is the choice of the norm $|| \cdot ||_{\mathcal G}$ in this function space? These details need to be addressed in B.1 so that the readers may understand the full procedure of OSGD on this example.

局限性

Yes

最终评判理由

I thank the authors for their detailed rebuttal. In particular, I followed their example demonstrating the non-vacuity of Assumption 6 and found it convincing. This now seems to me a quite reasonable assumption as far as Theorem 2 (being the first such last-iterate analysis on OSGD) goes.

I have revised my rating accordingly.

格式问题

Are these colored backdrops allowed?

作者回复

2025-07-29

Thank you for your review and thorough feedback. We address each of your comments below.

“Occasionally symbols that appear in the main text are defined in the appendices.”

Due to the 9 page limit of submission, we moved Asm. 6 to the appendix. However, the notations in Asm. 6, as stated in line 259, can be understood similarly as notations in Asm. 3. We will make this more clear in the final version.

“There might be a lack of demonstration that Assumption 6 holds in any of the provided examples… Can the authors provide detailed justification that Assumption 6 can be satisfied...?

Thank you for this feedback. As we discuss in lines 906-912, Asm. 6 can be satisfied easily for an orthogonal loss. For the case of a non-orthogonal loss, we consider below the example in Appx. B.1.2 and derive mild sufficient conditions for Asm. 6 to be satisfied. We will incorporate this argument into the final version to improve the paper.

Consider the non-orthogonal loss is defined as $\ell(\theta, g; z) = \frac{1}{2}(y - g(w) - \langle x, \theta \rangle )^2$ . By the definition of derivative operator in Def. 1, we have the computations

$D_g \ell(\theta, g; z)[h] = -(y - g(w) - \langle x, \theta \rangle)h(w),$
$D_\theta D_g \ell(\theta, g; z)[h, \theta - \theta_\star] = \langle x, \theta - \theta_\star \rangle h(w),$
$D_g^2 \ell(\theta, g; z)[h_1, h_2] = h_1(w)h_2(w).$

Thus, by the definition in line 851 and line 852, we have that $H_{\theta g} = E[X \mid W]$ , $H_{gg} = I$ (the identity operator). According to line 858, the NO gradient oracle is defined as

$S_{no}(\theta, g; z) = -(y - g(w) - \langle x, \theta \rangle)(x - E[X \mid W=w]).$

Under the partially linear model defined in Eq. (16), we have

$g_0(W) = E[Y - \langle X, \theta_0 \rangle \mid W]$ and $\theta_0 = \theta_\star$ . Plugging this into $S_{no}(\theta_\star, g_0) = E[S_{no}(\theta_\star, g_0; Z)]$ and we will see that it is the exact gradient of the orthogonal loss in Appx. B.1.1. Thus, $S_{no}(\theta_\star, g_0) = 0$ .
$D_g \ell(\theta_\star, g_0)[h] = E[-\epsilon h(W)] = 0.$

Thus, Asm. 6(a) is satisfied. For Asm. 6(b)-(d), they can be easily verified by definitions directly and following the same analysis as the proof of Lem. 3 in Appx. B.1.1 or Lem. 4 in Appx. B.1.2. Lastly, Asm. 6(e) holds true since $D_g^2 S_{no} = 0$ by Def. 1.

“How is the functional estimate $\hat{g}$ computed in the partially linear model, and what is the choice of the norm $\Vert \cdot \Vert_{\mathcal{G}}$ in this function space?”

Thank you for raising this point. In general, the function $g$ will often have both a closed-form expression and variational expression depending on the data-generating distribution. This will naturally lead to $\hat g$ trained via standard empirical risk minimization and a norm based on the chosen function class. Consider the following examples:

For the orthogonal loss in Appx. B.1.1, the nuisance consists of two conditional expectations $(E[Y\mid W], E[X \mid W])$ , which can be estimated using nonparametric methods like random forests and kernel methods under squared error loss. The corresponding nuisance norm $\Vert \cdot \Vert_{\mathcal{G}}$ is defined in Eq. (22) in the proof of Lemma 3.
For the non-orthogonal loss in Appx. B.1.2, the nuisance is $g_0$ itself, which is usually modeled as the conditional expectation $E[U \mid W]$ , where $U$ is the additional feature vector. Thus, to estimate $g_0$ , we can do the same as in the previous item. The norm $\Vert \cdot \Vert_{\mathcal{G}}$ is defined in Eq. (31) in the proof of Lemma 4. We will incorporate the upcoming discussion in the final version.

"Are these colored backdrops allowed?"

We leave it to the AC’s discretion and will gladly remove them if asked.

2025-08-07

I have revised my rating accordingly.

审稿意见

评分: 2置信度: 42025-07-02

SUMMARY: The paper studies learning problems where the target objective depends on an unknown nuisance parameter. It establishes stochastic gradient convergence guarantees under the assumption of Neyman orthogonality and shows that even when orthogonality is violated, orthogonalized gradient updates can still achieve linear convergence. The work is relevant for causal inference and semiparametric estimation, though its theoretical results appear limited to convex objectives and standard stochastic gradient descent without extensions such as momentum. The paper lacks any empirical validation.

优缺点分析

STRENGTHS:

The paper presents interesting convergence guarantees for SGD under nuisance estimators, and Neyman orthogonality assumptions.
Figure 1 offers a very clear explanation of the paper's motivation, I would suggest that you move it closer to the introduction.
The Theorem in the paper is conveniently presented, and contrasted between the nuisance sensitivity conditions.

WEAKNESSES:

The paper introduces the Neyman orthogonality. Since Neyman orthogonality is highly technical concept it should be accompanied earlier with intuitive explanations. 1.1) If a lot of the intuitions from Neyman orthogonality are borrowed from double machine learning and causal inference,

Tt would be convenient for the paper to add the explanations of the concepts in the paper or an abstract of the paper. 2) In spite of Section 3 being clearly explained and illustrated, the abstract and Section 1 lacks a clear motivation, on why learning under nuisance is important. 3) The paper organization and motivation can be improved. 4) The paper lacks any empirical validation of the proposed method.

问题

QUESTIONS:

What is the connection of Neyman orthogonality and unbiasedness of the gradient estimators?
Why don't the authors apply their method on any DML or OSL problem? Can you show that you obtain unbiased estimator of a data generating process?

局限性

LIMITATIONS

The results are constrained to Vanilla SGD, however most modern SGD algorithms use momentum or are ADAM variants, perhaps the authors could extend their results towards this optimizers.
The results are restricted to convex objective assumptions, neural networks do not have convex loss landscapes.

最终评判理由

The lack of empirical studies and proper justification made me update my score from borderline accept to reject.

格式问题

No paper formatting concerns.

作者回复

2025-07-29

Thank you for thorough and insightful feedback. We address your comments below. We have also incorporated your comments on improving the motivation (by elaborating on existing examples) and commenting on the intuitions from DML/OSL in the revision. If you find your concerns are addressed, we kindly request you increase your score or let us know what we can provide.

"The results are constrained to Vanilla SGD, however most modern SGD algorithms use momentum or are ADAM variants, perhaps the authors could extend their results towards this optimizers."

Regarding momentum, we first note that due to the presence of noise, momentum variants of SGD do not improve the achievable convergence rates (see Li et al (ALT, 2022)). However, it has been observed that momentum updates can be one-to-one mapped to averaged SGD under particular averaging assumptions (Defazio, 2022). We proved an $O(1/n)$ convergence result for the $n$ -th iterate averaged uniformly across iterations, $\bar{\theta}^{(n)} = \frac{1}{n}\sum_{i=1}^n \theta^{(i)}$ , but did not include it in the submission due the aforementioned limitations of momentum for SGD analyses. Stated in terms of momentum updates, this averaging sequence corresponds to the updates:

m^{(n+1)} = \frac{1}{n} m^{(n)} + S_\theta(\theta^{(n)}, \hat{g}, Z_{n+1}) \text{ and } \bar{\theta}^{(n+1)} = \bar{\theta}^{(n)} - \eta\left(1-\frac{1}{n+1}\right) m^{(n)}.

Regarding Adam variants such as AdaGrad, we comment on recent approaches such as Ward et al (2019) and Défossez et al (2022). We rely on the observation that the AdaGrad update for our method can be written as:

\theta^{(n)} = \operatorname{argmin}_{\theta \in \mathbb{R}^d} \langle S\_\theta(\theta^{(n-1)}, \hat{g}, Z_n), \theta \rangle + \frac{1}{2\eta} \Vert \theta - \theta^{(n-1)} \Vert_n^2,

where $\Vert \cdot \Vert_n$ is a norm defined by the variance pre-conditioner in AdaGrad. To prove an analog of Theorem 1, we change the norm of $f\_n$ from Lemma 8 and follow a similar analysis but with a changing norm across iterations. By bounding the differences in the norms, we achieve a similar result with an additional error term. We note that these ideas are in service of providing a novel analysis of Adam/AdaGrad in the strongly convex setting, which would be an independent contribution even in absence of Neyman orthogonality, and may be beyond the scope of this paper.

We can provide further details in the discussion period due to character limitations.

"What is the connection of Neyman orthogonality and unbiasedness of the gradient estimators?"

There are two notions of bias to consider in this problem. The first is the bias of any estimator for the population gradient $\nabla_\theta L(\theta, \hat{g}) = E_{Z \sim \mathbb{P}}[\nabla_\theta \ell(\theta, \hat{g}, Z)]$ , which is zero under the standard stochastic gradient oracle and the NO oracle analyzed in Theorems 1 and 2, respectively.

The second bias results from optimizing $\theta \mapsto L(\theta, \hat{g})$ instead of $\theta \mapsto L(\theta, g_0)$ , or in other words, the nuisance estimation error. Here, Neyman orthogonality allows us to provide precise bounds on the error incurred by optimizing the first function when we are in fact pursuing the optimum of the second.

"The paper lacks any empirical validation of the proposed method."

We have now added an experiment section to illustrate Theorem 2 in the revision where we consider SGD versus OSGD for the non-orthogonal loss based on the partially linear model $Y = \langle \theta_0, X \rangle + \alpha_0(W) + \epsilon$ . Specifically, we consider the following data generating process:

\begin{bmatrix} X\\\ W \end{bmatrix} = N\left( \begin{bmatrix} \boldsymbol{\mu}_X \\\ \boldsymbol{\mu}_W \end{bmatrix}, \begin{bmatrix} 1.05I_2 &\lambda I_2\\\ \lambda I_2 & 1.05I_2 \end{bmatrix}\right),$$ where $U = \alpha_0(W) + \xi$, $Y = \langle \theta_0, X \rangle + \alpha_0(W) + \epsilon$, $X,W \in \mathbb{R}^2$, $U, Y \in \mathbb{R}$, $\xi \sim N(0,1)$ and $\epsilon \sim N(0,1)$ are independent Gaussian noises, and $\lambda \in [0,1]$ is used to control the covariance between $X$ and $W$. We define the true parameter $\theta_0 = [-0.5 ~ 1]$ and the nonlinear function $\alpha_0$ as $$\alpha_0(w) = 0.5\times \cos\left(\frac{w_1 + w_2}{2}\right) + 0.5\times \sin\left(\frac{w_1 + w_2}{2}\right).$$ We estimate the nuisances and the orthogonalizing operator nonparametrically using ridge regression with random Fourier features, and we estimate the target parameter using SGD with orthogonalized and non-orthogonalized oracles, respectively. The nuisance error is defined as $\Vert \hat g - g_0 \Vert_{\mathcal{G}}$ where the norm is defined in Eq. (31). The orthogonalizing operator error $\Vert \hat \Gamma - \Gamma_0 \Vert_{\mathrm{Fro}}$ is defined as $\sqrt{\Vert \hat{g}(W) - E[X \mid W] \Vert_2^2}$ since $\Gamma_0: g \mapsto E[E[X\mid W] g(W)]$ has this explicit form. Below are two tables that summarize the estimation performance of each gradient oracle where $m$ is the number of samples used to estimate the nuisances and $k$ is the number of examples used to estimate the orthogonalizing operator. We also provide the risk suboptimality $L(\cdot, g\_0) - L(\theta\_\star, g\_0)$ when using the true nuisance parameter and orthogonalizing operator to represent the idealized performance. To understand the results, first note that each choice of $\lambda$ represents a different data-generating distribution (i.e., another dataset). Table 1: SGD Performance |$\lambda$|$m$|$g_0$ Error|$\theta_\star$ Error|Excess Risk| |---------|---|-----------|--------------------|------------| |0.0|16|1.6851|0.0404|0.0732| |0.0|1024|0.2297|0.0011|0.0012| |0.0|0|0.0000|0.0008|0.0011| |1.0|16|1.8792|0.5234|0.4485| |1.0|1024|0.2145|0.0019|0.0019| |1.0|0|0.0000|0.0008|0.0011| Table 2: OSGD Performance |$\lambda$|$m$|$k$|$g_0$ error|$\Gamma_0$ Error|$\theta_\star$ Error|Excess Risk| |---------|---|---|------------|----------------|--------------------|------------| |0.0|16|1024|1.6851|2.1097|0.0270|0.0301| |0.0|16|0|1.6851|0.0000|0.0030|0.0047| |0.0|1024|1024|0.2297|2.1008|0.0091|0.0097| |0.0|1024|0|0.2297|0.0000|0.0011|0.0013| |0.0|0|1024|0.0000|2.1075|0.0106|0.0115| |0.0|0|0|0.0000|0.0000|0.0016|0.0019| |1.0|16|1024|1.8792|0.6399|0.1450|0.1995| |1.0|16|0|1.8792|0.0000|0.0155|0.0163| |1.0|1024|1024|0.2145|0.6387|0.0449|0.0529| |1.0|1024|0|0.2145|0.0000|0.0095|0.0087| |1.0|0|1024|0.0000|0.6398|0.0481|0.0506| |1.0|0|0|0.0000|0.0000|0.0086|0.0077| The results demonstrate that: 1. Across values of $\lambda$ in Table 1, when the nuisance estimators approximate the true nuisances well, the performance of SGD under the estimated nuisances is numerically indistinguishable from the performance of SGD under the true nuisance. 2. Across values of $\lambda$ in Table 2, the performance of OSGD largely depends on the quality of the NO estimator and the correlation between $X$ and $W$. When the nuisance estimation is accurate, the $\theta_\star$ error decreases as demonstrated in Theorem 2. 3. Comparing Table 2 with Table 1, using the estimated NO gradient oracle reduces $\theta_\star$ error when $m = 16$, which aligns with Theorem 2. The large correlation between $X$ and $W$ would impact the target estimation using an approximated nuisance and the approximated NO gradient oracle for OSGD. However, for a bad nuisance estimator, using a well approximated NO gradient oracle can still improve the target estimation. As discussed in Appx. F.3, we also perform SGD and OSGD on the non-orthogonal loss with updated nuisances and the updated NO oracle using the same data generating process above. For the nuisance and NO estimation, we use a kernel-based estimator with random Fourier features and SGD. When we update the nuisance and NO for 10 times with batch sizes 16 and 1024, respectively, and we iterate SGD 100 times with batch size 10 between every two nuisance updates, we achieve a relative error 0.0008 for SGD and a smaller relative error 0.0005 for OSGD when $\lambda = 0$, which implies that both SGD and OSGD estimate the target perfectly since both the nuisance and the NO gradient oracle are well estimated through the updates, and that OSGD reduces the relative error due to Prop. 20 and Prop. 21 in our paper. For real data analysis, we consider the Diabetes 130-Hospitals Dataset as the real dataset example, where we use 6 of these features as covariates such as the change indicator in diabetic medications (change), the time in hospital (time_in_hospital). We conduct analysis based on a partially linear model in Example 1 and Appx. B.1 so that we can use the synthetic outcome based on this model instead of a real outcome to examine the performance of our proposed methods. Using the synthetic outcome is common in causal inference, such as Sec. 4.1 of [Nie et al. (2021)](https://arxiv.org/abs/1712.04912). For comparison, we perform SGD and OSGD for the non-orthogonal loss. For nuisance estimation with batch size 64, we achieve relative error 0.0095 for SGD, while with the same nuisance batch size, we achieved a smaller relative error 0.0013 with an estimated operator learned with batchsize=1024 for OSGD. We also have experiments of SGD on the orthogonal loss, which can be provided in the discussion period due to space limitation. This comparison would illustrate the nuisance sensitive and insensitive bounds from Theorem 1.

2025-08-02

@reviewer: Please take a moment to review the authors’ rebuttal and share your feedback. It would also be useful if you could consider the other reviews and their corresponding rebuttals.

2025-08-05

Momentum variants of SGD might not improve the theoretical convergence rates, yet empirically they have better convergence than vanilla SGD. This paper is still limited in its general interest as results restricted to SGD are not of practical interest rather of theoretical interest.

2025-08-05

Estimating a partial linear model regression, is not standard use of SGD algorithms as small models and causal inference with these kind of models are often preferred to be optimized with QR optimization it ensures robust causal inference. Running SGD on these kind of causal inference experiments is not standard and its selection as a baseline only equates as an ablation study and not a proper empirical study.

评论- Comment to Area Chair

2025-08-08

Dear Area Chair,

As a reviewer, I am expressing my disappointment towards the low quality review provided by Reviewer rpHD. Original research idea starts with simple, easy-to-handle concepts/methods. Building orthogonal gradient estimator is the first step for every gradient-based methods under nuisance. The paper has done a good job in building such gradient estimators and thoroughly understanding its properties, which could potentially start a new interdisciplinary research direction between optimization and statistics. Reviewer rpHD lowered the score from 4 to 2 for ridiculous reasons. I carefully check the response by authors, there is clear numerical benefits of OSGD against SGD on Table 2. As for extending to momentum/variance reduction/adaptive stepsizes/neural networks and further numerical studies, these can be done as homework for graduate students instead of writing a paper.

最终决定Accept (poster)

2025-09-17

This paper studies stochastic gradient methods for learning problems where the objective depends on unknown nuisance parameters. The authors provide non-asymptotic convergence guarantees, showing that classical stochastic gradient algorithms can still converge under conditions such as Neyman orthogonality. Additionally, they propose an algorithm with approximately orthogonalized updates that achieves similar performance even when strict orthogonality is not satisfied. Applications to orthogonal statistical learning and causal inference are discussed.

Although there is a substantial divergence in reviewer opinions, I find the core contributions both novel and technically solid. The theoretical insights into how nuisance parameters affect convergence, and the development of a practical solution through approximate orthogonalization, make this work a valuable addition to the NeurIPS community.

I recommend acceptance, with the expectation that the authors carefully address all outstanding concerns raised during the discussion in the camera-ready version.