PaperHub
8.2
/10
Spotlight4 位审稿人
最低5最高5标准差0.0
5
5
5
5
3.5
置信度
创新性3.3
质量3.3
清晰度2.8
重要性3.3
NeurIPS 2025

An Analysis of Causal Effect Estimation using Outcome Invariant Data Augmentation

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

We show the effectiveness of data-augmentation for mitigating bias due to unobserved confounding, and this motivates the proposal of our novel method for the same.

摘要

关键词
Causal InferenceInterventionsInstrumental Variable RegressionInvarianceData Augmentation

评审与讨论

审稿意见
5

This study proposes treating outcome-invariant data augmentation (DA) as a form of soft intervention on the treatment in causal effect estimation. The proposed method aims to mitigate hidden confounding bias by introducing instrumental variable like regression, which interprets data augmentation parameters as weak instruments and incorporates empirical risk minimization (ERM) regularization. The authors theoretically and empirically demonstrate that this approach outperforms standard ERM and conventional DA methods in terms of intervention generalization and estimation bias.

优缺点分析

Strengths:

  • S1: The paper addresses a valuable topic and presents an interesting and novel methodological contribution.
  • S2: The figures and notation are well-organized, and the presentation is easy to follow.
  • S3: The paper is the first to place data augmentation (DA) within a causal inference framework, showing that under the outcome invariance assumption, DA is equivalent to a soft intervention on the treatment. The introduction of IVL regression relaxes the requirements of classical IV methods and is theoretically shown to reduce confounding bias under certain conditions. The practical interpretations of the theoretical results are also insightful.

Weaknesses:

  • W1: While the topic of improving predictive performance via data augmentation is important, the manuscript lacks a discussion on the scope of applicability of the proposed method and its connections to classic literature. In particular, I would like to see the authors discuss how their approach relates to and differs from the well-known Independent Causal Mechanisms (ICM) principle. Specifically, under what conditions does enriching the marginal distribution of P(X)P(X) (via data augmentation) help or fail to improve the conditional distribution of P(YX)P(Y \mid X)?
  • W2: Although the theoretical analysis is thorough and informative, the empirical improvements are limited, especially on the Colored-MNIST dataset, where DA+ERM already accounts for most of the gains, and the additional benefit from combining DA with IVL is marginal.
  • W3: The effectiveness of the proposed DA+IVL method in strictly reducing confounding bias relies on the assumption that the augmentation perturbs latent directions aligned with unobserved confounding, which is untestable. Moreover, the theoretical properties also depend on additional assumptions, such as joint diagonalizability of covariances, which may not hold in practice.

问题

Please address the weaknesses listed above.

局限性

The authors did not discuss the limitations of their method in the main text. In my view, while the proposed method introduces a novel perspective on data augmentation, its theoretical properties rely on some untestable assumptions, such as the inclusion of unobserved confounding information in the augmented data. If my understanding is incorrect, please feel free to correct me.

最终评判理由

The authors provide a clear and detailed discussion of the relationship between this work and Independent Causal Mechanisms and offer a fundamental analysis of the untestable assumption I raised. I believe that the revised version incorporating these points exceeds the conference’s acceptance threshold, and therefore, I recommend accepting this paper.

格式问题

NA

作者回复

We thank reviewer for their constructive feedback and comments. We have addressed their concerns and questions below.

W1: Independent Causal Mechanism (ICM) Principle

We find the reviewer’s comment about relation of our model to the ICM principle by Peters et al. [1] quite interesting, especially since cyclic SEMs have been around for quite a bit longer than the ICM principle itself, with roots in econometrics from as early as 1940s.[2] Despite this, we are not aware of any explicit reference to the principle in modern literature on cyclic SEMs, including in works by Peters et al.[1][2][3] Nevertheless, we can observe the following.

Similarity — Structural equations as ICMs: Our SEM formulation in Eq. (1) is a direct instantiation of the ICM principle as described by Peters et al.[1] The two equations represent the autonomous mechanisms, and their independence is captured by the mutual independence of the exogenous noise terms NX,NYN_X, N_Y. The simultaneity in our model is not a violation of ICM, but rather the equilibrium state resulting from the interaction of these two independent mechanisms. Assuming the existence of this equilibrium is a statement about the scope of systems under analysis, and not about the nature of the mechanisms themselves. Indeed, surgically changing τ\tau to some τ\tau’, for example, does not in itself alter ff and vice versa. And precisely because of ICM this may or may not make the system unstable depending on the nature of τ\tau’, but in our setting Theorem 4 (Appendix G.7) shows that soft interventions induced by outcome-invariant DA are always stable.

Difference — Markov factors as ICMs: We speculate that this is the main motivation for reviewer’s comment about ICMs. Indeed, in most settings, ICMs are discussed in terms of more familiar Markov factorization of joint distribution of the system since such a factorization always exists in DAGs. However, such a factorization generally does not exist for cyclic models, which is precisely why they are almost exclusively represented as simultaneous structural equations.

We thank the reviewer for this very insightful and thought-provoking question, and agree that this connection to ICMs should be made explicit in the manuscript. To address this in the preliminaries Section 2 we will make clear that our model's structural equations are a direct instantiation of the ICM principle and point to a more detailed commentary in Appendix B as an appropriate place to elaborate nuances since it is exclusively dedicated to discuss cyclic SEMs.

Success and failure modes w.r.t. ICM

Given the above clarification about structural equations being a more appropriate choice for representing ICMs in cyclic models, we believe our manuscript already addresses the reviewer’s wishes of success/failure modes with respect to ICMs τ\tau, ff, etc. We provide a recap (and an additional comment) for completeness.

The task is to improve estimation of ff which corresponds to the ICM of the outcome YY. And we do this by changing/ intervening on the ICM τ\tau of the treatment XX while keeping all other ICMs in the system unchanged. This is done via DA, and it is successful/unsuccessful in the following cases.

  1. Outcome-invariant DA: Observation 1 shows that outcome-invariant DA is equivalent to surgically changing the treatment ICM τ\tau to GτG\tau while keeping all other ICMs in the system unaffected. As such, outcome-invariance complies with the ICM principle.
    1. Success — perturbing spurious features: Theorems 2 shows that this reduces confounding bias when the DA aligns with spurious features.
    2. Benign failure — no spurious features perturbed: Theorems 2 also shows that this does not reduce confounding bias when the DA is orthogonal to spurious features. Fortunately however, this does not perform worse than ERM, and so the failure is rather benign compared to the following case.
  2. Malignant failure(?) — outcome-invariance violation: Although outside the scope of our current work, we conjecture that a violation of outcome-invariance is equivalent to a violation of the ICM principle itself. Since outcome-invariance is necessary for DA equivalence to IVL, a violation would mean that the DA not only affects XX but also YY so that it is now a confounder. We believe this to be true because violations of outcome invariance in DA has been shown to bias estimates in works that don’t necessarily study causality [6]. Although this raises many interesting questions in itself, we defer a further formal investigation to future works in order to keep the current submission focused.

Scope of applicability

Given the above classification, it is natural, as the reviewer points out, to ask whether the conditions required for our method’s success are reasonable. In the next section, we argue that they are—particularly when viewed through the appropriate lens of instrumental variables (IVs).

W3 & Limitations: On Untestable Assumptions

We agree with the reviewer that the untestability of assumptions is a central concern, and appreciate the opportunity to clarify this point more directly in the main text.

1. IVs Are the Right Benchmark

We believe IVs are the appropriate lens for judging whether our assumptions are "too strong." Our IV-like variables (IVLs) are a direct generalization of IVs—designed to be more available by relaxing the outcome relevance condition—while still aiming to reduce confounding. This generalization naturally inherits the core limitations of IVs: assumptions like exclusion restriction and unconfoundedness are not testable from observational data and must be justified via domain knowledge.

Similarly, outcome-invariant data augmentation (DA), which we show is equivalent to IVL by construction (Observation 2), carries the same interpretability and limitations. In this light, we argue it is inappropriate to expect our method to be free from the very limitations of the framework it generalizes.

outcome-invariance_popular model for DA+spurious targets_benign failure if violatedun-testable DA assumptions    un-confoundedness+exclusionun-testable IV/IVL assumptions\overbrace{ \underbrace{\text{outcome-invariance}}\_{\text{popular model for DA}} + \underbrace{\text{spurious targets} }\_{\text{benign failure if violated}} }^{\text{un-testable DA assumptions}} \iff \overbrace{\text{un-confoundedness}+\text{exclusion}}^{\text{un-testable IV/IVL assumptions}}

2. Clarifying Assumptions

  • Outcome Invariance: The central assumption is outcome-invariant DA—analogous to exclusion restriction and un-confoundedness in IVs. While untestable, this is a common and accepted modeling choice in the DA and invariance literature [6–8].
  • Perturbing Spurious Features: This condition is not a new limitation but a direct analogue of the exclusion principle—an IV must enter YY via a spurious feature of XX to identify the corresponding confounded component of ff.
  • Joint Diagonalizability: The reviewer raises a valid concern about Assumptions 1 and 2 behind our theoretical results. These are only required for our use of the Relative Squared Error (RSE) metric, which—as argued by Janzing [9]—is more relevant for causal estimation. Our core results on confounding bias reduction still hold under the standard risk-based estimation error EM[(f(X),h^(X))]\mathbb{E}^{\mathcal{M}}[\ell(f(X), \hat{h}(X))], without requiring Assumptions 1 and 2.

Acknowledging that these clarifications could be better emphasized, we will move and expand the limitations discussion from the Appendix into the main paper (as Section 7), directly addressing the concerns around untestable assumptions and theoretical scope.

W2: On Empirical Improvements

We acknowledge your observation that on Colored-MNIST, the additional benefit of DA+IVL over DA+ERM is modest (although, arguably still significant). We would like to offer two points of context:

  1. DA+ERM's performance is a key result: A central contribution of our paper is to provide a causal interpretation for why outcome-invariant DA works: it acts as a soft intervention. The fact that the simple DA+ERM baseline is already so effective is strong empirical validation of this core thesis. The performance of DA+ERM should therefore be seen as a strength of our framework, not a weakness of the DA+IVL extension.
  2. Contextualizing the DA+IVL improvement: While the additional gain from DA+IVL is modest on this specific task, it is still competitive with or better than several established OOD generalization methods (as seen in Figure 5). Furthermore, on the Optical Device dataset, our proposed DA+IVLcc is the only method that provides a significant performance gain over both the ERM and DA+ERM baselines, demonstrating its value in other settings.

We hope these clarifications have addressed your concerns. We thank you again for your insightful review, which helped us identify key areas for improvement in the presentation of our work. We are confident that by incorporating these points, we can make the paper's contributions and limitations clearer to the reader.

[1]: J. Peters et al. Elements of Causal Inference. MIT Press, 2017.

[2]: C. F. Christ. “The Cowles Commission's Contributions to Econometrics at Chicago, 1939–1955.” J. Econ. Lit., 1994.

[3]: S. Bongers et al. “Foundations of Structural Causal Models with Cycles and Latent Variables.” Ann. Statist., 2021.

[4]: J. M. Mooij et al. “On Causal Discovery with Cyclic Additive Noise Models.” NeurIPS, 2011.

[5]: D. Rothenhäusler et al. “BACKSHIFT: Learning Causal Cyclic Graphs from Unknown Shift Interventions.” NeurIPS, 2015.

[6]: S. Chen et al. A group-theoretic framework for data augmentation. JMLR, 2020.

[7]: C. Lyle et al. On the benefits of invariance in neural networks. arXiv, 2020.

[8]: H. Shao et al. A theory of PAC learnability under transformation invariances. NeurIPS, 2022.

[9]: D. Janzing and B. Schölkopf. Detecting confounding in multivariate linear models via spectral analysis. J. Causal Inference, 2018.

评论

The authors' detailed response has thoroughly addressed my concerns. I believe the revised version exceeds the acceptance threshold, therefore, I will raise my rating from 4 to 5.

评论

Thank you for your support and for raising your score. A special thanks as well for the very insightful ICM question -- I really enjoyed answering that one!

审稿意见
5

The work presents a framework for improved ATE estimation in the presence of hidden confounders, when no proper instrumental variables are available. In such a setting the causal effect is not identifiable. The authors propose two complementary approaches to reducing the estimation bias in this setting: regularising the instrumental variable (IV) risk through an additional empirical risk term, and a data augmentation technique relying on applying an outcome-invariant transformation to the data. The authors carry out theoretical analysis of these methods for linear data generating models, and further validate their results through empirical experiments on synthetic and real-world data.

优缺点分析

Strengths

  • The authors provide a comprehensive introduction to the topic of causal effect estimation and instrumental variable regression (section 2), grounding their work in context and making the subsequent sections more easy to follow.
  • The authors evaluate their proposed method empirically, on simulation data which respect their imposed assumptions, but also on real-world data where the assumptions might not necessarily hold.

Weaknesses

  • Limited applicability: In their work, the authors consider the problem of causal effect estimation in the presence of hidden confounders, when the causal effect is not identifiable. As in this setting the causal effect cannot be accurately estimated, even in the infinite sample limit, the prevalent approach is that the causal effect should not be estimated, as it can be arbitrarily far away from the truth (depending on the level of confounding). However, the authors propose a method which allows to improve estimation performance in this setting (when no guarantees on identifiability can be made). This raises two significant questions which, I believe, the authors did not sufficiently address in their manuscript: 1) in what real-world applications would that be useful? 2) if there is no use for specific real-world applications, what are the implications of their results for the broader research in the area of causal effect estimation?
  • Weak theoretical results: The authors analyse the proposed methods (IVL-regression and outcome-invariant DA) only in the case of two simple linear data generating models. No theoretical insights beyond these two settings are provided. Even in these two linear setting, the provided results indicate only that the proposed methods, in some settings, can reduce the relative squared error (RSE) compared to the baseline of empirical risk minimisation. However, the results do not guarantee in any way that this reduction will be in any way meaningful (i.e. the RSE can still be prohibitively large). Further, the authors themselves mention that the RSE is a “surrogate” for measuring confounding bias. It would be interesting to see how these results translate into the magnitude of the error on the actual ATE estimate.
  • Large number of assumptions: What further limits the applicability and insightfulness of the presented theoretical results is that they rely on numerous implicit and explicit assumptions. For the IVL regression, aside from assumption 1, the variable ZZ still needs to satisfy properties (i)-(iii) in section 2.3, and to see tangible improvements Z~C~\tilde{Z} \perp \tilde{C}. For DA, the chosen transformation GG needs to be outcome-invariant but also perturb XX along the spurious latent features originating from C. Finding such a transformation (if one exists) seems non-trivial, and so is verifying that a proposed transformation satisfies these criteria.
  • Empirical evaluation in the real-world settings: While the presented method seems to offer advantages in the linear simulation setting agreeing with Examples 1 & 2 (Fig 4), the benefits of the proposed methods are more questionable in the real-world examples (Fig 5). It is not clear whether the method significantly improves upon the baselines, raising further questions about its broader applicability. Further, in the optical device dataset the ground truth label is computed by assuming linearity (which might bias the results). As mentioned above, it would also be interesting to see the results in terms of metrics other than the RSE, which is a surrogate for confounding bias (e.g. error on the ATE estimate).

问题

  1. In the introduction the authors mention a suit of regularisation techniques relying on 1\ell_1 and 2\ell_2 regularisation, applicable in the similar non-identifiable setting. How do these methods compare to the presented approach, in terms of the precise settings they target, as well as int terms of empirical results?
  2. Could the authors elaborate on the meaning of the set Iα\mathcal{I}_{\alpha} for ICL regression? In particular, what’s the significance/intuitive meaning of the parameter α\alpha?
  3. Could the authors provide a more detailed justification for Observation 1? This conclusion does not seem immediate to me. Further, what is the formal definition of GτG\tau? What is GτG\tau in the context of example 2?

局限性

As mentioned in the weaknesses section, this work operates in the setting where the causal effects are not identifiable. While the authors are open about this inherent limitation of their method, I think it could be emphasised more that the obtained estimates can be arbitrarily far away from the true causal effects. The work should emphasise more that special care should be taken before applying this method in real-world contexts, particularly as evaluating any causal method using strictly observational data, to gauge the level of resulting error, is infeasible.

最终评判理由

The reviewers addressed most of my concerns in their rebuttal. Specifically, their comments about the 'broader significance' of their work made me more appreciative of the scientific value of their paper and the theoretical analysis it presents. Similarly, their detailed comments comparing the assumptions they are making in the paper with the standard assumptions made in the IV/IVL literature convinced me that this work can be of interest to the broader causal inference community. Overall, I think that the work can provide a good preliminary exploration into the direction of using DA to mitigate confounding bias (as well as providing an understanding of why DA has a potential to help).

The additional experimental results that the authors provided during the rebuttal also make the empirical evaluation of the method more robust, showcasing it as a promising solution.

Some (minor) issues still remain and I think should be addressed by the authors in the revised camera-ready submission:

Changes to the writing:

  • Discussing broader significance in Section 1.
  • Connecting Assumptions 1 and 2 with the RSE metric, justifying that the equivalent results for the EE metric also hold and do not require these assumptions.
  • Including a more comprehensive section about the limitations of the work, making explicit that the proposed method can still lead to arbitrarily large error on the estimated causal effect (even though it is smaller than ERM error).

格式问题

No major concerns.

作者回复

We thank Reviewer MsXV for their constructive feedback. The reviewer raises crucial points, including about the value of our work in a non-identifiable setting, which we are happy to clarify.

Response to Limited Applicability

Our work's significance is best understood through the following lens:

  1. Practical Value: A Safe, "No-Regret" Improvement: Our primary response is grounded in Theorem 3: under the standard DA assumption of outcome-invariance, our approach is guaranteed to perform no worse than standard ERM. This provides a "no-regret" choice for practitioners. They gain a potential reduction in bias if the data augmentation targets spurious features, but risk nothing compared to the ERM baseline.
  2. Broader Significance (1): Addressing the Upstream Problem for Robustness: Reducing confounding bias, even in non-identifiable settings, is not a niche goal but the upstream problem for OOD generalization and robustness [4]. A model that fails on shifted distributions often does so because it learned spurious correlations (i.e., confounding). Our work, in line with the goals of causal regularization [10-12], tackles this root cause directly. By formalizing why DA can help (as a soft intervention) and when (by perturbing spurious features), we provide a principled path toward building inherently more robust models.
  3. Broader Significance (2): Downstream Implications for:
    1. Partial Identifiability: Lower confounding bias in the data (e.g., via DA) can result in tighter bounds in partial identification [9].
    2. Sensitivity Analysis: Lower confounding bias means reduced sensitivity to confounding, leading to more reliable conclusions from subsequent sensitivity analyses.

The value is in repurposing the already existing, cheap tool of DA to provide a safe and causally-grounded improvement over the ERM baseline, with direct positive implications for the above downstream tasks. We will clarify and explicitly mention applicability in section 1 of our draft.

Theoretical Results, Assumptions, and Metrics

The reviewer's concerns about the theoretical results, assumptions, and the RSE metric are interconnected, and we address them together.

  • RSE is a direct ATE Error Metric. There seems to be a slight misunderstanding here. The Relative Squared Error (RSE) is not a surrogate for the average treatment effect (ATE) error; it is itself a valid ATE estimation error [33,10]. Furthermore, Janzing argues that its structural nature makes RSE more relevant for causal statements than the more familiar risk based estimation error (EE) EM[(f(X),h^(X))]\mathbb{E}^{\mathcal{M}}[ \ell( f(X), \hat{h}(X) ) ] [33,12], which also motivates our choice. The (un-normalized) RSE is simply h^f2\|\hat{h}-f\|^2, which directly measures the squared error of the estimate h^\hat{h} from the true causal function ff (which equals the ATE in our model). The normalization is only done to make the metric more interpretable and meaningful for our experiments. We used the term "surrogate for confounding bias" because any choice of estimation error also contains statistical bias, but we see now how this was unclear. We will revise the phrasing to "causal estimation error" in the paper to avoid this confusion.
  • Strong Assumptions are for the Stricter Metric. Our strongest assumptions (Assumptions 1 & 2, regarding joint diagonalizability) are a direct consequence of using the stricter, structural RSE metric. We would like to emphasize that our core results on bias reduction also hold for the standard risk-based EE, without requiring Assumptions 1 and 2. We will make this clarification about choice of metric vs. generality of results explicit in the revised manuscript. We believe this directly addresses the concerns about the limited insightfulness of our results and choice of estimation error. Should the reviewer still find it helpful, we are also prepared to add experimental results using EE to the appendix , as our core claims hold for that metric as well.
  • On the Practicality of Assumptions. To avoid redundancy, we provide a more detailed discussion on limitation of assumptions in our response to Reviewer CVxW. Please refer to the same.

Empirical Evaluation

  • Weakness: Benefits on real-world data are questionable. We respectfully disagree with this characterization. Our results in Fig. 5 show a consistent and often significant improvement.
    • On the Optical Device dataset, our proposed DA+IVL\textsubscript{CC} is the only method that meaningfully outperforms both the ERM and DA+ERM baselines.
    • The reviewer mentions that DA+IVL improvement over DA+ERM is modest (though, we argue, still significant) in Colored-MNIST. We point out that this is still competitive with or better than established OOD generalization benchmarks. Also, we urge the reviewer that DA+IVL should not distract from giving credit on the undeniable performance of DA+ERM baseline itself, as DA as soft intervention is also a central contribution of our framework.
  • Weakness: Linearity assumption in the optical device dataset. This is a fair point. We followed the precedent set by [33] who introduced this dataset and also analyzed it under a linearity assumption [10]. While the true mechanism may or may not be linear, this setup allows for a controlled experiment and direct comparison with prior work.

Addressing Specific Questions

  1. Comparison with causal regularization methods.

    [12] involves a purely theoretical analysis for using \ell_2 to improve causal estimation in purely observational (X, Y) setting.

    [10] considers also considers purely observational (X, Y) but under special, linear modelling assumptions so that now estimating | f | is possible, even if f isn’t. This information is then used to develop a cross-validation scheme.

    In [11] authors consider data from 2 observational (X, Y) sources to develop a croos-validation scheme for \ell_2 regularization based on stability of regression parameters across the 2 datasets.

    Regarding incorporating these into our experiments, we thank the reviewer for highlighting this area of improvement. Due to time limitations, however, we couldn't include additional experimental results in this rebuttal. Nevetheless, we do plan to complete and incorporate these into the camera-ready revision.

  2. Significance and meaning of α\alpha in I_α\mathcal{I}\_{\alpha} — the degree of OOD robustness. In Theorem 1, the set I_αspan(K)\mathcal{I}\_{\alpha}\subseteq\operatorname{span}(\mathbf{K}^\top) represents the interventions over which the IVL estimate has robust, or optimal ‘worst-case’ predictive performance. The parameter α\alpha can be interpreted as a ‘knob’ that determines how OOD the interventions in this set would be. As α0\alpha\rightarrow 0, then limα0I_α=span(K)\lim_{\alpha\rightarrow 0}\mathcal{I}\_\alpha = \operatorname{span}(\mathbf{K}^\top) such that now our estimates would have robust predictions over arbitrarily large interventions of the form do(Z~ξ),  ξspan(K)\operatorname{do}(\tilde{Z}\coloneqq\xi), \;\xi\in\operatorname{span}(\mathbf{K}^\top). So, essentially, α\alpha determines the trade-off IVL makes between in-distribution risk (i.e., ERM) vs. OOD treatment-intervention risk (i.e., IV regression).

  3. On justification and notation of Observation 1

    Justification of Observation 1. Theorem 4 (Appendix G.7) shows that for SEM A\mathcal{A} in Eq. (7),

    XA;do(τGτ)=GXA,YA;do(τGτ)=YA.()X^{\mathcal{A};\operatorname{do}(\tau\coloneqq G\tau)} = GX^{\mathcal{A}}, \qquad Y^{\mathcal{A};\operatorname{do}(\tau\coloneqq G\tau)} = Y^{\mathcal{A}}.\qquad(\dagger)

    So it immediately follows that P_X,Y,G,CA;do(τGτ)=P_GX,Y,G,CAP\_{X, Y, G, C}^{\mathcal{A};\operatorname{do}(\tau\coloneqq G\tau)} = P\_{GX, Y, G, C}^{\mathcal{A}} given that GG and CC are exogenous.

    Intuition for Observation 1. The intuition behind Eq. (\dagger) and, by extension, Observation 1 is that the transformation GXGX does not change YY by definition of outcome-invariance, despite cycles.

    Clarification of notation for GτG\tau. We

    • Definition: GτG\tau is simply the composition GτG\circ\tau. Given that we have

    τ:Y×C×N_XX,δ:X×GX\tau: \mathcal{Y}\times\mathcal{C}\times\mathcal{N}\_X\rightarrow \mathcal{X}, \qquad \delta:\mathcal{X}\times\mathcal{G}\rightarrow \mathcal{X},

    then using gxδ(x,g)\mathbf{g}\mathbf{x}\coloneqq \delta(\mathbf{x}, \mathbf{g}) as shorthand for group action, GτG\tau is simply

    Gτ()=Gτ()=δ(τ(),G).G\tau(\cdot) = G\circ\tau(\cdot) = \delta( \tau(\cdot), G ).

    • Form in Example 2: For Example 2, τ\tau has the block-matrix form

    τ()=T(),T[0_mEσI_m×m]\tau(\cdot) = \mathbf{T}(\cdot) ,\qquad \mathbf{T}\coloneqq[\mathbf{0}\_{m} \quad \mathbf{E}^\top \quad\sigma\mathbf{I}\_{m\times m}]

    so that now for any tuple (Y,C,N_X)(Y, C, N\_X) we have

    τ(Y,C,N_X)=T[YCN_X]\tau(Y, C, N\_X) = \mathbf{T} \begin{bmatrix} Y \\ C \\ N\_X \end{bmatrix}.

    Finally, given the definition GXX+γKGGX\coloneqq X + \gamma\mathbf{K}^\top G, we have

    Gτ()=Gτ()=T()+γKG.G\tau(\cdot) = G\circ\tau(\cdot) = \mathbf{T}(\cdot) + \gamma\mathbf{K}^\top G.

Response to Limitations

We fully agree with the reviewer and will explicitly emphasize this context in the camera-ready version. Our goal is not to claim full identification, but to provide a principled step forward from ERM, but the residual bias still necessitates caution. Crucially, we will frame this as the reality of the problem setting, where our method still offers a robust and provable improvement over standard ERM.

We appreciate the reviewer’s feedback and hope our response addresses their concerns. We are confident the clarifications will strengthen the manuscript and better highlight its practical relevance. We also note that the reviewer’s comments and criticisms focus primarily on identifiability, whereas others reviewed our work from a broader perspective.

References: See manuscript.

评论

Thank you for your detailed response. Your comments (particularly the discussion about broader significance and contextualising the assumptions made throughout the paper) helped me understand the broader context of your work and addressed most of my concerns, hence I increased my score. Nevertheless, I would still strongly encourage introducing the following changes to the paper (some of which you already agreed to add to the camera ready version):

  • Discussing broader significance in Section 1.
  • Connect Assumptions 1 and 2 with the RSE metric, justifying that the equivalent results for the EE metric also hold and do not require these assumptions.
  • Run additional experiments on the optical device dataset, using more complex than linear models as the ground-truth (to further verify that the proposed method also provides non-trivial improvements in non-linear settings).
  • Run experimental comparisons against other causal regularisation methods.
  • Include a more comprehensive sections about the limitations of the work, making explicit that the proposed method can still lead to arbitrarily large error on the estimated causal effect (even though it is smaller than ERM error).
  • As the paper is quite notation-heavy, I would also consider including a short overview and explanation of the notation in the appendix. The notational clarifications you provided in the rebuttal were helpful in following the logic of the paper.
评论

Thank you very much for your positive feedback and for increasing your score. We really appreciate your detailed and constructive suggestions.

We agree with all of your recommendations, and many of them seem to overlap with other reviewers’ suggestions as well. We think they are good additions to the manuscript and as such, we will incorporate them into the camera-ready version as follows:

  1. Broader Significance: We will explicitly discuss broader significance in Section 1, expanding on the points mentioned in our rebuttal.
  2. Assumptions and Metrics: When introducing RSE in Section 3, we will explicitly contrast it with EE to better motivate our choice. Following this, we will mention that Assumptions 1 and 2 are required for RSE, but not for corresponding results with EE. This will ensure readers understand the generality of our findings and also give us the opportunity to discuss (in Appendix A) the nuanced difference between these two metrics that Janzing points out, which we believe is actually quite interesting!
  3. Non-linear Experiments: Experiments with non-linear models for the Optical Device dataset would definitely be insightful. We will try a few simple approaches (polynomial features, non-linear kernel SVMs, shallow NNs, etc) and report the results in Appendix F, with a summary in the main text to highlight their implications.
  4. Causal Regularization Baselines: As promised, we will work to include causal regularization baselines in our experiments. These will make our suite of baselines more complete and meaningful.
  5. Limitations Section: Absolutely, this seems to be the most pressing request by all reviewers — the Limitation section will be moved from Appendix E to the main paper and expanded to include (i) contextualisation of assumptions, (ii) explicitly caution about residual error still being arbitrarily large.
  6. Notation/Nomenclature in Appendix: We completely understand the frustration with notation. Unfortunately, this is inevitable since we have to juggle multiple SEMs, their intervention forms, examples and overloading to bridge the topics of causality and DA. A notation/nomenclature section would certainly help! We will definitely include it.

These suggestions were quite helpful for polishing the draft, and we once again express our gratitude for your time and feedback.

Thanks!

评论

Dear reviewer MsXV,

Thanks again for your detailed suggestions from earlier during the discussion. Because of the discussion period extension, we have been able to already implement some of these and are happy to share the following results.

Table: Relative error ±\pm standard error for Simulation and Optical Device experiments.

MethodSimulation Exp.Optical Device Exp.
linear (original)polynomial (degree 1–5 selected w/ 5-fold CV)
ERM0.549 ±\pm 0.0570.822 ±\pm 0.0830.813 ±\pm 0.084
DA+ERM0.430 ±\pm 0.0490.781 ±\pm 0.0810.782 ±\pm 0.079
DA+IVLαCV_{\alpha}^{\text{CV}}0.378 ±\pm 0.0420.744 ±\pm 0.0780.748 ±\pm 0.075
DA+IVLαLCV_{\alpha}^{\text{LCV}}0.397 ±\pm 0.0440.760 ±\pm 0.0750.760 ±\pm 0.077
DA+IVLαCC_{\alpha}^{\text{CC}}0.351 ±\pm 0.0290.720 ±\pm 0.0760.723 ±\pm 0.081
DA+IV0.723 ±\pm 0.0120.888 ±\pm 0.0280.887 ±\pm 0.027
IRM0.641 ±\pm 0.0280.976 ±\pm 0.0060.976 ±\pm 0.006
ICP0.707 ±\pm 0.0000.744 ±\pm 0.0290.747 ±\pm 0.029
DRO0.504 ±\pm 0.0220.842 ±\pm 0.0530.839 ±\pm 0.052
RICE0.592 ±\pm 0.0150.962 ±\pm 0.0110.947 ±\pm 0.024
V-REx0.495 ±\pm 0.0210.941 ±\pm 0.0240.938 ±\pm 0.023
MM-REx0.662 ±\pm 0.0561.000 ±\pm 0.0001.000 ±\pm 0.000
1\ell_1 Janzing (2019)0.970 ±\pm 0.0020.981 ±\pm 0.0090.981 ±\pm 0.009
2\ell_2 Janzing (2019)0.547 ±\pm 0.0560.821 ±\pm 0.0840.812 ±\pm 0.084
Kania et al. (2023)0.399 ±\pm 0.0440.785 ±\pm 0.0740.779 ±\pm 0.077

Causal regularization baselines

We’ve implemented the causal regularization baselines from Janzing (2019) and Kania et al. (2023) (both named Causal Regularization). C. Vankadara (2022) was not implemented because, as stated earlier, they only do a theoretical analysis of causal estimation with 2\ell_2 and do not propose an actual method.

On the Simulation experiment, our method achieves a relative error of 0.351, outperforming recent methods like Kania et al. (0.399) and the approaches from Janzing (2019) (0.547 for 2\ell_2, 0.970 for 1\ell_1). Similar results can be seen in the optical device experiments. The Colored-MINST experiment is not included as these methods are not compatible with neural-networks.

Non-linear data generation in optical device

Regarding construction of the ground-truth data generation model for the optical device experiment, you rightly pointed out to try non-linear models as well. To this end, we’ve conducted a first experiment with polynomial features — for each of the 12 datasets in the optical device experiment we independently pick the polynomial degree between 1 and 5 based on 5-fold cross validation (CV). Most of the datasets were best explained by degree 2. The same degree is then used to fit each of the methods listed in the table. This is important so as to avoid statistical bias from model-misspecification as our analysis squarely focuses on confounding bias.

The results are consistent with our previous results on the optical device experiment from the manuscript. However, we like this new approach much better! It is a principled generalization of the original linear assumption on optical-device data by Janzing (2018) while still being compatible with RSE and covering all of the baselines under consideration. Therefore we shall be using this in place of our old optical device experiment.

We contemplated other approaches for baseline construction as well, however they either seem to be incompatible with RSE or many baselines (including causal regularisation), or both! We therefore think the above results are most appropriate to present in the camera-ready as they check all the boxes.

References

D. Janzing. Causal regularization. NeurIPS, 2019.

L. Kania et al. Causal Regularization: On the trade-off between in-sample risk and out-of-sample risk guarantees. arXiv:2205.01593, 2023.

C. Vankadara et al. Interpolation and regularization for causal learning. NeurIPS, 2022.

D. Janzing and B. Schölkopf. Detecting confounding in multivariate linear models via spectral analysis. J. Causal Inference, 2018.

评论

Great, I believe this significantly improves the experimental evaluation of the paper! I have raised my score to 5 :)

审稿意见
5

The paper shows a method that simulates (soft) interventional data provided that the structural equation model satisfies a certain symmetry, which can then be used to reduce bias from confounding or feedback. To this end, the paper introduces a generalization of instrumental variables in which the outcome may be independent of the instrument, but nevertheless useful to mitigate confounding bias.

优缺点分析

Strength: addresses an important problem in way that is novel to the best of my knowledge. I particularly like the observation that IVs can help reducing confounding bias without being relevant to the outcome. The theory is sound.

Weakness:

  • It remains unclear how often symmetries of the structural equation are known, although the paper mentions a few cases admittedly.

  • It is difficult to fully understand the comparison to other methods because it is not very transparent which role exactly the domain DA parameters play in the other methods and why it is conceptually right to let them play the respective role.

  • The authors could do a better job in providing intuition alongside with theoretical results: why does the soft intervention mitigate bias?

问题

  • Not sure it is recommend to overload the term “confounding” even further and subsume even cyclic causal relations under confounding.

  • line 141: this sentence is not very clear: "...have good predictive performance while the IV risk encourages solution search within the subspace where we know f to be present.", what does it man that f is present?

  • line 280 "Finally, DA+IVLα was implemented using a closed form linear OLS solution between empirical versions (see Lemma 1) of the following…" , unclear what "between empirical versions" means.

  • line 355: "Finally, when used together with DA, IVL regression can simulate worst-case interventions that are encoded by the DA" in what sense does regression simulate an intervention?

  • Is the method fundamentally restricted to additive noise?

  • Abstract: “for when treatment randomization sources may carry no information about the outcome and the possibility of its use for improving predictive performance across treatment interventions and reducing confounding bias”, this sentence is confusing,

  • Table 1 is a bit unclear: The arrow of lower confounding bias points downwards, but worst-case or adversarial DA is a scenario with larger confounding, right?

局限性

I don’t see an explicit discussion of limitations.

最终评判理由

I have been confirmed that this is an interesting paper which should be accepted.

格式问题

none

作者回复

We sincerely thank the reviewer for their positive assessment and insightful, constructive feedback. We are encouraged that they found our approach novel, and the core ideas sound and valuable. The reviewer's questions and suggestions for improvement are appreciated and will help us improve the clarity and impact of our work.

We have structured our response to first address the main weaknesses identified (intuition, comparisons, and practical assumptions), followed by answers to the specific clarification questions.

On Weaknesses and General Improvements

1. Intuition for Bias Mitigation and Additive Noise Restriction (Weakness 3 & Additive Noise Restriction Question)

Thank you for the suggestion. We agree that a more direct intuitive explanation would be helpful.

The core intuition behind why the outcome-invariant data augmentation (DA) induced soft intervention helps reduce confounding bias is that it injects new randomness into the treatment XX that is, by construction, independent of the hidden confounder CC. This process effectively "dilutes" the spurious correlation between XX and YY that is mediated by CC. By training on this augmented data, the model is discouraged from relying on the spurious path and is guided toward learning the true causal relationship, thereby mitigating confounding bias.

Another, albeit more `technical' intuition is to look at it from the perspective of the data processing inequality I(GX;C)I(X;C)I(GX;C)\leq I(X;C) applied to the Markov chain CXGXC\rightarrow X\rightarrow GX, so that GXGX now carries less information about the confounder CC than the original treatment XX, unless GXGX preserves information about CC (this is precisely the condition of perturbing spurious features in our linear settings). This perspective also portrays that our method is not merely limited to additive noise, and applies to more general transformations GG.

We will add a similar explanation at the end of Section 4.1 to better frame our theoretical result and also emphasize that it is more general than our simple additive noise setting would suggest.

2. Availability of Symmetries and Discussion of Limitations (Weakness 1 & Limitations)

We thank the reviewer for this crucial point. We agree that assuming known structural symmetries is a strong requirement and they are not universally available. Our work builds on a key insight inspired by this very challenge:

The widespread use of an outcome-invariant model for data augmentation in the literature seems to suggest that in many application domains, such as vision, natural language processing, etc., practitioners already possess and leverage this type of symmetry knowledge, even if implicitly.

And the motivation behind our work is to show that these existing, domain-justified symmetries can be repurposed as a principled tool to mitigate confounding bias.

Therefore, our framework is most valuable in application domains where such data augmentations are already established. More generally, and as with any non-experimental causal method like Instrumental Variables, justifying these symmetries ultimately relies on domain knowledge. This is a fundamental characteristic of the problem space, rather than a unique limitation of our approach.

We discuss this and other limitations of our method in Appendix E.4, but will be moving that section into the main paper with more explicit discussions of the same for better transparency.

We also make some additional comments on symmetry availability under Reviewer CVxW's thread that you may find interesting -- see under "Additional Reflections on untestable validity of IV vs. outcome-invariant DA".

3. Clarity of Comparison to Other Methods (Weakness 2)

We appreciate the reviewer's call for more transparency here. Our rationale for using DA parameters (GG) as inputs for baseline domain generalization methods (e.g., V-REx, DRO, IRM, etc.) is to ensure a fair comparison. The fundamental challenge we address is reducing confounding bias given only observational data (X,Y)(X, Y) and knowledge of some symmetries via valid augmentations. For a fair comparison, all methods must be restricted to this same information. Therefore, we use the DA parameter GG as a proxy for the "environments" or "domains" that these other methods typically require as input. The reason this is still a conceptually valid comparison is because by virtue of being domain generalizaiton methods, they are essentially solving a robust loss of a similar form as in Corollary 1. This gives us direct and correct controls to compare our DA+IVL method against.

We will explicitly state this rationale in Section 6 to make the experimental design clearer.

Responses to Specific Questions

  • On subsuming cyclic relations under "confounding": We appreciate the nuance. Our use of “confounding bias” is the same as Pearl’s definition of the term [5] Confounding BiasE[YX]E[Ydo(X)]\text{Confounding Bias}\coloneqq \mathbb{E}[Y\mid X] - \mathbb{E}[Y\mid\operatorname{do}(X)] . Which is to say, any bias that causes the observational to deviate from the interventional. In our context, this would also include simultaneity. Such a general framing also follows the tradition in econometrics [14, 39] where omitted-variables (i.e. confounders) are lumped together with simultaneity as sources of endogeneity (statistical dependence between the treatment XX and the residual Yf(X)Y - f(X)). We will add a footnote in Section 2.1 to clarify our use of the term.
  • Clarification of line 141: This sentence describes the two parts of our IVL risk. The IV risk term ensures the solution lies in a set that contains the true causal function ff (i.e., where "ff is present") (assuming α=0\alpha=0). Since this set may not be a single point, the ERM risk term acts as a regularizer, selecting a solution from this set that also has good predictive performance. We will rephrase lines 140-142 for better clarity.
  • Clarification of line 280: "Between empirical versions" was imprecise phrasing. It refers to the transformed variables XX' and YY' defined just below in the text. We perform a closed-form OLS regression of the empirical transform YY' on XX'. We will revise this to be more precise.
  • Clarification of line 335 ("regression simulates an intervention"): By simulation we essentially mean is equivalent to:
    1. Applying DA is equivalent to a soft intervention on the treatment mechanism (Observation 1).
    2. And DA+IVL regression is equivalent to a worst-case DA from a given set (Corollary 1). In this sense, the DA+IVL regression procedure finds a solution for a worst-case intervention scenario encoded by the DA. We will clarify this two-step logic in Section 4.2.
  • Confusing sentence in the Abstract: Thank you for highlighting this. We will revise the sentence for clarity. A proposed revision is: "We introduce the concept of IV-like (IVL) regression, which leverages sources of treatment randomization even when they are irrelevant to the outcome. We show that this approach can still improve predictive performance across interventions and reduce confounding bias."
  • Arrow in Table 1: This is a good point to clarify. The arrow indicates that moving down the column leads to methods that yield lower confounding bias. "Worst-case or adversarial DA" refers to a method that optimizes for a solution robust to the worst possible augmentation. In our setting, this robustness forces the model to ignore spurious correlations, thereby yielding a better causal estimate with less bias. We will add a note to the table's caption to make this interpretation explicit.

Thank you once again for your thorough and valuable review. We are confident that by incorporating these changes, we can significantly improve the paper.

References:

Please see the manuscript.

评论

The responses confirm my positive assessment.

评论

We’re grateful for your positive review and happy that you liked the IV vs. IVL spin. Also, thank you for pointing out parts that could use more clarification.

审稿意见
5

This work is concerned with estimating causal effects in the presence of unobserved confounders using data augmentation techniques. In particular, the authors consider a setting where we observe only treatment X and outcome Y, with hidden confounder C affecting both, and where the outcome function f is invariant to certain data transformations. The estimand of interest in this setting estimating the causal effect of X on Y. The authors look at a connection between data augmentation and causal inference through the lens of instrumental variables and soft interventions. The approach consists of three key pieces: (1) showing that outcome-invariant data augmentation is mathematically equivalent to a soft intervention on the treatment generating mechanism (2) they derive IV-like (IVL) regression, which can be seen as relaxing standard instrumental variable methods by dropping the outcome relevance requirement (while maintaining bias reduction properties through a regularized objective combining IV and ERM losses), and (3) showing that the data augmentation parameters can serve as IV-like variables, leading to a worst-case augmentation formulation that further reduces confounding bias. The authors provide theoretical guarantees in the linear Gaussian case showing bias reduction under appropriate conditions, and propose practical algorithms for both DA+ERM and DA+IVL estimation. Experiments on simulated linear systems, optical device data, and Colored-MNIST demonstrate consistent bias reduction, with DA+IVL generally outperforming standard DA alone, and modest improvements on complex real-world datasets.

优缺点分析

Strengths:

  • I found this to overall be a very original and compelling idea. The authors do a really nice job of describing the problem setting, and providing the intuition for the approach before introducing it.
  • I really liked the connection between data augmentation and soft interventions. This was surprising to me and I found it very interesting and a very nice addition to the causal literature.
  • The proposed method itself is both intuitive and relatively simple. Further, while finding the "right" set of data augmentations is non-trivial, the idea itself and it's underlying identification requirements are easily digested.
  • The provided empirical evidence shows consistent improvements over a set of candidate alternatives.

Weaknesses:

  • The main weakness of this paper is much the same as the the issue that underlies a lot of IV application in the exclusion principle. It's not entirely clear to me how hard or easy it would be to find a suitable data augmentation target here for various applications. I think the authors do a nice job of trying to address this by providing very clear criteria for when the proposed method is applicable vs. not. With that being said, I think it's important to acknowledge that essentially most of the applications of this have to lie purely on assumption, which can be difficult for practitioners.
  • The theoretical results provided are pretty restricted in terms of the settings that they consider. I think this is fine because there's a lot of task novelty here. It is worth noting that some of the results, in particular Theorem 1, seem to be operating on a very restricted setting in contrast to the framing of the paper which is much more general.

问题

While I don't think it's necessary, per se, for publication, it would be interesting to hear if the authors have thoughts or could provide a brief discussion within the paper around examining some of these assumptions required to have a valid data augmentation. In particular, are there sensitivity analyses that could be performed so that practitioners can bound the range of outcomes they would expect under violations of assumptions in model misspecification?

局限性

yes

最终评判理由

This paper studies an interesting and novel task. The proposed approach is interesting and leverages nice insights tying data augmentation to soft interventions.

格式问题

N/A

作者回复

We would like to extend our sincerest thanks to the reviewer for their thoughtful and encouraging feedback. We are particularly pleased that they found the core idea original and compelling, and appreciated the connection we draw between data augmentation and soft interventions. The reviewer's summary of our work's strengths and weaknesses is fair and very helpful. We will address the identified weaknesses and questions below.

On the Practicality of Finding Suitable Data Augmentations and the Untestability of Assumptions

This is the reviewer's main concern, and we agree it is a crucial point. The reviewer correctly draws a parallel between the difficulty of finding a suitable data augmentation and the challenge of satisfying the properties for valid Instrumental Variables (IV). Both, as the reviewer notes, must often "lie purely on assumption."

We believe the most appropriate yardstick for evaluating our framework is indeed that of IVs. Our proposed IV-like (IVL) variables are a direct generalization of IVs, designed to exploit a specific trade-off:

  • Availability vs. Identifiability: By relaxing the "outcome relevance" property of a standard IV, IVLs become easier to find and are more broadly available. The price for this increased availability is that IVL regression can no longer guarantee the identification of the causal effect. However, as we show, it can still be instrumental in reducing confounding bias.
  • Inherited Limitations: Consequently, the limitations of IVLs regarding testability and reliance on prior knowledge are the same as those for standard IVs. The validity of an IVL (e.g., exclusion restriction, un-confoundedness) is not empirically testable from observational data alone and must be justified by domain expertise—the same standard applied to IVs for decades.
  • Outcome invariance: Outcome-invariant data augmentation (DA), by virtue of equivalence with IVL by construction, also inherits these properties. The outcome-invariance assumption itself now becomes the central, untestable assumption, analogous to exclusion restriction and un-confoundedness. And while we acknowledge that this is a seemingly strong assumption, we point out that it is a fairly common modeling choice in the data augmentation and invariance literature.[1-10]
  • Perturbing spurious features: The condition of IVLs and/or DA perturbing spurious features of the treatment XX is also not a new limitation introduced by our framework. In fact, this condition is just a re-framing of the exclusion principle in IVs and is also necessary for identifiability in IV regression. It doesn’t pop up simply because of our IVL relaxation. Indeed, if an IV does not enter YY via spurious features of XX, then the respective components of the causal function ff cannot be identified. Analogously, if no spurious features are perturbed, then IVLs cannot help mitigate confounding bias. The same holds for outcome-invariant DA by virtue of their equivalence with IVL.

outcome-invariance_popular model for DA+spurious targets_benign failure if violatedun-testable DA assumptions    un-confoundedness+exclusionun-testable IV/IVL assumptions\overbrace{ \underbrace{\text{outcome-invariance}}\_{\text{popular model for DA}} + \underbrace{\text{spurious targets} }\_{\text{benign failure if violated}} }^{\text{un-testable DA assumptions}} \iff \overbrace{\text{un-confoundedness}+\text{exclusion}}^{\text{un-testable IV/IVL assumptions}}

Ultimately, our work does not propose a "free lunch." Instead, it formalizes a trade-off that may be quite practical in many settings where valid IVs are scarce, but plausible outcome-invariances (i.e., data augmentations) are abundant. Consequently, we believe it would be inappropriate to expect IVLs and/or outcome-invariant DA not to come with the same limitations as IVs other than the one trade-off made — availability vs. identifiability.

In the camera-ready version, we will make sure to emphasize that these untestabe assumptions are inherent to the difficulty of causal estimation under hidden confounding, and not a limitation of our method per se.

Additional Reflections on untestable validity of IV vs. outcome-invariant DA: A majority of reviewers voiced concerns over the untestable assumption on the validity of the IV and outcome-invariant DA. While this is a fair and valid criticism, it’s worth pointing out the key distinction between IV and DA. The classic IV relies on premise that the IV induces a “natural” variation on the treatment. The researchers must therefore "look for” these variables passively and then need to justify why they act as a valid IV. In contrast, DA is typically designed by the researchers who have control over how the transformations are applied to the data. This flexibility can make DA outcome-invariant “by design”. In fact, it’s quite intuitive to distinguish between DA that are outcome-invariant and those that are not. For example, any approaches that use the labeled data to learn DA cannot be outcome-invariant.

In my opinion, this distinction makes outcome-invariant DA less susceptible to the question of validity compared to the classical IV.

On the Restricted Setting of the Theoretical Results

The reviewer notes that our theoretical results, particularly Theorem 1, operate in a more restricted setting than the general framing of the paper. We thank the reviewer for their understanding that this is reasonable given other strengths of our work.

We wish to clarify that the primary reason for Assumptions 1 and 2 is our choice of evaluation metric: the Relative Squared Error (RSE) by Janzing.[11] Janzing argues that the structural nature of RSE makes it a more relevant estimation error for causal settings [13], which also motivates our choice. Nevertheless, there is no generally accepted metric for causal estimation since quantifying causal influence is itself a non-trivial problem [13, 14]. So if we instead use the more familiar risk-based estimation error EM[(f(X),h^(X))]\mathbb{E}^{\mathcal{M}}[ \ell( f(X), \hat{h}(X) ) ], we can in fact drop Assumptions 1 and 2 altogether from our results.

We therefore urge the reviewers to see these assumptions as a requirement for a more appropriate measure of causal estimation error rather than a necessary crutch. And we shall explicitly provide the same disclaimer in the camera ready version of our manuscript as well.

Question: Examining Violations of DA Assumption(s?)

We interpret this question as asking about the robustness of our method to violations of the core outcome-invariance assumption. Our paper mainly studies the case of exact invariance to make the connection with causal inference more obvious and tractable. And although exact invariance has precedence as a modeling assumption in DA [1-10], we understand the hesitancy to commit to this in practice. And naturally, this opens the door to thinking about approximate notions of invariance.[1, 6]

As such, we have been looking to incorporate the same in our framework. Preliminary results seem to suggest that our main theorems can indeed be extended to this more general setting. We plan to further develop this line of work in a journal version. This allows us to present a more focused and self-contained contribution here, while laying a clear path for future work.

We hope these clarifications have addressed the reviewer's concerns. We once again thank them for their positive and constructive review, which has helped us to polish our thinking and positioning of this work.

[1]: S. Chen et al. A group-theoretic framework for data augmentation. JMLR, 2020.

[2]: C. Lyle et al. On the benefits of invariance in neural networks. arXiv, 2020.

[3]: H. Shao et al. A theory of PAC learnability under transformation invariances. NeurIPS, 2022.

[4]: A. Fawzi and P. Frossard. Manitest: Are classifiers really invariant? BMVC, 2015.

[5]: Y. Dubois et al. Lossy compression for lossless prediction. NeurIPS, 2021.

[6]: M. Petrache and S. Trivedi. Approximation-generalization trade-offs under (approximate) group equivariance. NeurIPS, 2023.

[7]: O. Montasser et al. Transformation-invariant learning and theoretical guarantees for OOD generalization. NeurIPS, 2024.

[8]: D. Romero and S. Lohit. Learning partial equivariances from data. NeurIPS, 2022.

[9]: S. Zhu et al. Understanding the generalization benefit of model invariance from a data perspective. NeurIPS, 2021.

[10]: S. Wong et al. Understanding data augmentation for classification: When to warp? DICTA, 2016.

[11]: D. Janzing. Causal regularization. NeurIPS, 2019.

[12]: C. Vankadara et al. Interpolation and regularization for causal learning. NeurIPS, 2022.

[13]: D. Janzing and B. Schölkopf. Detecting confounding in multivariate linear models via spectral analysis. J. Causal Inference, 2018.

[14]: D. Janzing et al. Quantifying causal influences. Ann. Stat., 2013.

评论

Thank you for the thorough response. My concerns have all been addressed. I will maintain my score of accept.

评论

Thank you so much for your very encouraging feedback and enthusiasm for our work!

最终决定

The authors propose a new approach to treatment effect estimation through data augmentation (DA) and regularized instrumental variable (IV)-like regression. Assuming that the outcome is invariant to some properties of the data, they generate data which preserves the outcome invariance and show that this can be used to reduce confounding bias in the case where we do not have identifiability and thus standard IV approaches are not possible. Results include theoretical guarantees (in the linear Gaussian setting), practical estimation algorithms, and experiments showing improved bias reduction on simulated, Colored-MNIST, and optical device datasets.

Strengths of the paper:

  • The use of data augmentation in a causal setting (rather than just for generalizability in typical prediction settings) is novel.
  • The IV-like regression, a generalization of standard IV that does not require dependence of the outcome on the IV, is of high and independent interest. The observation that IVs can help reducing confounding bias without being relevant to the outcome is highly original and useful.
  • Strong and convincing experimental results.
  • Well-written and clear.

Weaknesses of the paper:

  • Missing a limitations section.
  • Theoretical guarantees are available for fairly limited settings (but I don't think that this is a major issue).
  • More clarity needed about the assumptions and their connections to (a) typical IV analysis and (b) independent causal mechanisms -- this was resolved in the discussion and the authors have promised to add this to the camera-ready version.

Acceptance decision: I am in favor of acceptance as a spotlight, primarily because of the novelty and high interest of the work.

Discussion and rebuttal: The conversation was positive and productive, with all reviewers agreeing that the paper is worthy of acceptance and that no major concerns remained after discussion. The authors made an important point that the approach generalizes typical IV analysis and therefore depends on the same (untestable) assumptions -- since IVs are a common and useful method, this is a reasonable starting point for the work. Additional experimental results (comparisons to two papers in the literature) were provided by the authors and should (and can easily) be included in the camera-ready version. The authors should also add a brief limitations section to the camera-ready version and have agreed to do so.