3.7

/10

Rejected3 位审稿人

最低3最高5标准差0.9

4.0

置信度

ICLR 2024

Learning Counterfactually Invariant Predictors

Francesco Quinzan,Cecilia Casolo,Krikamol Muandet,Yucen Luo,Niki Kilbertus

OpenReview PDF

提交: 2023-09-23更新: 2024-02-11

TL;DR

We propose a model-agnostic framework building on a kernel-based conditional dependence measure for learning counterfactually invariant predictors.

摘要

关键词

CausalityCounterfactual Invariance

评审与讨论

审稿意见

评分: 3置信度: 42023-10-27

This paper introduces a graphical criterion to learn counterfactually invariant predictors by leveraging conditional independence within the observational distribution. The Hilbert-Schmidt Conditional Independence Criterion (HSCIC) is employed as a measure of conditional independence. Experimental results, based on both synthetic and real data, validate the efficacy of the proposed approach.

优点

Learning counterfactually invariant predictors solely from the observational distribution is a relevant and important research problem in the field of causal inference.
The extensive experiments on synthetic data and real data are conducted to demonstrate how the proposed method works.

缺点

A strong limitation of Theorem 3.2, which corresponds to the issue in the experiments mentioned next. The assumption made in Theorem 3.2, which presumes that $X = g(X, ...)$ , seems to be a strong assumption. The authors clarify that this assumption implies $pa(V) \in \mathbf{X} \cup \mathbf{A}$ , which should imply $\mathbf{A} \cup \mathbf{X}$ should be a maximal connected graph (and so $\mathbf{A} \cup \mathbf{W}$ ) and $Y$ is a root node, meaning $Y$ cannot be any parent of $\mathbf{X} \cup \mathbf{A}$ . Therefore, for a common connected DAG with node $\mathbf{A} \cup \mathbf{W} \cup \mathbf{Y}$ , the adjustment set for $(\mathbf{A} \cup \mathbf{W}, \mathbf{Y})$ should be empty, which corresponds to the issue in experiments mentioned below.
The fairness example in section 4.3 is based on Fig. 1(e), where our goal is to learn a predictor $\hat{Y}$ that exhibits counterfactual invariance with respect to $\mathbf{A}$ within the context of $\mathbf{W}=\mathbf{X} \cup \mathbf{C}$ . As per Theorem 3.2, our first step is to identify a valid adjustment set $\mathbf{S}$ for $(\mathbf{A} \cup \mathbf{W}, \mathbf{Y})$ . However, in this case, the adjustment set $\mathbf{S}$ between $\mathbf{A} \cup \mathbf{W}$ and $\mathbf{Y}$ should be empty. Upon reviewing Appendix F.5 and Fig. 8, it seems that it is inappropriate to use a subset of $\mathbf{A} \cup \mathbf{W}$ as the adjustment set. A similar issue arises in the experiments conducted on synthetic and image datasets, in which, we have $\mathbf{W}$ encompassing all variables except for $\mathbf{Y}$ and $\mathbf{A}$ . Besides, notice that, in the latter case, the authors state that 'We seek a predictor $\hat{Y}$ that is CI in the x-position with respect to $**all other observed variables**$ ', so $\mathbf{W}$ includes all the observational variables except $\mathbf{Y}$ and $\mathbf{A}$ as well.
Clarification on the 'Injectivity' Assumption in Theorem 3.2. Fawkes & Evans (2023) shows that CI cannot be decided from the observational distribution unless strong assumptions are made. Theorem 3.2 provides a valid assumption: injectivity of $g$ . I searched for the term ‘injectivity’ in the proof but cannot find it, so how does this assumption work here? Furthermore, while the authors claim that ‘Guaranteeing CI necessarily requires strong untestable additional assumptions, but we demonstrated that CIP performs well empirically even when these are violated’, they only provide an example of a specific violation scenario involving the absence of unobserved confounders. It is recommended that the authors conduct additional experiments to demonstrate the method's efficacy under more general violation conditions.
Scalability and Computational Complexity. The experimental section primarily employs simple causal graphs with a small number of nodes. It is essential to evaluate the scalability of the proposed Counterfactual Invariance Predictor (CIP) on causal graphs with a larger number of nodes, such as 20, 30 nodes. Additionally, it would be better to provide insights into the computational complexity associated with finding the adjustment set and estimating the Hilbert-Schmidt Conditional Independence Criterion (HSCIC).
Organisational Improvement. Section 2.2, "Related Work," requires more structured organization. Although the section discusses almost sure conditional independence, distributional conditional independence, and F-CI along with their relationships, it appears that they may not directly relate to the contributions of this work. Therefore, it would be advisable to omit this portion. Moreover, the paper should offer a concise and focused overview of the approaches used in prior research to attain counterfactually invariant predictors.

问题

How was the synthetic experiment design tailored to meet the assumption in Theorem 3.2 (injectivity assumption) explicitly?

评论- Official Response to Reviewer Lrr5

2023-11-20

We thank the reviewer for analysing our paper and for the detailed review. We are happy that the reviewer understands the relevance of the problem of counterfactual invariant predictors. Referring to the points of the reviewer:

Limitation graphical assumptions: As we mention in our submission, our analysis inevitably relies on assumptions on the DGP. Without assumptions on the causal structure, it is impossible to derive the necessary identifiability results for counterfactual invariance. We remark that the injectivity of $g(...)$ is weaker than some previous assumptions in the literature, such as assuming a noise-additive models [1]. Furthermore, our analysis does not use strong assumptions such as faithfulness. Regarding the second part of your question, our theoretical framework extends to the cases in which the valid adjustment set is empty. This is further explored in the next point.
Valid adjustment set: Theorem 3.2 does not necessitate as set $S$ the optimal adjustment set, but a valid one. In Appendix F.5, it is explained that the conditioning set includes the variables {race, nationality} that consist in the node $C$ of Fig.8. Even though this is a subset of $A \cup W$ , the set $C$ represents a valid (not optimal) adjustment set for $(A \cup W, Y)$ . Based on this, here selecting the adjustment set $S$ as $C$ is consistent with the theoretical results. Similarly, in the image and synthetic experiments a valid adjustment set is chosen in the conditioning.
Clarification on the “Injectivity” Assumption: We use the set-theoretic notion of “injectivity”, and we are happy to further clarify this point. As mentioned in our previous reply, there are models where the injectivity notion is fulfilled, such as ANMs. In our proof, the injectivity assumption is used in Lemma A.6 to prove that Eq. (5) holds.
Scalability and Computational Complexity: we are working on experiments with $20$ nodes. In a dataset with $n=1000$ data points, in the same structure as the first experimental setting shown in the paper (Fig. 1(d)), the average running time for an epoch without the regularization term is $0.003 s$ and $1.112 s$ with the regularization term, training with SGD with batch-size of $512$ . By using smaller batch sizes, e.g. $n=128$ , the extra computational cost can further decrease. From a theoretical perspective, the estimation of the HSCIC requires kernel ridge regression (see eq. (1) in our submission). In the high-dimensional image example, with a mini batch-size of $512$ , the average running time for an epoch with the regularization term is $64.03 s$ and $34.01 s$ without. This operation generally requires $O(n^3)$ and $O(n^2)$ memory, with n the size of the dataset. However, these bounds can be significantly improved by using, i.e., Fourier Features (see, i.e., [2,3]). By using Fourier features, the resulting approximate kernel ridge regression estimator can be computed in $O(ns^2)$ and $O(ns)$ memory. Here, s is a parameter determining the accuracy of the approximation. In practice, s can be set to be significantly smaller than the problem size, resulting in a dramatic speed-up. Other methods for efficient kernel computation include the popular Nystrom approximation [4,5], and Memory-Efficient Kernel Approximation (MEKA) [6].
Organisational Improvement: We appreciate the reviewer's suggestion, we think that contrasting our approach with other interpretations of counterfactual invariance is crucial for situating our research within the existing body of literature. In Section 4, under the 'Baselines' paragraph, we present a comparison with various methods employed to achieve counterfactual invariant predictors.

[1] Peters J. et al: Causal Discovery with Continuous Additive Noise Models. Journal of Machine Learning Research 15: 1:2009-1:2053 (2014)

[2] Ali Rahimi, Benjamin Recht: Random Features for Large-Scale Kernel Machines. NIPS 2007: 1177-1184

[3] Haim Avron, et al.: Random Fourier Features for Kernel Ridge Regression: Approximation Bounds and Statistical Guarantees. ICML 2017: 253-262

[4] Petros Drineas, Michael W. Mahoney: On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning. Journal of Machine Learning research 6: 2153-2175 (2005)

[5] Cho-Jui Hsieh, et al.: Fast Prediction for Large-Scale Kernel Machines. NIPS 2014: 3689-3697

[6] Si Si, et al.: Memory Efficient Kernel Approximation. Journal of Machine Learning Research 18: 20:1-20:32 (2017)

审稿意见

评分: 5置信度: 32023-10-30

The authors propose graphical criteria that yield a sufficient condition for a predictor to be counterfactually invariant in terms of a conditional independence in the observational distribution. In order to learn such predictors, they propose a model-agnostic framework, called Counterfactually Invariant Prediction, building on the Hilbert-Schmidt Conditional Independence Criterion, a kernel-based conditional dependence measure.

优点

Their experimental results demonstrate the effectiveness of their method in enforcing counterfactual invariance across various simulated and real-world datasets including scalar and multi-variate settings.

缺点

Please refer to Questions.

问题

In figure 1(a), what is S and what is X? How does one distinguish the two or say how to decide which is S and which is X?

Sometimes the authors write " $\hat Y$ is counterfactually invariant in $A$ with respect to $W$ ", sometimes they write " $\hat Y$ is counterfactually invariant in $A$ with respect to $X$ ". Do they have the same meaning?

Theoretical results. Corollary 3.6 gives a population result about VCF( $\hat Y$ ). However, no theoretical results on finite sample estimator are established.

Besides fairness, what are other applications for learning counterfactually invariant predictors?

评论- Official Response to Reviewer mZ5A

2023-11-18

We are happy about the positive assessment of our experimental results and are thankful for the useful suggestions.

We now address the three main questions of the reviewer:

Q: In figure 1(a), what is $S$ and what is $X$ ? How does one distinguish the two or say how to decide which is $S$ and which is $X$ ? Sometimes the authors write " $\hat{Y}$ is counterfactually invariant in $A$ with respect to $W$ ", sometimes they write " $\hat{Y}$ is counterfactually invariant in $A$ with respect to $X$ ". Do they have the same meaning?

A: We thank the reviewer for pointing this out. Figure 1(a) refers to the counterfactual invariance of $Y$ in $A$ with respect to $X$ . As pointed out by the reviewer, this could lead to inconsistency in the notation, since Def. 2.2 refers to the counterfactual invariance of $Y$ in $A$ with respect to $W$ . This notation is not incorrect because the choice of $W$ is arbitrary, however we have modified the label of Fig 1(a) only using the notation $W$ to make it more consistent. We thank the reviewer for the suggestion.

The distinction between $S$ and $W$ ( $X$ in the previous version) of Figure 1(a) is based on the Definition 2.2 of counterfactual invariance and the Theorem 2.2. In the definition of counterfactual invariance, it is necessary to select a set $W$ the user would like to achieve counterfactual invariance with respect to. This set $W$ represents the set of variables that is conditioned upon in the pre-interventional world and is arbitrary. This notion of counterfactual invariance, centered around the set $W$ , is aimed at offering flexibility rather than simplifying the problem. Regarding the set $S$ , this is the adjustment set used in Theorem 2.2. The set $S$ cannot hence be arbitrarily chosen, but needs to satisfy the following conditions for the sets $(A \cup W, Y)$ . The definition of a valid adjustment set is provided in Definition 3.1.

Q: Corollary 3.6 gives a population result about VCF(Y^). However, no theoretical results on finite sample estimators are established.

A: Yes. We do not consider finite sample estimation for VCF, since the VCF is only used as a performance metric, and it can only be computed using access to the DGP for the counterfactuals.

Q: Besides fairness, what are other applications for learning counterfactually invariant predictors?

A: Besides fairness, learning counterfactually invariant predictors (CIP) has important applications in areas such as robustness and text classification (see Section 3: Example Use-Cases of Counterfactually Invariant Prediction). These applications underscore the broad utility of counterfactually invariant predictors in ensuring robust and unbiased decision-making in various domains, extending well beyond the realm of fairness.

Robustness in Image Classification: Counterfactual Invariance (CI) enhances robustness in image classification. It is used to assess if an image, like a truck, would still be identified correctly under different conditions, such as varying seasons. This application is demonstrated using a dataset of simple black and white images, where CI helps in understanding the impact of changing attributes like shape and size on image classification.
Text Classification: CI is also significant in text classification tasks. The approach involves considering how protected attributes and outcomes are interrelated. Even when certain assumptions about these relationships are not met, counterfactually invariant predictors are shown to be effective. The concept is applied to ensure that classification decisions are consistent and unbiased, regardless of variations in text data, such as different attributes or contexts.

We thank the reviewer for the time and consideration, we value all the provided insights and suggestions. We are confident that we could resolve the doubts about the mentioned questions. If the reviewer agrees that their remaining doubts are largely resolved, we would be happy if they considered updating their score accordingly.

2023-11-22

Thank you for your clarification. Does Figure 1(a) satisfy Definition 2.2?

评论- Official Response to Reviewer mZ5A reply

2023-11-23

Yes Figure 1(a) does satisfy Definition 2.2. Based on this causal graph, if a predictor $\hat{Y}$ is independent of $A\cup W$ given $S$ , then it is counterfactually invariant based on Definition 2.2. We hope this clarification is helpful.

审稿意见

评分: 3置信度: 52023-10-31

This paper presents a counterfactually invariant prediction (CIP) method for achieving fairness, robustness, and generalization in the real world. By enforcing independence in kernel space with the prior causal structure, CIP achieves counterfactual invariance.

优点

Theoretical analysis is sufficient, and the studied problem isinteresting.

缺点

My main concern is on the usage of counterfactual invariance. Counterfactual refers to individual-level potential outcomes, rather than conditional or sub-populational levels. Hence, pursuing counterfactual outcome relies on prior SCM model or very sharp bounds, and estimating counterfactual outcome from observational is nearly possible even with the aid of AB tests. Hence, as your independence regularization only enforces tthe populational independence, how can your CIP achieves targets in the counterfactual level?
Such invariance learning is not new to me.
Prior causal graph is restrictive for realistic applications.

问题

See Weaknesses

评论- Official Response to Reviewer sZNY

2023-11-17

We thank the reviewer for the comments, and we are glad that the reviewer finds the theoretical analysis to be sufficient sufficient, and the studied problem interesting. We now address the three reported weaknesses:

Q: My main concern is on the usage of counterfactual invariance. Counterfactual refers to individual-level potential outcomes, rather than conditional or sub-populational levels. Hence, pursuing counterfactual outcome relies on prior SCM model or very sharp bounds, and estimating counterfactual outcome from observational is nearly possible even with the aid of AB tests. Hence, as your independence regularization only enforces tthe populational independence, how can your CIP achieves targets in the counterfactual level? Such invariance learning is not new to me. Prior causal graph is restrictive for realistic applications.

A: We appreciate the concern regarding the application of counterfactual invariance. Our approach indeed focuses on translating counterfactual independence into observational distribution independence, acknowledging the challenges inherent in this transition. We emphasize in Section 2.2 that such a translation is non-trivial and necessitates specific assumptions about the graphical structure and the data generation process, as detailed in Theorem 3.2. This forms a crucial part of our theoretical framework, enabling a more feasible analysis of counterfactuals within observational data, however with necessary limitations and assumptions. Specifically our proof consists of the following steps:

we first construct a causal graph for the pre and post-interventional random variables of the model;
we show that $S$ acts as a valid adjustment set for $(A∪W,Y)$ in the graph constructed in (1);
we then use the adjustment criterion to prove the desired result.

Part (3) is not-trivial and it makes use of the assumptions on the DGP as in Theorem 3.2.

Q: Such invariance learning is not new to me.

A: Thank you for the observation regarding the novelty of invariance learning. While it is true that the concept of counterfactual invariance is not new, our contribution lies in achieving an established concept of counterfactual invariance ( $\mathcal{D}$ -CI) within a novel, model-agnostic framework. This is outlined in the Contribution subsection (page 3). We have also conducted a thorough review of related literature to position our work within the current academic literature. We would greatly appreciate any specific references or examples the reviewer might have that could further inform our understanding and discussion of this topic in the context of our work.

Q: Prior causal graph is restrictive for realistic applications.

A: We agree that knowing the causal graph is a strong assumption. However, essentially all work on cause-effect estimation and causal invariance starts from this assumption. In fact, while the somewhat orthogonal field of causal discovery aims at inferring the causal graph from (observational and/or interventional) data, most other work on causality starts from assuming some background knowledge of the data-generating process, we are aware that this is indeed a potentially problematic assumption. Usually, one refers to “expert knowledge” for how to come up with the graph. Ultimately, dismissing our work on the grounds that the true causal graph is hard to know in practice, is essentially dismissing large parts of the work on causality in machine learning altogether.

We thank the reviewer for the comments. In light of the above clarifications and our previous discussions, we kindly request the reviewer to reevaluate our work and consider increasing the score.

AC 元评审

2023-12-13

The main contributions of this work are: 1) a graphical criterion that equates counterfactual invariance (CI) to conditional independence, and 2) a novel Hilbert-Schmidt Counterfactual Invariance Criterion (HSCIC) that quantifies CI when this graphical criterion is satisfied. The authors propose the use of HSISC as a regularizer to encourage learning predictors that balance utility and CI.

This work tackles the important problem of learning counterfactually invariant predictors using only samples from the observational distribution, and the experimental results show that using HSISC as a regularizer allows controlling the trade-off between utility and CI on both synthetic and semi-synthetic data sets. However, the reviewers found that the assumptions of the data generating process needed in order to justify the application of the graphical criterion need further clarification, as does the claim that they hold in the specific experiments used to evaluate the performance of the proposed methodology.

为何不给更高分

The reviewers found the assumptions necessary for the graphical criterion to apply to be limiting, and that they may invalidate the experimental setup used to evaluate the performance of the proposed methodology.

为何不给更低分

N/A

最终决定Reject

2024-01-16

Reject