5.0

/10

Rejected4 位审稿人

最低3最高6标准差1.2

3.8

置信度

正确性2.5

贡献度2.8

表达3.3

ICLR 2025

Causal Graph Learning via Distributional Invariance of Cause-Effect Relationship

Nang Hung Nguyen,Trong Nghia Hoang,Thao Nguyen Truong,Phi Le Nguyen,Masashi Sugiyama

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

摘要

关键词

Causal Graph LearningInvariance

评审与讨论

审稿意见

评分: 6置信度: 32024-11-04

Based on the fact that the distribution of an effect, conditioned on its causes, remains invariant to changes in the prior distribution of those causes, this paper proposes a causal discovery algorithm for large-scale datasets. Specifically, it designs an invariance test, which is achieved by a downsampling scheme. It also makes the best of searching Markov blankets of all variables, to reduce the time complexity. Experiments on synthetic datasets and real-world networks show better scalability.

优点

This paper is written well, with clear descriptions and motivations.
The authors propose practical algorithms for causal discovery, with some interesting theoretical findings, e.g., the basis of a DAG, the minimal downsampling rate, etc.
The experiments under synthetic datasets and real-world networks are extensive, which verified the advantages in large-scale datasets.

缺点

Some details seem to be missing in the paper. For example,

i) Footnote 2 and Theorem 1 tell how to find non-parent sets, whereas how to set the threshold for the variance is not clear. Please give the details in the paper.

ii) How to learn the different priors $P_i(X)$ , with the estimated $m$ ? Did the authors assume some distributions?

Theorem 2 provides a necessary condition to test whether a subset $Z$ is the parent set of X. However, it is not a sufficient condition. Although the authors stated, “When m is infinitely large, the implication in Eq. (1) becomes bi-directional and $V[P+(X | Z)] = 0$ definitively implies $Z = Pa[X]$ ”. It is not that clear why this implication we can get. Please elaborate on it more.
It is better to perform some real-world datasets for validation. This is because bnlearn provides real networks and it generates the data based on the networks. These datasets look like semi-synthetic. BTW, in Table 2, when dealing with small-scale datasets (or even a large-scale dataset Munin), the runtime all seem not to be satisfactory. Please explain it.

问题

Does the proposed causal discovery algorithm work for time-series data? What are the challenges?

I would be very glad to increase my score if the authors could resolve my concerns.

评论- Response to Reviewer Wjzo part 2/2 -- Concern 3 and Questions

2024-11-21

C3. It is better to perform some real-world datasets for validation. This is because bnlearn provides real networks and it generates the data based on the networks. These datasets look like semi-synthetic. BTW, in Table 2, when dealing with small-scale datasets (or even a large-scale dataset Munin), the runtime all seem not to be satisfactory. Please explain it.

A: Thank you for the suggestion on validating our approach with a real-world dataset. Following your suggestions, we have run extra experiments with the real-world dataset SACHS (Sachs et al, 2005) which has been widely used in many existing literatures on causal discovery. This dataset consists of continuous measurements of expression levels of proteins and phospholipids in human immune system cells and is widely accepted by the biology community. The SACHS graph contains $11$ variables and $17$ edges. The results are reported as in the following table.

Baselines	Metrics	SHD	Spurious rate	TPR	Runtime (seconds)
GLIDE	Mean	8.7	0	0.833	10.47
	95-conf	0.48	0	0.054	0.71
DAS	Mean	22.27	0.55	0.25	16.45
	95-conf	1.65	0.04	0.03	4.24
SCORE	Mean	13.2	0.23	0.49	30.02
	95-conf	0.67	0.04	0.04	3.98
FCI	Mean	17	0.44	0.33	4.06
	95-conf	0	0	0	0.63
GIES	Mean	27	0.58	0.15	2.69
	95-conf	0	0	0	0.12
Notears	Mean	51	0.69	0.09	575.97
	95-conf	0	0	0	14.11
MLP-Notears	Mean	43.33	0.65	0.15	164.86
	95-conf	4.77	0.03	0.02	12.73

The reported results show that our proposal GLIDE significantly outperforms other baselines, achieving the lowest SHD and spurious rate. When compared to the second-best baseline (SCORE), GLIDE notably doubles the TPR of SCORE while running three times faster.

In Table 2, we note that for the first $5$ small datasets, our approach returns the result within $1$ minute. For the Pathfinder dataset, our approach takes $3$ minutes and for the Munin dataset, our approach takes less than $2$ hours. We believe such running time is reasonable given the significant performance improvement over the baselines.

Furthermore, given that the time complexity of our approach scales quadratically with the no. of variables, these results are justifiable and consistent with the reported figures in Section 5.1 and 5.2. In addition, it is worth noting that our running time is also much faster than the state-of-the-arts methods such as DAS, SCORE as reported in Section 5.1.

Reference:

Karen Sachs et al., Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data. Science 308, 523-529 (2005). DOI:10.1126/science.1105809

Questions

Q1. Does the proposed causal discovery algorithm work for time-series data? What are the challenges?

A: In principle, our proposed causal discovery algorithm should work for (multivariate) time-series data which is also underlied by an (unknown) Bayesian Network reflecting its causal relationship. In practice, we foresee a (practical) challenge in terms of scalability. To elaborate, the size of the causal graph scales linearly in the number of time steps. Hence, for $d$ -variable cases, the total number of nodes in the corresponding Bayesian Net of the whole $T$ timesteps is $O(dT)$ . As the complexity of our proposal is $O(|node|^2)$ , its complexity will grow quadratically in the number $T$ of time steps which will cause scalability issues since $T$ can be very large. This could be an interesting and novel direction for a potential follow-up of our work. Thank you very much for this interesting suggestion.

If our responses have sufficiently addressed your questions/concerns, we hope you would consider increasing the overall rating of our paper. Otherwise, please let us know if you have any follow-up questions for us.

评论- Response to Reviewer Wjzo part 1/2 -- Concerns 1-2

2024-11-21

Thank you for your detailed review & constructive feedback. We have addressed all your concerns and questions below:

Concerns

C1. Some details seem to be missing in the paper. For example, i) Footnote 2 and Theorem 1 tell how to find non-parent sets, whereas how to set the threshold for the variance is not clear. Please give the details in the paper.

A: We would like to explain that there is no need to set the threshold for the variance. Footnote 2 and Theorem 1 suggest that when $\boldsymbol{Z}$ = true parent set, the variance will be zero which means among plausible candidates, the one that results in the smallest variance is the true parent set. As such, we can run the invariance test following the procedure outlined in Part B of Section 4.1 w/o having to set any thresholds.

ii) How to learn the different priors $P_i(\boldsymbol{X})$ , with the estimated m? Did the authors assume some distributions?

A: The augmentation $P_i(\boldsymbol{X})$ is defined in the statement of Theorem 1, which stipulates that $P_i(\boldsymbol{X}) = P_i(\boldsymbol{B}) \cdot P(\boldsymbol{X}\setminus\boldsymbol{B}|\boldsymbol{B})$ where $P(\boldsymbol{X}\setminus\boldsymbol{B}|\boldsymbol{B})$ is the conditional induced from $P(\boldsymbol{X})$ . Thus, $P_i(\boldsymbol{X})$ is theoretically defined via $P_i(\boldsymbol{B})$ with $\boldsymbol{B}$ being the set of source variables. $P_i(\boldsymbol{B})$ is in turn sampled from a simplex over a space of categorical distributions as defined in Theorem 6. Here, $m$ is the number of samples. As discussed in the paragraph following Eq. (2), the invariance test will become more accurate as m increases.

Furthermore, we want to explain that we do not learn or compute $P_i(\boldsymbol{X})$ since the invariance test only requires drawing samples from $P_i(\boldsymbol{X})$ . This is possible without knowing the exact form of $P_i(\boldsymbol{X})$ as detailed in Theorem 5 which shows that a downsample $D_i$ of $D$ following a particular downsampling procedure (see the 2nd sentence in the statement of Theorem 5) is statistically equivalent to a direct sample from $P_i(\boldsymbol{X})$ .

C2. Theorem 2 provides a necessary condition to test whether a subset $\boldsymbol{Z}$ is the parent set of $X$ . However, it is not a sufficient condition. Although the authors stated, “When $m$ is infinitely large, the implication in Eq. (1) becomes bi-directional and $\mathbb{V}[P(X|\boldsymbol{Z})] = 0$ definitively implies $\boldsymbol{Z} = Pa(X)$ ”. It is not that clear why this implication we can get. Please elaborate on it more.

A: We would like to explain that if the variance $\mathbb{V}[P(X|\boldsymbol{Z})]$ can be computed exactly via an exhaustive enumeration over all possible source prior augmentations of $P$ , it is zero if and only if $\boldsymbol{Z} = Ancestor(X)$ . This is true since on the true causal graph, changing $P(Ancestor(X))$ obviously does not change $P(X | Ancestor(X))$ . Since we only search for $\boldsymbol{Z}$ in the space of subsets of the Markov blanket of $X$ , $Ancestor(X)$ reduces to $Pa(X)$ and thus, $\mathbb{V}[P(X|\boldsymbol{Z})] = 0$ definitively implies $\boldsymbol{Z} = Pa(X)$ when $\mathbb{V}[P(X|\boldsymbol{Z})]$ can be computed exactly, which requires an infinite number of sampled source prior augmentations of $P$ (i.e., when $m$ is infinitely large).

审稿意见

评分: 3置信度: 52024-11-04

This paper proposes a new framework that leverages the invariance of effect conditioned on its causes for causal discovery from observational data. The main idea is it try to disturb the p(cause) distribution and see whether the p(effect|cause) would change after the disturbance.

优点

This work leverages the invariance of conditional distribution and then proposes a downsampling method, combining them to find the parent set.

缺点

The main issue is that since the real intervention is not applicable to the observational data, it provides a downsampled technique to approximate the p(effect|cause) after the disturbance, which, however, has no theoretical guarantee. This is, how to guarantee such a downsampled correctly corresponds to the real distribution after the disturbance?
Since the basis variables would include the leaf vertices, in such a case, changing the prior basis variables will not affect the distribution of their ancestors, and thus it may not have a similar effect to changing prior over source variables.

Typos:

"Theorem 2" -> "Theorem 4" in Theorem 5

After rebuttal:

After multiple rounds of discussion with the authors, my fundamental concerns remain unresolved. The core issue persists: the proposed approach fails to adequately address the challenge of obtaining interventional distributions from observational data.

As highlighted in seminal works by Pearl [1] and [2], answering interventional questions such as "What would be the impact on the system if this variable were changed from value x to y?" requires explicit causal knowledge. This limitation is well-established in the causal inference literature, as also discussed comprehensively by [3]. The authors' repeated attempts to address my concerns through mathematical manipulations and resampling techniques which do not overcome the fundamental identification problem in causality.

[1] Pearl, J. (2009). Causality: Models, Reasoning, and Inference. Cambridge University Press.

[2] Brouillard, Philippe, et al. "Differentiable causal discovery from interventional data." Advances in Neural Information Processing Systems 33 (2020): 21865-21877.

[3] Bareinboim, Elias, et al. "On Pearl’s hierarchy and the foundations of causal inference." Probabilistic and causal inference: the works of judea pearl. 2022. 507-556.

问题

See the weakness above.

评论- Response to Reviewer Z5mH

2024-11-21

Thank you for your detailed review & constructive feedback.

We have addressed all your concerns below:

C1. The main issue is that since the real intervention is not applicable to the observational data, it provides a downsampled technique to approximate the P(effect|cause) after the disturbance, which, however, has no theoretical guarantee. This is, how to guarantee such a downsampled correctly corresponds to the real distribution after the disturbance?

A: Theorem 5 guarantees that the downsampled dataset $D_i$ from the original dataset $D$ following a particular downsampling procedure (see the 2nd sentence of Theorem 5's statement) is statistically equivalent to a direct sample of the real distribution after the disturbance $P_i$ (as defined in the 2nd sentence of Theorem 1). We hope this has addressed the reviewer's concern.

In addition, we would also like to emphasize that there is generally no guarantee with certainty for causal discovery without intervention. In fact, even with intervention, causal discovery is still not perfect due to the reliance on existing statistical independence tests (which are highly accurate but certainly not perfect). So, we believe the main focus should be on the actual empirical performance rather than theoretical guarantee. We hope the reviewer would reconsider this point.

C2. Since the basis variables would include the leaf vertices, in such a case, changing the prior basis variables will not affect the distribution of their ancestors, and thus it may not have a similar effect to changing prior over source variables.

A: It is unlikely that the basis would include a leaf node because our selection method as stated in Theorem 3 always prioritize selecting nodes with smallest dependent set (according to d-separation) first. Lemma 2 (in Appendix B1) shows that such nodes are either a source node or dependent on exactly one source node. Such nodes are unlikely leaf nodes unless on some artificial graph structures. For clarity, we refer the reviewer to an intuitive example in Appendix B1 where we demonstrate step-by-step the procedure of basis selection based on the causal graph for the dataset ASIA.

Furthermore, it is also fine if the basis includes some leaf nodes as long as it also includes the source nodes. Following the above argument, it is even more unlikely to arrive at a basis that only contains leaf nodes. This does not happen in any of the widely used benchmark datasets in our experiments. As a matter of fact, our numerous experiments consistently show a significant improvement over SOTA baselines across a wide variety of data scenarios and graph topologies. We hope the reviewer would reconsider this point given the above and the fact that, in general, there is no perfect guarantee for causal discovery w/o intervention.

C3. Typos: "Theorem 2" -> "Theorem 4" in Theorem 5

A: Thank you for helping us catch the typo. We will fix this in our revised version.

审稿意见

评分: 6置信度: 52024-11-04

This paper introduces a novel approach to causal learning through a new invariance test for causality, which underpins a reliable and scalable algorithm for reconstructing causal graphs from observational data. This method leverages a core insight that the conditional distribution of the effect given the cause remains invariant under changes in the prior distribution of the cause. This insight enables a parent-identification process for each variable using synthetic data augmentation. This process is integrated with an efficient search algorithm that utilizes prior knowledge of each effect variable’s Markov blanket, along with the empirically observed sparsity of causal graphs, to significantly reduce computational complexity.

优点

The proposed method is rather novel.
Overall, the paper is well-structured and clearly written.
The experiments are extensive, covering 3 types of functional causal models, 6 causal discovery baseline methods, and varying graph sizes.

缺点

Any thoughts on how to extend your method to handle heterogeneous or time-series datasets?

问题

(See above)

评论- Response to Reviewer m1km

2024-11-21

Thank you for recognizing our contributions.

We have addressed all your questions below:

Q1. Any thoughts on how to extend your method to handle heterogeneous or time-series datasets?

A: In principle, our proposed causal discovery algorithm should work for (multivariate) time-series data which is also underlied by an (unknown) Bayesian Network reflecting its causal relationship. In practice, we foresee a (practical) challenge in terms of scalability. To elaborate, the size of the causal graph scales linearly in the number of time steps. Hence, for $d$ -variable cases, the total number of nodes in the corresponding Bayesian Net of the whole $T$ timesteps is $O(dT)$ . As the complexity of our proposal is $O(|node|^2)$ , its complexity will grow quadratically in the number of time steps which will cause scalability issues since $T$ can be very large. Thus, the key challenge in extending this to time-series data is scalability.

Our proposed method can also handle multiple heterogeneous datasets sharing the same causal graph (albeit with different sets of effect-cause conditional distributions). In this case, the invariant test can be slightly repurposed to simply find the plausible parent set with minimal variance (see Eq. (1)) across those datasets. We have in fact conducted a preliminary study in this setting in Appendix C.2.5.

Thank you for these suggestions. We think these are novel directions for potential follow-up of our current work.

2024-12-02

I appreciated the responses from the authors, which solved my concerns. Therefore, I would like to maintain my positive score.

审稿意见

评分: 5置信度: 22024-11-07

The paper utilizes the invariance property of $P(X \mid \text{Pa}(X))$ despite having different $P(\text{Pa}(X))$ to find the causal parents. For this purpose, they choose some source priors, generate corresponding datasets, and verify the invariance. Finally, they show their performance on a variety of synthetic and real-world setups.

优点

The paper is well written. The figure is helpful for understanding the algorithm. The experiments are extensive.

缺点

Minor weaknesses:

The authors mentioned “finding the maximum clique in an augmented bidirectional graph” multiple times but without a proper definition or example/visualization.
The source variables should be defined in a little more detail.
What does $P'$ in equation 2 refer to? It should be precise.
“The intuition is if we can re-sample $D_i$ from $D \sim P(X)$ such that $D_i \sim P_i(X)$ ,” This is a little unclear. How are $D \sim P(\mathbf{X})$ and $D_i \sim P_i(\mathbf{X})$ different?
It is unclear how $D_1, D_2, \dots, D_M$ are sampled. How are the $m$ source priors ( $P_i(\mathbf{B})$ ) obtained? Although these are discussed later, some hints/intuitive discussion should be provided earlier in the paper.
“We cannot compute $P_i(X)$ , … we can re-sample $D_i$ from $D$ so that $D_i \sim P_i(X)$ ” – based on my understanding, the first case is computing the numerical probability table, and the second case is sampling without any such table. This difference should be made clear.
More details on "downsampling without replacement" are needed.
An intuitive explanation of the “minimal downsampled rate” is required.

Major weaknesses

Suppose $Z$ is not a parent but an ancestor. Shouldn’t we also get variance = 0 (equation 1) in such cases? Does a change in $P(\text{Ancestor})$ affect $P(\text{descendant} \mid \text{ancestor})$ ?
Many important concepts are delayed until section 4.2. The authors should consider introducing them earlier in the paper.

I will consider increasing the score after seeing author's response and reviewer discussion.

问题

Questions:

How are the authors resampling the datasets?
Do you have to perform this invariance test for all possible parent sets?
Why is $Pa[B] = \emptyset$ in the definition of set $\mathbf{B}$ (section 4.1)? What does that imply?
In practice with real-world data, is the variance always zero for all true parents (equation 1)? Why or why not? Should a threshold be used?
How expensive is it to compute $\phi(X)$ ? Do we have to iterate over all $X$ ? And do it again after performing step ii in Theorem 3?

评论- Response to Reviewer Mdty part 3/3 -- Questions 3-5

2024-11-21

Q3. Why is $Pa(B)$ = empty in the definition of set $\boldsymbol{B}$ (section 4.1)? What does that imply?

A: In section 4.1, $\boldsymbol{B}$ is used to describe the set of source variables, which, by definition, are variables that have no parents. Hence, $Pa(B)$ is an empty set for any variable $B∈\boldsymbol{B}$ .

Q4. In practice with real-world data, is the variance always zero for all true parents (equation 1)? Why or why not? Should a threshold be used?

A: The variance is theoretically zero if we can enumerate over the possibly infinite set of $P_+$ to compute it exactly. In practice, due to the limited amount of data and computation time, we only sample a large but finite number of $P_+$ . Our estimate will approach zero as the number of sampled $P_+$ increases -- see the discussion following Eq.(2).

As shown in Section 5 and Appendix C.2.1, the larger this number gets, the more effective our method becomes.

As for the threshold, there is no need to set it explicitly since we will always select the plausible parent set with the smallest variance according to the practical test in Part B of Section 4.1.

Q5. How expensive is it to compute $\Phi(X)$ ? Do we have to iterate over all $X$ ? And do it again after performing step ii in Theorem 3?

A: For each $X$ , the cost to compute $\Phi(X)$ is $O(nd^2)$ which is the total cost of checking whether each pair of variables are independent using the $O(n)$ Fisher test. As there are $O(d^2)$ such pairs, the total cost is $O(nd^2)$ as stated above. $\Phi(X)$ does not change after performing step ii in Theorem 3 so it does not need to compute again. Hence, computing $\Phi(X)$ is a one-time overhead that is not expensive.

If our responses have sufficiently addressed your questions/concerns, we hope you would consider increase the overall rating of our paper. Otherwise, please let us know if you have any follow-up questions for us.

评论- Response to Reviewer Mdty part 2/3 -- Minor concerns 5-8 and questions 1-2

2024-11-21

Mn5. It is unclear how $D_i$ are sampled. How are the source priors ( $P_i(\boldsymbol{B})$ ) obtained? Although these are discussed later, some hints/intuitive discussion should be provided earlier in the paper.

A: $D_i$ is sampled using the procedure described in the statement of Theorem 5. This procedure requires the optimal downsampling rate which is computed from a given choice of the augmentation $P_i(\boldsymbol{B})$ as detailed in Theorem 4. The source prior $P_i(\boldsymbol{B})$ is obtained via the procedure stated in Theorem 6 of Section 4.2.3. The intuition of this procedure (i.e., criteria and selection methods) is mentioned in the first two paragraphs of Section 4.2.3.

Mn6. “We cannot compute $P_i$ , … we can re-sample $D_i$ from $D$ so that $D_i \sim P_i$ ” – based on my understanding, the first case is computing the numerical probability table, and the second case is sampling without any such table. This difference should be made clear.

A: As explained in Mn4 above, while we can choose a variation of the source prior augmentation $P_i(\boldsymbol{B})$ we still cannot draw direct samples from $P_i(\boldsymbol{X})$ because $P_i(\boldsymbol{X}) = P_i(\boldsymbol{B})\cdot P(\boldsymbol{X}\setminus\boldsymbol{B} | \boldsymbol{B})$ where $P(\boldsymbol{X}\setminus\boldsymbol{B} | \boldsymbol{B})$ is the induced conditional of the (unknown) true distribution $P(\boldsymbol{X})$ . Thus, drawing sample $D_i$ from $P_i$ (second case) to run our invariance test has to be achieved via re-sampling $D_i$ from $D$ (first case) using a specific downsampling routine stated in Theorem 5, which is guaranteed to be statistically equivalent to a direct sample from $P_i$ .

Mn7. More details on "downsampling without replacement" are needed.

A: This is described in the 2nd sentence of Theorem 5. To elaborate, we first determine the size of the downsample $D_i$ using Theorem 4 -- see Eq. (4). This allows us to determine (for each candidate value $\boldsymbol{b}$ of the source variables $\boldsymbol{B}$ ) how many samples (where $\boldsymbol{B} = \boldsymbol{b}$ ) we need to draw (without replacement) from $D$ . Those samples (across all observed candidate values $\boldsymbol{B} = \boldsymbol{b}$ ) constitute $D_i$ . We will include this elaboration in the revised version of our paper.

Mn8. An intuitive explanation of the “minimal downsampled rate” is required.

A: The intuition of the minimal downsampled rate is mentioned in the first two paragraphs of Section 4.2.2. To further elaborate, our invariance test requires drawing samples Di from the augmented distribution $P_i(\boldsymbol{B})$ of the true oracle $P(\boldsymbol{X})$ . This cannot be done directly as explained above, we achieve this via drawing a sample $D_i$ from $D$ , which is supposed to preserve $D$ 's evidence of all (hidden) cause-effect relationships.

As such, $D_i$ cannot contain duplicated data points from $D$ because otherwise, it might (statistically) emphasize on (incorrect) cause-effect relationships that were not implied by $D$ and hence, $P_i$ . Thus, $D_i$ must be a downsample. But, a downsample might lose information so we want to minimize $|D|/|D_i|$ which stipulates the maximum size of $|D_i|$ at which we can still guarantee $D_i \sim P_i(\boldsymbol{X})$ . This rate is computable via Theorem 4 -- see Eq. (4).

Questions

Q1. How are the authors resampling the datasets?

A: To resample $D_i$ from $D$ , we use the procedure described in the statement of Theorem 5. To elaborate, we determine the size of $D_i$ via Theorem 4. This allows us to determine (for each candidate value $\boldsymbol{b}$ of the source variables $\boldsymbol{B}$ ) how many samples (where $\boldsymbol{B} = \boldsymbol{b}$ ) we need to draw (without replacement) from $D$ . Those samples (across all observed candidate values $\boldsymbol{B} = \boldsymbol{b}$ ) constitute $D_i$ . Theorem 5 guarantees that such $D_i$ is statistically equivalent to a direct sample from $P_i$ which is what we need to run the invariance test in Part B, Section 4.1.

Q2. Do you have to perform this invariance test for all possible parent sets?

A: No, we only need to perform the invariance test for all maximum plausible sets which is at most $O(dp3^{p/3})$ where p is the degenerate constant (see Definition 4) of the augmented bidrectional graph in Theorem 7. Empirically, we find that $p$ is at most $13$ across all experimented datasets. Hence, $3^{p/3}<117$ and $O(dp3^{p/3})$ is essentially $O(d)$ given the reasonably small constant values of $p$ and $3^{p/3}$ . This is explicitly discussed in the paragraph following Theorem 7.

评论- Response to Reviewer Mdty part 1/3 -- Major concerns 1-2 and minor concerns 1-4

2024-11-21

Thank you for your detailed review & constructive feedback. We have addressed all your concerns and questions below:

Major Concerns

Mj1. Suppose $Z$ is not a parent but an ancestor. Shouldn’t we also get variance = 0 (equation 1) in such cases? Does a change in $P$ (Ancestor) affect $P$ (Descendant|Ancestor)?

A: It is true that a change in $P$ (Ancestor) will not cause a change in $P$ (Descendant | Ancestor). This is in fact a stronger statement of Theorem 1 which states that if the variance of $P(X | \boldsymbol{Z})$ over the (random) choice of $P(\boldsymbol{Z})$ is positive, then $\boldsymbol{Z}$ cannot be the ancestor of $X$ and consequently, $\boldsymbol{Z}$ cannot be the parent of $X$ . Theorem 1 as stated in the current manuscript remains correct.

The reason we did not use this stronger version of Theorem 1 to develop the causal test is because doing so would require an exhaustive enumeration over the space of all sets of ancestor candidates. This will be too costly.

Instead, Theorem 1 only requires enumerating over all possible sets of parent candidates of each node. This can be done effectively via utilizing existing routines to (approximately) find the Markov blanket (Edera et al., 2014) in $O(d^2)$ which is in turn leveraged to find $O(d)$ most plausible candidates for the parent set in $O(d^2)$ -- see Section 4.3. The test in Section 4.1, part B -- as a consequence of Theorem 1 -- can then be used to find the parent set among these plausible candidates.

As the entire causal structure can be recovered if we are able to recover the parent set of each node, Theorem 1 suffices, and we do not need its stronger version above which will otherwise induce a much more expensive test.

We hope this clarifies that this point is not a weakness of our proposal because our theorem and method remain correct.

Mj2. Many important concepts are delayed until section 4.2. The authors should consider introducing them earlier in the paper.

A: Thank you for the suggestion. We commit to moving the most important concepts up as suggested in our revised version.

Minor Concerns

Mn1. The authors mentioned “finding the maximum clique in an augmented bidirectional graph” multiple times but without a proper definition or example/visualization.

A: Theorem 7 establishes that each plausible parent set of a node $X$ corresponds to a maximal clique in a bidirectional graph as defined constructively in the first sentence of Theorem 7's statement. Here, a clique is defined to be a set of nodes in a bidirectional or undirected graph such that there is a (bidirectional or undirected) edge between any two nodes. Such clique can be found effectively via a DFS-based Bron-Kerbosch algorithm (see the paragraph following Theorem 7). Due to limited space in the main text, an example with visualization has been deferred to Appendix B.2 (see Fig. 6). We will refer to this figure explicitly in the revised version for better clarity. Thank you for the suggestion.

Mn2. The source variables should be defined in a little more detail.

A: As defined in Section 4.2.1 (1st line), a source variable is a variable that has no parent. We will create an explicit definition (rather than an in-text statement) item for this concept in the revised version. Thank you for the suggestion.

Mn3. What does $P'$ in equation 2 refer to? It should be precise.

A: We apologize for the typo. This should be $P_i(X|\boldsymbol{Z})$ with no ('). Here, $P_i(X|\boldsymbol{Z})$ is the conditional induced from $P_i(X)$ as defined in the second sentence of Theorem 1's statement. This typo (') happened once in Eq. 2 and once in Eq. 3 and we will correct it in the revised version, which will be uploaded soon.

Mn4. “The intuition is if we can re-sample $D_i$ from $D \sim P$ such that $D_i\sim P_i$ ” This is a little unclear. How are $D \sim P$ and $D_i \sim P_i$ different?

A: $D \sim P$ and $D_i \sim P_i$ are different. $D_i$ is a downsample of $D$ (i.e., $D_i$ is a subset of $D$ ) such that $D_i$ follows an augmentation distribution $P_i$ of $P$ . Since $P_i$ is a function of $P$ -- see the 2nd sentence of Theorem 1 -- which is unknown, we cannot sample directly from $P_i$ to run our invariance test -- see part B of 4.1 which is a consequence of Theorem 1. To overcome this difficulty, we have derived a downsampling procedure $D_i \sim D$ which provably guarantees that the sampled $D_i$ indeed follows $P_i$ as desired -- see Theorem 5 which explicitly describes this downsampling procedure.

AC 元评审

2024-12-22

The authors propose a scheme for learning causal graphs without relying on strict interventions, and instead using distributional invariances. The idea is interesting, but reviewers had a hard time following the details, and there were several gaps in the presentation raised by reviewers. The overall idea is quite complicated, and it would be in the interest of the authors to simplify the presentation and focus on the core novel ideas in a future re-submission.

审稿人讨论附加意见

There was extensive discussion between the reviewers and authors.

最终决定Reject

2025-01-22

Reject