PaperHub
5.0
/10
Rejected4 位审稿人
最低3最高6标准差1.2
6
3
6
5
3.8
置信度
正确性2.5
贡献度2.8
表达3.3
ICLR 2025

Causal Graph Learning via Distributional Invariance of Cause-Effect Relationship

OpenReviewPDF
提交: 2024-09-27更新: 2025-02-05

摘要

关键词
Causal Graph LearningInvariance

评审与讨论

审稿意见
6

Based on the fact that the distribution of an effect, conditioned on its causes, remains invariant to changes in the prior distribution of those causes, this paper proposes a causal discovery algorithm for large-scale datasets. Specifically, it designs an invariance test, which is achieved by a downsampling scheme. It also makes the best of searching Markov blankets of all variables, to reduce the time complexity. Experiments on synthetic datasets and real-world networks show better scalability.

优点

  • This paper is written well, with clear descriptions and motivations.

  • The authors propose practical algorithms for causal discovery, with some interesting theoretical findings, e.g., the basis of a DAG, the minimal downsampling rate, etc.

  • The experiments under synthetic datasets and real-world networks are extensive, which verified the advantages in large-scale datasets.

缺点

  • Some details seem to be missing in the paper. For example,

i) Footnote 2 and Theorem 1 tell how to find non-parent sets, whereas how to set the threshold for the variance is not clear. Please give the details in the paper.

ii) How to learn the different priors Pi(X)P_i(X), with the estimated mm? Did the authors assume some distributions?

  • Theorem 2 provides a necessary condition to test whether a subset ZZ is the parent set of X. However, it is not a sufficient condition. Although the authors stated, “When m is infinitely large, the implication in Eq. (1) becomes bi-directional and V[P+(XZ)]=0V[P+(X | Z)] = 0 definitively implies Z=Pa[X]Z = Pa[X]”. It is not that clear why this implication we can get. Please elaborate on it more.

  • It is better to perform some real-world datasets for validation. This is because bnlearn provides real networks and it generates the data based on the networks. These datasets look like semi-synthetic. BTW, in Table 2, when dealing with small-scale datasets (or even a large-scale dataset Munin), the runtime all seem not to be satisfactory. Please explain it.

问题

  • Does the proposed causal discovery algorithm work for time-series data? What are the challenges?

I would be very glad to increase my score if the authors could resolve my concerns.

评论

C3. It is better to perform some real-world datasets for validation. This is because bnlearn provides real networks and it generates the data based on the networks. These datasets look like semi-synthetic. BTW, in Table 2, when dealing with small-scale datasets (or even a large-scale dataset Munin), the runtime all seem not to be satisfactory. Please explain it.

A: Thank you for the suggestion on validating our approach with a real-world dataset. Following your suggestions, we have run extra experiments with the real-world dataset SACHS (Sachs et al, 2005) which has been widely used in many existing literatures on causal discovery. This dataset consists of continuous measurements of expression levels of proteins and phospholipids in human immune system cells and is widely accepted by the biology community. The SACHS graph contains 1111 variables and 1717 edges. The results are reported as in the following table.

BaselinesMetricsSHDSpurious rateTPRRuntime (seconds)
GLIDEMean8.700.83310.47
95-conf0.4800.0540.71
DASMean22.270.550.2516.45
95-conf1.650.040.034.24
SCOREMean13.20.230.4930.02
95-conf0.670.040.043.98
FCIMean170.440.334.06
95-conf0000.63
GIESMean270.580.152.69
95-conf0000.12
NotearsMean510.690.09575.97
95-conf00014.11
MLP-NotearsMean43.330.650.15164.86
95-conf4.770.030.0212.73

The reported results show that our proposal GLIDE significantly outperforms other baselines, achieving the lowest SHD and spurious rate. When compared to the second-best baseline (SCORE), GLIDE notably doubles the TPR of SCORE while running three times faster.

In Table 2, we note that for the first 55 small datasets, our approach returns the result within 11 minute. For the Pathfinder dataset, our approach takes 33 minutes and for the Munin dataset, our approach takes less than 22 hours. We believe such running time is reasonable given the significant performance improvement over the baselines.

Furthermore, given that the time complexity of our approach scales quadratically with the no. of variables, these results are justifiable and consistent with the reported figures in Section 5.1 and 5.2. In addition, it is worth noting that our running time is also much faster than the state-of-the-arts methods such as DAS, SCORE as reported in Section 5.1.

Reference:

Karen Sachs et al., Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data. Science 308, 523-529 (2005). DOI:10.1126/science.1105809

Questions

Q1. Does the proposed causal discovery algorithm work for time-series data? What are the challenges?

A: In principle, our proposed causal discovery algorithm should work for (multivariate) time-series data which is also underlied by an (unknown) Bayesian Network reflecting its causal relationship. In practice, we foresee a (practical) challenge in terms of scalability. To elaborate, the size of the causal graph scales linearly in the number of time steps. Hence, for dd-variable cases, the total number of nodes in the corresponding Bayesian Net of the whole TT timesteps is O(dT)O(dT). As the complexity of our proposal is O(node2)O(|node|^2), its complexity will grow quadratically in the number TT of time steps which will cause scalability issues since TT can be very large. This could be an interesting and novel direction for a potential follow-up of our work. Thank you very much for this interesting suggestion.

If our responses have sufficiently addressed your questions/concerns, we hope you would consider increasing the overall rating of our paper. Otherwise, please let us know if you have any follow-up questions for us.

评论

Thank you for your detailed review & constructive feedback. We have addressed all your concerns and questions below:

Concerns

C1. Some details seem to be missing in the paper. For example, i) Footnote 2 and Theorem 1 tell how to find non-parent sets, whereas how to set the threshold for the variance is not clear. Please give the details in the paper.

A: We would like to explain that there is no need to set the threshold for the variance. Footnote 2 and Theorem 1 suggest that when Z\boldsymbol{Z} = true parent set, the variance will be zero which means among plausible candidates, the one that results in the smallest variance is the true parent set. As such, we can run the invariance test following the procedure outlined in Part B of Section 4.1 w/o having to set any thresholds.

ii) How to learn the different priors Pi(X)P_i(\boldsymbol{X}), with the estimated m? Did the authors assume some distributions?

A: The augmentation Pi(X)P_i(\boldsymbol{X}) is defined in the statement of Theorem 1, which stipulates that Pi(X)=Pi(B)P(XBB)P_i(\boldsymbol{X}) = P_i(\boldsymbol{B}) \cdot P(\boldsymbol{X}\setminus\boldsymbol{B}|\boldsymbol{B}) where P(XBB)P(\boldsymbol{X}\setminus\boldsymbol{B}|\boldsymbol{B}) is the conditional induced from P(X)P(\boldsymbol{X}). Thus, Pi(X)P_i(\boldsymbol{X}) is theoretically defined via Pi(B)P_i(\boldsymbol{B}) with B\boldsymbol{B} being the set of source variables. Pi(B)P_i(\boldsymbol{B}) is in turn sampled from a simplex over a space of categorical distributions as defined in Theorem 6. Here, mm is the number of samples. As discussed in the paragraph following Eq. (2), the invariance test will become more accurate as m increases.

Furthermore, we want to explain that we do not learn or compute Pi(X)P_i(\boldsymbol{X}) since the invariance test only requires drawing samples from Pi(X)P_i(\boldsymbol{X}). This is possible without knowing the exact form of Pi(X)P_i(\boldsymbol{X}) as detailed in Theorem 5 which shows that a downsample DiD_i of DD following a particular downsampling procedure (see the 2nd sentence in the statement of Theorem 5) is statistically equivalent to a direct sample from Pi(X)P_i(\boldsymbol{X}).

C2. Theorem 2 provides a necessary condition to test whether a subset Z\boldsymbol{Z} is the parent set of XX. However, it is not a sufficient condition. Although the authors stated, “When mm is infinitely large, the implication in Eq. (1) becomes bi-directional and V[P(XZ)]=0\mathbb{V}[P(X|\boldsymbol{Z})] = 0 definitively implies Z=Pa(X)\boldsymbol{Z} = Pa(X)”. It is not that clear why this implication we can get. Please elaborate on it more.

A: We would like to explain that if the variance V[P(XZ)]\mathbb{V}[P(X|\boldsymbol{Z})] can be computed exactly via an exhaustive enumeration over all possible source prior augmentations of PP, it is zero if and only if Z=Ancestor(X)\boldsymbol{Z} = Ancestor(X). This is true since on the true causal graph, changing P(Ancestor(X))P(Ancestor(X)) obviously does not change P(XAncestor(X))P(X | Ancestor(X)). Since we only search for Z\boldsymbol{Z} in the space of subsets of the Markov blanket of XX, Ancestor(X)Ancestor(X) reduces to Pa(X)Pa(X) and thus, V[P(XZ)]=0\mathbb{V}[P(X|\boldsymbol{Z})] = 0 definitively implies Z=Pa(X)\boldsymbol{Z} = Pa(X) when V[P(XZ)]\mathbb{V}[P(X|\boldsymbol{Z})] can be computed exactly, which requires an infinite number of sampled source prior augmentations of PP (i.e., when mm is infinitely large).

审稿意见
3

This paper proposes a new framework that leverages the invariance of effect conditioned on its causes for causal discovery from observational data. The main idea is it try to disturb the p(cause) distribution and see whether the p(effect|cause) would change after the disturbance.

优点

  • This work leverages the invariance of conditional distribution and then proposes a downsampling method, combining them to find the parent set.

缺点

  • The main issue is that since the real intervention is not applicable to the observational data, it provides a downsampled technique to approximate the p(effect|cause) after the disturbance, which, however, has no theoretical guarantee. This is, how to guarantee such a downsampled correctly corresponds to the real distribution after the disturbance?
  • Since the basis variables would include the leaf vertices, in such a case, changing the prior basis variables will not affect the distribution of their ancestors, and thus it may not have a similar effect to changing prior over source variables.

Typos:

  • "Theorem 2" -> "Theorem 4" in Theorem 5

After rebuttal:

After multiple rounds of discussion with the authors, my fundamental concerns remain unresolved. The core issue persists: the proposed approach fails to adequately address the challenge of obtaining interventional distributions from observational data.

As highlighted in seminal works by Pearl [1] and [2], answering interventional questions such as "What would be the impact on the system if this variable were changed from value x to y?" requires explicit causal knowledge. This limitation is well-established in the causal inference literature, as also discussed comprehensively by [3]. The authors' repeated attempts to address my concerns through mathematical manipulations and resampling techniques which do not overcome the fundamental identification problem in causality.

[1] Pearl, J. (2009). Causality: Models, Reasoning, and Inference. Cambridge University Press.

[2] Brouillard, Philippe, et al. "Differentiable causal discovery from interventional data." Advances in Neural Information Processing Systems 33 (2020): 21865-21877.

[3] Bareinboim, Elias, et al. "On Pearl’s hierarchy and the foundations of causal inference." Probabilistic and causal inference: the works of judea pearl. 2022. 507-556.

问题

See the weakness above.

评论

Thank you for your detailed review & constructive feedback.

We have addressed all your concerns below:

C1. The main issue is that since the real intervention is not applicable to the observational data, it provides a downsampled technique to approximate the P(effect|cause) after the disturbance, which, however, has no theoretical guarantee. This is, how to guarantee such a downsampled correctly corresponds to the real distribution after the disturbance?

A: Theorem 5 guarantees that the downsampled dataset DiD_i from the original dataset DD following a particular downsampling procedure (see the 2nd sentence of Theorem 5's statement) is statistically equivalent to a direct sample of the real distribution after the disturbance PiP_i (as defined in the 2nd sentence of Theorem 1). We hope this has addressed the reviewer's concern.

In addition, we would also like to emphasize that there is generally no guarantee with certainty for causal discovery without intervention. In fact, even with intervention, causal discovery is still not perfect due to the reliance on existing statistical independence tests (which are highly accurate but certainly not perfect). So, we believe the main focus should be on the actual empirical performance rather than theoretical guarantee. We hope the reviewer would reconsider this point.

C2. Since the basis variables would include the leaf vertices, in such a case, changing the prior basis variables will not affect the distribution of their ancestors, and thus it may not have a similar effect to changing prior over source variables.

A: It is unlikely that the basis would include a leaf node because our selection method as stated in Theorem 3 always prioritize selecting nodes with smallest dependent set (according to d-separation) first. Lemma 2 (in Appendix B1) shows that such nodes are either a source node or dependent on exactly one source node. Such nodes are unlikely leaf nodes unless on some artificial graph structures. For clarity, we refer the reviewer to an intuitive example in Appendix B1 where we demonstrate step-by-step the procedure of basis selection based on the causal graph for the dataset ASIA.

Furthermore, it is also fine if the basis includes some leaf nodes as long as it also includes the source nodes. Following the above argument, it is even more unlikely to arrive at a basis that only contains leaf nodes. This does not happen in any of the widely used benchmark datasets in our experiments. As a matter of fact, our numerous experiments consistently show a significant improvement over SOTA baselines across a wide variety of data scenarios and graph topologies. We hope the reviewer would reconsider this point given the above and the fact that, in general, there is no perfect guarantee for causal discovery w/o intervention.

C3. Typos: "Theorem 2" -> "Theorem 4" in Theorem 5

A: Thank you for helping us catch the typo. We will fix this in our revised version.

If our responses have sufficiently addressed your questions/concerns, we hope you would consider increasing the overall rating of our paper. Otherwise, please let us know if you have any follow-up questions for us.

审稿意见
6

This paper introduces a novel approach to causal learning through a new invariance test for causality, which underpins a reliable and scalable algorithm for reconstructing causal graphs from observational data. This method leverages a core insight that the conditional distribution of the effect given the cause remains invariant under changes in the prior distribution of the cause. This insight enables a parent-identification process for each variable using synthetic data augmentation. This process is integrated with an efficient search algorithm that utilizes prior knowledge of each effect variable’s Markov blanket, along with the empirically observed sparsity of causal graphs, to significantly reduce computational complexity.

优点

  1. The proposed method is rather novel.
  2. Overall, the paper is well-structured and clearly written.
  3. The experiments are extensive, covering 3 types of functional causal models, 6 causal discovery baseline methods, and varying graph sizes.

缺点

Any thoughts on how to extend your method to handle heterogeneous or time-series datasets?

问题

(See above)

评论

Thank you for recognizing our contributions.

We have addressed all your questions below:

Q1. Any thoughts on how to extend your method to handle heterogeneous or time-series datasets?

A: In principle, our proposed causal discovery algorithm should work for (multivariate) time-series data which is also underlied by an (unknown) Bayesian Network reflecting its causal relationship. In practice, we foresee a (practical) challenge in terms of scalability. To elaborate, the size of the causal graph scales linearly in the number of time steps. Hence, for dd-variable cases, the total number of nodes in the corresponding Bayesian Net of the whole TT timesteps is O(dT)O(dT). As the complexity of our proposal is O(node2)O(|node|^2), its complexity will grow quadratically in the number of time steps which will cause scalability issues since TT can be very large. Thus, the key challenge in extending this to time-series data is scalability.

Our proposed method can also handle multiple heterogeneous datasets sharing the same causal graph (albeit with different sets of effect-cause conditional distributions). In this case, the invariant test can be slightly repurposed to simply find the plausible parent set with minimal variance (see Eq. (1)) across those datasets. We have in fact conducted a preliminary study in this setting in Appendix C.2.5.

Thank you for these suggestions. We think these are novel directions for potential follow-up of our current work.

If our responses have sufficiently addressed your questions/concerns, we hope you would consider increasing the overall rating of our paper. Otherwise, please let us know if you have any follow-up questions for us.

评论

I appreciated the responses from the authors, which solved my concerns. Therefore, I would like to maintain my positive score.

审稿意见
5

The paper utilizes the invariance property of P(XPa(X))P(X \mid \text{Pa}(X)) despite having different P(Pa(X))P(\text{Pa}(X)) to find the causal parents. For this purpose, they choose some source priors, generate corresponding datasets, and verify the invariance. Finally, they show their performance on a variety of synthetic and real-world setups.

优点

The paper is well written. The figure is helpful for understanding the algorithm. The experiments are extensive.

缺点

Minor weaknesses:

  • The authors mentioned “finding the maximum clique in an augmented bidirectional graph” multiple times but without a proper definition or example/visualization.
  • The source variables should be defined in a little more detail.
  • What does PP' in equation 2 refer to? It should be precise.
  • “The intuition is if we can re-sample DiD_i from DP(X)D \sim P(X) such that DiPi(X)D_i \sim P_i(X),” This is a little unclear. How are DP(X)D \sim P(\mathbf{X}) and DiPi(X)D_i \sim P_i(\mathbf{X}) different?
  • It is unclear how D1,D2,,DMD_1, D_2, \dots, D_M are sampled. How are the mm source priors (Pi(B)P_i(\mathbf{B})) obtained? Although these are discussed later, some hints/intuitive discussion should be provided earlier in the paper.
  • “We cannot compute Pi(X)P_i(X), … we can re-sample DiD_i from DD so that DiPi(X)D_i \sim P_i(X)” – based on my understanding, the first case is computing the numerical probability table, and the second case is sampling without any such table. This difference should be made clear.
  • More details on "downsampling without replacement" are needed.
  • An intuitive explanation of the “minimal downsampled rate” is required.

Major weaknesses

  • Suppose ZZ is not a parent but an ancestor. Shouldn’t we also get variance = 0 (equation 1) in such cases? Does a change in P(Ancestor)P(\text{Ancestor}) affect P(descendantancestor)P(\text{descendant} \mid \text{ancestor})?
  • Many important concepts are delayed until section 4.2. The authors should consider introducing them earlier in the paper.

I will consider increasing the score after seeing author's response and reviewer discussion.

问题

Questions:

  • How are the authors resampling the datasets?
  • Do you have to perform this invariance test for all possible parent sets?
  • Why is Pa[B]=Pa[B] = \emptyset in the definition of set B\mathbf{B} (section 4.1)? What does that imply?
  • In practice with real-world data, is the variance always zero for all true parents (equation 1)? Why or why not? Should a threshold be used?
  • How expensive is it to compute ϕ(X)\phi(X)? Do we have to iterate over all XX? And do it again after performing step ii in Theorem 3?
评论

Q3. Why is Pa(B)Pa(B) = empty in the definition of set B\boldsymbol{B} (section 4.1)? What does that imply?

A: In section 4.1, B\boldsymbol{B} is used to describe the set of source variables, which, by definition, are variables that have no parents. Hence, Pa(B)Pa(B) is an empty set for any variable BBB∈\boldsymbol{B}.

Q4. In practice with real-world data, is the variance always zero for all true parents (equation 1)? Why or why not? Should a threshold be used?

A: The variance is theoretically zero if we can enumerate over the possibly infinite set of P+P_+ to compute it exactly. In practice, due to the limited amount of data and computation time, we only sample a large but finite number of P+P_+. Our estimate will approach zero as the number of sampled P+P_+ increases -- see the discussion following Eq.(2).

As shown in Section 5 and Appendix C.2.1, the larger this number gets, the more effective our method becomes.

As for the threshold, there is no need to set it explicitly since we will always select the plausible parent set with the smallest variance according to the practical test in Part B of Section 4.1.

Q5. How expensive is it to compute Φ(X)\Phi(X)? Do we have to iterate over all XX? And do it again after performing step ii in Theorem 3?

A: For each XX, the cost to compute Φ(X)\Phi(X) is O(nd2)O(nd^2) which is the total cost of checking whether each pair of variables are independent using the O(n)O(n) Fisher test. As there are O(d2)O(d^2) such pairs, the total cost is O(nd2)O(nd^2) as stated above. Φ(X)\Phi(X) does not change after performing step ii in Theorem 3 so it does not need to compute again. Hence, computing Φ(X)\Phi(X) is a one-time overhead that is not expensive.

If our responses have sufficiently addressed your questions/concerns, we hope you would consider increase the overall rating of our paper. Otherwise, please let us know if you have any follow-up questions for us.

评论

Mn5. It is unclear how DiD_i are sampled. How are the source priors (Pi(B)P_i(\boldsymbol{B})) obtained? Although these are discussed later, some hints/intuitive discussion should be provided earlier in the paper.

A: DiD_i is sampled using the procedure described in the statement of Theorem 5. This procedure requires the optimal downsampling rate which is computed from a given choice of the augmentation Pi(B)P_i(\boldsymbol{B}) as detailed in Theorem 4. The source prior Pi(B)P_i(\boldsymbol{B}) is obtained via the procedure stated in Theorem 6 of Section 4.2.3. The intuition of this procedure (i.e., criteria and selection methods) is mentioned in the first two paragraphs of Section 4.2.3.

Mn6. “We cannot compute PiP_i, … we can re-sample DiD_i from DD so that DiPiD_i \sim P_i” – based on my understanding, the first case is computing the numerical probability table, and the second case is sampling without any such table. This difference should be made clear.

A: As explained in Mn4 above, while we can choose a variation of the source prior augmentation Pi(B)P_i(\boldsymbol{B}) we still cannot draw direct samples from Pi(X)P_i(\boldsymbol{X}) because Pi(X)=Pi(B)P(XBB)P_i(\boldsymbol{X}) = P_i(\boldsymbol{B})\cdot P(\boldsymbol{X}\setminus\boldsymbol{B} | \boldsymbol{B}) where P(XBB)P(\boldsymbol{X}\setminus\boldsymbol{B} | \boldsymbol{B}) is the induced conditional of the (unknown) true distribution P(X)P(\boldsymbol{X}). Thus, drawing sample DiD_i from PiP_i (second case) to run our invariance test has to be achieved via re-sampling DiD_i from DD (first case) using a specific downsampling routine stated in Theorem 5, which is guaranteed to be statistically equivalent to a direct sample from PiP_i.

Mn7. More details on "downsampling without replacement" are needed.

A: This is described in the 2nd sentence of Theorem 5. To elaborate, we first determine the size of the downsample DiD_i using Theorem 4 -- see Eq. (4). This allows us to determine (for each candidate value b\boldsymbol{b} of the source variables B\boldsymbol{B}) how many samples (where B=b\boldsymbol{B} = \boldsymbol{b}) we need to draw (without replacement) from DD. Those samples (across all observed candidate values B=b\boldsymbol{B} = \boldsymbol{b}) constitute DiD_i. We will include this elaboration in the revised version of our paper.

Mn8. An intuitive explanation of the “minimal downsampled rate” is required.

A: The intuition of the minimal downsampled rate is mentioned in the first two paragraphs of Section 4.2.2. To further elaborate, our invariance test requires drawing samples Di from the augmented distribution Pi(B)P_i(\boldsymbol{B}) of the true oracle P(X)P(\boldsymbol{X}). This cannot be done directly as explained above, we achieve this via drawing a sample DiD_i from DD, which is supposed to preserve DD's evidence of all (hidden) cause-effect relationships.

As such, DiD_i cannot contain duplicated data points from DD because otherwise, it might (statistically) emphasize on (incorrect) cause-effect relationships that were not implied by DD and hence, PiP_i. Thus, DiD_i must be a downsample. But, a downsample might lose information so we want to minimize D/Di|D|/|D_i| which stipulates the maximum size of Di|D_i| at which we can still guarantee DiPi(X)D_i \sim P_i(\boldsymbol{X}). This rate is computable via Theorem 4 -- see Eq. (4).

Questions

Q1. How are the authors resampling the datasets?

A: To resample DiD_i from DD, we use the procedure described in the statement of Theorem 5. To elaborate, we determine the size of DiD_i via Theorem 4. This allows us to determine (for each candidate value b\boldsymbol{b} of the source variables B\boldsymbol{B}) how many samples (where B=b\boldsymbol{B} = \boldsymbol{b}) we need to draw (without replacement) from DD. Those samples (across all observed candidate values B=b\boldsymbol{B} = \boldsymbol{b}) constitute DiD_i. Theorem 5 guarantees that such DiD_i is statistically equivalent to a direct sample from PiP_i which is what we need to run the invariance test in Part B, Section 4.1.

Q2. Do you have to perform this invariance test for all possible parent sets?

A: No, we only need to perform the invariance test for all maximum plausible sets which is at most O(dp3p/3)O(dp3^{p/3}) where p is the degenerate constant (see Definition 4) of the augmented bidrectional graph in Theorem 7. Empirically, we find that pp is at most 1313 across all experimented datasets. Hence, 3p/3<1173^{p/3}<117 and O(dp3p/3)O(dp3^{p/3}) is essentially O(d)O(d) given the reasonably small constant values of pp and 3p/33^{p/3}. This is explicitly discussed in the paragraph following Theorem 7.

评论

Thank you for your detailed review & constructive feedback. We have addressed all your concerns and questions below:

Major Concerns

Mj1. Suppose ZZ is not a parent but an ancestor. Shouldn’t we also get variance = 0 (equation 1) in such cases? Does a change in PP(Ancestor) affect PP(Descendant|Ancestor)?

A: It is true that a change in PP(Ancestor) will not cause a change in PP(Descendant | Ancestor). This is in fact a stronger statement of Theorem 1 which states that if the variance of P(XZ)P(X | \boldsymbol{Z}) over the (random) choice of P(Z)P(\boldsymbol{Z}) is positive, then Z\boldsymbol{Z} cannot be the ancestor of XX and consequently, Z\boldsymbol{Z} cannot be the parent of XX. Theorem 1 as stated in the current manuscript remains correct.

The reason we did not use this stronger version of Theorem 1 to develop the causal test is because doing so would require an exhaustive enumeration over the space of all sets of ancestor candidates. This will be too costly.

Instead, Theorem 1 only requires enumerating over all possible sets of parent candidates of each node. This can be done effectively via utilizing existing routines to (approximately) find the Markov blanket (Edera et al., 2014) in O(d2)O(d^2) which is in turn leveraged to find O(d)O(d) most plausible candidates for the parent set in O(d2)O(d^2) -- see Section 4.3. The test in Section 4.1, part B -- as a consequence of Theorem 1 -- can then be used to find the parent set among these plausible candidates.

As the entire causal structure can be recovered if we are able to recover the parent set of each node, Theorem 1 suffices, and we do not need its stronger version above which will otherwise induce a much more expensive test.

We hope this clarifies that this point is not a weakness of our proposal because our theorem and method remain correct.

Mj2. Many important concepts are delayed until section 4.2. The authors should consider introducing them earlier in the paper.

A: Thank you for the suggestion. We commit to moving the most important concepts up as suggested in our revised version.

Minor Concerns

Mn1. The authors mentioned “finding the maximum clique in an augmented bidirectional graph” multiple times but without a proper definition or example/visualization.

A: Theorem 7 establishes that each plausible parent set of a node XX corresponds to a maximal clique in a bidirectional graph as defined constructively in the first sentence of Theorem 7's statement. Here, a clique is defined to be a set of nodes in a bidirectional or undirected graph such that there is a (bidirectional or undirected) edge between any two nodes. Such clique can be found effectively via a DFS-based Bron-Kerbosch algorithm (see the paragraph following Theorem 7). Due to limited space in the main text, an example with visualization has been deferred to Appendix B.2 (see Fig. 6). We will refer to this figure explicitly in the revised version for better clarity. Thank you for the suggestion.

Mn2. The source variables should be defined in a little more detail.

A: As defined in Section 4.2.1 (1st line), a source variable is a variable that has no parent. We will create an explicit definition (rather than an in-text statement) item for this concept in the revised version. Thank you for the suggestion.

Mn3. What does PP' in equation 2 refer to? It should be precise.

A: We apologize for the typo. This should be Pi(XZ)P_i(X|\boldsymbol{Z}) with no ('). Here, Pi(XZ)P_i(X|\boldsymbol{Z}) is the conditional induced from Pi(X)P_i(X) as defined in the second sentence of Theorem 1's statement. This typo (') happened once in Eq. 2 and once in Eq. 3 and we will correct it in the revised version, which will be uploaded soon.

Mn4. “The intuition is if we can re-sample DiD_i from DPD \sim P such that DiPiD_i\sim P_i” This is a little unclear. How are DPD \sim P and DiPiD_i \sim P_i different?

A: DPD \sim P and DiPiD_i \sim P_i are different. DiD_i is a downsample of DD (i.e., DiD_i is a subset of DD) such that DiD_i follows an augmentation distribution PiP_i of PP. Since PiP_i is a function of PP -- see the 2nd sentence of Theorem 1 -- which is unknown, we cannot sample directly from PiP_i to run our invariance test -- see part B of 4.1 which is a consequence of Theorem 1. To overcome this difficulty, we have derived a downsampling procedure DiDD_i \sim D which provably guarantees that the sampled DiD_i indeed follows PiP_i as desired -- see Theorem 5 which explicitly describes this downsampling procedure.

AC 元评审

The authors propose a scheme for learning causal graphs without relying on strict interventions, and instead using distributional invariances. The idea is interesting, but reviewers had a hard time following the details, and there were several gaps in the presentation raised by reviewers. The overall idea is quite complicated, and it would be in the interest of the authors to simplify the presentation and focus on the core novel ideas in a future re-submission.

审稿人讨论附加意见

There was extensive discussion between the reviewers and authors.

最终决定

Reject