Causal Graph Learning via Distributional Invariance of Cause-Effect Relationship
摘要
评审与讨论
Based on the fact that the distribution of an effect, conditioned on its causes, remains invariant to changes in the prior distribution of those causes, this paper proposes a causal discovery algorithm for large-scale datasets. Specifically, it designs an invariance test, which is achieved by a downsampling scheme. It also makes the best of searching Markov blankets of all variables, to reduce the time complexity. Experiments on synthetic datasets and real-world networks show better scalability.
优点
-
This paper is written well, with clear descriptions and motivations.
-
The authors propose practical algorithms for causal discovery, with some interesting theoretical findings, e.g., the basis of a DAG, the minimal downsampling rate, etc.
-
The experiments under synthetic datasets and real-world networks are extensive, which verified the advantages in large-scale datasets.
缺点
- Some details seem to be missing in the paper. For example,
i) Footnote 2 and Theorem 1 tell how to find non-parent sets, whereas how to set the threshold for the variance is not clear. Please give the details in the paper.
ii) How to learn the different priors , with the estimated ? Did the authors assume some distributions?
-
Theorem 2 provides a necessary condition to test whether a subset is the parent set of X. However, it is not a sufficient condition. Although the authors stated, “When m is infinitely large, the implication in Eq. (1) becomes bi-directional and definitively implies ”. It is not that clear why this implication we can get. Please elaborate on it more.
-
It is better to perform some real-world datasets for validation. This is because bnlearn provides real networks and it generates the data based on the networks. These datasets look like semi-synthetic. BTW, in Table 2, when dealing with small-scale datasets (or even a large-scale dataset Munin), the runtime all seem not to be satisfactory. Please explain it.
问题
- Does the proposed causal discovery algorithm work for time-series data? What are the challenges?
I would be very glad to increase my score if the authors could resolve my concerns.
C3. It is better to perform some real-world datasets for validation. This is because bnlearn provides real networks and it generates the data based on the networks. These datasets look like semi-synthetic. BTW, in Table 2, when dealing with small-scale datasets (or even a large-scale dataset Munin), the runtime all seem not to be satisfactory. Please explain it.
A: Thank you for the suggestion on validating our approach with a real-world dataset. Following your suggestions, we have run extra experiments with the real-world dataset SACHS (Sachs et al, 2005) which has been widely used in many existing literatures on causal discovery. This dataset consists of continuous measurements of expression levels of proteins and phospholipids in human immune system cells and is widely accepted by the biology community. The SACHS graph contains variables and edges. The results are reported as in the following table.
| Baselines | Metrics | SHD | Spurious rate | TPR | Runtime (seconds) |
|---|---|---|---|---|---|
| GLIDE | Mean | 8.7 | 0 | 0.833 | 10.47 |
| 95-conf | 0.48 | 0 | 0.054 | 0.71 | |
| DAS | Mean | 22.27 | 0.55 | 0.25 | 16.45 |
| 95-conf | 1.65 | 0.04 | 0.03 | 4.24 | |
| SCORE | Mean | 13.2 | 0.23 | 0.49 | 30.02 |
| 95-conf | 0.67 | 0.04 | 0.04 | 3.98 | |
| FCI | Mean | 17 | 0.44 | 0.33 | 4.06 |
| 95-conf | 0 | 0 | 0 | 0.63 | |
| GIES | Mean | 27 | 0.58 | 0.15 | 2.69 |
| 95-conf | 0 | 0 | 0 | 0.12 | |
| Notears | Mean | 51 | 0.69 | 0.09 | 575.97 |
| 95-conf | 0 | 0 | 0 | 14.11 | |
| MLP-Notears | Mean | 43.33 | 0.65 | 0.15 | 164.86 |
| 95-conf | 4.77 | 0.03 | 0.02 | 12.73 |
The reported results show that our proposal GLIDE significantly outperforms other baselines, achieving the lowest SHD and spurious rate. When compared to the second-best baseline (SCORE), GLIDE notably doubles the TPR of SCORE while running three times faster.
In Table 2, we note that for the first small datasets, our approach returns the result within minute. For the Pathfinder dataset, our approach takes minutes and for the Munin dataset, our approach takes less than hours. We believe such running time is reasonable given the significant performance improvement over the baselines.
Furthermore, given that the time complexity of our approach scales quadratically with the no. of variables, these results are justifiable and consistent with the reported figures in Section 5.1 and 5.2. In addition, it is worth noting that our running time is also much faster than the state-of-the-arts methods such as DAS, SCORE as reported in Section 5.1.
Reference:
Karen Sachs et al., Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data. Science 308, 523-529 (2005). DOI:10.1126/science.1105809
Questions
Q1. Does the proposed causal discovery algorithm work for time-series data? What are the challenges?
A: In principle, our proposed causal discovery algorithm should work for (multivariate) time-series data which is also underlied by an (unknown) Bayesian Network reflecting its causal relationship. In practice, we foresee a (practical) challenge in terms of scalability. To elaborate, the size of the causal graph scales linearly in the number of time steps. Hence, for -variable cases, the total number of nodes in the corresponding Bayesian Net of the whole timesteps is . As the complexity of our proposal is , its complexity will grow quadratically in the number of time steps which will cause scalability issues since can be very large. This could be an interesting and novel direction for a potential follow-up of our work. Thank you very much for this interesting suggestion.
If our responses have sufficiently addressed your questions/concerns, we hope you would consider increasing the overall rating of our paper. Otherwise, please let us know if you have any follow-up questions for us.
Thank you for your detailed review & constructive feedback. We have addressed all your concerns and questions below:
Concerns
C1. Some details seem to be missing in the paper. For example, i) Footnote 2 and Theorem 1 tell how to find non-parent sets, whereas how to set the threshold for the variance is not clear. Please give the details in the paper.
A: We would like to explain that there is no need to set the threshold for the variance. Footnote 2 and Theorem 1 suggest that when = true parent set, the variance will be zero which means among plausible candidates, the one that results in the smallest variance is the true parent set. As such, we can run the invariance test following the procedure outlined in Part B of Section 4.1 w/o having to set any thresholds.
ii) How to learn the different priors , with the estimated m? Did the authors assume some distributions?
A: The augmentation is defined in the statement of Theorem 1, which stipulates that where is the conditional induced from . Thus, is theoretically defined via with being the set of source variables. is in turn sampled from a simplex over a space of categorical distributions as defined in Theorem 6. Here, is the number of samples. As discussed in the paragraph following Eq. (2), the invariance test will become more accurate as m increases.
Furthermore, we want to explain that we do not learn or compute since the invariance test only requires drawing samples from . This is possible without knowing the exact form of as detailed in Theorem 5 which shows that a downsample of following a particular downsampling procedure (see the 2nd sentence in the statement of Theorem 5) is statistically equivalent to a direct sample from .
C2. Theorem 2 provides a necessary condition to test whether a subset is the parent set of . However, it is not a sufficient condition. Although the authors stated, “When is infinitely large, the implication in Eq. (1) becomes bi-directional and definitively implies ”. It is not that clear why this implication we can get. Please elaborate on it more.
A: We would like to explain that if the variance can be computed exactly via an exhaustive enumeration over all possible source prior augmentations of , it is zero if and only if . This is true since on the true causal graph, changing obviously does not change . Since we only search for in the space of subsets of the Markov blanket of , reduces to and thus, definitively implies when can be computed exactly, which requires an infinite number of sampled source prior augmentations of (i.e., when is infinitely large).
This paper proposes a new framework that leverages the invariance of effect conditioned on its causes for causal discovery from observational data. The main idea is it try to disturb the p(cause) distribution and see whether the p(effect|cause) would change after the disturbance.
优点
- This work leverages the invariance of conditional distribution and then proposes a downsampling method, combining them to find the parent set.
缺点
- The main issue is that since the real intervention is not applicable to the observational data, it provides a downsampled technique to approximate the p(effect|cause) after the disturbance, which, however, has no theoretical guarantee. This is, how to guarantee such a downsampled correctly corresponds to the real distribution after the disturbance?
- Since the basis variables would include the leaf vertices, in such a case, changing the prior basis variables will not affect the distribution of their ancestors, and thus it may not have a similar effect to changing prior over source variables.
Typos:
- "Theorem 2" -> "Theorem 4" in Theorem 5
After rebuttal:
After multiple rounds of discussion with the authors, my fundamental concerns remain unresolved. The core issue persists: the proposed approach fails to adequately address the challenge of obtaining interventional distributions from observational data.
As highlighted in seminal works by Pearl [1] and [2], answering interventional questions such as "What would be the impact on the system if this variable were changed from value x to y?" requires explicit causal knowledge. This limitation is well-established in the causal inference literature, as also discussed comprehensively by [3]. The authors' repeated attempts to address my concerns through mathematical manipulations and resampling techniques which do not overcome the fundamental identification problem in causality.
[1] Pearl, J. (2009). Causality: Models, Reasoning, and Inference. Cambridge University Press.
[2] Brouillard, Philippe, et al. "Differentiable causal discovery from interventional data." Advances in Neural Information Processing Systems 33 (2020): 21865-21877.
[3] Bareinboim, Elias, et al. "On Pearl’s hierarchy and the foundations of causal inference." Probabilistic and causal inference: the works of judea pearl. 2022. 507-556.
问题
See the weakness above.
Thank you for your detailed review & constructive feedback.
We have addressed all your concerns below:
C1. The main issue is that since the real intervention is not applicable to the observational data, it provides a downsampled technique to approximate the P(effect|cause) after the disturbance, which, however, has no theoretical guarantee. This is, how to guarantee such a downsampled correctly corresponds to the real distribution after the disturbance?
A: Theorem 5 guarantees that the downsampled dataset from the original dataset following a particular downsampling procedure (see the 2nd sentence of Theorem 5's statement) is statistically equivalent to a direct sample of the real distribution after the disturbance (as defined in the 2nd sentence of Theorem 1). We hope this has addressed the reviewer's concern.
In addition, we would also like to emphasize that there is generally no guarantee with certainty for causal discovery without intervention. In fact, even with intervention, causal discovery is still not perfect due to the reliance on existing statistical independence tests (which are highly accurate but certainly not perfect). So, we believe the main focus should be on the actual empirical performance rather than theoretical guarantee. We hope the reviewer would reconsider this point.
C2. Since the basis variables would include the leaf vertices, in such a case, changing the prior basis variables will not affect the distribution of their ancestors, and thus it may not have a similar effect to changing prior over source variables.
A: It is unlikely that the basis would include a leaf node because our selection method as stated in Theorem 3 always prioritize selecting nodes with smallest dependent set (according to d-separation) first. Lemma 2 (in Appendix B1) shows that such nodes are either a source node or dependent on exactly one source node. Such nodes are unlikely leaf nodes unless on some artificial graph structures. For clarity, we refer the reviewer to an intuitive example in Appendix B1 where we demonstrate step-by-step the procedure of basis selection based on the causal graph for the dataset ASIA.
Furthermore, it is also fine if the basis includes some leaf nodes as long as it also includes the source nodes. Following the above argument, it is even more unlikely to arrive at a basis that only contains leaf nodes. This does not happen in any of the widely used benchmark datasets in our experiments. As a matter of fact, our numerous experiments consistently show a significant improvement over SOTA baselines across a wide variety of data scenarios and graph topologies. We hope the reviewer would reconsider this point given the above and the fact that, in general, there is no perfect guarantee for causal discovery w/o intervention.
C3. Typos: "Theorem 2" -> "Theorem 4" in Theorem 5
A: Thank you for helping us catch the typo. We will fix this in our revised version.
If our responses have sufficiently addressed your questions/concerns, we hope you would consider increasing the overall rating of our paper. Otherwise, please let us know if you have any follow-up questions for us.
This paper introduces a novel approach to causal learning through a new invariance test for causality, which underpins a reliable and scalable algorithm for reconstructing causal graphs from observational data. This method leverages a core insight that the conditional distribution of the effect given the cause remains invariant under changes in the prior distribution of the cause. This insight enables a parent-identification process for each variable using synthetic data augmentation. This process is integrated with an efficient search algorithm that utilizes prior knowledge of each effect variable’s Markov blanket, along with the empirically observed sparsity of causal graphs, to significantly reduce computational complexity.
优点
- The proposed method is rather novel.
- Overall, the paper is well-structured and clearly written.
- The experiments are extensive, covering 3 types of functional causal models, 6 causal discovery baseline methods, and varying graph sizes.
缺点
Any thoughts on how to extend your method to handle heterogeneous or time-series datasets?
问题
(See above)
Thank you for recognizing our contributions.
We have addressed all your questions below:
Q1. Any thoughts on how to extend your method to handle heterogeneous or time-series datasets?
A: In principle, our proposed causal discovery algorithm should work for (multivariate) time-series data which is also underlied by an (unknown) Bayesian Network reflecting its causal relationship. In practice, we foresee a (practical) challenge in terms of scalability. To elaborate, the size of the causal graph scales linearly in the number of time steps. Hence, for -variable cases, the total number of nodes in the corresponding Bayesian Net of the whole timesteps is . As the complexity of our proposal is , its complexity will grow quadratically in the number of time steps which will cause scalability issues since can be very large. Thus, the key challenge in extending this to time-series data is scalability.
Our proposed method can also handle multiple heterogeneous datasets sharing the same causal graph (albeit with different sets of effect-cause conditional distributions). In this case, the invariant test can be slightly repurposed to simply find the plausible parent set with minimal variance (see Eq. (1)) across those datasets. We have in fact conducted a preliminary study in this setting in Appendix C.2.5.
Thank you for these suggestions. We think these are novel directions for potential follow-up of our current work.
If our responses have sufficiently addressed your questions/concerns, we hope you would consider increasing the overall rating of our paper. Otherwise, please let us know if you have any follow-up questions for us.
I appreciated the responses from the authors, which solved my concerns. Therefore, I would like to maintain my positive score.
The paper utilizes the invariance property of despite having different to find the causal parents. For this purpose, they choose some source priors, generate corresponding datasets, and verify the invariance. Finally, they show their performance on a variety of synthetic and real-world setups.
优点
The paper is well written. The figure is helpful for understanding the algorithm. The experiments are extensive.
缺点
Minor weaknesses:
- The authors mentioned “finding the maximum clique in an augmented bidirectional graph” multiple times but without a proper definition or example/visualization.
- The source variables should be defined in a little more detail.
- What does in equation 2 refer to? It should be precise.
- “The intuition is if we can re-sample from such that ,” This is a little unclear. How are and different?
- It is unclear how are sampled. How are the source priors () obtained? Although these are discussed later, some hints/intuitive discussion should be provided earlier in the paper.
- “We cannot compute , … we can re-sample from so that ” – based on my understanding, the first case is computing the numerical probability table, and the second case is sampling without any such table. This difference should be made clear.
- More details on "downsampling without replacement" are needed.
- An intuitive explanation of the “minimal downsampled rate” is required.
Major weaknesses
- Suppose is not a parent but an ancestor. Shouldn’t we also get variance = 0 (equation 1) in such cases? Does a change in affect ?
- Many important concepts are delayed until section 4.2. The authors should consider introducing them earlier in the paper.
I will consider increasing the score after seeing author's response and reviewer discussion.
问题
Questions:
- How are the authors resampling the datasets?
- Do you have to perform this invariance test for all possible parent sets?
- Why is in the definition of set (section 4.1)? What does that imply?
- In practice with real-world data, is the variance always zero for all true parents (equation 1)? Why or why not? Should a threshold be used?
- How expensive is it to compute ? Do we have to iterate over all ? And do it again after performing step ii in Theorem 3?
Q3. Why is = empty in the definition of set (section 4.1)? What does that imply?
A: In section 4.1, is used to describe the set of source variables, which, by definition, are variables that have no parents. Hence, is an empty set for any variable .
Q4. In practice with real-world data, is the variance always zero for all true parents (equation 1)? Why or why not? Should a threshold be used?
A: The variance is theoretically zero if we can enumerate over the possibly infinite set of to compute it exactly. In practice, due to the limited amount of data and computation time, we only sample a large but finite number of . Our estimate will approach zero as the number of sampled increases -- see the discussion following Eq.(2).
As shown in Section 5 and Appendix C.2.1, the larger this number gets, the more effective our method becomes.
As for the threshold, there is no need to set it explicitly since we will always select the plausible parent set with the smallest variance according to the practical test in Part B of Section 4.1.
Q5. How expensive is it to compute ? Do we have to iterate over all ? And do it again after performing step ii in Theorem 3?
A: For each , the cost to compute is which is the total cost of checking whether each pair of variables are independent using the Fisher test. As there are such pairs, the total cost is as stated above. does not change after performing step ii in Theorem 3 so it does not need to compute again. Hence, computing is a one-time overhead that is not expensive.
If our responses have sufficiently addressed your questions/concerns, we hope you would consider increase the overall rating of our paper. Otherwise, please let us know if you have any follow-up questions for us.
Mn5. It is unclear how are sampled. How are the source priors () obtained? Although these are discussed later, some hints/intuitive discussion should be provided earlier in the paper.
A: is sampled using the procedure described in the statement of Theorem 5. This procedure requires the optimal downsampling rate which is computed from a given choice of the augmentation as detailed in Theorem 4. The source prior is obtained via the procedure stated in Theorem 6 of Section 4.2.3. The intuition of this procedure (i.e., criteria and selection methods) is mentioned in the first two paragraphs of Section 4.2.3.
Mn6. “We cannot compute , … we can re-sample from so that ” – based on my understanding, the first case is computing the numerical probability table, and the second case is sampling without any such table. This difference should be made clear.
A: As explained in Mn4 above, while we can choose a variation of the source prior augmentation we still cannot draw direct samples from because where is the induced conditional of the (unknown) true distribution . Thus, drawing sample from (second case) to run our invariance test has to be achieved via re-sampling from (first case) using a specific downsampling routine stated in Theorem 5, which is guaranteed to be statistically equivalent to a direct sample from .
Mn7. More details on "downsampling without replacement" are needed.
A: This is described in the 2nd sentence of Theorem 5. To elaborate, we first determine the size of the downsample using Theorem 4 -- see Eq. (4). This allows us to determine (for each candidate value of the source variables ) how many samples (where ) we need to draw (without replacement) from . Those samples (across all observed candidate values ) constitute . We will include this elaboration in the revised version of our paper.
Mn8. An intuitive explanation of the “minimal downsampled rate” is required.
A: The intuition of the minimal downsampled rate is mentioned in the first two paragraphs of Section 4.2.2. To further elaborate, our invariance test requires drawing samples Di from the augmented distribution of the true oracle . This cannot be done directly as explained above, we achieve this via drawing a sample from , which is supposed to preserve 's evidence of all (hidden) cause-effect relationships.
As such, cannot contain duplicated data points from because otherwise, it might (statistically) emphasize on (incorrect) cause-effect relationships that were not implied by and hence, . Thus, must be a downsample. But, a downsample might lose information so we want to minimize which stipulates the maximum size of at which we can still guarantee . This rate is computable via Theorem 4 -- see Eq. (4).
Questions
Q1. How are the authors resampling the datasets?
A: To resample from , we use the procedure described in the statement of Theorem 5. To elaborate, we determine the size of via Theorem 4. This allows us to determine (for each candidate value of the source variables ) how many samples (where ) we need to draw (without replacement) from . Those samples (across all observed candidate values ) constitute . Theorem 5 guarantees that such is statistically equivalent to a direct sample from which is what we need to run the invariance test in Part B, Section 4.1.
Q2. Do you have to perform this invariance test for all possible parent sets?
A: No, we only need to perform the invariance test for all maximum plausible sets which is at most where p is the degenerate constant (see Definition 4) of the augmented bidrectional graph in Theorem 7. Empirically, we find that is at most across all experimented datasets. Hence, and is essentially given the reasonably small constant values of and . This is explicitly discussed in the paragraph following Theorem 7.
Thank you for your detailed review & constructive feedback. We have addressed all your concerns and questions below:
Major Concerns
Mj1. Suppose is not a parent but an ancestor. Shouldn’t we also get variance = 0 (equation 1) in such cases? Does a change in (Ancestor) affect (Descendant|Ancestor)?
A: It is true that a change in (Ancestor) will not cause a change in (Descendant | Ancestor). This is in fact a stronger statement of Theorem 1 which states that if the variance of over the (random) choice of is positive, then cannot be the ancestor of and consequently, cannot be the parent of . Theorem 1 as stated in the current manuscript remains correct.
The reason we did not use this stronger version of Theorem 1 to develop the causal test is because doing so would require an exhaustive enumeration over the space of all sets of ancestor candidates. This will be too costly.
Instead, Theorem 1 only requires enumerating over all possible sets of parent candidates of each node. This can be done effectively via utilizing existing routines to (approximately) find the Markov blanket (Edera et al., 2014) in which is in turn leveraged to find most plausible candidates for the parent set in -- see Section 4.3. The test in Section 4.1, part B -- as a consequence of Theorem 1 -- can then be used to find the parent set among these plausible candidates.
As the entire causal structure can be recovered if we are able to recover the parent set of each node, Theorem 1 suffices, and we do not need its stronger version above which will otherwise induce a much more expensive test.
We hope this clarifies that this point is not a weakness of our proposal because our theorem and method remain correct.
Mj2. Many important concepts are delayed until section 4.2. The authors should consider introducing them earlier in the paper.
A: Thank you for the suggestion. We commit to moving the most important concepts up as suggested in our revised version.
Minor Concerns
Mn1. The authors mentioned “finding the maximum clique in an augmented bidirectional graph” multiple times but without a proper definition or example/visualization.
A: Theorem 7 establishes that each plausible parent set of a node corresponds to a maximal clique in a bidirectional graph as defined constructively in the first sentence of Theorem 7's statement. Here, a clique is defined to be a set of nodes in a bidirectional or undirected graph such that there is a (bidirectional or undirected) edge between any two nodes. Such clique can be found effectively via a DFS-based Bron-Kerbosch algorithm (see the paragraph following Theorem 7). Due to limited space in the main text, an example with visualization has been deferred to Appendix B.2 (see Fig. 6). We will refer to this figure explicitly in the revised version for better clarity. Thank you for the suggestion.
Mn2. The source variables should be defined in a little more detail.
A: As defined in Section 4.2.1 (1st line), a source variable is a variable that has no parent. We will create an explicit definition (rather than an in-text statement) item for this concept in the revised version. Thank you for the suggestion.
Mn3. What does in equation 2 refer to? It should be precise.
A: We apologize for the typo. This should be with no ('). Here, is the conditional induced from as defined in the second sentence of Theorem 1's statement. This typo (') happened once in Eq. 2 and once in Eq. 3 and we will correct it in the revised version, which will be uploaded soon.
Mn4. “The intuition is if we can re-sample from such that ” This is a little unclear. How are and different?
A: and are different. is a downsample of (i.e., is a subset of ) such that follows an augmentation distribution of . Since is a function of -- see the 2nd sentence of Theorem 1 -- which is unknown, we cannot sample directly from to run our invariance test -- see part B of 4.1 which is a consequence of Theorem 1. To overcome this difficulty, we have derived a downsampling procedure which provably guarantees that the sampled indeed follows as desired -- see Theorem 5 which explicitly describes this downsampling procedure.
The authors propose a scheme for learning causal graphs without relying on strict interventions, and instead using distributional invariances. The idea is interesting, but reviewers had a hard time following the details, and there were several gaps in the presentation raised by reviewers. The overall idea is quite complicated, and it would be in the interest of the authors to simplify the presentation and focus on the core novel ideas in a future re-submission.
审稿人讨论附加意见
There was extensive discussion between the reviewers and authors.
Reject