6.0

/10

Poster4 位审稿人

最低5最高7标准差0.7

3.5

置信度

正确性3.0

贡献度2.8

表达2.8

NeurIPS 2024

Hybrid Top-Down Global Causal Discovery with Local Search for Linear and Nonlinear Additive Noise Models

Sujai Hiremath,Jacqueline R. M. A. Maasch,Mengxiao Gao,Promit Ghosal,Kyra Gan

OpenReview PDF

提交: 2024-05-15更新: 2025-01-14

TL;DR

We present a novel hybrid method for causal discovery that leverages local causal relationships in SEMs to construct a compact hierarchical topological sort, followed by a novel nonparametric constraint-based method for efficient edge discovery.

摘要

关键词

global causal discoveryadditive noise modellocal structure

评审与讨论

审稿意见

评分: 6置信度: 32024-07-02

This paper studies structure learning problem for additive noise model (ANM) in both linear and nonlinear settings. It proposes a hybrid constraint based approach to learn the DAG by leveraging the local ancestral relationships. The algorithm consists of ordering search and edge discovery these two steps. Correctness is shown and simulation is conducted to compare with other approaches.

优点

Though ANM is shown to be identifiable for a long time, e.g. by RESIT, the high computational complexity and hardness in nonparametric regression and CI tests stand as roadblock. The finer analysis and exploitation of local structure in the proposal show potential to tackle this task efficiently;
The introducing of the proposed method is well-written and easy to follow for researchers working in relevant area.

缺点

The main contribution of this work is the exploitation of local structure to reduce the number of nonparametric regression and CI tests. However, despite of the quick discussion below thm 3.7 and 4.5, there is no explicit and formal statement on these to emphasize the contribution, and also comparison with others, e.g. RESIT.
The experiments are preliminary. More setups should be considered to demonstrate the superiority of proposal: e.g. different graph types like scale-free graphs, different number of edges, different noise, recovery criterion like F1 for linear setting, more benchmarks like CAM, GSGES, etc.
See Questions.

问题

As the main contribution of the paper, why are the runtime results in the appendix? It is also spurious that the runtime for linear case is slower than the benchmarks in Figure 7; runtime for d=12 is faster than d=8 in Figure 8. There does not seem to be significant improvement empirically;
Since theoretically the number of nonparametric regression and CI tests are reduced, is it possible to establish some statistical guarantee and sample complexity dependence on the sparsity, e.g. in-degree?

局限性

No limitation is discussed in the paper.

作者回复

2024-08-07

We thank the Reviewer for their suggestions on how we might better emphasize the contributions of our work. We respond to Weakness 2 in the general response, adding to our preliminary experiments by comparing our algorithms against many state-of-the-art baselines (CAM, NoGAM, GES, GRaSP, GSP, etc.). We find that both our nonlinear topological sorting algorithm NHTS and our edge pruning algorithm ED outperform baselines across all settings, consistently achieving the highest median $A_{top}$ and $F1$ . These strong empirical results provide convincing support for the superiority of our proposal. We give further clarification on all other comments below (following the order of the review).

Formalizing Computational Complexity Contribution

We thank the Reviewer for their suggestion: we now include a formal analysis to compare the number of high-dimensional nonparametric regressions run by our method with baselines such as RESIT and NoGAM in the worst case, i.e. the DAG is fully connected.In particular, we formally show that NHTS runs significantly fewer high-dimensional nonparametric regressions to identify the correct topological ordering. We have included the following Theorem in Section 4 (with proof in appendix) of our paper:

"Theorem 4.6: Consider a fully connected DAG $G=(V,E)$ with nonlinear ANM. Let $d:= |V|$ . Let $n^\mathrm{NHTS}_k$ be the number of nonparametric regressions with covariate set size $k\in[d-2]$ run by NHTS when sorting $V$ ; we similarly define $n^\mathrm{RESIT}_k$ and $n^\mathrm{NoGAM}_k$ respectively. Then, $n^\mathrm{NHTS}_k = d - k$ , and $n^\mathrm{RESIT}_k = n^\mathrm{NoGAM}_k = k+1$ . This implies that for all $k > \frac{d}{2}$ , $n^\mathrm{NHTS}_k<n^\mathrm{RESIT}_k = n^\mathrm{NoGAM}_k$ ."

We would like to note that it is difficult to analyze the reduction in computational complexity in the general case, as NHTS is sensitive to the specific local graph structure of the underlying data generating process. We leave such a characterization to future work: for now, we demonstrate our method's improved runtime and increased sample efficiency in new experimental results; NHTS ran $9\times$ faster than NoGAM with higher median accuracy on a set of 20 dense ER4 graphs (see PDF attached to general response).

Below, we provide a proof sketch to Theorem 4.6:

RESIT [1] and NoGAM [2] both identify leaf vertices in an iterative fashion, regressing each unsorted vertex on the rest of the unsorted vertices; RESIT tests the residual for independence with the covariate set, while NoGAM uses the residual for score matching. Therefore, the number of regressions run in both methods in each step equals one plus the covariate set size. Therefore, when the covariate set size is $k$ , there are $k+1$ regressions run.

In the case of a fully directed graph, the first stage of NHTS only runs pairwise regressions with empty conditioning sets. After the first stage, NHTS regresses each unsorted vertex onto all sorted vertices, finding vertices with independent residuals. Therefore, the number of regressions run is equal to $d$ minus the size of the covariate set. Therefore, when the covariate set size is $k$ , there are $d - k$ regressions run.

F1 Score

We thank the Reviewer for the suggestion. However, we respectfully point out that the Precision subcomponent of the F1 score requires a well-defined notion of false positives and false negatives: it is unclear how to translate these definitions into the setting of topological sorts. We instead follow a stream of causal discovery papers [1, 2, 3] that validate their methods using topological divergence measures.

Runtime Results

While the runtime results of our algorithms are an important contribution, our work focuses on achieving greater sample efficiency by leveraging local causal substructures to obtain smaller conditioning sets and run fewer high-dimensional nonparametric regressions, as demonstrated in our new experimental results (see PDF attached to general response). Given the limited space in the manuscript at the time of submission, we chose to use the main text to highlight gains in accuracy demonstrated in preliminary experiments. Overall, our methods were faster than some baselines, while slower than others; we believe that the mentioned statistical anomalies were a result of the chance occurrence of extremely dense DAGs in some trials. We will make these points clear in the additional page afforded to camera-ready submissions.

Formal Sample Complexity Result

We thank the Reviewer for pointing out the need for statistical guarantees of sample complexity dependent on sparsity, and agree that this is a crucial next step. Previous works [4] have successfully demonstrated such statistical guarantees of sample complexity under the assumption of a nonlinear ANM with Gaussian noise, utilizing concentration inequalities for Gaussian random matrices. However, extending these results to nonparametric and non-Gaussian models requires the development of an entirely new set of analytical tools for non-Gaussian matrices, which is beyond the scope of the present paper. We will address this challenge in future research.

Citations

[1] Peters et. al, Causal Discovery with Continuous Additive Noise Models, (2014).
[2] Montagna et. al, Causal Discovery with Score Matching on Additive Models with Arbitrary Noise, (2023).
[3] Rolland et. al, Score Matching Enables Causal Discovery of Nonlinear Additive Noise Models, (2022).
[4] Zhu et. al, Sample Complexity Bounds for Score-Matching: Causal Discovery and Generative Modeling, (2023).

2024-08-12

Thank you for the detailed response, which addressed most of my concerns. I would like to retain my score.

2024-08-12

We thank the Reviewer for their reply. We are happy to further clarify any concerns that the Reviewer believes were not adequately addressed in our initial response.

审稿意见

评分: 5置信度: 42024-07-12

In this paper, the authors present a causal discovery method by firstly determining the order of the causal variables, then determining the existence of edges between any two variables. The experimental results demonstrate the superiority of the proposed method compared to relevant methods.

优点

I thank the authors for their detailed clarifications, which address most of my concerns. I increase my score to 5.

Despite the theoretical results simple, the idea and the method are interesting and somewhat novel.

The theoretical results seem sensible.

缺点

Lack of necessary discussions: I think there are some similar idea in the literature, such as [1], where they maintain the order of the variables. What is the advantage of this method on [1]? The proposed method should be compared to [1] as well. [1] L. Solus, Y. Wang, C. Uhler, and L. Matejovicova. Consistency guarantees for permutation based causal inference algorithms. ArXiv preprint arXiv: 1702.03530 (2017)
Lemma 4.1 is confusing. In the condition, it is required that $x_i$ is one of the parents of $x_j$ . Why is it possible that $x_i$ and $x_j$ are not causally related?

typo: Line 202: the

问题

Could the authors further elaborate on Line 205 - 207? It is not quite clear to me why Alg. 1 cannot be used for the non-linear case.

I am happy to adjust my score according to the authors' rebuttal.

局限性

No.

作者回复

2024-08-07

We thank the Reviewer for bringing [1] to our attention. We respond to Question 1 in the general response. We provide clarification on Weakness 1 and 2 below.

Necessary Discussion of Related Literature

Methods in [1] tackle a different setting than our paper: they obtain an MEC under the Sparsest Markov Representation assumption, while our methods obtain a unique DAG under the assumption of an ANM. The main contributions of our paper are twofold: 1) our methods scale to high dimensional datasets by directly building a valid topological sort, rather than searching across a large permutation space, 2) our methods improve sample complexity by using smaller, local conditioning sets for edge identification. We clarify further below.

Methods in [1] belong to a literature of "permutation-based" causal discovery algorithms that search over the space of permutations to find a DAG that achieves an optimal score. These methods build upon the original Sparsest Permutation (SP) algorithm [2], which follows a two step procedure. In the first step, SP constructs a DAG $G_{\pi} = (V, E)$ for each permutation $\pi$ , where $\pi(i)\rightarrow \pi(j) \in E \iff i < j$ and $\pi(i)$ is not independent of $\pi(j)$ conditioned on $\{\pi_k\}$ for all $k < j, k \neq i$ . In the second step, SP outputs the set of $G_{\pi}$ that obtain the minimal number of edges, which corresponds to the MEC. SP exploits the fact that all DAGs in the MEC have the same skeleton structure, implying all valid permutations $\pi_v$ produce DAGs $G_{\pi_v}$ that achieve the minimal number of edges; permutations $\pi_i$ not in the MEC are guaranteed to produce $G_{\pi_i}$ with strictly larger edge counts [2]. Unfortunately, SP requires a search over all $d!$ permutations to obtain the MEC; the key contribution of [1] is to introduce a consistent Greedy Sparsest Permutation algorithm (GSP) that uses DFS to efficiently explore the space of permutations, again returning the set of $G_{\pi}$ with minimal edge count (MEC).

While GSP improves on SP, it suffers from 1) the need to use heuristics to bound runtime when run on high-dimensional DAGs, and 2) a loss in sample efficiency as conditioning sets grow large. Authors in [1] clarify that it is often necessary in practice to ``bound the search depth $d$ and number of runs $r$ allowed before the algorithm terminates'', reducing the accuracy of GSP as the permutation search space grows. In contrast, our ordering methods do not utilize a heuristic for high-dimensional discovery: they directly build the correct topological sort, vertex by vertex, entirely avoiding a search procedure. Additionally, when computing each $G_{\pi}$ , to determine whether the edge $\pi(i)\rightarrow \pi(j)$ exists, GSP runs a CI test that conditions on all vertices prior to $\pi(j)$ in $\pi$ : in contrast, our edge pruning algorithm ED is sensitive to the sparsity of the underlying DAG, identifying $\pi(i)\rightarrow \pi(j)$ with conditioning sets that are bounded above by $|\text{Pa}(\pi(i))| + |\text{Pa}(\pi(j))|$ (see Step 2 of ED). By leveraging local conditioning sets, ED avoids sample complexity issues suffered by GSP [3].

To accommodate the Reviewer's suggestion, we have inserted the following two paragraphs at the beginning of our Related Work section:

``Our work is related to two kinds of methods that explicitly leverage the topological structure of DAGs in their discovery procedures: 1) permutation-based approaches and 2) functional causal model-based approaches.

Recent permutation-based approaches, such as SP [2], GSP [1], and GRaSP [4], constitute a family of score-based approaches that utilize permutations for efficient discovery of a MEC. SP searches over the space of variable orderings to find permutations that induce DAGs with minimal edge counts. GSP introduces greedy variants of SP that maintain asymptotic consistency; GRaSP relaxes the assumptions of prior methods to obtain improvements in accuracy. These methods highlight the importance of using permutations for efficient causal discovery, but suffer from poor sample efficiency in high dimensional settings [5]."

We have also included various permutation-based algorithms (GSP, GRaSP) as baselines in our new experimental results (see PDF in global response), in order to better evaluate the efficacy of our methods against the discovery literature. We find that our ordering algorithm NHTS systematically outperforms permutation-based methods.

Lemma 4.1

We thank the Reviewer for catching a typo in line 217: it is possible that $x_i,x_j$ are not causally related. The sentence now reads "... where $x_i$ is one of the potential parents of $x_j$ ...".

Citations

[1] Solus et. al, Consistency guarantees for permutation based causal inference algorithms, (2017).
[2] Raskutti et. al, Learning directed acyclic graphs based on sparsest permutations , (2013).
[3] Lu et. al, Improving Causal Discovery By Optimal Bayesian Network Learning, (2021).
[4] Lam et. al, Greedy Relaxations of the Sparsest Permutation Algorithm, (2022).
[5] Niu et. al, Comprehensive Review and Empirical Evaluation of Causal Discovery Algorithms for Numerical Data, (2024).

评论- Thank you

2024-08-11

Thank you for your detailed clarifications, which address most of my concerns. I increase my score to 5.

2024-08-11

We thank the Reviewer for their reply. We are happy to further clarify any concerns that the Reviewer believes were not adequately addressed in our initial response.

审稿意见

评分: 6置信度: 42024-07-13

The paper presents theoretical results about extensions of the partial order induced by a causal DAG and uses these results to propose new constraint-based algorithms for ANMs.

Edit: increased rating from 3 to 5, soundness from 1 to 2, and contribution from 2 to 3.

Edit 2: increase rating from 5 to 6, after the authors fixed $A_{top}$ calculation; solid paper, but I think the impact of a hybrid causal discovery method like this is limited

优点

Takes a simple idea (which seems original but also somewhat obvious from an order theory perspective) and turns it (creatively, originally) into causal discovery algorithms with contisency guarantees, broad applicability (ANMs), and good identifiability (specific DAG instead of MEC)
Very clearly written, as far as grammar, organization, motivation (but importantly, not mathematical notation)
Based on the theoretical results, the algorithms have potential to be very significant to the field of causal discovery

缺点

The main (and fatal) weakness is the claims of strong performance in the abstract combined with the inadequate experimental results:
1. the abstract makes a claim of "achieving greater accuracy than current methods", but then the limited experiments only compare on simulated data (rather than real) to a few closely related algorithms (as opposed to a selection of classic or state-of-the-art methods, such as PC or GRaSP) in settings the authors have already explained are challenging for existing algorithms (very sparse DAGs, rather than a range of sparsities), and even then the proposed algorithm doesn't seem to do especially well. It also seems the NHTS algorithm is missing from the experiments.
A smaller but nonetheless important weekness is notation that contradicts mathematical conventions, making the writing unecessarily difficult:
1. consulting introductory texts on partial orders and order theory would help clear up some of the confusion. For example, a topological sort is conventionally a linear extension of a partial order, making the introduced terms "linear topological sort" and "hierarchical topological sort" confusing. Replacing the former introduced term with just "topological sort", "linear order", or "total order", and replacing the latter introduced term with something that more clearly indicates it is 'between' a partial order and a linear order (i.e., it extends the partial order, but not completely into linear order), would be more natural/conventional and easier to understand.
2. the authors seem to use $\mapsto$ to indicate the domain and image of the ordering functions, but $\mapsto$ conventionally denotes how a specific element in the domain is mapped to a specific element of the image, hindering precise and easy comprehension.
3. other notation in Definition 2.1, such as inconsistent/unexplained indexing of $\pi$ make the definition harder to understand/not rigorous
4. it's unclear what the difference between $x_j \dashrightarrow x_i$ (called a directed path) and $x_j \dashrightarrow \ldots \dashrightarrow x_k$ (called a front door path) is.

问题

Aren't there just $d \choose 2$ (i.e., number of entries above the diagonal of a corresponding adjacency matrix) possible edges in a DAG for a given linear order, rather than the $d^2$ claimed on line 303?
Suggestion: Include more explicit theorem statements and proofs for the complexity results.

局限性

The authors have adequately described what assumptions their methods require (and hence to which settings the results are limited).

作者回复

2024-08-07

We thank the Reviewer for their detailed comments on how we might improve our notation, and better support our theoretical results. We respond to Weakness 1 in the general response, providing many new experiments that compare our algorithms against many state-of-the-art baselines (CAM, NoGAM, GES, GRaSP, GSP) across a range of sparsities. We find that both our nonlinear topological sorting algorithm NHTS and our edge pruning algorithm ED outperform baselines across all settings, consistently achieving the highest median $A_{top}$ and $F1$ . These strong empirical results provide convincing support for the gains in sample efficiency promised by our theoretical results. We reply to all other comments below (following the order of the review).

Unclear Mathematical Notation

We thank the Reviewer for identifying unclear notation; we are committed to improving the accessibility of our work, as this allows for communication between disparate academic communities. We have replaced Definition 2.1 in the main text with the following definitions, incorporating the Reviewer's subpoints 1, 2, and 3:

"Definition 2.1: Consider a given DAG $G = (V,E)$ . A topological sort (linear order) is a mapping $\pi: V \rightarrow$ { $0,1,\ldots, |V|-1$ }, such that if $x_i \in \text{Pa}(x_j)$ , then $x_i$ appears before $x_j$ in the sort: $\pi(x_i) < \pi(x_j)$ ."
"Definition 2.2: A layered sort (between a partial and linear order) is a mapping $\pi_L: V \rightarrow$ { $0,1,\ldots, |V|-1$ }, such that if $\text{Pa}(x_i)=\emptyset$ , then $\pi_L(x_i)=0$ , and if $\text{Pa}(x_i)\neq \emptyset$ , then $\pi_L(x_i)$ equals the maximum length of the longest directed path from each root vertex to $x_i$ , i.e., $\pi_L(x_i)= 1 + \max$ { $\pi_L(x_j): x_j \in \text{Pa}(x_i)$ }."

Following a request from Reviewer LEV3, our updated definition of the layered sort enforces uniqueness. Additionally, we respectfully note that previous work in the literature [1] has used the term "hierarchical topological ordering"; however, if the Reviewer strongly believes that this would be confusing to a general audience, we are ok with changing the name to a "layered sort" to indicate that the sort is between a partial and linear order.

Regarding subpoint 4: a directed path $x_i \dashrightarrow x_j$ is the same as a front door path $x_i \dashrightarrow \ldots \dashrightarrow x_j$ . For clarity, line 105 now reads "A frontdoor path is a directed path...''.

Possible Edges given a Linear Order

We thank the Reviewer for catching a typo in Line 304. The sentence now reads "Edge discovery checks $O(d^2)$ possible edges allowed by $\pi_L$ ".

Explicit Theorems for Complexity Results

We thank the Reviewer for the suggestion: we have added the following theorems for our complexity results to the main text (Sections 4, 5, and 6), with the corresponding formal proofs in the appendix. We provide proof sketches below for clarity.

Theorem 1: Given $n$ samples of $d$ vertices generated by a LiNGAM, the worst case runtime complexity of LHTS is upper bounded by $O(d^3n^2)$ .
- Proof sketch: in Stage 1, LHTS runs $O(d^2)$ marginal independence tests that each have $O(n^2)$ complexity. In each step of Stage 2 and Stage 3, LHTS runs $O(d^2)$ marginal independence tests, each with $O(n^2)$ complexity. In the worse case of a fully connected DAG, there are $O(d)$ steps in total, across Stage 2 and Stage 3: this is because in each step one layer of the layered sort DAG is identified, and a fully connected DAG has $d$ layers. Therefore, the overall sample complexity of LHTS is $O(d^3n^2)$ .
Theorem 2: Given $n$ samples of $d$ vertices generated by an identifiable nonlinear ANM, the worst case runtime complexity of NHTS is upper bounded by $O(d^3n^3)$ .
- Proof Sketch: In Stage 1, NHTS runs $O(d^2)$ marginal independence tests that each have $O(n^3)$ complexity. In Stage 2, NHTS runs $O(d^2)$ nonparametric regressions and $O(d^2)$ marginal independence tests, each of which has $O(n^3)$ complexity. In Stage 3 NHTS runs at most $O(d^2)$ conditional independence tests, each of which has $O(n^3)$ complexity. In the worst case of a fully connected DAG, NHTS goes through $O(d)$ steps in Stage 4: in each step of Stage 4, NHTS runs $O(d)$ nonparametric regressions and $O(d^2)$ marginal independence tests, each of which has $O(n^3)$ complexity. Therefore, the overall sample complexity of NHTS is $O(d^3n^3)$ .
Theorem 3: Given $n$ samples of $d$ vertices generated by a model corresponding to a DAG $G$ , the runtime complexity of ED is upper bounded by $O(d^2n^3)$ .
- Proof Sketch: ED checks for the existence of every edge permitted by a topological sort $\pi$ by running one conditional independence test that has complexity $O(n^3)$ . In the worst case, there are $O(d^2)$ possible edges, so the overall complexity is $O(d^2n^3)$ .

Citations

[1] Wu et. al, Hierarchical Topological Ordering with Conditional Independence Test for Limited Time Series, (2023).

2024-08-11

Thanks for the very thorough rebuttal! I think the updated notation, added explicit complexity results, and substantial simulation results drastically improve the paper. However, I only increase my rating to 5, because I worry that such significant revisions might be outside the intended scope of this rebuttal process and rather warrant resubmission and another round through the review process (I will inquire more and further increase my rating in case I'm mistaken about this).

I also have an important follow-up question, which I've posted to the general results-focused rebuttal.

评论- Scope of Rebuttal Process

2024-08-11

We thank the Reviewer for appreciating our rebuttal. We respond to the Reviewer's follow-up question in the comment section of the general response.

We respectfully note that most essential features of our paper were not revised, including the problem setup, background assumptions, theoretical results characterizing local causal substructures, and algorithms LHTS, NHTS, and ED. The revisions were mainly restricted to 1) providing additional experimental results that support the superiority of our methods across a wider variety of settings, and 2) clarification of established contributions through updated notation and formalization of existing discussion.

We believe that these changes are within the scope of a conference rebuttal, and would be happy to provide additional details if the Reviewer believes that further evaluation is needed.

审稿意见

评分: 7置信度: 32024-07-15

The paper mainly focuses on proposing efficient search algorithms for finding the hierarchical sort ordering (linear topological sort) of variables. As mentioned in Section 5, finding such hierarchical orders can significantly improve the efficiency of causal discovery of edges, making the algorithm tractable (traditional algorithms such as PC are exponential). The paper studies two cases: linear (LiNGAMs) and non-parametric, where a complete algorithm based only on path analysis is developed for the linear case, and a combination of path analysis and layer-wise search is developed for the non-parametric case. Both algorithms improve the discovery of hierarchical order.

优点

The paper is well structured and clearly written. The theoretical contributions, including the causal path analysis and corresponding algorithms, are interesting and also important in practice as can be told from the analysis of computational complexity. All results are properly formulated as definitions and theorems and proofs are included in the appendix. Experiments are also conducted and their results are discussed in depth in Section 6. In general, I enjoyed reading it.

缺点

In general, I suggest adding more examples to demonstrate the procedure of algorithms, probably for NHTS (Algorithm 2) so that we can see a clear cut between the two stages (root-identification and layer identification).
While the authors touched a bit at the beginning of Section 4, non-experts may benefit more if the paper could include additional details about the difference between the linear and non-linear cases (especially how they affect conditional independencies if any).
For definition 2.1, it will be great to provide a hierarchical topological sort that cannot be trivially converted to a linear topological sort; that is, we cannot simply add more layers to a hierarchical sort to obtain a linear topological sort.
Lemma 4.1 is a little confusing to me: if $x_i$ is a parent of $x_j$ , how are PP1 and PP4 possible? $x_i$ must be a direct cause of $x_j$ , right? Also, when you say " $x_i$ and $x_j$ " are not causally related, does it mean that there is no directed edge from $x_i$ to $x_j$ or no directed path? Does "active path" mean any unblocked dependency path (backdoor or frontdoor)?

问题

I'm curious if it's all the results can also be explained using independencies (d-separations) instead of regressions? This allows us to think only in terms of graphs. I guess regressions in the non-parametric setting are equivalent to d-separations, how about the linear case? Are there any independencies that hold in the linear case but not in the non-parametric case?
For algorithm 1 (LHTS), is stage 2 really needed? It seems that stage 2 is a special case of stage 3 when mutual ancestors = $\emptyset$ .
For experiments, the paper mentions a tradeoff between accuracy and encoded causal information. Would it be more fair to restrict the ordering length (say limit it to some length $k$ ) and compare the ordering accuracy?

局限性

作者回复

2024-08-07

We thank the Reviewer for their insightful questions about how our methods work, and suggestions for how to clarify our explanations and results. We respond to Weakness 2 in the general response and address everything else below (following the order of the review).

Explanatory Examples

We thank the Reviewer for their suggestion; in the appendix of our paper, we have added a figure for each algorithm (LHTS, NHTS, ED) with corresponding text that walks through the steps of each algorithm on an exemplary DAG. We hope that this makes it clearer to readers which vertices or edges are being identified in which step, and and easier for readers to understand how the identification occurs. We include the figure corresponding to NHTS in the PDF attached to the general response and include the corresponding text here:

"In Stage 1, NHTS finds that A, B, C, D, and E are all mutually not in PP1 relations. In Stage 2, NHTS discovers that A is in PP2 relation with B and C, while D is in PP2 relation with E: therefore, {A, E} are the candidate root set. In Stage 3, NHTS confirms that A is indeed the root vertex, adding it to the first layer of the hierarchical sort. In the first iteration of Stage 4, NHTS recovers an independent residual when regressing B or C on A; B, C are added to the next layer in the hierarchical sort. Similarly, NHTS recovers an independent residual when regressing D and E in the second and third iteration of Stage 4, respectively: therefore, NHTS recovers the hierarchical topological sort [[A],[B, C],[D],[E]] by the end of Stage 4."

Definition 2.1

Theoretically, any hierarchical topological sort can be converted into a linear topological sort by flattening the hierarchical structure: vertices in the same layer can be placed in an arbitrary order to each other, while vertices in lower layers are placed before vertices in upper layers. By "trivially converted," we take the Reviewer to be asking whether given a DAG, the hierarchical topological ordering is unique. Indeed, we meant that the hierarchical topological sort in our paper is unique. We have corrected Definition 2.1 to reflect this change:

"Definition 2.1: A hierarchical topological sort (between a partial and linear order) is a mapping $\pi_L: V \rightarrow$ { $0,1,\ldots, |V|-1$ }, such that if $\text{Pa}(x_i)=\emptyset$ , then $\pi_L(x_i)=0$ , and if $\text{Pa}(x_i)\neq \emptyset$ , then $\pi_L(x_i)$ equals the maximum length of the longest directed path from each root vertex to $x_i$ , i.e., $\pi_L(x_i)= 1 + \max$ { $\{\{\pi_L(x_j): x_j \in \text{Pa}(x_i)\}}$ }.''

We thank the Reviewer for catching this, and we are happy to clarify further if this interpretation is incorrect.

Lemma 4.1

We thank the Reviewer for catching the typo in line 217: the sentence now reads "...where $x_i$ is one of the potential parents of $x_j$ ...". Therefore, $x_i$ is not necessarily a parent of $x_j,$ : PP1 and PP4 are possible.

We thank the Reviewer for pointing out the unclear notation: $x_i,x_j$ are not causally related if and only if no active path exists between $x_i,x_j$ , which includes both directed edges and directed paths. For clarity, the sentence in line 218 now reads: "PP1) no active path exists between $\mathbf{x_i,x_j}$ ...''.

The Reviewer is correct: we use the term "active path" to refer to any unblocked dependency path, which can either be a backdoor path or a frontdoor path.

Relationship between Results, D-Separations and Regressions

We agree that some of our results may be explained entirely by d-separations, but not all. We have further clarified the following points in Section 4 of our paper.

Our edge pruning method ED is entirely constraint-based, and therefore is explainable via d-separations in both the linear and nonlinear case. We refer the Reviewer to our proof of correctness for ED in Appendix C.2 for more detail.
Our topological ordering algorithms (LHTS, NHTS) are not fully explainable by d-separations: d-separations only characterize a MEC of DAGs, and different topological sorts may exist within the same MEC. To identify a unique topological sort, we exploit the additive noise assumption: the causal parents of a variable $x_i$ are independent of the noise $\varepsilon_i$ corresponding $x_i$ . In practice, we use this property by running regressions and checking if the residual is independent of the covariates.
Further, we note that the conditions under which the residual is independent of the covariates are different in the linear and nonlinear case; we refer the Reviewer to the general response for more detail.

Stage 2 of LHTS

We agree with the Reviewer. We distinguish between Stages 2 and 3 in Algorithm 1 for explanatory purposes, emphasizing that all AP3 relations are discovered in Stage 2, and all AP2 and AP4 relations are discovered in Stage 3. These relations constitute distinct local causal substructures, motivating a conceptual separation between Stages 2 and 3. We have further clarified this point in Line 181 of our revised paper.

Tradeoff between Accuracy and Encoded Information

Restricting the ordering length does not impact ordering accuracy. The loss of accuracy of our method comes from classifying vertices into layers according to a cutoff, instead of pinpointing the 'best' candidate to add to a linear topological sort, which involves selecting the vertex that achieves the best test statistic among the remaining vertices. Both kinds of algorithms are asymptotically consistent. While cutoff-based methods may be slightly more error-prone in low-sample settings, they inherently capture more information about the underlying DAG, reducing the complexity of the edge-pruning stage and making the overall process more efficient when the graph is sparse.

2024-08-13

Thanks for your clarifications. The new Definition 2.1 looks good to me. I'll keep my score.

2024-08-14

We thank the Reviewer for their consideration.

作者回复

2024-08-07

We thank the Reviewers for their insightful comments and questions, as they have helped improve the clarity of our paper. We have addressed all raised concerns in this rebuttal, and incorporated the feedback into our manuscript.

We thank the Reviewers for unanimously acknowledging the novelty and potential utility of our proposed local search approach. In particular, Reviewer rTf8 appreciated how we leveraged a simple idea to develop "causal discovery algorithms with consistency guarantees, broad applicability (ANMs) and good identifiability (specific DAG instead of MEC)." Reviewer fR7W appreciated how "the finer analysis and exploitation of local structure" shows potential to efficiently tackle "the high computational complexity and hardness in nonparametric regression and CI tests." In terms of writing, Reviewers LEV3 and rTf8 both found the paper generally "clearly written", and Reviewer fR7W said it was "easy to follow for researchers working in relevant area."

In this general response, we will provide the following clarifications to all Reviewers:

Contextualize our paper's importance to the field of causal discovery (Reviewer LHcu Weakness 1, Reviewer fR7W Question 1).
Provide additional experimental results (Reviewer rTf8 Weakness 1, Reviewer fR7W Weakness 2).
Specify the difference between linear and nonlinear case (Reviewer LHcu Question 1, Reviewer LEV3 Weakness 2, Reviewer LEV3 Question 1).

Importance to the Field of Causal Discovery

Traditional causal discovery algorithms (PC, GES, GSP, etc.) utilize conditional independence relations to recover a Markov Equivalence Class (MEC) of causal models. To enable the identification of a unique DAG, our methods operate under the assumption of an additive noise model (ANM), leveraging independent residuals from regressions to find a topological ordering, and CI tests to prune edges. Our main contributions are twofold: 1) our nonlinear topological ordering algorithm NHTS exploits local search to run fewer high-dimensional regressions than traditional ANM methods, achieving lower sample complexity than baselines, 2) our constraint-based algorithm ED uses the discovered topological ordering to nonparametrically prune edges with smaller conditioning sets than traditional sparse regression methods, outperforming baselines by improving sample efficiency.

Additional Experimental Results

In response to Reviewers fR7W and rTf8 we provide new experimental results (see the attached PDF) that test our nonlinear topological sorting algorithm NHTS and our edge pruning algorithm ED against more benchmarks and in many different settings. We find that both NHTS and ED systematically outperform baseline methods, providing strong evidence for the superiority of our methods. We are happy to provide additional experimental results upon request. We provide two representative examples of runtime results in the attached PDF. We test our methods in 24 different settings: two choices of dimensionality ( $d=10,d=20$ ), three types of noise distribution (uniform, gaussian, laplacian), and four types of Erdos-Renyi [1] graphs with different sparsities; the expected number of edges either equals to $d$ (ER1), $2d$ (ER2), $3d$ (ER3), or $4d$ (ER4). In each experiment, methods are evaluated on 20 DAGs generated by nonlinear causal mechanisms, with $n=300$ . We include NoGAM for comparison as a state-of-the-art ANM-based method [2]. As suggested by Reviewers rTf8, LHcu, and fR7W, we include CAM, GES, GRaSP, and GSP as additional baselines. GES, GRaSP and GSP return only a MEC; to enable a fair comparison, we randomly select one topological ordering permitted by the outputted MEC for evaluation. CAM is excluded from trials with $d=20$ due to extremely long runtimes, taking over 23 minutes to complete a single trial. We exclude PC and RESIT since in general they perform much worse than baseline methods [2].

Difference between Linear and Nonlinear Case

The reason why LHTS (Algorithm 1) cannot be naively applied is that regressions yield independent residuals under different conditions in the nonlinear case. LHTS leverages the assumption of linear causal mechanisms when running pairwise regressions in Stage 2, which is insufficient in the nonlinear setting. For clarity, we demonstrate how LHTS fails to correctly recover causal relationships in an exemplary 3-node DAG with nonlinear causal mechanisms, whereas NHTS (Algorithm 2) succeeds.

Consider DAG $G$ with three vertices $x_1,x_2,x_3$ , where $x_1 \rightarrow x_3, x_2 \rightarrow x_3$ . The functional causal relationships are nonlinear, given by $x_1 = \varepsilon_1, x_2 = \varepsilon_2, x_3 = x_1x_2 +\varepsilon_3$ , where the $\varepsilon$ 's are mutually independent noise variables. We focus on whether LHTS or NHTS can recover the parent-child relationship between $x_1$ and $x_3$ . Both LHTS and NHTS find that the relationship between $x_1,x_3$ is unknown in Stage 1.

In Stage 2, LHTS runs pairwise regressions between $x_1,x_3$ but incorrectly concludes that $x_1,x_3$ are not in AP3 relation because neither pairwise regression provides an independent residual; both parents of $x_3$ must be included in the covariate set for an independent residual to be recovered. In contrast, NHTS correctly constructs the conditioning set $P_{13} = \{x_2\}$ for the pairwise regression of $x_3$ on $x_1$ ; NHTS recovers an independent residual, identifying that $x_1,x_3$ are in PP2 relation, i.e, $x_3$ is a child of $x_1$ .

We have incorporated the above explanation into the main text in Section 4. Additionally, as suggested by Reviewer LEV3, for reader clarity we have added figures and text to our paper's appendix that walk through the steps of LHTS, NHTS and ED (see attached PDF for the NHTS figure).

Citations

[1] Erdos et. al, On the evolution of random graphs, (1960).
[2] Montagna et. al, Causal Discovery with Score Matching on Additive Models with Arbitrary Noise, (2023).

评论- Topological Ordering Evaluation for GES, GRaSP, GSP

2024-08-11

We thank Reviewer rTf8 for appreciating the strength of our new experimental results (see attached PDF). In our experiments, $A_{top}$ is computed for methods that return an MEC in two steps: 1) we select one topological sort $\pi_{\mathrm{MEC}}$ permitted by the returned MEC, 2) we set $A_{top}$ equal to the percentage of edges that can be recovered from $\pi_{\mathrm{MEC}}$ . Step 2) is identical for every method in our experiments; as a representative example, we walk through how step 1) is done for GES.

We use the implementation of GES provided by the causal-learn [1] python package, which returns a 'completed partially directed acyclic graph' (CPDAG) $CG$ . A CPDAG is a compact matrix representation of a MEC of DAGs, commonly used in the literature [2]. In particular, $CG[j,i]=1,G[i,j]=-1$ indicates $i\rightarrow j$ , $CG[j,i]=G[i,j]=-1$ indicates $i-j$ , i.e., there exists an edge between $i,j$ with an unknown orientation, and $CG[j,i]=CG[i,j]=0$ if no edge exists between $i,j$ .

To select a valid topological sort $\pi_{\mathrm{MEC}}$ from $CG$ , we first initialize a list of sorted vertices $\pi_{MEC} = []$ and a list of unsorted vertices $U =$ { $1,\ldots,|V|$ }. We then iteratively sort one vertex at a time in $U$ into $\pi_{MEC}$ : in each iteration, we first determine the set of vertices $R\in U$ that have no parents in $U$ , i.e. $R =$ { $i\in U: CG[j,i] ~!= 1 ~\forall j \in U, j \neq i$ }. We then uniformly randomly select a vertex from $R$ , remove it from $U$ , and append it to $\pi_{MEC}$ . After $|V|$ iterations, $\pi_{MEC}$ is valid with respect to $CG$ : by construction, if $i\rightarrow j$ , i.e. $CG[j,i]=1$ , then $\pi_{MEC}(i) < \pi_{MEC}(j)$ .

We emphasize that, under the assumptions made by traditional discovery algorithms such as GES, all topological orderings contained within a MEC are equally valid: every such ordering corresponds to a DAG that satisfies every conditional independence constraint found in the data. Therefore, these algorithms cannot distinguish between any ordering present in the MEC. This motivates the uniformly random selection of one valid topological ordering from the MEC for downstream comparison with LHTS and NHTS.

Citations

[1] Zheng et. al, Causal-learn: Causal Discovery in python, (2024).
[2] Perkovic et. al, Complete Graphical Characterization and Construction of Adjustment Sets in Markov Equivalence Classes of Ancestral Graphs, (2018).

2024-08-11

I think these results look very promising! However, I remain a bit skeptical about the evaluation metrics. Could the authors elaborate how $A_{top}$ is computed for methods like GES that return a CPDAG rather than a topological (or even partial) order?

2024-08-11

The procedure you describe is invalid, in that it produces linear orders that are not extensions of the partial order corresponding to any DAG in the MEC. Consider the CPDAG $a - b - c$ . In the first iteration, $U = R =$ { $a,b,c$ }, so it's possible to select $a$ . This updates $U$ and $R$ to { $b, c$ }, so it's possible to pick $c$ in the second iteration, forcing $b$ to be picked in the third iteration. The resulting order $(a, c, b)$ corresponds to the DAG $a \rightarrow b \leftarrow c$ which is not in the MEC represented by the original CPDAG.

Instead of selecting parentless nodes uniformly at random, one must order the chain components of the CPDAG according to a linear extension of their PO, and then orient the nodes within each chain component according to an inverse perfect elimination ordering (see [1] and [2] for more information about these concepts). Even easier, since you are trying to construct a single compatible order rather than enumerate them all, is to randomly select a DAG from the MEC (causal-learn has a function for this, as do the python packages pgmpy and causaldag) and then proceed as you do with DAGs from the other methods.

[1] Andersson, S. A., Madigan, D., & Perlman, M. D. (1997). A characterization of Markov equivalence classes for acyclic digraphs. The Annals of Statistics, 25(2), 505-541.

[2] Koller, D., & Friedman, N. (2009). Probabilistic graphical models: Principles and techniques. MIT press., especially Algorithm 9.3 and surrounding context.

If these issues are address and the results still support the claim of superior performance, I will increase my score further.

Edit: I forgot to add: The experimental results would also be much more compelling if they included comparison to sortnregress or some other method to help ensure the simulated data is sufficiently challenging (see [3] for more details).

[3] Reisach, A., Seiler, C., & Weichwald, S. (2021). Beware of the simulated DAG! Causal discovery benchmarks may be easy to game. Advances in Neural Information Processing Systems, 34, 27772-27784.

评论- Updated Evaluation for GES, GRaSP, GSP

2024-08-12

We agree with Reviewer rTf8's description of the issue concerning the computation of $A_{top}$ for GES, GRaSP, and GSP, and apologize for the error. We thank Reviewer rTf8 for catching this mistake. Following Reviewer rTf8's suggestion, we update how $A_{top}$ is computed for GES, GRaSP, and GSP; we use the function pdag2dag provided by the causal-learn python package [1] to randomly select a DAG from returned CPDAGs (replacing step 1 in our previous official comment).

Fortunately, we saved all of the raw results from the experiments in the attached PDF (true DAGs from each trial, returned DAGs/CPDAGs for each method, etc.): using the updated $A_{top}$ measure, we have re-evaluated the original results. We find that, although GES, GRaSP, and GSP show improved performance when measured by the updated $A_{top}$ , NHTS still has the highest median score in all experiments, supporting our claim of superior performance. We are instructed not to post links to images, and are therefore unable to directly display the plots; instead, at the bottom of this comment we provide the original median $A_{top}$ for NHTS, and updated median $A_{top}$ for GES, GRaSP, and GSP in all experiments.

Additionally, we respectfully note that 1) we have already processed the data to ensure it is sufficiently challenging, and 2) incorporated comparison to a variant of sortnregress . In Lines 314-316 of our original submission, we clarify that "data were standardized to zero mean and unit variance" to avoid gameability by var_sort_regress [2], which uses marginal variance as an ordering criterion. The data generated for the new experiments were standardized through the same process. Furthermore, in Line 329 of our original submission we describe the benchmark r2_sort_regress [3], a scale invariant version of sortnregress which uses $R^2$ as an ordering criterion. We had included r2_sort_regress as a benchmark in the new experiments, under the name "R2ST": R2ST was outperformed by NHTS in all experiments.

Updated Median $A_{top}$ Scores for NHTS, GES, GRaSP, and GSP

Gauss, ER1, d=10 - NHTS: 1.00 | GES: 0.88 | GRaSP: 0.93 | GSP: 0.35 |

Gauss, ER2, d=10 - NHTS: 0.80 | GES: 0.65 | GRaSP: 0.70 | GSP: 0.48 |

Gauss, ER3, d=10 - NHTS: 0.84 | GES: 0.68 | GRaSP: 0.81 | GSP: 0.43 |

Gauss, ER4, d=10 - NHTS: 0.73 | GES: 0.52 | GRaSP: 0.65 | GSP: 0.54 |

Gauss, ER1, d=20 - NHTS: 1.00 | GES: 0.86 | GRaSP: 0.89 | GSP: 0.57 |

Gauss, ER2, d=20 - NHTS: 0.86 | GES: 0.85 | GRaSP: 0.84 | GSP: 0.53 |

Gauss, ER3, d=20 - NHTS: 0.83 | GES: 0.76 | GRaSP: 0.73 | GSP: 0.50 |

Gauss, ER4, d=20 - NHTS: 0.88 | GES: 0.61 | GRaSP: 0.57 | GSP: 0.44 |

Unif, ER1, d=10 - NHTS: 1.00 | GES: 0.74 | GRaSP: 0.85 | GSP: 0.47 |

Unif, ER2, d=10 - NHTS: 0.90 | GES: 0.70 | GRaSP: 0.62 | GSP: 0.49 |

Unif, ER3, d=10 - NHTS: 0.97 | GES: 0.64 | GRaSP: 0.66 | GSP: 0.46 |

Unif, ER4, d=10 - NHTS: 0.98 | GES: 0.59 | GRaSP: 0.57 | GSP: 0.39 |

Unif, ER1, d=20 - NHTS: 1.00 | GES: 0.75 | GRaSP: 0.69 | GSP: 0.60 |

Unif, ER2, d=20 - NHTS: 0.86 | GES: 0.77 | GRaSP: 0.73 | GSP: 0.57 |

Unif, ER3, d=20 - NHTS: 0.99 | GES: 0.66 | GRaSP: 0.70 | GSP: 0.51 |

Unif, ER4, d=20 - NHTS: 0.94 | GES: 0.59 | GRaSP: 0.63 | GSP: 0.54 |

Lapla, ER1, d=10 - NHTS: 0.97 | GES: 0.93 | GRaSP: 0.92 | GSP: 0.48 |

Lapla, ER2, d=10 - NHTS: 0.88 | GES: 0.70 | GRaSP: 0.76 | GSP: 0.50 |

Lapla, ER3, d=10 - NHTS: 0.82 | GES: 0.73 | GRaSP: 0.71 | GSP: 0.47 |

Lapla, ER4, d=10 - NHTS: 0.79 | GES: 0.75 | GRaSP: 0.70 | GSP: 0.43 |

Lapla, ER1, d=20 - NHTS: 0.89 | GES: 0.85 | GRaSP: 0.83 | GSP: 0.54 |

Lapla, ER2, d=20 - NHTS: 0.98 | GES: 0.86 | GRaSP: 0.79 | GSP: 0.51 |

Lapla, ER3, d=20 - NHTS: 0.73 | GES: 0.68 | GRaSP: 0.64 | GSP: 0.47 |

Lapla, ER4, d=20 - NHTS: 0.77 | GES: 0.63 | GRaSP: 0.65 | GSP: 0.53 |

Citation

[1] Zheng et. al, Causal-learn: Causal Discovery in python, (2024).
[2] Reisach et. al, Beware of the simulated DAG! Causal discovery benchmarks may be easy to game, (2021).
[3] Reisach et. al, A Scale-Invariant Sorting Criterion to Find a Causal Order in Additive Noise Models, (2023).

最终决定Accept (poster)

2024-09-25

The authors conducted a very detailed rebuttal process and it greatly helped clarify and convince the reviewers. All reviewers are recommending acceptance, although with varying enthusiasm levels.

I recommend the authors include the "additive noise model" explicitly in the abstract, if not in the title, which currently is not mentioned openly. I would like to especially thank R# rTf8 for closely parsing the rebuttal and notifying the authors of the issue with their experiments on extending PO and also authors for correcting this issue. I am happy to see the review process has made the paper much stronger. The same reviewer first only increased their score to a 5, but after further back and forth with the authors, they increase their score to a 6. Most issues seem to have been resolved assuming that the authors will incorporate all the recommended changes in camera ready. In that sense, another round of review cycle may not be necessary since most of the content is already there.