Gradient based Causal Discovery with Diffusion Model
The paper proposes to use diffusion models for causal discovery, and searches for the DAG under continuous optimization frameworks.
摘要
评审与讨论
This paper introduces a diffusion model (denoised diffusion probabilistic model) for causal discovery and demonstrates its effectiveness in this task.
优点
The paper highlights the significant advantages of using a diffusion model for causal discovery. In nonlinear models, it consistently outperforms many other baseline methods.
缺点
-
Small typo: Line 309-310, "deep leyers"-> "deep layers". Line 183-184, "node i to node j"->"node to node "
-
The current form of Proposition 1 lacks rigor, raising concerns about its accuracy and possible need for revision. Here, is the weighted adjacency matrix. Notably, a sequence can always be constructed such that (using here is more appropriate than ), as is a weighted adjacency matrix and is continuous in , implying that . Additionally, the phrase “all direct cyclic graphs in the solution space” is unclear and seems incorrect. The solution space should consist of only directed acyclic graphs, so there should be no cyclic graphs in this space. Please revise this accordingly.
-
The experimental results omit the algorithm’s runtime, and results are only reported for graphs with up to 50 nodes and for ER2. Given that this paper employs a diffusion model for causal discovery, it would be helpful to include [1] as a baseline. Additional experiments would strengthen the case for the effectiveness of the diffusion model in causal discovery.
[1]Pedro Sanchez, Xiao Liu, Alison Q O’Neil, and Sotirios A Tsaftaris. Diffusion models for causal discovery via topological ordering. arXiv preprint arXiv:2210.06201, 2022.
问题
-
I noticed that CAM predicts many edges that do not exist in the true causal graphs. This could be due to the CausalDiscovery package having the tuning step turned off by default, which is crucial for the CAM algorithm. Could this be a contributing factor to CAM’s degraded performance?
-
Could you clarify why DAG-diffusion’s performance is lower in linear models? Is it possibly because the diffusion model is more effective at capturing nonlinear relationships?
伦理问题详情
NA
Thank you for the comments. Here are our revisions based on your advises.
Weaknesses 1:Small typo: Line 309-310, "deep leyers"-> "deep layers". Line 183-184, "node i to node j"->"node to node "
A: Thank you. We fixed this in the updated version.
Weaknesses 2:The current form of Proposition 1 lacks rigor, raising concerns about its accuracy and possible need for revision. Here, is the weighted adjacency matrix. Notably, a sequence can always be constructed such that (using inf here is more appropriate than ), as is a weighted adjacency matrix and is continuous in , implying that . Additionally, the phrase “all direct cyclic graphs in the solution space” is unclear and seems incorrect. The solution space should consist of only directed acyclic graphs, so there should be no cyclic graphs in this space. Please revise this accordingly.
A:Thank you for raising this concern. For the solution space, we here mean all possible matrices in the search space, or the matrices in the optimization trajectory, not solution space. We change this in the updated version.
Weaknesses 3:The experimental results omit the algorithm’s runtime, and results are only reported for graphs with up to 50 nodes and for ER2. Given that this paper employs a diffusion model for causal discovery, it would be helpful to include [1] as a baseline. Additional experiments would strengthen the case for the effectiveness of the diffusion model in causal discovery.
A: This is a good advise. We include this as a baseline in the updated paper. The main consideration of this paper is the capability of the proposed causal method for functional causal discovery. We do not really pay much attention to running time, but care more about accuracy.
Q1: I noticed that CAM predicts many edges that do not exist in the true causal graphs. This could be due to the CausalDiscovery package having the tuning step turned off by default, which is crucial for the CAM algorithm. Could this be a contributing factor to CAM’s degraded performance?
A:Thank you for this advice. We change the parameter and report the results in the updated version.
Q2: Could you clarify why DAG-diffusion’s performance is lower in linear models? Is it possibly because the diffusion model is more effective at capturing nonlinear relationships?
A: We agree with your arguments, and we make several comments on this in the experiment section (see line 453-454).
Thank you for addressing my concern about the paper.
However, I believe there is an issue with Proposition 1. Consider the following simple counterexample:
Let
Here, we have . Since represents a directed cyclic graph, it is evident that as , . In this case, .
This example demonstrates that your conclusion is not generally correct.
Thank you for the response. This is a good consideration. In fact, this graph you give is very special. What we mean "weighed adjacency graph" is not one that have edge strength on the diagonal element that converge to 0, because this means that we have a graph in infinity case that have almost no edge at all, which is acylic, not cyclic. The more easy-to-understand case is we treat this as binary matrix, where the is replaced with so that this is a binary adjacency matrix. Then your calculation will give you another result. The more general insight is that (equation (5) in reference Zheng 2018) interpreting as , where counts the number of length-k closed walks in a directed graph, and must be 0 if acyclic. In this regard, cyclic graph will not lead to an being 0 even in the infinity case, and it would be more clear to interpret the case in a finite dimensional space to remove ambiguity.
If you have further questions, please kindly let us know and we are happy to make discussions to make issues resolved.
Reference: Zheng X, Aragam B, Ravikumar P K, et al. Dags with no tears: Continuous optimization for structure learning. Advances in neural information processing systems, 2018, 31.
This paper adopts diffusion models for differentiable causal discovery. Specifically, a specific function class is considered, where diffusion models are used to model the nonlinear causal relations. Empirical studies on synthetic and real data are provided.
优点
Causal discovery with nonlinear relations is an important task, because it is often unrealistic to assume linear relations in practice.
缺点
- The specific nonlinear functional class considered/assumed is highly restrictive. The method cannot handle more general functional causal model, such as nonlinear additive noise models.
- The baselines considered are not adequate.
问题
- I would suggest the authors to be clear about what type of functional causal model the method can handle, which is not clear from Eq. (10), (11), (12). Specifically, the paper should give a precise formulation in the form of a structural causal model .
- Is and in Eq. (10) and (11) a variable/element wise function? If so, this should be stated explicitly.
- Only CAM and DAG-GNN are compared for nonlinear models. Several other differentiable nonlinear methods could be included to strengthen the empirical studies, such as those for more general nonlinear causal models (https://arxiv.org/abs/1909.13189, https://arxiv.org/abs/1906.02226), and those for the specific causal model in Eq. (11) (https://arxiv.org/abs/1911.07420, https://arxiv.org/abs/2004.08697)
Minor:
- Proposition 1 follows similar spirit as Zhu et al. (2020), which should be made more explicitly in the main paper. Also, Proposition 1 does not provide any new insight into the optimization procedure because a typical way in differentiable causal discovery is to use augmented Lagrangian, which the paper does, so I would suggest removing this proposition.
- L56: Zheng et al. (2020) did not propose a nonparametric score function but a nonparametric DAG constraint.
Thank you for the comments. Here are our revisions based on your advises.
Weakness 1: The specific nonlinear functional class considered/assumed is highly restrictive. The method cannot handle more general functional causal model, such as nonlinear additive noise models.
A: We thank the reviewer for pointing out this issue. In fact, the nonlinear functional class is not restrictive, but this form admits a wide class of models. By changing the functions in the equation (11), we can get a lot of "traditional" models
- Set and to be identify function, we get linear causal models
- Set to be some nonlinear function and f to be identity function, we get nonlinear additive noise causal models with some linear mixing mechanism (represented by mixing matrix A).
- Set to be a mixing function, we get additive noise model with noise post-processing, so that the "added" noise can be mutually non-independent. We thus think this model is not restrictive. A discussion in added in Appendix D.
Q1: I would suggest the authors to be clear about what type of functional causal model the method can handle, which is not clear from Eq. (10), (11), (12). Specifically, the paper should give a precise formulation in the form of a structural causal model.
A:Thank you for raising this issue. Basically, our model originated from the linear causal model (eq (8)) and then extends to nonlinear cases, and formulating this into a structural causal model does not naturally match our explanation why diffusion process can model the generative functions. We thus prefer to state the model under the functional generative model class. To make "what type of functional causal model the method can handle" more clear, we put down additional discussions in appendix (also summarized in our response to your weakness one).
Q2:Is and in Eq. (10) and (11) a variable/element wise function? If so, this should be stated explicitly.
A: Here we consider and to be variable wise function. We also state this in the updated version (between line 191 to line 200).
Weakness 2: The baselines considered are not adequate.
Q3: Only CAM and DAG-GNN are compared for nonlinear models. Several other differentiable nonlinear methods could be included to strengthen the empirical studies, such as those for more general nonlinear causal models (https://arxiv.org/abs/1909.13189, https://arxiv.org/abs/1906.02226), and those for the specific causal model in Eq. (11) (https://arxiv.org/abs/1911.07420, https://arxiv.org/abs/2004.08697)
A: Thank you for mentioning more related work. We respond to your comments one by one.
-
GraN. We compared this in the real world dataset.
-
GAE and NOTEARS-MLP. We added the experiments in tables. Please see the updated version.
-
CausalVAE. This method needs labels as supervised signals, and mainly targets images for hidden concept learning. It is not direct applicable in our testing dataset. However, this is indeed related work and we discuss it in section 2.
Minor: Proposition 1 follows similar spirit as Zhu et al. (2020), which should be made more explicitly in the main paper. Also, Proposition 1 does not provide any new insight into the optimization procedure because a typical way in differentiable causal discovery is to use augmented Lagrangian, which the paper does, so I would suggest removing this proposition. L56: Zheng et al. (2020) did not propose a nonparametric score function but a nonparametric DAG constraint.
A: Thank you for this comment. The proposition we still keeps because it states the euqvalence between augmented optimization procedures which is important property and can make our paper self-contained. We address your comment "more explicitly" (see line 285) and "a nonparametric score function but a nonparametric DAG constraint" by making modifications on this in the updated version. L56 is revised as "the score function with nonparametric DAG constraint". (see line 58)
Thanks for the response. Some of my concerns have been addressed and I have updated my rating accordingly.
Thank you for the response. If there are other questions or uncleared concerns, please kindly let us know and we are willing to provide more materials and revise our manuscripts accordingly.
This paper proposed an approach to combine the causal discovery with the diffusion model. In particular, the author proposed to adopt a DAG-GNN approach for structural causal model but use diffusion approach to model two of the non-linear function , . For DAGness of the graph, they propose to use NOTEARS approach for soft but differentiable constraints. Empirically, the author evaluates on 4 synthetic datasets and two semi-synthetic ones to evaluate the performance, showing that DAG-Diffuser can outperform the baselines on some datasets.
优点
This paper is clearly written and easy to follow. The proposed approach is indeed sound and correct. Personally, although I was expecting the proposed method is to use diffusion for graph modelling (i.e. diffusion model to directly model the graph distribution), it is still interesting to see the performance if one uses diffusion model only for function modelling.
缺点
However, I have several concerns regarding this paper. First, methodology-wise, I think the contribution is not significant. The core idea of this paper is to use diffusion model to model function , . Due to the generality of diffusion model (i.e. not much technical requirements if one wants to use diffusion model), it does not pose significant technical challenges of simply plug-and-play. Therefore, in terms of methodology, it is of little difference compared to using normalizing flow, invertible neural network for and , which has been done before.
Another question I have is the motivation of using diffusion model for and . Is it because diffusion is flexible, and the author think that this may be helpful? If that is the case, another set of causal inference experiments can provide stronger evidence of choosing diffusion model. Since the proposed method in fact learns a structural causal model. it is straightforward to extend the framework for average treatment effect and individual treatment effect on synthetic data. Under this case, DAG-Diffuser may demonstrate stronger performance compared to baselines.
In addition, for empirical evaluation on linear Gaussian synthetic dataset, do you use the same variance for the Gaussian noise variable? If not, the model is not identifiable, and invalidate your statements.
Some minor comments:
-
For eq.2, is the and are in wrong order?
-
You should add some reference on line192. For example, [1].
-
In line 196, you mentioned that "without loss of generality, we assume....", but the invertibility of and will hurt the capacity of .
-
For line 201, if f is non-linear, do you still need to be non-Gaussian?
-
Line 261, what is ? Shouldn't it be ?
-
For related work, you should also cite some related work on discovery with soft constraints and diffusion model, like [2,3,4,5].
-
Consider introduce the method name DAG-Diffuser before the experiment section.
-
what is in line 403?
Reference
[1] Hoyer, Patrik, et al. "Nonlinear causal discovery with additive noise models." Advances in neural information processing systems 21 (2008).
[2] Rolland, Paul, et al. "Score matching enables causal discovery of nonlinear additive noise models." International Conference on Machine Learning. PMLR, 2022.
[3] Lachapelle, Sébastien, et al. "Gradient-based neural dag learning." arXiv preprint arXiv:1906.02226 (2019).
[4] Geffner, Tomas, et al. "Deep end-to-end causal inference." arXiv preprint arXiv:2202.02195 (2022).
[5] Sanchez, Pedro, et al. "Diffusion models for causal discovery via topological ordering." arXiv preprint arXiv:2210.06201 (2022).
问题
See above
Thank you for the comments. Here are our revisions based on your advises.
Weakness 1: "However, I have several concerns ... done before"
A: This is an important comment. The main difference between normalizing flow and diffusion model is the invertibility of the function and the estimation approach (consequently the complexity of the distribution that can be represented). The diffusion model has the flexibility of configuring the "noise adding" process while normalizing flow relies on the Jacobian matrix of functions to perform model estimation. Although we "assume" invertibility for the ease of explanation, our model works as long as the data obeys a generative process described by equation (1), without the actual need to assume invertibility. This differs from methods using NF, with more flexibility to represent observational distributions. In this regard, this framework using the diffusion model as a tool contributes an important piece to the actual discovery methods, which is with wider applicability, and we think is not so simply a "plug-in". We also change the statements of invertibility in the main text (kindly see rebuttal to your Q3).
Reference https://deeplearning.cs.cmu.edu/F23/document/slides/lec23.diffusion.updated.pdf
Weakness 2: "Another question I have is the motivation ... compared to baselines."
A: Thank you for this comment. This is indeed an good idea and a direct extension in fact. Since our paper mainly focuses on causal discovery, we do not compare the approach with causal inference ones. But this is a good and important extension, we may list as future work.
Weakness 3: "In addition, ... and invalidate your statements."
A: Yes, we use equal variance in the experiments.
Q1: For eq.2, is the and are in wrong order?
A:The forward process corresponds to the process transforming to , and the reverse process corresponds to the process that reverses this chain. The order in eq.2 simulates the chain process in diffusion models.
Q2:You should add some reference on line192. For example, [1].
A:Thank you for this issue. We made modifications in the new version. (see line 192)
Q3: In line 196, you mentioned that "without loss of generality, we assume....", but the invertibility of and will hurt the capcity of ,.
A: Thank you for pointing out this. We modify this to "when and are invertible, we can write " to remove ambiguity. This also closely relates to your concerns raised in weakness section.
Q4: For line 201, if f is non-linear, do you still need to be non-Gaussian?
A: This relates to identifiability of the model, while in our model in fact we do not consider too much on this. The variable here are just exogenous variables.
Q5:Line 261, what is ? Shouldn't it be ?
A: Thank you for this. We fixed it.
Q6: For related work, you should also cite some related work on discovery with soft constraints and diffusion model, like [2,3,4,5].
A:Some of them have been mentioned in other content, and [3,5] have been used as baselines. As for [2,4], we discuss them in related work.
Q7: Consider introduce the method name DAG-Diffuser before the experiment section.
A: We make this modifications as you instructed.
Q8: What is in line 403?
A:Sorry for this typo. It‘s . This is a kind of method to choose an appropriate threshold.
Reference
[1] Hoyer, Patrik, et al. "Nonlinear causal discovery with additive noise models." Advances in neural information processing systems 21 (2008).
[2] Rolland, Paul, et al. "Score matching enables causal discovery of nonlinear additive noise models." International Conference on Machine Learning. PMLR, 2022.
[3] Lachapelle, Sébastien, et al. "Gradient-based neural dag learning." arXiv preprint arXiv:1906.02226 (2019).
[4] Geffner, Tomas, et al. "Deep end-to-end causal inference." arXiv preprint arXiv:2202.02195 (2022).
[5] Sanchez, Pedro, et al. "Diffusion models for causal discovery via topological ordering." arXiv preprint arXiv:2210.06201 (2022).
Thanks for author's response, my argument is not about the similarity of NF and diffusion model-wise, but refer to little difference methodology-wise. If one uses NF for and (which is a straight-forward modification), it does not require significant modifications if one wants to replace it with diffusion model.
Thank you for the responses. In fact, concerning the main difference between the two methods, you already give very good comments: the invertibility of the functions. In this regard, our model is able to represent richer conditional distributions when admitting non-invertible functions in (1). The distributions are also discussed in equation (15). We already make modifications of statements in the updated pdf, and we give our sincere thanks for your advice which greatly improves the quality of the paper in terms of its theoretical perspectives. Although it seems very direct to replace diffusion with NF, the underlying estimation approaches (NF with Jacobian and DF with likelihood) and consequently the “richness” of representable conditional distributions are obviously different. From this perspective, the seemingly “simple” replacement makes some difference, and contributes to society with a piece of useful tool for causal discovery task.
The authors propose using diffusion models for causal discovery and searching for the DAG under continuous optimization frameworks. The authors claim that the diffusion model has the ability to represent various functions, and the proposed causal discovery approach is able to generate graphs with satisfactory accuracy on observational data generated by either linear or nonlinear causal models. Experiments on synthetic and real-world datasets were conducted to test the proposed method.
优点
The authors present different experiments on synthetic and real-world datasets. The writing is clear and easy to follow.
缺点
a. The method proposed in the paper is incremental, and it heavily relies on NoTears regularization. The method is just a combination of diffusion model and NoTears constraint. The contribution of the paper is limited.
b. The benefit of the method is limited. As shown in Table 1 and Table 2.
c. The abstract is too general and does not provide sufficient information to summarize the content of the article. It should provide insights into the proposed method.
问题
See weaknesses.
Thank you for the comments. Here are our revisions based on your advises.
Qa:The method proposed in the paper is incremental, and it heavily relies on NoTears regularization. The method is just a combination of diffusion model and NoTears constraint. The contribution of the paper is limited.
A:Thank you for this comment. The novelty of our method indeed lies on the capability of gradient based causal discovery on a more general nonlinear structure causal model, as shown in equation (1), compared to existing ones. The diffusion model is used to simulate complex nonlinear generative causal functions, which can be treated as a tool for causal reasoning under our proposed framework.
Qb:The benefit of the method is limited. As shown in Table 1 and Table 2.
A: Table 1 and 2 mainly record the performance on linear causal models. Since our main contribution is on nonlinear models, we think that comparable performance to SOTA on linear models does not restrict the main benefits of the method. In fact, this is also empirical evidence that our method is able to complete causal discovery tasks when the underlying model is linear.
Qc:The abstract is too general and does not provide sufficient information to summarize the content of the article. It should provide insights into the proposed method.
A:This is a good advice. The core insights are that we use diffusion process to simulate nonlinear causal generative functions, which admits the method to perform causal discovery on more complex situations with a richer class of observational distributions. We modified the abstract so that the insights are stated more clearly. Please see the updated version (see " The underlying nonlinear causal generative process is modeled with ...")
Thank you for your response. I will keep the score unchanged.
Thank you for the response. If there are other questions, please kindly let us know and we are willing to provide more materials and revise our manuscripts accordingly.
Dear all,
We express our sincere thanks for the time you spent on our manuscript. We revised our manuscript based on your comments, with the revisions summarized below.
-
Abstract: abstract is revised so that the insight is more clear. (Reviewer FcqQ)
-
Section 2: more related work including GAE and CausalVAE, causal score matching are discussed. (Reviewer 6nj7, eoGv and CBSw)
-
Section 3: several revisions on statements related to invertibility of functions (line 197) and proposition 1. (Reviewer 6nj7, eoGv, CBSw)
-
Section 4: more experiments are added (GAE, DiffAN, NOTEARS-MLP) with discussions revised acordingly. Pruning method of CAM is also added. (Reviewer 6nj7, eoGv, CBSw)
-
Appendix: open source links (section C) and discussions (section D) are added about the type of nonlinear models and several new baselines. (Reviewer 6nj7)
For details, please kindly see our responses to each reviewer.
Authors
The authors propose a diffusion-based approach to continuous causal discovery, using a DAG-GNN approach that models the non-linear function with diffusions. This is an interesting direction with a few prior works already. Reviewers unanimously rated the paper as "weak reject" with no one strongly in favor of acceptance.
审稿人讨论附加意见
Every reviewer engaged with the authors, with some increasing their score. But no one raised their score above "weak reject".
Reject