Demystifying amortized causal discovery with transformers
Causal discovery with supervised learning seem to bypass the need for identifiability assumptions. We find that constraints on the training data act as a prior on the test data, defining the class of causal model identifiable by the learned algorithm
摘要
评审与讨论
This paper studies the level of generalization achievable when training a predictor to classify “X causes Y” vs. “Y causes X” from observational data. Motivated by recent works performing causal discovery using a pretrained transformer model, the works explores which cases result in predictors that generalize to graph-dataset pairs generated from unseen types of SCM models. This is mostly achieved through a set of empirical experiments on synthetic 2-node SCM data. The work derives a corollary of Hoyer et al [7] to argue why training on multiple identifiable classes of synthetic SCM instances may help generalization amortized causal discovery methods.
优点
The paper takes a first step towards analysing why amortized causal discovery performs well in practice and often significantly better than classical approaches. This is generally an important direction and of interest to the field.
缺点
While the motivation of this work is generally well-grounded, the contribution and argument of the work itself have several weaknesses that, in my opinion, do not justify many of the claims made in the abstract, introduction, and throughout the paper.
First, a major aspect of amortized causal discovery with transformers (referenced in the title) is that of solving structure learning tasks in high dimensions. Lopez [10] already provide theoretical and empirical analyses of the bivariate case. Recent work showed that this idea can generalize to (very) large systems -- the literature of works on causal discovery with transformers cited in the paper all study significant large-dimensional problems (ranging from 20-100 variables). Despite this, the present paper limits its entire analysis to the bivariate case. Thus, it is misleading to claim the paper “demystifies amortized causal discovery with transformers”. No part of the analysis concerns multivariate causal discovery or transformers. The paper should be upfront and highlight much more clearly what its contributions are beyond Lopez et al [10], which already study the bivariate amortized causal discovery case.
The paper repeatedly states it analyses CSIvA. However, none of the algorithmic components of CSIvA, such as e.g. the auxiliary loss it is trained on or the architecture of the predictive model, are part of the analysis. The loss function studied here (p.3, l. 133) is the same as e.g. used by [13]. Hence, it would be more truthful to claim the analysis concerns general predictors trained on the classification task of X->Y vs X<-V, as in [10].
A major component of causal discovery performance is not only identifiability of the graph from the observational distribution, but also the intractably large search problem incurred by classical score- and constrained-based methods. The question is: do transformers outperform classical methods in large problem sizes because 1) (parts of) the graphs are identifiable to it, or 2) a prediction-based approach is better at finding the identifiable edges in a large system (as opposed to doing a search)? This question motivates amortizing causal discovery in the first place, but the two-variable special case studied here is ill-suited for answering it. Since the work only studies the bivariate case, the title and claims throughout the paper, as well as their ties to the (large-scale) transformer literature have to be recalibrated.
Section 3.2 seems unnecessary. The section only studies the generalization ability of CSIvA, which is no contribution. The takeaways (lines 195-) that “CSIvA generalizes well to test data generated by the same class of SCMs used for training” and that “it struggles when the test data are [from different SCMs]” are obvious and well-studied by CSIva or related works with the same approach. The same applies to the insight that “training […] exclusively on LiNGAM-generated data is equivalent to learning the distribution p(.|D, LiNGAM)”, implying identifiability.
问题
In the experiments, is the data standardized? Standardization can affect identifiability (see Reisach, 2021, “Beware of the simulated DAG! Causal Discovery Benchmarks May Be Easy To Game”).
局限性
-
The “theoretical result” (Proposition 1) is a simple corollary of Hoyer et al [7]. The paper makes otherwise no theoretical contribution to the problem underlying amortized causal discovery itself.
-
It is unclear whether “randomly initialized MLPs” are sensible nonlinear functions to use for constructing nonlinear mechanisms and non-Gaussian noise distributions. The fact that a few prior works used it is not a good reason. The shape and scale of randomly initialized neural network functions depends heavily on the activation function and weights distribution. The functions in these experiments could be anything from approximately constant or linear to very jumpy. Please provide additional motivation or evidence for why this is a good choice, and what hyperparameters are used, or consider as an alternative, e.g., samples from a GP, which are smooth and have an interpretable length-scale parameter, also in high dimensions.
We thank the reviewer for the time dedicated to our paper. One important criticism is that we should better highlight our contribution in comparison to Lopez et al. (2015): this is addressed in the first bullet of our response to the Weaknesses section.
Weaknesses
-
“The paper should be upfront and highlight much more clearly what its contributions are beyond Lopez et al [10]”.
We highlight significant differences between Lopez et al. and our work. In their paper, Lopez et al.:
-
Study upper-bounds of the excess risks of a binary classification problem, i.e. mapping distributions to bivariate causal graphs, via Rademacher complexity of the hypothesis space (theorem 3 in their paper).
-
Perform experiments on limited real data and on synthetic data exclusively generated in the setting of nonlinear ANMs.
Our work, instead, is aimed at understanding when supervised causal discovery works, in a principled manner rooted in identifiability theory results. In the motivation of their work, Lopez et al. claim that supervised causal structure learning “would allow dealing with complex data-generating processes, and would greatly reduce the need of explicitly crafting identifiability conditions a-priori.” Our work shows that CSIvA generalization is still limited by results from identifiability theory, and that identifiability assumptions still need to be crafted a priori. In particular, our Example 2 shows that CSIvA failure and success of correct inference in non-identifiable settings is determined by the variety of SCMs that have been used during the model training, and our Hypothesis 1 formalizes this conjecture. On this basis, we present:
- Theory adapted from Hoyer et al., that sets defines the set of identifiable SCMs
- experiments that show that the class of SCMs identifiable by CSIvA is constrained to that of identifiable SCMs according to the theory of Proposition 1 and our Hypothesis 1 (see Figures 3a and 3b).
- Experiments showing when CSIvA succeeds and fails to generalize at test time - in-distribution and OOD test data
- We show that training on data from multiple SCMs that are identifiable according to our Proposition 1 results in an Algorithm with better empirical generalization performance.
-
-
“A major component of causal discovery performance is not only identifiability of the graph from the observational distribution, but also the intractably large search problem […]” We don’t provide experiments in multivariable settings as our goal is not to study CSIvA scalability. Our focus is on whether supervised causal discovery respects known identifiability results, and bivariate graphs provide a minimally sufficient setting for studying identifiability. In particular, our Proposition 1 summarises important bivariate identifiability results (Hoyer et al., Zhang and Hyvarinen), and our empirical study confirms that CSIvA is constrained by the identifiability results of Proposition 1. Concerning the multivariate settings, Peters et al. propose identifiability theory for multivariate additive noise models as a straightforward generalization of Hoyer et al.; hence, the theory of Proposition 1 underlying our empirical findings is valid for arbitrary high dimensions.
-
“Section 3.2 seems unnecessary. The section only studies the generalization ability of CSIvA, which is no contribution” In section 3.2 we study in-distribution and out-of-distribution generalization. In-distributoin generalization is studied in Ke et al. (2023a), as recognized in our paper (L196). The main point of this study, in our paper, is to validate our CSIvA implementation (as specified in L196-197 and footnote number 1) as the authors of CSIvA paper did not provide public code, which required from-scratch implementation on our side. Concerning the OOD generalization, this is not studied in Ke et al. (2023a), as in their experiment, the only difference between train and test data is in (i) the algorithm for synthetic graph generation (ii) some variation on the parameter of the Dirichlet distribution they employ in some experiments. However, identifiability theory is sensitive to mechanism and distribution assumptions, which Ke et al. do not vary but which our experiments in section 3.2 study.
”The takeaways (lines 195-) that “CSIvA generalizes well to test data generated by the same class of SCMs used for training” and that “it struggles when the test data are [from different SCMs]” are obvious and well-studied ...” Due to what we wrote just above, we can safely say that this is not obvious nor well studied. Lopez et al. (2015) limit tests to nonlinear ANMs, Lorch et al. never tests on different classes of mechanisms with respect to the training, similarly Li et al., Ke et al., Lippe et al. None of these works mentions insights that relate to our Example 2. If the reviewer has precise references supporting the claim that our results are obvious and well-studied, we kindly ask to provide them.
Questions
We standardize the data, as written in L163.
Limitations
- “The “theoretical result” (Proposition 1) is a simple corollary of Hoyer et al [7].” We openly discuss this relation and the relation of Proposition 1 with Zhang and Hyvarinen [8]. We offer to rename Proposition 1 to Corollary 1, making the relation with previous works more explicit even in the naming.
- “It is unclear whether “randomly initialized MLPs” are sensible nonlinear functions to use for constructing nonlinear mechanisms and non-Gaussian noise distributions. …” In the PDF attached to this rebuttal we replicate the experiments of Figure 2a and Figure 4 of the paper, this time using Gaussian Process generated mechanisms (GP data) with unit RBF kernel. These empirical results agree with the findings in the paper on MLP-generated data. We will replicate all the experiments using data with GP-generated mechanisms and include them in the appendix.
Thank you for your response. I will maintain my score of the work.
This paper explores why causal discovery from observational data, particularly with CSIvA, a transformer-based model, can achieve competitive performance despite seemingly avoiding the explicit assumptions that traditional methods make for identifiability. The authors demonstrate that constraints on the training data distribution implicitly define a prior on the test observations. When this prior is well-suited, the underlying model can be identifiable. In other words, prior knowledge of the test distribution is encoded in the training data through constraints on the structural causal model governing data generation.
Additionally, they provide a theoretical basis for training on observations sampled from multiple classes of identifiable SCMs, a strategy that enhances test generalization to a wide range of causal models. They show that training on mixtures of causal models offers an alternative approach that is less reliant on assumptions about the mechanisms.
优点
This paper bridges the gap between existing theoretical results on identifiability and practical observations. More importantly, it moves away from classical causality settings and quite restricted models, shifting towards more mainstream and modern models like transformers. This opens a pathway for causality research to integrate with large language models (LLMs), which represent the state-of-the-art in a wide range of applications.
缺点
The presentation can be significantly improved. Since the paper aims to offer novel insights, it is crucial to organize the arguments, theoretical results, and experimental findings effectively to support these insights.
问题
raining on data from multiple causal mechanisms and/or noises intuitively improves generalization, as the model is exposed to more changes and diverse data. This phenomenon is also observed and theoretically analyzed in domain adaptation. Domain adaptation is simpler because there is a meta-distribution or meta-process that governs how distributions change across domains. Similarly, large language models (LLMs) benefit from training on diverse and vast data sources and their underlying generating processes. I wonder if the authors have theoretical results or insights on this aspect with causal models training on vast data generated by diverse causal mechanisms potentially without a common meta-process.
局限性
n.a.
We thank the reviewer for their time and effort in analyzing our work.
Weaknesses
The only comment present in the weaknesses section is that “The presentation can be significantly improved. Since the paper aims to offer novel insights, it is crucial to organize the arguments, theoretical results, and experimental findings effectively to support these insights.” In absence of more specific thoughts on this, any answer we could give would not be on point.
Questions
-
“Training on data from multiple causal mechanisms and/or noises intuitively improves generalization, as the model is exposed to more changes and diverse data”: this is true, and related to an important point made in our paper that we want to remark on: previous to our work, there is no research about whether training on multiple mechanisms would actually be beneficial, or if instead, due to well-known boundaries posed by identifiability theory, this would be harmful. Thus, despite the conclusion that training on mixed SCMs is beneficial seems obvious, it is not, and this is a consistent subject treated in our work.
-
“Domain adaptation is simpler because there is a meta-distribution or meta-process that governs how distributions change across domains. Similarly, large language models (LLMs) benefit from training on diverse and vast data sources and their underlying generating processes. I wonder if the authors have theoretical results or insights on this aspect with causal models training on vast data generated by diverse causal mechanisms potentially without a common meta-process.” It is hard to make unifying statements on such a large variety of topics as those touched by this question. Here is our thought, to be taken with a grain of salt. Algorithmically, causal discovery on linear, nonlinear ANM, and post-nonlinear models has at its core one common procedure, that we may call meta-procedure: specifically, if one considers cornerstones classical method for causal discovery like RESIT (Peters et al., 2014) or Direct-LiNGAM (Shimizu et al., 2011), (roughly) consisting of regression + independence testing over the residuals, it appears that causality research has found that a single meta-procedure (again, regression + conditional independence testing of the residuals) is one of the best approaches to causal discovery, both on linear and nonlinear additive noise models - the object of study of RESIT and Direct-LiNGAM papers. This suggests that a good learner should be able to learn one algorithm that works on nonlinear ANM data, and seamlessly adapt it to linear ANM data, thus achieving good OOD generalization at least in the task of training on nonlinear data and testing on linear data (or vice versa). Though, we don’t observe this to happen, which is a fact worth noting. We ask the reviewer's opinion about this insight, as we believe it could make a valuable insight into our work. We thank the reviewer for the point made which sparked this discussion.
Here is my view. Since everything has a beginning and originates from somewhere—whether it's a big bang or multiple big bangs—it's reasonable to think that, ultimately, a meta-process governs all processes. At the mid or lower levels, many processes may appear independent. A causal discovery or causal representation learning algorithm aims to weave a sweater that stretches across data from multiple domains, capturing the dynamics of change rather than mere correlations. When these datasets are closely related and governed by a mid-level meta-process, the weaving is easier. However, when they are not, the sweater becomes more stretched. A measurement could be proposed to capture this, indicating whether the data were influenced by a mid-level meta-process. A more stretched sweater isn’t necessarily a drawback, as greater diversity might actually enhance generalization.
We thank the reviewer for the time dedicated to our rebuttal. Concerning the comment and the specific suggestion that “A measurement could be proposed to capture this, indicating whether the data were influenced by a mid-level meta-process”, we believe that this is an interesting point, but nevertheless beyond the scope of our paper, as it is not a trivial extension to our work, but instead a research work on its own. Our work could be interpreted as a building block of what is suggested by the reviewer: we define the post-ANM as a class of SCMs that captures structural causal models that are different in some aspects but share common assumptions, more importantly, the additivity of the noise terms (in the PNL case, this is true up to invertible function). In this case, post-ANM assumption plays the role of a “shared meta process”, which we interpret as the underlying model. We observe that amortized training on a large variety of assumptions that share some of the underlying generating process, as in the case of post-ANM, results beneficial for the inference (in agreement with the theory of section 3.4 and the experiments of 3.5). Hence, the ability to define reasonable and large enough model classes for amortized inference is surely an important point of our work: in case our paper gets accepted, one possible direction for future work is to provide priors to the network about shared assumptions (i.e. shared meta process) of the structural causal model generating the inference data: this is amenable to be done with transformers, which are a suitable architecture for specifying prior knowledge on the data in the form of context (see e.g. https://arxiv.org/pdf/1909.05858).
The paper studies the behaviour of amortised (supervised) causal discovery methods based on different training data distributions and its relation to more traditional causal discovery and the related identifiability theory. The authors empirically validate the intuitions about supervised causal discovery and generalisation of supervised learning methods.
优点
- The paper studies the behaviour of amortised causal discovery methods which have previously been unstudied.
- The empirical insights generally validate the intuition about identifiability and generalisation. Some examples give interesting insights into the identifiability and performance in the case of mixed assumptions.
缺点
- The paper is a purely empirical study of the generalisation behaviour of supervised causal discovery methods, validating general intuition without thorough novel insights.
- Given the empirical nature of this paper, I'd have expected to see a more thorough comparison, e.g. setting up a leave-one-out generalisation study or more in-depth analyses of the prediction on interesting individual SCMs such as the non-identifiable example or the performance of the prediction from new samples from a training set SCM.
问题
- You seem to be surprised that supervised causal discovery methods can infer graphs when trained on mixed data. Isn't that somewhat obvious after results from [29] that show that transformers can identify valid assumptions from data?
- You make fairly strong statements about classical methods not being applicable because the underlying assumptions cannot be verified. I'd disagree with this as they still turn out to be useful in practice.
- How do you think [1] relates to the behaviour of supervised causal discovery methods, given that transformers can perform approximate Bayesian inference [2].
[1] Dhir, Anish, Samuel Power, and Mark van der Wilk. "Bivariate Causal Discovery using Bayesian Model Selection." Forty-first International Conference on Machine Learning. [2] Hollmann, Noah, et al. "TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second." The Eleventh International Conference on Learning Representations.
局限性
n/a
We thank the reviewer for their comments and effort in understanding our paper. Before proceeding further, we note the conciseness of the Weaknesses section, where two generic criticisms are expressed in four lines of text. In absence of more articulated concerns, we respond at our best to the comments presented therein.
Weaknesses
“The paper is a purely empirical study of the generalisation behaviour of supervised causal discovery methods, validating general intuition without thorough novel insights”. The claim that "we validate general intuition" is a vague statement, hard to address in a satisfactory way: to help foster a more articulated discussion, we present a summary of our contributions. The goal of our work is to understand when supervised causal learning works, in a principled manner rooted in identifiability theory results. Our Example 2 shows that CSIvA failure and success of correct inference in non-identifiable settings is determined by the variety of SCMs that have been used during the model training, and our Hypothesis 1 formalizes this conjecture. On this basis, we present:
-
Proposition 1, a theoretical statement adapted from Hoyer et al. (2008) which defines the set of identifiable SCMs
-
Experiments showing that the class of SCMs identifiable by CSIvA is constrained to that of identifiable SCMs according to the theory.
-
Experiments showing when CSIvA succeeds and fails to generalize at test time - in-distribution and OOD test data, respectively.
-
We show that training on data from multiple SCMs that are identifiable according to our Proposition 1 results in an Algorithm with better empirical generalization performance.
If the reviewer believes that all of these findings are general intuition, we kindly ask: (a) please argue this more specifically, as the claim that it is general intuition is very generic (b) provide scientific references where this intuition is exposed, in a way that is so comprehensive to invalidate our contribution.
- “Given the empirical nature of this paper, I'd have expected to see a more thorough comparison, ...” Concerning the requested leave one out study, this is a sensible approach to probe test generalization only when zero or few test datasets are available for the optimized model: since we are dealing with synthetic data, instead we have an infinite availability of test datasets. Indeed, every model is tested on 1500 datasets unseen during training, which is the standard machine learning way to study generalization. Concerning the requirement of more “interesting individual SCMs such as non identifiable examples”, while we agree that this would be nice, current known theory only tells us models that are identifiable and not which ones are non-identifiable, with the notable exception of linear Gaussian data, which we do indeed analyse (see Figure 3b and the relative discussion). Any other example as the one provided in Example 1 must be analytically found by pen and paper computations, which is not a feasible option. Finally, concerning the request for prediction from new samples from a training set SCM, this is what our in-distribution generalization study is about, see Figure 1 (a, b, c).
Questions
- “You seem to be surprised that supervised causal discovery methods can infer graphs when trained on mixed data. Isn't that somewhat obvious after results from [29] that show that transformers can identify valid assumptions from data?” Our mixed training study of Section 3.5 is motivated by our observations of Section 3.2 that CSIvA presents good in-distribution generalization while it fails in OOD tasks (section 3.2): given that training on a larger variety of SCMs allows CSIvA to operate on in-distribution test data more frequently, our expectation is that test performance in our experiments benefits from it. This is well expressed in the Implications paragraph of Section 3.2, particularly L 197-199, and in Section 3.5, L 294-297: for this reasons, we disagree with the reviewer saying that we are surprised by discovering that CSIvA has good in-distribution generalization properties after mixed training, as this is exactly what we specify to be the expected outcome.
- “You make fairly strong statements about classical methods not being applicable because the underlying assumptions cannot be verified. I'd disagree with this as they still turn out to be useful in practice.” We don’t make such statements: if we mistake, please directly point to where this happens, as we are ready to remove these claims that find us disagreeing. What we do instead, is highlight strengths and weaknesses of classical methods compared to supervised learning-based approaches, based on our empirical findings. On the one hand, classical methods are more reliant on assumptions on the mechanisms than supervised learning-based methods can be: as we show that mixed-training on mechanisms is a theoretically principled practice (by Proposition 1), a CSIvA model that is trained on LiNGAM, nonlinear ANM, and PNL data clearly has less restrictive requirements on the mechanisms than classical methods. On the other hand, we notice poor generalization of CSIvA on unseen noise distributions, in contrast to classical methods which are mostly agnostic about the distribution of the error terms. A CSIvA model agnostic about noise distributions would require training on SCMs covering all existing noise distributions, which is arguably impossible: our reasoning unveils that classical models appear to have an advantage, in this sense. These arguments are presented in the paper, see L 319-323, L334-340, and the abstract at L 13-16.
- “How do you think [1] relates to the behaviour of supervised causal discovery methods, given that transformers can perform approximate Bayesian inference [2].” Transformers’ ability to do Bayesian inference is far beyond the scope of this paper.
This paper conducts an empirical study of the performance of supervised causal discovery methods, its generality, and the learnability vs. causal structure identifiability. The scope is the bivariate case, and with controlled mechanism and noise to establish the SCM for training and testing data.
In my opinion, this paper gives two findings:
-
a previous claim (Lopez-Paz et al. [10]) said that, by using the supervised learning based approach, the performance of causal discovery can exceed the boundary of identifiability. which is not true
-
by using diverse training data (diverse = diverse mechanisms + diverse noise), supervised based causal discovery can achieve better OOD performance.
优点
1 - the study of supervised causal learning, especially the DNN-based approach, is timely and important.
2 - the experiment setup, is a good starting point. To my knowledge, this is the first paper study the performance, boundary of supervised-based causal discovery methods, setting the bivariate case, with the configuration in terms of mechanism + noise is valid.
3 - some findings are interesting, which can potentially benefit the community for further algorithm design.
缺点
1 - part of the study can be summarized as learnability vs. identifiability, or in my opinion, one question within this category is "when and how can learnability exceeds the boundary of identifiability?". in this regard, the current findings are still very limited, need to be further consolidated. in this regard, a related work [1] is missing, I think it is helpful for this work.
2 - although not explicitly claimed, this paper suggests that "CSIvA is capable of in-distribution generalization', is this true? or is this true just for bivariate case or generaly applicable?
3 - I suggest to use the term supervised-based approach, or supervised causal learning (SCL), rather than amortized causal discovery, which is more to the point.
4 - one claim "we conclude that the post-ANM is generally identifiable, which suggests that the setting of Example 2 is rather artificial" I disagree. Although the space of all continuous distributions such that the bivariate post-ANM is non-identifiable is contained in a 2-dimensional space, thus it is a submanifold of the entire distribution space, thus its measure is 0. This is only a mathematical claim but lacks real-world relevance. I would argue that the setting of example 2 is quite valid in real-world setting, or the linear gaussian setting, is also commonly adoped in real-world, but had not been discussed in this work.
5 - potential conflict between section 3.3 and 3.4: 3.3 shows that when mixed two training dataset (different setting) together, would significantly compromise the SCL's performance; however, section 3.4 shows that the more diverse of the training data, the more gain on OOD setting.
[1] Dai, H., Ding, R., Jiang, Y., Han, S., & Zhang, D. (2023). Ml4c: Seeing causality through latent vicinity. In Proceedings of the 2023 SIAM International Conference on Data Mining (SDM) (pp. 226-234). Society for Industrial and Applied Mathematics.
问题
Is Figure 3 a) and b) reversed?
局限性
N/A
We thank the reviewer for their comments and the time taken analysing our paper. Before proceeding further, we note that one important criticism from the reviewer appears to be that “the current findings are still very limited, need to be further consolidated”. We point to the first bullet of our response to the Weaknesses section for an answer.
Weaknesses
-
“part of the study can be summarized as learnability vs. identifiability […] the current findings are still very limited, need to be further consolidated. […] a related work [1] is missing.” We agree that there is a missing citation, which we will add to the paper. Concerning the criticism that our “findings are very limited and need to be further consolidated”, this is a generic statement, hard to address in a satisfactory way. We present a summary of our contributions below, which can be taken as a ground to precisely articulate this concern. The goal of our work is understanding when supervised causal learning works, in a principled manner rooted in identifiability theory. Example 2 shows that CSIvA failure and success in inferring non-identifiable settings is determined by the variety of SCMs used during the model training, and our Hypothesis 1 formalizes this conjecture. On this basis, we present:
- Proposition 1, a theoretical statement adapted from Hoyer et al. which defines the set of SCMs identifiable from observational data.
- Experiments showing that the class of SCMs identifiable by CSIvA is constrained to that of identifiable SCMs according to the theory of Proposition 1 and our Hypothesis 1 (see Figures 3a and 3b).
- Experiments showing when CSIvA succeeds and fails to generalize at test time - in-distribution and OOD test data, respectively.
- We show that training on data from multiple SCMs that are identifiable according to our Proposition 1 results in an Algorithm with better empirical generalization performance.
Further, we clarify that the our findings are general in the following sense:
- CSIvA is an archetypical model, as its learning objective - the conditional distribution over the space of graphs given the data - is shared with the majority of existing methods for amortized causal discovery (Lopez et al., Lorch et al., Lippe et al., Li et al.). Thus, our novel findings apply to an entire class of common methods in the literature
- The theoretical ground of our study generalizes to multivariate causal discovery. In particular, Peters et al. (2014) proposes identifiability theory for multivariate additive noise models as a straightforward generalization of Hoyer et al. (2008) (the main reference of our Proposition 1);
-
“one claim "we conclude that the post-ANM is generally identifiable, which suggests that the setting of Example 2 is rather artificial" I disagree. […] example 2 is quite valid in real-world setting, or the linear gaussian setting, is also commonly adopted in real-world, but had not been discussed in this work.” The fact that under the post-ANM assumption non-identifiable SCMs belong to a zero-measure region is the definition of identifiability provided by Hoyer et al., commonly adopted in the causality community and by our paper. In this sense, identifiability of the post-ANM means that samples from the post-ANM are almost surely identifiable (almost surely in a formal sense): as such, non-identifiable SCMs like our Example 2 must be artificially crafted to sample from a zero measure region, which is why we use the word artificial. To avoid potential sources of confusion, we propose to review the sentence specifying that our Example 2 under the post-ANM assumption almost surely does not happen, which formally clarifies what we mean by saying that it is artificial. Concerning the case of Linear Gaussian data and reviewer’s claim that this has not been discussed in our work, we point to our experimental results of Figure 3b, where we show that CSIvA is unsuitable for inference on Linear Gaussian data in agreement with the identifiability statement of Proposition 1.
-
“potential conflict between section 3.3 and 3.4 [...]” We believe the reviewer refers to section 3.5 (instead of 3.4) as this is where we analyse training on diverse SCMs. This is closely related to the point made in the previous bullet. Section 3.3 shows that mixed training on data from SCMs sampled from the non-identifiable zero-measure region of the post-ANM compromises SCL’s performance, coherently with our Hypothesis 1. Instead, section 3.5 shows that training on datasets generated by identifiable SCMs (i.e. not from the zero measure region) benefits generalization, coherently with our findings in section 3.2 about good CSIvA in-distribution generalization. So, the two sections are complementary, not in contrast, as they consider CSIvA behavior when trained on samples drawn from complementary sets (the one of identifiable SCMs, and the one of non-identifiable SCMs, under the post-ANM hypothesis).
-
“although not explicitly claimed, this paper suggests that "CSIvA is capable of in-distribution generalization' [...]" We do explicitly claim this (L212-213). In particular, our experiments are in the bivariate setting, Ke et al. experiments are in the multivariate setting. Note that the in-distribution generalization experiments, in our case, mostly serve the purpose of validating our CSIvA implementation (L196-197 and footnote number 1) as the authors of the CSIvA paper did not provide public code, which required from-scratch implementation from our side.
-
“I suggest to use supervised causal learning (SCL) rather than amortized causal discovery” This nomenclature would generate confusion with other methods that perform supervised causal discovery but are not suitable for the amortized inference - see L73-74, with references [18, 19, 20, 21, 22].
Questions
Yes, Fig 3 a) and b) are reversed in the caption, but not in the text. Thank you.
We thank the reviewers for the time spent reading and understanding our paper, as well as for the insightful comments and questions. Our work is well received in terms of soundness and presentation quality (with scores ranging from 2 to 3). In contrast, we notice a more polarized view regarding the amount of contributions in our paper (with grades ranging from 1 to 3). Given the absence of criticisms shared by more than one reviewer, we leave our comments to the individual responses. In the PDF attached to the rebuttal, we present the experiments suggested by R 7Mr6, replicating some of the empirical analysis of the paper using Gaussian process-generated nonlinear mechanisms.
We use this space to provide a minimal bibliography of the references we use across the whole rebuttal.
References (alphabetic order of first authors)
Nonlinear causal discovery with additive noise models, 2008, Hoyer et al.
Learning to Induce Causal Structure, 2023a, Ke et al.
Supervised Whole DAG Causal Discovery, 2020, Li et al.
Efficient neural causal discovery without acyclicity constraints., 2022, Lippe et al.
Towards a learning theory of cause-effect inference, 2015, Lopez et al.
Amortized inference for causal structure learning, 2022, Lorch et al.
Assumption violations in causal discovery and the robustness of score matching, 2023a, Montagna et al.
Causal Discovery with Continuous Additive Noise Models, 2014, Peters et al.
DirectLiNGAM: A Direct Method for Learning a Linear Non-Gaussian Structural Equation Model, 2011, Shimizu et al.
On the Identifiability of the Post-Nonlinear Causal Model, 2009, Zhang and Hyvarinen
Dear Reviewers,
We sincerely appreciate your constructive suggestions and have made every effort to address your concerns. There are only three days left until the deadline for reviewer-author discussions. Please don’t hesitate to comment on our rebuttal if further clarification would be helpful.
Best regards,
Authors
This paper aims to identify and provide novel insights about why supervised causal discovery can outperforms the traditional approaches. Although this is an important research question, most reviewers agrees that the current empirical evidence is not strong enough to support the claims and is limited to the bivariate case. Thus, it is not ready for acceptance at the current form.