Demystifying amortized causal discovery with transformers
Causal discovery with supervised learning seem to bypass the need for identifiability assumptions. We find that constraints on the training data act as a prior on the test data, defining the class of causal model identifiable by the learned algorithm
摘要
评审与讨论
The paper presents a comprehensive analysis of CSIvA in the context of bivariate causal models. The authors first demonstrate that while CSIvA generalizes well in distribution, it fails to generalize out-of-distribution, regardless of changes in mechanisms or noise distributions. By hypothesizing that CSIvA can identify the SCM underlying the data generation process, the authors challenge the previous claim that “supervised learning methods can address complex data generation and reduce the need for explicit assumptions.” Additionally, the paper takes an empirical step forward by showing that CSIvA has the potential to capture multiple identifiable causal structures within a single network, supported by theoretical claims that post-ANM models are generally identifiable.
优点
The paper is well-organized and clearly presented. The authors articulate their points effectively, with implications interwoven throughout the sections and the corresponding background provided in the appendix. The experiments on synthetic datasets are comprehensive, demonstrating that the performance of CSIvA is non-trivial. The empirical results support the proposed hypothesis. The experiment demonstrates that CSIvA’s generalization ability can be significantly enhanced by incorporating multiple distributions into the training dataset.
This paper potentially addresses the limitations of supervised learning in causal discovery, particularly the challenge of applying pre-trained models to datasets that may be out-of-distribution. Specifically, the highlight of this paper is that it offers a possible way to make supervised learning-based causal discovery practical: if the training dataset is sufficiently comprehensive, the model’s predictions are somewhat reliable.
缺点
In my view, the largest weakness is the lack of evaluation on real-world datasets. Since the experiments already demonstrate the potential of achieving a relatively robust predictor by incorporating multiple distributions in the training dataset, an investigation and comparison with traditional methods on any real-world datasets would be valuable and appreciated. For example, https://www.cmu.edu/dietrich/causality/projects/causal_learn_benchmarks/. This is just an example, the author does not need to be constrained by this and can use any dataset that is valid for causal discovery. If there are any reasons why experiments on real-world datasets may not be feasible or practical, I would be very interested in hearing them and would welcome the explanation with an open mind.
Another limitation, which is also discussed by the authors, is the scope of analysis is rather limited in a bivariate model without considering intervention.
问题
I would appreciate clarification on how the method handles independent variable pairs. Does it require a training dataset of independent variables, or is it primarily designed to identify a DAG from the Markov equivalent class, meaning independent cases don't need to be considered? Addressing this point explicitly could be helpful for readers with varying levels of expertise in the field.
Concerning the lack of real-world experiments, we thank the reviewer for raising this point. We agree that these experiments would benefit and strengthen our results. We address the reviewer to the common response for an analysis of the additional experimental results; in this regard, we want to express our appreciation for pointing us to the CMU collection of real datasets, which was of great help.
For the lack of multivariate experiments, an extensive discussion is also presented in the common response.
Finally, we express our appreciation for the reviewer’s valuable insights, which sparked the discussion on the nature of bivariate experiments and the empirical analysis on real data. We incorporated several of their suggestions in the updated pdf. If the concerns are positively addressed, we kindly ask the reviewer to consider increasing their score. We remain available for further clarification and debate.
I appreciate the authors for their dedicated effort in the rebuttal. The updated submission has addressed my concern regarding the real-world experiment. The remaining question, as I asked in the first review:
I would appreciate clarification on how the method handles independent variable pairs. Does it require a training dataset of independent variables, or is it primarily designed to identify a DAG from the Markov equivalent class, meaning independent cases don't need to be considered? Addressing this point explicitly could be helpful for readers with varying levels of expertise in the field.
I don't know if I missing anything since I don't find out the answer in the revised submission.
We apologise for missing to answer this question in the first rebuttal; that was unintentional. Let us present our response in this comment. Considering independent variables increases the number of classes the transformer can output (3 instead of 2): , , or (the empty graph). For this to be possible, training on independent pairs is required, as the algorithm is otherwise missing information on one of the three classes.
We note that predicting the independence of variables (i.e., an empty graph) does not require information on the class of the structural causal model underlying the data generation, as the empty graph defines a Markov equivalence class with a unique element, that can be identified even without knowing the SCM. In particular, disambiguating between empty and connected graphs can be done by testing independence between the input variables: previous works phrase independence testing as a classification task and show that this can be learned via deep neural networks (Bellot & van der Schaar, 2019; Sen et al., 2017). We empirically show that this can also be done by CSIvA in the context of causal discovery, providing additional experiments where the architecture is trained and tested on both connected and empty graphs (with linear, nonlinear, and post-nonlinear data): in all the considered scenarios, the average test SHD is close to 0, indicating that adding the class of empty graphs does not degrade performance.
An updated version of the manuscript including this discussion and the related experiments (Appendix D.5 and Figure 11) has been uploaded.
This work explores supervised learning for causal discovery, a promising direction to potentially replace traditional score-based or constraint-based methods. However, many gaps remain unclear when considering the supervised learning paradigm for this task. To clarify these gaps, we conduct extensive experiments using a specialized method, CSIvA, on a variety of synthetic datasets. The results of these experiments are presented, along with a detailed discussion of their implications.
优点
Some insights from the experiments are provided.
-
For in-distribution generalization, CSIvA effectively generalizes to unseen samples from the training distribution.
-
For out-of-distribution generalization, it is unsurprising that CSIvA performs poorly.
-
CSIvA's performance depends on whether the data is sampled from an identifiable SCM.
-
Training CSIvA on mixtures of SCMs leads to improved generalization.
缺点
I am generally open to, encouraging, and expecting new paradigms that go beyond traditional score-based or constraint-based methods for causal discovery. However, this work raises several key concerns, prompting me to review this draft with caution.
-
As an experimental paper, it is expected to provide a wide range of existing methods applied to various datasets, not just simulated data. However, the experimental settings in this work are relatively simple and do not appear to be robust enough for a comprehensive experimental study.
-
The insights from the experiments do not yield particularly exciting results and seem somewhat predictable. I would expect a deeper discussion of these experiments or even some theoretical progress. For example, it would be helpful to explore why training data sampled from identifiable SCMs performs well or how the learned model (e.g., a transformer) leverages this information. Additionally, while the learned model may not explicitly satisfy identifiable conditions (such as those in linear non-Gaussian additive noise models), it would be interesting to investigate whether the model implicitly learns such information.
-
For data sampled from a non-identifiable SCM, there may be multiple graphs within a single Markov equivalence class. In this case, do supervised learning paradigms truly offer no advantage? For instance, could we apply a Bayesian approach to the model to yield the most probable result? Additionally, given the existence of multiple possible graphs, could we frame this as a one-to-many problem, perhaps using techniques like Mixture Density Networks?
Overall, although the insights from this work are not surprising, it is still interesting and valuable, and I believe it contributes to the community. However, the current version is not yet ready for publication. I encourage the authors to incorporate more robust experimental results, provide deeper insights, and offer rigorous justifications.
问题
See above.
We thank the reviewer for the suggestions and valuable insights on our paper. We now proceed to address the points raised by the review.
Weaknesses
W1. As an experimental paper, it is expected to provide a wide range of existing methods applied to various datasets, not just simulated data. We agree with the reviewer’s comment, as additional experiments would strengthen our conclusions. For the discussion of the experiments on real data, we point the reviewer to the common response. Moreover, in Appendix D.2 of the updated pdf, we add experiments with four well-established causal discovery methods for different assumptions: DirectLiNGAM, CAM (Buhlmann et al., 2014), NoGAM (Montagna et al., 2023, reaching SOTA accuracy on several synthetic benchmarks), and GraNDAG (Lachapelle et al., 2019). We benchmark these methods on simulated data against CSIvA algorithms learned on the following dataset configurations:
- linear-mlp (linear SCM with mlp noise)
- anm-mlp (anm = nonlinear additive noise model)
- pnl-mlp (pnl = postnonlinear model)
- mixed mechanisms and mixed noise.
NoGAM, CAM, LiNGAM, GraNDAG, and CSIvA instances are tested on the following datasets:
- Linear-mixed, anm-mixed, pnl-mixed (mixed = mixed noise distributions)
- mixed-mlp (mixed = mixed mechanisms types, with mlp noise)
These dataset configurations are suited to benchmark CSIvA OOD generalization ability in comparison to the other methods. The results show that CSIvA generally outperforms traditional and state-of-the-art classical inference methods: the mixed training renders the model robust to a larger variety of SCMs, in agreement with our prior results, and even in comparison to well-established methods.
W2. We discuss individual concerns raised in the second point.
- It would be helpful to explore why training data sampled from identifiable SCMs performs well. We note that this is an important part of our work. Summarising, as discussed in Section 3.2, if the CSIvA training procedure only learns to model the conditional distribution of a graph given the data, it can only identify the Markov equivalence class of the causal graph. Hence, given that previous work showed that DAGs can be identified by transformers-based architectures (Ke et al. 2023, Lorch et al., 2022), the model must encode information about the data-generating process in some way. We formalize this intuition in our Hypothesis 1 (that for classes of SCMs observed during training CSIvA can encode the assumptions on the data generation), theoretically analyze it in the closed form Example 2, and empirically validate it. In the case of identifiable test data, if the underlying SCM is identifiable, encoding information on the data generating process enables inference of a DAG instead of an equivalence class. If this was not easy to parse from the presentation, we would welcome the reviewer’s suggestions to make this point more apparent.
- While the learned model may not satisfy identifiability conditions, it would be interesting to investigate whether it implicitly learns such information. We thank the reviewer for raising this interesting point. This can be analyzed given the results already in the paper. Given our observations that (i) for classes of SCMs observed during training CSIvA can encode the assumptions on the data generation, and (ii) CSIvA can not generalize to SCM classes unseen during training, we can foresee that a model that is exclusively optimized on non-identifiable data, e.g Linear Gaussian, can only try to fit a linear Gaussian model, irrespective of the input data. From this, it follows that CSIvA trained on non-identifiable data is no better than random in inferring the causal direction. This is indeed what we observe in the new experiments we provide in the updated pdf of the manuscript. In particular, we consider CSIvA trained on linear Gaussian data and run inference on six different simulated benchmarks, observing that the average test SHD is always approximately equal to 0.5. Experiments and the present analysis are added to the updated pdf (Appendix D.4)
W3. For data sampled from a non-identifiable SCM, there may be multiple graphs within a single Markov equivalence class... This point is out of the scope of the paper: supervised learning of Markov equivalence classes does not relate to the notion of identifiability that is central to our paper, i.e. the identifiability of the causal order. For a discussion of supervised learning of Markov equivalence classes and its theoretical guarantees, we point to Dai et al., 2023, as referenced in our related work section.
We remark our appreciation for the reviewer’s feedback, which we incorporated in the updated version of our pdf. If the concerns are positively addressed, we kindly ask the reviewer to consider raising their score. In any case, we remain available for further clarification and discussion.
Thank you for the detailed response.
Unfortunately, it does not convince me of the conclusions or insights presented in this work.
--
-
Insights or conclusions drawn from experimental results should not necessarily align with or exhibit similar properties as those derived theoretically. In theory, the identifiability result in the bivariate setting can be extended to the multivariate case because the principle underlying identifiability—such as the asymmetry between bivariate variables—is theoretically generalizable. However, this reasoning does not hold based on the experimental results. Specifically, the claim that "it suffices to know that transformers align with identifiability theory in the bivariate case" is problematic. For a model trained on data, there is no guarantee that it learns the principle of identifiability; instead, it may capture task-specific biases. These are fundamentally different aspects—generalizing theoretical results does not automatically imply analogous outcomes in experimental settings.
-
I do not think W3 falls outside the scope of the paper. If it did, why include an investigation of the linear Gaussian model? Moreover, a key motivation for supervised learning algorithms is that certain assumptions, e.g., the limited function class, are often difficult to justify in real-world applications.
Since the response does not address my concerns, I would like to maintain my rating.
We thank the reviewer for engaging in the discussion. We provide a response on the two points raised in their last comment.
-
Multivariate analysis. We note that the fact that CSIvA optimizes the parameters via maximum likelihood estimation (L188) ensures that our findings extend to the multivariate setting. In particular, CSIvA parameterizes a distribution , conditional probability of a graph given the dataset, where are the neural network parameters optimized on data points. It is known that, if a minimizer of the maximum likelihood estimation loss exists, then implies , where convergence is in probability (note that is a random variable). Crucially, this result does not depend on the dimension of the input, for us, in the dimension of the graph. As reference, see Theorem 2 in "Statistical theory", Richard Nickl, 2013. In our case, the optimal parameter guarantees that concentrates all the mass in one single point, given that data are generated by an identifiable model (see L174). Note that for a dataset generated by an identifiable process, the minimizer exists: e.g., consider data generated by a LiNGAM SCM, the DirectLiNGAM algorithm can find the optimal solution.This argument supports our conclusion that bivariate empirical conclusions extend to the multivariate setting. In fact, we empirically show that for an identifiable data generation process, CSIvA seems to approximate the optimal parameter which defines , concentrating all the mass on a single point. Given the above theoretical arguments, we know that in the infinite sample limit this also holds true for multivariate models.
In addition to these considerations, we remark on one point: our paper is clearly a step forward in advancing the understanding of the relationship between identifiable guarantees and supervised learning methods. We consider an existing line of works without any theoretical guarantees, i.e., transformers for causal discovery, and advance the understanding on this domain.
-
Experiments on Markov equivalence classes. We share the reviewer's interest, yet we still believe that this point is out of the aim of our paper:
- Concerning the clarification on why we consider linear Gaussian experiments, we can find motivation in the paper. In L25 and L41, we report that previous influential hypotheses argued that supervised learning methods can be trained to learn to infer beyond the constraints of identifiability (Lopez-Paz et al., 2015). The content of Section 4.3 "How does CSIvA relate to identifiability theory for causal graph" analyzes this claim: in this context, we show that CSIvA cannot infer the causal direction of linear-Gaussian models.
- We have to remark the scope of our paper: the goal is to analyze how existing transformer-based methods for causal discovery encode information that leads to correct predictions of the direction of causality, given empirical evidence from prior work that they do. To do this, we don't introduce new architectures, but we consider CSIvA, an existing transformer method for causal discovery. CSIvA does not output Markov equivalence classes but is trained to predict the directed acyclic graph. In this regard, analysis of how to "apply a Bayesian approach to the model to yield the most probable result" is out of scope for the paper, as it falls beyond the main goal of understanding how assumptions that ensure correct inference of the causal graph are encoded in the model's parameters. Such analysis requires the analysis of another algorithm, or instead, of a different training procedure of CSIvA aimed at the prediction of Markov equivalence classes. A study in the direction suggested by the reviewer is referenced in our related works (Dai et al., 2023).
We hope that our comments better address the concerns of the reviewer with respect to our previous response. If this is the case, we would kindly ask to reconsider their score.
The paper tries to analysis the CSIvA model on bivariate causal models. The authors proposed an hypothesis to attempt to bridge the gap between model free (i.e. end-to-end learning without given knowledge of causal generating process) causal discovery algorithms and the identifiability theory, but only use some examples, without a proper proof.
优点
The problem that the authors are trying to attack is actually important. Identifiability is central to causal discovery because it determines whether the true causal structure can be uniquely recovered from the data. Without identifiability, any conclusions about causality are inherently uncertain, making the validation and application of causal models problematic. Meanwhile, models such as CSIvA does not provide any identifiability results, but achieves good performance in practice. Thus to ensure the reliability of such models, identifiability results would be required.
缺点
-
The hypothesis in the paper is not proved.
-
Actually, for a causal discovery algorithm, if it is differentiable at each steps, then it would be trivial to use a RNN to mimic the action of the algorithm exactly. Then a large enough transformer can be used to fit the RNN easily. Thus it would be trivial that a transformer would perform good in identifiable causal discovery cases if such algorithm exists, For example, for linear gaussian with equal variance, each steps of NOTEARS is differentiable if the objective is differentiable. Thus we can use a differentiable neural network to mimic NOTEARS. In other cases, such explanation may also holds.
-
The experimental on 2-variables dataset are too simple.
问题
See above weakness.
We thank the reviewer for their feedback and time dedicated to our work.
It seems like there may be a misunderstanding about the main goal and scope of our paper. Concerning the second point raised in the Weaknesses that
it would be trivial that a transformer would perform good in identifiable causal discovery
it is indeed known that a transformer can learn causal discovery algorithms, being empirically investigated by Ke et al., 2023 and Lorch et al. 2022. However, this should not be raised as a weakness relative to our work, as these results are not the object of our paper. The first and main question we pose is how the identifiability assumptions are encoded (or needed at all in an explicit form): as discussed in Section 3.2, if the CSIvA training procedure learns to model the conditional distribution of a graph given the data, it can only identify the Markov equivalence class of the causal graph. However, transformer-based architectures can identify DAGs (Ke et al. 2023, Lorch et al., 2022), implying that the model must encode information about the data-generating process in some way. We formalize this intuition in our hypothesis (that for classes of SCMs observed during training, CSIvA can encode the assumptions on the data generation) and empirically validate it.
Concerning the first point in the weaknesses that
the hypothesis is not proven,
theoretical investigation of identifiability guarantees would require knowing the algorithm learned by the transformer neural network with millions of parameters, which is unfeasible. The only viable option, then, is that of presenting a theoretically grounded hypothesis and empirically validating it, as done in our paper. Once our hypothesis that CSIvA encodes assumptions on the process generating the distribution (rather than on the distribution only) is verified (Sections 4.3), this opens several interesting questions (see L399 and the following):
- Can we train a single network to encompass multiple (or even all) identifiable causal structures, i.e. able to encode information of the data generating process when this is identifiable?
- How much ambiguity might exist between these identifiable models?
The second question arises from the observation that jointly considering SCM classes that are separately identifiable (e.g. LiNGAM and nonlinear ANM) may render a data-generating process non-identifiable (as proven by Example 2). In other words, identifiability is a relative property that depends on the SCM search space: however, this is not an issue, as the non-identifiable distributions induced by this merging of SCMs happen rarely, as theoretically demonstrated in Proposition 2. Then, given that we empirically demonstrate that CSIvA can model SCM assumptions, we can exploit this theoretical result to train a single model over multiple classes of causal models: identifiability is still guaranteed (shown in Proposition 2 and experiments of Section 4.5), and generalization is improved, as demonstrated by our experiments.
We hope that this summary clarifies the nature of our contribution and clarifies the concerns raised in the first two points of the weaknesses section. Regarding the limit of our experiments to two variables (highlighted in the third point of the weaknesses), we address the reviewer to the general comment of the rebuttal.
We are available to continue our discussion on the above points. If, instead, the concerns are positively addressed by our rebuttal, we kindly ask the reviewer to consider raising their score.
The Hypothesis 1 is the key results of the paper. While empirical results are important, theoretical proof of the result is also essential. Also it is notable that theoretical proof of the hypothesis is not unfeasible given some high level abstractions. One possible way is:
- obtain a traditional causal discovery algorithm (i,e, LinGAM);
- to show that each step of traditional discovery algorithm can be mimicked by a proposed neural network;
- If Step 2 can be done, then at least with sufficient data, very likely the proposed approach can do as good as the traditional algorithm.
This might be a weak proof, and a stronger one may be done via:
- show that the proposed neural network can mimic the data generation process;
- show that if wrong data generation process is wrongly estiamted the result will be inconsistency.
In addition to the theoretical part, there are more problems in the experiments. It is true that theoretically it may be easy to extend the identifiability results from 2 variables to multi-variable cases. However, from the view of optimization, the search space of 2 variable and multi-variable cases are totally different. This is because the search space will grow exponentially as the number of variables increases. We can not simply assume that the neural network will be scalable to the number of variables.
From the response, it is clear to us that the reviewer may have misunderstood the aim of our work. The reviewer suggests two directions to theoretically prove our hypothesis 1. Both of these suggestions are unrelated to our paper's content.
- Mimicking an existing algorithm and proving identifiability. The reviewer suggests training a transformer that mimics an existing causal discovery method, e.g. LiNGAM, and to prove identifiability guarantees for that algorithm.
- This abstract training procedure would result in the LiNGAM algorithm, which already has its theoretical identifiability guarantees proven.
- This training procedure does not pertain to any existing transformer-based algorithm trained with supervised learning for causal discovery. The goal of our paper is analysing how existing transformer-based methods for causal discovery encode information that leads to correct predictions of the direction of causality, given prior empirical evidence that they do.
- Show that the proposed neural network can mimic the data generation process.
- We don’t see this as a practical direction, given that CSIvA is not a generative model, but a discriminative neural network. Instead, we can show that CSIvA encodes information on the data-generating process, and use it for inference: we follow this approach, which aligns with the reviewer’s suggestion for a “stronger proof”.
- Given that mathematical analysis of complex neural networks is unfeasible, the only viable approach is the one in our paper: present a hypothesis motivated by theory, validated by empirical evidence. In sustain to this point, note that the only theory developed for causal discovery as a supervised classification problem is provided in the context of kernel methods (Lopez-Paz et al., 2015), for which closed-form analysis is possible (unlike neural networks).
We remark one more time that our goal is not to prove that CSIvA is identifiable, but to showcase that (1) amortized causal discovery methods learn to encode the assumptions on the SCM required for identifiability implicitly from the training data (contrary to prior hypotheses in Lopez-Paz et al., 2015), and (2) they do not work if the setting is non-identifiable or there is a mismatch between the assumptions (i.e., the training data) and the test scenario. From there, we explore new tradeoffs for these algorithms in theory and practice.
Overall, we disagree with the criticisms raised by the reviewer as they do not relate to our work and kindly request them to reconsider their rating.
This paper analyzes amortized causal discovery, where synthetic data allows supervised learning of causal discovery. The investigation focuses on whether learning-based methods can overcome traditional identifiability constraints in causal discovery. Extensive out-of-distribution experiments in a bivariate setting with a CSIvA transformer demonstrate that the method cannot generalize well to unseen mechanisms or noise families. This finding establishes an important link to traditional causal discovery literature, revealing how identifiability assumptions are transferred from algorithm design to training data selection. The work further demonstrates empirically that model generalization can be significantly improved by training on diverse families of mechanisms and noise distributions. A theoretical justification for this approach is provided by proving that the set of non-identifiable structures arising from such unions is negligible, particularly extending known results to post-additive noise models.
优点
- Establishes a clear connection between how assumptions in classical causal discovery are translated to training data assumptions in amortized approaches.
- Provides strong empirical validation and theoretical justification of generalization in amortized learning improves with mixed training, something that previously has not been studied to this extent, providing an important result and discussion point that will benefit the causal discovery community.
- Establishes a testing methodology for amortized causal discovery that if adopted stands to benefit the field in improving the methods.
缺点
- Limited Scope:
- Analysis restricted to bivariate cases
- Not clear how well the results extend to more complex graphs, nor is there a clear path for extending results to larger graphs
- Doesn't address practical challenges like confounding or selection bias
- Experimental Limitations:
- Only evaluated on purely synthetic data
- Limited exploration of how results scale with dataset size
问题
- Have you explored how your results might extend to larger graphs? What are the main challenges you foresee?
- Can you provide more concrete guidance on choosing optimal ratios when mixing different types of causal models during training?
We thank the reviewer for considering our work and their valuable insights. In what follows we address the points they raised.
Weaknesses
W1. Concerning the limitation of our experiments to the bivariate setting, we point the reviewer to the common response, where we specifically discuss this point and how our conclusions on the guarantees of identifiability are expected to extend to larger graphs.
W2.
-
In the updated pdf of the manuscript (Appendix D.1) we provide experiments on real data and compare CSIvA performance with common classic approaches to causal discovery. For a detailed discussion, we address the reviewer to the common response. In summary, we observe that our mixed training procedure enhances better performance on real-world data, in agreement with our simulated experiments. Thanks for suggesting this analysis, which we believe to be a valuable addition.
-
In the updated pdf (Appendix D.3) we provide experiments on different dataset sizes. In particular, we compare the performance of models trained on 5k, 10k, and 15k data points. These experiments are in addition to those already in the paper (Appendix C.1). We train CSIvA models on the following dataset configurations:
- linear-mlp (linear SCM with mlp noise),
- anm-mlp (anm = nonlinear additive noise model),
- pnl-mlp (pnl = postnonlinear model),
- mixed mechanisms and mixed noise,
and test on:
- Linear-mixed, anm-mixed, pnl-mixed (mixed = mixed noise distributions),
- mixed-mlp (mixed = mixed mechanisms types, with mlp noise).
These experiments are useful to observe the interplay between OOD generalization and the size of the training dataset. We find that there is no remarkable difference in performance between models trained on 5k, 10k, 15k samples. The linear-mlp, anm-mlp, pnl-mlp models generalize poorly to the unseen distributions. The mixed-trained model generalizes well, irrespective of the number of training samples. Note that experiments with different numbers of training samples are meaningful only for investigating the OOD generalization performance. Instead, as identifiability is relative to the properties of the population distribution generated by a specific SCM, ideally we would like to provide as much data as needed to fit the training distribution, as we do in our paper (see Fig. 1 showing that CSIvA achieves close to 0 SHD on in-distribution datasets). Finally, we point to Ke et al. 2023 for a thorough analysis of the impact of dataset size.
Questions
Q1. The first question relative to real-world experiments is addressed in the common response, where we discuss the results of the experiments on real-world data. In general, these strengthen our conclusions, showing that mixed training benefits OOD generalization even in real settings.
Q2. Can you provide more concrete guidance on choosing optimal ratios when mixing different types of causal models during training?
- Given that training specifically occurs on simulated data (which, in general, are the only large-scale resource with labels in causal discovery), in absence of compute constraints one should use as much data as possible evenly spread between the different identifiable SCMs. This is indeed what emerges from our experiments in Appendix C.1, where we discuss training with an unlimited training budget.
- In case compute resources are significantly limited, the optimal ratio depends on the prior knowledge that you might have on the data: our experiments show that testing in-distribution (i.e. training and testing on data generated according to the same SCM) provides the best results. The point of mixed training (on different SCMs, where the difference is relative to the mechanism type and noise terms) is to define a larger class of in-distribution test datasets. Mixed training then is useful to address the case where prior knowledge about the class of SCM generating the data at hand is unknown: in that case, the most natural prior one has on the data is that they have been generated by any SCM with uniform probability, and equal ratios during training are desirable. When, instead, prior knowledge is available, e.g. we believe that mechanisms are linear, using only linear data as training allows the model to perform in-distribution at test time while seeing more data of the meaningful class (linear SCMs, in this example). Note that if this prior knowledge is available, then resorting to classical methods (e.g. LiNGAM, for linear data) with theoretical guarantees is also a viable option.
We thank the reviewer for the valuable insights which made us consider important experiments. Several of their suggestions have been incorporated in the updated pdf. If we positively addressed the concerns in this review, we kindly ask to consider raising their score.
We thank the reviewers for their valuable feedback and the attention dedicated to our paper. Our work is unanimously recognized as addressing an important problem (AnXC), and praised for its clear presentation (vYGz, Ed52, qPWy). Almost all reviewers acknowledge the quality of our study: our findings are supported by theoretical justification and strong empirical validation (Ed52, vYGz), and the experiments are extensive with detailed discussion of their implications (qPWy). Our work is a valuable contribution to the community (qPWy, Ed52) and offers a way to make amortized causal discovery practical (vYGz).
In the common response, we address concerns shared by multiple reviewers:
- The limit of experiments to bivariate graphs (Ed52, vYGz, AnXC).
- The absence of real-world experiments (Ed52, vYGz, qPWy).
The new experiments we discuss are presented in the updated manuscript pdf (Appendices C and D), where novel parts are colored in red. The manuscript bibliography lists the papers cited in this response.
Limit to bivariate experiments. We agree with the reviewers on this point, also discussed in the paper (L207). We outline theoretical and empirical reasons for focusing on two variables.
- Theoretical perspective:
- The bivariate setting is the privileged ground for analyzing identifiability: identifiability of additive noise and post-nonlinear models is established on bivariate graphs in the seminal works of Hoyer et al., 2008 and Zhang & Hyvärinen, 2009, and extended to multi-variate graphs with a combinatorial argument, see next.
- Multivariate guarantees build on these bivariate results: identifiability for multivariate models is ensured by iteratively verifying the identifiability of all bivariate subgraphs (Peters et al., 2014, Tu et al., 2022). Thus, generalizing our findings to multivariate models is about empirically verifying that transformers align with theoretical identifiability for each pair of nodes connected in the graph. For this, it suffices to know that transformers align with identifiability theory in the bivariate case, a fact that we show. This theoretical argument suggests that the empirical identifiability guarantees we find similarly extend to multivariate graphs. This discussion is now part of the updated pdf, Appendix C, and addresses Ed52's query on how our results would extend to more dimensions.
- Empirical perspective:
- Unfortunately, we find that CSIvA performs poorly on multivariate graphs in purely observational settings. This is also found in the original paper by Ke et al., 2023, Appendix A.7.4. Investigating OOD generalization or identifiability via experiments requires convergence during the training and reliable in-distribution performance, so that test errors can be imputed to distribution shifts or non-identifiability. In the multivariate setting, CSIvA lacks these properties.
- Training on multivariate graphs is computationally intensive: training on 5-node graphs takes ~2 days to converge. Without access to Ke et al.'s code (not released) we can’t compare our implementations. Yet, our algorithm converges in approximately the same number of iterations as theirs, taking more time per iteration: differences may stem from hardware.
Summary remark. CSIvA has limitations in scaling to high dimensions, but scalability is not our focus. Amortized causal discovery has two components: one is that of scalability to large graphs, and the other is relative to the theoretical viability of the idea. Despite the current limitations that we observe with CSIvA, transformers clearly could scale structure learning to large graphs (a point shared e.g. by AnXC, point two of the weaknesses). On the other hand, the causal discovery community did not have evidence that this idea is theoretically sound. We addressed this aspect conclusively, additionally finding surprising trade-offs (e.g., mixed training, assumptions on noise vs mechanisms), that differ from classical algorithms and are worthy of further research.
Absence of real-world experiments and other benchmarks (Appendix D.1 of the updated pdf). We agree that real-world experiments could strengthen our conclusions. Given the challenges of scaling CSIvA, we present results on real-world pairwise data from the CMU benchmark. We find that our mixed training strategy on CSIvA improves SHD on real data, consistent with our prior synthetic results. As suggested by vYGz and qPWy we compare CSIvA with well-established causal discovery methods: DirectLiNGAM, CAM (Buhlmann et al., 2014), NoGAM (Montagna et al., 2023, reaching SOTA accuracy on several synthetic benchmarks), and GraNDAG (Lachapelle et al., 2019). Interestingly, CSIvA with mixed training consistently matches with or outperforms these benchmarks.
We thank vYGz for pointing us to the CMU benchmark for the real-data analysis.
Dear Reviewers, Since the discussion period ends early this week and no discussion has emerged yet, we would like to follow up. We encourage everyone to reply if any aspects of our work or responses require further clarification even after the rebuttal.
Dear reviewers, as the rebuttal period draws to an end, we provide a friendly reminder of our availability to discuss the open points that emerged in the reviews and the follow-up discussions.
While the manuscript targets an important question that why amortized causal discovery with transformers works, major concerns raised by the reviewers—particularly regarding the limited scope of experiments (mostly bivariate) and the lack of theoretical claims—still remain.
Two reviewers were marginally positive but did not fully endorse the paper, while two others maintained their skepticism. Given the discussion, the additional experiments, and the persistent reservations, the balance of arguments weighs slightly against acceptance at this time. The paper provides a valuable perspective and represents a step forward in understanding amortized causal discovery, but it may require further theoretical development or more robust multivariate evidence to be fully convincing. Therefore, the final recommendation is to reject the paper in its current form.
审稿人讨论附加意见
The remaining major concerns are as follows:
-
Theoretical Guarantees:
Reviewer AnXC and Reviewer qPWy reiterated concerns about the lack of any theoretical results to support the authors’ core hypothesis on how transformer-based amortized methods implicitly encode identifiability assumptions. -
Empirical Evidence: Generality Beyond the Bivariate Setting:
Reviewer AnXC, Reviewer qPWy, and Reviewer Ed52 questioned the relevance of results obtained mainly from bivariate experiments to more complex, multivariate graphs. While the authors provided theoretical arguments suggesting that identifiability extends from the bivariate case to multivariate settings, the reviewers expressed doubts that the transformer model necessarily can learn these principles.
Considering that both the theoretical results and empirical evidence are insufficient to support the authors' hypothesis, we regretfully reject this paper for now and encourge the authors to make use this chance to further improve this manuscript.
Reject