Amortized Active Causal Induction with Deep Reinforcement Learning
We propose an amortized and adaptive intervention strategy that results in a sample efficient estimate of the true causal graph on the distribution of training environment as well as on test-time environments with distribution shifts.
摘要
评审与讨论
The authors propose CAASL, which amortizes an intervention policy in the setting of causal structure learning. They apply this algorithm in the setting of a synthetic environment and a gene expression simulator. They use AVICI as a reward model to estimate an approximate posterior over the graph’s adjacency matrix. The RL algorithm is SAC, and the policy and value networks are transformers. On their two environments, CAASL outperforms several baselines (in terms of uncovering the true adjacency matrix) and performs better in out-of-distribution settings.
优点
The authors combine several lines of work to tackle the setting of causal structure learning: soft actor-critic, AVICI, transformer policies. From what I can tell, this is a novel combination of these methods. These results seem reasonably significant, enabling fewer interventions to obtain comparable adjacency matrix estimation accuracy.
The paper itself is fairly clear in terms of writing and notation. The empirical analyses are clear and helpful, exploring in-distribution vs. out-of-distribution generalization and ablations. All plots are clearly labeled.
缺点
The authors put considerable effort into explaining the setup of causal structure learning, etc. However, perhaps due to my relative unfamiliarity with this area, the particular details of the setup were still unclear to me. Figure 1 attempts to present an overview of where their method fits into a causal structure learning pipeline, however, I would have instead (or in addition) preferred to see how CAASL links up with AVICI, the simulator, etc. From Figure 1, we only see the amortized intervention policy without any of the other machinery.
Again, this may be due to my unfamiliarity with this area, but the novelty of CAASL was unclear to me. For instance, how does AVICI estimate the intervention policy? Is this not an applicable baseline? Is the main benefit of CAASL to improve the efficiency of policy estimation over AVICI? What are the natural baselines to consider in comparison with CAASL? In the experiments, the authors present two other intervention baselines, but these do not perform well. Is the benefit, then, some type of accuracy/performance?
The experiments seemed slightly toyish, and the baseline comparisons did not ease my concerns. The most performant baseline is a random intervention policy. Are we to believe that the other baselines are actually worse than random? And are these problems simple enough to just use random search? Further, as the data dimensionality increases, CAASL is not able to outperform the random baseline. I would have, naively, expected the opposite, in which amortizing the intervention policy should yield an even larger improvement as the problem becomes harder.
问题
Why does the random intervention policy outperform the other baselines, and why does CAASL revert to this performance as the dimensionality increases?
Have previous works explored the environments used here? I.e., are these standard benchmarks?
局限性
Yes.
Thanks a lot for the detailed feedback. We answer your questions and concerns below:
-
"preferred to see how CAASL links up with AVICI, the simulator, etc": Thanks a lot for the great suggestion. For the camera-ready version, which provides an additional content page, we will include another figure which will highlight the individual components of the CAASL training procedure. We would like to clarify that AVICI is an amortized structure learning framework and does not itself do intervention design. We use AVICI to design rewards for our intervention design policy, which can be trained with a simulator.
-
"Difference between AVICI and CAASL": We would like to clarify that AVICI is an amortized causal structure learning algorithm and CAASL is an intervention design policy, and both are addressing different problems. In particular, AVICI predicts (distribution of) causal graphs given any dataset, and CAASL predicts the most informative intervention to perform given experimental history. We use the prediction of AVICI to design rewards for training the intervention policy. The benefit of our method over all other applicable baselines is that we can not only achieve amortized intervention design without parameter updates, we also do not require any intermediate posterior estimation.
-
"The experiments seemed slightly toyish": Compared to prior work, the experimental settings we consider are some of the more challenging ones when it comes to the problem of active causal structure learning. If there is any specific concrete experimental setting that you believe could improve our work, we are happy to run those experiments and include them.
-
"The most performant baseline is a random strategy": In active causal structure learning, random strategy is a very well regarded baseline due to its strong empirical performance. All the related works listed in the paper do compare with random, and its competitive performance can also be seen in prior works. We would like to clarify that the reason random baseline is competitive is not because the problems are simple, rather the opposite, that they are in fact hard. Other baselines like DiffCBED use an explicit posterior model to predict interventions. Getting an explicit causal graph posterior is a hard problem in itself, even for very simple settings, which in turn affects the performance of other baselines as compared to random.
-
"as the data dimensionality increases, CAASL is not able to outperform the random baseline": Note that the dimensionality increase experiments are OOD experiments, i.e. CAASL has never seen any data with that dimensionality before during training. Hence, more the dimensionality difference as compared to the dimensionality during training, the more is the distribution shift. With more distribution shift, CAASL is expected to slowly degrade in performance, as is expected of other ML/ RL methods.
-
“Have previous works explored the environments used here? I.e., are these standard benchmarks?“: Synthetic environments are very common across both causal structure learning and active causal structure learning. GRN environment SERGIO has been used in AVICI for structure learning before, but not in active causal structure learning. This is due to the fact that SERGIO is a very challenging environment for the existing baselines, and for most baselines it is not applicable due to no likelihood.
We hope that addresses your concerns and questions. We would be glad to answer any further questions you may have.
Thank you for your responses to my questions. They have addressed some of my concerns. As the question of comparison with the random intervention strategy also came up in reviewer o26x's review, I will wait to update my score until the reviewer discussion period.
Dear reviewer,
Thanks a lot for the feedback. We have further provided clarification to random intervention strategy to o26x reviewer. If you have any questions in particular that requires further clarification, we are happy to answer them as well.
Dear reviewer,
We have provided further clarification on limitations of the approach vs random intervention strategy to reviewer o26x. Since the discussion period is soon coming to an end, we would be happy to address if there are any other remaining questions that you might have. However, if all the outstanding issues have been clarified, we would appreciate if you could consider increasing your score.
The paper proposes to use RL to train policies that can adaptively select interventions/experiments for obtaining data to do active causal structure learning. The approach amortizes policy training by training the policy on synthetically generated data, an approach that has shown some promise in the amortized training of causal structure learning algorithms. These pretrained causal structure learning inference models are used as the reward function in the training of the policy used for experiment design. The policy takes in a dataset and outputs an experiment (intervention on the casual model), and uses a network architecture that bakes in the relevant invariances/equivariances. Evaluation is done on synthetic environments including a gene regulatory network environment.
优点
- The approach seems reasonable and I think members of the NeurIPS community will find the experiments evaluating this approach interesting
- The experiments in section 5.1, particularly the results on OOD performance are encouraging
- The methodology is generally clearly presented
- It is a novel and interesting combination of the ideas in [6], [36]
缺点
- Section 5.2. I think more information needs to be given to interpret the results, perhaps with additional plots showing a measure more easily interpretable than 'Returns'. My interpretation from the plots currently presented in the main paper in section 5.2 is that the proposed method technically improves over a random baseline, but by such a small amount that the improvement is practically insignificant
- Section 5. It would be helpful to show experiments demonstrating that the approach actually converges to learning the true graph if enough experiments are done, since even simple approaches like random have this property.
- I think more background needs to be given on what the baselines in the experiments are and how they are run. Both perform much worse than a random strategy, which is surprising. It also seems that both baselines are batch approaches (at each iteration select multiple distinct interventions), while the approach here is always using batch size 1. How are / do the baselines need to be modified for the batch-size 1 setting and why do they perform so poorly, even in-distribution in Figure 3?
- Since greedy approaches are so common in this field, I think a natural ablation would have been to train the RL policy with an effective horizon of 1 (effectively a greedy/bandit policy trained under the same methodology presented). This can be done by rolling out the policy for the full horizon , but then updating the policy by considering each timestep as it's own 'episode'. This would answer the very important question of whether this extra RL/adaptive machinery is really necessary compared to just looking one experiment ahead. I couldn't find what the horizon length used during training of the policy was. Apologies if I just simply missed this.
问题
- Section 5.1. What is the intervention? I got from the appendix that it is a do() intervention but for each method how do you select what value to set the intervention targets to in the case of Random and SS Finite?
- Line 88: "where the first column corresponds to one-hot encoding of whether a particular variable is intervened or not". This made me believe you only deal with singleton intervention targets, but reading further it seems like the approach also handles and makes use of multi-target interventions. I think it is not 'one-hot' in your case, but confirmation from the authors would be appreciated.
局限性
The method requires inputting an entire dataset into the policy. It would be helpful to have experiments and discussion that reflect the ability for the approach to scale not just over graph size (included already) but over dataset size/number of rounds of experiment. Besides this I have no complaints about the transparency of limitations.
Thanks a lot for the detailed feedback. We address your questions and concerns below:
-
“Section 5.2… More information needs to be given to interpret the results”: In Appendix, Fig 14-18 presents results with regards to metrics other than returns. We argue that returns is an interpretable metric, as it is an indicator of the test-time rewards of the policy for unseen environments. Regarding the practical significance of the performance of the approach wrt random, we would like to emphasize that our approach can handle unknown likelihoods and high amounts of missing data, which has not been studied in active causal structure learning before. The results that we have obtained can be further improved by also updating the reward model AVICI as more interventional data is acquired during training. We hope that through this contribution, further active structure learning methods aimed at practical scenarios can be developed.
-
“Section 5. It would be helpful to show experiments demonstrating that the approach actually converges to learning the true graph.” Thanks a lot for the suggestion. While this is a desirable property of any method to converge to the true graph when acquiring more interventional data, we would like to highlight that the convergence property is also influenced by the structure learning algorithm that is used to evaluate the policy. In our case, we use a pre trained AVICI model, which does not have necessarily the property that it converges to the true graph when given more data. Hence, even a random policy, when evaluated with AVICI also need not converge to the true graph.
-
“More background regarding baselines”: The chosen baselines are the state-of-the-art in Bayesian Optimal Experimental Design for causal discovery which can select multi-target interventions. For the intervention value, it is selected at random between the range [-1,1] for random strategy. For SS-Finite, we set the intervention value to 5, as suggested in previous approaches like [52].
-
"Baselines are batch strategies and perform poorly": Indeed, both of the baselines are batch strategies. We set the batch size to 1 in order to make a fair comparison. Most baselines work well when the underlying graphs are very sparse (for instance when the average edges per node is less than 1). However, in the settings we consider, the sparsity of graphs is set that it is more realistic in terms of real-world applications, wherein a random policy is still competitive. In addition, baselines still use an approximate posterior, thereby making the induced policy at times worse than random. We will make these details clear in the revision.
-
“Possibility of training a greedy policy”: Thanks for bringing this great point. In figure 2, one can see the adaptiveness of the policy, esp. wrt optimal intervention values, at play. We agree that if adaptiveness is not necessarily a requirement, one can just train the policy with a time horizon of 1. In such a setting, we expect that our approach still offers advantages in terms of amortization through design space symmetries. For the experiments, the horizon is always set to 10, both during training and evaluation.
-
“where the first column corresponds to one-hot encoding of whether a particular variable is intervened or not”: We apologize for the poor choice of wording. By one-hot encoding, we mean to say the corresponding vector is a {0,1} vector. We willl make this correction of terminology to make it clear that we select multi-target interventions.
-
“Scalability wrt dataset size and number of experiments”: With regards to dataset size, one can scale it depending on the compute availability, by performing data parallel training approach. With regards to scaling to number of experiments, our method might show limitations as any baseline RL algorithm does not perform well over long horizons. However, as noted to the other reviewer (ydzf), it is indeed interesting to look at OOD performance when the number of experiments during testing changes. We will include this experiment for the camera-ready in appendix.
We hope that addresses your concerns and questions. We would be glad to answer any further questions you may have.
Thank you to the authors for their response. I kept my score the same.
Regarding the practical significance of the performance of the approach wrt random, we would like to emphasize that our approach can handle unknown likelihoods and high amounts of missing data, which has not been studied in active causal structure learning before
I don't think this really addresses the point since the random strategy can also handle all these novelties. The performance compared to random on SERGIO seems like a very significant limitation that is glossed over in both the paper and the general Openreview discussion.
Thanks for bringing this great point. In figure 2, one can see the adaptiveness of the policy, esp. wrt optimal intervention values, at play. ... In such a setting, we expect that our approach still offers advantages in terms of amortization through design space symmetries.
I think showing that an adaptive strategy is worth pursuing can only be done by showing that it is actually performing better than the ablated non-adaptive strategy (eg a bandit/setting time horizon to one). It doesn't really matter that the strategy proposed indeed ends up being adaptive, if it doesn't turn out that it is helping. This seems like a very fundamental experiment being missed when proposing an adaptive method.
Dear reviewer,
Thanks a lot for the comments and the feedback.
I don't think this really addresses the point since the random strategy can also handle all these novelties. The performance compared to random on SERGIO seems like a very significant limitation that is glossed over in both the paper and the general Openreview discussion.
We would like to highlight that we perform only 10 experiments with 50 initial observational samples in this setting, which might also be less samples for the AVICI model (which is used for evaluation and calculating the returns) to show significant improvements. An alternative is to train a policy which is either a batch of designs, or as mentioned in the rebuttal, by increasing the number of experiments. Of course, it is practically harder to train a policy which has these properties with RL, which can be counted as a limitation. Another point to highlight is that the existing methods for (Bayesian) active causal structure learning evaluate exclusively on synthetic data, where we perform significantly better. Besides, all the results are provided with 95% confidence intervals over 100 random environments, including for the SERGIO environment. The improvements we see in SERGIO, though not as substantial as that for the synthetic environment, is still noteworthy given that no other active causal structure learning algorithm is applicable in such a setting. Hence, our contribution should be viewed as a step forward in the ongoing discussion about advancing active causal structure learning.
I think showing that an adaptive strategy is worth pursuing can only be done by showing that it is actually performing better than the ablated non-adaptive strategy (eg a bandit/setting time horizon to one). It doesn't really matter that the strategy proposed indeed ends up being adaptive, if it doesn't turn out that it is helping. This seems like a very fundamental experiment being missed when proposing an adaptive method.
Theoretically, as shown in Greenwald et al. 2019 [23] and Choo and Shiragur 2023 [13], performing interventions that are adaptive can be more sample efficient. Empirically, given the limited space, we focus our experiments on studying the amortization and generalization properties of the transformer based policy, which forms the core of our contribution. However, we are prepared to add the Bandit setting/ setting the time horizon to one for the revision. We have also provided the code, which is compatible to train the policy for time horizon one, and we will make it available on acceptance.
I am going to raise my score to a 5 under the assumption that the authors will include some of this discussion for the results in the SERGIO experiments and make it clear in the text that the improvements in the SERGIO experiment are either very modest or not existent (seems to depend on which metric is used). I think right now the slightly opaque metric of 'returns' in the main text plots and lack of a clear commentary on these results gives an inflated sense of how well the approach is doing on SERGIO. I agree that the performance on synthetic benchmarks looks good.
Theoretically, as shown in Greenwald et al. 2019 [23] and Choo and Shiragur 2023 [13]
My understanding is that these citations are defining adaptive as whether all experiments are done in parallel, or done sequentially. Our present discussion and my original comment is really on greedy vs non-greedy (where I think we've both had the confusion of using the word adaptive since my original comment), which is a different property of the algorithm: whether the planning horizon is 1 vs >1. I don't think these citations comment on greedy vs non-greedy.
Dear reviewer,
Thanks a lot for your comments and for increasing the score.
I think right now the slightly opaque metric of 'returns' in the main text plots and lack of a clear commentary on these results gives an inflated sense of how well the approach is doing on SERGIO. I agree that the performance on synthetic benchmarks looks good.
We will include the comments about the limitations of using the AVICI model for evaluation and results of SERGIO in general in the revision, along with the other changes as highlighted in the common rebuttal section. In addition, a proper commonly agreeable metric for Bayesian causal structure learning is missing (a recent ICML paper [1] has brought this issue up in more detail), which is why we have presented multiple metrics, among which one of them is the expected number of correct edges (returns).
I don't think these citations comment on greedy vs non-greedy.
Thanks for the clarification. Indeed, these citations do not comment on greedy vs non-greedy. As mentioned before, we will add the Bandit setting/ setting the time horizon to one for the revision which would illustrate the greedy vs non-greedy. The provided code is also compatible to train the policy for time horizon one.
References:
[1] Mamaghan, Amir Mohammad Karimi, et al. "Challenges and Considerations in the Evaluation of Bayesian Causal Discovery." In ICML (2024).
Update after rebuttal: I personally think that the authors have sufficiently addressed all criticism. Adding an extended discussion regarding the strong performance of a random baseline in many domains is a good idea and would situate the work even better. I remain in favor of accepting the paper, and have raised my confidence after taking into account all information.
The paper aims at the problem of designing efficient intervention strategies for causal induction. In causal induction, the goal is to infer an underlying structural causal model from observational and interventional data - and while this inference problem in itself is hard and a very active area of research, the problem of how to optimally choose interventions given a limited intervention-budget is of great importance in practice. Proposed solutions range from evolutionary and uncertainty-based active learning approaches to reinforcement learning, which is the method this paper pursues. The paper implements the intervention strategy via a (history-based) transformer that is trained via off-policy RL (via soft actor-critic) and is capable of incorporating some reasonable inductive biases via its architecture (permutation equivariance via attention). The trained policy is referred to as an amortized solution, since after training, a single forward pass determines the next intervention (without any parameter updates or RL). The paper also designs a novel reward function that is based on AVICI, a previously published transformer-based solution to obtain an amortized predictor (posterior) over the structural causal model, given a history of interventions and observations. Putting all of this together, yields a policy for interventional strategies that well aligns with practices in modern machine learning, and uses transformers’ great empirical generalization qualities to do the heavy lifting. Accordingly, out-of-distribution performance and generalization are a concern (since theoretical guarantees are scarce or vacuous) - but as the paper shows, the method generalizes remarkably well to non-trivial distributional shifts on synthetic data (10 variable linear SCMs with additive homoskedastic Gaussian noise for training) and data from a single-cell gene regulatory network simulator (which simulates realistic measurements of gene expressions including noise and significant amounts of missing values). On-distribution and mainly out-of-distribution experiments in both domains show promising performance and good generalization of the method.
优点
- Very well written paper, with good introduction for non-domain experts, making the paper very accessible to a wide NeurIPS audience.
- Careful combination of previously published approaches with a novel intervention policy parametrization (based on a transformer) and a novel formulation of a reward function (based on a previously proposed method for amortized causal inference). The focus is on forward-pass simplicity and efficiency, and using transformer-based architectures throughout to achieve good generalization of amortized solutions.
- Experiments show good performance, and most of the attention is devoted to evaluating out-of-distribution performance under various OOD conditions, which greatly supports the generality and reliability of the results.
缺点
- Experimental ablations investigate the data-generating distribution / ground-truth, but not the network architecture and various design choices (such as, e.g., the network size, the use of tanh vs. sigmoid vs. ReLU, the sensitivity to the underlying AVICI model and its training regime, …). It would be nice to see some ablations to empirically justify the design choices, but I do consider the OOD experiments more important for publication and am pleased to see a fairly thorough investigation there.
- The discussion around amortization and how it relies on generalization to “just work” could be expanded a bit to make the paper even more broadly accessible.
- Though the paper is very clear about which contributions are novel compared to what is taken off the shelf, it may help to have all the novel contributions in one place (e.g., end of the intro).
Verdict: Though I am not a domain expert on causal inference for gene regulation, I greatly enjoyed reading the paper - it is very well written. The great introduction and background section makes the paper easily accessible to a wide ML audience. The method itself is somewhat intricate and composed of a number of parts that each come with several design choices - the paper strikes a good balance between laying out these options and explaining the ones taken in the paper, and then separating the novel parts from previously published parts. In the end this section can get quite dense, and it might benefit from some additional summary or overview diagram at the end of Section 3 (though I really enjoyed Sec. 4.1, the paper would also work if it were pushed to the appendix if more space is needed). The experiments are sensible and of small to medium-sized scale, and I appreciate the use of a realism-focused simulator for gene-regulation networks instead of a larger-scale synthetic experiment. The focus on out-of-distribution performance by investigating several OOD settings is particularly important for amortized methods, and I think it is well executed in the paper and results show the good performance and generalizability of the method. As a non-domain expert I cannot comment on the strength of the baselines that the method is being compared against (and have lowered my confidence accordingly) - though they seem to be non-trivial and sensible baselines. Overall I think the paper is very well written, makes a novel and original contribution (that includes putting many existing techniques together), the empirical investigation is thorough and convincing, and the work is of interest to a wide audience. I currently recommend to accept the paper, and have some (mostly optional) suggestions for improvement.
Improvements:
- Ablations for design choices: to support the design choices and get a sense of the sensitivity of the results some additional ablations would be interesting (I do not consider these to be critical for publication, but they would further strengthen the paper; the rebuttal period may be too short to conduct these):
- How sensitive is the performance of the method to the underlying AVICI and the regime it has been trained on, ie., does AVICI first need to be carefully tuned for the application domain or can a relatively general AVICI model be used to easily obtain good results on many domains?
- How sensitive are results in terms of the network architecture of the policy (network size, other activation functions)?
- Another interesting experiment could be to test how well the amortized policy generalizes when increasing the intervention budget beyond the maximal training range (does the policy continue to perform well / degrade gracefully, or does it eventually break down catastrophically?). Like above, I consider this to be optional, but interesting.
- To make the paper even more widely accessible consider having a brief paragraph in the background section on amortization. What is it? What are the advantages? What are the potential drawbacks (in-distribution vs. ood generalization “guarantees”).
- Maybe have a brief summary or overview figure, etc. at the end of Sec. 3 that summarizes the choices made in this paper (as opposed to the potential alternatives for the various parts which are pointed out nicely already) and identifies the novel parts and/or consider having a summary of contributions (e.g., at the end of the intro).
问题
Minor comments:
- L108-109 (nit): L108 states that the posterior is trained “without having access to the likelihood of the data’, but L109 states that “the posterior can be trained by maximum likelihood with…”. Maybe rewrite to avoid possible confusion.
- L111: Maybe add that “Empirically it has been shown that the AVICI model can generalize to…”.
- L120: typo “they”.
- L123: What is “technical noise”? (maybe add a reference to 5.2 where this is explained) 5.Zero-shot OOD (Fig. 4 and 5) - just to clarify: performance of the policy in this regime is only indirectly/implicitly reliant on good generalization of AVICI (since there is no RL at this stage; the reward model got “baked” into the amortized policy)? To improve OOD performance via additional learning it is plausible that fine-tuning the policy with fixed AVICI would be sufficient?
局限性
The paper clearly states limitations throughout and in a short limitations section. One general limitation (that is obvious to readers familiar with amortized algorithms) is that amortization typically only comes with in-distribution guarantees, and no ood-guarantees. Perhaps it is worth adding this to the limitations section (and clearly, the paper puts a lot of emphasis on empirically investigating this generalization).
Thanks a lot for the very positive feedback of our work. We provide detailed response to your questions below:
- “Sensitivity to AVICI”: That’s a good point. In our work, we use a pre trained AVICI which has been trained on datasets drawn from wide range of SCM and noise distributions. So a relatively general AVICI model sufficed. But if amortization performance is more important than the OOD, having an AVICI model which is trained on a more narrow domain should further improve the results of the policy due to the more informative reward function.
- “Increasing the intervention budget”: Thanks a lot for the great suggestion. The performance of the policy is based on the AVICI model. So as we give AVICI more interventional data, we expect performance to improve, but the returns become more incremental. But it is nevertheless interesting to think about increasing the intervention budget during training, especially to investigate until when the policy remains competitive as compared to random. We will include this experiment in the camera-ready.
- Given the extra page for camera-ready, we will expand a bit on amortization, summary of contributions and also a schematic figure of the training setup (which was requested by other reviewer as well).
- Typos and minor comments: Thanks a lot for the correction. We will correct these typos and sentences. We agree that it will help make the paper more readable.
- “performance of the policy in this regime is only indirectly/implicitly reliant on good generalization of AVICI (since there is no RL at this stage; the reward model got “baked” into the amortized policy)?”Indeed, the evaluation still relies on a causal structure learning model, in which case we still use AVICI. As a result, the performance of the policy in OOD regime still implicitly relies on good generalization of AVICI. One could also consider the possibility of updating the AVICI model when updating the policy, but we only study the simpler setting in this work.
- We will add the limitations about OOD generalization guarantees for amortized algorithms.
We hope that addresses your concerns and questions. We would be glad to answer any further questions you may have.
Thank you for the response. I have no more open questions or comments.
Reading the other reviews and responses, I personally believe that all criticism has been sufficiently addressed. I think o26x raises an important point - which is to emphasize how it is often hard to beat a random baseline - and I think it would be nice to discuss this with more emphasis in the paper (discussion section, or limitations section). But I do not think that this fundamentally invalidates the approach and the careful analysis and comparison against SOTA baselines. I therefore remain in favor of accepting the paper and raise my confidence.
Dear reviewer,
Thanks a lot for the discussion and the positive assessment of our work. As indicated to the reviewer o26x, we will include a discussion of the competitiveness of the random baseline for the SERGIO environment.
The authors propose an intervention design method, Causal Amortized Structure Learning, CAASL, which works by using a transformer to directly predict the next intervention to perform given a simulator of the environment (in which the intervention is to be performed) using the Soft Actor Critic (SAC) policy. The rewards are obtained by using the amortized posterior of the adjacency matrix of the causal graph given interventional data using AVICI, a recently proposed likelihood-free approach for obtaining an estimate of the causal graph. The authors demonstrate the amortization performance of CAASL on synthetic data and SERGIO (a single cell gene expression simulator) and its generalization to distribution shifts. Importantly, CAASL doesn't require inference of the causal graph.
优点
- The paper tackles the important problem of designing an intervention policy in environments where the likelihood-based inference of a causal graph is intractable and it might be difficult to utilize the data likelihood to select interventions.
- The use of the pretrained AVICI model as a reward function can potentially be more generally applicable--
- it formalises the reward of performing an intervention in the environment as the increase in the number of correct entries in the adjacency matrix of the causal graph resulting from the intervention-- and this is computable since we have access to the simulator, and hence .
- this reward function is also shown to be kind of optimizing the lower bound on the multi-step expected information gain (EIG) by learning the adjacency matrix for the policy
- The notation and writing of the paper is very clear, making it approachable.
缺点
Experiments:
- How dense/sparse can be?
- What are some other intervention design approaches and can they be used as baselines?
- Can you provide some insights on why the random baselines is competitive when the intervention type changes on SERGIO?
Writing:
- on what kinds of datasets is CAASL applicable-- what are some other datasets where we can access ?
- can you clearly establish the link between sequential Bayesian experimental design and the multi-step EIG-- where does the related BA bound appear in Bayesian experimental design (such that it doesn't involve a reading of ref.17)?
- If the BA bound uses the log-likelihood of the true causal graph under the amortized posterior, how does comparing and make it still connected to the BA bound?
- (minor): can you clarify what is meant by line 59 "As such, CAASL is an intervention design method for performing sample efficient causal structure learning, but is not a new causal structure learning method in itself."
问题
- just curious, why is it CAASL and not CASL?
局限性
The authors discuss some of the limitations of their work, but I think the limitations with respect to experimental results on changing intervention type, and types of simulation environments where this method is applicable can be clarified.
Thanks a lot for your detailed and very positive review of our work. We provide answers and clarifications to your questions below:
-
"How dense/sparse can be?" For training the policy, we set the density of graphs to 3 edges on average per node. Amortization can also be achieved across different density of graphs by simply sampling from a prior over graphs which is more denser/ sparser during training.
-
"What are some other intervention design approaches and can they be used as baselines?": For the synthetic design environment setting, any of the approaches listed in L119 can be potential baselines. However, we compare with diffCBED as it is the most recent and is shown to perform better than other possible baselines. For GRN environment, we are not aware of any baselines apart from random as it requires access to the likelihood.
-
"Why is random strategy competitive when intervention type changes in sergio"?: The input to the policy is the experimental history, which only contains the interventional sample and the corresponding {0,1}-vector interventional target. When the intervention type changes, the semantics of the data generating process also changes significantly, which the policy is unaware of since the input to the policy is still the same. Likewise, there are no specific inductive biases that are encoded in either the training procedure or the policy architecture that might result in a good generalization. We will clarify this limitation in the revision. In order to get better results in a different intervention type, we can amortize over different intervention types by training the policy on data generated by different intervention types. However, in this work, we deliberately train on a narrow domain to demonstrate the OOD generalization properties.
-
"on what kinds of datasets is CAASL applicable-- what are some other datasets where we can access ?" CAASL is applicable to any datasets/ settings that has a corresponding simulator. Some of the examples of causal structure learning simulators and applications are given in line 199.
-
“can you clearly establish the link between sequential Bayesian experimental design and the multi-step EIG-- where does the related BA bound appear in Bayesian experimental design (such that it doesn't involve a reading of ref.17)?”: The link between sequential Bayesian experimental design and related BA bound is discussed in Blau et al. Ref. [7].
-
"can you clarify what is meant by line 59 "As such, CAASL is an intervention design method for performing sample efficient causal structure learning, but is not a new causal structure learning method in itself." In this sentence, we mean to emphasize that this work focuses on the active learning/ intervention design part rather than focusing on how to do just causal structure learning given data.
-
"If the BA bound uses the log-likelihood of the true causal graph under the amortized posterior, how does comparing and make it still connected to the BA bound?" As mentioned in line 230, although the proposed reward function is not exactly connected to BA bound, comparing the normalized accuracy, i.e. log-probs of the true graph would result in a BA bound.
-
"Why CAASL and not CASL": We would like to emphasize both the active learning part and the amortization part, hence the acronym CAASL.
We hope that addresses your concerns and questions. We would be glad to answer any further questions you may have.
I thank the authors for clarifying my questions and for planning to expand on the baselines and their explanations in the final draft. I remain positive about accepting the paper.
We thank all the reviewers for the detailed feedback regarding our work. We are delighted to find that in general the reviewers find our work tackles an important problem (rMit), well-written (ydzf, rMit, EZYL), novel and interesting combination of existing ideas (o26x, EZYL) and provides clear empirical analysis (ydzf, EZYL).
In addition, for the camera-ready, we will incorporate the following changes as suggested by the reviewers:
- Summarize the contributions at the end of the introduction.
- Include a schematic figure of the training of CAASL, which shows the network architecture, simulator and the AVICI model used for reward function.
- Experiments with respect to generalization when the experiment budget is more than that during training.
- We will expand the limitations section to include the lack of theoretical guarantees wrt OOD generalization.
- Expanded explanation of the baselines and their limitatiions.
We have answered to each reviewer's questions and concerns that answers the above points in more detail. We would be happy to answer further questions if there are any.
The reviewers are all positive on the paper, with an average score of 6. They appreciate the generality of the approach and feel that it is well-motivated and a novel combination of ideas. They also applaud the inclusion of out-of-distribution evaluations, which shows the generalizability of the approach. However, there are several (not major) weaknesses that they agreed needed to be improved upon before publication; for example, they feel that further discussion on the good performance of the random baseline should be included. Since the authors promise to address these weaknesses in the camera-ready, I feel that this paper can be accepted as a poster.