PaperHub
7.0
/10
Poster5 位审稿人
最低6最高8标准差0.6
7
7
8
6
7
3.2
置信度
正确性3.2
贡献度2.8
表达3.2
NeurIPS 2024

Linear Causal Representation Learning from Unknown Multi-node Interventions

OpenReviewPDF
提交: 2024-05-14更新: 2025-01-02
TL;DR

We prove identifiability results and design algorithms for linear causal representation learning from unknown multi-node stochastic interventions.

摘要

关键词
Causal representation learninginterventionsscore-based methodsidentifiability

评审与讨论

审稿意见
7

This paper studies identifiability under unknown muilti-node interventions (soft/hard), with general models (parametrtic/nonparametric) and linear mixing functions. This work provides both detailed proof which justifies the main theoretical statement, and a step-by-step algorithm which guides how to achieve identifiability in practice. Overall, I find this work serves as an important step for interventional CRL towards more realistic settings.

 

References

[1] Burak Varıcı, Emre Acartürk, Karthikeyan Shanmugam, Abhishek Kumar, and Ali Tajer. Score- based causal representation learning with interventions. arXiv:2301.08230, 2023.

[2] Burak Varıcı, Emre Acartürk, Karthikeyan Shanmugam, Abhishek Kumar, and Ali Tajer. Score- based causal representation learning: Linear and general transformations. arXiv:2402.00849, 2024.

[3] Julius von Kügelgen, Michel Besserve, Wendong Liang, Luigi Gresele, Armin Kekic ́, Elias Bareinboim, David M Blei, and Bernhard Schölkopf. Nonparametric identifiability of causal rep- resentations from unknown interventions. In Proc. Advances in Neural Information Processing Systems, New Orleans, LA, December 2023.

优点

This paper is extremely well written and clearly structured: it communicates clearly motivations, formulation, technical details, and theoretical implications. The experimental results adequately validate the theory in case of a linear causal model.

缺点

  1. The proposed UMNI-CRL algorithm is claimed to work with general non-parametric causal models; however, the simulation experiment only showed results on linear structural equation model. It would be great if the authors could report further experimental results on non-parametric causal models, to align with the theoretical claims. If there is a valid reason why it cannot be done, I am also very happy to hear.

  2. Following the previous point, since this approach requires density estimation, it might not be scalable on nonparametric models. But to be fair, this seems to be a common limitation in many interventional CRL works [1, 2, 3].

  3. Linearity assumption on the mixing function is restrictive, but the authors have acknowledged it and discussed possible future directions to overcome this limitation (sec. 6).

问题

See the first point in weakness section. I am very happy to raise my rating if this issue is resolved.

局限性

The authors discussed the remaining open problems and limitations in Section 6.

作者回复

We thank the reviewer for finding our results an important step towards more realistic CRL settings and noting the clarity of the paper.

General causal models: We did not provide results using non-linear causal models since our algorithm, due to its combinatorial nature, is sensitive to input noise. While we are actively trying to make our algorithms work better using generic score estimators, for this response, in the interest of time, we have elected to provide a complementary analysis, where we investigate the performance of our algorithms under different levels of artificially introduced noise.

For this set of experiments, we adopt a non-linear additive noise model with a score oracle. Specifically, the observational mechanism for node ii is generated according to

$

Z_i = \sqrt{Z_{{\rm pa}(i)}^\top {\bf A}_i Z_{{\rm pa}(i)}} + N_i \ ,

$

where NiN(0,σi)N_i \sim {\cal N}(0, \sigma_i), and the interventional mechanisms set Zi=Ni/2Z_i = N_i / 2. This causal model admits a closed-form score function (see [7, Eq.(393–395]), which enables us to obtain a score oracle. In our experiments, we use this score oracle and introduce varying levels of artificial noise according to

$

\hat{s}_X(x; \sigma^2) = s_X(x) \cdot \big( 1 + \Xi \big) \ , \quad \mbox{where} \quad \Xi \sim {\cal N}(0, \sigma^2 \cdot {\bf I}_{d \times d}) \ ,

$

to test the behavior of our algorithm under different noise regimes (σ[103,101.5]\sigma \in [10^{-3}, 10^{-1.5}]). Results à la Table 2 versus different σ\sigma values are provided in Figure 1 in the PDF document attached to the general response.

Score estimation: We kindly note three things regarding the score functions on nonparametric models.

  • Our algorithm is agnostic to the choice of score estimator, and we can adopt popular score-matching methods as mentioned in Line 325, e.g., Song et al. (2020) or Zhou et al. (2020). We also note that score function estimation is finding increasingly more applications in diffusion models, and it is an active research field. Hence, our algorithms can modularly adopt any new score estimation algorithm and benefit from advances in the score estimation literature.

  • Score vs. density estimation: We also note that score estimation in a high-dimensional setting is generally easier than density estimation, as it avoids the difficulty of finding the normalizing constants.

  • Finally, we emphasize that our algorithm only requires score differences and does not require the score functions themselves. We conjecture that the direct estimation of score differences can be much more efficient than estimating the individual score functions, a lá direct estimation of precision difference matrices [R1]. Furthermore, density ratio estimation methods, e.g., classification-based approaches [R2], can be potentially useful for direct score difference estimation.

[R1] Jiang, Wang, and Leng. "A direct approach for sparse quadratic discriminant analysis." JMLR, 2018.

[R2] Gutmann and Aapo Hyvärinen. "Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics." JMLR, 2012.

评论

I thank the authors for clarifying and providing additional experiment results. I increased the score correspondingly.

审稿意见
7

This paper advances Causal Representation Learning (CRL) by addressing the challenge of using unknown multi-node (UMN) interventions to identify latent causal variables and their structures. The authors develop a score-based CRL algorithm that leverages UMN interventions to guarantee identifiability of latent variables and their causal graphs under both hard and soft interventions, achieving perfect identifiability with hard interventions and identifiability up to ancestors with soft interventions. Their method outperforms existing single-node approaches by ensuring robust recovery of causal structures in more complex, multi-intervention environments.

优点

  • Extending the causal representation learning to unknown multi-node interventions

  • Proofs are provided

  • Pseudocode is provided

  • Computational complexity is discussed

  • Limitations are clearly stated

缺点

  • The paper primarily focuses on causal models with linear transformations. This limits its applicability in many real scenarios

  • The applicability of the assumptions in real scenarios was not discussed

  • The method was not applied on real world-data

问题

  • Can you please elaborate on the computational complexity and on why it is dominated by step 2?

  • Can you please discuss the applicability of the assumptions in real scenarios?

  • I think that adding some real world application can increase the impact of this paper. Is it possible to find such an application?

局限性

The paper acknowledges certain limitation. One notable limitation is the assumption of linear transformations in the causal models considered. This restricts the applicability to scenarios where causal relationships are adequately approximated by linear relationships. Additionally, while the paper addresses the challenge of UMN interventions, it acknowledges the complexity involved in identifying intervention targets in such settings, which can affect the ability to fully leverage the statistical diversity inherent in UMN interventions.

作者回复

We thank the reviewer for their thoughtful review and accurate summary. We address the questions as follows.

Computational complexity: Stage 1 (score difference estimation) is only performed once before the main algorithm starts. Stage 4 (unmixing procedure for hard interventions) essentially works as a post-processing step that does not pose computational challenges. Below, we elaborate on the computational complexity of Stage 2 and Stage 3.

  • Stage 2: We check dimension of V\mathcal{V} (line 6 of the algorithm) at worst n×(2κ+1)nn \times (2 \kappa + 1)^n times in Stage 2 of the algorithm. Here, κ\kappa denotes the maximum possible determinant of a matrix 0,1(n1)×(n1)\\{0,1\\}^{(n-1) \times (n-1)}. In the proof of Lemma 2 in Appendix A.1, we discuss why this choice facilitates the identifiability guarantees. We also discuss how κ\kappa grows with nn in Appendix A.8, and give special cases, such as sparse multi-node interventions, in which κ\kappa is upper bounded by small numbers even for large nn values.

  • Stage 3: Here, we check dimension of V\mathcal{V} (line 23 of the algorithm) at worst W_:,j1×W_:,t1\\|\mathbf{W}\_{:,j}\\|_1 \times \\|\mathbf{W}\_{:,t}\\|_1 times for every (t,j)(t,j) in Stage 3 of the algorithm. Note that W\mathbf{W} is updated within the steps of Stage 3. Therefore, the exact computational complexity would be a function of the graph structure, and the outputs of Stage 2.

  • Empirical tricks: Finally, we note two empirical tricks that can reduce the computational complexity greatly. First, even though κ\kappa can grow quickly as nn becomes larger, in practice, setting κ=2\kappa=2 usually works fine (see additional experiment results in the general response). Second, additional empirical tricks can greatly increase the speed of the algorithm, e.g., dividing the columns of W\mathbf{W} by the greatest common divisor of its entries after every step. Since our focus is on establishing the identifiability results, we omit the investigation of empirical tricks that would disturb the flow of the paper and distract from the main focus.

Toward realistic settings: We believe this paper serves as a significant step in this direction by removing the stringent assumption of single-node interventions. For instance, genomics datasets such as Perturb-seq (Norman et al. 2019) are used by single-node interventional CRL (Zhang et al. 2023). However, the genomic interventions are known to have unknown off-target effects (Fu et al. 2013, Squires et al. 2020) that violate the single-node intervention assumption. Therefore, establishing unknown multi-node interventional results is fundamental to unlocking the use of CRL in realistic datasets. The implementation of real applications is beyond the scope of this work and is an important future direction.

References

T. M. Norman, M. A. Horlbeck, J. M. Replogle, A. Y. Ge, A. Xu, M. Jost, L. A. Gilbert, and J. S. Weissman. Exploring genetic interaction manifolds constructed from rich single-cell phenotypes. Science, 365(6455):786–793, 2019.

J. Zhang, C. Squires, K. Greenewald, A. Srivastava, K. Shanmugam, and C. Uhler. Identifiability guarantees for causal disentanglement from soft interventions. NeurIPS 2023

Y. Fu, J. A. Foden, C. Khayter, M. L. Maeder, D. Reyon, J. K. Joung, and J. D. Sander, “High frequency off-target mutagenesis induced by CRISPR-Cas nucleases in human cells,” Nature Biotechnology, vol. 31, no. 9, pp. 822–826, 2013

C. Squires, Y. Wang, and C. Uhler. Permutation-based causal structure learning with unknown intervention targets. UAI 2020

评论

I thank the authors for the response. After reviewing the reviews and considering the responses, I will raise my score to 7.

审稿意见
8

This work studies interventional causal representation learning, where one has access to interventional data, to identify latent causal factors and latent DAG in the unknown multi-node interventions regime. The authors consider a setting where the mixing function is linear and the latent causal model is nonparametric. Under the assumption of sufficient interventional diversity, the authors use score function arguments to show that the underlying causal factors of variation (and DAG) can be recovered (1) up to permutation and scaling from stochastic hard interventions and (2) up to ancestors from soft interventions. The authors propose a score-based framework (UMNI-CRL) and evaluate it on synthetic data generated from Erdős–Rényi random graph model.

优点

  • This work provides significant results in the unknown multi-node intervention setting, which is much more realistic than the common single-node intervention regime. As opposed to other works, this work studies CRL from a more general class of multi-node interventions (stochastic hard and soft).
  • The paper is well-written, the concepts are explained well, and the theoretical identifiability results add a lot of value to the current CRL literature.
  • The use of score functions and score differences in the observation space to estimate the unmixing function, especially for the UMN setting, is a novel and interesting approach for CRL.
  • This work is the first to establish latent DAG recovery in the UMN setting under any type of multi-node intervention for arbitrary nonparametric latent causal models.

缺点

Although the theoretical contribution of this work is strong, the empirical evaluation is quite weak compared to other works in CRL. There are only experiments for n=4 causal variables. There is also no baseline comparison of the proposed framework with other methods in the UMN setting (e.g., [1]). Also, some discussions are a bit abridged and could use more elaboration in the paper (see below for details).

[1] Bing et al. “Identifying Linearly-Mixed Causal Representations from Multi-Node Interventions” CLeaR 2024.

问题

  • I would like some clarification on the intervention regularity condition. Specifically, why does the additional term ensure that multi-node interventions have a different effect on different nodes? It would be good to elaborate on this condition when introduced since it is a central assumption that needs to be satisfied for the results to hold.
  • How do you obtain Λ\Lambda in Eq. (14)? It seems that this matrix encodes the summands with the latent space score differences. However, since the distribution of the latents is unknown, how would you go about estimating Λ\Lambda and score differences ΔSX\Delta S_X in general cases of nonparametric distributions?
  • How do you learn the integer-valued vectors w\mathbf{w} in Stage 2 of the algorithm? From Eq (18), it seems that W\mathcal{W} is a fixed predefined set and you choose the vectors wW\mathbf{w} \in \mathcal{W} that satisfy a specific condition in the algorithm. To my understanding, this is central to recovering the approximate unmixing H\mathbf{H}^* up to a combination of the rows of the true unmixing G\mathbf{G}^{\dagger}. I would appreciate it if the authors could elaborate on how this procedure was done.
  • From Appendix A.8, it seems that κ\kappa is determined by the number of causal variables nn. Could the authors give some more intuition on what κ\kappa represents in Stage 2 with respect to how the unmixing is recovered?
  • Are there any distributional assumptions on the exogenous noise in the latent additive noise causal model?
  • It seems that the UMN hard intervention result (Theorem 1) requires a latent model with additive noise. Would perfect recovery still be possible for latent models with non-additive noise under UMN hard interventions?
  • The empirical results suggest that increasing sample size improves DAG recovery, which is intuitive. However, what do the results look like as the number of causal variables scales up? Currently, the authors only show results for n=4 latent causal variables. I only offer this as a suggestion due to the short rebuttal period.
  • How would the assumptions made need to change to be applied to general mixing functions? I know that generality in one aspect of the model (i.e., general SCM) may require other aspects to take some parametric form (i.e., linear mixing) for identifiability guarantees, but do the authors have any intuition on how to achieve identifiability results for the UMN setting in a completely nonparametric setup?

局限性

Limitations are discussed in Section 6.

作者回复

We thank the reviewer for acknowledging our strong theoretical and algorithmic contributions. We address the questions as follows.

Empirical results:

  • Increasing nn: Thanks for the suggestion. Please refer to the general response for experiment results for up to n=8n=8 nodes.
  • Baseline: We note that Bing et al. use “do” interventions. Hence, the algorithms are not comparable.

Interventional regularity: We explain the rationale and details as follows.

  • First, we emphasize that interventional regularity is not a restrictive assumption. In Appendix A.8.(Lemma 5), we present some sufficient conditions in which the interventional regularity holds. For instance, under a hard intervention on a linear causal model, if the distribution of the exogenous noise remains the same, then interventional regularity holds. Hence, it is not a restrictive assumption.

  • Next, we clarify what we mean by the effect of an intervention. Essentially, logpi/qizj\frac{\partial\log p_i/q_i}{\partial z_j} is the effect of intervening on node ii on the score associated with node jj. Note that we use combinations of different multi-node environments to generate new score difference functions. Without the additional term in Eq.(12), logpj/qjzj\frac{\partial\log p_j/q_j}{\partial z_j}, the ratio in Eq.(11) considers only a single node intervention on ii and its effects on scores of ii and jj.

  • To ensure that the effect of a combined intervention is different on scores of ii and jj, we need to consider interventions on both ii and jj. As such, the additional term in Eq.(12), logpj/qjzj\frac{\partial\log p_j/q_j}{\partial z_j}, denotes the effect of the additional intervention on jj on the node jj.

Score function differences:

  • We do not require estimating Λ\Lambda. It is correct that Λ\Lambda encodes score differences in latent space, and we cannot estimate it. Therefore, the algorithm only takes ΔSX\Delta S_X as input. Λ\Lambda is defined to provide intuition on the rationale of the algorithm and the connection between latent score differences and ΔSX\Delta S_X in Eq.(17).

  • Our algorithm is agnostic to the choice of score estimator, and we can adopt popular score-matching methods as mentioned in Line 325, e.g., Song et al. (2020) or Zhou et al. (2020). We also note that score function estimation is finding increasingly more applications in diffusion models, and it is an active research field. Hence, our algorithms can modularly adopt any new score estimation algorithm and benefit from advances in the score estimation literature.

  • We also emphasize that our algorithm only requires score differences and does not require the score functions themselves. We expect that the direct estimation of score differences can be much more efficient than estimating the individual score functions, à la direct estimation of precision difference matrices [R1]. Furthermore, density ratio estimation methods, e.g., classification-based approaches [R2], can be potentially useful for direct score difference estimation.

[R1] Jiang et al. A direct approach for sparse quadratic discriminant analysis. JMLR, 2018.

[R2] Gutmann and Hyvärinen. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. JMLR, 2012.

Additive noise models (ANM): Please refer to our general response.

Stage 2 of the algorithm: We elaborate on learning the vectors w\bf w and the role of κ\kappa as follows.

  • Note that ΔSXw\Delta S_X\cdot\bf w is essentially a combination of SN latent score differences via Eq.(15) and (17) (also shown in (30)). Also, since the score difference logpi(zizpa(i))logqi(zizpa(i))\nabla\log p_i(z_i\mid z_{{\rm pa}(i)})-\nabla\log q_i(z_i\mid z_{{\rm pa}(i)}) is a function of Zpa(i)Z_{{\rm pa}(i)} and ZiZ_i, the dimension of the image of this function will be 1 if and only if the intervened node ii has no parents.

  • Leveraging this property, at the first step t=1t=1, we search for a linear combination of the given MN environments (via w\bf w) to emulate a single node intervention on a root node. For such a vector w\bf w, using Eq.(15) and (17), the image of ΔSXw\Delta S_X\cdot\bf w contains only one vector (up to scaling), which is the encoder G\bf G^\dagger’s row corresponding to the ii-th node where ii is a root node.

  • Subsequently, at each step, we follow the same routine to estimate a row of the true encoder. While checking the dimension of the emulated intervention, we project ΔSXw\Delta S_X\cdot\bf w to the nullspace of the submatrix recovered so far. This ensures that the learned encoder will be full-rank.

  • Role of κ\kappa: Vector w=[D1]i{\bf w=[D^{-1}]}_i is a valid choice that makes ΔSXw\Delta S_X\cdot\bf w correspond to a single node intervention. Then, by constructing a set W\cal W that contains the rows of D1\bf D^{-1}, we ensure that the procedure described above will work by an exhaustive search over W\cal W. We note that the entries of D1\bf D^{-1} can be found via the cofactor matrix of D\bf D, for which the κ\kappa denotes the maximum possible entry. For detailed derivation, please refer to lines 479-490 in Appendix A.1.

Intuition for general mixing functions: Our core technical idea is using combinations of MN interventions to construct new interventions with desired properties, e.g.,sparsity. Our intuition is that we can extend the intervention matrix in Eq.(4) to handle multiple interventional mechanisms. For instance, to represent two interventional mechanisms per node, the columns of the intervention matrix would be in 0,12n\\{0,1\\}^{2n}. Subsequently, the goal is to find combinations of MN environments to create two different SN interventions for each node so that the problem simplifies to the known results for general transformations under two SN int/node (see references [6],[9]). However, building on this intuition to prove identifiability is challenging and is a major direction for future work.

评论

I greatly appreciate the authors taking the time to answer my questions and provide clarifications. My questions and concerns have been addressed quite well in the response. The new empirical results for a larger number of causal variables further strengthen the paper. I believe this is a high-quality submission with significant theoretical results of great interest to the CRL community. Thus, I raise my score to 8.

审稿意见
6

This paper extends previous results on using score function for causal representation learning to the settings with unknown multi-node interventions. This new setting poses significant new challenges as opposed to the single node intervention case. The author first present theoretical identifiability result on hard interventions with latent additive noise model and on soft interventions. They then propose an algorithm called (UMNI)-CRL and test it on synthetic linear Gaussian dataset.

优点

The paper is clearly written, easy to follow and with good motivations.

缺点

  1. The transformation from latent to observed is noiseless, which could be a limitation.

  2. Line 199 says that: “This regularity condition ensures that the effect of a multi-node intervention is not the same on different nodes”. But how realistic or neccessary is this condition? It seems like it is very possible that an intervention can cause two downstream nodes to have the same effect although these two nodes is not influenced the same by all type of interventions.

  3. The experiments are only on synthetic dataset but I don’t think that is a big issue.

  4. Some potential missing citations

    [1] Kumar, Abhinav, and Gaurav Sinha. "Disentangling mixtures of unknown causal interventions." Uncertainty in Artificial Intelligence. PMLR, 2021.

    [2] Jiang, Yibo, and Bryon Aragam. "Learning nonparametric latent causal graphs with unknown interventions." Advances in Neural Information Processing Systems 36 (2024).

问题

  1. (UMNI)-CRL requires estimating the score function. How do you ensure a good estimate of the score function to unsure that the algorithm is useful in practice?
  2. One small question: on line 141-143, it is mentioned that if a node is not intervened on, perfect identifiability is not possible. But there are cases like A→B where I don’t need to intervene on A?

局限性

  1. Theorem 1 only works for additive noise model.
  2. The transformation from latent to observed noiseless.
  3. Experiments are only on synthetic dataset.
  4. I am unsure if the algorithm is practical because it needs to estimate the score function.
作者回复

We thank the reviewer for acknowledging the challenges of the unknown multi-node intervention setting and noting the clarity of the paper. We address the raised concerns as follows.

Noiseless transformation: The current scope of the paper cannot handle noisy transformations. We also kindly note that the closely related CRL literature (see references [3]-[10] in the paper) also considers deterministic transformations X=g(Z)X = g(Z). Our paper’s primary goal is to relax the restrictive assumption of single-node interventions. As such, we consider noisy transformations a major future direction.

Interventional regularity: We thank the reviewer for raising the need for elaboration on the interventional regularity condition. We explain its rationale and emphasize that it is not a restrictive assumption as follows.

  • First, we want to clarify what we meant by "effect of an intervention.” Essentially, zjlogpi(zizpa(i))qi(zizpa(i))\frac{\partial}{\partial z_j}\log\frac{p_i(z_i\mid z_{{\rm pa}(i)})}{q_i(z_i\mid z_{{\rm pa}(i)})} is the effect of intervening on node ii on the score associated with node jj. Therefore, the effect in the score function is not on downstream nodes of ii but on the parents of ii.

  • Note that we use combinations of different multi-node environments to generate new score difference functions. Therefore, to ensure that the effect of this new “combined” intervention will be different on the ii-th and jj-th coordinates of the score function, we require the ratio in Eq.(12) to be not constant.

  • Next, we emphasize that interventional regularity is not a restrictive assumption. In Appendix A.8., we present some sufficient conditions that make the interventional regularity valid (see Lemma 5). For instance, if we consider a linear latent causal model under a hard intervention and the distribution of the exogenous noise term remains the same after the intervention, then interventional regularity is satisfied. Therefore, interventional regularity is not a restrictive assumption.

  • Furthermore, note that if there exist kpa(j)pa(i)k\in{\rm pa}(j)\setminus{\rm pa}(i), then logpj(zjzpa(j))qj(zjzpa(j))\log\frac{p_j(z_j\mid z_{{\rm pa}(j)})}{q_j(z_j\mid z_{{\rm pa}(j)})} is a function of ZkZ_k, whereas the other terms in Eq.(12) are not. This implies that the ratio in Eq.(12) is not a constant function of ZZ and again exemplifies that the interventional regularity is not a restrictive condition.

Estimating score function differences:

  • Our algorithm is agnostic to the choice of score estimator, and we can adopt popular score-matching methods as mentioned in Line 325, e.g., Song et al. (2020) or Zhou et al. (2020). We also note that score function estimation is finding increasingly more applications in diffusion models, and it is an active research field. Hence, our algorithms can modularly adopt any new score estimation algorithm and benefit from advances in the score estimation literature.

  • We also emphasize that our algorithm only requires score differences and does not require the score functions themselves. We conjecture that the direct estimation of score differences can be much more efficient than estimating the individual score functions, a lá direct estimation of precision difference matrices [R1]. Furthermore, density ratio estimation methods, e.g., classification-based approaches [R2], can be potentially useful for direct score difference estimation.

[R1] Jiang, Wang, and Leng. "A direct approach for sparse quadratic discriminant analysis." JMLR, 2018.

[R2] Gutmann and Aapo Hyvärinen. "Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics." JMLR, 2012.

Non-identifiability when missing an intervened node: In lines 141-143, we meant that “when a node is not intervened on, perfect identifiability is not possible in general”, i.e., without imposing additional restrictions such as structural assumptions. Our reference for non-identifiability is Proposition 5 of Squires et al. (2023). Their proof for non-identifiability only requires that the non-intervened node ii has at least one parent. We will add this note to the updated manuscript.

We note that the example of ABA\rightarrow B under an intervention on only BB is also discussed by Remark 2 of Squires et al. (2023). Even though the graph ABA\rightarrow B can be discovered in this specific case, we are not aware of any results for the identifiability of the latent variables. Squires et al. (2023) empirically suggest that the latent variables are not identifiable.

Additive noise models (ANM): Please refer to our general response.

Additional references: Thank you for the suggestions; we will include and discuss them in the paper. In summary, Jiang and Aragam (NeurIPS 2023) focus on recovering the latent DAG without recovering the latent variables as opposed to our complete CRL objectives. On the other hand, Kumar and Sinha (2021) focus on an entirely different setting in which the causal variables are observed, and the distributions are given in a mixture. In contrast, our CRL setting focuses on latent variables observed through the same transformation when observing distinct interventional distributions.

评论

Thanks for your rebuttal. I have raised my score leaning towards acceptance.

审稿意见
7

This paper introduces new identifiability results for CRL in environments with unknown multi-node interventions. It shows that, with sufficiently diverse interventional environments, one can achieve identifiability up to ancestors using soft interventions and perfect identifiability using hard interventions. The paper also provides an algorithm with identifiability guarantees.

优点

  • The paper tackles the complex and underexplored multi-node intervention setting. The established identifiability can be crucial for extending current CRL theories into more practical contexts.
  • The introduced algorithm that leverages score functions with different interventional environments is also interesting and insightful.
  • The paper is well-motivated and articulated with high clarity.

缺点

  • The proposed algorithm, while theoretically sound, seems computationally demanding. In fact, even a 4-node low-dimensional case requires a large number of environments and samples. The paper could benefit from a deeper discussion on the scalability of the algorithm.
  • The current evaluation of the algorithm is limited to synthetic simulations. Expanding it to more realistic datasets would substantively improve its practical significance.

问题

How effectively does the proposed algorithm scale to more nodes and higher dimensions?

局限性

The paper acknowledges its main limitations in the reliance on linear transformations.

作者回复

We thank the reviewer for finding our results crucial for extending CRL into more practical contexts, for finding our algorithm insightful, and for noting the clarity of the paper. We address the raised questions about the algorithm’s scalability as follows.

  • Dimension of XX: Our algorithm is readily scalable to arbitrarily high-dimensional observations XX. Please see the additional experiments reported in our general response, in which we use d=50d=50.

  • Dimension of ZZ: Please see the additional experiments reported in our general response, in which we use n4,5,6,7,8n \in \\{4, 5, 6, 7, 8\\} latent nodes.

  • Number of environments: We note that nn environments are necessary in general (without further structural assumptions) for identifiability via single-node interventions (shown by Squires et al. 2023, Proposition 5). Since our unknown multi-node intervention setting subsumes the single-node interventions, we require at least nn environments as well.

  • Number of samples: We remark that the main purpose of our algorithm is to establish a framework for identifiability via multi-node interventions. In future work, we aim to address the efficient score difference estimation, which is a separate line of work that can significantly increase the efficiency of our framework. We also kindly note that, even in the much simpler single-node intervention setting, related CRL literature uses a similar number of samples to 10510^5 that we used in our experiments for good performance (e.g., 10510^5 samples for n=5n=5 nodes in Squires et al. (2023), 5×1045 \times 10^4 samples for n=5n=5 nodes in Buchholz et al. (2023), and 5×1045 \times 10^4 samples for n=5n=5 nodes in Varici et al. (2024)).

  • Toward realistic settings: Finally, we acknowledge the need for working with more realistic datasets in the CRL field. We believe this paper serves as a significant step in this direction by removing the stringent assumption of single-node interventions, and we leave addressing realistic applications to future work.

评论

Thank you for your response to my question. I increased the score to 7.

作者回复

We thank all the reviewers for their thorough feedback and thoughtful questions. Below we address some shared questions by the reviewers.

Additional experiments

We address the shared concerns of the reviewers regarding the scalability of the algorithm via the following additional experiment results.

Setting: We follow the same setting as in Section 5 of the paper; Erdös-Rényi model with density 0.5 and linear structural equation models (SEMs) with Gaussian noise.

  • Dimension of observed variables XX: We increase dd to 5050 in this additional experiments.
  • Number of latent nodes: We perform experiments for n4,5,6,7,8n \in \\{4,5,6,7,8\\} nodes. We also note that n=8n=8 is the largest graph size considered in the closely related single-node interventional CRL literature (e.g., Squires et al. (2023) and Buchholz et al. (2023) consider 5 nodes, Varici et al. (2024) consider 8 nodes, von Kügelgen et al. (2023) consider 2 nodes).
  • We use ns=105n_{\rm s} = 10^5 samples for each realization and repeat the experiments 100 times for each (n,d)(n,d) pair.

The table below shows that the average rate of incorrect mixing entries, captured by soft\ell_{\rm soft} and hard\ell_{\rm hard}, remains low for increasing values of nn. Graph recovery metric SHD increases with nn, since the number of expected edges also increases with nn under a fixed edge density.

nnddSHD (Soft)soft\ell_{\rm soft}SHD (Hard)hard\ell_{\rm hard}
4500.440.720.040.11
5500.961.250.050.10
6502.413.200.090.14
7504.226.000.110.16
8505.678.750.100.16

Empirical trick: We recall that our algorithm involves searching for proper wκ,,κn\mathbf{w} \in \\{-\kappa,\dots,\kappa\\}^n vectors in which κ\kappa denotes κ\kappa denotes the maximum determinant of a matrix in 0,1(n1)×(n1)\\{0,1\\}^{(n-1) \times (n-1)}. Even though κ\kappa is a function of nn, e.g., κ=2\kappa=2 for n=4n=4 and κ=5\kappa=5 for n=6n=6, we observe that setting κ=2\kappa=2 does not disturb the performance noticeably. Therefore, we set κ=2\kappa=2 in all our experiments to reduce runtime.

General causal models

In addition to our experiments with linear causal models, we investigate general causal models. Specifically, we provide a sensitivity analysis where we investigate the performance of our algorithms under different levels of artificially introduced noise.

For this set of experiments, we adopt a non-linear additive noise model with a score oracle. Specifically, the observational mechanism for node ii is generated according to Zi=Z_pa(i)AiZ_pa(i)+Ni ,Z_i = \sqrt{Z\_{{\rm pa}(i)}^\top {\bf A}_i Z\_{{\rm pa}(i)}} + N_i \ , where NiN(0,σi)N_i \sim {\cal N}(0, \sigma_i), and the interventional mechanisms set Zi=Ni/2Z_i = N_i / 2. This causal model admits a closed-form score function (see [7, Eq.(393–395]), which enables us to obtain a score oracle. In our experiments, we use this score oracle and introduce varying levels of artificial noise according to s^X(x;σ2)=sX(x)(1+Ξ) ,\mboxwhereΞN(0,σ2I_d×d) ,\hat{s}_X(x; \sigma^2) = s_X(x) \cdot \big( 1 + \Xi \big) \ , \quad \mbox{where} \quad \Xi \sim {\cal N}(0, \sigma^2 \cdot {\bf I}\_{d \times d}) \ , to test the behavior of our algorithm under different noise regimes (σ[103,101.5]\sigma \in [10^{-3}, 10^{-1.5}]). Results à la Table 2 versus different σ\sigma values are provided in Figure 1 in the PDF document attached to the general response.

Additive noise models (ANM)

  • We emphasize that the core component of our work – minimizing score differences to estimate the true encoder – does not require ANMs, as shown by Theorem 2 for soft interventions.
  • ANM is introduced for the analysis of CI tests in Stage 4 (hard interventions). Specifically, given a different causal model, it may be possible to analyze Stage 4 of the algorithm in a different way. For simplicity, we have adopted ANMs which are commonly adopted by both causal discovery and CRL literature (e.g., for perfect identifiability, Squires et al. (2023), Buchholz et al. (2023), Varici et al. (2024), and Bing et al. (2024) use ANMs).
  • Finally, we only require the exogenous noise in the ANM to have full support, which is already implied by the full support of zz. Hence, we don’t make any specific assumptions for ANM.
最终决定

This paper develops new results for multi-node causal representation learning, which is an important open direction for the field. All reviewers are in favor of acceptance. Clear accept.