Learning Linear Causal Representations from General Environments: Identifiability and Intrinsic Ambiguity
摘要
评审与讨论
This paper considers the task of learning linear causal representation with data collected from general environments. Authors show that there exists a surrounded-node ambiguity (SNA) which is basically unavoidable in their setting. On the other hand, identification up to SNA is possible under mild conditions in their considered setting. An algorithm, LiNGCReL, is further proposed to achieve such identifiability guarantee.
优点
-
Learning linear causal representation using soft-interventions seem to be new.
-
The surrounded-node ambiguity (SNA) identifiability is new.
缺点
-
The placed assumptions do not seem to be weaker, compared with existing ones.
-
Many places/claims are vague/incorrect, and presentation is poor.
问题
-
what does "general environments" mean? I don't think it is a widely used concept, so perhaps describe it more explicitly and accurately.
-
Definition 2: this definition is questionable, or at least inaccurate.
- Line 105: here "" and then "for all "? what is and what is the difference bewtten and ?
- Further, when you write , what does indicate?
-
Similarly in Def. 4, line 121: it is difficult to interpret what you mean by <=>. Can you give a complete sentence of the condition?
-
More importantly, I find several motivations/claims are not convincing/accurate, which is particularly critical to a theory-heavy paper:
- a motivation of considering soft interventions is that, ”the latent variables are unknown and need to be learned from data, it is unclear how to perform interventions that only affect one variable. “, but how soft interventions avoid this concern?
- lines 123-124: "if there exists some i ∈ surG(j), then ambiguities may arise for the causal variable at node j, since any effect of j on any of its child k can also be interpreted as an effect of i." in pearl's book or the topic of mediation analysis, you can define clearly the effect of and the effect of on child , and why would I interpret the effect of as that of ? That is not sound.
- I also have a question with the example with three causal variables in Appendix E, which aims to show the different between hard and soft interventions. It seems that one can distinguish the two even with soft interventions, e.g., by setting , , then and (or ) would still be dependent while in the right, and (or ) are independent.
- lines 139-142: "in contrast with existing literature on single-node interventions, we impose no similarity constraints on the environments.": however, assumptions 4 and 5 seem to assume the environments are sufficiently diverse, which are not weaker.
I just stop here, since the presentation and rigorousness issues have led to the decision for the submitted version. While I tend to believe the paper may contribute some interesting ideas and results, at least a major revision is required.
局限性
See above.
We thank the reviewer for providing insightful feedback and suggestions. Below are our responses to the reviewer’s questions and concerns of our paper.
Q1: What does "general environments" mean? I don't think it is a widely used concept, so perhaps describe it more explicitly and accurately.
As the reviewer notices, the “general environments” setting is relatively new in the literature. As a result, we include discussions of this concept in several parts of our paper. The most notable one might be at the beginning of Section 4, where we wrote that “In this section, we consider learning causal models from general environments. Specifically, we assume that the environments share the same causal graph, but the dependencies between connected nodes (latent variables) are completely unknown, and, in contrast with existing literature on single-node interventions, we impose no similarity constraints on the environments.” The motivation of considering such a setting is discussed in Section 1 line 41-51. We thank the reviewer for pointing this out and we will improve exposition by moving the definition from Section 4 to the introduction so that the reader has a more precise understanding of the use of the term earlier on.
Q2: Definition 2: this definition is questionable, or at least inaccurate. Line 105: here "" and then "for all "? What is and what is the difference between and ? Further, when you write , what does indicate?
Thank you for pointing at potential points of confusion/improved readability and let us clarify below the accuracy of the statements and ways that we plan to address the readability issues.
"i\in S" is part of the sentence “......a subset of latent variables ”, so it describes for which zi’s we are defining the soft intervention. On the other hand, "if for all " belongs to the sentence that follows: if , , we have . So these two ’s are essentially the subscripts of different objects: latent variables and distribution , not contradictory statements or typos. We will use different subscripts in the two sentences for improved readability in a revised version.
The notation stands for “if and only if”. We will replace the arrow with the if and only if text to make it clearer in a revised version.
So, to summarize, Definition 2 can be rephrased as follows: “We say that a collection of environments is a set of (soft) interventions on a subset of latent variables if the following statements hold for any : , we have if and only if .
Q3: Similarly in Def. 4, line 121: it is difficult to interpret what you mean by <=>. Can you give a complete sentence of the condition?
Similar to Q2, we use the mathematical convention that <=> means if and only if. The sentence reads “For all , if and only if . We will replace the arrow with the "if and only if" text for improved readability.
Q4.1: a motivation of considering soft interventions is that, ”the latent variables are unknown and need to be learned from data, it is unclear how to perform interventions that only affect one variable. “, but how do soft interventions avoid this concern?
We are afraid that the reviewer mixed up the following two settings: one is the “general environment” setting, which is the main focus of our paper (Theorem 1, 3 and the LingCReL algorithm), the other is the “soft intervention” setting, for which we only include a negative result (Theorem 2) to highlight its difference from the general environment setting. In fact, soft interventions exactly inherit the limitations that the reviewer quoted, and this is the main motivation for us to consider general environments that do not have this limitation. We included the soft-intervention result in our paper, as it only makes the negative result stronger. Obviously the negative impossibility result holds for general environments too, but proving it for soft interventions only strengthens the impossibility result. All our positive and constructive results are for general environments, which exactly bypass the requirement that the reviewer describes as a problem.
Q4.2: lines 123-124: "if there exists some i ∈ surG(j), then ambiguities may arise for the causal variable at node j, since any effect of j on any of its child k can also be interpreted as an effect of i." in pearl's book or the topic of mediation analysis, you can define clearly the effect of 𝑗 and the effect of 𝑖 on child 𝑘, and why would I interpret the effect of 𝑗 as that of 𝑖? That is not sound.
We thank the reviewer for pointing to the references. Note that this sentence is only an intuitive description and serves as a supplement to Definition 3, which is mathematically rigorous.
In this section, the word “interpret” is used in the following sense: from the learner’s perspective, the task is to find a suitable data model that matches the data. intuitively, there would be two indistinguishable but non-equivalent data generating processes, so in other words, there are two distinct causal mechanisms that match observation data. So you can of course define these theoretical causal quantities, but given the data you cannot conclude which of these two causal effects are at play. We hope this explanation would make things clear, and we point the reviewer to Definition 3 for a more rigorous mathematical formulation.
Due to space limits, the remaining parts of the rebuttal can be found in the official comment that follows this post.
Q4.3: I also have a question with the example with three causal variables in Appendix E, which aims to show the difference between hard and soft interventions. It seems that one can distinguish the two even with soft interventions, e.g., by setting , , then and (or ) would still be dependent while in the right, and (or ) are independent.
We thank the reviewer for raising this question. The reason why the two models in Example 3 cannot be distinguished is that, given any soft intervention on the first model, there always exists a soft intervention on the second one that yields the same data distributions. This is not to say that any two soft interventions on two models would yield the same data distribution – which is actually impossible. Thus, by constructing two soft interventions separately for the two candidate models and proving that they are different, one cannot conclude that these two models are distinguishable. In practice this translates to: "given the data that I have, I cannot distinguish whether they were generated by a true structural causal model SCM1 under an intervention SoftIntervention1 or whether it was SCM2 with intervention SoftIntervention2." The statement of Theorem 1 includes a precise statement of what “undistinguishable models” mean in our paper (a formal definition also used in prior work).
Returning to the reviewer’s example, a soft intervention on the first model is equivalent to on the second one. The key point here is that we are not assuming that the experimenter knows the specific form of the interventions, nor the nodes on which the interventions are performed (a commonly considered setting in CRL, see e.g. [3,4]). Thus, from the experimenter’s perspective, these two causal models cannot be distinguished.
[3] Squires, Chandler, et al. "Linear causal disentanglement via interventions." International Conference on Machine Learning. PMLR, 2023.
[4] Varıcı, Burak, et al. "Linear Causal Representation Learning from Unknown Multi-node Interventions." arXiv preprint arXiv:2406.05937 (2024).
Q4.4: lines 139-142: "in contrast with existing literature on single-node interventions, we impose no similarity constraints on the environments.": however, assumptions 4 and 5 seem to assume the environments are sufficiently diverse, which are not weaker.
Our Assumption 4 and 5 can be easily satisfied by single-node soft interventions. Indeed, soft interventions on node is equivalent to changing the entries of the -th row of . As a result, as long as there exists at least soft interventions in general positions for node (which is interventions in total), our assumption can be automatically satisfied. As a result, while our assumption allows for identification from diverse environments, it is not restricted to this case; it also holds for soft interventions as well. On the other hand, we also show in Theorem 2 that the number of interventions cannot be reduced, thereby providing a clear picture of the soft intervention setting.
We would like to stress that, however, the main motivation of considering general environments is that single-node soft interventions are too restrictive for applications. Hence, the focus of our paper is the setting of general environments, and we do not compare with the restrictive setting of soft interventions explicitly in the main text. We will add it in a revised version of the paper.
We hope that the above explanations are helpful, and we are also happy to answer further questions that the reviewer has.
Thanks for your response which addressed many of my questions. Some of my feedbacks and remaining questions are:
-
regarding Q2 and Q3: to clarify, I know "A <=> B" means A is equivalent to B. The thing here is, when you use this notation, please be careful to make A and B crystal clear. If you look at the writing, A is a very long setence with many commas. And shall be defined first and explicitly.
-
regarding Q4.2: "Note that this sentence is only an intuitive description and serves as a supplement to Definition 3". No, I cannot agree. Every sentence, every it's an explaination, shall be made accurate and serious. My question is on this statement "since any effect of j on any of its child k can also be interpreted as an effect of i."
-
regarding Q4.4: from my understanding, these two assumptions either require sufficient diverse environments (which is contradictraty to the statement "... no similarity constraints on the environments", or additonal interventions, right? Then how are these soft interventions conducted?
We would like to thank the reviewer for providing invaluable feedbacks and for letting us know the remaining concerns.
For Q2 and Q3: We thank the reviewer for the clarification. We agree with the reviewer that confusion in math notations should be avoided, and we will definitely make these statements clearer in future versions.
For Q4.2: We thank the reviewer for pointing out this issue. We realize that this sentence is not accurate; the statement should be "since any effect of j on any of its child k can also be interpreted as a mixture of the effect of i and j." We will make this modification in the revised version.
For Q4.4: it seems that there are some misunderstandings here. In the statement "... no similarity constraints on the environments" we are comparing our work to existing works that assume access to environments that differ only on one node. By the phrase "similarity constraint" we are referring to "constraints that require different environments to be similar to some extent". The reviewer seems to treat the diversity constraints as a form of similarity constraint too. It is fine to understand this assumption in this way, but what we would like to point out is that 1) one has to make some assumptions on the environment to ensure identifiability. 2) the main contribution of our paper is that we completely remove the single-node interventions assumptions in existing identifiability theory. We replace the very special single-node interventions assumption with Assumption 4/5 that basically holds with probability if the weights are sampled from some continuous distribution. 3) Our Theorem 1 naturally implies that having single-node soft interventions ( for each node) is sufficient for identifiability, and we show in Theorem 2 that this is also necessary.
We hope that the above explanations would resolve the reviewer's concerns, and please feel free to let us know if the reviewer has more questions.
Thanks for your reply.
To clarify, one question of mine is: how do you actually perform such intervetions? Or why and in what applications should I believe the environments are from such soft interventions?
Also, can you point out the reference paper "we are comparing our work to existing works that assume access to environments that differ only on one node ..." ? Thanks.
We would like to thank the reviewer for providing further feedback.
Q1: can you point out the reference paper "we are comparing our work to existing works that assume access to environments that differ only on one node ..." ? Thanks.
There is a line of works that aim to learn the underlying causal representations under the assumption of single node interventions, which means changing the only one of the latent factors. More precisely, there exists some such that only is changed, while the remaining 's are unchanged. In the setup of our paper, this means that compared with a "base" environment, all other environments differ only on one node.
Within this line of literature, [1] assumes access to single-node and hard interventions (i.e. intervention on some that makes it independent of its causal parents), and they prove full identifiability of the causal model assuming there are two such interventions per node. [2] studies identifiability given single-node, soft interventions bue they assume that the causal graph is already known. [3] considers a linear causal model and assumes single-node, soft interventions, and prove identifiability up to trasitive closure (cf. page 4, section 2). [4] allows non-linear causal model but still requires linear mixing functions, and under the same single-node intervention assumption, they prove identifiability up to SNA that we recalled in our paper.
[1] von Kügelgen, Julius, et al. "Nonparametric identifiability of causal representations from unknown interventions." Advances in Neural Information Processing Systems 36 (2023).
[2] Wendong, Liang, et al. "Causal component analysis." Advances in Neural Information Processing Systems 36 (2023).
[3] Squires, Chandler, et al. "Linear causal disentanglement via interventions." International Conference on Machine Learning. PMLR, 2023.
[4] Varici, Burak, et al. "Score-based causal representation learning with interventions." arXiv preprint arXiv:2301.08230 (2023).
Q2: how do you actually perform such intervetions? Or why and in what applications should I believe the environments are from such soft interventions?
We supose that the reviewer is asking about our assumption that different environments share the same causal graph and the same mixing function. This assumption has been widely adopted in the causal representation learning literature, including all of the above reference papers and also other papers that rely on different assumptions (e.g.,[5]).
We think that this assumption is reasonable as long as we have data across several different domains. For example, the iWildCam dataset [6] has animal images captured by cameras at different locations around the globe. Then it is natural to hypothesize that several causal factors lead to the generation of a photo. For example, it is likely that location and weather are both causal factors that affect the distribution of animals captured by the cameras. We may expect that there are many more causal factors that are difficult to be discovered by human, but it might be possible to learn them from data, and this is the main motivation of studying CRL. Similar reasonings can be applied to other types of data (e.g. text, audio etc.). The underlying hypothesis is that there exists a "causal world model" that generates human-level objects. This assumption is of course not verifiable, but we believe that it can potentially lead to better machine learning algorithms compared with those that rely on i.i.d. data. Indeed, recent work [7] shows that learning the causal graph is necessary and sufficient for being robust to distribution shifts.
We acknowledge that currently our work cannot handle such complicated of data, but we take an step towards this direction by showing 1) what is the best-achievable identification guarantee in the multi-environment setup and 2) there is a provable algorithm that has such guarantee for linear causal models.
We thank the reviewer for actively engaging in the discussions and providing many helpful feedbacks, and we sincerely hope that the reviewer would have a more positive view of our paper. We are also happy to answer any further questions that the reviewer may have.
[5] Lu, Chaochao, et al. "Invariant causal representation learning for out-of-distribution generalization." International Conference on Learning Representations. 2021.
[6] S. Beery, A. Agarwal, E. Cole, and V. Birodkar. The iWildCam 2021 competition dataset. arXiv preprint arXiv:2105.03494, 2021.
[7] Richens, Jonathan, and Tom Everitt. "Robust agents learn causal world models." The Twelfth International Conference on Learning Representations, 2024.
Thanks. I've now got the new contributions of the paper, though the problem setting is somewhat limited. I decide to increase my score.
However, I still feel that the paper needs quite much effort to improve its clarity, particularly in contrast with those reference works. I hope authors could take into account these feedbacks to revise the paper, to make it clear and accurate.
We thank the reviewer for appreciating our paper and also for providing many suggestions for improving our paper. As the reviewer recommends, we will make things clearer and more accurate during revision.
The paper is about causal representation learning (i.e, learning the latent causal graph and the unmixing function) from high-dimensional observations in the case of linear SCMs where the mixing function is also linear. The paper defines the notion of surrounded node ambiguity (SNA) and then performs studies for CRL under assumptions of soft interventions (where there are K environments that share the causal graph). Theoretical analysis is performed and the authors also introduce a method, LinGCRel that can perform CRL upto SNA and is provably identifiable.
优点
-
The authors introduce the idea of SNA and identifiability upto SNA which is novel and also important for understanding ambiguities in the CRL setup, even for the simple case of linear SCMs with a linear mixing function.
-
LinGCRel, a practical method is also proposed to perform CRL in such a setting, provably identifying upto SNA. The paper is also about soft single node interventions, in comparison to other previous work which has primarily dealt with hard interventions.
缺点
Since I am not aware of many theoretical results and proofs for CRL, which seems to be the main contribution of the paper (apart from LinGCRel), I do not have particular weaknesses to state.
问题
- Figure 2e looks like an undirected graph but on zooming looks like there might be directed edges. If so, this figure can be corrected.
- Font size for the axis labels is too small to read in Fig 2a-2d
- Apart from the interventional CRL works already cited which mainly focuses on identifiability, the authors could also cite BIOLS [1] which studies CRL from a more empirical perspective and proposes an algorithm to learn linear latent causal graphs from high-dimensional data under hard, multi-node interventions.
[1] Subramanian, Jithendaraa, et al. "Learning latent structural causal models." arXiv preprint arXiv:2210.13583 (2022).
局限性
Yes
We thank the reviewer for the positive feedback and comments. In the following, we respond to the questions raised by the reviewer.
Q1: Figure 2e looks like an undirected graph but on zooming looks like there might be directed edges. If so, this figure can be corrected.
That was indeed the case – the graph is directed. We thank the reviewer for pointing out this issue, and will correct them in a revised version.
Q2: Font size for the axis labels is too small to read in Fig 2a-2d
We thank the reviewer for pointing out this issue. We will update these labels to make them bigger.
Q3: Apart from the interventional CRL works already cited which mainly focuses on identifiability, the authors could also cite BIOLS [1] which studies CRL from a more empirical perspective and proposes an algorithm to learn linear latent causal graphs from high-dimensional data under hard, multi-node interventions.
We thank the reviewer for pointing to this super interesting and closely related paper, and we will definitely cite it in the revision.
We thank the reviewer again for appreciating our work and are happy to address any other questions that the reviewer might have.
This paper investigates causal representation learning from low-level observed data across multiple environments. The authors address the surrounded-node ambiguity (SNA) in linear causal models and propose the LiNGCReL algorithm, which achieves identifiability up to SNA without relying on single-node interventions. Experiments on synthetic data show the effectiveness of LiNGCReL in the finite-sample regime.
优点
- It addresses the limitations of previous methods that rely on single-node interventions, providing a more practical approach to causal representation learning.
- The proposed LiNGCReL algorithm achieves identifiability up to SNA under mild conditions
缺点
-
Could you provide an intuitive explanation or motivations for the assumptions? What do they represent in real-world data scenarios?
-
The authors mentioned that this work differs from Xie et al. [54, 55] and Dong et al. [11] because the latter requires structural assumptions. However, it seems that the proposed model in this paper also inherently assumes that there are no direct causal edges between observed variables.
-
In the first step of the proposed algorithm, "any identification algorithm for linear ICA is used to recover the matrix ". This may introduce some errors, as perfect identification cannot be achieved. These methods typically assume that the dimensions of and are the same and other restrictions.
-
In Figure 2(e), the arrows on the edges could be made larger, as they currently look like undirected edges.
-
How would the proposed method perform when applied to real-world data?
-
The authors only provided the results of the LiNGCReL algorithm on simulated data. It would be more objective and validate the effectiveness of the proposed method if results of other baselines or classical methods on the same data were also provided for comparison.
问题
The paper considers the problem of learning linear causal representations from general environments. Is the information about these different environments known? If so, it indeed provides some additional information for structure learning.
局限性
NA
We thank the reviewer for providing insightful comments and feedback. Below are our responses to the reviewer’s questions and concerns of our paper.
Q1: Could you provide an intuitive explanation or motivations for the assumptions? What do they represent in real-world data scenarios?
Certainly. Our paper considers the task of learning linear representations from general environments. Assumption 1 states that all environments share the same mixing function, which is a standard assumption in CRL literature [1,2]. For example, the generating process of images can be thought of as first generating a few causally related features (scheme, location, weather, etc.) and then these features are transformed by some mixing function into high level representations (pixels in images). Assumption 1 basically states that the relationship between features and the final image remains the same for all environments, but that those features are generated with different probabilities. Assumption 2 requires that the mapping from features to images is injective, i.e. there can’t be two different feature vectors that correspond to the same image. Assumption 3 requires the noise variables to be non-Gaussian and have different distributions. The non-Gaussian assumption is relatively standard in causal graph discovery, while requiring different distributions is not restrictive, since real-world noise distributions are seldom the same. Assumptions 4 and 5 are the main assumptions in our theory and intuitively state that the environments should contain enough information for all latent variables. Notably, it does not require single-node soft interventions, a widely-adopted assumption [1,4,5] but is questionable in real-world applications such as genomics (see e.g. Figure 1 of [3]), while interventions typically affect multiple nodes.
[1] Squires, Chandler, et al. "Linear causal disentanglement via interventions." International Conference on Machine Learning. PMLR, 2023.
[2] von Kügelgen, Julius, et al. "Nonparametric identifiability of causal representations from unknown interventions." Advances in Neural Information Processing Systems 36 (2024).
[3] Tejada-Lapuerta, Alejandro, et al. "Causal machine learning for single-cell genomics." arXiv preprint arXiv:2310.14935 (2023).
[4] Wendong, Liang, et al. "Causal component analysis." Advances in Neural Information Processing Systems 36 (2024).
[5] Varici, Burak, et al. "Score-based causal representation learning with interventions." arXiv preprint arXiv:2301.08230 (2023).
Q2: The authors mentioned that this work differs from Xie et al. [54, 55] and Dong et al. [11] because the latter requires structural assumptions. However, it seems that the proposed model in this paper also inherently assumes that there are no direct causal edges between observed variables.
We thank the reviewer for pointing out the possible confusion here, and we will modify the sentence in a revised version. Our point is that since these works rely on observational data only, the underlying causal models may at best be recovered up to Markov equivalence. If we hope to establish stronger identification guarantees, more structural assumptions must be made.
Q3: In the first step of the proposed algorithm, "any identification algorithm for linear ICA is used to recover the matrix ". This may introduce some errors, as perfect identification cannot be achieved. These methods typically assume that the dimensions of and are the same and other restrictions.
As the reviewer noticed, standard linear ICA only applies to the case where and have the same dimension. If , then this means that our observation contains redundant information, and we can naively ignore the last components of . Alternatively, we can run PCA to identify the top components of observation and use the new -dimensional vector to run linear ICA. We will make this point more precise in a revised version of the paper.
Q4: In Figure 2(e), the arrows on the edges could be made larger, as they currently look like undirected edges.
We thank the reviewer for pointing out this issue and will make the arrows larger in an updated version.
Q5: How would the proposed method perform when applied to real-world data?
The experiments conducted in this paper mainly aims to verify the correctness of our theory and algorithm. We have not tested it on real-world data, because it is unclear how to evaluate the performance, since the underlying causal structure is generally unknown, and there is no benchmark in our setting. We agree with the reviewer that dealing with real-world data is an important future direction of CRL, and we will definitely explore it in future studies.
Q6: The authors only provided the results of the LiNGCReL algorithm on simulated data. It would be more objective and validate the effectiveness of the proposed method if results of other baselines or classical methods on the same data were also provided for comparison.
To the best of our knowledge, LiNGCReL is the first algorithm that can handle CRL problems given data from multiple environments. We are not aware of any other algorithms that work for this task.
Q7: The paper considers the problem of learning linear causal representations from general environments. Is the information about these different environments known? If so, it indeed provides some additional information for structure learning.
We assume that the environments are unknown; the weights in the linear causal models can be arbitrarily different across different environments.
We hope that the above explanations have properly addressed the reviewer’s questions, and feel free to let us know if the reviewer has any other questions.
Thanks for your response. But I have the following concerns:
-
Although the authors explain the definition of "general environment", they mention that "all environments share the same causal graph", which is a very strong assumption. Moreover, intervening in an arbitrary number of nodes in real-world scenarios is challenging.
-
For Q3, the performance of the ICA technique heavily relied on the variance of noises. I am concerned that it is not a suitable method to recover the matrices .
We thank the reviewer for letting us know the concerns.
1, The reviewer is right that "all environments share the same causal graph" might be restrictive in some cases. However, as we explained in the rebuttal, our results actually apply to cases where some causal edges are absent. Moreover, most existing literature assumes single-node soft interventions, which are special cases of environments with the same causal graph, and our main contribution is that we remove the single-node assumption.
In practice, we often do not know how interventions change the underlying latent variables. Our results apply to interventions that may affect one or more latent variables. Please note that we do not need to perform interventions on a chosen set of latent variables; this set is actually unknown to us in the learning process.
2, To verify that the first-stage ICA estimation is accurate, we run experiments on randomly generated models with size and . We found that with data, the recovered matrix has average error <1e-3 compared with the ground-truth (up to row permutations) for , and around 2e-3 for . Currently, we are not aware of methods better than ICA that can be used in our context, but it is definitely a promising direction and we leave it to future work.
The paper studies causal representation learning (CRL) in the linear setting, that is, linear latent SEMs and a linear mixing function. Key contributions include:
- Identifying an intrinsic surrounded-node ambiguity (SNA) that exists when the causal model and mixing function are linear. This ambiguity is unavoidable without hard interventions.
- Proving that identification up to SNA is possible for linear models under reasonable conditions with data from diverse environments, where is the number of latent variables.
- Proposing an algorithm called LiNGCReL that provably achieves identification up to SNA in the linear case.
- Showing that single-node soft interventions would be required to achieve the same identification guarantee, highlighting the benefit of diverse environments.
- Demonstrating LiNGCReL's effectiveness on synthetic data in recovering the true causal model up to SNA.
The paper provides some new theoretical insights into the identifiability limits of CRL and an algorithm to achieve those limits in the linear case with general environments.
优点
- New theoretical insights on identifiability limits for CRL, particularly connecting the SNA concept.
- The authors propose an algorithm which provides a concrete way to achieve the theoretical guarantees.
- Experimental results validate the theoretical findings on synthetic data.
缺点
- Limited to linear causal models and mixing functions. Nonlinear extensions not discussed.
- All environments share the same causal graph. My understanding on this is that soft interventions considered in this work do not allow for removal of a subset of the parents.
- Experiments only on synthetic data. Real-world applications or datasets not tested. Perhaps a semi-synthetic experiment following Squires et al. 2023. Specially when a key point is that the assumptions are less stringent than prior work, an experiments like this might help.
- Computational complexity of LiNGCReL not thoroughly analyzed.
- Implications of SNA for downstream tasks not explored.
问题
I overall think that the paper would be a nice contribution to NeurIPS. Here are some moderate/minor comments:
-
Regarding my comment above on "all environments share the same causal graph", if my understanding is correct, what's the main challenge on allowing partial removal of parents during a soft intervention?
-
L113-114: "One may expect that identifiability with soft interventions is not much different from hard interventions, since soft interventions can approximate hard interventions with arbitrary accuracy". I am not sure really sure about this sentiment on soft interventions, I particularly would expect the identifiability to be different.
-
In many places I see "for ", I think either use "for all" or simply "".
-
In line 133, it reads "each latent variable " but was used to denote latent variables.
-
In Theorem 1, is used to denote the candidate/hypothetical latent varaibles, wouldn't it be easier to use ? This would keep the consistency with , , etc.
-
In the model setup in eq.(3), you might want to index the variables per environment too since these are not exactly the same r.vs.
-
I think you want to cite the proceedings version of Squires et al 2023 and not the arxiv version from 2022.
-
In Definition 10 in the appendix, what is ? I also think there might be a better way to state Definition 4 instead of relying on a definition in the appendix as early as in page 3.
-
L627 in appendix, typo "must stronger".
局限性
The natural limitations that come from the assumptions are discussed.
We thank the reviewer for appreciating our work and for giving insightful comments. Below are our responses to the questions and weaknesses mentioned by the reviewer.
Q1: All environments share the same causal graph. My understanding of this is that soft interventions considered in this work do not allow for removal of a subset of the parents…… Regarding my comment above on "all environments share the same causal graph", if my understanding is correct, what's the main challenge of allowing partial removal of parents during a soft intervention?
We are sorry for the confusion here. There are actually three different settings that existing works consider:
1). single-node, hard interventions. Hard intervention means removing the edge between the intervened node and all its parents.
2). single-node, soft interventions. This means changing the weights of the edges between the interveded node and its parents.
3). general environments (a.k.a. soft interventions that allow simultaneously intervening an arbitrary number of nodes. This is more general than 2) and is the setting this paper focuses on.
While we highlight in our paper that we can deal with the case where the weights are nonzero, implying that the causal graph does not change, our setting does allow some weights to be zero. This is because our Assumption 4 (and 5) only requires a non-degeneracy condition, and this condition does not necessarily require all weights to be nonzero. for example,let there by a three-node graph with edges and , and we have access to three environments where has nonzero weights for both and , only has nonzero weights for the former edge and only has nonzero weights for the latter edge, then Assumption 4 and 5 are satisfied on node since the third rows of the ’s are of forms (where denotes some nonzero numbers) and they span . We will make this point more explicit in a revised version of our paper.
Q2: L113-114: "One may expect that identifiability with soft interventions is not much different from hard interventions, since soft interventions can approximate hard interventions with arbitrary accuracy". I am not really sure about this sentiment on soft interventions, I particularly would expect the identifiability to be different.
We thank the reviewer for pointing this out, and we will delete this sentence to avoid unnecessary confusion.
Q3: In line 133, it reads "each latent variable " but was used to denote latent variables. L627 in appendix, typo "must stronger". In the model setup in eq.(3), you might want to index the variables z per environment too since these are not exactly the same r.vs. I think you want to cite the proceedings version of Squires et al 2023 and not the arxiv version from 2022.
We are sorry for these issues and will fix it in a revised version.
Q3: In Theorem 1, v is used to denote the candidate/hypothetical latent variables, wouldn't it be easier to use z^? This would keep the consistency with H^, A^, etc.
The reviewer is correct about this. We will change the notations in the revision.
Q4: In Definition 10 in the appendix, what is ? I also think there might be a better way to state Definition 4 instead of relying on a definition in the appendix as early as in page 3.
We are very sorry for the typo. We restate this definition below:
We write if there exists a permutations on , and a diffeomorphism where the -th component of , denoted by , is a function of for , such that the following holds: For , , , where is a permutation matrix satisfying .
Intuitively, this means that one can find a permutation of the nodes, such as the two models have node-wise correspondence.
We thank the reviewer again for the positive feedback and are happy to address any other questions that the reviewer might have.
I thank the authors for their response. I will keep my score for now.
The submission addresses causal representation learning in the linear setting (latent structural causal model and mixing), based on data from general environments (soft interventions possibly on groups of nodes).
Reviewers have acknowledged that the work:
- provides novel theoretical identifiability guaranties for a challenging setting,
- for that, they introduce the insightful concept of surrounded-node ambiguity (SNA),
- provides a practical algorithm for estimation,
- provides validation on synthetic data.
Given the progress made in a challenging setting of practical importance (general environments), the AC recommends acceptance as a spotlight.