PaperHub
5.5
/10
Poster4 位审稿人
最低4最高6标准差0.9
4
6
6
6
3.8
置信度
正确性3.3
贡献度3.0
表达3.0
NeurIPS 2024

Causal vs. Anticausal merging of predictors

OpenReviewPDF
提交: 2024-05-15更新: 2025-01-14
TL;DR

We study the asymmetries produced in the merging of predictors whenever we have causal information.

摘要

关键词
CausalityMerging of predictorsCausal vs AnticausalMaximum Entropy

评审与讨论

审稿意见
4

The paper explores the potential differences in predictor merging when approached from causal versus anti-causal directions. The results from MAXENT and CMAXENT indicate that in the causal direction, the solution converges to logistic regression, whereas in the anti-causal direction, it converges to Linear Discriminant Analysis (LDA). The study also examines how the decision boundaries of these two solutions vary when only partial bivariate distributions are observed, highlighting implications for Semi-Supervised Learning (SSL) and Out-Of-Variable (OOV) generalization.

优点

The paper investigates the differences that arise in predictor merging from causal and anti-causal perspectives. It demonstrates through MAXENT and CMAXENT that the causal direction results in logistic regression, while the anti-causal direction leads to Linear Discriminant Analysis (LDA). Additionally, the paper analyzes how the decision boundaries of these two methods change when only some bivariate distributions are observed, discussing the implications for Semi-Supervised Learning (SSL) and Out-Of-Variable (OOV) generalization.

缺点

  1. Small scale dataset: The main weakness is the small scale of the data and models studied in the paper. I believe the challenge of reducing computational cost with mixture-of-expert models is more relevant to larger models. The authors however only presented results on small bivariate distributions. Experiment results with larger models are appreciated. If experiments with larger models are not feasible, I hope authors can discuss potential limitations of the study under those larger-scale/multivariate scenarios. Do you expect the findings in Eqn 1&2 change in larger-scale/multivariate setups?

  2. Lacks comparison: This paper lacks sufficient comparisons with other papers. Can you explain what are the differences and advantages of the proposed method, compared to the pi-Tuning method proposed in [1] (Section 3.2 and Section 3.3)?

  3. Contributions are obvious from the observations given in MAXENT and CMAXENT: The overall paper seems like a consolidation of few previous papers.

  4. Non-causal: The paper has studied the causal and anti-causal setups but from the signal processing point of view pleas study the non-causal settings.

  5. Results under the availability of Noise and biases: The paper has shown result in a toy example which is good for overall pipeline understanding, but not sufficient to understand it from ML perspective. For example: what if the random variable leverages some amount of noise and have biases that imposes skewness in the distribution?

References: [1] Wu, Chengyue, et al. "pi-Tuning: Transferring Multimodal Foundation Models with Optimal Multi-task Interpolation." International Conference on Machine Learning. 2023.

问题

  1. Noise: What if we have observation noise (which is very common in MRI dataset)? Will the results still hold?
  2. Co-variance shift: Not sure whether I am missing anything, but, what if there is a co-variance shift?
  3. A curiosity is whether all the theoretical results hold if the distributions are not Gaussian?
  4. From a signal processing point of view can we get results for non-causal setups?

局限性

The major limitation of the paper is the experimentations that assumes the data distribution to be gaussian. The paper should include non-causal results and its interactions with causal and anticausal models.

作者回复

We thank the reviewer for their comments. We will first give some general comments, then answer each one of the points in the weaknesses and last answer the questions.

We noticed the strengths of the paper are the same as the summary. Is this correct?

Weaknesses:

  1. The question about scaling is an important one. We believe that the results of Equations 7 and 10 (and Theorem 5) would remain unchanged whenever we have DD experts to merge. We studied the bivariate case because it allowed us to easily analyse and visualise the implications of our theoretical results. Furthermore, we recently got an interesting insight about the results in high dimensions. Intuitively, the result is that using the same data as constraints, the resulting model in the anticausal direction gives more weight, relative to the causal direction, to those variables that have higher covariance with the target variable.

  2. We are unaware of other papers that look at merging of predictors in the case of causal and anticausal learning. If the reviewer knows one such analysis we would be happy to compare it to ours. We read [1] in detail but we found that beyond the use of the term pooling of experts there is no strong relation between [1] and our work. Our work studies the asymmetries between merged predictors whenever we can make causal assumptions, whereas [1] proposes a method to do merging of experts in the setup of foundation models with relation to transfer learning, which we do not talk about in our paper. There is no causality involved in [1].

  3. We disagree with the reviewer with this statement. CMAXENT has been studied for the causal discovery and merging data scenario Garrido Mejia, et.al. (2022). What we study here is theoretical asymmetries from merging of experts with different causal assumptions. These are different tasks and we chose CMAXENT because it allows us to merge data, and allows us to include causal information on the final model. Furthermore, we analyse the geometry of the decision boundaries of different models using the same data. This is the only study we know that does such analysis a we find it to be of high value for the NeurIPS community.

  4. We do not see how the analysis of non-causal scenarios is relevant to the purpose of our paper, which is studying the differences arising from causal assumptions when merging experts. There is previous research, referenced on our paper, that studies merging of experts using MAXENT without any causal assumptions (see references [21], [23] and [30] on our paper for some examples). This is an interesting question but, as the previous one, out of scope for the current work. Furthermore, there is research on the results of MAXENT under data noise, whereas there is very little research on merging of experts and causality.

Questions:

  1. Although this is an interesting question, we find it to be out of scope as we mentioned above. In this paper we are interested mainly in the asymmetries arising from the differences between causal questions on merging of experts.

  2. As usual in causality, if there is covariance shift one needs to understand whether the model is causal or anticausal, as a distribution shift can be interpreted as an intervention under the causal lens. If the model is causal, then the it is robust against distribution shifts, if the model is anticausal, then the model will not be robust against distribution shifts.

  3. We do not assume that the distribution of any covariate is Gaussian. This is a result of MAXENT under mean and (co)variance constraints. Several common distributions are the result of a MAXENT problem under moment constraints: Bernoulli, Exponential, Gaussian, etc. An excellent resource to understand the relation between common distributions and Maximum Entropy is Wainwright and Jordan (2008).

  4. If the reviewer is interested in merging of experts without making any causal assumptions they can check references [21], [23] and [30] on the main document and more recently:

  • Vetter, J., Moss, G., Schröder, C., Gao, R., & Macke, J. H. (2024). Sourcerer: Sample-based maximum entropy source distribution estimation. arXiv preprint arXiv:2402.07808.
评论

Dear reviewer,

We hope that we were able to address your concerns and clarified some misunderstandings. Let us know if you have any further concerns and, if not, ask to reconsider the score. We appreciate the time and effort you have spent to review our paper.

Thank you, The authors

审稿意见
6

The authors give a treatment of the mixture of experts problems using the idea of maxent; they use this as a tool to discuss how to merge causal and anti-causal inferences on the same data, in part as a way to assess the quality of the data being analyzed.

优点

The discussion of the differences and merging of causal and anti-causal analyses was strong and appreciated.

缺点

The framing of the paper, I felt, missed a lot of literature and possible approaches to the issue being addressed. That is, the paper is framed as a discussion of merging of expert models (which can be an important problem), and maxent is proposed as a method for doing this. But the merging of experts problem is itself framed as a problem of inferring causal graphs where each expert has access to only part of the data. The problem of overlapping dataset has been extensively studied, but no reference is made to that literature, or to any ideas that literature has proposed which may compete with the maxent proposal here. More recently (actually not that recently) the discussion has taken a turn into discussions of privacy, for contexts, e.g., where different hospitals have access to their own dataset but may want collaborate on building causal model without risking making available by inference the identities of their respective patients--i.e., the so-called "federated learning" problem. This has been extensively studied also in the literature to date with many proposals given for how to address it. I think this paper could benefit from a literature review of this sort to place the proposed ideas in context, with comparisons made to alternative methods, or at least reasons not to compare to particular methods that make sense.

Also, the paper is mainly theoretical but could have benefitted from discussion of an empirical or simulation example.

问题

Would the authors be willing to expand their literature review to encompass more of the literature relevant to what is being called the "mixed of experts" issue?

Would the authors be willing to include an empirical example, if not in the main test, then in the Supplement, with a pointer from the main text? It would be very helpful to know whether these ideas about causal/anti-causal merging are helpful empirically or in simulation.

Would the authors be willing to provide software to allow Neurips readers to easily evaluate the ideas in this paper?

局限性

This is a theoretical paper, entirely, so does not benefit from worked examples, so it is difficult to judge impact. I did not see any discussion of societal impact. I also did not see any discussion of promises of software or data to help users assess reproducibility of any results (as empirical results are not assessed).

作者回复

We thank the reviewer for taking the time to read our paper in detail and appreciate that they ranked the soundness and contribution of the paper as excellent. We would like to answer the reviewer’s concerns outlined in the weakness section, clarify some points that might have not been clear in the article and answer the reviewer’s questions.

Weaknesses:

About literature and privacy: We appreciate the reviewer’s comment on the literature. However, it is almost always the case that one can include more related work. We will include more work on overlapping datasets and admit that our own point of view is biased by the causal setting. We would also like to invite the reviewer to point us to some relevant papers about privacy and federated learning, which we will be happy to study and reference. We were under the impression that federated learning usually does not study the setting where the variable sets differ (i.e., the merging problem), but following your comment, we have been able to find some papers that study “vertical federated learning”, which we shall discuss. Our impression is that this area focuses more on practical applications and less interested in theoretical statements (and they do not connect the problem to the marginal problem of statistics), but it will be interesting to make those connections.

About inferring causal graphs: The reviewer mentions in the weaknesses section that we are framing the merging of experts’ problem as one of inferring causal graphs. This is not entirely correct, and we would like to clarify it here (and we will do it in the revised version of the text as well). What we want is to come up with a single predictor of our target variable YY, given a combination of predictors. We have the information of the causal graph, so we don’t need to infer it. We explore the differences in the solutions of the merging of experts under different causal assumptions (in our case, whether the variables are causes or effects of our target variable).

Questions: About literature expansion: We will expand the literature with the overlapping datasets literature, and we are happy to expand it with privacy and federated learning related research.

About an example: This is a fair point. However, we believe that the greatest merit of the paper are its theoretical contributions and not a potential empirical example. Indeed, Figure 2 is such an example for one particular data generation process, where we can clearly see the difference in the slope of the decision boundary for data with the same moments, under different CMAXENT solutions. If the reviewer is curious about the results of a particular example, we would be happy to run those tests and include them in the final version of the paper.

For a potential real world example (which we can include on the main paper) in the medical domain. If YY is the presence/absence of a disease, the causal scenario is given when X1,,XnX_1,\dots,X_n describe risk factors, while X1,,XnX_1,\dots,X_n being symptoms would be anticausal. Research results from different labs may provide predictors of YY from different sets of risk factors and/or different sets of symptoms. Combining them into a joint predictor which uses the union of all features would be highly valuable, but it also requires to include the right causal assumptions: risk factors cause diseases cause symptoms.

About code: We are happy to share the code for our paper upon acceptance. However, we believe that most of the value of our paper comes from the theoretical insights we are giving on the problem of merging of experts when different causal assumptions are made. In fact, the code to produce Figure 2 consists of a very basic data generation process and a common python library to estimate the decision boundaries using the logistic regression and linear discriminant analysis, as Remark 2 and Theorem 5 of our paper point out.

评论

I honestly think that for purposes of your paper, referring to a survey paper for the federated learning literature and maybe looking at some of the references in it would be sufficient--the main goal being to show how your proposal fits into the literature. Here's one from 2021:

Zhang, C., Xie, Y., Bai, H., Yu, B., Li, W., & Gao, Y. (2021). A survey on federated learning. Knowledge-Based Systems, 216, 106775.

This is three years old, but as I said, the literature is no longer new.

评论

We thank the reviewer for the reference. We checked it on detail and, as we mentioned on the rebuttal, we believe there are interesting connections that could be exploited between federated learning (in particular vertical federated learning) and the framework we talk on our paper, and in general the marginal problem in causality. We will include this reference on the camera-ready version of the paper.

If the reviewer considers it appropriate, could you please increase the score of our paper? if not, what does the reviewer believe could be clarified, or discussed in order to do so?

评论

I can increase the score. I'll move it to 6.

审稿意见
6

This paper studies the problem of learning a mixture of experts (predictors) where individual predictors have been learned with different causal constraints. It studies different asymmetries that arise when we merge different predictors using the Causal Maximum Entropy (CMAXENT) objective. It goes on to show that different data-generating processes lead CMAXNET to reduce to different objectives under some restrictive setting. Next, they show how the learnt predictors will have different decision boundaries under different data moment restrictions.

优点

  1. The paper is well-written and easy to follow.
  2. The contribution of this paper, however, restricted to a simple setup, is novel. The author shows that under different assumptions on the data-generating process, the CMAXENT objective will yield different predictors and establish necessary and sufficient conditions under which the predictors are different.

缺点

  1. The connection with the OOV generalization literature is not discussed properly. In particular, it would be interesting to see how this paper's observation relates to the paper "On causal and anti-causative learning" (ICML 2012) and the guarantees they have for generalization to the distribution shift.

  2. In the introduction and abstract, the authors mention that they study the problem of merging the two predictors, i.e., one predictor trained to assume an anti-causal data generating process (DGP) and another assuming causal. Next, in Sections 3 and 4, the authors show the closed form of each predictor separately under different DGPs. However, little is said about the final "combined" predictor and its generalization properties. See Question 1 for more.

问题

Questions:

  1. By merging the predictor, I understand learning a combined model using both the causal and anti-causal predictor. Is my notion of "merging" predictors correct, or am I missing something?

Typos:

  1. Line 235: I think you mean X2 will be irrelevant in the estimation of the target predictor.

局限性

Yes, they are added to the checklist.

作者回复

We thank the reviewer for the careful reading of our paper. We also thank the reviewer for pointing that the soundness and the contribution of the paper are good and the presentation is excellent. We will start giving a short comment on the reviewer's summary and start answering to the weaknesses and questions.

Comment about the reviewer’s summary: When we study the difference in the decision boundaries, we study the case where the moments are the same and not different. This is precisely one of the most surprising results of the paper.

Weaknesses: We appreciate the reviewer's comment about the connection between the literature; in the revised version of the paper we will be more explicit about the connections between “On causal and anticausal learning” and some of the more recent OOV literature.

In particular on the paper “On causal and anticausal learning” they study common Machine Learning tasks (distribution shift, semi-supervised learning, transfer learning) under the light of causal assumptions. They answer questions of the type: can we perform this task if the data comes from this assumed -causal- process? In this paper we do the same but with the task of merging of experts. Here we ask: What are the differences of the resulting models, if we assume the predictors are from a causal process versus an anticausal one? We answer this question using CMAXENT because CMAXENT 1) allows merging of predictors, and 2) allows to include causal information. Nevertheless, we hypothesise the resulting asymmetry would hold also for any other model that has these two properties.

In relation to OOV, we give an example of a DGP and a model so that we can do OOV. The OOV literature is fairly recent, in fact “on causal and anticausal learning” does not study an OOV scenario. We consider OOD and OOV "dual" modes of generalization but their precise link is yet to be understood. This is not part of the current papers technical contribution.

We can include an extended discussion of the above in the revised version of the paper.

This is an interesting question and we indeed have the results for the combined DGP but did not include them in the submission to keep it somewhat decluttered. Since the reviewer believes in the value of the combined DGP we can include it in the final version. As a summary of the results, the resulting conditional distribution of YY given XX is the product of the two conditional distributions found in Proposition 1 and Theorem 5. This is expected, as given YY, the parents and effects of YY are independent and so their joint distribution is the product of their distributions.

Questions: We will improve the introduction and explanation of certain terms in the manuscript, this is valuable feedback. We do understand merging of experts as the task of putting predictors together into a single model. However, the results show that it matters whether you assume that the predictors are produced with causal or anticausal variables. If you can make these assumptions, then a different question arises: What do we want the combined model for? If we need the model to estimate causal effects then we conclude we should not use the assumed anticausal covariates, if it is about prediction, we should merge all the models available.

Typo: Thank you for reading the paper in such a level of detail. We will correct the typo and double check if we missed more.

评论

I thank the author for their time and effort spent on the response. The response answers my question. I have increased my score to 6.

审稿意见
6

This paper studies the differences and properties that emerge when one uses causal, anticausal features for prediction.

优点

S1. This work makes several interesting observations of causal and anticausal predictors under their parametric assumptions.

S2. This work suggests some potential considerations for practitioners dealing with feature sets that contain both types of information.

缺点

W1. The primary weakness of this work is that the connections are underexplored empirically and in more complicated settings, e.g., higher dimensions and discrete data.

W2. While I do not have an issue with the simplifications you have made to make the connections clear, the lack of more general results combined with a lack of real-world datasets that exhibit properties resembling the observations from your analysis limit the impact of this work is insufficient for the venue.

W3. Some of the observations merely confirm properties already known, e.g., the asymmetries on causal and anticausal directions [1-2].

[1] Schölkopf, Bernhard, et al. "On causal and anticausal learning." arXiv preprint arXiv:1206.6471 (2012).

[2] Janzing, D., and B. Schölkopf. "Causal inference using the algorithmic Markov condition. to appear in IEEE Transactions on Information Theory." See also http://arxiv. org/abs/0804.3678 (2008).

问题

Q1. Do your results hold with incomplete causal sets?

Q2. Are there any connections between your observations and robustness to distribution shifts?

局限性

The setting considered is too theoretically and empirically simple to be convincing about real-world tasks.

作者回复

We thank the reviewer for their comments. We would like to clarify certain points related to the weaknesses and respond to their questions as best we can.

W1 (about high dimensionality): in the examples we studied, we used two covariates but the results would be similar if we had predictors of more covariates. That is, if we had DD covariates we would obtain a transformed logistic regression in the form of Equation 7, but instead all the variables for which we have Cov(Y,Xi)Cov(Y,X_i) would appear on the right-hand side of the equation. Likewise, for the anticausal direction we would obtain the LDA algorithm where the Gaussians are DD-dimensional as opposed to bivariate Gaussians. We used only two covariates to be able to visualise the differences in the geometry of the decision boundary.

Furthermore, we recently got an interesting insight about the results in high dimensions. Intuitively, the result is that using the same data as constraints, the resulting model in the anticausal direction gives more weight, relative to the causal direction, to those variables that have higher covariance with the target variable.

W1 (about discrete data): there are two options here:

  • First, if the target variable is discrete, in which case our results remain unchanged. In fact, Theorem 7 is written for a discrete (and not bivariate) target variable, and thus a generalisation of the previous results.
  • Second, the covariates are discrete. If the reviewer is referring to this case, we would agree that the study of discrete covariates requires a separate analysis. However, to showcase the asymmetries between causal and anticausal merging we decided to resort to cases that admit an analytical solution, which result in differences that are easier to interpret.

W2: (about a potential application of the results) Merging predictors will get increasingly relevant for instance in the medical domain. If YY is the presence/absence of a disease, the causal scenario is given when X1,,XnX_1,\dots,X_n describe risk factors, while X1,,XnX_1,\dots,X_n being symptoms would be anticausal. Research results from different labs may provide predictors of YY from different sets of risk factors and/or different sets of symptoms. Combining them into a joint predictor which uses the union of all features would be highly valuable, but it also requires to include the right causal assumptions: risk factors cause diseases cause symptoms.

W3: We know the content of references [1,2] in detail. Neither of them describe a merging scenario. The paper [1] first studied the implication of causal asymmetries (causal vs. anticausal) for machine learning, for the case of prediction (but not merging). This indeed inspired us, as mentioned in the introduction. The relation to [2] is much more remote: that paper develops a theory of causality that uses Kolmogorov complexity rather than statistical (Shannon) information to study conditional independence properties of causal graphs. It is fundamental for the justification of independent causal mechanisms, which in turn is related to machine learning, but we felt that we did not have to cite it. If the reviewer feels otherwise, we are happy to change this, and include a discussion.

Q1: We are not completely sure by what the reviewer means with incomplete datasets. Assuming the reviewer is asking about the case where we do not have predictors for all causal parents, YY (that is, in the causal scenario there may be additional causes). If we do not have predictors for each causal parent of YY, then we can still combine the predictors into a single model, however we are not going to obtain an approximation of the causal mechanism from the causal parents of YY to YY (recall the causal mechanism is simply P(YPA(Y))P(Y\mid PA(Y))). As a consequence, we might not be able to compute some causal quantities like interventional distributions without any further assumptions. Of course if the goal of merging the experts is purely prediction there is no consequence of not having all the causal parents, however, as we have shown on the paper the geometry of the decision boundary (regardless of whether one wants to predict or perform causal tasks) -will- change depending on the causal assumptions.

Q2: This is an interesting question and it relates to our interpretation of Q1. In particular (in the causal scenario), if we have the sample averages of the cross-moments between the causal parents of YY and YY, then CMAXENT gives us an approximation of the causal mechanism of YY that is robust to distribution shifts. To be more precise, Grunwald and Dawid (2004), and Farnia and Tse (2016) prove that MAXENT and conditional MAXENT are robust Bayes in the family of distributions with the predetermined constraints. That is, it is the distribution chosen from the set of distributions that has the same expectations as those given as constraints, so that the expected value of the log loss is minimised in expectation.

评论

Thanks for your clarifications. On Q1. Yes, my question was about "incomplete causal sets," i.e. only some subset of Pa(Y) is observed. It's important to note that the robustness of E[y | pa(y)] is not the same for E[y | S], S \subset pa(y). I want to clarify what, if any, of your results change, assuming one only has access to S rather than pa(y).

Additionally, I have increased my score to a 6.

作者回复

We thank all the reviewers for reading our paper and the interesting questions they asked. We also appreciate some of the reviewers considering the paper has a valuable contribution to the community, being sound and well presented. We invite the reviewers to increase the score if they feel their questions were satisfactorily answered.

In the revised version of the paper we will include an extended discussion about our results in higher dimensions, some potential applications of our theoretical insights and we will extended our literature review to include some references on overlapping datasets, and some interesting papers on federated learning where they have a similar setup as ours (though without the causal considerations).

最终决定

Summary. This paper considers the problem of combining multiple predictors based on different causal relationships using Causal Maximum Entropy (CMAXENT), namely causal and anticausal predictors. Though the models and settings they analyze are simple and restrictive, their results demonstrate that CMAXENT reduces to standard model classes, logistic regression in the causal direction, and linear discriminant analysis in the anticausal direction. They also study the effect only partially observing the data distribution on the decision boundaries of these models.

Strengths.

  • The paper is written clearly, and the ideas are communicated effectively.
  • Though the analysis only considers simple settings, the ideas and observations about the solutions to CMAXENT are novel and motivate an interesting direction for further research.
  • The analysis is well-executed

Weaknesses

  • The settings considered are limited. For instance, the analysis does not include categorical covariates (Reveiwer Wx67 ) or distribution shifts (Reviewer vvvY).
  • The authors should take care to address comments raised by the reviewers -- particularly the following: (i) a discussion about how their results are related to the utility and properties of the final merged predictor; (ii) motivate applications where the observations if this work is prescriptive with more attention; (iii) a potential weakness of this work is reduction OOD-generalization properties due to the use of anticausal features; multiple reviewers asked about the relationship between this work and the ideas in “On causal and anticausal learning.” Discussing this work and/or other related work would allow the authors to discuss this point. (iv) Expand the literature review with reviewer suggestions. (v) Discuss high-dimensional covariates.
  • Related to the previous point (i), very little is said about the final predictor; should it be better or worse than the causal / anticausal predictor alone, and when? Some empirical demonstration of this would go a long way.

Discussion The primary criticism from the reviewers was the extent of the theoretical contributions. The authors clarified the questions and limitations raised by the reviewers and agreed to address their discussion points (some enumerated below). Still, one reviewer recommended rejection due to the limited scope of the results; however, other reviewers found the paper's insights interesting enough to merit acceptance despite the limited settings explored -- though there were no strong supporters of acceptance.

Recommendation Overall, the ideas and observations in this work are interesting and potentially motivate more work in this important direction, e.g., merging multiple model decisions or Human/AI predictions. The reviewers recommend acceptance with the expectation that the authors will address the abovementioned weaknesses.