Do causal predictors generalize better to new domains?
Predictors using all available features, regardless of causality, have better in-domain and out-of-domain accuracy than predictors using causal features.
摘要
评审与讨论
This work aims to provide an empirical study of how well models trained on causal features generalize across domains compared to models trained on all features. The major result from this study is that, contrary to the existing understanding that using causal features can generalize well in different environments, across 16 prediction tasks, models trained on all features generally perform better than those using only causal features. A similar trend is observed when using arguably causal features. Other experimental results, such as the inconsistency in the identification of causal features using classical discovery algorithms and the accuracy performance comparison between standard ML models and causal ML models, are also provided.
优点
(High clarity in terms of experimental Setup and Results) This work provides a detailed exposition of how causal features were chosen for each task. The authors include a diverse set of tasks from different domains for the experiment, and the experimental procedures are clearly described. This ensures the validity of their results and is crucial for reproducibility.
缺点
-
(Quality in terms of writing):
The style and structure of this work require significant improvement. Specifically, the authors should adopt a more academic writing style and strive for greater specificity in each sentence. For instance, the sentences on lines 20, 24, 62, and 70—"But it’s less clear what to do about it," "The idea may be sound in theory," "To be sure, our findings don’t contradict the theory," and "... also discussed under the terms ‘autonomy’, ‘modularity’ or ‘stability’"—are ambiguous. The authors should clearly describe the specific theories they are referring to in these instances and precisely define what 'autonomy', 'modularity', or 'stability' mean in the context of causal machine learning.
There are many other places with similar writing issues, and I leave it to the authors to identify these and make further improvements.
- (Quality in terms of validity of the experiment): There are several aspects of the experimental setup that are not entirely convincing or fair, which raises concerns about the reasonableness of the claims made in this work. I will raise these specific issues in the next section.
问题
-
On the choice of causal features: While I understand that the authors have tried their best of knowledge to select the features that are causal, it is also very likely the case that these chosen features are not actually causal. Under this mis-specificity, it is not surprised that the accuracies, irregardless of in-domain or out-domain accuracies, will suffer. Think about the case where some of the real causal features are selected and some of the spurious features are selected. Using these features to train the model will result in expected sub-optimal prediction performance. In your experiment, it seems it might be the case the above issue is true, as when you include the arguably causal features, the prediction errors significantly drops.
-
Similar to the above issue, if the domain knowledge used to select causal features was incorrect, it is expected that the features identified by the causal discovery algorithm would not significantly overlap with the hand-selected causal features. While it is understood that causal discovery algorithms are not perfect, assuming these algorithms can accurately identify causal features, any inconsistency would likely be due to incorrect hand-selected features. Based on this reasoning, I suggest that the authors create a synthetic dataset where the ground truth is known to strengthen their claims.
-
Similar to the above issue, the robustness tests seem unconvincing. Specifically, the robustness results are based on selecting a subset of features from the hand-selected causal features. However, if the original causal features are incorrect, any subset chosen from these features is unlikely to improve generalization.
-
It is uncertain whether the comparison of accuracies between the use of machine learning models, causal methods, and methods tailored to domain robustness is a fair comparison. Specifically, for instance, it is unfair to compare XGBoost, a non-linear model, with IRM, which is a linear model. Such differences in prediction performance might come not from the methods themselves but from the different hypothesis classes and model complexities. Similarly, including methods tailored to domain robustness in the comparison can be confusing. For example, one could use the same causal features but achieve better domain generalization by employing domain robustness methods, such as sharpness-aware minimization (which is similar to the flavor or DRO), which finds flatter minima in the loss landscape to enhance generalization, which is entirely unrelated to the choice of causal features.
局限性
Although the authors include a limitations section, the authors does not directly address how their work might have potential limitations or potential negative societal impacts.
We thank the reviewer for the thoughtful feedback. Below, we aim to answer the reviewer’s concerns:
(Quality in terms of writing)
We will revise our work and adopt a more precise writing style, as well as reference in greater detail to what we are referring to.
(Hand-selected causal features)
it is also very likely the case that these chosen features are not actually causal
We agree with the reviewer that the chosen features are likely misspecified. We even state a similar concern in the discussion section (“We likely made some mistakes in this classification", line 271f.). Nevertheless, we tried out every conceivable way to obtain causal features (domain knowledge, remove single features from the ones obtained by domain knowledge, causal discovery algorithms, causal machine learning methods). If the reviewer knows another way to obtain the causal features in an empirical study, we would be happy to include it in our analysis.
assuming these algorithms can accurately identify causal features, any inconsistency would likely be due to incorrect hand-selected features
This is true. That’s why we tried out both approaches in our study: hand-selecting based on domain knowledge and applying causal discovery algorithms. The results don’t change though.
(Synthetic dataset)
I suggest that the authors create a synthetic dataset [...] to strengthen their claims.
Following your suggestion, we conducted synthetic experiments. The setup is depicted in the PDF of the author’s rebuttal. Our code is based on the synthetic study conducted by Montagna, et al., 2023. We refer the reviewer to their results for a detailed performance analysis of the causal discovery methods.
The synthetic experiments confirm our empirical findings. Using all features achieves best out-of-domain prediction accuracy. The one exception is if the distribution shift is exclusively on the anti-causal features and even in this case, a strong shift is needed before causal features achieve best out-of-domain accuracy.
(Robustness tests)
if the original causal features are incorrect, any subset chosen from these features is unlikely to improve generalization
Yes, the robustness test on the causal features merely tests for misclassifying one feature as causal although it is not. To enhance our robustness test, we also tested 500 random subsets for each task. The subsets are randomly drawn from all features. See Appendix C.5. for details. We welcome any suggestions to further test the robustness of our results.
(Empirical evaluation)
It is uncertain whether the comparison of accuracies [..] is a fair comparison. [...] Such differences in prediction performance might come [...] from the different hypothesis classes and model complexities.
We follow the example of preceding empirical studies when comparing standard machine learning methods, causal machine learning methods and domain robustness methods via accuracy (Gulrajani and Lopez-Paz, 2020, Miller, et al., 2021, Gardener, et al., 2023). The objective is the applicability in practice. It is of separate interest to explore and disentangle the reasons for the inferior performance of certain methods, e.g., different hypothesis classes and model complexities.
For example, one could use the same causal features but achieve better domain generalization by employing domain robustness methods
We train all feature sets in the main analysis with the same methods. In particular, we also train on the causal features on domain robustness methods (DRO, GroupDRO, an adversarial label robustness method).
address how their work might have potential limitations or potential negative societal impacts.
We will improve our limitation section and explicitly state our limitations and potential negative societal impacts.
References:
Montagna, F., Mastakouri, A., Eulig, E., Noceti, N., Rosasco, L., Janzing, D., ... & Locatello, F. (2024). Assumption violations in causal discovery and the robustness of score matching. Advances in Neural Information Processing Systems, 36.
Gulrajani, I., & Lopez-Paz, D. (2020). In search of lost domain generalization. arXiv preprint arXiv:2007.01434.
Miller, J. P., Taori, R., Raghunathan, A., Sagawa, S., Koh, P. W., Shankar, V., ... & Schmidt, L. (2021, July). Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization. In International conference on machine learning (pp. 7721-7735). PMLR.
Gardner, J., Popovic, Z., & Schmidt, L. (2024). Benchmarking distribution shift in tabular data with tableshift. Advances in Neural Information Processing Systems, 36.
Thank you so much for your response. However, although the authors tried every possible approach to select the features, the possibility of mis-specification of features still exists. Additionally, I am still not quite convinced about the fairness of comparing different ML models and training methods. Therefore, I will maintain my original score.
We thank the reviewer for reading the rebuttal. We appreciate your affirmation that we analyze "every possible approach to select the features", even though our explanations and additional experiments couldn't convince you.
This paper attempts to test the hypothesis of whether models trained on causal features generalize better across domains. The authors found that predictors using all available features both causal and non-causal, have better in-domain and out-of-domain accuracy compared to causal features-based predictors. The authors also discuss that causal discovery algorithms perform poor in selecting causal variables. The authors provide empirical analysis to support their claims.
优点
The paper is written in a well-presented manner and is easy to follow. The authors provided an extensive experimental analysis considering different real-world scenarios.
缺点
Here I discuss my major concerns about the paper:
-
It is unclear what type of distribution shift the authors are considering. For example, for a pair of variables X, Y, is it target shift: P(Y ) changes while P(Y |X) stays fixed, ii)conditional shift: P(Y |X) changes while P(Y ) stays fixed or, iii)covariate shift: only P(X) changes (see [1],[2] for details).
-
It is unclear if the test datasets the authors considered properly represent enough domain shifts. The author should more precisely numerically mention (like above) how much the test distribution changed compared to the train distributions.
-
One possible reason for causal features not performing well might be that the test distribution is not different enough compared to the training distribution. For example, if we consider a dataset of birds (Caltech-UCSD Birds-200-2011 (CUB) dataset) and we consider all available features including the background (water, land, forest, etc) for training, we will probably achieve better performance (due to background shortcut) than only causal features. However, if we consider a test dataset with higher domain shift such as Waterbirds [3], a model trained on all features will perform poor since the background feature will affect its prediction. In this scenario, causal predictors might perform well [4]. Maybe the irrelevant/non-causal features are helping because the test distributions are not significantly different. I would request the authors to share their perspectives about this possibility.
-
The authors should perform an ablation study by removing one non-causal feature at a time and measure the corresponding performance. This should show which non-causal features are improving the accuracy and how much. If they are not causal features, then why such features are improving the model accuracy. These should be discussed in detail.
-
If the conclusion drawn by the authors is to utilize all features for prediction, then the prediction would highly depend on features whose distribution changes with the domain. As a result, the prediction would be highly dependent on the training domain and perform poor in the test domain. This would prevent the model from generalization. Please read the introduction section of [2], [5] for details. How do the authors plan to deal with such domain dependence and achieve generalization if they use all features.
-
Are the authors considering the possibility of confounders (shared unobserved parent of observed variables)? Due to the presence of confounders, we might not observe all the causal parents. For example, [5] discusses a medical scenario, where the goal is to diagnose a target condition T, say lung cancer using information about patient chest pain symptoms C and whether or not they take aspirin A. Lung cancer leads to chest pain (L->C) and that aspirin can relieve chest pain (A->C). Smoking K (unobserved confounder) is a risk factor for both lung cancer (T), and aspirin (A) is prescribed to smokers as a result: T<—[K]—>A. In such scenarios, different causal methods such as [2,5] utilize non-causal features such as children (C) or neighbor (A) for prediction (T) as causal parent smoking (K) is unobserved. Thus, if we are only using causal parents, it is necessary to make sure that there are no unobserved causal parents.
-
The authors performed their experimental analysis on their selected datasets. They should also show their results on the datasets where causal predictors claim to perform well. Did the authors check any such datasets?
[1] Kun Zhang, Bernhard Schölkopf, Krikamol Muandet, and Zhikun Wang. Domain adaptation under target and conditional shift. In International conference on machine learning, pages 819–827. PMLR, 2013.
[2] Lee, Kenneth, Md Musfiqur Rahman, and Murat Kocaoglu. "Finding invariant predictors efficiently via causal structure." Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence. 2023.
[3] Sagawa, Shiori, et al. "Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization." arXiv preprint arXiv:1911.08731 (2019).
[4] Kaur, Jivat Neet, Emre Kiciman, and Amit Sharma. "Modeling the data-generating process is necessary for out-of-distribution generalization." arXiv preprint arXiv:2206.07837 (2022).
[5] Adarsh Subbaswamy, Peter Schulam, and Suchi Saria. Preventing failures due to dataset shift: Learning predictive models that transport. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3118–3127. PMLR, 2019s
问题
Below I share my questions:
- Why are the non-causal features helping the model to perform well? Are completely irrelevant features helping as well?
- To my knowledge, the PC algorithm performs conditional independence tests to discover the undirected skeleton. How do the authors obtain causal parents from PC (Line 240)?
- Please answer and discuss the questions in the weakness sections.
局限性
The authors clearly discussed their limitations.
We thank the reviewer for the thoughtful feedback. To answer the reviewer’s concerns and questions:
(Distribution Shift)
unclear what type of distribution shift
We are considering natural distribution shifts. For example, the distribution shift induced by switching between geographic regions or demographic groups. Therefore, we are likely to suffer from all three forms of distribution shift at once. A recent study by Liu, et al., 2023, shows that conditional shifts are prevalent in tabular settings.
precisely numerically mention how much the test distribution changed
Following your suggestion, we added a table with details on the observed distribution shifts in the PDF of the author’s rebuttal. We adapted the metrics for target shift, conditional shift and covariate shift from Gardener, et al, 2023. See Appendix E.2 of their paper for the detailed definitions.
Maybe the irrelevant/non-causal features are helping because the test distributions are not significantly different
We agree with the reviewer that one explanation of the empirical results is that the test distribution may not differ from the training distribution in a substantial way. In support of this argument is the fact that the training and testing domains are, for instance, different geographic regions within the U.S. bordering each other, or population groups with different education levels or racial identity.
(Datasets)
Please read the introduction section of [2], [5] for details. How do the authors plan to deal with such domain dependence and achieve generalization if they use all features
Lee, et al, 2023, and Subbaswamy, et al, 2019 propose proactive algorithms, anticipating potential sources of unreliability in the data. In that context, our study should be interpreted as testing how unreliable variables in common tabular data are. In other words, do we need to act proactive when dealing with tabular data encountering natural distribution shifts? Are there domain dependences that limit the generalization?
We didn’t find empirical evidence for that within our datasets. All datasets we studied are commonly used in empirical evaluations, and most of them are currently part of a benchmark for out-of-domain generalization. Hence, we conclude to utilize all features for prediction in common tabular datasets. Note that our conclusion does not extend to, in that sense, uncommon datasets with known high domain dependence.
They should also show their results on the datasets where causal predictors claim to perform well
We looked for tabular datasets with (i) public or open credentialized access and (ii) interpretable features that can be classified in ‘causal’ and ‘non-causal’. We haven’t found any dataset matching these criteria where the causal predictors performed well. We’d welcome any suggestions from the reviewer, and are happy to include these datasets in our study.
(Ablation Study)
perform an ablation study by removing one non-causal feature at a time [...] Why are the non-causal features helping the model to perform well?
We conducted an ablation study and provided the results in the PDF of the author’s rebuttal. We remove anti-causal and non-causal features one at a time and measure the corresponding out-of-domain accuracy. In a later comment, we discuss in detail the non-causal features whose removal significantly dropped the out-of-domain performance.
Are completely irrelevant features helping as well?
Our datasets are carefully curated. They are based on surveys that collect information experts in that field deem relevant. In some cases, features are selected from a multitude of variables, e.g., 23 features from the US Census data for predicting income. In addition, our datasets cover applications dealing with complex social dynamics (health, employment, politics,...). Therefore, it is hard to declare any feature as completely irrelevant.
(Confounders)
Are the authors considering the possibility of confounders?
We strongly believe that there are confounders within our datasets. Their existence is another plausible explanation for the inferior performance of the causal features. We do not model the data generating processes precisely enough to employ methods proposed by Lee, et al, 2023, and Subbaswamy, et al, 2019.
(PC Algorithm)
How do the authors obtain causal parents from PC (Line 240)?
The PC algorithm estimates a completed partially directed acyclic graph (CPDAG), see Figure 33 in Appendix C.4 as an example. We use the implementation from the R package ‘pcalg’. The edges connecting the target with other features are directed for the tasks ‘Food Stamps’, ‘Income’ and ‘Unemployment’. Therefore, we can readily identify these features as estimated causal parents. In the task ‘Diabetes’, the target is only connected to high blood pressure via an undirected edge. We treat high blood pressure as potential causal parent. If deemed necessary, we can also include all skeletons as figures in our appendix.
References:
Liu, J., Wang, T., Cui, P., & Namkoong, H. (2024). On the need for a language describing distribution shifts: Illustrations on tabular datasets. Advances in Neural Information Processing Systems, 36.
We discuss the non-causal features whose removal significantly dropped the out-of-domain performance. We split by task.
Food Stamps (food stamp recipiency in past year for households with child across geographic region):
- Relationship to reference person: There could be a stable and informative correlation within the survey of US Census between kind of household members (encoded in relationship to the reference person/head of the household, e.g., multiple generation household vs roommates) and food stamp recipiency. We didn’t classify this variable as causal, as it’s survey related.
Income (income level across geographic regions)
- Relationship to reference person: same argument applies.
- Marital status: Marital status and personal income are both intricately linked with socio-economic status, although we haven’t found any research of causally linking them together.
- Insurance through a current or former employer or union / Medicare for people 65 or older, or people with certain disabilities: These insurances are benefits not tied to income, but rather the person’s employer or age and medical condition. They are however indicative of the economic and social environment of the individual, which helps to classify the income level.
- Year: The year, e.g., 2018, encodes information about the economic status, which may be predictive across geographic regions.
Public Coverage (public coverage of non-Medicare eligible low-income individuals across disability status):
- State/year: The current state of living and year encode information about the economic status.
Voting (voted in the US presidential elections across geographic regions):
- Party preference on specific topics, e.g. pollution
- Opinion on party inclinations, e.g., which party favors stronger government
- Opinion on sensitive topics, e.g., abortion, religion, gun control
The opinions/preferences of an individual may sort them to specific sub-groups of the populations, wherein civil duty is or is not prominent. It is fathomable that similar sub-groups form across geographic regions.
Hypertension (High blood pressure across BMI categories):
- State: The current state of living encodes information about the socio-economic status, which research linked to hypertension in several studies (Leng, 2015).
Sepsis (sepsis across length of stay in ICU):
- Hospital: Hospitals serve different groups of the populations, which may be correlated with different risks of attaining sepsis.
References:
Leng, B., Jin, Y., Li, G., Chen, L., & Jin, N. (2015). Socioeconomic status and hypertension: a meta-analysis. Journal of hypertension, 33(2), 221-229.
Thanks to the authors for their efforts in their submitted manuscript and their detailed rebuttal. I apologize for replying late. Below I share my concerns based on the authors’ responses.
The authors mentioned that their considered datasets are likely to suffer from all three forms of distribution shift at once. They provided covariate shift, concept shift and label shift for all datasets.
However, in my opinion, these shifts should be measures for specific variables (weakness 1) instead of the whole dataset. For example: conditional distribution for which variable changes the most compared to other variables when you consider a new domain. That would explain the source of shift, and hint us why a predictor performs good/ bad. These information are more insightful instead of seeing a number for the whole dataset.
The authors mentioned, “do we need to act proactive when dealing with tabular data encountering natural distribution shifts? Are there domain dependences that limit the generalization?”
I understand the authors’ arguments about causal predictors performing bad. However, my perspective is that if a predictor using all the features does not perform bad in their considered new domain even in the presence of different types of shifts, should we consider that a new domain at all? Does it contain significant amount of shift to be considered as a new domain?
Hence, we conclude to utilize all features for prediction in common tabular datasets.
Again, I understand the authors’ claim about causal predictor performing bad in the considered common datasets and the fact that using all features gave better accuracy. However, there are many existing works, which show how the prediction gets domain specific or highly dependent on sensitive variables if we condition on all variables. Are we proposing a method with new failures modes?
We strongly believe that there are confounders within our datasets.
To me, the role of confounders was not explicitly mentioned in the main paper. It should be explicitly mentioned as one of the possible reasons of why causal features do not perform well.
-
A question to the authors: How does a predictor perform better in new domains if the prediction is dependent on “State/Year/hospitals”? Aren’t these one type variables that create new domains? This means that the trained predictor would be domain specific and fail when deployed in different state/year/hospitals.
-
A final message to the authors is that their empirical results are very impressive. However, these results should not mislead readers about using causal features for prediction tasks. Different assumptions and possibilities (ex: presence of counfounders) should be explicitly mentioned.
Based on authors’ responses and the discussion with other reviewers, I would consider changing my scores.
Thanks for reading the rebuttal. We appreciate that you find our empirical results "very impressive". These are, in fact, our main contribution. We're happy to add additional clarification and discussion along the lines of what you suggest. In particular, we will detail the conditional distribution shift on a variable level. We will emphasize more strongly that we adapt the choice of domains from a recent benchmark and that demonstrating the utility of causal methods likely requires other benchmark datasets than the ones currently available (see line 279f.). We will also explicitly mention confounders as one of the possible reasons why the causal features do not perform well.
Thank you for your response. I appreciate that you will clarify and emphasize on my mentioned points in your main paper. Could you also please share some details on my mentioned point 3 and point 5?
Thank you.
The paper empirically investigates the hypothesis that machine learning models trained on causal features generalize better across domains. Analyzing 16 prediction tasks across various datasets, the study finds that models using all available features outperform those using only causal features in both in-domain and out-of-domain accuracy, challenging the assumed external validity of causal modeling techniques. Extensive robustness checks support these findings, suggesting that current causal methods do not necessarily provide better generalization in practical settings.
优点
- Comprehensive empirical analysis with 16 diverse tasks across domains
- Robustness checks confirm the stability of results
- Challenging an existing assumption contributes valuable insights for the field
- Great detail in the appendix.
缺点
- It may have been my mistake, but it took me a while to grasp how the domain split is done. Maybe this can be described more clearly.
问题
- Anti-causal improves out-of-domain performance of arguably causal features (ll 221): Do you have an explanation/hypothesis?
局限性
- Feature classification into causal/non-causal is somewhat subjective (the authors do acknowledge this fact).
- The study only provides empirical results. It does so very carefully, but still, this makes it hard to make any more general claims.
We thank the reviewer for their helpful comments and positive feedback! To answer the reviewer’s comments:
how the domain split is done [...] can be described more clearly
Thank you for drawing our attention to that. We will state how we obtain the domain splits more clearly.
Anti-causal improves out-of-domain performance of arguably causal features (ll 221): Do you have an explanation/hypothesis?
A reasonable explanation is that the relationship of the anti-causal feature and target does not change strongly between domains. For example, patients with diabetes have a considerably increased risk of cardiovascular disease. Their relationship is partly explained by biomedical mechanisms, e.g., dyslipidemia (abnormal levels of lipids in the bloodstream) is a common feature of diabetes and poses a significant risk factor for cardiovascular diseases (Schofield, et al., 2016). It is conceivable that these mechanisms are stable across domains (population groups with different racial identity).
The study only provides empirical results [...] hard to make any more general claims.
To address your concerns, we conducted synthetic experiments to test how far our claims generalize. The setup is depicted in the PDF of the author’s rebuttal. Our code is based on the synthetic study conducted by Montagna, et al., 2023.
The synthetic experiments confirm our empirical findings. Using all features achieves best out-of-domain prediction accuracy. The one exception is if the distribution shift is exclusively on the anti-causal features and even in this case, a strong shift is needed before causal features achieve best out-of-domain accuracy.
References:
Schofield, J. D., Liu, Y., Rao-Balakrishna, P., Malik, R. A., & Soran, H. (2016). Diabetes dyslipidemia. Diabetes therapy, 7, 203-219.
Thanks for the response and additional explanations. My vote remains and I hope to see this paper accepted.
We thank the reviewer for reading the rebuttal, and appreciate their continued support of our paper!
In this paper, an extensive evaluation of Ml methods with different feature sets is performed. Only tabular data is considered and different feature set are considered: all, arguably causal and causal. In the experiments no advantage of using causal features is shown for the domain generalization task.
优点
- very useful, inspiring results which will probably spark new research directions and/or rebuttals.
- very well-written, clear and well-structured paper
- code available and well-described experiment
- interesting "negative" results for an interesting question which is often stated as a "strength" of causal machine learning models.
缺点
- maybe the number of different benchmarks/tasks could be higher, but I think that it is sufficient as it is
- maybe simulation experiments? (see questions)
问题
- It would be interesting to add an experiment with simulated data, with varying degrees of difference in the out-of-distribution data, so as to probe the theoretical ideas behind the claim that causal features should generalize better to unseen domain (something similar to what is done in anchor-regression literature).
局限性
yes
We thank the reviewer for their helpful comments and amazing feedback! To answer the reviewer’s question:
add an experiment with simulated data, with varying degrees of difference in the out-of-distribution data [...] (something similar to what is done in anchor-regression literature)
Following your suggestion, we conducted synthetic experiments. The setup is depicted in the PDF of the author’s rebuttal. Similar to Rothenhäusler, et al., 2021, we vary the degree of difference in the out-of-domain data using shift intervention on target, features and confounders.
The synthetic experiments confirm our empirical findings. Using all features achieves best out-of-domain prediction accuracy. The one exception is if the distribution shift is exclusively on the anti-causal features and even in this case, a strong shift is needed before causal features achieve best out-of-domain accuracy.
References:
Rothenhäusler, D., Meinshausen, N., Bühlmann, P., & Peters, J. (2021). Anchor regression: Heterogeneous data meet causality. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(2), 215-246.
The authors provide a thorough benchmark and analysis of machine learning models trained on different features to generalize to unseen domains. They use tabular datasets from various fields, including health, employment, education, etc., and categorize features into groups ranging from causal to anti-causal to test their influence on the model performances. Their key finding is that models utilizing all available features, irrespective of their causal nature, achieve better in-domain and out-of-domain accuracy than those relying solely on causal features across a battery of methods. This work challenges the practical applicability of theoretical causal advantages in real-world tabular data scenarios.
优点
- very clear and accurate writing style, intuitive presentation of results in form of the figures, clear summary statements
- the authors use a variety of datasets from different areas like health, employment, and politics, which makes the findings more applicable to real-world scenarios
- the separation of features into the four categories, seems reasonable and human-like (i.e. that's how everyone should cluster features before training a predictive model).
- the large battery of models is great
- the authors conducted thorough checks to ensure their results are robust, making their conclusions seem trustworthy, and try to give explanations why certain results are observed
缺点
- Could you add the sample size to Table 1?
- The explanations about the results seem to be a bit short. It would be great to elaborate a bit more and interpret the results wrt to the strength of the domain shift (e.g. it's obvious that anti-causal features are predictive as long as the shift is small)
Typos:
- Figure 1: Predictors based ON all features
- L63: theortical -> theoretical
- L108: that THE distribution
- L172: Remove "a"
- L214: considerably
- L275: knowlege -> knowledge
问题
- I would be curious to see if these findings also translate to genomics, where the feature size is huge, and a subset of genes has been discovered as causal for certain phenotypes. Have you considered this?
局限性
The authors adequately addressed the limitations!
We thank the reviewer for their helpful comments and positive feedback! We will fix the typos, thanks for pointing them out to us. To answer the reviewer’s questions:
add the sample size to Table 1?
Yes, we will add the sample sizes to Table 1. Note that they are currently provided in Table 6 in the Appendix.
elaborate a bit more and interpret the results wrt to the strength of the domain shift
We added a table with details on the strength of the domain shifts in the PDF of the author’s rebuttal. We adapted the metrics for target shift, conditional shift and covariate shift from Gardener, et al, 2023.
We will explain and interpret the results in more detail, taking into account the strengths of the domain shifts.
these findings also translate to genomics [...] Have you considered this?
No, we have focused our attention on datasets with easily interpretable features so far. If the reviewer has a suggestion on a plausible task based on genomics data and an appropriate natural distribution shift, we’d be happy to include that in our analysis.
Thank you for the updates.
Yes, we will add the sample sizes to Table 1. Note that they are currently provided in Table 6 in the Appendix.
Sorry, I did not see that.
I'll leave my score as is and hope this paper gets into the conference.
We thank the reviewer for reading the rebuttal, and appreciate their vote of confidence!
Dear Reviewers,
Thank you for your thoughtful comments and suggestions!
(Contribution) We are encouraged that you found our work has "very useful, inspiring results which will probably spark new research directions and/or rebuttals" (MY3d) and "contributes valuable insights for the field" (bBCb). Our separation of features into causal, arguably causal, anti-causal and non-causal is praised as "reasonable and human-like (i.e. that's how everyone should cluster features before training a predictive model)" (yvyj).
(Soundness) We are particularly pleased that our experiments are described as "extensive" (QBUN) and "comprehensive"(bBCb). The reviewers appreciated that we used a "diverse set of tasks from different domains" (6nV6), and concluded that this makes our findings "more applicable to real-world scenarios" (yvyj). Our robustness checks are perceived to "confirm the stability of results" (bBCb) and make our conclusions "seem trustworthy" (yvyj).
(Presentation) Our work is praised to have "high clarity in terms of experimental setup and results" (6nV6), be a "well-described experiment" (MY3d) and have "great detail in the appendix" (bBCb). The reviewers highlighted that this "ensures the validity of their results and is crucial for reproducibility" (6nV6).
We have also taken your feedback into account and made the following key changes to improve our paper:
-
(Synthetic experiments)
Following your suggestion (6nV6, bBCb, MY3d), we conducted synthetic experiments. The setup is depicted in the PDF of the author’s rebuttal. The causal mechanisms are modeled as (i) linear with weights randomly drawn in (-1,1) and (ii) based on a neural network with random instantiation. The noise variables are drawn from a standard normal distribution. The task is to classify whether the target is larger than 0.
Similar to Rothenhäusler, et al., 2021, we vary the degree of domain shift using shift intervention on target, features and confounders, as proposed by (MY3d). We draw 1,000 training samples from the causal mechanism, and evaluate the performance on 1,000 testing samples from the intervented causal mechanism with shift interventions varying from 0 to 10; step size is 0.1. Our code is based on the synthetic study conducted by Montagna, et al., 2023.
The synthetic experiments confirm our empirical findings. Using all features achieves best out-of-domain prediction accuracy. The one exception is if the distribution shift is exclusively on the anti-causal features and even in this case, a strong shift is needed before causal features achieve best out-of-domain accuracy. We highlighted the plots where the exception occurs. -
(Strength of distribution shift)
To meet your requests (QBUN, yvyj), we added a table with details on the observed distribution shifts in the PDF of the author’s rebuttal. We adapted the metrics for target shift, conditional shift and covariate shift from Gardener, et al, 2023. See Appendix E.2 of their paper for the detailed definitions. -
(Ablation study)
As you suggested (QBUN), we conducted an ablation study and provided the results in the PDF of the author’s rebuttal. We remove anti-causal and non-causal features one at a time and measure the corresponding out-of-domain accuracy. We will discuss in detail the non-causal features whose removal significantly dropped the out-of-domain performance and try to give explanations.
We hope these updates address any concerns expressed by the reviewers — we are happy to respond to any additional concerns that might arise during the Author Reviewer Discussion period!
Best regards,
The authors of #16003
References:
Gardner, J., Popovic, Z., & Schmidt, L. (2024). Benchmarking distribution shift in tabular data with tableshift. Advances in Neural Information Processing Systems, 36.
Montagna, F., Mastakouri, A., Eulig, E., Noceti, N., Rosasco, L., Janzing, D., ... & Locatello, F. (2024). Assumption violations in causal discovery and the robustness of score matching. Advances in Neural Information Processing Systems, 36.
Rothenhäusler, D., Meinshausen, N., Bühlmann, P., & Peters, J. (2021). Anchor regression: Heterogeneous data meet causality. Journal of the Royal Statistical Society Series B: Statistical Methodology, 83(2), 215-246.
This paper performs empirical comparison on 16 datasets and show that predictions using all features generalize (both in-domain and out-of-domain) better than when we use only causal features.
The study confirms the reported struggles in finding improvement in the predictive performance of ML algorithms by infusing the causal principle of invariance/stability/autonomy. It attempts to be as rigorous as possible, however, as Reviewer 6nV6 points our and the authors agree, categorization of features into causal/non-causal is quite difficult. I trust the authors have done their best in this aspect.
This is a timely paper for the ML community. I hope the authors continue this line of work by providing a theoretical understanding of why predictors on only causal features do not generalize better. The authors need to emphasize that they do not challenge the validity of causality theorems, but rather challenge how realistic the assumptions are.