PaperHub
6.1
/10
Poster4 位审稿人
最低3最高4标准差0.4
3
3
4
3
ICML 2025

AutoCATE: End-to-End, Automated Treatment Effect Estimation

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24
TL;DR

AutoCATE is the first end-to-end, automated framework for estimating conditional average treatment effects (CATE), unifying evaluation, estimation, and ensembling in a single framework.

摘要

关键词
Treatment Effect EstimationCausal InferenceAutoML

评审与讨论

审稿意见
3

In this paper, authors have developed and released an AutoML library, called AutoCATE, for the automated selection and hyperparameter tuning of meta-learners for CATE estimation. They divide the CATE development pipeline into evaluation, estimation and ensembling phases, where evaluation corresponds to choosing a proxy risk, estimation corresponds to steps used for training CATE learners, including hyperparameter tuning, and ensembling uses proxy risk from evaluation stage to find the best learners from the estimation stage which can consist of ensemble of learners. They did extensive experiments on four benchmark datasets: IHDB, ACIC, Twins and News, to study different design choices for the three stages of AutoCATE.

给作者的问题

As discussed above, my concerns are related to the contributions of this work as compared with the literature, and second, AutoCATE is limited to meta-learners only.

论据与证据

  1. They claimed AutoCATE as AutoML solution for CATE but it is limited only to meta-learner approaches and does not consider many other learners based on trees, neural networks, including causal forest, TARNet and SNet etc.

方法与评估标准

  1. As discussed above, the analysis is limited to meta-learners only, and does not consider other CATE learners.

理论论述

N/A

实验设计与分析

The experimental design looks okay to me as they have provided detaild experiments to study the three phases, i.e., evaluation, estimation and ensembling, of the the proposed framework, where they study the effect of using different approaches for risk, learners and baselearners. However, somethings were not clear as I did not find there proper definition, e.g., AllMeta, and AllBase?

补充材料

I did not check this part as this was not needed.

与现有文献的关系

  1. As per my understanding, there already exists some work for automated CATE, as also referenced by the authors. However, the authors' discussion in the related work is focused on the conclusions of the existing work for auto CATE, which were dataset/setting specific. Authors should have directly focused on their approach and how authors' approach adds value to the literature? Is it just another paper/code for developing an auto-CATE pipeline?

遗漏的重要参考文献

As discussed above, the discussion is focused only on meta-learners and their automated training which obviously does not cover all CATEs in general.

其他优缺点

Strengths:

  1. This paper discusses an important challenge of hyperparameter tuning and training, faced in the causal inference literature for CATE estimation. They develop auto-ml solution for CATE.
  2. The paper is generally clear and well-organised into clear heading and subheadings with relevant discussions.
  3. Overall experimental analysis covers all aspects of the proposed framework, considering 4 benchmark datasets.
  4. Authors have ensured the credibility of the work by releasing the code.

Weaknesses:

  1. As discussed above, authors did not clearly placed their contribution against the literature as there exists some work on auto-CATE, and they did not clarify how their proposed is different from the existing ones rather they focused on findings of the existing works.
  2. Auto-CATE discussion is limted to meta-learners only which does not cover many other categories of CATE learners. So, in my opinion the contribution is not sufficient for this venue.

Update after the rebuttal and follow-up discussion

Dear authors, thanks for your response. I am convinced with your response to Related Work and Limited data points. So, I am raising my score from 2 to 3. All the best!

其他意见或建议

  1. To find a risk measure, in Stage 1 Evaluation, you're training CATE learners by doing hyperparameter tuning etc which is required in a standard case for a CATE learner. So, it appears a chicken-and-egg problem to me. How do you make choices in the first stage? Moreover, it is possible to repeat the 3 phases until convergence to get a better estimates?

  2. Will your framework work in limited data settings? Because you make choices for risk estimators on the validation data which is smaller part of the given dataset and it is likely to lead to unstable estimates if the choices are made on a small dataset.

  3. Moreover, your AutoCATE framework seems to be computationally expensive to me. Comment on it and add as a limitation, if needed. Moreover, I did not notice significant improvement of combining multiple risks from the evaluation. So, is it useful as compared to the addtional computations/complexity it adds?

作者回复

Thank you for your thorough review!


Related work

While prior work automates parts of CATE estimation, no approach–to our knowledge–provides an end-to-end automated framework. Existing work only automates part of our approach (e.g., learning pseudo-outcomes with AutoML). In contrast, AutoCATE extends automation across the entire pipeline, integrating model selection, preprocessing, and ensembling. In doing so, we address novel challenges like balancing data for training and validation or using multi-objective optimization. Our empirical study is also considerably more extensive, covering diverse datasets and revealing new insights into model selection.

We will revise our paper to better highlight these distinctions.

Metalearners

AutoCATE is a general-purpose framework for CATE estimation, not limited to metalearners. While we use them for their flexibility, our key contribution is automating selection, tuning, and validation—critical for all CATE methods. Other approaches (e.g. TARNet) can easily be integrated in this framework. In future versions, we aim to support custom models through a user-friendly API.

Similar to general AutoML frameworks (e.g. FLAML), which do not incorporate each supervised ML algorithm, AutoCATE focuses on structured automation within an extensible search space. This remains a fundamental challenge, regardless of specific estimators.

Clarity: AllMeta and AllBase

These terms define search spaces in our automated ML pipeline for CATE validation and estimation. AllMeta includes all metalearners in our search (S, T, DR, X, R, RA, Lo, Z, U, F), covering all known techniques in the literature. AllBase consists of the nine base learners in our framework (see Figure 1). BestMeta and BestBase are selected subsets based on performance—metalearners are chosen from experiments and the baselearners are known to work well for tabular data.

We will clarify these in the revised paper—thank you for pointing this out! If anything else is unclear, we would be happy to clarify further.

AutoCATE stages

In Stage 1, AutoCATE constructs pseudo-outcomes as proxies for ground truth CATE. These are derived from metalearners, but not full CATE learners themselves: e.g., we use the DR-Learner’s pseudo-outcome directly, but without fitting the final model to predict it. This step is also performed on validation data, not training data.

Stage 2 trains a CATE estimator on training data and evaluates it using pseudo-outcomes from Stage 1. As evaluation depends on a risk measure, it logically precedes estimation, avoiding a chicken-and-egg problem. Both stages involve choices not dictated by data (e.g., which risk measure to use), which we study systematically.

Repeating stages until convergence is not straightforward, as estimation and evaluation are independent stages with distinct data splits. While evaluation informs estimator selection, estimators do not directly influence evaluation. Nevertheless, exploring these interactions (e.g., by sharing information in their optimization) could be an interesting avenue for future work.

Limited data

We agree that limited data can be a challenge for CATE validation. We emphasize the importance of having sufficient data for validation to mitigate such issues (as highlighted in Figure 3). We also study cross-validation (see Figure 7 and Table 5), which can help to stabilize the estimates for smaller datasets. Empirically, we observe good results for the IHDP data with only 672 instances. Some baselearners (e.g. linear regression) can perform reasonably well with very little data.

That said, we acknowledge that for very small datasets, other approaches may be more appropriate. We will include a more detailed discussion on scenarios where our framework may not be optimal in the conclusion. We refer to the response to reviewer NSJK, where we enlist more limitations.

Computational complexity

Efficiency is an important consideration in our framework. AutoCATE runs fairly quickly on small to moderate-size datasets, often completing in minutes locally (see Table 9). AutoCATE also allows for trade-offs by limiting search space, reducing trials, or choosing faster learners (see Figure 9). Further improvements could be made based on more efficient search and pruning algorithms. Nevertheless, we agree that computational complexity is a potential limitation and will discuss it more in depth in the revision.

Multiple risk measures

We provide an initial exploration of combining risk measures. While no combination consistently outperforms the best single measure, we see significant potential for further development (see Figure 13e). Though not yet optimal, it provides a solid foundation for future improvements in both performance and efficiency. Although this approach can increase computational complexity, reusing components like propensity scores helps mitigate this impact.


Thank you once again! Please let us know if there are any remaining concerns.

审稿人评论

Thanks for the rebuttal.

  1. Related work: Unfortunately, only proper citations and comparison/discussion will convince me about the value added by your paper. Your discussion of related work is not based on comparing the approach of existing works for hyperparameter tuning against yours rather on their conclusions.
  2. Metalearners: metalearners form only one direction for ITE estimation. If your framework is generic then you should have considered few examples from other research directions of ITE.
  3. Limited data: How much is validation data for IHDP out of 672 instances? Training models on the validation as compared to the rest of the data seems to lead to unstable results. So, this framework will work when you have sufficiently large datasets.

My concerns about correct placement of this work w.r.t. the literature, and limited applicability of the framework to only subset of ITE methods, remain unaddressed so I will keep my score.

作者评论

Thank you for your response!


Related work

We believe there may be a misunderstanding: to our knowledge, the problem of automated CATE has not been addressed in prior work. Existing research only addresses specific components, such as tuning nuisance models, but none propose an end-to-end automated system for CATE estimation.

We categorize the literature on CATE methods into estimation and evaluation (see Section 2.2). For estimation, many methods exist, but practical guidelines for their tuning and implementation are lacking. For evaluation, we provide a more detailed discussion below.

Comparison with CATE evaluation

This work has a different focus: how to evaluate a model for CATE estimation? Importantly, none of these aim to automate CATE estimation end-to-end. Instead, they evaluate a fixed pool of CATE estimators with a fixed pool of evaluation measures and compare their performance. Nevertheless, we can compare components of these works, see the table below.

Our work differs significantly from these efforts. Only Mahajan et al. (2018) (partially) automate evaluation by tuning the nuisance models with AutoML. However, they do not address how long to tune these models or how to automatically find the best CATE estimator. In contrast, AutoCATE fully automates the entire process—estimation, evaluation, and ensembling—integrating model selection, hyperparameter tuning, preprocessing, and ensembling.

Moreover, while these works fix the estimator pool and overlook tuning efficiency, we analyze performance as a function of tuning iterations. This enables us to explore trade-offs in estimator quality under finite compute, a critical practical concern not addressed in earlier studies. Moreover, the search space in our work is far more extensive than in previous studies, with 2,187 pipeline configurations (excluding hyperparameters), based on 9 baselearners and 10 metalearners—many of which combine differet baselearners, an approach rarely considered in prior research.

PaperEstimation searchEvaluation search (nuisance model)Search efficiency considered?BaselearnersMetalearners
Schuler et al. (2018)Pre-specified pool of estimatorsExhaustive grid searchNo23
Mahajan et al. (2018)Pre-specified pool of estimatorsFLAMLNo57
Curth & van der Schaar (2023)Pre-specfied pool of estimators (with underlying nuisance models tuned using grid search)Exhaustive grid searchNo25
Doutreligne & Varoquaux (2023)Pre-specified pool of estimatorsRandom searchNo23
AutoCATERandom searchRandom searchYes910

Why this matters

Our core innovation lies in framing CATE estimation as a system design problem—optimizing the configuration of a general protocol that can be applied in a range of practical scenarios. As such, our empirical analysis (Section 4) goes beyond existing work by evaluating the influence of a much wider range of design choices across a wide variety of settings (four benchmark families, spanning binary and continuous outcomes, as well as different sizes and dimensionalities), which no prior study has systematically examined.

We hope this clarification resolves the reviewer’s concern. We will revise the paper to include this discussion and make these distinctions more explicit.


Metalearners

While exploring alternative CATE estimation methods would be valuable in future work, the goal of our paper is to develop a general-purpose framework for the automated selection, tuning, and evaluation of CATE estimators, rather than benchmarking every class of estimator.

Metalearners are a natural candidate to validate the usefulness of our framework as they enable a very large search space of 2,187 estimator pipelines without considering hyperparameters. This way, we are able to empirically validate our primary contributions and rigorously analyze key design choices, including optimization trials, ensembling strategies, and model selection approaches. In our opinion, these questions are independent of the considered CATE estimator classes.

Similarly, related work on CATE evaluation (e.g., Schuler et al., Mahajan et al., Curth & van der Schaar) generally also focuses exclusively on metalearners.

We hope this clarifies that our choice for metalearners reflects a design decision aligned with the paper's scope and contributions, not a limitation of the framework.


Limited data

The IHDP dataset has 672 total instances, which we split into training and validation sets. How much data should we allocate for validation? We explicitly study this in Figure 3. Results are fairly robust when allocating 30–70% to validation. Performance drops at the extremes (e.g., 10% or 90%), as expected. Again, cross-validation could help improve results in these settings—see Figure 7 and Table 5.

We will discuss this limitation in the revision (see our answer to reviewer NSJK).


Thank you for your time and effort during this review process!

审稿意见
3

This paper presents AutoCATE, an automated, end-to-end framework for estimating Conditional Average Treatment Effects (CATE). The core motivation is that while ML methods have made significant advancements in causal inference, their adoption remains limited due to the complexities in pipeline selection, hyperparameter tuning, and validation.

To address these issues, the authors propose framing the problem as a counterfactual Combined Algorithm Selection and Hyperparameter (CASH) optimization and develop AutoCATE, a framework that integrates evaluation, estimation, and ensembling into a single automated solution. The framework searches across various ML models, metalearners, and hyperparameters to optimize CATE estimation.

给作者的问题

Are there heuristics that shrink the space automatically based on intermediate findings?

You discuss combining multiple risk measures (e.g., DR + T + kNN). Did you consider dynamic weighting or meta-learning that adaptively emphasizes certain pseudo-outcomes depending on early performance signals?

Some results show that simple top-k ensembles can outperform advanced stacking in certain data sets. Could there be reasons (e.g., overfitting to pseudo-outcomes) why stacking underperforms? Are there strategies to mitigate it?

Do you have plans to incorporate domain knowledge or observational data diagnostics (e.g., checking positivity/overlap) more directly into AutoCATE’s workflow (e.g., automatic filtering of extreme propensity scores)?

论据与证据

Claim 1: AutoCATE is “the first end-to-end, automated solution” tailored for CATE estimation. The authors provide an overview of existing libraries (e.g., CausalML, EconML) and show that, while these offer various metalearners or partial automation, they do not perform a comprehensive search across risk measures, model architectures, hyperparameters, and ensembling.

Evidence: The paper includes a table comparing software packages, demonstrating that other libraries focus on some metalearners or on certain tuning aspects, whereas AutoCATE addresses the entire pipeline.

Claim 2: Jointly optimizing evaluation methods and ML pipelines boosts performance over standard “predict the observed outcome” baselines and over conventional single metalearner approaches.

Evidence: Empirical results indicate that simply optimizing based on the observed-outcome (µ) risk can lead to suboptimal CATE predictions, whereas using specialized risk measures (e.g., T-risk, DR-risk) aligns model selection with the actual causal objective. Experiments also show improved accuracy (in terms of √PEHE or Qini-based metrics) when comparing AutoCATE to typical T-/S-Learners that only tune on one group outcome at a time.

Claim 3: Metalearners like T-, DR-, and RA-Learners tend to achieve competitive or best performance on average.

Evidence: Through ablation studies, the paper shows consistent strong results for T-, DR-, and RA-Learners, whereas others (like U- or R-Learners) can produce outlier performance in some data sets. This is measured through extensive random-search trials and is illustrated with results in tables and plots.

Overall, the major claims are around: (1) the novelty of a fully automated pipeline specifically for CATE, (2) the demonstrated importance of risk measures aligned with the causal objective, and (3) empirical benefits of ensembling. The evidence rests on thorough experiments across four well-known semi-synthetic causal benchmarks

方法与评估标准

  • The paper includes multiple metalearners (S-, T-, Lo-, DR-, RA-, etc.) and a large suite of baselearners (random forests, gradient boosting, MLPs, etc.). Metalearners are combined with different pseudo-outcome-based risk measures (e.g., T-risk, DR-risk, kNN-risk) to evaluate the pipeline on held-out or cross-validation data.

  • Evaluation metrics for final performance include √PEHE (for measuring overall error on the true potential outcomes in synthetic data) and AUQC (for ranking-based tasks).

  • The approach is sensible for observational data with confounding: it respects standard assumptions (unconfoundedness, overlap). The use of pseudo-outcomes or IPW-based risk measures is standard in causal inference, but the novelty lies in systematically searching across them and ensembling.

From a methodological perspective, the authors’ approach to separating the “evaluation pipeline” from the final “estimation pipeline” is carefully designed: it includes data splitting so that the pseudo-outcome (or other risk) is learned on a separate portion, then used to guide model selection. This is appropriate for CATE tasks where the ground truth is never observed for each individual’s counterfactual.

理论论述

The work frames the search problem in a counterfactual CASH setting and cites known results about pseudo-outcomes converging to the true treatment effect under standard assumptions (e.g., DR-/R-Learners). However, the paper does not provide new proofs or theoretical derivations. Instead, it references known theoretical results (e.g., asymptotic unbiasedness of DR, properties of R-Learners).

No obvious errors appear in the sketches of theory or in the references to established results. The theoretical claims are primarily restatements of known properties (e.g., each metalearner’s consistency under standard conditions).

实验设计与分析

  • The experiments span four widely used semi-synthetic data sets in CATE research: IHDP, ACIC, Twins, and News. They vary in size, dimensionality, data-generating processes, and outcome type (binary vs. continuous).

  • The evaluation metrics (e.g., √PEHE, AUQC) are standard, and each data set has a known “ground truth” effect or can approximate it, so the experimental design is appropriate.

  • Model comparisons systematically vary key design choices, risk measures, baselearners, data splits, and provide results in tables and line plots.

  • An ablation study approach is used to highlight which parts of AutoCATE (like using T-risk vs. DR-risk or enabling feature selection vs. not) matter most.

  • Overall, the experimental setup is sound. The sample sizes and repeated runs (e.g., 50–200 trials, multiple random seeds) mitigate random variation.

A potential limitation is that the paper relies heavily on semi-synthetic data, so real-world complexities (e.g., non-stationarity, unobserved confounders) may not always arise. The authors acknowledge this and suggest future applications to purely real-world data.

补充材料

The authors include an extensive appendix that details the metalearners, risk measures, ablation studies, and additional results (e.g., effect of number of cross-validation folds). They also provide a comparison table of available CATE software packages, plus some usage examples. I have not identified any gaps between the main paper and the supplementary information: the appendices appear to comprehensively support the main results.

与现有文献的关系

  • In AutoML, the authors draw parallels to general-purpose tools like AutoGluon, H2O AutoML, or FLAML, pointing out that standard AutoML focuses on conventional supervised tasks (e.g., classification, regression) and does not address the challenges of CATE estimation (lack of ground truth, confounding, etc.).

  • For CATE methods, they discuss standard metalearners such as the T-, S-, X-, DR-, R-Learners, and highlight that prior work compares them in isolation or tunes them partially, but has not combined them in one broad pipeline that includes “evaluation pipeline search.”

  • They position their approach as bridging these two areas: AutoML + CATE-specific validation.

This situates the paper well in an emerging area of interest, applying automated search for valid causal effect estimation.

遗漏的重要参考文献

The paper cites many standard references (e.g., R-, DR-, X-Learners) and relevant AutoML works (FLAML, H2O, etc.). One might also compare with the EconML approach, which implements an “R-risk ensembling.” The authors do mention EconML, but they might elaborate on how exactly EconML’s ensemble compares with the new stacking approach. Nonetheless, the references in the paper are quite thorough, and there do not appear to be major missing lines of prior research.

其他优缺点

Strengths:

  • The idea of decoupling an “evaluation pipeline” with multiple risk measures from the “estimation pipeline” and then ensembling is a strong conceptual contribution that addresses known challenges in model selection for causal inference.
  • The thorough ablation studies reveal which metalearners and which risk measures typically excel, giving the community new insights.
  • The open-source release (in Python) lowers the barrier to entry for robust causal effect estimation, potentially encouraging broader adoption.

Weaknesses:

  • Real-world data sets that contain unobserved confounders or strong domain shifts are not tested extensively; thus, the framework’s performance under such violations remains open.
  • The search space can become large and computationally expensive (especially for high-dimensional data), though the authors do note ways to tune time or limit certain baselearners.
  • The paper does not propose new theoretical results for metalearners, it mainly integrates known approaches. However, the overall pipeline is novel, so this does not detract from the paper’s main contributions.

其他意见或建议

See the above weaknesses section.

作者回复

Thank you for your insightful review–this is highly appreciated!

Real-world data and violations

We agree that these are crucial considerations. While we use real-world data to validate AutoCATE as much as possible (e.g, Twins and the uplift data in Appendix D.4), we acknowledge that more real-world validation would be useful. Unfortunately, we are unfortunately limited by the fact that the CATE is unknown in real data.

We agree that considering violations of identifiability assumptions is important. We have added a synthetic experiment varying selection bias (γ) and found that while performance declines with higher bias, more optimization trials can partially mitigate the effect (we refer to the response to Reviewer NSJK). Even with strong overlap violations (γ > 10), AutoCATE remains competitive. We will update the paper to clarify the data-generating process and these findings. Finally, we also point out AutoCATE’s good performance on IHDP, which also contains overlap violations [1].

More generally, we recognize the need for improving robustness to such violations and will include this limitation in the revised version.

Domain knowledge and observational data diagnostics

Thank you for the insightful suggestion. We agree that incorporating domain knowledge and data diagnostics would be valuable, and we would love to support such features in future releases. AutoCATE allows for trimming extreme pseudo-outcomes (e.g., from very small propensity scores), though this was not included in our paper. We also agree that checking overlap/covariate balance and filtering propensity scores would be useful additions.

While sensitivity analyses for hidden confounders are appealing, they may fall outside the scope of our automated approach. Instead, we aim for AutoCATE to be complementary to other packages (e.g. DoWhy), which address other aspects of causal inference. Our main goal is automating CATE estimation, a gap we see in current tools. Nevertheless, we see many remaining challenges for truly supporting practical adoption of these methods.

Computational complexity and heuristics

We have designed AutoCATE with efficiency in mind, using parallelization via Optuna, efficient ML implementations via scikit-learn, and configurable constraints. Computational times for different datasets are summarized in Table 9.

Nevertheless, there are several opportunities to improve AutoCATE’s speed. We currently use a naive random search algorithm without heuristics. As our search is implemented with Optuna, more advanced search algorithms and pruning strategies can easily be used instead. Finally, different metalearners have widely varying time complexities (Figure 9). Future research could try to use these discrepancies to further optimize the search.

In the revised version, we will stress the complexity of AutoCATE more explicitly as a limitation.

Combining risk measures

Thank you for these insightful suggestions!

We explore various static strategies for combining risk measures (e.g., averaging, ranking) and compare them empirically in Table 3. While we analyse correlations between risk measures (Figure 13a), we do not yet use this information, though it seems promising. Our current work focuses on the feasibility of combining risk measures using simple approaches. While the suggested dynamic approaches are really exciting directions, their implementation is challenging–key issues include the lack of ground truth and the variance of risk measures. While tackling these challenges is beyond the scope of this work, we see them as important future research directions and hope our framework can serve as a foundation for further advancements.

Stacking

Why might stacking underperform? Without a ground truth CATE, validating and tuning stacking weights is challenging. Stacking may overfit pseudo-outcomes with high variance or outliers. The squared error loss is sensitive to outliers, and alternatives like Huber loss could improve robustness. Stacking also requires more training data, and with limited data, its complexity may hurt generalization.

To improve stacking, we could try to filter pseudo-outcomes based on reliability, possibly using risk measure agreement. Multi-objective stacking could create more generalizable models. Using loss functions like Huber loss could mitigate noisy outcomes. Further research into tailored stacking and ensembling for CATE estimation is needed. Limiting stacking to top models could also improve stability.


Thank you again for your detailed and thoughtful review. We agree that there is much work ahead, and while some suggestions are beyond the scope of this paper, we are happy to consider any specific approaches you feel should be included for acceptance within the remaining time.


[1] Curth, A., Svensson, D., Weatherall, J., & van der Schaar, M. (2021). Really doing great at estimating CATE? A critical look at ML benchmarking practices in treatment effect estimation. NeurIPS.

审稿意见
4

The authors propose a pipeline for automating the several design choices required for CATE estimation; from preprocessing datasets to different risk measures for model selection. The pipeline is divided into three stages corresponding to the following three questions; what risk measure should be used for model selection, what CATE estimators should be trained, and finally how should be select over the trained CATE estimators and combine them for better generalization. The authors conduct experiments on widely used benchmarks and present interesting insights regarding the numerous design choices in CATE estimation.

Update after rebuttal

Thanks for the rebuttal! My concerns have been addressed and I want to retain my rating for acceptance.

给作者的问题

For major questions, please refer to the the "claims and evidence" sections above.

  • How are the multiple risk measures combined? Do we take the average of risk measures, like average of T and DR risk in the experiments?

  • How do the authors obtain the best meta-learner or best base-learner? Is it based on how well they fit the observational data?

论据与证据

Strengths

  • The authors have done a really good job at empirical validating all the design choices involved with the proposed framework AutoCATE! The scale of the empirical study is quite comprehensive; experiments involve a variety of meta-learners, base-learners, and risk measures. This makes their findings interesting and significant for practitioners and future work, and their software package should also make it easy for practitioners to adopt the proposed pipeline.

Weaknesses

  • My concerns are mostly regarding the claims and insights regarding the end-to-end automation part of AutoCATE.

    • Regarding experiment in section 5.5, the authors should follow the procedure of AutoML to tune S/T Learner (Mahajan et al. 20023) instead of manual grid search. This would ensure stronger baselines and a fair comparison with them. Similarly, the authors can construct meta-learners with nuisance model trained via AutoML (Mahajan et al. 20023), and that could serve an alternative set of CATE estimators for experiment in section 5.3 (Estimation) as well. For example, the BestBase estimator currently involves a manual search over a grid of different algorithms and hyperparameters, but this could be automated via AutoML.

    • I am not sure what are the main conclusions from the experiments with combined risk measures? The authors did not experiment with many combinations, and only considered combining T & DR risk and different T risk. So the experiments are not exhaustive which makes it hard to interpret what the main trend should be and what recommendations can be made. Similar comment for ensembling with multiple risk measures; I think the strategy of combining risk measures is the most novel aspect of the work, so analyzing it in depth would make the paper strong.

方法与评估标准

Yes, the proposed methods and evaluation criteria make sense for the problem at hand. All the baselines and benchmarks used in this paper are widely used in causal inference.

理论论述

There are no theoretical claims in the paper.

实验设计与分析

Yes, I checked the soundness/validity of all the experiments in the paper, and the experiment design doesn't have any flaws.

补充材料

Yes, I checked all parts of the supplementary material.

与现有文献的关系

This work builds upon the prior literature [1, 2, 3] on empirically analyzing model selection strategies for CATE estimation. Model selection in causal inference is a challenging task due to the fundamental challenge of unobserved counterfactual potential outcomes, which led to several heuristics being proposed for this task without much clarity on when a certain strategy must be used. Hence, prior empirical studies [1, 2, 3] empirically analyzed these model selection strategies, often finding that several strategies can be optimal. This paper further builds upon this, by proposing novel ways of combining multiple model selection strategies and also extensively analyzing the role of other factors like dataset preprocessing that was missed by earlier works.

References

  • [1] Schuler, Alejandro, Michael Baiocchi, Robert Tibshirani, and Nigam Shah. "A comparison of methods for model selection when estimating individual treatment effects. arxiv. 2018." arXiv preprint arXiv:1804.05146 (1804).

  • [2] Curth, Alicia, and Mihaela Van Der Schaar. "In search of insights, not magic bullets: Towards demystification of the model selection dilemma in heterogeneous treatment effect estimation." In International conference on machine learning, pp. 6623-6642. PMLR, 2023.

  • [3] Mahajan, Divyat, Ioannis Mitliagkas, Brady Neal, and Vasilis Syrgkanis. "Empirical analysis of model selection for heterogeneous causal effect estimation." arXiv preprint arXiv:2211.01939 (2022).

遗漏的重要参考文献

No, I believe all essential references have been discussed to the best of my knowledge. The authors have written a detailed related works section.

其他优缺点

Strengths

  • Authors introduce several novel components. To the best of my knowledge, analyzing the role of dataset preprocessing and dataset splits for training/evaluation in CATE estimation has not been done in prior works. Further, the authors experiment with novel strategies for model selection with multiple risk measures and ensembling of CATE estimators.

  • The paper overall is well written and organized which makes it easy to follow and understand main results. The experiment results are clearly presented with good discussion around them. I especially like their comparisons with the findings from prior benchmarking studies for CATE model selection.

其他意见或建议

  • It would be nice to have statistics regarding the scale of the empirical study before section 5; like how many risk measures, how many meta-learners and base-learners for estimation, how many estimators are included for the model selection study, etc.
作者回复

Thank you for your time and effort in reviewing our work!

Obtaining the best meta/base-learner and training nuisance models with AutoML

Please allow us to clarify our approach. AutoCATE follows an AutoML-based procedure to tune ML pipelines at multiple stages: first, to optimize risk measures for model selection, and second, to construct optimal CATE estimators by searching over meta- and base-learners. We obtain the best CATE estimator–i.e., a metalearner constructed with (multiple) baselearner(s)--in the second stage, based on the risk measure selected in the first stage. As the reviewer suggests, this indeed ensures that we pick the CATE estimator that best predicts the (observational) validation data.

Our design ensures a fully automated, end-to-end approach that integrates evaluation, estimation, and ensembling within a single framework (see Section 4.5). Importantly, all components of a metalearner (i.e., each baselearner) are tuned automatically and simultaneously, ensuring the best possible downstream performance. We believe that this is one of the key strengths of our work compared to previous approaches!

We hope that these additions clarify our approach. We will update the paper to make this aspect more clear. Nevertheless, if there are any remaining questions, please do let us know.

Tuning the S- and T-Learner

Regarding the tuning of S-/T-Learners, this is done based on a random search with the number of trails equal to AutoCATE, allowing for a fair comparison. As such, we do not use a manual grid search for either AutoCATE or the benchmarks. Mahajan et al. (2023) use AutoML separately for evaluation and for learning the nuisance models underlying CATE estimators. As neither S- and T-Learners use nuisance models, we believe that our approach is similar to the one in Mahajan et al. (2023). Additionally, they do not consider the impact of only using a limited number of optimization trials, which is an important consideration in practice. Nevertheless, if there are distinctions we have missed, we would be happy to look into them and adjust the experiments.

We will revise our manuscript to better clarify the reasoning behind and training procedure for the benchmarks.

Combining risk measures

Thank you for these thoughtful questions. We appreciate the opportunity to clarify our approach and conclusions.

How are risk measures combined?

We explore multiple strategies for combining risk measures (see also Appendix B.5), including:

  • Averaging normalized risk measures (as in Table 1b). This approach corresponds to the reviewer’s suggestion.
  • Averaging rankings of risk measures to improve robustness to outliers.
  • Euclidean distance to the origin (best possible performance).
  • Selecting all Pareto-optimal points.

What are the main conclusions?

Our experiments serve as an initial exploration into combining risk measures. While our hypothesis was that relying on multiple risk measures could enhance robustness and performance, we find that no strategy consistently outperforms using a single T-risk. However, results indicate that this approach is promising, and further research—such as leveraging risk measure correlations (Figure 14a) or advanced ensembling—could improve performance. Therefore, we believe there is evidence that this novel approach provides a fruitful and promising direction for future research on CATE estimator validation. By highlighting the potential of this approach and providing a foundation for future research with our software package, we hope to encourage the community to further explore these ideas.

We will update our paper to clearly state the main conclusions.

Scale of the empirical study

Thank you for this great suggestion! We agree that summarizing the scale of our empirical study upfront helps highlight our contributions, and we will incorporate this more expliclity in the revised manuscript.

In the main body, we present experiments across a total of 247 distinct datasets, across four benchmark families, spanning binary and continuous outcomes, as well as different sizes and dimensionalities. Additionally, we include additional experiments on synthetic data for this rebuttal. In the appendix, we also explore AutoCATE on two uplift datasets.

AutoCATE’s full search space consists of 2,187 possible pipelines (3 feature selection × 3 scaling × 27 meta/base-learner configurations × 9 base learners), excluding hyperparameters (Appendix B.3). It incorporates 8 different risk measures. A full overview of its configuration options can be found in Appendix B.6.

We appreciate the reviewer’s suggestion and will ensure this information is presented more explicitly in the paper.


Once again, thank you for your detailed review! While we hope that these responses address your concerns, please let us know if there are any remaining points. We would be happy to engage further if needed.

审稿人评论

Thanks for the rebuttal! My concerns have been addressed and I want to retain my rating for acceptance.

审稿意见
3

The paper presents AutoCATE, an automated framework for CATE estimation, optimizing model selection, tuning, and validation via counterfactual Combined Algorithm Selection and Hyperparameter (CASH) optimization. It unifies evaluation, estimation, and ensembling, automating key design choices for improved generalization. Experiments on benchmarks show AutoCATE outperforms existing methods, and it is released as open-source software for broader adoption.

给作者的问题

  • Besides these well-known datasets, why not use fully synthetic datasets to systematically control the degree of selection bias and covariate shift? This would allow for precise evaluation of AutoCATE's robustness across different levels of bias and distribution shift, ensuring clearer insights into its generalization capabilities.
  • To what extent can users intervene in the automated process? Are there provisions for customizing or overriding certain steps in the pipeline to tailor it to specific needs or preferences?
  • What are the known limitations or potential failure cases of AutoCATE? Are there scenarios where it might not be the optimal choice for CATE estimation?

论据与证据

The authors claim that AutoCATE surpasses conventional CATE estimation methods and provide empirical evaluations on benchmark datasets that show AutoCATE achieving superior results compared to existing approaches.

方法与评估标准

Yes, the evaluation criteria make sense, however, it could be expanded to cover more scenarios (see section “Questions For Authors”).

理论论述

The paper makes no theoretical claims.

实验设计与分析

Yes, the experimental designs or analyses are valid.

补充材料

I briefly looked at the appendix.

与现有文献的关系

NA

遗漏的重要参考文献

NA

其他优缺点

  • The paper is well-written and easy to follow.

其他意见或建议

  • Not sure if ICML is the right venue for this paper, as it lacks theoretical contributions (e.g., new proofs, formal guarantees, or novel optimization formulations).
作者回复

Thank you for your thoughtful review!

Robustness to selection bias

We agree that synthetic data allows for precise control over selection bias and covariate shift, enabling a more systematic evaluation. We have added a synthetic experiment where we vary the degree of selection bias (controlled by the parameter γ\gammaγ). Our results indicate that while AutoCATE’s performance degrades as selection bias increases, increasing the number of search trials helps mitigate this effect. Even under strong overlap violations (γ>10\gamma > 10), AutoCATE can result in good performance. We also compare AutoCATE to benchmark models across different bias levels. The results confirm that AutoCATE consistently performs competitive to each baseline in settings with moderate bias and remains relatively robust under extreme bias. We will update the paper to include a detailed explanation of the data-generating process (DGP) and expand on these findings. Thank you for this valuable suggestion—we believe this addition strengthens our evaluation and provides clearer insights into AutoCATE’s generalization capabilities.

Synthetic data: setup

Gamma01101001000
Extreme propensities ([0, 0.01) and (0.99, 1]) (%)0.00.172.097.099.7

Synthetic data: PEHE\sqrt{\text{PEHE}} (SE) for AutoCATE with different number of evaluation and estimation trials

Gamma01101001000
5 trials0.49 (0.03)0.53 (0.02)0.88 (0.12)0.93 (0.12)0.95 (0.12)
10 trials0.20 (0.03)0.25 (0.02)0.46 (0.07)0.54 (0.08)0.58 (0.11)
50 trials0.14 (0.01)0.15 (0.03)0.35 (0.04)0.52 (0.06)0.44 (0.09)

Synthetic data: PEHE\sqrt{\text{PEHE}} (SE) for AutoCATE and benchmarks with 50 evaluation and estimation trials

Gamma01101001000
S-LR1.23 (0.06)1.23 (0.06)1.25 (0.07)1.25 (0.07)1.25 (0.07)
T-LR0.08 (0.01)0.09 (0.01)0.23 (0.07)0.23 (0.08)0.24 (0.08)
S-RF0.51 (0.03)0.59 (0.03)0.82 (0.04)0.83 (0.04)0.82 (0.04)
T-RF0.42 (0.02)0.46 (0.03)0.98 (0.06)0.70 (0.07)0.70 (0.06)
S-GB0.39 (0.02)0.45 (0.03)0.72 (0.03)0.75 (0.04)0.75 (0.05)
T-GB0.37 (0.02)0.40 (0.03)0.56 (0.05)0.55 (0.04)0.56 (0.05)
AutoCATE0.14 (0.01)0.15 (0.03)0.35 (0.04)0.52 (0.06)0.44 (0.09)

Customizability

AutoCATE is designed to be highly customizable yet easy to use. Users can specify the search space (preprocessors, metalearners, baselearners) and set design parameters (evaluation protocols, trial numbers, ensemble methods). Domain experts can also fine-tune risk measures, evaluation metrics, and validation procedures. The interface follows scikit-learn conventions for intuitive use, with configuration details available in Appendix B.6. Future versions will include an API for custom algorithms, enabling users to integrate their own CATE estimation and evaluation methods.

Limitations

While AutoCATE is designed to be broadly applicable, we acknowledge that no method is universally optimal for all CATE estimation scenarios. As such, there are certain settings where AutoCATE may be less suitable:

  • Very small datasets (n<50), where model selection based on pseudo-outcomes may be unreliable. For IHDP, we achieve good performance with only n=672 instances in the training set.
  • Large datasets with constrained compute, where an AutoML-based approach may be too computationally expensive.
  • Scenarios requiring strong domain knowledge integration, where a fully customized pipeline may be preferable.
  • Data that requires extensive preprocessing, such as raw image or text data.
  • Settings with fairness or regulatory constraints, where automated model selection may need additional safeguards.
  • Cases violating our causal assumptions, such as strong violations of overlap, hidden confounders.
  • Cases considering a different setting, such as instrumental variables or time series/panel data settings.

We agree that clarifying these limitations will strengthen the paper and will update the manuscript accordingly. Thank you for this valuable suggestion. If the reviewer thinks of other limitations that we have missed, we would be happy to also include those.

Theoretical contributions

While our paper does indeed not present new formal proofs or theoretical guarantees, it aligns with ICML's emphasis on application-driven research, as highlighted in the call for papers. ICML encourages innovative techniques and problems motivated by real-world needs, with an emphasis on reproducible experiments and sound analysis, rather than mandatory theoretical components. Many influential ICML papers prioritize empirical insights and conceptual innovations, and we believe our work contributes to this tradition by addressing a significant practical challenge in causal inference.


Thank you for your time and effort in reviewing our work! Please let us know if you have any remaining concerns.

审稿人评论

I thank the authors for addressing my comments.

Given the empirical nature of this work, it is crucial to explicitly state its limitations within the paper.

I will maintain my recommendation.

最终决定

This paper proposes a end-to-end framework for automatic estimation of the CATE alongside a detailed empirical study. Reviewers praised the inclusion of components of the pipeline not studied in previous work (e.g. various preprocessing and data splitting steps) and the comprehensive nature of the experiments. There were various questions about details of the experiments and the relationship to previous work which were largely resolved during the discussion period. This paper makes a valuable contribution to the area of heterogeneous treatment effect estimation and will be of interest to anyone working in the area.