PaperHub
5.8
/10
Rejected4 位审稿人
最低5最高6标准差0.4
5
6
6
6
3.0
置信度
正确性2.8
贡献度2.8
表达2.5
ICLR 2025

Training and Evaluating Causal Forecasting Models for Time-Series

OpenReviewPDF
提交: 2024-09-26更新: 2025-02-05
TL;DR

We design causal time-series forecasting models using orthogonal statistical learning, and evaluate them by creating a test set of treatment effects estimated with Regression Discontinuity Designs.

摘要

关键词
Time-series forecasting; Causal Inference; Regression Discontinuity Designs; Deep Learning

评审与讨论

审稿意见
5

The authors argue that current deep learning approaches to time series modeling are limited by not explicitly modeling causality. To remedy the situation, the authors propose a framework to convert neural network time series models into causal models by fitting a conditional average treatment effect (CATE) model on a temporally held-out dataset. The authors demonstrate how to use orthogonal learning to construct the CATE model and model high-dimensional treatment using 3 different encodings (one-hot, cumulative, and linear). To evaluate the models, the authors construct ground truth CATEs using regression discontinuity designs (RDDs), which use a temporal cutoff to create examples with and without a specified treatment. The authors then show that while non-causal models perform well in terms of RMSE and MAE in distribution, they perform poorly on recovering the CATEs constructed using RDD and domain knowledge. The authors then conclude that their causal models are superior in settings with clear causal relationships.

优点

The authors pursue an interesting synthesis of methods from deep learning for time series and statistic tools for treatment effect prediction. Certainly there are many cases in time series where there might be nuisance variables that lead to unexpected behavior out of distribution, and in cases where there is domain knowledge about these nuisance variables, this method might be helpful.

缺点

Overall I found the paper pretty challenging to parse, for example how exactly the models were constructed and evaluated. I am not a domain expert in casual methods, but I work in machine learning and time series modeling, and I don't think the material was so technical that its complexity was the source of the problem, so sections 2-4 might benefit from some reworking to make the key details more salient.

Beyond style, I take two of the central claims of the paper to be:

(1) Adding causal modeling to neural network time series models helps in practical out-of-distribution scenarios

(2) Estimating CATEs in a time-series setting is novel and necessitates methodological innovation

I don't have enough domain knowledge to fully assess (2), but I am currently unconvinced that the paper demonstrates (1). The evaluation of the model seems a little circular in that the examples are special designed using a similar procedure to the method being evaluated. Perhaps I'm misunderstanding, but I take the evaluation to be whether the causal predictions of the model match a linear RDD analysis. If this is a misunderstanding, a correction might be helpful. If I've understood correctly, when would we be better served by apply the methodology in the paper than just applying the linear RDD analysis directly? This evaluation also seems limited to cases where the causal structure of the problem is known and the changing covariate (e.g. price) is observed. How does this solve prediction in out-of-distribution settings? There is also no explicit evaluation on out-of-distribution data using the method, even though this was one of the stated motivations in the introduction.

For (2), the contribution appears to be mostly the encoding of the treatment and the treatment effect \theta(W_t), because the other tools seem to be borrowed directly from related work. There are some additional proof that those tools can be extended to this setting, but as far as I can tell, the crux of the method is combining a pre-trained time series backbone with existing statistical. Please correct me if this is a misunderstanding. Further from the evaluations that were presented in the results, the method's performance is nearly identical to the baseline for several of the encoding, in terms of the causal evaluation, and its general forecasting performance is worse. What is the practical take-away from this evaluation? What are the settings in which I would want to adopt this method as an alternative to simpler deep learning forecasting methods given that it performs worse in-distribution and there is no explicit out-of-distribution evaluation? In terms of predicting CATE, how can this be used beyond recapitulating the known effects of known treatments?

问题

I have outlined some potential confusions and questions in my "Weaknesses" section response.

评论

Thank you for the valuable feedback, which is helping us make our paper clearer. Here are the changes and clarifications we are implementing, which we believe address your concerns. Please let us know if there are questions that we did not address or any remaining issues!

Overall I found the paper pretty challenging to parse, for example how exactly the models were constructed and evaluated.

Sections 2.3 and 3.3 describe how we instantiate our approaches in deep learning models, for causal learning and RDDs, respectively. We will describe them in a more operational way.

For causal learning (S2.3):

  • All models share the same time-series architecture (which we call backbone in the paper). We split the data in two splits.
  • On the first split, we learn the nuisance models (m and e): m is trained to predict the outcome Y with all features except the treatment and past outcomes, trained with an MSE loss; e is trained with a logistic loss to predict the probability of treatment using the same features.
  • On the second split, we use models m and e (fixed) to create the R-loss, and train theta to minimize the R-loss using SGD.
  • At prediction time, we combine all models as in line 311 to make the final prediction.

For RDDs:

  • For each treatment change, we create a small dataset of points before/after the change with no treatment change (if there aren’t any, we drop that price change).
  • We fit a local linear model with a discontinuity at the price change to estimate the CATE at that price change, and that time.
  • This is our ground truth, to evaluate the models on a causal task (predicting this CATE).
评论

(1) Adding causal modeling to neural network time series models helps in practical out-of-distribution scenarios

The evaluation of the model seems a little circular in that the examples are special designed using a similar procedure to the method being evaluated. Perhaps I'm misunderstanding, but I take the evaluation to be whether the causal predictions of the model match a linear RDD analysis. If this is a misunderstanding, a correction might be helpful.

Yes, this is a misunderstanding: our linear encoding in the causal model is not related to the RDD linear estimator!

  • A linear encoding in the causal TFT means that we predict the coefficient of a linear treatment effect model (it predicts that, at time step t, a change of price from a to b will change the demand by coef*(b-a)). The coefficient varies based on the TFT inputs, so this model is still flexible.
  • The RDD model is a local linear extrapolation of the outcome from adjacent time-steps: for a given observed price change at a time-step t, the RDD estimator takes demand data strictly before the change, and linearly extrapolates the demand to the day of the change. It does the same with data from the future, and adds a step function at the time of the change to measure the discontinuity. The step function’s coefficient is the estimated causal effect of the observed price change: there is no assumption of linearity of the treatment effect.

when would we be better served by apply the methodology in the paper than just applying the linear RDD analysis directly?

We emphasize that this alternative is not possible: RDDs identify the CATE only at observed treatment changes (i.e., we do observe a price change at time-step t for train i). However, a CATE model like our causal TFT aims to predict the causal effect of any price change at any time-step t: the RDDs cannot provide such a model.

This evaluation also seems limited to cases where the causal structure of the problem is known and the changing covariate (e.g. price) is observed. How does this solve prediction in out-of-distribution settings? There is also no explicit evaluation on out-of-distribution data using the method, even though this was one of the stated motivations in the introduction.

This is a miscommunication on our part that we will fix in the paper: we do not aim to support general out-of-distribution scenarios, and do not mean to claim that we do support that! This is why we do not evaluate general out-of-distribution tasks.

What we do support is a specific class of out-of-distribution predictions that result from forecasting models that are later used to make decisions and take actions. In such cases, there is one special variable that the user of the model wants to act on (e.g. choosing a price at which to sell something; choosing which drug to give a patient). This variable is observed, as it is under the control of the user of the model. This is a very common scenario, and both tasks we consider in the evaluation are of this form: (1) training a demand forecasting model, and then calling it with different input prices to choose the price that maximizes revenue; (2) training a model that predicts blood pressure, and later use it with different drugs as input to decide how to treat a patient. Those are real applications! And in fact our data for the first application comes from a real product doing exactly this.

The key challenge in this specific out-of-distribution setting is that the in-distribution treatment (the price set by the train operator; the treatment chosen by the doctor) is very informative about the outcome. Models thus learn to “cheat” when predicting, by using the information contained in the treatment (price chosen by the operator, who has a strong idea of the underlying demand). However, if we make the model predict on a different treatment than the one chosen in the dataset (required to optimize the treatment choice decision), the input treatment does not carry the same information (it was not chosen by the train operator/doctor), and the predictions become really bad: this is our message in the paragraph “The need for causal forecasting”, summarized on Figure 1. Decisions made with these predictions will hence also be bad. Causal models make the right kind of prediction for decisions: they predict a change in outcome that a change in treatment will cause.

However, those models are really hard to evaluate on real, complex tasks, since we do not have access to outcomes under alternative treatments. Our RDD evaluation provides ground truth for these alternative outcomes for specific time points at which we observe a change in treatment. So our RDD RMSE is the out-of-distribution evaluation. To the best of our knowledge, our RDD based approach is the first technique to perform such an evaluation for causal effects, and is part of our contribution.

评论

(2) Estimating CATEs in a time-series setting is novel and necessitates methodological innovation

contribution appears to be mostly the encoding of the treatment and the treatment effect \theta(W_t), because the other tools seem to be borrowed directly from related work. There are some additional proof that those tools can be extended to this setting, but as far as I can tell, the crux of the method is combining a pre-trained time series backbone with existing statistical. Please correct me if this is a misunderstanding.

We do not believe that this is an accurate characterization of our contributions:

  • We do borrow the core theory from the orthogonal statistical learning paper, though extending it to our setting is not trivial and could be useful to other works. It also helps understand some observations, like the better performance of the linear encoding in the demand forecast that we can trace back to the impact of the treatment dimension (line 283) that comes from our extension.
  • We do not use pre-trained time series models. We use existing architectures, and extend them (our Causal TFT is not an existing architecture), that we train from scratch. Our training and prediction procedures, stemming from the theory we use, are also very different from typical deep learning time-series models (see S2.3 and the first part of our answer).
  • The other half of our contribution is a new evaluation technique for causal models, using real data from complex tasks. We do this using RDDs (S3). While RDDs are not new, this application is, to the best of our knowledge. We also believe that this is an important contribution to the field of causal ML, which always struggles with evaluations of causal effects. We believe that future causal prediction papers should leverage and extend this approach.

Further from the evaluations that were presented in the results, the method's performance is nearly identical to the baseline for several of the encoding, in terms of the causal evaluation, and its general forecasting performance is worse.

When testing a model in distribution (the typical forecasting evaluations, and our regular RMSE metrics), we have access to the sequence of treatment chosen by the knowledgeable decision maker (the train company operator setting the price; the doctor choosing the treatments). These chosen treatments carry a lot of information on the outcome, which models use to “cheat” when forecasting in distribution. Our causal models do not leverage this information by construction (to be able to make causal predictions), so we do expect them to perform worse in distribution! What is important is that they predict better on alternative treatments, which is what will be used to make decisions on the best treatment (the relevant question is “will my revenue be higher if I increase the price by 10%”, and not “how many tickets will I sell if I set the same price as usual?”).

In terms of predicting CATE, how can this be used beyond recapitulating the known effects of known treatments?

The CATE models that we train extrapolate, and do not only recapitulate the known effects of known treatments: Theorem 3 is a statement of convergence of this predictive model, stating that if each sub-model converge properly, our CATE model will converge to the true CATE.

We can see the RDDs as an estimator for known effects of known treatments, but this is actually a good property: we use these known effects as a ground truth (computed on the test set) to evaluate the predictions of our CATE models!

What are the settings in which I would want to adopt this method as an alternative to simpler deep learning forecasting methods given that it performs worse in-distribution and there is no explicit out-of-distribution evaluation? What is the practical take-away from this evaluation?

As we explained above, there is an explicit out-of-distribution evaluation: this is the RDD RMSE. As we can see, the causal models perform much better there (except on MIMIC further in the future, where the Causal Transformer and baseline models “cheat” again, as explained lines 534+ in the paper). The implication is that the causal models make better predictions for the purpose of choosing the optimal treatment based on the predictions. Hence, we would want to adopt those models in cases where we will use the predictive models to optimize the treatments we choose (e.g., to set the price to maximize revenue). Both applications we present are examples in which we would want to choose the causal models. This is the practical take-away from our evaluation.

评论

Thank you to the authors for the clarifications. I'm going to raise my score and lower my confidence. I don't feel well-positioned to assess the novelty or significance of this work, but I can now understand it is making reasonable claims and providing evidence for those claims.

评论

Thank you for engaging with our answer!

审稿意见
6

This paper highlights the need for time series models to showcase generalization to out-of-distribution data, where the relationship between treatment and outcome differs from the training data. To this end, the authors propose extending the orthogonal statistical learning framework to train causal forecasting models that can capture changes in outcome based on changes in treatment outside of the training distribution. Due to a lack of out-of-distribution evaluation of existing time series models, the authors propose to evaluate out-of-distribution by creating a test set of causal treatment effects using Regression Discontinuity Designs (RDD).

优点

  1. This work has real-world applications and has the potential to improve decision-making processes that depend on forecasts from time series models.
  2. The two extensions to adapt the orthogonal statistical learning framework to train causal models—defining daily treatment effects and extending the binary treatment effect data model to categorical and linear treatments—are mathematically justified.
  3. Despite the complex approach and new terminologies introduced, the paper is well-structured and relatively easy to follow.

缺点

  1. T_t for treatment at time t, and T_1, T_2 ​ for treatment types can be confusing.
  2. The evaluation method using RDD is complex and may generate only a few test set examples.

问题

Why are more datasets not considered for evaluation?

评论

Thank you for the valuable suggestions, that will help us improve our paper:

T_t for treatment at time t, and T_1, T_2 ​ for treatment types can be confusing.

This is indeed confusing. We will switch all treatment variables T to A (for action) to be closer to Reinforcement Learning notations which might be more familiar to the audience. This will also disambiguate the time index t and the treatment T. We will use A={a,b} for the generic two actions setup, so that we avoid subscripts that we use for time in the rest of the paper.

The evaluation method using RDD is complex and may generate only a few test set examples.

RDDs are fairly complex, but a well studied and well used identification strategy in econometrics, and our automated process to create them is fairly robust. Importantly, RDDs let us estimate the CATE for a causal evaluation on real, complex applications. To the best of our knowledge, this is the first proposal to address this long standing challenge in causal ML!

It is true that the number of RDD points that we can estimate will depend on the specifics of an application, and in particular how often treatment changes are observed (too few changes mean too few RDDs, too frequent changes mean that our estimators will be too noisy, so we need an in-between). So this approach will not always apply. In our applications though, we see an average of 1.50 RDDs/train time series (about 150k test points) and 0.51 RDDs/patient time series (about 770 test points for the 1.5k test set of MIMIC) after applying our filtering. While this is far from perfect, it is a significant number of evaluation points (much better than what we could do until now, which is not evaluate causal effects on complex applications with only observational data at all). We believe that it is a very valuable source of evaluation data that we should leverage and extend in the field.

Why are more datasets not considered for evaluation?

We did consider several other datasets, but none other were good fits for our approach. Causal datasets are typically very small by deep learning standards [1,2,3,4], with treatments happen all at the same time [1,2] or are unique [3,4]. The only large causal dataset with more treatment variety that we found is [5] (causal impact of price changes through discounts), but the dataset has very few price changes and our approach is not applicable (too few changes to fit causal models, and too few RDDs for evaluation). Other classical time series datasets are in-distribution forecasting tasks without a clear treatment [6-9].

We are currently working on comparisons under a simulation proposed in prior work (see answer to reviewer DJs7). While this is a much simpler task than a real application, this will add one more example to showcase our approach.

Dataset references:

[1] Advertisement Data: Brodersen, Kay H. and Gallusser, Fabian and Koehler, Jim and Remy, Nicolas and Scott, Steven L., “Inferring causal impact using Bayesian structural time-series models”, 2015.

[2] Geo Experiment Data: Jouni Kerman and Peng Wang and Jon Vaver, “Estimating Ad Effectiveness using Geo Experiments in a Time-Based Regression Framework”, 2017.

[3] Economic data for Spanish régions: Abadie, Alberto and Gardeazabal, Javier, “The Economic Costs of Conflict: A Case Study of the Basque Country, 2003.

[4] California’s Tobacco Control Program: Alberto Abadie, Alexis Diamond, Jens Hainmueller. 2010.

[5] m5 Forecasting challenge: https://www.kaggle.com/competitions/m5-forecasting-accuracy.

[6] Traffic: https://pems.dot.ca.gov/

[7] Electricity: https://archive.ics.uci.edu/dataset/321/electricityloaddiagrams20112014

[8] Exchange: Guokun Lai and Wei-Cheng Chang and Yiming Yang and Hanxiao Liu, “Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks”, 2018.

[9] Weather:https://www.bgc-jena.mpg.de/wetter/.

评论

Our message "Follow-up on simulations" above summarizes the results of the simulation we ran. We believe it to be a good example of the value of the RDD-based evaluation technique we contribute.

审稿意见
6

The paper proposes an alternative loss function for training causal deep learning models to estimate conditional average treatment effects (CATE) over time - through the use of orthogonal learning to directly learn the CATE from data under unconfoundedness assumptions. Losses are applied over a customised TFT architecture for time series predictions, and they demonstrate improvements using Regression Discontinuity Design (RDD) experiments.

优点

While many architectures have been proposed for time series forecasting in general, causal temporal models have been relatively understudied by comparison. The paper is also well motivated and clearly presented — with the use of orthogonal learning being an interesting approach, and use of the TFT being sensible to incorporate a wider range of inputs. The outline of the RDD evaluation approach is also useful, providing a way to evaluate causal temporal models on live observational data.

缺点

However, the performance improvements do appear slim, with the causal TFT underperforming the causal transformer on both on all in distribution time-steps and 2/5 of the RDD forecast horizons. Where improvements exist, it is also not immediately clear if they are attributable to the improved loss function or the use of the TFT — with the novelty of the former far out-weighing the latter.

问题

  1. How does the causal TFT compare to other models under the original simulation framework used to evaluate Melnychuk 2020, Bica 2020 and Lim 2018? While more simplistic, it makes it easier to control the degree of bias present in policies, which could widen performance gaps (vs the slim margins seen in the RDD experiments).
  2. Where improvements are seen (i.e. shorter time steps in RDD expts), are these attributable to the use of TFT or new loss function proposed?
评论

Thank you for the valuable comments and suggestions.

Before answering the specific questions, we would like to address one core misunderstanding about the interpretation of our results: “the causal TFT underperforming the causal transformer on both on all in distribution time-steps and 2/5 of the RDD forecast horizons.”

  • We consider tasks in which a forecasting models is later used to make decisions and take actions. The data we train on includes treatments made by a knowledgeable decision maker (the train operator setting prices; the doctor observing the patient and deciding on a treatment).
  • When testing a regular model in distribution (the typical forecasting evaluations, and our regular RMSE metrics), we have access to the sequence of treatment chosen by the knowledgeable decision maker. These chosen treatments carry a lot of information on the outcome, which models use to “cheat” when forecasting. Our causal models do not leverage this information by construction (to be able to make causal predictions), so we do expect them to perform worse in distribution!
  • When predicting on a different treatment than the one chosen by the decision maker, the treatment does not carry the same information (it was not chosen by the train operator/doctor), and the predictions of non-causal models become worse: is is what we observe at τ+1\tau+1, and why our causal models beat the Causal Transformer.
  • At large τ+t\tau+t for large tt, the causal transformer has access to the whole treatment sequence between τ\tau and t1t-1: this sequence is in distribution (the only data we have), and hence carry the same information about the outcome, which the Causal transformer “cheats” with. Our causal model does not leverage this information by construction, to isolate causal effects. The Causal Transformer performs better, but it is because it learns non-causal information through the treatment sequence.

We are clarifying these points in our paper.

How does the causal TFT compare to other models under the original simulation framework used to evaluate Melnychuk 2020, Bica 2020 and Lim 2018? While more simplistic, it makes it easier to control the degree of bias present in policies, which could widen performance gaps (vs the slim margins seen in the RDD experiments).

This is a great suggestion, thank you! We are in the process of adding support for this experiment and running experiments. One thing to note is that in the cited work, the models are still evaluated in distribution, on regular forecasting tasks. We are using the simulation to extract the ground truth causal effect, and compare models on predictions of causal effects, to have an evaluation metric relevant to decisions made from the forecast. We hope to be able to report on these results by the end of the discussion period, but will update our paper regardless.

Where improvements are seen (i.e. shorter time steps in RDD expts), are these attributable to the use of TFT or new loss function proposed?

We believe that we can attribute the gains to the loss function (or more precisely the whole algorithm that first learns the nuisance models and then uses the special loss to train a causal effect model). Indeed, we can compare the TFT baseline (same architecture, without the causal learning) to the Causal TFT: we can see on Table 1 that the causal version (especially with linear encoding) performs much better on causal effects (RDD RMSE) than the baseline; on Table 2 the effects are smaller but in the same direction: causally trained TFTs perform better than the baseline with the same architecture. Given that those models share the same architecture, the causal training explains the difference.

评论

We ran experiments using the simulation from the Causal Transformers paper (Melnychuk et al. 2022). The following table shows the results (γ\gamma parametrizes the confounding, higher means more confounded):

ModelTime 0Time 1Time 2Time 3Time 4
γ=0\mathbf{\gamma = 0}
baseline (γ=0\gamma = 0)11.3649±0.485711.3649 \pm 0.48579.4263±0.3944\mathbf{9.4263 \pm 0.3944}6.8839±0.331\mathbf{6.8839 \pm 0.331}5.299±0.40925.299 \pm 0.40924.2173±0.40894.2173 \pm 0.4089
theta_cumulative (γ=0\gamma = 0)11.8109±0.612411.8109 \pm 0.612410.1104±0.611710.1104 \pm 0.61178.3477±0.76428.3477 \pm 0.76426.8449±0.90036.8449 \pm 0.90035.5295±0.94715.5295 \pm 0.9471
theta_one_hot (γ=0\gamma = 0)11.843±0.540811.843 \pm 0.540810.2364±0.570210.2364 \pm 0.57028.3729±0.70338.3729 \pm 0.70336.9625±0.93636.9625 \pm 0.93635.7561±0.96455.7561 \pm 0.9645
CT (γ=0\gamma = 0)10.198±0.49\mathbf{10.198 \pm 0.49}9.767±0.8559.767 \pm 0.8557.137±0.8297.137 \pm 0.8295.226±0.57\mathbf{5.226 \pm 0.57}3.863±0.642\mathbf{3.863 \pm 0.642}
γ=1\mathbf{\gamma = 1}
baseline (γ=1\mathbf{\gamma = 1}
theta_cumulative (γ=1\gamma = 1)10.5656±0.354810.5656 \pm 0.35489.0891±0.24339.0891 \pm 0.24337.1218±0.41417.1218 \pm 0.41415.4182±0.41095.4182 \pm 0.41094.2316±0.48454.2316 \pm 0.4845
theta_one_hot (γ=1\gamma = 1)11.3529±1.079411.3529 \pm 1.07948.9697±0.41658.9697 \pm 0.41657.0004±0.40747.0004 \pm 0.40745.4984±0.40175.4984 \pm 0.40174.3835±0.59924.3835 \pm 0.5992
CT (γ=1\gamma = 1)9.242±0.506\mathbf{9.242 \pm 0.506}8.108±0.609\mathbf{8.108 \pm 0.609}5.239±0.47\mathbf{5.239 \pm 0.47}3.62±0.433\mathbf{3.62 \pm 0.433}2.622±0.377\mathbf{2.622 \pm 0.377}
γ=2\mathbf{\gamma = 2}
baseline (γ=2\gamma = 2)9.8447±0.60479.8447 \pm 0.60477.5981±0.3584\mathbf{7.5981 \pm 0.3584}4.9255±0.2957\mathbf{4.9255 \pm 0.2957}3.6377±0.2425\mathbf{3.6377 \pm 0.2425}2.8244±0.2634\mathbf{2.8244 \pm 0.2634}
theta_cumulative (γ=2\gamma = 2)11.3323±1.984911.3323 \pm 1.98498.4202±0.36898.4202 \pm 0.36896.3143±0.55616.3143 \pm 0.55614.8582±0.64344.8582 \pm 0.64343.6981±0.56253.6981 \pm 0.5625
theta_one_hot (γ=2\gamma = 2)11.6451±1.277711.6451 \pm 1.27778.7733±0.31178.7733 \pm 0.31176.5164±0.49536.5164 \pm 0.49535.091±0.68335.091 \pm 0.68333.997±0.69643.997 \pm 0.6964
CT (γ=2\gamma = 2)8.845±0.554\mathbf{8.845 \pm 0.554}8.239±1.298.239 \pm 1.294.976±0.6124.976 \pm 0.6123.663±0.5943.663 \pm 0.5942.94±0.7642.94 \pm 0.764
γ=3\mathbf{\gamma = 3}
baseline (γ=3\gamma = 3)9.6805±0.50869.6805 \pm 0.50867.1925±0.43\mathbf{7.1925 \pm 0.43}4.2715±0.40574.2715 \pm 0.40573.2891±0.30023.2891 \pm 0.30022.595±0.23582.595 \pm 0.2358
theta_cumulative (γ=3\gamma = 3)10.7574±0.638310.7574 \pm 0.63838.0589±0.42468.0589 \pm 0.42465.4557±0.53915.4557 \pm 0.53914.4202±0.43794.4202 \pm 0.43793.3092±0.15773.3092 \pm 0.1577
theta_one_hot (γ=3\gamma = 3)11.1535±1.108711.1535 \pm 1.10878.2518±0.33288.2518 \pm 0.33285.4357±0.46085.4357 \pm 0.46084.3145±0.53134.3145 \pm 0.53133.2739±0.19633.2739 \pm 0.1963
CT (γ=3\gamma = 3)8.839±0.564\mathbf{8.839 \pm 0.564}7.371±0.6277.371 \pm 0.6273.871±0.259\mathbf{3.871 \pm 0.259}2.768±0.279\mathbf{2.768 \pm 0.279}2.035±0.216\mathbf{2.035 \pm 0.216}
γ=4\mathbf{\gamma = 4}
baseline (γ=4\gamma = 4)9.6825±0.48959.6825 \pm 0.48957.3741±0.4112\mathbf{7.3741 \pm 0.4112}3.9787±0.2856\mathbf{3.9787 \pm 0.2856}3.049±0.1995\mathbf{3.049 \pm 0.1995}2.4012±0.2134\mathbf{2.4012 \pm 0.2134}
theta_cumulative (γ=4\gamma = 4)11.0481±1.500211.0481 \pm 1.50027.7493±0.44287.7493 \pm 0.44285.1016±0.47255.1016 \pm 0.47254.0327±0.19294.0327 \pm 0.19292.8649±0.17092.8649 \pm 0.1709
theta_one_hot (γ=4\gamma = 4)11.3982±1.683411.3982 \pm 1.68347.8388±0.3967.8388 \pm 0.3965.0889±0.54625.0889 \pm 0.54623.994±0.41083.994 \pm 0.41082.9902±0.22622.9902 \pm 0.2262
CT (γ=4\gamma = 4)8.684±0.562\mathbf{8.684 \pm 0.562}9.335±2.099.335 \pm 2.096.747±4.3046.747 \pm 4.3045.363±3.5495.363 \pm 3.5494.789±3.7294.789 \pm 3.729

As is clear from the numbers, the simulation is unfortunately not challenging enough for deep learning models, and can be solved quite well by baseline models minimizing the MSE. Under this simulation's confounding structure, more confounded settings are actually easier! As we discuss lines 117-122 of our original submission (lines 150-255 in the current version), under the typical causal assumptions (used in our work and the Causal Transformer, and verified in the simulation) such a model will converge to the “correct” model and predict well on new actions. When the task is too easy, this convergence is fast enough that causal techniques are not needed. This is what we observe here, and our baseline TFT handily beats the Causal Transformer (excepts without any confounding, when they are comparable). Our causal TFT is comparable to the baseline TFT, though a bit worse. It still outperforms the Causal Transformer when there is confounding.

These results are not too surprising: making complex causal simulations that are challenging enough to show a proper separation between causal and non-causal models is notoriously challenging. Many theoretical papers (such as Foster and Syrgkanis 2023) restrict their simulation-based experiments to simpler models (e.g. linear) for this reason. This is why we believe that our RDD-based evaluation technique, which creates a causal test set on real datasets on which we do observe the pitfalls of non-causal models, is a particularly valuable contribution.

评论

We believe that we have clarified why the CT is expected to perform better several steps after τ\tau, and why this is a sign of confounding and not a sign of good estimates of causal effects. We believe that this shows the strength of our approach, and not a weakness of it. We also pointed to Tables 1 and 2 and a comparison between our baseline and causal TFT, to show that we can attribute the causal gains to our loss.

Finally, we ran experiments in the simulation framework used to evaluate Melnychuk 2020, Bica 2020 and Lim 2018, and have shown that it is not suitable to evaluate complex ML models, as even naive approaches perform well on such a simple task. Our models still outperform the CT.

We believe that this addresses the weaknesses, questions, and suggestions raised in the review: please let us know if other uncertainty remains!

审稿意见
6

The paper concentrates on using orthogonal learning for causal inference and forecasting, for the case of time-series. The paper is well written overall, combining theoretical results (mostly relying on the work of others, e.g., as in the proofs) and empirical investigations. It provides a set of interesting results related to both learning and evaluating forecasting approaches in a causal framework. I find that the paper makes some novel and potentially valuable contribution.

优点

It is great to see that the authors focus on both theoretical results and using empirical investigations (for 2 different cases) to explore their approach. It is mostly an extension to existing ideas and methods (e.g., orthogonal learning, for time-series), but still in a way that makes it very valuable since time series learning and forecasting is a very broad field for which any improvement can make a high impact.

缺点

Even though the paper is well written, it is also very compact overall, and difficult to read at stages. Already the motivations could be better developed, as well as the justification for orthogonal learning. Personally, I felt quite frustrated with the readability of the paper. Besides quite a number of points that could be improved in terms of presentation (citation style for references and equations), the flow of the paper is difficult to follow at stages. This is possibly due to page limitations, but still, the authors could have chosen to lighten the introductory parts, in order to leave more room for the novel developments in the paper. Also in terms of the presentation, some of the methodological parts are written as if we only consider a specific application (e.g., mentioning daily prices as treatment at the top of page 5). However, different applications are presented in the empirical investigation part. Maybe there could be some additional effort made with the writing, to make the paper easier to read and to better separate purely methodological considerations and developments from application-specific considerations.

问题

My suggestions would include:

  • improving the presentation (especially citations to references and equations)
  • improving the way the methodology is presented, so that it does not need to rely on discussing a specific application
  • be clear about why we care about improving RDD RMSE - especially here, the TST baseline in table 1 (which does not rely on any causal inference approach) seems to be as competitive as the bespoke causal approaches...
评论

Thank you for the valuable feedback.

We believe that the most important issue is the question about “why we care about improving RDD RMSE”? This question seems to stem from a miscommunication on the type of tasks we consider:

  • We consider forecasting models that are later used to make decisions and take actions, a very common setting. Both tasks we consider in the evaluation are of this form. (1) training a demand forecasting model, and then calling it with different input prices to choose the price that maximizes revenue. (2) training a model that predicts blood pressure, and later use it with different drugs as input to decide how to treat a patient.
  • When testing a regular model in distribution (the typical forecasting evaluations, and our regular RMSE metrics), it has access to the sequence of treatments chosen by an informed decision maker when collecting data (the price set by the train operator; the treatment chosen by the doctor). These chosen treatments carry a lot of information on the outcome, which the model uses to “cheat” when forecasting. However, if we make the model predict on a different treatment than the one chosen by an informed decision maker, this treatment does not carry the same information (it was not chosen by the train operator/doctor), and the predictions become really bad. Decisions based on these predictions will hence also be bad! This is our message in the paragraph “The need for causal forecasting”, summarized on Figure 1.
  • Our RDD evaluation technique estimates the true causal effect of changing from one treatment to the other: this is the quantity we care about to make good decisions. For instance, to answer the question “will increasing the price make my revenue higher?” we need to know how increasing the price will impact demand (and not just how much demand there will be at our usual price). Our RDDs are a proxy for this ground truth causal effect, and the RDD RMSE measures how good a model is at predicting this causal effect. The RDD RMSE is thus a metric of how good a forecast is to make decisions, and not just to guess the outcome under decisions made as usual (without changing them). This is why we care about the RDD RMSE. We emphasize that our RDD-based method is, to the best of our knowledge, the first technique to create an evaluation of such causal forecasts on complex, real data tasks.

We are clarifying these points in the paper. The following, bigger changes that we are implementing will also contribute to passing this message more effectively:

  • Moving the paragraph “The need for causal forecasting” to the introduction. This paragraph is an empirical result that we contribute and that supports the motivation for causal predictions. This will help clarify why we care about the RDD RMSE and will also lighten the background on causal inference.
  • Change T (for treatment) to A (for action) in the theory: this should help clarity in the background (this is a special input that we want to change) and disentangle general statements from the running example.

We answer other questions in the review in the next message.

评论

improving [...] citations to references and equations

We would also be happy to implement changes to references and equation labels if they can increase clarity. Our reference and equation format seem pretty typical to us: could you please be more specific about what changes to those could improve readability?

Improving the way the methodology is presented, so that it does not need to rely on discussing a specific application

We chose this presentation because a purely abstract exposition of the theory was hard to follow. It seemed beneficial to anchor the theory in a running example. Using two running examples (both applications) ended up being cumbersome. Lines 66-67 state that “In the reminder of this paper, we use a demand forecasting task for passenger rails as a running example”. We still believe that using this running example is a good trade-off for presenting theory, but we will improve clarity as follows: (1) clarify that the theory is general, and that we apply it to another application later in the paper; (2) re-assert this later in the theory, so that it is not lost in the first sentence of S2; (3) in the exposition, make sure we state the results with the generic terminology (e.g. time steps will not be days by default), and use phrasing like “in our running example, this is a day” to emphasize the difference between general and example statements.

be clear about why we care about improving RDD RMSE - especially here, the [TFT] baseline in table 1 (which does not rely on any causal inference approach) seems to be as competitive as the bespoke causal approaches…

We answered about why we care about the RDD RMSE metric above, which we believe is the most crucial point in our paper. In Table 1, the TFT baseline is the best non-causal model, but we can see that the Causal TFT with linear encoding of theta and the Causal iTransformer with linear encoding for theta both beat the TFT baseline by a large margin (a 37% improvement!). This is related to our remark after Theorem 3 (line 283): the linear encoding is a good approximation of the causal effect, and it is more tolerant of errors in the nuisance models (see factor d in Proposition 3: d=1 for the linear encoding, but is high for the vector encoding). The MIMIC dataset has a lower-dimensional treatment, so categorical encodings do provide improvements on causal (RDD) tasks.

Please let us know if there is any issue that we are not addressing with those changes!

评论

Thanks for your reply, and your willingness to make changes to your paper.

评论

We have updated our paper based on the review's constructive comments, thank you.

Summary:

  • We have addressed misunderstandings in review PffE in the answers, and clarified the paper to avoid them in the future.
  • We also addressed concerns raised in the reviews and clarified our approach, the setting we address, and our contributions, incorporating all clarifications and discussions posted in our answers below.
  • Finally, we performed experiments on a simulation setting from prior work, as suggested by review DJs7. We show that such simulations are not suitable to evaluate complex ML models, as even naive approaches perform well on such a simple task. This confirms the value of our RDD-based evaluation, which evaluates causal effects on real tasks.

The largest changes include:

  • We witched our terminology to actions instead of treatments, to be closer to typical reinforcement learning terminology, and disembiguate the notation between treatment (now action A) and time t.
  • We made the theory exposition less specific to our main application and emphasized its general applicability to an important class of problems.
  • We clarified the applications we target (models that predict well on out-of-distribution actions, which will be used for decision optimization), and why the RDD RMSE is a good measure of performance for our models. We also clarified that we consider out-of-distributions actions, not general out-of-distributions predictions (e.g. OOD features). To further drive this point, we moved the analysis of fitting baseline ML models in distribution (and how they learn to predict the wrong demand elasticity to price) in the introduction. This better motivates and frames our approach.
  • We clarified how to operationalize the theory in S2.3, and how it differs from traditional orthogonal learning approaches (especially at prediction time).
AC 元评审

This paper introduces orthogonal learning for causal time-series forecasting, aiming to assess how interventions impact out-of-distribution outcomes. Reviewers appreciated the integration of theoretical concepts with empirical evaluations on real-world tasks like demand forecasting and healthcare, as well as the innovative use of Regression Discontinuity Designs for evaluating causal effects. However, initial concerns were raised regarding the clarity and presentation, particularly the connection between theory and application, and the limited number of evaluation points due to dependence on specific treatment changes. While the authors’ rebuttal addressed most concerns, the work’s overall significance and novelty remain uncertain for part of the review panel.

审稿人讨论附加意见

Most reviewers acknowledged the rebuttal, while the discussion may not be so informative.

最终决定

Reject