PaperHub
5.5
/10
Poster4 位审稿人
最低2最高4标准差0.7
3
3
4
2
ICML 2025

When to retrain a machine learning model

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24

摘要

关键词
retraining;sequence modeling; forecasting performance

评审与讨论

审稿意见
3

This paper presents a novel approach to maintaining the performance of open-world machine learning models operating in continuously changing environments. The primary focus is on addressing a critical challenge in such settings: determining when the model should be retrained. The objective is to balance the trade-off between minimizing retraining costs and maintaining model performance. The paper argues that existing methods become impractical when retraining costs are high. To overcome this, the authors propose a methodology that explicitly accounts for retraining costs while maintaining competitive model performance. The evaluation demonstrates that the proposed approach outperforms conventional methods in scenarios where retraining costs are a significant concern.

给作者的问题

Why the results seems incremental when compare the proposed method with the traditional method even though the paper considers high training cost?

论据与证据

The claims made in the paper are clear and convincing

方法与评估标准

Yes

理论论述

I did not check for correctness of any proofs.

实验设计与分析

Experimental Design seems appropriate.

补充材料

I did not review Supplementary Material

与现有文献的关系

Key contribution of the paper related to the machine learning algorithms design for real world scenarios where the data goes through major distribution shifts.

遗漏的重要参考文献

References seems adequate.

其他优缺点

Strengths

  • The paper is well-written and easy to follow.
  • The approach is mathematically well-grounded, providing a clear and formal formulation.
  • The paper addresses a practical and significant research problem in open-world machine learning, making a relevant contribution to the field.

Weaknesses:

  • Although the paper provides a mathematical analysis for the proposed online objective, the objective itself appears relatively straightforward, as it primarily combines a standard loss function with retraining costs.
  • The performance improvements demonstrated in the evaluation are modest compared to traditional methods that rely on distribution shift signals for retraining, raising questions about the practical advantage of the proposed approach.

其他意见或建议

Few sentences in the related works need a bit of clarification, for example "Since the signal is designed to adapt a model rather than trigger a full retraining, these methods are not appropriate as retraining signals."

I think "as" should replaces with "for" but I could be wrong

作者回复

Weaknesses

1. Although the paper provides a mathematical analysis for the proposed online objective, the objective itself appears relatively straightforward, as it primarily combines a standard loss function with retraining costs.

We disagree with the statement that the objective is straightforward. The objective described by Eqn. 1 is not a standard loss function with retraining costs. The loss function described in Eqn.1 is parameterized by a binary vector θ\boldsymbol{\theta} that is subject to some constraints. This doesn't resemble any standard setup. If the reviewer is instead referring to the training loss used to obtain the parameters of the Gaussian r.v. in Eqn. 13, it is true that a standard loss is being used, but this is only a small part of our methodology, and cannot be characterized as the objective of our method. This would overlook the second part of our methodology, described in Section 4.2, which is central to our method and is how we form our prediction θ\boldsymbol{\theta}.

2. The performance improvements demonstrated in the evaluation are modest compared to traditional methods that rely on distribution shift signals for retraining, raising questions about the practical advantage of the proposed approach.

We disagree with the statement that the improvements over distribution shift methods are modest. On average, in Table 1, the distribution shift methods are 23%23\% worse than ours, (or ours is 17%17\% better on average). This is a significant improvement, especially considering that they are on average 31%31\% worse than the oracle, and the oracle is 20%20\% better on average. We can include these summary statistics to better highlight the improvement of our method over these baselines.

Moreover, the experimental setting that we chose is beneficial to the distribution shift methods, as we stop the evaluation at a cost-to-performance ratio where the oracle finds that no retraining should be done. Since the distribution shift methods are entirely agnostic to any cost considerations, settings with a very high cost of retraining would be catastrophic to use. This can be easily inferred by extrapolating the results of the distribution shift methods in the top figure of Figure 2 (the green and red lines).

3. Few sentences in the related works need a bit of clarification, for example "Since the signal is designed to adapt a model rather than trigger a full retraining, these methods are not appropriate as retraining signals." I think "as" should replaces with "for" but I could be wrong

Yes, thank you for pointing this out. It is phrased a little awkwardly. We will rewrite line 101 from: Since the signal is designed to adapt a model rather than trigger a full retraining, these methods are not appropriate as retraining signals. to: Since the signal for these methods is designed to adapt a model rather than trigger a full retraining, they are not appropriate to be used as full retraining signals.

4. Why the results seems incremental when compare the proposed method with the traditional method even though the paper considers high training cost?

This is related to our previous response regarding the performance of distribution shift methods. We deliberately avoided scenarios with excessively high costs, where the cost-aware methods would easily determine that no retraining is necessary, while the distribution shift methods would have a very poor performance.

审稿意见
3

One underexplored problem is that of when to retrain a model, assuming a sequence of datasets over time that experience distribution shift. This problem can be formulated as an off-policy reinforcement learning problem, where the goal is to find a policy with minimum cost. Authors define cost as the sum of costs from retraining and also costs from worse model performance from using an older model. Authors propose at each step to use past data to fit a future performance forecaster, which is a bayesian model for the performance of models trained on data at time i and evaluated on data at time j, using features such as the time between when the evaluation data is collected and when the model training data is collected. Authors then propose an online policy that uses this future performance forecaster to model total cost, and decide whether to retrain based on comparing fixed quantiles of this modeled total cost under training vs not training. The proposed method, called UPF, generally outperforms baselines including CARA.

给作者的问题

N/A

论据与证据

  • Proposition 3.1. This seems only tangentially related to the rest of the paper.
  • Proposed method UPF performs well empirically: the experiments seem sensible.

方法与评估标准

Proposed method:

  • I think the formulation of the problem is very reasonable (also an improvement on CARA).
  • I also think the Bayesian model is largely reasonable, though I am curious about ablations with / without zshiftz_{shift}.

Evaluations:

  • Metrics:
    • AUC for C^α\hat C_\alpha, to estimate cost but aggregated over a sequence of values of α\alpha (tradeoff between retraining and performance costs)
    • Number of retrains compared to an oracle
  • Datasets and models:
    • It is a little strange to use XGBoost as the model for tasks involving text (e.g. epicgames).
    • Why do you choose between 188 models from huggingface for last-layer training for iWildCam? What happens if you don't?
    • There could be more detail in how the datasets were split up into different datasets DtD_t.
    • Otherwise it seems reasonable

Section A.8.1 advertises something different from what it delivers. The section title and beginning sounds like it will address the impact of using a normal approximation to fit the Beta parameters in the method. However, what it actually does is compare using a Beta model to a Normal model.

理论论述

I checked the proof of Prop 3.1. There appear to be errors that may affect the result.

  • I think there is double counting in (33). The quantity being minimized includes pe(t1,t2)pe(t_1,t_2) and also pe(t2,t2)pe(t_2,t_2). However, the summand in (33) should not contain more than one term for pe(,t2)pe(\cdot ,t_2).
  • When going from (42) to (44), it seems (44) is missing a pe(tr,tr+1)pe(t_r^*,t_{r+1}^*) term.

实验设计与分析

They look mostly reasonable; see Methods And Evaluation Criteria.

补充材料

  • I briefly reviewed the proof of Prop 3.1.
  • I reviewed ablations.
  • I reviewed the dataset section.
  • I checked A.8.1: "impact of the normal approximation".

与现有文献的关系

There is a lot of previous work on distribution shift detection and diagnosis but this work addresses cost-aware policies for when to re-train a model.

遗漏的重要参考文献

I am not aware of essential references not discussed, but I am also not familiar with other works that address this specific problem.

其他优缺点

Strengths

  • This paper is clearly written.
  • The proposed formulation and method are very reasonable (with caveats explained elsewhere).
  • The experiments seem reasonable.

Other weakness

  • I am struggling to think of a setting where this problem is realistic. My understanding is that in many industry settings, the question of when to retrain a model is not based on cost-aware modeling, but rather based on legitimate and important operational/organizational/procedural reasons that I imagine are costly and/or difficult to try to change.
  • A common problem in modeling under distribution shift over time is how much past data to train on. Here, this problem is not part of the formulation (which is fine).

其他意见或建议

  • Eq 12 has an extra )
  • Discussions about scaling laws seem unrelated to the paper, as scaling laws are not about retraining on new data from different/updated distributions.
  • Line 232 right side: "setting δ=0.5\delta = 0.5 simply selects the decision that minimizes the expected total cost" -- I assume you mean median
  • Comments in blue in the appendix proofs are helpful but you may want to distinguish them from the proof in other ways, since they will not be blue when printed in black and white.
作者回复

Methods

1. Bayesian model: ablations with/without feature zz

Experimentally, we saw that without the shift feature zz, the results were on average 510%5-10\% worse, with some datasets being more affected than others (the electricity and the synthetic datasets were mostly unaffected). We can include this as an ablation study.

2. XGBoost justification - use for tasks involving text

Better suited models could be used for each specific dataset, but we decided to keep a single model in these experiments for simplicity. We chose a tree-based algorithm since they can handle different datatypes.

3. Why choose between 188 models for iWildCam? What happens if you don’t?

This experiment was designed to highlight that the setup does not assume any particular model ff to be consistently used, and to illustrate a scenario where the training cost is not restricted to the training of a single model. The procedure to obtain ff is actually outside of our framework; it is part of what defines the dataset in our setting. We could have also chosen a single model, which would have resulted in lower accuracy values.

4. Details on dataset construction

Thank you for pointing this out. We will include additional details in Appendix A.6.

5. Section A.8.1 advertises something different from what it delivers

That is true. It is slightly misleading, and we will change the title of the section to BETA APPROXIMATION VS NORMAL.

6. Theoretical claims - errors in Proof of Prop. 3.1

Thank you very much for taking the time to verify our result. Yes, there is double counting in 33. The summand should be offset by a minus 1 in the first two terms to avoid this. The second mistake you pointed out in (44) is also related to this missing offset. The added terms should start at s=tr+1+1s=t^*_{r+1}+1 and not at s=tr+1s=t^*_{r+1}. We will fix these errors. This doesn't affect the final result.

Weaknesses

7. Not a realistic setting.

If a practitioner has no flexibility on when/if to retrain the model due to some operational/organizational/procedural reasons, it is true that our algorithm is not applicable, since there is no retraining problem. However, we believe that this is a rare case and that the problem we are addressing is ubiquitous in industrial settings. Many practitioners do have this flexibility and are actually faced with the problem of deciding when to retrain their model. Indeed, we were motivated to formulate this problem due to collaborations with several industry partners who have live, deployed models, and must make this exact decision.

Many companies collect data in real-time, and their interest is to forecast something based on this most recent data. That is the case for companies that operate recommender systems, perform manufacturing control (e.g., detecting faults), or employ fraud detection systems, for example.

The cost of retraining is indeed usually related to external costs (i.e., not computation), but this is precisely what we are trying to tackle in this work. While quantifying these costs can be challenging, these types of risk assessments and quantification are actually made all the time, and are inherently considered when a decision to retrain is being made. Financial transaction companies running fraud detection systems do estimate the potential financial cost that could be incurred if a newly deployed model performed poorly.

The contribution of our work is to provide a formal framework for practitioners to incorporate cost estimates into retraining decisions. It asks them to provide a direct specification of what amount of accuracy gain is worth this holistic cost of retraining, at a given time step. Our formulation has the considerable advantage of enabling this explicit quantification. This formulation is the most interpretable and therefore useful for setting this problem and making decisions.

8. A common problem in modeling under distribution shift over time is how much past data to train on. Here, this problem is not part of the formulation

This is a valid point and could be an interesting extension to our formulation. Rather than just deciding whether to retrain, we could choose to retrain on varying amounts of samples, adding another dimension to our prediction space. Instead of only the timestep index and model index, we would now have multiple models to choose from.

Comments

9. Discussions about scaling laws seem unrelated to the paper.

The organization of this paragraph made it seem like the scaling law applies to the distribution shift scenario, which, as the reviewer pointed out, is not the case. It belongs to the section where we describe the performance of a given model remaining constant (i.e., no distribution shift). We will reorganize this paragraph by moving the scaling law discussion before the discussion about distribution shift.

10. Other typos

Thank you for bringing these to our attention. We will correct them.

审稿意见
4

In this paper, the authors proposed a novel strategy to determine when to retrain deployed machine learning models. Specifically, the authors first developed a future performance forecaster for predicting the performance of models built in future time steps. Building on top of it, the authors model the problem of whether to retrain a model or not into a decision-making problem and connect it to a reinforcement learning problem. The empirical studies can demonstrate the effectiveness of the proposed method.

给作者的问题

See above

论据与证据

Overall, the claims are quite sound and clear. But there are still some minor issues:

  • First, I feel that more explanations are needed to clearly explain why the random variables Ai,jA_{i,j} can be modeled as a Beta distribution. I feel that there is a jump in the flow which may confuse readers.
  • Second, I am wondering how the future performance forecaster is related to Section 4.2. Is it true that the future performance forecaster is trained first and then used for the derivation in Section 4.2, right? If this is true, how is it trained in the first place without knowing the ground-truth future performance numbers? If this is not true, how is its training process correlated to the following RL algorithm?
  • Third, in Section 4.2, the authors discussed how rules for determining Θt\Theta_t at each time step tt. Are these rules derived by the RL algorithm? More clarifications of its connection to the RL formulation would be essential.

方法与评估标准

I think the general method looks good to me, which is fully backed by rigorous theoretical analysis. The evaluation criteria is also appropriate.

理论论述

Yes, I checked the theoretical claims and proofs, which are all correct.

实验设计与分析

I think the experimental designs and analysis look good to me.

补充材料

Yes, I checked the supplementary material, including the theoretical analysis and parts of the additional experiments. They all look good to me.

与现有文献的关系

This paper is closely related to the literature on distribution shifts.

遗漏的重要参考文献

No, everything is cited properly.

其他优缺点

Strengths:

  • The idea of framing when to determine to retrain a model as a reinforcement learning problem is novel. This is also supported by a rigorous theoretical analysis.
  • Except for some confusing points described above, the overall flow of the paper looks good to me.
  • The authors also performed extensive experiments to demonstrate the effectiveness of the proposed methods.

Weakness:

  • I would suggest that the authors add some additional discussion to explain the connection between the retraining problem and the continual learning problem, which I think are pretty close.
  • Although the theoretical analysis in the paper is perfect, the theoretical analysis (say the one in Appendix A.11) seems to be unrelated to the distribution differences between DtD_t and Dt+1D_{t+1}. Does this mean that the theoretical analysis applies to datasets with arbitrary distributions? More discussion on this point would be essential.

其他意见或建议

See above

作者回复

Claims and Evidence

1. Beta r.v. motivation:

We agree that there could be a better motivation for the Beta distribution and will add sentences to improve the flow. Our choice to use a Beta distribution was motivated by the facts that 1) Beta distributions are appropriate (and a common choice) for continuous r.v.s defined over a finite interval; and 2) the empirical accuracies can be interpreted as sum of Bernoulli outcomes and the Beta is the conjugate prior. Moreover, the model performs well empirically, and we conducted some experiments with a Gaussian-based alternative that can be seen in Appendix A.8.1, where we can see that the Beta model slightly outperformed. We recognize that this is a modeling choice and the framework could be applied with other distributions.

2. Future performance forecaster is trained first and then used for derivation in Section 4.2

Yes, we train the forecaster on some available training data collected at some past timesteps (the offline data, denoted Ioffline\mathcal{I}^{offline} in Section 3.1). We incorporate a modeling assumption that the predictive relationships learned during this offline period also exist in the operational period, allowing us to forecast performance in future timesteps.

3. Are rules in Section 4.2 derived from RL algorithm?

No, these rules are not directly derived from the RL formulation. Section A.11 aims to clarify and highlight the connection between our approach and an offline RL formulation, though it is not a perfect match. To make this clearer, we will rename the appendix "Connection to an Offline RL Formulation." Since the problem is closely related to offline RL, we found it important to emphasize these connections. However, the addressed task is not ideal for RL, as discussed in the main text, which is also reflected in the performance of a basic offline RL baseline in Table 7.

Weaknesses

4. Additional discussion to explain the connection to continual learning

There is a connection between the "when to retrain" problem and continual learning and we will add more text to discuss. The main difference lies in how cost is incorporated in the problem formulation and whether the setting allows for future planning. In particular, continual learning approaches do not ever delay updates due to cost considerations. For example, in a when-to-retrain scenario, I might decide to hold off from retraining in period 7, because I only have budget to retrain one more time, and if I retrain in the next window in period 8, it will give me better performance for periods 9-11. In continual learning, the primary goal is typically to maintain performance, sometimes while trying to reduce the number of gradient updates. In contrast, our when-to-retrain formulation is abstracting away the training procedure and focuses more on cost considerations. Here, cost is not limited to gradient updates. It is a parameter set by the practitioner which will influence the algorithm’s behavior, as the algorithm explicitly considers the cost trade-off when deciding whether to retrain.

Moreover, our algorithm does not aim to optimize how models are retrained/updated under distribution shift, which is a core focus of continual learning. In fact, continual learning techniques could be combined with our algorithm to balance the decision of whether to apply an update. Even if continual learning mechanisms indicate that gradient updates should be applied to adapt to distribution shift, our approach could deem that it would still be better to wait due to other training cost considerations.

As suggested, we will add a more detailed explanation of the connections and differences between our problem and the field of continual learning, incorporating these key points.

5. Theoretical analysis: does it apply to datasets with arbitrary distributions

Yes, in theory, the result in Eqn. 7 does not impose any structure on the dataset distributions, so it is the case that it would apply to datasets with arbitrary distributions. It is also the case for Appendix A.11. However, since the result relies on bounding the performance gap of two models trained on subsequent datasets Di\mathcal{D_i} and Di+1\mathcal{D_{i+1}}, evaluated on dataset Dt\mathcal{D_t}, it will likely be related to a function of the distributions of these datasets. We will add a comment to this effect as suggested. This is exemplified in Appendix A.5, where we present examples illustrating how the performance quantities pei,tpe_{i,t} can be characterized for a simple Gaussian model. In this example, the bound is indeed a function of the dataset characteristics, i.e., its size D|\mathcal{D}|, variance σ\sigma and dimensionality dd.

审稿意见
2

This paper faces the complex task of understanding when a machine learning model needs to be retrained in the presence of drift. In doing so, it takes into account the problem associated with the trade-off between retraining cost and poor model performance. Their approach is based on forecasting the evolution of the model performance over time and using this information to later decide whether to retrain or not.
The authors report a set of experiments addressing classification tasks showing the performance of the devised approach.

给作者的问题

See Sections above.

论据与证据

The claims made in the work seem sound to me.

方法与评估标准

I agree with the authors that this is a tough problem since many things need to be taken into account, and it is difficult to define a proper cost-to-benefit ratio or to analyse the type and amount of non-stationarity in the environment.

I believe that the work may benefit from a discussion about the types of non-stationarities that can be handled using this approach. Indeed, since the authors define a forecasting approach for the model performance, I assume that the presented approach is more suited for slowly changing environments rather than abruptly changing ones, which are harder to handle.

Concerning the proposed method, I believe that one of the main weaknesses lies in the forecasting procedure. Some of the concerns are about the choice of the feature vector ri,jr_{i,j}. It seems to not be informative enough since the only feature containing information about data is the zshiftz_{shift} feature which relates the amount of difference (in L-1 norm) of the mean of input features xˉ_t\bar{x}\_t with the mean at the previous timestamp xˉ_t1\bar{x}\_{t-1}. Using only this feature may for example prevent the identification of a "real concept drift" (with this expression, I mean when the input distribution does not change but the target distribution conditioned on the input features does).

Another concern is that, as far as I understood, since the first feature refers to the used model, if the algorithm chooses to retrain or not the model at time step i+1, this influences the target ai,ja_{i,j}. This means that the prediction of future values also depends on the previous choices of retraining or not the model. I wonder whether this may represent a problem.
Other simplifications lie in the use of a linear model for prediction and in the choice of the beta distribution for the prediction of forecasted values. How can these choices limit the applicability of the approach to more general and complex scenarios?

Concerning the evaluation criteria, the authors make a good work in presenting their results by testing it under different values of the retraining cost α\alpha.

理论论述

I did not check the proofs of the theoretical claims.

实验设计与分析

The design of the experiments is done extensively and the presented results seem sound to me.

补充材料

I reviewed Appendix A.7 describing the performance forecaster.

与现有文献的关系

The related works are clearly discussed. With respect to the most related work of Mahadevan & Mathioudakis (2024), this paper removes some strong assumptions concerning the data distribution and the impact on model performance and presents a more general objective that takes into account both the retraining cost and the average performance.

遗漏的重要参考文献

The related works are thoroughly discussed.

其他优缺点

Among the strengths of the work, I would mention the extensive and detailed numerical simulations showing the benefits of using the presented approach.

Concerning the weaknesses, see the Section on "Methods And Evaluation Criteria".

其他意见或建议

N/A

作者回复

Methods

1. Types of non-stationarities: approach is more suited for slowly changing environments rather than abruptly changing ones

That is a good point. It is true that the best scenario for our approach is a slowly changing environment, and that abrupt changes would be harder for our model to handle. We will clarify the scope of our paper by adding a comment to the effect that our model is mostly adapted to gradual performance changes in the limitations section and state that abrupt changes in performance remain an open problem.

That being said, the combination of probabilistic forecasting and a risk-averse decision algorithm in our method provides some defence against abrupt changes in the environment. Since our approach does not rely solely on pointwise estimates but also produces probabilistic estimates through mean and variance predictors, a failure in capturing sudden changes will result in high-variance estimates for past models. Due to our risk-averse design, the model would be biased toward frequent retraining to avoid catastrophic performance drops. We believe unexpected performance rises are less likely than sudden drops, which is why our approach is designed this way.

2. Forecasting procedure: choice of the feature vector

There are certainly various designs of input features that could be more informative. However, we aimed to keep our approach simple, as we are in some ways introducing a new problem setting. Our priority was to ensure broad applicability across settings, rather than focusing on more complex feature combinations that might yield better performance but reduce generalizability.

3. Forecasting procedure: inability to detect concept drift

The point about concept drift is an important one. We agree that it is crucial for the model to identify concept drift, and in fact, our model is capable of capturing pure concept drift where p(Y1X1)p(Y2X2)p(Y_1|X_1) \neq p(Y_2|X_2) with p(X1)=p(X2)p(X_1) = p(X_2). In short, the presence of concept drift will be reflected for a model f0f_0 that was trained on dataset D0(X0,Y0)D_0 (X_0, Y_0) in the shift of performance on the dataset X1,Y1X_1, Y_1 versus the performance on the dataset X2,Y2X_2, Y_2. Even if the dataset shift feature stays constant, since we also feed the time index as input, our model can pick up this trend. The performance will be a decreasing function of tt and that can be captured in the model.

This is an important point worth highlighting. We will add a comment in the text about the model's capability to capture concept drift and also include the possibility of designing more informative input features as a future direction.

4. Dependence on past predictions: prediction of future values also depends on the previous choices of retraining or not the model

This is an excellent point. It is true that retraining decisions influence the online dataset Ionline\mathcal{I}^{online}, which in turn affects the prediction algorithm. This effect is, in fact, ignored by our algorithm. Empirically, we do see that this is not a problem as we observe a good performance. However, it would be a good research direction to investigate strategies to allow for more non-optimal decisions to provide training data for the prediction algorithm in other decision regions.

We will add the following comment to highlight this fact by adding in line 259: "As constructed, past decisions influence the dataset Ionline\mathcal{I}^{online} available for the next iteration, but this effect is ignored by the algorithm. Empirically, we find that the algorithm performs well despite this. One direction worth investigating is the incorporation of random decisions to allow the predictor to learn over a broader region of actions and responses."

5. Simplifications: linear model for prediction, beta distributions. Do these limit the applicability to more general and complex scenarios?

While our approach uses simple components, we do not consider it limits the applicability of the approach to complex scenarios. Indeed, the iWildCam experiment represents a reasonably complex setting, with distribution shift, and 188 architectures of different natures.

Given the nature of the retraining problem, where data is scarce and potentially noisy, a linear regression model with variance estimation is a suitable choice, especially when only 10-20 samples are available for learning. Furthermore, since this is a relatively new problem formulation, we found it appropriate to introduce a simple yet robust method that can be applied across various settings. That being said, the limited modeling complexity of the chosen method may indeed restrict the model's expressiveness, and exploring more complex algorithms in the future would be valuable.

Regarding the Beta assumption, it is a common distribution for modeling continuous random variables over a finite interval. Please see our response to point 1 of Reviewer 1 for more discussion.

最终决定

This paper aims to formulate the problem of when to re-train a deployed model and applies a novel reinforcement learning approach to forecast model failure. The reviewers had generally positive opinions of the work (1 x accept, 2 x weak reject, 1 x weak reject). The reviewer who leaned toward rejection had the primary criticism that the problem is inherently very difficult. Yet, this reviewer still found the empirical validation compelling: "design of the experiments is done extensively and the presented results seem sound". Overall, I find this paper formulates a reasonable approach to an ambitious yet important problem. For subsequent versions, in addition to addressing the reviewers' concerns, please cite and discuss Hoffman et al. (2024): https://arxiv.org/abs/2405.13926. I believe it is quite relevant and provides some complementary methodology (although I have not read all its details).