Beyond Prediction: Managing the Repercussions of Machine Learning Applications
摘要
评审与讨论
This paper presents a novel algorithm, THEIA to manage the real-world repercussions of machine learning models. The primary contribution is an approach that optimizes for a main objective, such as accuracy, while simultaneously ensuring with high, user-specified confidence that constraints on a model's repercussions are met using only existing data from a previously deployed classifier. The method leverages importance sampling to estimate the potential repercussions of a new model and provides theoretical guarantees that it will satisfy the specified constraints. The authors empirically demonstrate THEIA's effectiveness on two real-world case studies, showing it can successfully enforce fairness-related repercussion constraints without a significant loss in accuracy.
优缺点分析
Strength
- The authors conduct a comprehensive empirical evaluation on real-world datasets, demonstrating that THEIA successfully identifies accurate models that satisfy repercussion constraints while various state-of-the-art fairness algorithms fail to do so. The experiments also robustly test the algorithm's performance across different data sizes and in challenging settings where the model's predictions have varying levels of influence on the observed repercussions.
- The paper tackles the critical and previously open problem of managing the real-world repercussions of ML models without requiring pre-existing analytical models that describe these effects. This approach significantly enhances practicality, as it relies only on observational data from a previously deployed model, which is more readily available than a formal causal model.
- The proposed algorithm, THEIA, is supported by formal proofs which guarantee that any solution it returns will satisfy the user-defined repercussion constraints with high probability.
Weakness
- The work's primary contribution is in problem formulation, as it applies existing Seldonian and off-policy evaluation techniques to the domain of repercussions, which may be viewed as an incremental methodological step.
- The assumption that repercussions are instance-specific and independent may be unrealistic for systemic, user-level impact across multiple instances and population-level impacts across users are very common, limiting the framework's applicability to a narrower class of problems.
- The method's reliance on importance sampling is known to suffer from high variance, potentially requiring impractically large datasets to yield reliable estimates and raising concerns about sample efficiency.
问题
See strength and weekness
局限性
N/A
最终评判理由
I have read authors' rebuttal and will remain my initial evaluation
格式问题
N/A
We thank the reviewer for their positive comments on the novelty and importance of our work, the thoroughness of our empirical evaluation, and the relevance of our proposed framework—the first to provide high-confidence guarantees about potential repercussions based only on historical data. We also appreciate the reviewer’s comments on how our method solves, in a principled way, a critical and previously open problem in the literature.
Below, we address the reviewer’s questions and concerns.
On our contributions and methodological novelty
While Theia builds on ideas from Seldonian algorithms and off-policy evaluation, combining these concepts to create a general-purpose framework capable of managing the repercussions of real-world ML model, in a principled way, is far from straightforward. Formalizing a setting where this combination is not only feasible but also provably effective required addressing significant technical challenges. For example, establishing consistency guarantees—i.e., proving that Theia will always identify and return a repercussion-aware model if one exists, given sufficient data—required a complex four-page-long proof. Without such guarantees, Theia would be only a heuristic, likely unsuitable for high-stakes applications that require rigorous safety assurances.
Beyond this core contribution, we also discuss many principled and readily implementable ways to significantly expand the applicability of our framework. These include adaptations to regression problems and support for enforcing constraints on a wide range of properties of the repercussion distribution—such as conditional value at risk, variance, or median—all of which are paramount in risk-sensitive settings.
Our proposed approach solves an important open problem in the literature. Theia is the first method capable of providing high-confidence guarantees on the real-world repercussions of ML decisions using only historical data. Until now, such guarantees were only attainable through complex analytical models that precisely characterize how predictions influence downstream outcomes. Constructing these models (e.g., via causal inference) is often impractical or infeasible. By eliminating this dependency, Theia fills a critical gap and enables the development and deployment of ML systems that are not only effective but also provably capable of satisfying user-defined constraints on repercussions with high confidence.
Instance-level versus group-level repercussions
The reviewer asks whether focusing on instance-based repercussions may be potentially limiting in scenarios where one wishes to enforce constraints at the group level. This is a great question.
Under our formulation, group-level repercussions can be directly computed from instance-level information. In particular, the objective introduced in Equation 1 is designed precisely to model repercussion constraints at the group level, where is a group membership variable indicating to which group the -th instance belongs.
Finally, note that settings where only group-level repercussion information is available typically arise when predictions made by an ML model are not targeted at individuals, but instead influence decisions with collective consequences—e.g., a decision to change the speed limit on a major street. These settings can still be handled by our framework: the group can be treated as an instance, described by group-level features, and the repercussion reflects the average effect of the model’s prediction on all individuals within that group.
Importance sampling variance
The reviewer is correct that importance sampling estimators, in general, may suffer from high variance. However, in our setting, two key properties ensure that this concern does not materialize in practice:
- In general decision-making settings, importance sampling may require data exponential in the horizon. The classification setting we address, by contrast, involves a single decision per instance: determining its label. This makes ours an ideal setting for using importance sampling, since variance is independent of the decision horizon.
- Theia was actively designed to incorporate mechanisms that control potential variance issues (see Footnote 4 and Appendix G, lines 857–869). In particular:
- To estimate a model’s repercussions, Theia computes the ratio of the probability the candidate model assigns to a label to the probability assigned to the label by the deployed model.
- The concern raised by the reviewer would be justified if nothing were known in advance about the relationship between these models. Then, the resulting ratios could be extremely small or large, leading to high variance.
- However, the candidate models evaluated by Theia are not arbitrary. Theia is designed to actively reject candidates that diverge too much from the current model. In particular, such candidates produce large importance sampling ratios, which lead to wide confidence intervals on their repercussions, causing them to fail Theia’s repercussion-awareness test (lines 4–9 of Algorithm 1 and 4–11 of Algorithm 3).
- In other words, models that could lead to high-variance estimates are naturally filtered out during the search process since they induce unreliable repercussion estimates.
In practice, the two key properties above serve as effective safeguards, constraining importance sampling ratios and thereby controlling variance. The effectiveness of these variance-control methods is evident in all our experiments: Theia consistently identifies models that satisfy all constraints with high confidence, using on average less than 1.5% of the available data—and no more than 12% even in the worst case.
We hope these clarifications address your concerns. If so, we would sincerely appreciate it if you would consider revisiting your score. If anything remains unclear, we welcome further feedback and would be happy to continue the discussion. Thank you again for your time and engagement.
This paper introduces a new Seldonian algorithm, Theia, for producing classification models that can satisfy constraints on repercussions derived from a previous model’s performance. The algorithm works by using a historical behavior model and data about the repercussions of that model (for example, levels of polarization induced by a recommendation algorithm). The algorithm allows specification of a probability, , that the repercussion constraint is violated, as well as a tolerance value to control the tightness of the constraint. The paper uses importance sampling to construct an unbiased estimator of the new model’s repercussions given the historical model’s data. The paper also provides several theoretical guarantees about Theia, including that it satisfies the probabilistic constraint specified and that it is consistent, meaning that it will return a solution satisfying the constraints given enough training data. Through several empirical results, the paper shows that the algorithm does indeed behave as specified across two different settings with repercussions.
优缺点分析
This paper’s strengths are its quality, clarity, and thoroughness. The authors have introduced a clear and useful framework for satisfying constraints given historical data on model behavior. The method is novel and will likely be very useful to the fairness community. The empirical results show a clear advantage over other fairness approaches in terms of satisfying the constraints, and the experiments are quite thorough, comparing against several other approaches. The paper lays out several research questions and answers all of them clearly and concisely. This paper would be a valuable addition to the Neurips community.
In terms of weaknesses, the primary item missing from this paper, in my opinion, is more discussion of the groupwise impacts of Theia’s approach. I have some specific questions below regarding this. Given that we know that models already struggle for accuracy on groups that are most likely to have these repercussions because they are smaller or tend to be less represented in datasets, I am curious as to whether Theia’s efficacy varies across demographic groups in a similar manner. Lack of experiments in this regard is not necessarily a dealbreaker for the paper, but I think it would be strongly improved by some discussion or consideration of the possibility that the performance of the Theia algorithm itself may be in some way biased or unfair.
Overall, though, this is a very well done and thorough paper, and I thank the authors for their wonderful contribution.
问题
-
Does the accuracy loss in figure 2 occur only for the group from whom the algorithm is attempting to minimize repercussions? Or is it dataset-wide? I am wondering whether the impacted group that one is hoping to help actually ends up suffering in a different way because of lower accuracy.
-
In the case where multiple groups have constraints that are trying to be satisfied, what might affect Theia’s efficacy in satisfying those constraints? For example, if I have two groups with constraints, and one has significantly more data, will that one have its constraint satisfied more reliably? Are there ways to mitigate such disparities?
局限性
I have discussed some limitations in the weaknesses above.
最终评判理由
I maintain my high opinion of this paper after discussion
格式问题
None
We thank the reviewer for their positive comments on the novelty and importance of our work, the thoroughness of our empirical evaluation, and the relevance of our proposed framework—the first to provide high-confidence guarantees about potential repercussions based only on historical data. We also appreciate the reviewer’s comment that our paper would be a valuable contribution to the NeurIPS community.
Below, we address the reviewer’s questions.
Theia's groupwise accuracy
The reviewer correctly pointed out that models satisfying repercussion constraints may lead to different accuracy levels across demographic groups. They asked whether, in our experiments, Theia’s accuracy empirically varied across groups. To investigate this, we conducted additional experiments using the real-world dataset from EXP-1 and found that the accuracy of Theia’s models differed by less than 1%, on average, between the two demographic groups.
That said, in some applications, it may be important to ensure not only that repercussion constraints are satisfied with high confidence, but also that other potential undesirable behaviors—such as large differences in accuracy across groups—are explicitly and formally controlled. Theia fully supports this capability. In fact, as noted in line 289, EXP-1 already makes use of an additional constraint unrelated to repercussions; in particular, a requirement that selected models achieve at least 75% accuracy with high confidence.
Concretely, one way to address the reviewer’s concern within our framework is to include an additional constraint enforcing, with high confidence, that the difference in accuracy across groups does not exceed a specified threshold :
,
where can be estimated by evaluating the accuracy of a candidate model on instances from group .
This type of constraint is naturally supported by Seldonian algorithms (of which Theia is an instance). For further details, please see [42].
Satisfying constraints under different group sizes
The reviewer correctly notes that the amount of available data may differ across demographic groups, which can make it harder to confidently satisfy constraints for underrepresented groups. Theia is specifically designed to handle this scenario in a robust and principled way: it returns a solution only if all constraints are satisfied with high confidence.
In particular, Theia is capable of reasoning about its own uncertainty. If any group lacks sufficient data to satisfy its corresponding constraint with high confidence, Theia abstains from returning a model, opting instead to return No Solution Found, rather than producing a model it does not trust. This built-in safeguard ensures that the final model returned by Theia does not favor well-represented groups at the expense of those with less data, thus helping to prevent unintended disparities in performance or treatment.
We hope the comments above have addressed your main comments and questions. If any points remain unclear, we would be glad to clarify further. Once again, we sincerely thank the reviewer for their time and thoughtful feedback.
Thank you for your responses to my questions!
This work presents THEIA, a framework to optimize beyond accuracy and consider repercussions of machine learning applications in the optimization. The algorithm satisfied a constrain called Seldonian, meaning that if no solution can be found with the present data, it will return NSF instead of returning a solution that itself does not trust.
优缺点分析
Strengths
The Seldonian constraint and the ability that the algorithm is able to return NSF when the solution is not feasible is a very practical and useful feature.
Weaknesses
- Importance sampling is hard to be controlled and may have a high variance. The authors did not provide a good justification of how to get over it; instead they just mention in the footnote that the high variance situation was not observed by them. It weakens the value of this work.
- The authors overloaded the term repercussion and position THEIA as a general solutuion. However, throughout the paper, the only repercussion demonstrated is the fairness constraints. The claim does not match the results presented in this work. There are many repercussions that a real ML application should consider such as long-term behavior shift, polarization of comments in social media, legal constraints, causality, interpretability, etc. It is the main reason why I give a low score for this work.
- As this paper is mainly comparing fairness constraints, there is a lack of comparison against existing fair learning baselines. The authors claim in the footnote that repercussion fairness is different from static fairness but did not compare clearly what is the different and also did not compare with baselines that can perform repercussion fairness.
- The different condition that cause BSF is underexplored. It is common that a fair learning algorithm can’t achieve high accuracy and low fairness metrics at the same time. However, the authors did not provide concrete discussions and empirical results with high-level causes on why it is not achievable.
问题
Please also refer to the Weaknesses.
- When will THEIA fails due to the high variance of the importance sampling and what is the time complexity comparing to existing fairness algorithms?
- How can the repercussion be mathematically defined beyond fairness? The authors should provide some examples of how repercussions g is defined for other applications.
- When the algorithm return NSF, how do you know if it is not the case when the constraints are too strict or the hyper-parameters such as \beta are ill-selected?
- Can THEIA handle objectives that are not limited to the expected behavior? Such as controlling the tail distribution?
局限性
I think the Weaknesses reflect the limitations of this work.
最终评判理由
The authors have provided the rebuttal that partially addresses my questions, and thus I decided to increase my rating.
格式问题
no formatting concerns.
We thank the reviewer for their feedback and appreciate their recognition of the practical importance of Theia’s ability to return No Solution Found when constraints cannot be satisfied with high confidence, rather than producing models it does not trust.
Below, we address the reviewer’s questions and constraints.
Importance sampling variance
The reviewer is correct that importance sampling estimators may suffer from high variance. However, in our setting, two key properties ensure that this concern does not materialize in practice:
- In general decision-making settings, importance sampling may require data exponential in the horizon. The classification setting we address, by contrast, involves a single decision per instance: determining its label. This makes ours an ideal setting for using importance sampling, since variance is independent of the decision horizon.
- Theia was actively designed to incorporate mechanisms that control potential variance issues (see Footnote 4 and App.G, lines 857–869). In particular:
- To estimate a model’s repercussions, Theia computes the ratio of the probability the candidate model assigns to a label to the probability assigned to the label by the deployed model.
- The concern raised by the reviewer would be justified if nothing were known in advance about the relationship between these models. Then, the resulting ratios could be extremely small or large, leading to high variance.
- However, the candidate models evaluated by Theia are not arbitrary. Theia is designed to actively reject candidates that diverge too much from the current model. In particular, such candidates produce large importance sampling ratios, which lead to wide confidence intervals on their repercussions, causing them to fail Theia’s repercussion-awareness test (lines 4–9 of Alg.1 and 4–11 of Alg.3).
- In other words, models that could lead to high-variance estimates are naturally filtered out during the search process since they induce unreliable repercussion estimates.
In practice, the two key properties above serve as effective safeguards, constraining importance sampling ratios and thereby controlling variance. The effectiveness of these variance-control methods is evident in all our experiments: Theia consistently identifies models satisfying all constraints with high confidence, requiring at most 12% of the available data in the worst case.
Do all repercussions have to be fairness-related?
Our empirical evaluation touches on fairness because we required Theia to satisfy repercussion constraints across different groups. However, the repercussion objectives themselves were not fairness-related. In EXP-1, the objective was to improve the longer-term repercussions of ML-driven decisions on foster care youth—e.g., how such decisions might shape their future opportunities. In EXP-2, the goal was to increase, with high confidence, the future financial well-being of clients. These are meaningful, application-specific metrics—not fairness measures per se.
Fairness arises only when we ask Theia to improve these repercussion metrics uniformly across all groups. The exact same experiments could be conducted without that requirement. In this case, however, Theia might return models that improve repercussions on average but risk harming specific subgroups, thus raising concerns about trustworthiness.
Importantly, although Theia’s formulation supports arbitrary group-specific constraints (lines 99–118), these do not need to be fairness-related. One could require, e.g., that repercussions improve for a single group while requiring them to worsen for others. The framework is flexible and fairness-agnostic.
How can repercussion be defined beyond fairness?
The reviewer asked for examples of how repercussions can be defined beyond fairness. In addition to the running example of political polarization and the two scenarios used in our experiments, App.A provides further examples of non-fairness-related domains where Theia could be applied:
- Ensuring that a university’s policy for assigning 1-on-1 tutoring, based on students’ GPAs, does not reduce overall graduation rates. Let be the current probability that students graduate under the existing policy. The dataset could include , indicating whether the -th student received tutoring, and , an indicator of whether they graduated. Then, the repercussion constraint can be defined as .
- Ensuring that a new crime prevention strategy, implemented by a police department based on predicted recidivism risk, does not lead to higher average incarceration rates. Let denote the current incarceration rate prior to deployment of the new strategy. Then, the repercussion objective could enforce that the probability of incarceration (i.e., the expected value of an indicator variable denoting whether a person was incarcerated) does not exceed with high confidence.
- The third example, related to a potential healthcare application, can be modeled similarly—by defining a repercussion objective that compares expected patient outcomes (e.g., a metric of chronic illness severity after treatment) under a new decision policy to a baseline level of care.
Difference between static and repercussion fairness
The reviewer asked us to clarify the difference between repercussion fairness and standard static fairness definitions studied in the literature. As discussed in footnote 6, these differ qualitatively. Static fairness metrics—such as equalized odds—are defined in terms of performance metrics like false positive and true positive rates. They can be easily/directly computed by evaluating a model on a validation set and comparing its predictions to the true labels.
Repercussion fairness is significantly more general. It concerns the actual downstream impact of model decisions after deployment. Quantifying it requires observing the real-world consequences of using the model—for example, by measuring a client’s debt-to-income ratio one year after the model has been deployed. Such metrics cannot be inferred from standard performance evaluations alone and go beyond what is captured by static fairness metrics.
Comparison to model-free baselines for repercussion fairness
The reviewer asked why we did not compare to baselines that ensure repercussion fairness. As discussed in Section 5, while model-based approaches exist, Theia is, to the best of our knowledge, the first method capable of satisfying repercussion constraints in the model-free setting. There are no existing baselines that operate exclusively on data collected from a previously deployed classifier without access to a model of the environment. For this reason, we compare Theia to well-established, state-of-the-art model-free methods that share our core assumptions and also rely solely on observed data.
Causes for No Solution Found
Unlike most existing methods, Theia can reason about its own uncertainty and return No Solution Found when it cannot guarantee, with the desired level of confidence, that all constraints will be satisfied. The reviewer asked whether Theia could be extended with explainable AI capabilities to help users understand why no models could be certified as repercussion-aware with high confidence.
Recall that to determine if a candidate solution is repercussion aware, Theia checks if a high-confidence upper bound on its repercussion passes a safety test. A first step towards understanding why no solutions were found could be to check if the estimated average repercussion of a model (i.e., a point estimate) would pass the safety test. Intuitively, this point estimate provides an optimistic assessment of a model's repercussion, since it does not account for uncertainty. If even this optimistic estimate fails the test, it suggests that the failure is not due to data scarcity or estimator variance, but rather that no model can satisfy the constraints. By contrast, if the point estimate would pass the test but its associated confidence interval is too wide—resulting in an upper bound that fails—the likely issue is insufficient data to meet the desired confidence level.
While incorporating such diagnostic capabilities into Theia is beyond the scope of this work, it is a promising future extension. In this paper, we focused on introducing the first classification algorithm capable of providing high-confidence guarantees on repercussion constraints using only historical data—thus solving a key open problem in the literature.
Can Theia handle objectives beyond expected repercussions?
This is a great question. Theia can be readily extended to enforce repercussion objectives beyond those involving expectations. Repercussion objectives can be defined over other statistics of the repercussion distribution to enforce high-confidence guarantees, e.g, on repercussion variance, median, or conditional value at risk. All that is required is a confidence interval for the statistic of interest. If the user wishes, e.g., to impose constraints on the CVaR of the repercussions, they could directly leverage the concentration inequality introduced by Brown (2007). Similarly, Chandak et al. [9] proposed methods for producing high-confidence bounds for various other distributional parameters, e.g., median and interquantile range. These can be readily integrated into Theia to support the extensions suggested by the reviewer.
Brown, D. B. Large deviations bounds for estimating conditional value-at-risk. Operations Research Letters, 2007.
We hope these clarifications address your concerns. If so, we would sincerely appreciate it if you would consider revisiting your score. If anything remains unclear, we welcome further feedback and would be happy to continue the discussion. Thank you again for your time and engagement.
I thank the authors for the detailed rebuttal. I decided to slightly improve my rating.
This paper proposed THEIA, a new classification algorithm that seeks to manage the real-world consequences of deploying machine learning models. Rather than relying on causal models or predefined analytical relationships between predictions and their effects, THEIA uses observational data from a previously deployed classifier to ensure, with high probability, that user-specified constraints on repercussions are satisfied. The authors provide formal guarantees and evaluate the approach in two real-world scenarios involving employment outcomes and lending decisions.
优缺点分析
- The paper addresses an increasingly important problem, how to reason about the real-world effects of ML models.
- THEIA operates without requiring access to a causal relationships.
- The algorithm includes strong theoretical backing, including high-confidence guarantees and consistency under mild assumptions.
- The method fundamentally relies on observational associations, and cannot distinguish causation from correlation. This limits its ability to draw meaningful conclusions when confounding is present.
- There is no mechanism for detecting whether the estimated repercussions actually reflect causal effects, which may be problematic in practice.
- THEIA assumes that consequences under a new policy can be estimated using data collected from a previous policy—this stationarity assumption may not hold.
问题
While THEIA offers a valuable framework for making conservative decisions based on observational data, its reliance on correlational estimates inherently limits its scope. In particular, it cannot provide causal guarantees or account for unobserved confounding. There are well-known cases where results based solely on correlation are indistinguishable from misleading or spurious ones. It would be helpful for the authors to comment on how THEIA handles such scenarios, or whether safeguards could be incorporated to detect when observational estimates may be unreliable.
局限性
yes
最终评判理由
My concerns remain and score stays the same. After reading the rebuttals, it appears that the model heavily depends on the underlying unknown causal models. If the correlations perfectly account for the confounders, then the model may perform well; otherwise, the reliability of the model is questionable.
格式问题
No
We thank the reviewer for their time and thoughtful feedback. We appreciate their recognition of the novelty and importance of our work, which addresses a key gap in the ML literature and solves an important open problem in the field. We also appreciate their positive comments on the strength and technical soundness of our theoretical contributions.
Below, we address the reviewer’s questions and comments.
Is Theia affected by unobserved confounders?
As pointed out by the reviewer, unobserved confounding—cases where unobserved variables could simultaneously influence both predictions and repercussions—could affect the conclusions drawn by Theia. For example, an unobserved variable that simultaneously caused predicted probabilities and associated repercussions to increase could lead Theia to believe that higher predictions caused higher repercussions.
In the specific setting we study, however, this concern does not arise due to two key assumptions:
-
(1) Since we are addressing a classic classification problem, predictions depend only on the feature vector , which is observed and available by definition since it serves as input to the prediction model (e.g., a neural network). There are no additional unobserved variables (e.g., ) influencing the model's output, beyond its input variables.
-
(2) The behavior model , which produces the prediction probabilities, is known (as stated on line 71) since it is the model Theia aims to improve upon.
Under these conditions, no unobserved variables jointly influence both and the repercussions —and so there is no confounding along the prediction pathway. In other words, although and may be statistically associated (since both and influence ), is not a confounder since it does not influence . All non-causal paths from to remain blocked by the collider at .
That said, the reviewer is correct in noting that Theia provides high-confidence guarantees under these assumptions, but not causal guarantees. Causal guarantees would require access to a full causal model—which we assume is unavailable. In fact, our primary motivation is to develop an algorithm that offers meaningful, high-confidence guarantees about repercussions precisely in the absence of such models.
We will extend Section 2 (Problem Formulation) to discuss the points above and further clarify how potential issues related to unobserved confounders are avoided in our setting.
How do unobserved variables impact Theia?
Unobserved factors that influence repercussions act as noise rather than confounders—in particular, they lead to increased variance in observed repercussions. Theia’s high-confidence bounds are designed to account for this: if the data is too noisy to satisfy user-defined constraints with confidence , the algorithm returns No Solution Found. This mechanism prevents Theia from returning a model it does not trust, thus acting as a safeguard against noisy observations.
Beyond introducing and formally characterizing the mechanism that safeguards Theia against the effects of unobserved variables, we also empirically evaluated Theia's behavior in such scenarios (Research Question 3). Specifically, we evaluated Theia in challenging settings where predictions had limited influence on repercussions compared to unobserved external factors. Figure 3 shows that Theia returns No Solution Found more frequently when approaches zero, which corresponds to the scenario described above: one where unobserved factors make repercussions too noisy, leading to higher uncertainty in the estimated repercussions of a candidate model.
These empirical results support Theia’s robustness in handling scenarios with unreliable repercussion observations and its ability to operate effectively even when noise arises due to unobserved factors.
How does Theia perform when repercussions are non-stationary?
Designing ML systems that are robust to sudden shifts in dynamics is indeed important, as acknowledged in our Conclusions. However, addressing non-stationarity is not the focus of this paper. Prior to our work, no methods existed that could satisfy repercussion constraints with high confidence without access to models—even in stationary cases. As shown in our experiments, state-of-the-art competitors consistently fail to tackle such cases, while Theia always succeeds at identifying solutions satisfying all relevant constraints.
The reviewer correctly points out that in some problems, stationarity may not hold exactly. That said, many real-world problems of interest are either stationary or stable enough for Theia’s assumptions to be valid and its guarantees useful. Our case studies—based on real-world data from the U.S. foster care system and bank lending decisions—illustrate exactly the kind of high-stakes applications where Theia performs effectively and offers meaningful guarantees. Providing formal high-confidence guarantees in these settings, without relying on models, had remained an open problem in the literature, which we solved.
That said, for completeness, we evaluated Theia’s performance under non-stationary conditions. Appendix H (lines 913–919) describes a challenging experiment where distribution shift alters the relationship between predictions and repercussions—modeling, for instance, how societal changes may cause lending decisions to affect individuals differently over time. Figures 4 and 5 show that in this experimental setting, even though no models are available, Theia remains robust and capable of producing repercussion-aware models even when standard stationarity assumptions are relaxed.
The above-mentioned empirical analyses of Theia in non-stationary scenarios are presented in the Appendix, since addressing such settings was not the primary focus of our paper. We appreciate the reviewer’s question and will update Section 5 to mention this point and direct readers to the relevant Appendix section for a more detailed discussion and accompanying experimental results.
We hope the comments and discussion above address your main questions. If so, we would sincerely appreciate your consideration in revisiting your score. If any concerns remain, or if there are additional points you believe we should address to strengthen the work, we would be happy to do so. Thank you again for your thoughtful and constructive review.
Thanks for the author’s clarification. However, my concerns remain. After reading the rebuttals, it appears that the model heavily depends on the underlying unknown causal models. If the correlations perfectly account for the confounders, then the model may perform well; otherwise, the reliability of the model is questionable.
meta review forthcoming (in touch with sac)