4.0

/10

withdrawn4 位审稿人

最低3最高5标准差1.0

3.8

置信度

ICLR 2024

Robust Algorithmic Recourse Design Under Model Shifts

Yulu Jin,Lifeng Lai

OpenReview PDF

提交: 2023-09-22更新: 2024-03-26

摘要

关键词

Algorithmic recourseModel shiftsRobustnessUncertainty quantification

评审与讨论

审稿意见

评分: 3置信度: 52023-10-26

Algorithmic recourse, which suggests actions to alter unfavorable outcomes in decision-making systems, faces challenges due to model updates. A new uncertainty quantification method is introduced to assess recourse robustness against model shifts, offering a theoretical upper bound for recourse invalidation. Additionally, a novel framework helps users balance implementation cost and robustness by generating model-agnostic recourses based on the derived invalidation rate bounds, with promising results on various datasets.

优点

The research problem make senses in realistic problems.
The upper bound is agnostic w.r.t. prior distribution (e.g., Gaussian), thus offers a more flexible approach.

缺点

My main concern is on whether the studied problem is novel enough. I admit that the problem of designing a robust post-hoc explanation problem is important for dynamic decision systems. However, such topic is not your contribution, as similar researches have emerged in recent top conferences in 2 years.
Hence, as your contribution offers a more cost-efficient or flexible approach, I think the novelty or importance of your work is not well present. As you referred to before, current robust CF methods cannot handle efficient cost and robustness at the same time. However, I do not see any theoretical analysis in your method section regarding. this clarification, as your method section focuses on deriving the bound. Meanwhile, I think that it is important to theoretically clarify how your contributed method can achieve more efficient and robust CF against previous methods.
Besides, I also have some minor questions.

Your robustness stands on the conformity score function, which depicts the variation of the predictors. However, the word robustness usually refers to the variation of the underlying distribution, or data. Although the variation of predicted distributions can be regarded as another measure to quantify the model shift, it is not clear whether the underlying data shifts or just the model shifts (e.g., new fine-tuning approaches).
Another direction for your paper is to consider more sophisticated description of how the data changes, i.e., considering strategic adaptation raised by population's incentives and the resulting predictors. I really think that considering the model shift or data shift in general is meaningless, especially in the era of LLMs. Well pre-trained models with a large amount of data can easilly beats over tricks towards distribution shift or model generalization. Hence, only specific topics on model shift or data shift, i.e., strategic adaptation, are still meaningful.

问题

As seen in Weaknesses.

2023-11-16

Reply for weaknesses 1 and 2:

Novelty and Importance: We appreciate your concern about the novelty and importance of our work, and we would like to emphasize several key points to clarify the significance of our contribution. Firstly, our research uniquely addresses the challenge of modeling recourse invalidation rates under potential model shifts in dynamic decision systems. This specific focus, along with our comprehensive analysis, differentiates our work from prior studies in the field. Secondly, we derive theoretical upper bounds on the recourse invalidation rate. Our approach, tailored to model updates, stands out for its cost-efficiency compared to other methods. Importantly, our paper introduces a novel framework that empowers users to set their desired recourse invalidation rate. This user-centric design allows for the creation of tailored recourses that meet specific invalidation criteria, addressing a key aspect of efficiency in recourse design.
Theoretical Analysis: We appreciate the reviewer's suggestion for additional theoretical clarification regarding efficiency and robustness. Here is how our paper addresses these aspects:

(1) Robustness with Model Shifts: Throughout the paper, especially in Section 4, we derive theoretical results that specifically characterize the recourse invalidation rate under potential model shifts. This analysis offers valuable insights into how our method is designed to maintain robustness when confronted with changes in the underlying model, effectively addressing the reviewer's concern about robustness.

(2) Efficiency Considerations: While our primary focus in this paper is on robustness, we acknowledge the importance of efficiency. We provide a foundation for exploring trade-offs between cost-efficiency and robustness within the context of dynamic decision systems. For the presented work, the efficiency of our proposed algorithm is characterized by its ability to allow users to specify the recourse invalidation rate and to design recourses that achieve this rate. This flexibility and adaptability in recourse design represent a significant advancement in the field. Although we do not extensively delve into efficiency concerns theoretically, we recognize it as a crucial area for future research and intend to investigate it further.

Reply for weakness 3: Our definition of robustness is focused on the variation of predictors. This is based on the premise that in dynamic decision systems, model shifts-such as updates or changes in data distributions-are common. Our approach quantifies robustness by measuring how well recourses adapt to these shifts in predictors. We believe this is a crucial aspect in predictive modeling, especially given the prevalence of model updates in practice.

We appreciate the reviewer's suggestion to consider more sophisticated description of how the data changes, and to bring LLM into the discussion. These considerations will be very valuable and will be one of our future research directions.

2023-11-21

Dear reviewer LkRN,

Thank you for dedicating your time to reviewing our manuscript. Your insights have been invaluable, and we've crafted a detailed response to address the points you raised and polish the paper accordingly. We're confident that our responses and modifications have effectively addressed your concerns. With these changes in mind, we hope you might reconsider and adjust your evaluation of our work.

Should there be any remaining issues, we're more than ready to provide further clarifications. As the rebuttal phase is nearing its deadline, we're looking forward to engaging in a timely discussion. Thank you again for your expertise and guidance!

审稿意见

评分: 3置信度: 42023-10-30

The paper aims to address the inconsistency in recourse generation: as predictive models often change over time, there is a concern about the reliability and robustness of the generated recommendations. The paper addresses this issue by proposing a method to generate model-agnostic recourses that are both robust to model shifts and have lower implementation costs. The challenges of measuring the robustness of a recourse under a shifted model and accommodating users' different tolerance levels for recourse invalidation are discussed. The paper introduces the concept of recourse invalidation rate and utilizes conformal predictive inference techniques to bound the invalidation rate. An extended alternating direction method of multipliers (ADMM) approach is proposed to efficiently find the minimal cost recourse that satisfies the invalidation rate constraint.

优点

This paper studies an important problem in the literature, and it potential has high impact if carried out properly.

缺点

The authors do not clearly state the contributions of this paper in the introduction.
The authors do not provide a justification for why their methods work. In many cases, the model parameters are shifted because the underlying distribution of the dataset is changed. It is thus unclear why having access to the $D_{train}$ and $D_{calib}$ set can solve this distributional shift problem when these two sets have no predictive power of the future distributions.
The authors do not illustrate whether the bound $\hat L$ and $\hat U$ are practically sensible. I guess that $\hat L$ is trivially (close to 1) and $\hat U$ is conservative (much bigger than 1). I am not convinced that the construction of these values is informative and effective at dealing with model shifts.
The authors should discuss the privacy concerns of their method because their approach requires access to the dataset.
Section 4 is confusing to read. They contain mostly formulas with ad-hoc definitions of mathematical terms and have no discussion to provide insights/intuition.
The numerical experiments do not show strong dominance against DiRRAc.

问题

My feeling is that the authors are over-complicating the problem. How about a simpler approach as follows: Step 1: Get all empirical values of the score. Step 2: Construct a kernel density estimator Step 3: Formulate a robust version by perturbing the kernel density using some divergences or total variation.

I guess that the above three step can generate the same effect as what the authors want to do in this paper. But it is more interpretable and easier to understand.

2023-11-16

Reply for weakness 1: The main contributions of this paper are summarized as:

(1) Characterization of Recourse Invalidation Rate: We introduce a novel approach to characterize the recourse invalidation rate in the context of potential model shifts. This characterization is crucial in dynamic systems where models are frequently updated or where data distributions change over time.

(2) Model-Agnostic Recourse Framework: We develop a model-agnostic framework for generating recourses. This framework allows for recourses that are adaptable to changes in the model, thereby maintaining their effectiveness over time.

(3) Recourse In validation Estimation: In Section 4, we propose a method for uncertainty quantification that calculates a theoretical upper bound for the recourse invalidation rate. This method is applicable to any counterfactual plan and prediction model, enhancing its utility across various scenarios.

(4) Cost-Efficiency and Robustness Trade-off: In Section 5, we tackle the crucial trade-off between cost-efficiency and robustness in the design of recourses. Utilizing the derived bounds on the recourse invalidation rate, our method empowers users to tailor this trade-off to meet their specific needs.

(5) Practical Implications: We present numerical results using multiple datasets to validate the effectiveness of our theoretical bounds and the efficiency of our proposed algorithms. These results underscore the practical applicability of our research in real-world settings.

We have included this summary of contributions in the introduction section of our revised paper to provide a clear overview of our research's impact and significance in the field.

Reply for weakness 2: We acknowledge the reviewer's point that simply having access to the training dataset ( $D_{\text{train}}$ ) and calibration dataset ( $D_{\text{calib}}$ ) does not inherently solve the problem of distributional shifts in the data. However, they are instrumental in our approach for quantifying the effects of model shifts on the validity of algorithmic recourse. Specifically, the theoretical bounds on recourse invalidation rates in our model are based on the assumption that the difference in the scoring function values is bounded before and after the model shifts. To convert this perturbation into quantifiable recourse invalidation rates, $D_{\text{train}}$ is utilized to derive upper and lower bounds on the likelihood ratio between $P$ and $P'$ . The calibration dataset $D_{\text{calib}}$ is then employed to assess the nonconformity of $s'({x}^{\text{CF}},1)$ in the context of the updated model. This step is crucial to understand how ${x}^{\text{CF}}$ with a positive result in the updated model may appear anomalous compared to the patterns in $D_{\text{calib}}$ . Through this process, we can establish both lower and upper bounds on the coverage probability of $s'({x}^{\text{CF}},{y}^{\text{CF}})$ . These bounds are integral to determining the theoretical limits on the recourse invalidation rate under model shifts.

Reply for weakness 3: Recall that $P$ and $P'$ represent the distributions for the nonconformity random variable $S$ , which takes values between 0 and 1. The nonconformity score function value is either $h({x})$ or $1 - h({x})$ , resulting in a concentrated distribution of the random variable $S$ for a given dataset. The concentration of the density of $S$ depends on the model's predictive performance on the specific dataset. In regions where the density is sparse, the value of $\hat{L}(s)$ approaches 1. Conversely, in areas where the density is high, employing $\hat{L}(s) = \frac{\epsilon}{\hat{p}(s)}$ with the empirical probability mass function (PMF) $\hat{p}$ might result in a significantly smaller value of $\hat{L}(s)$ . For $\hat{U}(s)$ , given that the value of $\tau$ is small, its value exceeds 1 but remains within a reasonable range. Furthermore, in numerical examples, to mitigate extreme values of $\hat{L}$ , we don't examine each possible value of $s$ in isolation. Instead, we divide the range $[0, 1]$ into multiple bins, calculate the probability mass in each bin, and use this as $\hat{p}$ for calculating $\hat{L}$ and $\hat{U}$ . This approach ensures that the ratio between $\hat{U}$ and $\hat{L}$ typically remains below 10 for all $s$ , making these bounds both practical and informative. The construction details of $\hat{L}$ and $\hat{U}$ have been moved from the main context to Appendix B. Additionally, the discussion mentioned above has been included in the appendix for comprehensive reference.

2023-11-16

Reply for weakness 4: In our approach, we utilize two datasets: $D_{\text{train}}$ and $D_{\text{calib}}$ . It's important to note that both of these datasets are typically collected before the model is deployed in a real-world scenario. These datasets are an integral part of the standard machine learning pipeline. Given that the data required for deriving the upper bound on the recourse invalidation rate is essentially the same data used for model training, we believe that our approach does not introduce additional privacy concerns beyond those already associated with typical machine learning practices. As such, we did not explicitly discuss privacy concerns in our paper. However, we understand the importance of privacy in machine learning applications. If you have any specific privacy-related questions or concerns about our method that you would like us to address, please feel free to provide more details, and we would be happy to provide further clarification.

Reply for weakness 5: Section 4 is mainly about the theoretical analysis on the recourse invalidation rate. We have worked to enhance the coherence of the section by providing clearer connections between each lemma, proposition, and theorem. We have also revised the writing of Section 4 to make the concepts more understandable by the readers.

Reply for weakness 6: We appreciate the opportunity to discuss the performance of our proposed method, PiRR, in comparison to DiRRAc. In our numerical experiments, PiRR demonstrates notable improvements over DiRRAc in certain instances. For example, on the Criminal Justice dataset, PiRR (with a parameter setting of 0.05) achieves a significantly lower average cost than DiRRAc. Moreover, in the Student Performance dataset, while PiRR and DiRRAc show similar mean recourse invalidation rates, PiRR (0.05) presents a substantially smaller variance in the invalidation rate, indicating more consistent performance. Additionally, it achieves this with a lower average cost, suggesting an overall better trade-off between effectiveness and efficiency.

Moreover, it's crucial to note that the primary contribution of our paper extends beyond designing a new algorithm for generating robust recourses. We aim to address the critical real-world problem of characterizing the recourse invalidation rate under potential model shifts. This issue is important in practical applications where model updates and distributional shifts are common. As such, this paper provides valuable insights into the behavior of recourse methods in dynamic settings, which goes beyond the scope of algorithmic performance comparisons. Furthermore, we acknowledge that there is room for improvement in designing robust recourses by potentially combining our method with other techniques in the recourse design stage. This flexibility allows us to explore opportunities for achieving even better performance in real-world scenarios.

Reply for questions: We appreciate the reviewer's suggestion for a seemingly simpler three-step approach involving empirical score values, kernel density estimators, and robust formulation via perturbation. However, there are several considerations that make this approach challenging in the context of our research. Firstly, careful consideration is required in constructing a kernel density estimator, particularly in terms of selecting the right kernel and bandwidth. These choices can have substantial effects on the performance and interpretability of the estimator. Secondly, the selection of an appropriate divergence metric for perturbing the kernel density estimator is crucial and non-trivial. This choice significantly influences the resultant robustness and associated costs of the recourses. For model changes, characterizing an appropriate divergence or variation metric that accurately reflects the nature of the model shifts is complex. Furthermore, our work primarily addresses the intricacies associated with model shifts. These shifts can be multifaceted and may not be adequately captured by straightforward perturbations in a kernel density estimator. While we acknowledge the potential benefits of a more interpretable and straightforward approach as you suggested, our method was designed to specifically address the nuanced challenges presented by model shifts. Nevertheless, we find the proposed three-step procedure intriguing and see it as a valuable avenue for future research. Exploring this method could potentially complement our current work and contribute to the broader understanding of robust recourse design under model shifts.

2023-11-21

Dear reviewer rDRT,

审稿意见

评分: 5置信度: 22023-10-30

Algorithmic recourse is the process of offering users recommendations for actions they can take to receive a positive classification from a machine learning model after they have received a negative classification. The focus of this work is on designing robust recourses, i.e. recommendations for improvement which are robust to shifts in the underlying predictive model being used to make decisions. The authors propose an uncertainty quantification method to upper-bound the invalidation rate (i.e. the probability the offered recourse is no longer valid under model shift) for a given pre-computed recourse.

Using their proposed uncertainty quantification method (recourse invalidation rate), the authors design an algorithm to manage the trade-off between the cost of recourse (e.g. the "effort" required for a user to achieve this recourse) and the probability that the given recourse will be invalid under model shift. In particular for a user-specified invalidation rate, the algorithm aims to return a recourse with minimum cost such that the probability of the recourse being invalid is at most the user's given rate. The authors formulate this objective as a (non-convex) optimization problem, and numerically solve it using gradient-free optimization methods.

The authors empirically validate their recourse invalidation bounds and find that the average empirical bounds are always tighter than their theoretical counterparts, sometimes by a significant margin. They also empirically evaluate the performance of their proposed algorithm and find that it generates more robust recourse solutions which are often easier to implement when compared to existing methods for recourse.

优点

While the authors are not the first to study the robust algorithmic recourse problem, their results are novel within this space (to the best of my knowledge). In particular, their proposed uncertainty quantification method (recourse invalidation rate) is an intuitive measure of of uncertainty under model shifts, which does not require distributional assumptions on the feature space.

The numerical results are also a welcome addition to the submission. In particular, the comparison between the empirical and theoretical bounds was helpful for gauging how tight the bounds of Theorem 12 and Theorem 13 are. Additionally, the numerical comparison between Algorithm 2 and several baselines in real-world data showed that the authors' proposed method often outperforms the relevant baselines.

缺点

I found the writing of this submission somewhat challenging to understand. In particular, Section 4 (Recourse Invalidation Estimation for a Given Recourse) reads as a laundry list of propositions, theorems, and lemmas with various definitions sprinkled in between. It was unclear to me which parts are the salient features, and which parts are there to aid in the understanding of more important results. I also found the inclusion of the comparisons to previous work in this section to be odd, as Section 2 (Background and Related Work) appears to be a more appropriate place for such comparisons. If the submission is accepted, I encourage the authors to rewrite Section 3 in a way which is more concise and highlights important results.

Another weakness of the submission is the lack of any theoretical performance guarantees for Algorithm 2. (In fact, even the algorithm's runtime is not specified.) Even if theoretical performance guarantees cannot be obtained due to the non-convexity of the problem domain, it would have been nice to see a discussion on what exactly makes obtaining performance guarantees in this setting challenging.

Finally, while not a major weakness, the authors' theoretical upper bounds on recourse invalidation are often significantly loose when compared to their empirical counterparts.

问题

What is the runtime of Algorithm 1?

What is the runtime of Algorithm 2?

Does Algorithm 2 enjoy any non-trivial performance guarantees? In other words, is it possible to say something about the cost of the solution returned by Algorithm 2?

2023-11-16

Reply for weakness 1: We have revised the writing of Section 4 to make the concepts more understandable by the readers. In Section 4, since we have proposed an assumption and derived theoretical results based on such assumption, we compare the applying scenario of the proposed methods and also compare the bounds with other related methods. However, it is also a good idea to include these comparisons in Section 2.

Reply for weakness 2: We appreciate the reviewer's concern regarding the absence of theoretical performance guarantees and specified runtime for Algorithm 2. The minimization problem defined in (1) is inherently non-convex with multiple constraints, which presents significant challenges in solving it exactly. Non-convex problems, by their nature, are known for potential local minima, making it difficult to guarantee global optimality. In the paper, we transform the initial optimization problem to a format represented by (2), which seems more tractable and inspires the use of the Alternating Direction Method of Multipliers (ADMM) for solving the optimization problem. While ADMM is a powerful tool for certain types of optimization problems, demonstrating its convergence for multi-block problems, like the one we address, is exceptionally challenging. In the ADMM literature, although convergence rates are characterized for simpler structured problems, the generalization to more complex, multi-variable problems introduces significant complications. Not only is proving convergence itself difficult, but also characterizing the convergence rate in such scenarios is even more daunting.

Reply for question 1: The runtime of Algorithm 1 in our study is closely tied to the computational complexity of the processes involved, which depends on the number of samples in the training and calibration datasets. Suppose that there are $N$ samples in $D_{\text{train}}$ and $n$ samples in $D_{\text{calib}}$ . Initially, Algorithm 1 iterates over all $N$ samples in $D_{\text{train}}$ to construct the point-wise bounds $\hat{L}(\cdot)$ and $\hat{U}(\cdot)$ . For each of the $n$ samples in $D_{\text{calib}}$ , the algorithm computes values $S_i, L_i, U_i$ . For $k$ from $1$ to $n$ , the algorithm performs computations for $\hat{F}(k)$ . Considering these steps, the total computational complexity of Algorithm 1 can be approximated as $O(N+n)$ . This linear complexity suggests that the runtime will scale proportionally with the total number of samples in datasets. In practical terms, the runtime typically spans several minutes, depending on the specific sizes of $N$ and $n$ .

Reply for question 2: The runtime of Algorithm 2 is influenced by several factors due to the nature of ADMM, which updates one variable at a time while keeping others constant. These factors include: initial point, dataset structure, penalty constant, model and model shift variables, etc. In particular, the convergence speed of ADMM heavily relies on the initial point. A starting point that is closer to the optimal solution can significantly reduce the convergence time, thus shortening the overall runtime of the algorithm. Moreover, the choice of the penalty constant in ADMM is crucial. An optimal penalty constant can facilitate quicker convergence, while other choices may lead to slower convergence rates. Furthermore, the variables associated with the predictive model and model shifts, such as the type of predictive model and parameters defining the model shifts, affect the algorithm's efficiency. These elements determine the complexity of the problem that ADMM needs to navigate, impacting the runtime. In our experimental observations, the runtime for Algorithm 2 varied between 5 to 60 minutes.

Reply for question 3: Please refer to our response for weakness 2.

2023-11-21

Dear reviewer NqmB,

2023-11-22

Thanks for your reply. I have read your response and I maintain my original score.

审稿意见

评分: 5置信度: 42023-11-01

The paper proposes a new algorithmic recourse aiming for robustness against model shift. The method, named Probabilistic Invalidation-based Robust Recourse Generation (PiRR), is formulated with ideas from conformal prediction and solved with extended ADMM. The paper also proposes a new metric called invalidation rate, with two upper bounds. Then, three baseline recourse methods are evaluated demonstrate the power of these bounds. The effectiveness of PiRR is also compared against four other methods in term of the new metric.

优点

The paper contributes a novel approach and formulation to the field.
Most of the paper is well-written, except for section 4.

缺点

The paper is not following the conventional citing style.
The paper did not present an evaluation of PiRR in the traditional cost-validity trade-off performance. This makes it hard to put the method in context of the current landscape of the field.
Section 4 seems cramped and could be improved.

问题

Assumption 3 seems too restrictive to be practical, since you need a bounded future perturbation. Can you elaborate on how is your formulation advantageous compared to previous works?
Can you clarify the utility for each upper bounds presented in Theorem 12 and 13? Why are both presented and included in the evaluation?
Based on section 2, the method in (Nguyen et al., 2022) seems to be highly related to this paper. Why was it left out from further discussion?
The updated version of (Pawelczyk et al., 2022) is (Pawelczyk et al., 2023).
Some other robust recourse baselines are listed below.

References

Tuan-Duy Nguyen, Ngoc Bui, Duy Nguyen, Man-Chung Yue, and Viet Anh Nguyen. Robust bayesian recourse. In proc. Uncertainty in Artificial Intelligence, pp. 1498–1508, Eindhoven, Netherlands, Aug. 2022.

Martin Pawelczyk, Teresa Datta, Johannes van-den Heuvel, Gjergji Kasneci, and Himabindu Lakkaraju. Algorithmic recourse in the face of noisy human responses. arXiv preprint arXiv:2203.06768, Oct. 2022.

Martin Pawelczyk, Teresa Datta, Johannes van-den-Heuvel, Gjergji Kasneci and Himabindu Lakkaraju. “Probabilistically Robust Recourse: Navigating the Trade-offs between Costs and Robustness in Algorithmic Recourse.” International Conference on Learning Representations (2023).

Victor Guyomard, Françoise Fessant, Thomas Guyet, Tassadit Bouadi, and Alexandre Termier. "Generating robust counterfactual explanations." ECML/PKDD (2023).

Junqi Jiang, Jianglin Lan, Francesco Leofante, Antonio Rago, and Francesca Toni. "Provably Robust and Plausible Counterfactual Explanations for Neural Networks via Robust Optimisation." arXiv preprint arXiv:2309.12545 (2023).

2023-11-16

Reply for question 1: Thank you for raising the question regarding Assumption 3 and its practical applicability. In our theoretical framework, Assumption 3 serves as a guideline rather than a strict requirement for all practical applications. Specifically, for a counterfactual example ${x}^{\text{CF}}$ , it is not necessary that the difference between the scoring functions before and after a model shift is smaller than $\tau$ for every sample. Instead, the more practical condition we use is: $\mathbb{P}(h'({x}^\text{CF})\geq \eta-\tau \vert (h'({x}^\text{CF})\geq \eta)=1.$ This condition reflects a more realistic scenario where the bound $\tau$ serves as a probabilistic rather than an absolute constraint.

Our decision to introduce Assumption 3 is based on common occurrences in dynamic decision systems, such as model updates or changes in data distribution. We recognize that these shifts are typically not extreme, hence the theoretical assumption that perturbations are bounded by $\tau$ .

While our theoretical model assumes bounded perturbations for simplicity and tractability, we acknowledge the complexities of real-world data and model shifts. In light of this, one of our key future research directions is to relax this assumption, allowing for a broader range of perturbations. By loosening this assumption, we aim to further bridge the gap between theoretical robustness and the practical variability encountered in real-world applications.

Reply for question 2: Theorem 12 (Theorem 11 in the revised paper) derives its upper bound from the inequality presented in Proposition 8 (Proposition 7 in the revised paper), which characterizes the lower bound of the coverage probability. This approach offers a simpler and computationally more efficient solution, as it involves only the computation of $\hat{L}(S)$ . The simplicity and lower computational demand make Theorem 12 (Theorem 11 in the revised paper) particularly useful in scenarios where computational efficiency is a priority, even though it might offer a less tight bound compared to Theorem 13 (Theorem 12 in the revised paper).

Conversely, Theorem 13 (Theorem 12 in the revised paper) provides a tighter bound by integrating both inequalities from Propositions 8 and 10 (Propositions 7 and 9 in the revised paper), utilizing the lower and upper bounds on the coverage probability. This results in a tighter upper bound on the recourse invalidation rate. However, this increased precision comes at the cost of higher computational complexity, as it requires calculating both $\hat{L}(S)$ and $\hat{U}(S)$ . Additionally, deriving $k^*$ from Proposition 8 (Proposition 7 in the revised paper) or $t^*$ from Proposition 9 (Proposition 8 in the revised paper) requires multiple evaluations of $\hat{F}(k)$ and $\hat{E}(t)$ , respectively, for different values, roughly doubling the computational resources compared to Theorem 12 (Theorem 11 in the revised paper). Thus, Theorem 13 (Theorem 12 in the revised paper) is more appropriate in situations where the accuracy of the bound is a critical factor, despite the higher computational cost.

In our numerical examples, we observed that the performance difference between the bounds derived from Theorems 12 and 13 (Theorems 11 and 12 in the revised paper) is relatively minor. Therefore, we have included results from both theorems to offer readers a choice. This inclusion allows practitioners and researchers to select the theorem that best aligns with their specific needs regarding accuracy and computational resources.

Reply for question 3: (Nguyen et al., 2022) introduces a unique Bayesian recourse framework that focuses on minimizing the odds ratio between posterior probabilities of different outcomes. This approach, termed Bayesian recourse, diverges from the more general setting of ensuring a positive prediction result for generated recourses. Moreover, their robustification process involves smoothing samples and solving a min-max optimization problem within a Wasserstein-Gaussian mixture conditional ambiguity set. This specific approach to robustness, centered around a minimax problem, is quite distinct from the methods employed in our work. Given the unique and non-generalizable aspects of Nguyen et al.'s definition of Bayesian recourse and their specific robustification technique, we chose not to include a comparison with their method in our initial paper. However, we acknowledge the relevance of their work and are open to incorporating a comparative analysis in the revised version of our paper.

Reply for question 4: Thank you for bringing the reference update to our attention. We have revised the reference in our paper accordingly.

Reply for question 5: Thanks for pointing out these robust recourse baselines. We have included them in our revised paper.

2023-11-16

Reply for weakness 1: Thanks for pointing this out. We have updated the citing style in the revised paper.

Reply for weakness 2: We appreciate the reviewer's concern regarding the absence of a traditional cost-validity trade-off performance evaluation for PiRR in our paper. The main focus of our work is on analyzing the impact of potential model shifts and designing model-agnostic recourses that can maintain their effectiveness in the face of future model changes. In light of this primary objective, we chose to prioritize the evaluation under model shift scenarios rather than the conventional cost-validity trade-off performance. Moreover, in conventional settings, many existing methods, including PiRR, can achieve near-perfect performance, as indicated in Table 2 and Table 8, column ``Invalidation rate before shift". In such cases, the differences between various methods, including PiRR and others, are marginal. Focusing on these conventional settings may not effectively highlight the advantages of PiRR. Given the marginal differences in performance under traditional settings, we believed that an evaluation focusing on these aspects would not adequately highlight the unique strengths of PiRR, especially in the context of model shifts. Therefore, our evaluation strategy was specifically designed to emphasize PiRR's performance in response to updated models, which we consider to be a more critical and challenging aspect of recourse design in dynamic environments.

Reply for weakness 3: Thank you for your feedback on Section 4. We have worked to enhance the coherence of the section by providing clearer connections between each lemma, proposition, and theorem. We have also revised the writing of Section 4 to make the concepts more understandable by the readers.

2023-11-21

Dear reviewer C8WR,

2023-11-23

I appreciate the authors' response. I have read the revision along your answer and decided to retain my score.