Off-Policy Selection for Initiating Human-Centric Experimental Design
摘要
评审与讨论
The paper presents the First-Glance Off-Policy Selection (FPS) framework, aimed at improving policy selection for human-centric systems (HCSs) like education and healthcare, by addressing participant heterogeneity. FPS groups participants with similar traits, augments each sub-group with trajectories generated by a variational auto-encoder (VAE), and selects the optimal policy based on an estimator of the policy selection criteria. The proposed method is tested in a real-world intelligent education (IE) system and demonstrates significant improvement in the learning outcomes by personalizing tutoring policies based on initial student states. The framework's effectiveness was also demonstrated in selecting optimal treatment policies for sepsis patients in a simulated healthcare environment.
优点
- The paper introduces a new framework that selects policies for each new participant joining the cohort based solely on the initial state.
- The proposed method has been tested in a real-world IE system with 1,288 students and in a simulated healthcare environment.
- Although each component of the proposed framework has been previously studied, the framework itself demonstrates advantages over baseline methods concerning the addressed problems.
缺点
- The proposed method is applicable only to problems with a finite number of policies, as policy selection is based on evaluating each candidate target policy separately.
- The real-world experiment was not a randomized controlled trial that directly compared the proposed method against baseline methods. Instead, in the IE system, a policy was randomly assigned to each student. The authors then tracked the outcomes for students in each subgroup who were assigned the policy recommended by each method.
问题
- What is the error bar in Figures 1 and 2? Is it the standard deviation, standard error, or confidence interval? If it is the standard error, the difference between the baseline and the proposed method is small compared to the variation within each method.
- In Figure 1(b), most methods overestimate the reward with the proposed estimator. Could the authors explain the reason for this? Is there a systematic bias in the estimator?
- In Figure 2, FPS and the best possible baseline combinations perform exactly the same in subgroups K1 and K2. Is this because the two methods assign the same policy to every student in these first two subgroups?
局限性
The authors have addressed the limitations of their work.
We sincerely appreciate your time and efforts on evaluating our work. Please find our point-by-point response below.
Q1. The proposed method is applicable only to problems with a finite number of policies, as policy selection is based on evaluating each candidate target policy separately.
R1. Good question. We focused on a common scenario when deploying RL policies to real human participants in practical human-centric systems, where all deployed policies have to be highly regulated and scrutinized by departmental committees enforcing the guidelines for human-centric experiments, where the total number of deployed policies would be generally limited [1-4],
Q2. The real-world experiment was not a randomized controlled trial that directly compared the proposed method against baseline methods. Instead, in the IE system, a policy was randomly assigned to each student. The authors then tracked the outcomes for students in each subgroup who were assigned the policy recommended by each method.
R2. Unfortunately, as we had to follow the pre-defined guidelines agreed with a departmental committee, that each student has to be treated equally during the experiment (i.e., they either opted-in for testing out all the methods or none). As a result, we were not able to run typical randomized control experiments. However, the chi-squared test was employed to check the relationship between policy assignment and sub-groups, and it showed that the policy assignment cross sub-groups were balanced with no significant relationship (p-value=0.479), as described in Appendix A.2.
Q3. What is the error bar in Figures 1 and 2? Is it the standard deviation, standard error, or confidence interval? If it is the standard error, the difference between the baseline and the proposed method is small compared to the variation within each method.
R3. The error bar represents standard error. Though the difference may not be large for potential high-performing sub-groups (e.g., & ), we observed that the baselines can even have a negative effect on some sub-groups ( & ) as in Figure 2, which is undesired in human-centric experiments. In empirical human-centric scenarios, such as education, the behavior policy is generally highly regularized by department committees strictly following guidance from human-centric experiments, such that the target policy would not be dramatically opposed to the behavior policy – suggesting the underlying assumption that the divergence between behavior and target policies could be intrinsically bounded. We will definitely clarify this in the camera-ready version, if accepted.
Q4. In Figure 1(b), most methods overestimate the reward with the proposed estimator. Could the authors explain the reason for this? Is there a systematic bias in the estimator?
R4. This is probably due to the unobserved confounding existing in human-centric systems, leading the method that assumed sequential ignorability for the behavior policy to overestimate or underestimate, and similar findings were observed in [5-6].
Q5. In Figure 2, FPS and the best possible baseline combinations perform exactly the same in subgroups K1 and K2. Is this because the two methods assign the same policy to every student in these first two subgroups?
R5. Both FPS and baselines selected the same policy for sub-groups and .
We hope these answers provide some explanations to address your concerns and showcase that our work is solving a significant challenge in a satisfying manner. We are happy to answer any followup questions or hear any comments from you.
References
[1] Mandel et al. Offline policy evaluation across representations with applications to educational games. AAMAS 2014.
[2] Gao et al. Offline policy evaluation for learning-based deep brain stimulation controllers. International Conference on Cyber-Physical Systems 2022.
[3] Zhou et al. Hierarchical reinforcement learning for pedagogical policy induction. International Conference on Artificial Intelligence in Education 2019.
[4] Abdelshiheed et al. Leveraging deep reinforcement learning for metacognitive interventions across intelligent tutoring systems. International Conference on Artificial Intelligence in Education 2023.
[5] Namkoong, et al. Off-policy policy evaluation for sequential decisions under unobserved confounding. NeurIPS 2020.
[6] Fu et al. Benchmarks for Deep Off-Policy Evaluation. ICLR 2021.
As we are stepping into the 2nd half of the discussion period, should the reviewer have any follow-ups, we will try our best to address them in time. If satisfied, we would greatly appreciate the reviewer to update the reviews/acknowledge our responses. We sincerely thank the reviewer again for the efforts devoted to the review process, allowing the work to be thoroughly evaluated and discussed.
Sincerely,
Authors of submission 2308
Thank you for your thoughtful and detailed response to my comments. I have no further questions at this time.
We sincerely thank the reviewer for the prompt response, as well as the patience and efforts dedicated to evaluating our work thoroughly.
This paper studies off-policy selection in healthcare settings where users are heterogeneous and in situations where new users can appear in the policy deployment phase. To deal with new participants, this paper proposes a two-stage evaluation procedure: (1) learning a partitioning function of users and (2) choosing different policies for each subgroup via OPE. The proposed method (sub-group partitioning) achieves better performance than random user partitioning.
优点
- Reasonable approach: Using similar features and trajectory augmentation sounds like a reasonable approach, given that the dataset is sparse and only initial states are available for some users.
- Simple and easy-to-implement algorithm: The proposed method is not too complicated, and it seems to be easily implementable in real-world applications.
- Ablations are informative: Ablations, especially FPS-noTA and FPS-P, are instructive in telling the benefits of TA and sub-group partitioning, respectively.
缺点
-
Feasibility of sub-group partitioning: Section 2.2 (Definition 2.3) states the objective function of sub-group partitioning, and it indicates that the partitioning process requires knowledge about V^{\pi} for every candidate policy. In my understanding, this procedure itself requires OPE. However, looking at Algorithm 1, it seems the algorithm determines partitioning before applying OPE. How is this sub-group partitioning actually conducted? If using OPE, I believe this partitioning procedure can have a high variance. If simple clustering based on the features of initial states is used, the partitioning may not align with the objective function.
-
Variance in the sub-group partitioning phase: Related to the above point, how does this algorithm deal with the variance in the sub-group partitioning phase? Also, using silhouette scores is distance-based and does not consider the variance. It would be useful if there is a way to determine the number of clusters (M) in a data-driven way, taking both bias and variance into account.
-
Improvement over Keramati et al. (2022) seems incremental: The paper states that the benefit of the proposed method over Keramati et al. (2022) is that the proposed method does not require full-trajectory information, and thus applicable when only initial states are accessible. I understand the practical advantages, however, the technical progress is not convincing, as the way the proposed method overcomes the challenges in sub-group partitioning without full trajectory is not well-described. Also, if Keramati et al. (2022) is a skyline, how does the performance of the proposed method differ from Keramati et al. (2022)? A more detailed discussion of comparing the proposed method with existing work would be appreciated.
Keramati et al. (2022): Identification of Subgroups With Similar Benefits in Off-Policy Policy Evaluation. Ramtin Keramati, Omer Gottesman, Leo Anthony Celi, Finale Doshi-Velez, Emma Brunskill. CHIL, 2022.
问题
-
How does the proposed method estimate the value during the sub-group partitioning phase?
-
How does the proposed method deal with the variance in the data partitioning?
-
If Keramati et al. (2022) is a skyline (with full knowledge about trajectory under the logging policy), how much can the proposed method achieve?
-
This may be due to random seeds, but I wondered why FPS-P can be quite pessimistic while FPS and FPS-noTA are rather optimistic.
局限性
Ambiguity about the sub-group partitioning phase and light comparison with Keramati et al. (2022) (one of the most related work). See weaknesses for the details.
We sincerely appreciate your time and efforts on evaluating our work. Please find our point-by-point response below.
Q1. How is this sub-group partitioning actually conducted?
R1. We used an off-the-shelf algorithm called Toeplitz inverse covariance-based clustering (TICC) [1] to obtain the initial partitioning as described in Appendix A.3.5, to jumpstart the iterative optimization process toward objective (1) that follows. It could be our writing that causes such a confusion, and we will clarify that in the camera-ready version, if accepted. Please see the response below for the concern over variance.
Q2. Variance in the sub-group partitioning phase:
Q2.1. How does this algorithm deal with the variance in the sub-group partitioning phase?
R2.1. We thank the reviewer for this thoughtful comment where we carry out additional analyses to address it, which lead to additional findings. Specifically, we alter the total number of subgroups, , and re-run the sub-group partition. It is observed that a minimal percentage of students have had their group changed over different 's -- see the table below. This may not be surprising, because in empirical human-centric scenarios, such as education, the policies are generally highly regulated and scrutinized by departmental committees strictly enforcing the guidelines for human-centric experiments, so that both behavioral and target policies deployed at scale would not interfere students' learning greatly. There it implies the underlying assumption that the divergence between behavior and target policies could be intrinsically bounded. Such a finding is important and can be potentially generalized to common human-centric environments, and we plan to further pursue such an avenue in broader contexts both empirically and theoretically in the future.
| Changes of # of Sub-groups | 3->4 | 4->5 | 5->6 | 3->5 | 3->6 | 4->6 |
|---|---|---|---|---|---|---|
| Perc. of students changing sub-groups | 2% | 4% | 3% | 6% | 10% | 9% |
Q.2.2. Using silhouette scores is distance-based and does not consider the variance. It would be useful if there is a way to determine the number of clusters (M) in a data-driven way, taking both bias and variance into account.
R2.2. Given our work is the first targeting and solving a practical challenge often encountered in OPS, we provided the framework in a straight-forward manner, and silhouette score has been broadly employed and exhibits good performance across human-related tasks [1-3]. However, we greatly appreciate this comment and agreed that it would be an interesting topic to investigate in the future toward bias and variance in partitioning.
Q3. Improvement over Keramati et al. seems incremental:
Q3.1. I understand the practical advantages, however, the technical progress is not convincing, as the way the proposed method overcomes the challenges in sub-group partitioning without full trajectory is not well-described.
R3.1. We respectfully disagree with the opinion that the work is incremental over Keramati et al. Though the theoretical analysis of bound of variance in both work may be grounded on [4-5], please note that the objective of our work is to identify the policy that can possibly work the best for each arriving individual of HCSs from a set of policy candidates, without access to any prior offline data or entire trajectory collected over the individual, which targets to address a common bottleneck in empirical human-participated studies. In contrast, Keramati et al. assumed that there existed a population-level (optimal) policy and then identified the sub-group that may benefit most of that policy, which is the opposite side of our problem statement. Moreover, they required the entire trajectory for each individual to be pre-given before sub-grouping, which would not be practical for HCSs in the real-world, as the initial state would be the only observation available at the time an individual arrives at the HCS right after which a group needs to be assigned (e.g., doctors need to sketch out diagnoses/treatment plan for patients soon after they stepped into the triage room).
Q3.2. how does the performance of the proposed method differ from Keramati et al.?
R3.2. As we mentioned above, given the complicated real-world human-centric environments and highly limited information of each incoming individual, it’s hard to directly implement Keramati et al. work to our empirical experiments, as the assumption required would not be met, nor the availability of the trajectory beyond the initial state for their method to use for sub-grouping.
Q4. Why FPS-P can be quite pessimistic while FPS and FPS-noTA are rather optimistic.
R4. There could be some unobserved confounding that exists in human-centric experiments, leading the method that assumed sequential ignorability for the behavior policy to underestimate or overestimate, and similar findings were observed in [6].
We hope our responses sufficiently addressed your concerns and clarified that our work is solving a significant practical challenge effectively. We are happy to respond to any follow-ups you may have.
References
[1] Toeplitz inverse covariance-based clustering of multivariate time series data. KDD 2017.
[2] Domain generalization via model-agnostic learning of semantic features. NeurIPS 2019.
[3] A Hierarchical Clustering algorithm based on Silhouette Index for cancer subtype discovery from genomic data. Neural computing and applications 2020.
[4] Learning bounds for importance weighting. NeurIPS 2010.
[5] Policy optimization via importance sampling. NeurIPS 2018
[6] Off-policy policy evaluation for sequential decisions under unobserved confounding. NeurIPS 2020.
As we are stepping into the 2nd half of the discussion period, should the reviewer have any follow-ups, we will try our best to address them in time. If satisfied, we would greatly appreciate the reviewer to update the reviews/acknowledge our responses. We sincerely thank the reviewer again for the efforts devoted to the review process, allowing the work to be thoroughly evaluated and discussed.
Sincerely,
Authors of submission 2308
Thank you for providing detailed responses to the questions. After reading the rebuttals and thinking about it carefully, I'm still skeptical about the impact of this paper on the OPE research community. As mentioned in the rebuttals, TICC is the off-the-shelf algorithm for clustering, which is generally suitable for clustering sequential trajectories. However, there is no unique discussion regarding clustering in the context of OPE, such as the bias-variance tradeoff of OPE/OPS. What is unique about OPE is the use of user partitioning, however, this approach has already been taken in previous research papers as pointed out in the Weaknesses. For these reasons, I remain on the negative side in my evaluation.
The authors would like to thank the reviewer for taking the time reading our rebuttals and following up. It is also glad to see that, presumably, we were able to address the concerns mentioned in Q2 and Q4 above, as they were not mentioned in the follow-up above.
We also thought it would be helpful for us to further clarify the main discrepancies centered around the latest response from the reviewer.
I'm still skeptical about the impact of this paper on the OPE research community. User partitioning has already been taken in previous research papers as pointed out in the Weaknesses.
We completely understand and respect that each researcher holders their opinion toward what contributions would be more meaningful to the community, and agreed that theoretically addressing the biasing/variance issue would be one of them, which we plan to explore further down the road (in the context of FPS) in future works. On the other hand, our work is inspired and developed to solve a practical problem where we were the first that attempted to solve -- as pointed out by the other two reviewers (cited below), we are the first work that addressed an important practical OPS problem for arriving participants without prior knowledge (i.e., only with the initial state being observable). We believe that we have rigorously pinpointed in rebuttal above how our work is fundamentally different from Keramati et al. -- they required the full trajectory to be known before the partitioning/evaluation process begins which would be impractical in real-world HCSs (e.g., clinicians can only make treatment decisions by observing the state within a short period after the patient is admitted, without ways of accessing any information on how the treatment can go after it's started). Further, the information available to our approach is drastically less than Keramati et al., and this is also the reason why we had an augmentation part in the methodology. Simply put, Keramati et al. approach is not able to just take the initial states as inputs and produce the partitioning and OPE results -- they required entire trajectories as inputs but how could it work where one only have initial states as inputs? We would greatly appreciate the reviewer to provide solid and more detailed justifications on the aspects supporting their claim on that our work has been already covered by Keramati et al., as the authors have detailed on why they would not agree with the reviewer on this point in the rebuttal but have not seen any additional rationale being provided from the reviewer's follow-up above rather than re-stating their claim.
Quoted from wmCx -- "New Problem Formulation: The paper tackles the unique challenge of selecting policies for new participants without prior offline data. This problem formulation is distinct from existing OPS/OPE frameworks, which typically assume homogeneity among agents."
Quoted from wmCx -- "Methodological Rigor: The paper demonstrates a high level of methodological rigor. The use of variational auto-encoding (VAE) for trajectory augmentation and the development of an unbiased value function estimator with bounded variance reflect thorough and robust algorithm design."
Quoted from aKHU -- "The paper introduces a new framework that selects policies for each new participant joining the cohort based solely on the initial state."
TICC is the off-the-shelf algorithm for clustering, which is generally suitable for clustering sequential trajectories. However, there is no unique discussion regarding clustering in the context of OPE
TICC is only used to jump-start the optimization process of (1) in our approach and is an empirical finding we discovered that would help ensuring the stability and efficiency of the partitioning process. In theory one can start with any initial partitioning (e.g., uniform random). It's not the main focus of the paper or our approach, and we never claimed that using TICC would be part of our novelty.
Again, we greatly appreciate the reviewer's time and efforts toward getting our work thoroughly discussed and evaluated. Regardless of the manuscript decision, we would love to hear back from the reviewer again for their feedback on the points touched above, which we thought would be always extraordinarily meaningful and helpful to lift our work, as well as the community in general, to another level of excellence.
Thank you for the detailed responses. I acknowledge the contribution of the augmentation part, as I have mentioned in the Strengths. I agree that this can be useful in practice. The concern I have lies in the way the paper user partitioning method and its connection to the bias and variance of OPE/OPS (theoretical advancement over existing work), which is also an important aspect of OPE. I'd be happy to discuss this point with other reviewers and ACs.
This paper introduces First-Glance Off-Policy Selection (FPS), a novel approach to off-policy selection (OPS) in human-centric systems (HCSs) such as healthcare and education, where the heterogeneity among participants requires personalized interventions.
FPS addresses this by segmenting participants into sub-groups based on similar traits and applying tailored OPS criteria to each sub-group. This method is evaluated in two real-world applications: intelligent tutoring systems and sepsis treatment in healthcare. The results demonstrate significant improvements in both learning outcomes and in-hospital care, highlighting FPS's ability to personalize policy selection and enhance system performance.
优点
-
Novel Approach: The introduction of the First-Glance Off-Policy Selection (FPS) framework is a significant innovation. By systematically addressing participant heterogeneity through sub-group segmentation, FPS offers a fresh perspective on OPS in human-centric systems (HCSs).
-
New Problem Formulation: The paper tackles the unique challenge of selecting policies for new participants without prior offline data. This problem formulation is distinct from existing OPS/OPE frameworks, which typically assume homogeneity among agents.
-
Methodological Rigor: The paper demonstrates a high level of methodological rigor. The use of variational auto-encoding (VAE) for trajectory augmentation and the development of an unbiased value function estimator with bounded variance reflect thorough and robust algorithm design.
-
Comprehensive Experiments: The experimental evaluation is extensive, covering both real-world educational systems and healthcare applications. This diversity in testing scenarios strengthens the validity of the results. The paper provides a detailed analysis of the results, including comparisons with various baselines and ablation studies. This thoroughness ensures that the findings are well-supported and credible.
Also, the paper is well-organized, with clear sections that guide the reader through the problem formulation, methodology, experiments, and results. Each part builds logically on the previous one.
缺点
The paper is generally well-written. I will combine the weaknesses and questions into one section.
-
Assumption of Independent Initial State Distributions The FPS framework assumes that the initial state distributions for each participant are independent and can be uniformly sampled from the offline dataset. This assumption may not hold true in real-world scenarios where participants’ initial states can be influenced by various contextual factors and past interactions. The independence assumption may oversimplify the complexity of human-centric systems and may lead to suboptimal policy selections. The authors may consider addressing the potential dependencies in initial states.
-
Lack of Consideration for non-stationary state transition FPS focuses on the initial state for policy selection without considering longitudinal data that captures the progression of participant states over time. This, in the AI community's language, is a non-stationary state transition issue faced by meta-RL. Is it possible that the state transition for each patient is also independent of each other? In other words, chances are that the state transition for each patient is sampled from a distribution. Would this FPS work in such a case?
问题
Questions are combined in the weaknesses section. Please see above.
局限性
The paper does not thoroughly address FPS's scalability and applicability in large-scale, real-world settings, especially in healthcare. I suggest the authors clarify the scalability and emphasise the importance of clinical guidance. One to two sentences will be enough.
Thank you for your time and efforts on evaluating our work, and your positive comments that the paper is making an important impact. Please find our point-by-point response below.
Q1. Assumption of Independent Initial State Distributions The FPS framework assumes that the initial state distributions for each participant are independent and can be uniformly sampled from the offline dataset. This assumption may not hold true in real-world scenarios where participants’ initial states can be influenced by various contextual factors and past interactions. The independence assumption may oversimplify the complexity of human-centric systems and may lead to suboptimal policy selections. The authors may consider addressing the potential dependencies in initial states.
R1. Thank you for the insightful comment. That independence assumption we made was inspired by the use of uninformative prior in Bayesian methods where we do not have much prior knowledge in terms of sub-grouping's potential outcome before the experiment starts. We agreed with the reviewer that investigating the potential confoundings in initial states would be important -- and we plan to do so as a separate work in the future, as it could be very challenging to capture those in practical scenarios, e.g., there exist a few literature attempting to soly address that issue [1-3].
Q2. Lack of Consideration for non-stationary state transition FPS focuses on the initial state for policy selection without considering longitudinal data that captures the progression of participant states over time. This, in the AI community's language, is a non-stationary state transition issue faced by meta-RL. Is it possible that the state transition for each patient is also independent of each other? In other words, chances are that the state transition for each patient 𝑝𝑖 is sampled from a distribution. Would this FPS work in such a case?
R2. Great point. Initially we did not consider the progression/transition of participant states since the overall scenario our work considered was that each individual needs to be assigned a group upon arrival where the initial states would be the only observation available. However, we agreed that our framework can be further extended toward also considering each participant's follow-up visits once they are available. And if the state transitions are also independent from each other, a simple variation of our method may work, as one could re-run the partitioning algorithm at each new visit; this way it would not violate the assumptions needed by our work. Further, in the future we plan to explore the possibilities of temporally dynamic sub-group partitioning, where the agent can consider all the historical states of a participant upon arrival of each visit. It would make the problem setup more challenging as it falls more under a POMDP schema. We greatly appreciate both of these comments as they indeed helped us to shape the future work down the avenue and led us to think of more challenging, interesting and realistic setups. We also hope that our work could possibly inspire more researchers to recognize and emphasize the empirical challenges faced by deploying RL/OPS in real-world systems, if accepted.
Q3. The paper does not thoroughly address FPS's scalability and applicability in large-scale, real-world settings, especially in healthcare. I suggest the authors clarify the scalability and emphasise the importance of clinical guidance. One to two sentences will be enough.
R3. We sincerely appreciate your suggestion. We will add the following discussion in the camera-ready, if accepted, i.e., "Compared to IE systems, HCSs in healthcare would be considered even more high-stakes, thus may further limit the options (i.e., policies) that are available to facilitate sub-grouping experiments, due to stricter clinical experimental guidelines. However, FPS has demonstrated its extraordinary capabilities over a real-world experiment that involved >1,200 participants with years of follow-ups, which showed its efficacy and scalability toward working with more challenging systems and larger cohorts as in healthcare, as the assumptions needed by FPS across these two systems would not change fundamentally. Moreover, potential underlying confoundings may exist across the patient's initial states in healthcare, and it is also important to consider inputs from healthcare professionals during sub-grouping. As a result, one may further extend our framework toward such a direction, allowing it to function better in the healthcare domain."
We hope these answers provide some explanations to address your concerns and showcase that our work is solving a significant challenge in a satisfying manner. We are happy to answer any followup questions or hear any comments from you.
References
[1] Namkoong et al. Off-policy policy evaluation for sequential decisions under unobserved confounding. NeurIPS 2020.
[2] Xu et al. An instrumental variable approach to confounded off-policy evaluation. ICML 2023.
[3] Tennenholtz et al. Off-policy evaluation in partially observable environments. AAAI 2020.
As we are stepping into the 2nd half of the discussion period, should the reviewer have any follow-ups, we will try our best to address them in time. If satisfied, we would greatly appreciate the reviewer to update the reviews/acknowledge our responses. We sincerely thank the reviewer again for the efforts devoted to the review process, allowing the work to be thoroughly evaluated and discussed.
Sincerely,
Authors of submission 2308
Dear reviewers and ACs,
Though the author-reviewer discussion window is closing in less than a day, key discrepancies still remain between the authors and reviewer 2e7j on two aspects. We have also noticed that reviewer 2e7j degraded the rating from 4 to 3 without providing any reasonings behind after our latest clarification response, and have not find additional rationales on insisting on their claims in 2e7j's response to our rebuttal. Consequently, the authors thought it would be necessary to briefly summarize the two main discrepancies here, in case other reviewers and ACs would like to weigh in after the open discussion window closes.
#1. Reviewer 2e7j was unsure why sub-grouping is needed in OPE and how it could help addressing the variance-bias trade-off in OPE. As a result, reviewer 2e7j deemed that our work may not bring much value to the OPE community.
- On why sub-grouping is needed.
Conceptually, the OPS problem for arriving participants without prior knowledge we seek to solve significantly constrained the information the OPS agent could access, i.e., only the initial states are available as opposed to the entire trajectories as in most existing OPE/OPS works. As a result, sub-grouping would allow the agent to source reference on how other similar individuals could have been treated by each policy candidate, before slating one to the arriving participant. Similar decision-making procedure has been widely adopted in may existing real-world HCSs, though with one difference being that it is domain experts who select the policy for the arriving participants as opposed to the OPS agent. For example, clinicians would source vitals and existing health conditions of patients walking into emergency department (i.e., constituting the initial states), followed by laying out a treatment plan timely (i.e., determining treatment options as provided in standard emergency procedures). The information sourcing step here is analogous to solving the sub-grouping objective (1) in our work — the emergency clinicians intrinsically groups the patient into a type given their experience treating similar patients, from which diagnoses and treatment plan can be determined (i.e., the policy selection/assignment step). In short, our work automated such a pipeline that has been running extensively constantly in real-world HCSs. Empirically, we have provided, in Section 3 of the paper, rigorous results and justifications to demonstrate the strength of our methodology, coupled with discussions to illustrate the importance of sub-grouping in terms of solving the OPS's arriving participants problem we focused on.
- On variance and broader impact.
As the authors pointed out in the rebuttal that the policies that can be deployed to participants in HCSs are subject to thorough examination and strict regulations, implying an implicit bounded divergence between the state-action visitations under behavioral and target policies, i.e., the variance issue here may not be as critical as in other OPE applications. As a result, the authors chose to prioritize solving the arriving individual problem practically first, and defer to future work to dive deeper into the theoretical aspect. However, please do note that the estimator we used has bounded variance (see Proposition 2.4 in paper). Further, as our framework addresses a critical problem that is commonly faced in most HCSs systems, we believe that our impact would go beyond the OPE community, benefiting many cross-functional domains involving human-participated experiments, including but not limited to remote/telecom healthcare, online/intelligent education, recommender systems.
#2 Reviewer 2e7j deemed that the problem we looked into, as well as our methodology, has already been studied and covered by Keramati et al. (2022).
In short, the method in Keramati et al. would only work when the entire treatment trajectories are provided at the time new participant arrives, which would be practically impossible and is a completely different problem setup than us. As a result, their work pursued fundamentally different objective and their work could not be easily adapted to solving the problem we attempted to solve. We have provided more details in our response to reviewer 2e7j’s Q3 in the original rebuttal, as well as our latest response to 2e7j.
The authors would also appreciate the reviewers and ACs to advise us how we could further improve the presentation of our work pertaining to such aspects, as we felt that it could be part of our writing that caused such a confusion.
Dear authors, reviewers, ACs,
Just for a quick clarification,
degraded the rating from 4 to 3 without providing any reasonings behind after our latest clarification response
I changed my evaluation for this reason: in the initial evaluation phase, I still had some questions on how the user-partition is conducted, therefore, I was setting a neutral score on my evaluation. After reading the rebuttals, it turned out that the user partition itself does not have any further advancement over existing work, especially regarding the theoretical analysis. I thought this point should be carefully considered as a NeurIPS paper, and therefore, I updated the evaluation towards a not-neutral score.
However, I do acknowledge the authors' effort on the augmentation part to enable OPE with only the knowledge about the initial state distribution of the evaluation policies. Also, I am not strongly against the acceptance.
I'd be happy to discuss the above points with other reviewers and ACs.
The paper formulates the problem of off-policy selection, relevant to healthcare and social settings where the goal is to select a policy to deploy for a new patient/sample with no prior history. To enable this task, the authors assume independent initial state distributions. A subgroup partitioning using clustering is enabled. The value function corresponding to each subgroup's policy is designed such that the difference between the value corresponding to the policy of a subgroup assignment and the behavior policy is maximized. An estimator is proposed for value function estimation. To enable subgroup partitioning, a variational auto-encoder based data augmentation strategy is used for initial clustering. During deployment, cluster membership is inferred and appropriate policy is deployed.
Reviewers consider the problem formulation and the approach novel and easy to implement, with comprehensive evaluation, including ablations. There are certain clarity issues in the writing (on my reading) which the author response clarified. Reviewer aKHU raised valid points regarding the experiment design in the real world experiment as well as bias in the reward function estimates. Authors have attempted to clarify these concerns. One of the major concerns raised was that of potential distinction over prior work, particularly Keramati et al, which the authors have clarified in their rebuttal. This distinction is satisfactory to me. The other concern raised was regarding analysis of the proposed method re- bias/variance tradeoff of the proposed estimator for value function estimation. There is some analysis in the paper but nowhere nearly as exhaustive. However considering the largely empirical nature of the contribution, I have weighted this criticism appropriately. The empirical evaluation is interesting and demonstrates the utility of the simple approach proposed in the paper.
Considering all of these factors, I recommend and accept.