Concept-driven Off Policy Evaluation

审稿意见

评分: 5置信度: 22024-11-03

The paper introduces concept-driven off-policy evaluation (OPE). This approach mirrors the structure of concept bottleneck models, where the prediction task is split into two stages: first, predicting a set of human-interpretable concepts, and then using these concepts to complete the prediction task. The paper applies this concept-driven approach to OPE. Specifically, it maps trajectories to a concept space, derives a policy for each concept value, and uses the concept equivalent behavior and evaluation policies to calculate the importance sampling ratio. When concepts are not known in advance, the paper presents an algorithm to learn them. The evaluation is performed for specific policies on both synthetic and real datasets.

优点

Importance smapling methods for OPE can suffer from high variance and using a concept bottelenck can reduce the variance. The concepts can also improve the interpretability of the evaluation task. The proposed idea is novel to the best of my knowledge.

缺点

I found this paper challenging to read due to numerous typos and mathematical inconsistencies. For example, in Section 4.1, $\phi$ is first introduced as a function of state, action, reward, and next state. Later, it’s interpreted as a function of the full trajectory. However, in Equation 1, $\phi^{-1}$ appears to return a state, creating confusion. Additionally, there’s a typo in the definition of w in Equation 1. Theoretical results are presented with virtually no explanation. In both Sections 5.1 and 6.2, four theorems are listed in row without any discussion, making it difficult to follow or interpret the results and understand their limitations (for instance, Theorem 5.2 suggests that concept-driven evaluation is biased, which doesn’t align with the evaluations). It’s generally known that IS estimators can exhibit high variance, and various methods have been proposed to reduce this variance at the cost of extra bias. This paper’s proposed method falls into this category, though this is not clearly acknowledged.

There are also choices in the paper that lack justification. For instance, it uses an approximate nearest-neighbor policy with the MIMIC-III dataset. Having worked extensively with this dataset and reviewed recent studies on it across different tasks, I am not aware of any example where such a policy is applied. Additionally, this experiment employs a concept space of size $10^{15}$ , which does not seem to offer interpretability and appears far removed from the motivating example in Figure 1.

问题

Please refer to weaknesses.

评论- Author response to weaknesses (continued)

2024-11-17

6.2, 6.3 (Lines 367-377): We discuss when variance under unknown concept OPEs are lower than traditional OPEs. More specifically, we discuss the covariance assumption can be used as a loss function in the algorithm section of the paper, unlike known concepts where the practitioner has to explicitly satisfy the assumption while designing the concept policies.

6.4 (Lines 390-397): We discuss how the confidence bounds loosen under unknown concepts as there is additional bias, and quantify the worst-case additional bias to be the value function $\mathbb{E}{\pi^c_e}[\hat{V}{\pi_e}]$ of the unknown concept-based estimator.

Q: It’s generally known that IS estimators can exhibit high variance, and various methods have been proposed to reduce this variance at the cost of extra bias. This paper’s proposed method falls into this category, though this is not clearly acknowledged.

Response: In addition to IS and PDIS, other common methods such as Weighted-IS [1], Per-Decision Weighted-IS [1], and truncating states that do not contribute to variance [2], aim to reduce variance at the cost of increased bias. While our approach reduces variance for bias, it achieves this by characterizing key state information and adhering to specific desiderata such as conciseness, diversity, and interpretability. This interpretability allows the sources of bias to be analyzed and addressed through interventions, as discussed in Section 7. Following your recommendation, we have added these examples to the related work section for completeness.

[1] Eligibility traces for OPE (Precup et al.) 2000 [2] Low Variance OPE with State-based IS (David Bossn, Philip Thomas) 2023

Q: There are also choices in the paper that lack justification. For instance, it uses an approximate nearest-neighbor policy with the MIMIC-III. Having worked extensively with this dataset and reviewed recent studies on it across different tasks, I am not aware of any example where such a policy is applied.

Response: For MIMIC-III, it’s very common to generate behavior trajectories based on K-nearest neighbors as the true on-policy trajectories aren’t available. Some references where behavior trajectories (or policies in general for MIMIC-III) are generated using KNNs are

[1] Interpretable OPE in RL by Highlighting Influential Transitions: Here the evaluation policy is generated using 50 NNs.

[2] Superhuman performance on sepsis MIMIC-III data by distributional RL.

[3] Offline Policy Optimization with Eligible Actions (K=100 NNs)

[4] Identification of Subgroups With Similar Benefits in OPE (K=100 NNs)

[5] The AI Clinician learns optimal treatment strategies for sepsis in intensive care.

[6] Development and validation of a RL algorithm to dynamically optimize mechanical ventilation in critical care.

In this paper, we consider a popular variant of KNNs, called the approximate nearest neighbors search. The advantages of Approximate-NNs over KNNs are scalability, reduced computational cost, efficient indexing and support for dynamic data. This allows us to generate behavior and evaluation policies with a larger number of neighbors (200 in our paper, which is double than all known papers using KNN), with a faster inference time. Some examples of papers that use approximate nearest neighbors in medical examples are:

[7] Approximate kNN Classification for Biomedical Data

[8] Medical image retrieval via nearest neighbor search on pre-trained image features (This paper compares ANN as a baseline over their main algorithm called DenseLinkSearch).

[9] Approximate nearest neighbors: towards removing the curse of dimensionality. (This is the seminal paper on ANNs which discusses advantages over KNNs).

We add this complete analysis in Appendix G.3 in the revised version of the paper.

Q: Additionally, this experiment employs a concept space of size 10^15, which does not seem to offer interpretability and appears far removed from the motivating example in Figure 1.

Response: While the concept space may be higher, the interpretability lies in the concept representation and not the concept-space itself. As an example: a patient with the concept representation [0, 2, 1, 1, 2, 0, 9, 5, 2, 0, 6, 2, 1, 5, 9] shows the following conditions: acute kidney injury (very low creatinine), severe hypoxemia (very low PaO2), metabolic alkalosis (very high SpO2), and critical electrolyte imbalances (low potassium and magnesium), along with severe hypoglycemia. The normal GCS score indicates preserved neurological function, but over-oxygenation and potential respiratory failure are likely. The combination of anuria, AKI, and hypoglycemia points strongly toward hypotension or shock as the underlying cause. This example is listed in lines 259-264 of our paper.

The design choice could be further reduced however to lower discretizations, which would reduce the space to M^15 where M is the user choice. We consider 10 to be thorough, while a lower number would be an easier condition.

评论- Authors Response to weaknesses

2024-11-17

We sincerely thank the reviewer JqPf for the review and comments. We will sequentially address the questions.

Q: I found this paper challenging to read due to numerous typos and mathematical inconsistencies. For example, in Section 4.1, ϕ is first introduced as a function of state, action, reward, and next state. Later, it’s interpreted as a function of the full trajectory. However, in Equation 1, ϕ−1appears to return a state, creating confusion. Additionally, there’s a typo in the definition of w in Equation 1.

Response: We acknowledge there is an easier way to define a concept-based policy, without making it as complicated as Eqn 1 and rectifying notation. We make the changes in the latest version of the draft, section 4.1.

To summarize, the concept function $\phi$ maps trajectory histories $h_t$ to concepts $c_t$ . This function $\phi$ can capture various vital information in the history such as transition dynamics, short-term rewards, influential states, interdependencies in actions across timesteps, etc. Without loss of generality, in this work, we consider concepts $c_t$ to be just functions of current state $s_t$ . This assumption considers the scenario where concepts capture important information based on the criticalness of the state. We provide this clarification in the latest draft of the paper, section 4.1.

Concept-based policies $\pi^c_e, \pi^c_b$ are policies that are conditioned on concepts instead of states, where the concepts satisfy the desiderata. We make an additional assumption (5.2) on these concept-based policies. This assumption states that given a state, the concept-based policies $\pi^c_e, \pi^c_b$ are allowed to differ from the traditional policies $\pi_e,\pi_b$ by at most a quantity of $\beta$ , i.e. $|\pi^c_e(a|c)-\pi_e(a|s)|<\beta$ and $|\pi^c_b(a|c)-\pi_b(a|s)|<\beta$ . This is to ensure that the evaluation policy $π_e$ under concepts is reflective of the original policy while allowing to satisfy the concept desiderata depending on the flexibility of $\beta$ . The quantity $\beta$ is defined at the discretion of the practitioner.

Hopefully, this provides more clarity on the objective of the paper, and we are open to more suggestions on where we can improve.

Q: Theoretical results are presented with virtually no explanation. In both Sections 5.1 and 6.2, four theorems are listed in row without any discussion, making it difficult to follow or interpret the results and understand their limitations (for instance, Theorem 5.2 suggests that concept-driven evaluation is biased, which doesn’t align with the evaluations).

Response: We provide more detailed explanations to each of the Assumptions and Theorems in Sections 5 and 6, in the revised draft. These also cater to insights from theory and help design concepts and their corresponding policies practically. As a summary,

5.1 (Lines 210-213): This talks about the absolute continuity assumption, that every state-action pair that appears in the batch policy is capable of being evaluated with a positive probability.

5.2 (Lines 214-220): The state and concept policies differ by most $\beta$ , to ensure that the evaluation policy $π^c_e$ under concepts is reflective of the original policy $\pi_e$ .

5.3 (Lines 221-222): Known concepts are unbiased -> This is because the change of measure theorem from $\pi_b$ to $\pi^c_b$ is possible. However, this holds for infinite trajectories, and bias creeps in when there are finite trajectories.

5.4 (Lines 224-230): Variance under known concepts is lower than traditional IS, under the covariance assumption. We discuss why the covariance assumption can be realized better under concept representations as opposed to the state representations, in the revised draft of the paper.

5.5 (Lines 232-234): Variance Comparison with the MIS Estimator: We discuss how the covariance assumption under concept IS-product ratios can result in lower variance, even compared to the more efficient state-distribution ratios, which don’t suffer from the curse of the horizon with increasing trajectory length.

5.6 (Lines 236-243): Cramer-Rao bounds of our estimators: We explicitly discuss the scenario when the worst-case bounds tighten when moving from states to concepts. The crux is in the ratio $\frac{\pi_e(a|s)}{\pi_b(a|s)}$ where $\pi_b(a|s)$ is very low due to some states being poorly sampled. Under concepts, this state is better characterized which leads to an improved value of $\pi^c_b(a|c)$ , which improves the denominator and reduces the IS ratio, which in turn improves the bounds. (We provide a short explanation right after stating the theorem in the main text and a detailed explanation in Appendix D.1.7.

6.1 (Lines 362-366): Unknown concepts are biased. We discuss why unknown concepts are biased, as change of measure theorem is not applicable due to unknown probability $\pi^c_b$ .

2024-11-20

Thank you for your response. I want to acknowledge that I have read your rebuttal, and it has clarified some of the questions I had. However, I noticed that the paper has undergone significant changes during the rebuttal phase. Using a pdfdiff tool, I estimate that about 30–50% of the writing has been revised (it looks like a new paper!).

At first glance, the changes have made the paper clearer and better written. For instance, the theorems and assumptions now have some extra space and explanations.

That said, I am unsure whether such extensive revisions are typically acceptable during the rebuttal phase. Regardless, I want to inform the authors that I am carefully reviewing the revised version of the paper along with their response, but it may take some time.

2024-11-21

Dear Reviewer JqPf,

Thank you for your prompt response. Aside from grammatical mistakes and fluency in certain sentences, these are the main changes we have made in the paper which caters to all questions and suggestions by the reviewers. For instance,

Section 4: Clarifies the definition of concepts and concept-based policies (This was requested by all the reviewers)

Section 5: Theoretical section: This section was modified to connect (the assumptions and theorems) to (experiments and practical scenarios), (This was requested by Reviewer JqPf)

Section 6: Theoretical section: This section was modified to connect (the assumptions and theorems) to (experiments and practical scenarios), (This was requested by Reviewer JqPf). Methodology section: The explanation of the algorithm was revised to detail the nuances of each step, linking them to concepts desiderata, which underscores the significance of previously unexplored concepts. (This was requested by Reviewer 795h).

Section 7: The methodology section of the interventions was slightly modified to better explain the definitions. We elaborated more upon human criteria, and the intervention strategies and removed some redundant definitions to improve the flow of the section. (This was requested by Reviewer n4XH).

Appendix Section H (Ablations). We provided some additional analysis in the form of ablation experiments, like a. Adding a state-abstraction baseline b. IPS score comparison between concept and states c. Quantitative analysis of OPE under imperfect (poor) concepts. These ablation experiments further support the original claims in our paper and add better clarity to our original claims and observations. (These ablation experiments were requested by Reviewer 795h).

Appendix Section G (Additional training and hyper-parameter details). G2. Known, Oracle and Intervened concepts. We added some missing details on the concept policies in windygridworld. (This was requested by reviewer 795h). G3. We added additional details on the background on using KNN-based policies in MIMIC-III, and justify our experiment choice. (This was requested by reviewer JqPf).

All of these modifications are consistent with the original claims of the paper and don't change the experimental results, or theoretical proofs. These modifications further support the claims of the paper and provide additional clarity which helps the readers. Additionally, please don't hesitate to ask for additional clarifications which help with the understanding of the paper!

2024-11-26

Thank you for your further clarification.

The improvements in the writing are clear, specifically in the presentation of the results.

The authors have also addressed most of my questions.

Due to these improvements, I increase my rating.

But still, I don't support the acceptance of this paper. While the paper's main claims have not changed, there are a lot of changes from the original submission. These changes are indeed in a good direction, but I don't think it's an accepted practice in the community to submit a paper with poor presentation and later fix it during rebuttal.

评论- Possible further improvements to the paper?

2024-11-26

Dear Reviewer JqPf,

We once again thank you for your comments. We have responded to all your questions sequentially and incorporated the additional analysis in the revised version of the paper. Is there anything else you would like to see in the revised version or answer via rebuttals that can help further enhance the paper and increase the score?

审稿意见

评分: 5置信度: 42024-11-04

The goal of off-policy evaluation is to estimate the performance of an evaluation policy using data that is collected under a different behavior policy. Standard approaches to OPE can suffer from high variance. This work proposes a family of concept-based OPE estimators that reduce variance relative to standard importance-weighting estimators. They also develop an end-to-end algorithm for learning parametrized concepts, which are interpretable and can be used for off policy evaluation.

优点

The authors study an important problem – it is well-known that off-policy evaluation estimators can suffer from high variance, so the idea of using concepts as a form of dimensionality reduction of the state-action space can yield empirical benefits.

缺点

The paper has some clarity / conceptual issues: The proposal of this paper is to use a concept bottleneck model to learn interpretable concepts and derive importance-weighting estimators based on these concepts. The idea of simplifying the state-action space using concepts is a promising one, but there are many technical details that are not clear from the paper.

The authors posit that a concept at time-step $t$ can be obtained from a function $\phi$ that takes the entire trajectory from time-step $0$ to time-step $t$ as input, i.e. one can pass $(s_{t}, a_{0:t-1}, r_{0:t-1}, s_{0:t-1})$ to $\phi$ . However, later in the paragraph the authors write that “In this work, we consider concepts $c_{t}$ to be just functions of current state $s_{t}$ , and thus…” This is a bit confusing because $\phi$ is initially introduced as taking entire trajectories as input. Furthermore, if concepts $c_{t}$ are just functions of the current state $s_{t}$ , I wonder if it would be reasonable to interpret concepts as a way of allowing us to do dimensionality reduction on the states?
Suppose that we have a standard policy $\pi(a |s)$ and we compute a concept policy $\pi^{c}(a|c)$ . The policy value function $V_{\pi}(s)$ is a function of state $s$ , it is not clear to me from the paper how we can compute the analogous quantity for $V_{\pi^{c}}(s)$ or if such a quantity is even well-defined.
Assumption 5.1 seems to require that any action-state pair that is possible under the evaluation policy must also be possible under the behavior policy. However, the written interpretation of the assumption is the opposite.

问题

How can $\phi$ adapt to input trajectories of different length? Furthermore, in Equation (1), the authors appear to be able to invert the map $\phi$ . How is this possible?

评论- Author Response to Weakness 3 and Question 1

2024-11-17

Q: Assumption 5.1 seems to require that any action-state pair that is possible under the evaluation policy must also be possible under the behavior policy. However, the written interpretation of the assumption is the opposite.

Response: Thanks for pointing it out! We have made the corrections in the revised draft of the paper. (Lines 210-212)

Q: How can ϕ adapt to input trajectories of different lengths?

Response: In this work, we consider concepts as functions of states alone, however, there are multiple ways to deal with input trajectories of different lengths. 1. We can define a window-length k beforehand, and consider k-latest (s,a,r,s’) transitions as input to the concept function $\phi$ . 2. We can pad trajectories to the highest timestep, similar to word tokenization in NLP literature. On lines of implementation, an RNN or a transformer seems fit to handle input trajectories of different lengths with appropriate paddings.

Q: Furthermore, in Equation (1), the authors appear to be able to invert the map ϕ. How is this possible?

Response: Point taken. It is impractical to compute the inverse $\phi$ for large state spaces as defined in Equation 1, unless making additional assumptions like $\phi$ being a Bijective and invertible function, which is quite restrictive. To address this, we redefine concept-based policies as follows: (This is rewritten in section 4.1 of the paper): Concept-based policies $\pi^c_e, \pi^c_b$ are policies that are conditioned on concepts instead of states, where the concepts satisfy the desiderata. We make an additional assumption (5.2) on these concept-based policies. This assumption states that given a state, the concept-based policies $\pi^c_e, \pi^c_b$ are allowed to differ from the traditional policies $\pi_e,\pi_b$ by at most a quantity of $\beta$ , i.e. $|\pi^c_e(a|c)-\pi_e(a|s)|<\beta$ and $|\pi^c_b(a|c)-\pi_b(a|s)|<\beta$ . This is to ensure that the evaluation policy $π^c_e$ under concepts is reflective of the original policy $π_e$ , while allowing to satisfy additional desiderata depending on the value of $\beta$ . The quantity $\beta$ is defined at the discretion of the practitioner.

This additional assumption allows for computational feasibility without having to evaluate the inverse of the function and instead use a soft constraint over the loss function in Line 8 of the instead. We provide all the additional details related to training in Appendix G of our paper. This new definition of the concept-policy function and the additional assumption 5.2 doesn’t change any theoretical proofs or practical experiments, as we made no assumptions on the functional form of $\phi$ in our theoretical proofs.

2024-11-22

Thank you for your rebuttal and revised paper. I am reviewing the latest draft of the paper.

2024-11-22

Certainly. Please don't hesitate to ask for additional clarifications which help with the understanding of the paper!

评论- Author Response to Weaknesses 1,2

2024-11-17

We sincerely thank the reviewer tRrE for the review and comments. We will sequentially address the questions.

The paper has some clarity / conceptual issues: The authors posit that a concept at time-step t can be obtained from a function ϕ that takes the entire trajectory as input to ϕ. However, later in the paragraph the authors write that “In this work, we consider concepts ct to be just functions of current state st, and thus…” This is a bit confusing because ϕ is initially introduced as taking entire trajectories as input.

Response: We acknowledge that section 4.1 could be better clarified to describe the notion of a concept function $\phi$ . We make the changes in the latest version of the draft.

To summarize, the concept function $\phi$ maps trajectory histories $h_t$ to concepts $c_t$ . This function $\phi$ can capture various vital information in the history such as transition dynamics, short-term rewards, influential states, interdependencies in actions across timesteps, etc. Without loss of generality, in this work, we consider concepts $c_t$ to be just functions of the current state $s_t$ . This assumption considers the scenario where concepts capture important information based on the criticalness of the state. Furthermore, the concept function ϕ satisfies the following desiderata: explainability, conciseness, better trajectory coverage and diversity. We provide this improved interpretation of a concept function in the latest draft of the paper, section 4.1. A detailed description of desiderata is provided in Appendix A.

Furthermore, if concepts ct are just functions of the current state st, I wonder if it would be reasonable to interpret concepts as a way of allowing us to do dimensionality reduction on the states?

Response: Certainly! One of the interpretations of concepts which are just functions of states can be viewed as dimensionality reduction, with concepts summarizing states that have similar properties like transition dynamics, short-term rewards, etc. However, it is to be noted that concepts can do much more than just dimensionality reduction. Concept-based representations allow for interventions, inspect reasons and isolate sources of variance into why a particular concept is contributing to high variance in an OPE, as discussed in the Interventions section of the paper. (Section 7).

Furthermore, we conduct an ablation study (Appendix H.1, line 1799) where we perform K-means clustering over the state space. (This is our baseline for the state abstractions). First, we plot the MSE for our OPE performance against state clusters with varying numbers of clusters and then plot the clusters for the value of K(33) with minimum MSE. We observe a large number of clusters (As K=33), spread locally across, which is quite different from the true concepts. These clusters lack correspondence with the true oracle concepts or the optimized concepts, highlighting that learned concepts capture more meaningful and useful information compared to state abstractions. Furthermore, these aren’t readily interpretable and thus intervenable, underscoring the importance of performing OPE under concept representations.

Q: Suppose that we have a standard policy π(a|s) and we compute a concept policy πc(a|c). The policy value function Vπ(s) is a function of state s, it is not clear to me from the paper how we can compute the analogous quantity for Vπc(s) or if such a quantity is even well-defined.

Notably, the concept representation is only present in the Importance Sampling ratios, and thus in concept policy πc(a|c) and not in the value functions $V_\pi$ . This is important as we make no assumptions on concepts being Markovian, and Vπc(c) would imply there is a constraint on the concept to satisfy the Markovian property. This design choice makes the concept function \phi flexible in capturing other important aspects of the environment and offloading the burden of satisfying markovianness to traditional states. Let’s understand this through a Model-based OPE estimator to make this even clearer. Let’s take the Doubly-robust estimator as an example.

$\hat{V}_{\text{DR}}$

$= \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^T \prod_{k=0}^t \frac{\pi_e(a_k^{(i)} \mid s_k^{(i)})}{\pi_b(a_k^{(i)} \mid s_k^{(i)})} \left( r_t^{(i)} - \hat{Q}(s_t^{(i)}, a_t^{(i)}) \right) + \hat{V}(s_t^{(i)})$

$\hat{V}_{\text{CDR}}$

= \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^T \prod_{k=0}^t \frac{\pi_e(a_k^{(i)} \mid c_k^{(i)})}{\pi_b(a_k^{(i)} \mid c_k^{(i)})} \left( r_t^{(i)} - \hat{Q}(s_t^{(i)}, a_t^{(i)}) \right) + \hat{V}(s_t^{(i)})

It’s important to note the concept representation only appears in the Importance Sampling ratios and not in the actual model-based estimates. The model-based estimates are still a function of states. As the concepts only appear in the IS ratios, they don’t have to be necessarily Markovian, and the Bellman equation is still satisfied as the burden of satisfying Markovianity still lies on the states.

评论- Possible further improvements to the paper?

2024-11-26

Dear Reviewer tRrE,

We once again thank you for your comments. We have responded to all your questions sequentially and incorporated the additional analysis in the revised version of the paper. Is there anything else you would like to see in the revised version or answer via rebuttals that can help further enhance the paper and increase the score?

审稿意见

评分: 6置信度: 22024-11-04

The authors consider the problem of off-policy evaluation in offline/batched reinforcement learning. The goal is to be able to determine the effectiveness of a policy different from the one that collected/generated the data. The paper proposes a concept-based approach. A concept is a higher-level feature of the states/actions etc. that is (ideally) more interpretable and can capture key aspects of the problem such as transition points, changing dynamics and so on. The paper constructs a sampling importance method using concepts -- either learned concepts or concepts defined by experts. They prove various properties of the estimators and compare against traditional importance sampling methods. Finally, they run experiments on the Windy GridWorld problem and the MIMIC dataset, both when having human-designed concepts and when learning concepts.

优点

The paper proposes a useful idea in improving the interpretability of dynamics in RL which can give insights into the problem. Section 7 was very useful in illustrating this. The computational results overall are promising and show nice improvements over existing methods.

缺点

Theory: Additional discussion would be helpful for the theoretical results to explain the significance of these results. What kind of insights can we gain from the theory? Anything we can apply to improve/guide practical application and experimental results?

Additionally, how does the choice on the number of concepts affect the ability of evaluate policies? More concepts may be helpful in better partitioning state space, but overall the process becomes less interpretable if we have too many concepts.

Experiments: More detail would be appreciated on the tasks. Please explain what the WindyGridworld and MIMIC problems are, discussion on the state/action space etc. How are the concepts you propose in section 5.2 related to these? How do you assume the data is generated? For example, at least cite what the PPO algorithm is in section 5.2.

It would be nice to have more synthetic experiments, besides only WindyGridworld, so we can observe bias, variance, mean squared error and the effective sample size (ESS) on more problems (since MIMIC is not, we can only compute variance -- it is still important to observe this variance metric on real-world data, so please do keep it).

Algorithm: The algorithm introduced (Algorithm 1) seems to be a fairly significant contribution of the paper. However, it is given very little attention, and as a result I found it difficult to follow. For example, from the discussion in section 6.1 I do not see how the term $c_t^i = w \cdot f(s_t)$ is connected to the algorithm. Please provide a clearer description of the algorithm.

问题

Please see my questions in the weaknesses section above. In addition,

How is learning concepts different from learning representations? For example the work in [1].
Overall, can you give more details and intuition behind definitions etc. so we can better follow.

a) For example, in equation (1) $\phi^{-1}$ is never defined. Also initially $\phi$ is defined as a function of $a, r, s$ but $\phi^{-1}$ only acts on $c_t$ . You later say that concepts are only a function of $c_t$ .

b) Moreover, what is the steady-state distribution $d_{\pi}$ ? Steady-state of what distribution?

Can you provide more background on importance sampling, which seems to be the main approach you extend? It would be useful to have in order to better gauge the contribution of your work.

[1] Representation Matters: Offline Pretraining for Sequential Decision Making, Mengjiao Yang, Ofir Nachum ICML 2021

评论- Author Response to Questions 2,3

2024-11-17

Q: Overall, can you give more details and intuition behind definitions etc. so we can better follow. a) For example, in equation (1) ϕ−1 is never defined. Also initially ϕ is defined as a function of a,r,s but ϕ−1 only acts on ct. You later say that concepts are only a function of ct.

Response: We acknowledge that section 4.1 could be better written to describe the notion of a concept function $\phi$ . We make the changes in the latest version of the draft, section 4.1.

The concept function $\phi$ maps trajectory histories $h_t$ to concepts $c_t$ . This function $\phi$ can capture various vital information in the trajectory history such as transition dynamics, short-term rewards, influential states, interdependencies in actions across timesteps, etc. Without loss of generality, in this work, we consider concepts $c_t$ to be functions of current state $s_t$ . This assumption considers the scenario where concepts capture important information based on the criticalness of the state. Furthermore, the concept function ϕ satisfies the following desiderata: explainability, conciseness, better trajectory coverage and diversity. We provide this improved interpretation of a concept function in the latest draft of the paper, section 4.1. The concept function $\phi$ satisfies the following desiderata: explainability, conciseness, better trajectory coverage and diversity. A detailed description of desiderata is provided in Appendix A.

b) Moreover, what is the steady-state distribution dπ? Steady-state of what distribution?

Response: The steady-state distribution dπ(s) of a state s is the probability of being in that state s at any given point in time, assuming the process has reached its equilibrium. It’s mathematically defined as $d_\pi(s') = \sum_{s \in \mathcal{S}} \sum_{a \in \mathcal{A}} d_\pi(s) \pi(a \mid s) P(s' \mid s, a)$ .

We acknowledge there is an easier way to define a concept-based policy, without making it as complicated as Eqn 1 (without requiring steady-state distribution functions) and rectify notation. We make the changes in the latest version of the draft, section 4.1. Concept-based policies $\pi^c_e, \pi^c_b$ are policies that are conditioned on concepts instead of states, where the concepts satisfy the desiderata. We make an additional assumption (5.2) on these concept-based policies. This assumption states that given a state, the concept-based policies $\pi^c_e, \pi^c_b$ are allowed to differ from the traditional policies $\pi_e,\pi_b$ by at most a quantity of $\beta$ , i.e. $|\pi^c_e(a|c)-\pi_e(a|s)|<\beta$ and $|\pi^c_b(a|c)-\pi_b(a|s)|<\beta$ . This is to ensure that the evaluation policy $π^c_e$ under concepts is reflective of the original policy $π_e$ , while allowing to satisfy additional desiderata depending on the value of $\beta$ . The quantity $\beta$ is defined at the discretion of the practitioner.

Can you provide more background on importance sampling, which seems to be the main approach you extend? It would be useful to have to better gauge the contribution of your work.

Response: We provide background on most Importance Sampling works in the related work section of the paper (lines 88-100), in brevity of space, we will add more background on Importance sampling in the Appendix as an additional related work section.

评论- Author Response to Weaknesses 3,4,5 and Question 1

2024-11-17

Experiments: More detail would be appreciated on the tasks. Please explain what the WindyGridworld and MIMIC problems are, discussion on the state/action space etc. How are the concepts you propose in section 5.2 related to these? How do you assume the data is generated? For example, at least cite what the PPO algorithm is in section 5.2.

Response: Environments: We provide a detailed description of environments in Appendix F. Known Concepts: We provide elaborated details on training tasks in section 5.2. We provide details about the known Windygridworld concepts, true concepts and intervened concepts in Appendix G.2. Unknown Concepts: Training and Hyperparameter details are elaborated in Appendix G.1 Interventions: Experiment details are consumed within Sections 7.1 and 7.2. The data is generated by training an on-policy PPO algorithm for the Windygridworld environment,(we add a citation for PPO in the revised version of the paper), while the MIMIC-III dataset is publicly available with preprocessing steps elaborated in Appendix F.

It would be nice to have more synthetic experiments, besides only WindyGridworld, so we can observe bias, variance, mean squared error and the effective sample size (ESS) on more problems (since MIMIC is not, we can only compute variance -- it is still important to observe this variance metric on real-world data, so please do keep it).

Response: We are experimenting on an additional cancer simulator environment in the next few days, we plan to add the results in the Appendix once we are done.

Algorithm: The algorithm introduced (Algorithm 1) seems to be a fairly significant contribution of the paper. However, it is given very little attention, and as a result I found it difficult to follow. For example, from the discussion in section 6.1 I do not see how the term cti=w⋅f(st) is connected to the algorithm. Please provide a clearer description of the algorithm.

Response: We modify section 6.1 of the paper to better explain the nuances in the algorithm, in the revised version of the draft, lines 316-356. Additional details in comparison to the previous draft are along the lines of explanations of the loss functions concerning the concept desiderata, proximity between the concept policies and the state policies, and choice of OPE-metric (Variance) to directly optimize for the Variance as a metric.

The term $c_t^i=w⋅f(s_t)$ is a design choice, which specifies the concepts are linear functions of the interpretable state features. The weights w are embedded as part of the CBM and is a specific assumption we make to automatically satisfy the interpretability desiderata. Possible alternatives to these could be hierarchical concepts or symbolic representations.

How is learning concepts different from learning representations?

Response: Typically all the literature so far that talks about state-abstractions are of the form of neural embeddings. These neural embeddings which are seen as alternative representations of states have limited interpretability, and hence can’t be used for further downstream tasks. Concepts are also a form of representations, but they can be generalized to capture more important information such as transition dynamics, high variance states, short-term rewards, etc using CBMs as the base architecture. Furthermore, since concepts are compositions of human interpretable functions, they stay interpretable, and thus allow for targetted interventions (Section 7). These targeted interventions allow us to isolate sources of variance, reduce bias, and make the overall OPE estimation better. Traditional representations obtained via clustering or predefined neural embeddings lack this advantage and thus are less suitable for interventions.

评论- Author Response to Weaknesses 1,2

2024-11-17

We sincerely thank the reviewer 795H for the review and comments. We will sequentially address the questions.