Stable Offline Value Function Learning with Bisimulation-based Representations
摘要
评审与讨论
This paper proposes a new algorithm for representation learning in offline RL; with the goal to make value function learning more stable and stabilize policy optimization through learning better representations. It builds up on prior works on bisimulation based representations, and proposes a kernel based objective to learn representations for similar state-action pairs. Experimental results are demonstrated on a few tasks to demonstrate why the proposed KROPE algorithm can do stable and accurate policy evaluation.
优点
The paper is well written and proposes a kernel based representation objective through first principles in RL; cares about why policy evaluation is an important measure/metric to look into for analyse stability of value function learning - and thoroughly discusses prior works in this widely studied area of learning good representations for online/offline RL. Furthermore, experiments analyse stability-metrics that are important to look into, when proposing new representation objectives; Figure 4 is a good example of that. It draws insights from prior works and discusses theoretical properties such as the Bellman completeness, that are worth looking into when analysing the strength of a new objective.
缺点
There are several key weaknesses that I find concerning in this paper :
-
The proposed objective adds to the line of work around representation learning for RL; in addition to several past works that recently also studied bisimulation based objectives for control and offline RL. While the paper refers to past works, little comparisons are made to those prior works, either theoretically or experimentally - and it seems the authors start with policy evaluartion - discusses metrics related to stability for it - and claims to show expeirmentally why policy evaluation can be done well with the KROPE objective. However, several prior works have done this already, whether implicitly or explicitly showing the policy evaluation measure; and have demonstrated significantly empirically why bisimulation can be a good measure fior learning representations for offline RL.
-
Following from the above comment - I do not understand the exact benefits the kernel based approach buys for this; whether we do bisimulartion based representation or not (such as looking into inverse or forward dynamics models for learning representations) - the benefits of kernels seem to be not explained too well here. Are there new techniques in deep kernel learning literature that can be exploited here? What happens when we go to large state action spaces, with complex tasks - how well would the kernel based approach scale here?
-
Experiments are done on simple tasks where kernel based representations can assumably do similar to other objectives. However, as you scale to larger state-action spaces, I do not think the paper demonstrates well how well their algorithm would scale in this context. I would experiment more thorough comparisons with a large body of prior works for comparison, either empirically or theoretically.
-
Recent works (e.g Lamb et al.,) have significantly studied inverse dynamics models and claimed how these objectives can learn the true underlying latent dynamics in an unknown environment. In addition, Zang et al, studied and compared some of these inverse dynamics objectives with the bisimulation based objectives in offline RL - and demonstrated effectiveness of each approach on several empirical benchmarks. How well does the proposed KROPE representation do, in light of all these past works? In all these works, there is also the implicit or explicit demonstration of good enough policy evaluations to then do control. I do not think the discussion of policy evalaution and stability here is enough to show that KROPE can learn good policies for control.
问题
Few questions related to the weaknesses :
-
Can the authors compare theoertically, why bisimulation with the kernel representations can learn the true underlying latent dynamics? How do we know whether inverse or forward dynamics objectives are better/worse compared to KROPE? What does the algorithm mean in context of learning the underlying latent dynamics of the world?
-
Can authors compare to past works on bisimulation based representations, both theoretically and empirically - and show how well does KROPE perform and scale in complex control tasks in light of these past objectives?
-
I would expect to see some more experimental results in more challenging benchmarks - I think the current suite of tasks are not enough to show effectiveness of the proposed KROPE objective.
-
Zang et al for example also studies bisimulation in context of noise (similar to Amy Zhang et al’s past work on bisimulation for control) - can the authors comment on what would happen with KROPE, given reliance on kernels, in presence of noise in the environment?
-
All experiments are done on raw state-action spaces; there are no tasks with pixel based observations? I can imagine kernel representations would fail on more complex pixelized environments. Can authors comment on the trade-offs here?
-
I would like to understand the Bellman completeness claim much better in context of KROPE, compared to past works. Can authors comment on this? Or in short - what is the theoretical guarantee of this algorithm, and how well can we tell that KROPE learns good representations?
-
Is the claim here to learn true underlying latent dynamics, or more on the stability with the KROPE representations? Each claim would have it own questions to answer, depending on how the authors would like to move ahead - what’s the goal here?
Large-scale environments, image, noisy states: This is a very interesting future direction for us to pursue! Thank you for the suggestion. While we do acknowledge that applying our ideas to image-based domains and states with noisy distractors would be very interesting, we believe that the lack of it does not detract any significance from the work since we still evaluate the algorithms in various continuous state-action space environments with a high variety in datasets. In our work, we investigated the performance of the various algorithms on challenging datasets (7 environments, 13 datasets) that are commonly used for offline policy evaluation such as various DM control environments and D4RL datasets (see Appendix). Given that the scope of the paper is focused on the stability properties of KROPE-like algorithms, we believe that the current paper suffices as a step toward that understanding. We also note that while additional steps may be needed to apply KROPE to the alternative settings, there is no fundamental bottleneck for adapting our kernel-based method to images or distractors since our kernel is defined on the feature space.
Discussion about the kernel and scaling: Below we make remarks relevant to our work but we also encourage the reviewer to see [4] for additional benefits of the kernel in terms of theoretical analysis and scaling to complex domains.
- Benefit of the kernel: The use of the kernel enabled us to concretely talk about positive definiteness when measuring similarity. Intuitively, this property gives insights into the rank of the feature matrices involved, which gives us a sense of how distinct are the abstract/latent state-actions from each other. When designing KROPE and as done by [4], we are able to characterize the similarity measure (i.e., the KROPE kernel) as a positive (semi-)definite kernel. We used this property in proving the stability of KROPE representations (Theorem 1, see Appendix). Moreover, we also used this property to establish the relation to Bellman completeness (Theorem 2). In short, kernels give us a way to directly control the spectral properties of the dynamics. It may be possible to still prove these same properties through other means, but we found the tool of kernels to be helpful in our case.
- Scaling issues of kernel: There may be confusion on how the kernel is used, which casts doubts on its ability to scale. The kernel is applied on feature/vectors (and would not be applied directly to the raw image). Moreover, the kernel is not fixed but it is learned since it is a function of which is learned. In the feature space, the form of the kernel (linear dot product) is fixed (as is done by [4]), but the objective enables appropriate features to be learned such that the learned kernel captures the similarity relation we care about. Consequently, the use of kernels does not automatically preclude the use of our method in image-based domains. While additional steps may be needed to apply KROPE to the alternative settings, there is no fundamental bottleneck for adapting our kernel-based method to images or distractors since our kernel is defined on the feature space.
We now address specific points made in the review.
Weaknesses
- Please see the general comment above of “Goal of our work” and “comparison to prior work”.
- Please see above “Benefits of the kernel”.
- Please see above “Large-scale environments”.
- Please see “Goal of our work”, “comparison to prior work” (“Relations to dynamics model”).
Questions
- This is a great question and we will clarify this in the camera ready. Theorem 2 draws a relation to Bellman completeness. Intuitively, Theorem 2 tells us that KROPE representations are expressive enough to predict the reward and latent features at the next step. While KROPE does not explicitly model the latent dynamics of the MDP, its features are expressive enough to capture them. We refer the reader to [1, 2, 3] on prior works connections of bisimulations, forward dynamics, and Bellman completeness, where they are all essentially different approaches for getting at the same idea. Please also see the general comment above of “Comparison to prior work” (“Relations to dynamics model”).
- Thank you for the suggestion. Please see our response to “comparison to prior work” (“Other bisimulation methods”).
- Please see comments on “Large-scale environments”.
- This is an interesting direction to consider in future work. Given that the KROPE objective specifically tries to capture similarity between state-actions by considering short-term and long-term behavior, where this behavior is only a function of the reward function and transition dynamics, we also suspect that KROPE will discard irrelevant noise that does not impact the reward function and transition dynamics. That is, it will learn feature representations (and corresponding kernel) only relevant to capture similarity in action-values. Note that the kernel in KROPE is actually being learned. We do bias the form (using a linear dot product as done in prior work [4]), but the kernel, which is a function of the learnable representations, is being learned so we expect the features (and indirectly the kernel) to ignore irrelevant noise.
- We would not apply the kernel to the images directly. We suspect that we would have a typical encoder network that projects the image into a vector, and then we would measure similarity on that vector. [4] applies this kernel idea to Atari and shows improvement, so we would expect similar results to hold. However, as mentioned earlier, our focus is showing the stability of these representations, which is a result that theoretically extends to images or state-based environments as long as the features satisfy the relationship in Theorem 1. Please also see above “Discussion about the kernel”.
- Good question and this is related to your first question. Please see our response to Q1 above. We will make this connection clearer in the camera ready.
- The primary goal is to show that if we take a kernel perspective (similarity and positive definiteness) to representations and if they satisfy the KROPE relationship (a bisimulation-based relation), the representations are stable and accurate for offline policy evaluation setting as we state in the introduction. This stability property of the metric-based algorithm was missing from the literature and our work addresses it. We investigated this specific stability-related question in our experiments on 13 datasets, 7 high dimensional state-action environments, and through Theorem 1. As an auxiliary result, we also prove Theorem 2 which draws a relation between KROPE and Bellman completeness, which is another important condition typically considered for stability. Please also see “Goal of our work” above.
[1] Successor Features Combine Elements of Model-Free and Model-based Reinforcement Learning. Lehnert et al.
[2] A Note on Loss Functions and Error Compounding in Model-based Reinforcement Learning. Jiang.
[3] Information-Theoretic Considerations in Batch Reinforcement Learning. Chen and Jiang.
[4] A Kernel Perspective on Behavioural Metrics for Markov Decision Processes. Castro et al.
[5] Scalable methods for computing state similarity in deterministic Markov Decision Processes. Castro et al.
[6] Learning Invariant Representations for Reinforcement Learning without Reconstruction. Zhang et al.
[7] MICo: Improved representations via sampling-based state similarity for Markov decision processes. Castro et al.
[8] Benchmarks for Deep Off-Policy Evaluation. Fu et al.
[9] Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning. Voloshin et al.
[10] A Review of Off-Policy Evaluation in Reinforcement Learning. Uehara et al.
[11] Learning Bellman Complete Representations for Offline Policy Evaluation. Chang et al.
Thank you for appreciating the clarity of our paper and our process of introducing the KROPE objective. We also thank the reviewer for their detailed questions, comments, and feedback. We first make some general statements regarding the overall comments and then address individual comments/questions below.
General comments
In order to clarify any possible confusion regarding the contributions and focus of our work, we make the following general comments.
Goal of our work: The primary objective of our work is to show that the general class of bisimulation-based metric learning can learn representations that are indeed useful for accurate and stable offline policy evaluation (Lemma 3 and Theorem 1). Our work provides further evidence that bisimulation-like metric learning algorithms have desirable properties. Our goal is not to show a bake-off that our bisimulation-based metric learning method is better than other algorithms within this class, nor is to say that KROPE is trying to learn the true underlying latent dynamics.
Importance of studying only policy evaluation (reference to comment: “I do not think the discussion of policy evaluation and stability here is enough to show that KROPE can learn good policies for control.”): We want to mention that we focus only on offline policy evaluation (OPE) only and do not make claims about policies for control. Furthermore, we want to emphasize that studying evaluation and prediction algorithms only is still a significant contribution since good prediction is often a precursor for understanding improved control algorithms. While control is important, OPE is important [8, 9, 10] due to: 1) its practical significance in AI safety and building trust worthy RL agents and 2) it is a clean problem for studying how to make better predictions without dealing with the exploration problem or a changing target policy. Of course, exploration and changing target policy are important but there is value in removing those aspects of the whole RL problem for study. As such, designing control algorithms for offline RL is not the focus of this work, but is a very interesting future direction.
Comparison to prior work: We want to address the following:
- “several prior works have done this already, whether implicitly or explicitly showing the policy evaluation measure”: It would be great if the reviewer could link us to the mentioned references so we can give a more thoughtful response on this issue. However, at a high-level: just because good control algorithms may hint at good policy evaluation, it does not mean we understand the policy evaluation algorithm well, nor is it clear if the evaluation algorithm has desirable properties such as stability. Our work attempts to understand this important piece from the view of stability and accuracy.
- Other bisimulation methods: As mentioned in the introduction, KROPE is a representative of the class of bisimulation-based metric learning algorithms. Our work provides evidence that KROPE-like algorithms can have favorable stability and accuracy properties. Furthermore, our aim is not to show our metric learning algorithm is better than another one. As such, we compare KROPE to other non-bisimulation based metric learning algorithms. While prior has introduced different bisimulation metric learning algorithms, they have typically been limited. For example, they either assume deterministic [5] or Gaussian transition dynamics [6], or are more difficult to analyze theoretically [7]. Castro et al. [4] later introduced a metric learning algorithm for general stochasticity in the dynamics, and we built upon it to illustrate our point on the stability and accuracy of this class of methods.
- Relations to dynamics model: This is a very good point. As you mention, it is nice to theoretically show the relationship between KROPE and algorithms that model forward dynamics. This is actually what Theorem 2 tells us. Bellman completeness essentially tells us how well the representation models the dynamics of the MDP in the latent/abstract MDP. Theorem 2 draws the connection between KROPE representations and modeling latent dynamics of the environment. We will make this point clear in the camera ready. Empirically, we compare against BCRL as the representative method for the class of methods that capture the dynamics of the environment since it was introduced as a strongly competitive algorithm specifically for our problem setting of offline policy evaluation [11].
Dear uF25, Please let us know if we have addressed your concerns. Your comments are valuable and we would like to clarify any potential misunderstandings to strengthen our submission. Thanks!
Dear uF25, to ensure there is ample time for discussion before the deadline it would be great to get your thoughts on our response to see if we have addressed your concerns. Thank you!
Dear uF25, thankfully the conference briefly extended the discussion period. We would greatly appreciate it if you could let us know if we have addressed your concerns. Unfortunately, the opportunity to update the paper is no longer available, and hopefully we can keep that in mind during our discussion and in your response to our rebuttal from 12 days ago. Thanks!
Dear uF25, Please let us know if we have addressed your concerns.
The paper introduces Kernel Representations for Offline Policy Evaluation (KROPE), a novel algorithm aimed at stabilizing offline value function learning in reinforcement learning. By utilizing π-bisimulation to shape state-action representations, KROPE ensures that similar state-action pairs are represented similarly, enhancing convergence and reliability. The authors establish theoretical foundations demonstrating KROPE's stability through non-expansiveness and Bellman completeness. Empirical results show that KROPE outperforms non-bisimulation baselines in terms of stability and accuracy.
优点
-
This paper rigorously proves theoretically that the KROPE's representations stabilize least-squares policy evaluation and KROPE's representations are Bellman complete.
-
The authors have shown that the KROPE method can learn offline value functions with both stability and accuracy through a series of experiments.
缺点
The experimental part of this paper does not explain how is obtained, which will result in the inability to complete MSVE calculations.
问题
-
How can we obtain which is needed for calculation in KROPE?
-
In Garnet MDPs domain and 4 DM Control environments, how can we conduct offline value function learning while we don't have any offline datasets?
-
We are interested in understanding the specific implementation of the representation of the (s, a) pair. Specifically, we are wondering whether states and actions should be concatenated before being inputted into the network for representation, or if they should be inputted into separate networks for representation and then concatenated later. Furthermore, we wonder if the implementation will vary depending on whether the states are image-based or physics-based.
Thank you for appreciating our work, the rigor of our theoretical analysis, and thoroughness of our experiments.
We address your remarks below.
Weaknesses
In practice, it is common to assume that is available to us to evaluate it [3]. This is how algorithms such as FQE, BCRL, and KROPE typically work. The challenge, however, is knowing the value of since that is what we want to evaluate. In practice, this is unknown and KROPE does not depend on it. But we compute it for the sake of empirical analysis by running Monte Carlo rollouts to get an unbiased estimation of its performance (see Appendix C.1) and then computing the error between this value to the output of KROPE and LSPE.
Questions
- Refer above. Typically, in OPE, we will always know the policy that we want to evaluate (i.e. ) [3].
- In all cases, we do have an offline dataset, which is generated by a (collection) behavior policy (policies). We use this dataset for offline value prediction. We refer the reviewer to Appendix C.1 for details on how we generated the dataset.
- This is an interesting question. In our work, we concatenate the state and actions and feed them into the encoder. This is similar to the BCRL paper [1]. However, you are right that it is also possible to have separate networks. One work that adopts this approach is [2], where they have separate networks but jointly train the encoders based on a homomorphism loss. We conjecture that having separate networks has benefits of capturing more precise symmetries between states and between actions, but it does make the optimization process more difficult since it introduces more hyperparameters and networks. It would be interesting to investigate the performance gains or losses of such an architecture. And no there will be marginal difference in applying KROPE to image-based vs. state-based states. For state-based environments, the state can be directly fed into KROPE. For image-based, we would typically use a CNN encoder to generate vector features, which can then be fed into KROPE.
[1] Learning Bellman Complete Representations for Offline Policy Evaluation. Chang et al.
[2] Continuous MDP Homomorphisms and Homomorphic Policy Gradient. Rezaei-Shoshtari et al.
[3] Benchmarks for Deep Off-Policy Evaluation. Fu et al.
This paper explores a method for learning representations of state-action pairs that can stabilize the learning of state-action value functions (e.g. for use in reinforcement learning) from offline datasets. The motivating idea is that if state-action pairs with similar values under a policy have similar representations, this will make it easier to learn the true value function. First, the authors adapt the kernel-based -bisimulation representation proposed by Castro et al. (2023) from state to state-action value functions. Then, they present two novel stability results for this representation: first, they prove that using it for value function learning via least-squares policy evaluation will converge to the fixed point. Second, they show that it is "Bellman complete," an alternative notion of stability. Finally, the authors propose an algorithm for learning this representation from data and show that it is useful for stable value function learning in practice both in tabular MDP environments and in DeepMind Control Suite and D4RL tasks.
优点
The paper is well-written, well-organized, and clear about the scope of the contributions and how they relate to prior work. The notation is clearly defined and easy to follow. The experiments compare the proposed representation learning algorithm (KROPE) with several appropriate baseline methods and convincingly show that KROPE supports stable value learning in standard RL benchmark tasks.
缺点
The scope of the contributions may be a little narrow. I believe the figures could use significant improvement. They are generally hard to read, using only color to differentiate the different methods and thin lines that are difficult to distinguish when the confidence interval is 0. While the learning traces in figure 3 are interesting, I wonder if there might be alternative ways to show these results that are easier to parse, such as summary statistics that capture the degree of instability (similar to figure 4 but focusing on the value error over the course of training in the different environments). For example, in figure 3b the plot is dominated by what looks like a single unstable seed for BCRL-na which makes it hard to tell what "good" vs "bad" seeds look like for each method.
问题
- It would be great to have more description for a few of the introduced concepts: why is the notion of generalizability introduced in section 2.3 a notion of generalizability? Why do we think Bellman completeness is a desirable property for stability? Is the condition on the injectivity of the abstract reward function in theorem 2 reasonable in practice?
- I think the first paragraph of the introduction could do a better job of orienting readers who aren't familiar with the idea of value representation learning. What does value function learning look like without special representation learning methods, and how do these methods aim to improve it? I.e. something like --> , and making the role of function approximation here more explicit.
- Why is the higher error of KROPE for WalkerStand unsurprising? Lemma 3 discusses that KROPE may be less accurate when the transition dynamics or policy are stochastic, but I believe all the environments (including Walker) are deterministic and the policy should be stochastic across environments.
- In figure 4b, why does KROPE have a lower condition number than BCRL which specifically optimizes for this? The text at the top of page 10 seems to suggest that BCRL achieves a lower condition number even though the values contradict this.
- The description of FQE in section 4.1 is a little confusing. A cartoon of the network emphasizing the last layers and the representation vs. value function outputs might be helpful.
- I think terms like "one-step similarity" and "future similarity" would be more clear than "short-term" and "long-term" in equation (3) and definition 3.
- Minor issues: mismatched bracket at the end of equation (2); strange spacing around the sampled-from symbols beneath equation (3); FQE acronym was used without being defined yet in section 2.4.
Thank you for feedback, your kind words in appreciating the clarity of our work, and acknowledging the thoroughness in our experiments in demonstrating KROPE’s utility.
We address your comments below.
Weaknesses
Regarding the narrow scope: We want to emphasize that the stability of temporal difference learning algorithms is a fundamental problem in RL. Thus far, it was unclear whether bisimulation-based metric learning algorithms (instances of TD-like algorithms) produced representations that were indeed stable since prior work only focussed on generalization properties. Our work fills this gap, and through our theoretical and empirical analysis, shows that these algorithms do indeed learn stable and accurate representations for OPE.
Your comments regarding the clarity of the graphs make sense. We will update this for the camera ready. Thank you for the feedback.
Questions
- This completely makes sense. We will make this clear. To answer your questions briefly: a) the chosen notion of generalizability is quite common (see Section 4.2 in [1] and Figure 1 in [2]). Intuitively, if state-actions with similar representations had different action-values, then that would hurt performance since the action-value function would get confused on what it should output for identically looking inputs, so they should not have the same representation (example of bad generalization). b) Bellman completeness is commonly assumed in theoretical works for data-efficient policy evaluation [3]. Intuitively, it means that the representation captures the dynamics of the MDP perfectly in the latent MDP, which is naturally sufficient for accurate evaluation. c) It may not be reasonable in practice, but as shown in the experiments that does not detract from the utility of KROPE. Theorem 2 is to show that under some circumstances KROPE is related to this alternative notion of stability.
- Thanks for the suggestion. We will make this clearer. We refer the reviewer to Figure 1 in [4] for a nice illustration of the general idea of representation for value function learning. But briefly, yes, it is what you suggested of vs. .
- Yes, since the policy is stochastic in those environments, Lemma 3 tells us that two state-actions with similar representations may have different action-values, which would hurt performance.
- While BCRL does optimize for condition number, it depends on a scalar regularizer coefficient () [3]. So it depends on hyperparameter tuning. KROPE, on the other hand, is able to get favorable condition numbers without introducing any new objective and hyperparameters. We did a hyperparameter sweep for BCRL and reported the best result we got in terms of OPE error. At the expense of higher error for BCRL, it is possible to lower the condition number by increasing .
- We will make the description of FQE clearer in the camera ready.
- This makes sense too.
- Thanks for the suggested edits! We will revise these accordingly.
[1] Learning Dynamics and Generalization in Reinforcement Learning. Lyle et al.
[2] Metrics and continuity in reinforcement learning. Le Lan et al.
[3] Learning Bellman Complete Representations for Offline Policy Evaluation. Chang et al.
[4] On the Generalization of Representations in Reinforcement Learning. Le Lan et al.
The paper studies the stability of bisimulation-based representations in the context of offline RL, specifically as they serve to stabilize the LSPE algorithm and guaranatee Bellman completeness. To this end, they introduce a state-action similarity kernel, called KROPE, based on Castro et al. (2022) and Castro et al. (2023) in the spirit of bisimulation for state similarity. KROPE representations are shown to theoretically guarantee off-policy evaluation (OPE) stability for LSPE and yield Bellman-complete representations. Empirically, KROPE is compared to various alternatives in terms of mean-squared value error and various proxy measures aimed at analyzing stability in more depth. The empirical results verify that KROPE representations indeed provide stability benefits for OPE.
优点
- I found the paper to be very-well written and a breeze to read.
- The paper is clear about what the scientific questions being investigated are (L74), what the contributions are and where the key novelties lie. For example, I appreciated the forthcoming note in L215 that the properties discussed in Sec. 3.1 are largely due to prior work, but Sections 3.2, 3.3, and 4 provide novel insights.
- Given the clear scope of study (stability benefits of the bisimulation-like KROPE kernel in OPE) -- although somewhat narrow -- the theoretical results seem sound and provide clear answers (Theorems 1 and 2).
- Prior work is discussed clearly and credited appropriately to my knowledge.
- Experimental setup seems sound, providing additional insights and confirming theoretical results.
- The empirical results show benefit of the proposed KROPE kernel for OPE, which may be of practical interest for a broader audience.
缺点
I found the discussion and the empirical study around generalization to be somewhat limited. While stability and realizability are important and seem to be at the center of this paper, bisimulation-based representations have attracted a lot of interest from the community due to their generalization capabilities (e.g., via invariance to distractors in the environment). Pearson correlation between value difference and orthogonality (Fig. 2c) is a useful proxy, but I would have expected to see more direct and thorough experiments in settings where bisimulation-like algorithms shine, e.g., value error over "The distracting control suite" by Stone et al. (2021).
问题
A few questions, suggestions & typos:
- L121: is the number "of" state-actions in the dataset...
- L156: clarifying what "native features of the MDP" mean may improve readibility
- Proposition 1: Including in the expectations (like in Eq. 1) would help the reader make the connection to how stability of LSPE depends on the distribution shift between and .
- L373-374: I'm not sure about the higher chance of instability in high dimensions due to the covariance matrix. This seems to require some assumptions on the distribution of , which I am not sure if these representations necessarily satisfy. Hence, the sentence sounds speculative to me.
- L367-374: To what extent is the degradation in stability with increasing dimensionality related to (4) being better satisfied? The difference between LHS and RHS of (4) in practice seems like an important proxy that could be studied in the Appendix.
- Fig. 3: Legend placement is awkward.
- L486: As "expected"
- L487: Reads like BCRL-EXP-NA achieves lower condition number than KROPE, which doesn't seem right to me (Fig. 4b).
We sincerely appreciate your very kind words and appreciate that you liked our work. We are glad that you found the paper clear in its contributions and thorough in the empirical and theoretical analysis. We agree with you that our work will be of interest to a general audience.
We address your remarks below.
Weakness
This is a valid remark. Thank you for the suggestion. As you point out, most prior work already focused only on the generalization properties of bisimulation-based methods. However, there was still a gap in the literature of whether this class of methods produced representations that were stable and accurate for OPE. As such, our focus was mostly on addressing this gap. Our work showed that these representations are indeed stable and accurate for OPE through theoretical and empirical analysis.
Questions
Thanks for all the editing suggestions. We will make these changes for the camera ready.
Regarding Q4: This is a fair point. We made the statement based on what we observed in those experiments, but we will clarify this general speculative point in the camera ready. Thank you.
Regarding Q5: This is a good question. Your suggestion of studying the difference between LHS and RHS is an interesting first step and worth investigating in future work. We think the actual analysis may be more complicated since one way to measure the LHS/RHS difference is the loss function, but as noted by prior work loss functions based on target networks and bootstrapping (such as the TD/FQE loss) tend to poorly correlate with metrics we actually care about such as value error [1].
Regarding Q8: Ah yes. We meant that it (gray line) produces a lower condition number than BCRL-NA and FQE (orange and yellow). We will clarify this in the camera ready.
[1] Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error. Fujimoto et al.
We thank all the reviewers for putting in the effort in reviewing our paper, for appreciating the significance of our empirical and theoretical analysis, and for appreciating the thoroughness of our work. We address each reviewer individually below.
This paper proposes learning a stable representation for least-squares value iteration by leveraging a bisimulation-style representation. However, my primary concern lies with the definition of the krope kernel introduced in the paper, which seems fundamentally flawed. The main issue stems from the definition of (Lines 231--232): since and are drawn independently, this metric fails to effectively capture the similarity between the transitions of and . In fact, it is possible for to be larger than , even when and have very different transitions. For example, suppose transitions to a reward-one state with probability 0.6 and a reward-zero one otherwise, transitions to the same reward-one state with probability 1. In this case, the proposed kernel can return a higher similarity between and than and itself.
To put it plainly, the krope kernel can assign higher similarity scores to state-action pairs that are very different than to pairs that are identical. This undermines its validity as a bisimulation metric from the outset, and, as a result, the subsequent results and claims built upon it seem unsubstantiated.
审稿人讨论附加意见
Reviewer uF25 expressed concerns regarding the practical effectiveness of the proposed method and the lack of comparisons with prior works on representation learning using bisimulation-style objectives. This concern was not well addressed during the rebuttal phase. Nonetheless, the decision was made mainly based on the flaw articulated in the meta review.
Reject