Proto Successor Measure: Representing the space of all possible solutions of Reinforcement Learning
摘要
评审与讨论
The authors introduce a Proto Successor Measure (PSM), which is a method to solve RL problems with a focus on achieving zero-shot learning. PSM works by creating a basis set for all possible solutions in an environment by decomposing the state-action visitation distributions, using linear combinations of policy-independent basis functions. This allows an agent to compute the optimal policy for any given task by adjusting the weights associated with the basis functions.
优点
1: The authors claim: "We show that any solution of the Bellman Flow constraint for the visitation distribution can be represented as a linear combination of policy-independent basis functions and a bias". In Chapter 4 and the appendix, the mathematical reasoning to support this claim looks sufficient and is well-written. However, I do not know whether the authors are the first to prove this. If so, it would be beneficial to point this out in the paper, preferably at the end of the introduction.
Other strengths:
- Interesting idea, and a nice visual comparison to baselines in a maze environment.
- The paper is easy to read through.
缺点
1: This paper looks like it has been created in haste, with nonexistent figure references in line 257 and 279, a relatively large whitespace left on page 9 after the conclusion, and a missing reproducibility statement.
2: The claim: "We show that PSM can produce the optimal Q function and the optimal policy for any goal conditioned task in a number of environments outperforming prior baselines." is not sufficiently proven. First of all, only a maze environment and Fetchreach are used, which does not correlate to "a number of environments". Second, the authors claim to outperform prior baselines. However, performance is only compared in the Fetchreach environment. Adding to this, in Section 5.2 of [Ahmed Touati & Yann Ollivier, 2021], it seems their Forward-Backward (FB) algorithm easily solves the Fetchreach environment. Can the authors comment on this performance discrepancy? Also, for completeness and to say you outperform prior baselines, the authors could at least add the Ms. Pacman environment as done in prior work.
3: Evaluation: No other evaluations or ablations can be found in the very limited Appendix section.
4: There is no real discussion or future work component in this paper, although there is space for it.
Other weaknesses:
- Fig.1: Figure 1 should be trimmed. The part of the image with "Find a Behaviour in a Dynamical System" is not necessary and confusing.
- References: The references are organized very poorly:
- Line 535, Line 540, Line 543 , Line 576, Line 579, Line 581, Line 591, Line 593, Line 599: All these references have no venue.
- Line 611 & Line 616: These are two double references of exactly the same papers.
- For a lot of references that have been published in conferences, only the Arxiv reference is given.
- This correlates with my statement in weakness 1, where it just feels that this paper has not been carefully prepared but has been submitted prematurely.
[Ahmed Touati & Yann Olivier, 2021] - Learning One Representation to Optimize All Rewards, NeurIPS 2021.
问题
- Why are there no ablations, and no other quantitative evaluations other than the Fetchreach environment?
We thank the reviewer for their comments on the paper. We are encouraged to see that the reviewer found the idea interesting, well-motivated, and theoretically sound. We address the weaknesses and issues pointed out by the reviewer below:
This paper looks like it has been created in haste, with nonexistent figure references in line 257 and 279, a relatively large whitespace left on page 9 after the conclusion, and a missing reproducibility statement.
We apologize for missing figure references and have corrected them now. The code to reproduce our results can now also be found in supplementary material. We have revised the paper to be more comprehensive based on the reviewer’s feedback.
Evaluation and Ablations
First of all, only a maze environment and Fetchreach are used, which does not correlate to "a number of environments".
We have added 16 continuous control tasks from DM-control for evaluating and comparing PSM to baselines. This complements our discrete domain experiments and establishes that PSM indeed outperforms prior methods on a number of environments and tasks.
Adding to this, in Section 5.2 of [Ahmed Touati & Yann Ollivier, 2021], it seems their Forward-Backward (FB) algorithm easily solves the Fetchreach environment. Can the authors comment on this performance discrepancy? Also, for completeness and to say you outperform prior baselines, the authors could at least add the Ms. Pacman environment as done in prior work.
The Forward-backward algorithm from [Ahmed Touati & Yann Ollivier, 2021] evaluates grid worlds, discretized Fetchreach, and Pacman. Unfortunately, their datasets were never publicly released, contributing to difficulties in reproducing their setup exactly. Our results are based on the datasets we collected by running the exploration algorithm RND on fetchreach environments and enumerating all transitions for discrete maze and gridworlds. We have now added evaluations on 16 more tasks on 4 continuous challenging environments.
There is no real discussion or future work component in this paper, although there is space for it.
We thank the reviewer for their comment; we have added a discussion on limitations and future work in the conclusion section of the paper.
Fig.1: Figure 1 should be trimmed. The part of the image with "Find a Behaviour in a Dynamical System" is not necessary and confusing.
Our intent is to demonstrate the zero-shot nature of RL, that arises due to the representations learned by PSM. i.e. We aim to demonstrate that given any task requested by the user, we can obtain an optimal policy just by searching in the affine polytope represented by PSM. We have made changes to the figure to make it more clear.
References: The references are organized very poorly
We appreciate the reviewers' advice and have updated all the references to point to the published venue as opposed to their arXiv citations.
Why are there no ablations, and no other quantitative evaluations other than the Fetchreach environment?
We have now added evaluations on a larger set of more complicated tasks in the paper. We are running experiments ablating the representation dimensionality of PSM which will be added to the paper before the discussion period ends.
While we have striven to make the evaluations more comprehensive by adding more environments and tasks, we believe that our theoretical contributions might have been overlooked. We hope that the reviewer can reassess our paper in the light of a more thorough evaluation and writing improvements. Please let us know if you have any more questions.
We have summarized the improvements to the paper based on the reviewers' comments as the General Response. We have further added the following additional experiments in Appendix C as requested by the reviewer:
a. Ablation on the dimensionality of affine space: We observe PSM performance to be stable across dimension sizes 32 to 256.
b. Quantitative results on grid-world and discrete maze environments: PSM outperforms the baselines.
As the discussion period is ending soon, it would be great to hear from the reviewer as we have addressed their concerns about the draft. Please let us know if you have any further questions. We would appreciate it if the reviewer could reassess our work in light of these clarifications and additional empirical results.
"We have added 16 continuous control tasks from DM-control for evaluating and comparing PSM to baselines. This complements our discrete domain experiments and establishes that PSM indeed outperforms prior methods on a number of environments and tasks."
- This certainly strengthens the paper.
"The Forward-backward algorithm from [Ahmed Touati & Yann Ollivier, 2021] evaluates grid worlds, discretized Fetchreach, and Pacman. Unfortunately, their datasets were never publicly released, contributing to difficulties in reproducing their setup exactly. Our results are based on the datasets we collected by running the exploration algorithm RND on fetchreach environments and enumerating all transitions for discrete maze and gridworlds. We have now added evaluations on 16 more tasks on 4 continuous challenging environments."
- Thanks for clarifying. If their datasets were never released, it makes sense that their baselines cannot be accurately reproduced.
"*We appreciate the reviewers' advice and have updated all the references to point to the published venue as opposed to their arXiv citations. *"
-
Although this is obviously good, the point is that the original manuscript did not look like a version that was ready to send for publication to a top-tier conference. This, together with two clear figure reference question marks, and two pairs of duplicate references, does not give the impression that any of the authors thoroughly inspected the paper before submission.
-
Adding to this, the subsequent change in the manuscript from 1 evaluation environment (Fetchreach) to 17 evaluation environments, new experiments and a lot of ablations, seems like a lot of needed changes. To me, it gives the impression that the authors used the full paper submission deadline as an abstract deadline, by figure of speech. I feel this is unfair to reviewers and not how the ICLR submission process is supposed to work.
-
Although I will raise my score to a 5, since the paper is in a much better shape now, I will not yet recommend it for acceptance largely due to the reasons I explain above. However, I will keep an open mind in the reviewer / AC discussion to see if my reasoning is justified, and change my views accordingly.
We thank the reviewer for raising their score.
- We respectfully disagree that our paper was more similar to an abstract by the paper submission deadline. Prior published works [1,2] in similar areas that focused on theoretical insights often demonstrated their empirical results in mostly discrete environments like ours. We believe our paper consisted of impactful and novel theoretical insights. We clarify that our initial experiments were on 3 environments (~30 tasks, tasks are defined by reward function or goals), and we have subsequently added 4 environments (16 tasks). Since our method is zero-shot RL, a number of tasks are evaluations of a variety of reward functions for every environment. We also clarify that we have added one ablation since the initial draft. In our opinion, the additional experiments should not say the initial experiments were lacking but shows our commitment to showing the method is indeed empirically promising on different environment set.
- We agree that we missed two references in the final draft that arose on our part due to last moment latex compilation error. Citing arXiv papers instead of published version of the paper has been adopted by many papers published recently (eg. Direct Preference Optimization (NeurIPS 2023 best paper): 11, 18, 46 references point to arXiv despite being old published papers). We still believe in citing the published versions and had updated all our references. We believe it is a bit harsh to penalize us for this since a number of published papers have grammatical, vocabulary, and arXiv citation errors, and these aspects can be improved with reviewers' comments.
[1]: Diego Gomez, Michael Bowling, Marlos C. Machado; Proper Laplacian Representation Learning, ICLR 2024.
[2]: Lucas N. Alegre, Ana L. C. Bazzan, Bruno C. da Silva; Optimistic Linear Support and Successor Features as a Basis for Optimal Policy Transfer, ICML 2022.
We thank the reviewer for their constructive comments. It seems the score changes proposed by the reviewer might not have been reflected in the main review. We wanted to double check if this was the case from their end.
I first want to clarify the misunderstanding, I did not mean the paper resembled an abstract. I meant that it looked like the original manuscript was unfinished at the submission deadline, in order to still have a chance of improving it afterwards and during the discussion.
The initial experiments where done on 3 environments, of which 2 where simple maze environments. I think that these environments are fitting to support insights, but are generally not accepted as good evaluation environments like Fetchreach. This combined with the limited discussion, future work, ablations and quite some presentation errors which i listed in my review is what lead to my conclusion.
- I have succesfully updated the score now.
We thank the reviewer for their comments. We would like to clarify that through our paper, we introduced a principled representation learning method for successor measures that can represent all policies in the environment. We also used these representations to produce zero-shot RL policies for any downstream tasks. We had evaluated our method on discrete environments similar to prior works in this direction. During the discussion period, as asked by the reviewers, we experimented on continuous environments to showed that our method is not limited to discrete settings.
The paper is considering the 'zero-shot learning' problem, where the RL agent learns to understand a given environment and is expected to solve downstream tasks with different reward functions. The authors first show that any valid policy, by using its visitation distribution and successor measure, can be represented by some affine combination of basis vectors (in the discrete settings). And a corresponding algorithm is proposed to learn such basis vectors and then solve the optimization problem through the form of a Lagrangian dual. A couple of experiments are provided to show its advantage.
优点
Originality
- The paper proposed a novel perspective to represent the set of all valid policies using their visitation-based probability measures, which further gives rise to a concise representation of the optimization problem in zero-shot RL.
Quality
- Despite notational confusions, theoretical results provide some new insights for understanding an arbitrary policy.
Clarity
- The presentation is mostly easy to follow, though some notations/theorem need more clarifications.
缺点
Originality
- A bit more summary of baseline methods would better show which part of the paper is novel, while which was already discussed.
Quality
- In Line 368-370, some procedure to recover the optimal policy in the setting of continuous spaces is provided. However, the former theoretical results are based on proof in the setting of finite discrete spaces.
- As implied by Equation (10), the final step is to solve the dual problem, some theoretical analysis of the computational cost could help.
Clarity
- In Equation (1) - (3), a comma or is missing between and .
- In Section A.1, Proof of Lemma 4.1: first, the left bracket is in the wrong place in Line 659; and an explicit definition of matrix would help
- In Section 4, toy example, there is a reference to the derivation of Equation (5), which does not appear in the Appendix.
- Without more explicit explanation of , currently Theorem 4.4. is not easily accessible. Some definitions would help.
- Equation (7) seems to be important in forming the method, however, model densities and other notations used there are not defined.
Significance
- Although the abstract implicitly claims to not make assumptions on the MDP structure, according to both theories and experiments, discrete state and action spaces are focused on in this paper.
问题
- In the procedure in Section 5.3, is the reward function assumed to be known?
- Are Lemma 4.1 and Theorem 4.2 proposed by the paper or in former literature?
Thank you for taking the time to review our work. We are happy to hear that the reviewer found our work novel and easy to follow. We address the reviewer's questions and concerns below:
A bit more summary of baseline methods would better show which part of the paper is novel, while which was already discussed.
We have added more information about the baselines in Appendix B. To reiterate, our method identifies the affine nature of Bellman Equations using which it presents a method to learn representations for any MDP without imposing additional structure (e.g. HILP uses representations only suitable for goal reaching, and Laplacian representations are dependent on the policy used to collect the data) and avoids the instability of Bellman optimality backups (used in FB) that suffer from overestimation in Deep RL by only requiring Bellman evaluation backups.
In Line 368-370, some procedure to recover the optimal policy in the setting of continuous spaces is provided. However, the former theoretical results are based on proof in the setting of finite discrete spaces.
We clarify that our proposed method is not limited to discrete state or action spaces and extends to continuous domains as well. The matrix equations can be written using functional operators, and the linearity of the equation is still maintained. We have updated the paper in Section 4 to reflect this more clearly. We have also added experiments in continuous DM-control domains across 16 tasks to validate that PSM indeed scales to continuous domains.
As implied by Equation (10), the final step is to solve the dual problem, some theoretical analysis of the computational cost could help.
After our core pretraining phases of Sections 5.1 and 5.2, finding an optimal policy given any reward function amounts to solving a constrained linear program of Equation 10. The objective in Equation 10 is linear with affine constraints (thereby, a convex optimization problem). These classes of problems are well studied and guaranteed to converge to optimal solutions with a sublinear rate. More importantly, Equation 10 reduces the hard non-convex RL problem that needs to be solved for each reward function to a simple convex optimization for each reward function.
Punctuation and Matrix definition in Proof of Lemma 4.1
We thank the reviewer for pointing this out. We have corrected the punctuation and added the matrix definition in the paper.
In Section 4, toy example, there is a reference to the derivation of Equation (5), which does not appear in the Appendix.
Thanks for catching this. We have updated the paper to point to the derivation of Eq.5 in the supplementary material.
Without more explicit explanation of , currently Theorem 4.4. is not easily accessible. Some definitions would help.
We have added more definitions in Theorem 4.4 to make it easy to understand.
Equation (7) seems to be important in forming the method, however, model densities and other notations used there are not defined.
We have updated the paper in general to increase clarity. The definition for model densities is given in Section 5.1, first paragraph before using them in Equation (7). Please let us know if there are any other notations that are unclear.
Although the abstract implicitly claims to not make assumptions on the MDP structure, according to both theories and experiments, discrete state and action spaces are focused on in this paper
We apologize for the confusion. Our proposed method is not limited to discrete state or action spaces and extends to continuous domains as well. We have updated the paper to make this more clear. Furthermore, we have added evaluation on complex continuous state-action environments, demonstrating that PSM leads to significant improvement in these domains.
In the procedure in Section 5.3, is the reward function assumed to be known?
Yes, sections 5.1 and 5.2 form the core algorithm where we pre-train representations without assuming a reward function. Then, when a user provides us with a task (in the form of reward function), our fast inference procedure can quickly give an optimal policy by solving a constrained linear program.
Are Lemma 4.1 and Theorem 4.2 proposed by the paper or in former literature?
We have made minor modifications in the paper (highlighted in red) to make this clear. Both lemma 4.1 and theorem 4.2 are contributions of this work and novel to the best of our knowledge. To increase clarity, we have restructured these into Theorem 4.1 and Corollary 4.2.
The changes to the paper are marked in red. Along with increased evaluation, we have added detailed experimental details and made the writing more clear. Please let us know if you have any further questions. We hope the reviewer can reassess our work in light of these clarifications.
We have summarized the improvements to the paper based on the reviewers' comments as the General Response. We have further added the following additional experiments in Appendix C:
a. Ablation on the dimensionality of affine space: We observe PSM performance to be stable across dimension sizes 32 to 256.
b. Quantitative results on grid-world and discrete maze environments: PSM outperforms the baselines.
As the discussion period is ending soon, it would be great to hear from the reviewer as we have addressed their concerns about the draft. Please let us know if you have any further questions. We would appreciate it if the reviewer could reassess our work in light of these clarifications and additional empirical results.
Thanks for the detailed responses! Please see the following question:
We clarify that our proposed method is not limited to discrete state or action spaces and extends to continuous domains as well. The matrix equations can be written using functional operators, and the linearity of the equation is still maintained. We have updated the paper in Section 4 to reflect this more clearly. We have also added experiments in continuous DM-control domains across 16 tasks to validate that PSM indeed scales to continuous domains.
- First of all, solely the results on discrete settings are already novel. Second, according the reviewer's understanding, the theoretical results have not been made generalized enough for continuous cases. For example, PROOF OF THEOREM 4.1 is still assuming finitely many available actions. If the claim for comparable results are ready to present in continuous cases, both statements in the paper and proofs in the Appendix need to be carefully reformulated.
In general this paper makes accessible and novel contribution.
We thank the reviewer for acknowledging the novelty of our method and bringing to our attention that the continuous setting can be made clearer.
Earlier we mentioned in Section 4 (lines 247-250 in current revision), PSM can be easily extended to continuous domains. Now, In the updated Theorem 4.1 and Corollary 4.2 , the reviewer can see how by replacing sums by integrals and sums by operators, all our theory extends to continuous spaces.
Please let us know if more concerns remain. If the reviewer has no more concerns, we request the reviewer to reconsider the score for our paper.
The revised arguments look good and the score has been raised.
This paper presents Proto Successor Measure (PSM), a zero-shot RL method that compresses any MDP to allow for optimal policy inference for any reward function without additional environmental interactions. The method seems technically sound. But experiments are limited.
优点
- The proposed Proto Successor Measure (PSM) method combines policy-independent basis functions with biases, enabling it to generate new policies from offline data without additional environment interactions, and allowing it to adapt to different tasks.
- The paper introduces a policy-independent basis function set, Φ, which can be shared across different tasks and policies. Since these basis functions are independent of specific policies, new policies can be generated simply by adjusting the combination weights. This reduces the need for retraining on new tasks, enhancing generalization capability and computational efficiency.
缺点
- The paper cites "Planning with Diffusion for Flexible Behavior Synthesis" for its world model, but this work primarily focuses on using diffusion models for trajectory planning and is not directly related to world models in reinforcement learning. Relevant world-model references could include "Dream to Control: Learning Behaviors by Latent Imagination."
- The paper represents policies as a compact discrete codebook, approximating an infinitely large policy space through a discrete set. While this simplifies optimization, it may also impose limitations. A discrete policy codebook can only represent a finite subset of the policy space, potentially leaving some strategies uncovered and leading to suboptimal results in certain tasks.
- The experimental setup is fairly limited, testing only two basic environments (Grid World and FetchReach), with relatively simple tasks. There is a lack of more complex, high-dimensional environments (such as robot control tasks in Meta-World or Mujoco) to demonstrate PSM's generalization capabilities and applicability. This limitation may affect the persuasiveness of the method's performance in a broader range of environments.
问题
See previous section.
We thank the reviewer for the thoughtful review. We are glad to see that the reviewer found our method technically sound. We address the reviewer’s concerns below:
References for world models in reinforcement learning
We thank the reviewer for pointing this out and the reference has been fixed now.
While this simplifies optimization, it may also impose limitations. A discrete policy codebook can only represent a finite subset of the policy space
We clarify that while we use a discrete codebook to generate the policies, the number of codes can be unlimited. In fact, given an unbiased random code generator, we can show that our policies encapsulate all possible deterministic policies as the infinite number of codes and for practical settings, the discrete codebook provides an unbiased estimator of the expectation w.r.t sampled policies.
The experimental setup is fairly limited, testing only two basic environments (Grid World and FetchReach), with relatively simple tasks.
We have now included an evaluation of 16 continuous control tasks from the DM-control benchmark. These environments are typical for evaluating RL algorithms for continuous state-action spaces and PSM scales without difficulties in these domains.
The changes to the paper are marked in red. Along with increased evaluation, we have added detailed experimental details and made the writing more clear. Please let us know if you have any further questions. We hope that the reviewer can reevaluate our work in light of these clarifications.
The work is much nicer now. I have raised my score to 6.
Thank you for the response and raising the score. We wanted to check if any further clarifications or additions are needed to achieve a higher score.
We thank the reviewer for their time and effort in engaging with us and providing valuable feedback to improve our work. If there are any further questions or concerns regarding our work as the rebuttal period comes to a close, we are more than happy to provide further clarifications.
We have summarized the improvements to the paper based on the reviewers' comments as the General Response. We have further added the following additional experiments in Appendix C:
a. Ablation on the dimensionality of affine space: We observe PSM performance to be stable across dimension sizes 32 to 256.
b. Quantitative results on grid-world and discrete maze environments: PSM outperforms the baselines.
As the discussion period is ending soon, it would be great to hear from the reviewer as we have addressed their concerns about the draft. Please let us know if you have any further questions. We would appreciate it if the reviewer could reassess our work in light of these clarifications and additional empirical results.
In the paper, the authors present a novel approach to unsupervised reinforcement learning. They first present a theoretical foundation for the algorithm by showing that any state-action visitation distribution and any successor measure in a single MDP form an affine set. This motivates their approach - first learn the basis of the affine set (in an unsupervised manner, without a specific policy or reward), and then, given the affine space, the only optimization left given a downstream task is to find the optimal linear representation in the space. The authors present a practical algorithm to achieve both steps and evaluate it on two simple benchmarks and a robotic manipulation benchmark while comparing to two common unsupervised-rl baselines.
优点
-
As far as I know, the authors are the first to make the observation that all state-action visitation distributions in an MDP can be represented as an affine set. This can have a potential impact on the field of unsupervised-rl in particular and other fields in rl in general.
-
The theoretical claims in papers are well supported and presented in a precise and clear way
-
The suggested algorithm can potentially leverage large offline datasets to effectively learn representations for downstream zero-shot tasks, which can have a significant practical impact.
缺点
-
The tasks chosen in the experiments section are quite simplistic, and it is not clear how the method would hold in more complex environments (e.g. with large state space or a more complex transition function).
-
The technical details of the proposed algorithm are missing, e.g. how is parametrized, what optimizer is used, etc. Further, the technical details of the benchmarks are also missing, e.g. what is the horizon and reward function of each task (without these it is hard to assess the quality of the success rate), etc.
-
The limitations of the proposed approach were not discussed in the paper.
-
While the paper is well-written and the ideas are presented clearly, it is hard to keep track of its high-level flow. A high-level flow for Sections 4 and 5 could have helped the readability of the paper
-
A theoretical discussion and analysis of the dimensionality of the affine space is missing in Section 4 (i.e. what affects ).
问题
-
Typos:
a. Line 432: toform -> to form
b. Line 279 and Line 257: Add figure numbers
-
In line 267 you mention the derivation for the basis vector is in the supplementary material. I did not find such a derivation, is it missing? If so please add one and add an exact reference to it.
-
At the end of Section 5.1 you presented a two-player game optimization strategy and suggested simplifying it with the single-player game presented in Section 5.2. Are there any practical benefits of one approach or another? I would suggest incorporating a discussion on this into the paper or adding some sort of an ablation study on that to the empirical experiments.
-
How many seeds were used in the experiments in the empirical study?
-
You discussed the quality of the Q function based on the visualizations in lines 440-444. Did you also perform a quantitative study on this (e.g. by comparing to the oracle optimal Q function)?
-
Did you use the same offline dataset mentioned in the paper for all baselines?
-
Did you perform a hyperparameter tuning procedure for your algorithm and the baselines? If so, can you elaborate on this?
-
In the robotic manipulation task (Section 6.2), how did you handle a continuous state space? Doesn't that make the basis function continuous?
We thank the reviewer for the detailed review. We are happy to see that the reviewer found the idea of leveraging the property that visitations form affine sets novel, theoretically sound, and to have a potentially significant practical impact.
We address the perceived weaknesses and questions raised by the reviewer here:
The tasks chosen in the experiments section are quite simplistic, and it is not clear how the method would hold in more complex environments (e.g. with large state space or a more complex transition function).
The PSM loss function directly extends to continuous state-action spaces and we have now added experiments on the DM-control tasks, which represent a set of continuous control environments popular for benchmarking continuous RL agents. We have added evaluation for a variety of tasks defined by general reward functions; as opposed to goal-based tasks.
The technical details of the proposed algorithm are missing, e.g. how is parametrized, what optimizer is used, etc. Further, the technical details of the benchmarks are also missing, e.g. what is the horizon and reward function of each task (without these it is hard to assess the quality of the success rate), etc.
We thank the reviewer for pointing this out. We have now added a detailed summary of all the hyperparameters and experimental details in the appendix.
The limitations of the proposed approach were not discussed in the paper.
We have added a discussion about the limitations of our work in the paper in the conclusion section. The major limitation is that it is hard to ascertain a good size for dimensionality of basis functions. Complex MDPs with highly non-linear and non-smooth transition dynamics might require a larger dimensionality but a larger dimensionality than needed might lead to larger compute requirements and training time. An interesting future direction would be to investigate if ‘validation’ can help to determine the minimal dimensionality: running offline RL with the learned representations across a set of reward functions and choosing the dimension size that gives the highest average Q-values.
A high-level flow for Sections 4 and 5 could have helped the readability of the paper
Thanks for this suggestion. We have added a high-level flow for Sections 4 and 5.
A theoretical discussion and analysis of the dimensionality of the affine space is missing in Section 4 (i.e. what affects ).
We are currently running ablation experiments on how the dimensionality of affine spaces affects the performance of PSM. We expect to update this before the discussion period ends. A complete theoretical analysis of how the dimensionality of affine spaces affects performance is an interesting future direction and out of the scope of the current paper.
TYPOS
We have addressed this now, thank you!
Are there any practical benefits of one approach or another (two player vs single player)? I would suggest incorporating a discussion on this into the paper or adding some sort of an ablation study on that to the empirical experiments.
Two-player optimization is known to suffer from training instabilities when compared to single-player optimization (e.g. GAN vs Diffusion models, IRL vs. Behavior cloning, etc). Our initial attempts to train using a two-player setup led to performance collapse, and thus, we chose a more stable, principled and performant alternative for presenting our method.
How many seeds were used in the experiments in the empirical study?
We used 5 seeds for our continuous experiments and 3 seeds for our discrete experiments in Section 7 and have updated the paper to make this more clear.
You discussed the quality of the Q function based on the visualizations in lines 440-444. Did you also perform a quantitative study on this (e.g. by comparing to the oracle optimal Q function)?
We are running the quantitative evaluations on gridworlds, and the results will be updated before the discussion period ends.
Did you use the same offline dataset mentioned in the paper for all baselines?
Yes, we have now added a thorough explanation of implementation details in the appendix to make this clear.
Did you perform a hyperparameter tuning procedure for your algorithm and the baselines? If so, can you elaborate on this?
We have added the hyperparameters used for each method in the appendix. We kept the base hyperparameters the same (representation dimensionality, neural network capacity, inference samples, etc.) to ensure fair comparison but used the tuned hyperparameters for algorithm-specific hyperparameters (e.g. PSM inference learning rate). All hyperparameters can be found in Appendix B.3.1.
In the robotic manipulation task (Section 6.2), how did you handle a continuous state space? Doesn't that make the basis function continuous?
We clarify that PSM learns the basis of successor measures or successor visitation distributions, which are made up of continuous quantities. Thus, the PSM algorithm is invariant to discrete or continuous state or action spaces.
Please let us know if you have any more questions. We hope the reviewer can reassess our work in light of these clarifications.
We have summarized the improvements to the paper based on the reviewers' comments as the General Response. We have further added the following additional experiments in Appendix C as requested by the reviewer:
a. Ablation on the dimensionality of affine space: We observe PSM performance to be stable across dimension sizes 32 to 256.
b. Quantitative results on grid-world and discrete maze environments: PSM outperforms the baselines.
As the discussion period is ending soon, it would be great to hear from the reviewer as we have addressed their concerns about the draft. Please let us know if there are any further questions. We would appreciate it if the reviewer could reassess our work in light of these clarifications and additional empirical results.
I thank the authors for their amazing efforts during the rebuttal period. I think the added discussions and experiments, especially the DM-control and the limitations discussion made the paper much stronger and answered most of my major concerns. I modified my score.
Thank you for increasing the score and helping us with improving the paper!
We thank all reviewers for taking the time to review our work. We are encouraged to see reviewers point out: the importance of recognizing that visitations form an affine set and its potential impact to the field of unsupervised RL (Reviewer mxVW), that the paper has sound theory with theoretical justification (Reviewer mxVW, PoLf, NimH, nQqb) along with writing that was clean and easy to read (Reviewer mxVW, NimH, nQqb). A common issue raised by reviewers was limited evaluation to support the theory (PoLf, nQqb). We have addressed this limitation in the rebuttal.
The main changes to the paper are highlighted in red in the PDF. While we respond to each reviewer individually, we outline the general improvements to the paper in this comment:
-
Continuous control benchmark: We evaluate PSM across 16 tasks in 4 continuous control domains. Along with the previously presented results in discrete environments, our evaluation demonstrates significant benefits from the proposed method PSM. PSM leads to 11.5% improvement over the best-performing representation learning method (FB).
-
Writing improvements: We incorporated advice from the reviewers to correct typos, increase readability, and use the correct references that point to published version of arXiv papers. We have further added a section discussing theoretical connections between PSM and successor features
-
Complete experimental details in appendix: We have added complete details for reproducing our experiments in appendix section B.
-
Ablation on latent dimension: We have added ablations on dimensionality of affine space in Appendix C.
This paper proposes a method for learning features from a set of data that are useful for quickly identifying an optimal policy for a new task (new reward function). The central component of the method is to learn successor features because leveraging the Bellman flow constraints on the state-action visitation distribution forms a space where the value function is a linear successor measure.
The reviewers were generally positive about the paper’s intent and proposed contributions. However, there are significant issues of clarity; a large number of experiments added after the paper deadline that could not be properly reviewed, and some serious flaws in the experiments that lead to ambiguous conclusions.
I detail issues not discussed by the reviewers below.
In reference to the paper’s clarification, there are many typos, undefined portions, and some false statements. In the intro, the line “However, the eigenvectors from a random policy cannot represent all value functions.” is false. If the graph laplacian is full rank, it can represent any policy’s value function. Thus this produces a misleading characterization of the motivation for the paper.
In Theorem 4.4 (main paper) the function r(s+) is not defined. w^pi is used in both definitions of the value function, but are dependent on the basis functions \Phi and \Phi^{vf}. These basis functions also lack clear definitions. There is a “.” in the expression for V^\pi, which should not be there.
In Theorem 4.4 (appendix) the theorem statement does not match the main body. The proof also makes a special assumption that is never stated nor proved to ever be true. The proof is not easy to follow and has a term \Beta^\pi_i, which is never used. It also makes unclear references “The two equations match,” it is unclear which equations these are referencing.
These errors and lack of clarity (in addition to the ones noted by the reviewers) make it difficult to accurately review the paper and be sure there are not other mistakes hidden in the paper.
In terms of the experiments, there are several major errors that prevent accurate interpretation of the results. First, all quantitative results are based on three or five random seeds, and no statistical testing is given to make comparisons. Despite this, the results are bolded to indicate that the proposed method is superior to the others. Second, there is no control over how the hyperparameters were tuned or selected. As a result, the comparison could only be valid for one specific set of hyperparameters, which is of little interest and likely to be biased to favor the new method over old ones. At most, what can be learned from the experiments is that the new method works and perhaps works well with the chosen hyperparameters, but it is unclear if it represents an improvement over existing methods.
As a result of the above, I cannot recommend the paper for acceptance. However, I do recommend the authors address these issues, improve the clarity, and resubmit.
审稿人讨论附加意见
There was limited back-and-forth with the reviewers, and several reviews adjusted their scores positively. Despite the reviewers change in stance there was still majors errors in the paper.
As a matter of public record, we are responding to the AC meta-review as there were several new points brought up. We summarize the new reasons the AC provided for their reject decision in spite of reviewer ratings (8, 8, 6, 5) and detail our response to each below.
The AC mentioned that a large number of experiments were added after the submission deadline. These experiments were added in response to the questions asked by the reviewers. All reviewers wanted to see the performance of our method on a continuous domain so we provided a comparison of our method with the baselines on this domain. We believed that our research contribution in terms of the idea of representing all possible solutions of RL was already strong enough containing theoretical and empirical evidence to be a worthy submission, which reviewers clearly agreed with due to their change in scores.
The AC mentions that there are several major errors in experiments that prevent accurate interpretation of the results. The AC believes that our method lacks statistical testing with other methods. Additionally, the AC points out that we present our performance comparisons with a specific set of hyperparams. We believe we followed the standard practice to compare the mean performance across several seeds which was also used by the baselines [1, 2, 3]. While statistical testing across several hyperparams is a stronger evaluation, we submit that we shouldn’t be penalized for following the standard practice which is also followed by several of the papers accepted at ICLR 2025.
The AC points out some clarity concerns in Theorem 4.4 of our paper. We thank the AC for their suggestions to further improve the clarity of the Theorem. We had modified the Theorem statement following suggestions by a reviewer but neglected to modify the corresponding proof in the appendix, which is easily corrected. We can further modify the statement and the proof in the camera ready version. We would like to point out that while Theorem 4.4 is important to our discussion, it is not the main contribution of the paper. The AC claims that our statement in the introduction about eigenvectors from a random policy not being able to represent any value function is false. The AC mentions that if the graph laplacian is full rank, it is possible to represent any policy’s value function. Our point is that for all practical purposes, it is generally not possible to get a full rank graph laplacian or to have a random policy that covers the entire state space. Hence in practical settings, it is impossible for eigenvectors from a random policy to be enough to cover all possible value functions.
The AC makes an incorrect statement in their final remark by saying “there were limited back and forth with reviewers” (and then nonetheless nothing that “several reviewers adjusted their scores positively”). In fact, we submitted a total of 28 comments in the review process. As you can see from our responses, we were very prompt in replying to reviewers.
We thank the AC for thoroughly evaluating our paper and pointing out typos and concerns in clarity. But, the points raised by the AC are relatively minor, irrelevant to the main contribution, and/or easily addressable in camera-ready.
[1]: Touati, Ahmed, Jérémy Rapin, and Yann Ollivier. "Does Zero-Shot Reinforcement Learning Exist?." The Eleventh International Conference on Learning Representations. [2]: Park, Seohong, Tobias Kreiman, and Sergey Levine. "Foundation Policies with Hilbert Representations." Forty-first International Conference on Machine Learning. [3]: Wu, Yifan, George Tucker, and Ofir Nachum. "The Laplacian in RL: Learning Representations with Efficient Approximations." International Conference on Learning Representations.
Reject