Towards Robust Zero-Shot Reinforcement Learning
BREEZE is a new zero-shot reinforcement learning framework that improves the stability, expressivity, and performance of pre-trained generalist policies, achieving state-of-the-art results on benchmark tasks.
摘要
评审与讨论
This paper introduces several improvements to the FB algorithm to improve zero-shot RL. Specifically, they use attention for F and B, they use an additional value function to stabilize training, they use diffusion for the policy, and they use an IQL style constrained update for the value function. They demonstrate that basic FB does not scale with additional parameters. Furthermore, their algorithm outperforms FB and slightly outperforms HILP in terms of downstream return.
优缺点分析
Strengths:
-
The visuals look good and are useful.
-
Their explanations of what they did is clear.
-
Their experiments are thorough - many environments, seeds, reasonable baselines.
Weaknesses:
-
I think this paper is missing the why. Why is attention useful for F and B? Does some other aspect of attention explain the improved performance? It could be layer norm or residual connections explaining this difference. It is unclear to me why attention itself is useful here, and the authors do not motivate this other than to point to empirical results.
-
In this setting, regulating towards the exploratory policy may not be a good idea. The exploratory policy is likely not optimal for a downstream reward function. Therefore we want to deviate from the exploratory policy in order to maximize reward. While this does run the risk of OOD model updates, it seems necessary to transfer. In contrast to normal offline RL, we are trying to achieve a different objective from the offline data's policy. Why is staying close to the exploratory policy a good design choice?
-
The connections to negative values out of FB is not fully expanded on or solved directly. They mention this is a problem for FB, and that their method outputs negative values less often, but still does. If the only argument is that negative values are a symptom of approximation error, and therefore they have less approximation error, I think this argument is only somewhat convincing. I also think the discussion of large magnitude values being an issue for FB may not be correct.
Minor:
Notation at line 85 is weird. You use big S_+ in equation 1, but use little s_+ in S in line 86.
Typo Line 88
Typo line 109
Typo figure 2
Line 170: by solving the following dataset-constrained problem
Line 204: typo
Line 216 typo
问题
- In section 3, you mention that FB distributions "exhibit large absolute values". But the measure is effectively a sum of distributions, and if the transition function is deterministic, the measure for the next state should be a dirac, ie infinity. In other words, how do you know that these values are too large? Infinity is potentially a correct value, although I agree this causes substantial problems in practice.
- Equation 4 and 5: By incorporating E(BB^T), you are effectively projecting the reward function corresponding to z onto the basis B via least squares. Therefore, B is effectively learning a linear span of reward functions. Please comment on this design decision. I think this will change what B represents.
- Equation 6: z is an input to V_pi_z on the left, but is maxxed over on the right. I think you mean to only max over a in A?
- Equation 6: How do you know mu(a|s)? Are you assuming access to the exploratory policy? If not, are you estimating this? You cannot use the empirical distribution directly because you will never visit the same state twice in a continuous setting.
- Why do you use attention based architecture when the inputs are only 1 or 2 tokens? Especially in the case of 1 token, this is not intuitive because there is nothing else to attend to. How do you know which part of the transformer is necessary here? Maybe only layer norms are needed, dropout, residual connections etc. It is not clear why attention is useful in this case.
- For figure 11,12,13, why does HILP only train for half the time?
- From the training curves, this algorithm seems to converge asymptotically to a similar place as other algorithms. I see the primary advantage here in terms of convergence speed. However, there are many expensive components added, including attention, diffusion, and the additional learned value function. How does your approach compare to others in terms of compute speed?
- What happens if tau=1? From line 174 it seems like it should cause problems, but the trend of the ablation suggests it may perform well.
局限性
Largely this work needs ablations to help explain the "why" of this paper.
- This work needs ablations on what about attention actually improves things over vanilla NNs
- This work needs an ablation showing the compute time relative to prior works. Many of the components are expensive: Attention, diffusion, extra value function. How do the training curves compare when the x axis is training time rather than update steps? It is possible the improved performance is due to using significantly more compute per gradient step.
- This work needs an ablation on the IQL value loss. What happens if you train the value function using a typical TD-type loss instead? Is the IQL loss what improves performance, or the use of an additional value function?
最终评判理由
Some of my questions were answered, e.g. on IQL, tau, regularization, mu(a|s). However, my concerns about why attention should be used here is unresolved, as no motivation is provided other than performance, where there are many other design choices at play. I also find the arguments about the distribution of M to be unconvincing because an architecture could enforce only positive values of M without making other improvements. In any case, there approach still has negative values of M. Lastly, the 5x increase in training time over FB is very concerning given that this paper relies entirely on empirical evidence. This is an extremely important point in my opinion, since this detail is effectively being swept under the rug since training time does not appear in any plots.
格式问题
Minor: The limitations section is in the appendix. It is unclear from Neurip's website if this is allowed or not. If not, they can simply move to the main body for the final submission (since they are given one more page). The area chair should clarify this if accepted.
We thank the reviewer for the constructive comments. Regarding the concerns of the reviewer 5D2W, we provide the following responses.
Weakness 1 & Question 5 & Limitation 1 [Comments regarding the use of attention architecture]
- We provided a discussion in Section 3 to describe why we need to deprecate standard MLP architectures in FB. This is primarily because learning a successor measure for all possible tasks is actually an extremely complex task, which requires sufficient expressivity for modeling. Moreover, as we have shown in Fig.2, simply scaling up MLP parameters does not provide performance improvement; by contrast, using an attention-based architecture greatly boosts the performance.
- The reason we choose attention-based architecture in our design is threefold:
- Attention mechanism and the derived transformer architecture have already demonstrated great success in terms of modeling capability and scalability. We want to build our model based on a well-verified architecture.
- As illustrated in Fig. 3, the attention mechanism naturally favors capturing the complex dependencies among different dimensions of state, action, and task vectors.
- In our preliminary exploration, we actually tested other network architectures, such as BRO [1] and OFENet [2]. We found that these structures do not provide meaningful performance improvement over the MLP-based FB architecture. Due to the page limit, we only reported the final architecture that works (i.e., attention-based architecture in Fig. 3), rather than these failed attempts.
- Regarding the concern from the reviewer that "the inputs are only 1 or 2 tokens". Each "token" here actually refers to a 512-dim embedding vector encoded by the encoder. It is different from the typical discrete token used in NLP tasks. We apologize for the inaccurate description in our initial submission and have revised it in our final paper.
Weakness 2 and Question 4 [Regulating towards the exploratory policy & how to get ]
- It should be noted that the original FB objective (Eq. (3)) suffers from similar OOD error exploitation issues as in offline RL, where we have discussed this issue in the last paragraphs in Sections 2 and 3. Without proper OOD regularization, the FB policy learning could suffer from serious bias and learning instability. Some recent FB-based methods like MCFB/VCFB (Ref[7] in our paper) have also recognized this issue and used a much more conservative CQL-based regularization to restrict the learned policy from deviating too much from the data distribution (i.e., the exploratory policy mentioned by the reviewer).
- By contrast, as described in Section 4.1, we adopt a implicit value/policy regularization scheme in BREEZE, that is, through proper reformulation (e.g., leveraging expectile regression in Eq.(7), and using the closed-form optimal policy in Eq.(9)), we do not need to explicitly learn the exploratory policy in the dataset or enforce explicit policy constraint, but instead, we can directly using dataset sample to implicit enforce behavior regularization. This scheme is called in-sample learning in offline RL literature [3,4], which is less conservative as compared to the explicit policy constraint or the CQL-based scheme used in MCFB/VCFB, while also providing better performance and learning stability.
Weakness 3 & Question 1 [Negative values in FB]
- In Section 3, for the successor measure , as it is an expected discounted probability (see Eq. (1)), it should always be positive. But as we have shown in Fig. 1 as well as more results in Fig. 8-10 in the appendix, both FB and MCFB can produce large negative values for , which is obviously problematic.
- As for , by definition of value function in RL, for a reward function with range , the theoretical Q-value should fall within the range . In the Walker task in Fig. 1, the actual task reward is in the range , so we should not expect any negative Q-values. However, similar to the case , we again find FB and MCFB produce problematic Q-values with large negative values. By contrast, BREEZE produces much more reasonable Q-value estimates with proper scaling.
- The underlying cause for this phenomenon is that the FB learning objective Eq.(3) completely depends on the bootstrapped update based on learned and , without any external scale supervision. This is different from the typical Bellman update used in standard RL, as we have reward values as external scale supervision. Hence, the high quality of and estimates becomes essential for unbiased FB learning, which requires sufficient model expressivity and careful handling of the impact of OOD error exploitation. These are exactly what BREEZE has addressed. The improvement in the final empirical results also demonstrates the effectiveness of our design.
Weakness 4 [Minor typo] We sincerely apologize for the typos and thank the reviewer's correction. We will revise these issues pointed out by the reviewer in the final version of our paper.
Question 2 [Question regarding Eq.(4) and (5)] The incorporation of is introduced in the original FB paper [5] rather than proposed by us, which is also adopted in all later FB-based ZSRL works like [6, 7]. Actually, in FB methods, is designed as a function that maps reward to a task vector (i.e., ), hence corresponds to the reward estimate based on task vector . For more details, we refer reviewers to the original FB paper [5, 6].
Question 3 [Typo in Eq. (6)] We apologize for this typo. Yes, we only max over . We will remove under the max operator in our final version.
Question 6 [Half-time training on HILP] We apologize that HILP could only plot half-time curves. We used the official codebase with default environment settings and hyperparameters but encountered an environment error around 600k training steps in every domain. Currently, we are not fully clear about the cause of this problem and are still investigating.
To ensure the correctness, we rigorously checked HILP's original reported results when plotting curves and recording scores, confirming that our obtained peak performances are consistent with the results reported in the HILP paper.
We will continue fixing this issue and try to obtain the complete curve of HILP in the final version of our paper.
Question 7 & Limitation 2 [Concern on performance and computation speed]
- We would like to clarify that BREEZE outperforms all baselines across most domains and datasets, as shown in Table 5 in the Appendix. Notably, it achieves strong performance simultaneously on all tasks within each domain.
- Moreover, if the reviewer checks Fig. 5, BREEZE converges much faster than other baseline methods, typically reaches good performance within just 100k training steps, and enjoys stable convergence. By contrast, other baselines need much larger training steps (typically about 300k-500k steps), and often suffer from unstable convergence.
- To demonstrate a more detailed computation cost vs performance trade-off, we report the training time per 100k steps and the corresponding aggregated scores in the following table:
Methods Training Time (h) per 100k Steps IQM Walker 100k Steps IQM Jaco 100k Steps IQM Quadruped 100k Steps Aggregate IQM Score 100k Steps FB 0.4 448±12 17±2 270±15 245±6 VCFB 1.2 373±15 18±2 300±27 231±10 MCFB 1.2 415±16 17±1 270±20 234±8 HILP 0.75 509±9 30±1 394±15 311±6 BREEZE 2 615±7 45±3 542±11 404±4 - The scores reported for each domain are averaged over five random seeds and all datasets. The aggregate score denotes the overall performance by averaging all domains in ExORL. Based on the above table and the learning curve in Fig. 5, we can observe that BREEZE reaches much better performance with a reasonable increase in training cost.
Note: each experiment was conducted on a single NVIDIA A6000 GPU.
Question 8 [What happens if ] When , expectile regression effectively discards all data, equivalent to learning nothing. In our experiments, we used for all ExORL tasks, and for all Kitchen tasks (see Section B.7 in our Appendix).
Limitation 3 [Lacking ablation on IQL loss]
- Adding behavior regularization directly on a TD-type loss will require adding explicit policy regularizations, i.e., adding to avoid OOD action exploitation, where is some divergence function. As we have explained in the response to Weakness 2, this will require learning an explicit and can be more conservative. We refer the reviewer to the in-sample learning offline RL method papers [3, 4] for more detailed information regarding the advantage of using IQL-style loss in the offline RL setting.
References
[1] Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miłos. Bigger, Regularized, Optimistic: scaling for compute and sample-efficient continuous control. NeurIPS 2024.
[2] Kei Ota, Devesh K. Jha, and Asako Kanezaki. Training Larger Networks for Deep Reinforcement Learning.
[3] Kostrikov, Ilya, Ashvin Nair, Sergey Levine. Offline reinforcement learning with implicit q-learning. ICLR 2022.
[4] Xu, Haoran, et al. Offline RL with no ood actions: In-sample learning via implicit value regularization. ICLR 2023.
[5] Ahmed Touati, Yann Ollivier. Learning One Representation to Optimize All Rewards. NeurIPS 2021.
[6] Ahmed Touati, Jérémy Rapin, Yann Ollivier. Does Zero-Shot Reinforcement Learning Exist? ICLR 2023.
[7] Jeen, Scott, Tom Bewley, Jonathan Cullen. Zero-shot reinforcement learning from low quality data. NeurIPS 2024.
Thank you for your comments. For the following reasons, I will keep my score:
- I am not convinced by the argument about why attention is necessary here. Attention has proven to be an excellent architecture for sequence to sequence modeling, or even set-based modeling if positional encodings are not used. In this case, the inputs to F and B are always the same (s,a,z or s'). I am not aware of any works claiming that attention is a superior architecture for non-sequence modeling problems.
- The argument about the distribution of M is not convincing to me. One could easily design an architecture that never has negative values of M, e.g. by adding a relu. However, this does not mean that you would learn a better representation. Additionally, the magnitude of M is not very relevant, since the optimal policy depends only on the direction of z_R, not the magnitude. In FB implementations it is common to use Z_R = B(goal) for sparse, goal based rewards, because this is proportional to the true Z_r if you computed it as an expectation over the dataset. However, this approximate Z_R should not be used to estimate M.
- The increase in training time is a serious concern here. I think if you plotted the performance with training time as the x-axis, the empirical results would tell a different story. The authors argue that the increase in training cost is reasonable, but a 5x increase over FB is concerning.
We thank the reviewer 5D2W for the engagement in the discussion period. Regarding the reviewer's additional comments, we provide the following feedback:
Response to Q1
- Our proposed method is primarily focused on addressing two identified issues in the FB-based method: 1) expressive models are needed to properly capture complex F and B representations; and 2) the need to properly handle the OOD exploitation issue. The attention-based architecture is mainly proposed to improve model expressivity. Of course, there might be other ways/architectures to achieve the same goal, but as we responded previously, we had tested a wide range of model architectures in the early stage of our study, from large MLP nets, BRO, OFENet, to the standard Transformer network. The attention-based architecture reported in our paper is by far the best solution we found through extensive experiments, which provides greatly improved performance and learning stability. We didn't include the previous failed attempts in our paper, as we think these might be distractions for readers. But we are perfectly fine to report them if the reviewer thinks it is necessary.
- Also, note that the are high-dimensional vectors, rather than single values. The attention-based mechanisms actually did a good job in capturing the dependency/relationship across different dimensions of state, action, and task vector. Even though these are different from typical treatments in handling sequential data, we empirically found that these are very helpful to enhance the expressivity of F and B modeling.
Response to Q2
- We agree that keeping positive does not necessarily guarantee good representation, but a negative with improper scaling is clearly an indication of a problematic FB learning process, as it is contradictory to the theoretical foundation of FB formulation.
- Note that our proposed method does not naively enforce a positive , but provides sufficient expressivity and OOD regularization to learn more proper F and B representations. Actually, the proper scaling and positive Q-values naturally emerge in our proposed method, even without adding any explicit positivity or scale regularizations, and the learned & patterns align much better with the FB-related theory. These are also reflected in our comparative results and ablations, that our design achieves significant performance gains over existing FB-based methods.
Response to Q3
- Regarding the reviewer's concern about training time, we further provide the training time and final performance for baseline FB methods at convergence (~500k training steps, see the learning curves in Fig. 5 of our paper) in the following table:
| Methods (training steps) | Training Time (h) | IQM Walker | IQM Jaco | IQM Quadruped | Aggregate IQM |
|---|---|---|---|---|---|
| FB (100k) | 0.4 | 448±12 | 17±2 | 270±15 | 245±6 |
| FB (300k) | 1.2 | 536±9 | 29±3 | 442±20 | 336±7 |
| FB (500k) | 2 | 591±10 | 36±9 | 533±16 | 387±12 |
| VCFB (100k) | 1.2 | 373±15 | 18±2 | 300±27 | 231±10 |
| VCFB (300k) | 3.6 | 490±17 | 26±3 | 489±14 | 335±7 |
| VCFB (500k) | 6 | 552±36 | 31±6 | 522±22 | 368±24 |
| MCFB (100k) | 1.2 | 415±16 | 17±1 | 270±20 | 234±8 |
| MCFB (300k) | 3.6 | 495±14 | 33±5 | 499±15 | 343±7 |
| MCFB (500k) | 6 | 592±30 | 42±11 | 539±16 | 391±20 |
| BREEZE (100k) | 2 | 615±7 | 45±3 | 542±11 | 404±4 |
- As shown in the above table, other FB-based method typically requires 500k training steps to converge to their highest scores. Nevertheless, their scores are still lower than BREEZE's performance with just 100k learning steps. If you take the complete training time needed for convergence into consideration, BREEZE has comparable training time to FB (e.g., BREEZE-100k with 2h vs. FB-500k with 2h), but has much higher scores; VCFB and MCFB require almost 3x training time as BREEZE, but still have inferior performance.
- We acknowledge that the per-step training time of BREEZE can be higher, but it is important to look at the complete training time of algorithms needed for convergence. In practice, we don't need prolonged training if the algorithm has already reached convergence.
This paper proposes several improvements to the Forward-Backward Representations (FB) framework for zero-shot RL. The FB framework represents the successor measure using the dot product of a forward representation encoding the future state occupancy given state-action pair and task policy , and a backward representation encoding the state distribution leading up to . A -conditioned policy is jointly trained to decode the optimal policies for each task vector . The paper is motivated by the observation that FB often learns heavily distorted and . This is be attributed to a few reasons:
- The FB objective does not enforce proper scaling,
- The policy class is not expressive enough to encode all possible optimal policies
- No pessimism for OOD states.
To address these, the paper proposes the following improvements:
- Regularize the network using dataset samples by replacing value target in network objective with a separate value network (as opposed to bootstraping from network). The value network is trained with expectile loss on the dataset, ensuring proper normalization while being pessimistic OOD.
- Use a self-attention architecture to parameterize the and networks for better expressivity.
- Use a diffusion model to parameterize the policy. The model is trained via a weighted-regression objective which correspond to learning the optimal policy for each .
These contributions results in state-of-the-art performance on multi-task RL problems on ExoRL and D4RL tasks, outperforming baselines such as successor features, FB, and HILP. Overall, the methods proposed in this paper greatly improves the practicality of the FB framework.
优缺点分析
Strengths
- The paper is well-motivated from an empirical analysis of the failure modes of the FB algorithm.
- The proposed solutions are sensible and apply well-established machineries in the literature.
- The claims made in this paper are supported by extensive experiments on ExoRL and D4RL domains. Each individual component is justified via ablation experiments.
- The derivations of the theorems and lemmas are sound.
Weaknesses
- The paper can feel like an incremental contribution to / modernization of the FB framework.
- A lot of extra machinaries and compute are required to get marginal improvements over strongest baseline (HILP). According to the ablations in Section 5.2, removing a single component can cause performance to drop below baseline.
问题
- How should I interpret the and plots in Figure 1? Why is positive values more favorable? Are there ground truth and to compare to?
- FB has a regularizer to ensure proper scaling of and ([1] Appendix. B, orthonormalization loss). Does this address the and scaling problem?
- For policy extraction, can you do guided diffusion on the behavior policy by adding to the noise prediction at each timestep?
- The statement of Lemma 1 in Appendix A.2 refers to a missing citation [42].
[1] Ahmed Touati, Jérémy Rapin, Yann Ollivier. Does Zero-Shot Reinforcement Learning Exist? ICLR 2023.
局限性
The authors addressed the limitations of the paper in Appendix D.4, stating that the proposed method introduce extra compute compared to the standard FB framework.
最终评判理由
The authors have sufficiently addressed my questions during the rebuttal. The paper makes an important contribution towards making zero-shot RL more practical. Therefore, I recommend acceptance.
格式问题
N/A
We sincerely appreciate the reviewer Arsf for the positive feedback and the constructive comments on our paper. Regarding the comments of the reviewer, we provide the following responses.
Weakness 1, 2 [Incremental contribution & extra machineries]
- Although our work is built upon the FB framework, we provide the first evidence that conventional FB-based zero-short RL (ZSRL) methods suffer from systematically biased successor measure and Q-value estimations, due to the absence of proper scale supervision on the FB learning objective. We show in our paper that enhancing the model expressivity and properly handling the ODO error exploitation are crucial to solving this problem.
- It should be noted that the extra machinery introduced in BREEZE is targeted specifically to address the above issues. For example,
- We introduce expressive self-attention architecture for and to capture the complicated state-action-task relationships and enhance successor measure learning quality.
- We introduce an implicit behavior regularization schemes for both and policy extraction, which enable a closed-form optimal policy (Eq. (9)) and can naturally be modeled using a weighted diffusion regression (Eq.(10)).
- As the reviewer can notice, the above design is a very logical combination. Removing either component will reintroduce the original limitation in the conventional FB-based method.
Weakness 1, 2 [Performance comparison with HILP]
- The baseline HILP is actually not a FB-based ZSRL method. It is constructed upon a goal-conditioned RL framework, by learning a Hilbert representation for state space, and using the normalized distance between the current state and the goal state under the Hilbert representation as the task vector. The final policy in HILP is learned by running an off-the-shelf offline RL algorithm as a sub-routine.
- As the underlying mechanism of HILP and FB-based method is very different, thus it is not very suitable to directly compare with BREEZE's variants without FB enhancement or diffusion component. These variants are more suitable for comparison with FB-based methods like FB or VCFB/MCFB, and as we have reported in Table 2, even these incomplete variants could have stronger performance than conventional FB-based methods.
- Lastly, as we have reported in Table 1 and Fig. 5, in almost all tasks, our complete BREEZE algorithm achieves stronger performance and learning speed as compared to HILP.
Question 1 [Interpretation of Figure 1]
- In Fig. 1, for the successor measure , as it is an expected discounted probability (see Eq. (1)), it should always be positive. But as we have shown in Fig. 1 as well as more results in Fig. 8-10 in the appendix, both FB and MCFB can produce large negative values for , which is obviously problematic.
- As for , by definition of value function in RL, for a reward function with range , the theoretical Q-value should fall within the range . In the Walker task in Fig. 1, the actual task reward is in the range , so we should not expect any negative Q-values. However, similar to the case , we again find FB and MCFB produce highly problematic Q-values. By contrast, BREEZE produces much more reasonable Q-value estimates with proper scaling.
Question 2 [Orthonormalization regularization]
- We'd like to clarify that orthonormalization loss can only regularize the scale and orthogonal property of , but cannot help to address the scaling problem of the successor measure and Q-value . Note that is modeled as , solely normalizing is insufficient to enforce proper scaling of .
- Also, the FB learning objective Eq.(3) completely depends on the bootstrapped update based on learned and , without any external scale supervision. This is different from the typical Bellman update used in standard RL, as we have reward values as external scale supervision. Hence, the high quality of and estimates becomes essential for unbiased FB learning, which requires sufficient model expressivity and careful handling of the impact of OOD error exploitation.
Question 3 and 4 [Diffusion guidance and missing citation]
- Yes, it is possible to perform guided diffusion by using a weighted term to the noise prediction at each step. We refer the reviewer to the detailed proof of Theorem 2 in a prior work [1].
- Paper [1] actually is the Ref[42] in Lemma 1 in Appendix A.2. We apologize for the missing reference. For some unknown reason, this citation was not properly compiled in our submission. We will correct this issue in the final version of our paper.
References
[1] Zheng, Yinan, et al. Feasibility-guided safe offline reinforcement learning. ICLR 2023.
Thank you for addressing my comments and questions. I have no further question and will maintain my assessment of the paper.
We thank the reviewer for the acknowledgement of our work and the constructive comments!
This paper proposes an extension to Forward-Backward (FB) models in offline reinforcement learning by incorporating an improved out-of-distribution regularization scheme alongside enhanced attention-based network architectures. The motivation stems from shortcomings identified in prior FB approaches. The proposed method is evaluated on a wide range of offline RL tasks, including different datasets and domains, and is benchmarked against several state-of-the-art baselines. Results indicate that the method achieves overall improved performance and robustness, supported by extensive ablations and visualizations.
优缺点分析
Strengths
- Realistic Contribution: The paper makes a practically grounded contribution, avoiding overclaiming or speculative framing.
- Clear Motivation and Formal Soundness: The approach is well-motivated by formal considerations of current model shortcomings, and the proposed methodology is solidly defined.
- Comprehensive Evaluation: The experimental setup is broad, covering a range of tasks, datasets, and baselines. Ablation- and hyperparameter studies provide insight into the contribution of each component. The illustrative visualizations and provided results strongly support the empirical claims.
Weaknesses
- Clarity and Writing: Overall readability suffers due to long and dense sentences. Splitting them for clarity would improve comprehension significantly.
- Preliminaries and Task Formalization: Including the discount factor with the reward free MDP seems unnecessary. Also, I am missing a thorogh formalization of the considered task.
- Ambiguity in Zero-shot Framing: The terminology around "zero-shot" learning is ambiguous (e.g., "using only a small set of reward-labeled samples" could also be conceived as few-shot). A central definition would help.
- Missing Limitations and Assumptions: The paper lacks a clear discussion of limitations, such as assumptions on the availability and quality of offline data, or the implications of the KL constraint.
- Related Work: Could be improved by better positioning the contribution relative to prior approaches. Also, the improved representation modeling proposed could be considered related to the partial observability considered in [1].
- Computational Details: A discussion on model complexity, runtime, and training resource requirements is currently missing. These are important for understanding practical applicability.
[1] Altmann et al., "CROP: Towards Distributional-Shift Robust Reinforcement Learning using Compact Reshaped Observation Processing", in Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, 2023.
问题
What exactly constitutes the “ground truth” in Figure 1? The figure’s visual semantics do not clearly reflect the described differences in the text. Additionally, BREEZE also appears to show high values, contradicting the written interpretation.
Could the authors elaborate on what “zero-shot” means in their context? The usage could be interpreted as few-shot instead, especially since some reward-labeled data is used. A precise definition would help.
Could the authors discuss potential failure cases, data conditions, or general assumptions limiting their approach? This would strengthen the claims of robustness and highlight future directions for improvement.
局限性
No. Limitations are not discussed. Suggested areas to address include:
- Dependence on large, high-quality offline datasets.
- Computational cost and scalability.
- Potential brittleness to domain shift beyond training datasets.
最终评判理由
The rebuttal effectively addresses the previously raised questions and concerns, and demonstrates careful consideration of the feedback. In light of this, the initially positive assessment is maintained.
格式问题
- The ordering of sections is unconventional: results are presented prior to introducing the proposed approach. Reordering for logical flow -- first stating the formal problem, then the approach, followed by results -- would improve readability.
- Line 174: The notation should likely be or clarified if intentional.
We thank the reviewer for the constructive comments. Regarding the concerns of the reviewer Ytsj, we provide the following responses.
Weakness 1 [Clarity & writing] We thank the reviewer for the suggestion. We will perform thorough language editing and polish the writing in our final version.
Weakness 2 [Preliminaries & task formalization]
- We apologize for the relatively brief introduction on the problem setting of zero-short RL (ZSRL) due to the page limit. In short, ZSRL trains a policy in an unsupervised manner based on a dataset collected from a specified domain without reward. The desired reward function will be specified at test time with the same domain, and the learned agent must immediately produce a good policy without fine-tuning (i.e., zero-shot). We will add more preliminary information in our final version. We also refer the reviewer to the prior papers [1,2,3] in FB-based ZSRL for more detailed information.
- Regarding the discount factor, it is actually necessary to include the discount factor in FB-based ZSRL, as the central element in this problem, the successor measure (see Eq.(1) in our paper) is defined as the expected discounted occurrences of a future state after staring from a given state-action pair. The FB-based ZSRL methods estimate this value by solving a Bellman-like equation (Eq.(3)), in which is also needed to ensure convergence to a fixed point during learning.
Weakness 3 & Question 2 [Ambiguity in zero-shot framing]
- We actually share the same feeling on the potential ambiguous term "zero-short RL", as it is essentially a class of unsupervised RL (training w/o task/reward specifications) that generalizes to arbitrary tasks at test time. However, the term "zero-short RL" has already been used extensively in the existing literature, e.g., [1,2,3,4,5,6], hence we follow this naming convention from prior works in our paper. We thank the reviewer for the suggestion and will include a clearer definition in our final version.
Weakness 4 & Question 3 & Limitation 1 [Missing limitations & assumptions & discussion on failure cases]
We thank the reviewer for this constructive comment.
- We do have a limitation section in Appendix D.4 in our paper. The main limitation of BREEZE is that it has increased computation demand due to the use of more expressive model architectures. But as it enables noticeable improvements in learning performance and stability, we believe it is a reasonable trade-off.
- Regarding the assumptions:
- BREEZE inherits all theoretical assumptions from FB, e.g., the MDP assumption, as well as assuming for each task vector , we can find a pair of forward () and backward () representations such that we can decompose to be .
- The availability and quality of offline data require minimal assumptions, as the goal of ZSRL is to learn the environment structure from arbitrary datasets and handle any downstream tasks. In practice, an offline dataset with higher diversity is likely to provide better ZSRL performance, which has been discussed in the prior paper [2].
- Regarding the use of KL constraint: The KL-based behavior regularization is widely adopted in offline RL literature to prevent the learned policy from deviating too much from the dataset distribution. We borrowed a similar design in Eq.(9) as the FB-based ZSRL problem suffers from a very similar out-of-distribution (OOD) exploitation problem as in offline RL. A more important benefit of this design is that it can enable a closed-form optimal policy (Eq.(9)), which can be seamlessly integrated with a diffusion policy learning objective (Eq.(10)).
- Lastly, we will consider adding additional failure case analysis in the final version of our paper, for example, testing the cases with small-sample and low-cover datasets (the OOD problem will become more severe), as well as testing the possible over-conservative issue due to adding additional behavior regularizations.
Weakness 5 [Related works]
We thank the reviewer's recommendation. Reference [7] provides an effective approach to learn improved representations under partial observability. By contrast, FB-based ZSRL methods focus on learning a special type of successor representation that can facilitate zero-shot policy adaptation. The goals of these two representation learning methods are very different. However, we will add extra discussion with [7] in our final paper.
Weakness 6 & Limitation 2 [Computational details]
We appreciate the reviewer's comments and would like to provide the following details.
- In our experiments, we train BREEZE on a single NVIDIA A6000 GPU. Peak GPU memory usage varied by tasks, normally below 23,000MB.
- As BREEZE uses more expressive model architectures, it requires more computational resources for training. Training for 1M steps, as adopted in other baseline methods, takes about 20 hours for most environments. However, as we have reported in Fig. 5, BREEZE can learn much faster, typically reaches good performance within just 100k steps, and enjoys stable convergence. By contrast, other baselines need much larger training steps (typically about 500k steps), and often suffer from unstable convergence (performance drop in later training stages).
- To demonstrate a more detailed computation cost vs performance trade-off, we report the training time per 100k steps and the corresponding aggregated scores in the following table:
Methods Training Time (h) per 100k Steps IQM Walker 100k Steps IQM Jaco 100k Steps IQM Quadruped 100k Steps Aggregate IQM Score 100k Steps FB 0.4 448±12 17±2 270±15 245±6 VCFB 1.2 373±15 18±2 300±27 231±10 MCFB 1.2 415±16 17±1 270±20 234±8 HILP 0.75 509±9 30±1 394±15 311±6 BREEZE 2 615±7 45±3 542±11 404±4 - The scores reported for Walker, Jaco, and Quadruped tasks are averaged over five random seeds and all datasets of this domain. The aggregate score denotes the overall performance by averaging all domains in ExORL. Based on the above table and the learning curve in Fig. 5 of our paper, we can observe that BREEZE reaches much better performance over baselines with a reasonable increase in training cost.
Question 1 ["Ground truth" on Figure 1]
- For the successor measure , as it is an expected discounted probability (see Eq. (1)), it should always be positive. But as we have shown in Fig. 1 as well as more results in Fig. 8-10 in the appendix, both FB and MCFB can produce large negative values for , which is obviously problematic.
- As for , by definition of value function in RL, for a reward function with range , the theoretical Q-value should fall within the range . In the Walker task in Fig. 1, the actual task reward is in the range [0,1], so we should not expect any negative Q-values. However, similar to the case , we again find FB and MCFB produce highly problematic Q-values. By contrast, BREEZE produces much more reasonable Q-value estimates with proper scaling.
Limitation 3 [Domain shift consideration]
- The ZSRL problem setting [1,2,4,5,6] only considers zero-shot adapting to different tasks in the same domain (e.g., walk, stand, run, flip tasks for Walker domain), rather than focusing on solving domain shift problems. Future research can explore learning agents that have both task and domain generalization capabilities, but it is not the current focus of our paper.
Paper Formatting Concerns
We sincerely appreciate the reviewer's suggestion. We will revise these issues pointed out by the reviewer in the final version of our paper.
References
[1] Ahmed Touati, Jérémy Rapin, Yann Ollivier. Does Zero-Shot Reinforcement Learning Exist? ICLR 2023.
[2] Jeen, Scott, Tom Bewley, Jonathan Cullen. Zero-shot reinforcement learning from low quality data. NeurIPS 2024.
[3] Ahmed Touati, Yann Ollivier. Learning One Representation to Optimize All Rewards. NeurIPS 2021.
[4] Park, Seohong, Tobias Kreiman, Sergey Levine. Foundation policies with Hilbert representations. ICML 2024.
[5] Ingebrand, Tyler, Amy Zhang, and Ufuk Topcu. Zero-Shot Reinforcement Learning via Function Encoders. ArXiv 2401.17173.
[6] Frans, Kevin, et al. Unsupervised zero-shot reinforcement learning via functional reward encodings. PMLR 235, 2024.
[7] Altmann et al., CROP: Towards Distributional-Shift Robust Reinforcement Learning using Compact Reshaped Observation Processing. IJCAL 2023.
Thank you for your extensive rebuttal, which addresses my open questions and the outlined concerns while taking my feedback into account. I will maintain my already positive assessment.
We thank the reviewer for acknowledging our work and the positive feedback! We will incorporate the discussion in our final version.
BREEZE is a behaviour-regularised, expressivity-enhanced Forward-Backward (FB) framework for zero-shot reinforcement learning. It (i) adds a behaviour-constrained expectile-value objective that keeps policy learning in-sample to curb OOD action extrapolation, (ii) replaces shallow MLP encoders with self-attention forward/backward networks for richer successor-measure estimates, and (iii) extracts actions with a guided task-conditioned diffusion policy to handle multi-modal action distributions. Across 12 ExORL domains and four long-horizon Franka-Kitchen tasks, BREEZE sets new state-of-the-art IQM returns and shows markedly smoother learning than prior FB variants .
优缺点分析
- Principled behaviour regularisation: The KL-bounded, expectile-value loss forces learning to stay on the dataset support, directly attacking OOD action extrapolation .
- Expressive FB encoders: Two-token self-attention forward and transformer backward nets capture complex state–task relations far better than widened MLPs, lifting FB performance noticeably (Fig. 2) .
- Diffusion actor models multi-modal actions: A task-conditioned diffusion policy with value guidance generates diverse candidate actions and picks the top- by , yielding higher returns on high-variance domains .
- Strong, broad empirical gains: BREEZE tops 10/12 ExORL tasks and all Kitchen benchmarks in IQM return and variance, clearly outperforming FB/MCFB/VCFB and HILP baselines .
- Informative diagnostics & ablations: Distribution plots reveal scale errors in previous FB methods, while component ablations isolate the impact of FB enhancement and diffusion actor (Table 2, Fig. 6) .
问题
- Compute footprint & efficiency – Please report training time, GPU type/count, and peak memory for ExORL/Kitchen runs; show BREEZE stays within 1.5 × wall-clock of FB or explain the trade-off.
- Generality beyond MuJoCo/Kitchen – Evaluate on at least one pixel-based suite (e.g., DMControl-Pixels) or Atari-URLB. Comparable gains would strengthen significance; failure analysis would also help.
- Ablate diffusion on Kitchen – Add a Kitchen variant using a Gaussian actor to clarify when diffusion is essential.
- Failure-case deep dive – For tasks with variance spikes, supply diagnostics linking instability to either critic bias or diffusion sampling.
局限性
- Limited domain diversity: All experiments are MuJoCo or Franka-Kitchen; no pixel-based control or real-robot data are tested, so generality is uncertain .
- parameter sensitivity: Expectile and guidance weight demand careful tuning; Figure 6 shows sharp performance cliffs with poor choices .
- Sparse analysis of failure cases: Experiment results show variance spikes, but the paper does not dissect whether these stem from value mis-estimates or diffusion sampling.
- Missing comparison to function-encoder ZSRL lines: Recent methods like FROE or F-Reward are omitted, so progress relative to non-FB approaches is unclear .
格式问题
n/a
We thank the reviewer for the constructive comments. Regarding the concerns of the reviewer WnNG, we provide the following responses.
Question 1 [Compute footprint & efficiency]
- In our experiments, we train BREEZE on a single NVIDIA A6000 GPU. Peak GPU memory usage varied by tasks, normally below 23,000MB.
- As BREEZE uses more expressive model architectures, it requires more computational resources for training. Training for 1M steps, as adopted in other baseline methods, takes about 20 hours for most environments. However, as we have reported in Fig. 5, BREEZE can learn much faster, typically reaches good performance within just 100k steps, and enjoys stable convergence. By contrast, other baselines need much larger training steps (typically about 500k steps), and often suffer from unstable convergence (performance drop in later training stages).
- To demonstrate a more detailed computation cost vs performance trade-off, we report the training time per 100k steps and the corresponding aggregated scores in the following table:
Methods Training Time (h) per 100k Steps IQM Walker 100k Steps IQM Jaco 100k Steps IQM Quadruped 100k Steps Aggregate IQM Score 100k Steps FB 0.4 448±12 17±2 270±15 245±6 VCFB 1.2 373±15 18±2 300±27 231±10 MCFB 1.2 415±16 17±1 270±20 234±8 HILP 0.75 509±9 30±1 394±15 311±6 BREEZE 2 615±7 45±3 542±11 404±4 - The scores reported for Walker, Jaco, and Quadruped tasks are averaged over five random seeds and all datasets of this domain. The aggregate score denotes the overall performance by averaging all domains in ExORL. Based on the above table and the learning curve in Fig. 5 of our paper, we can observe that BREEZE reaches much better performance over baselines with a reasonable increase in training cost.
Question 2 & Limitation 1 [Pixel-based experiments]
- We appreciate the reviewer's suggestion. In our paper, we follow other zero-shot RL literature and conduct the experiments on the widely used ExORL/Kitchen benchmarks. We are implementing pixel-based experiments in DMControl and will promptly update the results if we can finish producing results within the discussion period. These results will be included in our final version.
- Regarding Atari experiments, due to the lack of multi-task settings and relevant datasets, they have not yet been used for zero-shot RL in the task setting.
Question 3, 4 & Limitation 3 [Futher ablation & analysis]
-
We appreciate the reviewer's suggestion. We will further conduct relevant experiments and provide analysis on the potential impact of critic bias and diffusion sampling.
-
As for the importance of the diffusion model:
- Regarding the Kitchen environment, we have finished ablation on the setting w/o diffusion actor, as shown in the following table:
Setting Dataset Average Normalized Return with Diffusion kitchen-partial 72.5 with Diffusion kitchen-mixed 75.0 w/o. Diffusion kitchen-partial 33.3 w/o. Diffusion kitchen-mixed 37.5 - Combining the above results with Table 2 in our paper, we can clearly observe the benefits of using diffusion policy in BREEZE for all tasks, which helps enhance policy expressivity and enables better capturing complex data distributions.
Limitation 2 [Parameter sensitivity]
- In our ablation experiments shown in Fig. 6, the y-axis labels actually start at 650 and 670 instead of 0. The performance varies in the ranges of [650, 700] for [0.5, 0.99] and [670, 700] for [0.01, 0.5]. The performance variations are actually not as significant as they seem to be.
- Moreover, if the reviewer checks Tables 3 and 4 in the Appendix of our paper, we used the consistent expectile (0.99 for all ExORL tasks and 0.7 for Kitchen) and guidance weight (0.05 for all tasks) for all our experiments without any hyperparameter tuning. Even using untuned hyperparameters for different tasks, BREEZE still achieves better performance. This also demonstrates the hyperparameter robustness of BREEZE.
Limitation 4 [Comparison to function-encoder lines]
-
We appreciate the reviewer's suggestion. We have managed to conduct additional experiments on the baseline FRE [1] for the walker domain in the ExORL benchmark during the short 1-week rebuttal period. We use the official codebase in the FRE paper and modify the evaluation part for rollout in our task environment. The results averaged over three random seeds are shown in the table below, and we will report FRE's scores on other tasks and domains in our final version.
Dataset Tasks (Walker) FRE IQM Score BREEZE IQM Score RND Walk 187.7±9 945±3 RND Run 88.3±3 340±16 RND Stand 407.3±19 950±1 RND Flip 242.3±5 593±18 RND Whole Performance 232±8 703±2 APS Walk 162.3±9 949±4 APS Run 56.7±11 278±9 APS Stand 376.0±19 922±33 APS Flip 104±5 589±15 APS Whole Performance 175±12 685±7 PROTO Walk 177.3±2 923±7 PROTO Run 115.7±6 274±9 PROTO Stand 341.3±7 930±4 PROTO Flip 187.7±35 722±13 PROTO Whole Performance 190±11 711±3 DIAYN Walk 83.3±12 774±21 DIAYN Run 99.0±7 229±10 DIAYN Stand 336.3±10 781±49 DIAYN Flip 94.0±22 395±14 DIAYN Whole Performance 153.2±2 545±12 -
As shown in the above table, BREEZE consistently outperforms FRE on Walker tasks. Function-encoder ZSRL methods, such as FRE, rely heavily on reward prior assumptions and extensive training/tuning, but they have fewer theoretical guarantees compared to FB-based ZSRL methods.
References
[1] Frans, Kevin, et al. Unsupervised Zero-Shot Reinforcement Learning via Functional Reward Encodings. ICML 2024.
As the discussion period draws to a close, we kindly invite reviewers to reach out if they have any additional questions. We are happy to provide further clarification. Any response throughout this discussion period is appreciated.
The paper proposes BREEZE, an improved Forward-Backward (FB) framework for zero-shot RL that addresses instability, non-expressive representations, and out-of-distribution (OOD) action extrapolation in offline RL. Strengths include 1) strong empirical improvements across multiple tasks; 2) well-motivated fixes to FB’s key issues (OOD extrapolation, expressivity, multimodality); 3) clear ablations supporting contributions. Weaknesses include 1) limited domain diversity (no pixel-based or real-robot benchmarks); 2) some missing theoretical clarity (why attention helps, “zero-shot” definition); 3) weak discussion of limitations and failure cases. The paper is technically strong, with meaningful improvements to zero-shot RL, but reviewers flagged concerns about compute efficiency, scope of evaluation, and clarity. Overall, reviewers are lean towards accept this paper.