Conservative Offline Goal-Conditioned Implicit V-Learning
We propose Conservative Goal-Conditioned Implicit Value Learning (CGCIVL), which mitigates value overestimation in cross-trajectory sampling by penalizing unconnected state-goal pairs.
摘要
评审与讨论
This paper proposes conservative goal-conditioned implicit V-learning (CGCIVL). The main insight of CGCIVL is to penalize cross-trajectory goal-conditioned values, which may potentially be overestimated, with a conservative regularizer. To improve the empirical performance of CGCIVL, the authors additionally employ other techniques (e.g., quasimetric value functions, hierarchical policy extraction, etc.) from the literature. They evaluate CGCIVL on OGBench, showing that it outperforms the previous methods on navigation environments, including those that require stitching.
给作者的问题
I don't have any questions other than the ones I asked above.
论据与证据
The claims are empirically supported to some degree, but I do have several questions (see below).
方法与评估标准
Their evaluation criteria are reasonable in general, but the tasks are limited to (similar) navigation environments, and it'd have been more convincing if the authors had shown CGCIVL's performance on manipulation environments as well.
理论论述
I briefly reviewed the theoretical results (though I haven't thoroughly gone through the Appendix), and at least they look believable to me. The theoretical results are largely based on standard proof techniques about conservative value estimation.
实验设计与分析
I don't have particular concerns about experimental designs or analyses other than the ones I listed in the weakness section below.
补充材料
I briefly checked the supplementary material, and confirmed that the authors have submitted their code with (very) brief instructions to reproduce the results. I'd encourage the authors to polish the README file when they release the code to the public.
与现有文献的关系
CGCIVL is built upon several existing methods --- IQL, GCIVL, HIQL, CQL, and QRL. Although the "novelty" of CGCIVL is not necessarily extremely prominent, I think the paper does have a reasonable degree of contribution (given that the claims are fully empirically supported).
遗漏的重要参考文献
I don't see any particular missing work.
其他优缺点
Strengths
- Figure 5 is quite convincing to me (especially in comparison with Figure 1). It is nice to see that improves performance on "stitch" datasets with the proposed techniques.
- CGCIVL achieves the best performance on almost all tasks employed in the paper.
Weaknesses
- The paper omits a key ablation result -- how does CGCIVL's conservative regularization affect performance? This is the supposed key ingredient of the method, so I believe it is crucial to show how this design choice affects performance. In Figure 4 (in its current form), most of the performance gains are seemingly from quasimetric value functions and hierarchical policy extraction.
- The authors only evaluate CGCIVL on maze navigation environments. While the authors employ many datasets from OGBench, it'd have been much more informative if the authors had shown how CGCIVL works on other types of environments as well (e.g., manipulation). Does CGCIVL also work well on OGBench manipulation environments? If not, why?
- The authors use more training steps (e.g., 3M) for some challenging tasks (e.g., humanoidmaze), whereas the baseline results are obtained at 1M steps. Is CGCIVL still better than the baselines when they are trained with the same number of epochs?
- The proposed method is fairly complicated. It combines a number of different ingredients from previous methods -- quasimetric value functions, hierarchical policy extraction, implicit Q-learning, conservative Q-learning, etc. Hence, to some degree, their method is somewhat expected to work better than the baselines, because the baselines are usually more "atomic" (in the sense that they mostly employ one or two key techniques). While I don't think this is a major limitation, it would have been a great plus if their method had been simpler.
Overall, I'm not entirely convinced by the empirical results, mainly due to the lack of ablations and the limited types of environments. I'd be happy to adjust my score if these points are addressed.
其他意见或建议
- is never formally defined (it is instead somewhat implicitly defined around L180). Relatedly, is correct? I suspect is sampled from the dataset distribution, not (note that they are different when dataset trajectories are truncated).
- I'd explicitly mention that around Equation (13). This is not explicitly stated in the current draft.
- What is the value of used for the experiments?
Thank you for your thoughtful review and valuable suggestions. We have carefully addressed each of your concerns in the responses below.
R1: Methods and Evaluation Criteria
We have extended our evaluation to manipulation environments (see Table 1 in the linked PDF for details). Detailed analysis is provided in our response R1 to Reviewer 96R5.
R2: Supplementary Material
We would like to polish the README file to ensure our algorithm can be easily reproduced.
R3: Weaknesses
- We have conducted additional experiments comparing CGCIVL’s performance with and without the conservative regularization term (i.e., and respectively). As shown in Figure 1 (see the linked PDF for details), removing this component leads to a significant performance drop, confirming its critical role. Furthermore, we observe that the performance is robust within a suitable range of the regularization coefficient, but both excessively small and large values degrade the results.
- See our response in R1.
- In our paper, all algorithms were evaluated under the same number of training steps to ensure a fair comparison. Different from results in OGbench where baselines were trained with 1M steps, we trained all algorithms for longer steps in complicated environments (See Appendix C.3 for details). Figure 3 (see the linked PDF for details) shows training curves for all algorithms in these complex environments.
- We would like to clarify that CQL and quasimetric are the key techniques of our algorithm, addressing value overestimation on unconnected state-goal pairs. IQL is a necessary policy improvement technique, which could be replaced with other methods. The hierarchical structure is a common approach to handle long-horizon tasks and may be omitted in non-long-horizon environments.
R4: Comments
- denotes an arbitrary distribution which satisfies . Trajectory truncation only alters the goal associated with states in different segments, without directly changing the distribution of states in the dataset. Therefore, we can approximate sampling states from as sampling from .
- We'll mention near Eq. (13) for clarification in the revised manuscript.
- The parameter in Equations (14)-(15) serves as the temperature coefficient for both the high-level and low-level policy extraction. We empirically set across all experiments.
Thank you for the detailed response. I appreciate the additional results, and they look convincing to me. I've raised my score to 3.
Two minor comments:
- Why does Table 1 in the additional PDF not contain ? In case this result is omitted because CGCIVL doesn't perform better: I believe a new method doesn't necessarily need to achieve the best performance on every single task. It'd be more informative to the community to present the entire result to enable a more holistic evaluation.
- can be different from even without goals, because the former is the discounted state marginal of the policy (with infinite rollouts), whereas the latter is the truncated state marginal distribution (e.g., consider the extreme case where every trajectory has length 1, in which case would be the same as the initial state distribution, while isn't).
Thank you for your additional comments. As you suggested, we will include the full tasks in the puzzle environment in the final version of our paper. Regarding the sampling of , I apologize for misunderstanding your point, and your review is correct. is indeed the state marginal distribution of the dataset and is related to . Nevertheless, the conclusion of Theorem 4.1 still holds because is eliminated during the derivation and is not involved in the final expression.
This paper introduces conservatism to prevent overestimation in unconnected state-goal pairs and uses a quasimetric value network to prevent underestimation in connected cross-trajectory state-goal pairs. Theoretical analysis is provided for the idealized version of the algorithm, and the practical implementation of the algorithm outperforms offline goal-conditioned RL baselines on OGBench.
给作者的问题
- Does Theorem 4.1 hold for continuous state space or discrete state space? Similarly, does it require discrete action space or continuous action space also holds? This question reflects how the theoretical analysis aligns with practice.
- It seems unnatural to require the value function to be a quasimetric, because sometimes the ground truth value function might not be a quasimetric. For instance, Suppose states A,B are connected, but both (A,C) and (C,B) are unconnected. Then we should have , , . This violates the quasimetric property.
- Can CGCIVL fit in the framework of Eq. (8), assuming that we do not use any function approximation? This question determines the relationship between the theory and the practical algorithm.
- Does Proposition 4.5 still hold if conservatism (regularization) is not added in the algorithm? This is relevant to the novelty of this paper.
I am willing to raise the score if the above concerns are resolved and the comments in the previous part are handled.
论据与证据
The performance of the proposed CGCIVL algorithm is demonstrated through abundant experiments, which are convincing evidence.
However, the connection between theoretical analysis and the practical CGCIVL algorithm is weak, as the theory is based on an idealized version of CGCIVL (Eq. (8)). The theorems serve therefore more as a motivation than a guarantee.
方法与评估标准
Conservatism or regularization is a standard technique in reinforcement learning. The Quasimetric framework serves specifically for the goal-conditioned RL problem. Therefore, the proposed method is overall appropriate for the discussed problem.
The benchmark OGBench is suitable for goal-conditioned reinforcement learning.
理论论述
Many notations are used without formal definition, thus hindering the understanding of the theorems along with their proofs. For instance, , and in Eq. (8).
The formulation of Proposition 4.5 is problematic. As claimed in the proposition, the inequality should hold for any , then this will imply that the has to be . However, this seems to be a minor typo.
I believe the theoretical claims are sound after the above issues are addressed.
实验设计与分析
Experiments are solid to support the claim.
补充材料
No significant problems in supplementary material.
与现有文献的关系
The two key components of CGCIVL, conservatism and quasimetric, are not novel in RL literature. The former is standard in RL algorithms e.g. CQL (Kumar et al., 2020), COMBO (Yu et al., 2021), and the latter is also proposed in https://arxiv.org/abs/2304.01203. The paper is only investigating the effect of combining these two techniques. Nonetheless, the successful combination of these two methods reveal the contribution of this paper.
遗漏的重要参考文献
No missing reference found.
其他优缺点
Strength: The paper investigates the advantage of combining conservatism and quasimetric.
Weakness:
- Both conservatism and quasimetric are existing techniques in RL literature, although this does not severely harm originality, as the methods are tailored specifically for GCRL.
- Lack of clarity. Many notations are used without formal definition (already discussed above). In addition, the main algorithm (Algorithm 1) needs detailed description. For instance, in Eq. (12), how do we estimate the expectation, and how to sample from from ?
其他意见或建议
- All notations should be defined before using. Many notations are not standard across literatures, so they will cause confusion to readers without clear definition.
- Algorithm 1 should be described in details to convince the readers that it can be implemented in practice. For instance, we need to discuss how to sample .
- Both Proposition 4.3 and 4.5 use and , but they stand for different meanings in the two propositions. Therefore, we should consider using different notations.
We sincerely appreciate your time and effort in reviewing our work. We have addressed each concern you raised below.
R1: Claims and Evidence
Thank you for the reviewer’s comment. The theoretical guarantees mentioned in our paper indeed refer to the algorithm prototype based on Eq. (8), rather than the practical algorithm. However, in Lemma 1 (Appendix A), we prove that the expectile regression in the practical algorithm is equivalent to the Bellman operator in Eq. (8). Additionally, compared to Eq. (8), the practical algorithm incorporates two key techniques: 1) hierarchical learning to address long-horizon tasks, and 2) quasimetric distillation to improve the efficiency of value learning. These techniques do not fundamentally alter the core components of the CQL-inspired penalty term and quasimetric, which form the foundation of the theoretical analysis. Therefore, while Eq. (8) does not exactly match the practical algorithm, we believe the practical algorithm still benefits from the theoretical guarantees established by the analysis.
R2: Theoretical Claims
- We clarify notations you mentioned as follows:
- denotes an arbitrary distribution which satisfies .
- represents an empirical estimate of the true value function during iteration.
- denotes the empirical Bellman operator, which is the sample-based counterpart of the theoretical Bellman operator.
- Yes. The correct statement of Proposition 4.5 should be: For any and , there exists a hyperparameter such that the inequality holds.
R3: Relation to Broader Scientific Literature
We would like to clarify that our work extends beyond a mere combination of existing techniques. The key contributions are:
-
Problem identification:
To the best of our knowledge, we are the first to formalize the critical issue of value overestimation for unconnected state-goal pairs in offline GCRL.
-
Feasible solutions:
Our solution penalizes the values of all cross-trajectory state-goal pairs while ensuring that values on connected pairs are not excessively under-estimated. We introduce a CQL-inspired regularization term for the first goal and use a quasimetric model for accurate value estimation of connected pairs to achieve the second. Both methods are supported by theoretical guarantees.
-
Difference with original methods
Unlike CQL, which penalizes OOD actions, we introduce a penalty term tailored for state-goal pairs. Unlike QRL, which trains value functions without value iteration, our approach incorporates quasimetric properties into the value iteration process to ensure accurate value estimation for connected state-goal pairs.
The novelty of our method lies in re-engineering these components to address a new problem in offline GCRL, rather than merely combining them.
R4: Weaknesses
- The originality of our work is discussed in R3.
- In R2.1, we provide explanations for any undefined notations and will include these details in the revised version of the paper. As defined in Sec 2, in Algorithm 1, states are randomly sampled, and goals are sampled in two ways: 1) samples uniformly from all states in , and 2) samples from the same trajectory as state with probability , otherwise using .
R5: Other Comments or Suggestions
- See R2.1.
- See R4.2.
- Future revisions will implement distinct notation per proposition for clarity.
R6: Questions
- Although the proof in the original paper is based on discrete state and action spaces, its key components can also be extended to continuous settings. Non-negative penalty terms for underestimation can be generalized to density-based terms, and concentration bounds for the empirical Bellman operator do not require discretization. The Neumann series ensures that remains well-defined in continuous spaces when . Therefore, while Theorem 4.1 has not been strictly proven for continuous settings, our algorithm still benefits from the theoretical analysis in continuous environments, as further supported by experimental results.
- As described in Sec 2 of our paper, the distance between state and goal should satisfies properties of quasimetric (Eq. (6)). However, the value function should exhibit an inverse relationship with the distance away from the goal (Eq. (7)). Thus we have , which is hold when .
- Please refer to R1 where we discuss the differences between the practical algorithm and Eq. (8).
- Proposition 4.5 is based on Theorem 4.1 by incorporating the quasimetric and replacing with the uniform distribution . Consequently, Proposition 4.5 cannot hold if the conservatism is not included.
Thank you for your response. It resolves my major concern of novelty, so I've updated my review and raised the score to 3.
This paper proposes a method for offline goal conditioned reinforcement learning with a penalty term to penalize the value function for unconnected state-goal pairs and does evaluation on OGBench. The results suggest the method outperforms previous methods on goal conditioned tasks.
给作者的问题
-
Is there a reason the method is not run for all of the environments on OGBench?
-
The method has a lot of important hyperparameters. What are the sensitivity to other hyperparameters besides alpha?
论据与证据
The paper makes several claims.
- Offline goal conditioned reinforcement learning suffers from value over estimation on unconnected state action pairs.
Support for this claim is presented through theoretical analysis in Theorem 3.3 and experimental evidence both in table 1 and figure 3.
- The method proposed in this paper addresses the value over estimation and achieves better performance.
Generally this claim is supported from the main results in Table 1, but it only includes a subset of tasks from OGBench. The evidence would be more convincing if results are shown for all of the OGBench experiments.
方法与评估标准
The method and evaluation criteria appears to be well suited for the problem. The proposed method directly addresses the problem and the provides theoretical motivation. The benchmark selection is appropriate.
理论论述
The proofs appear to be correct.
实验设计与分析
The experimental design is valid and the selection of benchmark is good. There are ablation studies to back up claims and compares against the other state of the art methods in offline goal conditioned reinforcement learning. The experimental section can be improved by providing results on the entire OGBench.
补充材料
The supplementary material provides theoretical analysis of the theorems and experiment detail.
与现有文献的关系
The key contributions of the paper is related to the advancement of offline goal conditioned reinforcement learning. It proposes a new method that addresses an important problem in this direction of research.
遗漏的重要参考文献
Other related works that are essential are included in the paper.
其他优缺点
The paper is original and combines ideas in conservative value estimation to offline goal conditioned reinforcement learning. It is well motivated with its approach and achieves higher performance compared to previous algorithms.
A weakness of the paper is that it only compares in maze navigation tasks and is unclear how it would scale to other domains.
其他意见或建议
It would be helpful to include a mean for each of the environments.
We sincerely appreciate your insightful feedback on our work. Please refer to detailed responses below to each of the raised concerns.
R1: Weaknesses
In order to provide more comprehensive evaluation of the performance, we have conducted additional experiments in three manipulation environments (Cube, Scene, Puzzle) - comprising a total of 8 manipulation tasks of varying complexity. The results, presented in Table 1 (see the linked PDF for details), demonstrate that CGCIVL achieves superior performance in all manipulation tasks, particularly in scene and elementary cube tasks, consistent with results in maze environments. We will include these additional experimental results in the revised version of the paper.
R2: Questions
- In the original paper, we aimed to validate the algorithm in solving the goal stitching task. However, OGbench currently only provides the stitch dataset in maze environments. Nevertheless, we have supplemented our evaluation with additional experiments in manipulation environments to further verify the algorithm’s performance (See our response in R1). We will expand the experimental scope and incorporate these additional results into the final version of the paper.
- Besides the analysis of , we have also performed sensitivity studies on both the penalty coefficient and the subgoal interval . The analysis of and related experiments can be found in our response R2 to Reviewer Q24N. Figure 2 (see the linked PDF for details) shows the the performance of CGCIVL across different subgoal interval sizes. Results indicate that CGCIVL achieves the optimal performance with between and . Overly small values of lead to the "signal-to-noise" issue in the value functions, as identified in the HIQL paper[1], while excessively large values of makes subgoals difficult to be achieved.
R3: Other Comments or Suggestions
We will include the mean score for each environment in the revised version of our paper to provide more comprehensive results.
[1] Park, S. , Ghosh, D. , Eysenbach, B. , and Levine, S. Hiql: Offline goal-conditioned rl with latent states as actions. Advances in Neural Information Processing Systems,36, 2024b.
This paper proposes an algorithm for goal-conditioned offline RL called Conservative Goal-Conditioned Implicit V-Learning (CGCIVL). CGCIVL improves upon Hierarchical Implicit Q-Learning (Park et al., 2024b) by introducing two techniques. First, it adopts a regularizer similar to CQL (Kumar et al., 2020) to penalize values for unconnected state-goal pairs. Then, based on the observation that a goal-conditioned value function is a pseudometric, it models the value function with Interval Quasimetric Embeddings to prevent over-penalization of values for connected state-goal pairs. CGCIVL outperforms existing baselines on the OGbench(Park et al., 2024a) benchmark containing various goal-reaching tasks.
给作者的问题
I do not have any additional questions for the authors.
论据与证据
Most of the claims made in the submission are supported by clear and convincing evidence. For those that are problematic, refer to the following sections.
方法与评估标准
It is unclear why the authors use instead of the distilled to estimate the advantage functions and . Aside from that, the proposed methods and the evaluation criteria make sense for the problem.
理论论述
The in Proposition 4.3 depends on the choice of the state-goal pairs, which means there might be no that satisfies the condition for all state-goal pairs. The proposition becomes irrelevant since is fixed for the entire training process. As Propositions 4.4 and 4.5 are all based on Proposition 4.3, the two propositions are also irrelevant.
实验设计与分析
The penalty coefficient also seems to play an essential role in the algorithm, but the authors have not conducted a sensitivity analysis on it.
补充材料
I have gone through the proofs in the appendix.
与现有文献的关系
The proposed algorithm is mainly based on HIQL(Park et al., 2024b). The penalization term for unconnected state-goal pairs was inspired by CQL(Kumar et al., 2020). The observation that an optimal goal-conditioned value function is a quasimetric was proved by Liu et al. (2023). Finally, the authors modeled their value function using IQE(Wang & Isola, 2022a).
遗漏的重要参考文献
To the best of my knowledge, the paper has cited all of the essential references.
其他优缺点
Trajectory stitching is necessary for real-world problems because collecting high-quality data is challenging. This paper proposes an interesting method of applying HER for cross-trajectory state-goal pairs.
其他意见或建议
CQL adds a term to the loss function that maximizes the values for in-distribution data so that the regularizer is canceled out for in-distribution data. Similarly, adding a loss function term that maximizes the values for connected state-goal pairs might be helpful.
We sincerely appreciate the reviewer’s constructive feedback. Below, we respond to each concern point-by-point.
R1: Methods and Evaluation Criteria
The choice between using or the distilled for advantage estimation ( and ) is flexible, as both approaches are empirically valid, surpassing all baselines. Our experiments confirm that achieves comparable performance when used for policy extraction, suggesting either value function can be adopted without compromising results. We will clarify this point in the final paper.
| pointmaze-large-navigate | pointmaze-giant-navigate | pointmaze-large-stitch | pointmaze-giant-stitch | |
|---|---|---|---|---|
| CGCIVL(with ) | ||||
| CGCIVL(with ) |
R2: Theoretical Claims
Thank you for pointing out this question. Here we briefly explain why there exists an satisfies condition for all state-goal pairs. Proof of Proposition 4.3 demonstrates that for any fixed and arbitrary tuple sampled from dataset, where is in-trajectory and is cross-trajectory, there exists an , such that inequality holds when . Consequently, let , then we could find a static , which satisfies the condition for all state-goal pairs. We will provide further clarification in the revised version of the paper.
R3: Experimental Designs or Analyses
As suggested, we have conducted additional ablation studies to analyze the sensitivity of the penalty coefficient , and results are presented in Figure 1 (see the linked PDF for details). The results indicate that both excessively small (causing insufficient regularization) and excessively large (over-constraining the optimization) degrade the performance. In practice, we determine the optimal through empirical validation across multiple candidate values.
R4: Other Comments or Suggestions
Maximizing values for connected state-goal pairs is indeed an interesting direction. However, we might need to carefully address several practical considerations: 1) directly sampling from the distribution of connected state-goal pairs in the dataset is difficult, and 2) further analysis is required to establish appropriate theoretical bounds, similar to those in CQL, when incorporating this approach. We plan to thoroughly explore these open issues in future work.
Thank you for your response. However, I still have a question to ask. The proof of Proposition 4.3 in the current version of the paper does not seem to mention the existence of a global upper bound of . Could you elaborate on why such should exist?
Thank you for your additional comments. Proposition 4.3 indicates that for any , there exists an such that the inequality holds. Furthermore, the conclusion of Proposition 4.3 also holds when is greater than based on the current proof process. Since the offline dataset is finite, there are only finite combinations of . Therefore, we can select the maximum value from the finite set of as the upper bound. Thus, in theory, we can use a fixed that is not too small during training to achieve a lower estimate for the value of cross-trajectory state-goal pairs. We will include this clarification in the subsequent version of the paper.
Authors present a way to incorporate stitching into hindsight experience replay in offline goal-conditioned RL with conservatism. This is a really interesting problem in goal-conditioned RL where there should be stitching in the goal space as well as the state space. While standard RL takes care of stitching in state space, hindsight experience replay does not allow for stitching in goal space. The authors propose to use quasimetric structure and conservatism to generalize to unconnected state-goal pairs.
There were concerns about completeness of the ablations, experiments, and incomplete definitions in the theorems. However, the authors provided additional results and clarifications in the rebuttal that assuaged most of the concerns. I recommend the authors make the suggested edits in the camera ready.