Q-Supervised Contrastive Representation: A State Decoupling Framework for Safe Offline Reinforcement Learning
To address the OOD issue during testing for safe offline RL, we propose the first framework that decouple the global observations into reward- and cost-related representations through Q-supervised contrastive learning for decision-making.
摘要
评审与讨论
This paper introduces State Decoupling with Q-supervised Contrastive representation (SDQC), a framework that decouples observations into reward-related and cost-related representations to improve generalization and address out-of-distribution issues. The proposed framework is evaluated on the DSRL benchmarks. The results point to a better capacity of SDQC in learning safe and generalizable policies compared with existing approaches.
优点
This paper theoretically proves that SDQC generates a coarser representation than bisimulation-based methods while maintaining the optimal policy, evidenced by achieving higher information entropy of the global observations when conditioned on the representations.
缺点
- The proposed method does not seem to be particularly elaborate or innovative, as it mainly makes use of known results/techniques.
- The evaluation results are not convincing, and certain comparisons of model-based methods are absent.
问题
- In Section 3.2, the authors state that the representation and Q-values are learned and updated jointly (lines 243-244). How could this result in optimal policies and good representations? One could imagine that the Q-values during training are not yet optimal, and using these estimated Q-values to compute could introduce significant estimation errors for representation learning and then potentially lead to inaccurate Q-value estimations.
- Why are the trade-off policy and the upper-bound cost-related value function necessary? In Section 3.3, the authors adopt a similar approach to Zheng et al. (2024) for updating Q-values and policies, differing primarily by replacing with and introducing and . However, Zheng et al. (2024) demonstrate that optimal policies can be achieved without and when excluding representations. The authors assert that their representation method does not alter the optimal policy (lines 333-334). It seems possible that integrating the representation method directly into Zheng et al. (2024)'s approach could still yield optimal policies. Could the authors clarify the rationale behind introducing and ?
- In Section 3.4, while the authors provide theoretical proof that their representation method outperforms bisimulation, empirical comparisons with methods that use bisimulation are absent. Could the authors provide additional evaluation results comparing their method with bisimulation-based approaches for a thorough evaluation?
- The evaluation metrics (lines 356-359) are unclear. While the authors highlight zero-constraint violation in Section 4, they use a non-zero cost threshold for evaluation. Why do the authors emphasize 'zero-constraint violation' when the cost threshold is not set to zero? If my understanding is correct, the evaluation should focus on comparing rewards among safe agents with normalized costs below 1, as per the DSRL benchmark. Additionally, SDQC seems to achieve safety through over-conservatism, as indicated by the low rewards (less than 0.2 in CarButton, PointButton, PointPush, and SwimmerVel tasks). Could the authors clarify the reason for these low rewards and provide the actual reward results of all tasks for better insight?
Besides, please ensure that proper citations are included where necessary.
We thank Reviewer 8jhZ for providing valuable suggestions on our research. We address the concerns in your review point by point below.
[W1] Lack of Novelty
We regret to make readers feel that our paper is lack of novelty. As a matter of fact, we believe that our main contribution is the idea of decoupling the global observations into reward-related representation, and cost-related representation for decision making. To the best of our knowledge, we are the first to utilize representation learning in continuous state-based RL tasks, and the first to introduce the concept of state-decoupling for decision-making.
[W2] Lack of Comparison with Model-Based Methods
To the best of our knowledge, there is no model-based safe offline RL algorithms for now. For the comparison with bisimulation, please refer to our response in [Q3]
[Q1] Convergence of Joint Optimization
We agree that the joint optimization of the value functions and representations poses a potential risk of training instability. However, experimental results indicate that, although the inclusion of representation loss results in an increase in critic and value loss, it does not compromise the overall stability of the training. We have added a new section (Appendix F.2, Pages 28-29, lines 1502-1537) with new Figure 9 to present the value function estimation error during the training process. Notably, our proposed neural network architecture, which incorporates an attention-based state encoder, significantly enhances the precision and stability of value function learning compared to the simple MLP used by FISOR. In fact, we had previously recognized this issue intuitively and attempted to address it by employing a soft-start training approach, which involves introducing the representation contrastive term only after the value function converges. However, experimental results indicate that this method is not effective in clustering the representations, leading to suboptimal final performance.
[Q2] Necessity of the Trade-Off Policy
The proposed SDQC framework encompasses three distinct policies: the reward policy , which prioritizes maximizing rewards while disregarding costs; the trade-off policy , which considers both rewards and costs; and the cost policy , which aims to escape dangerous regions as swiftly as possible. We emphasize that the trade-off policy is essential and cannot be omitted.
For any given state , we denote the set of feasible actions within offline datasets as . A condition where indicates that there is no action capable of ensuring a completely safe subsequent trajectory. Our reward policy relies solely on reward-related representations and neglects cost-related information. Consequently, if we use only as the threshold, it may be too late for the agent to transition from to once it has entered a state where . By introducing , which reflects the existence of an action that could potentially lead to an unsafe trajectory in the future, we enable the agent to consider both types of information through the trade-off policy . In fact, the naive implementation of within our SDQC framework is analogous to the safe policy in FISOR. However, due to our incorporation of representation learning and the introduction of , our SDQC exhibits superior generalization capabilities.
To substantiate this point with experimental evidence, we have conducted ablation studies on the deployment of the three distinct policies in Appendix D (Page 26, lines 1393-1411). Numerical experimental results clearly demonstrate that, in the absence of a trade-off policy, a naive combination of reward and cost policies leads to higher costs. Such outcome can be attributed to the agent's inability to respond promptly to borderline dangers. Additionally, we have included a GIF in the supplementary materials to illustrate the trajectories generated by the collaboration of the three policies, as well as the trajectories produced by each of the three naive policies in isolation.
[Q3] Comparison with Bisimulation
Initially, we attempted to complete the training of representations using bisimulation. However, we encountered challenges during the initial training phase. Specifically, estimating cost functions proved to be difficult due to their non-smooth and sparsely distributed nature. To substantiate this claim, we have added a new section (Appendix F.3, Page 29, lines 1539-1582) where the cost-value estimation accuracy, precision, recall, and corresponding F1 score are reported. The experimental results indicate that the model correctly identifies actual dangerous state-action pairs with a probability of only about 77%, suggesting that the model’s ability to assess dangerous conditions is insufficient to support effective bisimulation training in subsequent stages. Therefore, we propose learning representations based on optimal Q functions to address these issues, owing to the continuity and non-sparsity of the Q functions. From a theoretical standpoint, our optimal Q based representation approach is more advantageous as it introduces a coarser representation.
[Q4] Unclear Evaluation Metrics
We acknowledge Reviewer 8jhZ for the valuable feedback. We assert that both FISOR [1] and our SDQC framework aim to achieve Zero Cost Violations, a goal that has not been addressed by previous baseline algorithms. We believe that in certain domains where safety is critical, a zero-cost setting is essential. For instance, in autonomous driving, it is imperative that a vehicle does not physically collide with any objects. Most prior works have focused on soft constraints, and the implementation of a zero-cost threshold has led to training instability. Consequently, we have adopted the approach used in FISOR [1] by establishing a small threshold. In the revised manuscript, we have revised our description of the evaluation metrics as follows:
Our ultimate objective is to achieve a zero-cost deployment, aligning with the framework established by FISOR [1]. However, most baseline algorithms struggle to operate effectively under a zero-cost threshold. Consequently, in accordance with FISOR [1], we impose a stringent cost limit of 10 for the Safety-Gymnasium environment and 5 for Bullet-Safety-Gym. We employ the metrics of normalized return and normalized cost for evaluation, where a normalized cost below 1 signifies a safe operation.
Furthermore, while the SDQC framework adopts more conservative strategies in certain environments to achieve lower costs, we believe this approach holds practical significance. Additionally, we utilized a unified set of hyperparameters at the domain level, without fine-tuning for specific tasks or environments, which may have the potential to yield improved performance.
Regarding the reward metric, we adhere to the environment settings established by DSRL and report normalized values, as the reward signals vary significantly across different tasks (for instance, the maximum reward for CarGoal2 is approximately 30, whereas for HalfCheetah-Velocity, it is around 2700).
[1] Yinan, Zheng, et al. "Safe Offline Reinforcement Learning with Feasibility-Guided Diffusion Model." International Conference on Learning Representations, 2024.
I appreciate the authors' response and clarifications. However, my concerns and questions are not fully addressed:
Q1: The explanation provided is not convincing. The authors showed results for the value function estimation error during the training process, but this does not directly answer my question. Could the authors provide the actual results for the value functions and Q-functions during training?
Additionally, could the authors explain why introducing the representation contrastive term only after the value function converges does not improve performance? Intuitively, this approach should lead to a better soft similarity measure as per Eq. (5), which could enhance representation learning and ultimately benefit performance, such as lowering costs and increasing rewards.
I also observed in Table 6 that as the Contrast Coefficient () increases, the performance appears to degrade, with both lower rewards and higher costs. Could the authors clarify this? Based on Eqs. (8) and (12), a larger should result in better state representations, which should positively impact performance.
Q4: Could the authors provide results for SDQC and FISOR with a zero cost threshold? The authors highlighted that "their proposed neural network architecture significantly improves the precision and stability of value function learning compared to the simple MLP used by FISOR". Therefore, I would expect SDQC to achieve zero constraint violations and demonstrate better performance than FISOR under such conditions.
We acknowledge Reviewer 8jhZ for providing further feedback.
Value Function Estimations During Training Process
We have revised Figure 9 to include the actual-value curves for both the value functions and Q-functions during the training process. In the "CarPush2" task, FISOR experiences severe training instabilities, and both the value functions and Q-functions exhibit significant overestimation problems. In contrast, our proposed attention-based state encoder results in a more stable training process. For the "BallCircle" task, the actual-value curves for the three algorithms show almost no differences.
Why Introducing Representation Term after Value Function Converges Does Not Work
We believe that this phenomenon can be explained as follows: When we focus solely on optimizing the value functions without representation, some states that should have a close similarity measure (characterized by a small distance measure ), may end up being far apart in the representation space. Once the training of the value function converges, the neural network may have settled into a local minimum. Introducing representation learning at this late stage poses significant challenges. The entrenched parameters of the converged value function network can inhibit the optimization of the representation network, as even minor changes could disrupt the converged value estimates, leading to instability. As a result, this soft-start training approach leads to suboptimal final performance as the representation cannot effectively cluster similar states.
Why Larger Contrast Coefficient Leads to Worse Performance
In the field of machine learning, when the total loss comprises multiple components, it is essential to balance the coefficients of each component. A larger coefficient indicates a greater emphasis during the neural network's update process. Within our SDQC framework, if the coefficient for the representation component is excessively large, the neural network may disproportionately focus on learning the representation, potentially leading to suboptimal performance of the value function. In such instances, manual adjustment of the coefficients is often necessary, providing a trade-off to achieve optimal performance. This manual balancing is a common practice in RL [1,2].
Zero Cost Threshold for SDQC and FISOR
As we emphasized throughout the paper, our SDQC is consistent with FISOR and is dedicated to achieving zero-cost violations. Similar to FISOR, our SDQC cannot adjust the threshold; in other words, our threshold is consistently set to zero. The experimental results presented in Table 1 for both FISOR and SDQC already reflect outcomes with this zero-cost threshold. Our algorithm achieves zero-cost on nearly half of the tasks, whereas the previous state-of-the-art algorithm, FISOR, achieved this on only one-quarter of the tasks. To the best of our knowledge, no current Safe RL algorithm, including those for Safe online RL, can achieve zero cost across all state-based environments. In this regard, our SDQC framework demonstrates remarkable performance.
[1] Fujimoto, Scott, et al. "A minimalist approach to offline reinforcement learning." Advances in neural information processing systems, 2021.
[2] Wang, Zhendong, et al. "Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning." The Eleventh International Conference on Learning Representations, 2024.
Dear Reviewer 8jhZ:
Many thanks for your dedicated review and constructive suggestions, which have led to significant improvements in the revised manuscript. As the deadline for revision and discussion is approaching, we would like to follow up and seek your further valuable feedback. We would greatly appreciate it if you could re-evaluate our manuscript and provide additional insightful comments. We sincerely hope to continue the discussion with you and address any remaining concerns.
Best regards,
Authors of Paper Submission #3367
Thank you for providing additional clarifications. However, my primary concern about the convergence of joint optimization remains unresolved. In your response to Reviewer FtTa, you stated that "Contrastive Loss Serves as an Auxiliary Loss." Could the authors further clarify this? It seems that every loss term in Eq. (8) and Eq. (12) depends on the learned representations, and the representation learning loss, in turn, depends on and . The learning of value functions and representations are fully intertwined and occur simultaneously.
We sincerely thank Reviewer 8jhZ for providing further feedback. Please refer to our network settings as detailed in Appendix C.2, on pages 22-23, lines 1168-1239. The representation layer is the intermediate layer of the neural network, positioned between the input ground-truth observations and the output and values. By setting in Eqs. (8) and (12), the representation can be entirely disregarded, and the network will eventually converge due to the nature of in-sample learning. When the contrastive term is added as an auxiliary loss, the intermediate layer clusters according to the optimal Q-values. Our experimental results demonstrate that this approach is effective in both downstream performance (see Table 1 on page 8) and learning curve stability (see Figure 10 on page 29).
Additionally, you may refer to PSEs [1], an online reinforcement learning algorithm for image-based tasks. (PSEs learn representations based on policy similarities, while our SDQC approach learns representations based on optimal Q values.) They use similar settings, deriving the policy from representations and clustering the representations based on policy similarities. The policy and representations are jointly optimized, and they refer to the contrastive term as the auxiliary loss. We followed their naming conventions. Besides, we also adopted their soft similarity measures, as cited in Sections 3.1 and 3.2.
[1] Agarwal, Rishabh, et al. "Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning." International Conference on Learning Representations, 2021.
Dear Reviewer 8jhZ,
We hope that our recent discussions have effectively addressed all your concerns. As the deadline for the discussion period draws near, we would greatly appreciate it if you could re-evaluate our manuscript and provide any feedbacks.
Thank you for your time and consideration.
Best regards,
Authors of Paper Submission #3367
This paper introduces a contrastive learning framework for safe offline RL problems. It decouples states into reward-related and cost-related representations based on a reward-related Q network and a cost-related Q network. The Q networks are used as the supervision signal in the contrastive learning of the state representations. As a result, the learned representation clusters states with similar reward or cost Q-values, enhancing generalization in OOD environments. The framework also learns three policies and switches between the policies based on the safety assessments for each state. Experimental results show that SDQC provides a robust solution for safe RL across multiple offline safe RL tasks and also shows generalization in unseen environments.
优点
- Originality: SDQC’s approach of decoupling reward and cost representations through contrastive learning is a unique solution in offline safe RL.
- Quality: The paper has a theoretical comparison with bisimulation, and an emprical comparison with existing offline safe RL baselines.
- Clarity: The paper clearly explains the motivation and the method.
- Significance: SDQC’s approach could learn policy to generalize to OOD scenarios better than existing methods. It has a meaningful impact on offline safe RL.
缺点
- The comparison with bisimulation seems redundant and unfocused. While SDQC claims theoretical advantages over bisimulation methods, no empirical comparisons with bisimulation techniques are provided. If there are no such methods in safe offline RL, why adding this comparison?
- The framework learns two types of state representations and trains three separate policies, which may add considerable computational overhead compared to most single-policy baselines.
- None of the baselines in the paper incorporate explicit representation learning as in SDQC. This difference may give SDQC a structural advantage in terms of performance, though at a higher computational cost.
问题
- Q1: How does SDQC’s computational and sampling cost compare to baseline models that do not include representation learning? Does the contrastive learning component require sampling and evaluating Q-values over large numbers of state pairs? Although the ablation study shows that contrastive learning is essential for the algorithm, how do the parameters in the contrastive loss affect the results?
- Q2: Given the theoretical comparison with bisimulation, would a bisimulation-inspired method indeed underperform relative to the proposed method in safe offline RL? Could the authors provide an empirical comparison using the same offline safe RL datasets featured in the experimental section?
- Q3: How does SDQC manage high-dimensional or complex state spaces, such as visual input, where the demands of contrastive learning may increase the challenge of representation learning?
[Q1] Computational Cost During Training/Testing Phase. Ablation studies on the Contrastive-related Hyperparameters
We have detailed the computational costs for both the training and testing phases of SDQC in Appendix E.3 (Page 27, lines 1448-1462). During training, SDQC involves evaluating Q-values within a sampling batch, and the computational cost is significantly influenced by the number of anchors used. Table 5 (Page 26, lines 1359-1367) presents the computing time under different anchor choices. Notably, for the "CarPush2" task, using 8 anchors results in a runtime of 32.7 seconds per epoch (defined as 1000 gradient steps). Without contrastive learning, the runtime decreases to 23.7 seconds--the difference is not substantial. In the testing phase, SDQC requires 11.13 seconds for inference over 1000 RL timesteps, which is slightly higher than FISOR (6.11 seconds), yet significantly lower than TREBI (585.87 seconds).
Furthermore, we have added the ablation studies on the contrastive-related hyperparameters (the term coefficient and the exponential temperature ). With respect to the temperature , employing a very small value tends to destabilize the training process, ultimately resulting in collapse. Conversely, using a larger value produces poorly clustered representations, leading to a marked degradation in performance. Regarding the term coefficient , a smaller value results in a slight performance decline. However, a larger coefficient excessively prioritizes the contrastive loss, destabilizing the training of the value function and degrading performance. More detailed results and discussions can be found in Appendix D (Page 26, lines 1369-1390). While fine-tuning these hyperparameters for specific environments and tasks could potentially yield better experimental results on the benchmark, we choose not to do so.
[Q2] Additional Experiments on Bisimulations
Initially, we attempted to complete the training of representations using bisimulation. However, we encountered challenges during the initial training phase. Specifically, estimating cost functions proved to be difficult due to their non-smooth and sparsely distributed nature. To substantiate this claim, we have added a new section (Appendix F.3, Page 29, lines 1539-1582) where the cost-value estimation accuracy, precision, recall, and corresponding F1 score are reported. The experimental results indicate that the model correctly identifies actual dangerous state-action pairs with a probability of only about 77%, suggesting that the model’s ability to assess dangerous conditions is insufficient to support effective bisimulation training in subsequent stages.
[Q3] Extension on Visual-Based Environments
We sincerely thank Reviewer zft8 for the valuable comments. To the best of our knowledge, current safe offline RL algorithms have not yet addressed image-based tasks, and there are no relevant datasets available in the test benchmarks. It is well known that image-based tasks are more complex than state-based tasks and depend more heavily on representation learning. Nevertheless, due to time constraints, we are unable to apply our SDQC in image-based environments at this time. We plan to explore this avenue in our future research.
[1] Castro, Pablo, et al. "Using bisimulation for policy transfer in MDPs." Proceedings of the AAAI conference on artificial intelligence, 2010.
[2] Castro, Pablo. "Scalable methods for computing state similarity in deterministic markov decision processes." Proceedings of the AAAI Conference on Artificial Intelligence, 2020.
[3] Zhang, Amy, et al. "Learning Invariant Representations for Reinforcement Learning without Reconstruction." International Conference on Learning Representations, 2021.
[4] Agarwal, Rishabh, et al. "Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning." International Conference on Learning Representations, 2021.
[5] Lee, Vint, et al. "DreamSmooth: Improving Model-based Reinforcement Learning via Reward Smoothing." International Conference on Learning Representations, 2024.
The authors make valid points. Thanks for clarification. I will maintain my score.
We thank Reviewer zft8 for recognizing the importance of our work. Please see our response to your concerns below.
[W1] Why Compared with Bisimulation
As bisimulation is one of the most widely used representation learning methods in the relm of reinforcement learning [1–4], we choose to establish theoretical comparisons with it. In particular, we demonstrate that the optimal Q based representations proposed in our SDQC yield coarser representations than those derived from bisimulation.
While implementing our original idea of decoupling global observations into reward- and cost-related representations, bisimulation initially appeared to be a natural choice for training. Nevertheless, we found that bisimulation fails even at the initial model-estimation stage due to the non-smooth nature of the cost function and its sparse value distribution. Similar issues have been reported in sparse reward RL tasks [5]. For further details, please refer to our response in [Q2]. Alternatively, our representation method based on optimal Q values resolves these issues due to the continuity and non-sparsity of the Q functions.
[W2] Additional Computational Costs
We admit that the training of SDQC requires higher computational cost compared with other baseline algorithms, which we have listed as an limitation in Appendix G. However, during the deployment (testing) process, its inference speed is not significantly lower than that of diffusion-based policies. Please refer to our response in [Q1].
[W3] None of Baseline Algorithms Use Representation Learning for State-Based RL
Indeed, this is the key innovation of SDQC compared to existing baseline algorithms. Our experiments demonstrate that representation learning significantly enhances the model's generalization ability. To the best of our knowledge, we are the first to introduce representation learning for state-based tasks within the field of safe offline RL.
This paper proposes a collection of algorithms for reducing the impact of OOD samples on constraint satisfaction when using RL to solve constrained MDPs. The main hypothesis seems to be that learning independent latent representations for state-conditioned value and state-conditioned cost, and then selectively conditioning several different policies on these representations, may help decoupling cost-related and reward-related concepts within the networks parameterizing the policies and thus lead to more robust decision making with respect to cost.
优点
This paper presents an interesting approach that, although does not contain much novelty for each individual aspect, does as whole seem relatively new.
The paper addresses an important topic in RL and has some relevance to representation learning.
The paper demonstrates impressive experimental results.
The large amount of math notation is handled well and I did not find it confusing.
缺点
This paper has no major weaknesses in my opinion. Below I offer some smaller suggestions.
The biggest weakness in my opinion is that, with a very complicated method like the one the authors present, it is difficult as a reader to follow all the moving parts. I understand there is limited space, and thus many details must be omitted or relegated to the appendix, but I offer two concrete suggestions.
-
Provide a diagram, pseudocode/algorithm block, or add some text within the writing that transitions the reader from one part to another while providing some context. At a certain point I felt like I was reading a long list of individual method descriptions without really understanding how they fit together.
-
There are many, many steps to the proposed system. Almost none of them can be derived from first principles. Thus, they all need some form of justification. Often, AI/ML researchers just assert that their method makes some intuitive sense. That is sometimes sufficient, but it is much more convincing when the authors provide a list of properties they desire or a list of alternatives followed by some reasons why they think those alternatives would not work as well. This will also help the reader understand why each step is being taken, rather than simply that it has been taken.
Given the extreme complexity and many moving parts of the proposed approach, as well as how the paper is written, I am skeptical about its scientific value. Engineering-wise it is a very impressive system and stands as a good example of what kind of performance we might hope to achieve on some of these RL benchmarks. However, it is very difficult to be certain that we have learned something general (about representation learning, reward/cost decoupling, etc.) from this paper. Every time a new module is added it becomes exponentially more difficult to conduct or design the experiments necessary to rule out its confounding effect on whatever dependent variable we are interested in measuring, and conference papers generally do not contain enough space for such experiments.
Miscellaneous:
“such as industrial management, autonomous driving, and robot control” Is autonomous driving not a subset of robot control?
“(the set of the probability distribution over \mathcal{X} is denoted as \Delta (\mathcal{X}))” I don’t see \mathcal{X} defined before this… Maybe this is meant to be states \mathcal{S}? That would make the most sense.
“predefined cost limits” --> predefined cost limit
Try to avoid using citations as nouns e.g. "The most relevant works to ours are (Bellemare et al., 2019; Le Lan et al., 2021), which learn representations via Bellman value functions." You could instead say something like "The most ... are from Bellemare et al. (2019) and Le Lan et al. (2021), who ..."
问题
- “(with \eta representing the temperature factor)” How to set \eta?
- “where \nu is a temperature parameter” How to set \nu?
- I don’t see how temperatures are set anywhere, though I do see some values in the appendix. Does this complicate hyperparameter tuning? I know that many algorithms are very sensitive to temperature settings. Is that the case for SDQC as well?
We thank Reviewer 6xP2 for providing positive feedback and recognizing the effectiveness of our work. Please see our response to your questions below:
[W1] Lack of Psdudo Code
We apologize for any confusion resulting from the absence of pseudocode. We acknowledge that SDQC necessitates multi-phase training, and a detailed pseudocode can improve clarity. We have included it in our updated PDF documents, available on Page 21, lines 1099-1124.
[W2] Clear Justification
We agree that providing clear justification for each step is essential for an intuitive understanding of the proposed algorithms. To address this issue, we have added a summary section (Appendix C.1, Pages 21-22, lines 1094-1166) on the training and deployment of SDQC to enhance understanding of the process. In Appendix C.1, we explain the purpose of each phase of training and the rationale behind our three distinct policies.
[W3] Scientific Value of the Paper
We sincerely thank Reviewer 6xP2 for recognizing the excellent engineering performance of SDQC. While we acknowledge that SDQC is indeed complex to implement, our algorithm requires minimal hyperparameter tuning and is adaptable to various domains. We believe that our primary contribution lies in the novel idea of decoupling the state into reward-related and cost-related representations. Our experimental results demonstrate that this approach generalizes effectively to unseen states in the dataset. Although SDQC is currently challenging in training complexity, it offers valuable insights for advancing safe reinforcement learning, particularly in enhancing the generalization capabilities of future algorithms.
[W4] Writing Issues
We acknowledge Reviewer 6xP2 for detailed guidance on improving the writing of the paper. We have thoroughly reviewed the manuscript, corrected the typographical errors, and made revisions where necessary.
[Q1&Q2] Temperature-related Parameters
The settings for the temperature-related hyperparameters are detailed in Table 3 (Page 24, lines 1268-1269). For all environments/tasks in our experiments, we consistently set the values and , which proved to be effective. Additionally, we conducted ablation studies on the contrastive-related hyperparameters, the results of which are discussed in our response to [Q3].
[Q3] Additional Ablation Studies on Contrastive-related Hyperparameters
We added the ablation studies on the contrastive-related hyperparameters (the term coefficient and the exponential temperature ). With respect to the temperature , employing a very small value tends to destabilize the training process, ultimately resulting in collapse. Conversely, using a larger value produces poorly clustered representations, leading to a marked degradation in performance. Regarding the term coefficient , a smaller value results in a slight performance decline. However, a larger coefficient excessively prioritizes the contrastive loss, destabilizing the training of the value function and degrading performance. More detailed results and discussions can be found in Appendix D (Page 26, lines 1369-1390). While fine-tuning these hyperparameters for specific environments and tasks could potentially yield better experimental results on the benchmark, we choose not to do so.
The paper proposes a novel method to solve offline safe reinforcement learning problems by introducing representations to distinguish reward-related and cost-related states by contrastive learning. It leverages three different policies to handle different conditions (safe, unsafe, borderline safe). The paper's main motivation is to achieve a rough state representation while promising the optimal Q-value. Furthermore, it demonstrates that their method is better than prior work like bisimulation. After getting the representation, it aims to learn the reward Q-value and cost Q-value so it can extract the policy by following FISOR [1]. In the experiments, it illustrates that the method has good performance and can be generalized to complex tasks. However, this method struggles with the complexity of hyperparameter selection, network structure, and algorithm steps.
It seems to be a novel and strong method. But when carefully examined, there are some fatal issues in the method and the experimental results may be overclaimed. In addition, it lacks some important ablation experiments.
References:
[1] Yinan Zheng, Jianxiong Li, Dongjie Yu, Yujie Yang, Shengbo Eben Li, Xianyuan Zhan, and Jingjing Liu. Safe offline reinforcement learning with feasibility-guided diffusion model. arXiv preprint arXiv:2401.10700, 2024.
优点
- The paper is well-written. It explains the method step by step and also provides a comprehensive and detailed introduction to the experiment. The figures in the text are clearly expressed and easy to follow.
- The method is novel since no prior work uses contrastive learning to solve vector-based safe RL problems.
- It provided detailed proof about why the method is more efficient than bisimulation. It is a theoretical guarantee.
- The main experiments and experiments about contrastive learning are sufficient. Meanwhile, it provides some details about the tasks and hyperparameters.
缺点
- The contributions are not claimed in the introduction. Since the method has borrowed a lot from FISOR, it should emphasize its improvements.
- The lack of pseudocode algorithms in the methods section, even in the appendix, hinders understanding. Due to the algorithm's complexity ("the necessity of executing three distinct training phases"), authors need to provide pseudocode to make their method more clear.
- This method needs to be optimized jointly for learning Q-values and the representations (Section 3.3). However, It did not clearly explain the convergence of doing so. Obviously, the learning process of representations uses optimal Q-values as distance measurements. It is not realistic in the training phase since the Q-value should be initialized at the beginning. Plus, the learning of Q-values involves embedding representation vectors. Inaccurate representation can affect the optimization of Q-values. This is like whether the chicken or the egg came first. It is unlikely to get the optimal Q-values or representations if optimizing them together. I think this is an essential issue of the algorithm.
- The learning processes of and are unreasonable. I suspect authors misunderstand the function of expectile regression. And I can't understand the term "reverse expectile regression" (why is it called that). Expectile regression is proposed in IQL. IQL leverages the advantages of expectile regression to improve the learning of value functions by changing the weights of different items. It demonstrates that IQL can achieve the best performance when . This indicates that the value function estimation can be relatively accurate in the offline setting. However, in this paper, the method uses expectile regression to learn the upper-bound and lower-bound of cost value functions. This usage is incomprehensible. Furthermore, even if it can learn the upper-bound and lower-bound of cost value functions, the formulations can not make sense. (expectile regression) indicates that when , the weights before are larger than the weights before . Thus, it only can show the value function involving will be considered more rather than learning an upper-bound value function. Meanwhile, defining a "reverse expected regression" is quite strange and I understand that it should be equivalent to constraining the of the expected regression to .
- The generalization experiment only generalized on the number of obstacles, which needs to be emphasized. Since traditional generalization includes task generalization and scene generalization, involving unseen objects, it is important to explain the experiments here only generalize to tasks with more obstacles (it is obviously a state-based method can not deal with unseen objects). Otherwise, it would be an overclaim.
- The ablation experiments are not sufficient. One of the method's main contributions is the new state-decoupling framework, which contains three conditions. It should be compared with only two conditions in the ablation.
- One of my biggest regrets is that I cannot see any visual analysis in the paper on how to divide state-related and cost-related states, which is a core contribution of the method.
- The method involves many hyperparameters. I suspect its performances are because of parameter tuning and Appendix C, Appendix D demonstrate that. Additionally, the ablations of contrastive term coef are not presenting.
问题
My questions are mostly based on the weakness:
Main concerns:
- Please explain the convergence of joint optimation. How can it learn optimal Q-value and representation if the initialization is random?
- Why using expectile regression can learn the upper-bound and lower-bound of the value function? What's your motivation? Is this reasonable?
Other questions:
- If it cannot provide generalization at the level of obstacle types or tasks, why can it simply say that this method has generalization ability?
- How about the state-decoupling framework's ablation?
- In Table 4, what does it mean "Task: All/Vel"?
Writing:
- In line 1229, there is a repetition of "Zhang".
My initial recommendation is 'reject' because of my main concerns. If the authors can explain them clearly and convince me, I will raise my score.
We acknowledge Reviewer FtTa for providing the helpful suggestions. Please see our response to your concerns below.
[W1] Unclear contributions
We have revised our introduction to assert that our SDQC is built upon FISOR, with the main distinction being the state-decoupling framework. The modifications can be found on Page 2, lines 72-79, as follows:
Attributable to the successful application of Hamilton-Jacobi (HJ) reachability analysis in Safe RL, which introduces a safety analysis method iterated through Q-learning with convergence guarantees, our approach makes safety assessments on the cost-related representations and make decisions based on the assessment results. Our SDQC, developed based on FISOR, distinguishes itself from FISOR and other classical methods which rely on global observations for decision-making (as depicted in the left subplot of Figure 2), by being the first to utilize decoupled representations for decision-making in safe RL tasks (see the right subplot of Figure 2).
[W2] Lack of Pseudo Code
We apologize for any confusion resulting from the absence of pseudocode. We acknowledge that SDQC necessitates multi-phase training, and a detailed pseudocode can improve clarity. We have included it in our revised manuscript, on Page 21, lines 1099-1124. Additionally, we have added a summarization section (Appendix C.1) on the training and deployment of SDQC, providing an intuitive understanding of the process (Pages 21-22, lines 1094-1166).
[W3] Convergence of Q for SDQC
We agree that the joint optimization of the value functions and representations poses a potential risk of training instability. However, the experimental results indicate that, although the inclusion of representation loss results in an increase in critic and value loss, it does not compromise the overall stability of the training. We have added a new section (Appendix F.2, Pages 28-29, lines 1502-1537) with the new Figure 9 to present the value function estimation error during the training process. Notably, our proposed neural network architecture, which incorporates an attention-based state encoder, significantly enhances the precision and stability of value function learning compared to the simple MLP used by FISOR.
[W4] Usage of Expectile Regression
We would like to emphasize that the expectile regression employed in IQL [1] is intended to fit the optimal value function, specifically to ensure that . It is used for approximating the upper bound rather than merely for naive parameter tuning to enhance the accuracy of the value estimate. Here we provide a copied sentence from Section 4.2 (Page 4) of IQL [1]:
We can use expectile regression to predict an upper expectile of the TD targets that approximates the maximum over actions constrained to the dataset actions.
It can also be referred to Theorem 3 in IQL [1]: The reason for not setting is that the overestimation issue cannot be corrected under such circumstances. Experimental results indicate that is the optimal choice.
Note that the optimal value function under the general Bellman operator is given by , and is under the cost-related safe Bellman operator. The expectile regression , with can be used to approximate the maximum value. Conversely, the reverse version with can be utilized to approximate the minimum value. The difference lies in the direction of the indicator function . Following the setting in FISOR [2], we refer to this as "reverse expectile regression".
[W5] Generalization Experiments
Thanks for pointing this out. We note that "generalization" in the field of machine learning typically refers to a model's ability to perform well on new, unseen data that was not part of its training set. It does NOT strictly pertain to the model's ability to handle entirely new tasks or scenes. To the best of our knowledge, most studies on representation learning in reinforcement learning use "generalization" to describe a model's ability to handle the same task but with a slightly different observation space (e.g., shifted or distorted images) [3, 4]. In our experiments, the number of obstacles in the training sets differs significantly from those in the testing sets, thereby the radar observations for the agent will be totally different. We do not agree that "generalization" is overstated for our settings. Nonetheless, in accordance with the reviewer's suggestions, we have emphasized this point in our revised introduction (Page 2, lines 103-104) as follows:
Further, in generalization tests where agents are evaluated in environments with a different number of obstacles than those in the training dataset, all baseline algorithms show a substantial increase in cost and/or a significant decline in reward. In contrast, SDQC stands out as the only approach that guarantees no increase in cost while experiencing only a slight decay in reward.
[W6] Incomplete Ablation Studies
Thank you for your suggestions. We have added ablation studies on the deployment of the three distinct policies in Appendix D (Page 26, lines 1393-1411). The experimental results demonstrate that none of these policies can be omitted, as the best performance consistently arises from their collaboration. It is important to note that if we rely solely on safe/unsafe conditions and make decisions based on global observations rather than representations, our SDQC method effectively reduces to FISOR[2].
We have also added the ablation studies on the contrastive-related hyperparameters (the term coefficient and the exponential temperature ). With respect to the temperature , employing a very small value tends to destabilize the training process, ultimately resulting in collapse. Conversely, using a larger value produces poorly clustered representations, leading to a marked degradation in performance. Regarding the term coefficient , a smaller value results in a slight performance decline. However, a larger coefficient excessively prioritizes the contrastive loss, destabilizing the training of the value function and degrading performance. More detailed results and discussions can be found in Appendix D (Page 26, lines 1369-1390)
[W7] Lack of Visualization
We agree that effective visualizations can provide an intuitive understanding of our proposed SDQC. To this end, we have included a GIF in the supplementary materials. It showcases the trajectory induced by the collaboration of the three policies, as well as the trajectories induced by each of the three naive policies individually. These results align with our discussion in response to [W6]. Due to size limits of supplementary materials, we are only presenting the visualization for the "PointGoal2" task. Additional results will be included in the final version of our paper.
[W8] Over-tuning on Hyperparameters
We would like to emphasize that the excellent performance of SDQC DOES NOT rely on over-tuning the hyperparameters. As shown in Table 3, we have set most of the hyperparameters to be consistent across all environments and domains. Furthermore, we have nearly unified all hyperparameters within the same environment, even when dealing with different tasks and agents, as illustrated in Table 4. For instance, in the Safety Gymnasium domain, we use the same hyperparameters for 12 tasks (PointGoal1, PointGoal2, PointPush1, PointPush2, PointButton1, PointButton2, CarGoal1, CarGoal2, CarPush1, CarPush2, CarButton1, CarButton2). Users only need to select the network structure and adjust the encoded state dimension according to the global observation space, which is largely determined by the physical characteristics of each domain. In fact, the guidelines for selecting hyperparameters are provided in our paper (Appendix C.3, lines 1289-1295). We believe that such domain-level hyperparameter unification is acceptable. While we acknowledge that fine-tuning hyperparameters for specific tasks may yield better results, we have chosen not to pursue it.
[Q1] Convergence of Joint Optimization
Please refer to our response in [W3].
[Q2] Usage of Expectile Regression
Please refer to our response in [W4].
[Q3] Terminology Generalization
Please refer to our response in [W5].
[Q4] State-Decoupling Framework Ablation
Please refer to our response in [W6].
[Q5] Content in Table 4
In Table 4, we use "All" to represent all tasks/environments for the same agent. We apologize for any confusion this may have caused and we have included further details in the corresponding caption.
[Q6] Typos in Paper
We apologize for the typos we made. We have carefully checked entire paper to correct all typos.
[1] Kostrikov, Ilya, et al. "Offline Reinforcement Learning with Implicit Q-Learning." International Conference on Learning Representations, 2022.
[2] Yinan, Zheng, et al. "Safe Offline Reinforcement Learning with Feasibility-Guided Diffusion Model." International Conference on Learning Representations, 2024.
[3] Agarwal, Rishabh, et al. "Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning." International Conference on Learning Representations, 2021.
[4] Kirk, Robert, et al. "A survey of zero-shot generalization in deep reinforcement learning." Journal of Artificial Intelligence Research 76 (2023): 201-264.
Thank you for your effort in responding. I think the experiments are sufficient now and the presentation is better. I fully understand the "reverse expectile regression" and the upper bounds. Meanwhile, I found the visualization to be truly interesting. However, I still have some concerns:
-
About Convergence of Joint Optimization: the experiments could be a part of the explanation but not convincing enough. It could be better for the author to provide some proof to explain why the joint optimization can converge.
-
Visualization: Actually, I want to see the function of representation. It could be better to provide some t-SNE results to show how to separate the reward-related states and cost-related states.
I think this work's main contribution is the representation and separating into three policies serves as an auxiliary function. Since the author did indeed solve most of my concerns, I would like to raise my rating to 5.
We sincerely acknowledge Reviewer FtTa for providing further feedback.
Theoretical Proof on the Joint Optimization
Given the non-convex nature of neural networks, it is challenging to directly demonstrate why joint optimization converges and to ascertain whether such convergence leads to a global optimum. This issue remains an unresolved problem in the field of deep learning. Nonetheless, joint optimization is a prevalent practice in reinforcement learning. For instance, Agarwal et al. [1] learn representations by leveraging policy similarity, wherein the policy and representation share certain neural network parameters and are jointly optimized. Similarly, in [2] and [3], researchers tried to clone the actions observed in datasets and to bias the replicated policies towards regions with high Q-value distributions. They incorporated Q-value guidance into the loss function of behavior cloning through joint optimization. However, none of the aforementioned works provide proofs of convergence or optimality for joint optimization processes. We recognize that our study, SDQC, encounters similar challenges, and we aim to address the gap in our future work.
Additional t-SNE Visualization on the Representations
Thank you for the clarification. We agree that additional t-SNE visualizations of the representation space would provide deeper insights into our proposed SDQC. However, we faced challenges in addressing this request within such a short timeframe. In the SafetyGymnasium library, obstacle and destination positions are entirely random, and the library does not provide an interface for manual specification. We understand the reviewer's suggestion to perform t-SNE visualizations under reward representation for states with identical reward-related information but varying cost-related information, and vice versa under cost representation. While we are working to modify the library to enable manual specification for this purpose, we cannot complete these changes within the 3-day window. We commit to including this analysis in the final version of the paper.
[1] Agarwal, Rishabh, et al. "Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning." International Conference on Learning Representations, 2021.
[2] Fujimoto, Scott, et al. "A minimalist approach to offline reinforcement learning." Advances in neural information processing systems, 2021.
[3] Wang, Zhendong, et al. "Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning." The Eleventh International Conference on Learning Representations, 2024.
Thank you. I have clearly checked the work [1] and gained a deeper understanding of your paper. The joint optimization is widely present in deep learning. I agree it is unnecessary to provide theoretical proof since the method does not involve the circular optimization problem. However, compared with [1], your paper's structure is not clear enough. In [1], it states that contrast learning is an auxiliary loss that helps the RL agent learn the representation. But this work first explains how to distinguish between reward-related and cost-related states using different representations through contrastive learning, and then trains the RL agents based on the representations. It seems that the use of contrastive learning to learn the representation and RL learning policies are separate while it does not. Since reviewer 8jhZ also has the concern, this confusion could make it hard for more readers to follow. I suggest that you should stress that contrast learning is used as an auxiliary loss for distinguishing between reward-related and cost-related states, while the end-to-end RL plays a main role in decision-making by optimizing or . The representation is not only learned from contrast learning but also from RL training objectives. Clarifying the network structure is also important.
I believe better structure can make it easier for readers to understand and less confusing. So you should consider my suggestion and try to adjust your writing. Because of the workload in this paper, I found it hard to make decisions. I will reconsider it after subsequent discussions.
Besides, after reading [1], I found it more like a work applying [1] to safe offline RL areas. I would like to ask for more novelty beyond combining [1] and FISOR. (such as what problem you found in safe offline RL and why using representation can perfectly solve it)
The Additional t-SNE Visualization on the Representations is not the core problem in your work but it can help readers intuitively recognize the role of representation. Since the discussion is extended, I hope you can do that before the due.
Question 4: Clarifying the network structure is also important.
We utilized the whole Section (Appendix C.2, Pages 22-23, lines 1168-1239) to introduce the network structure. We even mentioned it twice in the main text of our paper (Page 5, lines 231 and Page 9, line 460).
Additional t-SNE Visualization on the Representations
We would like to kindly remind the reviewer FtTa that, the deadline for submitting the final version of the revised paper is set for Nov.27, not Dec.3. During the extended period, only forum activities are permitted. We are unable to complete such a huge amount of work (refactoring the codebase of SafetyGymnasium) in such a short time. We estimate this will take at least 1-2 weeks. We have committed to including this in the final version of our paper and hope for your understanding regarding the time constraints (only one day left).
[1] Agarwal, Rishabh, et al. "Contrastive Behavioral Similarity Embeddings for Generalization in Reinforcement Learning." International Conference on Learning Representations, 2021.
[2] Khosla, Prannay, et al. "Supervised contrastive learning." Advances in neural information processing systems, 2020.
[3] Yinan, Zheng, et al. "Safe Offline Reinforcement Learning with Feasibility-Guided Diffusion Model." International Conference on Learning Representations, 2024.
[4] Kostrikov, Ilya, et al. "Offline Reinforcement Learning with Implicit Q-Learning." International Conference on Learning Representations, 2022.
Thank you for your response. It seems my initial understanding is right and I saw Section C.2. I still have some questions about it:
-
The loss for training network is similar in PSEs and SDQC (RL loss + contrast loss). But you state SDQC is different from PSEs. I think the training procedure is similar. SDQC has a similar structure to Figure 3 in PSEs but it does not use projection. Meanwhile, it learns the optimal Q-value instead of the RL policy. Is it right?
-
I understand SDQC proposed new representations based on optimal Q values and your contributions are unique. I mean the learning of critic (policy will be then extracted from optimal Q value) also involves learning for representation, while contrast learning is used as an auxiliary loss. If this understanding is correct, I suggest that you should emphasize that in your paper. Since I don't believe that only contrastive learning is involved in representation learning, the training of the Q value also plays a role in representation learning, which may explain why the joint optimization can converge.
For the additional experiments, I understand that the experiments need time to finish. Please focus on the discussion.
We thank Reviewer FtTa for providing further feedback. We would like to clarify on the
Differences between PSEs [1] and SDQC (ours)
Firstly, we would like to emphasize that contrastive learning serves only as a tool employed by both PSEs [1] and SDQC for learning representations. (In fact, it is a common technique originally derived from computer vision [2]). However, the methodologies underlying PSEs [1] and SDQC are fundamentally different. PSEs leverage behavioral similarity to supervise representation learning, whereas our SDQC utilizes optimal Q values for this purpose.
Question 1: In [1], it states that contrast learning is an auxiliary loss that helps the RL agent learn the representation. But this work first explains how to distinguish between reward-related and cost-related states using different representations through contrastive learning, and then trains the RL agents based on the representations. It seems that the use of contrastive learning to learn the representation and RL learning policies are separate while it does not.
In PSEs [1], contrastive learning serve as an auxiliary loss, assisting the RL agent in learning representations, and is utilized for end-to-end policy learning. This is not the case in our proposed SDQC. As you just mentioned, in SDQC, contrastive learning is employed solely for representation learning and is not applied for RL policy learning. Furthermore, the training processes for representations and polices are completely separate. This distinction is clearly articulated in Section 3.3:
In the initial phase, we undertake the learning process for the value functions and representations associated with cost and reward separately. Following that, we extract the policy based on the acquired value functions and representations.
Additionally, you can refer to the pseudocode provided on Page 21, lines 1099-1124, and our corresponding summarization on the same page, lines 1125-1136.
Question 2: I suggest that you should stress that contrast learning is used as an auxiliary loss for distinguishing between reward-related and cost-related states, while the end-to-end RL plays a main role in decision-making by optimizing or . The representation is not only learned from contrast learning but also from RL training objectives. I believe better structure can make it easier for readers to understand and less confusing. So you should consider my suggestion and try to adjust your writing.
We would like to clarify that PSEs [1] and SDQC are fundamentally distinct concepts. Importantly, our approach does NOT integrate the policy learning process into the representation learning. We respectfully request that Reviewer FtTa re-evaluate the structure of our writing on methodology as outlined below:
- We first introduce the motivation behind SDQC, which involves abstracting coarser reward- and cost-related representations, in Section 3.1.
- Then we present our unique approach to abstracting the representations based on optimal Q values, alongside the corresponding methodology (contrastive learning), in Section 3.2.
- Following this, we provide a detailed implementation of our SDQC in Section 3.3: initially learning the representations using in-sample Q-learning methods, and then extracting the policies based on these pre-learned representations. We would like to kindly remind Reviewer FtTa that learning optimal Q values without learning policies is the fundamental feature of in-sample learning, as introduced by IQL [4].
If reviewer FtTa has any further suggestions, we are pleased to follow your instruction to polish our writing.
Question 3: Besides, after reading [1], I found it more like a work applying [1] to safe offline RL areas. I would like to ask for more novelty beyond combining [1] and FISOR. (such as what problem you found in safe offline RL and why using representation can perfectly solve it)
We reclaim that, PSEs and SDQC are fundamentally different. The sole similarity lies in our use of supervised contrastive learning [2] to derive representations. PSEs learn representations based on policy similarities and function as an online RL algorithm tailored for image-based RL tasks. In contrast, our SDQC approach learns representations based on optimal Q values and operates as an offline RL algorithm designed for state-based RL tasks. To the best of our knowledge, we are the first to apply representation learning to continuous state-based RL tasks and the first to introduce the concept of state-decoupling in decision-making. Indeed, we believe that our main contribution is the innovative concept of decoupling the global observations into reward-related and cost-related representations for decision-making. We hope that Reviewer FtTa can reconsider the novelty of our proposed method.
We acknowledge reviewer FtTa for providing further suggestions, and we are willing to follow your instruction to further polish our paper.
Similar Network Structure
Your understanding is correct: SDQC learns representations based on optimal Q-values instead of RL policies (as in PSEs). SDQC and PSEs indeed share the common characteristic of learning representations with contrastive loss derived from reinforcement learning features. However, we would like to emphasize that PSEs is designed for image-based RL tasks in online settings and require a data augmentation process and further projection layers. In contrast, SDQC is designed for state-based RL tasks in offline settings and is implemented alongside in-sample learning methods. While the network structures of PSEs and SDQC may appear similar, we would like to clarify that, for representation learning in reinforcement learning, such a network structure is a necessary choice. This is because representations must originate from the ground-truth observations, and value functions or policies must build upon these representations. This sequentially connected neural network design is a common choice but not unique to PSEs or SDQC.
Contrastive Loss Serves as an Auxiliary Loss
Your understanding is accurate, and we have followed your guidance to revise our manuscript from the following three perspectives. (The detailed changes are marked in blue in the updated manuscript.)
- We presented our representation learning methods predicated on optimal Q in Section 3.2, detailing the corresponding methodology in Eq. 5. Furthermore, we added an explanation of how the contrastive loss serves as an auxiliary loss on Page 5, lines 239-245, as follows:
It is important to note that Eq. 5 requires precise calculation of optimal Q-values for all states across all actions, i.e., the constraints in Eq. 4 are satisfied. However, the Q-values are derived from the representation network, and even a small change in the network can result in variations in the Q-values. Therefore, it is necessary to integrate the training process of the representation network with the Q-learning process. This integration can be achieved by incorporating the contrastive loss as an auxiliary objective during the learning of the optimal value functions. Please refer to Section 3.3 for further details.
- Additionally, we have emphasized this point in the detailed methodology section. On Page 5, lines 259-260, we have revised the introduction to reward-related representation learning as follows:
The reward-related representations are acquired using the auxiliary contrastive loss term described in Eq. 5, with a weighting factor denoted by . Consequently, the overall loss for the reward-related value functions and representations is formulated as follows:
- In Page 6, lines 274-275, we made the similar modifications as follows:
By incorporating the auxiliary contrastive loss term (Eq. 5) with a weighting factor of , we express the overall loss for the cost-related value functions and representations as:
Thank you for taking my suggestion. I appreciate the effort the authors put in during the discussion. My concerns have been solved and this work has been better improved. Therefore, I will continue raising my score.
We sincerely appreciate the constructive feedback and recognition of our work provided by reviewer FtTa. Your efforts have significantly helped us enhance our paper from both experimental and linguistic perspectives. We are grateful for your consideration in raising the score.
We would like to express our sincere gratitude to the reviewers for their dedicated time and expertise in reviewing our work. We truly appreciate the professional and constructive suggestions and are encouraged by the recognition of the significance of our work. We provide responses and clarifications below for each reviewer respectively and hope they can address your concerns.
We also updated our paper with a few modifications to address reviewers' suggestions and concerns (in blue). A summary of the updates is listed below:
-
We included a GIF in the supplementary materials that illustrates the trajectories generated by the collaboration of the three policies, as well as the trajectories produced by each of the three naive policies in isolation.
-
We revised several expressions in both the introduction and the experimental evaluations.
-
We added a new section (Appendix C.1, Pages 21-22) to provide pseudocode and summarize the training and deployment processes of the SDQC framework.
-
We conducted ablation studies on contrastive-related hyperparameters, which can be found in Appendix D, Page 26.
-
We performed ablation studies on the deployment of the three distinct policies, also detailed in Appendix D, Pages 26-27.
-
We provided the training curves for the SDQC and discussed the impact of representation loss on value estimations in Appendix F.2, Pages 28-29.
-
We included additional discussions and experimental verifications regarding the limitations of bisimulation in safe offline reinforcement learning in Appendix F.3, Pages 29-30.
-
We corrected typographical errors and unprofessional expressions.
Summary: This paper addresses safe offline reinforcement learning, specifically focusing on the zero-violation constraint setting. To meet the zero-violation requirement, the authors propose a representation learning method based on contrastive learning to separately learn state representations. Experimental results show that the proposed method achieves comparable improvements over state-of-the-art (SOTA) baselines, such as FISOR.
Strengths:
- The presentation is clear.
- The proposed method is reasonable and aligns with the problem setting.
- Experimental results indicate some improvement over existing baselines.
Weaknesses:
-
High complexity of the proposed method: The method introduces significant complexity, and the authors do not provide sufficient demonstration or justification for each component of the algorithm. For instance, I would expect a more detailed justification of the different choice of representation learning methods. This weakens the overall clarity and impact of the proposed approach.
-
Lack of justification for experimental setup: A significant portion of the experimental setup, including the safety threshold selection, closely follows the FISOR paper without sufficient justification. The choice of safety thresholds is critical for evaluating safe RL methods, and simply adopting the setup from FISOR is not enough, especially given the variety of safety limits used in prior work. This aspect requires more detailed justification.
Decision: Reject. While the paper shows promise, the concerns regarding the complexity of the method, insufficient justification for key components, and the experimental setup prevent me from recommending acceptance at this stage.
审稿人讨论附加意见
During the discussion, the authors provided additional details regarding the experimental setup and the demonstration of their method. While these additions improved the clarity and addressed some of the raised concerns, I believe the paper still requires further refinement to meet the publication standards of ICLR.
Reject