PaperHub
6.3
/10
Rejected3 位审稿人
最低2最高4标准差0.9
4
4
2
ICML 2025

Semi-gradient DICE for Offline Constrained Reinforcement Learning

OpenReviewPDF
提交: 2025-01-22更新: 2025-06-18

摘要

关键词
DICEOffline Constrained RL

评审与讨论

审稿意见
4

This paper investigates the limitations of SemiDICE in offline constrained reinforcement learning, revealing that it outputs policy corrections rather than stationary distribution corrections. This fundamental flaw significantly impairs SemiDICE's effectiveness in off-policy evaluation (OPE). Based on these findings, CORSDICE is proposed to facilitates OPE by computing state stationary distribution corrections.

update after rebuttal

Thank the authors for offering the empirical evidence. I have no further questions and will keep my score.

给作者的问题

The question I asked in the claims and Evidence part.

论据与证据

While the majority of the claims presented in this paper are substantiated by clear and compelling evidence, one particular assertion in Section 4 (Page 4) requires further clarification. The authors state that "there often exists a state s such that dπ(s,a)=0d_{\pi^{\star}}(s, a) = 0 for all a in OptiDICE." However, upon careful examination, I was unable to locate any empirical evidence or theoretical justification supporting this specific claim within the manuscript.

方法与评估标准

It makes sense for problem.

理论论述

I’ve checked the correctness of theorem proof mentioned in the main paper.

实验设计与分析

The experimental design demonstrates both comprehensiveness and persuasiveness, reflecting the authors' substantial efforts in empirically validating CORSDICE. The experimental evaluation comprises three comprehensive components. The initial component systematically examines algorithmic characteristics, providing empirical validation for the limitations of existing DICE methods as discussed in Section 4. The subsequent component demonstrates CORSDICE's superior capability in achieving precise OPE. The third component evaluates CORSDICE's performance among benchmarks in offline constrained reinforcement learning.

补充材料

The Supplementary Material includes the code for CORSDICE; however, I have not executed it due to time constraints.

与现有文献的关系

The paper is well-grounded in the broader literature.

遗漏的重要参考文献

No

其他优缺点

No

其他意见或建议

No

作者回复

We appreciate the reviewer’s comment and provide empirical evidence in Section 4 to better substantiate our work.

Q1. Empirical evidence on the claim from Section 4

We demonstrate that OptiDICE can yield a state ss where dπ(s,a)=0  ad_{\pi^*}(s,a)=0\;\forall a. The figure provided in https://imgur.com/a/7zh2zlg is based on the Four Rooms domain from Figure 1 of OptiDICE [1], using 1000 trajectories collected under a behavior policy with 0.7 optimality. In this setup, red arrows indicate the optimal policy, and the heatmap shows the state visitation frequency under the optimal policy, adπ(s,a)\sum_{a}d_{\pi^*}(s,a). While all states are visited by the behavior policy, some states have no arrows, indicating they are not visited by the optimal policy (adπ(s,a)0\sum_{a}d_{\pi^*}(s,a)\approx 0). Consequently, the optimal policy in those states cannot be recovered by computing dπ(s,a)/adπ(s,a)d_{\pi^*}(s,a)/\sum_{a}d_{\pi^*}(s,a). We will include this demonstration in the final version.

[1] Optidice: Offline policy optimization via stationary distribution correction estimation. ICML, 2021.

审稿人评论

Thank you for offering the empirical evidence. I have no further questions and will keep my score.

审稿意见
4

The paper developed a new offline RL algorithm that applies semi-gradient DICE, addressing the challenge of constraint violation when applying semi-gradient DICE in the context of constrained RL. The paper provides theoretical analysis on the characteristics of the correction term (i.e., the ratio of the stationary distribution w.r.t the learning policy to that w.r.t the dataset policy) that lead to the violation of the Bellman equation (thus the constraint violation). The paper then proposes a stationary distribution correction idea to address the issue.

Experiments on several benchmarks show the advantages of the proposed algorithm CORDICE in comparison with other offline RL algorithms.

给作者的问题

I don't have questions for the authors.

论据与证据

The advantages of the proposed algorithm CORDICE are supported by both theoretical and experimental results. In particular, the paper provides substantial empirical analysis on different aspects (i.e.,violations of Bellman and policy correction constraints) and performance of CORDICE comparing with other baselines.

方法与评估标准

The semi-DICE approach is well-suited for offline RL. Benchmarks such as the D4RL datasets or the DSRL datasets are commonly used in literature.

理论论述

No, I didn't check the correctness of any proofs in the appendix.

实验设计与分析

Experimental result analysis in the paper is extensive, examining the performance and characteristics of the proposed algorithm in various settings using both D4RL and DSRL datasets.

补充材料

No I didn't.

与现有文献的关系

The paper develops a new offline RL algorithm, utilizing DICE --- a well-known framework used in RL literature. There is a long line of research that applies DICE to different settings, ranging from single-agent to multi-agent, from unconstrained to constrained RL. This work extends semi-gradient DICE to the context of constrained RL, offering insights that could be valuable to the RL community.

遗漏的重要参考文献

I am not aware of missing essential references.

其他优缺点

A key strength of the paper is a thorough theoretical analysis on properties of semi-gradient DICE in constrained RL, including the analysis on the violation of the stationary distribution and the connection with behavior-regularized offline RL. The proposed idea of extracting state stationary distribution to fix the cost violation is novel and well justified.

其他意见或建议

None.

作者回复

We appreciate the reviewer’s acknowledgment of our contribution: extending semi-gradient DICE to constrained offline RL, grounded in theoretical analysis of its optimal solution and stationary distribution correction approach to address the issue of Bellman flow constraint violation.

审稿意见
2

This paper proposes a DICE-based algorithm for offline constrained RL. The proposed method can be seen as a SemiDICE version of COptiDICE, with some extra designs. The paper is generally well-written, but I also feel there is some overclaiming of contributions and a lack of adequate acknowledgment of existing works. The final safety control performance is not very impressive compared to some recent SOTA-safe offline RL algorithms.

给作者的问题

N/A

论据与证据

  • The paper claims to identify the root cause of the limitation when using the SemiDICE framework to perform OPE. The claim is generally supported by the theoretical analysis, although I think the description overstates some points.

  • The paper claims that it achieves state-of-the-art (SOTA) performance on DSRL benchmarks, however, by inspecting Table 2 and 3, many results for CORSDICE have somewhat high costs, and many are close to the cost threshold 1. This is not a desirable property for a safe policy.

方法与评估标准

The method generally makes sense. The evaluation on the DSRL benchmark is also reasonable.

理论论述

I think there are some overclaiming and lack of acknowledgments of existing works in the paper.

  • For Proposition 4.1, it is well-known that SemiDICE methods may not satisfy the Bellman flow constraint, since it replaces (1γ)Es0p0[v(s0)](1-\gamma)E_{s_0\sim p_0}[v(s_0)] with E(s,a)dD[v(s0)]E_{(s,a)\sim d_D}[v(s_0)] and use the parameter α\alpha, which causes the Bellman flow property to break apart. This is obvious and more or less mentioned in existing works, hence I do not think it is that new.
  • For the discussion in Section 4 about the connections to behavior-regularized offline RL, this actually has been thoroughly discussed in the ODICE paper [1] by Mao, et al. (2024). Actually, they derive their methods by first noticing the relationship between SemiDICE and behavior-regularized offline RL. This is not adequately acknowledged in the paper.
  • The introduction of another function approximator A(s)A(s) for bias reduction is identical to the trick used in PORelDICE [2]. Again, this is never mentioned nor discussed in the description of the methodology, even if the authors actually cited the paper in the preliminary section.

[1] Revealing the mysteries of distribution correction estimation. ICLR 2024.

[2] Relaxed Stationary Distribution Correction Estimation for Improved Offine Policy Optimization. AAAI 2024.

实验设计与分析

  • The experiment design is generally reasonable, however, I feel the safety control performance of the proposed method is not very impressive. Although the proposed method can control cost value below the threshold, it often gets high cost values (close to the threshold), especially in Table 2.
  • Why many tasks in Table 2 are not tested and reported in Table 3?

补充材料

I've read the appendix of the paper, but have not run the supplementary code provided along with the paper.

与现有文献的关系

Safe offline RL has broad applications in robotics, autonomous driving, and industry control.

遗漏的重要参考文献

See my comment in the Theoretical Claims section.

其他优缺点

Strengths:

  • The paper is generally well-written and easy-to-read. The key ideas are clearly conveyed and discussed.

Weaknesses:

  • The paper actually borrowed lots of methodological designs and insights from existing papers, but does not adequately acknowledge them. For example, the problem framework is from CoptiDICE [1]; the insights and techniques of SemiDICE are from DualRL [2] and ODICE [3]; the insights between SemiDICE and behavior-regularized offline RL is from ODICE [3]; and the trick for bias reduction is from PORelDICE [4]. None of these are adequately stated in the paper.

[1] COptiDICE: Offline Constrained Reinforcement Learning via Stationary Distribution Correction Estimation

[2] Dual RL: Unification and New Methods for Reinforcement and Imitation Learning. ICLR 2024

[3] Revealing the mysteries of distribution correction estimation. ICLR 2024.

[4] Relaxed Stationary Distribution Correction Estimation for Improved Offine Policy Optimization. AAAI 2024.

其他意见或建议

  • From G.3, it seems the paper conducted heavy hyperparameter tuning to get good results. This is not encouraged in offline RL research. As most offline RL methods are designed to solve real-world problems that have restricted online system interaction during the training stage. Hence we do not have much opportunity for exhaustive hyperparameter tuning, and bad policy can cause severe consequences on real systems. It is desirable to report results based on fixed hyperparameters or only a small set of hyperparameters. Using random search and hyperparameter optimization to get nicer results is a bad practice in offline RL research.
作者回复

We thank the reviewer for the thorough and constructive comments. We hope we can address your concerns below.

Q1. Violation of Bellman flow (BF) constraint and the originality of Proposition 4.1

We respectfully disagree with the reviewer’s claim that replacing the term (1γ)p0(1-\gamma)p_0​ with the dataset distribution dD(s)d_D(s) is a well-known violation of the BF constraint. To clarify, there are two distinct scenarios for this replacement, under the full-gradient DICE method:

  • Direct substitution: While replacing (1γ)p0(1-\gamma)p_0 with dDd_D does violate the BF constraint, the resulting d becomes a scaled (due to removal of 1γ1-\gamma factor) stationary distribution under a modified MDP with initial state distribution dDd_D (due to p0dDp_0\to d_D). The constant scaling can be easily corrected during the policy extraction.
  • Assumption-based substitution (as used in SemiDICE, appx. B, Eq. (31-32)): Here, we assume dDd_D satisfies the stationary distribution condition. This method does not violate the Bellman flow constraint.

Thus, we argue that the fundamental cause of BF violation is the use of semi-gradient update—not the replacement of the initial state distribution. To our knowledge, this is the first time this specific cause has been clearly identified.

Furthermore, Proposition 4.1 does more than pinpoint the violation's cause. It shows that SemiDICE outputs the policy ratio, which is essential for applying our stationary distribution extraction technique. Without Proposition 4.1, this insight would not be possible. We believe this adds a meaningful contribution to the understanding of semi-gradient DICE methods.

Finally, α only affects the conservatism strength of DICE method—it has no impact on the BF property.

Q2. Lack of acknowledgement [1, 2] for discovering the connection

Although we cited [1, 2] and briefly discussed them in appx. C, we will revise the manuscript to better acknowledge their discovery. While earlier studies implied the connection through similar loss functions, we make it more explicit by showing that a behavior-regularized MDP with a general f-divergence can be approximated by its corresponding SemiDICE. This insight is central to our work, as it justifies the SemiDICE policy ratio and supports its constrained extension—the main contribution of our paper.

Q3. Lack of acknowledgement [3] for bias reduction technique

While our bias reduction technique was inspired by PORelDICE [3], we approximate a different expectation. PORelDICE estimates U(s,a)r(s,a)+γsT(ss,a)ν(s)U(s,a)\approx r(s,a) +\gamma\sum_{s'}T(s'|s,a)\nu(s'), while we approximate A(s)aπ(as)(γsT(ss,a)μ(s)μ(s))A(s)\approx\sum_a\pi(a|s)(\gamma\sum_{s'}T(s'|s,a)\mu(s')-\mu(s)). Consequently, PORelDICE focuses on reducing transition bias to improve offline RL, whereas our approach targets bias from both the transition dynamics and the policy to improve OPE. We will acknowledge [3] appropriately and clarify this distinction in the final version.

Q4. High cost close to limit

We respectfully clarify that obtaining cost values near the constraint threshold is not a flaw but a desirable behavior in our average-constrained optimization setting (CMDP, Eq. (7)). Our objective explicitly encourages using as much of the cost budget as needed to improve the return, so long as the expected constraint is satisfied. This is a principled difference from hard or probability-constrained formulations, where any thresholds violation is unacceptable. Recent works (e.g. [4], [5]) have studied these stricter formulas; our method is complementary and could be extended to them in the future. We will update the manuscript to make this distinction clearer.

Q5. Hyperparameter tuning

While G.3 lists all potential hyperparameters, we only tuned one—α—for CORSDICE. For D-CORSDICE, we tuned two additional hyperparameters (guidance scale values, the number of inference action values), which were directly adopted from D-DICE [6]. For inference samples, we used a smaller search space than in D-DICE (Table 4 of [6]). The original paper did not specify a search space for guidance scale, so we referred to the values in their official implementation. Importantly, all methods, including baselines, were tuned with the same hyperparameter search budget and procedure to ensure fair comparison. We will clarify these points in the revised manuscript.

[1] Dual RL: Unification and New Methods for Reinforcement and Imitation Learning. ICLR 2024

[2] ODICE: Revealing the Mystery of Distribution Correction Estimation via Orthogonal-gradient Update. ICLR 2024

[3] Relaxed stationary distribution correction estimation for improved offline policy optimization. AAAI 2024

[4] Safe Offline Reinforcement Learning with Feasibility-Guided Diffusion Model. ICLR 2024

[5] Quantile Constrained Reinforcement Learning: A Reinforcement Learning Framework Constraining Outage Probability. NeurIPS 2022

[6] Diffusion-DICE: In-Sample Diffusion Guidance for Offline Reinforcement Learning. NeurIPS 2025

最终决定

The reviewers recognize the thoroughness of both the theoretical analysis and the experimental campaign in diagnosing and proposing solutions to subtle issues of DICE approaches. At the same time, since several recent papers address related problems and present related solutions, some further work should be put into the presentation in order to clearly position the contribution within the recent DICE literature. In particular, regarding insights from [1] and [2] (e.g., the connection between SemiDICE and behavior-regularized offline RL) and algorithmic solutions from [3], [4]. Besides crediting these previous works, the overall presentation should be adjusted to provide a clear picture to the reader.

[1] Dual RL: Unification and New Methods for Reinforcement and Imitation Learning. ICLR 2024 [2] ODICE: Revealing the Mystery of Distribution Correction Estimation via Orthogonal-gradient Update. ICLR 2024 [3] Relaxed stationary distribution correction estimation for improved offline policy optimization. AAAI 2024 [4] COptiDICE: Offline Constrained Reinforcement Learning via Stationary Distribution Correction Estimation