Online Optimization for Offline Safe Reinforcement Learning
Creating safe and reward maximization policies from offline data via min-max optimization formulation and solving it using no-regret algorithms
摘要
评审与讨论
The authors propose O3SRL, a training framework for offline safe RL. O3SRL builds on strong duality of the Lagrangian to solve the constraint optimization problem, by iteratively solving a cost penalized MDP and then updating the Lagrange multiplier with a no-regret “online” learning algorithm. The authors highlight an inherent limitation of offline policy evaluation that is required in their theoretical framework (used to update the Lagrange multiplier by the no-regret algorithm). To address this limitation practically, they discretize the Lagrange multiplier and place a multi-armed bandit to pick it, without directly evaluating policies offline. Finally, the authors demonstrate the performance of O3SRL on several offline safe RL tasks, and show that their method outperforms existing baselines, especially when the budget for the constraint is tight.
优缺点分析
Strengths
- The empirical evaluation is comprehensive and adequate. The authors show that their method outperforms many existing baselines.
- The use of EXP3 to circumvent offline policy evaluation is novel and interesting.
Weaknesses
- The proposed framework and theory has been studied before in [1]. I couldn't find any reference for that.
- While providing convergence guarantees (Theorem 1 & 2) adds value to the paper, I believe that showing the conditions for constraint satisfaction upon deployment, after training with offline data, would give a more interesting theoretical insight to readers, and would strengthen a lot the paper. This is more so given that the main theoretical argument that is used in the paper has already been shown by [1].
- Some theoretical claims are not precise or not rigorous enough:
- L.128: the objective and constraints are generally linear only when there is no (non-linear) function approximation involved; this is usually not the case in many practical problems and even in your own experiments. The implication here is that, formally speaking, strong duality does not necessarily hold in many of these problems.
- L. 129: Slater's condition is typically assumed, i.e., Equation 1 not always satisfies it.
- Remark 1, L. 162-164, “notions of coverage depend very mildly on the reward functions, if at all”. I don't think that this statement is true. What about sparse reward tasks? Consider for example the sparse reward tasks in [2].
- While the use of the bandit algorithm is novel, and seem to yield strong results, the practicality of it is somewhat limited.
Minor
- Using only three seeds for evaluation should be improved.
Presentation
- I believe that the empirical result is more significant in this paper and should receive more focus.
- The significance of low budget constraints should be further motivated and highlighted.
- Consider presenting your empirical results in a plot instead of a table---it is much easier the parse and could convey your empirical result better.
[1] Le, Hoang, Cameron Voloshin, and Yisong Yue. "Batch policy learning under constraints." In International Conference on Machine Learning, pp. 3703-3712. PMLR, 2019.
[2] Nair, A., Gupta, A., Dalal, M., and Levine, S. AWAC: Accelerating online reinforcement learning with offline datasets. arXiv, June 2020.
问题
No questions.
局限性
I believe that the main limitation of the proposed work is the practicality of the proposed algorithm:
- Using a bandit to pick the Lagrange multiplier will require practitioners to implement an additional (non-trivial) subroutine.
- The algorithm returns a “mixing” policy that combines several policies together and draw them randomly, as common in previous works that propose primal-dual algorithms to solve CMDPs. This is usually a limitation in practice, because such mixing policies are not easy to implement, and might execute severely unsafe actions.
While the use of a bandit limits the practicality of O3SRL, I believe that this is still a useful contribution.
最终评判理由
The method seems to be empirically strong. The use of MAB is novel and interesting.
I still have some concerns regarding the rigorousness of the presentation, however, since the main contribution of this work is not theoretical, I think this is a minor limitation.
格式问题
No formatting concerns.
Thanks for your constructive feedback and review comments. We address them below.
Re: “The proposed framework and theory has been studied before in [1]”
Thanks for pointing out this paper. We will discuss [1] and its connections in the final paper.
Lagrangian formulation for offline safe RL is quite standard that both our work and [1] share. [1] and our work also share the general idea that we can use online optimization algorithms to tune the Lagrangian multiplier. The critical difference is in the design of the algorithmic frameworks: our framework requires one call to an offline RL oracle per loop and does not rely on any OPE oracle. On the other hand, [1] requires two calls to an offline RL oracle and four calls to an OPE oracle per loop. Since OPE procedures are known to be typically unstable, we believe that our O3SRL framework has unique advantages.
Re: “convergence guarantees with conditions for constraint satisfaction after deployment…main theoretical argument that is used in the paper has already been shown by [1]”
The solutions that our algorithms converge to (at the rate specified by Theorem 1 and 2) satisfy the cost constraint for deployment, by the very definition of the Langragian formulation in Eq. 3. The nature of our theoretical argument is completely different from those in [1]: our framework works for a general offline RL oracle, allowing us to integrate any state-of-the-art off the shelf offline RL algorithm and doesn’t need OPE procedures, whereas [1] only uses FQI and relies on OPE procedures, which are known to be unstable.
Re: “L.128: the objective and constraints are generally linear only when there is no (non-linear) function approximation involved”
The Lagrangian formulation in Section 4.1 has nothing to do with function approximation. When the policy class is expressive enough, Slater’s condition easily holds. The linearity we exploit is the linearity of the objective with respect to λ.
Re: “L. 129: Slater's condition is typically assumed, i.e., Equation 1 not always satisfies it.”
When the policy class is expressive enough, Slater’s condition easily holds.
Re: “Remark 1, L. 162-164, “notions of coverage depend very mildly on the reward functions, if at all”. I don't think that this statement is true. What about sparse reward tasks?”
We refer the reviewer to our explanation in Lines 164-167 where we say “most notions of coverage”, not “all notions of coverage”. Even in sparse reward tasks, we still believe the statement applies – though the extremity of sparse rewards would place much emphasis on the influence of a particular (sparse) reward function to the data coverage.
Re: “While the use of the bandit algorithm is novel, and seem to yield strong results, the practicality of it is somewhat limited.”
We use the multi-arm bandit approach because of the practicality it offers by avoiding the reliance on off-policy evaluation procedures. We demonstrate its practicality through experiments on a standard suite of benchmarks.
Note that the bandit algorithm is used only during offline training. The final output is a standard policy that maps states to actions and can be deployed directly. No bandits or additional machinery are involved at inference time. The bandit component is used solely during training to embed the desired behavior into the final policy.
Re: “Suggestions for presentation and experimental results”
We will adjust the focus to better highlight the empirical results and the significance of low-budget constraints. Initially, space limitations influenced our choice of presentation. We also note that, per NeurIPS recent guidelines, we are not allowed to share external visualizations or link to supplemental PDFs during the review process. However, we agree that plots can improve clarity.
Re: “Using a bandit to pick the Lagrange multiplier will require practitioners to implement an additional (non-trivial) subroutine”
The subroutine of multi-arm bandit (MAB) is standard. MAB methods are quite commonly used in practice for adaptive decision-making applications (e.g., A/B testing). The simplicity is apparent in the pseudo-code we provide, which shows this is a very small wrapper around an offline RL algorithm.
Note that the bandit component is used only during training to guide the learning process. After training, O3SRL produces a single policy that maps states to actions and can be deployed directly, with no bandits or additional components involved at inference time.
Re: “The algorithm returns a “mixing” policy that combines several policies together and draw them randomly. This is usually a limitation in practice, because such mixing policies are not easy to implement, and might execute severely unsafe actions.”
The mixture policy is only used for the theoretical analysis. In practice, we find that a single policy from the last training iteration is sufficient for strong results (see Lines 231–234 for the concrete approach used in evaluation). The final output is a standard policy that embeds the effect of the mixture during training. Although we use a mixture of shaped rewards based on sampled λ values during training, the resulting policy maps states to actions directly and requires no runtime mixing or sampling.
Re: “Using only three seeds for evaluation should be improved.”
We agree that more seeds can help provide a more reliable estimate of performance. To address this, we include results from three additional seeds (40, 50, and 60) on top of the original 10, 20, and 30. Columns 1 and 2 report the rewards and costs (mean ± standard deviation) over the original 3 seeds. Columns 3 and 4 show the same metrics computed over all 6 seeds.
We note that using three seeds with 20 evaluation episodes per seed is the standard practice in OSRL, as followed in prior works. Nonetheless, we appreciate the suggestion and have included the extended results for completeness.
Comparison over 3 Seeds vs. 6 Seeds
| Environment | Reward (3 seeds) | Cost (3 seeds) | Reward (6 seeds) | Cost (6 seeds) |
|---|---|---|---|---|
| BallRun | 0.25 ± 0.03 | 0.00 ± 0.00 | 0.26 ± 0.02 | 0.00 ± 0.00 |
| CarRun | 0.96 ± 0.01 | 0.02 ± 0.03 | 0.96 ± 0.01 | 0.00 ± 0.00 |
| DroneRun | 0.32 ± 0.05 | 0.68 ± 1.18 | 0.37 ± 0.03 | 0.74 ± 0.67 |
| AntRun | 0.33 ± 0.13 | 0.14 ± 0.10 | 0.22 ± 0.10 | 0.17 ± 0.19 |
| BallCircle | 0.62 ± 0.01 | 0.06 ± 0.07 | 0.63 ± 0.04 | 0.26 ± 0.25 |
| CarCircle | 0.66 ± 0.03 | 0.11 ± 0.16 | 0.66 ± 0.03 | 0.04 ± 0.10 |
| DroneCircle | 0.49 ± 0.07 | 0.23 ± 0.34 | 0.47 ± 0.03 | 0.02 ± 0.04 |
| AntCircle | 0.48 ± 0.06 | 0.00 ± 0.00 | 0.44 ± 0.05 | 0.02 ± 0.03 |
The results across 6 seeds confirm the stability of O3SRL, with consistent rewards and safe cost levels across tasks. Variance remains low, and performance trends observed with 3 seeds largely hold, indicating robust and reliable behavior.
Thank you for your response.
The Lagrangian formulation in Section 4.1 has nothing to do with function approximation. When the policy class is expressive enough, Slater’s condition easily holds. The linearity we exploit is the linearity of the objective with respect to λ.
I believe this stands in contradiction to line 128 in the paper: “Note that the objective and the constraint in Equation (1) are linear, thus convex and differentiable.”
Equation (1)
Equation 1 is linear (thus convex and differentiable) w.r.t. the state-occupancy measure of the policy. Therefore:
- I cannot find the relation to you are referring to. Is it a bug in the paper and you were actually referring to the Lagrangian form in Equation 2? There indeed the relation to is linear.
- Otherwise, when the policy is learned via function approximation, this linear relation typically breaks.
- In addition, when function approximation is involved, strong duality does not necessarily hold [1].
When the policy class is expressive enough, Slater’s condition easily holds. (answering “L. 129: Slater's condition is typically assumed, i.e., Equation 1 not always satisfies it.”)
Precisely, when it is expressive enough. You implicitly assume that your policy class is expressive enough, which is not always the case. Being more explicit and clear about this point would make the paper more rigorous.
[1] Paternain, Santiago, Luiz Chamon, Miguel Calvo-Fullana, and Alejandro Ribeiro. "Constrained reinforcement learning has zero duality gap." Advances in Neural Information Processing Systems 32 (2019).
Thank you for your comments.
We remark that there is no contradiction between our previous response and Line 128. The objective and constraint in Eq (1) are linear (w.r.t. the optimization variable ). The unconstrained optimization in Eq (2) is merely a standard Lagrangian formulation of the constrained optimization in Eq (1) -- function approximation is not relevant to the equivalence of these two forms.
Precisely, when it is expressive enough
As long as there exists a policy distribution that is strictly feasible, i.e., , Slater's condition holds. That condition holds easily, e.g., when is expressive enough that there exists at least one policy that is strictly safe, i.e., .
We will clarify this in the revision.
Hello, and thank you again for your review and for your comments on our rebuttal. We wanted to check whether our last response addressed your concerns, or if there is anything you would like us to expand on before the deadline.
Dear authors,
Some points were addressed. It seems like we cannot reach an equilibrium for some other concerns, however they are not major so I will keep my (already positive) score.
Thank you for writing your rebuttal!
This paper proposes to solve the problem of offline safe RL by combining offline RL and online optimization. The incorporation of the cost into the reward enables the use of offline RL, while discretizing the value of the Lagrange multiplier enables the use of multi-armed bandit algorithms, eliminating the need for estimating off-policy value functions. Theoretical results are provided to show the optimality of the proposed framework, and modifications are made upon the framework to make it practically usable. Empirical evaluation on a set of benchmark tasks showcases the effectiveness of the proposed method.
优缺点分析
Strengths
- The flow of this paper is clear and easy to follow.
- The incorporation of the cost into the reward enables the use of well-developed offline RL algorithms, which is novel in offline safe RL as far as I know.
- The discretization of the Lagrange multiplier is novel and empirically effective. Combined with the EXP3 algorithm, this eliminates the need for estimating the off-policy values which could be erroneous.
- The empirical results are pretty promising, especially in terms of the safety metric, which is desirable for offline safe RL algorithms.
Weaknesses
- The number and variety of the tasks evaluated on can still be improved. A more comprehensive evaluation like the one in the FISOR paper would make the comparative results much more convincing.
- Missing related work (and probably a baseline): TREBI , which serves as a baseline in the FISOR paper.
Minors
- Line 86: missing in-text reference.
References
: Zheng et al., 2024, "Safe offline reinforcement learning with feasibility-guided diffusion model".
: Lin et al., 2023, "Safe offline reinforcement learning with real-time budget constraints".
问题
- Line 286 concludes that higher values of K leads to diminishing returns, but from Table 1 the trend is the opposite.
- Why does discretizing the Lagrange multiplier help stabilize learning?
- Could you provide a more intuitive explanation of how the EXP3 algorithm updates the distribution over arms? Although this is not the contribution of this work, I think an intuition of how it works to avoid OPE can be insightful to readers. Maybe putting it in the Appendix would be a good choice.
- Table 1: Why does increasing the granularity reduce the safety of the agent? How can interpret this empirical result? I couldn't see an intuitive causal relationship between this hyperparameter and the safety performance.
- Table 3:
- Does the results mean that the agent becomes closer to the safety budget when the budget is higher? Shouldn't the normalized cost stay relatively the same regardless of the budget?
- Will it finally reach or even violate the constraint when the budget is high enough?
- Line 331 says approaching the budget can "make better use of the available budget", does this mean that the results of O3SRL in Table 2 can still be improved since the normalized cost is mostly close to 0?
局限性
N/A
最终评判理由
I raised a few concerns about the evaluation and related work, and the authors addressed them sufficiently. So I decided to raise my confidence in giving this paper a high score.
格式问题
No major formatting issues found.
Thanks for your constructive feedback and review comments. We address them below.
Re: “The number and variety of the tasks evaluated on can still be improved. Missing related work (and probably a baseline).”
We have extended our evaluation by running O3SRL on additional environments from the Gymnasium Safety Suite, following the FISOR setup with a tight cost limit of 10. Results are summarized in the table below. (R) is the normalized reward and (C) is the normlaized cost (<1 means safe) (costs are not normalized in the last table).
Regarding TREBI [2], we will add it to the related work section. We attempted to include it as a baseline, but faced persistent issues with installing its dependencies and setting up the environment correctly. Unlike more recent baselines such as CAPS and CCAC [3], TREBI predates the standardized OSRL baselines framework, which makes integration and reproducibility more difficult.
Additionally, both the FISOR and CCAC papers report that TREBI underperforms compared to their approaches. We opted to include CCAC as an additional baseline in our experiments. We also already compare against CAPS, a recent method with strong performance. Both CCAC and CAPS are cost-limit agnostic, similar to TREBI, and can adapt across different budget levels.
[3] Constraint-Conditioned Actor-Critic for Offline Safe Reinforcement Learning. https://openreview.net/pdf?id=nrRkAAAufl
| Environment | CAPS | FISOR | CCAC | O3SRL |
|---|---|---|---|---|
| PointCircle1 (R) | 0.31 ± 0.05 | 0.43 ± 0.05 | 0.57 ± 0.08 | 0.53 ± 0.02 |
| PointCircle1 (C) | 0.94 ± 1.53 | 14.93 ± 4.24 | 6.65 ± 3.51 | 1.31 ± 0.62 |
| PointCircle2 (R) | 0.44 ± 0.06 | 0.76 ± 0.05 | 0.03 ± 0.75 | 0.52 ± 0.04 |
| PointCircle2 (C) | 0.10 ± 0.09 | 18.02 ± 4.17 | 3.99 ± 4.30 | 0.55 ± 0.41 |
| AntVelocity (R) | 0.87 ± 0.04 | 0.89 ± 0.01 | -1.01 ± 0.00 | 0.45 ± 0.43 |
| AntVelocity (C) | 0.19 ± 0.06 | 0.00 ± 0.00 | 0.00 ± 0.00 | 0.05 ± 0.06 |
| HalfCheetah (R) | 0.86 ± 0.01 | 0.89 ± 0.01 | 0.91 ± 0.04 | 0.93 ± 0.02 |
| HalfCheetah (C) | 0.27 ± 0.07 | 0.00 ± 0.00 | 0.97 ± 0.12 | 0.00 ± 0.00 |
| HopperVelocity (R) | 0.29 ± 0.14 | 0.12 ± 0.03 | -0.01 ± 0.02 | 0.51 ± 0.31 |
| HopperVelocity (C) | 1.61 ± 1.53 | 0.90 ± 1.07 | 0.00 ± 0.00 | 0.20 ± 0.35 |
| SwimmerVelocity (R) | 0.38 ± 0.10 | -0.02 ± 0.05 | 0.06 ± 0.25 | 0.05 ± 0.02 |
| SwimmerVelocity (C) | 2.11 ± 3.01 | 0.23 ± 0.20 | 0.91 ± 1.57 | 0.00 ± 0.01 |
| Walker2dVelocity (R) | 0.78 ± 0.01 | 0.51 ± 0.03 | 0.40 ± 0.17 | 0.77 ± 0.02 |
| Walker2dVelocity (C) | 0.08 ± 0.07 | 1.92 ± 0.57 | 5.20 ± 1.16 | 0.02 ± 0.02 |
O3SRL satisfies the cost constraint on 6 out of 7 additional tasks, demonstrating strong generalization across environments. The next best method, CAPS, is safe on 5 out of 7. In terms of performance, O3SRL achieves the highest reward on 3 tasks, the most among all methods. Additionally, it remains competitive with the best-performing method on SwimmerVelocity (0.05 vs. 0.06) and Walker2dVelocity (0.77 vs. 0.78), while maintaining safety.
The table below compares O3SRL and CCAC on the Bullet Gym environments, as reported in Table 2 of the main results.
| Environment | CCAC (Mean ± Std) | O3SRL (Mean ± Std) |
|---|---|---|
| BallRun (R) | 0.31 ± 0.01 | 0.25 ± 0.03 |
| BallRun (C) | 0.00 ± 0.00 | 0.00 ± 0.00 |
| CarRun (R) | 1.82 ± 0.84 | 0.96 ± 0.01 |
| CarRun (C) | 24.57 ± 20.94 | 0.02 ± 0.03 |
| DroneRun (R) | 0.50 ± 0.08 | 0.32 ± 0.05 |
| DroneRun (C) | 16.29 ± 9.35 | 0.68 ± 1.18 |
| AntRun (R) | 0.03 ± 0.10 | 0.33 ± 0.13 |
| AntRun (C) | 0.00 ± 0.00 | 0.14 ± 0.10 |
| BallCircle (R) | 0.47 ± 0.38 | 0.62 ± 0.01 |
| BallCircle (C) | 10.19 ± 17.65 | 0.06 ± 0.07 |
| CarCircle (R) | 0.70 ± 0.01 | 0.66 ± 0.03 |
| CarCircle (C) | 1.61 ± 0.51 | 0.11 ± 0.16 |
| DroneCircle (R) | 0.43 ± 0.04 | 0.49 ± 0.07 |
| DroneCircle (C) | 0.06 ± 0.07 | 0.23 ± 0.34 |
| AntCircle (R) | 0.49 ± 0.07 | 0.48 ± 0.06 |
| AntCircle (C) | 0.02 ± 0.03 | 0.00 ± 0.00 |
O3SRL satisfies the cost constraint on 8 out of 8 tasks, while CCAC is safe on only 4. Additionally, O3SRL achieves the highest reward on 6 out of the 8 tasks.
Re: “Line 286 concludes that higher values of K leads to diminishing returns, but from Table 1 the trend is the opposite.”
Thank you for pointing this out. The sentence was not phrased well. Our intended meaning was that increasing K beyond a certain point (e.g., from 5 to 10) provides relatively smaller gains compared to the improvement seen when increasing from K=2 to K=5.
Re: “Why does discretizing the Lagrange multiplier help stabilize learning?”
If we optimize for continuous Lagrange multiplier, we generally need Off-Policy Evaluation (OPE) procedures on top of the offline RL oracle. OPE is known to be typically unstable especially when the target policy, found as the return of the offline RL oracle, keeps changing during the iterative process. By discretizing , we completely avoid the need for OPE.
Re: “Could you provide a more intuitive explanation of how the EXP3 algorithm updates the distribution over arms?”
EXP3 maintains a distribution over arms, where the mass is biased toward the arms that have been observed to achieve higher rewards in the past rounds. As per your suggestion, we will provide more intuition in the main paper and detailed description in the Appendix.
Re: “Table 1: Why does increasing the granularity reduce the safety of the agent? How can interpret this empirical result?”
In our setup, safety is defined as keeping the normalized cost below 1, i.e., staying within the given cost budget. From this perspective, all policies that respect the budget are considered safe. Increasing granularity helps the agent discover better policies within the safe region. The goal is not to minimize cost to zero, but rather to maximize reward while staying under the cost limit. An ideal policy would use the available budget effectively to collect as much reward as possible. When the budget is tight (a limit of 5 means the agent is allowed only five mistakes over an entire trajectory), the agent may behave conservatively and avoid costs entirely, even if using some of the budget could improve reward.
Re: “Table 3: questions on results”
All reported costs are normalized (i.e., divided by the cost limit). For example, a normalized cost of 0.5 under a cost limit of 20 corresponds to an actual accumulated cost of 10. The goal is to stay below 1.0, and the percentage of budget used is not inherently important as long as this condition is met.
Given more budget, the agent accumulates more cost in absolute terms, which might not be fully reflected in the normalized cost values. In the table below, we report the non normalized costs. The agent typically leaves a safety margin and avoids pushing right up to the limit. This is intentional. Since offline RL uses fixed data and faces distribution shift, the agent must be pessimistic to avoid violating the constraint during evaluation. A slightly conservative policy is preferable to one that risks infeasibility.
Regarding Table 2: yes, the normalized costs are mostly close to 0, but this is because those results use a tight budget of 5, and the agent learns to avoid costs as much as possible to remain safe across episodes (200–500 steps). In that regime, even a few binary cost events can be enough to exceed the limit. We include additional experiments with a larger cost limit of 80 below, where the agent accumulates more costs while still remaining safe. This supports the observation that O3SRL adapts based on the constraint and trades off conservatism appropriately.
O3SRL Performance with Unnormalized Costs
| Environment | Cost Limit 5 (R) | Cost Limit5 (C) | Cost Limit 20 (R) | Cost Limit 20 (C) | Cost Limit 40 (R) | Cost Limit 40 (C) | Cost Limit 80 (R) | Cost Limit 80 (C) |
|---|---|---|---|---|---|---|---|---|
| DroneRun | 0.32 | 3.40 | 0.38 | 8.20 | 0.38 | 17.60 | 0.42 | 30.08 |
| BallCircle | 0.62 | 0.30 | 0.69 | 11.60 | 0.76 | 25.60 | 0.81 | 38.32 |
Thank you for the responses. My confusion has been well clarified. I would keep the score as is.
This paper studies offline safe reinforcement learning (RL) with a constrained MDP notion of safety. The goal is to maximize the reward while keeping the associated cost of transitions under a certain safety budget. An algorithm is proposed that solves the Lagrangian objective using two approaches:
- an offline RL oracle that optimizes over the given data using adjusted returns that include both reward and cost with current Lagrange multiplier values
- a no-regret algorithm for optimizing the Lagrangian multiplier that is implemented interestingly as a discretization of the multiplier value solved using a bandit approach
The result shows that on safe RL benchmark, the proposed approach does well always leading to safe policies that get high returns. The algorithm is also robust to choice of offline RL oracle
优缺点分析
Strengths
- Safe RL is an important problem.
- The proposed approach does make sense
- Experiments are good
Weakness
-
Parts of the paper, especially the theoretical part, do not make full sense to me. Firstly, there is no mention of data distribution coverage or use of pessimism, the latter of which is common in offline RL literature. The offline RL oracle assumption seems particularly strong. It is essentially assuming that given the data you have, you can learn an optimal policy with high probability. This is not true in general unless you assume concentrability or other notions of good data coverage. Alternatively, most offline RL approaches assume a data distribution that only covers the optimal policy and use pessimism.
-
Writing is unclear. The paper talks about how to approximate value function of the optimal policy without telling why is this needed. The oracle for Lagrange multiplier in Definition 3 does not seem to take that as input although later it is used in Line 4, Algorithm 2.
-
It is unclear how another approach will do that, simply runs the offline RL algorithm, treating lambda as a hyperparameter and tuning it on some dev set. It seems the core contribution is just to find the right value of this scalar.
问题
-
Is data distribution assumed to have good coverage? What assumptions will the offline RL oracle need to guarantee the requirements in Definition 2.
-
Why is the value function need to be estimated? There is no mention of this in Definition 3. Also, it is stated in line 200 that "approximation of the value function of the optimal policy" but then Equation 6 shows it is estimation of the policy . Shouldn't you have .
-
What is a "stochastic approximation"?
-
It seems to me that the entire contribution of the paper is to set the value of the Lagrange multiplier and then call the offline RL algorithm. If yes, then how will a much simpler approach do which simply calls an offline RL algorithm with the Lagrange multiplier treated as a single scalar hyperparameter, and then setting this hyperparameter by doing runs on a held-out set of online runs. It is possible that iterative learning with averaging is needed, but this baseline should have been evaluated.
-
Related work: Provable Safe Reinforcement Learning with Binary Feedback, Bennett et al. AISTATS 2023.
Minor:
- Broken link on line 186
- Citations should be in parentheses at several places such as Line 209
局限性
N/A
最终评判理由
The authors have provided clarification, and so I have increased the score.
格式问题
N/A
Thanks for your constructive feedback and review comments. We address them below.
Re: “clarifications on theoretical part: offline RL oracle assumption and data coverage”
Remark 1 (Lines 158-170) clarifies that offline RL oracle absorbs the data coverage condition.
Pessimism is an algorithmic approach that an offline RL oracle might employ to have a good estimation error rate.
The assumption of offline RL oracle is to abstract away many state-of-the-art off the shelf offline RL algorithms. Any offline RL algorithm that returns a near-optimal policy, either in theory or practice, belongs to the oracle by definition. In the rich offline RL literature for both theory and practice, many choices for good offline RL algorithms exist (e.g., TD3BC and IQL). More broadly the idea of assuming oracles in theoretical work is widely used as it allows for the theoretical problems to be decomposed into pieces where independent progress can be made. In summary, we provide a reduction approach to leverage powerful offline RL algorithms to effectively solve offline safe RL problems.
Re: “unclear writing: approximate value function and parts of Algorithm 2”
The approximate value function (Line 3, Algorithm 2) is a function of and the offline data. So the oracle for Lagrange multiplier in Definition 3 does not explicitly rely on the approximate value function.
Re: “alternative approach: offline RL with lambda as a hyper-parameter”
Note that in our framework the term "online" refers to the use of online/streaming optimization algorithms (e.g. bandits) for optimizing parameters, here Lagrange multipliers. This is different from how "online" is used in the context of online RL, where algorithms are allowed to run the policies online in the environment. Our work is within offline RL.
We conducted the experiment that the reviewer suggested, where we fix the value of throughout training and evaluate the resulting policy at the end. This setup violates the spirit of offline RL, as it assumes we can execute policies in the environment to collect validation data for selection of . We are aware that some works tune key hyperparameters this way, but it remains inconsistent with the offline RL setting. Moreover, we cannot exhaustively search the entire space.
In our method (O3SRL), we define a discrete set of values: . Each value corresponds to an "arm". During training, we sample a from a learned distribution over these arms and use it to shape the reward. This distribution is updated at each batch following EXP3 logic, allowing the policy to adaptively mix behaviors instead of committing to a single .
For the experiment, we train five separate policies, each with a fixed (arms 0 through 4). No updating or mixture is used in these runs. The table below compares their final performance with O3SRL on the DroneCircle environment. Results are averaged over 3 seeds. We report normalized rewards and normalized costs (i.e., total cost divided by the budget) for three budget levels: 5, 20, and 40.
DroneCircle Results
| cost limit | Arm 0 | Arm 1 | Arm 2 | Arm 3 | Arm 4 | O3SRL |
|---|---|---|---|---|---|---|
| 5 - (R) | 0.90 | 0.86 | 0.55 | 0.47 | 0.39 | 0.49 |
| 5 - (C) | 18.18 | 16.52 | 2.47 | 0.00 | 0.00 | 0.23 |
| 20 - (R) | 0.90 | 0.87 | 0.75 | 0.49 | 0.40 | 0.53 |
| 20 - (C) | 4.54 | 4.33 | 2.80 | 0.04 | 0.00 | 0.36 |
| 40 - (R) | 0.90 | 0.87 | 0.82 | 0.52 | 0.42 | 0.65 |
| 40 - (C) | 2.27 | 2.21 | 1.88 | 0.08 | 0.00 | 0.85 |
Takeaways:
- Arm 0 (λ=0) obviously ignores cost and violates all constraints.
- Arms 1 and 2 fail to enforce the constraints reliably since the penalty is low.
- Arms 3 and 4 stay within budget but sacrifice reward, with Arm 4 being most conservative.
- O3SRL outperforms all fixed- policies by dynamically adjusting , achieving better reward and cost trade-offs.
Re: “Is data distribution assumed to have good coverage? What assumptions will the offline RL oracle need to guarantee the requirements in Definition 2”
Yes, we assume good data coverage. We already discussed this in Remark 1 (Lines 158-170). The assumptions needed by offline RL oracle to guarantee requirements in Def 2 include the ones we have used in theory of offline RL: good data coverage and good representation assumptions. The offline RL algorithms might use the pessimism principle to implement offline RL oracle with a good estimation error rate. All these are abstracted away and inherently embedded in the offline RL oracle.
Re: “Why is the value function need to be estimated? There is no mention of this in Definition 3. Also, it is stated in line 200 that "approximation of the value function of the optimal policy" but then Equation 6 shows it is estimation of the policy ”
We employ value-based methods and thus, the value function needs to be estimated. One can also opt for model-based methods to get rid of value function estimation, at the cost of using more complex models.
Re: "What is a "stochastic approximation"?"
It was a typo. It should be “the stochastic approximation of the value function of the current policy (distribution)”.
Re: “Related work: Provable Safe Reinforcement Learning with Binary Feedback, Bennett et al. AISTATS 2023”
This paper is considering a very different framework with very different assumption. First, the paper is in the online RL setting whereas we are in the offline setting. Second, this paper assumes an offline oracle that provides binary feedback on the safety of state-action pairs. In our setting, such a safety signal is not available, instead there is a cost associated with each state-action pair. Third, the nature of our offline RL oracle is also fundamentally different from the oracle in this paper: their oracle provides safety signal, whereas our oracle abstracts away standard offline RL algorithms.
Minor
Thank you for pointing this out. We will fix those in the final paper.
Hello, and thank you again for your review. As the deadline is approaching, we just wanted to check whether our rebuttal addressed your questions or if there is anything you would like us to clarify further.
Thanks for the clarification. I will increase my score. Please try to revise the paper to make it easier to follow as discussed above.
The paper studies offline safe RL: each transition in the environment yields both a cost and a reward, and the goal is to have expected total discounted cost below a threshold while maximizing expected total discounted reward. We are only allowed to observe a dataset of trajectories collected ahead of time — i.e., we can’t try out our new policy to collect more data or make sure it is safe and rewarding.
The paper designs a wrapper that repeatedly calls an ordinary offline RL subroutine as a black box (i.e., an ORL subroutine with no consideration of cost/safety constraints). By editing the rewards in our training data to penalize unsafe behavior to varying degrees, the wrapper explores different policies; via no-regret learning, it guarantees convergence to a policy that is estimated to satisfy the safety constraint. (This kind of wrapper design is classical: the earliest wrapper modifying cost/reward that I know of is Dantzig-Wolfe-Benders decomposition from the 1960s, and the earliest I know of using no-regret is in RL and online convex programming research in the 1990s; boosting (1980s) is a no-regret wrapper design for a different problem.) The exact behavior of the combined method depends on the ORL subroutine: e.g., different ORL methods might be more or less conservative about returning a policy whose reward is imperfectly known due to coverage problems in the offline data, and different ORL methods might need different assumptions about data collection in order to work. An advantage of the wrapper construction is generality: we delegate such choices to the ORL subroutine.
The experiments verify that the wrapper (wrapped around an existing ORL method from the literature) is able to uncover safe policies in a variety of RL domains. The results show a better tradeoff than competing methods: higher reward while still maintaining safety. The need to call the ORL subroutine multiple times increases the computational cost, but in many domains, trajectories are much more expensive than computation, so this tradeoff can be worthwhile. (But a full exploration of scalability is left to future work, so the conclusion of a worthwhile tradeoff is provisional.)
The key contribution is the design of the wrapper to be both theoretically and practically sound. One of the key obstacles to be overcome was that there is necessarily high variance in value estimates based on offline data: ORL needs to worry about this for comparing policies, and safe RL needs to worry about this for satisfying the cost constraint. The paper mitigates this concern by switching to a multi-armed bandit formulation, which only needs to observe the value (under the safety-modified rewards) of the single policy selected by the ORL subroutine on each iteration. Unfortunately this aspect needs much better explanation in the final paper: several of us on the review team are RL experts, the authors were responsive during discussion, and we still had trouble following this argument. Key points to clarify:
- The paper repeatedly mentions that OPE is difficult and high variance (true). But the paper still relies on the ORL subroutine’s estimate of the value of its returned policy. Naively, estimating this value is the same problem as OPE, but the paper claims an advantage by using it instead of OPE. Worse, what we need to estimate in OPE is actually the exact same policy, just with a different set of rewards — an even more similar problem. Reading between the lines, the conservatism inherent in most ORL methods would tend to make the value estimate for the returned policy under the passed-in reward structure more reliable than estimates for other policies or reward structures — is this what is being taken advantage of?
- To get to MAB, the paper discretizes lambda. The paper states that continuous lambda would require OPE, but this isn’t obvious: yes, it would require OPE if we treat the learning problem as having infinitely many unrelated arms, one for each lambda value. But that ignores structure: why not treat it as an online convex programming problem, for which we just need a subgradient at the current value of lambda (which we can get from the output of the ORL subroutine, just as we do for the MAB).
Despite the above clarity problem, the paper’s contributions are strong: it’s an interesting problem, framed well, with a novel (though classically inspired) solution that seems to work well in experiments. We hope that the final version of the paper will take the above questions into account, as well as the other results of the reviewer-author discussion. (I (the area chair) have posted a few final suggestions in the review thread.) We expect that these updates will lead to an excellent final paper.