6.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

3.8

置信度

创新性2.3

质量2.8

清晰度2.8

重要性2.8

NeurIPS 2025

Efficient Safe Meta-Reinforcement Learning: Provable Near-Optimality and Anytime Safety

Siyuan Xu,Minghui Zhu

OpenReview PDF

提交: 2025-05-02更新: 2025-10-29

摘要

关键词

Reinforcement learning

评审与讨论

审稿意见

评分: 4置信度: 42025-06-21

This paper presents a safe Meta RL algorithm. In particular, the paper considers a meta-training stage where the idea is to find a policy that will be safe across the tasks, where the policy will maximize the reward across the tasks drawn from a probability distribution. The policy is then adapted to the new task. The policy is also guaranteed to be safe by considering an advantage-based policy update. The paper validates the theory across diverse simulation experiments.

优缺点分析

Strengths:

The meta safe policy is very important. The paper thus considers an important problem.
The proposed algorithm has a theoretical basis, and the paper shows theoretical justifications as well.
The empirical results show that the proposed approach has practical significance.

Weaknesses:

(2) says that the policy has to satisfy the constraints of all the tasks, which seems to be very conservative. In particular, consider the grid-world scenario where some locations have obstacles that one wants to avoid that locations; now consider the other task where the obstacles are placed at the complementary set of locations, then there will be no policy. Hence, it is not clear what type of policies one can expect to get solving (2).
In order to solve (1), one again needs to ensure that $\pi_{\phi}$ is safe. Without knowing the task, ensuring that $\pi_{\phi}$ is not clear. In particular, if the Meta test task is different from the Meta-training task, it is not clear whether solving (2) will ensure that $\pi_{\phi}$ will be safe for new task.
Convergence rate results have not been provided.

问题

Please see weakness

局限性

N/A

最终评判理由

This paper considers an important and interesting problem in meta CMDP. The paper's idea is sound, and it has interesting simulation results. However, one criticism is that in Theorem 1, it does not specify $\epsilon$ that can balance between the sub-optimality gap and the constraint satisfaction. I mean, if $\epsilon$ is high, then one can show that the policy is feasible; however, the sub-optimality gap is constant, and vice versa. Hence, I am fully into accepting this paper. Some sort of trade-off result would have been good. The authors have also accepted the above.

格式问题

N/A

作者回复

2025-07-31

We are grateful and indebted to the time and effort invested in evaluating our manuscript, as well as to all the suggestions and reference recommendations that have made our manuscript a stronger and more valuable contribution.

Weakness 1: (2) says that the policy has to satisfy the constraints of all the tasks, which seems to be very conservative. In particular, consider the grid-world scenario where some locations have obstacles that one wants to avoid that locations; now consider the other task where the obstacles are placed at the complementary set of locations, then there will be no policy. Hence, it is not clear what type of policies one can expect to get solving (2).

Answer: Thanks for the question about Assumption 1 and providing the example. Here, we claim that, although Assumption 1 may be violated, it is not overly conservative and is realistic for the safe meta-RL setting.

Assumption 1 requires the existence of at least one policy that satisfies the constraints across all tasks. This assumption only requires the task distribution that not all feasible policies are mutually exclusive with respect to constraint satisfaction. This assumption is not overly conservative and is actually a minimal requirement if one wants to achieve anytime safety. If the assumption is violated, for any initial policy, anytime safety would be violated at the beginning of policy adaptation.

In the CMDP setting of this paper, the constraint satisfaction is defined as that the expected accumulated cost is below a threshold, i.e., $J\_{c_i,\tau}(\pi) \leq d\_{i,\tau}$ . In the provided example, when setting the constraint violation threshold $d_{i,\tau}>0$ , it is possible that there exists a policy $\pi$ such that $J_{c_i,\tau}(\pi) \leq d_{i,\tau}$ for both the environments with complementary cost sets. In this case, Assumption 1 is satisfied. By solving (2), an initial meat-policy $\pi$ with $J_{c_i,\tau}(\pi) \leq d_{i,\tau}$ can be obtained.

If we simultaneously consider (i) the strict case that the constraint satisfaction requires $J_{c_i,\tau}(\pi) \leq 0$ , i.e., there are strictly no costs, and (ii) given two tasks with complementary sets of obstacles, Assumption 1 is violated and the optimization problem does not have a solution. In this case, anytime safety is theoretically unachievable for any algorithm. On the other hand, the safe meta-RL is to extract the useful information (meta-policy) from correlated task samples. If the distances between the tasks are too far, such as when the constraint set is nearly complementary, then the learned meta-policy is not useful for the meta-test. So the above extreme setting is less meaningful in the safe meta-RL setting, and then Assumption 1 should not be viewed as overly conservative, even though the strict case described above does not satisfy it.

Weakness 2: In order to solve (1), one again needs to ensure that $\pi_\phi$ is safe. Without knowing the task, ensuring that $\pi_\phi$ is not clear. In particular, if the Meta test task is different from the Meta-training task, it is not clear whether solving (2) will ensure that $\pi_\phi$ will be safe for the new task.

Answer: Thanks for the question. The question is intuitive for our anytime safety result.

Our meta-training process is explicitly designed to extract constraint information from a diverse set of correlated training tasks. Although the test task may be new, it often shares structural similarities with the training tasks—especially in terms of the constraints. Since the training tasks collectively cover the relevant constraint space, the meta-policy learns to satisfy a combination of the constraints seen during training. This means that even if the test task is new, its constraints are still within the span of the training constraints, enabling the meta-policy to maintain safety.

From a theoretical analysis point of view, we assume the training task and the test task follow the same distribution. Therefore, when the safe behavior is learned from the meta-policy, the initial policy will satisfy the constraints.

In practice, when the set of sampled training tasks is sufficient, the empirical constraint violation during meta‑test adaptation is almost zero. As shown in Figures 1 and 4, anytime safety is achieved during the meta-test after the meta-policy is learned from the training tasks.

If we assume the test tasks are shifted from the distribution, due to the initial policy violating the safety constraints, anytime safety cannot, in general, be guaranteed without additional mechanisms. Characterizing constraint violations under such shifts and designing new adaptation algorithms that retain safety is an important direction for future work.

Weakness 3: Convergence rate results have not been provided.

Answer: The convergence rate of stochastic constrained optimization has been extensively studied in general settings, such as in [a] and [b]. As these results already provide well-established convergence guarantees for a broad class of constrained optimization problems, we do not repeat such an analysis here. Instead, our focus is on the unique challenges of safe meta-RL, where we derive a dedicated theoretical framework that emphasizes optimality bounds and anytime safety guarantees. It is the first to derive a comprehensive theoretical analysis regarding near optimality and anytime safety guarantees for safe meta-RL. Specifically, first, we establish the theoretical basis of the algorithm design that guarantees anytime safety, i.e., zero constraint violation for any policy during the policy adaptation. Second, we derive a lower bound of the expected accumulated reward of the adapted policies compared to that of the task-specific optimal policies, which shows the near optimality of the proposed safe meta-RL framework. Finally, we demonstrate a trade-off between the optimality bound and constraint violation when the allowable constraint violation varies, which enables the algorithm to be adjusted to prioritize either safety or optimality.

Table 1 compares both the theoretical and experimental results between this paper and previous works [24, 12]. First, we study anytime safety and provide a zero constraint violation guarantee. In previous works, they only provided positive upper bounds for the constraint violation, and the upper bounds only work for the final policy [24 ] or the adapted policies [ 12 ]. Second, although [24 ] provides an upper bound of the optimality gap, the experimental optimality is the worst. On the other hand, [12] does not provide an optimality bound. In contrast, our method exhibits high optimality and provides a near-optimality guarantee, outperforming existing approaches in terms of both experimental and theoretical outcomes. Third, our method is more efficient than the existing methods [24, 12].

[a] Dentcheva et al., Optimization with Stochastic Dominance Constraints, 2003.

[b] Geiersbach et al., Sample-Based Consistency in Infinite-Dimensional Conic-Constrained Stochastic Optimization, 2025.

2025-08-01

I thank the authors for their responses. They did clear a few things. I think that the idea is good and has some value. There is not much work on the meta-CMDP space. But the paper was not clear enough in explaining quite a few things, which raises confusion. I have some other concerns, which I explain in the following.

Specifically, in Theorem 1, the bound may be loose. Are these bounds tight (meaning that the can authors comment on the lower bound? even for the unconstrained case?). Especially, dependence on $R$ and $Var$ can be high in practice.
In Theorem 1, it seems that there is no control knob. If $\delta=4\gamma \alpha A_{\max}/(1-\gamma)$ , then, the constraint satisfaction bound is loose as it is a constant factor away. On the other hand, when $\delta=0$ , the policy is constant factor away. Given that the main goal is to achieve feasibility, it means that the policy is sub-optimal in this case.
It seems that one needs to solve for the Lagrange multiplier and the optimal policy in line (5) of Algorithm 1, which might incur a large computational cost as it needs to be done for every task sampled in the training phase.
Finally, I am not sure how it will work in the test/inference time. Will it get a few online trajectories from a task, and adapt to it? If so, it can still violate the constraint as the task emerges.

评论- Response 1 to new questions

2025-08-02

Thank you for your follow-up question. We appreciate the opportunity to clarify further.

Question 1: Specifically, in Theorem 1, the bound may be loose. Are these bounds tight (meaning that the can authors comment on the lower bound? even for the unconstrained case?). Especially, dependence on $R$ and $Var$ can be high in practice.

Answer: Thanks for pointing out the question. We first clarify $\mathcal{V}ar^\epsilon ( \mathbb{P}(\Gamma))$ and the optimality bound, i.e., $\frac{4\gamma \alpha A^{max}}{(1-\gamma)^2} \mathcal{V}ar^\epsilon ( \mathbb{P}(\Gamma))$ in Theorem 1, and show they can be relatively small and tight. Next, we show the intuition behind the upper bound.

Notice that the meta-safe RL aims to extract common knowledge from multiple existing RL tasks. The tasks in the task distribution are usually correlated, and their near-optimal policies usually produce similar actions on a large part of the state space. For example, in the experiments of navigation scenarios with collision avoidance, although optimal policies under different environments should produce different actions when meeting different obstacles, the actions in free space should be similar. As $\mathcal{V} a r(\mathbb{P}(\Gamma))$ is defined as ${\min}\_{\phi} \ \mathbb{E}\_{\tau \sim \mathbb{P}(\Gamma)} \mathbb{E}\_{s \sim \nu\_{\tau}^{\pi\_\phi}}[D_{KL}(\pi\_{\*}^{\tau}(\cdot|s) || \pi_{\phi}(\cdot|s))]$ , which computes the expectation over the entire state space, it could be small, especially when the state space is large. Moreover, $\mathcal{V}ar ( \mathbb{P}(\Gamma)) \triangleq {\min}\_{\phi} \ \mathbb{E}\_{\tau \sim \mathbb{P}(\Gamma)} \mathbb{E}\_{s \sim \nu_{\tau}^{\pi\_\phi}}[D_{KL}(\pi\_{\*}^{\tau}(\cdot|s) || \pi\_{\phi}(\cdot|s))]$ is the minimal average distance between policies. According to our experiment, $0.05$ is a sufficiently large number for the quantity for the KL divergence between two policies and the value of $\mathcal{V} a r(\mathbb{P}(\Gamma))$ under this KL divergence is about $\frac{1}{4} \times 0.05$ .

Next, we roughly evaluate the quantity of the optimality bound $\frac{4\gamma \alpha A^{max}}{(1-\gamma)^2} \mathcal{V}ar^\epsilon ( \mathbb{P}(\Gamma))$ and show it can be relatively tight. We set $\gamma=0.99$ , the reward/cost $r \in [0,1]$ and $c \in [0,1]$ . The max advantage function value $A^{\max} \leq 1$ is generally not very small. However, as indicated in the proof of Lemma 4 in line 1453, we actually can replace $A^{\max}$ by $\max_{s, a} A_\tau^{\pi_\phi^*}(s, a)$ , which is computed on the meta-policy and is usually much smaller than $1$ , we set it as $0.05$ . Then, the optimality bound of $\frac{4\gamma \alpha A^{max}}{(1-\gamma)^2} \mathcal{V}ar^\epsilon ( \mathbb{P}(\Gamma))$ is about $25$ . Note that, the quantity of the total reward/cost is about $\sum_{n=1}^{\infty} \gamma^n = \frac{1}{\gamma}=100$ . Therefore, the bound is about 25% of the total reward/cost, which is relatively tight.

It is standard in RL and safe RL to derive a lower bound with an order of quantity similar to the optimality bound $\frac{4\gamma \alpha A^{max}}{(1-\gamma)^2} \mathcal{V}ar^\epsilon ( \mathbb{P}(\Gamma))$ in this paper. For example, in TRPO and CPO, the lower bounds also include the term $\frac{A^{max}}{(1-\gamma)^2}D_{KL}$ . Moreover, for conciseness and broad applicability, we have to derive a general optimality bound, which makes the bound looser. When the used condition and target problem are specified, a tighter optimality bound can be derived. For example, when a large task distribution is specified, i.e., $\mathcal{V}ar ( \mathbb{P}(\Gamma))$ is large, we can also make some inequalities (such as those in lines 917-923) tighter. However, when the target problem is required to be specified, the generality of the optimality bound will be lost and it will become harder to understand and less intuitive.

Next, we show the intuition behind the upper bound. The variance of the task distribution $\mathcal{V} a r(\mathbb{P}(\Gamma))$ is defined as ${\min}\_{\phi} \ \mathbb{E}\_{\tau \sim \mathbb{P}(\Gamma)} \mathbb{E}\_{s \sim \nu\_{\tau}^{\pi\_\phi}}[D\_{KL}(\pi_{\*}^{\tau}(\cdot|s) || \pi\_{\phi}(\cdot|s))]$ , where $\pi_\tau$ is the task-specific optimal policy. So the variance is an inherent property of the task distribution, which is independent of meta-policy $\pi_\mathcal{T}$ . Therefore, the optimality bound provides the intuition that, if the variance is small, the meta-policy learned from the task distribution is more helpful for other tasks, and then the performance is better. If the variance is very large, the meta-policy learned from the task distribution is not helpful, and it is unrealistic to achieve good performance with only one-step adaptation from any initial policy.

评论- Response 2 to new questions

2025-08-02

Question 2: In Theorem 1, it seems that there is no control knob. If $\delta=4 \gamma \alpha A_{\max } /(1-\gamma)$ , then the constraint satisfaction bound is loose as it is a constant factor away. On the other hand, when $\delta=0$ , the policy is constant factor away. Given that the main goal is to achieve feasibility, it means that the policy is sub-optimal in this case.

Answer: Thanks for the comment. We agree that when $\delta=0$ , the constraint is strictly satisfied, and the adapted policy is $\epsilon$ -conservatively optimal (as defined in Case 1). Here are two clarifications for the result.

Consider the left-hand side of the inequality, $J_{\tau}(\mathcal{A}^{s}({\pi}_{\phi^\*}, \Lambda, \Delta, \tau))$ is the accumulated reward of the policy, which is using one step of the safe policy adaptation algorithm to obtained, i.e., the trajectory for the policy adaptation is sampling using single policy then finish the optimization. This is in contrast to that, in the traditional (safe) RL problem, the policy is optimized by thousands of iterative sampling and optimization steps. Therefore, it is intuitive that, given only one step of sampling, it needs to be conservative to avoid violating the constraint when the priority is the feasibility. Therefore, it is intuitive that the policy is sub-optimal under the feasibility constraint. Importantly, we will continue to optimize the policy using the proposed safe policy optimization approach by many times to approach the optimal policy. As shown in Corollary 1, anytime safety is achieved (the feasibility is achieved for each policy optimization step) and the policy is monotonically improved.
The $\epsilon$ -conservatively optimal policy could be very close to the optimal policy in certain cases. the optimal policy $\pi^\tau_\*$ for task $\tau$ is defined as $\pi^\tau_\* \triangleq {\operatorname{argmax}}\_{\pi \in \Pi} \ J\_\tau(\pi)$ $\text{ s.t. } J\_{c_i,\tau}(\pi) \leq d\_{i,\tau}$ . The $\epsilon$ -conservatively optimal policy is defined as $\pi\_{\*,[\epsilon]}^\tau$ $\triangleq {\operatorname{argmax}}_{\pi \in \Pi} \ J\_\tau(\pi)$ $\text{ s.t. } J\_{c_i,\tau}(\pi) \leq d\_{i,\tau} - \epsilon$ . When the optimal point of the first optimization problem is not close to the constraint set boundary, the solutions to these two optimization problems could be the same. When the optimal point of the first optimization problem is not sensitive to the constraint set boundary, the solutions to these two optimization problems could be close. In these cases, the $\epsilon$ -conservatively optimal policy could be very close to the optimal policy.

Question 3: It seems that one needs to solve for the Lagrange multiplier and the optimal policy in line (5) of Algorithm 1, which might incur a large computational cost as it needs to be done for every task sampled in the training phase.

Answer: Thanks for your question. We agree that one needs to solve for the Lagrange multiplier for each task. On the other hand, this is very efficient and is not a large computational cost. The derivation of the closed-form solution of Problem (1) makes it efficient to be solved through the dual problem. The dual problem in (3) is always convex and low-dimensional. The dimension of the problem (3) is the number of constraints $c_i$ . It is extremely efficient in solving the problem, e.g., $0.3$ seconds. This is close to the unconstrained meta-RL approach, where one-step of policy gradient is required for each task. In contrast, the meta-CRPO and meta-CPO require solving a constrained optimization problem, which is more computationally expensive than solving problem (1).

Question 4: Finally, I am not sure how it will work in the test/inference time. Will it get a few online trajectories from a task, and adapt to it? If so, it can still violate the constraint as the task emerges.

Answer: During a meta-test, when a test task (task 1) is given, the agent uses the safe-meta policy learned from the training task to collect trajectories on the new task, and then adapts the policy to the task. When another new task (task 2) is given, we still use the safe-meta policy learned from the training task as the initial policy for task 2. We will not continue to use the policy of solving task 1 to collect the data for task 2.

Given that the meta-test tasks share structural similarities with the training tasks in terms of the constraints, the initial policy for the test tasks will satisfy the constraints.

2025-08-02

Thanks for your response. I do not have any further comments. I have raised my score.

审稿意见

评分: 4置信度: 32025-06-30

In this paper, the authors develop a safe meta-RL framework consisting of two modules: safe policy adaptation and safe meta-policy training. They introduce a novel safe policy adaptation method, which guarantees monotonic improvement, ensures safety, and provides a closed-form solution for a single safe policy adaptation step. For the meta-policy training, the authors impose safety constraints on the meta-policy by leveraging the closed-form expression of the adapted policy and developing a Hessian-free meta-training algorithm. The authors validate their safe meta-RL algorithm on Mujoco safe tasks and collision avoidance tasks.

优缺点分析

Strengths:

The method is able to achieve zero-constraint violation for various safe RL tasks
The theoretical analysis seems well developed.

Weaknesses:

It is not very clear how the zero-constraint violation is achieved if there will be some exploration done by the policy adaptation module.
The dependence of regret guarantess for the meta-safe RL algorithm are not shown with respect to the number of tasks as shown in the Meta-CRPO paper.

问题

I have one clarification question for the authors about the one-step policy adaptation module. If the state action data points need to be collected to evaluate $A_\tau^{\pi_\phi}$ , there could be constraint violations in task $\tau$ while collecting those state action pairs. Then, how is zero constraint violation guaranteed?
In Theorem 1, eq.(6), the upper bound is finite, and will only be zero when $\delta_{c_i}$ is zero. How realistic is this assumption, as zero constraint violation depends on this.
Is there a reason that the regret guarantee do not depend on the number of tasks in Theorem 1. I was expecting the total number of tasks to appear in the analysis and have a positive effect on the regret guarantee.
How is the proposed method different from the zero constraint violation methods in the following papers? I think it would be good to add a discussion to compare these methods. [1] Liu, Tao, et al. "Learning policies with zero or bounded constraint violation for constrained mdps." Advances in Neural Information Processing Systems 34 (2021): 17183-17193. [2] Liu, Tao, et al. "Learning policies with zero or bounded constraint violation for constrained mdps." Advances in Neural Information Processing Systems 34 (2021): 17183-17193. [3] Xu, Siyuan, and Minghui Zhu. "Meta-Reinforcement Learning with Universal Policy Adaptation: Provable Near-Optimality under All-Task Optimum Comparator." Neural Information Processing Systems. 2024.

局限性

I think ensuring zero-constraint violation only happens under some strict assumptions. The authors should clarify on that.
Dependence on the number of tasks for the regret guarantees is not shown.

格式问题

None.

作者回复

2025-07-31

Weakness 1: It is not very clear how the zero-constraint violation is achieved if there will be some exploration done by the policy adaptation module.

Question 1: I have one clarification question for the authors about the one-step policy adaptation module. If the state action data points need to be collected to evaluate $A_\tau^{\pi_\phi}$ , there could be constraint violations in the task $\tau$ while collecting those state action pairs. Then, how is zero constraint violation guaranteed?

Answer: In the CMDP setting of this paper, the constraint satisfaction is defined as that the expected accumulated cost is below a threshold, i.e., $J\_{c_i,\tau}(\pi) = \mathbb{E}\_t[ \sum_t c_i] \leq d_{i,\tau}$ . The advantage function $A_\tau^{\pi_\phi}$ can be evaluated based on sampling a lot of trajectories by the policy $\pi_\phi$ , but the policy $\pi_\phi$ still satisfies the constraint and is defined as safe as long as the expected accumulated cost, i.e., the average cost over all sampled trajectories, is below a threshold.

In some papers [48] [56] [37], the safety is defined by that the agent cannot visit some states. In this case, the advantage function $A_\tau^{\pi_\phi}$ is hard to evaluate under the safety definition in these papers.

Weakness 2: The dependence of regret guarantees for the meta-safe RL algorithm is not shown with respect to the number of tasks as shown in the Meta-CRPO paper.

Question 3: Is there a reason that the regret guarantee do not depend on the number of tasks in Theorem 1. I was expecting the total number of tasks to appear in the analysis and have a positive effect on the regret guarantee.

Limitation 2: Dependence on the number of tasks for the regret guarantees is not shown.

Answer: The theoretical analysis in this paper does not derive the regret analysis for online meta-policy training, but derives the optimality analysis for the meta-test, i.e., the optimality analysis and the anytime guarantee during the safe policy adaptation.

Our work considers the safe meta-RL training with a form of the offline optimization problem, where the meta-policy is learned from the sampled tasks from a fixed task distribution, and focuses on the optimality analysis of the safety policy adaptation from the learned meta-policy. We assume we can obtain the solution of the meta-training, no matter how many tasks are used during the meta-training. Since the optimization problem is offline, the theoretical analysis does not include the regret in an online optimization problem.

In contrast, Meta-CRPO addresses online meta-RL, where tasks are revealed sequentially. So they provide the regret bound that depends on the number of revealed tasks.

Question 2: In Theorem 1, eq.(6), the upper bound is finite, and will only be zero when $\delta_{c_i}$ is zero. How realistic is this assumption, as zero constraint violation depends on this.

Answer: We do not directly assume zero constraint violation. Instead, based on Corollary 1, the proposed method is designed to and has been shown to achieve the zero constraint violation while achieving good optimality. Theorem 1 provides the optimality analysis under different tolerance settings of the constraint violation, which includes the special case that the zero constraint violation is strictly required.

Specifically, the parameter $\delta_{c_i}$ in Theorem is a user-defined tolerance hyper-parameter that controls the maximum allowed violation of constraint $c_i$ . Our optimality in (6) holds for any $\delta_{c_i} \geq 0$ . When strict constraint satisfaction is required, we can set $\delta_{c_i} = 0$ , and the optimality analysis naturally reduces to the zero-violation case. Otherwise, for a practical tolerance $\delta_{c_i} > 0$ , Theorem 1 quantifies the trade-off between optimality and the small, controlled violation bounded by $\delta_{c_i}$ . Therefore, the statement in Theorem 1 includes the optimality analysis that encompasses both scenarios where the constraint satisfaction is strict and the more common settings.

Question 4: How is the proposed method different from the zero constraint violation methods in the following papers? I think it would be good to add a discussion to compare these methods. [1] Liu, Tao, et al. "Learning policies with zero or bounded constraint violation for constrained mdps." Advances in Neural Information Processing Systems 34 (2021): 17183-17193. [2] Xu, Siyuan, and Minghui Zhu. "Meta-Reinforcement Learning with Universal Policy Adaptation: Provable Near-Optimality under All-Task Optimum Comparator." Neural Information Processing Systems. 2024.

Answer: Thanks for the reference recommendations. Paper [1] considers an online policy optimization problem and derives the sub-linear regret of the constraint violation. The paper is not a safe meta-RL setting and does not have a meta-training process. In the setting of paper [1], it is theoretically unachievable to achieve the anytime safety, which is defined as the constraint violation being smaller than $0$ for any turn. Therefore, although paper [1] achieves a very strong constraint satisfaction metric, it is different from the anytime safety in our paper. Paper [2] considers an unconstrained policy optimization problem, and no constraint satisfaction guarantee is provided. In our paper, we achieve anytime safety in the setting of safe meta-RL by combining the proposed safe meta-training and the safe policy adaptation.

We will cite these papers and include the above discussion in our modified version.

Limitation 1: I think ensuring zero-constraint violation only happens under some strict assumptions. The authors should clarify on that.

Answer: The zero constraint violation is developed based on (i) our CMDP constraint setting (lines 91-96) and (ii) Assumption 1 (line 274). The assumption for the theoretical analysis only includes Assumption 1, which is not overly strict.

(i) As we mentioned in the answer to weakness 1, this paper considers the constraint violation on the CMDP setting, i.e., the constraint satisfaction is defined as that the expected accumulated cost is below a threshold, i.e., $J\_{c_i,\tau}(\pi) = \mathbb{E}\_t[ \sum_t c_i] \leq d_{i,\tau}$ . In this setting, the zero constraint violation is achieved in an expected metric in the proposed method.

(ii) Assumption 1 requires the existence of at least one policy that satisfies the constraints across all tasks. This assumption only requires the task distribution that not all feasible policies are mutually exclusive with respect to constraint satisfaction. This assumption is not overly conservative and is actually a minimal requirement if one wants to achieve anytime safety. If the assumption is violated, for any initial policy, anytime safety would be violated at the beginning of policy adaptation.

审稿意见

评分: 4置信度: 42025-07-02

The paper studied safe meta-reinforcement learning, with the objective of ensuring anytime safety during the meta-test phase, such that safety constraints are satisfied at every step of the adaptation process. The paper proposes a safe meta-RL framework consisting of two modules: safe policy adaptation and safe policy training. The paper provided theoretical guarantees, demonstrating the possible trade-off between optimality and anytime safety. The experimental results justify the theoretical guarantee.

优缺点分析

Strengths

The considered problem is meaningful, especially in safety-critical applications where violations cannot be tolerated even during the learning phase. The paper successfully proposes a framework that is both theoretically grounded and practically relevant. The experiments are promising to support the theory.

Weaknesses

The algorithm and its the performance of heavily depends on the choice of hyperparameters such as $\lambda,$ $\alpha$ , and $\delta$ , which are difficult (even impossible) to determine in practice.
Theorem 1 is a bit confusing, as it establishes the trade-off between constraint violation and optimality. This seems to contradict the claim of zero violation under anytime safety. It would be better to discuss optimality in the case of anytime safety $\delta=0.$ Theorem 1 seems only to provide the distance between the regularized objective and the epsilon-optimal policy and it is more convincing to provide the performance analysis w.r.t. the original objective. Theorem 1 is also not ideal as the optimality gap could be large. Besides, it would be helpful if the paper elaborated on how $R$ scales with the size and structure of the state and action spaces. Moreover, no sample complexity analysis is provided. The paper also did not provide the sample complexity analysis.
Though the safe policy adaptation step admits a closed-form solution, solving the dual problem for the Lagrange multipliers and computing the meta-gradient still involve non-trivial computations. The actual computational advantage over prior methods may not be as significant as claimed
The anytime safety seems straightforward by assuming initial policy is safety based on the formulation in (1). It is unclear how it can be adapted to the setting when the initial policy is also unsafe. The formulation (1) and the algorithm CPRO seems both from the previous CMDP papers. The paper would benefit from better emphasizing its unique contributions beyond this integration.

问题

Please see the weaknesses above.

Additional questions:

The key function $f$ used in the softmax policy parameterization is not clearly defined or explained.

局限性

最终评判理由

The response addressed my concerns, and I increased my ratings.

格式问题

作者回复

2025-07-31

We are grateful and indebted to the time and effort invested in evaluating our manuscript and for all the suggestions that have made our manuscript a better and stronger contribution.

Weakness 1: The algorithm and its performance heavily depend on the choice of hyperparameters such as $\lambda$ , $\alpha$ and $\delta$ ; these are difficult to determine in practice.

Answer. Our method only introduces two practical hyperparameters: $\lambda$ and $\delta$ . The constant $\alpha$ appears only in the theoretical analysis as a problem-dependent parameter and is not tuned in practice.

For the constraint violation threshold $\delta$ , we can simply set $\delta = 0$ to guarantee anytime safety, which requires no tuning.

The hyperparameter $\lambda$ controls the trust-region size. We select $\lambda$ so that the KL divergence between the initial policy $\pi$ and the adapted policy $\pi'$ from problem (1) satisfies $\mathrm{KL}(\pi \,\|\, \pi') \approx 0.03$ . Practically, we adjust $\lambda$ using a simple backtracking rule: if $\mathrm{KL}(\pi \,\|\, \pi') > 0.03$ , we increase $\lambda$ (e.g., multiply by 2); if $\mathrm{KL}(\pi \,\|\, \pi') < 0.03$ , we decrease $\lambda$ (e.g., divide by 2), until $\mathrm{KL}(\pi \,\|\, \pi') \in [0.02, 0.05]$ . This procedure is standard and efficient in the existing policy optimization algorithms, such as PPO and TRPO.

Note that existing methods for meta-RL and safe meta-RL, such as MAML and meta-CPO, also require tuning a similar hyperparameter, i.e., the stepsize for task-specific policy adaptation, which is analogous to the tuning of $\lambda$ in our method. Therefore, our approach does not introduce any additional tuning complexity compared to these existing methods.

Weakness 2:

Weakness 2.1. Theorem 1 is a bit confusing, as it establishes the trade-off between constraint violation and optimality. This seems to contradict the claim of zero violation under anytime safety. It would be better to discuss optimality in the case of anytime safety $\delta=0$ .

Answer 2.1: Theorem 1 characterizes the optimality gap under a general constraint violation threshold $\delta \geq 0$ . For the special case $\delta = 0$ , the theorem still provides an optimality analysis while guaranteeing zero constraint violations, consistent with the anytime safety property. The theorem simply indicates that allowing a nonzero $\delta$ can lead to a smaller optimality gap. It does not imply that near-optimality cannot be guaranteed when $\delta = 0$ .

Weakness 2.2. Theorem 1 seems only to provide the distance between the regularized objective and the epsilon-optimal policy and it is more convincing to provide the performance analysis w.r.t. the original objective.

Answer 2.2: Theorem 1 directly provides the distance between the original objective of the safe policy optimization, i.e., the expected accumulated reward $\mathbb{E}\_{\tau \sim \mathbb{P}(\Gamma)}[J\_\tau(\mathcal{A}^s (\pi\_{\phi^\*}, \Lambda, \Delta, \tau))]$ of the adapted policy $\mathcal{A}^s (\pi_{\phi^*}, \Lambda, \Delta, \tau))$ , and the epsilon-optimal policy.

Weakness 2.3. Theorem 1 is also not ideal as the optimality gap could be large.

Answer 2.3: In the manuscript, we have a section (Appendix G) about the discussion on the tightness of the derived lower bound in Theorem 1. In summary, as $\mathcal{V}ar(\mathbb{P}(\Gamma))$ measures the minimal average KL divergence between optimal policies and a meta-policy, it can be small, especially in large state spaces; for example, we observe $\mathcal{V}ar(\mathbb{P}(\Gamma)) \approx \frac{1}{4} \times 0.05$ in our experiments. With $\gamma=0.99$ and a reduced $A^{\max}=0.05$ , the bound $\frac{4\gamma \alpha A^{\max}}{(1-\gamma)^2} \mathcal{V}ar^\epsilon (\mathbb{P}(\Gamma))$ is around $25$ , which is about 25% of the total reward/cost (about 100), indicating reasonable tightness. It is standard in RL and safe RL to derive a lower bound with an order of quantity similar to the optimality bound $\frac{4\gamma \alpha A^{max}}{(1-\gamma)^2} \mathcal{V}ar^\epsilon ( \mathbb{P}(\Gamma))$ in this paper. For example, in TRPO and CPO, the lower bounds also include the term $\frac{A^{max}}{(1-\gamma)^2}D_{KL}$ . Moreover, for conciseness and broad applicability, we have to derive a general optimality bound, which makes the bound looser. When the used condition and target problem are specified, a tighter optimality bound can be derived. For example, when a large task distribution is specified, i.e., $\mathcal{V}ar ( \mathbb{P}(\Gamma))$ is large, we can also make some inequalities (such as those in lines 917 to 923) tighter. However, when the target problem is required to be specified, the generality of the optimality bound will be lost and it will become harder to understand and less intuitive.

Weakness 2.4. Besides, it would be helpful if the paper elaborated on how $R$ scales with the size and structure of the state and action spaces.

Answer 2.4: The optimality bound is independent of the dimensions of the state and action spaces. Instead, the term $\mathcal{V}ar(\mathbb{P}(\Gamma))$ reflects the variability of the task-specific optimal policy parameters within the task distribution.

Weakness 2.5. Moreover, no sample complexity analysis is provided.

Answer 2.5: In each meta-training iteration, we sample 10 tasks from the task distribution. In each task, we sample 40 trajectories, and the horizon of each sampled trajectory is 200. Therefore, for each meta-training iteration, the number of sampled state-action pairs is 80k. We train the model from 300 to 500 iterations in the meta-training.

Weakness 3: Though the safe policy adaptation step admits a closed-form solution, solving the dual problem for the Lagrange multipliers and computing the meta-gradient still involves non-trivial computations. The actual computational advantage over prior methods may not be as significant as claimed.

Answer: The computational complexity of the proposed method, including solving the dual problem for the Lagrange multipliers and computing the meta-gradient, has significantly reduced the computational complexity in baseline methods in both the meta-test and meta-training stages.

In the meta-test, the derivation of the closed-form solution of Problem (1) makes it efficient to be solved through the dual problem. The dual problem in (3) is always convex and low-dimensional. The dimension of the problem (3) is the number of constraints $c_i$ . It is extremely efficient in solving the problem, such as $0.1$ seconds. In contrast, the meta-CRPO and meta-CPO require solving a constrained optimization problem, which is more computationally expensive than solving problem (1).

In the meta-training, the closed-form solution of the policy adaptation (1) is used to derive a Hessian-free meta-gradient and reduce the computation complexity of the proposed algorithm to that in the single-level optimization. The computational complexity is on a scale of simple policy gradients. In contrast, the meta-CPRO requires that the task-specific optimal policies have been learned, which is impractical when the number of training tasks is large. The meta-CPO uses bi-level optimization to learn the meta-policy, which requires the computation of the Hessian and the Hessian inverse. The Hessian inverse for the large neural network is significantly computationally complex.

In Section 6, we conduct experiments on seven scenarios, including navigation tasks with collision avoidance and locomotion tasks, to verify the computational efficiency of the proposed algorithms. Figures 1 and 5 show that our algorithm is much more efficient than the baselines, saving about 70% of the computation time for meta-training and 50% for meta-testing compared to meta-CPO.

Weakness 4: The anytime safety seems straightforward by assuming the initial policy is safety based on the formulation in (1). It is unclear how it can be adapted to the setting when the initial policy is also unsafe.

Answer: In the proposed safe meta-RL algorithm, we aim to learn a meta-policy from the training tasks, such that the policy adaptation from the meta-policy to the new task can achieve anytime safety. To achieve anytime safety, the meta-policy, which is the starting policy for the meta-test, is designed to be safe by being learned from the optimization problem (2). In Section 3.2, the optimization problem (2) is solved for the meta-training stage. The constraints in the optimization problem of (2) ensure that the solved meta-policy is safe, i.e., the starting policy for the meta-test is safe. To solve the optimization problem (2), in Algorithm 2, we prioritize the minimization of the constraint violation of the meta-policy (shown in lines 10-12 in Algorithm 2). Therefore, if there exists a safe policy for all tasks in the task distribution, the meta policy solved by (2) is always safe. As shown in the last column of Figures 1 and 4, the cost of the initial policy is always below the constraint threshold for all scenarios.

Moreover, it is non-trivial even if the initial policy is safe. To ensure anytime safety, the safe policy adaptation in (1) should be safe for any safe initial policy. To achieve this, we first derive Lemma 1 to upper bound the accumulated cost of the optimized policy by the initial policy, and then design the constraint set of (1) correspondingly. Meanwhile, the closed-form solution of (1) is also required for the safe meta-RL algorithm. The proposed (1) is the first algorithm that simultaneously offers the two key properties.

评论- Complementary Response

2025-08-02

Thanks again for the time and effort invested in reviewing our manuscript. I would like to take this opportunity to offer a complementary answers for completeness.

Weakness 5: The formulation (1) and the algorithm CRPO seems both from the previous CMDP papers. The paper would benefit from better emphasizing its unique contributions beyond this integration.

Answer: We propose the formulation (1) in this paper. The algorithm CRPO is not our algorithm and is a baseline method from [54] that we compare with.

This paper designs a new policy adaptation algorithm by problem (1) and solves it by the dual method. The proposed algorithm holds (i) a safety guarantee for a single policy optimization step, (ii) a closed-form solution, and (iii) the monotonic improvement. The proposed policy adaptation algorithm is the first algorithm that simultaneously offers the two key properties (i) and (ii).

The previous methods cannot hold the two properties simultaneously. In particular, they do not have a closed-form solution or a safety guarantee for a single policy optimization step. For example, CRPO uses the primal-dual method to solve safe RL, which cannot guarantee safety for a single policy optimization step. CPO does not have a closed-form solution and requires solving a constrained optimization problem, which is significantly more computationally expensive than solving problem (1). Note that these two key properties are crucial for the overall design of a safe meta-RL algorithm, for the reasons explained below.

The advantages of the formulation (1). The proposed formulation (1) enables the proposed safe meta-RL method to offer several key advantages over existing safe meta-RL methods, including meta-CPO and meta-CRPO. (i) Anytime safety guarantee during the meta-test: The safe meta-policy training produces a safe initial meta-policy by imposing the safety constraint. The safe policy adaptation imposes a constraint on the upper bound of the total cost, and thus is guaranteed to produce a safe policy for each iteration when the initial policy is safe. By integrating these two modules, anytime safety is achieved. In contrast, meta-CRPO employs CRPO, a primal-dual-based method, for policy adaptation, which does not have any safe guarantee in a single policy adaptation step. Thus, anytime safety cannot be guaranteed in meta-CRPO. (ii) High computational efficiency in both the meta-test and meta-training stages: In the meta-test, the derivation of the closed-form solution of Problem (1) makes it efficient to be solved. In contrast, the meta-CRPO and meta-CPO require to solve a constrained optimization problem, which is more computationally expensive than the solution of problem (1). In the meta-training, the closed-form solution of the policy adaptation (1) is used to derive a Hessian-free meta-gradient and reduces the computation complexity of the proposed algorithm to that in the single-level optimization. In contrast, the meta-CPRO requires that the task-specific optimal policies have been learned, which is impractical when the number of training tasks is large. The meta-CPO uses the bi-level optimization to learn the meta-policy, which requires the computation of Hessian and its inverse. We conduct experiments on seven scenarios including navigation tasks with collision avoidance and locomotion tasks to verify these advantages of the proposed algorithms.

Question: The key function $f$ used in the softmax policy parameterization is not clearly defined or explained.

Answer: The function $f_\theta$ defined in line 99 could be any approximation function, such as neural network with parameter $\theta$ .

2025-08-04

Thank you for the detailed response, which addressed most of my concerns. I will increase my rating and I have a few follow-ups:

I am slightly confused about the results that are independent of the size/dimension of state and action spaces, which is a bit counterintuitive to me. Can you explain the underlying reason? Consider an extreme setting where only one task exists (reduced to the classical CMDP), the performance bounds are supposed to be related to the size/dimension of the state and action space? Similar for the examples of two/multiple independent tasks.
For sample complexity analysis, I mean what is the number of data samples/interaction (specifically on lines 2 and 6 in Algorithm 1) to learn an epsilon-optimal policy? If I understand correctly, the paper assumes the exact value functions, which is more convincing for considering the learning errors.

评论- Response 2 to new questions

2025-08-05

Question 2. For sample complexity analysis, I mean what is the number of data samples/interaction (specifically on lines 2 and 6 in Algorithm 1) to learn an epsilon-optimal policy? If I understand correctly, the paper assumes the exact value functions, which is more convincing for considering the learning errors.

Answer: Thanks for the question. We first explain the setting of our problem, then explain the error analysis in our theoretical analysis, and finally show the sampling complexity in our practical experiment.

In the theoretical analysis of the safe meta-RL problem setting, the meta-policy $\pi_\phi$ is adapted for the new task using only one policy optimization step (problem (1)), which means the agent only obtains the value function of the initial policy $\pi_\phi$ and minimizes the loss, and is not allowed to explore the environment further to obtain more data on the adapted policy. Therefore, in Algorithm 1, we are not obtaining the epsilon-optimal policy, and the epsilon-optimal policy is actually inaccessible under the above setting. In contrast, we aim to study that, under the above setting, if we can learn a good meta-policy during the meta-training, whether we can obtain a nearly optimal policy for a new task only using one optimization step during the meta-test. Therefore, during the meta-test, we collect data and assume the value function can be obtained and the optimization problem (1) can be solved. As mentioned in Proposition 2, the dual problem in (3) solving (1) is always convex and low-dimensional, therefore the learning error is decreased exponentially, which is extremely fast. Therefore, it is reasonable to assume the optimization error is approaching 0.

In contrast, in single-task RL problems, one can iteratively optimize the policy and collect a new value function of the optimized policy hundreds of times. In this case, the sample complexity analysis for evaluating the value function and the learning error is very important.

In our practical experiment, in each meta-training iteration, we sample 10 tasks from the task distribution, i.e., task number in line 2 of Algorithm 1 is 10. In each task, we sample $40$ trajectories, and the horizon of each sampled trajectory is $200$ , i.e., the number of state-action pairs in line 6 is $8000$ .

2025-08-05

Thank you for the detailed response. I have increased my rating. It would be better to add these details in the revision.

2025-08-05

We sincerely appreciate your time and effort in reviewing our paper and thank you for recognizing the contributions of this work. We will include all the additional clarifications in the new version of the paper.

评论- Response 1 to new questions

2025-08-05

Thanks again for the time and effort invested in reviewing our manuscript. Thank you for your follow-up question. We appreciate the opportunity to clarify further.

Question 1. I am slightly confused about the results that are independent of the size/dimension of state and action spaces, which is a bit counterintuitive to me. Can you explain the underlying reason? Consider an extreme setting where only one task exists (reduced to the classical CMDP), the performance bounds are supposed to be related to the size/dimension of the state and action space? Similar for the examples of two/multiple independent tasks.

Answer: Thank you for your intuitive question. We first introduce the optimality bound of our theoretical result in Theorem 1 and then explain why the bound is not directly related to the size/dimension of the state and action spaces.

In the safe-meta RL problem, the meta-policy is learned from a task distribution (during the meta-training), and then adapted to new tasks sampled from the same distribution (during the meta-test). In Theorem 1, we consider the optimality of the policy, which is adapted from the meta-policy for a new task (during the meta-test). The meta-policy is optimized using the training task distribution. Therefore, the distance between the new task and the training task distribution is the key factor that impacts the optimality bound in Theorem 1. In this paper, we derive the term $\mathcal{V}ar(\mathbb{P}(\Gamma))$ to reflect the variance of the task distribution, which is defined by the mean square distance between task-specific optimal policy parameters within the task distribution. This term, similar to other problem settings like state-action space dimension, is an inherent characteristic of the training tasks and is fixed during the problem-solving. Theorem 1 uses the term $\mathcal{V}ar(\mathbb{P}(\Gamma))$ to quantify the optimality bound, which is intuitive. Specifically, the optimality bound is improved when the variance of the task distribution $\mathcal{V}ar(\mathbb{P}(\Gamma))$ is reduced, as $\pi^{\tau}$ is approaching the task-specific optimal policy $\pi_{*,[\epsilon]}^\tau$ . It corresponds to the intuition of meta-learning, which is that, when the variance of a task distribution is smaller, the tasks are more similar, and then the experience learned from the task distribution works better for new tasks sampled from the task distribution.

In the general case, the term $\mathcal{V}ar(\mathbb{P}(\Gamma))$ may be related to the dimensions of the state action space. For example, when the dimensions of the state action space are large, the distance between tasks could also become large. However, when several tasks are very similar and the task distance is small, such as a robot needs to move at different speeds, but the dimension of the state-action space of the tasks is very large, the optimality bound could still be small. Therefore, the optimality bound is not directly reflected by the dimension of the state-action space.

Take the extreme example where only one task exists. During the meta-training, since only one task exists, the policy has been optimized for this specific task during the meta-training. There is no need for adaptation during the meta-test. Then, the optimality bound of the meta-test is 0. This result is consistent with our theoretical results, i.e., the variance of the training task distribution (a single task exists) is 0.

审稿意见

评分: 5置信度: 42025-07-03

The paper studies the safe meta-reinforcement learning (safe meta-RL) problem where anytime safety is ensured during the meta-test, which means that all policies output during the meta-test phase satisfy constraints. It proposes an algorithm similar to meta-CPO, but importantly the KL divergence constraint for policy updates has a reverse order, which leads to a closed-form policy update rule and can be efficiently implemented. As in meta-CPO and CPO, the proposed algorithm has monotonic policy improvement guarantee and anytime safe guarantee.

优缺点分析

Strength

Consistent performance improvement over previous algorithms and particularly meta-CPO, which is closes to the proposed algorithm in my opinion. The writing is clear and easy to understand.

Weakness

Changing the order of KL has a limited novelty. However, this change introduces computational advantages and no approximation error in contrast to meta-CPO, which is a highly related algorithm.

Theoretical analysis do not consider sample error, which would be a serious issue in stochastic environments.

The proposed idea and the closed form of policy updates are related to MPO (Maximum a Posteriori Policy Optimization by Abdolmaleki et al. 2018). Also, some existing research suggest the benefit of using the KL order used in this paper, stating that it confers stability to noise and function approximation error. For example, Theoretical Analysis of Efficiency and Robustness of Softmax and Gap-Increasing Operators in Reinforcement Learning by Kozuno et al. (2019) and Leverage the Average: an Analysis of KL Regularization in Reinforcement Learning by Vieillard et al. (2020). I recommend the authors to cite those papers.

问题

It would be nice if the authors provide experimental results in highly stochastic environments. Currently all experiments seem to be done in deterministic environments.

局限性

The CMDP formulation used in the paper is pointed out to be insufficient for safety-critical problems, and there is some efforts to fix it, e.g., Anytime-Constrained Reinforcement Learning by Jeremy McMahan and Xiaojin Zhu (2023). It would be nice to mention this limitation.

最终评判理由

While the theoretical analysis and empirical results are solid, the core idea of using KL regularization in meta-RL is already proposed in meta-CPO. The algorithm employs a different order of KL than the one used in meta-CPO, and this order of KL has been already used in, for example, MPO. From these, I set my score to 5.

格式问题

作者回复

2025-07-26

Weakness 1: Changing the order of KL has a limited novelty.

Answer: The main contribution of the paper in terms of the algorithm design in (1) is not changing the order of KL. Instead, the algorithm design (i) guarantees monotonic improvement, (ii) ensures safety, and simultaneously (iii) provides a closed-form solution for a single safe policy adaptation step, which is applied to the meta-policy training and simplifies its computation by leveraging the closed-form expression of the adapted policy and developing a Hessian-free meta-training algorithm.

Specifically, the proposed algorithms offer three key advantages over existing safe meta-RL methods. (i) Superior optimality. Our safe meta-policy training algorithm in (2) maximizes the expected accumulated reward of the policies adapted from the meta-policy. In contrast, the meta-training of meta-CRPO learns the meta-policy by minimizing the distance between the meta-policy and the task-specific policy, which does not consider the optimality and the safety of the task-specific policies adapted from the meta-policy. (ii) Anytime safety guarantee during the meta-test. The safe meta-policy training produces a safe initial meta-policy by imposing the safety constraint. The safe policy adaptation imposes a constraint on the upper bound of the total cost, and thus is guaranteed to produce a safe policy for each iteration when the initial policy is safe. By integrating these two modules, anytime safety is achieved. In contrast, meta-CRPO employs CRPO, a primal-dual-based method, for policy adaptation, which does not have any safe guarantee in a single policy adaptation step. Thus, anytime safety cannot be guaranteed in meta-CRPO. (iii) High computational efficiency in both the meta-test and meta-training stages. In the meta-test, the derivation of the closed-form solution of Problem (1) makes it efficient to be solved. In contrast, the meta-CRPO and meta-CPO require to solve a constrained optimization problem, which is more computationally expensive than solving problem (1). In the meta-training, the closed-form solution of the policy adaptation (1) is used to derive a Hessian-free meta-gradient and reduces the computation complexity of the proposed algorithm to approach that in the single-level optimization. In contrast, the meta-CPRO requires that the task-specific optimal policies have been learned, which is impractical when the number of training tasks is large. The meta-CPO uses bi-level optimization to learn the meta-policy, which requires the computation of Hessian and Hessian inverse. In Section 6, Figures 1, 2, 4 and 5, we conduct experiments on seven scenarios including navigation tasks with collision avoidance and locomotion tasks to verify these advantages of the proposed algorithms.

The novelties of the techniques. In order to derive the algorithm in (1), which guarantees monotonic improvement, (ii) ensures safety, and simultaneously (iii) enables a closed-form solution, we apply the following techniques. First, we derive the closed-form solution of the safe policy adaptation by applying the KKT analysis on the softmax policy. Second, we derive an upper bound of the constraint violation by analyzing the accumulated cost gap between different policies. The safe policy adaptation imposes a constraint on the upper bound of the total cost, and thus is guaranteed to produce a safe policy for each iteration by the majorization-minimization theorem. Third, to derive a lower bound of the expected accumulated reward of the adapted policies, we define the variance of the task distribution and then derive the lower bound by constructing a virtual sub-optimal meta-policy and quantifying the expected accumulated reward under the sub-optimal meta-policy.

The theoretical contribution and novelties of the proposed method. The paper is the first to derive a comprehensive theoretical analysis regarding near optimality and anytime safety guarantees for safe meta-RL. First, we establish the theoretical basis of the algorithm design that guarantees anytime safety, i.e., zero constraint violation for any policy used for exploration. Second, we derive a lower bound of the expected accumulated reward of the adapted policies compared to that of the task-specific optimal policies, which shows the near optimality of the proposed safe meta-RL framework. Finally, we demonstrate a trade-off between the optimality bound and constraint violation when the allowable constraint violation varies. In meta-CPRO, the optimality bound is provided, but anytime safety is not guaranteed. Meta-CPO provides neither an optimality bound nor anytime safety.

Weakness 2: Theoretical analysis does not consider sample error, which would be a serious issue in stochastic environments.

Answer: Theoretically, we assume that the constrained policy optimization problem in (1) can be solved exactly, ensuring both monotonic policy improvement and zero constraint violations. This assumption is standard in the literature, such as CPO and TRPO, and also neglects sampling errors when establishing theoretical guarantees, as accounting for such errors would render the analysis intractable, especially the anytime safety property. Under this widely accepted assumption, the anytime safety property demonstrated in our paper, which represents one of the most stringent safety criteria, is theoretically achievable.

In the experiment, with a sampling size of 2000, the sampling error is sufficiently small to have a negligible impact on performance. In our Swimmer experiment, we additionally add a large Gaussian noise to the state transition (up to Gaussian noise with variance 0.5), and therefore the environment is highly stochastic. As shown in Figure 4, our experiments on the highly stochastic Swimmer environment demonstrate that the constraints are consistently satisfied, confirming the robustness of our approach in realistic scenarios.

Weakness 3: The proposed idea and the closed form of policy updates are related to MPO (Abdolmaleki et al. 2018). Also, some existing research suggests the benefit of using the KL order used in this paper, stating that it confers stability to noise and function approximation error. For example, Theoretical Analysis of Kozuno et al. (2019) and Vieillard et al. (2020). I recommend the authors to cite those papers.

Answer: Thanks for pointing out the connection with MPO and for suggesting these valuable references. MPO and [30] (Liu et al., 2019) have derived the closed-form solution for policy optimization. In this paper, we extend the solution to the constrained policy optimization problem, and simultaneously guarantee the closed-form solution, policy improvement, and the strict constraint satisfaction. The proposed algorithm is the first to provide both the safety guarantee and a closed-form solution.

We will cite the papers and add the following discussion to the related works section:

“We note that the closed-form policy update considered in our work is closely related to MPO (Abdolmaleki et al., 2018) and its extensions (Liu et al., 2019), which derived similar formulations for unconstrained policy optimization. Furthermore, the KL order adopted in our framework has been shown to improve robustness against noise and function approximation errors (Kozuno et al., 2019; Vieillard et al., 2020). Building upon these insights, our work extends the closed-form approach to the constrained policy optimization setting, providing the first algorithm that simultaneously achieves strict constraint satisfaction, guaranteed policy improvement, and a closed-form solution, thereby offering both theoretical safety guarantees and practical efficiency.”

Question: It would be nice if the authors provide experimental results in highly stochastic environments. Currently all experiments seem to be done in deterministic environments.

Answer: We appreciate the reviewer’s suggestion. In our paper, the swimmer experiment adds a large Gaussian noise to the state transition (up to Gaussian noise with variance 0.5), and therefore the environment is highly stochastic. Please refer to Appendix D4.1 for more experiment setting details on the Swimmer experiment. As shown in Figure 4, the proposed method still works well and significantly outperforms meta-CPO and meta-CRPO in both meta-training and meta-test.

In other environments, the current MuJoCo environments also already include a degree of randomness, such as variations in initial states, actuation noise, and sensor noise, which introduce stochasticity into the dynamics.

Limitations: The CMDP formulation used in the paper is pointed out to be insufficient for safety-critical problems, and there is some efforts to fix it, e.g., Anytime-Constrained Reinforcement Learning by Jeremy McMahan and Xiaojin Zhu (2023). It would be nice to mention this limitation.

Answer: Thanks for pointing out the safety-critical metric and the reference recommendations. We will cite the paper and incorporate the following discussion into the section of related works:

``While Constrained Markov Decision Processes (CMDPs) are a common framework for incorporating constraints into reinforcement learning, they can be insufficient for safety-critical applications where violations must be avoided at all times. Recent advances, such as Anytime-Constrained Reinforcement Learning (McMahan and Zhu, 2023), introduce stricter formulations that enforce constraints at every timestep rather than in expectation. These approaches provide stronger safety guarantees and highlight potential limitations of standard CMDPs, suggesting promising directions for future work."

2025-08-08

Thank you very much for the detailed response, and I am sorry for the delayed response. I had been busy until yesterday due to a research grant interview. I will definitely check your response and reply today or early tomorrow.

2025-08-09

We sincerely appreciate your time and effort in reviewing our paper and response. If there are any remaining questions or points that require further clarification, we would be more than happy to provide additional explanations. We understand that reading and assessing our response can be time-consuming, especially given your very busy schedule, and we truly appreciate your patience!

2025-08-09

Thank you very much for the response, and I am really sorry for the delayed reply.

The main contribution of the paper in terms of the algorithm design in (1) is not changing the order of KL. Instead, the algorithm design (i) guarantees monotonic improvement, (ii) ensures safety, and simultaneously (iii) provides a closed-form solution for a single safe policy adaptation step, which is applied to the meta-policy training and simplifies its computation by leveraging the closed-form expression of the adapted policy and developing a Hessian-free meta-training algorithm.

Regarding (iii), the closed-form solution can be obtained because of the changing the order of KL, right? And this closed-form solution confers computational advantages and no approximation error in contrast to CPO, which uses the first-order approximation.

As noted already, it is a well known fact that the proposed direction of KL regularization has a closed-form solution. (While constraints are additionally considered, it is obvious that a closed-form solution still exists if one knows convex optimization.) Furthermore, monotonic policy improvement and safety are guaranteed by a controlled KL divergence, which is already proposed by meta-CPO. From these, I feel the novelty is limited.

That said, let me clarify my position not only for the authors but also for the AC; While the change of KL order has a limited novelty, it has a significant contribution. This change introduces computational advantages and no approximation error in contrast to meta-CPO.

(i) Superior optimality...

I am sorry, but I pointed out the similarity between the proposed method and meta-CPO rather than meta-CRPO.

Meta-CPO finds a meta-policy by solving Eq 10. For simplicity, let me consider only $F(\theta)$ and ignore constraints. Then, Eq 10 essentially aims at solving

\max\_{\theta'} \mathbb{E}\_{\tau \sim \mathbb{P}(\Gamma), s \sim d^{\pi\_{\mathcal{A}(\theta)}}} \left[ \sum\_a \pi\_{\mathcal{A}(\theta')} (a|s) A\_{i, R}^{\pi\_{\mathcal{A}(\theta)}} (s, a) \right]\,,

where $\mathcal{A}$ is the CPO algorithm, and $\mathcal{A}(\theta)$ is its output with initial parameter $\theta$ . The first-order approximation of leads to Eq 12 in the meta-CPO paper. This objective is a surrogate of

\max\_{\theta'} \mathbb{E}{\tau \sim \mathbb{P}(\Gamma)} J\tau (\mathcal{A} (\theta'))

from the view point of TRPO, and this original objective seems to be exactly the same as yours (Eq 2).

(ii) Anytime safety guarantee during the meta-test. The safe meta-policy training produces a safe initial meta-policy by imposing the safety constraint.

As for this, monotonic policy improvement seem to be already obtained by meta-CPO, as briefly mentioned in the meta-CPO paper.

As for the safety of a safe initial meta-policy, I am not sure how much it is useful. Basically, Theorem 1 states that if the "variance" of tasks is small, and a user correctly specifies $\varepsilon$ , a policy produced by the proposed method is safe. However, we have no knowledge on $\alpha$ , $A\_{c\_i}^{max}$ , and $R$ , so we cannot make sure a produced meta-policy is safe, apart from a fact that in some cases there might be no feasible solution.

Theoretically, we assume that the constrained policy optimization problem in (1) can be solved exactly, ... Under this widely accepted assumption, the anytime safety property demonstrated in our paper, which represents one of the most stringent safety criteria, is theoretically achievable.

I am indeed working on RL theory. Assuming no sampling error is an acceptable first step, but many RL theoreticians I know aim for analyzing a situation involving sampling in the end.

While revising my score, I thought my original score is a bit low. Therefore, I increased it to 5 from 4.

最终决定Accept (poster)

2025-09-17

The authors' rebuttal has led to a significant improvement of the manuscript. After extensive discussions, all reviewers agree that the paper can be accepted.