Rethinking Optimal Transport in Offline Reinforcement Learning
摘要
评审与讨论
This paper provided a novel view of offline RL, which roughly is maximizing return while keeping close to the data, as OT problem. The key contribution is demonstrating that partial OT can effectively address the challenge of stitching—a fundamental issue in offline RL—both theoretically and empirically.
优点
- Formulation of offline RL as OT from stationary dist for state to datasets dist by some policy is promising.
- The formalisation as OT is made convincing b y presenting the capability of problem of stitching by partial OT in principled way (hyperparameter w controls the degree of the algorithm between BC and naive value maximization)
- The main claim—that partial OT effectively handles stitching—is supported by experimental results, particularly in maze environments known to require stitching.
缺点
In your method, stitching is done by the dynamic programing, application of bellman operator and your contribution regarding stitching is introducing partial OT to allow it work by alleviating the matching restriction. And, in high-level, it is what existing regularization methods like TD3+BC(BC regularization) or ReBRAC(KL regularization) are doing. While this is not a weakness, the paper could be strengthened by providing a clearer explanation of the advantages of OT compared to BC or KL regularization, which is partially addressed in the experiment section.
问题
Could you explain the theoretical/intuitive benefit of OT in offline RL compared with other technique to make policy close to the data such as behavior cloning or KL regularization?
局限性
Yes
Thank you for your review and for a positive assessment of the paper! Your valuable feedback will help us improve the manuscript! Please find below the answers to your questions.
Q1: In your method, stitching is done by the dynamic programing, application of bellman operator and your contribution regarding stitching is introducing partial OT to allow it work by alleviating the matching restriction. And, in high-level, it is what existing regularization methods like TD3+BC(BC regularization) or ReBRAC(KL regularization) are doing. While this is not a weakness, the paper could be strengthened by providing a clearer explanation of the advantages of OT compared to BC or KL regularization, which is partially addressed in the experiment section. Could you explain the theoretical/intuitive benefit of OT in offline RL compared with other technique to make policy close to the data such as behavior cloning or KL regularization?
Compared to BC/KL, OT offers more flexibility, while including many building blocks to properly consider the geometry of the problem. For example
- BC/KL regularization match the policy to the data distribution at a pointwise level, which can lead to suboptimal policy learning in scenarios where exact matching is not feasible or necessary. In OT, however, different constraints and regularizations can be used. The partial alignment proposed in our paper is a clear example.
- OT methods allow the choice of arbitrary cost functions that take into account the domain of the problem. For example, we proposed to use Q-function-based cost function for transformation in RL domain. In comparison, BC/KL do not inherently consider the transformation costs between distributions.
From a more general perspective, considering RL as OT bridges the gap between these two areas, allowing tools from OT to be applied in RL for better efficiency. We will explicitly include this discussion in the final version of the paper.
Concluding remarks: We truly value your reviews. We hope that clarifications are helpful. If there are remaining issues or questions on your mind, we're more than willing to address them.
Thank you for your replay, the claim that by formulating as OT, we can leverage well-developed tool in OT in RL sounds nice. Looking forward to your future work showing their effective usage in RL.
Hello reviewer 7ePM: The authors have responded to your comments. I would expect you to respond in kind.
The authors propose a novel perspective for offline reinforcement learning by formulating offline reinforcement learning as a partial optimal transport problem. They view the policy as a transport map from the state distribution to and show that the dual form of the partial optimal transport problem can be expressed as a maximin optimization problem. The resulting maximin problem can be easily optimized, unlike other optimal transport problems, due to the absence of 1-Lipschitz constraints. Experimental analyses on various offline RL benchmarks manifest the effectiveness of the proposed algorithm, especially in the antmaze environments.
优点
The paper introduces an interesting perspective of formulating offline RL as a partial optimal transport problem. Also, the proposed algorithm outperforms existing baselines on antmaze tasks by a huge margin.
缺点
-
The authors formulate offline RL as a partial optimal transport problem between and by regarding the policy as a transport map. Since the problem depends on the choice of in , the resulting policy will also depend on it. This does not make much sense.
-
The explanation of the environment used in the toy example is unclear. Line 193 states that the final state yields a reward of 1, while other states have a zero reward. Then, what would necessitate the agent to seek the shortest path? The reward seems independent of the agent's action.
-
It is difficult to understand how PPL, PPL-CQL, and PPL-R work. A pseudocode for each variant would be helpful.
-
The paper has a huge room for improvement in terms of formatting. Excessive use of underlines, underfull hbox on Line 6 of Algorithm 1, and misaligned s and ragged purple boxes in the tables hurt the paper's readability.
问题
Have you tried using a pre-trained Q-function learned by IQL or any other in-sample value learning algorithms instead of training the Q-function in parallel?
局限性
The authors adequately addressed the limitations of their work.
Thank you for taking the time to review our paper and provide useful feedback. Your questions will help us improve the manuscript. Below are the answers to your questions. Please let us know if any issues remain!
Q1: The authors formulate offline RL as a partial optimal transport problem between and by regarding the policy as a transport map. Since the problem depends on the choice of in , the resulting policy will also depend on it. This does not make much sense.
Please note that our method is completely offline, with no access to the online environment. The distribution over the states provided by the expert policy is everything we have in offline learning. All we have to learn from are the states visited by the expert policy . The goal of our method is to extract the best policy using the given distribution of expert states and actions . The dependence on in is the standard approach for offline RL, not something we invented (lines 16-17, 107-112). Please let us know if this answer addresses your concern.
Q2: The explanation of the environment used in the toy example is unclear. Line 193 states that the final state yields a reward of 1, while other states have a zero reward. Then, what would necessitate the agent to seek the shortest path? The reward seems independent of the agent's action.?
Our method works in the most common MDP settings with a discount factor of (lines 94-106), which prioritize immediate rewards over distant future rewards. By taking the shortest path to a reward, the agent ensures that it receives a higher discounted reward compared to taking a longer path where the same reward would be less valuable due to discounting. The Q-function trained by Eq. 7 also tends to favor the shortest path. In these experiments, we simply show that BC limits the agent's performance by closely mimicking the provided suboptimal dataset. In contrast, our method allows the extraction of the best actions according to the Q-function, ignoring the suboptimal actions. We will clarify this section in the final version.
Q3. It is difficult to understand how PPL, PPL-CQL, and PPL-R work. A pseudocode for each variant would be helpful.
All of these methods mainly differ in the way they learn the Q-function (cost), not the policy extraction. We agree with the reviewer that the description of the variants is an important component that needs to be discussed. Below is the high-level pseudocode for these methods. We will add the complete pseudocode in the appendix and refer to them in main text. Please note that for ReBrac cost is using.
Input: Dataset D(s,a,r,s')
Initialize: Q, π, f, β, α, w, ℛ
----------Update the Q-Function (cost)---------------
for k in 1...N do
(s, a, r, s') ← sample a batch of transitions from D
if One-Step RL then pre-train Q by:
Q^{k+1} ← argmin_{Q} E_{(s, a, s') ~ D} [(r(s, a) + γ E_{a' ~ β(s')} [Q^k(s', a')] - Q^k(s, a))^2]
else if CQL then
Q^{k+1} ← argmin_{Q} E_{s ~ D, a ~ π(s)} [Q^k(s, a)] - E_{s ~ D, a ~ β(s)} [Q^k(s, a)]
+ (1/2) E_{(s, a, s') ~ D} [(r(s, a) + γ E_{a' ~ π(s')} [Q^k(s', a')] - Q^k(s, a))^2] + ℛ(π)
else if ReBrac then
Q^{k+1} ← argmin_{Q} E_{(s, a, s') ~ D} [(r(s, a) + γ E_{a' ~ π(s')} [Q^k(s', a') - α(π(s') - a')] - Q^k(s, a))^2]
end if
----------Update OT---------------
f^{k+1} ← argmin_f E_{s ~ D, a ~ π^k(s)} [f^k(s, a)] + w E_{s, a ~ D} [f^k(s, a)]
π^{k+1} ← argmin_π E_{s~D, a~π^k(s)} [-Q^k(s, a) - f^k(s, a)]
end for
Q4. The paper has a huge room for improvement in terms of formatting. Excessive use of underlines, underfull hbox on Line 6 of Algorithm 1, and misaligned s and ragged purple boxes in the tables hurt the paper's readability.
We will incorporate your suggestions to improve the formatting. All misalignments will be corrected and the purple boxes will be replaced with underlining. In the main text, underlining will be minimized. If you have any other suggestions for improving the presentation of the paper, we would appreciate it.
Q5: Have you tried using a pre-trained Q-function learned by IQL or any other in-sample value learning algorithms instead of training the Q-function in parallel?.
No. We do not use IQL because this method does not align with the Optimal Transport framework. If we look at the OT optimization problems (Eq. 1, 4, 12), we can see that the map (policy) outputs are used as inputs for the cost function (Q-function in our case) during optimization. However, the IQL method is a weighted behavior cloning approach and not use learned policy outputs as inputs for the action-value function.
Concluding remarks: In conclusion, we truly value your reviews. We hope that the revisions and clarifications will influence and improve your overall opinion. If we've managed to resolve your principal concerns and questions, we'd be thankful for your endorsement through an elevated score of our submission. On the other hand, if there are remaining issues or questions on your mind, we're more than willing to address them.
Thank you for your response. I noticed some misunderstandings and wanted to clarify a few.
Q1. If I understood the paper correctly, you are viewing as a transport map from the state distribution to the behavior policy for some . The transport map will depend on the choice of . How do you choose this ?
Q2. Lines 190 and 191 state that the agent has to "navigate through 50 intermediate steps." Doesn't this mean that the episode length is fixed to 51 regardless of the agent's path?
Q5. The value learning part of IQL can be completely separated from the policy learning part, which uses AWR, as you mentioned. IQL Q function can be trained without weighted behaviour cloning and thus can be used together with the proposed algorithm.
Q1. If I understood the paper correctly, you are viewing as a transport map from the state distribution d(s) to the behavior policy for some s'. The transport map will depend on the choice of s'. How do you choose this s'?
Sorry for the misunderstanding. We are randomly sampling from the given dataset , please see derivations at lines 131 and Eq.9. For the sampled we choose the conditional distribution of the action space provided by the dataset (expert) as the target. We will make this explicit in the final version.
Q2. Lines 190 and 191 state that the agent has to "navigate through 50 intermediate steps." Doesn't this mean that the episode length is fixed to 51 regardless of the agent's path?
Yes, the length of the episode is fixed. This length of 50, corresponds to the number of steps provided by each expert policy in the dataset.
Q5. The value learning part of IQL can be completely separated from the policy learning part, which uses AWR, as you mentioned. IQL Q function can be trained without weighted behaviour cloning and thus can be used together with the proposed algorithm.
Thanks for the clarification, this experiment is really interesting. Following your question, we conducted a series of experiments on MuJoCo using the CORL[1] IQL implementation. In these tests, we found that the value function trained using expectile regression from the IQL method is inappropriate for off-policy evaluation.
- First, we conducted an experiment without using our method. We trained a -function using expectile regression loss and attempted to extract a policy through direct optimization: . This approach resulted in zero rewards.
- Second, we added a BC objective to avoid distribution shift: . The results were the same — -functions trained via expectile regression dramatically overestimate actions sampled by the learned policy .
- Finally, we tested our method: and observed improvements in the scores! Even with the ill-suited cost function, policy optimization with respect to the potential yielded the highest scores.
We also tested the advantage and exponential advantage functions exp from IQL, but did not observe any improvements. The table below summarizes our analyses.
Table: Averaged normalized scores on MuJoCo tasks. Reported scores are the results of the final 10 evaluations and 3 random seeds.
| Dataset | + BC | (Ours) | |
|---|---|---|---|
| HalfCheetah-medium | -2.53 ± 0.1 | -2.54 ± 0.1 | 48.7 ± 0.3 |
| HalfCheetah-medium-expert | -2.53 ± 0.2 | -2.54 ± 0.2 | 39.9 ± 1.4 |
| Hopper-medium | 0.6 ± 0.1 | 0.6 ± 0.1 | 27.9 ± 15.4 |
| Hopper-medium-expert | 0.7 ± 0.1 | 0.6 ± 0.1 | 8.7 ± 4.6 |
| Walker-medium | -0.16 ± 0.1 | -0.16 ± 0.0 | 37.6 ± 3.6 |
| Walker-medium-expert | -0.23 ± 0.2 | -0.21 ± 0.1 | 17.5 ± 14.3 |
While the IQL method can be formally decoupled, we see that both components, expectile-based in-sample value learning and weighted behaviour cloning, are important parts of each other to achieve strong results. Adapting IQL value learning for better off-policy evaluation is a promising direction, but beyond the scope of our contribution. We believe that our strong performance on most tasks, combined with different types of a cost function, justifies our formulation and makes it valuable to the community. We will include these analyses in the final version of the paper.
Reference:
[1] JAX-CORL: https://github.com/nissymori/JAX-CORL
Thank you for your response.
Q1. I think I've failed to deliver my point. A transport map between the state distribution and the behaviour policy for some depends on the choice of . This means the optimal transport framework will produce multiple policies , one for each . How are you going to combine them into one policy ? It seems like you are defining for each (am I correct?), but this does not make much sense because any change on a measure-zero set {} will not affect the result. Therefore, any arbitrary function would be a solution, which is definitely not what we want.
Q2. If the episode length is fixed to 50, the discounted return will always be regardless of the agent's policy.
Q5. Thank you for running the experiments. It is interesting that using IQL Q-functions as the cost function fails so miserably.
Q1. We learn a single neural network that approximates all policies for each state . But the same is true for any BC method in offline RL when we learn using , KL, or any other discrepancy. The neural approximator for the policy is able to generalize and provide a solution for the measure-zero set of the . Did we understand your question correctly?
Q2. Yes, you are right, this was an oversight on our part. Only this toy experiment was affected. We thank the reviewer for pointing this out. We fixed this by providing the expert trajectories with a different length in the dataset. This gave us the same visual results.
P.s. Although there were no explicit signals with fixed episode length, the model still learned the near shortest path. We think this is because the function approximation implicitly learned to value actions similar to the final rewarded action more highly.
Q5. Yes, it was really interesting results. We will make these experiments public along with the rest of the code.
Q1. Say we are going to use the log-likelihood as our regularizer, which means the objective function would be something like . For each , the regularizer is only affected by the value of and nothing else. If we ignore the expressivity of a neural network, the resulting is straightforward to interpret: . However, in the case of OT, the value of for all matters, where is the transport map between the state distribution and . I'm concerned that trying to force the same for all would cause conflicts between different .
Q2. Does it mean that an agent's actions in the intermediate 49 steps wouldn't affect the return?
Q1. Thank you for the clarification. Optimal transport does not cause conflicts between different . Our final objective (Eq.12) for policy is similar to the BC example that your gave: . For each the potential is affected by the value of and nothing else. In the optimal transport literature, several methods with a similar objective have been considered to map into the conditional distributions [1, Eq. 8c][2, Sec. 6.2]. Please note that [1] considered the case when even is conditional, and then, use the single for all distributions, see Sec. 2.4.
Q2. Yes, intermediate steps are not rewarded, but the -function is particularly given to estimate these intermediate steps. We added different lengths to the dataset, 20, 30, 50. Consequently, trajectories that lead to the reward faster have a higher - value.
For a simpler illustration, we also made the reward for each step equal to the Euclidean distance between the current state-action pair and the final rewarded state-action pair. This simplifies the learning problem, but gives a clear intuition why the shortest path is optimal. If you find this more relevant, we will include such an experiment in the final version.
Reference:
[1] Nonlinear Filtering with Brenier Optimal Transport Maps: https://openreview.net/pdf/70633c38d3ce64c9b3b29dd7abd18c2f6b6e1dc6.pdf
[2] Neural Monge Map estimation and its applications: https://openreview.net/pdf?id=2mZSlQscj3
Q1. I'm confused now. Let's consider the simplest case where has two elements and . Then the state distribution is and the behavior policies are and . The optimal transport map from to should be and the map from to should be . If you try to use the same in both cases, conflict would occur since , wouldn't it? Am I missing something?
Q2. What is the transition function? From the description in the paper, it looks like there are states uniformly spaced across the x-axis and the agent will transition from to regardless of the action it takes. After 50 steps, the agent will arrive at state no matter what and get a reward of 1. What would encourage the agents to follow the shortest path?
Q1: Dear reviewer. As we already noted, we consider the conditional optimal transport setup. This means that we want to simultaneously learn a family of transport plans (indexed by some condition ; each plan is between some distributions ). Following the standard practices of neural OT, this would means that we have to learn a map , where is a condition, is an input point (from ) and is a random noise to be able to learn stochastic plans.
In our case, we need to learn a set of OT plans, each plan is conditioned on the given state . Following our problem, each such plan should be a plan between distribution (this is our choice) and . Hence, we would have to learn a function with 3 arguments. However, since the input comes from , it always coincides with with probability 1. Hence, one may merge the arguments together to simply get a function of the form . In turn, we also remove the random noise component , which is standard practice in neural OT methods, unless additional regularization (variance, entropy, etc.) is used[1]. Hence, our final function is . We will add this details into the final version.
Q2: The transition function is . As we said in the previous answer, we have redefined the environment to avoid any confusion. Now the reward is given for each state-action pair , where and is the target state-action pair. In this scenario, trajectories deviating from the straight line (actions define y-axis coordinates) from to will have the lower cumulative reward .
References [1] Neural Optimal Transport: https://openreview.net/pdf?id=d8CBRlWNkqH
Now I get it. So by the state distribution in line 129, you meant the Dirac mass . Please explicitly mention it in the paper. State distribution, especially when written as , often means the state occupancy measure under policy , so I highly recommend you use another term to prevent misunderstandings like mine. As all of my concerns were resolved, I have updated my score accordingly.
The authors address the problem of offline RL. They rethink offline RL using optimal transport. In offline RL, often the datasets consist of several sub-optimal trajectories that are needed to be stitched together. The authors use partial OT to incorporate stitching and a maxmin formulation of this partial OT. They perform experiments majorly on D4RL.
优点
(1) While a lot of methods that use OT, use Wasserstein distance and that requires optimizing a function constraint to be Lipschitz. This is often hard. The authors have used a maxmin formulation which does not need the function to be Lipschitz.
(2) They treat offline RL as an OT problem rather than using OT as a regularizer.
缺点
(1) There should have been comparisons to W-BRAC.
(2) The results show that PPL^{CQL} produces some marginal improvement over PPL and PPL^{R} with ReBRAC.
(3) From Equation 12, you must not need \beta. But in experiments you constantly talk about being in conjugation with something or training a \beta. Maybe this wasn't clear or I misunderstood, why do you need to be in conjugation with CQL or ReBRAC or one-step RL? Why can't you simply train the maxmin objective in equation 12.
The work is interesting but I would like the authors to clarify why at all there is a need for conjugation? IQL performs better than CQL, so why would I use PPL^{CQL} and not train IQL directly? I request the authors to clarify the contribution of this paper in context to this.
问题
(1) What is the motivation for using OT here?
(2) what is d^\beta(s). Is it the visitation of the behavioral policy \beta? Then why do you say that you want to learn a policy that transfers mass from d^\beta(s) to the corresponding distribution given by the behavioral policy?
局限性
Limitations are addressed.
Thank you for your thoughtful questions. We have managed to address and improve the paper based on them. Below, we address each of your concerns: we include the W-BRAC comparison, provide an explanation of the conjunction, clarify the OT motivation, and provided details on the behavior policy visitation.
Q1: There should have been comparisons to W-BRAC.
We will include W-BRAC's results in Table 2 of our paper. Initially we compared our method with other methods that have already shown significant improvements over W-BRAC, suggesting a clear performance improvement hierarchy.
Q2: ... Why can't you simply train the maxmin objective in equation 12.? ... Why at all there is a need for conjugation with CQL or ReBRAC or one-step RL? IQL performs better than CQL, so why would I use and not train IQL directly? I request the authors to clarify the contribution of this paper in context to this.
To address the comment regarding the conjunction, we would like to clarify several reasons for that:
-
We don't simply train Eq. (12), because the Q-function trained via Eq. (7) and then used in Eq. (12) can suffer from overestimation bias (lines 107-113).
-
Consequently, we tested our method in conjunction with various methods that avoid overestimation of Q-function. is actually necessary for the methods used. Please note that the contribution of our method is policy extraction, not solving the overestimation bias. These overestimation-avoiding methods are used to obtain different types of cost functions in our method, showing that our method can work efficiently with any of them.
-
We do not use IQL as a backbone because this method does not align with the Optimal Transport framework. If we examine the optimization problems (Eq. 1, 4, 12), we can see that the map (policy) outputs are used as inputs for the cost function (Q-function in our case) during optimization. However, the IQL method is a weighted behavior cloning approach, which does not use policy outputs as inputs for the critic function. Instead, only actions from the dataset are weighted by the action-value function.
In summary, we clarify our contribution: We proposed a novel optimal transport-based policy extraction method and provided an analysis of its performance on various RL-based cost functions. We will add this very explicitly to the final version of the paper.
Q3: The results show that produces some marginal improvement over PPL and with ReBRAC.
The goal of these experiments was to show a side-by-side comparison between the basic method and its improved version via our method. In Table 1, we compare to CQL and to ReBRAC, showing that regardless of the backbone used, our method allows improvement.
Q4: What is the motivation for using OT here?
This motivation follows from the need to have a partial policy that maps only to the best action distribution when dealing with suboptimal data in offline learning. In OT, partial alignment methods have been extensively developed. To integrate OT methods into RL in the simplest way, we considered offline RL as an max-min OT problem, meanwhile avoiding the limitations of existing OT in RL methods.
Q5: What is . Is it the visitation of the behavioral policy ? Then why do you say that you want to learn a policy that transfers mass from to the corresponding distribution given by the behavioral policy?
Yes, you are right. Indeed, it is the visitations of the behavioral policy. Please note that our method is completely offline. Thus, the probability over the states, , is the only distribution over the state space that we have. We have no ability to get our distribution, , as we cannot interact with the environment. This is the standard way for offline RL, not something we came up with. For each state visited by the expert, we aim to map it to the most efficient part of the distribution over the actions for this state, provided by .
Concluding remarks: We truly value your reviews. Please respond to us, to let us know if the clarifications above, suitably address your concerns. If you finds the responses above are sufficient, we kindly ask that you consider raising score.
Thank you for clarifying my doubts. I have increased my score.
Hello reviewer qCSm: The authors have responded to your comments. I would expect you to respond in kind.
The paper presents an approach to offline reinforcement learning based on multiple different (potentially suboptimal) expert data. The problem is then to "stitch" together a policy from the best actions across experts. The authors formalize the problem as partial Optimal Transport. This is a novel, interestingly different role for Optimal Transport than has appeared in previous work on reinforcement learning. The results show promising performance against various baselines.
The authors are in consensus that this is interesting and sound work. I have read the reviews and the paper and agree. This is novel work that will likely be of broad interest to the community.