5.8

/10

Rejected4 位审稿人

最低3最高8标准差1.8

2.5

置信度

正确性2.8

贡献度3.0

表达2.8

ICLR 2025

Zero-Shot Offline Imitation Learning via Optimal Transport

Thomas Rupf,Marco Bagatella,Nico Gürtler,Jonas Frey,Georg Martius

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

TL;DR

A non-myopic method for zero-shot imitation from arbitrary offline data.

摘要

关键词

Imitation LearningDeep Reinforcement LearningOptimal Transport

评审与讨论

审稿意见

评分: 3置信度: 32024-11-03

This paper presents a method for zero-shot imitation learning with a single demonstration (without action labels or partial trajectories consisting of a sequence of goals). A method called ZILOT that learns to plan and act according to the goals is proposed. Firstly, ZILOT learns a dynamics model (similar to TD-MPC2) using a dataset of transitions which can be sub-optimal. For planning using the demonstration, a non-Markovian method is employed to match the occupancy of the partial demonstration and the state-visitation distribution of the policy. The discrepancy between occupancies is computed using Optimal Transport which requires value functions: V to get reachability of a goal state and W to get the steps taken between goals. The non-Markovian planner uses this discrepancy to select the best action. Experiments conducted on multiple benchmarks show that ZILOT is better at W1 distance between expert and policy occupancies and better at generating trajectories that follow the order of goals in the demonstration.

优点

The idea of recovering expert behaviors from partial trajectories with subgoals and without action labels is interesting and challenging.
The drawback of prior methods that used goal-conditioned policies for this task is discussed well to motivate the proposed method.

缺点

Many design choices of the proposed algorithm are not clear. Why does ZILOT use a non-Markovian policy as I believe the expert policies are Markovian? How is the function $\phi$ that maps states to goals defined / learned?
The experiments are not convincing as the baselines like FB-IL [1] and OTR [2] are missing. It is not clear why the task success rate or the average returns are not used for comparing methods.
The limitations of the methods are not discussed. The only limitation presented is in Sec 7 that describes the dependence on a learned dynamics model.

问题

The paper is motivated by the fact that methods using goal-conditioned policies for zero-shot IL fail (Proposition 1 is used to support this argument). However, ZILOT plans using all the goal states in the demonstration. What if the goal-conditioned policy is given the expert demonstation as context? This way the goal-condition policy might discard the bad states to achieve a goal (shown in Proposition 1). Moreover, this would lead to learning a Markovian policy.
The goodness of the proposed method depends on the planning horizon, and the paper discusses that ZILOT can be myopic too without a long planning horizon. This limits the applicability of the method in many real world tasks when planning for a long horizon. Is there a way to mitigate this by estimating the visitations using some form of bootstrapping?
Since the method uses optimal transport and cites OTR [2] in the Related work, I feel it should be added as another baseline that uses the expert states and the sub-optimal dataset to recover a reward function and train a policy over it.
Since sub-optimal data is used to learn the value-functions, can optimal V, W be recovered? If not, it is not highlighted how will this gap affect the final performance?
The abstract talks about zero-shot IL with existing practical methods viewing demonstrations as a sequence of goals. I feel this is not true as FB-IL [1] does zero-shot IL from a single demonstration. Moreover, I feel FB-IL should be a baseline. Although, ZILOT deals with demonstrations with a subset of states, I feel using Eq 8 in [1] should recover the policy with partial demonstrations.

References

[1] Pirotta et al., Fast Imitation via Behavior Foundation Models. ICLR'24
[2] Luo et al., Optimal transport for offline imitation learning, ICLR'23

伦理问题详情

评论- Response to Reviewer Lfwv (2)

2024-11-23

Question 1

The paper is motivated by the fact that methods using goal-conditioned policies for zero-shot IL fail (Proposition 1 is used to support this argument). However, ZILOT plans using all the goal states in the demonstration. What if the goal-conditioned policy is given the expert demonstation as context? This way the goal-condition policy might discard the bad states to achieve a goal (shown in Proposition 1). Moreover, this would lead to learning a Markovian policy.

Thank you for raising this very interesting point. We agree that this represents a promising algorithmic idea. However, to the best of our knowledge, an RL algorithm that is conditioned on an entire trajectory would not be trivial. The machinery developed for goal-conditioned RL does not easily generalize to this setting (e.g., how would we sample “goal trajectories” for relabeling?). We believe that a naive implementation would then only be capable of imitating trajectories that appear in the offline dataset. Synthesising a (Markovian) reward function for multiple goals that cannot be “exploited” by repeatedly traversing through one of the goals also does not seem trivial. Our method represents a solution that sidesteps these issues by leveraging zero-order optimization, but we share the opinion that further research in zero-shot, model-free imitation to RL is also an interesting direction.

Question 2

The goodness of the proposed method depends on the planning horizon, and the paper discusses that ZILOT can be myopic too without a long planning horizon. This limits the applicability of the method in many real world tasks when planning for a long horizon. Is there a way to mitigate this by estimating the visitations using some form of bootstrapping?

We agree that bootstrapping represents a very interesting avenue of further research: in particular, estimating successor measures and future visitations would alleviate the finite horizon issue. We have decided to leave it for future research, as the current framework does not require an explicit policy representation, which would be necessary to estimate off-policy visitations, to the best of our knowledge. Moreover, we have observed empirically that myopic behavior is acceptable in practice, even for finite horizons.

Question 3

Since the method uses optimal transport and cites OTR [2] in the Related work, I feel it should be added as another baseline that uses the expert states and the sub-optimal dataset to recover a reward function and train a policy over it.

While OTR uses similar ideas to ours and was an inspiration for our work, it is not zero-shot and would require full training for each task we test as you have pointed out. For this reason we decided to not compare against this method. Finally, OTR is not straightforwardly applicable since it assumes access to the full expert states (as opposed to goals in our case), so a goal-conditioned value function would need to be trained additionally.

Question 4

Since sub-optimal data is used to learn the value-functions, can optimal V, W be recovered? If not, it is not highlighted how will this gap affect the final performance?

Since we are using an offline off-policy method to learn the goal-conditioned value function $V$ , assuming convergence of the algorithm and a sufficient amount of data and coverage, the optimal $V^*$ (and thus also $W$ ) should be recovered. Of course, if the trained $V$ is far from optimal, ZILOT might either not take a short available path to a goal, or try to reach a goal in a way that is not possible (eg. through walls in maze).

Question 5

The abstract talks about zero-shot IL with existing practical methods viewing demonstrations as a sequence of goals. I feel this is not true as FB-IL [1] does zero-shot IL from a single demonstration. Moreover, I feel FB-IL should be a baseline. Although, ZILOT deals with demonstrations with a subset of states, I feel using Eq 8 in [1] should recover the policy with partial demonstrations.

Thank you for bringing this important line of research to our attention. We believe that this comparison highlights the novelty and the properties of ZILOT. Please refer to the general response for a detailed discussion.

We hope that we were able to clarify our contributions and address the reviewer’s comments. We remain open for further discussion.

2024-11-24

W1

Computing actions through planner is standard with WMs as done in PlaNet and TD-MPC2. However, in both cases the planning horizon is small between 5-16 steps. Moreover, TD-MPC2 does use a Q function while ranking trajectories. To highlight, I see a few major concerns with ZILOT:

ZILOT will have to search or plan across the episode length as horizon which makes planning hard because the search space is quite large. In many tasks, it might not be feasible to plan across the entire episode.
A solution is to use a fixed horizon length. However, as mentioned in the paper- ZILOT will be myopic and will have similar problems as the baselines. Since the key insights of this paper are around myopic behaviors of baselines, I feel the paper needs to show how ZILOT can tackle this challenge when planning with a horizon less than episode length.
Lastly, I agree that previous states are needed to infer about states that are unvisited and ZILOT uses a non-Markovian policy. My concern is if the task can be solved with a Markovian policy, do we need a more complex/general policy class for the task?

Q1

I appreciate the effort of getting results on discussing different formulations of FB-IL and comparison on tasks suitable for the state-only setting with partial trajectories.

Q2

Can you elaborate on the last point on myopic behavior being acceptable in practice? Specifically, which task used in the experiments shows that myopic behavior worked fine whereas which task shows that using myopic behavior leads to failures? If myopic behavior is acceptable in practice, then the rationale for focusing on non-myopic agents, which forms the core motivation of this work, becomes unclear.

2024-11-25

Thank you for your quick response! We have addressed your concerns in the following.

Concern 1&2: Fixed Horizon

ZILOT will have to search or plan across the episode length as horizon which makes planning hard because the search space is quite large. In many tasks, it might not be feasible to plan across the entire episode.

A solution is to use a fixed horizon length. However, as mentioned in the paper- ZILOT will be myopic and will have similar problems as the baselines. Since the key insights of this paper are around myopic behaviors of baselines, I feel the paper needs to show how ZILOT can tackle this challenge when planning with a horizon less than episode length.

You correctly point out that planning with a horizon as long as the episode is not feasible in many settings. What we have seen in all of our experiments is that a finite planning horizon of 16 steps is sufficient to avoid very bad myopic behavior. This is also reflected in our quantitative and qualitative results. While one surely can contrive an MDP where planning for anything less than the full episode length is arbitrarily bad, we would argue that, in practical settings, considering a few future goals (e.g., what ZILOT does) is much better than considering a single one, and approaches the performance of an algorithm considering the entire goal sequence.

Concern 3: Markovianity

3. Lastly, I agree that previous states are needed to infer about states that are unvisited and ZILOT uses a non-Markovian policy. My concern is if the task can be solved with a Markovian policy, do we need a more complex/general policy class for the task?

We concur that in the markovian setting the OT objective can be optimized by a Markovian policy, and a more general policy class is not strictly necessary. However, while choosing the right policy class is important, it is also crucial to easily find the optimal policy in this class. ZILOT needs to consider a richer class of policies (i.e., non-Markovian), but is capable of efficiently searching this class by instead optimizing over trajectories with a powerful zero-order method. In particular, ZILOT only needs to search over a space of size $H|\mathcal{A}|$ , while the space of all Markovian (deterministic) policies is already much larger ( $|\mathcal{S}||\mathcal{A}|$ ).

Q2: Limited Myopia

Can you elaborate on the last point on myopic behavior being acceptable in practice? Specifically, which task used in the experiments shows that myopic behavior worked fine whereas which task shows that using myopic behavior leads to failures? If myopic behavior is acceptable in practice, then the rationale for focusing on non-myopic agents, which forms the core motivation of this work, becomes unclear.

We apologize for our imprecise wording in the response of Q2. We meant to make a similar statement as the one we did in the “Limitations” paragraph in section 7:

However, we found the degree of myopia to be acceptable in our experimental settings, [...]

Full myopia as found in our baselines (Pi+Cls, MPC+Cls) does not perform very well as can be seen in our qualitative and quantitative results. But we have found that the myopia incurred by going from an infinite horizon to a finite one in ZILOT is acceptable in practice.

The degree of myopia that is acceptable depends on the MDP and the goal-abstraction. For example, in an environment like Pointmaze all states that reach a goal only differ in the pointmasses velocity, thus the worst possible state in which a goal can be reached, where the velocity points in the opposite direction of the next goal, is still recoverable within the next few steps. Thus even our myopic baselines manage to imitate the expert in this environment by “over-shooting” each goal (see fig. 30 in appendix D). Contrary to this, in our fetch slide environment a cube may slide out of reach of the arm from a single push meaning that every goal can be reached in a state from which the policy cannot recover from afterwards (i.e. retrieve the cube). Even worse: this state is exactly the one that our myopic baselines use to reach goals, since it is the fastest one to reach. In this environment the contrast between myopic baselines and ZILOT is much larger: as long as ZILOT sees enough of the expert trajectory to realize that it will need to redirect the cube at some point in the future, it will not choose to push it so strong that it slides out of reach. So, while ZILOT may still be myopic because of its finite horizon, it is not myopic enough to choose the very bad behavior of our baselines.

评论- Response to Reviewer Lfwv (1)

2024-11-23

We thank the reviewer for the detailed feedback, and for raising several interesting points of discussion. We have significantly updated our work based on the suggestions, and we believe that we can fully address all comments as follows.

Weakness 1

Many design choices of the proposed algorithm are not clear. Why does ZILOT use a non-Markovian policy as I believe the expert policies are Markovian?

We thank the reviewer for raising this important point, which we are happy to clarify. In standard RL, actions are selected to optimize an objective (i.e., the Q-function), which is conditioned on a specific policy. In this case, a Markovian policy can imitate a Markovian policy. ZILOT does not explicitly parameterize the policy, but computes actions in an online fashion, as is standard for zero-order optimization with learned world models [1]. Thus, the objective for action selection cannot be conditioned on a policy. Intuitively, when optimizing the action at timestep $t$ , ZILOT needs to keep track of how the previous states align with the expert demonstration, in order to move towards states that are yet unvisited. In other words, this difference stems from the fact that ZILOT searches over actions, rather than over policies. Fortunately, this framework comes with an added benefit: ZILOT is fully able to imitate more general, non-Markovian policies.

[1] Hafner et. al. Learning Latent Dynamics for Planning from Pixels. ICML ‘19

Weakness 2

How is the function that maps states to goals defined / learned?

As established practice in the goal-conditioned RL literature [1], we consider the function $\phi$ as part of the problem definition. In our environments it captures the part of the state we want to control, e.g. the cube position in Fetch, the ball position in Pointmaze, and the x-axis and orientation of the Cheetah. The exact function is detailed in table 7 in the appendix.

[1] Andrychowicz et. al. Hindsight Experience Replay. NeurIPS ‘17

Weakness 3

It is not clear why the task success rate or the average returns are not used for comparing methods.

Our expert trajectories are not retrieved from optimizing a reward signal, but specified by humans: as rewards are undefined, we cannot evaluate returns. We can instead evaluate success rates, and this is what we indirectly report as GoalFraction. Compared to raw success rates (1 if all goals are achieved, else 0), we find GoalFraction (a scalar from 0 to 1, proportional to the number of goals in the expert demonstrations that are achieved in order) to be more fine-grained and informative.

Weakness 4

The limitations of the methods are not discussed. The only limitation presented is in Sec 7 that describes the dependence on a learned dynamics model.

Thank you for pointing this out, we have expanded limitations to a dedicated paragraph in Section 7.

审稿意见

评分: 6置信度: 22024-11-04

This paper proposes an approach for zero-shot imitation -- use a single expert trajectory to extract a policy to mimic this task. The general approach is to solve an optimal transport problem (equipped by a goal-reaching cost function) with a model that can be used to estimate the policy's stationary distribution. Across a number of tasks (fetch, halfcheetah, pointmaze), the proposed approach can learn useful skills.

优点

The paper studies a very important problem: the ability to quickly learn new skills and policies from few expert demonstrations is a much-desired property of policies. The method is conceptually simple and well-motivated theoretically.

I found the paper to be easy to read, and the background for all relevant concepts was motivated and well-introduced (within Sec 2-4)

Figure 4 clearly demonstrates the main improvement over the sub-goal approach, where the MPC + CIs overshoot the puck when going to the first subgoal.

The experiments are relatively thorough (both in the main paper, and the Appendix), indicating improvement over other model-based zero-shot imitation methods.

缺点

I had no major objections to the paper;

However, it would have been nice to see comparisons to other approaches to zero-shot imitation (e.g. along the line of FB representations. Another axis that could improve the paper would be to see a wider range of downstream tasks (requiring the synthesis of more diverse motions), perhaps those from (Frans et al)

Minor nit: There is a lot of content in the Appendix, but it is not clearly linked within the main paper. I would encourage the authors to link and discuss this content within the main paper, so that it is not missed by a reader.

问题

How "in-distribution" are the inference tasks compared to the trajectories used to train TD-MPC? How does the quality of learned policy change we try to imitate more "OOD" expert behaviors?

评论- Response to Reviewer 3zhP

2024-11-23

Thank you for taking the time to review our work and your insightful comments. We have improved our paper based on your concerns, as addressed in the following.

Weakness

However, it would have been nice to see comparisons to other approaches to zero-shot imitation (e.g. along the line of FB representations)

Thank you for pointing out this important line of work. We analyse this family of methods in detail, compare it with ZILOT, and provide empirical evidence in Appendix B. Please refer to the general response for further discussion.

Question 1

How "in-distribution" are the inference tasks compared to the trajectories used to train TD-MPC? How does the quality of learned policy change we try to imitate more "OOD" expert behaviors?

We have added a paragraph in appendix C.8 investigating this question. We confirm that expert demonstrations do not need to appear in the offline dataset used to train TD-MPC2. However, it is important that individual states in the demonstrations are largely within the support of the dataset, even if they appear in separate trajectories. This can be interpreted as a form of off-policy learning, stitching. In other words, ZILOT can imitate “OOD” sequences of states, as long as each state is in-distribution.

We hope that we were able to clarify our contributions and address the reviewer’s comments. We remain open for further discussion.

审稿意见

评分: 8置信度: 22024-11-07

This paper identifies an existing problem with goal-conditioned imitation learning: an agent that optimized to achieve individual goals can undermine long-term objectives. Instead of learning a goal-conditioned policy, ZILOT learns a dynamics model with a goal-conditioned value function. It uses optimal transport to minimize the divergence between the rollout policy and expert demonstration.

Experiments show that ZILOT imitates demonstration more closely compared with MPC and goal-conditioned policy.

优点

Identifies an existing problem of goal-conditioned policy learning and proposes a solution based on optimal transport.
Demonstrate success of ZILOT on a range of locomotion and manipulation tasks and show better alignment with demonstrations.

缺点

Lack of explanation and analysis of the myopic behavior in goal-conditioned imitation learning and why ZILOT bypass those limitations.
Baselines are relatively simple. How does it compare with some diffusion-based methods, e.g. Diffusion Policy?

问题

Instead of conditioning on an intermediate goal, can we condition on a sequence of goals to mitigate myopic behavior?
Instead of optimal transport to minimize the distance between distributions, how does it compare to diffusion-based methods?

评论- Response to Reviewer Vxx1

2024-11-23

We thank the reviewer for their feedback and comments. We are happy to address them, one by one, in this response.

Weakness 1

Lack of explanation and analysis of the myopic behavior in goal-conditioned imitation learning and why ZILOT bypass those limitations.

The fundamental issue with hierarchical approaches is that they “break down” the imitation learning problem into a sequence of local goal-conditioned problems. Each of the problems is then solved without considering the global imitation problem. As a consequence, these approaches can display greedy or myopic behavior, by aggressively trying to achieve the first subgoal, while disregarding the rest. ZILOT does not perform this decomposition in local problems, but tries to directly match the entire expert distribution (up to approximations due to finite-horizon planning). In other words, the policy extracted by ZILOT considers not only the next subgoal, but also the following ones. The suboptimality of hierarchical approaches is formally proved in Proposition 1, and can be observed directly in Figure 4. In Fetch-Slide, the naive baseline starts by only focusing on the first subgoal, and has to later correct the trajectory to accommodate for the following subgoals. By considering all subgoals, ZILOT produces a smoother trajectory that follows the expert more closely. We hope this discussion clarifies this important difference; we are happy to apply any further suggestion to improve our paper’s clarity.

Weakness 2

Baselines are relatively simple. How does it compare with some diffusion-based methods, e.g. Diffusion Policy?

To the best of our knowledge, diffusion-based methods for behavioral cloning need access to the expert’s actions, while our setting only provides sequences of expert states (or goals). The offline dataset $\mathcal{D}_\beta$ contains actions, but it does not demonstrate expert behavior: cloning this behavior would not recover an optimal goal-reaching policy. On the other hand, our baselines can handle suboptimal data, as they also build upon TD-MPC2. Since TD-MPC2 seems to be currently one of the best off-policy algorithms for online and offline RL [1], we are confident in the performance of our baselines. Nonetheless, in the context of this rebuttal, we evaluated an additional baseline (FB-IL): please see the common response for further details.

[1] Hansen et. al. D-MPC2: Scalable, Robust World Models for Continuous Control. ICLR ‘24

Question 1

Instead of conditioning on an intermediate goal, can we condition on a sequence of goals to mitigate myopic behavior?

Thank you for raising this interesting point. Given our analysis in Section 3, conditioning a policy on multiple future goals represents a promising direction for solving the issues faced by myopic policies. However, to the best of our knowledge there is no principled framework that trains such a policy from suboptimal data. This problem seems to be rather hard; even synthesising a reward function for multiple goals that cannot be exploited by repeatedly traversing through one of the goals does not seem to be trivial. Our method represents a first step into this direction.

Question 2

Instead of optimal transport to minimize the distance between distributions, how does it compare to diffusion-based methods?

While both diffusion models and optimal transport are concerned with distribution matching, their use cases differ: Optimal Transport, in particular Sinkorn’s algorithm that we use, is concerned with measuring the distance between two arbitrary distributions (over a certain space). Conversely, diffusion models are a solution to model one (usually very complex) distribution by minimizing a distribution matching objective. To the best of our knowledge, there is no established way of using a diffusion model to determine a distance between two distributions.

2024-12-02

Thank you for your response! I appreciate the additional baseline (FB-IL) and most of my concerns are addressed. I will raise my score to 8.

2024-12-03

Thank you for taking our revision into account! We appreciate your engagement in the discussion process.

审稿意见

评分: 6置信度: 32024-11-09

The authors introduce a new method for zero-shot imitation learning based on optimal transport.
The approach involves combining a modified goal-conditioned TD-MPC2 algorithm with optimal transport. The authors begin by training goal-conditioned value functions V and W, as well as a dynamics model obtained from TD-MPC2. Then they run the OT process using the Sinkhorn algorithm, given V, W, P, and the expert trajectory, to compute the cost matrix and transport matrix.

优点

The method shows example of non-myopic behaviour compared to the baselines. The proposed approach does not depened on availability of actions in the expert trajectory. The proposed approach works as zero-shot "agent correction" policy. Authors evaluated their method on variety of different environments.

缺点

The proposed method requires both the learned world model and expert demonstrations. In practice, the training world model requires a lot of resources and either access to a simulator or a large collection of datasets.

Though, the authors positioned their paper as "zero-shot IL" it is still far away from "fair" zero-shot since it requires some conditions for the method to work:

Access to the offline dataset
Trained Transition model
Expert trajectory.

问题

What is the additional computational cost introduced by solving OT problems compared to running just the planners?
How does the method scale when there are more than 1 demonstration? Does the method have a problem with multimodal behavior in the bigger data collection?
Can the authors apply the method to cross-domain / cross-embodiment settings?
What if the expert policy is different from the resulting policy in TD-MPC? I.E can we observe a distributional shift if our expert policy comes from an environment with slightly different dynamics, for example, cross-embodiment (x-magical https://github.com/kevinzakka/x-magical)?
Can the current method be altered to remove the model-based approach and replace it with the offline dataset of video collection? I.e why can't we learn value functions from just video data without rewards and actions and then use it in OT?

评论- Response to Reviewer Ha2j (1)

2024-11-23

Thank you for taking the time to review our work and your valuable feedback. We have improved our paper based on your comments, which we address in the following response.

Weakness

The proposed method requires both the learned world model and expert demonstrations. In practice, the training world model requires a lot of resources and either access to a simulator or a large collection of datasets.

it is still far away from “fair” zero-shot

We would like to clarify our setting and the requirements of the algorithm. As the reviewer points out, our method requires access to an offline dataset of trajectories (of arbitrary quality). However, ZILOT does not require a trained transition model as an input: through the experiments, the transition model is simply trained offline on the aforementioned dataset. Thus, at training time, ZILOT only requires access to an offline dataset. At test-time, ZILOT uses a single expert demonstration, only for specifying the task to be performed (i.e., it is not used for training). To the best of our knowledge, zero-shot imitation would not be possible without either of these two components. This notion of zero-shot imitation is consistent with the literature ([1]).

[1] Pathak et. al. Zero-Shot Visual Imitation. ICLR ‘18

Question 1

What is the additional computational cost introduced by solving OT problems compared to running just the planners?

For the fetch environments, the MPC+Cls planner runs at around 30Hz compared to ZILOT which runs at around 3Hz. We should point out that there exists a range of runtime improvements on the basic Sinkhorn Algorithm that we did not test for this work. We included these details in Appendix C.3.

Question 2a

How does the method scale when there are more than 1 demonstration?

Thank you for suggesting this interesting setting. As ZILOT deals with occupancies, it can directly generalize to imitation of multiple demonstrations, by including all given trajectories in the empirical approximation of $\rho^E$ and truncating each trajectory individually for the finite-horizon problem that is optimized with MPC. Notably, the same approach can also be applied to the occupancy estimate of the agent policy $\rho^\pi$ which allows ZILOT to build upon probabilistic world models, eg. diffusion-based world-models [1].

[1] Alonso et. al. Diffusion for World Modeling: Visual Details Matter in Atari. NeurIPS ’24

Question 2b

Does the method have a problem with multimodal behavior in the bigger data collection?

ZILOT relies on offline learning for both its dynamics model and value function. For the former, we adopt the standard RSSM ([1]) in TD-MPC2, while the latter is learned through an offline reinforcement learning algorithm. Both algorithms are fully capable of handling multimodal behavior. In fact, the offline dataset $\mathcal{D}_\beta$ is highly multi-modal; nevertheless we do not observe significant issues during training.

[1] Shaj et. al. Hidden Parameter Recurrent State Space Models for changing Dynamics Scenarios. ICLR ‘22

Question 3 & 4

Can the authors apply the method to cross-domain / cross-embodiment settings? What if the expert policy is different from the resulting policy in TD-MPC? I.E can we observe a distributional shift if our expert policy comes from an environment with slightly different dynamics, for example, cross-embodiment (x-magical https://github.com/kevinzakka/x-magical)?

As long as the goal-spaces coincide, the expert can have a different embodiment as ZILOT imitates experts from rough (potentially incomplete) and partial (goal-space only) trajectories and does not need expert actions. Further, we would also like to point out that TD-MPC2, the method used in ZILOT to train its dynamics model and value functions, shows great performance in cross-domain/cross-embodiment settings ([1]). We are happy to further discuss this points in case it remains unclear.

[1] Hansen et. al. D-MPC2: Scalable, Robust World Models for Continuous Control. ICLR ‘24

(continued)

评论- Response to Reviewer Ha2j (2)

2024-11-23

Question 5

Can the current method be altered to remove the model-based approach and replace it with the offline dataset of video collection? I.e why can't we learn value functions from just video data without rewards and actions and then use it in OT?

Thank you for raising this interesting point. ZILOT needs to learn a goal-conditioned value function and a dynamics model. Optimal value functions can be learned from datasets without action labels through approaches like VIP [1], learning a dynamics model generally requires action labels. It is however possible to infer actions in certain domains, for instance by hand-tracking in manipulation. Recent work even considers latent actions [2,3] that may later be fine-tuned to the real action space with very little data. Thus, our framework can be extended to learn from largely unlabeled videos, as long as a small amount of data can then be used to label actions.

[1] Ma et. al. VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training. ICLR ‘23

[2] Ye et. al. Latent Action Pretraining from Videos. Preprint ‘24

[3] Bruce et. al. Genie: Generative Interactive Environments. ‘24

We hope our response addresses your concerns, and answers your questions. We are happy to continue the discussion if further clarifications are required.

评论- General Response

2024-11-22

We would like to thank all reviewers for their valuable feedback and positive comments. We believe that the clarifications and the additional results we provide help to further improve our work. In this general response, we would like to provide a comment on model-free zero-shot imitation methods, and summarize the changes we made based on the feedback we received:

Comparison to FB-IL

We thank reviewers 3zhP and Lfwv for pointing out FB-IL as an alternative model-free zero-shot IL framework. FB-IL is a powerful and flexible framework, introducing several different approaches. We have added a new section as Appendix B, in which we discuss in detail each of these instantiations. We find that there are substantial differences between each of them and ours, namely in the underlying objective and in the approximations applied to it. In short, we find that two flavors of FB-IL (behavior cloning and distribution matching) do not apply to our setting, in which demonstrations are partial and rough; one flavor (goal-conditioned) recovers the performance of one of our existing baselines. The remaining instantiation (reward-based), is applicable to our setting, and provides an interesting comparison. As suggested by reviewer Lfwv, we implement and evaluate this approach. In our setting, it does not match the performance of ZILOT. We refer to Appendix B for empirical results and a formal analysis.

Changes and Additional Results

We have added a thorough comparison to FB-based imitation learning methods in Section B.
We have given the limitations a dedicated paragraph in Section 7 that expands on the previously stated limitations, as suggested by Reviewer Lfwv.
We have added a discussion on the difficulty of tasks in appendix C.8 as enquired by Reviewer 3zhP.
We have expanded our discussion of computational costs in a dedicated subsection in Appendix C.3, as suggested by Reviewer Ha2j.

2024-11-25

Dear Reviewers,

This is a gentle reminder that the authors have submitted their rebuttal, and the discussion period will conclude on November 26th AoE. To ensure a constructive and meaningful discussion, we kindly ask that you review the rebuttal as soon as possible and verify if your questions and comments have been adequately addressed.

We greatly appreciate your time, effort, and thoughtful contributions to this process.

Best regards, AC

AC 元评审

2024-12-30

This paper proposes ZILOT (Zero-shot Offline Imitation Learning via Optimal Transport), a novel method for zero-shot imitation learning that aims to address the myopic behavior issue present in existing approaches that decompose expert demonstrations into sequences of individual goals and target each goal independently. The key innovation is directly optimizing an occupancy matching objective using optimal transport between the expert's and policy's state distributions, rather than treating goals independently. This allows ZILOT to consider multiple future goals during planning, leading to more sophisticated non-myopic behavior. The empirical results across manipulation and locomotion tasks show that ZILOT outperforms baseline methods, particularly in scenarios requiring longer-term planning. The reviewers appreciated the novelty and motivation of the approach as well as the theoretical underpinnings of the approach. The main weaknesses are: (a) ZILOT requires solving computationally expensive optimal transport problems during planning, leading to significantly slower inference compared to baselines, (b) the finite planning horizon means some degree of myopic behavior remains, though less severe than baselines, (c) missing comparisons to relevant baselines like FB-IL and OTR, though these were partially addressed in the rebuttal, (d) questions about whether the added complexity of non-Markovian policies is truly necessary when Markovian policies may be sufficient. Overall, the paper had mixed set of reviews and during the internal discussion phase, reviewer Lfwv reiterated some of the above limitations which the authors are encouraged to address in a subsequent version.

审稿人讨论附加意见

See above.

最终决定Reject

2025-01-22

Reject