6.3

/10

Poster4 位审稿人

最低6最高7标准差0.4

3.8

置信度

正确性3.0

贡献度3.0

表达3.3

NeurIPS 2024

Diffusion Imitation from Observation

Bo-Ruei Huang,Chun-Kai Yang,Chun-Mao Lai,Dai-Jie Wu,Shao-Hua Sun

OpenReview PDF

提交: 2024-05-09更新: 2024-12-19

摘要

关键词

Imitation from ObservationLearning from ObservationImitation LearningDiffusion ModelReinforcement Learning

评审与讨论

审稿意见

评分: 7置信度: 42024-07-09

This paper introduces Diffusion Imitation from Observation, an adversarial imitation learning method using a conditional diffusion model as the discriminator for policy learning. DIFO learns a diffusion model on expert and agent state transitions, with an auxiliary binary classification objective to discriminate between the expert and the agent. Experiments show that DIFO can learn an imitation policy online, given only expert demonstration observations, outperforming relevant LfO baselines. Ablations and variants show that both loss terms are important for the diffusion model training, and justify the choice of diffusion model conditioning input.

优点

Using conditional diffusion models for adversarial imitation learning from observations is a novel approach.
The loss objectives in the diffusion model are well-motivated and shown to be necessary for downstream policy performance.
The trained diffusion model captures the expert distribution, and exhibits some generalizability in the generated trajectories.
Experiment results show that DIFO works across multiple state-based and image-based environments, outperforming previous methods for learning from observation.

缺点

This method requires online interactions to train the diffusion discriminator and the policy. One popular alternative approach in this setting (online + expert demonstrations) is optimal transport-based RL, which also works with only access to expert demo observations (optimal transport over expert and agent state trajectories as the reward). $1$ $2$ It would be convincing to see comparisons with this line of work.
It would be great to see more evaluations on image-based environments.

问题

It seems unclear to me why behavioral cloning, with access to ground truth action labels, is doing worse than LfO methods in Figure 3.
Does DIFO work with negative samples generated offline, e.g. adding Gaussian noise to the previous state, instead of policy online rollouts?

局限性

Most experiments are on state-based environments. It would be convincing to see more evaluations on image-based environments.
This method requires online rollouts for collecting negative transitions for discriminator training. It would be great if this method can be shown to work completely offline.

作者回复

2024-08-07

We sincerely thank the reviewer for the thorough and constructive comments. Please find the response to your questions below.

This method requires online interactions to train the diffusion discriminator and the policy. One popular alternative approach in this setting (online + expert demonstrations) is optimal transport-based RL, which also works with only access to expert demo observations (optimal transport over expert and agent state trajectories as the reward). [1] [2] It would be convincing to see comparisons with this line of work.

We thank the reviewer for providing these references. We will revise the paper to discuss these works. As requested by the reviewer, we additionally implemented Optimal Transport (OT) [1]. OT uses proxy rewards derived from Sinkhorn distances rather than directly obtaining rewards from the raw logits of the discriminator, which enhances stability compared to AIL methods.

We report the result in Figure R.1 in the PDF file attached to the rebuttal summary, which shows that our method consistently outperforms OT across all evaluated tasks. We hypothesize that this is because OT computes distances at the trajectory level rather than the transition level, which requires monotonic trajectories. Consequently, OT performs well in environments like Walker and AdroitDoor, where trajectory variety is limited. However, it struggles in tasks with diverse trajectories, such as navigation, where the initial and goal states vary significantly. In contrast, our method generates rewards at the transition level, allowing us to identify transition similarities even when facing substantial trajectory variability. This flexibility enables our method to succeed in more complex environments with diverse trajectories.

We will revise the paper to include the results and the discussion.

It would be great to see more evaluations on image-based environments.

As requested by the reviewer, we additionally conducted experiments on the drawer close task introduced in Meta-World [2] using image-based states. This table-top manipulation task requires the agent to control a Sawyer robotic arm to close a drawer. An illustration of this task is shown in Figure R.5(a) in the PDF file attached to the rebuttal summary.

Figure R.5(b) presents the learning efficiency of our proposed method and the baselines. Our proposed method achieves an 80% success rate with only 70k environment interactions, outperforming BC, BCO, GAIfO, WAILfO, AIRLfO, IQ-Learn, and DePO.

We will revise the paper to include the results of this new image-based environment.

It seems unclear to me why behavioral cloning, with access to ground truth action labels, is doing worse than LfO methods in Figure 3.

While BC has access to ground truth action, it may suffer from compounding errors, i.e., the accumulation of errors from small initial deviations, caused by covariate shifts [3, 4, 5]. Because BC relies solely on learning from the observed expert dataset, unlike the LfO methods that utilize online interaction with environments, BC is susceptible to accumulating errors from states with small deviations, ultimately reaching an irrecoverable point. Therefore, BC is well-known for requiring a substantial amount of expert data to achieve coverage of the dataset and reduce unseen states. However, under our experiment setting with limited expert data, BC doesn’t have sufficient data to generalize to unseen states, leading to poor performance.

Does DIFO work with negative samples generated offline, e.g. adding Gaussian noise to the previous state, instead of policy online rollouts?

We thank the reviewer for the insightful idea. This idea of generating and utilizing negative samples offline for imitation learning resembles the idea of Implicit Behavior Cloning (Implicit BC) [6]. Implicit BC learns the joint probability of expert state-action pairs $p(s, a)$ via contrastive learning using expert data and offline generated negative data without policy online rollouts.

We believe it is possible to generate negative data offline by adding Gaussian noise to the previous state, as suggested by the reviewer, and learning our diffusion classifier to classify expert state transitions and the generated state transitions. However, unlike Implicit BC, our work focuses on learning from observation, where we do not have access to action labels in expert demonstrations. Hence, without online interactions, it would be impossible to know the action space, let alone learn a policy. Thus, we believe that online interactions are essential for policy learning.

References

[1] Papagiannis et al. "Imitation learning with Sinkhorn Distances." In ECML PKDD, 2022.

[2] Yu et al. "Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning." In CoRL, 2019.

[3] Ross et al. “Efficient Reductions for Imitation Learning.” In AISTATS, 2010.

[4] Ross et al. “A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning.” In AISTATS, 2011.

[5] Laskey et al. “SHIV: Reducing supervisor burden in DAgger using support vectors for efficient learning from demonstrations in high dimensional state spaces.” In ICRA, 2016.

[6] Florence et al. “Implicit Behavioral Cloning.” In CoRL, 2022.

2024-08-12

Thank you for the additional experiments and comments addressing the concerns. Overall I am positive about the paper and I will maintain my score of Accept.

评论- Re: Official Comment by Reviewer P9FE

2024-08-12

We sincerely thank the reviewer for acknowledging our rebuttal and helping us improve our submission.

审稿意见

评分: 6置信度: 32024-07-12

The paper introduces a novel method named Diffusion Imitation from Observation (DIFO), which integrates diffusion models into the adversarial imitation learning from observation (LfO) framework. Traditional adversarial imitation learning methods often struggle with hyperparameter sensitivity and training stability. DIFO leverages the generative capabilities of diffusion models to improve the imitation process. Specifically, the diffusion model generates the next state based on the current state, and its learning objective is reformulated as a binary classification task to distinguish between expert and agent transitions. This model then provides "realness" rewards to guide the policy learning process. The paper demonstrates that DIFO achieves superior performance across various continuous control tasks compared to existing LfO methods.

优点

The paper introduces a novel approach by integrating diffusion models into the adversarial imitation learning from observation (LfO) framework, enhancing both stability and performance.
The proposed DIFO method consistently outperforms existing LfO methods across various continuous control tasks, demonstrating improved data efficiency.

缺点

Although the authors name their paper "Diffusion Imitation from Observation," the performance gain does not seem to be due to the use of the diffusion model. As shown in Section 5.7, using the diffusion loss alone demonstrates very poor results. The major contributing factor is the BCE loss and the discriminator as a whole. It appears more like using diffusion loss as regularization for discriminator training. Do other regularization techniques lead to similar improvements, and why are diffusion models more suitable in this setting (an intuitive explanation)?
The LfO baselines in the experiments are rather outdated. More recent LfO baselines should be compared and discussed [1, 2].
Minor problems:
1. In Line 111, there seems to be a missing symbol following the first tilde, and the $\phi$ should be the "parameters" of the diffusion model.
2. In the Preliminary section, the symbol $t$ is used to denote the environment step, while later $t$ is again used as the diffusion step, which might cause confusion. It is common practice in diffusion-based RL papers to denote the diffusion step and environment step using different symbols, respectively written as superscripts and subscripts.
3. The "Expert" lines are missing in Figure 3 and Figure 4.

[1] Liu, M., Zhu, Z., Zhuang, Y., Zhang, W., Hao, J., Yu, Y., & Wang, J. (2022, June). Plan Your Target and Learn Your Skills: Transferable State-Only Imitation Learning via Decoupled Policy Optimization. In International Conference on Machine Learning (pp. 14173-14196). PMLR.

[2] Liu, Y., Dong, W., Hu, Y., Wen, C., Yin, Z. H., Zhang, C., & Gao, Y. Imitation Learning from Observation with Automatic Discount Scheduling. In The Twelfth International Conference on Learning Representations.

问题

The diffusion discriminator is different from prior methods in that we can sample from the distribution $D(s, s')$ . Can the resulting discriminator itself be used as a policy when combined with an inverse dynamics model $p(a|s, s')$ , and how does it perform during evaluation?

局限性

Authors listed the limitation of their method in the appendix.

作者回复

2024-08-07

We sincerely thank the reviewer for the thorough and constructive comments. Please find the response to your questions below.

As shown in Section 5.7, using the diffusion loss alone demonstrates very poor results. The major contributing factor is the BCE loss and the discriminator as a whole.

Diffusion models have shown their ability to become classifiers with diffusion loss [1]. DIFO-NA demonstrates using diffusion loss from an offline pre-trained model provides valid rewards, as shown in Figure 3(a), which indicates diffusion loss could be a reasonable metric for the discriminator. However, as the agent continues to learn, it will generate transitions that increasingly resemble those of the expert, causing the offline model to lose its ability to provide precise rewards eventually. Hence, we integrate the AIL framework to improve the discriminator simultaneously with the policy.

It appears more like using diffusion loss as regularization for discriminator training. Do other regularization techniques lead to similar improvements?

WAILfO builds on GAIfO's work by incorporating gradient penalty as a regularization to improve performance. Additionally, our implementations of GAIfO, AIRLfO, and WAILfO already employ L2 regularization. The experimental results show that our proposed method outperforms these methods with various regularizations. We would like to highlight that these regularizations could also be applied to our method.

Why are diffusion models more suitable in this setting (an intuitive explanation)?

We hypothesize that there are two reasons why our diffusion model classifier performs better than the GAIL discriminator. First, the instability of GAIL arises from the tendency of discriminators to overfit, resulting in the policy's inability to learn from discriminator rewards. Unlike GAIL's MLP binary classifier, which maps a high-dimensional input to a one-dimensional logit, our diffusion model learns to predict high-dimensional noises, which is inherently more difficult to overfit. Second, diffusion models excel at modeling multimodal distributions and thus can outperform GAIL's discriminator when expert demonstrations exhibit significant variability. We will revise the paper to include these intuitions.

More recent LfO baselines should be compared and discussed [1, 2].

We thank the reviewer for these references. We will revise the paper to discuss these works. As requested by the reviewer, we additionally included the results of two recent methods: Decoupled Policy Optimization (DePO) [2] and Optimal Transport (OT) [3] provided by Reviewer P9FE, which is considered a more general version compared to ADS [4] provided by Reviewer Dab5. We report the results in Figure R.1 in the PDF file attached to the rebuttal summary. The results show that our proposed method outperforms both DePO and OT.

DePO decouples the policy into a high-level state planner and an inverse dynamics model (IDM), utilizing embedded decoupled policy gradient and generative adversarial training. The decouple feature makes the planner and IDM transferable across different domains. However, it still suffers from the same challenges as the ones in GAIL, such as overfitting and the inability to model multimodality. Moreover, it also requires training with an IDM, making it more difficult to tune.

OT uses proxy rewards derived from Sinkhorn distances rather than directly obtaining rewards from the raw logits of the discriminator, which enhances stability compared to AIL methods. However, OT computes distances at the trajectory level rather than the transition level, which requires monotonic trajectories. Consequently, OT performs well in environments like Walker and AdroitDoor, where trajectory variety is limited. However, it struggles in tasks with diverse trajectories, such as navigation, where the initial and goal states vary significantly. In contrast, our method generates rewards at the transition level, allowing us to identify transition similarities even when facing substantial trajectory variability. This flexibility enables our method to succeed in more complex environments with diverse trajectories.

We will include the results and discussion of these two methods in the revised paper.

The diffusion discriminator is different from prior methods in that we can sample from the distribution 𝐷(𝑠,𝑠′). Can the resulting discriminator itself be used as a policy when combined with an inverse dynamics model 𝑝(𝑎|𝑠,𝑠′)?

As suggested by the reviewer, we believe that combining an ideal inverse dynamics model (IDM) with our diffusion discriminator could indeed result in an effective policy. This potential is illustrated by our successful trajectory generation in PointMaze experiments (Section 5.5 and Figure 5), where our diffusion discriminator shows the capability to accurately predict the next state based on the current state.

However, obtaining a high-quality IDM is challenging, as discussed in lines 27-29. Training IDMs requires data that is well-aligned with the expert's data distribution, while, collecting such data relies on having an effective policy, creating a deadlock. Even if this problem is solved, the planner may also produce invalid transitions for the IDM, posing challenges for both planners and IDMs.

Given these difficulties, our method directly learns a policy with the rewards produced by the diffusion discriminator instead of addressing the difficulties of obtaining an ideal IDM. We will revise the paper to make this clear.

References

[1] Li et al. "Your diffusion model is secretly a zero-shot classifier." In ICCV, 2023.

[2] Liu et al. "Plan your target and learn your skills: Transferable state-only imitation learning via decoupled policy optimization." In ICML, 2022.

[3] Papagiannis et al. "Imitation learning with Sinkhorn Distances." In ECML PKDD, 2022.

[4] Liu et al. “Imitation Learning from Observation with Automatic Discount Scheduling.” In ICLR, 2024.

评论- Reminder: The reviewer-author discussion period ends in three days

2024-08-11

We would like to express our sincere gratitude to the reviewer for the thorough and constructive feedback. We are confident that our responses adequately address the concerns raised by the reviewer, including the following points.

A clarification of the importance of the BCE and denoising (MSE) losses
Additional results of employing AIL regularization techniques
An intuitive explanation of why diffusion models are suitable for AIL
Additional results of more recent LfO baselines: Decoupled Policy Optimization and Optimal Transport
A discussion of combining our diffusion model and an inverse dynamics model

Please kindly let us know if the reviewer has any additional concerns or if further experimental results are required. We are fully committed to resolving any potential issues, should time permit. Again, we thank the reviewer for all the detailed review and the time the reviewer put into helping us to improve our submission.

2024-08-11

I appreciate the authors' thorough explanation of diffusion loss and BCE loss, as well as the additional baseline results. As the authors' responses have resolved my concerns, I will recommend accepting the paper.

评论- Re: Official Comment by Reviewer Dab5

2024-08-11

We sincerely thank the reviewer for acknowledging our rebuttal and helping us improve our submission.

审稿意见

评分: 6置信度: 42024-07-13

This paper introduces Diffusion Imitation from Observation (DIFO), a novel approach to Imitation Learning from Observation (ILfO). DIFO innovates by employing a diffusion model as the discriminator, departing from the conventional feed-forward neural network approach. The authors leverage the connection between the diffusion model's training loss and the Evidence Lower Bound (ELBO) to construct a discriminator based on the relative likelihood of state-pairs originating from the expert rather than the agent. Experimental results across six environments demonstrate DIFO outperforming baseline methods.

优点

The utilization of diffusion models' loss as ELBO to build a discriminator presents a novel and potentially fruitful direction in the field of ILfO.

缺点

While DIFO demonstrates sample efficiency in terms of expert demonstrations, it lacks efficiency in environment interactions.
The experimental scope is limited, with tests conducted on only six environments and comparisons made against relatively older baselines.

问题

There are other methods [1, 2, 3] that use metrics, i.e. likelihood, ELBO and entropy, from generative model for ILfO. I suggest discussing them in the related works.
How many samples of $t$ and $\epsilon$ are needed to compute the discriminator score in eq. 4? Is the value stable across different samples?
Although you are motivated to improve the robustness of AILfO, I don't see a direct connection that using a diffusion model instead of an FFN should make things more stable, since in the end you are still running adversarial learning. Could you please elaborate on this?
In Figure 7, the lines of $(\lambda_\text{MSE}, \lambda_\text{BCE}) = (1, 10^{-2})$ and $(\lambda_\text{MSE}, \lambda_\text{BCE}) = (1, 10^{-1})$ are too similar to each other both in terms of colours and symbols and is really hard to distinguish. Could you at least change one of them to different colour or symbol?

Reference:

[1] Escontrela et al., Video Prediction Models as Rewards for Reinforcement Learning, NeurIPS 2023

[2] Zhang et al., Action Inference by Maximising Evidence: Zero-Shot Imitation from Observation with World Models, NeurIPS 2023

[3] Huang et al., Diffusion Reward: Learning Rewards via Conditional Video Diffusion

局限性

Yes, the author has discussed the limitations in the appendix.

作者回复

2024-08-07

We sincerely thank the reviewer for the thorough and constructive comments. Please find the response to your questions below.

There are other methods [1, 2, 3] that use metrics, i.e. likelihood, ELBO and entropy, from generative model for ILfO. I suggest discussing them in the related works.

We thank the reviewer for providing the references. We will revise the paper to discuss these methods using metrics from generative models for ILfO.

How many samples of and are needed to compute the discriminator score in eq. 4? Is the value stable across different samples?

We sample a single denoising timestep (n=1) to compute the reward. As suggested by the reviewer, to investigate the robustness of our rewards with a varying number of timestep samples, we conducted experiments in PointMaze by averaging different numbers of timestep samples (n=1, 2, 5, 10) as rewards. The result presented in Figure R.3 in the PDF file attached to the rebuttal summary shows that our method can learn smoothly with a single sample.

To verify the stability of the rewards, we present the standard deviation to mean ratio of the rewards from 500 timesteps across different numbers of timestep samples in the table below. The values are averaged from a batch of 4096 transitions. The result below suggests that the reward is stable.

Learning Progress	n=1	n=2	n=5	n=10
20%	0.323	0.237	0.294	0.246
40%	0.234	0.199	0.206	0.230
60%	0.201	0.175	0.157	0.206
80%	0.157	0.152	0.150	0.190
100%	0.142	0.133	0.145	0.161

Although you are motivated to improve the robustness of AILfO, I don't see a direct connection that using a diffusion model instead of an FFN should make things more stable

We hypothesize that the instability of most existing AIL frameworks arises from the tendency of FFN discriminators to overfit, resulting in the policy's inability to learn from discriminator rewards. Unlike FFN binary classifiers, which map a high-dimensional input to a one-dimensional logit, our diffusion model learns to predict high-dimensional noises, which is inherently more difficult to overfit.

Furthermore, expert demonstrations typically exhibit significant variability, which could be difficult for FFNs to model effectively [1]. Diffusion models, however, are adept at handling such multimodality, providing a more robust approach for capturing the diverse patterns present in expert behavior. We will revise the paper to include these insights.

While DIFO demonstrates sample efficiency in terms of expert demonstrations, it lacks efficiency in environment interactions.

We respectfully disagree with the reviewer. Figure 3 in the main paper shows that our method, DIFO, is very efficient in environmental interactions, i.e., sample efficiency. In PointMaze, DIFO only requires about 175k environment steps to achieve a 60% success rate, while the best-performing baseline AIRLfO needs over 400k steps. Similarly, in AdroitDoor, DIFO achieves a 75% success rate with only 4M steps, yet the best-performing baseline WAILfO needs around 10M steps. That said, DIFO is very sample-efficient.

The table below presents the number of environment steps required for each method to achieve 50% expert performance. The result shows that our method, DIFO, can achieve the same performance with significantly fewer environmental steps (environment interactions). That said, DIFO is very sample-efficient.

Environment	Goal	DIFO (Ours)	BCO	GAILfO	WAILfO	AIRLfO	IQ-Learn
PointMaze	Success rate=50%	120k	100k	850k	x	310k	160k
AntMaze	Success rate=50%	1.1M	x	x	x	x	x
FetchPush	Success rate=50%	630k	x	510k	670k	x	x
AdroitDoor	Success rate=50%	3.1M	x	x	4.9M	x	x
Walker	Return=3000	1.0M	x	2.0M	1.7M	x	x
CarRacing	Return=400	400k	x	1.1M	x	x	x

The experimental scope is limited, with tests conducted on only six environments and comparisons made against relatively older baselines.

We conducted extensive experiments to evaluate our method across various aspects.

The six test environments we selected cover a wide range of domains, including navigation (PointMaze and AntMaze), manipulation (FetchPush and AdroitDoor), locomotion (Walker), and games (CarRacing). In contrast, most existing works [2, 3, 4, 5] focus only on locomotion tasks.
The tasks included in our work cover both vectorized-state-based environments (PointMaze, AntMazmethod'sPush, AdroitDoor, and Walker) and image-based environments (CarRacing and DrawerClose), while most existing works [2, 3, 4, 5, 6, 7, 8, 9] only consider vectorized-state-based environments.
Our work analyzes the data efficiency by varying the amount of available expert demonstrations and evaluating the learning performance of all methods in Section 5.4.
We designed and conducted a toy environment to visualize the reward functions learned by our proposed method and GAIfO.

More baselines

Please see the response to Reviewer Dab5.

References

[1] Li et al. "Infogail: Interpretable imitation learning from visual demonstrations." In NIPS, 2017.

[2] Liu et al. "Plan your target and learn your skills: Transferable state-only imitation learning via decoupled policy optimization." ICML, 2022.

[3] Papagiannis et al. "Imitation learning with sinkhorn distances." In ECML PKDD, 2022.

[4] Ni et al. "f-irl: Inverse reinforcement learning via state marginal matching." In CoRL, 2021.

[5] Liu et al. “CEIL: Generalized Contextual Imitation Learning.”. In NeurIPS, 2023.

[6] Garg et al. "Iq-learn: Inverse soft-q learning for imitation." In NeurIPS, 2021.

[7] Fu et al. "Learning robust rewards with adversarial inverse reinforcement learning." In ICLR, 2018.

[8] Ho et al. "Generative adversarial imitation learning." In NIPS, 2016.

[9] Xiao et al. "Wasserstein adversarial imitation learning." arXiv, 2019.

评论- Looks good, I will increase my score

2024-08-09

I would like to thank for the author's rebuttal. The additional experiments and clarifications do make the results more convencing. In the light of this, I will increase my score from 5 to 6.

评论- Re: Looks good, I will increase my score

2024-08-10

We sincerely thank the reviewer for acknowledging our rebuttal and helping us improve our submission.

审稿意见

评分: 6置信度: 42024-07-13

This paper leverages a diffusion model to learn a expert-state transition and additionally a discriminator that can differentiate expert and agent states. The paper conducts experiments on standard RL environments and demonstrate better performance and data efficiency.

优点

The idea is simple and its experimental results demonstrate strong performance and data efficiency
The paper is well written in general

缺点

Under section 4.3, why do we only learn the MSE on the expert data but not also the agent data? My understanding is that both data will impact the discriminator output based on section 4.2.
Diffusion models are usually used for modelling multimodal distribution. I have two questions:
- Is the expert stochastic? Does the expert policy exhibit diverse behaviours?
- Is it more convincing to have stochastic environments rather than purely deterministic environments?
The experimentation seems very limited for the amount of theory it provides---in particular there is no theory regarding whether the proposed objective is truly sound.
Tuning the $\lambda$ 's seem difficult as they can vary by ~20% success rate according to figure 7. How should one address this?

问题

Questions

On page 8, line 298, the paper indicates that a "smoother contour" allows for bringing a learning agent closer to the expert. I was wondering if this property is only useful for specific environments but not in general. Dense reward may be helpful but often can misguide policy learning.

Possible typos

Page 3, line 90: "can't" should be "cannot"

局限性

The approach leverages only expert data, it remains to extend this to setting with suboptimal demonstrations.

作者回复

2024-08-07

We sincerely thank the reviewer for the thorough and constructive comments. Please find the response to your questions below.

the paper indicates that a "smoother contour" allows for bringing a learning agent closer to the expert. I was wondering if this property is only useful for specific environments but not in general. Dense reward may be helpful but often can misguide policy learning.

We thank the reviewer for the question. The objective of RL is to maximize the overall return. The RL algorithm should converge to the optimal policy as long as the reward function produces the highest reward at the optimal transition. A smoother contour helps guide the RL algorithm without misguiding the final policy. Our method's reward peaks align with expert distribution (see Figure 7(a) in the main paper), ensuring accurate guidance.

Moreover, under the AIL framework, the reward function evolves alongside the policy. The reward function becomes sparser as the policy improves and eventually converges to a relatively sparse condition. Detailed theoretical guarantees can be found in the GAIL paper [1]. Besides, many previous imitation learning and inverse RL papers visualize the reward function with similar plots like ours [2, 3, 4, 5]. Hence, we believe that this approach is widely accepted for evaluating reward functions.

Lastly, we would like to highlight that we evaluate DIFO in diverse domains, including navigation, manipulation, locomotion, and control, in both state- and image-based environments. Our evaluation results demonstrate that DIFO is broadly applicable and not restricted to specific environments.

Why do we only learn the MSE on the expert data but not also the agent data?

Please see the overall response.

Is the expert stochastic? Does the expert policy exhibit diverse behaviors?

All of our expert data incorporates stochasticity to enhance diversity in trajectories. In the PointMaze environment, our expert trajectory data is generated using a breadth-first search planner with added stochasticity to create different path choices, resulting in various trajectories for the same start and end positions. In other environments, we add a small amount of noise to the experts' actions. These methods provide stochasticity and multimodality in expert behaviors. Our model effectively learns from such expert demonstrations, demonstrating its capability to handle multimodal distributions.

Diffusion models are usually used for modeling multimodal distribution.

As discussed in Section 5.5 and Section D of the main paper, we use a trained diffusion model to generate maze trajectories. As previously mentioned, our expert data exhibits a certain degree of multimodality, meaning that there are different paths in the dataset for the same start and goal locations. The generated results support our observations, as our model effectively produces multimodal possible paths. This demonstrates the benefit of modeling multimodal distributions using diffusion models.

Is it more convincing to have stochastic environments rather than purely deterministic environments?

Please see the overall response.

The experimentation seems very limited for the amount of theory it provides---in particular there is no theory regarding whether the proposed objective is truly sound.

Our method builds upon the established theories of Diffusion Classifier [6] and GAIL [1], as described in Section 4. The Diffusion Classifier paper shows that the diffusion loss (Equations 2 and 3) approximates ELBO, effectively transforming a diffusion model into a classifier. GAIL proves that IRL is a dual problem of an occupancy measure matching problem. Consequently, running RL with the discriminator IRL reward, which recovers the primal optimum from the dual optimum, ensures that the resulting policy is the primal optimum. As a result, by adversarially training the discriminator and the policy, we can match the distribution of the agent's policy to that of the expert. By incorporating the diffusion model as the discriminator into the AIL framework, our method leverages the strengths of the diffusion model while maintaining optimality guarantees. We will revise the paper to include these details.

We believe our experiments are comprehensive. Please see the response to Reviewer AzJp.

Tuning the lambda seem difficult as they can vary by ~20% success rate according to figure 7.

Figure 7 aims to investigate the effect of $\lambda$ , and therefore, we experimented with a very wide range of values. The performance only drops by 20% when the lambda is increased or decreased by two orders of magnitude from the optimal value we identify. In Figure 7(a), the performance of $\lambda=0.1$ and $0.01$ is comparable, with overlapping standard deviations, indicating no statistically significant difference between them. Similarly, in Figure 7(b), although the performance appears to vary slightly, the standard deviations overlap. Therefore, we believe DIFO is not sensitive to $\lambda$ . Any $\lambda$ s in a reasonable range [0.1~0.001] can yield good performance.

References

[1] Ho et al. "Generative adversarial imitation learning." In NIPS, 2016.

[2] Garg et al. "Iq-learn: Inverse soft-q learning for imitation." In NeurIPS, 2021.

[3] Fu et al. "Learning robust rewards with adversarial inverse reinforcement learning." In ICLR, 2018.

[4] Ni et al. "f-irl: Inverse reinforcement learning via state marginal matching." In CoRL, 2021.

[5] Liu et al. "Energy-based imitation learning." In AAMAS, 2021.

[6] Li et al. "Your diffusion model is secretly a zero-shot classifier." In ICCV, 2023.

2024-08-10

Thank you for the detailed response and the additional experiments. In short, I will increase my score.

Regarding the theory: Please include the response as a high-level intuition to explain why this approach is reasonable. Although I was expecting something like convergence bounds, sample-complexity bounds, etc.

Regarding $\lambda$ : Please also include a short sentence on this, I believe this can help practitioners when they aim to reproduce this work.

Further question about the computational cost, how much time should one expect when running the proposed algorithm when compared to existing ones?

评论- Re: Official Comment by Reviewer 7W6h

2024-08-11

We sincerely thank the reviewer for acknowledging our rebuttal and helping us improve our submission. As suggested by the reviewer, we will revise the paper to include high-level intuitions explaining why this approach is reasonable and a detailed description of $\lambda$ as we provided in the rebuttal. Moreover, we will make the codes, scripts, datasets, and model weights publicly available to ensure reproducibility.

Computational cost: Below, we provide how much time it took roughly for our method to learn each task. Note that the time provided below was estimated when multiple jobs were running on the same workstation, and, therefore, the running time could be overestimated. The workstation is equipped with an Intel Xeon W7-2475X CPU, two RTX 4090 GPUs, and two 64GB DDR5-4800 RDIMM RAMs.

Task	PointMaze	AntMaze	FetchPush	AdroitDoor	Walker	CarRacing
Time	40 minutes	3 hours	2.5 hours	5 hours	7 hours	8 hours

We will include this information in the revised paper.

作者回复

2024-08-07

The attached PDF file contains the following content:

Two additional baselines (DePO and OT) and the results [Reviewer AzJp, Reviewer Dab5, Reviewer P9FE]: We additionally include two recent and relevant baselines: Decoupled Policy Optimization (DePO, 2022) [1] and Optimal Transport (OT, 2022) [2] suggested by reviewer Dab5 and P9FE. These baselines represent two other major families of LfO approaches: inverse dynamics models (IDM) and sequence matching. The experiment results presented in Figure R.1 show that our method outperforms both DePO and OT across all tasks.
Optimizing the MSE loss with agent data [Reviewer 7W6h]: We evaluate optimizing the diffusion model denoising loss $\mathcal{L}\_{\text{MSE}}$ with and without agent data in all tasks. The results presented in Figure R.2 show that optimizing $\mathcal{L}\_{\text{MSE}}$ with agent data leads to slower convergence and less stable performance, which justifies our choice of only optimizing $\mathcal{L}\_{\text{MSE}}$ with expert data.
Sampling different numbers of denoising timesteps for computing rewards [Reviewer AzJp]: To examine how varying the number of denoising timestep sampled for computing rewards impacts policy learning, we experiment with different sample sizes (n=1, 2, 5, 10) in PointMaze. The results are shown in Figure R.3, indicating the effect of various numbers of samples is statistically insignificant, which justifies our choice of using one timestep sample for computing rewards.
Stochastic environments [Reviewer 7W6h]: To investigate our diffusion model's ability to handle stochastic data, we created a stochastic AntMaze environment by adding strong Gaussian noise with a standard deviation of 0.5 to the actions. The results are shown in Figure R.4, demonstrating our method maintains robust performance under stochasticity.
Additional image-based Meta-World drawer close task [Reviewer P9FE]: In response to the suggestion of adding more image-based tasks, we introduced the drawer close task from Meta-World [3]. This table-top manipulation task requires the agent to control a Sawyer robotic arm to close a drawer. Figure R.5(a) provides a screenshot of the task, and the learning efficiency is depicted in Figure R.5(b). Our method, DIFO, demonstrates superior performance compared to the baselines, including BC, BCO, GAIfO, WAILfO, AIRLfO, IQ-Learn, and DePO.

[Reviewer 7W6h] Why do we only learn the MSE on the expert data but not also the agent data?

Please see the overall response. As pointed out by the reviewer, our method optimizes $\mathcal{L}\_{\text{MSE}}$ (Eq. 6) to approximate the ELBO only using expert demonstrations. We hypothesize that optimizing $\mathcal{L}\_{\text{MSE}}$ leads to unstable training because, during the early stage of training, the agent policy changes frequently and generates diverse transitions.

To investigate the effect of optimizing this MSE loss using agent data, we experiment with optimizing $\mathcal{L}\_{\text{MSE}}$ with and without agent data on all tasks. The results are reported in Figure R.2 in the PDF file attached to the rebuttal summary. We found that optimizing $\mathcal{L}\_{\text{MSE}}$ with agent data can lead to slower and unstable convergence, especially in tasks with larger state and action spaces, e.g., AdroitDoor, where optimizing $\mathcal{L}\_{\text{MSE}}$ leads to a 0% success rate. We hypothesize that the rapidly changing agent distribution hinders learning the diffusion model optimizing $\mathcal{L}\_{\text{MSE}}$ with agent data. As a result, the overall performance can be less stable. Hence, we design our method to optimize $\mathcal{L}\_{\text{MSE}}$ using only expert data.

We thank the reviewer for the question. We will revise the paper to include this experiment and the discussion.

[Reviewer 7W6h] Is it more convincing to have stochastic environments rather than purely deterministic environments?

We thank the reviewer for the suggestion. We additionally created a new stochastic AntMaze environment where Gaussian noise is added to the agent's actions before they are applied in the environment. The magnitude of the noise is 0.5, resulting in the actual action taken in the environment would be $action = action + 0.5 *\mathcal{N}(0,1)$ . Given the action space of this environment is [-1,1], this represents a high level of stochasticity.

We report the result in Figure R.4 in the PDF file attached to the rebuttal summary. The result shows that the performance of our method remains robust even under such high stochasticity, indicating our model's ability to adapt to stochastic environments effectively. We appreciate the reviewer's insightful suggestion, which leads to this experiment that strengthens our paper. We will include this result in the revised paper.

References

[1] Liu et al. "Plan your target and learn your skills: Transferable state-only imitation learning via decoupled policy optimization." In ICML, 2022.

[2] Papagiannis et al. "Imitation learning with Sinkhorn Distances." In ECML PKDD, 2022.

[3] Yu et al. "Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning." In CoRL, 2019.

评论- Please respond to author response if you haven't

2024-08-11

Dear Reviewers,

Before the author-reviewer discussion period ends, please make sure that you have read the author responses, acknowledge that your review reflects them, and engage with the authors if necessary.

Thank you!

最终决定Accept (poster)

2024-09-25

This paper proposes Diffusion Imitation from Observation (DIFO), a diffusion-model method for learning from state-only demonstrations without action labels, within an adversarial imitation learning framework. DIFO utilizes a diffusion model to capture expert and agent state transitions, and reformulates the learning objective as a binary classification task to train the diffusion model as a discriminator, providing "realness" rewards for policy learning. The method demonstrates superior performance in various continuous control tasks, such as navigation, locomotion, and manipulation, compared to existing methods. DIFO also exhibits better sample complexity, making it suitable for real-world applications where expert demonstrations are limited.

This paper highlights the effectiveness of diffusion models in improving the robustness and performance of imitation learning from observation (ILfO). All the reviewers agreed that this paper makes a good contribution to the conference.