PaperHub
6.5
/10
Poster4 位审稿人
最低5最高9标准差1.5
9
6
5
6
3.5
置信度
正确性2.8
贡献度2.5
表达2.5
NeurIPS 2024

DiffuserLite: Towards Real-time Diffusion Planning

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06

摘要

Diffusion planning has been recognized as an effective decision-making paradigm in various domains. The capability of generating high-quality long-horizon trajectories makes it a promising research direction. However, existing diffusion planning methods suffer from low decision-making frequencies due to the expensive iterative sampling cost. To alleviate this, we introduce DiffuserLite, a super fast and lightweight diffusion planning framework, which employs a planning refinement process (PRP) to generate coarse-to-fine-grained trajectories, significantly reducing the modeling of redundant information and leading to notable increases in decision-making frequency. Our experimental results demonstrate that DiffuserLite achieves a decision-making frequency of $122.2$Hz ($112.7$x faster than predominant frameworks) and reaches state-of-the-art performance on D4RL, Robomimic, and FinRL benchmarks. In addition, DiffuserLite can also serve as a flexible plugin to increase the decision-making frequency of other diffusion planning algorithms, providing a structural design reference for future works. More details and visualizations are available at https://diffuserlite.github.io/.
关键词
Diffusion model; Reinforcement learning; Deep reinforcement learning;

评审与讨论

审稿意见
9

The paper introduces a lightweight framework employing a Plan Refinement Process (PRP) for generating trajectories from coarse to fine-grained levels. This approach reduces redundant information modeling, significantly enhancing planning efficiency. DiffuserLite achieves an impressive decision-making frequency of 122.2Hz, making it 112.7 times faster than existing frameworks and suitable for real-time applications. Additionally, it can be integrated as a flexible plugin to accelerate other diffusion planning algorithms. Experiments demonstrate state-of-the-art performance on multiple benchmarks, highlighting DiffuserLite's robustness, efficiency, and practical applicability across various domains.

优点

  • DiffuserLite achieves an impressive decision-making frequency of 122.2Hz, making it 112.7 times faster than existing frameworks, which is crucial for real-time applications.

  • The PRP generates trajectories from coarse to fine-grained levels, reducing redundant information modeling and significantly enhancing planning efficiency.

  • DiffuserLite can be integrated as a flexible plugin to accelerate other diffusion planning algorithms, demonstrating its adaptability and potential for widespread use across various domains.

缺点

The paper is well-written and easy-to-follow, I do not see obvious weakness.

问题

  • The proposed DiffuserLite is a classifier-based conditional diffusion model. How does it provide gradients?

  • There may exist some physical constraints for the generated trajectory, which are not differential and can not provide gradients. How does DiffuserLite work in that case?

局限性

The authors have adequately addressed the limitations.

作者回复

Thank you for your effort in reviewing and acknowledging our work. Based on the questions you have raised, it seems you are particularly interested in the implementation details of DiffuserLite and its potential applications. We have already released the initial version of the code on diffuserlite/diffuserlite.github.io to help you understand the algorithm better. And below is our response to your questions in the hope of addressing your concerns:


  • Q1: Your question offers a different perspective on the planning procedure of DiffuserLite, particularly regarding treating the value critic as a classifier, which is indeed correct. This classifier plays two key roles in the planning framework: (a) guiding the Diffusion model to generate high-performance trajectories, a process that can be achieved either through gradient calculations (referred to as CG, commonly utilizing automatic differentiation in DL software packages) or by using the value as input to a neural network (referred to as CFG). Although mathematically equivalent, due to the high computational cost of gradient calculations that significantly reduce decision frequency, we have chosen to use CFG in DiffuserLite. Another reason for choosing CFG is that the Rectified flow backbone cannot use CG for guidance. Rectified flow is our alternative generative model backbone. To our knowledge, DiffuserLite is the first to validate its effectiveness in decision-making domains and leverage its reflow procedure further enhanced decision frequency. As Rectified flow cannot use CG, this is why we opted for CFG. To demonstrate the impact of using CG on decision frequency, we tested DiffuserLite-D on MuJoCo benchmarks, resulting in an average decision frequency of 27.2Hz (CG) versus 68.2Hz (CFG), suggesting that CFG should be preferred when high decision frequency is crucial.
  • Q2: This question is highly relevant as real-world applications often encounter non-differentiable physical constraints. As a diffusion planning algorithm, DiffuserLite offers various ways to address this issue: (a) Explicitly filtering out trajectories that do not meet physical constraints during planning: DiffuserLite generates multiple candidate trajectories at each decision step and evaluates their quality to select the optimal one. Similar to search methods, this procedure allows for the explicit filtering of candidates that do not adhere to constraints. (b) Incorporating additional guidance: While many physical constraints are non-differentiable, we can soften these constraints and approximate them using a neural network classifier to guide the generation of trajectories that best satisfy these constraints. Previous research has shown significant improvements in the physical plausibility of generated trajectories by utilizing gradient guidance from the environment dynamic model [1]. Given DiffuserLite's flexible, plugin-based framework, this method can be incorporated with DiffuserLite to ensure compliance with physical constraints. In conclusion, we believe that a combined approach using (a) and (b) can lead to favorable performance in most real-world applications. As demonstrated in the paper, Robomimic benchmark offers a real-world robot arm dataset with inherent physical constraints. With method (a), DiffuserLite has already achieved very good performance on this benchmark.

I hope our response addresses your concerns. Please feel free to reach out if you need further clarification. Thank you once again for your insightful review.


[1] Ni, et al, "MetaDiffuser: diffusion model as conditional planner for offline meta-RL," in Proceedings of the 40th International Conference on Machine Learning, 2023.

审稿意见
6

The paper introduces a method to accelerate diffusion model-based planning called the Plan Refinement Process (PRP). This method divides the planning of the entire trajectory into several temporal spans, focusing on the most recent spans in each planning stage, thereby discarding redundant distant information. The paper also employs techniques such as critic design and rectified flow to ensure both the frequency and accuracy of planning. Experimental results show that the proposed DiffuserLite, which incorporates PRP, achieves a high decision-making frequency, which is significantly faster than predominant frameworks. This significantly enhances the real-time applicability of diffusion planning in various benchmarks. Additionally, DiffuserLite's flexible design allows it to serve as a plugin to improve other diffusion planning algorithm.

优点

One of the advantages of the paper is the combination of a coarse-to-fine architecture with diffusion model based planning. This method avoids redundant information in long-term planning and significantly improves planning efficiency. The authors also integrate classifier-free guidance (CFG), introduce a new critic, and use techniques such as rectified flow. As a result, the proposed DiffuserLite possesses the capability for fast and successful planning. Another advantage of the paper is the ability to flexibly apply the proposed method to other diffusion planning frameworks. This approach significantly increases the speed of planning inference with only a slight sacrifice in performance. The comprehensive comparisons in the paper demonstrate the potential of the proposed PRP in diffusion planning applications.

缺点

One of the primary innovations of the paper is the introduction of the Plan Refinement Process (PRP). However, this hierarchical approach is not entirely original, as similar concepts have been introduced in prior works such as HDMI and HD-DA, mentioned in the 6. Related Works section. Additionally, many of the other improvements, such as the use of rectified flow, rely on existing methods. As a result, the paper's novelty is somewhat limited.

问题

In the Abstract, the website provided by the authors does not contain updated details and results. Please remember to update it. In line 126, the authors mention that agents often struggle to reach distant states proposed by the planning process. Could you provide a deeper explanation for this? Otherwise, what is the significance of long-term planning? In lines 176-178, why does this critic design make it difficult to distinguish better trajectories in short-term planning? What is the "uniform critic" mentioned in line 186? Could you further explain it? What are the specific metrics used in Tables 2, 3, and 4?

局限性

Yes

作者回复

Thank you for your thorough review. The concerns you have raised are insightful, and we aim to address them effectively in the following responses. To be concise, below, we use "Lite" to refer to "DiffuserLite".


Weakness: Limited Novelty

We aim to elucidate the differences between Lite and HDMI as well as HD-DA (It is actually a concurrent work and both it and Lite are published on arXiv in January 2024) from the perspectives of Motivation, Technology, and Reproducibility.

1. Motivation: The motivation behind Lite lies in reducing redundant information during planning to increase decision frequency, whereas HDMI and HD-DA focus on generating higher-quality plans through "subgoal planning" and "goal reaching". The primary contribution of Lite lies in speed enhancement. Experimental results indicate that in terms of D4RL score, Lite outperforms HD-DA and HDMI, with a significantly higher increase in decision frequency (Lite-R1: 52.2x \gg HDMI: 1.8x \approx HD-DA: 1.3x, compared to Diffuser) [1,2]. We argue this disparity is due to the failure of the other two works to eliminate redundant information in planning, unlike Lite. DiffuserLite addresses a distinct problem and is not merely an improvement upon HDMI.

2. Technology: While all three algorithms exhibit a hierarchical structure, Lite offers additional insights:

  • Lite demonstrates high-performance diffusion planning WITHOUT meticulous dataset handling or the inclusion of prior knowledge, unlike HDMI.
  • Lite proves that disregarding distant redundant information and using a significantly shorter planning horizon can still yield excellent results and greatly enhance speed compared to HDMI and HD-DA.
  • Lite conducts extensive ablation experiments discussing the impact of the number of hierarchical levels LL and interval at each level IlI_l on performance, summarizing best practices in hierarchical structure design, a discussion lacking in HDMI and HD-DA (and they only use L=2L=2).
  • Lite validates its algorithm on real-world datasets like Robomimic and FinRL, showcasing its practical applicability, which is not extensively covered in HDMI and HD-DA.
  • Lite boasts a simpler structure that can serve as a flexible plugin for other diffusion planners, as validated on AlignDiff, setting it apart from HDMI and HD-DA.

3. Reproducibility: Currently, only DiffuserLite has released code on Lite/Lite.github.io, making its results the most reliable.


Questions

Q1: Thank you for your reminder. The website content is updated. : )

Q2: Planning-based methods suffer from compounding errors, leading agents to deviate from the plan over time [3]. Hence, agents execute only a few steps based on the plan (typically 1 step in planning-based RL algorithms). Therefore, we believe that detailing every aspect of plan generation is unnecessary. We believe successful long-term planning requires: (a) Sparser details for more distant parts (as long as they can accurately reflect the plan's quality). (b) More details for nearer parts to ensure decision consistency with the plan.

Q3: In lines 176-178, the described critic used in Diffuser/DD predicts the cumulative return within the horizon, leading to greedy plans, lacking of long-term judgment. For example, Hopper in Table 2 is a single-legged hopping robot, faster movement yields higher rewards. A common failure mode in Diffuser and DD is that the agent jumps abruptly and then falls, attributing this to the short-sightedness induced by greedy planning within the horizon. Critics with values can alleviate this issue by providing long-term judgment.

Q4: In line 186, a 'uniform critic' refers to assigning the same judgment to all plans, assuming equal quality. DD generates one high-performance plan a priori for execution without a plan selection process. To align with our unified diffusion planning framework, DD can be viewed as utilizing a uniform critic for plan selection.

Q5: The metric used in Table 2 is the normalized D4RL score, which is the episodic return normalized against expert performance (100 means expert-level [4]). The metric used in Table 3 is the success rate of manipulation tasks in Robomimic [5]. The metric used in Table 4 is the episodic return in the realistic stock market simulation [6]. Across all tables, higher values indicate better performance.


I hope our response addresses your concerns. Please feel free to reach out if you need further clarification. Thank you once again for your insightful review. Moreover, we have released the code at https://github.com/diffuserlite/diffuserlite.github.io to facilitate a better understanding of the implementation details. I hope it can help you : )


References

[1] Chen, et al, "Simple Hierarchical Planning with Diffusion," in The Twelfth International Conference on Learning Representations, ICLR, 2024.

[2] Li, et al, "Hierarchical Diffusion for Offline Decision Making," in Proceedings of the 40th International Conference on Machine Learning, ICML, 2023.

[3] Xiao, et al, "Learning to Combat Compounding-Error in Model-Based Reinforcement Learning," in arXiv,1912.11206, 2019.

[4] Fu, et al, "D4RL: Datasets for Deep Data-Driven Reinforcement Learning," in arXiv,2004.07219, 2021.

[5] Mandlekar, et al, "What Matters in Learning from Offline Human Demonstrations for Robot Manipulation," in Conference on Robot Learning (CoRL), 2021.

[6] Liu, et al, "FinRL: A Deep Reinforcement Learning Library for Automated Stock Trading in Quantitative Finance," in arXiv,2011.09607, 2022.

评论

Dear Reviewer 7Mwp,


As the deadline for the discussion phase is approaching, we are particularly eager to receive your feedback, as we believe some of your comments may stem from misunderstandings. Your engagement in discussion and feedback is truly important to us. We also greatly hope that the detailed explanations, additional experiments, and code releases in the rebuttal can fully address your concerns.


Warm regards,

Authors of submission 9426.

评论

Thanks a lot for the efforts of the authors. I have carefully gone through the replies and added results. I really appreciate the authors' efforts. Most concerns are addressed. I would like increase my rating.

评论

We are very pleased to be able to address your concerns. Your positive recognition means great to us, and we truly appreciate it. Your feedback is significant, and the explanations provided in the rebuttal will be incorporated into the revision to help readers better understand our work. If you have any further questions during the upcoming discussion period, please feel free to contact us, and we will do our best to resolve. Thank you for your time and efforts once again!

评论

Reviewer 7Mwp,

As the author-reviewer discussion phase comes to a close, have the authors satisfactorily addressed your concerns? Please engage in the discussion if you still have any unsolved concerns.

Best,

AC

审稿意见
5

This paper introduces DiffuserLite, a lightweight framework that utilizes progressive refinement planning to reduce redundant information generation and achieves real-time diffusion planning.

Key contributions are:

  • Introduced the plan refinement process (PRP) for coarse-to-fine-grained trajectory generation, reducing the modeling of redundant information.
  • Introduced DiffuserLite, a lightweight diffusion planning framework, which significantly increases decision-making frequency by employing PRP.

Results:

  • Achieved a decision-making frequency of 122.2Hz, 112.7x faster than existing frameworks and SOTA performance on multiple benchmarks, e.g. D4RL, Robomimic, and FinRL

优点

Originality: This paper introduced a novel approach in diffusion planning: PRP, and leads to various benefits, resulting in SOTA in various benchmarks.

Quality: Results are impressive. The experiment design is rigorous with comprehensive evaluation across multiple benchmarks. There are detailed information about how the experiments are set up and carried out.

Clarity: The paper is well written. The figures are in good quality and helps illustrate key points.

Significance: The improvements and the flexibility of the method could potentially benefits real-time decision-making applications.

缺点

Details on Implementation and design choices: Discussion in the main paper about design choices and hyperparameter selection could benefit this paper. More ablation study would be helpful to understand how this method can be generalized.

Many of the details are in appendix, it requires people to read though the appendix to fully understand the paper.

Some of the tables can benefit from clearer presentation. e.g. table 6. While it does save some space, but it's really hard to read.

Complexity and generalization: While the paper shows very impressive results on the benchmarks in the paper, the multi-level refinement process introduce quite some complexity. Additional theoretical analysis or more experimental results would help. It could be useful to provide data-driven recommendations on how to make design choices (e.g. for different levels) to make it more useful.

Further discussion on the scalability of this method to more complex, real-world scenarios would be valuable.

问题

  1. The paper discussed L = 2,3,4, what would the behavior be if L >=5?
  2. If the paper can share any any specific limitations or trade-offs the authors find during the development of this method, that could be helpful to discuss. It will guide people on how to use the method and strengthen the paper.

局限性

The authors addressed limitations in the paper.

作者回复

Thank you for your insightful suggestions. I am deeply touched by your diligent review and appreciate your dedication. I summarize the questions you raised and respond below:


  • Implementation Details: We have released the code at diffuserlite/diffuserlite.github.io to facilitate a better understanding of the implementation details. Furthermore, in response to your suggestion, we will incorporate more details on implementation and design choices into main text during revision to avoid the necessity of referring to Appendix for complete comprehension.

  • Clearer Presentations: We commit to double-checking all figures and tables during revision to ensure they are presented more clearly and address any potential ambiguity issues.

  • Real-World Tests: Robomimic and FinRL benchmarks provide real-world datasets that significantly differ from D4RL. The strong performance of DiffuserLite on these benchmarks demonstrates its potential for real-world applications. Besides, our released code allows for easy adjustments of model size and training datasets to evaluate scalability in more complex real-world scenarios. We will explore this in future work.

  • PRP-Introduced Complexity: While the methodology may appear complex, the implementation of our algorithm is not significantly more intricate than one-shot generation. After excluding components such as neural networks, generative backbones, and the reflow procedure, the core algorithm comprises approximately 100 lines of code, alleviating concerns regarding complexity.

  • Design Choice Recommendations: More PRP levels allow for more redundant information to be discarded but may also accumulate errors across levels. In the revision, we will provide theoretical justifications from this perspective. However, due to numerical calculation and many estimations in PRP, theoretical justifications may not suggest an optimal configuration. As such, we conduct additional experiments to explore various design choices, as detailed in the following table. This table is an expanded version of Table 6 in the paper. You can see that the third column is the right part of Table 6. The results align with the recommendations in Appendix C, emphasizing that 3 levels are a practical choice balancing performance and decision frequency, while too many levels can lead to error accumulation and reduced decision frequency, and too few levels may complicate trajectory distribution simplification, resulting in lower performance. It is also important to balance the horizon length at each level, avoiding overweighting at the top or bottom. Additional design choice recommendations can be found in Appendix C, and we will integrate these details into the main text based on your feedback for the revision.

Number of Levels2345
Design Choice[5,33]/[9,17]/[17,9]/[33,5][3,5,17]/[5,5,9]/[9,5,5]/[17,5,3][3,3,3,17]/[5,3,5,5]/[5,5,3,5]/[17,3,3,3][3,3,5,3,5]
cheetah-me57.6/75.6/89.1/82.285.6/88.5/88.6/89.081.6/88.3/88.9/84.985.5
antmaze-ld0/0/69/15.334.7/68.0/67.3/34.034.3/68/70/6026.7
Design Choice[5,13]/[7,9]/[9,7]/[13,5][3,3,13]/[4,5,5]/[5,5,4]/[13,3,3][3,3,3,7]/[3,4,3,5]/[5,3,4,3]/[7,3,3,3][3,3,3,3,4]
kitchen-p69/72.8/72/74.566.7/74.4/74.2/31.772.8/74.4/73.8/7471.7
Average Cost0.0175s0.0209s0.0266s0.0321s
  • Performance with L=5L=5: To address your query, we include an experiment with L=5L=5 in the table above (column 4). Increasing the number of levels beyond a certain point may lead to error accumulation and reduced decision frequency. Considering these factors, we still recommend using L=3L=3.

I hope our response addresses your concerns. Please feel free to reach out if you need further clarification. Thank you once again for your insightful review.

评论

Thank you for your comments and the detailed results. They helped solve many of my questions.

"the core algorithm comprises approximately 100 lines of code" -> Just to clarify, by complexity, I mean more on the complexity required when making design choices and generalize to other tasks, base models, etc. Therefore, complexity may not be measured by the lines of code, e.g. even if the solution is just one line of code, but there are 100 parameters to tune to ensure good results, I still think the complexity are there. Thank you for the detailed results for design choice recommendations. It could be super helpful if they can be included in a certain way in the final version.

评论

Thank you for your positive feedback on our work and your appreciation of our rebuttal! I now better understand what you meant by "complexity." I'm sorry to say that introducing PRP does indeed require the selection of additional design choices. However, in practice, as you can see from the experimental results, selecting these design choices is not particularly challenging, and the performance is not very sensitive to these choices. I believe that this introduced "complexity" won't be a significant issue.

Thank you once again! If you have any other questions during the discussion period, please don't hesitate to contact us. We'll do our best to respond and further improve the paper!

审稿意见
6

TLDR; Unlike traditional methods that generate the entire trajectory at once, this process gradually refines the plan at each stage, reducing computational costs and improving real-time performance.

DiffuserLite is a lightweight diffusion planning framework designed to increase decision-making frequency in real-time applications. Traditional diffusion planning methods are limited by the high computational costs of modeling long-horizon trajectory distributions, leading to low-frequency decision-making. To overcome these limitations, DiffuserLite introduces a Progressive Refinement Process (PRP) that reduces redundant information and progressively generates fine-grained trajectories. As a result, DiffuserLite demonstrates superior performance compared to existing methods across various benchmarks such as D4RL, Robomimic, and FinRL. Additionally, DiffuserLite is designed as a flexible plugin that can be easily integrated with other diffusion planning algorithms, making it a valuable contribution to future research and practical applications.

优点

A simple idea with a order of magnitude speed improvement.

  • The PRP process reduces time cost by starting with coarse goals and refines them into detailed goals.
  • It reduces unnecessary redundant planning and ensures faster inference.
  • With the speed improvement, DiffserLite shows reasonable performance.

Flexibility (RQ3)

  • In the form of a plugin, it can be applied to other planning methods such as AlignDiff.

缺点

  • The terms or concepts such as "reflow" lack adequate explanation. Although they are mentioned in the appendix, it is necessary to provide an explanation or a reference to this section in the main text. Even a basic explanation within the main part would be beneficial.
  • The abbreviation “PRP” is used to abbreviate multiple terms. On page 2, it refers to "plan refinement process," while on page 4, it stands for "progressive refined planning." Although these terms appear to be used interchangeably, it would be helpful to clarify this in the text.
  • It would be beneficial to have a theoretical justification for the experimental results. The concept is straightforward, but it would be helpful to explain why certain configurations of levels are better in the ablation study comparing the levels of DiffuserLite. Currently, the analysis seems too intuitive and empirical.
  • DiffuserLite adopts Conditional Filtered Guidance (CFG) instead of Conditional Guidance (CG) to prevent speed degradation. However, the adaptation of CFG requires adjusting the target condition at each level in the multi-level structure. It would be beneficial to compare the performance and speed of DiffuserLite using CG in addition to CFG.
  • Equation 10 in the paper shows that the optimal value function is added as a term in the Critic to solve zero Critic values issue in sparse reward settings for fine-grained planning. However, using this optimal value requires an offline methodology, which limits the approach. Methods studied in Online RL may replace optimal value. For example, it would be possible to consider studies on exploration strategies such as intrinsic motivation or curiosity-based reward, or value-based method enhancement studies such as double Q-learning or dueling DQN.

问题

  • What is the default model for Lite w/o PRP that you mentioned in section 5.5? You mentioned a one-level model, but the default design in Table 6 has three temporal horizons [5,5,9]. Is the entire planning horizon used as the temporal horizon? It would be helpful to clearly specify the actual temporal horizon used in the study.
  • In Figure 1 and Table 2, the scores of DD and Diffuser for the antmaze diverse dataset are recorded as 0. However, in the ICLR 2024 paper "Reasoning with Latent Diffusion in Offline Reinforcement Learning," those models show higher scores for antmaze-diverse-v2. This discrepancy suggests that either a different dataset version was used or the hyperparameter settings need to be adjusted for a fair comparison.
  • Trade off between depth of searching and removing redundant information seems good idea. But is there a optimal point for this trade off? Or does it considered by task?
  • I think critic design is one the most important part in this paper, but there are lack of explanation. Were there any literature to solve this problem by modifying Critic parts? Were there any other research that used the sum of discounted reward and discounted optimal value to solve the problems caused by sparse rewards?
  • According to Equation 10, Critic has the optimal value term. However, this approach is limited to offline methodology. Are there any alternatives or methods to overcome this limitation?
  • The paper explains that CFG is adopted instead of CG to prevent speed degradation. Given that CFG requires adjusting the target condition at each level in the multi-level structure, are there any methods to mitigate this limitation? Additionally, could you provide experimental results comparing the performance and speed of DiffuserLite using CG?
  • Table 2 shows that DiffuserLite-R1, which uses rectified flow, achieves the best performance across most metrics. However, the paper lacks a detailed analysis of these results. Could you explain the specific mechanisms through which rectified flow contributes to performance improvement?"

局限性

  • It is insufficient to prove the way to choose the number of planning levels and the temporal horizons. It would have been beneficial to compare the number of planning levels or temporal horizon with respect to the learning / inference frequency, analyzing the trade-offs when these factors are changed. Additionally, comparing whether it is better to choose the temporal horizon in a bottom-up or top-down manner according to the levels would have provided a more comprehensive analysis.
  • DiffuserLite may also have some societal impacts, such as expediting the deployment of robotic products, and it could potentially be utilized for military purposes." If they provide more examples that highlight the practical aspects of DiffuserLite, readers may more easily understand the potential of the paper.
  • For some baseline(DD and Diffuser), the results shown in Table 2 are too low to make the experimental results appear unfair. Refer to results from Table 1 of the existing study (arXiv:2309.06599) on diffusion-based models performing one-shot planning. For the same benchmarks (antmaze-diverse, kitchen-partial, kitchen-mixed), they showed better baseline performance. It is expected that even with the step-by-step planning method of this study, proper hyper-parameter tuning could achieve at least similar performance.
作者回复

Thank you for your insightful suggestions. I am deeply touched by the diligent review and appreciate your dedication. I summarize the issues and respond below to address your concerns:


  • Paper revision: Following your suggestion, during revision, we will further introduce Reflow in main text, address PRP’s multiple abbreviation issue, update the D4RL score report for Diffuser/DD, and shift the design choices from Appendix to main text.
  • About PRP configuration: More PRP levels allow more redundant information discarded but may also accumulate errors across levels. We will provide theoretical justifications from this perspective during revision. However, due to numerical calculation and many estimations in PRP, theoretical justifications may not suggest an optimal configuration. As such, we conduct additional experiments to explore various design choices, as detailed in the following table. This table is an expanded version of Table 6 in the paper. You can see that the third column is the right part of Table 6. The results align with the recommendations in Appendix C, emphasizing that 3 levels are a practical choice balancing performance and decision frequency. It is also important to balance the horizon length at each level, avoiding overweighting at the top or bottom. Additional design choice recommendations can be found in Appendix C, and we will integrate these details into the main text based on your feedback for the revision.
Number of Levels2345
Design Choice[5,33]/[9,17]/[17,9]/[33,5][3,5,17]/[5,5,9]/[9,5,5]/[17,5,3][3,3,3,17]/[5,3,5,5]/[5,5,3,5]/[17,3,3,3][3,3,5,3,5]
cheetah-me57.6/75.6/89.1/82.285.6/88.5/88.6/89.081.6/88.3/88.9/84.985.5
antmaze-ld0/0/69/15.334.7/68.0/67.3/34.034.3/68/70/6026.7
Design Choice[5,13]/[7,9]/[9,7]/[13,5][3,3,13]/[4,5,5]/[5,5,4]/[13,3,3][3,3,3,7]/[3,4,3,5]/[5,3,4,3]/[7,3,3,3][3,3,3,3,4]
kitchen-p69/72.8/72/74.566.7/74.4/74.2/31.772.8/74.4/73.8/7471.7
Average Cost0.0175s0.0209s0.0266s0.0321s
  • About CG and CFG: We do not use CG primarily for two considerations: (a) CG requires NN gradients computation, which is very expensive. A comparison of decision frequency for DiffuserLite-D on MuJoCo, 27.2Hz (CG) vs. 68.2Hz (CFG), suggests that CFG should be preferred when high decision frequency is crucial. (b) Rectified flow cannot use CG for gudiance. Therefore, we solely employ CFG in Lite. Although CFG requires specifying a target condition for each level, various methods can address this: (a) using an additional NN to predict the max reachable return for the current state as the target or employing more sophisticated methods in [1], (b) we also observe that Level 0 has the most significant impact on decision performance, with most cases requiring only the tuning of target condition for Level 0.
  • About critic design: (Why is it helpful when sparse rewards?) Assuming xsx_s as the planned trajectory and τ\tau as xsx_s and its future trajectory, RR estimates the trajectory's discounted return. A critic with only reward terms optimizes Eq0(x0),q(ϵ),s,τ[ϵθ(xs,s,R(τ))ϵ2]\mathbb E_{q_0(x_0),q(\epsilon),s,\tau}\left[\Vert\epsilon_\theta(x_s,s,R(\tau))-\epsilon\Vert^2\right], whereas a critic with value terms optimizes Eq0(x0),q(ϵ),s[ϵθ(xs,s,Eτ[R(τ)])ϵ2]\mathbb E_{q_0(x_0),q(\epsilon),s}\left[\Vert\epsilon_\theta(x_s,s,\mathbb E_\tau[R(\tau)])-\epsilon\Vert^2\right]. When rewards are sparse, a critic with value terms can receive more stable conditions during training, leading to better performance in sparse reward tasks. Additionally, DiffuserLite does not restrict the use of any specific critic; as demonstrated in AlignDiff-Lite, we utilized an attribute strength model [2] as a critic for preference aligning. You can also deploy DiffuserLite in online tasks using curiosity-based rewards or value-based methods.
  • In Section 5.5, Lite w/ only last level refers to one level with a horizon of 9, while Lite w/o PRP refers to one level with a horizon of 129. I apologize for any lack of clarity.
  • Low Performance of Diffuser and DD in Antmaze: In Antmaze, we run official code for Diffuser and referenced the report in [3] for DD. We did not intentionally report low scores. The higher scores in LDGC [4] may be due to (a) their use of an inpainting version of Diffuser as the original version yielded poor results (page 9, line 2), and (b) careful tuning of DD hyperparameters. To ensure a fair comparison, we will update the score report during revision. However, Diffuser and DD still exhibit significantly lower scores than Lite. Regarding Kitchen tasks, we have double-checked to confirm that both LDGC and our reports align with official reports.
  • About DiffuserLite-R1: We attribute its best performance to the straight flow property of Rectified Flow, which confers an advantage in very few sampling steps. Compared to Lite-D, we only change the generative backbone from diffusion model to rectified flow, without any other tricks.
  • Practical Aspects: Robomimic and FinRL provide real-world datasets with substantial differences compared to D4RL. The strong performance of Lite on these datasets underscores its potential for real-world applications.

I hope our response addresses your concerns. Please feel free to reach out if you need further clarification. Thank you once again for your insightful review.


[1] Yu, et al, "Regularized Conditional Diffusion Model for Multi-Task Preference Alignment," in arXiv, 2404.04920, 2024.

[2] Dong, et al, "AlignDiff: Aligning Diverse Human Preferences via Behavior-Customisable Diffusion Model," in The Twelfth International Conference on Learning Representations, 2023.

[3] Hong, et al, "Diffused Task-Agnostic Milestone Planner," in arXiv, 2312.03395, 2023.

[4] Venkatraman, et al, "Reasoning with Latent Diffusion in Offline Reinforcement Learning," in arXiv, 2309.06599, 2023.

评论

I have carefully reviewed your detailed responses to my comments. Your commitment to addressing the points raised is appreciated. Given the promised revisions and the forthcoming code release, I believe the simple yet effective idea of PRP could indeed be valuable in reducing the notorious planning time of diffusion models for planning.

While I find the empirical results compelling, I would have been even more supportive if there were stronger theoretical justifications for the approach. Nonetheless, the practical benefits of the method are clear.

Considering these factors, I am revising my overall assessment from 5 to 6.

Here are a few remaining questions.

  1. Regarding CFG and CG: In your response, you mentioned that CG was not used for several reasons, including decision frequency. Is there a difference in performance (score) between using CG and using CFG without considering rectified flow? In Section 4, you stated that you used CFG for decision frequency. However, in the conclusion, you mentioned the limitations of DiffuserLite with CFG being the main cause of these limitations. Could using CG address this issue? It would be helpful to explain why it was necessary to improve decision frequency by opting for CFG instead of CG. As it currently reads, your writing suggests, "We use CFG because CG is not good, but CFG also has limitations!" If CG can address this issue, it would be better to explain why decision frequency is important despite those limitations. If not, it would be clearer to explain that the issue is not about the limitations between CG and CFG.

  2. Additional question about the plugin: Have you tried using this method as a plugin for other backbones besides AlignDiff? It would be beneficial to include any tests that show whether it can be applied to more major methods like Diffuser or HDMI, rather than AlignDiff.

评论

Thank you very much for your thoughtful feedback and for increasing the score. I sincerely apologize for the delayed response (it took some time to train and test the models). I hope the following reply addresses your questions:

A1. The normalized scores for the CG/CFG guided versions are listed below. We observe a slight performance drop when using CG. This drop was also noted in our earlier experiments, which was one of the initial reasons for choosing CFG. I believe the first work to propose using CFG was Decision Diffuser (DD), motivated by CFG's superior performance in image generation tasks compared to CG. As for why CFG performs better in decision-making tasks, there's still a lack of systematic research. Since this is not the focus of DiffuserLite, we didn't discuss it in detail. However, I have observed that CG-generated trajectories are more prone to OOD issues as guidance strength ww increases, which might be one reason for the performance drop.

envCGCFG
halfcheetah-me87.9±1.688.5±0.4
kitchen-p66.2±4.174.4±0.6
antmaze-ld46.1±6.268.0±2.8

A2. Can CG address the issue of adjusting the target return in CFG? I don't think it can fully address this. In practice, CG requires tuning one hyperparameter (guidance strength), while CFG needs two (guidance strength and target return). Although it seems CG has one less parameter to tune, the guidance strength has a more significant impact on performance. The target return is normalized to [0, 1] based on the max/min values in the dataset, so during inference, choosing a relatively large value like [0.7-1.1] usually works well and is easy to tune. On the contrary, we must carefully tune the guidance strength for both CG and CFG.

A3. Why did we choose AlignDiff as the backbone? This is because Diffuser/DD/HDMI are all reward-maximizing algorithms (And DiffuserLite can be seen as DD+PRP). We felt that these algorithms with PRP would still be solving the same problem in the same domain, which wouldn't effectively demonstrate the "flexible plugin" capability. Therefore, we chose AlignDiff, a preference-matching algorithm, to see if DiffuserLite could be applied to a completely different task domain - and the answer is yes. Can DiffuserLite be used with Diffuser and HDMI? Absolutely. To integrate with Diffuser, the CG version mentioned in A1 & A2 can be viewed as Diffuser+PRP (if we ignore the inverse dynamic model). To integrate with HDMI, HDMI's heuristic method for automatically selecting high-quality sub-goals could also be used to provide different levels of training data for DiffuserLite, allowing PRP to generate more meaningful and high-quality subgoals. However, implementing HDMI+PRP might have to wait until they open-source their code.

Thank you again for your insightful comments and patience! If you have any further questions during the discussion phase, please don't hesitate to contact us. We'll do our best to respond and further improve the paper.

最终决定

This paper introduces a new diffusion-based planning scheme, which adopts a coarse-to-fine-grained generation procedure to overcome the high computational costs of modeling long-horizon trajectory distributions in the existing method. The proposed method, planning refinement process (PRP), can significantly speed up diffusion planning, while achieving reasonable performance. The strengths and weaknesses are summarized below:

Strength:

  • The coarse-to-fine-grained scheme introduced in this paper is interesting and intuitive.
  • The proposed method PRP has impressive speed improvement over the existing diffusion-based planning method.
  • The proposed method achieved reasonable performance on the D4RL benchmark.

Weakness:

  • Although the inference speed of this method can be greatly improved, it needs to train multiple diffusion models. This potentially leads to more costly training, as well as more storage requirements during the deployment phase.
  • As pointed out by one of the reviewers, the paper lacks of theoretical justification or guarantee for performance improvement. It sometimes can be tricky to choose the best configuration of planning levels and horizon length for different tasks.
  • The paper can benefit from further improving the clarity and removing ambiguity in the writing, as well as including several missing references as suggested by the reviewers.

The major concerns of the reviewers have been addressed after the rebuttal and the paper received unanimously positive feedback from the reviewers. I hope the authors can further polish the paper in their final version based on the provided suggestions, and make it a stronger submission.