PaperHub
6.3
/10
Poster4 位审稿人
最低5最高7标准差0.8
7
6
5
7
3.8
置信度
正确性2.8
贡献度2.5
表达3.3
NeurIPS 2024

Closed-Loop Visuomotor Control with Generative Expectation for Robotic Manipulation

OpenReviewPDF
提交: 2024-04-25更新: 2024-11-06
TL;DR

A closed-loop visuomotor control framework that incorporates feedback mechanisms to improve adaptive robotic control

摘要

关键词
Robotic ManipulationVisuomotor Control

评审与讨论

审稿意见
7

This paper introduces a novel framework that incorporates closed-loop feedback into vision-based control systems for robot manipulation tasks. The author proposes a text-conditioned video diffusion model for high-level reference path planning. The error measurement and feedback policy are obtained through an encoder-decoder architecture. This framework improves task performance over open-loop policies with a higher success rate and good instruction-following without the need for giant LLM models. The method is tested on proper benchmarks for language-based manipulation and has a thorough discussion of the method and results.

优点

  1. Novelty: This paper presents a novel feedback policy for visual input. The proposed measurable embedded space can help identify the error between the reference plan and the executed plan, and the encoder-decoder bridges the input-feedback-output modules. This design helps in long-horizon planning where open-loop control easily fails.
  2. Presentation: This paper is well-written and provides a thorough discussion of the methods.
  3. Potential impact: A feedback system for visual-based control has a potential impact in other robotics systems other than manipulators.

缺点

  1. Generalization: how this framework can be generalized across different task/robot platforms/datasets? Especially for the sub-goal replan/transition, which seems to be very specific for each task. The replanning threshold is also designed by hand. Can it be determined during training?
  2. The paper seems not to report the computational time for the model during testing, which is crucial for a close-loop policy in real-time. Especially with a diffusion-based policy, the sampling time could be long. Could provide more discussion on addressing this issue.
  3. Do you expect your proposed architecture to scale well when pre-trained on a large dataset; what additional design might be needed?

问题

  1. How the dataset in CLOVER is obtained? What data helps train the CLOVER? More details will be welcomed.
  2. How does CLOVER perform in environments with significant variations from the training scenarios? Have you tested the framework on completely new tasks or objects not seen during training?

局限性

The author discussed the potential societal impact in the Appendix.

作者回复

Thanks for your valuable review. We address your concerns below.

Question1:{\color{BrickRed}Question 1:} Generalization: how this framework can be generalized across different task/robot platforms/datasets? Especially for the sub-goal replan/transition, which seems to be very specific for each task. Can the threshold be determined during training?

Generalize across platforms and datasets: The idea of CLOVER (the closed-loop visuomotor control framework) itself is embodiment-agnostic and can be generalized across different platforms. For instance, the robot in CALVIN simulation is Franka Emika Panda, whereas ALOHA is employed for conducting real-world experiments. Therefore, the generalization is more related to data. In our work, we train CLOVER on the respective data from the two platforms and successfully conduct verification. It is also expected that training on large-scale cross-embodiment datasets may lead to zero-shot generalization.

Sub-goal replan/transition generalization across tasks and models: The thresholds for sub-goal replan/transition empirically generalize well across tasks. For the CALVIN benchmark, there are 34 different tasks in total and the evaluation suite involves 1,000 distinct instruction chains. We use the same set of hyperparameters for sub-goal replan and transition, which works well for all tasks and different visual encoders (Fig. 6(c)). Please also refer to Fig. R3 in the rebuttal PDF.

Threshold determination during training: In our current implementation, the visual planner and the subsequent inverse dynamics-based policy are trained separately. Consequently, the required threshold can only be measured during the inference stage, when the two components are integrated. Future research will focus on training these models jointly in an end-to-end manner, potentially facilitating the threshold determination process during training.

Question2:{\color{BrickRed}Question 2:} Discussions on computational overhead

Thanks for the question. Please refer detailed discussion in the Q2 of Global Rebuttal session.

Question3:{\color{BrickRed}Question 3:} Do you expect your proposed architecture to scale well when pre-trained on a large dataset; what additional design might be needed?

Thanks for the insightful question.

  • Diffusion-based video generation models have been proven to be scalable with the size of the dataset and network. All we need to do to scale up CLOVER is to collect more videos and extend the model with more channels and layers. Notably, the videos can be free of action labels, and even human videos would help as well. Works like UniPi and UniSim [16] have made very successful attempts towards building world simulators by scaling up the pre-training dataset, and it would be similar for CLOVER to perform such extensions as well.
  • Feedback-driven policy: Its training is grounded in an inverse dynamics objective. While it necessitates action labels, it does not require high-level, task-specific knowledge for policy training. This characteristic facilitates the potential for training the policy on extensive, cross-embodiment datasets [38], thereby enabling few-shot cross-embodiment generalization.

Question4:{\color{BrickRed}Question 4:} How the dataset in CLOVER is obtained? What data helps train the CLOVER? More details will be welcomed.

We train our models exclusively on in-domain datasets. For simulation experiments, we utilize the ABC split of the CALVIN training dataset, which includes language instruction labels, to train the video diffusion model. Consecutive frames and their action labels are randomly sampled to train the low-level policy. Similar procedures are applied in real-world experiments. Additional training details are available in Appendix B2.

By decoupling the training of the visual planner from the feedback-driven policy, we allow the visual planner to be trained on diverse video datasets without action labels. As discussed in Question 3, this approach enables us to leverage internet-scale videos without action labels for training a robust and generalizable visual planner and to utilize large-scale cross-embodiment robot datasets (Open X-Embodiment [38]) for training a more effective low-level policy.

Question5:{\color{BrickRed}Question 5:} How does CLOVER perform in environments with significant variations from the training scenarios? Have you tested the framework on completely new tasks or objects not seen during training?

Thanks for the question. Zero-shot generalization to completely new tasks or objects stands as a significant challenge for current robotic research. Following the common experimental design, we mainly adopt certain variations during testing to demonstrate the generalization ability of our method.

As requested, we conduct additional real-world experiments in the rebuttal, by introducing entirely new objects — a yellow clay and a doll — alongside the primary interaction object. Please refer to the Global Rebuttal session for detailed results.

Additional Analysis:

  • Our simulation experiment setup necessitates the policy's generalizability. In the CALVIN benchmark, the textures of the table and the positions of buttons, drawers, and sliders in the test environment differ from those in the training set, posing a substantial challenge to the generalizability of various policies. CLOVER demonstrates exceptional generalizability in this context.
  • CLOVER relies on a video generation model that must reliably follow task instructions to produce corresponding visual plans. Currently, video generation models struggle to generalize to completely novel tasks (instructions) outside the training set. This challenge could potentially be mitigated by extensive pre-training on large-scale internet datasets, which we hope future open-sourced video foundation models will provide to the community. Our inverse dynamics-based policy is inherently task-agnostic and capable of generalizing to new tasks.
评论

Thank you for the rebuttal and the additional experiment results presented. My concern is addressed and I will raise my score to 7.

评论

Thanks for considering our responses and recommending acceptance. We will update our paper according to our insightful discussions.

审稿意见
6

This paper proposes CLOVER, a closed loop visuomotor control framework that incorporates feedback mechanisms to improve adaptive robotic control. The framework consists of a text-conditioned video diffusion model for generating visual plans as reference inputs, an error measurement module to model the discrepancy between the current and goal states in the embedding space, and a feedback-driven controller that refines actions from feedback and initiates replans as needed.

优点

  1. The paper is well written and organized, clear to read.

  2. Integrating closed loop feedback into robotic visuomotor control is interesting and useful, which can help deal with deviation and improve control accuracy.

  3. Combines visual planning, error measurement, and feedback-driven control in a cohesive system.

缺点

  1. Limited evaluation to real robot environment. The authors only show one experiment comprising three consecutive sub-tasks, which is not enough to test the generalizability and robustness of the approach across different environments and tasks.

  2. The measurable embedding space for error quantification seems effective not be sufficiently validated. This component’s effectiveness could vary significantly in different scenarios, and the paper does not provide extensive evidence of its robustness.

  3. The framework seems need a lot of computation, each step in the loop the diffusion model needs to generate a video, then get embeddings through two ViT-based encoders for RGB and depth respectively. I guess the high computation required could limit its real-time execution and practical applicability.

  4. Further validation on a wider range of benchmarks and task environments is necessary to confirm the framework's generalizability.

  5. Minor, in Figure 5, the authors show generated videos of four tasks conditioned on the same initial frame, but the first figure in each task looks different, better to make them the same.

问题

  1. In Figure 3, the authors mention is obtained during a roll-out, what task and what environment are you using? How many tasks and environments did you test?

  2. How long does it take to execute one task in simulation and real robot?

  3. Can you show video demos of your experiments not just screenshots in the paper?

局限性

  1. High computation required for both training and real-time execution could limit practical applicability. The proposed framework's complexity might pose challenges in implementation.

  2. The scalability of the framework to diverse and complex tasks remains uncertain.

  3. The real-time execution of the feedback-driven controller might face latency issues, especially in dynamic and unpredictable environments.

作者回复

Thanks for your careful review and valuable comments. We address each question below.

Question1:{\color{BrickRed}Question 1:} Limited evaluation to real robot environment.

We conduct new real-world experiments to further validate the effectiveness and generalizability of CLOVER. Please refer to the Global Rebuttal session Q1.(1) and (2) above, and the qualitative analysis in Fig. R2 of our rebuttal PDF.

Question2:{\color{BrickRed}Question 2:} The measurable embedding space for error quantification seems effective not be sufficiently validated. This component’s effectiveness could vary significantly in different scenarios, and the paper does not provide extensive evidence of its robustness.

We address the concern on robustness in the following perspectives and will update them in our revision:

  • Empirically robust to various tasks and scenes: The CALVIN benchmark comprises 1,000 unique instruction chains across 34 different tasks. We select the most challenging scenario where the texture and position of objects to interact during testing is distinct from the training samples (ABC -> D). As outlined in Algorithm 1, our sub-goal transitions and replanning depend entirely on measurable embeddings. Thus, the performance improvement on CALVIN can be a reflection of the robustness of error qualification. For instance, using embeddings with poor robustness can cause the policy to get stuck on certain goals, as the measured distance keeps failing to meet the threshold required to trigger a sub-goal transition.

  • Robustness comparison with different embeddings for error quantification: In the table below, We provide quantitative results of LIV which introduces tailored training objectives for dense reward learning (error measurement). CLOVER yields notable robustness compared with previous works. Further qualitative illustrations are provided in Appendix A. \indent |Methods|Task 1|Task 2|Task 3|Task 4|Task 5|Avg. Len.| |-|-|-|-|-|-|-| |LIV [61] (newly added)|70.8|48.2|29.2|18.2|10.2|1.77| |CLIP Feature (Fig. 3 (a))|72.4|46.8|25.0|13.7|5.1|1.63| |State Embedding (Ours, Fig. 3 (c))|96.0|83.5|70.8|57.5|45.4|3.53|

  • Model and parameter insensitive: Figure 6(c) shows that we do not need to carefully cherry-pick hyperparameters about the measurement even for models with different architectures.

Question3:{\color{BrickRed}Question 3:} Computational overhead.

Thanks for the question. We provide a detailed discussion in the Q2 of Global Rebuttal session above.

Question4:{\color{BrickRed}Question 4:} Further validation on a wider range of benchmarks and task environments is necessary to confirm the framework's generalizability.

Thanks for the comment. On the one hand, the tested simulation benchmark CALVIN ABC-> D considers certain scene generalizations. The texture of the table and the position of buttons, drawers, and sliders in the tested environment is different from what the model has seen in the training set, which is quite challenging to the generalizability of different policies.

To further validate the generalizability of our framework, we conduct more real-world experiments under certain perturbations. Please refer to the submitted PDF for detailed experiment setting and in the Global Rebuttal session.

Question5:{\color{BrickRed}Question 5:} Minor, in Figure 5, the authors show generated videos of four tasks conditioned on the same initial frame, but the first figure in each task looks different, better to make them the same.

Thanks for the suggestion. In Fig. 5, we show the 2nd, 4th, 6th, and 8th frames of the generated video for simplicity. We have revised our manuscript where the initial frame has been positioned at the forefront.

Question6:{\color{BrickRed}Question 6:} In Figure 3, the authors mention is obtained during a roll-out, what task and what environment are you using? How many tasks and environments did you test?

The visualization in Fig. 3 is based on a randomly sampled task ("open the drawer") within the CALVIN D environment. We only show one task in Fig. 3 for simplicity and clarity. However, we note that similar patterns are observed across all tasks, where our measurable embedding space consistently shows monotonicity when approaching each sub-goal, compared to the other two counterparts. We also provide a similar plot to Fig. 3, but encompassing all possible tasks within over 300+ independent rollouts, in Fig. R3 of our rebuttal PDF. As shown in the figure, the measurability remains and generalizes well under all scenarios.

Question7:{\color{BrickRed}Question 7:} How long does it take to execute one task in simulation and real robot?

  • Simulation: Our experiments are conducted using NVIDIA RTX 3090 GPUs. The low-level policy, with an input size of 196, operates at a frequency exceeding 70 Hz. The video diffusion model requires approximately 5 seconds to generate 8 video frames with a resolution of 128x128. It only takes around 11 seconds on average to complete a task in CALVIN simulation.
  • Real-world: We perform experiments on an NVIDIA RTX 5000 Ada GPU with the input size of 128*96, where the low-level policy runs at more than 25Hz and the visual plan generation takes around 4.3 seconds. Overall, it takes around 38 seconds to complete the 3 consecutive tasks with our real-world robot.

We have added the detailed execution time in the revised version. Please also refer to our reply in Global Rebuttal for further information and comparison with other methods.

Question8:{\color{BrickRed}Question 8:} Video demos?

Thanks for the advice. Regrettably, we are not allowed to provide an external link to present the demonstration videos according to the NeurIPS rebuttal guidelines. We will release a project page with video demos when publicly releasing the work.

评论

Thanks authors for the response to address the concerns, I increased my score to 6.

评论

Thank you for reviewing our responses and raising the score. We will update our paper based on your insightful comments.

审稿意见
5

The paper presents a novel framework named CLOVER. The proposed system aims to enhance the adaptability and robustness of robotic manipulation in long-horizon tasks by incorporating closed-loop control principles. CLOVER consists of three main components: a text-conditioned video diffusion model for generating visual plans, a measurable embedding space for accurate error quantification, and a feedback-driven controller that refines actions based on real-time feedback. The framework shows significant improvement in real-world robotic tasks and sets a new state-of-the-art performance on the CALVIN benchmark.

优点

  1. Originality: CLOVER introduces a unique combination of closed-loop control principles with generative models for robotic manipulation. The use of a text conditioned video diffusion model to generate visual plans is a creative approach that leverages recent advances in generative AI.
  2. Quality: The methodology is well-articulated, detailing the design of each component in the framework. The inclusion of depth map generation and optical flow regularization to enhance the reliability of visual plans demonstrates the challenges in robotic manipulation.
  3. Clarity: The paper is clearly written, with a logical structure that guides the reader through the motivation, methodology, and experimental validation. Figures and diagrams are effectively used to illustrate complex concepts and the overall system architecture.
  4. Significance: The proposed framework addresses a critical challenge in robotics—improving the robustness and adaptability of robotic systems for longhorizon tasks. The notable performance improvements on the CALVIN benchmark and real-world tasks underscore the potential impact of CLOVER on the field of robotic manipulation.

缺点

  1. Novelty of the error measurement approach: While the paper introduces a measurable embedding space for error quantification, it does not provide a detailed comparison with existing methods in terms of novelty and performance. Additional analysis or experiments comparing this approach with other state-of-the-art error measurement techniques would strengthen the contribution.
  2. Clarification of the effect of closed-loop system: The article seems to tackle the long horizon manipulation tasks, but in the article there are no obvious evidences to prove it.
  3. Scalability of the feedback-driven controller: The paper presents the feedback-driven controller as an effective solution for adaptive control. However, it lacks a discussion on the scalability of this approach for more complex and diverse task environments. Evaluating the controller's performance across a wider range of scenarios, particularly dynamic scenarios would provide a better understanding of its generalizability.
  4. Lacked methodology clarification: Some technical aspects, particularly the algorithms and specific implementation details, are not as thoroughly explained as they could be. This makes it harder to fully grasp the intricacies of the proposed system.
  5. Relatively limited experiments: The experiments conducted on the CALVIN dataset and real-world scenarios may not be sufficient to demonstrate the effectiveness of the proposed method. It is recommended to perform additional experiments in more long-horizon simulation environments such as RLBench and robosuite. Additionally, the real-world experiments are relatively limited, so further experiments should be conducted to provide a more comprehensive evaluation.
  6. Computational complexity: The incorporation of a video diffusion model and feedback mechanisms might introduce significant computational overhead. The paper does not provide an in-depth analysis of the computational requirements and how they impact real-time performance, which is crucial for practical deployment.

问题

  1. Can the authors provide more detailed training process used for the text-conditioned video diffusion model and the error measurement strategy, including training datas, network architecture?
  2. How does CLOVER handle highly dynamic environments where changes occur rapidly and unpredictably?
  3. How is the training data organized during the diffusion model generation phase of the visual plans? How many intermediate points are selected for each trajectory? I believe this needs to be explained, as it affects the performance of the final trajectory.
  4. What are the computational requirements for running CLOVER in real-time, and how does it scale with more complex tasks?

局限性

The authors have addressed some limitations, such as they validate CLOVER for simulation and real-world experiments by training the models heavily on the corresponding data. However, further discussion on the scalability of the framework and its adaptability to a wider range of robotic morphologies and environments would be beneficial.

作者回复

Thanks for your detailed review. We address your questions below.

Question1:{\color{BrickRed}Question 1:} Novelty of the error measurement approach. Comparison with other state-of-the-art error measurement methods.

As discussed in Sec. 2, existing works rely on additional detection models with manually set rules or high-cost VLMs to understand tasks' completion state. These impose restrictions on their applicability and efficiency.

Compared to works like LIV [61] which also tends to learn measurable embeddings, we do not incorporate additional contrastive objectives but investigate the inherent property of inverse dynamics-based policy. In the new results below, our method shows more robust performance for consecutive tasks. We will update the analysis in our revision.

MethodsTask 1Task 2Task 3Task 4Task 5Avg. Len.
LIV [61] (newly added)70.848.229.218.210.21.77
State Embedding (Ours, Fig. 3(c))96.083.570.857.545.43.53

Question2:{\color{BrickRed}Question 2:} Clarification of the effect of closed-loop system (long-horizon manipulation).

Each test rollout in CALVIN consists of five distinct instructions (sub-tasks), which requires the policy to succeed in chained sub-tasks. Our framework's improved performance on CALVIN and long-horizon real-world tasks demonstrates its effectiveness in such settings. We acknowledge that current settings are limited compared to tasks like "making a cuisine". We will work to develop more complex long-horizon manipulation tasks in the future.

Question3:{\color{BrickRed}Question 3:} Scalability of the feedback-driven controller. Evaluating the controller's performance across a wider range of scenarios.

We provide additional real-world generalization experiments, including background distraction, object variation, and dynamic scenarios in Global Rebuttal.

Scalability: As discussed in Secs. 3.1 and 5, by decoupling planning and low-level control into a two-level hierarchy, the visual planner can learn from massive human videos to be robust world simulators [16]; our inverse dynamics-based policy offers greater robustness for multitask IL compared to BC-based policies [59]. Fig. R4 shows that the BC-based method RT-1 struggles with scene variation.

Question4:{\color{BrickRed}Question 4:} Lacked methodology clarification. Detailed training process used for the text-conditioned video diffusion model and the error measurement strategy.

We have detailed our model architecture and training protocol in Appendix B1 & B2, and we will publicly release all materials. In the rebuttal, we provide details below to address the question.

  • Video diffusion model: Extending Imagen [32], we add input/output channels and separate noise injection for RGB and depth, enabling depth generation. Optical flow-based regularization is introduced using diffusion latent embeddings to create cost volumes, with a lightweight CNN serving as ContextNet, inspired by RAFT [39]. Training data consists of 8 randomly sampled frames with fixed intervals (5 in CALVIN, 20 in real-world experiments).
  • Feedback-driven policy: We use MLPs for the action decoder, but Transformers or Diffusion Policy could be alternatives. The policy is trained with the inverse dynamics objective, using frames from random intervals in demonstrations to enhance its robustness.

Question5:{\color{BrickRed}Question 5:} Relatively limited experiments.

Thanks for the advice. RLBench and Robosuite focus on few-shot learning with tasks' horizons comparable to a single subtask in CALVIN. Thus, CALVIN is the most suitable benchmark for validating long-horizon capabilities. We will consider involving additional simulation experiments.

Please also refer to Global Rebuttal Q1(3) for additional real-world experiments.

Question6:{\color{BrickRed}Question 6:} Computational complexity.

We provide detailed discussion in Global Rebuttal Q2.

Question7:{\color{BrickRed}Question 7:} How does CLOVER handle highly dynamic environments?

Most existing works test in static environments where the interaction object remains stationary. Our work also focuses on similar settings and does not address environmental dynamics. We envision this as an important future direction and have incorporated the discussion below into the revision.

  • How to adapt to dynamic environments: CLOVER could be adapted for dynamic settings by using multi-frame conditioning to capture velocity and acceleration information. Besides, incorporating the replanning mechanism (Appendix A) at each inference step could help handle new scenarios.
  • Additional experiments with unpredictable changes: In the added tests, we randomly place and pick up a doll to create unpredictable visual changes. Thanks to the robustness of visual misalignment of our inverse dynamics-based policy, CLOVER surpasses RT-1 by a large margin.

Question8:{\color{BrickRed}Question 8:} How is the training data organized during the diffusion model generation phase of the visual plans?

During training, CLOVER extracts 8 frames at five-frame intervals, covering key task segments. This ensures better task alignment and fewer generation rounds during a test (1-2 rounds per task). As shown in Figs. 6(a) and 9, CLOVER effectively achieves sub-goals with adaptive steps and generalizes to different visual planners without specific adjustments. Detailed protocol is in Appendix B2 and will be further clarified.

Question9:{\color{BrickRed}Question 9:} Computational requirements. How does it scale with more complex tasks?

Please refer to Global Rebuttal for the discussion on computational requirements.

For more complex tasks, we may introduce LLMs to decompose the tasks into more manageable subtasks similar to HiP [29]. CLOVER can then generate visual plans for simplified tasks and perform real-time low-level control. The computational cost of this system will potentially scale linearly with the number of subtasks derived.

评论

Thank you for your response. I've read the other reviews and the rebuttal. I’m keeping my initial score.

评论

Thank you for the kind feedback. We appreciate the time and effort you have put into reviewing our work! We will update the manuscript based on your helpful review.

审稿意见
7

The authorsThe authors introduce CLOVER, a generalizable closed-loop visuomotor control framework that incorporates a feedback mechanism to improve adaptive robotic control. The method uses a text-conditioned video diffusion model to generate reference visual subgoals, an error measurement to quantify the difference between the current state and the planned sub-goal, and a feedback-driven controller via inverse dynamics model to ourput actions. Experiments on CAlVIN benchmark and ALOHA real robot shows the method outperforms all previous methods by a notable margin.

优点

  1. The problem of using closed-loop visuomotor control to solve long-horizon tasks is very meaningful.
  2. The motivation and storyline is stated clearly. The paper is well-written and easy to follow.
  3. The experiments on both the simulation and the real robot are convincing.
  4. The ablation study of visual embedding, optical flow regularization, error measuring, multi modal fusion and sampling rate are very sufficient.

缺点

See questions

问题

  1. How is your work different from AVDC?
  2. Could the authors elaborate more on the superiority of their design which use diffusion model generate reference frames and then use inverse dynamics model to get the actions, over the design like diffusion policy[60] which directly generate actions from diffusion model with a closed-loop style?
  3. What is the orange dash line meaning in Figure 6(a)?
  4. It seems author defines sub-goals as unreachable when the cosine distance between the state embeddings of consecutive are too far.
  5. Would there be circumstance when the embedding is close but the sub-goal is actually not reachable due to singularity or other robot limit? Especially when testing the framework on real robot?

局限性

Did not test on generalization ability.

作者回复

Thanks for your careful review and we really appreciate your comments. We address your questions below.

Question1:{\color{BrickRed}Question 1:} How are you different from AVDC?

Our work differentiates from AVDC in the following aspects:

  • Model structure: For the visual planning part, we introduce a novel constraint term to enhance its temporal consistency and endow the model with additional depth generation ability.
  • Action output mechanisms: As discussed in the related work, AVDC infers actions from predicted video content with dense correspondence. It combines an off-the-shelf optical flow estimator with depth information to compute SE(3) transformations of the end-effector and transfer it into action control signals. However, CLOVER learns an inverse dynamic model to output actions and adaptively reach given goals. Paired with error measurement and feedback mechanism, CLOVER is more robust to the inherent instability of video diffusion models and can be adapted to broader scenarios.
  • Feedback policy: The AVDC framework lacks a feedback mechanism within its visuomotor control system. In contrast, our work aims to explicitly quantify errors and incorporate them into a unified framework for long-horizon manipulation.

We have added the above discussion to the revision.

Question2:{\color{BrickRed}Question 2:} Could the authors elaborate more on the superiority of their design which use diffusion model generate reference frames and then use inverse dynamics model to get the actions, over the design like diffusion policy which directly generate actions from diffusion model with a closed-loop style?

Thanks for the question. Our decoupled paradigm has two main benefits:

  • As mentioned in Section 4.2, our diffusion-based video model can effectively understand high-level instructions and generate corresponding plans. This alleviates the learning complexity of policy by decoupling the planning and control into a two-level hierarchy, where the low-level policy can be trained in a task-agnostic manner (no need for high-level instruction labels).
  • The inverse dynamic model (IDM) can be established based on the generated sub-goals, which maps the state transitions to actions. In contrast, diffusion policy uses the current observation only following the paradigm of behavior cloning (BC). Previous work [59] proves that IDM could be more performant and robust than BC, and scales best with the pretraining dataset.

With the above benefits, we show the stronger performance of our method in Table 1, compared to the diffusion policy-based model "3D diffusion actor" [52], even though it utilizes more sensor input.

Question3:{\color{BrickRed}Question 3:} What is the orange dash line meaning in Figure 6(a)?

We intend to highlight the highest performance achieved by open-loop methods for clearer comparison to closed-loop counterparts. We have revised the subfigure.

Question4:{\color{BrickRed}Question 4:} It seems author defines sub-goals as unreachable when the cosine distance between the state embeddings of consecutive are too far. Would there be circumstances when the embedding is close but the sub-goal is actually not reachable due to singularity or other robot limits? Especially when testing the framework on real robots?

Thanks for the insightful question. Injecting physical constraints or considering potential robot limitations in video generation models (world models) remains a longstanding challenge in visual plan generation. Even models like Sora, which are trained with extensive data and computational resources, have been found to fail in fully adhering to physical laws. However, the reliability of visual planners can be significantly improved by training on large-scale, real-world robot demonstrations that are inherently physically plausible. The generalizability of such models to unseen embodiments during the training phase remains an area that requires further exploration.

In practice, the lack of physical feasibility in visual planning can result in scenarios where a robotic arm stops at a certain position for an extended period without the error reaching the predefined threshold. To empirically enhance the reliability of visual planning, we can implement human-defined rules to detect these conditions and trigger replanning across multiple rounds of generation, thereby addressing this limitation.

Question5:{\color{BrickRed}Question 5:} Did not test on generalization ability.

Thank you for the comment. In the current draft, the benchmark CALVIN ABC-> D itself certifies the scene generalization ability of the learned policy [45]. Specifically, the texture of the table and the position of buttons, drawers, and sliders in the tested environment are different from those in the training set, which challenges the model's generalization in unseen scenarios. In real-world robot experiments, we further validate the position generalization by putting the fish to be grasped at different positions for each rollout.

During the rebuttal, we provide additional real-world results to verify CLOVER's robustness against background distraction and object variation. Please refer to Fig. R1 and R2 in submitted PDF for detailed experiment settings and the results given in the Global Rebuttal session Q1(1) and (2).

作者回复

Dear Area Chairs and Reviewers,

We thank all the Reviewers for their detailed and helpful comments on our work. We appreciate the Reviewers for acknowledging our strengths and contributions, such as a creative and novel feedback policy and cohesive system design (DjFL, hKyy, eUsr), a useful and meaningful research problem of integrating closed-loop feedback into visuomotor control (hfKW, DjFL, hKyy, eUsr), clear motivation and well-articulated methodology (hfKW, DjFL), sufficient ablations and notable improvements (hfKW, DjFL), and well-written (hfKW, DjFL, hKyy, eUsr).

During the rebuttal phase, we have made diligent efforts to address the concerns raised by the Reviewers, add new ablation studies and real-world experiments, provide discussions on computational complexity, and add clarity and depth to address any ambiguities. Our responses to specific concerns are detailed below. We thank you all for the opportunity to improve our work with your constructive feedback.

Best regards,
The Authors


Here we refer to two general questions:

Question1{\color{BrickRed}Question 1} (Reviewer hfKW, Reviewer DjFL, Reviewer hKyy): More real-world experiments, or experiments on generalization.

We conduct additional generalization evaluation with our original long-horizon task and perform two more tasks to testify CLOVER against our baselines. From the results below, it can be observed that CLOVER greatly outperforms existing works under distractions and in more challenging tasks. Please refer to the submitted rebuttal PDF for the illustrative experiment settings. We will add the results to our revision.

  • (1) (hfKW Q.5, DjFL Q.3, hKyy Q.1, eUsr Q.5) Robustness evaluation with visual distractions. (See Fig. R1(a) in the rebuttal PDF for a detailed setting) |Methods|Task 1|Task 2|Task 3|Avg. Len.| |-|-|-|-|-| |ACT [53]|13.3|0|0|0.13| |R3M [54]|20.0|0|0|0.20| |RT-1 [49]|40.0|6.7|0|0.47| |CLOVER (Ours)|73.3|66.7|6.7|1.47|

  • (2) (DjFL Q.3, hKyy Q.4) Robustness evaluation with dynamic scene variation. (See Fig. R1(a) in the rebuttal PDF for a detailed setting; Due to limited time, we choose the competitive RT-1 for comparison only in this experiment.) |Methods|Task 1|Task 2|Task 3|Avg. Len.| |-|-|-|-|-| |RT-1 [49]| 33.3|0|0|0.33| |CLOVER (Ours)|80.0|53.3|20.0|1.54|

  • (3) (DjFL Q.5, hKyy Q.1) Experiments with two new tasks: Pour shrimp into the plate & Stack bowl. (See Fig. R1(b) in the rebuttal PDF for a detailed setting) |Methods|Pour Shrimp|Stack bowl|Avg.| |-|-|-|-| |ACT [53]|33.3|46.7|40.0| |R3M [54]|46.7|53.3|50.0| |RT-1 [49]|80.0|66.7|73.4| |CLOVER (Ours)|80.0|86.7|83.4|

Question2{\color{BrickRed}Question 2} (Reviewer DjFL, Reviewer hKyy, Reviewer eUsr): Computational complexity

Statistics: By default, a ViT-Base (86M) backbone is employed to encode RGB data, and a ViT-Small (22M) backbone is utilized for the extraction of depth features. As indicated by Table 4, substituting the ViT-Base encoder in the RGB branch with a ViT-Small encoder also results in comparably exceptional performance. When utilizing the aforementioned configuration and running on a Nvidia RTX 5000 Ada GPU, the inference time of the proposed policy model is less than 0.04 seconds, which means the policy operates in real-time at a frame rate greater than 25Hz. We conduct simulation experiments with a server equipped with an RTX 3090 GPU, where our policy can run at over 70Hz. In the rebuttal, we provide the following statistics to enable a full grasp of the computational complexity of CLOVER and other competitive methods.

MethodsPerformanceVideo Generation (s)Policy (s)Avg. Time to complete a task (s)# Params. in total (M)
RoboFlamingo [50]2.48/0.07283000
SuSIE [15]2.6990.1549400
CLOVER (Ours)3.5350.01311200

Note that our diffusion model serves as a high-level planner, i.e., activated only when subtask starts and replans. Therefore, it does not require real-time inference. Unlike image editing models like SuSIE, our video generation model can cover a larger span of task rollouts, thus fewer inferences are needed. As shown in the table above, CLOVER achieves better performance with much lighter computational requirements.

Analysis: We agree that the main bottleneck is in video generation. Therefore, we made multiple designs to enhance efficiency:

  • Illustrated in Sec. 4.4, we use a 20-step DDIM sampler for balanced efficiency and performance. Yet, a 10-step process halves video generation time and remains competitive performance, achieving Avg. Len. 3.21 on CALVIN. In contrast, SuSIE uses a larger model with 50 denoising steps, while it yields inferior performance (Avg. Len. 2.69). Integrating advanced samplers like DPMSolver could further reduce the steps needed.
  • Based on Imagen, our diffusion model downscales the channel dimension and limits the attention blocks to only 1/8 and 1/16 downsampled feature maps, cutting the attention module's quadratic computational load. We build a compact diffusion network with merely 72M parameters to minimize latency.

We will add the above results and discussions to the revision.


Please refer to the rebuttal modules below for our point-to-point responses to each reviewer.

最终决定

The detailed rebuttals by the authors were appreciated. All reviewers recommend accepting the paper.