PaperHub
5.5
/10
Poster4 位审稿人
最低2最高4标准差0.7
4
3
2
3
ICML 2025

Efficient Robotic Policy Learning via Latent Space Backward Planning

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24

摘要

关键词
PlanningEmbodied AgentsGoal-Conditioned Policy

评审与讨论

审稿意见
4

The authors introduce LBP (Latent space Backward Planning), a novel approach for robotic planning. LBP works by grounding tasks into final latent goals and recursively predicting intermediate subgoals backward toward the current state. The authors evaluate LBP on simulation benchmarks and real-robot environments, demonstrating its performance over existing methods for long-horizon, multi-stage tasks.

给作者的问题

N/A

论据与证据

The experimental validation supports the claims by the authors. LBP achieves 82.3% success rate on LIBERO-LONG, outperforming baselines like MPI (77.3%) and Seer (78.6%). Ablation studies validate the contribution of each component, showing significant performance drops when removing key elements.

方法与评估标准

The methods are appropriate for the problem.

理论论述

N/A

实验设计与分析

Experimental designs are comprehensive, evaluating on 10 LIBERO-LONG tasks and 4 real-world tasks against multiple baselines. Ablation studies effectively isolate component contributions.

补充材料

I watched the videos on the companion website.

与现有文献的关系

Ok

遗漏的重要参考文献

References are complete. Perhaps there may be some marginal connections with Hindsight Experience Replay.

其他优缺点

N/A

其他意见或建议

N/A

作者回复

We sincerely appreciate your positive feedback and recognition of our work! If you have any further concerns or questions related to LBP, we would be happy to discuss them.

审稿意见
3

In this work, the author proposed a robotic manipulation method called LBP. This method first grounds the task into final latent goals and then recursively predicts the intermediate subgoals closer to the current state. Compared to previous fine-grained approaches, LBP is more lightweight and less prone to accumulating inaccuracies. For implementation, the goal predictor and subgoal predictor of LBP only use two-layer MLPs and use a cross-attention block to realize the goal-fusion model. The effectiveness of LBP is proven by the experiments on both the LIBERO-LONG benchmark and four real-world long-horizon tasks.

Update after rebuttal

The generalization capabilities of LBP are demonstrated by the additional results of shifting cups. And, my misunderstandings about the baseline selection have been well addressed. However, the real-world task settings are still very simple, I'm prone to maintain my score.

给作者的问题

No

论据与证据

Yes, the claims made in the submission are supported by clear and convincing evidence.

方法与评估标准

The proposed methods make sense, and the evaluations contain both simulation and real-world experiments, which makes the evaluation results comprehensive.

理论论述

I have checked the soundness of LBP’s theoretical claims, especially for the derivative of Eq (3,4,5) in the Sec 4.2 “Predicting Subgoals with a Backward Scheme” part.

实验设计与分析

  • In the LIBERO-LONG benchmark, two versions of LBP have achieved almost state-of-the-art performance. However, on certain tasks such as task 6 and 7, LBP still has a large gap behind the best method. Overall, LBP’s average performance is the strongest.
  • The presentation of real-world results is great. Figure 4 delivers a direct impression of each method’s performance at each stage. However, the long-horizon tasks involved in the real-world experiments are simple pick-and-place or stacking tasks. It could be better if the real-world experiments involve more contact-rich tasks, e.g. articulated object manipulation.
  • Besides, the baselines selected in real-world experiments are not strong enough. I would like to recommend adding R3M[1], VC-1[2], DP[3], or other policies into your comparisons.

[1] Suraj Nair, Aravind Rajeswaran, Vikash Kumar, Chelsea Finn, and Abhinav Gupta. R3M: A universal visual representation for robot manipulation. In CoRL, 2022.

[2] Arjun Majumdar, Karmesh Yadav, Sergio Arnaud, Yecheng Jason Ma, Claire Chen, Sneha Silwal, Aryan Jain, Vincent-Pierre Berges, Pieter Abbeel, Jitendra Malik, Dhruv Batra, Yixin Lin, Oleksandr Maksymets, Aravind Rajeswaran, Franziska Meier. Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence? In arXiv, 2023.

[3] Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, Shuran Song. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. In RSS, 2023.

补充材料

I have reviewed the supplementary materials, including the implementation details, benchmark details, and additional results.

与现有文献的关系

Previous methods usually predict consecutive frames to model future outcomes, which could bring the propagation of inaccuracies. LBP is a new scheme that aims to achieve the balance between efficiency and long-horizon processing capabilities.

遗漏的重要参考文献

No

其他优缺点

No generalization experiment results are provided to prove LBP’s robustness.

其他意见或建议

No

作者回复

Thanks for the reviewer's positive feedback and recognition of our work! Below are our responses to the concerns raised.

Experimental Designs Or Analyses

The presentation of real-world results is great...It could be better to involve more contact-rich tasks.

Thanks for your suggestion! We are willing to try different classes of robotic tasks in future version, including contact-rich tasks.

The baselines selected in real-world experiments are not strong enough. I would like to recommend adding R3M, VC-1, DP, or other policies into your comparisons.

  • In our experiments, the LCBC baseline is actually implemented using Diffusion Policies (DP) [1] with language instructions. More details can be found in Appendix A of our paper.
  • R3M [2] and VC-1 [3] primarily focus on representation learning, while LBP is a planning framework that allows flexible representation choices. For planning in latent space, we adopt DecisionNCE [4] and SigLIP [5], two recent strong methods in robotic representation learning. As shown in [4], DecisionNCE outperforms R3M, making it a sufficiently strong choice for our experiment. SigLIP has been widely adopted in many robotic frameworks like OpenVLA [6].

Other Strengths And Weaknesses

No generalization experiment results are provided to prove LBP’s robustness.

  • We test LBP on the longest real-world task shift cups with different backgrounds and distracting objects and find that LBP maintains robust performance in these complex scenarios, still outperforming the strongest baseline LCBC in base setting. The corresponding videos have also been updated on our website (Click the link at the end of our abstract).
  • In our planning framework, the generalization capability also depends on the selected latent space. If adopting stronger latent spaces, the generalization capability of LBP can be further improved.
shift cups
stage 1stage 2stage 3stage 4stage 5
LCBC (Base setting)85.055.048.320.80.0
LBP (Distracting objects)87.575.848.335.09.0
LBP (Different backgrounds)91.684.155.837.513.3
LBP (Base setting)97.587.574.15026.6

[1] Chi, et al. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. RSS 2023.

[2] Nair, et al. R3M: A universal visual representation for robot manipulation. CoRL 2022.

[4] Majumdar, et al. Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence? NeurIPS 2023.

[4] Li, et al. DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning. ICML 2024.

[5] Zhai, et al. Sigmoid Loss for Language Image Pre-Training. ICCV 2023.

[6] Kim, et al. OpenVLA: An Open-Source Vision-Language-Action Model. CoRL 2024.

审稿人评论

The generalization capabilities of LBP are demonstrated by the additional results of shifting cups. And, my misunderstandings about the baseline selection have been well addressed. However, the real-world task settings are still very simple, I'm prone to maintain my score.

作者评论

We sincerely appreciate your positive recognition of our work. Let us address your concerns on our real-world task settings.

On the real-world task settings

  • Due to the limited timeframe of rebuttal, we are unable to include additional real-world tasks. However, we are fully committed to exploring other tasks in the future as suggested by the reviewer.
  • Nevertheless, we would like to emphasize that the core contribution in our work is proposing a general efficient planning framework, LBP, that provides a recursive backward subgoal planning scheme for long-horizon tasks.
  • Thanks to this scheme, even lightweight MLP-based planners can outperform significantly larger models, as we have already observed in our experiments. This demonstrates the efficiency of LBP and indicates its potential scalability to more complex tasks. We believe this would be greatly inspiring in a period when most of the field is dominated by scaling up models to improve long-horizon planning performance.
审稿意见
2

This paper focuses on latent space planning to accomplish robotic tasks. It breaks down a long horizon language conditioned manipulation task into predicting the final goal. Then using the final-goal to predict sub-goals moving from the goal state to the initial state. Once these have been learned a sub-goal/final-goal conditioned context policy is learned. During inference time, the approach generates the final subgoal and then other sub-goals, which are used to predict the action which is then rolled out. Experiments are performed on Libero-Long benchmark and shows some improvments over baselines.

给作者的问题

please see above

论据与证据

Yes the claims seem reasonable.

方法与评估标准

Yes

理论论述

none

实验设计与分析

yes, the libero long experiments is a decent choice since the dataset is focused on long horizon tasks. However, there is nothing in the method which is unique to manipulation, hence other common long horizon tasks could also have been used for comparison (e.g. different variants of AntMaze etc). The real-world experimental design seems good. Finally, the ablation analysis makes sense and the main components of the approach have been validated.

补充材料

no

与现有文献的关系

Many prior approaches have focused on sub-goal generation and then using goal-conditioned supervised learning, which is what the proposed approach is doing. The only big difference is how sub-goal generation happens. The paper claims following a final-goal to initial state approach should be better. This has also been applied

遗漏的重要参考文献

see below. Also, classical works such as doing backwards chaining using skill trees should also be cited [1]. There is a large body of work around this which predates a lot of modern deep learning based approaches. None of these approaches are cited or discussed.

[1] Konidaris et al. Robot Learning from Demonstration by Constructing Skill Trees

其他优缺点

Pros:

The paper focuses on an important problem. Developing robust planning approaches would be super useful for robot tasks. Overall, the paper is also well written (although some details are missing see below).

Cons:

Few baselines: There has just been a tremendous amount of work on heirarchical approaches for control tasks. Many papers have tried techniques similar to the ones proposed in the paper. But these approaches have not been compared against, some of them have not been cited at all [1, 2, 3]. I think some of these alternative approaches which do sub-goal generation differently should be compared against and properly discussed.

Another interesting baseline would ideally be using denser language labels for the entire task. Here the need for sub-goals is motivated by the language labels not being dense enough for the task and not providing enough semantic value (Line 180 — language descriptions often reduce to task identifiers …). However, given improved understanding of large multi-modal models it may be possible to zero-shot get denser labelings for a task from a long horizon video. In case the policy performance with dense language relabeling is worse then it is possible to conclude that latent image goal/sub-goal embeddings are indeed crucial but it is unclear if this the case with the current set of experiments.

Fixed recursive time for sub-goal generation: This seems like a very big assumption. For many tasks the challenging part of the task might be much shorter than the other parts, in this case a fixed approach might just miss the right sub-goal for the task. This will always be a problem in the proposed approach since it apriori relies on no information to do sub-goal generation. This makes the proposed approach quite unscalable to more challenging and interesting tasks.

Inference time action selection: How does action selection happen at inference time. Once all the sub-goals are selected does the policy use the first sub-goal to generate action and rollout. When does the policy switch to next sub-goal?

[1] Hierarchical reinforcement learning with timed subgoals

[2] Zhang et al. Generating adjacency-constrained subgoals in hierarchical reinforcement learning

[3] Lei et al. Goal-conditioned Reinforcement Learning with Subgoals Generated from Relabeling

其他意见或建议

please see above

作者回复

Experimental Designs Or Analyses

Other long horizon tasks for comparison (e.g. AntMaze etc).

Thanks for the positive comments to our experimental designs. While other long-horizon tasks like variations of AntMaze exist, we excluded them from our primary benchmark suite for specific reasons aligned with the focus of our work.

  • Firstly, our method is explicitly designed and evaluated for language-guided robotic control, as detailed in the introduction of our paper. AntMaze, lacking language instructions, does not allow for the validation of language-driven task execution, which is a key application of our approach.
  • Furthermore, its low-dimensional state space and explicit coordinate goals make it significantly simpler than the complex image-based tasks that are the current focus of modern robotic research, as evidenced by [1, 2, 3, 4].

[1] Black, et al. Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models. ICLR 2024.

[2] Tian, et al. Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation. ICLR 2025.

[3] Nair, et al. R3M: A universal visual representation for robot manipulation. CoRL 2022.

[4] Chi, et al. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. RSS 2023.

Essential References Not Discussed

Classical works on backward chaining with skill trees should also be cited.

Thanks for the suggestion! We will discuss these relevant references in the final version.

Other Strengths And Weaknesses

Lacking baselines

We wish to emphasize that LBP's primary contribution lies in enhancing language-guided robotic control, a significant challenge in the field. Our experimental evaluation includes comparisons against the most relevant SOTA robotic methods, such as SUSIE, Seer, and OpenVLA.

  1. Other hierarchical approaches for control tasks

The hierarchical RL methods mentioned by the reviewer are not appropriate baselines for LBP due to their fundamentally different application scope. They are tailored for simpler, often customized environments (such as the AntMaze benchmark) and, crucially, do not support language-conditioned tasks, making them incomparable to LBP's core functionality.

Although they are not suitable as baselines, we will include a discussion of these methods in the related work of the final version.

  1. Another interesting baseline is to use denser language labels generated by large multi-modal models.
  • As far as we know, benchmarks with dense language labels are rare in the robotics community, as collecting reliable and sufficient language annotations is both costly and impractical.
  • Moreover, such methods typically require significantly larger models to process diverse and dense language inputs while also handling out-of-domain scenarios at test time.

In contrast, our LBP method offers a more robust, efficient, and lightweight approach for subgoal specification, which can take advantage of rich observation data without dense language labels, as is the common case.

Fixed recursive time for sub-goal generation (Q1) & Inference time action selection (Q2).

We would like to address a potential misunderstanding concerning how LBP is used for action selection.

  • LBP is designed to predict (update) future subgoals at every step of the task execution as we describe in lines 272-274. This dynamic planning scheme ensures that all the parts in task horizon would be covered during planning, thus addressing the "challenge part" (Q1).
  • At test time, we rollout action based on the fusion of all the subgoals generated at this time, as we describe in Section 4.3 (Q2).
审稿意见
3

To enable real-time planning for long-horizon and multi-stage tasks, the paper proposes LBP, a backward planning scheme in the latent space. By eliminating the need for pixel-level generation, the proposed scheme significantly improves inference speed while alleviating compound errors. Additionally, it enhances on-task planning through guidance from the final goal. The evaluation is conducted on LIBERO-LONG and real-world setups.

给作者的问题

Q1) A little confusion. I am confused by the sentence on Line 220-221. It states that the proposed mechanism suffers from less compounding error because it is completely supervised with ground truth data. My question is: if ground truth is not used, then what kinds of supervision could be?

Q2) Differences to DiffuserLite. The proposed method reminds me of DiffuserLite [2], which also introduces an efficient coarse-to-fine planning process that transits from long horizon to short horizon. Could you elaborate on the differences?

[2] DiffuserLite: Towards Real-Time Diffusion Planning. Zibin Dong, et al.

论据与证据

The main idea of the paper is backward planning in the latent space. However, it fails to provide two key ablations: forward planning and parallel planning. The absence of these comparisons makes it difficult to conclude the superiority of backward planning. In fact, it is possible that the planning order has minor effects and the performance gain is attributed to the informative subgoals in the SigLIP and DecisionNCE latent spaces.

方法与评估标准

The proposed methods are aligned with the motivation and evaluated using appropriate criteria.

理论论述

I have checked the theoretical claims in this paper.

实验设计与分析

The paper provides extensive experiments with sufficient details.

补充材料

I have reviewed all appendices.

与现有文献的关系

Recent goal-conditioned robot planning typically uses generative models for goal prediction. Unlike these approaches, the paper suggests that predicting latent goals enhances computational efficiency and achieves better performance. Although the proposed backward planning scheme sounds technically novel, I have concerns about its superiority due to a lack of ablation studies.

遗漏的重要参考文献

As far as I know, all closely related works are cited appropriately.

其他优缺点

W1) Missing key ablations. The proposed backward planning is not compared to forward planning and parallel planning using the same latent subgoals, which compromises the validity of the main contribution. In some cases, leaving causal uncertainty while determining closer steps first could be beneficial for decision making [1].

W2) Predicting final goals directly may lead to large errors. There is no evidence suggesting that distant final goal prediction is easier than progressive prediction.

W3) Inference speed is not reported. Since the authors highlight the efficiency over previous generative planners, it would be beneficial to report the inference frequency.

[1] Diffusion Forcing: Next-Token Prediction Meets Full-Sequence Diffusion. Boyuan Chen, et al.

其他意见或建议

Please see the questions below.

作者回复

Thank you for your efforts and valuable feedback!

Claims and Evidence

Missing ablations to forward planning

We add an ablation study comparing LBP to latent forward planning. The results demonstrate that LBP significantly outperforms the forward planning paradigm in both subgoal predicting accuracy and final policy performance. The updated results of prediction errors are visualized on our project page, linked at the end of our abstract.

  • LBP obtains substantially lower prediction error.
    • Train forward planner: We learn a forward planner in latent space for our real-robot tasks, which predicts the subgoal 10 steps ahead, similar to SuSIE [1]. At each step, the forward planner autoregressively generates latent subgoals towards the final goal.
    • Evaluate planning accuracy: We randomly sample 3000 data points as the current state from our real-robot datasets and compute the mean square errors (MSE) between predicted subgoals and their corresponding groundtruths.
    • Visualization of prediction error results: Please refer to Figure 5 on our website, which illustrates that forward planning struggles with long-horizon subgoal prediction due to rapid error accumulation. Given that long-horizon tasks often span hundreds of frames, this error compounding makes forward planning impractical. In contrast, LBP consistently produces accurate subgoals with significantly lower error magnitude, maintaining reliability throughout planning horizon.
  • LBP obtains substantially stronger long-horizon performance. The tables below show that LBP significantly outperforms latent forward planning in all the long-horizon real-robot and simulation tasks, benefiting from the recursive backward planning scheme and its subgoal generation accuracy. Note that all settings remain the same to ensure a fair comparison.
stack 3 cupsstack 4 cups
stage 1stage 2stage 1stage 2stage 3
latent forward planning78.36.771.621.65.0
LBP (ours)94.175.096.677.543.3
move cupsshift cups
stage 1stage 2stage 1stage 2stage 3stage 4stage 5
latent forward planning43.35.095.065.011.60.00.0
LBP (ours)90.065.897.587.574.150.026.6
libero-long
LCBC73.0
latent forward planning73.6
LBP (ours)82.3

Lastly, we are unsure which specific approach the reviewer refers to as "parallel planning". We would greatly appreciate any further clarification and descriptions on this. We would be happy to explore the comparison if time allows.

Weaknesses

"Predicting final goals directly may lead to large errors."

  • As shown in the above prediction error results, while predicting final goals may introduce some errors, they are negligible compared to the accumulated errors in forward (progressive) planning.
  • Grounding the task objective in final goals also stabilizes subgoal predictions along the horizon, keeping subgoal prediction errors low and demonstrating the effectiveness of error control in our recursive backward planning scheme.
  • Predicting the final goal not only plays a key role in LBP but also is not as difficult as it seems, as it is relatively deterministic given the current state and task description.

"It would be beneficial to report the inference frequency."

We present the inference time of LBP and a competitive generative planner, SuSIE. Other baselines either adopt large VLA models or are not planning methods, which are not meaningful to compare on inference latency. The results show that LBP is significantly more efficient than SuSIE. Each model is tested on a single GPU.

Inference time
SuSIE28.13s
LBP0.013s

Questions

A little confusion on line 220-221.

We apologize for confusing the readers. We meant to say:

  • This recursive mechanism suffers from considerably fewer compounding errors, as the λ-recursion effectively reduces the number of planning steps, and the training of fwf_w incorporates supervision (of groundtruths) in every recursion level.

Differences to DiffuserLite.

  • LBP is based on real-world task settings: Unlike LBP, DiffuserLite cannot handle language-conditioned tasks and struggles in high-dimensional spaces due to high computational cost of diffusion.
  • LBP enjoys simplicity in design: DiffuserLite trains separate diffusion models at each level, while LBP trains a single MLP to recursively predict subgoals.
  • LBP enjoys computational efficiency: DiffuserLite only uses the first prediction in trajectories for next-level trajectory generation, which results in many redundant computations. In contrast, LBP predicts one subgoal at each level with MLP, which is more efficient and proves effective for policy guidance.
审稿人评论

I appreciate the authors' efforts to address my concerns.

For "parallel planning", I meant a variant that predicts all subgoals simultaneously instead of a progressive manner. This can be achieved through multi-round joint refinement. Since this formulation always accounts for task completion, it also suffers less from compounding errors.

Due to the limited time, I will not request this ablation during the rebuttal period. As mentioned by reviewer sA8h, the proposed backward planning introduces a strong assumption of the task horizon. Different tasks across different benchmarks have various intervals for critical goals, and a completely recursive scheme may not be adaptive enough.

作者评论

Thanks for the response and the clarification for the "parallel planning" baseline.

Comparisons to parallel planning.

We add an ablation study comparing LBP to the parallel planning baseline. To ensure a fair comparison, all experimental settings for the parallel planning baseline are strictly aligned with those of LBP, including rough model size and hyperparameter setups of the MLP-based planner.

  • We report MSE between predicted subgoals and corresponding groundtruths as below. Notebly, LBP consistently produce more accurate and reliable subgoals in various horizons. While parallel planning does not accumulate error, it tends to predict inaccurate subgoals throughout the planning horizon. This can be attributed to the challenging training objective that requires supervision for all the subgoals simultaneously, which demands higher model capacity and significantly increased computational cost. Besides, we update Figure 5 with parallel planning errors in our website.
Subgoal errors on stack 3 cupsSubgoal errors on move cups
task progress0.1250.250.51.00.1250.250.51.0
latent forward planning0.015±0.0010.027±0.0020.050±0.0080.098±0.0250.039±0.0040.105±0.0370.224±0.0660.353±0.236
latent parallel planning0.369±0.2880.250±0.1230.102±0.1660.226±0.2760.027±0.1000.018±0.1410.091±0.1310.082±0.018
LBP0.018±0.0020.018±0.0030.016±0.0020.014±0.0030.024±0.0130.044±0.0240.036±0.0110.020±0.004
Subgoal errors on stack 4 cupsSubgoal errors on shift cups
task progress0.1250.250.51.00.1250.250.51.0
latent forward planning0.015±0.0000.036±0.0090.154±0.0350.489±0.0640.173±0.0311.580±0.3135.934±0.1584.292±0.355
latent parallel planning0.086±0.2860.073±0.4100.135±0.3600.294±0.0551.074±0.6590.850±0.3310.575±0.1950.636±0.050
LBP0.009±0.0010.014±0.0030.016±0.0020.014±0.0010.085±0.0130.223±0.0350.202±0.0790.319±0.106
  • We further evaluate the policy performance of latent parallel planning on both real-world and simulation benchmarks. The results show that LBP achieves significantly better performance across all those long-horizon tasks.
stack 3 cupsstack 4 cups
stage 1stage 2stage 1stage 2stage 3
latent forward planning78.36.771.621.65.0
latent parallel planning75.010.075.030.010.0
LBP (ours)94.17596.677.543.3
move cupsshift cups
stage 1stage 2stage 1stage 2stage 3stage 4stage 5
latent forward planning43.35.095.065.011.60.00.0
latent parallel planning55.06.796.648.38.30.00.0
LBP (ours)90.065.897.587.574.150.026.6
libero-long
latent forward planning73.6
latent parallel planning76.6
LBP (ours)82.3

"This recursive scheme may not be adaptive enough."

There appears to be a misunderstanding about how LBP operates during inference and we have provided an explanation in the response to reviewer sA8h. We wish to emphasize that the subgoals planned by LBP are highly adaptive rather than fixed, allowing the model to effectively capture future guidances across various horizons for reasons below:

  • Adaptive training: We train the planner in varying horizons and with λ\lambda-recursion subgoal supervisions as in Eq.5. λ\lambda-recursion scheme allows the planner to predict subgoals adaptively according to the rest of task progress instead of in fixed planning steps. Sampling from trajectories in varying horizons helps it generalize across different temporal contexts at inference time. More details can be found in Section 4.2 and Section 4.4.
  • Adaptive inference: LBP replans at each step, enabling the generated subgoals to dynamically cover the entire task horizon and provide sufficient guidance for policy extraction.
  • More adaptive than existing works: Compared to recent planning methods that rely on fixed planning steps [1,2] or lack the ability to replan [3], LBP can update subgoals adaptively according to task progress in every action step, attributing to the strong performance on long-horizon tasks.

For further clarification, we provide an illustrative video on our website, to demonstrate how the subgoals update adaptively at inference time.

[1] Black, et al. Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models. ICLR 2024.

[2] Tian, et al. Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation. ICLR 2025.

[3] Du, et al. Learning Universal Policies via Text-Guided Video Generation. NeurIPS 2023.

最终决定

The paper proposes a method for generating latent-space subgoals for long-horizon language-conditioned robotic manipulation tasks by planning backwards from the goal. The evaluation on the LONG subset of the LIBERO benchmark shows the promise of this approach.

Reviewers see merits in the proposed approach, including its novelty and experimental design. The biggest outstanding concerns are the coverage of related work and the scarce experiments on physical robot hardware. Alternatively, as one of the reviewers mentioned, since LBP doesn't seem to rely on any robotics-specific task features, it could be evaluated more extensively by adding non-robotics long-horizon benchmarks. That said, there isn't a pre-existing method that clearly undermines LBP's novelty, so a more extensive coverage of related work, as the authors promised in the rebuttals, should address the former concern. Also, the evaluation on the LIBERO-LONG benchmark is convincing enough, even though a broader evaluation would be more desirable.

Given the above considerations, on balance the metareviewer recommends this work for publication.