PaperHub
6.4
/10
Poster4 位审稿人
最低3最高5标准差0.7
4
4
3
5
3.8
置信度
创新性3.0
质量2.3
清晰度2.5
重要性2.8
NeurIPS 2025

UniDomain: Pretraining a Unified PDDL Domain from Real-World Demonstrations for Generalizable Robot Task Planning

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29
TL;DR

UniDomain pre- trains a PDDL domain from robot manipulation demonstrations and applies it for online robotic task planning.

摘要

关键词
RoboticsTask PlanningSymbolic PlanningVision Language ModelsRobotic Knowledge Graph

评审与讨论

审稿意见
4

This paper proposes a method for creating a large unified PDDL domain from real robot demonstrations. The domain generation process leverages VLMs and LLMs to ground predicates and to verify the syntax and solvability of the PDDL. Experiments in four domains compare the proposed approach to several baselines and ablations.

优缺点分析

Strengths

Novelty. The paper does seem quite novel. I am not aware of other work that attempts to construct a large unified PDDL domain from real robot demonstrations and I really like that idea.

Baselines and ablations. The baselines and ablations in experiments are fairly comprehensive (although I have concerns about the meaning of the metrics, see below).

Well-written papers. I have a few concerns related to transparency (see below), but otherwise the paper is well-written and easy to read. I did not notice many sentence-level issues. I think the paper would be accessible to a broad audience.

Related work. The related work section is thorough and the connections between UniDomain and previous work are clear.

Weaknesses

Non-executable task plans. My biggest concern is that the plans produced by UniDomain are not executable, i.e., there is no way to directly map the task plans to actions. This seems like a fundamental limitation with several implications:

  • From a user’s perspective, UniDomain does not seem like it is ready for use. I don’t see a way that I could create a new task / domain and actually solve it with UniDomain, unless I am willing to manually map the outputs to executable robot actions, which may be very difficult or impossible.
  • From an evaluation perspective, all of the reported metrics are more subjective than they initially appear. My understanding is that all of the metrics ultimately depend on a human judge. (This is buried in Appendix C: “We therefore reported results based on human evaluation to ensure their reliability.”) The optimality / cost metrics are particularly fraught: for example, what prevents every task from being solved with a single long-horizon operator? Or many small operators? There is arguably no ground-truth cost here, but the paper implies that there is.
  • From a presentation perspective, the paper is not at all transparent about this limitation. For example, Figure 2(f) (“Execution”) and the robot videos heavily imply that UniDomain is solving the tasks shown (e.g., “UniDomain solves complex, unseen tasks in a zero-shot manner”), when that is not really the case.

Technical details.

  • Not all robot plans can be easily expressed in PDDL. It is not clear what happens when UniDomain attempts to synthesize PDDL for a demonstration that really shouldn’t use it. From the paper’s description, it sounds like there is a potentially infinite loop that repeats until a domain is synthesized. It may be better to automatically determine if a domain is PDDL-able first.
  • Fast Downward already performs “predicate and operator filtering” and does so in a way that maintains solvability. I’m confused about the predicate and operator filtering proposed here and how it apparently leads to improved task success rates (Figure 5). I would have thought that removing predicates and operators could only improve planning time, not success rate.
  • The third claimed contribution is a “novel online task planner”, but I don’t think there is actually novelty in the planner.

Code promised but not delivered. The paper indicates that code is available at https://unidomain.github.io/ but it was not there during my review. I was also hoping that the final unified domain itself would be provided, but it was not.

Size of unified domain. The paper emphasizes in several places that 12,393 demonstrations are used from DROID, leading to 3,137 operators, 170 semantic categories, 2,875 predicates, and 16,481 causal edges. These numbers are a bit misleading in my opinion because what ultimately matters is the size of the merged domain. Many of the operators and predicates in the original set are redundant and it would be easy to further inflate the reported numbers by simply copying theo original demonstrations. Furthermore, in experiments, it says: “40 DROID demonstrations are retrieved to construct a meta-domain for all 254 evaluation tasks, which includes 78 predicates and 61 operators.” It then refers to Appendix A.2, which says: “For planning evaluation, we retrieve and fuse 40 kitchen-related atomic domains from this unified set to form a compact meta-domain with 106 predicates, 61 operators, and 332 causal edges.” So there is some conflicting information about the number of predicates. Also, at the end of the day, only a subset of UniDomain has an empirical evaluation, so it’s not yet clear if there is value to the entire set.

问题

  1. Is there an easy way to resolve my concerns about non-executable task plans?
  2. Can you provide more information about the procedure used by the human evaluators?
  3. Can you clarify why predicate and operator filtering improves success rates?

局限性

yes

最终评判理由

I am very borderline for the reasons mentioned in my review and follow up discussions, but I ultimately raised my score because the authors promise to clarify limitations in the writing.

格式问题

N/A

作者回复

We thank the reviewer for the thoughtful and detailed review. We appreciate the positive feedback on the novelty, clarity, and thoroughness of the paper. The reviewer raises several important concerns, primarily regarding plan executability, evaluation methodology, and implementation transparency. In this response, we will address each in detail to clarify our framework's intended use, justify our evaluation, and provide the missing technical details.


1. On Usability of the Approach and Objectiveness of Results

Q: From a user’s perspective, UniDomain does not seem like it is ready for use... unless I am willing to manually map the outputs to executable robot actions.
A: We would like to clarify UniDomain's role within a complete robotics system. UniDomain is designed as a high-level task planner, and its separation from low-level motor control is a deliberate feature that enables a generalizable and scalable robot system. This approach is not a limitation but a strength, creating a powerful division of labor.

The intended integration is automated, not manual:

  • UniDomain solves the long-horizon reasoning problem, producing a plan of symbolic operators like (pick_from_table robot spoon table).
  • Each symbolic operator is straightforwardly translated into a natural-language command, e.g., "Pick up the spoon from the table."
  • This command is then passed to any modern, language-conditioned Vision-Language-Action (VLA) model, which executes the physical motion.

Crucially, UniDomain is designed for direct compatibility with this paradigm. Our symbolic operators are learned from the same real-world demonstration data used to train state-of-the-art VLAs, ensuring seamless integration. This allows the system to leverage the best of both worlds: UniDomain handles the "what to do" (strategic planning and reasoning), while VLAs handle the "how to do it" (dexterous motor execution).


Q: From an evaluation perspective, all of the reported metrics are more subjective than they initially appear... ultimately depend on a human judge.
A: We respectfully disagree with the characterization of our evaluation as subjective. To make our evaluation as objective and scalable as possible, we primarily used an LLM‑as‑a‑judge, with human verification to ensure reliability. The vast majority of failures are not subjective but are clear violations of physical or commonsense constraints that both an LLM and a human can easily identify. Common, objective failure cases include:

  • Attempting to place an object into a bowl that is still on a vertical drying rack.
  • Attempting to pick an object from a stack without first removing the objects on top of it.
  • Attempting to interact with a container (e.g., drawer, pot) without first opening it.

The table below reveals high correlation between LLM-only judgement results and our reported numbers:

MethodLLM Score Mean ↑LLM Score StdHuman Score
VLM‑CoT38.112.2040
Code‑as‑Policies51.332.2451
ISR‑LLM52.892.3252
UniDomain84.132.1785

For complete transparency, we will release all raw evaluation records, including the generated plans and the LLM and human judgments, in the final version of the paper.


Q: The optimality / cost metrics are particularly fraught... What prevents every task from being solved with a single long‑horizon operator? ... There is no ground-truth cost...
A: While we use unit costs for simplicity, we argue that this cost design is valid and our framework readily supports more general cost functions.

  • Validity of Unit Costs: We ensure a consistent semantic granularity (one primitive skill per operator) through our domain‑learning process: Energy‑based keyframe extraction identifies transition points between primitive skills, ensuring that long-horizon behaviors are chopped into primitive skills; If a single skill is split across multiple keyframe pairs, our VLM‑based domain constructor (guided by prompt design) maps them to the same operator, preventing fragmentation.

  • Supporting Better Costs: Since we learn from real demonstrations, we can easily estimate costs like execution time (duration between keyframes), trajectory length, or motion energy from data. A PDDL solver can then directly use these non-unit costs to find a truly optimal plan.


Q: Figure 2(f) (“Execution”) and the robot videos heavily imply that UniDomain is solving the tasks shown...
A: Figure 2(f) and the videos show the physical execution of the task plan generated by UniDomain. It is a standard methodology in task planning research to isolate the evaluation to the high-level reasoning capability, without confounding results from an imperfect low-level controller. Following this standard, for our evaluation of both UniDomain and all baselines, the low-level execution was performed via human teleoperation, representing a perfect low-level policy. UniDomain does indeed solve the complex planning tasks in a zero-shot manner; the teleoperation only enacts the solution. We will clarify this in the final manuscript.


2. On Technical Details and Other Concerns

Q: Not all robot plans can be easily expressed in PDDL... what happens when UniDomain attempts to synthesize PDDL for a demonstration that really shouldn’t use it?
A: We agree that PDDL is best suited for problems with clear semantic structure, which is the focus of UniDomain (high-level task planning). For demonstrations that are not "PDDL-able" (e.g., continuous drawing, playing), we can detect this upfront by checking if the video's energy curve lacks distinct extrema or if the language instruction is vague (e.g., "drawing stuff", "playing around"). Furthermore, the closed-loop refinement is not an infinite loop; it is capped at L=5 iterations. If it fails to converge, which is rare (see Figure 4a), the demonstration can be discarded.


Q: I’m confused about the predicate and operator filtering proposed here and how it apparently leads to improved task success rates...
A: There is a potential misunderstanding. The filtering is not for the PDDL solver, but for the LLM-based problem generator. A large meta-domain can overwhelm an LLM, making it difficult to ground the specific task and generate a correct problem file. Our two-stage process uses a first pass to identify a small, task-relevant subset of predicates and operators. The second pass then uses this filtered, minimal domain to generate the final problem. This focused context significantly improves the reliability and accuracy of the LLM's problem generation, which in turn increases the final planning success rate by 11%, as shown in the ablation in Figure 5.


Q: The third claimed contribution is a “novel online task planner”, but I don’t think there is actually novelty...
A: The novelty lies specifically in the two-stage problem generation process described above. Using a first pass of problem generation as a filtering mechanism to create a minimal context for final problem generation is, to our knowledge, a new technique for improving the reliability of LLM/VLM-based PDDL planners. Figure 5 shows this is not a minor trick; removing it causes a significant drop, by 11%, in performance.


Q: Size of unified domain… operators and predicates are redundant... conflicting predicate counts… only a subset evaluated.
A:

  • Redundancy Control: We took significant care to minimize redundancy in the unified domain. Our prompt designs enforce strict naming conventions for operators and predicates, causing functionally similar elements to merge into a single node in the knowledge graph. This process reduced an initial set of over 24,199 potential operators to the 3,137 unique ones reported.
  • Predicate Count: The "106 predicates" in the appendix refers to the total nodes in the graph, including negated predicates (e.g., (is_open door) and (not (is_open door))). The "78 predicates" in the main text refers to the number of unique predicate names. We will make this consistent.
  • Value of the Full Set: We have now conducted new experiments with a fully automated pipeline that retrieves relevant domains from the entire unified set. For a given task class, an LLM infers relevant actions from the instruction. We then use sentence-embedding similarity to find the top-k best-matching operators in our unified domain. The atomic domains containing these matching operators are then selected and passed to our Domain Fusion pipeline to construct a meta-domain for online planning. This fully-automated approach acheives 83% success and 80% optimality (K=0) on the existing evaluation set, 99% success and 99% optimality (K=0) on 10 new, diverse task classes, providing strong empirical evidence for the value of the entire knowledge graph.

Q: Code promised but not delivered.
A: We apologize for this oversight in the submitted manuscript. The statement and link regarding code availability were included in error during the paper's preparation. We are fully committed to reproducibility and will open-source the project, including the learned unified domain, the automated retrieval and domain fusion pipelines, and the online planner, with the final version of the paper.


Thank you again for the constructive feedback; we believe these clarifications and new results strengthen the paper substantially.

评论

Thank you to the authors for their response. During the discussion period, I would like to further clarify the following points regarding "Usability of the Approach and Objectiveness of Results."

  1. I appreciate the explanation that the long-term vision for this work is to ground the operators using something like a VLA. That is worth saying explicitly in the paper. It helps my understanding quite a bit. It is also important to clarify in the paper that this is forward-looking and not something that has already been done. To clarify, none of the results in the paper actually use a VLA, right?
  2. Re: "Following this standard, for our evaluation of both UniDomain and all baselines, the low-level execution was performed via human teleoperation, representing a perfect low-level policy." Can you cite published works re: the claim that this is standard practice, especially the teleoperation part? (Also, I am not happy that this was omitted from the paper!)
  3. Regarding evaluation, it helps to see that the scoring was systematic and I appreciate the promise to release all raw results. I remain concerned that these results are limited in terms of their informativeness because they are not grounded in the environment at all. Ideally, we would be able to measure the actual task performance of the robot, and that is not yet the case in the paper. But let me know if my understanding is incorrect.
  4. I did not fully understand the response to "What prevents every task from being solved with a single long‑horizon operator?". The response mentions "primitive skills". What are you referring to here? I don't see any mention in the paper of primitive skills.
评论

1. Our Scope & Role of VLAs

You are correct: we do not use a Vision-Language-Action (VLA) model to execute the plans in this work.

  • Scope of this work: UniDomain is evaluated purely as a high-level planner; no Vision-Language-Action (VLA) controller is involved in our experiments.
  • Envisioned use of UniDomain: A hierarchical system: UniDomain decides what to do (optimal sequence of symbolic operators); A separate low-level policy—e.g., a VLA—would later decide how to do it and execute each operator.
  • Why the decomposition matters: Isolating task planning from motor control lets each layer specialize: UniDomain handles compositional task generalization and complex constraints, while a VLA provides dexterous, adaptive execution. We envision this hierarchical decomposition as the most practical path to general-purpose robotics.
  • Manuscript fix: We will revise the introduction to state this separation explicitly and position VLA integration as promising future work, beyond this paper’s contributions.

2. Evaluation Methodology: Tele-operation as Oracle Executor

We sincerely apologize for the oversight in not making the use of teleoperation clear in the initial submission. We will rectify this by providing full transparency in the revised paper.

  • Evaluation protocol:
    • Numbers in Tables: planner → symbolic plan → VLM judge assesses feasibility of the plan from the initial scene → human rater double-checks and corrects the judgement.
    • Demo videos: planner → symbolic plan → trained human operator tele-operates the robot, terminating execution if action is infeasible.
  • Rationale: Assuming a perfect low-level policy allows to measure the planning performance and efficiency without conflating results with execution-level imperfection. Tele-operation acts as an oracle executor for real-world visualization. We originally considered displaying only before/after key-frames of each action, but static snapshots make it hard to see what actually happened between states; uninterrupted tele-op videos give far clearer, more intuitive evidence of plan feasibility and optimality.
  • Recent top-tier work follows the same practice:
    • Tele-op/Human-op in real-world execution: [1-2].
    • Simulated perfect execution: [3-5].
    • Text-only abstraction: [6-9].
  • Manuscript fix: The revision will add a subsection “Evaluation protocol” that will (1) explicitly state our evaluation protocol, (2) claim upfront we used tele-op for real-world visualization, (3) cite the above works to demonstrate that our evaluation is rigorous and in line with community norms.

3. Informativeness of a “Non-Grounded” Evaluation

  • What is grounded: Each plan is checked against the actual initial scene image plus a set of commonsense pre-/post-conditions (e.g., “drawer must be open before a pick”). A plan is accepted only if all of its pre-conditions hold and its post-conditions remain logically consistent; thus, the evaluation is grounded in the visual scene as well as physical and operational commonsense.
  • Why physical roll-out is not necessary: All planning failures are detectable at the symbolic level, for example: pick from a closed drawer; grasp a second item while the arm is occupied; place an object onto an up-turned bowl on the drying-rack. These violations are objective planning faults, wholly independent of execution imperfection.
  • What is irrelevant: Measuring full end-to-end robot success would assess an integrated system; our paper focuses on high-level planning, assessing logical pre- and post-conditions for detecting failure.
  • Manuscript fix: We will release all generated plans, VLM judgements, and human audits, and state explicitly that our metrics evaluate UniDomain as the high-level planner, not as an integrated robot system.

4. “Primitive Skills” & Operator Granularity

"Primitive skills", or "primitive actions", in a common terminology used in task planning / task and motion planning research [10-13].

  • Definition: A primitive skill is an atomic, specialized skill for manipulating one object or one object–relation tuple (e.g., "pick an object", "open a container", "pour from a cup A to cup B"), inducing atomic, local effects, and acting as a building block for longer tasks.
  • Preventing mega-operators: Our energy-based key-frame extraction (Sec 4.1) detects local extrema in frame-energy curves—either a peak or a trough—that consistently occurs at primitive-skill boundaries (e.g., when a move ends and a pick begins). We split at these peaks, so every segment—thus every learned operator—covers exactly one primitive skill.
  • Resulting granularity: Operators are atomic, never spanning an entire long-horizon behaviour.
  • Manuscript fix: Section 4 will formally define “primitive skill,” link it to PDDL operators, and highlight that key-frame segmentation guarantees the desired granularity.
评论

Thanks for the additional response. I am clear on most points, but I just want to further clarify your point about primitive skills.

Our energy-based key-frame extraction (Sec 4.1) detects local extrema in frame-energy curves—either a peak or a trough—that consistently occurs at primitive-skill boundaries (e.g., when a move ends and a pick begins). We split at these peaks, so every segment—thus every learned operator—covers exactly one primitive skill.

Operators are atomic, never spanning an entire long-horizon behaviour.

Section 4 will formally define “primitive skill,” link it to PDDL operators, and highlight that key-frame segmentation guarantees the desired granularity.

I don't understand how this segmentation strategy "guarantees the desired granularity." What makes a certain granularity "desired"? How can you "guarantee" that desired granularity? These are deep and difficult questions. The energy-based key-frame extraction method is sensible but it is only one of several possible choices. Furthermore, depending on hyperparameters, you may end up with different granularities even with the same extraction method.

I feel this point is important because it underpins all the empirical results related to plan cost (number of primitive skills).

评论

We thank the reviewer for the insightful follow-up questions. The concept of "desired granularity" is indeed deep, and we appreciate the opportunity to clarify our thoughts and provide further evidence.
What makes a certain granularity "desired"?
We regard a granularity as "desired" if it satisfies two key properties: it is human-aligned and planning-effective.

  1. Human-Aligned: A granularity is desirable if it aligns with human intuition, i.e., each learned operator corresponds to what a person would consider a single, atomic action.

    • Evidence: We validated this by comparing our algorithm's extracted keyframes against those annotated by humans on a subset of the DROID dataset (performed during development) and the AgiBot dataset [1] which comes with human keyframes (performed during rebuttal). We found that the human-annotated keyframes consistently appear at or very near the local energy extrema identified by our method. We will include this human-correlation analysis in the final paper.
  2. Planning-Effective: A granularity is desirable if minimizing the number of operators in a plan under this granularity reliably serves as a proxy for minimizing a real-world cost, such as execution time.

    • Evidence: We conducted a new analysis to verify this. We first measured the real-world execution time for each of the 61 operators in our meta-domain from the source demonstrations. We then provided these time costs to the Fast Downward planner and had it solve our 100 evaluation tasks, optimizing for minimal total time instead of minimal steps. The result was that the length of time-optimal plans were identical to the step-optimal plans in 100% of the task. This provides strong evidence that the granularity produced by our method is meaningful for efficient, real-world planning, thus "desired".

How can we "guarantee" it and how is the robustness to hyperparameters?
We concede that "guarantee" was a strong term. Our intent was to express that our segmentation method is not arbitrary but is grounded in both conceptual and empirical validation, consistently producing a granularity that meets the "desired" properties stated above.

  • How our method works: As discussed in response to other reviewers, our keyframe extraction method captures significant shifts in the image's frequency-space energy, which correlate strongly with semantic transitions in the video. The closed-loop verification & revisions further align the granularity of operators with the concept of primitive skills, using commonsense knowledge embedded in LLMs.
  • Hyperparameter sensitivity: To address your concern, we performed a sensitivity analysis on the keyframe extraction window size (which is the key hyperparameter). We re-generated the operators of 50 atomic domains, with the window size set to 0.5x and 2.0x of the original value. The total number of learned operators changed by less than 2% (from 105 to 107 and 104, respectively). This demonstrates that the granularity of operators produced by our pipeline is robust to hyperparameter changes.

An important note: does granularity inconsistency, if potentially exist but unidentified, undermine UniDomains effectiveness?
No. UniDomain can seamlessly adapt to optimize for more direct metrics. As demonstrated in our time-cost analysis, UniDomain can be easily extended to support direct time-optimal planning by simply providing the measured execution time for each operator to the planner. By optimizing for a real-world metric like time, the influence of how skills are segmented can be completely bypassed, making the approach robust and future-proofing it against this potential issue.

Reference:
[1] AgiBot‑World‑Contributors, Bu, Q., Cai, J., Chen, L., Cui, X., Ding, Y., … Zhu, J. (2025). AgiBot World Colosseo: A large‑scale manipulation platform for scalable and intelligent embodied systems. In Proceedings of the 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).

评论

Thank you for the detailed response and additional analysis. The planning-effective results are particularly intriguing. Returning to my original question, though:

"What prevents every task from being solved with a single long‑horizon operator?"

By this paper's logic, if we are to assume that VLAs are capable of executing operators, then a single long-horizon operator is all we need. Of course, a VLA in practice may not work well over long horizons, but that reflects my larger point: it really doesn't make sense to learn and evaluate high-level task planning models in isolation, without considering how the task plans will be grounded.

I also want to emphasize that my main concerns were never about lacking analysis, but rather about lacking transparency and overly strong claims in the paper. Should the paper be accepted, I think it is absolutely necessary to revise the writing so that the following points are clear:

  1. The paper is about "task planning only" and assumes that something like a VLA will be able to execute any operator might come out of UniDomain
  2. The empirical results related to correctness are based on a combination of human and LLM judgment
  3. The empirical results related to plan length / cost are based on an assumption that the energy-based segmentation is "correct" (and following our most recent discussion, it's not obvious what "correct" means in this context)
  4. The real-robot results all use teleoperation
评论

Tele-operation / Oracle-Executor Evaluation Literature

  • [1] Pierre Sermanet, Tianhe Ding, Jianming Zhao et al. RoboVQA: Multimodal Long-Horizon Reasoning for Robotics. Proc. IEEE ICRA 2024. — Evaluates high-level reasoning with tele-operation as oracle low-level policy.
  • [2] Weijie Zhang, Mei Wang, Guangyi Liu et al. Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks. arXiv 2503.21696, 2025. — Human operators execute symbolic plans to isolate reasoning quality.
  • [3] Gautier Dagan, Frank Keller & Alex Lascarides. Dynamic Planning with a Large Language Model. Proc. NeurIPS Language Gamification Workshop 2024. — Conducts planning entirely in simulation with perfect low-level execution.
  • [4] Guanqi Chen, Lei Yang, Ruixing Jia et al. Language-Augmented Symbolic Planner for Open-World Task Planning. Proc. Robotics: Science & Systems 2024. — Simulator guarantees action success; evaluation focuses on the planner.
  • [5] Zirui Zhao, Wee Sun Lee & David Hsu. Large Language Models as Commonsense Knowledge for Large-Scale Task Planning. NeurIPS 2023. — Studies commonsense priors in simulated ALFWorld with perfect execution.
  • [6] Keisuke Shirai, Cristian C. Beltrán-Hernández, Masashi Hamaya et al. Vision-Language Interpreter for Robot Task Planning. Proc. IEEE ICRA 2024. — Validates plans in a text-only setting, abstracting away physical execution.
  • [7] Elliot Gestrin, Marco Kuhlmann & Jendrik Seipp. Towards Robust LLM-Driven Planning from Minimal Text Descriptions. ICAPS Human-Aware Explainable Planning Workshop 2024. — Purely textual environment; no physical actuation layer.
  • [8] Zhehua Zhou, Jiayang Song, Kunpeng Yao et al. ISR-LLM: Iterative Self-Refined Large Language Model for Long-Horizon Sequential Task Planning. Proc. IEEE ICRA 2024. — Plan validation via symbolic pre-/post-condition checks assuming perfect execution.
  • [9] Zeyu Yu, Yujie Yuan, Taizhou Xiao et al. Generating Symbolic World Models via Test-time Scaling of Large Language Models. TMLR 2025. — Generates world models and evaluates planners in text-only simulations.

Primitive-Skills for Task Planning or Task-and-Motion-Planning (TAMP) Literature

  • [10] Leslie P. Kaelbling & Tomás Lozano-Pérez. Hierarchical Task and Motion Planning in the Now. Proc. IEEE ICRA 2011. — Seminal real-time TAMP framework introducing atomic motion primitives.
  • [11] Zhanpeng Wang, Caelan R. Garrett, Leslie P. Kaelbling & Tomás Lozano-Pérez. Learning Compositional Models of Robot Skills for Task and Motion Planning. IJRR 2021. — Learns probabilistic sensorimotor primitives and composes them for long-horizon tasks.
  • [12] Mengyuan Guo & Matthias Bürger. Geometric Task Networks: Learning Efficient and Explainable Skill Coordination for Object Manipulation. IEEE T-RO 2021. — Coordinates low-level geometric skills via an interpretable task network.
  • [13] Rémi Strudel, Alexander Pashevich, Ivan Kalevatykh, Ivan Laptev, Josef Šivic & Cordelia Schmid. Learning to Combine Primitive Skills: A Step Towards Versatile Robotic Manipulation. Proc. IEEE ICRA 2020. — RL task planner that sequences vision-conditioned primitive skills for versatile manipulation.
评论

We appreciate the engaging discussion and acknowledge that our earlier reply misunderstood the intent behind “what prevents every task from being solved with a single long-horizon operator.” If we now understand precisely, the reviewer were referring to low-level execution, not high-level planning. We address this and remaining concerns directly below.

  1. Why Compositionality Matters: The Limits of a Monolithic Operator
    A single "do-everything" operator sacrifices compositional generalization —a key principle for scalable robotics.

    • Monolithic vs. Compositional Generalization: State-of-the-art monolithic VLAs [1, 2] require large, task-specific fine-tuning data and can hardly recombine behaviors (e.g., “make tea” + “fetch milk” → “make milk tea”) without new training. UniDomain targets another route: compositional generalization. It represents primitive skills (e.g., pick, place, open/close) as atomic PDDL operators. These building blocks can be recombined to form new long-horizon tasks without retraining the executor.
    • Executor Agnosticism: UniDomain is executor-agnostic, compatible with various low-level policies:
      • VLA policies: Multiple recent surveys [3, 4] concluded that training VLAs to serve as low-level policies for high-level symbolic planning leads to significantly more generalizable systems.
      • Modular policies: A recent study [5] demonstrates that modular approaches, combining perception, motion planning, and affordances, outperform end-to-end models in generalization and robustness for mobile manipulation, without requiring in-domain demonstration data.
      • Motion planning + short-term VLA: The most promising low-level policy to support UniDomain, we envision, is to strategically combine motion planning and short-term VLAs: motion planning handles navigation, approach, and alignment operations, while VLAs are strictly reserved for repeatable, short-horizon dexterous manipulation.
    • Principled, Constraint-Optimal Reasoning: By decoupling planning from execution, our approach enables principled reasoning that imitation-based policies such as VLAs lack. The symbolic planner explicitly enforces action preconditions and optimizes plan cost, raising the success rate from 40% (VLM-CoT) to 85% and optimality rate (K=0) from 13% (VLM-CoT) to 83%. These results demonstrate that high-level reasoning itself is a distinct and challenging problem.

    In a nutshell, a single monolithic operator sacrifices reusability and compositionality and provides no guarantees on constraint satisfaction or optimality. A system using UniDomain as the top layer preserves all three.

  2. Why Evaluate Planning in Isolation?
    We respectfully disagree with the statement that our results are “not meaningful” without integration with an automatic executor. Our contribution lies within the scope of symbolic task planning, where evaluating the high-level reasoning module in isolation is an established, valid, and widely practiced methodology. This approach removes confounding factors from imperfect control, allowing planning quality to be measured directly—an essential step for modular systems in which the same planner can be paired with different executors. Our evaluation approach is consistent with numerous other works in the field, including widely accepted symbolic benchmarks [6-8]; many PDDL-based task planners [9–13]; and works that employ teleoperation or human-as-perfect-execution [14, 15].
    While end-to-end evaluation is standard for monolithic VLAs and Task-and-Motion Planning (TAMP), our work focuses on symbolic task planning and should be assessed according to the established standards of this subfield.

  3. Commitments for the Final Version

    • Add prominent clarifications of the scope, assumptions, the use of teleoperation, and our judgment protocol, to prevent any possible misunderstanding of our claims.
    • Append the full suite of evaluation tasks and results, with detailed planning outcomes and LLM/human judgments for complete transparency. (These supplementary materials will exceed 100 pages and were omitted from the initial submission for brevity.)
    • Include citations supporting the practice of modular evaluation in task planning.
    • Report human correlation of keyframes, time-vs-step optimality, and hyperparameter sensitivity results that were additionally provided during the rebuttal.
  4. Commitments for Open-Sourcing

    • Release the unified domain, enabling users to fully leverage the extensive real-world knowledge encoded in our knowledge graph.
    • Release the high-quality meta-domains used in evaluation, enabling faithful reproduction of our results.
    • Release the domain fusion pipeline, allowing users to construct new meta-domains via automatic retrieval.
    • Release the online planner, enabling the community to leverage our zero-shot visual–language planner for new tasks.
评论

References

  • [1] Black, K., Brown, N., Driess, D., Esmail, A., Equi, M., Finn, C., ... & Zhilinsky, U. π0: A vision-language-action flow model for general robot control. Robotics: Science and Systems (RSS) 2025.
  • [2] Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., ... & Zhilinsky, U. π0.5\pi_ {0.5} : a Vision-Language-Action Model with Open-World Generalization. arXiv preprint arXiv:2504.16054.
  • [3] Ma, Y., Song, Z., Zhuang, Y., Hao, J., & King, I. A survey on vision-language-action models for embodied ai. arXiv preprint arXiv:2405.14093.
  • [4] Zhong, Y., Bai, F., Cai, S., Huang, X., Chen, Z., Zhang, X., ... & Yang, Y. A Survey on Vision-Language-Action Models: An Action Tokenization Perspective. arXiv preprint arXiv:2507.01925.
  • [5] Gupta, A., Zhang, M., Sathua, R., & Gupta, S. Demonstrating MOSART: Opening articulated structures in the real world. Robotics: Science and Systems (RSS) 2025.
  • [6] Shridhar, M., Yuan, X., Côté, M.-A., Bisk, Y., Trischler, A., & Hausknecht, M. ALFWorld: Aligning text and embodied environments for interactive learning. International Conference on Learning Representations (ICLR).
  • [7] Chang, M., Chhablani, G., Clegg, A., Cote, M. D., Desai, R., Hlavac, M., Karashchuk, V., Krantz, J., Mottaghi, R., Parashar, P., Patki, S., Prasad, I., Puig, X., Rai, A., Ramrakhya, R., Tran, D., Truong, J., Turner, J. M., Undersander, E., & Yang, T.-Y. PARTNR: A benchmark for planning and reasoning in embodied multi-agent tasks. International Conference on Learning Representations (ICLR).
  • [8] X. Puig et al. VirtualHome: Simulating Household Activities Via Programs. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [9] Zhehua Zhou, Jiayang Song, Kunpeng Yao, Zhan Shu, and Lei Ma. Isr-llm: Iterative self-refined large language model for long-horizon sequential task planning. In 2024 IEEE International Conference on Robotics and Automation (ICRA), 2024.
  • [10] Zhouliang Yu, Yuhuan Yuan, Tim Z. Xiao, Fuxiang Frank Xia, Jie Fu, Ge Zhang, Ge lin, and Weiyang Liu. Generating symbolic world models via test-time scaling of large language models. Transactions on Machine Learning Research (TMLR), 2025.
  • [11] Gautier Dagan, Frank Keller & Alex Lascarides. Dynamic Planning with a Large Language Model. Proc. NeurIPS Language Gamification Workshop 2024.
  • [12] Guanqi Chen, Lei Yang, Ruixing Jia et al. Language-Augmented Symbolic Planner for Open-World Task Planning. Proc. Robotics: Science & Systems 2024.
  • [13] Elliot Gestrin, Marco Kuhlmann & Jendrik Seipp. Towards Robust LLM-Driven Planning from Minimal Text Descriptions. ICAPS Human-Aware Explainable Planning Workshop 2024.
  • [14] Pierre Sermanet, Tianhe Ding, Jianming Zhao et al. RoboVQA: Multimodal Long-Horizon Reasoning for Robotics. ICRA 2024.
  • [15] Weijie Zhang, Mei Wang, Guangyi Liu et al. Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks. arXiv 2503.21696, 2025.
审稿意见
4

This paper introduces a framework that constructs a unified PDDL domain from a large-scale dataset of robotic manipulation demonstrations. Instead of relying on manually designed symbolic domains, the proposed method builds a single domain containing over 3,000 operators and more than 2,000 predicates. The framework operates in three stages. First, in the domain pretraining stage, atomic PDDL domains are learned from individual demonstrations using energy-based keyframe extraction and iterative refinement with LLMs. Second, in the domain fusion stage, relevant atomic domains are hierarchically merged into a meta-domain based on text and semantic similarity. Third, during deployment, task-relevant operators and predicates are filtered from the meta-domain to generate executable plans using a symbolic planner. The framework is evaluated on 100 tasks across four domains, demonstrating improved performance compared to existing methods that use LLMs or VLMs either alone or in combination with PDDL.

优缺点分析

Strengths:

  1. While prior work has explored the integration of LLMs and VLMs with structured representations such as PDDL, this paper is unique in its attempt to learn a unified domain from a large collection of demonstration videos.
  2. The paper introduces several novel techniques to build and refine the unified domain, including the use of LLMs and PDDL planners in a verification loop, grouping of predicates and operators using text embeddings and LLMs, and pruning mechanisms to remove irrelevant elements from the domain for planning.

Weaknesses:

  1. Although the paper claims that a unified domain with over 3,000 operators is constructed, the quality of this domain is not directly evaluated. Instead, the experiments use a smaller domain learned from only 40 videos.
  2. The paper is clearly written and well organized, but many implementation details are missing. These missing details, including in the supplementary material, limit the reproducibility of the method. For example:
  • In Line 151, what is the size of the sliding window used for keyframe detection?
  • In Line 161, how accurate is the LLM at generating test problems from a given predicate set and task instruction? What specific feedback is returned by the planner?
  • In Line 170, how does the LLM evaluate whether a plan satisfies physical constraints and commonsense expectations?
  • In Line 203, how are domain pairs selected for merging? Are the child domains chosen randomly?
  1. The paper provides very limited examples of generated operators and predicates. This makes it difficult to assess their quality.
  2. The paper misses an opportunity to analyze the quality of the generated domains in a systematic way. It would be valuable to examine common failure cases, such as preconditions or effects that violate physical constraints, incorrect groupings of predicates and operators, missing or redundant predicates in operators, and mismatches in abstraction levels (such as grounded versus lifted operators).

问题

Besides the questions listed above, I have the following questions:

  1. What is the computational cost and inference time for constructing the unified domain?
  2. Can the authors provide the full list of generated operators?
  3. What are the 100 real-world task instances used in evaluation? What operators are required to complete each of them? Which operators are frequently reused, and which ones are rarely needed?

局限性

yes

最终评判理由

The authors have provided many details in the rebuttal that address most of my concerns. My remaining concern is the lack of a systematic evaluation covering all 3,000 operators.

格式问题

No

作者回复

We thank the reviewer for recognizing the novelty of our work and for providing detailed, constructive feedback. We will use this response to provide the requested details, including a systematic failure analysis, concrete examples, new experimental results, and an analysis of operator usage.

Q: Although the paper claims that a unified domain with over 3,000 operators is constructed, the quality of this domain is not directly evaluated. Instead, the experiments use a smaller domain learned from only 40 videos.

A: To demonstrate that the full unified domain (with 3,137 operators) is indeed valuable and can be used systematically, we have developed and evaluated a fully automated pipeline.

  1. Automated Pipeline: For a given task class, an LLM infers relevant actions from the instruction. We then use sentence-embedding similarity to find the top-k best-matching operators in our unified domain. The atomic domains containing these matching operators are then selected and passed to our Domain Fusion pipeline to construct a meta-domain for online planning. This pipeline removes any need for manual selection.
  2. New Experimental Results:
    • On Existing Tasks: Applied to our original 100 evaluation tasks, the automated pipeline, directly leveraging the full unified domain, maintains the strong performance reported in the paper, achieving a success rate of 83% and an optimality rate of 80%.
    • On New, Unseen Tasks: We tested generalization on 10 new, diverse tasks from two new classes:

    Household Object Rearrangement (sharing one meta-domain):

    • "Put the tableware into the tray and the trash into the trash bin."
    • "Pull off the bedsheet, and put it into the laundry basket. Return the pillow and blanket from the table to the bed and fold them."
    • "Put all the remote controls on the living room coffee table into the storage box and wipe the coffee table."
    • "Put the hood on the table onto the person's head."
    • "Coil the cable on the table and place it into the storage box."

    Room Cleaning (sharing one meta-domain):

    • "Wipe the dust off the chair with a sponge."
    • "Erase all drawings on the whiteboard."
    • "Scrub the sink to remove stains."
    • "Pick up the cloth from the armrest of the chair and use it to wipe the chair and the dresser."
    • "Put the milk in the microwave and turn it on, then wipe the surface of the microwave with a towel."

    Our automated pipeline achieved 99% success rate and 99% optimality rate(K=0) on the new tasks, demonstrating robust zero-shot generalization to new concepts by activating operators like put_hood_on_head, fold_sheet, scrub_sink, coil_cable, and wipe_chair, etc..


Q: The paper is clearly written and well organized, but many implementation details are missing...

A: We apologize for the lack of these details in the manuscript and are happy to clarify them.

  • Sliding Window Size (Line 151): We use an adaptive sliding window size K based on the video's frame count n to handle varying demonstration lengths. The formula is:

    K = 10                    if n ≤ 100
        20                    if 100 < n ≤ 150
        30                    if 150 < n ≤ 200
        40                    if 200 < n ≤ 500
        10 * (⌊(n - 501)/200⌋ + 1)  if n > 500
    
  • Test Problem Generation & Planner Feedback (Line 161): We acknowledge that LLM‑generated test problems can sometimes be flawed. To account for this, our solvability check uses a soft threshold: a domain passes if it can solve at least 60% of the generated test problems, making the process robust to a few ill‑formed tests. For feedback, the full, verbose output from the Fast Downward planner, including search process logs and any validation errors, is passed back to the LLM for domain refinement.

  • Plan Commonsense Evaluation (Line 170): The LLM is prompted to act as a verifier, reading the action sequence and identifying steps that violate physical or commonsense logic. For example, it successfully flags errors like trying to pick an object that is still occluded, attempting to stack an object on itself (stack b1 b1), or applying an action to an invalid object type (open_drawer blue_plushie).

  • Domain Fusion Pairing (Line 203): In the current implementation, the child domains are paired randomly for merging along the binary tree. Optimizing the fusion order is a promising direction for future work.

To further ensure reproducibility, we commit to open‑source the UniDomain project, including releasing the unified domain, the automatic retrieval, domain fusion, and online planning components, together with the final paper.


Q: The paper provides very limited examples of generated operators and predicates. This makes it difficult to assess their quality. ... Can the authors provide the full list of generated operators?

A: Due to the 10k character limit, we can only present a minimal sample here. We will open‑source the entire unified domain file in the final paper.

(:action fold_sheet
    :parameters (?r ?s)
    :precondition (and (holding ?r ?s) (sheet ?s) (not (folded ?s)))
    :effect (and (folded ?s) (hand_free ?r) (not (holding ?r ?s)))
)
(:action cut_donut
    :parameters (?r ?tool ?donut ?t)
    :precondition (and (holding ?r ?tool) (on_table ?donut ?t) (table ?t) (can_cut_donut ?tool) (not (is_cut ?donut)))
    :effect (and (is_cut ?donut))
)
(:action hang_cloth
    :parameters (?r ?c ?obj)
    :precondition (and (holding ?r ?c) (cloth ?c))
    :effect (and (hung ?c ?obj) (hand_free ?r) (not (holding ?r ?c)))
)
(:action pair_shoes
    :parameters (?r ?rs ?ls)
    :precondition (and (holding ?r ?rs) (on_floor ?ls) (right_shoe ?rs) (left_shoe ?ls))
    :effect (and (on_floor ?rs) (hand_free ?r) (not (holding ?r ?rs)) (paired ?ls ?rs))
)

Q: The paper misses an opportunity to analyze the quality of the generated domains in a systematic way. It would be valuable to examine common failure cases...

A: This is an excellent suggestion. We have analyzed the failures and can categorize them as follows:

  • Domain Learning Failure Modes

    • Syntax Errors: generates a syntactically invalid PDDL file (e.g., mismatched brackets) that the planner cannot parse.
    • Missing Operators: fails to generate a critical operator needed to solve a test problem (e.g., generating pick but not place).
    • Logical Violations: contains logical errors, such as an effect that conflicts with a precondition in the same operator.
    • Flawed Test Problems: generates a test problem with an impossible goal state, such as (and (is_open door) (not (is_open door))).
  • Online Planning Failure Modes

    • Goal Misinterpretation: The VLM misunderstands the goal, for example, by reversing a required stacking order (top‑down vs. bottom‑up).
    • Visual Grounding Errors: The VLM misidentifies an object's state or location in the initial scene, leading to an incorrect (:init) block and planning failure.
    • Domain Inherited Errors: The planner generates a infeasible plan due to flaws in the learned domain itself (e.g., missing preconditions).

Q: What is the computational cost and inference time for constructing the unified domain?

A: Generating a single atomic domain takes, on average, 5.27 calls to GPT‑4.1 (at 8.36 s/call) and 4.47 calls to GPT‑o3-mini (at 74.51 s/call). Note that, constructing the full unified domain from 12,393 demonstrations was a one‑time, offline cost. We will release the full unified domain in the final paper.


Q: What are the 100 real-world task instances used in evaluation? ... Which operators are frequently reused, and which ones are rarely needed?

A: Here we include a representative set of 5 task instructions due to space constraints:

  • "Arrange all blocks into two separate stacks on the table. The first stack should have blocks green, red, blue, and yellow in order from top to bottom. The second stack should have blocks 1, 2, 3, and 4 in order from top to bottom."
  • "There is a paper in the orange drawer. There is a towel in the yellow drawer. Please scrunch the paper on the table and wipe the table"
  • "There is a spoon in the cup. Please stir the two bowls and then put the spoon in the left cup"
  • "There are a spoon, a tissue, an orange block in the green drawer. Stir the bowl and put the spoon in the cup, put the orange block into the orange drawer, wipe the bowl and scrunch the tissue on the table"
  • "There is an unfolded towel in the yellow drawer. Wipe the table, wipe the bowl, and fold the towel on the table. Push the purple block, slide the blue block and press the white remote"

Below is a quantitative analysis of operator activations during online planning, across the four evaluation domains.

Domain# Operators3 Most-Frequent Operators (# Activations)3 Least-Frequent Operators (# Activations)
BlockWorld4unstack (153), stack (143), put_on_table (84)pick_from_table (74)
Desktop11open_drawer (30), pick_from_drawer (25), put_on_table (15)fold_on_table (5), push_from_table (5), scrunch_on_table (5)
Kitchen12put_on_table (20), pick_from_rack (15), pick_from_bowl (15)place_on_rack (5), remove_lid (5), fold_on_table (5)
Combination22pick_from_drawer (95), place_in_drawer (85), unstack (83)stir_bowl (5), slide_from_table (5), press_remote (5)
评论

I would like to thank the authors for their very detailed rebuttal. I have a few follow-up questions and comments:

Regarding the 10 new tasks: Why do the task instructions need to be specified in such a detailed way? How does the method perform when given higher-level or more natural instructions such as:

  • “Wipe the dust off the chair”
  • “Remove the stains in the sink”
  • “Wipe the chair and dresser clean”
  • “Warm the milk”

It would be helpful to understand the model’s robustness to variation in task abstraction and language compositionality.

Regarding the failure modes: Could the authors provide a breakdown of how frequently each failure mode occurs?

Regarding the operator examples: I have some concerns about the quality of the generated operators:

  • hang_cloth does not require using a hanger.
  • pair_shoes specifies holding the right shoe, whereas holding either shoe should be valid.
  • cut_donut requires the donut to be on a table, even though a cutting board or plate would be more appropriate.

These examples suggest that some operator definitions may be inaccurate. It would be helpful to understand whether such issues are common.

评论

We sincerely thank the reviewer for the thoughtful follow-up questions. They have helped us further clarify the robustness and limitations of our framework. We address your points below.

On Robustness to Natural Task Instructions

This is a valuable question. The detailed instructions in our initial experiments were chosen to rigorously test long-horizon compositional planning, but they are not a requirement for UniDomain.
To directly test your point, we conducted new experiments with the higher-level instructions you suggested. For each task, we ran 10 trials.

  • Task Instructions:
    1. “Wipe the dust off the chair” (+ scene image from the original task 1)
    2. “Remove the stains in the sink” (+ scene image from task 3)
    3. “Wipe the chair and dresser clean” (+ scene image from task 4)
    4. “Warm the milk” (+ scene image from task 5)
  • Results: UniDomain achieved a 100% success rate on all tasks.
  • Why it works: Understanding natural, high-level task instructions sorely requires visual-language grounding, i.e., translating abstract instruction to precise PDDL formulations of relevant objects and the task goal, which state-of-the-art VLMs excel at.

Quantitative Breakdown of Failure Modes

We have performed a quantitative analysis of all failures that occurred during our experiments. The breakdown is as follows:

  • Domain Learning Failures (happened within the closed-loop iterations of domain generation):

    • Syntax Errors: 48%.
    • Missing Operators: 39%.
    • Logical Violations: 35%.
    • Flawed Test Problems: 26%.
    • Notably, our closed-loop refinement successfully corrects all of the above, so no final domains have the above errors.
    • (The numbers add up to >100% because some failures covered multiple modes)
  • Online Planning Failures (using a valid meta-domain):

    • Visual Grounding Errors: 53.4%.
    • Domain Applicability Failures: 33.3%.
    • Goal Misinterpretation: 13.3%.

Key Takeaways:

  1. Closed-loop refinement is highly effective. The primary bottleneck in learning the domain is the initial generation of syntactically and logically sound PDDL, but our proposed verification loop reliably detects and corrects these issues.
  2. Visual grounding is the primary online challenge. Given a high-quality meta-domain, the main source of failure shifts to the VLM's ability to accurately perceive the world and ground the planning task.

On the Fidelity of Learned Operators

Your observations about the operator examples are astute and highlight a crucial aspect of our work: UniDomain is designed to faithfully learn the causal structure as it is demonstrated, and the quality of the learned domain is inherently tied to the quality of the source demonstrations.

The operators you questioned are not learning errors, but rather accurate reflections of the specific actions performed in the DROID dataset:

  • hang_cloth: In the source video, the robot is instructed to "hang the white cloth on the black object", not a dedicated hanger. UniDomain correctly learned this general affordance, omitting an is_hanger precondition that would have contradicted the language evidence.
  • pair_shoes: The operator precisely models the demonstration, where a robot held the right shoe and placed it next to the stationary left shoe. UniDomain learned the specific, one-sided strategy that was demonstrated.
  • cut_donut: The demonstration for this action occurred directly on a tabletop (without a cutting board), which the learned operator correctly reflects.

This analysis reveals a current limitation and a key insight: UniDomain successfully extracts the demonstrated logic, but it does not invent commonsense that violates the demonstration data. The path to acquiring more general, commonsense-aligned operators is to apply UniDomain to wider and higher-quality demonstration datasets. Our framework is agnostic to the data source, and using it with curated human demonstrations is an efficient approach and a promising direction for future work.

评论

I want to thank the authors for their detailed follow-up. I found the clarifications extremely helpful and have decided to raise my score. I encourage the authors to incorporate the technical details, additional qualitative examples, and failure analysis into the revised version of the paper.

评论

We thank the reviewer for their thoughtful follow-up and for revisiting their assessment after carefully considering the clarifications we provided. We will certainly integrate these elements into the revised version of the paper to further strengthen our contributions.

审稿意见
3

The authors propose learning prior knowledge about PDDL domains required for solving robotics tasks from robot videos, and then utilizing that prior knowledge to perform better zero-shot planning in new environments. The authors show how the proposed method outperforms baselines in several real-world tasks, zero-shot solving planning problems that other methods struggle with.

优缺点分析

While the proposed approach might work well and would certainly be valuable if that is the case, there is almost no information in the paper to be able to clearly judge that.

  • It is unclear why a demonstration can be well segmented by simply looking at the total grayscale intensity in each frame. There is also no evaluation done on this in isolation to show that that is actually the case.
  • Not a single example of a domain for a task generated at test time is given, or an example of a generated plan. It is unclear what the approach qualitatively does based on its prior knowledge to outperform the baselines.
  • On a high level, it is unclear what the approach actually learns to enable it to perform better than the other approaches. Is it just a large set of preconditions for each action that are checked, where the VLM without prior knowledge might otherwise miss it?
  • While different domain variations and combinations get the total number of evaluation tasks to 100, they are based on only 3 relatively simple domains, covering a very small set of concepts. The majority of it seems to come down to: stacking blocks in order, putting things into bowls / onto plates / into drawers / onto dish racks, and performing table wiping.
  • Retrieval of relevant atomic domains being done manually seems like a significant downside. While it might be easy to overcome, it is not done so in the current version of the paper.

问题

No additional questions.

局限性

The authors discuss the limitations of the approach. On the surface, this seems sufficient, but without more information to understand the internals of the approach better, it is difficult to say whether there are any additional ones.

最终评判理由

As detailed in my response below, I think all my concerns about the paper largely remain. In addition, I only became aware that the low-level skills are realized using teleoperation with the reviewer cBvW noting it, as it was not stated in the paper. As detailed in my comment below, I do see this as a significant omission, but I will stay with my original rating for the paper.

格式问题

No major concerns.

作者回复

We thank the reviewer for the critical and detailed feedback. The core concern raised is a lack of clarity regarding the internal workings and qualitative behavior of our method. We appreciate this feedback and will use this rebuttal to provide the necessary details, justifications, and concrete examples to address every point raised. Our goal is to offer a much clearer picture of what UniDomain learns, how it works, and why it is effective.


Q: It is unclear what the approach qualitatively does based on its prior knowledge to outperform the baselines. On a high level, it is unclear what the approach actually learns to enable it to perform better than the other approaches.

A: Thanks for this crutial feedback. Qualitatively, UniDomain's strength comes from a principled division of labor. It frees LLMs/VLMs from the burden of long-horizon reasoning under complex constraints—a known weakness of current models—and delegates it to a classical PDDL solver, which is explicitly designed for this.

What UniDomain Learns: The unified domain learns a large, structured, and grounded library of action models (operators with preconditions and effects) from real-world data. It's not just learning "a large set of preconditions"; it's learning a compositional symbolic knowledge graph where actions are connected through shared predicates, enabling complex plans.

How It Outperforms Baselines:

  • Grounding, Not Hallucinating: LLM-only planners must reason about and generate action sequences from scratch, often hallucinating incorrect preconditions or missing crucial steps. UniDomain provides the VLM with a validated "menu" of predicates and operators gathered from real-world atomic domains. The VLM's role is simplified to grounding the current scene and instruction into a novel PDDL problem (extracting objects of interest and specifying the initial and goal conditions with predicates) using this menu, which is a task it excels at.
  • Optimal, Constrained Reasoning: Once the problem is formulated, the symbolic planner performs an efficient, optimal search that inherently respects all preconditions and dependencies. For example, in our BlockWorld tasks, the planner can handle the long chain of dependencies required for sorting, while LLM-based methods achieved success rates as low as 20% (see Figure 3a and Fig. 4a in the Appendix). UniDomain's pre-trained knowledge provides the essential structure that makes this division of labor possible.

Q: Retrieval of relevant atomic domains being done manually seems like a significant downside. While it might be easy to overcome, it is not done so in the current version of the paper.

A: We are pleased to report that we have developed and evaluated a fully automated retrieval pipeline, removing the manual step entirely. For a given task class, an LLM infers relevant actions from the instruction. We then use sentence-embedding similarity to find the top-k best-matching operators in our unified domain. The atomic domains containing these top-matching operators are then selected and passed to our Domain Fusion pipeline to construct a meta-domain for online planning. We have run two new experiments with this automated system:

  1. On Existing Tasks: Applied to our original 100 evaluation tasks, the automated pipeline maintains the strong performance reported in the paper, achieving a consistent success rate of 83% and an optimality rate (K=0) of 80%. This confirms the viability of a fully automated approach.
  2. On New, Unseen Tasks: We tested generalization on 10 new, diverse tasks from two new classes:

    Household Object Rearrangement (sharing one meta-domain):

    • "Put the tableware into the tray and the trash into the trash bin."
    • "Pull off the bedsheet, and put it into the laundry basket. Return the pillow and blanket from the table to the bed and fold them."
    • "Put all the remote controls on the living room coffee table into the storage box and wipe the coffee table."
    • "Put the hood on the table onto the person's head."
    • "Coil the cable on the table and place it into the storage box."

    Room Cleaning (sharing one meta-domain):

    • "Wipe the dust off the chair with a sponge."
    • "Erase all drawings on the whiteboard."
    • "Scrub the sink to remove stains."
    • "Pick up the cloth from the armrest of the chair and use it to wipe the chair and the dresser."
    • "Put the milk in the microwave and turn it on, then wipe the surface of the microwave with a towel."

Our automated pipeline achieved 99% success rate and 99% optimality rate(K=0) on the new tasks, demonstrating robust zero-shot generalization to new concepts by composing operators like put_hood_on_head, fold_sheet, scrub_sink, coil_cable, and wipe_chair, etc..


Q: While different domain variations and combinations get the total number of evaluation tasks to 100, they are based on only 3 relatively simple domains, covering a very small set of concepts. The majority of it seems to come down to: stacking blocks in order, putting things into bowls / onto plates / into drawers / onto dish racks, and performing table wiping.

A: We respectfully disagree with the characterization of the domains as simple and wish to clarify their complexity and richness.

  • Conceptual Richness: While the objects are common, their interactions activate a diverse set of concepts. These tasks required 25 distinct semantic operations, covering concepts such as pick, place, pour, wipe, stir, open, close, stack, unstack, fold, remove lid, push, slide, press, and scrunch, and many more.
  • Deceptive Complexity: The core challenge lies not in the individual actions but in their composition and the resulting long-horizon dependencies. As shown in Fig. 3a in the main text and Fig. 4a in the Appendix, strong baselines struggle significantly: ReAct achieves only 20% success in the "simple" BlockWorld domain, and Code-as-Policies scores only 25% in the Kitchen domain. This demonstrates that correctly identifying and satisfying the chained preconditions for these tasks is a non-trivial reasoning problem that LLM-only planners fail at.
  • Scalability to Richer Domains: As mentioned above, our new experiments on "Household Object Rearrangement" and "Room Cleaning" confirm that our framework readily scales to a wider variety of concepts, activating new operators to solve significantly different tasks.

Q: It is unclear why a demonstration can be well segmented by simply looking at the total grayscale intensity in each frame. There is also no evaluation done on this in isolation to show that that is actually the case.

A: Thank you for this question. The justification for our energy-based keyframe extraction is both theoretical and empirical.

  • Theoretical Justification: Our method of summing squared grayscale intensities is equivalent to measuring the total energy of the image treated as a 2D signal. By Parseval's theorem [refer to paper "Contribution à l'etude de la representation d'une fonction arbitraire par les integrales définies"], this spatial-domain energy is proportional to the energy in the frequency domain. Semantic transitions in a video, such as an object being picked up or a drawer opening, cause significant changes in the image's structure and texture, which correspond to shifts in its frequency-space energy. Our method identifies keyframes by detecting the local extrema of this energy sequence, effectively capturing these points of significant semantic change.
  • Empirical Justification: We validated this method by comparing its output against human-annotated keyframes on a subset of the data. We observed that the human-annotated ground truth keyframes consistently cluster around the local energy extrema identified by our algorithm, demonstrating its effectiveness in practice. Furthermore, our ablation study (see Figure 4a) shows that this simple, efficient method (0.6 seconds/video) outperforms a more complex similarity-based approach (47.8 seconds/video) in generating successful atomic domains, with success rates of 28% vs. 15% in a single-pass setting.

Q: Not a single example of a domain for a task generated at test time is given, or an example of a generated plan.

A: We apologize for this omission. To make our pipeline more concrete, here is a detailed example illustrating how the overview task (Figure 2) is solved.

Task: "Move the corn from the pot into the orange bowl, wipe the table with the towel in the drawer and put it back to the closed drawer."

The fused meta-domain for this task class includes key operators like remove_lid, pick_from_rack, pick_from_pot, place_on_table, put_in_bowl, and wipe_table.

The VLM grounds the scene and the instruction into this PDDL problem:

(:init 
    (on_rack orange_bowl rack)
    (in_pot corn pot)
    (can_wipe_table towel)
    (in_drawer towel yellow_drawer)
    (hand_free robot)
    (on lid pot)
)
(:goal 
    (and 
        (in_bowl corn orange_bowl) 
        (wiped table) 
        (in_drawer towel yellow_drawer) 
        (not (is_open yellow_drawer))
    )
)

The PDDL solver produces the following optimal plan:

(remove_lid robot lid pot)
(pick_from_rack robot orange_bowl rack)
(place_on_table robot orange_bowl table)
(pick_from_pot robot corn pot)
(put_in_bowl robot corn orange_bowl table)
(open_drawer robot yellow_drawer)
(pick_from_drawer robot towel yellow_drawer)
(wipe_table robot towel table)
(place_in_drawer robot towel yellow_drawer)
(close_drawer robot yellow_drawer)

We hope this example clarifies how UniDomain translates a real-world request into a structured problem and an executable plan.

审稿意见
5

This paper introduces UniDomain, a framework that extracts a unified PDDL domain from real-world robot manipulation demonstrations for generalizable task planning. This framework includes three parts: (1) Domain pretraining: extracts atomic domains from demonstration videos using keyframe segmentation and refines them via closed-loop verification with VLMs and LLMs; (2) Domain fusion: merges relevant predicate and operator; and (3) Online planning: constructs grounded PDDL problems for specific tasks and solves them using a symbolic planner. Experiments show it achieves significantly higher task success and plan optimality compared to LLM and LLM-PDDL baselines.

优缺点分析

Strengths:

  1. The paper addresses a critical challenge in planning: the reliance on complete and manually constructed domain knowledge. This is a non-trivial bottleneck, as manually engineering domain models is both time-consuming and error-prone. The proposed approach of automatically constructing a unified domain from demonstrations presents a valuable direction for reducing manual effort and improving scalability.
  2. The proposed UniDomain framework demonstrates good performance across four task domains. The paper also includes comprehensive ablation studies, which effectively validate the contribution of individual components within the framework.

Weaknesses:

  1. The unified domain is constructed from 40 manually selected demonstrations, raising concerns about the robustness and generalizability of the approach. It remains unclear whether alternative selections of demonstrations would produce similar results. Moreover, the potential impact of using the full DROID dataset—where a larger set of predicates and operators are present—has not been explored. Specifically, the paper does not discuss whether the increased domain complexity would affect the quality of the domain and problem files generated by the LLM during online planning, and whether this would result in a performance drop.
  2. The observation that the "w/o grouping" variant outperforms the full method in the Kitchen domain is counterintuitive and warrants further analysis. The lack of an explanation for this result makes it difficult to understand the trade-offs involved in the grouping component.
  3. The related work section overlooks prior efforts on domain generation from demonstrations. In particular, the following works are relevant and should be discussed: [1] Automated Generation of Robotic Planning Domains from Observations, IROS 2021. [2] Automated Planning Domain Inference for Task and Motion Planning, ICRA 2025.
  4. The paper does not clearly specify the control policy used for executing actions in the real-world experiments. Clarifying this aspect is important to assess the reproducibility and the completeness of the proposed system.

问题

See weaknesses.

局限性

Yes.

最终评判理由

Most of my concerns have been addressed. So I remain positive about this work.

格式问题

There are no formatting errors.

作者回复

We sincerely thank you for your detailed and insightful review. We are encouraged that you recognize the importance of the problem we are addressing—the bottleneck of manual PDDL domain engineering—and find our approach of automatically constructing a unified domain from demonstrations to be a "valuable direction." We are also glad you found our experimental results and comprehensive ablation studies to be strong points of the paper.

Below is our response to your comments:


Q: The unified domain is constructed from 40 manually selected demonstrations, raising concerns about the robustness and generalizability of the approach. It remains unclear whether alternative selections of demonstrations would produce similar results. Moreover, the potential impact of using the full DROID dataset has not been explored.

A: We would like to begin with a crucial clarification:

  • Our Unified Domain is pretrained on the full set of 12,393 demonstrations from the DROID dataset, resulting in a large-scale symbolic knowledge graph with 3,137 operators and 2,875 predicates (as stated in Section 4.3).
  • The 40 demonstrations you mentioned were used to construct a specific meta-domain for the evaluation tasks presented in the paper. In the original experiments, the atomic domains corresponding to these 40 demonstrations were manually retrieved based on their language instructions to demonstrate the effectiveness of our domain fusion and online planning stages.

We acknowledge your concern about the manual retrieval step. To address this and demonstrate the full potential and scalability of our framework, we have now implemented an automated retrieval mechanism that operates on the entire unified domain. For a given task class, an LLM infers relevant actions from the instruction. We then use sentence-embedding similarity to find the top-k best-matching operators in our unified domain. The atomic domains containing these top-matching operators are then selected and passed to our Domain Fusion pipeline to construct a meta-domain for online planning.

We present two new sets of results from this fully automated pipeline:

  1. Performance on Existing Tasks: When applied to our original 100 evaluation tasks, this automated pipeline achieves a success rate of 83% and an optimality rate (K=0)of 80%, consitent with numbers reported in the paper (85% success and 83% optimality (K=0)).

  2. Generalization to New, Unseen Task Classes: To further test the limits of our framework's generalization, we evaluated it on two new task classes significantly different from our original evaluation suite:

    Household Object Rearrangement (sharing one meta-domain):

    1. "Put the tableware into the tray and the trash into the trash bin."
    2. "Pull off the bedsheet, and put it into the laundry basket. Return the pillow and blanket from the table to the bed and fold them."
    3. "Put all the remote controls on the living room coffee table into the storage box and wipe the coffee table."
    4. "Put the hood on the table onto the person's head."
    5. "Coil the cable on the table and place it into the storage box."

    Room Cleaning (sharing one meta-domain):

    1. "Wipe the dust off the chair with a sponge."
    2. "Erase all drawings on the whiteboard."
    3. "Scrub the sink to remove stains."
    4. "Pick up the cloth from the armrest of the chair and use it to wipe the chair and the dresser."
    5. "Put the milk in the microwave and turn it on, then wipe the surface of the microwave with a towel."

    Results: Using our new automated retrieval and fusion pipeline, UniDomain achieved a 99% success rate and 99% optimality rate(K=0) across these new tasks. In particular, these new tasks activated new operators, such as coil_cable, put_in_basket, erase_whiteboard, and turn_on_microwave, as well as predicates like is_coiled, in_laundry_basket, on_whiteboard, and microwave_on, which already existed in our unified domain.


Q: The paper does not discuss whether the increased domain complexity would affect the quality of the domain and problem files generated by the LLM during online planning, and whether this would result in a performance drop.

A: You are absolutely correct that increased domain complexity can negatively impact LLM performance during online planning, by introducing noise and increasing the cognitive load for grounding. This is precisely the reason we designed the Domain Fusion (Section 5) and Predicate & Operator Filtering (Section 6) stages. These components are critical for managing complexity. Our ablation studies quantitatively demonstrate their importance:

  • Without Domain Fusion, planning success drops from 85% to 19% (see Figure 4b), as it is highly challenging to ensure connectivity in the large-scale unified domain, whereas achieving the domain quality of a relevant subgraph (the meta-domain) is much easier.
  • Without Predicate and Operator Filtering at planning time, the success rate drops from 85% to 74% (see Figure 5), as the LLM struggles to ground the task in a cluttered symbolic space.

Q: The observation that the "w/o grouping" variant outperforms the full method in the Kitchen domain is counterintuitive and warrants further analysis.

A: This is indeed a counterintuitive result that we have investigated further. We believe this to be an experimental artifact specific to the "Kitchen" domain's setup rather than a systematic failure of the grouping strategy. On average, across all four domains, predicate grouping improves the success rate from 72% to 85% (see "Average" in Figure 5), demonstrating its overall effectiveness.

Our empirical analysis suggests a potential cause for the anomaly in the Kitchen domain. We found that the failures in the "with grouping" case were often due to the VLM failing to correctly ground the initial state for the (is_open pot) and (on_rack bowl) predicates. The semantic grouping operation happened to shift the position of these two specific predicates significantly further down the list of 78 predicates provided in the prompt, compared to their original (random) order. We hypothesize that this reordering may have interacted with a known positional bias in LLMs, where attention can be weaker for tokens appearing later in a long context, leading to these specific grounding errors. While this is a hypothesis, it is supported by our failure analysis.


Q: The following works are relevant and should be discussed: [1] Automated Generation of Robotic Planning Domains from Observations, IROS 2021. [2] Automated Planning Domain Inference for Task and Motion Planning, ICRA 2025.

A: Thank you for pointing us to this relevant work. UniDomain is distinguished from these two papers in its scale, generalization method, and verification process. Our framework is the first to pre-train a unified PDDL domain from over 12,000 real-world robot demonstrations, leveraging modern LLM/VLM priors for domain generation and grounding. In contrast, both [1] and [2] learn from single or few-shot demonstrations, often in simulated settings, and do not use these foundational models. Furthermore, UniDomain introduces a rigorous, multi-stage closed-loop verification process to ensure domain correctness and commonsense alignment, a more comprehensive approach than the limited or rule-based verification found in [1] and [2]. This large-scale, verified pre-training enables our framework's key contribution: zero-shot compositional generalization to solve complex, unseen, real-world tasks, whereas [1] and [2] require in-domain demonstrations to handle new domains.


Q: The paper does not clearly specify the control policy used for executing actions in the real-world experiments.

A: To isolate the evaluation and focus purely on the performance of our high-level symbolic planner—which is the core contribution of this work—we followed a standard practice in the task planning literature and assumed a perfect low-level control policy (human teleoperation in our experiments for both UniDomain and baselines). This ensures that the measured task success and plan optimality are not confounded by potential imperfections in an autonomous low-level controller.

However, UniDomain is designed for seamless integration into a complete robotic system. The high-level action sequence generated by our planner (e.g., (pick_from_table bowl) can serve as direct input to any language-conditioned low-level skill policy. It is particularly compatible with modern Vision-Language-Action (VLA) models, as our unified domain is learned directly from the DROID dataset, which is also used to train these VLAs.

UniDomain enables a powerful hierarchical paradigm:

  1. High-Level Task Planning (UniDomain): Interprets complex goals, reasons over long horizons and symbolic constraints, and generates an optimal sequence of primitive actions.
  2. Low-Level Motion Execution (VLA/etc.): Translates each primitive action into a sequence of dexterous, adaptive robot motions.

This division of labor allows each component to address the challenges it is best suited for: symbolic reasoning for UniDomain, and sensorimotor control for the low-level policy.


We hope these clarifications and new results have fully addressed your concerns. We thank you again for your constructive and valuable feedback.

评论

Thanks for the authors' response. Most of my concerns have been addressed.

最终决定

This submission proposes a method that automatically constructs a unified PDDL domain from large-scale robot manipulation demonstrations to enable generalizable task planning. The approach consists of three stages: domain pretraining using keyframe extraction and closed-loop verification with VLMs/LLMs to learn atomic domains from individual videos; domain fusion to merge these into a coherent meta-domain; and online planning, which grounds new tasks into PDDL problems for a symbolic planner. Evaluated across 100 tasks in four domains, the method demonstrates strong zero-shot performance, outperforming LLM-based and LLM-PDDL baselines in both task success rate and plan optimality. This submission received two borderline rejects, one weak accept and one accept. After rebuttal, one of the reviewers upgrades his / her score from borderline reject to borderline accept. The reviewers recognized the merits of this proposed method, However, some concerns are still remains. The authors are encouraged to address these concerns clearly in their final version.