6.8

/10

Poster4 位审稿人

最低3最高5标准差0.8

4.0

置信度

创新性2.8

质量2.8

清晰度3.0

重要性3.3

NeurIPS 2025

Tru-POMDP: Task Planning Under Uncertainty via Tree of Hypotheses and Open-Ended POMDPs

Wenjing Tang,Xinyu He,Yongxi Huang,Yunxiao Xiao,Cewu Lu,Panpan Cai

OpenReview PDF

提交: 2025-05-11更新: 2025-10-29

TL;DR

Tru-POMDP combines rigorous POMDP planning with structured belief generation via Large Language Models (LLMs).

摘要

关键词

RoboticsPlanning under uncertaintyRobot task planningLarge language modelsPOMDPs

评审与讨论

审稿意见

评分: 5置信度: 42025-06-22

The paper proposes to do task planning in the open world by combining LLM for world modelling with a POMDP planner. A Tree-of-Hypothesis method is introduced to query LLMs in order to generate particle beliefs over world states and human goals. Experiments on object rearrangement in kitchen environments show better performance compared to LLM prompting methods.

优缺点分析

Strengths:

Extends robot planning to the open world while handling uncertainty in a principled manner. Before the success of LLMs, methods based on POMDP solvers operate on the closed world assumption where the domain is limited in order to make construction of the POMDP models feasible. LLM-MCTS initiated the combination of LLMs and POMDP solvers by using LLMs as world model and search heuristic but still operates in the closed world where where objects and areas are known. This paper proposes a method for generating open world beliefs that do not require these to be known through its Tree of Hypothesis method, making progress on the important problem of operating in the open world.
The proposed method outperforms LLM methods that are not based on POMDP solvers, namely ReAct and Reflexion.
The method is relatively clean and simple, hence others are likely able to build on it.

Weaknesses:

Novelty is not particularly high. LLMs are trained on very large amounts of data, hence it is not surprising that they are able to generate reasonable distribution of open world objects.

问题

The paper mentions that LLM-MCTS uses a single maximum likelihood hypothesis for task goal and hidden object placements (line 293-294). This is true for task goal, but LLM-MCTS uses a distribution for object placements, although the set of objects are known in advance. The authors should more carefully explain how this work makes progress beyond what LLM-MCTS does.

局限性

Yes.

最终评判理由

In rebuttal discussion, the authors provided analysis of errors made by the methods and identified cognitive shift as a source of errors that is not handled by current methods. I think this is useful and adds novelty to their paper. As a result, I have increased my rating.

格式问题

None.

作者回复

2025-07-31

Thank you for your valuable feedback and for recognizing our work's key strengths, particularly in extending principled POMDP planning to the open world. We appreciate your assessment that the method is clean, simple, and likely to be built upon by others. We are grateful for the opportunity to address your comments on novelty and to clarify the specific advancements our work makes over LLM-MCTS.

Q: Novelty is not particularly high. LLMs are trained on very large amounts of data, hence it is not surprising that they are able to generate reasonable distribution of open world objects.

A: We respectfully argue that while using an LLM to generate hypotheses may seem intuitive in retrospect, making this concept work for principled planning is a significant and novel contribution. We ask the reviewer to consider the novelty from two perspectives:

From the LLM Perspective: Our work addresses fundamental limitations of LLMs for probabilistic planning tasks.

First, LLMs have limited probabilistic reasoning capabilities on their own, often leading to suboptimal policies when faced with uncertainty, as shown by the performance of strong LLM-based baselines in our experiments (see Figure 2).
Second, naively asking an LLM to generate a flat distribution of open-ended world states is ineffective for complex planning. Our ablation study ("w/o TOH" in Figure 4) demonstrates that without the structured, factored approach of our novel Tree of Hypotheses (TOH), the generated belief is of poor quality and leads to low performance (dropping 26% success).
Third, our Hybrid Belief Update module is the first to seamlessly integrate principled Bayesian filtering with on-demand LLM inference, allowing both to function simultaneously to maintain a robust and efficient belief.

From the POMDP Planning Perspective: Our work expands the entire scope of what POMDPs can solve.

Traditionally, POMDP planning has been confined to closed-world problems, where the state, action, and observation spaces are fully predefined.
We introduce and solve the open-ended POMDP problem, where core elements are not predefined due to unknown objects and ambiguous human intentions. Our open-ended belief tracker and POMDP planner are the first to address this formulation, dynamically constructing the state, action, and observation spaces from LLM hypotheses. This framework could potentially be extended to handle unknown actions or action consequences in the future.
We developed a novel open-ended planner with two key contributions:
- A dynamically constructed action space: A naive action space would be intractably large or infinite due to the unbounded set of possible objects. Our planner dynamically grounds a compact, relevant set of actions based on the current particle belief. This is essential for making online planning tractable in an open-ended world.
- An LLM-generated rollout policy integrated into tree search: Unlike prior work using random rollouts or expensive LLM rollouts, we leverage an LLM to generate a coded heuristic policy that provides strong, commonsense guidance from the leaf nodes of the belief tree search. This novel integration of a code-based LLM policy within a principled planner significantly improves performance (by 22%), as shown in our ablation study (Figure 4, "w/o LRP").

Q: The paper mentions that LLM-MCTS uses a single maximum likelihood hypothesis for task goal and hidden object placements (line 293-294). This is true for task goal, but LLM-MCTS uses a distribution for object placements, although the set of objects are known in advance. The authors should more carefully explain how this work makes progress beyond what LLM-MCTS does.

A: Thank you for this important correction. You are right that LLM-MCTS uses a distribution for initial object placements. We will correct this in the final manuscript. Tru-POMDP makes several fundamental advances beyond LLM-MCTS, enabling planning in truly open-ended scenarios:

Open-Ended vs. Closed-Domain Planning: This is the most critical distinction. LLM-MCTS operates in a closed domain where the set of all possible objects is known in advance. In contrast, Tru-POMDP is designed for open-ended environments where the robot must reason about entirely unknown objects and ambiguous human goals. Our framework dynamically constructs its state, observation, and action spaces from LLM-generated hypotheses, a significant step towards real-world open-ended planning.
Factored Multi-Aspect Belief vs. Flat Single-Aspect Belief: Tru-POMDP explicitly models multiple, interacting sources of uncertainty (ambiguous goals, unknown target objects, and hidden placements) using the structured Tree of Hypotheses. This factored approach is key to generating high-quality beliefs scalably. LLM-MCTS uses a "flat" belief that only considers a single aspect of uncertainty: the placement of known objects.
Generalizable Belief Representation: Our use of a particle-based belief is inherently more flexible and generalizable to other domains beyond object rearrangement compared to the object-receptacle probability vectors used in LLM-MCTS.
Computational Efficiency and Weaker Priors: The LLM-MCTS planner is computationally intensive, as it requires querying the LLM to provide a search heuristic at each node during tree expansion. As seen in our experiments, this becomes intractable for complex tasks. Tru-POMDP is far more efficient because it does not call the LLM inside the belief tree search. Instead, it uses a pre-compiled, LLM-generated rollout policy that requires only weak prior knowledge (a language description of the domain).

We will incorporate these points to more clearly position our work relative to LLM-MCTS in the final version. Thank you again for your constructive feedback and your support for our paper.

2025-08-04

Thanks for the clarifications. Part of the reason that I rate the paper lower on novelty is that I did not see too much that I did not expect. A quick analysis of the failure cases, particularly for the harder cases, may provide useful information that is not expected. Can the failures be avoided by spending more (but not prohibitively more) compute? Would humans also fail in those cases, or are humans using a different strategy?

2025-08-04

Thank you for the insightful questions. Analyzing failure modes clarifies how Tru-POMDP advances beyond the fundamental architectural flaws of existing methods to address more nuanced, real-world challenges.

Failure Modes of Baseline Methods

LLM-as-Planners (ReAct, Reflexion):
- Systematic Exploitation Bias: LLM planners have a strong tendency to use only currently visible objects, lacking a principled mechanism to reason about missing information and decide when to perform exploration (e.g., opening a cabinet). This frequently leads to failure when key objects are occluded.
- Inability to Diagnose Complex Failures: Without an explicit belief-tracking module, these agents cannot effectively reason about why a plan is failing. When the environment repeatedly signals that a goal is not met, they cannot disentangle the multiple sources of potential causes (e.g., "Did I choose the wrong objects, or are they in the wrong place?"). This probabilistic reasoning is a known weakness of LLMs.
- Endless Thinking Loops: In complex scenarios, these agents may recognize a plan's failure but are unable to generate new coherent plans, getting stuck in non-productive "thinking" loops.
LLM-MCTS:
- Closed-World Brittleness: It requires a predefined object set, thus architecturally incapable of handling tasks where a target object is truly unknown and lies outside this predefined set. This is a key challenge addressed in our open-ended setting.
- Single Goal Hypothesis: Committing to a single maximum-likelihood goal hypothesis is a critical point of failure. Given significant ambiguity of human instructions, this initial guess is often wrong, and the planner has no mechanism to recover.
- Computational Intractability: Querying an LLM at each node during tree search is prohibitively expensive, failing to produce plans within a practical time limit.

Failure Modes of Tru-POMDP

Tru-POMDP successfully addresses the above issues. Its failures are strategic, not architectural, revealing the next layer of challenges:

Inefficient Exploration Strategy: The planner sometimes prefers to exhaust all plausible placement combinations of known target objects before exploring to find new hidden objects. While its Bayesian updates will eventually guide it to consider new objects, this exhaustive strategy can be inefficient, leading to timeouts in very hard cases.
Cognitive Shift: This is a subtle but critical failure mode that we deliberately modeled by using different LLMs for task generation (GPT-4o) and in our planner (GPT-4.1). A failure occurs when the ground-truth goal relies on a commonsense assumption that is present in the "user's" model (the task generator) but not in the "robot's" model (the planner). This models a realistic cognitive gap between a user and a robot that computation alone cannot bridge.

Can More Compute Avoid These Failures?

This distinction is key to our contribution:

Yes, for some Tru-POMDP failures: The "inefficient exploration" issue is largely computational. Given more time, belief updates would correctly guide the planner to explore for new objects after exhausting unpromising combinations.
No, for baseline and other Tru-POMDP failures: The remaining failures are algorithmic and conceptual.
- Baselines: No amount of compute can fix their architectural flaws, such as the lack of principled belief-tracking or reliance on closed-world assumptions.
- Cognitive Shift (Tru-POMDP): This is an intrinsic knowledge mismatch. More compute cannot reconcile different worldviews; the solution requires algorithmic changes like multi-model ensembling or active communication.

Comparison to Human Problem-Solving

Humans also struggle with the exploration/exploitation trade-off and cognitive shifts. However, a key human strategy for ambiguity is active communication ("Prepare for a party?" -> "Should I get snacks?"). This is a primary limitation but also a crucial future direction for our work, which Tru-POMDP is well-suited to support, as POMDPs are designed to optimally balance task completion actions with information-gathering actions like asking disambiguation questions [1,2].

This analysis shows Tru-POMDP advances the research problem from the fundamental limitations of baselines (e.g., intractability, open-endedness) to more advanced challenges like strategic exploration and cognitive alignment. This progression highlights our framework's novelty, and we will incorporate this discussion into the final manuscript.

[1] H. Zhang, Y. Lu, C. Yu, et al. INVIGORATE: Interactive Visual Grounding and Grasping in Clutter. In Proceedings of Robotics: Science and Systems, 2021.

[2] M. Shridhar, D. Mittal, D. Hsu. INGRESS: Interactive visual grounding of referring expressions. Int. J. Rob. Res. , 39(2–3):217–232, 2020.

2025-08-05

Thanks for the response which is informative. I'm more positively inclined now, going into discussion with the area chair and other reviewers.

审稿意见

评分: 5置信度: 52025-07-01

The paper introduces Tru-POMDP, a novel approach designed to extend POMDP planners to more open-ended and uncertain scenarios. Specifically, the method uses LLMs to construct a structured hierarchical Tree of Hypotheses (TOH), forming an initial belief distribution and generating rollout policies. Tru-POMDP integrates this with a modified DESPOT algorithm, featuring a dynamically updated action space derived from current beliefs. Bayesian belief updates manage the belief distributions, supplemented by LLM queries to replenish particle sets when necessary. Experiments conducted on kitchen rearrangement tasks demonstrate Tru-POMDP’s performance against baseline methods, particularly highlighting robustness and efficiency in scenarios involving ambiguity and partial observability.

优缺点分析

Strengths: The primary strengths of the paper include an innovative integration of LLMs with rigorous POMDP planning frameworks, effectively combining common-sense reasoning and probabilistic planning. The authors clearly contextualize their contributions within existing literature, emphasizing both novelty and practical applicability. I also appreciated the thoughtful use of belief-based action space restriction, significantly enhancing planning efficiency. The provided ablation studies offer a thorough analysis of each component's impact, clearly detailing improvements in terms of performance metrics, including computational efficiency.

Weaknesses: The paper exhibits several areas requiring clarification or enhancement. Primarily, the LLM-generated rollout policy, which is a crucial element for online tree-based algorithms, receives insufficient attention regarding its precise formulation and implementation details. Additionally, the experimental evaluation lacks a clear upper bound or optimal baseline, which would enable readers to better contextualize the performance results. Another weakness is a discussion of scalability is somewhat neglected; particularly, the practical limitations associated with generating hypotheses via LLMs as the complexity of the scenario increases and how to generate reasonable beliefs from parties as this hypothesis space grows.

Quality: Technically sound with well-supported claims, though more detail on rollout policies and scalability would improve completeness. Clarity: Clearly written and organized, but additional clarification on the rollout policy would enhance reproducibility. Significance: Integrates LLM and POMDP planning effectively, addressing challenging uncertainty issues and likely impacting future research. Originality: Novel combination and use of existing methods (LLMs and POMDP) differentiated it from prior work.

问题

Could the authors clarify their considerations or measures for handling the potential combinatorial explosion in the hypothesis space as the complexity of target objects and placement goals increases?
When generating hypotheses for current object placements, is independence between hypotheses assumed and/or required? If so, are there anticipated challenges associated with managing the significantly increased parameters of conditional distributions?
Regarding particle replenishment via the LLM, are these particles newly generated from the existing Tree of Hypotheses, or is a completely new tree constructed?
Can the authors provide more explicit details on how often the LLM-generated rollout policy is updated during action selection? Is it generated once per action or less frequently? Is there refinement of the policy or any issues with the policy/code being valid?
Does the reported token count in the experiments include the tokens necessary for generating the LLM rollout policy?

局限性

Yes

最终评判理由

The paper presents a novel and well-executed integration of LLM-based structured reasoning with principled belief-space POMDP planning, addressing open-ended uncertainty in a way that is both sound and relevant. While the evaluation is limited to kitchen rearrangement tasks and adapting to new domains will require some engineering, the conceptual contributions are strong enough to merit acceptance. The authors have addressed all of my questions in the rebuttal, adding valuable details on rollout policy design, providing an upper-bound baseline, and assessing the independence assumption. I recommend acceptance, with tempered generality claims in the final version.

格式问题

No major formatting issues.

作者回复

2025-07-31

Thank you for your very supportive and thorough review. We are delighted that you found our integration of LLMs with POMDPs to be innovative, significant, and original. We appreciate your insightful questions, which give us the opportunity to provide more detail on the rollout policy, present new results with an upper-bound baseline, and clarify our approach to scalability.

Responses to Weaknesses

Q: The LLM-generated rollout policy, which is a crucial element for online tree-based algorithms, receives insufficient attention regarding its precise formulation and implementation details.

A: Thank you for pointing this out. We are happy to provide more detail.

Formulation: The rollout policy is a mapping from the current state $s = (G_s, p_s)$ to an action $a$ . Here, $G_s$ is the scene graph representing the hypothesized world state, and $p_s$ is the placement goal. This mapping is implemented as a C++ function generated by an LLM. The generator LLM receives a general language description of the object rearrangement domain but is given no information about any specific task instance.
Implementation: The prompt used to generate this C++ policy code conveys four key aspects of knowledge to the LLM: (1) the objective (e.g., find and rearrange objects to reach the goal), (2) action constraints and semantics, (3) required input/output formats for the function, and (4) the API for accessing the state from within the policy code. The full prompt is provided in Appendix A.2. We will incorporate these details into Section 4.3 in the final version to improve reproducibility.

Q: The experimental evaluation lacks a clear upper bound or optimal baseline, which would enable readers to better contextualize the performance results.

A: This is an excellent suggestion. We have run a new experiment with an omniscient agent that serves as an oracle upper bound. This agent has full knowledge of the true world state (all hidden objects and the true goal) and plans optimally using MCTS. The results are shown below, integrated with the original results from Figure 2.

Table 1: Cumulative Reward Comparison

Method	Easy	Medium	Hard
Oracle (Upper Bound)	958.3 ± 1.6	937.5 ± 2.2	906.5 ± 3.5
Tru-POMDP (Ours)	737.9 ± 33.8	547.9 ± 47.3	381.6 ± 52.9
LLM-MCTS	-53.6 ± 2.0	-48.1 ± 1.7	-46.0 ± 1.6
Reflexion	126.0 ± 64.5	-295.8 ± 49.8	-425.2 ± 49.6
ReAct	26.1 ± 65.6	-472.2 ± 36.6	-578.1 ± 35.9

Table 2: Total Step Number Comparison

Method	Easy	Medium	Hard
Oracle (Upper Bound)	4.9 ± 0.2	7.3 ± 0.2	10.6 ± 0.3
Tru-POMDP (Ours)	13.2 ± 0.6	19.4 ± 0.7	25.7 ± 0.8
LLM-MCTS	25.0 ± 0.0	30.0 ± 0.0	35.0 ± 0.0
Reflexion	18.9 ± 0.8	27.6 ± 0.6	32.7 ± 0.6
ReAct	19.1 ± 0.9	28.8 ± 0.5	33.8 ± 0.5

The significant performance gap between the oracle and Tru-POMDP highlights the cost of uncertainty. The oracle, knowing the precise world state, never needs to explore or reason about hypotheses, thus taking far fewer steps and always succeeding. Tru-POMDP's ability to close a substantial portion of this gap demonstrates its effectiveness in managing and resolving uncertainty.

Q: A discussion of scalability is somewhat neglected; particularly, the practical limitations associated with generating hypotheses via LLMs as the complexity of the scenario increases and how to generate reasonable beliefs... as this hypothesis space grows.

A: We appreciate this important point. Our primary strategy for tackling scalability is through factorization. The Tree of Hypotheses (TOH) explicitly encodes a factored belief over the state space, decomposing it into (1) target objects, (2) goal placements, and (3) current placements, with dependencies between them. The benefit of this design is empirically validated in our ablation study (Figure 4), where the "w/o TOH" variant, which generates an unfactored belief, is highly ineffective, leading to a performance drop of 26%. As long as a problem's state space can be decomposed into smaller factors, our approach remains effective.

Responses to Questions

Q: Could the authors clarify their considerations or measures for handling the potential combinatorial explosion in the hypothesis space as the complexity of target objects and placement goals increases?

A: We manage this combinatorial growth through a multi-faceted approach. Let $C_1$ be the constant number of hypothesized target object sets, $C_2$ be the constant number of location hypotheses per object, and $N$ be the number of objects per set.

Linear Growth in LLM Calls: By assuming conditional independence of object locations (see next point), the number of LLM calls scales linearly with the number of hidden objects ( $1 + C_1 \times N$ ), not exponentially.
Bounded Particle Set: While the number of possible hypothesis combinations (particles) grows exponentially ( $C_1 \times C_2^N$ ), we resample the belief down to a constant number of particles ( $k=30$ in our experiments) before feeding it to the POMDP planner. This keeps the online planning cost manageable. For even larger problems, our framework could be integrated with highly parallelized POMDP solvers that support thousands of particles online [1].
Hierarchical Decomposition: For tasks with a very large number of target objects, a hierarchical approach could be used, where the task is first decomposed into subtasks with a manageable number of objects, and Tru-POMDP is then applied to each subtask.

Q: When generating hypotheses for current object placements, is independence between hypotheses assumed and/or required? If so, are there anticipated challenges associated with managing the significantly increased parameters of conditional distributions?

A: Yes, in our proposed method we assume the initial locations of different hidden objects are conditionally independent to reduce the number of required LLM calls. We ran a new ablation study to test the impact of this assumption. In the "w/o independence" variant, we query the LLM to generate a joint placement hypothesis for the entire set of target objects. The results show that our original approach (with the independence assumption) performs significantly better across all metrics.

Table 3: Ablation on Independence Assumption (Hard Difficulty)

Method	Cumulative Reward	Success Rate (%)	Total Planning Time
Tru-POMDP (with independence)	381.6 ± 52.9	71 ± 5	321.3 ± 19.7
w/o independence	39.9 ± 49.9	36 ± 5	493.1 ± 16.5

This suggests that prompting for independent locations reduces the cognitive load of LLMs and allows it to generate a more diverse sets of hypotheses, which outweighs the benefit of capturing potential correlations. The variant without independence also required significantly more planning time, due to more serialized LLM queries.

Q: Regarding particle replenishment via the LLM, are these particles newly generated from the existing Tree of Hypotheses, or is a completely new tree constructed?

A: A completely new Tree of Hypotheses is constructed. This is necessary because new hypotheses must be conditioned on the most up-to-date observations. Any particles generated by the old TOH that remain consistent with all historical observations are already preserved in the belief via the Bayesian filter, so their information is fully leveraged.

Q: Can the authors provide more explicit details on how often the LLM-generated rollout policy is updated during action selection? Is it generated once per action or less frequently? Is there refinement of the policy or any issues with the policy/code being valid?

A: The LLM-generated rollout policy is generated only once for the entire object-rearrangement domain in an offline stage. It is task-agnostic and scene-agnostic and is reused across all experiments without any updates or refinement. The generated C++ code was valid and required no manual correction.

Q: Does the reported token count in the experiments include the tokens necessary for generating the LLM rollout policy?

A: No, the reported token counts reflect the online planning costs only. The rollout policy generation is a one-time, offline cost for the entire domain, that is negligible in comparison. For completeness, this offline pass consumed only 1,353 tokens.

Thank you again for your careful reading and strong support of our paper. We hope our answers and new results have satisfactorily addressed your questions.

References

[1] P. Cai, Y. Luo, D. Hsu, et al. HyP-DESPOT: A hybrid parallel algorithm for online planning under uncertainty. The International Journal of Robotics Research, 40(2-3), 558-573, 2021.

2025-08-06

Thank you for your thorough and clear responses. You have effectively addressed all of my questions. I particularly appreciate the additional upper-bound baseline results and your careful assessment of the independence assumption. Given your new results highlighting the importance of this assumption, it might be valuable to explicitly note in the manuscript that Tru-POMDP’s performance could significantly degrade when independence between object placements cannot be reasonably assumed.

2025-08-08

Thank you for your positive feedback and for raising this valuable point. We will ensure that a discussion of this limitation is included in the final version of our paper.

审稿意见

评分: 4置信度: 32025-07-02

This paper proposes Tru-POMDP, a task planning framework that integrates Large Language Models (LLMs) with belief-space POMDP planning for robot manipulation tasks under uncertainty. The core idea is to build a Tree of Hypotheses using LLM queries, representing multiple plausible goals and hidden object placements. These hypotheses are filtered via a hybrid Bayesian mechanism and fed into an online planner, allowing for robust and efficient planning under ambiguity and partial observability. Experiments in diverse simulated household environments demonstrate significant performance gains over strong baselines, including ReAct, Reflexion, and LLM-MCTS.

优缺点分析

Strengths:

This paper addresses a critical and underexplored research gap in robot task planning under uncertainty. The proposed integration of structured LLM reasoning with POMDP-based decision-making is both novel and timely.

The core insight of the paper has strong real-world applicability, particularly for household service robots dealing with ambiguous instructions and hidden object states.

Weaknesses:

The computational analysis of the method could be stronger. Figure 5 presents a comparison against ablation variants, but the performance in comparison to other works in the literature (e.g., the ones shown in Figure 2) is still unclear.
Although five different kitchen environments were used, the work could show more generality if experiments are performed in other types of environments, beyond only kitchen-related tasks.
Although the paper is well-written, the text could be made more accessible. For instance, abbreviations such as POMDP and Tru-POMDP are not being defined when they are first mentioned. Other technical terms could also be made more clear in the introduction (such as "open-ended uncertainty"), in order to communicate more clearly the motivation and the significance of the work to a wider audience.

问题

Does the Bayesian filter in this paper involve an actual Bayesian update process, or does it primarily rely on the LLM to combine prior information with new observations?

What is the computational cost of the proposed approach, in comparison with other works in the literature?

局限性

Yes.

最终评判理由

I am happy with the authors rebuttal, and the additional experiments provided.

On the other hand, I agree with reviewer CHEj that the paper would be stronger if it was demonstrated in another domain. Therefore, I decided to keep my current score.

格式问题

N/A

作者回复

2025-07-31

Thank you for your supportive assessment and for recognizing that our work addresses a critical research gap with a novel and timely approach. We are grateful for your constructive feedback on strengthening the computational analysis, demonstrating generality, and improving clarity. We will address each of these points below.

Q: The computational analysis of the method could be stronger. Figure 5 presents a comparison against ablation variants, but the performance in comparison to other works in the literature (e.g., the ones shown in Figure 2) is still unclear.

A: Thank you for this suggestion. To clarify the computational cost relative to the baselines from Figure 2, we provide the detailed average planning times (in seconds) per episode across all difficulty levels below.

Method	Easy	Medium	Hard
Tru-POMDP (Ours)	135.8 ± 14.6	188.4 ± 25.1	245.9 ± 29.5
LLM-MCTS	600.8 ± 0.1	600.8 ± 0.1	601.8 ± 0.7
Reflexion*	114.7 ± 15.1	161.4 ± 16.5	181.8 ± 16.5
Reflexion	100.2 ± 10.9	155.6 ± 16.6	181.7 ± 16.8
ReAct*	102.5 ± 11.0	150.3 ± 15.6	167.3 ± 14.4
ReAct	97.3 ± 11.2	148.8 ± 12.0	165.7 ± 13.1

As the table shows, Tru-POMDP is significantly faster than LLM-MCTS, which also performs explicit reasoning but timed out on all tasks due to excessive LLM calls within its search loop. While Tru-POMDP is slower than purely reactive LLM planners (ReAct/Reflexion), this overhead is a direct result of the comprehensive belief generation, which is crucial for robust planning under uncertainty and achieving the high success rates shown in Figure 2. This cost is a worthwhile trade-off for principled decision-making. We also note that this overhead could be substantially reduced in future work by fine-tuning a specialized belief-generation LLM.

Q: Although five different kitchen environments were used, the work could show more generality if experiments are performed in other types of environments, beyond only kitchen-related tasks.

A: We agree that demonstrating generality is important. While our experiments are situated in kitchen environments, the underlying algorithmic components—factored belief generation with TOH, representing hypotheses as a particle belief, and POMDP planning with dynamic action spaces and LLM rollout policies—are fully generalizable.

Adapting our framework to a new domain would require redesigning the semantic POMDP model and the specific factorization structure of the TOH. However, we argue that this is no longer a significant barrier. With the advent of powerful LLMs and Vision-Language Models (VLMs), semantic or symbolic models for robotics domains (e.g., table-top manipulation, assembly, in-factory logistics) can now be effectively learned from human demonstrations and interactions [1]. Furthermore, the factorization structure of the TOH, essentially reflecting the factorization structure of the hidden state space of a domain, could itself be determined by an additional LLM layer that analyzes a language description of the new domain, further automating adaptation.

Q: Although the paper is well-written, the text could be made more accessible. For instance, abbreviations such as POMDP and Tru-POMDP are not being defined when they are first mentioned. Other technical terms could also be made more clear in the introduction...

A: Thank you for this valuable feedback. We agree that improving accessibility is crucial. We will revise the introduction in the final version to define all abbreviations and technical terms clearly. To clarify for this rebuttal, here are our definitions for the central concepts:

Planning under uncertainty: In the real world, robots face numerous uncertainties: perception is imperfect and incomplete (e.g., occlusions), human intentions are unobservable, and actions can have stochastic effects (e.g., execution failures). To achieve safety and efficiency, robots must plan in a way that hedges against these uncertainties.
POMDP (Partially Observable Markov Decision Process): This is a principled mathematical framework for planning under the uncertainties mentioned above. A POMDP model requires a full specification of all possible states, actions, and observations, which limits its use to "closed-domain" problems.
Open-ended Uncertainty: This refers to planning problems where one or more fundamental factors of the POMDP model, e.g., state, action, or observation space, cannot be fully predefined, possibly induced by an unknown set of possible objects, an unknown set of potential goals, or even an unknown set of available skills.
Open-ended POMDP: Our work introduces this concept, where we integrate the commonsense reasoning of LLMs with the rigorous planning of POMDPs. This allows us to address open-ended problems by dynamically constructing states, actions, and observations from LLM-generated hypotheses.

Q: Does the Bayesian filter in this paper involve an actual Bayesian update process, or does it primarily rely on the LLM to combine prior information with new observations?

A: Our framework uses an actual Bayesian update process. The Hybrid Belief Update module (see Section 4.2) executes a principled particle filter update in every planning cycle. This filter refines the existing belief by eliminating particles inconsistent with the latest observation, ensuring the belief remains grounded in reality. The LLM, via the Tree of Hypotheses, is invoked only when necessary—specifically, when "particle depletion" occurs (i.e., the total weight of consistent particles drops below a threshold), indicating that the current belief no longer explains the observations well. This hybrid mechanism significantly reduces the number of expensive LLM queries by 61% while ensuring stable and accurate belief tracking, which contributes to the superior planning performance shown in our ablation study (Figure 4).

We thank you again for your insightful review and positive recommendation. We hope our responses have fully addressed your questions.

References

[1] Curtis, A., Tang, H., Veloso, T., Ellis, K., Tenenbaum, J., Lozano-Pérez, T., & Kaelbling, L. P. (2025). LLM-Guided Probabilistic Program Induction for POMDP Model Estimation. arXiv preprint arXiv:2505.02216.

2025-08-05

Thank you for your detailed response. I look forward to seeing these changes incorporated into the final version. I will maintain my positive evaluation.

审稿意见

评分: 3置信度: 42025-07-03

This paper presents Tru-POMDP, a framework that integrates LLM-based commonsense reasoning with principled POMDP belief-space planning to address robotic task execution under open-ended uncertainty. The system introduces a hierarchical Tree of Hypotheses to generate particle-based beliefs over scene graphs and task goals, applies Bayesian filtering for belief tracking, and uses an online POMDP planner (based on DESPOT) for action selection. While the overall framework is well-structured and practically relevant, the core technical novelty is concentrated mainly on the problem formulation. The paper relies heavily on existing solvers and standard belief tracking methods, with limited innovation in planning algorithms or inference techniques.

优缺点分析

Strengths:

Structured belief generation: The Tree of Hypotheses (TOH) effectively organizes LLM queries into object, goal, and placement hypotheses, providing interpretable and diverse particles.
Belief-space planning: The use of online POMDP solvers like DESPOT allows the agent to reason over uncertainty instead of relying on brittle single-hypothesis planning.
Hybrid belief update: Combining Bayesian filtering with LLM-generated particles improves update efficiency and robustness, reducing token cost and planning time.
Efficient rollout policy: Using an LLM-generated C++ policy for DESPOT rollouts balances commonsense reasoning and computational tractability.

Weaknesses:

Limited algorithmic novelty and theoretical analysis: The work does not contribute new theoretical insights or generalizable planning methods. Its scope is limited to a specific task domain. Most components, like particle filtering and DESPOT, are standard. The contribution lies primarily in their integration with LLMs.
Heuristic hypothesis generation: The TOH process is prompt-based and manually structured, lacking learning or adaptation mechanisms to optimize hypothesis quality.
Execution failure not modeled: The framework assumes deterministic action execution (e.g., always successful picks or opens). In practice, such actions may fail due to perception or actuation errors. The absence of failure detection and recovery mechanisms limits the system's real-world robustness.
Limited task diversity despite general task planning title: Although the paper presents itself as addressing general task planning under uncertainty, the evaluation is focused solely on object rearrangement tasks in kitchen environments. It remains unclear how well the proposed method generalizes to other types of tasks, such as tool use, multi-agent collaboration, or dynamic goal adjustment, or how it handles real-world challenges like sensor noise, stochastic transitions, or action failures. A broader task set would be needed to support the generality implied by the title.

问题

Suggestions:

Incorporate action execution monitoring and failure recovery, such as retrying, replanning, or reasoning over failure causes within the belief update process.
Add experiments in noisy or stochastic environments, and potentially test tasks beyond object rearrangement to validate generality.

局限性

Yes

格式问题

作者回复

2025-07-31

Thank you for your constructive feedback and for acknowledging the strengths of our work, including the structured belief generation via the Tree of Hypotheses, the principled use of belief-space planning, the hybrid belief update mechanism, and the efficient LLM-generated rollout policy. We appreciate the opportunity to clarify our contributions and address your concerns regarding algorithmic novelty, generalizability, and the handling of execution failures. We believe our responses will demonstrate the significance of our contributions.

Q: Limited algorithmic novelty and theoretical analysis: The work does not contribute new theoretical insights or generalizable planning methods. Most components, like particle filtering and DESPOT, are standard. The contribution lies primarily in their integration with LLMs.

A: We respectfully disagree that the contribution is limited to a simple integration. While we build upon established methods, our primary contribution lies in creating the first framework that enables principled, belief-space POMDP planning in open-ended domains, a challenge that neither LLMs nor traditional POMDPs can solve alone. This synthesis offers significant conceptual and technical novelty:

Bridging Open-Ended and Principled Planning: Traditional POMDPs are confined to "closed-world" problems with predefined state, action, and observation spaces. Conversely, LLMs, while capable of handling open vocabularies, often produce suboptimal or deficient policies under uncertainty (as shown in our experiments in Figure 2). Our work is the first to bridge this gap, empowering POMDPs to tackle open-ended uncertainty while grounding LLM reasoning in a rigorous probabilistic framework for long-horizon, near-optimal planning.
Novel Belief Generation for Uncertainty: To our knowledge, prior work has used LLMs to generate only a single, most-likely hypothesis about the world state [1, 2, 3, 4]. A core insight of our work is to leverage LLMs to construct a complete belief, or a probability distribution over states, represented as a set of weighted particles. This is non-trivial; a naive extension from generating one hypothesis to many performs poorly, leading to a performance drop of 26%, as shown by the "w/o TOH" ablation in Figure 4. Our structured Tree of Hypotheses (TOH) is a novel and critical mechanism for generating a high-quality, diverse particle belief. Furthermore, the particle representation seamlessly integrates rigorous Bayesian filtering with LLM inference, a key aspect of our hybrid update module. We believe these insights will be valuable for future research in probabilistic planning with LLMs.
Novel POMDP Planner Components: Our planner includes novel adaptations crucial for efficiency in open-ended domains.
- The dynamic action space grounding, which instantiates actions based on hypothesized objects relevant to the task, prevents the action space from becoming intractably large.
- The LLM-generated rollout policy, where a code-based policy is integrated into the tree search, significantly boosts planning performance by 22%, as demonstrated by the "w/o LRP" ablation in Figure 4.

Q: Heuristic hypothesis generation: The TOH process is prompt-based and manually structured, lacking learning or adaptation mechanisms to optimize hypothesis quality.

A: This is an excellent suggestion. Dynamically adapting the TOH structure is an important direction for future research. The current manually-structured TOH serves as a strong, interpretable baseline that demonstrates the viability of the approach. It reflects the factorization structure of the hidden state space of object rearrangement tasks. For future work, we envision two potential avenues for adaptation:

One could introduce a meta-level agent that learns to determine the optimal factorization of the hidden state space, and thus the structure of the TOH, before generating hypotheses.
Another promising direction is to treat the TOH structure as a parameter to be optimized via closed-loop learning, drawing inspiration from recent work on planner optimization [5].

Q: Execution failure not modeled... The framework assumes deterministic action execution... The absence of failure detection and recovery mechanisms limits the system's real-world robustness.

A: We agree that modeling execution failures is crucial for real-world robustness. A key strength of our framework is its inherent extensibility to such scenarios without requiring fundamental algorithmic changes. The POMDP formulation readily models perception noise and stochastic action outcomes via probabilistic observation ( $O(z|s',a)$ ) and transition ( $T(s'|s,a)$ ) functions.

While our current implementation uses a deterministic version of these models for clarity and to focus on the core contribution of handling open-ended uncertainty, the belief-space planning algorithm (DESPOT) and the particle filter-based belief tracking are designed to handle probabilistic models. Extending our system would involve learning the POMDP transition and observation models from data (which has been demonstrated in [6]). The resulting optimal policy under uncertainty would naturally exhibit robust behaviors, such as retrying failed actions or choosing information-gathering actions when uncertainty is high. Online replanning, which is already implemented, allows the agent to adapt its plan based on unexpected outcomes.

Q: Limited task diversity... the evaluation is focused solely on object rearrangement tasks... It remains unclear how well the proposed method generalizes to other types of tasks... A broader task set would be needed to support the generality implied by the title.

A: We appreciate the feedback on task diversity. While our experiments focus on complex object rearrangement, the core technical contributions of Tru-POMDP are domain-agnostic and provide a foundation for general task planning under uncertainty.

LLM-generated particle beliefs can be applied to any domain where uncertain states can be described with semantic hypotheses.
The open-ended POMDP framework can be extended to tasks beyond object rearrangement, such as navigation, tool use, or multi-agent coordination, where the set of relevant states, objects, or goals is not known beforehand.
The semantic-level POMDP model can be applied to any domain where such a model can be formulated, for instance, by learning it from human demonstrations and interactions [6].
The hybrid belief update and the novel planner components (dynamic action space and LLM rollout policy) are general-purpose enhancements.

We acknowledge that demonstrating this generality on a broader set of tasks is an important next step. However, implementing and evaluating these extensions is substantial work that falls beyond the scope of a single paper. Our current work establishes the essential principles and demonstrates their effectiveness in a challenging, representative domain.

Suggestion: Incorporate action execution monitoring and failure recovery, such as retrying, replanning, or reasoning over failure causes within the belief update process.

A: This is an excellent suggestion. As discussed in our response regarding action failures, the Tru-POMDP framework already has replanning capability and can incorporate failure-recovery and retrying behaviors at the face of uncertainty, once equipped with stochastic transition and observation models. Implementing these extensions is a key part of our future research plan.

Suggestion: Add experiments in noisy or stochastic environments, and potentially test tasks beyond object rearrangement to validate generality.

A: As we outlined in our previous responses on generalizability and robustness, there are no fundamental barriers to applying Tru-POMDP more broadly. Our core technical designs—LLM-generated particle beliefs for semantic uncertainties, a hybrid belief update, a dynamic action space, and an LLM-generated rollout policy—are general. The POMDP formulation itself naturally supports noisy sensors and stochastic actions. While we believe these extensions would further strengthen our claims, implementing and thoroughly evaluating them in new domains and under noisy conditions is not feasible within the rebuttal period but remains a high-priority direction for future work.

References

[1] R. Hazra, P. Z. Dos Martires, and L. De Raedt. Saycanpay: Heuristic planning with large language models using learnable domain knowledge. In Proceedings of the AAAI Conference on Artificial Intelligence, 38, 20123–20133, 2024.
[2] A. Z. Ren, A. Dixit, A. Bodrova, et al. Robots that ask for help: Uncertainty alignment for large language model planners. arXiv preprint arXiv:2307.03046, 2023.
[3] B. Liu, Y. Jiang, X. Zhang, et al. Llm+p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477, 2023.
[4] T. Silver, S. Dan, K. Srinivas, et al. Generalized planning in pddl domains with pretrained large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, 38(18), 20256-20264, 2024.
[5] Y. Lee, K. Lee, P. Cai, et al. The Planner Optimization Problem: Formulations and Frameworks. arXiv preprint arXiv:2303.06768, 2023.
[6] Curtis, A., Tang, H., Veloso, T., Ellis, K., Tenenbaum, J., Lozano-Pérez, T., & Kaelbling, L. P. (2025). LLM-Guided Probabilistic Program Induction for POMDP Model Estimation. arXiv preprint arXiv:2505.02216.

2025-08-03

I maintain my evaluation and rating for this work, as I still consider its contribution to be moderate and lacking theoretical novelty. However, I am comfortable if the Chairs decide to accept the paper. I also suggest that the authors clarify in the title or abstract that this work is primarily designed for object rearrangement tasks. For example, they could consider:

“Task Planning Under Uncertainty via Tree of Hypotheses and Open-Ended POMDPs: A Case Study of Object Rearrangement” “Task Planning Under Uncertainty via Tree of Hypotheses and Open-Ended POMDPs for Object Rearrangement”

2025-08-05

We sincerely appreciate your support of the paper's potential acceptance, and will adopt your suggestion to revise the title in the final version, clarifying our current focus on object rearrangement.

最终决定Accept (poster)

2025-09-17

This paper presents a novel and interesting approach to combining POMDP planning with LLMs for abstract planning. The formulation is well justified and the experimental results are promising, though they are currently demonstrated only in the setting of object rearrangement tasks. The work makes a valuable contribution by showing how symbolic reasoning and language-based hypothesis generation can be integrated into robot planning.

The main concerns raised during review centered on the generality of the approach. In particular, while the framework itself is domain-agnostic, adapting it to a new domain appears to require significant engineering effort, such as redesigning the semantic POMDP model and task-specific structures. The authors argued that LLMs and VLMs could automate much of this, but the AC did not find this to be a productive line of argumentation. Relying on future automation is a weaker justification than demonstrating generality directly. The reviewers generally agreed that the work stands on its novelty and promise, but that claims of generality should be tempered.

For the final version, it would be valuable to lessen the claims of generality and to consider a less general title.