6.4

/10

Spotlight4 位审稿人

最低4最高4标准差0.0

3.8

置信度

创新性2.8

质量3.0

清晰度3.0

重要性2.5

NeurIPS 2025

ESCA: Contextualizing Embodied Agents via Scene-Graph Generation

Jiani Huang,Amish Sethi,Matthew Kuo,Mayank Keoliya,Neelay Velingker,JungHo Jung,Ser-Nam Lim,Ziyang Li,Mayur Naik

OpenReview PDF

提交: 2025-05-09更新: 2025-10-29

TL;DR

Contextualizing MLLM-based agents with grounded scene graphs boosts their performance.

摘要

关键词

neurosymbolicscene graphmultimodalMLLM agents

评审与讨论

审稿意见

评分: 4置信度: 32025-06-09

This paper introduces ESCA to improve the perception and reasoning abilities of multimodal large language model (MLLM)-based embodied agents by integrating structured scene graph representations. At the core of ESCA is SGClip, a CLIP-based and promptable scene graph generation model, trained in a self-supervised, neurosymbolic manner on a large video dataset without requiring manual scene graph annotations. ESCA decomposes the visual perception process into modular steps and provides fine-grained, grounded visual context for agent planning and action. Experimental results across simulated environments demonstrate that ESCA consistently reduces perception errors and boosts the performance of both open-source and proprietary MLLMs, setting new benchmarks in embodied navigation and manipulation tasks.

优缺点分析

Strengths:

ESCA introduces a modular scene graph pipeline that grounds agent perception in explicit, structured representations, improving spatial and temporal understanding in complex environments.
The proposed transfer protocol could adapt ESCA to diverse embodied benchmarks.
Overall writing of this paper is good and easy to understand and follow.

Weaknesses:

The system relies on 2D visual inputs instead of video or 3D representations, which limits depth-aware reasoning and spatial prcision.
The proposed ESCA-Video-87K dataset contains a large number of semantically redundant words that have not been removed. For example, the relation keywords include 35,415 unique types, which may pose obstacles for applications.
The entire system requires multiple calls to models such as MLLM and Grounding DINO, which may result in low operational efficiency.

问题

Since the video captions are generated using GPT, this indicates that GPT has the ability to perceive relationships within videos. So why not directly use GPT to generate scene graphs as part of the chain-of-thought process (similar to [1]), instead of building a new SGClip model to learn the results perceived by GPT? Isn't it inefficient to distill such knowledge from GPT into SGClip?
An excessive number of relation vocabulary categories can significantly increase the number of tokens required during processing. For instance, SGClip conducts binary classification for each relation keyword, utilizing a vocabulary of 35,415 unique relation keywords. In contrast, models such as Qwen2.5-VL have a maximum input capacity of 32,768 tokens. This raises concerns about whether the inclusion of too many relation keywords and actions could exceed the model's context length, potentially impacting both the effectiveness and efficiency of the MLLM.

[1] Compositional Chain-of-Thought Prompting for Large Multimodal Models. CVPR 2024.

局限性

See weaknesses.

最终评判理由

The rebuttal addressed my concerns.

格式问题

作者回复

2025-07-31

Q5jV-Q1. Why not directly use GPT to generate scene graphs as part of the chain-of-thought process (similar to [1]), instead of building a new SGClip model to learn the results perceived by GPT?

We appreciate the reviewer's valuable feedback and evaluate against CoT-SG [1], a direct baseline where scene graphs are generated through a Chain-of-Thought MLLM pipeline. This approach prompts the MLLM to produce structured scene graph representations directly, without grounding. Further studies are detailed in kXSe-Q1.

EB-Navigation Results: ESCA consistently outperforms CoT-SG across all tested models. The Chain-of-Thought strategy actually degrades performance, particularly on long-horizon tasks where target objects are not initially visible. CoT-SG often fails to detect target objects after performing a few rotation actions to scan its surroundings, and it exhibits more hallucinations than the raw MLLM.

Model	Raw Model	w GD	w ESCA	CoT-SG (NEW)
Intern-VL-2_5-38B-MPO	47.33%	47.67%	51.66%	46.33% (NEW)
Gemini-2.0-flash (Mar 2025)	40.68%	40.53%	42.00%	26.33% (NEW)
Qwen2.5-VL-72B-Ins	44.99%	48.27%	49.33%	33.99% (NEW)
GPT-4o	51.33%	53.66%	54.67%	44.00% (NEW)

EB-Manipulation Results: While CoT-SG shows modest improvements over raw models, ESCA maintains superior performance across all tested MLLMs, demonstrating the effectiveness of structured grounding over direct generation approaches.

Model	Raw Model	w YOLO	w ESCA	CoT-SG (NEW)
Intern-VL-2_5-38B-MPO	19.31%	19.30%	24.30%	20.00% (NEW)
Gemini-2.0-flash (Mar 2025)	11.81%	16.54%	21.94%	17.60% (NEW)
Qwen2.5-VL-72B-Ins	4.72%	13.34%	19.04%	17.33% (NEW)
GPT-4o	23.47%	28.48%	34.44%	30.43% (NEW)

Q5jV-Q2. … raises concerns about whether the inclusion of too many relation keywords and actions could exceed the model's context length, potentially impacting both the effectiveness and efficiency of the MLLM.

Training vs. Inference Distinction: While our training dataset (ESCA-Video-87K) contains a large vocabulary (35,415 relation keywords, 57,930 actions, 220,905 names) reflecting natural language diversity, this does not translate to excessive token usage during inference. SGCLIP's core mechanism relies on cosine similarity between visual and textual embeddings rather than exhaustive binary classification over the entire vocabulary.

Efficiency Design: ESCA's four-phase architecture inherently prevents vocabulary explosion through progressive refinement. The concept extraction phase generates only task-relevant keywords specific to the given instruction and visual scene, while both object identification and scene graph prediction work within this constrained vocabulary.

Empirical Analysis: Our quantitative analysis on EB-Navigation shows that ESCA generates an average of 5.57 keywords per instruction, producing 84 unique target objects, 199 related objects, 296 relationships, and 117 attributes across the entire dataset. Even in the most complex scenarios, the total token count for scene graph representations remains manageable.

Q5jV-Q3. The system relies on 2D visual inputs instead of video or 3D representations, which limits depth-aware reasoning and spatial precision.

We believe 3D representation generation is definitely important for embodied agents application. However, this direction is beyond the current scope of our work, which focuses on visual grounding using the available 2D egocentric inputs as defined by the benchmark setup. We will clarify this design choice in the paper and include 3D integration as a promising direction in our future work section.

Q5jV-Q4. The entire system requires multiple calls to models such as MLLM and Grounding DINO, which may result in low operational efficiency.

ESCA achieves significant perceptual grounding improvements while maintaining computational efficiency. Our method introduces only 1.09 seconds of additional processing per planning step, less than 5% overhead compared to baseline systems. To put this in perspective: on the EB-Navigation benchmark with GPT-4o, a complete planning step takes 22.25 seconds with ESCA versus 21.16 seconds using the baseline MLLM + Grounding DINO approach. This minimal 5% increase in runtime delivers substantial benefits through structured, interpretable scene graphs that significantly enhance visual-semantic alignment. We consider this an excellent tradeoff, the marginal computational cost enables ESCA to generate rich scene representations that drive superior task performance and generalization across diverse embodied planning scenarios. The interpretability and accuracy gains far outweigh the modest time investment.

2025-08-05

The rebuttal address my concern and I will keep my score.

审稿意见

评分: 4置信度: 52025-07-03

This paper addresses the challenge of fine-grained perception and physical world grounding in MLLM-based embodied agents. The authors propose the ESCA framework, which enhances an agent's environmental understanding by constructing a structured scene graph. A core contribution is the introduction of SGClip, a model specifically trained to predict relationships between objects, which forms the basis of the scene graph. The framework's effectiveness is demonstrated across multiple MLLMs and in two distinct embodied AI environments. Additionally, the authors contribute the ESCA-Video-87K dataset, generated via a neuro-symbolic self-supervised pipeline, which they plan to release to the community.

优缺点分析

Strengths

The paper accurately identifies a core shortcoming of current MLLM-based embodied agents: their limitations in fine-grained perception and grounding in the physical world. Enhancing the model's environmental understanding through structured scene graphs is a logically clear and practically significant research direction.
The experimental evaluation is comprehensive, validating the framework's generality across several advanced MLLMs and demonstrating performance improvements in two different embodied environments.
The paper is clearly written and well-structured, allowing readers to quickly grasp the core design ideas.

Weaknesses

Necessity of the SGClip: The paper claims its main contribution and novelty lie in the SGClip model. However, I believe the core contribution is more in the workflow design (though this pipeline is quite common in robotics and embodied AI). I am skeptical about the necessity of SGClip. I believe that modern VLMs (e.g., GPT-4o), when properly prompted with outputs from upstream Concept Extraction and Object Identification, are sufficient to accurately and efficiently infer inter-object relationships. The paper's lack of a direct comparison with such strong baselines makes me unenthusiastic about introducing and specifically training a SGClip model.
Limitations of CLIP-based Model: It is well-known that CLIP and its variants struggle with complex compositional relationships and precise spatial orientations (e.g., difficulty distinguishing "A is to the left of B" from "B is to the left of A," or compositional concepts like "person riding a horse" versus "horse riding a person"). While the paper proposes some engineering tricks, it lacks a thorough discussion and analysis of the inherent limitations of the CLIP model. Therefore, I am skeptical about the accuracy and robustness of using SGClip for predicting complex spatial and interaction relationships in an embodied agent setting.

问题

Could the authors provide a direct comparative analysis of SGClip's accuracy, robustness, and inference efficiency in relationship prediction against directly prompting an advanced VLM (e.g., GPT-4o) to describe inter-object relationships, assuming the same underlying Concept Extraction and Object Identification? If a general-purpose VLM can achieve similar or even superior results with lower development and inference costs, the necessity of specially designing and training SGClip would be severely challenged.

局限性

Yes. The authors acknowledge limitations such as latency and reliance on 2D inputs in the conclusion section. This discussion is adequate. However, I believe they have not sufficiently addressed the core weakness I raised above, namely, "the necessity of SGClip compared to a simpler baseline using a powerful VLM to conclude spatial relationship." My question aims to prompt the authors to reflect on these deeper design choices and their implications.

格式问题

None

作者回复

2025-07-31

kXSe-Q1. Necessity of the SGClip: Could the authors provide a direct comparative analysis of SGClip's accuracy, robustness, and inference efficiency in relationship prediction against directly prompting an advanced VLM (e.g., GPT-4o) to describe inter-object relationships, assuming the same underlying Concept Extraction and Object Identification?

To clarify SGCLIP's critical contribution within the ESCA framework, we conduct a systematic ablation study comparing SGCLIP against (a) direct scene graph generation using MLLMs and (b) alternative grounding methods such as Grounding DINO (GD). We also provide qualitative case studies & failure analysis on the above models, illustrating with examples how ESCA boosts perception and embodied planning.

(a) SGCLIP vs. Direct MLLM Scene Graph Generation

We evaluate against CoT-SG, a direct baseline where scene graphs are generated through a Chain-of-Thought MLLM pipeline. This approach prompts the MLLM to produce structured scene graph representations directly, without grounding.

Model	Raw Model	w GD	w ESCA	CoT-SG (NEW)
Intern-VL-2_5-38B-MPO	47.33%	47.67%	51.66%	46.33% (NEW)
Gemini-2.0-flash (Mar 2025)	40.68%	40.53%	42.00%	26.33% (NEW)
Qwen2.5-VL-72B-Ins	44.99%	48.27%	49.33%	33.99% (NEW)
GPT-4o	51.33%	53.66%	54.67%	44.00% (NEW)

Model	Raw Model	w YOLO	w ESCA	CoT-SG (NEW)
Intern-VL-2_5-38B-MPO	19.31%	19.30%	24.30%	20.00% (NEW)
Gemini-2.0-flash (Mar 2025)	11.81%	16.54%	21.94%	17.60% (NEW)
Qwen2.5-VL-72B-Ins	4.72%	13.34%	19.04%	17.33% (NEW)
GPT-4o	23.47%	28.48%	34.44%	30.43% (NEW)

(b) SGCLIP vs. Grounding DINO

Results in Figure 5 demonstrate that the full ESCA pipeline significantly outperforms the MLLM + Grounding Dino (GD) baseline across all tasks. In that experiment, we ablate the object identification component by comparing two variants: one using Grounding DINO (GD) alone, and the other using the full ESCA pipeline, which integrates GD for object identification and SGCLIP for scene graph generation. This performance gain indicates that SGCLIP's structured, probabilistic scene graph predictions provide more effective grounding for embodied planning compared to GD’s object detection alone.

As detailed in Appendix C.1 and C.2, both configurations share the same concept extraction interface and prompting strategy; the only difference lies in the transfer protocol and grounding mechanism.

(c) Qualitative Analysis of ESCA v/s raw MLLM v/s MLLM + GD

We demonstrate the superior scene graph representation of ESCA via case studies and error decomposition analysis, which illustrate that ESCA achieves SOTA performance on the benchmarks by generating higher-quality scene graphs & reducing perception error, particularly in identifying hard-to-see objects and reducing hallucinations.

Case Studies on EB-Navigation & EB-Manipulation: We kindly refer reviewers to Figures 14 and 16 in the supplementary material, Appendix C, for a detailed qualitative analysis on EB-Navigation & EB-Manipulation. For instance, Figure 14 shows scene graphs generated using raw MLLM (gpt4o), MLLM+GD, and ESCA. Figure 14’s task is “navigating to a pillow”, where the raw MLLM doesn’t recognize any pillow, MLLM+GD misclassifies a shadow as the pillow, while ESCA correctly recognizes the pillow, and navigates to it.

ESCA Reduces Perception Error: We highlight the error decomposition analysis of GPT-4o with and without ESCA in Figure 17, which shows that ESCA significantly reduces perception errors among all errors, from 40% to 24%, highlighting its effectiveness in improving grounded visual understanding.

Error Analysis of Scene Graphs Generated via CoT-SG. In addition to figure 17, we conducted an error analysis by sampling 10 failure cases from EB-Navigation tasks, where scene graphs were generated using GPT-4o with a Chain-of-Thought (CoT-SG) approach. The failure modes were categorized as follows:

Failure Type 1: Failure to detect the target object (6/10 cases)
Failure Type 2: Hallucination of non-existent objects (3/10 cases)
Failure Type 3: Incorrect object identification (e.g., misclassifying a red object as lettuce) (1/10 case)

All of these failure types stem from perception errors, whereas in our ESCA framework, only 24% of failures are attributed to perception issues, as shown in Figure 17. The discrepancy highlights a key limitation: despite leveraging a powerful MLLM, the CoT-SG method lacks sufficient grounding to reliably detect objects present in the visual scene. Further, as more entities are described without corresponding visual grounding, the planning context becomes diluted. This often leads to suboptimal action sequences and ultimately, task failure.

2025-08-08

Dear reviewer kXSe,

We hope you’ve had a chance to review our rebuttal. We’d be happy to clarify any remaining questions or concerns you may have. Your feedback would be valuable in helping us improve the paper.

Thank you again for your time and thoughtful review.

Best regards, The Authors

2025-08-08

I appreciate the additional experiments comparing against the CoT-SG baseline. This clarifies the contribution. While my general concern about the model's compositional limits stands, you've provided good evidence that the gains from improved grounding in your pipeline are significant for this application. I will keep my score.

2025-08-09

We sincerely thank the reviewer for their insightful feedback and take this opportunity to further clarify the discussion on compositional limitations.

Addressing Limitations of CLIP-based Models
We agree that vanilla CLIP and its variants often underperform on fine-grained compositional and spatial reasoning tasks (e.g., distinguishing “A is to the left of B” vs. “B is to the left of A”). This is precisely why our training approach for SGCLIP was designed to mitigate such weaknesses. Following LASER [1], we incorporate positional encoding during training via distinct colored masks marking subject and object entities. This positional encoding guides the model toward precise subject–object localization, mitigating spatial ambiguity.

As shown in appendix table 2, and our two extra evaluations, ESCA shows consistent improvements in downstream embodied-agent tasks that demand spatial reasoning:

Task & Subset	Model	Baseline	+SGCLIP	Δ Improvement
em-manipulation (spatial-split)	InternVL-2.5-38B-MPO	12.50	27.08	+14.58
	Gemini-2.0-Flash	18.75	27.08	+8.33
	Qwen2.5-VL-72B-Ins	2.78	10.42	+7.64
	GPT-4o	19.44	31.25	+11.81
EB-Habitat (spatial-split)	InternVL-2.5-38B-MPO	30.00	34.00	+4.00
	Gemini-2.0-Flash	20.04	22.00	+1.96
	Qwen2.5-VL-72B-Ins	24.00	36.00	+12.00
EB-Alfred (spatial-split)	InternVL-2.5-38B-MPO	20.00	24.00	+4.00
	Gemini-2.0-Flash	42.00	44.00	+2.00
	Qwen2.5-VL-72B-Ins	34.00	37.00	+3.00

These results collectively show that while CLIP alone struggles with complex spatial and compositional relationships, our targeted training strategy, combined with positional encoding and dataset design, substantially enhances its performance in embodied-agent settings requiring such type of reasoning. Thus, SGCLIP is not a vanilla CLIP, but designed specifically for relational grounding.

[1] Huang, Jiani, et al. “Laser: A neuro-symbolic framework for learning spatial-temporal scene graphs with weak supervision,” in The Thirteenth International Conference on Learning Representations (ICLR), 2025.

审稿意见

评分: 4置信度: 32025-07-03

The paper proposes the ESCA framework to provide MLLM-based agents with structured visual descriptions, scene graphs, which aids in their reasoning and planning. The core contribution is SGClip, a model trained using a self-supervised neurosymbolic pipeline that leverages CLIP to generate scene graphs without requiring human annotations. Experiments on two environments in EmbodiedBench demonstrate that the ESCA framework improves agent performance. The authors also provide code to ensure reproducibility.

优缺点分析

Strengths:

Novel and Scalable Self-Supervised Training Paradigm: The paper's key innovation is a scalable neurosymbolic pipeline that trains a scene graph model from video-caption pairs without human annotation. This elegantly solves a major data bottleneck in the field.
Versatile and Generalizable SGClip: SGClip demonstrates strong performance not only within the ESCA framework but also in zero-shot and downstream tasks.
Comprehensive Experimental Validation: The framework's effectiveness is rigorously tested on a diverse set of both open-source and proprietary MLLMs in multiple embodied tasks. The clear reduction in perception errors provides strong evidence for its utility.
Community Contribution: The commitment to release the ESCA-Video-87K dataset and codebase is a valuable contribution that fosters reproducibility and future research.

Weaknesses:

Limited "Open-Domain" Capability in Practice: The paper claims SGClip is an open-domain model, but its application within the ESCA framework appears constrained. The candidate sets for attributes and relations seem to be pre-defined, and concept extraction relies on environment-specific textual feedback. This setup limits the method's generality and raises questions about its direct applicability to real-world scenarios that lack such structured feedback.
Lack of Deeper Result Analysis and Case Studies: The paper reports a more significant performance gain on EB-Manipulation than on EB-Navigation but provides no explanation for this discrepancy. Furthermore, all qualitative examples and visualizations are from the navigation task. Including case studies for the manipulation task would be crucial for a better understanding of how ESCA functions in more complex, interaction-rich scenarios.
Insufficient Ablation and A Critical Missing Baseline:
- Unclear Contribution of SGClip: The ESCA pipeline heavily involves MLLMs in multiple stages (concept extraction, summarization), making it difficult to isolate the precise impact of the SGClip module itself versus the overall structured prompting.
- Missing a Stronger Baseline: The training of SGClip can be seen as a form of knowledge distillation. A critical baseline is absent: using a powerful MLLM to directly generate the scene graph zero-shot, aided by grounding from visual modules like GD+SAM2. Comparing against this simpler, distillation-free approach is essential to justify the complexity and necessity of training a specialized SGClip model.
Ambiguous Error Decomposition: The error analysis lacks clarity. The paper does not provide formal definitions for its error categories. Because most failures are compounded, and there are likely to be some errors in the trajectory of success.
Inadequate Engagement with Relevant Literature: The paper's positioning within the field could be strengthened by a more thorough discussion of related work. It overlooks several highly relevant prior works that also address grounding and planning for LLM agents in similar AI2-THOR environments, such as LLM-Planner[1], MART[2], and P-RAG[3]. A comparative discussion with these methods would provide a clearer context for ESCA's specific contributions and advantages over existing state-of-the-art approaches.

[1] C. H. Song, J. Wu, C. Washington, et al., "LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models," in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 2998-3009.

[2] J. Yue, X. Xu, B. F. Karlsson, and Z. Lu, "MLLM-as-Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents," in The Thirteenth International Conference on Learning Representations (ICLR), 2025. [Online]. Available: https://openreview.net/forum?id=K5yeB4dTtS

[3] W. Xu, M. Wang, W. Zhou, et al., "P-RAG: Progressive Retrieval Augmented Generation for Planning on Embodied Everyday Task," in Proceedings of the 32nd ACM International Conference on Multimedia (ACM MM), 2024, pp. 6969-6978.

问题

See above.

局限性

Yes.

Section 7 (Conclusion and Limitations) provides a clear, direct discussion of primary limitations including latency, dependence on 2D visual inputs, and lack of formal verification during execution. The impact of synthetic supervision is not fully detailed and could be addressed more explicitly.

最终评判理由

After the discussion with the authors during the rebuttal, my main concerns have been solved.

格式问题

作者回复

2025-07-31

szsY-Q1. "Open-Domain" Capability in Practice: The candidate sets for attributes and relations seem to be pre-defined, and concept extraction relies on environment-specific textual feedback.

We respectfully clarify that ESCA's attributes and relations are generated dynamically on-the-fly, not pre-defined. EmbodiedBench itself is an open-domain SGG benchmark, and our evaluation demonstrates this capability through comparison with other open-domain methods (Grounding DINO and CoT-SG). For a comprehensive analysis of “open domain” capability please refer to our detailed response in SqXB-Q2.

szsY-Q2. Deeper Result Analysis and Case Studies: Can you provide a deeper analysis to explain when and why ESCA performs better?

The impact of ESCA varies depending on the visual demands and action structure of the task. In general, tasks that involve more detailed spatial reasoning or complex object interactions tend to benefit more.

Task characteristics

Task	Visual Understanding	Action Granularity
EB-Navigation	Medium: pixel-wise object localization	Fine-grained: continuous movement (e.g., "move forward by X")
EB-Manipulation	Low: image-to-3D coordinate mapping	Fine-grained: 7-tuple action specifying position
EB-Habitat	Medium: layout reasoning + object access	High-level: predefined discrete actions
EB-Alfred	High: composite object-task understanding	High-level: abstract actions (e.g., "find X", "pick up Y")

We illustrate this by further conducting experiments on two high-level embodied planning benchmarks: EB-Habitat and EB-Alfred. ESCA showed the smallest performance gain on EB-Alfred, which we attribute to its abstract action space and lower reliance on fine-grained visual grounding.

EB-Habitat performance (NEW)

Model	Raw Model	w GD	w GD + ESCA
Intern-VL-2_5-38B-MPO	48.00%	51.67%	55.00%
Gemini-2.0-flash (Mar 2025)	30.04%	28.80%	33.86%
Qwen2.5-VL-72B-Ins	33.33%	49.50%	56.33%

EB-Alfred performance (NEW)

Model	Raw Model	w GD	w GD + ESCA
Intern-VL-2_5-38B-MPO	26.67%	34.64%	38.50%
Gemini-2.0-flash (Mar 2025)	53.00%	53.67%	54.67%
Qwen2.5-VL-72B-Ins	39.67%	44.67%	42.00%

szsY-Q3. Can you include qualitative examples or case studies from the manipulation task to better illustrate how ESCA operates in the manipulation task?

For a detailed qualitative analysis, we kindly refer reviewers to the full paper in the supplementary material, Appendix C, which provides fine-grained examples of intermediate scene graphs generated using different methods, including raw MLLM, Grounding DINO, and ESCA (see Figures 14 and 16). These examples demonstrate that ESCA improves scene graph accuracy in both EB-Navigation and EB-Manipulation environments.

Additionally, in Figure 17, we highlight the error decomposition analysis of GPT-4o with and without ESCA, on a manual inspection of 50 EB tasks. ESCA significantly reduces perception errors, from 40% to 24%, highlighting its effectiveness in improving grounded visual understanding.

szsY-Q4. Unclear Contribution of SGClip… A critical baseline is absent: using a powerful MLLM to directly generate the scene graph zero-shot... Comparing (ESCA) against this simpler, distillation-free approach is essential to justify the complexity and necessity.

We appreciate the reviewer's valuable feedback and we evaluate ESCA against CoT-SG, a direct baseline where scene graphs are generated through a Chain-of-Thought MLLM pipeline. This approach prompts the MLLM to produce structured scene graph representations directly, without grounding. Further studies are detailed in kXSe-Q1.

Model	Raw Model	w GD	w ESCA	CoT-SG (NEW)
Intern-VL-2_5-38B-MPO	47.33%	47.67%	51.66%	46.33% (NEW)
Gemini-2.0-flash (Mar 2025)	40.68%	40.53%	42.00%	26.33% (NEW)
Qwen2.5-VL-72B-Ins	44.99%	48.27%	49.33%	33.99% (NEW)
GPT-4o	51.33%	53.66%	54.67%	44.00% (NEW)

Model	Raw Model	w YOLO	w ESCA	CoT-SG (NEW)
Intern-VL-2_5-38B-MPO	19.31%	19.30%	24.30%	20.00% (NEW)
Gemini-2.0-flash (Mar 2025)	11.81%	16.54%	21.94%	17.60% (NEW)
Qwen2.5-VL-72B-Ins	4.72%	13.34%	19.04%	17.33% (NEW)
GPT-4o	23.47%	28.48%	34.44%	30.43% (NEW)

szsY-Q5. Error Decomposition: Can you clarify how error types are defined and disambiguated in your analysis, especially given that failures may be compounded and minor errors may occur even in successful trajectories?

We appreciate the reviewer’s concern and would like to clarify that formal definitions of the error categories are provided in Footnote 1 on Page 8 of the paper, which is also consistent with the paper:

“The three top-level error types are Perception, Reasoning, and Planning. The second-level categories include Hallucination, Wrong Recognition, Spatial Understanding, Spatial Reasoning, Reflection Error, Inaccurate Action, and Collision. For clarity, acronyms are used in the figure to label these error types.”

While eventual failures may involve compounded errors, we classify each failure case based on the first identifiable error in the execution trace to avoid ambiguity. Specifically, errors typically follow a cascade, where early perception or reasoning mistakes lead to downstream planning failures. For instance, incorrect spatial understanding may lead to collisions, and hallucinations can result in inaccurate actions.

As for errors that occur during successful trajectories, they are often corrected by subsequent planning steps. For example, the model may initially fail to find the target object due to distance, but successfully locates it after exploring further. Given that these recoverable deviations do not lead to task failure, we focus our error analysis on unrecoverable failures only.

szsY-Q6. Inadequate Engagement with Relevant Literature: LLM-Planner, MART, and P-RAG.

We thank the reviewer for providing these highly relevant literature and we will include the following discussion in the related works section. While these methods and ESCA both improve embodied agents, they target distinct pipeline components and are complementary: LLM-Planner focuses on hierarchical plan generation with ALFRED-specific grounding, while ESCA provides general-purpose, open-vocabulary grounding via scene graphs across diverse tasks. MART enhances planning through retrieval of past successful trajectories, P-RAG introduces progressive retrieval-augmented planning, while ESCA enhances the perception module with structured visual grounding.

2025-08-08

Dear reviewer szsY,

We hope you’ve had a chance to review our rebuttal. We’d be happy to clarify any remaining questions or concerns you may have. Your feedback would be valuable in helping us improve the paper.

Thank you again for your time and thoughtful review.

Best regards, The Authors

2025-08-09

Thank you for your response. Your reply has addressed my concerns, and I will increase my rating accordingly.

审稿意见

评分: 4置信度: 42025-07-08

The paper integrates scene graph descriptions with agents to describe object relationships, providing structured alignment between visual contexts and textual semantics. It proposes a CLIP-based, open-domain, and promptable model to generate scene graphs, and an Embodied and Scene-Graph Contextualized Agent to support MLLM-based embodied agents. A novel ESCA-Video-87K dataset is proposed, including object traces, open-domain concepts, and programmatic specifications for 87K video-caption pairs.

优缺点分析

Strengths:

Introducing scene graph descriptions into agents utilises spatial and semantic contexts in visual and language information, therefore facilitating the fine-grained operations.
The new ESCA-Video-87K dataset with video-caption pairs contributes to exploring open-domain scene graph descriptions.
The embodied agent pipeline augmented with the proposed ESCA provides a solution for accurate agent perception.

Weaknesses:

The paper has limited novel content for embodied agent or scene graph generation. For embodied agent construction, the only difference is introducing scene graphs, which is straightforward and cannot capture the outstanding part. For scene graph generation, the proposed SGClip is only a fine-tuned CLIP with the proposed dataset.
The paper stresses the superiority of open-domain perceptions but lacks more illustrations and experimental analysis. It only performs the open-domain comparisons with closed-set SGG datasets, missing the comparison with other open-domain SGG methods.
Actually, how to integrate SGG with the agent is not well described. The relationships in SGG are not shown in a qualitative manner, especially in Fig. 1. The Authors would better clarify how spatial and temporal relationships are integrated into the agent.

问题

Please refer to the weaknesses.

局限性

Yes.

最终评判理由

Thank you for the long response and illustrations. The authors have carefully addressed my concerns. After reading all the reviews and responses, I would raise my rating to borderline accept.

格式问题

No.

作者回复

2025-07-31

SqXB-Q1. Novelty: the only difference is introducing scene graphs, which is straightforward and cannot capture the outstanding part … the proposed SGClip is only a fine-tuned CLIP with the proposed dataset.

We clarify that the introduction of scene graph for the embodied agents is non-trivial, leading to our novel contributions of (a) selective grounding, (b) design of transfer protocol, and (c) the fine-tuned model SGClip and its training dataset ESCA-Video-87K which can realize the prior goals.

Selective Grounding. Rather than injecting full scene graphs indiscriminately, which we show in the COT-SG experiment, leading to degraded performance, we focus on presenting only the most relevant information to guide the MLLM in planning. Specifically, we use the MLLM to identify the objects, attributes, and relationships that are most pertinent to the instruction, and then determine which key objects are essential for task completion with symbolic execution. For navigation, this is the object the agent must locate; for manipulation, it is the precise (x, y, z) position mapping to the desired object feature.

Design of Transfer Protocol. We define the transfer protocol as the process of leveraging concepts extracted by the MLLM to construct prompts enriched with scene-specific information. The protocol filters and aggregates relevant scene elements tailored to the agent’s goal. For this purpose, it performs probabilistic reasoning over object names, spatial relationships, and attributes to identify the top-k most likely target objects. This fine-grained inference is enabled by our probabilistic scene graph design, which encodes SGCLIP predicted likelihood. This mechanism, detailed in Appendix C, is a technical contribution that goes beyond simply plugging in a scene graph module.

SGClip and ESCA-Video-87K. Regarding the novelty of SGCLIP, we emphasize that the contribution lies not in architectural innovation but in the data collection and training strategy. SGCLIP is trained on a large-scale, weakly-supervised video dataset generated entirely via multimodal LLM prompting, described in Appendix A. Our approach overcomes the scarcity of manually labeled scene graph annotations in video by enabling scalable, diverse, and compositional weak supervision at scale, requiring only raw video as input. SGCLIP thus represents a practical and generalizable solution for scene graph grounding in open-world tasks.

To the best of our knowledge, ESCA-Video-87K is the largest automatically labeled video scene graph dataset featuring rich object semantics, relations, and actions. Most other annotated SG datasets are much smaller. For comparison:

Ego4D-EASG includes 221 long videos annotated with 407 object classes, 219 verbs (action types), and 16 spatial/prepositional relation types.
OpenPVSG contains 400 videos, covering 126 object classes and 57 relation types.
BDD-100K, the largest densely labeled video dataset, focuses solely on driving scenarios. While it offers 100K videos with object trajectories, it includes only 10 object classes and lacks relation and action annotations.

In contrast, ESCA-Video-87K features automatic annotations of diverse object names, inter-object relations, and object actions across 87K videos. Human-labeled fine-grained datasets like OpenPVSG are limited in scale due to the complexity of dense annotation, highlighting the scalability advantage of our weak-supervision pipeline.

SqXB-Q2. How does ESCA compare against open-domain SGG methods? and on open-set SGG datasets?

Embodied Bench is an open-domain SGG benchmark

A key strength of the ESCA framework is its ability to handle open-domain natural language instructions, which often contain free-form and compositional descriptions not present in the simulated environments. For example, a banana might be described as a "curved yellow fruit" and a loaf of bread as a "freshly baked baguette."

This open-domain capability is clearly demonstrated in our quantitative analysis of the EB-Navigation dataset. Using GPT-4o as the concept extractor, our pipeline generates an average of 5.57 free-form keywords per instruction, resulting in 84 unique target object descriptions across the dataset. This diversity significantly exceeds the 26 canonical object names defined in the ground-truth vocabulary. Beyond target objects, the process also produces 199 related objects, 296 relationships, and 117 attributes as keywords. This expanded vocabulary demonstrates how SGCLIP captures rich, unconstrained concept expressions that arise from both natural language variability in instructions (e.g., "curved yellow fruit" instead of "banana") and observations derived from the visual scene.

To further demonstrate generalizability across open-domain SGG datasets, we report performance results on EB-Alfred and EB-Habitat as follows:

EB-Habitat performance (NEW)

Model	Raw Model	w GD	w GD + ESCA
Intern-VL-2_5-38B-MPO	48.00%	51.67%	55.00%
Gemini-2.0-flash (Mar 2025)	30.04%	28.80%	33.86%
Qwen2.5-VL-72B-Ins	33.33%	49.50%	56.33%

EB-Alfred performance (NEW)

Model	Raw Model	w GD	w GD + ESCA
Intern-VL-2_5-38B-MPO	26.67%	34.64%	38.50%
Gemini-2.0-flash (Mar 2025)	53.00%	53.67%	54.67%
Qwen2.5-VL-72B-Ins	39.67%	44.67%	42.00%

Grounding-dino and CoT-SG are two open-domain SGG methods

To empirically validate ESCA's open-domain capability, we compare against two baseline approaches. First, as reported in our main paper, we evaluate against a baseline that omits SGCLIP and relies solely on Grounding DINO for open-world object grounding (MLLM + GD). We further evaluate against CoT-SG, a Chain-of-Thought approach that prompts MLLMs to generate structured scene graph representations directly without any visual grounding. This method represents a direct generation baseline where the model produces scene graphs purely through reasoning rather than structured visual-textual alignment.

EB-Navigation Results: ESCA consistently outperforms CoT-SG across all tested models. The Chain-of-Thought strategy actually degrades performance, particularly on long-horizon tasks where target objects are not initially visible. CoT-SG often fails to detect target objects after performing a few rotation actions to scan its surroundings, and it exhibits more hallucinations than the raw MLLM.

Model	Raw Model	w GD	w ESCA	CoT-SG (NEW)
Intern-VL-2_5-38B-MPO	47.33%	47.67%	51.66%	46.33% (NEW)
Gemini-2.0-flash (Mar 2025)	40.68%	40.53%	42.00%	26.33% (NEW)
Qwen2.5-VL-72B-Ins	44.99%	48.27%	49.33%	33.99% (NEW)
GPT-4o	51.33%	53.66%	54.67%	44.00% (NEW)

EB-Manipulation Results: While CoT-SG shows modest improvements over raw models, ESCA maintains superior performance across all tested MLLMs, demonstrating the effectiveness of structured grounding over direct generation approaches.

Model	Raw Model	w YOLO	w ESCA	CoT-SG (NEW)
Intern-VL-2_5-38B-MPO	19.31%	19.30%	24.30%	20.00% (NEW)
Gemini-2.0-flash (Mar 2025)	11.81%	16.54%	21.94%	17.60% (NEW)
Qwen2.5-VL-72B-Ins	4.72%	13.34%	19.04%	17.33% (NEW)
GPT-4o	23.47%	28.48%	34.44%	30.43% (NEW)

SqXB-Q3. how to integrate SGG with the agent and how spatial and temporal relationships are integrated into the agent.

We illustrate how the ESCA framework integrates scene graphs through the transfer protocol for EB-navigation, as described in Appendix C. We will provide more information in a revised version.

Concrete Example - EB-Navigation Task: This protocol combines the object name likelihood, attribute likelihood, and spatial relation likelihood to compute the probability of each object being the target. Consider the instruction "navigate to the loaf of bread that is on the counter”, and three objects A, B & C. Our protocol processes this as follows:

Object Recognition: A and B recognized as "loaf of bread" (0.8 likelihood) and "counter" (0.2). C recognized as "counter" (0.7) and "bread" (0.3).
Spatial Relations: SGCLIP predicts A is "above" C (0.9), B is "above" C (0.1)
Probabilistic Aggregation: We compute the aggregated likelihood using our three-factor formula:Target Likelihood = Object Confidence × Spatial Relation Confidence × Landmark Confidence. The target likelihood for A: 0.8 (bread confidence) × 0.9 (spatial relation) × 0.7 (counter confidence) = 0.504, likewise for B = 0.8 × 0.1 × 0.7 = 0.056. This calculation enables the agent to prioritize objects that not only match in name but also satisfy the instructed spatial context.

From Scene Graph to Action: We pass the top-k targets downstream. Assume k = 1, Object A becomes the primary navigation target, with its bounding box coordinates and confidence score (0.504) passed to the motion planner. This probabilistic grounding enables robust decision-making under perceptual uncertainty. Though evaluated only on EB in static environments, our framework could generalize to dynamic environments with temporal relations. For "watch the person cook pasta", scene graphs would track action sequences with temporal edges, aggregating probabilities across time.

2025-08-08

Dear reviewer SqXB,

We hope you’ve had a chance to review our rebuttal. We’d be happy to clarify any remaining questions or concerns you may have. Your feedback would be valuable in helping us improve the paper.

Thank you again for your time and thoughtful review.

Best regards, The Authors

2025-08-09

Dear Reviewer SqXB,

We appreciate the time and effort you dedicated to your initial review and hope you have had a chance to read our rebuttal and follow-up clarifications. We hope these have addressed the concerns you raised.

With less than 12 hours remaining in the discussion period, we would be glad to clarify any remaining questions you may have. Your feedback at this stage would be greatly valued.

Thank you again for your thoughtful review.

Best regards, The Authors

2025-08-06

Dear Reviewers,

We would like to express our sincere gratitude for your insightful feedback and constructive suggestions. Your comments have been invaluable in strengthening our submission and clarifying key technical contributions. We kindly invite you to review our comprehensive rebuttal and engage in further discussion to address any remaining questions or concerns. In summary, the following key additions and clarifications have been made:

We evaluate against CoT-SG, a direct MLLM scene graph generation baseline, demonstrating that ESCA consistently outperforms direct generation approaches across all tested models and tasks. (R-kXSe, R-Q5jV, R-szsY, s-SqXB)
We demonstrate ESCA's open-domain capability through comprehensive evaluation on EB-Habitat and EB-Alfred benchmarks, showing consistent improvements across diverse embodied tasks. (R-SqXB, R-szsY)
We provide detailed qualitative analysis and case studies for both navigation and manipulation tasks, with concrete examples showing how ESCA reduces perception errors from 40% to 24%. (R-szsY, R-kXSe)
We clarify ESCA's integration mechanism with concrete examples of the transfer protocol, showing how probabilistic aggregation enables robust target selection under perceptual uncertainty. (R-SqXB)
We clarify error category definitions and provide failure analysis demonstrating ESCA's superior grounding compared to hallucination-prone direct generation methods. (R-szsY)
We demonstrate computational efficiency with only 1.09 seconds (5%) overhead per planning step, showing excellent performance-cost tradeoffs. (R-Q5jV)
We address vocabulary scalability concerns by clarifying that inference uses only task-relevant keywords (avg. 5.57 per instruction) rather than the full training vocabulary. (R-Q5jV)
We engage with relevant literature including LLM-Planner, MART, and P-RAG, clarifying that ESCA is complementary to existing planning-focused approaches. (R-szsY)
We clarify that the contributions of the ESCA framework extend beyond the SGClip model, including selective grounding, transfer protocol design, and the scalable weak-supervision pipeline for ESCA-Video-87K. (R-SqXB, R-kXSe)

We hope these clarifications and extensive new experiments adequately address your concerns. We deeply appreciate your thorough reviews and look forward to any further discussion.

最终决定Accept (spotlight)

2025-09-17

The paper integrates scene graphs with agents for embodied problems. Overall, the paper received uniformly positive reviews with 4 x Borderline Accepts, Outlined issues focused on: (1) limited novelty [SqXB], (2) potentially limited "open domain" capabilities [szsY], (3) lack of deeper result analysis [szsY], (4) missing stronger baselines [szsY], and (5) questioned the necessity of SGClip [szsY, kXSe]. Authors have addressed these concerns in the rebuttal and reviewers agree that most significant concerns have been resolved with the scores generally going up post-rebuttal.

AC has carefully considered the reviews, rebuttal and discussion that followed as well as the content of the paper itself. AC agrees that the problem domain is interesting and results are compelling. While pure technical novelty may indeed be somewhat limited, as [SqXB] suggests, paper truly excels in many other ways; including strong results on a number of SoTA models and potential impact (with new dataset and an interesting formulation) on the broader community that cares about embodied agents. As a result the AC is recommending Acceptance.