7.3

/10

Poster4 位审稿人

最低3最高6标准差1.1

3.3

置信度

创新性2.5

质量2.8

清晰度2.5

重要性2.8

NeurIPS 2025

Synthesizing Photorealistic and Dynamic Urban Environments for Multimodal Robot Navigation and Collaboration

Yan Zhuang,Jiawei Ren,Xiaokang Ye,Jianzhi Shen,Ruixuan Zhang,Tianai Yue,Muhammad Faayez,Xuhong He,Xiyan Zhang,Ziqiao Ma,Lianhui Qin,Zhiting Hu,Tianmin Shu

OpenReview PDF

提交: 2025-05-11更新: 2025-10-29

摘要

关键词

Embodied SimulatorMultimodal Instruction FollowingRobot NavigationMulti-Robot CollaborationCommunication

评审与讨论

审稿意见

评分: 5置信度: 32025-07-02

This paper introduces RoboScape, a new embodied AI simulation platform for large-scale, photorealistic, dynamic urban environments built on Unreal Engine. RoboScape supports procedurally generated cities with vehicles, pedestrians, and multiple embodied agents controlled asynchronously. They introduce RoboScape-20K, a dataset of 20,000 steps over 200 episodes in 100 procedurally generated cities. The authors also introduce two benchmarks: 1) ROBOSCAPE-MMNAV: Single-robot, multimodal (vision + language) navigation in city-scale settings with traffic and pedestrians. 2) ROBOSCAPE-MRS: Multi-robot collaborative search requiring communication. Experiments show that current vision-language models perform poorly on these tasks without fine-tuning, exposing critical gaps in instruction grounding, spatial reasoning, and planning. Fine-tuning improves performance but remains limited.

优缺点分析

Strengths

An urban simulators pocessing photorealism, procedural generation, agent diversity, and multi-agent control.
Robust engineering—procedural road/building/traffic generation, rich action/observation spaces, asynchronous multi-agent control.
New benchmark MRS: multi-robot communication and search in large urban environments is very unique and meaningful.
RoboScape-20K provides large-scale training data with diverse city layouts.
Multiple VLM baselines (zero-shot and fine-tuned), realistic failure analyses.
Could become a standard evaluation platform for multimodal robot navigation research.

Weaknesses

Entirely simulation-based—no real-world transfer or deployment results.
Discrete robot action space despite claims of realism; continuous control not explored.
Training data relies on oracle traces; no learning of error correction or reinforcement learning shown.
Baselines are all VLM-based; no comparison to classical planners or hybrid methods.
Traffic are rule-based, while in real world it is much more complex.
Scalability claims not fully stress-tested (episodes still ~500m on average).

问题

Do you plan to add real-world sensory noise or domain randomization to enable sim-to-real transfer?
Will you support continuous robot control as most controller take continuous reference input?
Why were only VLM-based baselines included? Did you try classical planners?
What are the compute requirements for procedural generation and simulation?
Do you plan to release the full simulator and benchmark code for public use?

局限性

The entire study is in a synthetic environment. There is no transfer to real robots or validation under real-world sensory noise. This limits immediate practical applicability.
Although the simulator is built for photorealistic, dynamic urban settings, the robot action space remains discretized (move/rotate in steps). Continuous low-level control is not explored.
The RoboScape-20K dataset uses expert trajectories without modeling errors or recovery. This limits robustness and generalization, as shown in the poor success rates even after fine-tuning.
All baselines are vision-language model (VLM)-based. There is no comparison to classical planning, imitation learning, or hybrid approaches that might perform better.
The traffic system is rule-based, and pedestrian behaviors remain relatively simple.
While the environments are procedurally generated at city scale (2 km²), the average navigation task length (500 m) is modest, and it’s unclear how well the simulation supports extremely long-horizon or many-agent scenarios in practice.

最终评判理由

Final Justification

Resolved:
- Sim-to-real considerations: Authors clarified plans to inject realistic sensor noise and designed RoboScape with sim-to-real transfer in mind.
- Action space: The platform already supports continuous action spaces, and integration with ROS2 is possible.
- Baselines: Additional experiments with RL and hybrid methods were provided. Results confirm the difficulty of tasks and demonstrate that RoboScape can benchmark future methods.
- Compute requirements: Authors provided both minimum and recommended specs, showing that the platform is accessible on modest hardware.
- Code release: Authors commit to releasing binaries, Gym wrapper, executables, and containers to ensure broad usability.
- Scalability: Authors clarified that RoboScape supports procedural generation of larger urban environments and significantly longer episodes.
Better to have:
- Performance of baselines: While RL and hybrid baselines were added, their relatively low performance highlights the challenge but leaves open how well current methods can adapt without heavy tuning.
- Traffic simulation realism: Current rule-based traffic models are tailored to case studies; more sophisticated behavior remains future work.
- Dataset diversity: Training data is currently expert-only; broader demonstrations (including noisy or human data) remain to be explored.
Weighting:
- The rebuttal successfully addressed core feasibility concerns (action space, compute requirements, scalability, code release).
- Remaining issues (traffic realism, dataset diversity) are secondary and can be addressed by the community once the platform is released.
- The main contribution—a large-scale, photorealistic, extensible simulation platform with meaningful benchmarks—is strong and unique.

Overall: The rebuttal improved confidence in the paper’s impact. Despite some remaining limitations, the platform and benchmarks represent a valuable community resource. I recommend to accept but keep my score.

格式问题

None.

作者回复

2025-07-31

We appreciate the reviewer for recognizing our contributions of “a new embodied AI simulation platform for large-scale, photorealistic, dynamic urban environments”, “very unique and meaningful” benchmarks. We address your questions and concerns below.

Q1&W1&L1: Sim-to-real

Yes, we fully agree that enabling sim-to-real transfer is crucial for embodied AI research. Although our current experiments focus on simulation, RoboScape was designed with sim-to-real considerations in mind. We plan to inject realistic sensory noise, such as image blurring, depth corruption, and camera jitter.

Q2&W2&L2: Continuous robot action space

Yes, we will. While the current benchmarks use a discrete action space for simplicity and clarity in the two case studies, RoboScape supports a continuous action space through velocity or acceleration commands. Additionally, RoboScape is built on Unreal Engine, which supports integration with ROS2, allowing users to implement low-level dynamics and control if needed.

Q3&W4&L4: Selection of baselines

We did conduct preliminary experiments with standard RL-based baselines (e.g., PPO) and a hybrid baseline. However, these approaches performed poorly in our proposed case studies, likely due to the complexity of instruction-driven tasks and the need for long-horizon reasoning. Our focus is on developing a flexible and extensible simulation platform. While important, designing high-performing policies for these tasks would require significantly more sophisticated architectures and task-specific tuning beyond the current state-of-the-art, which is out of the scope and orthogonal to the primary contribution of our paper.

Having said that, we have conducted additional experiments during rebuttal to include several non-VLM baselines and will include them in the revised Table 2:

Hybrid Baseline: We have evaluated the performance of a hybrid planner where a VLM serves as the high‑level decision maker—deciding whether to continue straight down a street until the next intersection or to turn at that intersection—while A* is used as a low‑level path planner to generate and execute the movement commands. The subtask success rate is 32.53%.
RL Baseline: We utilize two NVIDIA L40S GPUs, each running two parallel instances of RoboScape. Following the setting of VLN-CE [1], the agent is powered by a multimodal policy that combines DeBERTa-v3 for instruction encoding and DINOv2 for visual perception. We first pretrain the policy through imitation learning (specifically, behavioral cloning) on expert demonstrations, and then further finetune it using PPO. Rollouts are collected concurrently across all environments. The current RL baseline achieves a 28.37% subtask success rate, highlighting the challenging nature of our tasks. We believe RoboScape provides a strong foundation for future research on more advanced RL methods, by offering parallelizable, photorealistic, large-scale, and dynamic city environments, along with the ability to collect and incorporate human demonstrations that naturally include errors and corrective behaviors.

Q4: Compute requirements

To ensure wide accessibility, RoboScape supports adjustable rendering resolutions, allowing deployment across both high-end servers and modest laptops.

Recommended Setup

CPU: Intel Core i7‑12700H or AMD Ryzen 9 5900HS
GPU: NVIDIA RTX 3070 or GPU with more than 6 GB
RAM: 32 GB

Minimum Setup (60 FPS for RoboScape)

CPU: Intel Core i7‑11300H or AMD Ryzen 9 4800H
GPU: NVIDIA RTX 2060 (notebook)
RAM: 16 GB

Q5: Release the full simulator and benchmark code

Yes, we plan to release the full simulator and benchmark code. Upon publication, we will open-source all binaries, the Gym wrapper, and provide both a Windows executable and a Singularity container for Ubuntu and macOS, allowing users to run RoboScape out of the box without compilation.

W3&L3: RoboScape-20K Dataset

Indeed, current training data only consists of expert (oracle) trajectories. However, we emphasize that RoboScape is fully compatible with reinforcement learning, which could potentially address the limit as shown in the additional experiments. Furthermore, RoboScape provides a human interface for collecting trajectories from human users, which naturally include suboptimal and error-correcting behaviors. This enables future work to explore learning from human corrections or noisy demonstrations.

W5&L5: Rule-based traffic simulation

All pedestrians and vehicles in RoboScape are independently controllable and support asynchronous control. By applying more sophisticated control policies to each pedestrian, RoboScape is capable of supporting highly realistic traffic simulations. Currently, the traffic system is tailored to support our two case studies, primarily to demonstrate RoboScape’s capabilities as a flexible and extensible platform.

W6&L6: Scalability

While the average episode length in our current experiments is around 500 meters, this setting already reveals significant performance degradation across baselines, highlighting the challenge even at this scale. Importantly, RoboScape is built with scalability in mind: it supports procedural generation of much larger urban environments and can accommodate significantly longer episodes. This design ensures that researchers can easily stress-test navigation agents in more complex, long-horizon tasks, which we plan to explore in future benchmarks.

References:

[1] Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments

2025-08-07

Thank the authors for thoughtful rebuttal and conducting new experiments. My concerns are resolved and looking forward to seeing the opensource!

审稿意见

评分: 6置信度: 32025-07-03

The paper presents RoboScape, a novel simulation platform for embodied AI in large-scale, photorealistic urban environments. The platform is built on Unreal Engine 5 and can procedurally generate unlimited photorealistic urban scenes with dynamic elements such as pedestrians and traffic systems. The authors also introduce two new benchmarks: ROBOSCAPE-MMNAV for multimodal robot navigation and ROBOSCAPE-MRS for multi-robot search tasks. The paper demonstrates that current state-of-the-art models, including vision-language models (VLMs), struggle with these tasks, highlighting the gap in current foundation models for urban robot tasks.

优缺点分析

Strengths

RoboScape is a novel simulator that addresses the lack of realism, scalability, and versatility in existing urban simulators.
The platform offers high-fidelity rendering and dynamic elements like pedestrians and traffic, making it suitable for realistic robot training and evaluation.
The two new benchmarks provide a comprehensive evaluation of robot capabilities in realistic scenarios.

Weaknesses

Although more diverse than previous simulators, the action space for human agents is still limited.

问题

How does RoboScape handle the integration of new AI models or algorithms into its simulation framework?
What are the specific computational requirements and limitations for running RoboScape on different hardware configurations?

局限性

Yes

最终评判理由

The authors’ response was clear and convincing, addressing the issues I had pointed out. I believe the paper is stronger than I initially assessed, and I will increase my score.

格式问题

No formatting issues.

作者回复

2025-07-31

We appreciate the reviewer for recognizing our contributions of “a novel simulator that addresses the lack of realism, scalability, and versatility in existing urban simulators” and “two new benchmarks provide a comprehensive evaluation of robot capabilities in realistic scenarios.” We address your questions and concerns below.

Q1: Integration of New AI Models

Currently, RoboScape uses pre-defined 3D assets, primarily obtained from Unreal Engine Marketplace (FAB), which we further customize for integration. However, the framework allows easy integration of generative methods such as DreamFusion to synthesize 3D assets from text, and SMPL-X-based text-to-body motion generation models to generate human motion sequences from natural language descriptions. We are actively developing these extensions to further enhance RoboScape’s ability to support diverse AI research.

Q2: Specific Computational Requirements

To ensure wide accessibility, RoboScape supports adjustable rendering resolutions, allowing deployment across both high-end servers and modest laptops.

Recommended Setup

CPU: Intel Core i7‑12700H or AMD Ryzen 9 5900HS
GPU: NVIDIA RTX 3070 or GPU with more than 6 GB
RAM: 32 GB

Minimum Setup (60 FPS for RoboScape)

CPU: Intel Core i7‑11300H or AMD Ryzen 9 4800H
GPU: NVIDIA RTX 2060 (notebook)
RAM: 16 GB

W1: Limited Action Space for Human Agent

We agree that the action space can be further expanded. By defining poses, it is convenient to add more actions within our framework. As an initial effort, we have added 5 new types of human actions commonly observed in daily life: (1) picking up, carrying, and dropping objects; (2) sitting down and standing up; (3) entering and exiting vehicles; (4) opening and closing containers (e.g., boxes, cabinets, car trunks); and (5) engaging in social interactions (e.g., chatting, discussing, arguing).

We achieved this by using pre-defined human body motion for each type and adjusting the poses using inverse kinematics for human-object interactions (e.g., adjusting hand poses to grasp objects). We have also implemented a flexible API to enable these actions to be triggered not only by rule-based planners but also by other models such as LLMs, enabling more diverse and context-aware human behaviors. In ongoing work, we are exploring the use of generative models such as text-to-body motion for producing realistic human actions beyond the predefined space. This will allow RoboScape to support richer and more human-like behavior generation in the future.

2025-08-05

I believe the proposed approach is well-motivated and technically sound, and I look forward to seeing this work presented.

审稿意见

评分: 4置信度: 32025-07-03

The paper introduces a potentially valuable outdoor simulator RoboScape for robotics with support for high-fidelity assets and multi-agent cooperation. The experimental results demonstrate that state-of-the-art vision-language models (VLMs) lack robust perception, reasoning, and planning abilities necessary for urban environments.

优缺点分析

Strengths:

The paper presents an outdoor simulator designed for robotics tasks, featuring high-fidelity building and object assets. It is of significance to the field.
The simulator is reported to support both single-agent and multi-agent collaborative operations.

Weakness:

Key aspects such as simulation realism, agent interactions, and task definitions are insufficiently addressed, raising concerns about the simulator’s usability and generalizability.
The paper would benefit from a more rigorous exposition of design choices, empirical validation, and clearer articulation of its unique value compared to existing platforms.
The selection of baselines and planning methods appears underdeveloped. The authors only present fine-tuning results on QwenVL2.5 7B and do not propose any new baseline or planning method tailored to this specific simulator. Furthermore, the study lacks sufficient ablation experiments that could provide insights into the effectiveness of different components.

问题

As the simulator constitutes a core contribution of the paper and serves as the foundation for various tasks, the manuscript lacks sufficient discussion of critical aspects such as the runtime environment, hardware performance, rendering capability, API adaptability, simulation type (e.g., PP, OS, or RBF), and the availability of a human-in-the-loop (HITL) interface. These dimensions are essential benchmarks for assessing the simulator’s effectiveness. The authors are encouraged to provide a detailed specification of these components and offer a comparative analysis against existing simulators.
Although the paper compares weather conditions, lighting, building structures, and pedestrian behavior with other simulators, it remains unclear how these features influence performance in VLN tasks. The presented scenarios are insufficient to clearly demonstrate the simulator’s unique contributions to the field.

局限性

yes

最终评判理由

Thank you for the rebuttal. While some limitations remain, I feel that my main concerns have been adequately addressed by the authors. As a result, I am willing to raise my score by 1 point.

格式问题

作者回复

2025-07-31

We are happy that the reviewer considered our work “significant to the field,” and we address your questions and concerns below.

Q1: RoboScape Details

Thanks for the suggestion. We further clarify the details of RoboScape below and will include them in the revised Section 3 and Appendix A.

Underlying simulation engine. Our simulation is built on Unreal Engine 5, leveraging its native Chaos physics pipeline: each object is assigned an appropriate collision mesh, and at each fixed simulation tick performs discrete‑time integration of Newton’s equations—resolving forces, collisions, and joint constraints via its iterative solver.
Python API and runtime environment. On top of UE5, we’ve implemented a dedicated Python layer that communicates with the engine through an updated UnrealCV‑based TCP server, where high‑level commands issued in Python are forwarded over TCP straight into the UE5 runtime. On top of this API, we provide a standard Gym interface so that researchers can plug in and benchmark any baseline with minimal effort. We will distribute a Windows executable and a Singularity container for Ubuntu and macOS so that users can run RoboScape out of the box without local compilation; all binaries, container images, and the Gym wrapper will be open‑sourced upon publication.
Hardware requirements and simulation/rendering performance. To ensure wide accessibility, RoboScape supports adjustable rendering resolutions, allowing deployment across both high-end servers and modest laptops.

Recommended Setup

CPU: Intel Core i7‑12700H or AMD Ryzen 9 5900HS
GPU: NVIDIA RTX 3070 or GPU with more than 6 GB
RAM: 32 GB

Minimum Setup (60 FPS for RoboScape)

CPU: Intel Core i7‑11300H or AMD Ryzen 9 4800H
GPU: NVIDIA RTX 2060 (notebook)
RAM: 16 GB

We evaluated all baselines on a headless machine with an AMD EPYC 9534 CPU, L40S GPU, 64 GB RAM, and 720×600 resolution. We can run 2 instances in parallel with a fixed 60 fps. Here’s the table when we stress-test the runtime performance with different settings.

Resolution	Rendering Quality	GPU Utils (%)	CPU Utils (%)	RAM (MB)
640x360	low	30.04	16.41	561.56
720x600	low	29.72	17.73	596.4
1280x720	low	26.5	18.56	734.81
640x360	high	30.03	16.59	564.04
720x600	high	29.61	17.47	589.3
1280x720	high	27.44	19.08	738.6

Human‑in‑the‑loop interface. We support a human‑in‑the‑loop interface through which a human operator uses a mouse and keyboard to control the robot; all trajectories are recorded automatically as expert demonstrations for downstream training.

Q2: Baselines Evaluation under Different Settings

To clarify, our ROBOSCAPE‑MMNAV benchmark includes two distinct task settings to isolate how different conditions affect baseline performance in Table 2 and Table 3 at line 196. In the easy setting, all static obstacles (trees, benches, fire hydrants) and dynamic agents (traffic, pedestrians) are removed, so the challenge reduces to navigation in static environments. In the hard setting, we introduce both static obstacles and dynamic obstacles under sunny conditions. While baseline models achieve similar overall success rates and progress scores in both settings, they are unable to effectively avoid static and dynamic obstacles, consistently fail to yield to pedestrians, and do not adhere to basic traffic rules. None of the baselines could reliably complete the task under sunny setting and we could further anticipate that under more adverse weather or lighting condition—overcast skies, rain, or added visual occlusions—their performance would degrade even more.

W3: Lack of New Methods & Ablation Experiment

The goal of this work is the development of a highly customizable, photorealistic, city-scale simulation platform designed to provide the embodied AI community with training and testing environments that closely mirror real-world scenarios (Line 44). We further introduce two new benchmarks as case studies to illustrate RoboScape’s functionalities and versatility. Developing new methods is out of the scope of this work. We believe that our platform can be a valuable resource for the community to develop new methods in future work.

We conducted new ablation experiments to assess the contribution of each input modality, using the same GPT‑4o backbone. Our basic setting includes a ground truth segmentation input, explicit ReAct prompt framework [2] and separate perception-action calls which simplify the mapping. While none of the implementations successfully reach the destination, we observe that removing the GT segmentation image, ablating the ReAct framework, or merging the perception-action calls into a single inference would decrease the performance of the baseline. Furthermore, the depth image and stripped vision input hurdle the performance.

We will include the following results in the revised Section 4.2 using an additional table.

Performance summary (subtask success rate):

GPT‑4o: 34.38%
Merged perception-action call:33.54%
w/ depth: 33.39%
w/o explicit ReAct framework: 32.21%
w/o GT segmentation: 31.90%
w/ stripping: 31.22%

Analysis:

The perception-action framework simplifies the mapping process and reduces hallucinations.
The model already has a certain depth prediction capacity. Adding depth images does not help significantly and may even introduce noise if the colormap is not well aligned.
ReAct framework does help reduce hallucination and stabilize the policy in complex scenarios
GPT-4o lacks sufficient training on intersection-related scenarios, but the GT segmentation image helps mitigate this limitation.
Although shown to be promising in [2], stripping the observation into vertical chunks increases the difficulty of image matching, which is crucial to our current simple settings.

References

[1] ReAct: Synergizing Reasoning and Acting in Language Models
[2] NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large Language Models

2025-08-08

Dear bmod,

I don't see any engagement from you in the discussion, but noted that you nevertheless submitted the "Mandatory Acknowledgement". This is not ok. Please engage in the discussion before submitting. If the authors have resolved your (rebuttal) questions, please do tell them so. If the authors have not resolved your (rebuttal) questions, please do tell them so too.

Thanks, Your AC

审稿意见

评分: 3置信度: 42025-07-03

This paper proposes a new simulation platform RoboScape for embodied AI tasks in large-scale urban environments. Based on the simulator, it proposes two new benchmarks: multimodal robot navigation and multi-robot search and benchmarks the performance of state-of-the-art vision language models.

优缺点分析

Strengths:

Based on an advanced graphics engine, the proposed simulator has better photorealism than existing urban simulators.
It has more realistic pedestrian simulation with different behaviors.
The proposed benchmarks are practical in the real world while less studied in the urban embodied settings.

Weaknesses:

The paper only benchmarks the performance of general VLMs without evaluating the performance of some task-specific models like robot navigation models or RL-based policies. Given the very low success rate of VLMs on these tasks due to lack of spatial understanding, it is questionable whether these benchmarks are truly useful without at least some reasonable baselines.
It seems there's only one type of embodied urban robot in the simulator (the robot dog). However, many other types of robots could appear in the urban setting like delivery robot, autonomous scooter, autonomous wheelchair, etc.
Following 2, different types of robots could also have different dynamics and low-level control, while the simulator only models the action space with continuous translation and free rotation. Some of the actions may not be achievable by certain types of robots. This would lead to large sim2real gap if the simulator only models very high-level action.
The authors also mention the RoboScape-20K training set as their main contribution, but it is not explained in the main paper with only some vague explanations in the supplementary.

问题

What is the running efficiency of the simulator (e.g. memory cost, fps)? Does it support running multiple instances in parallel in headless mode for efficient RL training?
Why the authors only evaluate the performance of VLMs on the proposed benchmarks without also comparing with task-specific models or trained RL policies?
Does the simulator support user-friendly customization of new tasks and generating new scenarios? The current two tasks introduced in the paper can only represent a small fraction of tasks that agents can perform in urban settings.

局限性

Yes

最终评判理由

Based on the rebuttal and discussions, I have increased my score by 1 point to reflect the authors’ efforts in addressing my feedback on adding new RL/hybrid-based baselines. However, I still feel like the current status of the simulation framework could be further improved for better user experience and incline towards rejection.

格式问题

There seems to be 2 separate appendix, one in the attached supplementary, and another one in the main paper after the reference section (which seems to be the incomplete version).

作者回复

2025-07-31

We are happy that the reviewer considered our work “has better photorealism than existing urban simulators” and the two case studies “are practical in the real world while less studied in the urban embodied settings”. We address your questions and concerns below.

Q1: Running Efficiency of RoboScape

RoboScape supports running multiple instances in parallel in headless mode for efficient RL training.

Resolution	Rendering Quality	GPU Utils (%)	CPU Utils (%)	RAM (MB)
640x360	low	30.04	16.41	561.56
720x600	low	29.72	17.73	596.4
1280x720	low	26.5	18.56	734.81
640x360	high	30.03	16.59	564.04
720x600	high	29.61	17.47	589.3
1280x720	high	27.44	19.08	738.6

Q2 & W1: Selection of Baselines

We appreciate the reviewer’s suggestion regarding the inclusion of task-specific models such as robot navigation systems or RL-based policies. We have conducted preliminary experiments with standard RL-based baselines (PPO). However, these approaches performed poorly in our proposed case studies, likely due to the complexity of instruction-driven tasks and the need for long-horizon reasoning. Designing high-performing policies for these tasks would require significantly more sophisticated architectures and task-specific tuning—an important direction for future work, but one that is orthogonal to the primary contribution of our paper. Our primary contribution lies in the introduction of RoboScape, an open-source and extensible simulation platform that supports customizable, diverse scenarios for evaluating embodied agents.

We have conducted the following additional experiments and will include them in the revised Table 2:

Hybrid Baseline: We have also evaluated the performance of a hybrid planner where a VLM, GPT-4o, serves as the high‑level decision maker—deciding whether to continue straight down a street until the next intersection or to turn at that intersection—while A* is used as a low‑level path planner to generate and execute the movement commands. The subtask success rate is 32.53%.
RL Baseline: We utilize two NVIDIA L40S GPUs, each running two parallel instances of RoboScape. Following the setting of VLN-CE [1], the agent is powered by a multimodal policy that combines DeBERTa-v3 for instruction encoding and DINOv2 for visual perception. We first pretrain the policy through imitation learning (specifically, behavioral cloning) on expert demonstrations, and then further finetune it using PPO. Rollouts are collected concurrently across all environments. The current RL baseline achieves a 28.37% subtask success rate, highlighting the challenging nature of our tasks. We believe RoboScape provides a strong foundation for future research on more advanced RL methods, by offering parallelizable, photorealistic, large-scale, and dynamic city environments, along with the ability to collect and incorporate human demonstrations that naturally include errors and corrective behaviors.

Q3: Customization

High customizability is a core feature of RoboScape. We will outline how to create new tasks, scenarios, and embodied agents in Appendix A.4 to support flexible benchmark extension. The two benchmark tasks included in the paper are intended as case studies to demonstrate the functionalities and utilities of RoboScape. RoboScape was explicitly designed to support user-friendly customization of both tasks and environments.

New scenarios. Users can generate diverse and realistic urban layouts through our Python API by providing simple metadata inputs (e.g., number of streets, street length, object categories, and their spatial distributions, such as “10% trees, 5% tables and chairs”). This allows researchers to create arbitrarily large and varied city environments.
New tasks. Defining tasks in RoboScape is similar to standard Gym environments, as the Python API of RoboScape follows the same format for agent control (including pedestrians and vehicles). Users can spawn different types of agents, customize observation spaces (e.g., RGB, depth, or semantic segmentation) and action spaces (continuous or discrete actions), and program new goal definitions (e.g., language instructions, target images, or spatial objectives). Additionally, while the current pedestrians follow rule-based logic, users can override behaviors to simulate complex or rare cases. For instance, one can simulate a jaywalking pedestrian by scripting a few agents to cross during a red light, while placing a robot at the intersection to evaluate its reaction.

W2: New Embodied Robot

We fully agree that many other types of embodied agents (e.g., delivery robots, wheelchairs, humanoids, or drones) are relevant in urban settings. In RoboScape, we have included an additional embodied agent—a scooter—that can be controlled using velocity and angular velocity commands through our Python API, and it will be included in the revised Section 3.2.

To add a new robot type in RoboScape, one can leverage the existing Python API of RoboScape to conveniently customize action spaces—either continuous or discrete—as well as observation spaces such as RGB, depth, or semantic segmentation images. The required additional work (1) obtaining a new robot asset, typically from the Unreal Engine Marketplace; (2) defining the robot's actions using Unreal's Blueprint system; (3) integrating these actions with our Python API to enable high-level control; and (4) attaching our camera components to support the desired observation space.

W3: Action Space in RoboScape

Current action space in RoboScape is consistent with other urban simulators like CARLA and MetaUrban, which adopt a similar high-level action space. Additionally, RoboScape is built on Unreal Engine, which supports integration with ROS2, allowing users to implement low-level dynamics and control if needed. This makes it feasible to reduce the sim2real gap by customizing robot-specific control pipelines.

W4: Details of Dataset RoboScape-20K

Thanks for the suggestion. We will move the details of RoboScape-20K from Appendix D.4 to the main paper. To further clarify, RoboScape-20K is a training set consisting of 20,000 robot navigation cases for RoboScape-MMNav. The training environments are entirely disjoint from the evaluation environments, and 33% of the building assets used during evaluation are intentionally excluded from the training set to ensure a clean generalization split.

For each training case, the input is the robot’s current observation, and the label is the correct next action to take. Since our refined pipeline includes a reasoning module, we additionally provide optional intermediate supervision signals—such as the correct facing direction and the ground-truth distance to the goal—to encourage chain-of-thought learning. These auxiliary labels are optional and can be ignored depending on the training paradigm and model architecture.

Paper Formatting

We will merge the appendix and supplementary materials into a single, complete appendix in the final version.

References:

[1] Beyond the Nav-Graph: Vision-and-Language Navigation in Continuous Environments

2025-08-02

Thank the authors for their detailed rebuttal. It seems that both hybrid baselines and RL-based baselines perform poorly on the proposed navigation benchmark, even on the "easy" test set without any obstacles. I'm trying to understand why the success rate is very low on this task, especially compared to the multi-robot search task. The authors listed some common failure modes in Sec. 4.2, but do not present a comprehensive statistics of the frequency of each one. Can the authors elaborate on which among the listed failure modes are the most challenging ones to address, as well as other failure modes for non-VLM-based approaches? Will the success rate increase significantly if the travelling distance becomes shorter?

2025-08-04

We appreciate your thoughtful suggestion for deeper analysis. As observed, both hybrid and RL baselines perform poorly even on the easy split. This stems from the task’s demand for embodied spatial reasoning and multimodal instruction grounding.

Additional Analysis for non-VLM-based Approaches

Compared to the initial behavior cloning training (which only had 11.61% subtask SR), RL finetuning can significantly boost the performance to 28.37%. However, we observe that RL policies frequently failed at recognizing the correct intersection and making the right turn. This suggests the challenges of conducting visual and spatial reasoning in realistic environments simulated by RoboScape. The distance to subtask SR correlation for RL is strongly negative (-0.522). This could be caused by sparse rewards and insufficient exploration on longer trajectories.

Compared to GPT-4o baseline, the hybrid baseline performs better on turning (the success rate for turning increases from 42.90% to 66.60%), since once the position is correct, the planner executes turns reliably. However, it performs worse on intersection matching (decreases from 17.50% to 7.70%) because it currently has only one opportunity to match at the corner, with less tolerance than VLM methods that benefit from fine grained action updates.

Why is the performance worse than Multi-Robot Tasks

The multi-robot search task shifts emphasis to scene description and effective communication. The main robot has the oracle planner which covers a substantial portion of the route, and repeated communications allow adjustment. The condition for success, to spot the other robot, is also less strict than the single-robot setting.

More Results on Failure Modes

We analyze failure models for each subtask (we define subtasks in Sec. 4.1). For VLM baselines evaluated in the current benchmark, we further identify and analyze failure modes within each type of subtasks. Below are the most frequent GPT-4o failure modes across these subtasks.

Table 1. Most frequent GPT-4o failure modes

Subtask	Failure mode	Frequency (%)
1. Moving to Intersection	1.1 Misestimate the distance to the intersection	53.33
	1.2 Fail to detect the intersection	28.33
	1.3 Misidentify the reference landmark	18.33
2. Turning	2.1 Misinterpret the turning pattern	42.86
	2.2 Misunderstand history status summary	42.86
	2.3 Fail to detect upfront buildings	14.29
3. Reaching Destination	3.1 Fail to match the landmark in a different perspective	60.00
	3.2 Stop too early to face the landmark	30.00
	3.3 Fail to align the landmark	10.00

Among these failure modes, 1.1, 2.1, 2.3, 3.2 and 3.3 are all related to embodied reasoning since these scenarios involve locating oneself based on the egocentric view, especially 2.1, where the robot must reason when and how to turn. Failure mode 1.3 and 3.1 are related to spatial reasoning, while 1.2 reflects basic visual grounding errors and 2.2 shows that the performance can further benefit from enhanced planning and memory capacity. Therefore, most failure modes revolve around spatial and embodied reasoning, which are aligned with our original focus on VLM-focused evaluation.

We will expand Sec. 4.2 with more results and analyses regarding failure modes and types of subtasks.

Additional Reasoning Model Results

To further validate our benchmark, we additionally tested o3 and o3 pro. The successful trials show that with correct reasoning and accurate spatial understanding, the model can complete the task end-to-end. All successful trajectories contain at least 60 steps and include all task types, which reduces the chance of overly simple outliers.

Table 2. Additional Benchmark metrics

Model	SR (%)	Subtask SR (%)	Distance Progress (%)
GPT-4o	0.00	34.21	24.36
o3	5.00	42.25	38.43
o3-pro	8.33	43.23	39.46

The results indicate that improved reasoning abilities boost performance. In our experiment the reasoning models show improved depth estimation and destination alignment, which further demonstrates the importance of visual and spatial reasoning in our benchmark.

Thank you for the constructive suggestions. All the additions above will appear in the revision.

2025-08-05

Thank you to the authors for the additional clarifications and the detailed analysis of the failure modes. I have no further questions at this time and will raise my rating by 1. However, I still have some concerns regarding the completeness and usability of the simulator. Currently, it appears that users need to interact directly with the back-end Unreal Engine to support new agents/sensors/tasks. It would be highly beneficial to provide a more user-friendly and flexible interface to facilitate easier customization and broader applications.

2025-08-06

Thank you for your thoughtful feedback. Upon acceptance, we will include documentation on task customization through Python APIs and work toward automated Unreal integration to simplify the addition of new robots and actions.

2025-08-04

Dear Reviewers,

Please take a look at the rebuttal and check if your questions and concerns have been addressed and everything is clear. Now is the time to clarify any remaining issues in discussion with the authors. Thanks @kyYy for already engaging in the discussion.

Thanks, Your AC

最终决定Accept (poster)

2025-09-17

The paper presents a simulation platform for embodied AI in large-scale urban environments. The paper further introduces two benchmarks for multi-modal instruction following for navigating, and multi-robot search. Experimental results demonstrate shortcomings of state-of-the-art VLMs.

The reviewers appreciated the introduction of a high-fidelity simulator and highlighted the rich action and observation space, the meaningfulness of introduced benchmarks, and the analysis of benchmark results. Concerns were raised on the selection of baselines, which amounted to be only VLM-based instead of being task-specific, and only one baseline being fine-tuned for the task at hand. Further concerns regarded missing technical details, missing ablation experiments, and limitations in the available types of robots, and the restriction of the training set to oracle traces. The rebuttal and discussion added additional baselines and experimental results, technical details, and clarified remaining concerns. While it shows that an impressive amount of work went into the development of the simulator, it seems that more time could have been spend on the implementation of baselines and polishing the presentation. The additions made in the rebuttal clearly strengthened the paper here. Although a few things could still be improved (e.g. dedicated baselines developed for the benchmarks, built on findings from the zero-shot VLM baselines), they are not preventing a recommendation to accept the paper.