6.4

/10

Poster4 位审稿人

最低3最高5标准差1.0

3.5

置信度

创新性2.3

质量3.3

清晰度3.0

重要性3.0

NeurIPS 2025

SpatialLM: Training Large Language Models for Structured Indoor Modeling

Yongsen Mao,Junhao Zhong,Chuan Fang,Jia Zheng,Rui Tang,Hao Zhu,Ping Tan,Zihan Zhou

OpenReview PDF

提交: 2025-05-09更新: 2025-10-29

TL;DR

This paper presents SpatialLM, a large language model designed to process 3D point cloud data and generate structured scene descriptions.

摘要

关键词

Structured Indoor ModelingLarge Language ModelsSpatial Understanding3D Reconstruction

评审与讨论

审稿意见

评分: 3置信度: 42025-06-30

This paper introduces SpatialLM, a 3D point cloud multimodal large language model for structured indoor modeling. SpatialLM regards 3D modeling as a scripting task, thereby leveraging the outstanding programming capabilities of large language models. The authors created a large synthetic 3D dataset, the SpatialLM dataset, which surpasses existing datasets in both quantity and scale. They conducted extensive experiments to demonstrate the effectiveness and generalizability of their model.

优缺点分析

Strengths:

The introduced synthetic dataset of 12,328 indoor scenes is a substantial contribution, enabling extensive training and benchmarking.
The author conducted extensive ablation experiments on the design of the point encoder, the selection of spatial resolution, and the training schedule.

Weaknesses:

The motivation and approach of SpatialLM are similar to SceneScript, raising concerns about incremental novelty.
The improvements of SpatialLM over SceneScript in the layout estimation task are modest (+1.3%/+0.8% at IoU@0.25/0.5). Additionally, SpatialLM does not surpass V-DETR, which does not use LLMs, in 3D object detection tasks.
There are some structural issues in writing. The related work section could be presented earlier.

问题

What are the advantages of scripts over box texts (like \box{x, y, z, w, h, l} or [x, y, z, w, h, l])?
What is the incremental novelty of SpatialLM over SceneScript? I believe that the format of the output script is not a problem for large language models.
What are your rules for selecting scenes/categories? Are there any special considerations to ensure diversity and avoid a long-tail distribution?

局限性

Yes.

最终评判理由

This paper proposes a framework built upon LLM to perform 3D perception and reconstruction tasks. The idea is similar to SceneScript, and the improvement is demonstrated with experiments. However, the limited application and relatively incremental performance improvement do not support the relatively minor technical contribution of this work. The experimental results are also fixed during the discussion phase, which may mean the results may not be solid during submission.

Overall, I think it might be a promising direction, but it needs more experiments or applications to support. Therefore, I vote for borderline rejection at this stage.

格式问题

No.

作者回复

2025-07-31

We thank the reviewer for the comments. Below is our detailed response.

Weaknesses

Thanks for the question. Indeed, our work is inspired by SceneScript. However, a key difference is that SceneScript proposes a task-specific tokenizer, which is not compatible with the text tokenizers used in modern LLMs. And it trains the tokenizer with a task-specific decoder from scratch (please refer to Sections 4.2 and 4.3 of SceneScript paper for details). In other words, a new domain-specific language with the associated tokenizer and decoder is proposed in Scenescript. This is also noted in Section 4.1 (lines 302 - 303) of our paper.

In contrast, SpatialLM directly exploits modern LLMs’ capacity for Python language generation, without designing any new tokenizer or decoder. Thus, much of our work is focused on how to align point cloud features to existing LLMs with a standard “Encoder-MLP-LLM” architecture (Section 2), which has not been carefully studied before.

Furthermore, by building upon existing LLMs, our experiment in Section 3 also reveals other differences from SceneScript. For example, as shown in Tables 5 and 6, while training on existing datasets like Structured3D and ScanNet may be sufficient for SceneScript to achieve competitive performance, these datasets are too small for SpatialLM. Instead, by first training our new dataset, SpatialLM outperforms SceneScript on both tasks.

To summarize, our work is the first to employ modern LLMs for structured scene reconstruction (i.e., layout estimation and 3D object detection) from point clouds. Our contributions include (i) a new large-scale dataset and (ii) thorough ablation studies on aligning point cloud features with LLMs. Furthermore, we show that SpatialLM can achieve competitive performances against existing task-specific models. We believe this is an important step towards equipping LLMs with general 3D scene understanding and generation capacities.

Thanks for the comment. In fact, after the paper submission, we found a bug in our implementation related to using Fourier encoding of voxel indices as positional embeddings to the output Sonata features. After fixing the bug, the performance of SpatialLM on both layout estimation and 3D object detection further improved. The new results are given below.

Table 5. Experiment on layout estimation.

Method	F1 (IoU $_\textrm{2D}$ @0.25)	F1 (IoU $_\textrm{2D}$ @0.5)
RoomFormer	70.4	67.2
SceneScript	83.1	80.8
SpatialLM (ft. Structured3D)	32.8	17.9
SpatialLM (ft. Ours)	51.2	38.3
SpatialLM (ft. Ours -> Structured3D)	86.5	84.6

Table 6. Experiment on 3D object detection.

Method	F1 (IoU $_\textrm{3D}$ @0.25)	F1 (IoU $_\textrm{3D}$ @0.5)
V-DETR	65.1	56.8
SceneScript	49.1	36.8
SpatialLM (ft. ScanNet)	2.9	0.7
SpatialLM (ft. Ours)	33.8*	22.6*
SpatialLM (ft. Ours -> ScanNet)	65.6	52.6

As one can see, for layout estimation, SpatialLM consistently outperforms other baselines, with +3.4 IoU2D@0.25, +3.8 IoU2D@0.5 over SceneScript. For 3D object detection, we have +16.5 IoU3D@0.25, +15.8 IoU3D@0.25 over SceneScript, and +0.5% IoU3D@0.25, -4.2 IoU3D@0.5 compared to V-DETR.

Finally, we note that our primary goal of the work is not merely competing for the highest scores on existing benchmarks. Instead, with the experiments, we aim to show a feasible path to tackle 3D scene understanding tasks with the MLLM architecture.

Thanks for the suggestion. We will revise the paper to present the related work section earlier.

Questions

Compared to domain-specific box texts, the Python script is more general and flexible. For example, one may specify various relationships, such as associating door/window with a wall, by passing the necessary parameters. It can also be easily extended to include an arbitrary number of object classes or other properties. Further, since LLMs have already seen a large amount of Python code data, it is easier to align the scene descriptions with other LLM capacities, such as tool calling.
Please refer to item #1 above.
To ensure the diversity of our dataset, we mainly resort to our access to a large repository of interior designs from a leading platform in the interior design industry. As described in Section 2.1, most 3D scene designs in our dataset are created by professional designers and used for real-world production. This way, we ensure that the distribution of 3D scenes in our dataset closely resembles that of real-world houses. Please refer to Section 1 of the supplementary material for statistics of the rooms and objects in our dataset.

During dataset curation, we select designs (i.e., scenes) by considering the following factors: (i) ratings by the professional designers, (ii) the number of renderings generated by the design, (iii) total floor area > 20 m $^2$ , and (iv) number of unique objects > 35. For objects, we pick the 59 common object categories based on occurrence while filtering objects in the long-tail distribution. And as mentioned in Section 2.1, we filter out small objects with a side length < 15cm.

评论- Thanks for the feedback

2025-08-06

I appreciate that the authors provide a detailed comparison with SceneScript. However, I think the difference is a little engineering, and my concerns regarding the incremental originality of this paper are not fully addressed. Furthermore, the experimental results are fixed during the discussion phase, making this paper potentially not ready for publishing. I would recommend that the author further consider the benefits of evolving the paradigm of SceneScript by building it upon LLM (maybe extend it to more general tasks), which can make the submission more convincing. Therefore, I would keep my original rating.

评论- Greatly Appreciate Your Valuable feedback

2025-08-09

We thank the reviewer for the comments. Instead of proposing a new algorithmic principle, our contribution mainly lies in extending LLM’s capacity to structured 3D scene modeling. We believe this is an important capacity for generalist models to have (e.g., Qwen), especially for applications like embodied agents.

We also thank the reviewer for suggesting extending the work to more general tasks. Indeed, the structured scene description can be invoked by LLM in various scenarios, such as interactive 3D scene generation and Q&A, which is an interesting future direction.

审稿意见

评分: 3置信度: 32025-07-01

This paper presents a framework for 3D visual grounding using pretrained language models augmented with spatial inductive biases. The model takes as input a structured scene layout consisting of object labels and their associated 3D attributes (position, size, orientation), and predicts the 3D bounding box corresponding to a natural language query. Spatial reasoning capabilities are introduced into a transformer-based language model by adding spatial embeddings and modifying the attention mechanism to reflect geometric proximity. Experiments on synthetic and real-world layouts demonstrate improved grounding accuracy over baseline models that lack spatial awareness. The approach is positioned at the intersection of 3D vision and multimodal language reasoning, with contributions focused on task-specific integration rather than architectural or theoretical innovation.

优缺点分析

Strengths: The paper addresses a well-scoped problem at the intersection of 3D visual grounding and language-based spatial reasoning. It presents a clean architectural adaptation of transformer-based language models by introducing spatial inductive bias through position-aware embeddings and attention masking strategies. The proposed formulation is modular and lightweight, requiring no changes to the core transformer mechanism, which enhances reproducibility and compatibility with existing pretrained language models. Empirical evaluations span both synthetic environments and real-world datasets (ScanNet), with quantitative metrics showing consistent improvements over non-spatial baselines. The experimental design includes comparative analysis with spatial relation prediction and 3D bounding box localization, both of which validate the proposed enhancements. The use of object-centric representations as input provides computational efficiency and avoids the need for raw perceptual processing, making the system well-suited for layout-level inference tasks. Furthermore, the dataset construction, training setup, and evaluation protocol are clearly described, improving transparency and replicability. Conceptually, the paper explores how structural priors can be aligned with the attention mechanism in LLMs to support grounded reasoning. This contributes to ongoing discourse in multimodal learning on how domain-specific biases can be embedded into general-purpose models.

Weaknesses: Despite practical performance gains, the architectural contributions are modest. The proposed spatial embedding and attention mask mechanisms mirror existing practices in relational modeling and transformer locality bias, offering no new theoretical or algorithmic innovations. The model’s dependence on clean, discrete scene layouts means it cannot operate in realistic settings where perception is partial, noisy, or continuous. The spatial bias mechanism is treated as a black-box utility without detailed analysis of its effect on attention dynamics, training stability, or generalization. There is no investigation into whether the model internalizes spatial hierarchies or relational patterns, nor whether the improvements stem from inductive bias or training data regularity. Moreover, the model is evaluated only in bounded task settings involving synthetic or preprocessed layouts, and its performance under distributional shift, ambiguous language input, or sparse spatial cues is not tested. These omissions weaken claims of general-purpose spatial reasoning. From a broader research positioning, the work contributes primarily to task-specific improvements within the domain of 3D vision and grounding. It does not generalize to open-vocabulary grounding, sensor-based reasoning, or joint perception-language planning. As such, the contribution is best viewed as an applied extension of language modeling for layout-conditioned reasoning, rather than a conceptual advance in foundational representation learning.

问题

How sensitive is the model’s performance to the accuracy or sparsity of the input 3D layout? Would the method remain effective under noisy or incomplete object representations, as would be expected in real-world perception pipelines? How does the spatial inductive bias affect attention behavior or sample efficiency during training? Is there evidence that the proposed modifications improve generalization to unseen spatial configurations or object arrangements beyond the training distribution? Finally, could the same spatial reasoning be achieved by finetuning with task-specific data alone, without architectural modifications?

局限性

The method assumes access to structured scene layouts composed of object-centric tokens with precise 3D attributes including class, position, orientation, and size. These layouts are either synthetically generated or derived from processed ScanNet scenes, relying on accurate object detection, segmentation, and pose estimation pipelines. The performance and applicability of the model in scenarios where such structured representations are incomplete, noisy, or unavailable are not addressed. While the proposed spatial inductive bias improves performance on layout-conditioned reasoning tasks, it is unclear how the model would behave when exposed to real-world uncertainty, such as missing objects, ambiguous queries, or spatial occlusions. Additionally, the approach is strictly conditioned on discrete object instances, limiting extensibility to raw sensory data such as point clouds or voxelized scenes. The spatial inductive bias is implemented via embedding injections and localized attention masking, but its impact on model interpretability or reasoning trajectories is not analyzed. There is no investigation into whether the model develops spatial priors or relational structures beyond what is encoded in the training data. Generalization to unseen object configurations, novel spatial relationships, or out-of-distribution compositions is not systematically evaluated, which constrains the scope of the method’s robustness claims. Lastly, the model does not perform end-to-end reasoning from language to perception but instead operates in a middle-layer abstraction, making it difficult to assess its utility in embodied tasks or real-world robotic applications without substantial integration with upstream vision modules. More broadly, the research is situated within a structured-layout paradigm where perception is abstracted away, which limits its relevance to fields focused on holistic scene understanding or sensor-to-action pipelines. This constrains the potential impact of the work beyond its immediate 3D layout reasoning setting.

格式问题

None

作者回复

2025-07-31

We thank the reviewer for the comments. Below is our detailed response.

Weakness

[Architectural contributions of our work] There seems to be a misunderstanding about the contributions of our work. In this paper, we use existing point cloud encoders such as Sonata and connect it to LLM with the standard “Encoder-MLP-LLM” architecture. We do not propose any new architectural modifications, such as “modifying the attention mechanism to reflect geometric proximity", as suggested by the reviewer.

Instead, our primary goal is to train LLMs for structural indoor modeling tasks (i.e., layout estimation and 3D object detection). As stated in Section 1, our main contributions are
- (1) We regard the structured descriptions as scripts of a general-purpose language (i.e., Python) and propose to predict the language in text form,
- (2) We propose a new synthetic dataset for the task, and conduct the first empirical study on the best strategy of aligning the point cloud input with LLMs for structured scene modeling, and
- (3) We empirically show that, by first training on our large-scale dataset and then on smaller downstream task data, our model gives competitive performance on public benchmarks.
[Model’s dependency on clean, discrete scene layouts] There seems to be a misunderstanding of the SpatialLM’s input. For example, the reviewer wrote “The model takes as input a structured scene layout consisting of object labels and their associated 3D attributes (position, size, orientation), and predicts the 3D bounding box corresponding to a natural language query.” But SpatialLM actually takes 3D point cloud as input and generate structured scene descriptions as output. In other words, the “structured scene layout” is the output of our model, not the input.

Similarly, the reviewer wrote that “The method assumes access to structured scene layouts composed of object-centric tokens with precise 3D attributes including class, position, orientation, and size.” This statement is not accurate for the same reason.

In terms of handling incomplete or noisy point cloud inputs, please note that in Section 3.2, we conduct experiments on ScanNet, which is a real dataset with point clouds reconstructed with imperfect RGB-D data with holes, occlusions, and noise. In Section 3.3, we also report 3D indoor modeling results with point clouds reconstructed from monocular videos. These experiments demonstrate the robustness of our model to real-world inputs.
[Investigation into the spatial bias mechanism] We point out that SpatialLM follows the standard “Encoder-MLP-LLM” architecture for MLLMs. We leverage existing networks such as Qwen2.5-0.5B as the base model and Sonata as the point cloud encoder. We respectfully refer the reviewer to the Sonata paper [1] for a detailed study about how the point cloud encoder captures spatial context and priors.
[Generalization under distributional shift, ambiguous language input, or sparse spatial cues] In Section 3.3 (Zero-shot Detection on Videos), we show that our model exhibits reasonable generalizability across datasets. However, as we acknowledged in Section 4, fine-tuning SpatialLM on a task-specific dataset is still needed to achieve the best performance. Further scaling up the data and model size could be a promising future direction.
[Limitation to task-specific improvements] We acknowledge that SpatialLM, in its current form, is a task-specific model for structured indoor modeling. Our plan is to focus on this well-established task first, and to show a feasible path to tackle 3D indoor modeling tasks (i.e., layout estimation and 3D object detection) with LLMs.

In future work, we can improve its generalizability by generating open-vocabulary annotations with MLLMs and training them together with large-scale vision-language datasets. In fact, it is not uncommon for researchers to focus on one capacity of MLLMs at a time, such as video understanding [2] and function calling [3], with the understanding that the findings will later be integrated into a general-purpose foundation model.

Questions

[Model’s sensitivity to the accuracy and sparsity of 3D layout] Please refer to item #2 above.
[Analysis on spatial inductive bias] Please refer to item #3 above.
[Generalization to unseen spatial configurations] Please refer to items #4 above.
[Achieving spatial reasoning by finetuning with task-specific data alone, without architectural modifications] Please note that we do not propose any architectural modifications in this paper. Instead, we follow the standard “Encoder-MLP-LLM” architecture for MLLMs, and leverage existing networks such as Qwen2.5-0.5B as the base model and Sonata as the point cloud encoder.

Limitations

We respectfully refer the reviewer to our response to the paper's weaknesses above for discussions about the contributions of our work, the input of SpatialLM, investigation on the spatial inductive bias, generalization to unseen object configurations, and potential impact on 3D scene understanding.

References

[1] Wu et al., Sonata: Self-Supervised Learning of Reliable Point Representations. In CVPR, 2025.
[2] Zohar et al., Apollo: An Exploration of Video Understanding in Large Multimodal Models. In CVPR, 2025.
[3] Zhang et al., xLAM: A Family of Large Action Models to Empower AI Agent Systems. In NAACL, 2025.

2025-08-07

I appreciate the authors' clarification regarding the architectural details and input modality, and acknowledge my previous misunderstanding of these aspects. Nevertheless, the primary concerns regarding limited novelty and incremental contributions remain largely unaffected by these clarifications. The method integrates existing components without substantial theoretical innovation or novel algorithmic principles, constraining the overall impact. I recommend explicitly clarifying these points within the manuscript to avoid similar misunderstandings in the future. Consequently, my initial rating remains unchanged.

评论- Your Feedback is Greatly Appreciated

2025-08-09

We thank the reviewer for the comments. Instead of proposing novel theory or algorithmic principle, our work focuses on expanding LLM’s capacity beyond text and image, towards 3D scene modeling. We believe this is an important capacity for generalist models to have (e.g., Qwen), especially for applications like embodied agents. Further, we point out that training LLM for 3D scene modeling is not a trivial task, requiring the curation of a new, large-scale dataset (Section 2.1), and careful examination of network modules (Section 2.2) and training schedule (Section 2.3). Through systematic experiments, our work may provide valuable insight to the research community on training future LLMs for 3D vision tasks.

审稿意见

评分: 5置信度: 42025-07-01

This paper presents a LLM tailored for structured understanding. SpatialLM predicts architectural elements (walls, doors, windows), and oriented object boxes with semantic categories from 3D point clouds. It follows a standard “Encoder-MLP-LLM” architecture and is jointly trained for each downstream task. SpatialLM is evaluated on two downstream tasks: (1) layout estimation on Structured3D dataset, where SpatialLM outperforms state-of-the-art models RoomFormer and SceneScript, and (2) 3D object detection on ScanNet, where SpatialLM outperforms SceneScript and is competitive with V-DETR.

优缺点分析

Strengths:

The paper presents a large-scale (12,328 scenes) synthetic SpatialLM dataset with layout and object annotations. The visualization shows that this dataset is of high quality in terms of realism.
The paper presents a unified model that can accept language prompts to solve different 3D structured understanding tasks. The architecture follows standard multimodal LLM and can leverage the rich-knowledge encoded from pretrained LLMs.
There are some interesting ablations and empirical study on point cloud encoders and training schedules.
Strong performance: SpatialLM achieves state-of-the-art performance on layout estimation of 3D scenes on Structured3D dataset and competitive performance on 3D object detection on ScanNet.
The paper is clearly written and very easy to follow. There are also many nice visualizations, making the results more convincing.

Weaknesses:

Even though SpatialLM uses the same architecture for different tasks, it is not universal:
- It requires dedicated training for different tasks and datasets.
- And the training involves training all the three components of the “Encoder-MLP-LLM” architecture for best performance (tab 4), which defeats the purpose of large and universal models.
This paper aims for large language models and one important consideration should be model complexity and training data during evaluation. For tab. 2, 5, 6, please add a column of number of parameters so that readers know how model complexity affects results. For tab. 5, 6, please also add a column of training data to make the data factor more clear.
Some statement is not precise or wrong:
- L24-26: the paper states that the first contribution is that “regard the structured descriptions as scripts of a general-purpose language (i.e., Python) and propose to predict the language in text form”. However, I think this is what SceneScript already does. What’s the difference between this contribution from SceneScript?
- L297: “RoomFormer [58] proposes to train a two-stage Transformer-based network to predict corners and rooms.” I think RoomFormer is a single-stage network using two-level queries, not two-stage.
Organization of the paper: I feel the naming is misleading. To me, Section 2 “An Empirical Study on Point Cloud Feature Alignment” mainly discusses SpatialLM, while section 3 “SpatialLM” is actually the experiment section. Therefore, I suggest to rename section 2 as “SpatialLM” and section 3 as “Experiments”, or something indicative.

问题

What’s the difference between the claimed 1st contribution from SceneScript?
Structured3D is multi-room dataset while ScanNet is single-room dataset. Does SpatialLM do inference on multi-room directly or can only do inference in single-room manner?
Section 2.2 explored different point cloud encoders. However, eventually the point cloud encoder still needs to be trained jointly for each downstream task (tab. 4). Also L198-199 says “current pre-trained point cloud encoders are not as versatile in supporting downstream tasks yet.” This motivates the question: is pretrained point cloud encoder really necessary? How about just initializing the encoder (e.g. SPTv3) with random weights and then jointly trained with MLP and LLM?

局限性

yes

最终评判理由

Thanks the authors for the rebuttal. Initially, my major concern lies in the difference between this work and SceneScript. The authors' rebuttal helped address this concern. Some of my other questions/concerns are also addressed. Thus I would like to increase my rating from "Borderline accept" to "Accept". However, I have some suggestions for the authors to consider when revising the paper - see my comments.

格式问题

none

作者回复

2025-07-31

We thank the reviewer for the comments. Below is our detailed response.

Weaknesses

Thanks for the thoughtful comments. We acknowledge that SpatialLM, in its current form, is a task-specific model for structured indoor modeling. Our plan is to focus on this well-established task first, and to show a feasible path to tackle 3D indoor modeling tasks (i.e., layout estimation and 3D object detection) with LLMs.

In future work, we can improve its generalizability by generating open-vocabulary annotations with MLLMs and training them together with general-purpose vision-language datasets. In fact, it is not uncommon for researchers to focus on one capacity of MLLMs at a time, such as video understanding [1] and function calling [2], with the understanding that the findings will later be integrated into a general-purpose foundation model.
Thanks for the suggestion. In the revised paper, we will add columns to the tables to make the model complexity and data factors clearer.
[Lines 24 - 26: difference from SceneScript] Thanks for the question. Indeed, our work is inspired by SceneScript. However, a key difference is that SceneScript proposes a task-specific tokenizer, which is not compatible with the text tokenizers used in modern LLMs. And it trains the tokenizer with a task-specific decoder from scratch (please refer to Sections 4.2 and 4.3 of SceneScript paper for details). In other words, a new domain-specific language with the associated tokenizer and decoder is proposed in Scenescript. This is also noted in Section 4.1 (lines 302 - 303) of our paper.

In contrast, SpatialLM directly exploits modern LLMs' capacity for Python language generation, without designing any new tokenizer or decoder. Hence, the first contribution "regard the structured descriptions as scripts of a general-purpose language (i.e., Python) and propose to predict the language in text form".

Furthermore, by building upon existing LLMs, our experiment in Section 3 also reveals other differences from SceneScript. For example, as shown in Tables 5 and 6, while training on existing datasets like Structured3D and ScanNet may be sufficient for SceneScript to achieve competitive performance, these datasets are too small for SpatialLM. Instead, by first training our new dataset, SpatialLM outperforms SceneScript on both tasks.
[Line 297: RoomFormer is a single-stage network using two-level queries] Thanks for pointing this out. We will fix it in the revised paper.
[Organization of the paper] Thanks for the suggestion. We agree that Section 2 primarily discusses the design choices made in SpatialLM through a series of ablation studies, whereas Section 3 compares SpatialLM to existing methods on various benchmarks. We will adjust the section titles to better reflect their content in the revised paper.

Questions

Please refer to item #3 above.
SpatialLM is trained on our new dataset with a mix of single-room layouts and multi-room layouts. Therefore, it can do inference in both single-room and multi-room manners.
For SPTv3, we follow common practice to leverage a pre-trained model as a starting point, rather than training it from scratch. Please note that our claim in lines 198 - 199 is more of a comparison of existing 3D point cloud encoders to powerful 2D image encoders, SigLIP and DINOv2, where one can freeze the image encoders during MLLM training and still obtain good results.

References

[1] Zohar et al., Apollo: An Exploration of Video Understanding in Large Multimodal Models. In CVPR, 2025.
[2] Zhang et al., xLAM: A Family of Large Action Models to Empower AI Agent Systems. In NAACL, 2025.

2025-08-07

We acknowledge that SpatialLM, in its current form, is a task-specific model for structured indoor modeling.

In this sense, I am afraid the paper should rephrase "SpatialLM" because "Spatial" covers much more tasks beyond structured indoor modeling. Indeed taking a general title is appealing to wider audience, but it can be misleading on the other hand.

In the revised paper, we will add columns to the tables to make the model complexity and data factors clearer.

This work also proposes a large-scale (12,328 scenes) synthetic SpatialLM dataset. To disentangle the contribution of methodology and data, I suggest the authors to also retrain some baselines on the proposed dataset. For example, on layout estimation, the authors could retrain RoomFormer and SceneScript on the proposed dataset and then follow the same evaluation protocol with the proposed method (i.e. pretraining on SpatialLM then finetuning on Structured3D). This will (1) make the comparison more fair, and (2) demonstrate the benefits of the new dataset (if it also improves the performance of baselines).

评论- Thank You for Your Thoughtful Feedback

2025-08-09

Thanks for the comments and additional suggestions. We are glad that our response has helped address the previous concerns, such as the difference between our work and SceneScript. We will carefully consider the reviewer’s suggestions when revising our paper.

审稿意见

评分: 5置信度: 32025-07-03

This paper presents SpatialLM, a VLM finetuned for structured indoor modeling tasks. The authors curated a large-scale 3D dataset in simulation and finetuned a small language model to take input from point cloud features and output Python-style text language to annotate objects and room layouts in 3D. Experiments demonstrate the feasibility of this fine-tuning paradigm and the effectiveness of SpatialLM compared with baselines.

优缺点分析

Strengths

Methodology. This paper proves the feasibility of directly finetuning an LM, even a small one, for generating 3D indoor modeling, which is a meaningful augmentation to language models, especially useful for real-world applications.
Dataset. The curated dataset seems to be of high quality and produces good pretraining effects (Tables 5 and 6). Details of the dataset are also presented in the appendix. If the dataset can be open-source as promised by the authors, it will be of great use to the community. Can the authors confirm the open-sourcing plan?
Solid performance. Design choices are validated with experiments in Section 2. SpatialLM produces strong performance on layout estimation and 3D object detection tasks. Further finetuning on task data gives even better performance, which is a good sign for the potential usefulness of the model.

Weakness

The necessity of using a language model for the tasks. As the paper states, the model is currently confined to a predefined set of objects, which raises the question of the need for a language model. It is also a waste of MLLMs' generalizability. For such a fixed set, can a strong 3D point cloud feature + a strong segmentation model suffice for the task? Meanwhile, the paper justifies the advantage of using the Python language as the output by stating its editable and easily extendable. However, such features are not demonstrated in the paper. It would be nice to see if prompting the language model can help edit the layouts.
Robustness. It seems the performance is mainly tested assuming perfect 3D inputs. Only Section 3.3, the model is tested with reconstructed inputs. Although it does show some robustness, the experiment is only qualitative. Is it possible to roll out quantitative experiments on datasets that have 3D groundtruth but use reconsrtructed 3D as input? What would the performance be like?

问题

Line 167: What exactly is this 5-level structure?
Can the model be potentially extended to handle dynamic scenes when objects and layout change? If so, what is needed?
How would you augment the object sets for different application scenarios? Does that require a significant amount of additional training?
After training, does SpatialLM still possess general language generation capabilities? Can you freely prompt the model to answer scene-related questions?

局限性

Yes

最终评判理由

My questions and concerns are largely addressed. I do notice the concern raised by reviewer ML3L, that the originality of this work is compromised by the prior work SceneScript. While I acknowledge the difference in technical approaches and the improved performance, I feel reviewer ML3L's concern is also valid.

Set aside this concern, I highly look forward to the opensourcing of the entire dataset. I believe it will be a great contribution of this work to the community. Thank the authors for confirming the open-sourcing decision. However, this is hard to assess in the current double-blind reviewing stage. In light of this, I am willing to stay at my rating, but wish to slightly tune it down to a hypothetical 4.5 for AC's reference.

格式问题

No formatting concerns

作者回复

2025-07-31

We thank the reviewer for the comments. Below is our detailed response.

[Confirmation of the open-sourcing plan.] Yes, we will open-source the dataset as promised in the paper.

Weaknesses

[Waste of MLLMs' generalizability] We acknowledge that SpatialLM, in its current form, is a task-specific model for structured indoor modeling. Our plan is to focus on this well-established task first, and to show a feasible path to tackle 3D indoor modeling tasks (i.e., layout estimation and 3D object detection) with LLMs. By limiting to a predefined set of objects, we are able to conduct a direct comparison with SOTA methods, such as Roomformer and V-DETR, on existing public benchmarks.

In future work, we can improve its generalizability by generating open-vocabulary annotations with MLLMs and training them together with large-scale vision-language datasets. In fact, it is not uncommon for researchers to focus on one capacity of MLLMs at a time, such as video understanding [1] and function calling [2], with the understanding that the findings will later be integrated into a general-purpose foundation model.
[Comparison to a strong 3D point cloud feature + a strong segmentation model] Indeed, several studies in the literature detect 3D objects via point cloud instance segmentation, as suggested by the reviewer. For example, in V-DETR, the authors also included some point cloud instance segmentation methods, such as 3D-MPA, as baselines. Here, we note that, as a bottom-up approach, the instance segmentation methods have several limitations. First, they can only output axis-aligned bounding boxes (AABB) without an orientation. Second, they may fail to predict the correct object sizes in the presence of holes, occlusions, and noise. In contrast, our method can infer full object sizes and orientations from sparse or occluded inputs (please see lines 284 - 291 of the paper for more discussion and some examples).

Furthermore, as we stated earlier, our long-term goal is to equip MLLMs with the capacity to make structured predictions in text form for 3D indoor modeling. Therefore, a dense prediction method for instance segmentation is not suitable for our purpose.
[Use of Python language] Thanks for the suggestion. In Section 4.2 of the supplementary material, we provide some preliminary results on task adaptation via language-based prompts, including (i) detection with user-specified categories and (ii) semantic label completion. As for layout editing, we believe it is feasible with our current dataset by generating paired data with language instructions and editing actions in the future. But in this paper, our focus is on the reconstruction of the layout from point clouds.
[Robustness to imperfect point clouds] Please note that in Section 3.2, we conduct experiments on ScanNet, which is a real dataset with point clouds reconstructed with imperfect RGB-D data with holes, occlusions, and noise. In Section 3.3 of the supplementary material, we also report quantitative 3D indoor modeling results with point clouds reconstructed from monocular videos (Table 6). Please refer to the supplementary results for more discussions.

Questions

Line 167 (5-level hierarchical structure): The 5-level hierarchical structure refers to the 5 layers of neural blocks in the 3DCNN and Sonata/PTv3 encoders. The first layer doubles the feature dimensions but keeps the point cloud resolution the same, and the following 4 layers each downscales the point cloud resolution to half. We will add a figure in the revised paper to illustrate this structure.
Thanks for the interesting question. Potentially, we can use SpatialTrackerV2 [3] to reconstruct the 3D point cloud of both static scenes and dynamic moving objects with 3D tracking trajectories. Then, we may use a SpatialLM-like architecture to reconstruct the static layout and the moving object bounding boxes in different time steps auto-regressively. In addition, a large-scale point-cloud dataset with annotations of dynamic objects (e.g., humans) is needed.
In this paper, we only keep the 59 common categories while filtering objects in the long-tail distribution. To improve the model’s generalizability, one may augment the labels of all objects in our dataset with VLM captions. Specifically, one can get a diverse set of annotations by performing image captioning with rendered images of each object in our dataset (as shown in Figure 4 of the supplementary material). Alternatively, researchers may use a different dataset of their own for their specific applications. In either case, a one-stage training schedule as discussed in Section 2.3 is likely required.

Besides, it is worth noting that the amount of training required also depends on the point cloud encoder. The Sonata/PTv3 point cloud encoder used in SpatialLM follows a DINO-like self-supervised training strategy. Such a strategy focuses on capturing local and global visual features, and does not consider alignment between visual features and text. A more powerful point cloud encoder with enhanced vision-language alignment will likely reduce the amount of training required, while better supporting language-based applications.
In this paper, we focus on training LLMs to generate a structured scene description from point clouds only. This is one of the limitations of the paper (as stated in lines 335 - 337). It cannot answer scene-related questions based on the point clouds in natural language, as it is not trained to do so. And after finetuning, its general language generation capabilities may also be severely impacted.

However, as we discussed before, we believe our work is an important first step towards equipping LLMs with general 3D scene understanding and generation capacities. And by training together with other general-purpose vision-language datasets, we can restore and improve the model’s general language generation capabilities.

References

[1] Zohar et al., Apollo: An Exploration of Video Understanding in Large Multimodal Models. In CVPR, 2025.
[2] Zhang et al., xLAM: A Family of Large Action Models to Empower AI Agent Systems. In NAACL, 2025.
[3] Xiao et al., SpatialTrackerV2: 3D Point Tracking Made Easy. In ICCV, 2025.

2025-08-07

Thank authors for the response. My questions and concerns are largely addressed. I do notice the concern raised by reviewer ML3L, that the originality of this work is compromised by the prior work SceneScript. While I acknowledge the difference in technical approaches and the improved performance, I feel reviewer ML3L's concern is also valid.

评论- Thank You for Your Valuable Feedback

2025-08-09

Thanks for acknowledging the difference in technical approaches and the improved performance with respect to SceneScript. Indeed, empowering LLM for structural indoor modeling is not a trivial task, requiring the curation of a new, large-scale dataset (Section 2.1), and careful examination of network modules (Section 2.2) and training schedule (Section 2.3). And we look forward to open-sourcing the entire dataset to the research community soon as well.

最终决定Accept (poster)

2025-09-17

The paper presents SpatialLM, a framework for adapting large language models to structured indoor scene modeling, supported by a large synthetic dataset. The method improves spatial reasoning and shows promising results on layout understanding tasks.

Reviewer scores are mixed (two positive, two negative). While concerns remain about incremental novelty and overlap with SceneScript, the rebuttal clarifies differences, addresses technical questions, and commits to stronger evaluations in the camera-ready version.

Overall, the contribution is considered valuable, especially with the dataset release and the demonstrated improvements in structured indoor modeling tasks. The AC recommends acceptance, with the expectation that the camera-ready version further strengthens comparisons to prior work, clarifies methodological novelty, and incorporates the promised analyses.