5.5

/10

Poster4 位审稿人

最低4最高7标准差1.1

4.3

置信度

正确性3.0

贡献度2.0

表达2.5

NeurIPS 2024

Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers

Haifeng Huang,Yilun Chen,Zehan Wang,Rongjie Huang,Runsen Xu,Tai Wang,Luping Liu,Xize Cheng,Yang Zhao,Jiangmiao Pang,Zhou Zhao

OpenReview PDF

提交: 2024-05-02更新: 2024-11-06

TL;DR

A 3D large language model unifying referencing and grounding capabilities.

摘要

关键词

3D Scene UnderstandingMulti-modal Large Language Model

评审与讨论

审稿意见

评分: 4置信度: 42024-07-02

This paper proposes a 3D Multi-Label Language Model (MLLM) designed to perceive and represent 3D scenes at the object level. To interpret individual object instances, the authors develop object identifiers to convert the 3D scene into a series of distinguishable object tokens and present object-centric representations using foundation models. Experiments are conducted on various 3D scene-language tasks.

优点

Introducing Large Language Models (LLMs) into 3D perception and representation is a valuable and innovative research direction.
Leveraging foundation models to extract 3D and 2D embeddings shows significant potential for enhancing the performance and capabilities of the 3D MLLM model.

缺点

The concept of object identifiers is not new, as similar methods have been previously introduced, such as in "Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V."
The authors claim the proposed model enables "efficient object referencing and grounding abilities," but this efficiency is not evaluated in the experiments. Furthermore, experiments on 3D referring expression segmentation are not provided, making this claim hard to substantiate.
It is unclear how the proposed object-centric representations address the problem of 3D data scarcity.
Table 2 lacks comparisons with several notable works, such as the state-of-the-art method CORE-3DVG (NeurIPS 2023) on the ScanRefer dataset. More existing methods should be included for a comprehensive evaluation.

问题

Please refer to the weaknesses section.

局限性

The authors have discussed the limitations and potential societal impact of the proposed method.

作者回复

2024-08-07

W1: Difference with Set-of-Mark.

Different ways to introduce object identifiers. Set-of-Mark attaches object Identifiers directly onto the image, relying entirely on the multimodal LLM’s OCR capability to perceive the identifiers from the image. This method is indirect and can introduce ambiguity, especially when there are many objects in the image. Our method explicitly assigns the identifiers to each object in the language prompt. Intuitively, it is easier for the LLM to understand the link between the object and its identifier from the text.
Inefficiency of adapting Set-of-Mark to 3D. Adapting Set-of-Mark for 3D perception requires cross-view clustering of the segmented masks (from SAM) and labeling object identifiers on multi-view images. However, currently there is no MLLM that can handle a sequence of images with a good OCR capability.
Prompt-based vs Training-based. Set-of-Mark is a prompt-based method, which could produce uncontrollable outputs. Our training-based method can stably handle a series of tasks including grounding, captioning, and QA.

At last, our SOTA performance across various 3D benchmarks demonstrates the effectiveness of our proposed method to use object identifiers. Thus, the similar concept of object identifiers in Set-of-Mark does not diminish the contribution or novelty of our work at all.

W2.1: Evidence of object referencing and grounding abilities.

Section 3.3, along with Figures 2 and 3, illustrates how object identifiers are utilized during interactions with LLMs. Users can reference objects with identifiers for tasks like 3D dense captioning (Scan2Cap), while the LLM responds with identifiers to ground objects for both single object grounding (ScanRefer) and multiple object grounding (Multi3DRefer). The superior performance of our model on benchmarks such as Scan2Cap, ScanRefer, and Multi3DRefer, demonstrates its efficient object referencing and grounding abilities.

W2.2: Results of 3D referring expression segmentation.

Table A. Evaluation results of 3D referring expression segmentation on ScanRefer.

	Nr3D					Sr3D
	Overall	Easy	Hard	View Dep	View Indep	Overall	Easy	Hard	View Dep	View Indep
3DVG-Trans	40.8	48.5	34.8	34.8	43.7	51.4	54.2	44.9	44.6	51.7
TransRefer3D	48.0	56.7	39.6	42.5	50.7	57.4	60.5	50.2	49.9	57.7
MVT	59.5	67.4	52.7	59.1	60.3	64.5	66.9	58.8	58.4	64.7
3D-VisTA	57.5	65.9	49.4	53.7	59.4	69.6	72.1	63.6	57.9	70.1
Ours	63.9	75.7	52.6	53.8	69.2	73.1	78.3	60.8	66.6	73.4

Table A shows that our method surpasses previous baselines for 3D referring expression segmentation on ScanRefer. It is important to note that we did not provide referring expression segmentation results on ScanRefer simply because it is not a common evaluation metric for this dataset. Most previous baselines assess accuracy based on the IoU between the predicted boxes and the ground-truth boxes. The comparison based on box IoU can already demonstrate our model’s grounding ability.

W3: Why proposed object-centric representations alleviate the problem of 3D data scarcity.

As discussed in Section 1 (Lines 72-84), training robust scene-level representations typically requires a large amount of paired scene-language data, which is difficult to obtain. To overcome this challenge, we represent scenes using object-centric representations derived from well-trained 3D and 2D encoders. Benefiting from at least million-level pre-training, the 3D encoder excels at extracting spatial and shape attributes from point clouds, while the 2D encoder adeptly extracts rich features of object’s appearance from multi-view images. Intuitively, we then propose to combine the well-trained 3D and 2D representations explicitly at the object level, along with the object identifiers, to comprehensively represent a whole scene.

Unlike previous 3D LLMs such as LEO[20] and 3D-LLM[22], which require constructing additional data (near million-level) for pre-training or alignment, our method achieves state-of-the-art performance without the need for additional alignment data. Considering that we adopt similar instruction tuning techniques as theirs, the biggest difference between our model and their models is the design of representations. Thus, our superior performance with less data usage can indicate that our proposed object-centric representations alleviate the problem of 3D data scarcity to some extent.

W4: Lack of comparison with SOTA methods.

In Table 2, we've included previous SOTA methods for each dataset:

ConcreteNet [45] with the highest Acc@0.5 on ScanRefer,
M3DRef-CLIP [63] with the highest F1@0.5 on Multi3DRefer,
Vote2Cap-DETR++ [10] with the highest CIDEr@0.5 on Scan2Cap,
Scene-LLM [17] with the highest CIDEr on ScanQA and the highest EM on SQA3D.

In Tables 6–10 in the appendix, we provide comprehensive comparisons on each dataset by including additional SOTA models.

CORE-3DVG is a missed reference on ScanRefer. However, although it surpasses our method by 1.25% on Acc@0.25, our method still demonstrates superior grounding performance with a 6.39% higher Acc@0.5. We will include CORE-3DVG as a baseline for comparison in the future version.

评论- Response to rebuttal

2024-08-12

Thank you for providing the rebuttal. However, I noticed that the performance of the proposed method on hard objects is subpar, as shown in Table A. Additionally, the comparison involving pre-trained 2D and 3D encoders—compared to methods without pre-trained encoders like CORE-3DVG—raises some questions, particularly since the performance is even lower than CORE-3DVG on Acc@0.25. Therefore, I will revise my rating to a borderline reject, but I remain inclined towards rejection.

2024-08-12

Thank you for your prompt response.

We would like to respectfully draw your attention to our main contribution (L11-15, L92-94), which is the unification of various 3D tasks —including grounding, captioning, and QA—within an LLM-based model, rather than focusing on the development of specialized models for individual tasks. Most existing works, including CORE-3DVG (which is limited to 3D visual grounding), are designed with specific models or require task-specific fine-tuning. We believe that creating a 3D generalist model is a significant direction for future research.

For your concerns about performance comparison:

Please note our model's leading performance on the overall metric, the primary index in 3D grounding, rather than focusing solely on the hard metric:
- In Table A, without employing specialized model designs or adjusting specific hyperparameters for single dataset, our model achieves the best overall performance compared to other works.
- Compared to CORE-3DVG, our model demonstrates superior performance (+6.39%) in Overall Acc@0.5 and competitive performance in Overall Acc@0.25 in ScanRefer.
Please note our model's leading performance across all left benchmarks rather than focusing on the single benchmark. Given this, we believe our model provides a solid baseline for the community in developing 3D generalist models.

Therefore, we think that focusing solely on a sub-metric from a particular dataset while overlooking our main contribution is not entirely appropriate. We would greatly appreciate it if you could reconsider your rating in light of this.

审稿意见

评分: 6置信度: 52024-07-05

The paper proposes a new representation for 3D multimodal LLMs, a family of foundation models that repurpose LLMs to receive multimodal (visual and linguistic) input. Specifically, the paper advocates for an object-centric representation, where objects are first discovered (detected or segmented) with an off-the-shelf model, then they are fed as tokens to the LLM. This happens in the following way: i) objects are discovered in 3D using a 3D detector; ii) for every object we get a language identifier (e.g., <obj1>), a local point cloud and a 2D mask (by projecting the 3D mask back to multiple 2D views); the language identifier, featurized local point cloud and featurized 2D segment form tokens, that are fed to an LLM in the form of prompt.

This formulation unifies many visual-language tasks as text generation. As a byproduct, using task-specific prompts, the model can be jointly trained on several tasks jointly, leading to improved performance. The results show quantitative gains on visual grounding, captioning and VQA.

优点

(Presentation) The paper is very clearly presented and easy to understand. The writing makes the right claims and all the useful details and questions are answered in the main paper.
(Contribution) The submission addresses an important problem, which is to ground the knowledge of LLMs in the visual world. The proposed scene tokenization is not novel per se, but its combination with MLLMs is a nice feature. Also, several important details, such as appropriate featurization using both multi-view 2D and 3D are very useful to see, because MLLMs haven't shown such good results so far using different input representation.
(Soundness) The results indeed show that the proposed approach manages to use MLLMs in an effective way. More importantly, the comparisons include baselines that also train multiple tasks simultaneously, so the gain doesn't seem to come more data alone. Although we cannot conclude that this architecture is better than baselines which may have trained on a single task only, it seems to be the strongest multi-task approach now.

缺点

(Contribution) The paper enters the debate of the right representation for visual-language (VL) understanding without giving a clear answer, unfortunately due to current benchmarks' limitations. Object-centric transformers for VL tasks have been proposed long ago (see VilBERT, VisualBERT, LXMERT, OSCAR, UNITER etc for 2D VL understanding), but then got superseded by one-stage approaches like MDETR. The issue with two-stage object-centric models is the definition of a vocabulary. Open-vocabulary methods cannot really rely on detector bottlenecks, since they, by design, have a limited vocabulary. We cannot possibly enumerate all the concepts a user may refer to.

That said, in 3D VL understanding and especially grounding, most approaches are indeed object-bottlenecked. This can be attributed to ScanNet being the base dataset for most 3D VL benchmarks. ScanNet limits the scope of in-domain approaches to few classes. One cannot for example detect the legs of a chair using ScanNet, since parts of objects are not annotated. This is true for both one-stage and two-stage approaches, as long as they are trained on ScanNet. But among the two directions, the one more promising to extrapolate on a broad domain seems to be the non-bottlenecked one.

However, the object-centric tokenization is not bad, it provides a nice abstraction of the scene. My concern is that a detector will not cover some useful parts of the scene and, as a result, that part of the scene won't be visible at all to the model. This can be due to imperfect prediction or even limited concept vocabulary. Given how impressive the VLMs' generalization is, limiting them with in-domain detectors may be handicapping them.

(Soundness) It seems that a lot of quantitative gain comes from using 2D features. While it is fair to use 2D features in the proposed approach, I'm not sure whether the main factor of good results is the use of object identifiers, 2D features or multi-tasking through the unification of the output space. For example, an approach that does the same unification (with different prompts of course) and uses scene tokens and multi-view features (for example by unprojecting 2D features from multiple views and performing some voxelization) would be the best baseline to validate the importance of object identifiers. Which baseline or ablation represents that direction? Right now, we can mainly conclude that 2D features are very important and that the format of object identifiers is better than other alternatives, but we cannot safely conclude that object identifiers are the most useful component.

Another interesting fact that I noticed in this paper is that the best-performing models on grounding (ScanRefer) are the ones trained only on ScanRefer. In fact, the proposed approach is the only multi-task approach that beats those models. I believe it would be useful to also see some results on Nr3D and Sr3D for several reasons. First, the margins on grounding are a bit narrow, so we don't know whether the most competitive approaches lack due to less data or architecture. Evaluating on more datasets could provide some more evidence. Second, Nr3D and Sr3D use ground-truth boxes. It's good to see what is the limit of the proposed approach if a part of perception is perfect.

问题

Overall, I appreciate the presentation of the paper, its claims, good results and interesting technical contributions. At the same time, thinking broader of the field, I'm not sure whether this paper provides enough evidence towards the right direction; going through an in-domain bottleneck may drastically limit the generality modern LLMs/VLMs can offer. Moreover, the ablations lack some targeted experiments that would help us verify the proposed components.

I would appreciate if the response can clarify/add some of the following:

Discussion on the scope of using object-bottlenecked 3D VL models for the general field (beyond evaluating on ScanNet).
Discussion on the relation of this work to previous object-centric tokenization approaches from the 2D VL domain.
Adding results on Nr3D/Sr3D.
Adding some more tergated ablations that switch off object identifiers.

I will adapt my score after the discussion.

Post-rebuttal, some concerns are addressed and I'm increasing my score to from 4 to 6

局限性

Addressed

作者回复

2024-08-07

W1.1: Concerns about the recent trend of one-stage replacing two-stage methods in 2D.

Thanks for pointing out a promising future direction. However, one-stage models require large-scale training, for example, MDETR used 1.3M image-text pairs for pre-training. Given current limited 3D data (1200 scenes and 150K language annotations), our SOTA performance across various benchmarks highlights the effectiveness of our two-stage architecture. The data scaling and the exploration of end-to-end architectures should be left as a future work.

Q1.1 & W1.2: Discussion on the scope of using object-bottlenecked 3D VL models.

Firstly, it's important to note that open-vocabulary is not claimed as a contribution in our paper, nor do the previous baselines (either one-stage or two-stage) include formal open-vocabulary evaluations.

As discussed in the Limitation section (Lines 594-599), we acknowledge the limitation of relying on object detectors. Based on object detectors, our object identifiers can refer to object-level instances or clustered objects but fail to represent part-level concepts. We will clearly state this in the revised version.

Although part-level detectors such as PointNeXt[a] could be adopted to extract part-level instances, we still lack well-trained part-level encoders and related benchmarks. It is important to highlight that current large-scale 3D datasets are object-level, such as Objaverse[b] and OmniObject3D[c], and a large-scale, high-quality concept-level dataset is lacking. Therefore, the exploration of open-vocabulary/concept-level evaluations should be left for future work.

[a] PointNeXt: Revisiting PointNet++ with Improved Training and Scaling Strategies. NeurIPS 2022.

[b] Objaverse: A Universe of Annotated 3D Objects. CVPR 2023.

[c] OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation. CVPR 2023.

Q1.2: Zero-shot evaluation on other datasets.

Table A. Evaluation results on scene-language tasks 3RQA, 3RDialog, and 3RPlan, based on 3RScan.

Method	3RQA	3RDialog	3RPlan
LEO (zero-shot)	35.8	25.5	23.4
Ours (zero-shot)	36.2	32.7	30.9
LEO (fine-tuned)	51.9	73.3	81.1
Ours (fine-tuned)	55.8	82.1	93.5

To assess the generalizability of our model, we follow the precedent set by the 3D LLM method LEO[20] and conduct an evaluation on their proposed tasks: 3RQA, 3RDialog, and 3RPlan. These tasks are built upon the 3RScan[d] dataset, which belongs to a different domain than ScanNet. We use the same detection results as LEO and leverage our pre-trained weights (including the projection layer and LLM pre-trained on ScanNet). The results shown in the table above demonstrate our model’s zero-shot capabilities on 3RScan.

[d] RIO: 3D Object Instance Re-Localization in Changing Indoor Environments. ICCV 2019.

Q2: Discussion on the previous object-centric tokenization approaches.

Compared to previous object-centric tokenization methods, our approach introduces a novel way to incorporate a sequence of {object identifier, object features} into the LLM, enabling it to solve different tasks in a unified question-answering format.

For instance, ViLBERT uses an additional task head to predict matching scores for object grounding, while VisualBERT ranks entities by comparing attention weights. These methods require task-specific designs and heads for different tasks, which is impractical for real-world human-assistant interactions.

In contrast, our method establishes a direct link between the object and its identifier, allowing the LLM to respond with an object identifier as the grounding result. This approach can naturally extend to more complex tasks, such as multiple object grounding (outputting several identifiers for multiple grounding results) and grounded captioning (producing complex captions interleaved with identifiers as the grounding result).

Q3 & W2.2: Results on Nr3D/Sr3D.

Table B. Evaluation results on Nr3D/Sr3D.

	Nr3D					Sr3D
	Overall	Easy	Hard	View Dep	View Indep	Overall	Easy	Hard	View Dep	View Indep
3DVG-Trans	40.8	48.5	34.8	34.8	43.7	51.4	54.2	44.9	44.6	51.7
TransRefer3D	48.0	56.7	39.6	42.5	50.7	57.4	60.5	50.2	49.9	57.7
MVT	59.5	67.4	52.7	59.1	60.3	64.5	66.9	58.8	58.4	64.7
3D-VisTA	57.5	65.9	49.4	53.7	59.4	69.6	72.1	63.6	57.9	70.1
Ours	63.9	75.7	52.6	53.8	69.2	73.1	78.3	60.8	66.6	73.4

The results show that our model achieves state-of-the-art performance compared to previous expert models.

Q4 & W2.1: Importance of object identifiers.

Table C. Ablation study on object identifiers.

	ScanRefer	Multi3dRefer	Scan2Cap	ScanQA	SQA3D
w/o obj identifiers	22.8	25.6	48.9	79.8	50.0
Ours	50.2	52.4	77.1	88.4	55.5

Object identifiers are crucial to our model's design. They allow users to reference objects for tasks such as 3D dense captioning (Scan2Cap), while the LLM uses these identifiers for grounding objects in both single object grounding (ScanRefer) and multiple object grounding (Multi3DRefer).

Without object identifiers, it is necessary to adopt an alternative method for grounding and dense captioning. Following 3D-LLM[22], we add special location tokens to represent bounding boxes in the LLM. Thus, the model can output bounding boxes for grounding tasks and input bounding boxes for dense captioning. As shown in the table above, training the model without object identifiers reveals a significant decline in performance, particularly in grounding and captioning tasks.

评论- Thank you for your response

2024-08-09

I would like to thank the authors for their effort in the rebuttal. My main concerns were regarding the object bottleneck, references to related work, additional evaluations, more targeted ablations.

Discussion regarding the object bottleneck: while we don't expect a single paper to solve the debate, this paper positions in favor of two-stage approaches for limited data setups. That opens indeed a new discussion given recent advances in mixed 2D-3D training [1] but I agree that it is an avenue for future work.
Additional connections to related work were added.
Additional evaluations on ReferIt3D were added.
One additional ablation was added, but I'm not sure I get all the details. Are encoded bounding boxes fed to the LLM? Some description of this baseline would be helpful. If that's the case, one related approach is LanguageRefer [2]. I was also imagining that scene tokens (after some voxelization) could be fed directly to the transformer.

It would also be interesting to ablate the effect of multi-tasking. Is the architecture strong on its own or it benefits a lot from the unification of the output space? For that, some single-task results would be useful, e.g. on grounding.

For the final version, please consider adding the additional discussions on the broader position (two-stage vs one-stage), related work and additional results and ablations. If possible, including single-task results would provide a great insight on disentangling the architecture design and output unification. While the last is studied for NLP, it's not well-studied for 3D VL understanding.

Under these conditions, I'm increasing my score from 4 to 6.

[1] ODIN: A Single Model for 2D and 3D Segmentation, 2024 [2] LanguageRefer: Spatial-Language Model for 3D Visual Grounding, 2021

评论- Thank you for your review

2024-08-09

Thank you for your detailed comments and constructive suggestions. We appreciate your recommendations regarding additional discussions on model architecture and related object-centric methods, as well as the suggestion to include more results and ablation studies. We will incorporate these in the final version of the paper. Below, we want to further address your concerns.

Details of the ablation study on object identifiers:

Following 3D-LLM[1], we use 6 discrete tokens to represent a bounding box. Specifically, we add 1000 special tokens (<LOC000>, <LOC001>, ... , <LOC999>) into the language model's vocabulary to discretely represent numeric values in the [0, 1] range. (Coordinates were normalized into [0, 1]). For example, an original bounding box defined as (x=0.234, y=0.467, z=0.129, w=0.301, h=0.235, l=0.189) would be represented as: <LOC234> <LOC467> <LOC129> <LOC301> <LOC235> <LOC189>. This method has shown effectiveness in 2D models such as OFA[2] and Pix2Seq[3]. However, both our experiments and 3D-LLM's results indicate that the location tokens are not well-learned due to the lack of 3D data.

LEO[4] and LL3DA[5] chooses to feed the encoded feature of a bounding box (or a single point) to the LLM to represent the user-referred object. However, this design can not be directly applied for outputting a bounding box. Therefore, these models are not able to do grounding tasks.

It is worth noting that Scene-LLM[6] feeds scene tokens (after voxelization) directly to LLM as you suggested. However, this model is only evaluated on QA tasks, as it lacks the ability to reference specific objects for captioning or grounding.

Consequently, our design, which employs object identifiers, is pioneering in unifying these tasks among 3D MLLMs. Notably, our model even achieves superior performance compared to expert models, highlighting its potential as a promising direction for LLMs in 3D tasks.

Multi-tasking ablation:

Thanks for the advice. The comparison between multi-task and single-task training could provide more insights of our method.

Method	Acc@0.25	Acc@0.5
single-task training	50.8	46.3
multi-task training	55.5	50.2

We conduct an experiment of the single-task training on ScanRefer. The result in the table above shows that the joint training on multiple tasks enhances performance. However, it is important to note that the comparison might not be entirely fair, as reduced data for single-task training also leads to fewer training steps per epoch. Adjusting hyperparameters could slightly improve single-task training performance. Anyhow, we will include a more comprehensive comparison between single-task and multi-task training in the final version.

[1] 3D-LLM: Injecting the 3D World into Large Language Models. NeurIPS 2023.

[2] OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework. ICML 2022.

[3] Pix2seq: A Language Modeling Framework for Object Detection. ICLR 2022.

[4] An Embodied Generalist Agent in 3D World. ICML 2024.

[5] LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning. CVPR 2024.

[6] Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning. arXiv 2024.

评论- Thank you for the additional clarifications

2024-08-09

Thank you for the additional clarifications, I believe that the paper is now more complete and I vote for acceptance.

审稿意见

评分: 7置信度: 42024-07-07

This paper proposes a 3D MLLM that can understand the 3D environment at the object level. The proposed work designs object identifiers that are projected into language token space and can be understood by LLMs, which unifies several 3D scene understanding tasks into the same format. These object identifiers give a more natural way for humans and MLLM to reference specific objects of interest in a 3D scene. The paper extracts features from 3D scenes using several 2D and 3D pretrained models and finetunes the MLLM on 3D scene understanding tasks. Experiments were conducted on several benchmarks based on ScanNet and the results show the proposed method achieves state-of-the-art performance, surpassing prior arts by a large margin.

优点

The paper decomposes a 3D scene into objects and develops object tokens to represent them for interaction with LLMs. This is a more natural way for people and LLMs to refer to object entities in the scenes.
The paper conducts extensive experiments across several 3D scene understanding tasks (Grounding, VQA, and Captioning), showing the versatility of the proposed method. The reported results show that the proposed method achieves state-of-the-art performance and outperforms the existing expert and LLm-based models by large margins.
The paper also conducts a study on video input using a 2D video instance tracking model, showing that the proposed method can still work without the 3D model.
The paper is well-written and the code is provided in the supplementary material.

缺点

The performance of the proposed method is highly reliant on the ability of the pretrained 2D and 3D models (Mask3D, DINOv2, DEVA). Specifically, since the scene is represented as discrete object tokens, some fine-grained information existing in the full 3D scenes is lost. For example, if the 3D segmentation model wrongly merges 2 chairs into one, there is no way for the proposed method to recover. In this regard, I encourage the authors to provide some case studies about how the model fails (or succeeds) in this way.
Although multiple tasks are tested, they are all based on the same underlying dataset, ScanNet. Thus it's unclear how the proposed method works on other datasets. While this is maybe because the ScanNet has most established benchmarks on these 3D scene understanding tasks, I still want to see how the proposed method works on other datasets (e.g. some outdoor datasets from AV literature, or simply the next iteration of ScanNet, ScanNet++). Such evaluation can be quantitative or qualitative. It's also interesting to see how the model trained on ScanNet only can transfer to other datasets.
I'm interested to see the zero-shot, open-vocabulary generalization of the proposed method. As the proposed method is based on several foundation models, it probably has the ability in such settings. I won't consider this as a major weakness though, as this is not claimed as a contribution in the paper.

Minor issues, typos, and grammar errors:

Ln 156, "due to due to".

问题

Ln 183 mentions that DINOv2 has superior handling of local features within images. Why and how does this matter in the 3D scene understanding task? How does it influence the final performance? The relevant ablations study is currently lacking.

局限性

The authors provide a discussion on limitations and societal impact in the appendix. Yet they don't provide qualitative examples or discussion on the failure cases, which I highlighted in the weakness section.

作者回复

2024-08-07

W1: Case study of the reliance on pre-trained detectors.

Please refer to Figure 1 the attached PDF in the “Author Rebuttal”.

We provide several qualitative cases where the detected objects are imperfect (such as incomplete point clouds or an object being separated into two or more parts). Despite the direct influence of incomplete masks on grounding quality, there are successful cases in captioning and QA tasks. The model's ability to perceive the surroundings of the objects allows it to infer the correct captions or answers.

It is worth emphasizing that previous SOTA models such LEO[20], 3D-VisTA[68], and M3DRef-CLIP[63] are also two-stage models reliant on pre-trained detectors, which face the same challenge as us when the detector fails.

W2: Zero-shot evaluation on other datasets.

Table A. Evaluation results on scene-language tasks 3RQA, 3RDialog, and 3RPlan, based on 3RScan.

Method	3RQA	3RDialog	3RPlan
LEO (zero-shot)	35.8	25.5	23.4
Ours (zero-shot)	36.2	32.7	30.9
LEO (fine-tuned)	51.9	73.3	81.1
Ours (fine-tuned)	55.8	82.1	93.5

Considering the lack of 3D scene understanding benchmarks built upon other datasets, we follow the precedent set by the 3D LLM method LEO[20] and conduct an evaluation on their proposed tasks: 3RQA, 3RDialog, and 3RPlan. These tasks are built upon the 3RScan[a] dataset, which belongs to a different domain than ScanNet. We use the same detection results as LEO and leverage our pre-trained weights (including the projection layer and LLM pre-trained on ScanNet). The results shown in the table above demonstrate our model’s zero-shot capabilities on 3RScan.

[a] RIO: 3D Object Instance Re-Localization in Changing Indoor Environments. ICCV 2019.

W3: About zero-shot open-vocabulary generalization.

Firstly, it's important to note that open-vocabulary is not claimed as a contribution in our paper, nor do the previous baselines include formal open-vocabulary evaluations.

The Table A above reveals some zero-shot open-vocabulary abilities of our model on novel scenes in 3RScan. For achieving real open-vocabulary capabilities, we can adopt open-vocabulary detectors such as SAMPro3D[b] and Open3DIS[c], and scale up the language data to train a robust MLLM. This remains a valuable direction in this field, and we tend to leave it for future work.

[b] SAMPro3D: Locating SAM Prompts in 3D for Zero-Shot Scene Segmentation. arXiv 2023.

[c] Open3DIS: Open-vocabulary 3D Instance Segmentation with 2D Mask Guidance. CVPR 2024.

Q1: How do DINOv2 features work in the 3D scene understanding task?

As described in Appendix A (Ln 531-541), we extract 2D representations from the area of projected masks in multi-view images of each object. DINOv2, trained with self-supervision objectives at both image and patch levels, captures detailed local information such as shape and texture, providing fine-grained perception abilities. This aligns with our design goal of extracting rich object-centric representations. The ablation results in Table 4 demonstrate the importance of these 2D representations in our final model.

Table B. Ablation study on the 2D encoder.

	ScanRefer	Multi3dRefer	Scan2Cap	ScanQA	SQA3D
	Acc@0.5	F1@0.5	C@0.5	CIDEr	EM
w/ CLIP	46.3	48.7	73.1	85.0	53.9
w/ DINOv2	50.2	52.4	77.1	88.4	55.5

We also add an ablation study replacing DINOv2 with CLIP, which was trained with image-level contrastive learning and tends to neglect rich pixel-level details. The results show that using the CLIP encoder leads to lower performance, particularly on grounding tasks.

2024-08-13

Thanks a lot for the response!

I think all my concerns have been addressed. I have also briefly gone through other reviews and found no strong reasons to change my mind.

Thus I'm keeping my rating at 7 - accept.

2024-08-13

We are pleased to have addressed all your concerns and appreciate your decision to keep your rating!

审稿意见

评分: 5置信度: 42024-07-14

This paper aims to enhance the efficiency in interpreting individual object instances and improve referencing and grounding capabilities for intricate scene comprehension. The method decomposes the input 3D scene into object identifier tokens. Experimental results on 3D scene-language datasets demonstrate the effectiveness of the proposed approach.

优点

The paper is well-written and easy to follow.
The experimental results on 3D scene-language datasets shows the performance improvement and effectiveness of the proposed method.
The training schema of the model is single-stage yet effective on downstream task.

缺点

1． The authors do not provide the details about detectors, encoders, multi-modal inputs and LLMs used in other methods in the Table 1. These designs has already built a high-performance baseline, for example, the baseline results in Table 2 have exceeded other methods in the Table 1. It is difficult to judge the fairness of the comparisons in Table 1.

2． This paper lacks the performance on ScanQA test set.

3． The ablations are insufficient. The paper lacks the ablations about different size of LLM, training and fine-tuning schema.

4． The paper lacks the computation and time cost about the proposed method.

5． This paper does not provide sufficient discussion with object-centric representation learning methods on 3D vision-language. For example,[a] explored the object-centric representation learning with contrastive learning. And the object-level tokens are used in LEO[22], 3D-LLM[20], etc.

[a] Vision-Language Pre-training with Object Contrastive Learning for 3D Scene Understanding, AAAI2024

问题

More ablations to show the effectiveness of object identifier, such as the order of identifiers, only combining a 3D object token embedding and a 2D object token embedding for each object without the specifical token for identifier.
More ablations about the LLM used for 3D-VL learning, such as different types of LLMs , different sizes of LLM, multimodal large models such as LLAVA[b] or other 3DLLM[22]

3.The details about the training set, such as the scale of 3D, 2D and language, the statistics of identifiers .

4.The details about the inputs at inference, like will all objects of a scene be combined as input, and how to combine?

[b] Visual Instruction Tuning, NeurIPs2023

局限性

The authors provide the limitations about the reliance on pre-trained models, data scarcity, and the experience hallucinations.

作者回复

2024-08-07

W1: Details about multi-modal inputs, detectors, encoders, and LLMs used in other methods.

Please refer to Table 1 in the attached PDF in “Author Rebuttal”.

W2: Results on ScanQA test set.

	Test w/ object				Test w/o object
Method	CIDEr	B-4	METEOR	ROUGE	CIDEr	B-4	METEOR	ROUGE
ScanQA	67.29	12.04	13.55	34.34	60.24	10.75	12.59	31.09
Multi-CLIP	68.70	12.65	13.97	35.46	63.20	12.87	13.36	32.61
3D-VLP	70.18	11.23	14.16	35.97	63.40	15.84	13.13	31.79
3D-VisTA	76.6	16.0	15.2	38.6	62.6	11.9	12.9	32.8
3D-LLM	69.6	11.6	14.9	35.3	-	-	-	-
LL3DA	78.16	13.97	16.38	38.15	70.29	12.19	14.85	35.17
Ours	94.04	14.38	19.65	44.53	79.79	11.38	17.10	39.34

W3 & Q2: Ablations about LLMs and training schema.

		ScanRefer	Multi3dRefer	Scan2Cap	ScanQA	SQA3D
LLM	Training Schema	Acc@0.5	F1@0.5	C@0.5	CIDEr	EM
OPT-1.3B	one-stage	47.6	48.3	75.7	86.2	54.3
Tiny-Vicuna-1B	one-stage	49.5	50.1	76.2	86.7	54.5
Vicuna-7B	one-stage	50.2	52.4	77.1	88.4	55.5
Vicuna-13B	one-stage	49.9	51.7	78.2	89.1	55.8
Vicuna-7B	two-stage	51.1	53.1	73.7	85.6	55.3

Size of LLM: We compared various sizes of LLMs: Vicuna 1B, 7B, and 13B. The 7B model achieves the best performance on grounding tasks, while the 13B model excels in captioning and QA tasks. The 13B model does not show significant performance gains over the 7B model, suggesting that the current task scope and data scale do not challenge larger LLMs.
Type of LLM: We tested the OPT-1.3B, which performs slightly worse than the Vicuna-1B of similar size. As for multimodal large models like LLaVA and LEO, basically they use the same LLM backbone (Vicuna-7B) and the projection layer (MLP) as ours. However, directly adapting their pre-trained weights to our model is unsuitable because we use different 2D or 3D representations.
Training schema: We compare one-stage training with two-stage training (fine-tuning on each dataset). Two-stage training improves performance on grounding tasks but significantly decreases performance on captioning and QA tasks. The performance decline in captioning and QA tasks may be due to these tasks being easier to converge, leading to overfitting during further fine-tuning.

W4: Computation and time cost.

Method	Data Preparation (per scene)	GPU Usage	Training Time
3D-LLM (BLIP-2)	~ 15 min	64 * V100	~ 1 day
3D-LLM (Flamingo)	~ 15 min	8 * A100	~ 7 days
Ours	< 1 min	4 * A100	~ 8 hours

Compared to 3D-LLM[22], our model's simpler design significantly reduces computation and time costs.

W5: Discussion about object-centric representation.

Firstly, it’s important to note that the object-centric representation is not claimed as a contribution in the paper. We state that our study is based on well-trained encoders, and our contribution is on how to incorporate a sequence of {object identifier, object features} into the LLM, which can solve various 3D scene-language tasks in unified formats. More comparisons among different object-centric representations can be left as a future work.

Q1: Importance of object identifiers.

	ScanRefer	Multi3dRefer	Scan2Cap	ScanQA	SQA3D
	Acc@0.5	F1@0.5	C@0.5	CIDEr	EM
fixed order	49.6	51.5	76.2	88.7	55.4
w/o obj identifiers	22.8	25.6	48.9	79.8	50.0
Ours	50.2	52.4	77.1	88.4	55.5

Order of Object Identifiers: In our actual implementation, we randomize the order of object identifiers during training to minimize the influence of any inherent order distribution. Our results show that using a fixed order of identifiers yields slightly poorer performance compared to using a random order.

Removing Object Identifiers: Object identifiers are crucial to our model's design. They allow users to reference objects for tasks such as 3D dense captioning (Scan2Cap), while the LLM uses these identifiers for grounding objects in both single object grounding (ScanRefer) and multiple object grounding (Multi3DRefer). Without object identifiers, it is necessary to adopt an alternative method for grounding and dense captioning. Following 3D-LLM[22], we add special location tokens to represent bounding boxes in the LLM. Thus, the model can output bounding boxes for grounding tasks and input bounding boxes for dense captioning. As shown in the table above, training the model without object identifiers reveals a significant decline in performance, particularly in grounding and captioning tasks.

Q3: Details about training data.

Data type	Size
3D Scene	1201
Point Cloud / Scene	145K
Images / Scene	~ 80
Image Resolution	640*480
Language	155K
Identifiers	100

As a comparison, LEO[20] used 1.2M language training data, and 3D-LLM[22] used 700K, which are several times more than our training data (155K).

Q4: Details about the inputs at inference.

All objects detected by the 3D detector are combined as input. For each object, the extracted 3D and 2D representations are projected into the LLM’s embedding space, becoming 3D and 2D tokens. Consequently, each object is represented by three tokens: an identifier token, a 3D token, and a 2D token. As illustrated in Figure 2, these objects form a token sequence of length 3 $n$ , where $n$ is the number of objects. For more details, refer to Section 3.3 for the prompt template and Section 3.2 for feature extraction.

2024-08-13

We appreciate all your suggestions that help us improve our paper. As the deadline for discussion is approaching, we would be glad to address any remaining questions or concerns. Please let us know if there are any further points you'd like to discuss!

作者回复

2024-08-07

[Our Contributions]

We are glad to find out that the reviewers generally acknowledge our contributions:

(Contribution)

The combination of object-centric representations and MLLMs is a nice feature [KdPm], a valuable and innovative research direction [G2or], and it is a natural way for people and LLMs to refer to object entities using object identifiers [n84p].
Appropriate featurization using both multi-view 2D and 3D are very useful to see, because MLLMs haven't shown such good results so far using different input representation. [KdPm, G2or]

(Soundness)

The training schema of the model is single-stage yet effective on downstream tasks. [2kgC]
The experimental results show our state-of-the-art performance across various tasks, indicating the effectiveness of our method. [2kgC, n84p, KdPm]
The paper also conducts a study on video input using a 2D video instance tracking model, showing that the proposed method can still work without the 3D model. [n84p]
The code is provided in the supplementary material. [n84p]

(Presentation)

The paper is very clearly presented and easy to follow. [2kgC, n84p, KdPm]

[New Experiments and Results]

In this rebuttal, we have added more supporting results to address reviewers' concerns.

Details about multi-modal inputs, detectors, encoders, and LLMs used in other methods. [2kgC]
Results on ScanQA test set. [2kgC]
Ablation of object identifiers. [2kgC, KdPm]
Details about training data. [2kgC]
Zero-shot evaluation on 3RScan datasets. [n84p, KdPm]
Ablation of 2D encoder. [n84p]
Results on Nr3D/Sr3D datasets. [KdPm]
Results of 3D referring expression segmentation on ScanRefer dataset. [G2or]

Thank you again for your constructive comments. We would be happy to answer and discuss if you have further concerns.

评论- Thank you for your efforts

2024-08-14

We sincerely thank the reviewers and AC for their efforts during the review and discussion process.

We would like to briefly highlight the motivation and contributions of our work. The trend in the 3D field is moving towards generalist models, yet existing methods struggle with effective object referencing and grounding within LLMs. Our work is the first to address this challenge and unify 3D tasks—grounding, captioning, and QA—within a single LLM-based model, without the need for task-specific designs or fine-tuning.

Our goal is not to claim that the proposed model is completely superior than all specialist models in their specific tasks or metrics. We do not deny that specialists may continue to improve with data scaling or better pre-trained weights in the future, but their task-oriented designs inherently limit them to specific tasks. Compared to both specialist and LLM-based models, the superior performance of our model across all benchmarks establishes a solid baseline for the future development of 3D generalist models.

最终决定Accept (poster)

2024-09-25

The paper proposes to unify different tasks involving 3D scenes and language into question answering by 1) introducing object identifiers for detected objects, 2) formulating the task as a question-answer task with object identifiers and object embeddings and 3) using a fine-tuned LLM to generate answers. Objects are detected using Mask3D, and object features are composed of 3D features from Uni3D and 2D features from DINOv2. The LLM (Vicuna) is fine-tuned using LoRA. The proposed method is used to unify visual grounding in 3D scenes (ScanRefer, Multi3DRef), dense captioning of 3D objects (Scan2Cap), and 3D QA tasks (SanQA, SQA3D). In addition, the authors demonstrate they can adapt their model to 2D videos by using a 2D video instance tracking model (vs a 3D object detector).

Three of the reviewers advocate for acceptance (2kgC, n84p, KdPm) while one reviewer is negative (G2or).

Reviewers initially had questions about additional experiments and ablations (2kgC,n84p,KdPm), discussion of time and resource cost (2kgC), and requested additional details and clarification regarding training and inference (2kgC) as well as more discussion on prior work that uses object-centric token representations (n84p, KdPm, G2or) and reliance on performance of pre-trained feature detectors and extractors (n84p, KdPm). The strong response has encourage two of the reviewers (KdPm, G2or) to increase their ratings (although G2or remains negative)

Given the positive rating of the majority of reviewers, and the extensive experiments, the AC recommends acceptance as the proposed method provides a interesting and promising approach to unify 3D vision and language tasks. The authors are encouraged to include the additional experiments for the rebuttal (including ablation on the importance of object identifier) and add discussion and clarifications from reviewers in their camera ready.