PaperHub
6.0
/10
Poster5 位审稿人
最低3最高4标准差0.4
3
4
4
4
4
3.4
置信度
创新性2.8
质量3.0
清晰度2.6
重要性2.2
NeurIPS 2025

Spatially-aware Weights Tokenization for NeRF-Language Models

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29

摘要

Neural Radiance Fields (NeRFs) are neural networks -- typically multilayer perceptrons (MLPs) -- that represent the geometry and appearance of objects, with applications in vision, graphics, and robotics. Recent works propose understanding NeRFs with natural language using Multimodal Large Language Models (MLLMs) that directly process the weights of a NeRF's MLP. However, these approaches rely on a global representation of the input object, making them unsuitable for spatial reasoning and fine-grained understanding. In contrast, we propose **weights2space**, a self-supervised framework featuring a novel meta-encoder that can compute a sequence of spatial tokens directly from the weights of a NeRF. Leveraging this representation, we build **Spatial LLaNA**, a novel MLLM for NeRFs, capable of understanding details and spatial relationships in objects represented as NeRFs. We evaluate Spatial LLaNA on NeRF captioning and NeRF Q&A tasks, using both existing benchmarks and our novel **Spatial ObjaNeRF** dataset consisting of $100$ manually-curated language annotations for NeRFs. This dataset features 3D models and descriptions that challenge the spatial reasoning capability of MLLMs. Spatial LLaNA outperforms existing approaches across all tasks.
关键词
MLLMLLMNeRF

评审与讨论

审稿意见
3

The paper proposes an extension to Neural Radiance Fields (NeRFs), which are neural networks - typically multilayer perceptrons (MLPs) - that represent the geometry and appearance of visual objects. NeRFs represent a global representation of an input object and do not capture fine-grained details making them according to the authors unsuitable for spatial reasoning. The paper proposes weights2space, a self-supervised framework for encoding a sequence of spatial tokens directly from the weights of a NerRF. The paper also proposes a Spatial LLaNA , a multimodal large language model capable of understanding details and spatial relationships in which the objects are involved leveraging the NerRF representations of the objects. The paper also proposes a manually annotated dataset explicitly designed to evaluate spatial reasoning.

优缺点分析

Strengths

  • The related work section is extensive and well documented with the literature.
  • Extensive experimental results.

Weaknesses

  • The computation of a spatially aware representation of an object directly from the weights of its NeRF has resemblance to the nf2vec except for a pooling operation performed by nf2vec.
  • The function of the seq2space module needs to be clarified.
  • A detailed ablation of the difference steps/modules of the spatially-aware NeRF weights tokenization is missing.
  • The spatially-aware multimodal large language model should be evaluated on a broader range of downstream tasks and datasets beyond those requiring spatial reasoning, in order to ensure that its performance on other tasks is not degraded. As a foundation model, it is intended for general-purpose use.
  • The writing of the paper is not always very clear and precise.
  • The claims on significant improvements compared to state-of-the-art results need to be supported by statical significance tests.

问题

See weaknesses:

  • Content should be added with regard to the function of the seq2space module, detailed ablation of the difference steps/modules of the spatially-aware NeRF weights tokenization, and evaluation of the spatially-aware multimodal large language model on a broader range of downstream tasks and datasets.

局限性

Limitations could be added regarding the chosen architectures and scalability.

最终评判理由

Most of the issues that I raised were answered in a satisfactory way. I have added my final rating and have raised my score. I value the good results of the paper but I worry about the imprecise writing style, that should be improved in a final version.

格式问题

There are no major issues. The font size used for presenting results in tables is very small.

作者回复

Summary. The paper proposes an extension to Neural Radiance Fields (NeRFs), which are neural networks - typically multilayer perceptrons (MLPs) - that represent the geometry and appearance of visual objects. NeRFs represent a global representation of an input object and do not capture fine-grained details making them according to the authors unsuitable for spatial reasoning. The paper proposes weights2space, a self-supervised framework for encoding a sequence of spatial tokens directly from the weights of a NerRF. The paper also proposes a Spatial LLaNA , a multimodal large language model capable of understanding details and spatial relationships in which the objects are involved leveraging the NerRF representations of the objects. The paper also proposes a manually annotated dataset explicitly designed to evaluate spatial reasoning.

First of all, we would like to clarify that our paper is not an extension to Neural Radiance Fields, i.e., we do not propose a novel NeRF architecture that may better capture fine-grained details. Instead, we propose S-LLaNA: an AI assistant to perform language tasks on 3D data represented as standard NeRFs. A key ability of language tasks on 3D data deals with understanding the spatial relations between objects and their parts. Purposely, S-LLaNA includes a novel meta-encoder design, weights2space, that computes a sequence of spatially-aware tokens from the weights of the input NeRF.

Weakness 1. The computation of a spatially aware representation of an object directly from the weights of its NeRF has resemblance to the nf2vec except for a pooling operation performed by nf2vec.

We clarify the difference between our novel meta-encoder, weights2space, and nf2vec using Fig. 2 of the paper. The weights2seq module is the nf2vec model without its final pooling layer. However, the key difference between weights2space and nf2vec does not pertain the absence/presence of the pooling layer in weights2seq, but, instead, in weights2space including as additional components the seq2space module and the downstream tri-plane representation (see Fig.2). These two components endow weights2seq with its peculiar ability to elicit spatial information from the weights of the input NeRF and obtain a sequence of spatially-aware tokens to be fed to the LLM. Hence, unlike nf2vec, our proposed weights2space meta-encoder is explicitly designed to extract spatially grounded features from NeRF weights and produce a sequence of tokens that are strongly and explicitly correlated with object parts in the input scene.

Weakness 2. The function of the seq2space module needs to be clarified.

The function of the seq2space module is to work in conjunction with the tri-plane representation to convert the sequence coming from weights2seq, where spatial information is distributed across the tokens, into a sequence to be fed to the LLM, where each token corresponds to a specific spatial location. More precisely, seq2space is a transformer decoder that, by cross-attending to the output from weights2seq, produces a sequence of tokens that, at training time, is reshaped into the features associated with the locations of the tri-plane to enable rendering images from the input NeRF. The intuition behind this design is that, by cross-attending to all the tokens from weights2space, each learnable query of seq2space, which corresponds to a specific position in the tri-plane, can extract distributed spatial cues and rearrange them into spatially-localized features. Besides, the self-attention layers in seq2space realize interactions among the spatially-aware tokens that enable the capture of the relationships between the objects and object parts present in the input NeRF.

 Weakness 3. A detailed ablation of the difference steps/modules of the spatially-aware NeRF weights tokenization is missing.

In Section 6 of the Supplementary Material, we provide an ablation of the modules of weights2space. In particular, LLaNA-13b-multitoken uses the original nf2vec encoder without its pooling layer. As previously highlighted, this setup corresponds to directly interfacing weights2seq to the LLM, without leveraging the key original components in our design, namely seq2space and the tri-plane representation. The results show clearly that weights2seq alone performs poorly compared to the meta-encoder design proposed in this paper.

Moreover, we conduct ablations to analyze sensitivity to the number of queries N (and consequently tri-plane resolution) and token dimensionality C. The results of these experiments are reported in our response to reviewer GQGs.

 Weakness 4. The spatially-aware multimodal large language model should be evaluated on a broader range of downstream tasks and datasets beyond those requiring spatial reasoning, in order to ensure that its performance on other tasks is not degraded. As a foundation model, it is intended for general-purpose use.

While our primary focus is spatial reasoning, we have explicitly included downstream tasks that go beyond spatial understanding to assess whether S-LLaNA retains general reasoning capabilities.

First, the brief captioning task on ShapeNeRF-Text, HST, and ObjaNeRF-Text involves generating concise textual descriptions focused on identifying the main subject of a scene. This task primarily evaluates global scene understanding, rather than spatial reasoning. In this setting, S-LLaNA consistently outperforms LLaNA, demonstrating its ability to preserve a holistic understanding of the input NeRF.

Furthermore, we evaluated S-LLaNA on a zero-shot NeRF classification task following the protocol in [6], where the model must infer the object class based on its multimodal input. This task does not require fine-grained spatial reasoning. As shown below, S-LLaNA outperforms LLaNA:

ModelModalityAccuracy (%)
LLaNA-7bNeRF67.56
S-LLaNA-7bNeRF68.67
LLaNA-13bNeRF69.27
S-LLaNA-13bNeRF71.85

These results confirm that S-LLaNA does not sacrifice holistic visual or linguistic understanding.

Regarding the general language understanding and reasoning of the underlying LLM (LLaMA-2), both LLaNA and S-LLaNA incorporate LLM finetuning in their training protocol to adapt the model to our scenario. While in theory this finetuning stage could degrade performance on other general language tasks, we note that S-LLaNA correctly answers in the single-round QA task on ShapeNeRF-Text, which includes questions that tap into commonsense reasoning and real-world knowledge (as explained in [7]). Below are two representative examples showing how S-LLaNA provides answers consistent with human-level commonsense:


Example 1 (Phone)

ShapeNet ID 57056c8f465dd1aec03bc4569d70377c

Question: What is the purpose of the flash on the right side of the phone?

Ground-truth: The flash is used to enhance camera performance in low-light conditions.

Prediction: The flash is used to illuminate the subject in low-light conditions, providing a better quality image.


Example 2 (car)

ShapeNet ID 431ca41fdf0897c628ccbb4eb8965b05

Question: Can the blue car be used for daily driving?

Ground-truth: Yes, the blue car is a modern hatchback, making it suitable for daily driving.

Prediction: Yes, the blue car's compact design and modern features make it suitable for daily driving.


Finally, our ablation study in Section 6 (main paper) demonstrates that S-LLaNA performs robustly even without this step, achieving results comparable to the finetuned variant. Consequently, in this version of our framework, the pre-trained LLM remains untouched, and thus certainly preserves its original reasoning capabilities and world knowledge.

 Weakness 5. The writing of the paper is not always very clear and precise.

As pointed out by reviewer 1Y9M in response to weakness 1, we will revise Section 4 and update captions of Figures 2 and 3 to enhance readability.

 Weakness 6. The claims on significant improvements compared to state-of-the-art results need to be supported by statistical significance tests.

We acknowledge that statistical significance testing would offer a more rigorous foundation for comparing performance.

Based on our internal experiments, we observed that our framework achieves consistent performance across multiple independent training runs. We are currently running additional training of our framework to quantify the consistency of our performance, yet, due to the short rebuttal period, we will be able to report statistics of our results in the camera-ready version of the paper.

Nevertheless, we would like to highlight that the methods we benchmark against do not provide results from multiple runs, likely due to the substantial computational cost of training and evaluating multimodal large language models (MLLMs). Consequently, meaningful significance testing would still be infeasible due to the absence of comparable data from other approaches.

评论

We would kindly inquire whether the reviewer had the opportunity to consider our explanation on the actual content and objective of our paper, which, as highlighted in the rebuttal, does not propose a new NeRF architecture, but instead an AI assistant to perform language tasks on NeRFs. Indeed, we hope that this fundamental clarification on the scope of the paper, alongside with the answers we provided to address both the nature of the original contributions (see Weaknesses 1 and 2), and the solidity of the experimental evaluation (see Weaknesses 3 and 4), may lead the reviewer to reconsider the assessment on the merits and limits of our work.

We would conclude this comment by providing the additional repeatability results mentioned in the answer to Weakness 6. In particular, we trained two additional versions of weights2space encoder, and on top of each of the three encoders, we trained our MLLM twice. So, as reported in the Table below, we have a total of 6 runs, with the first row pertaining to the model already evaluated in the paper (the 7b variant of S-LLaNA, for computational efficiency). The last row reports the relative gap between the best and worst performing models. As we can notice, the gap is fairly small, which vouches for the consistency of the performance yielded by our framework. We would also highlight that, as it can be checked by looking at Tables 1 to 5 of the paper, even considering the worst run for each metric, S-LLaNA would still result as the best performing model across all datasets and metrics.

ModelSN-T briefON-T briefSN-T detailedSN-T QASpatial ObjaNeRF
S-BERTSimCSES-BERTSimCSES-BERTSimCSES-BERTSimCSES-BERTSimCSE
enc_1 - run_178.9179.9145.2345.8081.0383.0083.4984.0675.0177.30
enc_1 - run_279.0580.3945.6645.6481.2284.4584.4682.9374.8776.97
enc_2 - run_178.0480.2244.8145.9380.3583.7582.7582.9673.9376.96
enc_2 - run_278.8979.1045.8244.7581.9782.3684.4683.4474.8876.13
enc_3 - run_178.4078.8545.1645.7081.4784.4783.4884.9374.9976.56
enc_3 - run_279.1379.7444.6744.8980.5982.1784.8382.9175.2277.15
delta1.091.541.151.181.622.302.082.021.291.17
评论

My questions were answered. I will raise my score.

审稿意见
4

This paper proposes weights2space, a self-supervised method that extracts spatial tokens from NeRF weights via a meta-encoder, enabling spatially-aware understanding. Based on this, the authors develop Spatial LLaNA, an MLLM that performs well on NeRF captioning and Q&A. They also introduce the Spatial ObjaNeRF dataset. The method outperforms prior work in all tasks.

优缺点分析

Strengths:

  1. The idea of this paper is interesting—it uses a self-supervised approach to obtain spatially-aware visual tokens.
  2. The experiments show the proposed method clearly outperforms the previous SOTA LLaNA.
  3. A new dataset is introduced that focuses more specifically on object parts and their spatial relationships.

Weaknesses:

  1. The proposed method’s capability mainly stems from weights2space. However, the training of weights2space relies solely on 300K samples from ShapeNeRF-Text and ObjaNeRF-Text, which may limit its generalizability in real-world applications.
  2. It seems the method is limited to handling single-object NeRFs. It's unclear how it would scale to scenarios involving multiple objects in the same scene. For example, how would the model handle questions that require reasoning about spatial relationships between multiple objects? Addressing this limitation would be important for broader applicability.
  3. Lacking some ablation studies—for instance, the number of queries in weights2space.

问题

See Weaknesses

局限性

Yes

最终评判理由

Thanks for your thoughtful response, which addresses most of my concerns. Although the authors responded to the question about generalization, I believe that using a real scene from Objaverse-NeRF-Text does not provide strong evidence, since the model has already seen data from this domain during training. Hence, I keep my score.

格式问题

N/A

作者回复

1. The proposed method’s capability mainly stems from weights2space. However, the training of weights2space relies solely on 300K samples from ShapeNeRF-Text and ObjaNeRF-Text, which may limit its generalizability in real-world applications.

While the training set comprises 300K samples, we highlight that ObjaNeRF-Text includes a wide variety of synthetic and real objects and scenes, the latter featuring multiple objects. This diversity makes the test scenarios on ObjaNeRF-Text (including Spatial ObjaNeRF) a good first-level evaluation benchmark for generalization. As shown in Table 2 and Table 5 of the paper, S-LLaNA performance on the test set of ObjaNeRF-Text is superior to that of other MLLMs, highlighting the effectiveness of our approach on unseen NeRFs.

Moreover, to qualitatively assess the capability of S-LLaNA in handling real-world data, we also report here a result on a real scene from ObjaNeRF-Text:


Objaverse ID ba341c4ce89647ea9f6996ec58e3eacf

Spatial detailed description, ground truth: This is a 3D scene with two tractors, a loader, and a trailer parked on a plot of land with plenty of wood waiting to be loaded. The tractors are positioned side by side, with the loader attached to one of them. The trailer is parked in front of the tractors, and the wood is stacked neatly beside the tractors. The ground appears to be a mix of dirt and patches of grass, and there are no visible signs of recent activity, suggesting the scene might be a still moment in time.

Spatial detailed description, prediction: The 3D model represents a typical farming scene. It features two tractors, one of which is a large, robust machine with a visible driver's cabin, indicating it's likely used for heavy-duty tasks. The other tractor is smaller, possibly a compact model, which might be used for lighter tasks or for navigating narrower spaces. Both tractors are parked on a plot of land, suggesting they are either in a state of rest or not currently in use. There is also a loader, a versatile piece of equipment used for lifting and moving materials, which implies that the land might be a construction or building site.


The example above shows that S-LLaNA can produce accurate descriptions of real-world scenes. Notably, it can properly count objects and understand their relative positions: it features two tractors ... both tractors are parked on a plot of land ... there is also a loader.

Furthermore, to validate the applicability of S-LLaNA to in-the-wild NeRFs, we also test it on another dataset containing real objects, OmniObject3D. To conduct this experiment, we fit NeRFs for a few objects employing the images officially released with the dataset. These NeRFs are provided as input to our framework. Below, we show some qualitative results:


OmniObject3D ID: house_004

Spatial detailed description, ground truth: The house has a main building and an annex with doors on the left and right sides, a man standing in front of the main door and a coconut tree behind it, the main part of the house is mainly brown in color and the eaves are yellow, pointed, triangular, and not symmetrical.

Spatial detailed description, prediction: This is a 3D model of a cartoon-style building, predominantly brown in color, with a distinctive pink facade. A large coconut tree is behind the building. The structure appears to be a whimsical representation of a building, possibly from a children's animated show or game. The yellow roof adds a playful touch, making the building appear less realistic and more imaginative.


OmniObject3D ID: toy_train_044

Ground Truth: This toy train is overall green, the front is a cartoon face head, the back is dark green and with a small bump, the rear part is green with gold stripes, there is also a small bump, there are some irregular patterns on the side, the front and rear each have a bump, a total of six wheels, four small, two large, the overall structure of the axis of symmetry.

Prediction: This is a 3D model of a cartoon-style, green-colored train. At the front of the train, a cartoon-style head is visible. Its vibrant green color makes it eye-catching. This type of object could be used in various applications such as animation, video game design, or digital art due to its playful and modern aesthetic.


From the examples above, we can appreciate how S-LLaNA generalizes effectively to real-world NeRFs belonging to a dataset never seen during training. For instance, it correctly describes relative positioning (A large coconut tree is behind the building) and captures fine-grained details (at the front of the train, a cartoon-style head).

To better visualize these results, please refer to section "Note to visualize example models" at the end of response 2 of reviewer GQGs.

2. It seems the method is limited to handling single-object NeRFs. It's unclear how it would scale to scenarios involving multiple objects in the same scene...

We would like to clarify that ObjaNeRF-Text and Spatial ObjaNeRF already contain a substantial number of scenes featuring multiple objects. Moreover, Spatial ObjaNeRF provides spatially grounded descriptions for many of these scenes, including references to inter-object relationships, making it a suitable benchmark for evaluating the ability of the model to reason about spatial relations. Qualitative examples of such multi-object scenes can be found in Figure 2 of the supplementary material (i.e., first row: table, fruit bowl, and book; fourth row: telephone booth, wall, pavement, street lamp, and octopus)

Moreover, we include additional results below on multi-object scenes from our datasets:

Objaverse ID 0fa42f5b83084f0eb32533b760c8d146

Spatial detailed description, ground truth: This is the 3D model of scene with a man, two sheep and a sheepdog. The man is standing between the sheep and the sheepdog is standing in front of them. The man is wearing green trousers, a yellow jacket and a blue cap.

Spatial detailed description, prediction: The 3D model depicts a scene from a farm. A person, clad in a blue hat, light yellow coat, and brown pants with a green lining, is standing next to two white sheep and in front of a sheepdog. The scene suggests a tranquil, rural life setup. The attire of the person suggests a casual, possibly outdoor setting.


Objaverse ID 998cccab49ed496586994d9f4774daa0

Spatial detailed description, ground truth: The 3D model represents a stack of three antique brown suitcases. The suitcases are positioned one on top of the other, with the middle one being the largest. Each suitcase has a rectangular shape with rounded corners, and they are secured with metal rivets. The suitcases have a textured surface that appears to be worn, suggesting age and use. The handles of the suitcases are visible, and they are attached to the body of the suitcases. In particular, the suitcases at the top and at the bottom of the stack have a single handle in the middle of their front side. The suitcase in the middle also has a couple of leather straps and the handle is between them.

Spatial detailed description, prediction: The 3D model represents a stack of three antique brown suitcases. Each suitcase is of the same design, exhibiting a vintage aesthetic. The suitcase in the middle is larger than the others. The suitcases are vertically aligned, with the topmost one tilted slightly forward.


These examples show that S-LLaNA can understand scenes featuring multiple objects and reason about spatial relationships between them. For instance, S-LLaNA can describe various spatial properties such as the relative position of objects and their number in the first example (e.g., a person ... is standing next to two white sheep and in front of a sheepdog) and in the second example (e.g., a stack of three antique brown suitcases ... The suitcases are vertically aligned, with the topmost one tilted slightly forward). Notably, we appreciate how S-LLaNA can also capture relative object dimensions (e.g., The suitcase in the middle is larger than the others).

3. Lacking some ablation studies — for instance, the number of queries in weights2space.

We conduct ablations to analyze sensitivity to the number of queries N (and consequently tri-plane resolution) and token dimensionality C, reporting results in the table below:

ModelTri-plane resolutionToken dimensionalitySN-T briefON-T briefSN-T detailedSN-T QASpatial ObjaNeRF
S-BERTSimCSES-BERTSimCSES-BERTSimCSES-BERTSimCSES-BERTSimCSE
S-LLaNA-13b3251678.9879.9845.4446.1880.9882.9083.6084.1778.2580.58
S-LLaNA-13b1651676.6577.6841.7742.1176.7478.6983.3683.8375.4877.26
S-LLaNA-13b3225276.7776.3342.2542.3676.4778.0783.2783.4976.6278.43

Each column reports results on various tasks and datasets: ShapeNeRF-Text (SN-T) brief and detailed captioning and QA, ObjaNeRF-Text (ON-T) brief captioning, and Spatial ObjaNeRF.

We observe that reducing the tri-plane resolution or token dimensionality leads to a marginal performance drop (∼2% across all metrics). This behaviour suggests that models with higher capacity (i.e., higher resolution and dimensionality) can encode richer and more fine-grained features from the input NeRF. However, the relatively small drop also indicates that S-LLaNA remains robust and effective even with more compact and computationally efficient representations.

Moreover, we point out that in Sec. 6 (supplementary), we report an additional ablation study of the modules of the weights2space meta-encoder proposed in this paper. This experiment clearly shows that the weights2seq module alone performs poorly compared to the full architecture.

评论

Thanks for your thoughtful response, which addresses most of my concerns. Although the authors responded to the question about generalization, I believe that using a real scene from Objaverse-NeRF-Text does not provide strong evidence, since the model has already seen data from this domain during training. Hence, I keep my score.

评论

Thank you for your valuable feedback and for encouraging us to evaluate S-LLaNA on a multi-object NeRF captured in the wild and sourced from a dataset outside the training data distribution of our model.

To address this point, we ran an additional experiment using a real-world NeRF scene not included in the training data of S-LLaNA. Specifically, we selected a scene from the dataset of mip-NeRF 360 [1], which is shown in both the original paper (Figure 1) and the main video on the mip-NeRF 360 project webpage.

The scene portrays an outdoor garden with lush green vegetation. At its center stands a round wooden table, with a vase of flowers placed on top. The full scene contains several visual elements and spatial cues that make it a meaningful test case for evaluating multi-object understanding.

We trained a NeRF on this scene and used it as input to S-LLaNA. Below, we report outputs from the model's conversations:

Provide a very detailed description of the scene. Focus on the relative positions of the objects in the scene. Include as many details as possible.

This is a 3D model of an outdoor scene from a park or a garden. It features a round wooden table with a vase of flowers on top. The background is green, covered with plants. The scene suggests a peaceful, rural life setup.

What is placed on top of the table?

A vase is placed on top of the table.

Where is the vase placed in the scene?

The vase is placed on a table, which is positioned in the center of the scene.

This experiment, which will be included in the final version of the paper in case of acceptance, provides hints on the robustness and generalization of S-LLaNA. In particular, the model can accurately answer spatial and semantic questions about a real-world, out-of-distribution NeRF scene containing multiple objects.

We would like to clarify that we did not run this kind of experiment in the rebuttal because we interpreted your questions about generalization to real-world applications and multi-object scenes as two distinct concerns.

In fact, to address real-world generalization, we included evaluations on individual real-world objects from Objaverse and OmniObject3D. Importantly, OmniObject3D is a completely independent dataset that was not used in any form during the training of S-LLaNA. By including examples from this dataset, we demonstrate that S-LLaNA can effectively generalize to entirely unseen real-world objects.

Regarding multi-object scenes, we provided examples of scenes with multiple objects from ObjaNeRF-Text, demonstrating how our model is able to understand spatial relationships between multiple objects.

[1] J. T. Barron, B. Mildenhall, D. Verbin, P. P. Srinivasan, and P. Hedman, “Mip-nerf 360: Unbounded anti-aliased neural radiance fields,” in CVPR, 2022, pp. 5470–5479.

评论

Thanks for your kind response, acctually, I still think this paper is under the bordline accept due to the lack of large-scale quantitative experiments to prove its generalization on weights2space.

评论

Dear Reviewer, thanks a lot for taking further time to read our comment and analyze our additional qualitative experiment. As for your feedback, we would feel necessary to clarify that it is impossible to run other large quantitative experiments besides those already included in our paper, because there exist no other NeRF-Language datasets with ground truth annotations besides those already used in our paper. Indeed, our direct competitor, LLaNA [7] carries out the quantitative experimental evaluation exactly on the same datasets as ours, i.e. the only ones that do exist as of today. Thanks a lot again for your time and valuable comments.

审稿意见
4

The authors proposed a self-supervised framework featuring a novel meta-encoder (weights2space) that can compute a sequence of spatial tokens directly from NeRF weights. The idea is that it will output a sequence of tokens that are reorganized into a tri-plane structure so that it can extract more detailed and localized information. The authors then incorporate this idea into what they call Spatial LLaNA, a new version of Multimodal Large Language Models, and evaluate its performance in various datasets (including developing a benchmark).

优缺点分析

Quality: The authors provide extensive experiment results, including a comparison to LLaNA, their direct competitor, as well as the other reference baselines that LLaNA compared against. They also compare against three tasks: brief captioning, detailed captioning, and single-round Q&A conversations. Clarity: The authors include multiple figures to facilitate reading. However, the figures can be improved with better annotation and descriptive titles. The paper is overall easy to follow. But when authors introduce weights2space and S-LLaNA (together with NERF, MLLM) at the beginning, it can be a little confusing about how the ideas are related. Writing can also be improved in Section 4, especially. There are grammatical errors that make some sentences hard to understand. Significance & Originality: This is my biggest question. While the experiment results demonstrate that S-LLaNA gains some performance improvement, the encoder that the authors proposed (weights2space) doesn't seem like a fundamental improvement. I understand that the authors discussed some of the aspects in Related work, but I'd like to ask the authors to clarify more about their contribution.

问题

(1) The significance of your work, especially compared to LLaNA. See Strengths And Weaknesses. (2) Can you explain how to get the feature vector in your tri-plane representation? Specifically, you mentioned that "for each projection, bilinear interpolation is performed on the four nearest neighbors’ features within the corresponding plane to extract three vectors of dimension." (3) Any intuition on why introducing the triplane representation can improve the model's understanding in part-whole hierarchies?

局限性

yes

最终评判理由

I increased my score because the authors clarified my question about their contribution.

格式问题

NA

作者回复

Weakness 1. Clarity: The authors include multiple figures to facilitate reading. However, the figures can be improved with better annotation and descriptive titles. The paper is overall easy to follow. But when authors introduce weights2space and S-LLaNA (together with NERF, MLLM) at the beginning, it can be a little confusing about how the ideas are related. Writing can also be improved in Section 4, especially. There are grammatical errors that make some sentences hard to understand.

We thank the reviewer for the suggestion. We will include more descriptive captions for Figures 2 and 3 to enhance clarity. The captions will be:

Figure 2: Overview of weights2space. Given the weights of a NeRF MLP, our meta-encoder computes a sequence of spatially-aware tokens, YY. These tokens are reshaped into tri-plane feature maps, FxyF_{xy}, FxzF_{xz}, and FyzF_{yz}, which are used to obtain a feature ff for each 3D point pip_i. This feature is then processed by a decoder that computes the radiance field values for pip_i. For a given viewpoint, radiance field values are sampled along camera rays, and pixel RGB values are obtained through volumetric rendering. The meta-encoder and decoder are jointly trained in a self-supervised manner to reconstruct the image rendered by the original NeRF from the same viewpoint.

Figure 3: Overview of Spatial-LLaNA. The sequence of spatially-aware tokens YY is processed by a Linear Projection Network to map such vectors into the input embedding space of the LLM we employ, LLAMA. The projected tokens are concatenated with the text tokens of the input question and fed into the LLM for the generation of the output text. Both the Projector and the LLM are optimized through language supervision.

We will also revise Section 4 in the final version of this paper to improve its fluency.

Weakness 2. Significance & Originality: This is my biggest question. While the experiment results demonstrate that S-LLaNA gains some performance improvement, the encoder that the authors proposed (weights2space) doesn't seem like a fundamental improvement. I understand that the authors discussed some of the aspects in Related work, but I'd like to ask the authors to clarify more about their contribution.

&

Question (1) - The significance of your work, especially compared to LLaNA. See Strengths And Weaknesses.

Thank you for your comment. We would like to clarify the novelty and significance of our contribution compared to previous work, i.e., LLaNA.

LLaNA introduced the idea of encoding a NeRF scene into a single global latent vector, which enabled a language assistant to reason about neural radiance fields. However, this global representation discards all spatial structure, limiting the ability of the assistant to capture spatial and local information. In contrast, S-LLaNA proposes a fundamentally different and more expressive representation computed thanks to our original meta-encoder architecture, dubbed weights2space. Unlike the meta-encoder design described in LLaNA, weights2space does not collapse the NeRF weights into a single embedding. Instead, it produces a sequence of spatially-aware tokens enabling the assistant to reason on spatial relationships among objects, capture fine details, and understand part-whole hierarchies.

From a slightly more technical standpoint, we wish to point out that S-LLaNA meta-encoder key originality pertains to the seq2space module, a transformer decoder that cross-attends to the encoded weights of the input NeRF, along with the downstream tri-plane representation. These two components enable the model to extract spatially grounded features from the information spread across the weights of the input NeRF. In particular, because the output tokens of seq2space are reshaped into a tri-plane to enable rendering views from the input NeRF, the features learned by the S-LLaNA meta-encoder are strongly correlated with the object's parts and spatial layout. This novel strategy for eliciting spatially-aware features from the information distributed across the weights of the input NeRF marks a clear shift from the approach described in LLaNA, which compresses the weights into a global embedding and lacks any mechanism to preserve spatial structure.

Therefore, by enabling rich and detailed spatial reasoning on 3D objects, the design of S-LLaNA represents a significant step forward compared to LLaNA in advancing the capabilities of language assistants that can accept NeRFs as inputs.

Question (2) - Can you explain how to get the feature vector in your tri-plane representation? Specifically, you mentioned that "for each projection, bilinear interpolation is performed on the four nearest neighbors’ features within the corresponding plane to extract three vectors of dimension."

We provide below the algorithm employed to obtain feature vectors in our tri-plane representation. We will add this information to the supplementary material.

Inputs:

  • Tri planes feature maps – three orthogonal 2D feature planes (Fxy,Fxz,Fyz)(F_{xy}, F_{xz}, F_{yz}) of dimension RH×W×C\mathbb{R}^{H \times W \times C} .

  • A 3D point, p=(x,y,z)R3p = (x, y, z) \in \mathbb{R}^3.

  • Axis-aligned bounding box for coordinate normalization aabbmin,aabbmaxR3aabb_{min},aabb_{max} \in \mathbb{R}^3.

Output:

  • Interpolated feature vector at point pp, fRCf \in \mathbb{R}^C.

1. Normalize coordinates

Normalize the 3D point pp to [0,1][0, 1]:

coords01=paabbminaabbmaxaabbmin[0,1]coords_{01} = \frac{p - aabb_{min}}{aabb_{max} - \text{aabb}_{min}} \quad \in [0, 1]

2. Project 3D point to 2D coordinates on triplanes

pxy=(uxy,vxy)=(coords01.x,coords01.y)p_{xy} = (u_{xy}, v_{xy}) = (coords_{01}.x, coords_{01}.y) pxz=(uxz,vxz)=(coords01.x,coords01.z)p_{xz} = (u_{xz}, v_{xz}) = (coords_{01}.x, coords_{01}.z) pyz=(uyz,vyz)=(coords01.y,coords01.z)p_{yz} = (u_{yz}, v_{yz}) = (coords_{01}.y, coords_{01}.z)

3. Perform bilinear interpolation on each plane

Given the xyxy feature plane, FxyF_{xy}, and the projected coordinate of pp, pxyp_{xy}:

  1. Map from [0,1][0, 1] to triplane grid coordinates:
uimg=uxy(W1)vimg=vxy(H1)u_{img} = u_{xy}(W - 1) \qquad v_{img} = v_{xy}(H - 1)
  1. Compute integer and fractional parts:
i=uimg,j=vimgi = \lfloor u_{img} \rfloor, \quad j = \lfloor v_{img} \rfloor δu=uimgi,δv=vimgj\delta_u = u_{img} - i, \quad \delta_v = v_{img} - j
  1. Retrieve the four neighboring features:
  • f00=Fxy[j,i,:]f_{00} = F_{xy}[j, i, :] (top-left)  
  • f10=Fxy[j,i+1,:]f_{10} = F_{xy}[j, i+1, :] (top-right)  
  • f01=Fxy[j+1,i,:]f_{01} = F_{xy}[j+1, i, :] (bottom-left)  
  • f11=Fxy[j+1,i+1,:]f_{11} = F_{xy}[j+1, i+1, :] (bottom-right)
  1. Perform bilinear interpolation:
f=(1δu)(1δv)f00+δu(1δv)f10+(1δu)δvf01+δuδvf11f = (1 - \delta_u)(1 - \delta_v)f_{00} + \delta_u(1 - \delta_v)f_{10} + (1 - \delta_u)\delta_v f_{01} + \delta_u \delta_v f_{11}

Repeat 1. to 4. for other planes and projected coordinates to obtain:

  • fxzf_{xz} from FxzF_{xz} and pxzp_{xz}
  • fyzf_{yz} from FyzF_{yz} and pyzp_{yz}

4. Combine features

Sum the three interpolated features to obtain the final feature vector for point pp:

f=fxy+fxz+fyzRCf = f_{xy} + f_{xz} + f_{yz} \quad \in \mathbb{R}^C

Question (3) - Any intuition on why introducing the triplane representation can improve the model's understanding in part-whole hierarchies?

The explicit tri-plane spatial structure allows the model to associate specific tokens with specific regions of the scene. Consequently, as tokens capture local part-level information, while the overall spatial configuration provides cues about how these parts compose the global scene, this representation can be effectively processed by the downstream LLM for understanding part-whole hierarchies.

For instance, if we have two spatial tokens representing a lamp and a chair that are located one above the other in the tri-plane representation, then a language model processing these tokens can infer that the lamp is positioned above the chair. The same reasoning applies in the case of individual objects, where spatial tokens correspond to different parts of the object. Based on their relative positions in the tri-plane, the model can correctly infer that the head is above the torso, and the torso is above the legs. Furthermore, by understanding how these parts are composed, the model can also infer the identity of the whole object. For instance, if the head, torso, and legs are vertically aligned, the object is likely a standing person.

These considerations are supported by the visualization in Figure 5 (main paper) and Figure 7 (supplementary), where the attention maps are clearly localized. Specifically, the tokens receiving the highest attention scores correspond to spatial regions that align with the referenced elements in the scene. This suggests that the LLM is capable of identifying both what and where the relevant parts of the scene are in response to a given question, thanks to the triplane representation, thereby enabling it to generate more effective and spatially grounded answers.

评论

I thank the authors for their response. It answers my questions. I will increase my score.

审稿意见
4

This paper works on Multimodal Large Language Models (MLLMs) equipped with Neural Radiance Fields (NeRFs). Previous work utilizes the weights of a NeRF’s MLP as a global representation of an object, which may not be suitable for capturing the details. This paper instead invents a self-supervised way to learn a meta encoder and feed the more representative tokens into the LLMs. A new manually annotated dataset is developed and the proposed method shows promising results in such a new dataset, compared to other NeRFs-based MLLMs approaches.

优缺点分析

Strengths: Previous work uses the global representation of NeRFs as the input of MLLMs while this work uses more detailed tokens to enable fine-grained understanding in LLMs. I reckon this is a significant improvement for the NeRFs-based MLLMs. Compared to LLaNA, S-LLaNA consistently shows a large margin on different metrics of different datasets, which demonstrates the effectiveness of the propose method.

Weaknesses: It seems like all the experiments and ablations are conducted on the caption (e.g. S-BERT, SimCSE, BLEU-1, ROUGE-L, and METEOR) while the multi-choice and numerical evaluation is missing. It would be helpful if the authors could provide such evaluation on the proposed dataset or existing datasets. Some efforts should be made to annotate some multi-choice and numerical Q&As on the proposed dataset, though the authors have claimed such a limitation.

问题

The framework has two modules, self-supervised meta encoder and MLLMs. They are trained separately but how to determine the best checkpoint of the meta encoder for the second stage training, i.e. MLLMs? Does it use the checkpoint of the third epoch for each training stage?

局限性

In the supplementary material, the authors state that the trained model may perform poorly in the out-of-distribution data. To improve the generalization of the model, the scale and diversity of data can be investigated.

格式问题

No concerns about formatting issues.

作者回复

Weakness. It seems like all the experiments and ablations are conducted on the caption (e.g. S-BERT, SimCSE, BLEU-1, ROUGE-L, and METEOR) while the multi-choice and numerical evaluation is missing. It would be helpful if the authors could provide such evaluation on the proposed dataset or existing datasets. Some efforts should be made to annotate some multi-choice and numerical Q&As on the proposed dataset, though the authors have claimed such a limitation.

To better assess NeRF-language understanding and text generation capabilities, we built a new test set of spatial multiple-choice questions focused on the 100 objects from the Spatial ObjaNeRF dataset. These questions are crafted to probe fine-grained spatial reasoning within the 3D scenes. Each object is associated with one or more questions, resulting in a total of 190 multiple-choice tasks, each with three answer options.

Below are two representative examples from the dataset:

Object id: ac58cb3a28cb4a0d92ce1016f74fa5ff

Question: How many spouts does the teapot have?

  • A: One
  • B: Two
  • C: Four

Ground-truth answer: C: Four


Object id: 103989411047470ab9f86341fd016539

Question: Where is the bowl of the fruit placed with respect to the book?

  • A: The bowl of fruit is next to the book
  • B: The bowl of fruit is on top of the book
  • C: The bowl of fruit is under the book

Ground-truth answer: A: The bowl of fruit is next to the book


The results of our evaluation on this dataset are summarized in the table below:

Model                  Modality    Accuracy (%)LLM Accuracy (%)
GPT4Point-Opt-2.7b  Point cloud             15.26           47.89  
LLaVA-vicuna-7b    Image (RV)            63.16             63.16  
PointLLM-7b        Point cloud             24.74           49.47  
ShapeLLM-7b        Point cloud             23.16           44.21  
LLaNA-7b            NeRF                    6.32           47.89  
S-LLaNA-7b          NeRF                   22.63           54.74  
LLaVA-vicuna-13b    Image (RV)             64.74           64.74  
PointLLM-13b        Point cloud             24.74           54.21  
ShapeLLM-13b        Point cloud             23.16           44.74  
LLaNA-13b          NeRF                    7.37           46.32  
S-LLaNA-13b        NeRF                   23.68           55.79  

The "Accuracy" column reflects cases in which the model's output exactly matched the ground truth answer with the correct format, e.g., "C: Four". As the results show, LLaNA, S-LLaNA, and the other 3D MLLMs generally struggle on this task, while LLaVA achieves significantly higher accuracy.

We hypothesize that this is largely due to LLaVA having been trained on multiple-choice QA tasks, whereas the 3D MLLMs and NeRF-language models evaluated in this work were not. Consequently, these models often produce free-form textual responses rather than selecting an explicit multiple-choice answer in the expected format (i.e., "letter: answer"). While the correct answer is frequently mentioned in their generated text, it is not structured in a way that permits direct comparison with the ground truth. For example, we report the answers of S-LLaNA to the two questions shown above:

Object id: ac58cb3a28cb4a0d92ce1016f74fa5ff

S-LLaNA: The teapot has four spouts.


Object id: 103989411047470ab9f86341fd016539

S-LLaNA: The fruit bowl is placed on the table next to the book.

To address this limitation and ensure a fair comparison, we implement an auxiliary evaluation protocol. Specifically, we leverage a strong instruction-following LLM (LLaMA-3.3-70B-Instruct) to interpret the free-form outputs of each model and map them to the closest valid multiple-choice option, when possible. The resulting "LLM Accuracy" column reflects this refined evaluation.

The scores of LLaVA remain consistent across both evaluation strategies, proving the effectiveness of the LLM-based evaluation protocol and the proper answer formatting of this model. Among all baselines, LLaVA remains the top performer, likely due to its explicit exposure to multiple-choice QA during training.

Importantly, among the models that were not trained on multiple-choice QA tasks, S-LLaNA emerges as the best performer by a notable margin, outperforming both LLaNA and other 3D MLLMs. This suggests the potential to perform excellently on such diverse NeRF-language tasks.

This multi-choice QA dataset will be publicly released, and the evaluation results will be included in the final version of the paper.

Questions. The framework has two modules, self-supervised meta encoder and MLLMs. They are trained separately but how to determine the best checkpoint of the meta encoder for the second stage training, i.e. MLLMs? Does it use the checkpoint of the third epoch for each training stage?

To determine the best checkpoint of the self-supervised meta encoder for the second-stage MLLM training, we select the one that achieves the highest reconstruction PSNR on the validation set. In representation learning scenarios such as ours, where the objective is to learn rich and spatially grounded embeddings of the input scene, PSNR provides a strong and interpretable proxy for feature quality. This approach has also been adopted in prior works such as [6], [7], and [51] for downstream language and semantic tasks, respectively, demonstrating strong empirical performance. By selecting the checkpoint with the highest PSNR, we ensure that the MLLM is trained on top of a robust and well-aligned scene representation.

Limitations. In the supplementary material, the authors state that the trained model may perform poorly in the out-of-distribution data. To improve the generalization of the model, the scale and diversity of data can be investigated.

During the rebuttal phase, we conducted additional experiments suggested by the reviewers to probe the generalization capabilities of S-LLaNA further. In particular, we evaluated S-LLaNA also on another dataset containing real objects, OmniObject3D. To conduct this experiment, we fit NeRFs for a few objects employing the images officially released with the dataset. These NeRFs are provided as input to our framework. Below, we show some qualitative results:


OmniObject3D ID: house_004

Ground Truth: The house has a main building and an annex with doors on the left and right sides, a man standing in front of the main door and a coconut tree behind it, the main part of the house is mainly brown in color and the eaves are yellow, pointed, triangular and not symmetrical.

Prediction: This is a 3D model of a cartoon-style building, predominantly brown in color, with a distinctive pink facade. A large coconut tree is behind the building. The structure appears to be a whimsical representation of a building, possibly from a children's animated show or game. The yellow roof adds a playful touch, making the building appear less realistic and more imaginative.


OmniObject3D ID: toy_train_044

Ground Truth: This toy train is overall green, the front is a cartoon face head, the back is dark green and with a small bump, the rear part is green with gold stripes, there is also a small bump, there are some irregular patterns on the side, the front and rear each have a bump, a total of six wheels, four small, two large, the overall structure of the axis of symmetry.

Prediction: This is a 3D model of a cartoon-style, green-colored train. At the front of the train, a cartoon-style head is visible. Its vibrant green color makes it eye-catching. This type of object could be used in various applications such as animation, video game design, or digital art due to its playful and modern aesthetic.


From the examples above, we can appreciate how S-LLaNA generalizes effectively to real-world NeRFs belonging to a dataset never seen during training, demonstrating that S-LLaNA can effectively handle some degree of domain shift. For instance, in the first example, it correctly describes relative positioning and object properties: A large coconut tree is behind the building ... The yellow roof. In the second example, S-LLaNA can capture object properties and fine-grained details: green-colored train ... at the front of the train, a cartoon-style head.

Nevertheless, we acknowledge that increasing the diversity and scale of the training data represents a promising direction to further improve the generalization capabilities of the model on out-of-distribution scenarios, and is therefore worth investigating in future work.

评论

Thanks for the authors' response. After reading the rebuttal, the author addressed my concerns. But I feel like the work still under the borderline of the NeurIPS bar, thus I keep my rating unchanged with borderline accept.

评论

We thank you for taking the time to read our rebuttal and for acknowledging that we addressed the concerns you raised. Should there be any remaining doubts or aspects of the work that could benefit from further clarification, we would be more than happy to provide additional details.

审稿意见
4

This paper introduces weights2space, a self-supervised meta-encoder that reorganises NeRF-MLP weights into 3 072 spatially-aware tokens arranged as a 32 × 32 tri-plane. Plugging these tokens into a frozen LLaMA-2 via a lightweight projector yields Spatial LLaNA, a NeRF-language model capable of fine-grained spatial reasoning. The authors introduce a new benchmark, Spatial ObjaNeRF. Across four datasets and three tasks (brief captioning, detailed captioning, single-round Q&A), S-LLaNA surpasses LLaNA and 2D/3D MLLM baselines.

优缺点分析

[+] Clear architecture, end-to-end self-supervised training.

[+] Strong empirical gains and insightful attention visualisations.

[+] New dataset that targets spatial understanding.

[-] Lack of Ablations: Please provide sensitivity to the number of queries N, token dimensionality C, and tri-plane resolution.

[-] Real-world robustness: Can the model handle photogrammetry or in-the-wild neural fields?

[-] Given that tri-plane representations empirically work well, why not train the tri-plane tokens directly instead of first fitting a standard NeRF MLP and then converting its weights into tri-plane tokens? What technical or practical advantages does the two-step pipeline offer over a native tri-plane approach?

问题

Please check the weaknesses.

局限性

No.

最终评判理由

The authors addressed my concerns. I keep my score

格式问题

No.

作者回复

1. Lack of Ablations: Please provide sensitivity to the number of queries N, token dimensionality C, and tri-plane resolution.

We conducted several ablations to analyze sensitivity to the number of queries N (and consequently tri-plane resolution) and token dimensionality C. We report results in the table below. In particular, each table row reports:

  • First row: the official S-LLaNA version of the main paper.
  • Second row: an experiment to analyze the impact of the number of queries, and thus of the tri-plane resolution. Here, weights2space and S-LLaNA were trained using N=768 queries. As each tri-plane is obtained by reshaping the queries into 3 planes of spatial dimension H×WH \times W, with H=WH = W in our experiments, the corresponding tri-plane resolution is 7683=16\sqrt{\frac{768}{3}} = 16.
  • Third row: an experiment to analyze the sensitivity to token dimensionality CC. S-LLaNA was trained with N=3072 tokens (i.e., tri-plane resolution 32) as in the main paper, yet with a smaller token dimensionality C=252C=252.
ModelTri-plane resolutionToken dimensionalitySN-T briefON-T briefSN-T detailedSN-T QASpatial ObjaNeRF
S-BERTSimCSES-BERTSimCSES-BERTSimCSES-BERTSimCSES-BERTSimCSE
S-LLaNA-13b3251678.9879.9845.4446.1880.9882.9083.6084.1778.2580.58
S-LLaNA-13b1651676.6577.6841.7742.1176.7478.6983.3683.8375.4877.26
S-LLaNA-13b3225276.7776.3342.2542.3676.4778.0783.2783.4976.6278.43

Each column reports the S-BERT and SimCSE results on various tasks and datasets: ShapeNeRF-Text (SN-T) brief and detailed captioning and QA, ObjaNeRF-Text (ON-T) brief captioning, and Spatial ObjaNeRF.  

We observe that reducing the tri-plane resolution or token dimensionality leads to a marginal performance drop (∼2% across all metrics). This behaviour suggests that models with higher capacity (i.e., higher resolution and dimensionality) can encode richer and more fine-grained features from the input NeRF. However, the relatively small drop also indicates that S-LLaNA remains robust and effective even with more compact and computationally efficient representations.

Moreover, we point out that in Sec. 6 (supplementary), we report an additional ablation study of the modules of the weights2space meta-encoder proposed in this paper. This experiment clearly shows that the weights2seq module alone performs poorly compared to the full architecture.

2. Real-world robustness: Can the model handle photogrammetry or in-the-wild neural fields?

Our experiments already demonstrate that S-LLaNA can handle real-world data. Indeed, the Objaverse dataset [17] used in the experiments of Table 2 and Table 5 (main paper) includes several real-world objects and scenes captured through diverse 3D acquisition methods, photogrammetry among them. Below, we show two representative examples from our test set featuring a real scene (Example 1) and a real-world object (Example 2).


Example 1 (Real Scene)

Objaverse ID ba341c4ce89647ea9f6996ec58e3eacf

Spatial detailed description, ground truth: This is a 3D scene with two tractors, a loader, and a trailer parked on a plot of land with plenty of wood waiting to be loaded. The tractors are positioned side by side, with the loader attached to one of them. The trailer is parked in front of the tractors, and the wood is stacked neatly beside the tractors. The ground appears to be a mix of dirt and patches of grass, and there are no visible signs of recent activity, suggesting the scene might be a still moment in time.

Spatial detailed description, prediction: The 3D model represents a typical farming scene. It features two tractors, one of which is a large, robust machine with a visible driver's cabin, indicating it's likely used for heavy-duty tasks. The other tractor is smaller, possibly a compact model, which might be used for lighter tasks or for navigating narrower spaces. Both tractors are parked on a plot of land, suggesting they are either in a state of rest or not currently in use. There is also a loader, a versatile piece of equipment used for lifting and moving materials, which implies that the land might be a construction or building site.


Example 2 (Real Object)

Objaverse ID: c4a0c2e2fb624bc0af9928b0ae6407ff

Spatial detailed description, ground truth: The blue teddy bear is depicted with a soft, plush texture, and its color is a consistent shade of blue throughout. The bear has a round head with two small, round ears, a snout, and a pair of eyes. It has a body with a round belly, and its arms are short and stubby. The bear's legs are also short and rounded. The teddy bear is in a sitting position. Around its neck, there is a white ribbon.

Spatial detailed description, prediction: The 3D object model is a toy, specifically a teddy bear, painted in a vibrant shade of blue. The teddy bear is in a seated position, showcasing a white plush neck, enhancing its cute appeal. The teddy bear, being a universal symbol of comfort and care, is usually used as a plaything for children or as a decorative item in a nursery room or any child-oriented space.


The examples above show that S-LLaNA produces accurate descriptions of real-world entities. For instance, in Example 1, it can properly describe the scene: it features two tractors ... both tractors are parked on a plot of land ... there is also a loader ... Moreover, in the second example, S-LLaNA can recognize main features of the object: a teddy bear, painted in a vibrant shade of blue ... in a seated position ... a white plush neck.

Furthermore, to validate the applicability of S-LLaNA to in-the-wild NeRFs, we also test it on another dataset containing real objects, OmniObject3D. To conduct this experiment, we fit NeRFs for a few objects employing the images officially released with the dataset. These NeRFs are provided as input to our framework. Below, we show some qualitative results:


OmniObject3D ID: house_004

Ground Truth: The house has a main building and an annex with doors on the left and right sides, a man standing in front of the main door and a coconut tree behind it, the main part of the house is mainly brown in color and the eaves are yellow, pointed, triangular and not symmetrical.

Prediction: This is a 3D model of a cartoon-style building, predominantly brown in color, with a distinctive pink facade. A large coconut tree is behind the building. The structure appears to be a whimsical representation of a building, possibly from a children's animated show or game. The yellow roof adds a playful touch, making the building appear less realistic and more imaginative.


OmniObject3D ID: toy_train_044

Ground Truth: This toy train is overall green, the front is a cartoon face head, the back is dark green and with a small bump, the rear part is green with gold stripes, there is also a small bump, there are some irregular patterns on the side, the front and rear each have a bump, a total of six wheels, four small, two large, the overall structure of the axis of symmetry.

Prediction: This is a 3D model of a cartoon-style, green-colored train. At the front of the train, a cartoon-style head is visible. Its vibrant green color makes it eye-catching. This type of object could be used in various applications such as animation, video game design, or digital art due to its playful and modern aesthetic.


From the examples above, we can appreciate how S-LLaNA generalizes effectively to real-world NeRFs belonging to a dataset never seen during training. For instance, in the first example, it correctly describes relative positioning and object properties: A large coconut tree is behind the building ... The yellow roof. In the second example, S-LLaNA can capture object properties and fine-grained details: green-colored train ... at the front of the train, a cartoon-style head.


Note to visualize example models

Regarding ObjaNeRF-Text, the 3D models can be accessed by downloading the Objaverse dataset from the official website and locating the models using the Objaverse IDs provided above.

Regarding OmniObject3D, the dataset can be downloaded using the openxlab library specifying the data folder house, or toy_train for the first and second example, respectively. To visualize the objects, refer to the folders house_004 and toy_train_044.

3. Given that tri-plane representations empirically work well, why not train the tri-plane tokens directly instead of first fitting a standard NeRF MLP and then converting its weights into tri-plane tokens? What technical or practical advantages does the two-step pipeline offer over a native tri-plane approach?

Our decision to adopt the NeRF MLP representation was driven by both practical and methodological considerations.

First, our work follows the same setup as LLaNA, which focuses on MLP-based NeRFs. The MLP representation is the original parameterization used in the first NeRF paper, and consequently, many NeRFs available online adopt it. Moreover, this is the only representation used in all existing NeRF-Language datasets. Therefore, using this setup allows for direct comparison with LLaNA and compatibility with existing NeRF-Language benchmarks.

Second, as there are currently no publicly available NeRF-language datasets built on tri-plane-based representations, transitioning from MLP to tri-plane NeRFs would require rendering multi-view images for each object and subsequently fitting the corresponding tri-plane NeRF. This process introduces a substantial computational burden and complexity.

Nevertheless, we agree that designing a NeRF-Language model that takes as input also the tri-plane representation is a promising research direction towards a universal NeRF assistant.

评论

Thank you for your review and valuable feedback. We have submitted a detailed rebuttal addressing your concerns and would be happy to provide further clarification or engage in additional discussion if needed. Please feel free to let us know if any part of our response requires further elaboration. We appreciate your time and consideration during the rebuttal process.

最终决定

The authors propose a method of tokenising Nerfs so that they can effectively become a new input modality for MLLMs. In contrast to prior work (LLana), the authors' Nerf tokenisation is location-aware. Unlike Llana which just directly encodes the Nerf weights as tokens, the proposed tokens are location aware by projecting to a tri-plane representation of xy, yz and xz coordinates. To verify the location awareness, the authors also propose a new dataset, Spatial ObjaNeRF, for spatial reasoning tasks. The authors show strong performance on Nerf-captioning and QA tasks.

Reviewers appreciated the strong empirical results, motivation of the approach and evaluation dataset proposed by the authors. The authors posted convincing rebuttals to the reviewers, addressing most of their concerns during the review period, by adding additional ablation studies, adding multiple choice questions to the dataset and several clarification questions by the reviewers.

The sole negative-leaning reviewer (Vie1) mentioned the quality of the writing as the main limitation of the paper. However, after some deliberation, the AC and SAC found the writing quality to be sufficient and believe that the remaining issues can be fixed by addressing the reviewers' comments for the camera-ready version. Therefore, on the balance, the final decision is to accept the paper. Please incorporate all updates from the rebuttal into the camera-ready.