LLaNA: Large Language and NeRF Assistant
We propose LLaNA, che first Multimodal Large Language Model (MLLM) able to perform NeRF-language tasks, such as NeRF captioning and NeRF QA.
摘要
评审与讨论
In this submission the authors propose a new pipeline that enables Large Language models to interact with trained object-centric NeRF models. They achieve this by utilising a pretrained meta-network that ingests the weights of a NeRF MLP and outputs a low-dimensional feature vector. This low-dimensional feature vector is then transformed to the language models input space via a projection layer similar to LLaVA. The method is trained on a dataset of 40K NeRF models which the authors extend with 240K paired text descriptions. Experiments on this dataset show favourable performance compared to prior works that solely work with images or point clouds.
优点
- The proposed problem and method of interacting with NeRFs via a LLM is novel and very interesting, enabling potential new applications in robotics and AR.
- The experiments performed in the submission show promising results, where the proposed method is achieving impressive results over the chosen baselines.
- The dataset/benchmark and code will be released upon acceptance, which will make follow-up works and comparisons much easier.
- The paper is very well written and concepts are easy to understand.
缺点
In order of severity:
- The comparisons that were performed in the experiments seem unfair towards the baselines. The baselines are not trained or fine-tuned on the dataset proposed in the submission and it is assumed that they would generalise on this new domain since they were trained on millions of images or hundreds of thousands of 3D shapes. Since the proposed dataset consists only of ShapeNet objects there will be a domain gap for the baselines to overcome, since they were trained on the Objaverse (Deitke, Matt, et al. "Objaverse: A universe of annotated 3d objects." CVPR 2023) or ModelNet40 (Wu, Zhirong, et al. "3d shapenets: A deep representation for volumetric shapes." CVPR 2015) Datasets in the case of the GPT4Point[58] and PointLLM[77]. The same logic applies to the BLIP2 and LLaVA baselines, which will not have seen many synthetic object centric images during their training. This weakens the overall argument to consider NeRFs as an input modality to LLMs because the presented comparisons are flawed.
- While it is a valid argument that view selection is an issue for 2D-LLMs, as stated in line 123, NeRFs can render arbitrary viewpoints after training. It would therefore be possible to render multiple images from varying viewpoints and use all of them as input to modern Multimodal LLMs jointly, e.g. by concatenating their text-tokens after the projection layer. This would provide a more balanced comparison than the single-view baselines that are chosen in the submission, since a lot more information can be passed onto the LLM with multiple views. Another avenue to make a fair comparison to 2D-LLMs like LLaVA would be to encode 2D images in an MLP (as shown for example in: Sitzmann, Vincent, et al. "Implicit neural representations with periodic activation functions." Advances in neural information processing systems 33 (2020): 7462-7473.) and then use these weights as input to the proposed method. Both of these points will probably be infeasible to address in a rebuttal but would be interesting experiments to support the use of implicit representations as input to LLMs.
- It is a bit unclear from the description in line 215 if the test set contains only object classes not seen during training or if they were seen before. This should be clarified.
- In table 5 it is not discussed what a ‘hard’ view is and what constitutes its ‘hardness/complicatedness’. This makes the table confusing and not self-explanatory.
- Stating in line 90 that Ballerini et al. [5] are the first to utilise NeRFs as an input modality is not correct, what about NeSF (Vora, Suhani, et al. "Nesf: Neural semantic fields for generalizable semantic segmentation of 3d scenes." arXiv preprint arXiv:2111.13260 (2021)) or Nerf-RPN (Hu, Benran, et al. "Nerf-rpn: A general framework for object detection in nerfs." CVPR 2023)? It probably refers to being the first to directly utilise NeRF MLP weights as an input modality when considering language tasks. This sentence should be re-written.
问题
To focus the discussion about the weaknesses raised above here are some questions for the authors:
- Please further justify not finetuning the baselines on the proposed dataset. How is this evaluation done in prior work and what justifications do they give for why they did or did not do this? Do these arguments also apply to this submission?
- Is it possible to train a limited amount of NeRF models on Objaverse or ModelNet40 and then use the proposed pipeline without finetuning on them and compare to PointLLM and GPT4Point? This would be a very interesting experiment to showcase the generalisation ability of the proposed method.
- Can you evaluate the 2D baselines with more views ? (related to Weakness 2)
局限性
The authors addressed the technical limitations of their work quite well but limitations with regards to prior work are not discussed, as they do not exist in the current evaluation scheme. Societal Impact is briefly discussed in the paper checklist.
W1-Q1
Our choice of using frozen models as baselines has been motivated by LLaNA being the only assistant that works on NeRFs, i.e. whose input modality is a radiance field parametrized as a neural network. Indeed, the papers most closely related to ours -- those proposing object-centric assistants that ingest point clouds (PointLLM, GPT4Point) -- employ frozen VLMs as baselines operating on a different modality (i.e., images).
Nevertheless, we agree with the reviewer on the importance of complementing our evaluation with experiments to assess the performance of the baselines trained on ShapeNerf-Text. Thus, we trained all the baselines with their official code on ShapeNerf-Text following their official protocol, which, for all of them, keeps the modality-specific encoder frozen and trains an adaptor and the LLM in two steps.
We report the results of the new experiments alongside those already included in the paper (Frozen baseline models) in Table 1 and 2 (rebuttal PDF; see global comment). We notice that the trained baselines exhibit different behaviors wrt their frozen counterparts, with LLaVA performing significantly worse and PointLLM showing clear improvements. As for GPT4Point, we observe more variability across metrics, though, overall, we are led to reckon that it does not benefit from training on ShapeNerf-Text. The last row in both Table 1 and 2 points out how LLaNA yields the best performance compared to all baselines, either frozen or trained on ShapeNerf-Text.
Eventually, we highlight that the new experiments fortify the key finding of our paper: directly processing the weights of a NeRF is the most effective way to reason about the underlying radiance field. Moreover, converting a NeRF into a different modality, e.g. a point cloud or images, is inefficient and cumbersome (see A.3, main paper). For instance, extracting a point cloud from a NeRF mandates voxelizing the 3D space to evaluate the density function, the required computation scaling cubically with the desired resolution. Images are also not easy to handle, as NeRFs do not come with an object-centered 3D reference frame, so one would need to guess from which viewpoint(s) to render meaningful images. Obtaining the rendering(s) can be pretty slow, especially if high-resolution images are needed to capture important object details. Yet, rephrasing our key finding, it is not worth undertaking any costly NeRF-to-X conversion, as processing a NeRF as a NeRF is more effective.
Q2
To assess the generalization ability of LLaNA, we trained NeRFs for the subset of 200 Objaverse objects with human-annotated captions used as a test set in the PointLLM paper [77]. This sets forth a challenging out-of-domain and open-set experiment (164 out of the 200 Objaverse objects belong to categories not present in ShapeNet). We could also perform a comparative evaluation as we trained the baselines on ShapeNerf-Text (see W1-Q1). Purposely, we extracted point clouds and rendered front views from the 200 Objaverse NeRFs. Results are reported in Table 4 (Rebuttal PDF). We can observe that the scores of all models are significantly lower compared to Table 1 (Rebuttal PDF), which hints at all models struggling when evaluated on objects very diverse to those belonging to the training domain. LLaNA achieves the second-best generalization performance after PointLLM. Yet, it is worth highlighting that the frozen modality-specific encoder of PointLLM (and GPT4Point) is PointBERT, which has been trained on Objaverse, while the meta-encoder used by LLaNA, i.e nf2vec, has been trained only on ShapeNet and thus has never seen any object outside such categories. We also wondered whether the comparison to PointLLM and GPT4Point suggested by the reviewer concerned frozen versions of these models. So, we conducted this experiment and found out that both frozen models provide better scores than LLaNA on the considered 200 Objaverse objects. For example, PointLLM and GPT4Point yield a Sentence-BERT equal to 38.09 and 33.61, respectively, while the score for LLaNA is 30.07. We consider these results reasonable because the frozen models have been trained on Objaverse, so their performance deals with in-domain data. Yet, the gap between GPT4Point and LLaNA, for which this experiment implies reasoning on out-of-domain data, is relatively limited.
W2 – Q3
Following the reviewer's suggestion, we realized a multi-view baseline by rendering images from N viewpoints randomly chosen between the set of camera poses used to train the given NeRF. Then, we concatenated tokens from the N images and fed them into LLaVA alongside text instructions. We set N=3 because the model cannot process correctly a higher number of images. Results are in Table 4 (Rebuttal PDF). As for all reasoning tasks, we observe a slight improvement of LLaVA in the multi-view set-up. Yet, LLaNA keeps outperforming LLaVA by large margins. Conversely, using multiple images boosts the zero-shot classification performance of LLaVa, which turns out to be the best model for this task.
W3
The dataset does not contain classes not seen during training. Indeed, it features 13 classes, and the train, val, and test splits are obtained by randomly sampling objects within each class, i.e., holding out a fixed percentage of objects per class (80%, 10%, and 10% for train, val, and test, respectively).
W4
We apologize for the typo in Table 5; we should have used “Back View,” which has the same meaning as in Tables 1 to 4.
W5
Regarding NeSF, it proposes a novel semantic field representation of 3D scenes rather than seeing neural fields as an input modality. On the other hand, we agree with the reviewer that NeRF-RPN also considers NeRFs as an input modality. Therefore, we agree that the sentence at L20 needs rewriting, and we will modify it.
Thank you for the detailed rebuttal and performing the additional experiments that were requested. This must have been quite a lot of work and I appreciate it. Seeing the fine-tuned results on the ShapeNeRF dataset and experiments on Objaverse, I believe this work presents an interesting approach of combining NeRFs with LLMs that should be presented to the community.
I will raise my score to Weak Accept (6).
This work proposes LLaNA, a multimodal large language model (MLLM) that is capable of aligning language with 3D scene fields embedded with (learned) NeRF models. By projecting the NeRF fields into an LLM’s embedding space, LLaNA utilizes a meta encoder to transform fields seamlessly into the token space of an LLM (Llama model in this work) and hence can be adapted to downstream higher-level reasoning tasks (centered around 3D understanding). This work also features a newly created dataset called NeRF-language, which has the emphasis on leveraging NeRF features in answering various visual questions.
优点
- The proposed LLaNA method that directly projects a learned radiance field weights into LLM’s space is an interesting idea.
- The training on multi-turn 3D-aware question answering style makes good use of both fields and LLM’s capabilities.
- A new challenge is proposed with the emphasis on multi-turn QAs on 3D-awareness.
缺点
- A more detailed statistical analysis of the dataset is required, such as detailed sizes, diversities of the questions, human performance or judgment on the dataset, type-token ratios, and some frequent word analysis.
- The questions introduced in NeRF-language do not seem to require a very thorough understanding of 3D scenes.
- There is an increasing amount of work aimed at marrying the benefits of radiance fields with the strong ability of LLMs, just to name a few [1,2,3]. How does this work compared to them?
- Additionally, the 3D-LLM [1] also presents question answering capabilities from 3D input scenes, at least a comparison of the LLaNA method with it is required.
- Minor but not of least importance: additional standard VQA baselines can provide more in-depth analysis on the dataset, as well as language-only baseline models to gauge the potential spurious patterns of the curated question-answer relationships.
[1] Hong, Yining, et al. "3d-llm: Injecting the 3d world into large language models." NeurIPS 2023.
[2] Kerr, Justin, et al. "Lerf: Language embedded radiance fields." ICCV 2023.
[3] Qin, Minghan, et al. "LangSplat: 3D Language Gaussian Splatting." CVPR 2024.
问题
- Are there any questions that exactly depend on the variant of the views where the NeRF fields would benefit?
- And if so, what is the detailed percentage analysis of these types of questions, and what do they look like in general?
局限性
- The authors addressed some limitations of the work in the manuscript.
W1
The proposed ShapeNeRF-Text is built upon the 3D object dataset ShapeNet. ShapeNeRF-Text has 13 classes of ordinary objects, such as cars, airplanes, and chairs. The shapes are divided into 30939, 3846, and 3859 for the train, val, and test set. A NeRF is trained for each of these objects. Regarding the text, we have a brief and detailed description, three single-rounds and a three multi-round QA for each object.
The average lengths in words of the instructions/responses are 8.81/14.25 for single-round QAs, 8.80/14.14 (per round) for multi-round QAs, 8.51/22.76 for brief descriptions, and 7.82/77.90 for detailed descriptions. We report instruction/response length histograms in Figure 1 - bottom (rebuttal PDF; see global response).
Figure 1 - top (rebuttal PDF) shows word clouds obtained after removing generic words like “model,” “object,” and “NeRF”, emphasizing frequent words in the detailed description instruction and response texts.
Finally, we would like to point the reviewer to Appendix C, where we included several details on ShapeNeRF-Text.
W2 - Q1 - Q2
Akin PointLLM and GPT4Point, our paper focuses on an object-centric setup rather than scenes. The proposed ShapeNeRF-Text dataset contains many questions that require a holistic understanding of 3D objects. We prove that with the following two experiments.
First, we evaluate our dataset questions with an LLM (LLaMA3). For each question , we ask LLaMA3:
Is a random viewpoint of the object enough to answer this question?
If so, reply "YES"; if a specific viewpoint is needed, answer "NO"
By doing so, we obtained 5163 “YES” and 5847 “NO”, highlighting that most questions require multi-view information to be answered correctly.
Second, we run a Vision-Language model, LLaVA13b, on each question of the single round QA dataset on the front and back views of objects. Then, we select only the LLaVA responses, where the answer for the front or back view achieves a SimCSE score higher than 80%, i.e., likely correct answers, which selects approximately 45% of the answers. Among these correct responses, we calculate the percentage of those where the front and back answers are extremely different (i.e., a difference in SimCSE scores > 10). Remarkably, 26 % of answers are correct from one point of view but wrong from the other: these questions would have required multi-view information to be answered correctly. We report two qualitative examples in Figure 2. In the first row, the Mercedez-Benz logo cannot be recognized from the back view. In the second row, the monitor seems turned off, and thus it is not possible to identify correctly the helicopter displayed on the screen. Similarly, Figure 11 of the Appendix shows other examples of this kind of QA.
W3 - W4
LeRF [2] ([32] in our paper) and LangSplat [3] are innovative representations of 3D objects and scenes. They extend the radiance field formulation, considering functions that model density, color, and language features at each spatial coordinate. These language fields are parameterized by either a neural network (LeRF) or a set of 3D Gaussians (LangSplat). We believe that rather than competing with LLaNA -- a multimodal LLM that considers NeRFs as an input modality -- [2] and [3] propose field-based representations semantically richer than standard NeRFs.
3D-LLM [1] ([24] in our paper) is a multimodal LLM that processes a colored mesh or a set of posed images. Thus, like PointLLM and GPT4Point, it can be applied to data extracted from NeRFs. In our experiments, we did not include it among the baselines because 3D-LLM addresses scenes and scene-specific tasks, such as 3D grounding and navigation. On the contrary, LLaNA focuses on object-centric scenarios, like PointLLM and GPT4Point.
Yet, as requested by the reviewer, we have evaluated 3D-LLM on our ShapeNeRF-Text dataset to compare its performance to LLaNA. Purposely, we have extracted colored 3D meshes from the NeRFs belonging to the test set of ShapeNeRF-Text and processed these data by the official 3D-LLM code to render the images and compute both the 2D and 3D features required by the model at inference time. The results are reported in Table 4 (Rebuttal PDF). We notice that 3D-LLM provides performance somehow comparable to PointLLM (see Tables 1, 3, 4, and 5 of the paper). Yet, as highlighted by the last row of Table 4 (Rebuttal PDF), LLaNA performs much better on all tasks.
Moreover, we highlight that 3D-LLM has several disadvantages compared to LLaNA.
- 3D-LLM requires several components to extract 3D language features which are the input of their LLM. Indeed, it requires a pre-trained vision-language feature extractor (CLIP), a general-purpose semantic segmentation network (SAM), and a way to handle noisy 3D data (gradSLAM) to project language features in the 3D space. Conversely, LLaNA extracts NeRF features by a single forward pass of the meta-encoder, which takes just a few ms on a GPU.
- As it operates on meshes and images, 3D-LLM does not scale well with the resolution of the input signal. Conversely, the number of NeRF weights is decoupled from the spatial resolution of the underlying signal, making LLaNA resolution-agnostic.
- In our scenario where NeRF is the input modality, we must extract a 3D mesh and render N views to apply 3D-LLM. However, as already stated in the paper (see L53-57), this process is not trivial and time-consuming. Moreover, important details might be lost when the extracted geometry from NeRF is too noisy or low resolution.
W5
To assess potential spurious patterns in the question-answer relationships, we evaluated the performance of LLaMA 2 fine-tuned on ShapeNeRF-Text (Table 5 rebuttal PDF). There is a significant performance gap between LLaMA2 and LLaNA, highlighting that our dataset consists of questions that can only be answered with access to information about objects.
Thanks for the detailed responses and the additional experiments, they are well-appreciated. The statistical analysis part is more complete now (though still lacking a few items), but they should've been there during submission, and hence I felt the manuscript wasn't ready at its current state. The additional experiments eased some of my doubts, but my W2 still stands, it's more of a matter of how one design the view sampling unless majority of your data requires question askers to examine and be aware of the whole 3D structure of objects.
Nevertheless, I raise my score to 5 to give credits to the whole rebuttal process and clearing some of my questions.
The authors propose a method to ingest NeRFs and project them into a language model's latent space for question answering and chat applications on NeRFs directly.
优点
S1. Very clear abstract, which nicely frames the paper
S2. Strong related work section
S3. The proposed dataset may be useful to others working on related investigations
S4. Many different (standard) metrics considered
S5. As NeRFs become a more standard 3D representation, dealing with them in the context of chat applications is increasingly relevant and forward thinking
缺点
W1. Consider adding basic statistics about the proposed dataset in the intro (e.g., the dataset size in number of samples).
W2. Consider adding information about the kind of generalization that can be expected from the method. Can the method generalize to novel classes? To different instances of fixed classes with held out attributes, colors etc.?
W3. ShapeNet seems like a good starting point, but datasets like Objaverse and Objaverse-XL may be interesting to probe generalization. How does the proposed method, trained on ObjectNet generalize to more complex NeRFs?
W4. I dont quite follow the addition on L141. Consider giving a more high-level description of this and deferring details to the Appx.
W5. L181-182 about retaining safety of the model reads an unsupported claim. Either point to an experiment to this effect, run this experiment, or remove this claim.
W6. Figure 3 does not stand alone. Just looking at it I am not able to understand what I should take away. Consider adding more information to the caption so it is clear how to interpret the figure and what the takeaway is.
W7. I am not 100% clear on how the training dataset is split in L216. Are there some held out classes? Or is there some data that is held out per class?
W8. The proposed method uses fine-tuning while the baseline methods do not apply fine-tuning. Hence, I do not consider these fair comparisons calling into question the validity of the baselines and the gains of the prosed method for NeRFs over traditional 3D and 2D input representations. Adding fine-tuned models as baselines seems important to contextualize the results.
问题
See weaknesses for specific questions.
局限性
Yes the authors address limitations in a specific section.
W1
ShapeNeRF-Text is built upon ShapeNet. 3D shapes are divided into 30939, 3846, and 3859 for the train, val, and test set, respectively. A NeRF is trained for each of these objects. As for text, the dataset features a brief and detailed description and 3 single-rounds and one three-rounds QA for each object. We computed word clouds and the instruction/response length statistics, partially shown in Figure 1 (rebuttal PDF; see global response). We will add basic details on the dataset size in the intro and all the details in the Appendix.
W2-W3
To address W8 (see below), we have trained all the baselines considered in our paper on ShapeNeRF-Text (Table 1, Rebuttal PDF). To train the baselines, we have used the official training code, which, for all of them, keeps the modality-specific encoder frozen and trains in two steps both an adaptor and the LLM. Thus, to probe generalization, we have evaluated LLaNA and the trained baselines on the subset of 200 Objaverse objects with human-annotated captions used as a test set in the PointLLM paper [77]. This sets forth a pretty challenging out-of-domain and open-set experiment (164 out of the 200 Objaverse objects belong to categories not present in ShapeNet). To conduct our experiments, we created NeRFs for all 200 objects. Then, we extracted the point clouds and rendered front views from NeRFs to test the baselines. The results are in Table 4 (Rebuttal PDF). We can observe that the scores of all models are significantly lower compared to Table 1 (Rebuttal PDF), which hints at all models struggling when evaluated on objects very diverse to those belonging to their training domain. LLaNA achieves the second-best generalization performance after PointLLM. Yet, it is worth highlighting that the frozen modality-specific encoder of PointLLM (and GPT4Point) is PointBERT, which has been trained on Objaverse. In contrast, LLaNA meta-encoder, i.e, nf2vec, has been trained only on ShapeNet and thus has never seen any object outside the ShapeNet categories.
W4
The addition shows how to calculate the number of rows of the input matrix obtained by stacking the NeRF’s weights. The number of rows of depends on the number of hidden layers, , the number of units per hidden layer, , and the dimension of the input, which is a -dimensional array obtained by frequency encoding of 3D coordinates. We will move the detailed calculation in the Appendix.
W5
We apologize for any confusion. We meant that, as our ShapeNeRF training dataset was generated using LLaMA's automated text creation, which includes built-in safeguards, and as ShapeNet is a manually curated dataset with only common items such as chairs and cars, our model trained on these data should also be safe. However, we acknowledge that we have not conducted any experiments to validate this claim and will remove that statement.
W6
The image represents the automatic pipeline used to create the ShapeNeRF dataset. Given a 3D model corresponding to a NeRF, we render views by a computer graphic engine. Each view is then processed by a VLM, i.e. LLaVA, obtaining view-specific captions. These captions are finally aggregated by an LLM, i.e., LLaMA 3, to obtain brief and detailed descriptions and single-round and multi-round Q&As. We will add this information to the caption.
W7
The proposed ShapeNeRF-Text dataset features 13 classes. Following [61], the train, val and test splits are obtained by randomly sampling objects within each class, i.e., holding out a fixed percentage of objects per class. So, there are no held-out classes.
W8
Our choice of using frozen models as baselines has been motivated by LLaNA being the only assistant that works on NeRFs, i.e. whose input modality is a radiance field parametrized as a neural network. Indeed, the papers most closely related to ours -- those proposing object-centric assistants that ingest point clouds (PointLLM, GPT4Point) -- use frozen VLMs as baselines operating on a different modality (i.e. images).
Nevertheless, we agree with the reviewer on the importance of adding experiments with baselines trained on ShapeNeRF-Text. Thus, we trained all the baselines on ShapeNeRF-Text following their official protocol and using the official training code (see also answer to W3). We report the results of the new experiments alongside those already included in the paper (Frozen baseline models) in Table 1 and 2 (Rebuttal PDF). We notice that the trained baselines exhibit different behaviors than their frozen counterparts, with LLaVA performing significantly worse and PointLLM showing clear improvements. As for GPT4Point, we observe more variability across metrics, though, overall, we are led to reckon that it does not benefit from training on ShapeNerf-Text. The last row in Tables 1 and 2 shows how LLaNA performs best compared to all baselines, frozen or trained on ShapeNerf-Text.
Eventually, we highlight that the new experiments fortify the key finding of our paper: directly processing the weights of a NeRF is the most effective way to reason about the underlying radiance field. Moreover, converting a NeRF into a different modality, e.g. a point cloud or an image or set of images, is inefficient and cumbersome (see A.3, main paper). For instance, extracting a point cloud from a NeRF mandates voxelizing the 3D space to evaluate the density function, the required computation scaling cubically with the desired resolution. Images are also not easy to handle, as NeRFs do not come with an object-centered 3D reference frame, so one would need to guess from which viewpoint(s) to render meaningful images. Obtaining the rendering(s) can be pretty slow, especially if high-resolution images are needed to capture important object details. Yet, rephrasing our key finding, it is not worth undertaking any costly NeRF-to-X conversion, as processing a NeRF as a NeRF is more effective.
Thanks for the detailed rebuttal and new experiments! I maintain that as NeRF representations become more standard as a 3D representation, it is important to understand how they can be ingested into causal transformers. While the proposed method does seem susceptible to distribution shifts, I consider adding this result for transparency a huge strength not a weakness. I am raising my score to a 6.
We thank the reviewers for their valuable feedback, which allowed us to carry out a more extensive investigation, improving the quality of our work.
We reported the results of these extensive experiments in the attached PDF, referred to as the “rebuttal PDF”, which we believe provides a clearer picture of the merits of our proposal.
We also highlight that, in case of acceptance, we will include all new tables, figures, and considerations, including discussion on the related papers pointed out by the reviewers, in the revised manuscript, either in the main paper or in the Appendix.
This submission received very positive reviews from three reviewers. The authors did a good job in the rebuttal and addressed all of the concerns from all the reviewers. The AC agreed with the reviewers' comments and recommendations and decided to accept this submission. The authors need to include all the new materials/comments in the final camera-ready submission.