GeoLRM: Geometry-Aware Large Reconstruction Model for High-Quality 3D Gaussian Generation
摘要
评审与讨论
The primary objective of this paper is to develop a geometry-aware large reconstruction model. Previous approaches either predict tri-planes or per-pixel Gaussians for reconstruction from multi-view images. However, these methods lack an explicit correspondence between 2D image features and 3D representations. To address this issue, the authors employ a Proposal Transformer to identify grids with valid occupancy and use these tokens to perform Deformable Cross Attention for aggregating image features and subsequently predicting Gaussians. The results demonstrate that the proposed methods achieve state-of-the-art performance and adapt well to different numbers of views.
优点
-
This paper presents an alternative approach for building large reconstruction models by first predicting occupancy grids. This method may be beneficial in reducing memory consumption as the number of input views increases.
-
The results show that the proposed method achieves state-of-the-art performance, with performance improving as the number of input views increases.
缺点
-
The authors compare their model with InstantMesh under different input view settings, which is unfair. InstantMesh fixes the number of input views at six during training, so testing with varying numbers of input views may degrade its performance. In contrast, GeoLRM is trained on varying input view numbers. Therefore, the main claim of this paper is not well-verified.
-
The authors use a training and rendering resolution of 448x448, while other works render 512x512 images. It would be better to maintain the same settings for a fair comparison.
-
Introducing the Proposal Transformer complicates the process. An inference time comparison with other methods would be beneficial.
-
The comparison is only conducted on the GSO dataset. Evaluating and comparing with other methods on additional datasets, such as OmniObject3D, would be more convincing.
-
The authors claim superiority compared to per-pixel Gaussian prediction methods and compare performance with LGM. However, another method, GRM [1], released around the same time as LGM and sharing a similar architecture but showing much better performance, could also be included for a more convincing comparison.
-
The paper is not clearly written, especially the method section: details of the Proposal Transformer are missing; In Figure 2, the term "Reconstruction Transformer" is used, but in Section 3.2, "Anchor Point Decoder" is used, which is inconsistent.
[1] Xu, Yinghao, et al. "Grm: Large Gaussian reconstruction model for efficient 3d reconstruction and generation." arXiv preprint arXiv:2403.14621 (2024).
问题
Questions:
My main concern is that the comparison with previous methods using different numbers of input views is unfair, and this paper employs a different rendering resolution. The experimental results do not adequately support the authors' claims.
Suggestions:
- One of the main advantages of this paper is its potential to reduce GPU memory usage as the number of input views increases. However, there is no clear comparison with other methods. Adding GPU memory and inference time comparisons in Table 2 would be helpful.
- What if the number of input views is further increased beyond 12?
局限性
The authors have discussed their limitations and provided potential solutions.
We thank the reviewer for their insightful comments. We have addressed the specific concerns and provided additional clarification and results as follows:
Response to Weaknesses:
- Number of input views: We agree that the comparison with InstantMesh using different numbers of input views might seem unfair. The motivation of this experiment is to pave the way for the integration of video generation models into 3D AIGC applications. Because videos nauturally contain information about 3D and the diversity and quality of video datasets are much better than that of 3D. Therefore, the flexibility to handle varying numbers of input views is a significant advantage of our approach that should be considered. Moreover, even under conditions that are less favorable to our method (6 input views in the following table), we still achieve superior performance and efficiency.
- Rendering resolution: We chose 448 due to the requirements of our image backbone. To ensure a fair comparison, we have re-evaluated our method at 512 resolution, and the results remain consistent with those at 448. The updated performance metrics are as follows:
\begin{array}{|l|c|c|c|c|c|c|c|} \hline \text{Method}&\text{PSNR}\uparrow&\text{SSIM}\uparrow&\text{LPIPS}\downarrow&\text{CD}\downarrow&\text{FS}\uparrow&\text{Inf. Time (s)}&\text{Memory (GB)}\\ \hline \text{LGM}&20.76&0.832&0.227&0.295&0.703&0.07&\text{7.23}\\ \text{CRM}&22.78&0.843&0.190&0.213&0.831&\text{4}^*&\underline{5.93}\\ \text{InstantMesh}&\underline{23.19}&\underline{0.856}&0.166&\underline{0.186}&\underline{0.854}&\text{0.78}&\text{23.12}\\ \text{Ours}&23.57&0.872&\underline{0.167}&0.167&0.892&\underline{0.67}&4.92\\ \hline \end{array}
\begin{array}{|l|c|c|c|} \hline \text{Method}& \text{PSNR}\uparrow&\text{SSIM}\uparrow&\text{LPIPS}\downarrow\\ \hline \text{InstantMesh}&\underline{23.98}&\underline{0.861}&\underline{0.146}\\ \text{Ours}&24.65&0.893&0.130\\ \hline \end{array}
\begin{array}{|c|cc|cc|cc|cc|} \hline &\text{PSNR}&\text{PSNR}&\text{SSIM}&\text{SSIM}&\text{Inf. Time (s)}&\text{Inf. Time (s)}&\text{Memory (GB)}&\text{Memory (GB)}\\ \text{Num Input}&\text{InstantMesh}&\text{Ours}&\text{InstantMesh}&\text{Ours}&\text{InstantMesh}&\text{Ours}&\text{InstantMesh}&\text{Ours}\\ \hline 4&22.87&22.84&0.832&0.851&0.68&0.51 s&22.09&4.30\\ 8&23.22&23.82&0.861&0.883&0.87&0.84 s&24.35& 5.50 \\ 12&23.05&24.43&0.843&0.892&1.07&1.16&24.62& 6.96 \\ 16&23.15&24.79&0.861&0.903&1.30&1.51&26.69& 8.23 \\ 20&23.25&25.13&0.895&0.905&1.62&1.84&28.73& 9.43 \\ \hline \end{array}
I would like to thank the author for his detailed response, which addressed most of my concerns. However, I am not convinced by the author's response regarding the comparison with InstantMesh, which is one of my main concerns. The authors also acknowledge that it is unfair to use a different number of input views to compare with InstantMesh. Dealing with different input views is one of the main claims of this paper, so it is important to have a fair comparison with previous methods in the same settings. One of the solutions is that the authors could also retrain InstantMesh using the same training setup as in this paper (changing the input images during training), which would be more convincing. Therefore, I tend to remain my rating.
Thank you for your feedback. Due to limited time, we fine-tuned InstantMesh with 8 A100 GPUs for 6 hours utilizing our dynamic input number training strategy. Here are the results:
Our method still outperforms InstantMesh, especially for denser views. Our analysis indicates that InstantMesh is limited by its low-resolution triplane representation (64x64). This limitation explains why InstantMesh does not benefit as significantly from denser inputs compared to our method. We believe that 3D AIGC is a systemic endeavor, where the training strategy plays a crucial role. Adapting our training strategy to other methods might alter their original characteristics.
Furthermore, it is important to highlight that our key contributions also include the integration of geometry into LRMs, which brings significant enhancements in memory efficiency and representation resolution compared to previous approaches. These factors should also be taken into consideration.
This paper introduced a sparse reconstruction model based on the LRM. This paper try to use the projection between the 3D point with the pixel position to save computational consumption and this paper use the deformable attention to lift the 2D feature to 3D, and finally propose a two-stage pipeline to generate Gaussian.
优点
- Using the projection correspondence between 3D point with the pixel coordinate is reasonable to save consumption.
- Deformable attention is also a useful solution to lift 2D multi-view image features to 3D.
缺点
- Some confusions about the method and the experiment.
- Why cannot train the first stage and the second stage together? Is it because of the memory consumption?
- The advantage with more input images is good but seems to be unimportant in the sparse reconstruction model, which is designed to handle the situation with only few inputs. And there also a confusion in this experiment: will the number of Gaussian primitives increase as the input increases in your method?
- The 3DGS-based method has advantages in novel view rendering quality, but the advantage over InstantMesh is not obvious and the results shown in this paper seem to be much worse than the existing 3DGS-based methods [1,2]?
- How does your method generate the mesh? And there seems lack this comparison experiment. Is there still advantage in the mesh generation with Gaussian representation?
- The contribution seems to be not novel enough.
- The strategy of using projection strategy is a common solution, and has been widely used in existing sparse reconstruction methods like [3,4,5].
- Using the deformable attention to lift 2D image features to 3D space is also a widely used strategy like [6,7], and the two-stage pipeline that first generating the sparse queries and then only processing in these sparse queries is also the existing solution such as [7].
[1] GRM: Large Gaussian Reconstruction Model for Efficient 3D Reconstruction and Generation
[2] GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting
[3] SparseNeuS: Fast Generalizable Neural Surface Reconstruction from Sparse Views.
[4] C2F2NeUS: Cascade Cost Frustum Fusion for High Fidelity and Generalizable Neural Surface Reconstruction
[5] GenS: Generalizable Neural Surface Reconstruction from Multi-View Images
[6] SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving
[7] VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion
问题
Refer to weaknesses for details. And due to these doubts, I tend to give the borderline and hope to see the sufficient response of the author.
局限性
Declared in the paper.
We would like to thank the reviewer for their constructive feedback. Below are our detailed responses to the specific points raised:
Response to Weaknesses:
About the method and the experiment:
- Training stages: The primary reason for training the stages separately is due to the non-differentiability of the conversion from the occupancy grid to sparse tokens. This conversion step is necessary for processing the sparse 3D representation efficiently, but it cannot be backpropagated through directly. Moreover, the memory consumption is indeed a secondary concern that benefits from this separation.
- Denser inputs and Gaussian primitives: The motivation to handle denser image input was innovated by the success of video diffusion models (Sora [1], SVD [2], etc.). Most videos naturally contain information about 3D and the diversity and quality of video datasets are much better than that of 3D. Therefore, recent works (SV3D [3], VideoMV [4], etc.) leverage video diffusion models for multi-view generation and achieved great success. The output of video diffusion models is denser than that of multi-view diffusion models and could not be efficiently processed by previous LRM methods. We are optimistic about the ability of the video diffusion model to generate multi-view consistent images and its great potential to be extended to scene-level 3D generation. Therefore we propose our method to efficiently process denser inputs. The number of Gaussian primitives is set to 512k for all inputs to avoid bottlenecks in representation.
- Comparison with baselines:
- While the improvement may appear small in terms of quantitative metrics, our method achieves significant gains in resolution and efficiency. As demonstrated in Figure 3 of the manuscript, examples 3 and 4 show that our method recovers finer details from the input images. Additionally, please refer to the following table (which is an extension of Table 2 in the original manuscript) for a detailed comparison with InstantMesh, highlighting our memory efficiency and ability to process denser inputs effectively.
\begin{array}{|c|cc|cc|cc|cc|} \hline &\text{PSNR}&\text{PSNR}&\text{SSIM}&\text{SSIM}&\text{Inf. Time (s)}&\text{Inf. Time (s)}&\text{Memory (GB)}&\text{Memory (GB)}\\ \text{Num Input}&\text{InstantMesh}&\text{Ours}&\text{InstantMesh}&\text{Ours}&\text{InstantMesh}&\text{Ours}&\text{InstantMesh}&\text{Ours}\\ \hline 4&22.87&22.84&0.832&0.851&0.68&0.51 s&22.09&4.30\\ 8&23.22&23.82&0.861&0.883&0.87&0.84 s&24.35& 5.50 \\ 12&23.05&24.43&0.843&0.892&1.07&1.16&24.62& 6.96 \\ 16&23.15&24.79&0.861&0.903&1.30&1.51&26.69& 8.23 \\ 20&23.25&25.13&0.895&0.905&1.62&1.84&28.73& 9.43 \\ \hline \end{array}
Could you please explain why you lowered the score?
Because you didn't give any further response. I think the whole pipeline is reasonable, but I don't agree to separate the larger version model from the previous sparse reconstruction model. Although you claimed that this model is designed for AIGC, everything this work did is about sparse reconstruction. Thus, I think that more analysis about the previous sparse reconstruction is needed, e.g., the projection strategy is inspired from previous sparse reconstruction model (but not treat them as the other domain).
The paper proposes a geometry aware large reconstruction model, that represents the scene as 3D Gaussians. The paper proposes a novel architecture for multi view reconstruction that first generates a proposal occupancy grid for 3D gaussians with a proposal transfomer, and then refines them with a reconstruction transformer. The model uses hierarchical image encoders that encode both the semantic features (dino2), and rgb values and plucker coords. The paper uses a first proposal transformer that classifies which dense 3d tokens will have occupancy, and based on that samples sparse tokens, which are denoted as "anchor points", that are processed by the reconstruction transformer, to gaussian tokens that are decode a lightweight MLP. The network uses deformable cross attention to the hierarchical image features, which allows for recovering higher resolution features, and also uss a 3d version of Rotary Positional Encoding RoPE. The model is trained in two stages, first the proposal transformer, and then reconstruction model. The model is trained on Objaverse and evaluated on GSO against LGM, CRM, and Instant Mesh. The model is evaluated across different number of input views, and also some of the contributinos are ablated: types of features, 3D RoPE, and training with fixed number of views.
优点
- The paper proposes a novel architecture to perform 3D reconstruction from multiple views based on transformers and 3D Gaussians that improves upon previous work. The main contributions are the use of higher frequency information thanks to Deformable Cross Attention, and selective computing thanks to a proposal network that computes anchor points around which 3D gaussians are generated.
- The paper compares results with a number of highly relevant recent papers (LGM, CRM, InstantMesh), and showcases the strength of their method.
- The method seems to be more robust to increasing the number of input views compared to previous works, which seems like a win.
- The paper is well written and easy to follow.
缺点
- I think the deformable cross attention is not ablated properly -- yet it seems a key to the success of this method. Similarly, understanding the sampled points would be quite interesting.
- The paper could be a bit more clearly written. See questions. I think it it confuses the reconstruction transformer, and the anchor point decoder. I think it is also not clear the architecture of the proposal transformer, until one reads the supplementary and realizes that it's the same architecture as the reconstruction transformer (cool!). I think adding a more clear structure of when the proposal network and the reconstruction transformer are explained would make the paper more readable.
- The paper could be a bit better if it showed that deformable attention allows the model to be robust to slight pose noise. As well, scaling experiments of the different models would be quite useful.
问题
- What is the feature dimension of the low-level image features (eg after the conv)?
- The Anchor Point Decoder L125 is missing in Figure 2, and is labeled as Reconstruction Transformer.
- How much do the plucker coords in the low-level image features matter?
- What is the typical number of sparse tokens? Is it even an issue when there are too many?
- L159: I did not got these subtle point. The proposal anchor points are upsampled to 128^3, right? And then each of the active becomes a token for the reconstruction transformer, with max seq length of 32k?
局限性
Yes.
We sincerely appreciate the reviewer's positive feedback and valuable insights. Here are our detailed responses to the comments and questions:
Response to Weaknesses:
-
About deformable cross attention: We agree that deformable cross attention plays a crucial role in our method. To further evaluate its impact, we have conducted an additional ablation study, summarized in the table below. This experiment was performed using a smaller model configuration as described in 242 of the manuscript.
* 0 sampling points means directly using the projected points without any deformation.
The ablation results indicate that increasing the number of sampling points generally improves performance. Given the trade-off between computational cost and performance gain, we find that using 8 sampling points strikes the best balance.
-
Clarity of writing: We apologize for any confusion caused by the presentation. To improve clarity, we will standardize the terminology for the second stage of our model as the 'Anchor Point Decoder'. Additionally, we will clarify in the figure caption that the Proposal Transformer shares the same architecture as the Anchor Point Decoder, as previously mentioned in L109.
-
More explorations:
- Robust to slight pose noise: Thank you for your insightful advice! Given the absence of a baseline for this task, we have provided a qualitative visualization demonstrating how deformable attention responds to pose noise. Specifically, we perturbed one of the input camera poses by 0.02 along the z-axis and visualized the predicted offsets relative to the reference points. The results are detailed in the attached PDF. We observed that the average angle between the predicted offsets and the perturbation, when projected onto the image plane, was 31°. This indicates that the learned offsets attempt to counteract the perturbation, showcasing robustness to slight pose errors.
- Scaling experiments: Owing to constraints on both time and computational resources, currently we are unable to scale up the model. We plan to address this as part of our future work.
Response to Questions:
-
The feature dimension of the low-level image features after the convolutional layer is 384. This matches the dimensionality of the high-level features, facilitating the application of multi-level deformable attention.
-
As addressed in the response to weakness 2, we will correct the label in Figure 2 to 'Anchor Point Decoder'.
-
We performed an ablation study regarding the Plücker coordinates. The PSNR result without Plücker coordinates was 20.64, compared to 20.73 with the full model. These coordinates assist the model in learning camera directions, contributing to improved performance.
-
During training, the typical number of sparse tokens is 4k, while during inference, this number is 16k. A higher number of tokens during training significantly increases memory consumption. Thanks to the 3D RoPE, our model can efficiently handle more tokens during inference to capture finer details. However, we observed that excessive tokens with a simple model might introduce artifacts or 'floaters'.
-
Your understanding is correct. The proposal anchor points are upsampled to a 128³ grid. Active points are then converted into tokens for the Reconstruction Transformer, with a maximum sequence length of 4k during training.
We hope these clarifications address the reviewer's concerns and provide a clearer understanding of our work. Thank you again for the valuable feedback.
We sincerely thank the reviewers for their feedback and valuable comments on our work. As suggested by Reviewer hKRc, we have further compared the mesh extraction of our method with other baseline methods. The detailed comparison is provided in the attached one-page PDF file. Additionally, following the suggestion of Reviewer DmmG, we have added a qualitative visualization demonstrating how deformable attention responds to pose noise.
The paper received mixed reviews. On the positive side:
- Novel architecture (reviewer DmmG)
- Very strong performance (reviewers DmmG, 7HGK)
- More robust to a varying number of views (reviewer DmmG)
- Less memory consumption (reviewers hKRc, 7HGK)
The AC read through all rebuttal posts and finds that the authors did a very good job at resolving the questions by providing the necessary experiments. The AC also checked the paper and finds that it is well written.
Regarding reviewer hKRc, the question about occlusions after the rebuttal is not clear to the AC. The second point of criticism about novelty is unprecise - as reviewer DmmG mentions, the deformable attention is not novel by itself but has not been used for these types of models. Thus, designing an architecture and demonstrating the strong performance can be considered an important insight for the community. The usefulness of using deformable attention can be seen in the response to DmmG. Reviewer 7HGK in the discussion confirmed that the rebuttal addressed the concerns, even for the InstantMesh comparison that seemed to be the main point of criticism.
A comparison to GRM should not be held against this paper, as this is concurrent (unpublished) work.
Due to the strong performance and interesting contributions for the community, the AC decided to follow reviewer DmmG and recommends to accept the paper.