VTGaussian-SLAM: RGBD SLAM for Large Scale Scenes with Splatting View-Tied 3D Gaussians
We introduce view-tied 3D Gaussians, a novel representation for RGBD SLAM that improves scalability by tying Gaussians to depth pixels, reducing storage needs and enabling finer detail representation.
摘要
评审与讨论
To address a high memory consumption issue of 3DGS, this paper proposes view-tied 3DGS, which determines Gaussians based on the views. The Gaussians from the last frame are tracked and processed in sections. Since the method is view-dependent, it efficiently reduces the storage required for location, rotation, and scale values. The location of Gaussians is expressed through depth, and adjacent frames are used to handle missing depth information. When a new section is optimized, pose optimization is performed using rendering loss. To minimize total error, the head frame of the previous section is used. The Gaussians in the scene are trained using L1, SSIM and depth loss. Finally, BA is applied by optimizing the Gaussians of the head frame and camera pose to reduce absolute pose error.
给作者的问题
How much is the actual difference in memory usage in Table 11?
In Supplementary Section D.1, I am curious about the model's robustness to input depth noise. Are experimental results available to assess this?
论据与证据
This paper's main claim is that instead of storing all 3DGS parameters, it proposes a section-based storage approach, making it more efficient on limited GPU resources. The paper argues that the model can operate on large-scale scenes, demonstrating better rendering performance and lower pose error compared to existing models. However, the experiments are conducted only on indoor datasets, and no actual large-scale scenes are presented.
方法与评估标准
Instead of representing the entire scene as Gaussians, this paper proposes an approach based on views, and the scene is represented using depth values from each view. To reduce the number of parameters, location is replaced with depth, and rotation is eliminated by representing Gaussians with a sphere shape. Within each section, Gaussians are merged to create a scene. To further reduce redundancy, visibility across view is measured, and only Gaussians in invisible regions are additionally stored. For each new section, the head frame from the previous section, where views have the highest overlap, is selected. The Gaussians are then stored based on these views.
The comparison models include Neural Implicit Fields models and 3DGS-based models. On the Replica dataset, D-L1 and F1 score were measured. On other datasets, ATE RMSE was used for tracking comparisons, and PSNR, SSIM, and LPIPS were used for rendering comparisons.
理论论述
There are no specifically proposed theoretical claims.
实验设计与分析
The experiments follow the other SLAM methodologies and are conducted on the Replica, TUM-RGBD, ScanNet and ScanNet++ datasets. The evaluations are performed on an NVIDIA RTX 4090, where comparisons of runtime and the number of Gaussians are presented. Additionally, the experiments include comparisons between anisotropic and isotropic Gaussians, analyses of different section lengths, evaluations of overlap selection methods, and ablation studies on the impact of the visibility mask.
补充材料
The paper includes video results and the main function of the code. Additionally, it provides further details on the implementation, experimental results, visualizations of ablation studies, and supporting figures.
与现有文献的关系
The paper employs a method that represents RGBD SLAM using Gaussians and proposes a section-based representation. Instead of using point clouds, it adopts a 3DGS primitive-based approach. However, since the Gaussians are tied to depth-based representation, extending this method to RGB SLAM appears challenging.
遗漏的重要参考文献
There do not appear to be any essential references that are not discussed.
其他优缺点
Strengths
It is interesting that the model achieves good rendering results with a more simplified representation compared to other models.
Weaknesses
The explanation of the overlap section in the method is not smooth, and the description of the threshold is unclear. Despite the significant impact of the overlap method, it is not well-explained in the overview, making it difficult to understand.
There are no experimental results on actual large-scale scenes. In Table 11, the runtime is slower compared to G-SL and NICE, and both the total and maximum number of Gaussians are higher than in G-SL. This suggests that additional experiments on large-scale scenes and a direct comparison of GPU memory usage are necessary.
As the number of frames increases, the number of overlap candidates is likely to grow, potentially increasing the time required to search for sections. However, there are no experiments addressing this issue. Additionally, as the number of sections increases, the redundancy of Gaussians stored in the head frame is expected to rise.
Since the model is designed based on depth modality, it is likely to be highly dependent on dataset-specific factors such as the number of captured frames and resolution. Furthermore, in large-scale outdoor environments where dense depth acquisition is not feasible, the model may fail to operate effectively. This suggests that the model has generalization limitations.
其他意见或建议
There is a spacing typo on line 417: "Visibility ."
Thanks for your review and positive comments on our idea, contributions, evaluations, and supplementary materials.
1. Overview
We will revise the overview accordingly to make it easier to follow.
2. Runtime comparison in Tab.11
As stated in Lines 434-438 left, we manage to optimize many more Gaussians at each frame for much better rendering quality and more accurate pose estimation than other NeRF-based and GS-based methods, which sacrifices a little bit more runtime in total. However, we show a great advantage if we compare the runtime averaged on each Gaussian. All GS-based methods use the most Gaussians until no improvement can be made further in Tab. 11. We report peak GPU memory usage as a supplement to Tab. 11 below.
| Methods | NICE-SLAM | Point-SLAM | SplaTAM | Gaussian-SLAM | Ours |
|---|---|---|---|---|---|
| Peak GPU Use (GiB) | 12.0 | 7.7 | 18.5 | 4.2 | 5.4 |
Although Gasussian-SLAM uses fewer memory, we employ more Gaussians and produce much better rendering, as shown in the following. We also report additional experiments on city-level scenes. Due to the time limit, we only report our results on several scenes in KITTI to evaluate tracking and mapping performance below. We also provide a visual comparison of rendering performance at the link: https://imgur.com/a/FOIgcrz.
| Methods | Gaussian-SLAM | SplaTAM | LoopSplat | Ours |
|---|---|---|---|---|
| 00 (ATE RMSE[m]) | 3.02 | 58.83 | 2.22 | 2.06 |
| 01 (ATE RMSE[m]) | 77.51 | 84.45 | 74.47 | 29.01 |
| 05 (ATE RMSE[m]) | 128.88 | 80.39 | 117.43 | 7.74 |
| 10 (ATE RMSE[m]) | 10.60 | 43.82 | 11.39 | 4.54 |
| Methods | Gaussian-SLAM | SplaTAM | LoopSplat | Ours |
|---|---|---|---|---|
| 00 (PSNR) | 15.51 | 9.82 | 15.82 | 28.54 |
| 01 (PSNR) | 15.95 | 12.89 | 14.69 | 30.33 |
| 05 (PSNR) | 16.22 | 26.48 | 15.98 | 28.19 |
| 10 (PSNR) | 15.58 | 25.58 | 14.58 | 27.59 |
| Peak GPU Use (GiB) | 2.74 | 22.37 | 3.56 | 4.79 |
3. Time on overlapping section selection
To determine the overlapping section selection, we project the downsampled depth points (200000) in the current view to the previous candidates. This simple manner is quite fast, and does not bring a burden, as shown in the time comparison below.
| Number of Overlapping Candidate Frames | 200 | 400 | 800 | 1600 |
|---|---|---|---|---|
| Time (s) | 0.038 | 0.079 | 0.144 | 0.282 |
4. What if depth is not available
Actually, SLAM methods using RGBD images are dedicated to indoor scenes. For outdoor scenes, we may not have sensor-captured depth maps, but we can use monocular depth priors, such as DepthAnything or MiDaS, to predict depth maps from RGB images as initialization. Then, in the first several frames, we can register the depth to the sparse depth from SfM (Structure from Motion). In the following frames, we can register the predicted depth to the depth rendered with Gaussians as a coarse depth to initialize Gaussians.
5. Memory comparisons
In Tab.11, we report the memory cost with the most Gaussians in each method until no improvement can be made. Please see our memory consumption on KITTI in the second table in our response to the question 2 above.
6. Impact of depth noise on the performance
Our results on real datasets like Scannet were reported using depth maps with noise.
Although our Gaussians are fixed at depth with noise, Gaussian splatting is flexible enough to overfit the current frame and neighboring frames by tuning other attributes like color, opacity, and shape. Our results show that depth noise does not impact the rendering. We report additional results below.
| 10% pixels w/ noises | 20% pixels w/ noises | 30% pixels w/ noises | Ours(w/o additional noises & fix) | |
|---|---|---|---|---|
| PSNR | 43.41 | 43.40 | 43.29 | 43.06 |
| SSIM | 0.996 | 0.996 | 0.996 | 0.996 |
| LPIPS | 0.015 | 0.015 | 0.015 | 0.013 |
7. Minor issues
We will revise the manuscript accordingly to resolve other minor issues.
I appreciate the authors' response, which addresses most of my concerns. I will keep my original score.
Thanks for your review and comments. Really appreciated it.
Best,
The authors
- This paper addresses the limitation of traditional 3DGS-SLAM methods, which struggle to scale up to extremely large scenes due to inefficient tracking and mapping strategies.
- The authors propose tracking and mapping strategies based on a new 3D representation called view-tied 3D Gaussians, which simplifies traditional 3D Gaussians by tying them to depth pixels, eliminating the need to learn their locations, rotations, and multi-dimensional variances.
- The proposed method demonstrates advantages in both reducing storage requirements and improving rendering quality. Extensive experiments supported the effectiveness of their design.
给作者的问题
- The paper proposes simplifying an ellipsoid Gaussian into a sphere, which retains only a color c ∈ R1×3, a radius (variance) r∈R1 , and an opacity o∈R1. How does this simplification differ from the approach in Point-SLAM: Dense Neural Point Cloud-based SLAM (ICCV 2023)? It seems that such a simplified Gaussian representation effectively reduces to a Neural Point Cloud. Could the authors clarify the distinctions and advantages of their method compared to this prior work?
- The strategy of selecting overlapping regions and Frozen View-Tied Gaussians raises concerns about the ability to perform global bundle adjustment (BA). Could this approach lead to a degradation in the accuracy of the map and camera poses, potentially resulting in suboptimal solutions? Does the authors' framework include additional mechanisms to ensure global consistency and maintain optimality in the reconstruction?
- I am unclear about the concept of View-tied Gaussians. In traditional Gaussian splatting, only the Gaussians projected onto the current camera view are involved in optimization. How does View-tied Gaussians fundamentally differ from this approach? Could the authors elaborate on the unique aspects and advantages of their method compared to the standard Gaussian splatting framework?
论据与证据
Yes. The claim that "existing methods struggle to scale up to extremely large scenes due to inefficient tracking and mapping strategies" is generally valid. However, the paper doesn't provide sufficient experimental evidence to demonstrate the superiority of VTGaussian-SLAM in large-scale scenarios. The experiments are conducted only on small-scale scenes from datasets such as Replica, TUM-RGBD, and small subset of ScanNet++, which do not adequately test the scalability of the proposed method. To convincingly support the claim, the authors should evaluate their approach on larger-scale scenes. A relevant benchmark for comparison could be GO-SLAM: Global Optimization for Consistent 3D Instant Reconstruction (ICCV 2023), which addresses similar challenges in large-scale environments.
方法与评估标准
For method: the proposed View-Tied 3D Gaussians method introduces a novel approach to 3D representation in SLAM systems. By tying a 3D Gaussian to each pixel in the depth map, the positions of these Gaussians are determined solely by depth and camera poses, eliminating the need to learn and store their locations or perform density control. This design significantly reduces memory and computational overhead. Additionally, the proposed tracking and mapping strategies focus on rendering and optimizing only a subset of Gaussians associated with the most recent views, rather than all Gaussians in the scene. This approach removes the need to maintain and optimize all Gaussians in memory throughout the training process, thereby improving the scalability of 3D Gaussian Splatting (3DGS) in SLAM applications.
For evaluation: As I suggested in the previous section, the experiments are insufficient. Only small-scale datasets such as Replica, TUM-RGBD, and small subsets of ScanNet++ are conducted, which do not adequately test the scalability of the proposed method. To convincingly support the claim, the authors should evaluate their approach on larger-scale scenes.
理论论述
Yes. The theoretical proofs in this paper are quite concise, as the work is primarily based on modifications to an existing code framework. Consequently, the tracking and mapping sections lack substantial theoretical development, and the focus of the paper is predominantly on the implementation aspects rather than theoretical contributions.
实验设计与分析
Yes. I review the experiments in 4.1. Comparisons and 4.2. Ablation Studies and Analysis. The main experimental results are compelling. The authors demonstrate that their view-tied Gaussians significantly reduce storage requirements, enabling the maintenance of a large number of Gaussians within limited GPU memory. Furthermore, the ablation studies provide additional validation for the effectiveness of the proposed VT (view-tied) approach. Specifically, the ablation study compares different attributes of 3D Gaussians, including anisotropic Gaussians (aniso), isotropic Gaussians (iso), and view-tied Gaussians (VT), further confirming the advantages of the VT design.
补充材料
I have reviewed the supplementary material provided by the authors to examine the implementation details, experimental results, and demo demonstrations in greater depth. I have gone through essentially all of the supplementary material.
与现有文献的关系
The key contributions of this paper build upon and address limitations in recent works such as (Keetha et al., 2024; Matsuki et al., 2024; Huang et al., 2024b; Yan et al., 2024; Yugay et al., 2023; Sandström et al., 2024). which employ various tracking and mapping strategies, but requires maintaining and optimizing all Gaussians covering the scene within limited GPU memory to ensure color and geometry consistency across all previous views.
Here are also some more complex systems (Liso et al., 2024; Zhu et al., 2024; Bruns et al., 2024; Sandström et al., 2024), that incorporate loop closure mechanisms into their optimization processes. However, detecting loop closures across views often relies on pre-trained priors and is highly sensitive to image quality.
遗漏的重要参考文献
No
其他优缺点
Weakness:
- I believe the novelty and effectiveness of this paper are somewhat weak, although I agree with the motivation of addressing the scalability limitations of existing methods that struggle to optimize all 3D Gaussians within limited GPU memory. The proposed Learnable View-Tied Gaussians and Frozen View-Tied Gaussians essentially follow a standard incremental SLAM framework. The View-Tied Gaussians approach appears to focus primarily on optimizing the 3D scene for the current view, which seems conceptually similar to the tracking and mapping of the current frame in prior works. As such, the contribution leans more toward engineering implementation rather than groundbreaking innovation.
- The experiments fail to demonstrate the superiority of the proposed method in large-scale scenarios. The paper and video does not provide sufficient evidence regarding key performance metrics such as tracking and mapping accuracy, rendering quality, memory efficiency, and computational time in large-scale environments. While the results on smaller datasets (e.g., Replica, TUM-RGBD, and ScanNet++) are promising, they do not adequately validate the scalability claims. To strengthen the paper, the authors should include evaluations on larger-scale benchmarks, which would more convincingly showcase the advantages of their approach in real-world applications.
- The proposed method cannot achieve real-time performance with approximately 0.5 tracking FPS in the easiest Replica dataset, which is unacceptable for the SLAM framework with real-time demands.
其他意见或建议
No
Thanks for your review and positive comments on motivation and performance.
1. Results on city-level scenes
Following SplaTAM, we evaluate on the widely used benchmarks such as ScanNet++ and demonstrate superior storage efficiency, learning 20 times more Gaussians for more detailed rendering while overcoming out of memory issues during optimization. These advantages significantly improve our capability in large-scale scenes. Due to the time limit, We additionally report our results on several city-level scenes in KITTI below, and visual comparison is shown at https://imgur.com/a/FOIgcrz.
| ATE RMSE[m] | Gaussian-SLAM | SplaTAM | LoopSplat | Ours |
|---|---|---|---|---|
| 00 | 3.02 | 58.83 | 2.22 | 2.06 |
| 01 | 77.51 | 84.45 | 74.47 | 29.01 |
| 05 | 128.88 | 80.39 | 117.43 | 7.74 |
| 10 | 10.60 | 43.82 | 11.39 | 4.54 |
| PSNR | Gaussian-SLAM | SplaTAM | LoopSplat | Ours |
|---|---|---|---|---|
| 00 | 15.51 | 9.82 | 15.82 | 28.54 |
| 01 | 15.95 | 12.89 | 14.69 | 30.33 |
| 05 | 16.22 | 26.48 | 15.98 | 28.19 |
| 10 | 15.58 | 25.58 | 14.58 | 27.59 |
| Peak GPU Use (GiB) | 2.74 | 22.37 | 3.56 | 4.79 |
Memory on KITTI is in the table above. Each method is using the most Gaussians until no improvement can be made. We use more memory, but manage to produce much better rendering with more Gaussians.
We will also cite GO-SLAM. With the same RGBD setting as ours, GO-SLAM did not report results on city-level scenes either, and we cannot report its plausible results on KITTI at this moment. Note that it requires a RGB or stereo setting for larger scenes.
2. Novelty and effectiveness
Our novelty not only lies in the view-tied Gaussians, which save storage for more Gaussians to recover more details, but also novel tracking and mapping strategies to work with view-tied Gaussians.
Our tracking strategy resolves the camera pose error accumulation when not accessing all Gaussians as a global reference. Our novel mapping strategy finds a balance between storage complexity and rendering quality. We believe these novel strategies are not a kind of engineering job. Their effectiveness has been justified in our extensive experiments and ablation studies.
3. The 1st misunderstanding
“The View-Tied Gaussians approach appears to … current frame in prior works.” is a misunderstanding. As our response to your question 2 and explanations in Lines 214-239, we keep all Gaussians in the current section, which covers the current frame and its neighbor frames, learnable, rather than merely focusing on the current frame. This differentiates our method from the incremental SLAM framework.
4. Runtime
Compared to classic SLAM methods, there is still a large room for rendering-based SLAM methods to improve the runtime efficiency. However, rendering-based SLAM methods like ours provide the capability of novel view synthesis, which can be directly used in VR. This is a vital function that classic SLAM methods cannot provide. Tab. 11 reports our runtime efficiency is comparable to the latest rendering-based methods.
5. Difference to Point-SLAM
Although we are using a simplified isotropic Gaussian representation, each sphere is still a Gaussian rather than a point, including other attributes. So, ours is a GS-based SLAM, which is different from NeRF-based SLAM like Point-SLAM. Runtime comparisons in Tab. 11 show splatting is faster than ray tracing, and numerical comparisons in Tab. 3-8 show our superior performance over Point-SLAM.
6. Global bundle adjustment
As stated in Sec. 3.5, we do have BA, but only at the head frame. Since we employ a large amount of Gaussians over the scene, and cannot access all Gaussians at the same time, we cannot perform a global BA over all frames. But our superior performance shows no degeneration.
But we do have special designs to ensure global consistency and optimum with multi-view constraints in both tracking and mapping, like finding the overlapping section in the most front as a common reference in tracking and along with fixed Gaussians in other sections when mapping the most current section.
7. Difference to traditional GS-based SLAM
Firstly, our Gaussians are fixed at depth points and have fewer attributes to learn. Secondly, we only need to keep Gaussians in the current section learnable and in the memory without using keyframes. While traditional GS-based SLAM methods require keeping all Gaussians learnable in the memory to maintain consistency to the current frame and all keyframes. These designs enable us to employ more Gaussians to recover more details in larger scenes.
8. The 2nd misunderstanding
“In traditional Gaussian splatting, … involved in optimization” is a misunderstanding. With a background of SLAM, traditional GS-based methods require keeping all Gaussians (rather than merely the ones in the current frame) learnable and getting involved in the optimization, since they need to constrain Gaussians using both the current view and the keyframes, aiming to maintain the global consistency.
Thank you for your detailed response, which addresses most of my concerns. I will update my rating accordingly.
Thanks for your review and comments. Really appreciated it.
Best,
The authors
The paper presents VTGaussian-SLAM, a novel RGBD SLAM system that utilizes view-tied 3D Gaussians for efficient mapping and tracking in large-scale scenes. It introduces the representation of Gaussians tied to depth pixels, thus improving optimization efficiency and reconstruction quality while enabling better scalability in SLAM applications.
给作者的问题
- Did you prune or densify the GS during the mapping iterations?
- Selection of hyperparameters {α, β} in tracking. Depth maps from TUM-RGBD and ScanNet++ exhibit significantly higher levels of noise compared to those from Replica. When depth information is imperfect, it is intuitive that the system should rely less on depth loss. This suggests that the value of β (the weight assigned to depth loss) should be smaller in such scenarios. However, the system uses a much larger β when running on TUM-RGBD and ScanNet. Could you provide an explanation for this choice?
论据与证据
Yes, the claims made in the submission supported by clear and convincing evidence.
方法与评估标准
Yes, proposed methods and evaluation criteria make sense.
理论论述
Yes, they are theoretically correct.
实验设计与分析
Overall, the experimental designs are rigorous, and the discussions are comprehensive.
补充材料
The supplementary materials include implementation details and detailed results on each evaluated scene, more analysis and visualizations, code and a demo video. The submitted materials are comprehensive to facilitate paper understanding.
与现有文献的关系
The proposed methodology contributes to the improvement of GS-based RGBD SLAM systems.
遗漏的重要参考文献
The most related works have been discussed in the paper.
其他优缺点
Strengths:
- The paper develops an out-of-core algorithm for 3D GS based SLAM.
- The paper proposes a simple yet effective supervision strategy to improve RGBD SLAM performance, achieving SOTA results.
- The paper is well written and easy to follow.
Weaknesses: - The influence of section length. The length of the Gaussian section, i.e., N is an essential hyperparameters of the proposed system. The Gaussians are maintained within a fixed length of section, while the selection of this hyperparameter is sensitive to datasets. In Line 257-261, they mentioned they choose different N on different datasets, according to image resolution and mapping iterations. However, although the paper had conducted an ablation study on ScanNet to experimentally prove their selection of the length of section S is optimal, there are no substantive analysis and evidence to support this hyperparameters is related to image resolution and mapping iterations. Or should it be related to camera motion pattern? A deeper analysis is appropriated.
- Lack of implementation details. The number of optimization iterations in tracking and mapping appear not to be reported. Are the optimization iterations consistent across different datasets, and if not, how do they vary? Furthermore, Table 11 can include average operation time.
- Lack of evaluation on "large-scale" scenes. The paper emphasizes the scalability challenges of SLAM in large-scale scenes. However, the method is only evaluated on datasets with room-level scenes. In general, large-scale scenes in SLAM tasks typically involve building- or even city-level environments. A more comprehensive evaluation on such large-scale scenarios, e.g., KITTI, would strengthen the paper's claims regarding scalability.
其他意见或建议
- On the top of each subfigure in Figure 1, is it Section 1 or Section o, {g} or {g}
- Line 369, “Without using sections (“1”), we cannot ...”, is it section (“1”) or section ("o”).
- Figure 4 can include the corresponding rendering image and error for each section to enhance visualization.
Thanks for your review and positive comments on our idea, contributions, evaluations, and supplementary materials.
1. Impact of section length
For fair comparisons with previous methods in rendering quality, we adopt the same number of iterations for mapping. But our Gaussians are view-tied, which are usually more than the ones in previous methods when covering the same number of frames. So, we set a proper section length to ensure our Gaussians in the same section can be well optimized in the same number of iterations. Comparisons in Tab. 10 indicate that the rendering may degenerate if more and more Gaussians are included in a section but still under the same number of iterations. We agree camera motion may also be a factor to consider, but we value more on the fair comparisons.
2. Tracking and mapping iterations
For fair comparisons, we follow the previous methods to conduct optimization in the same iterations on different benchmarks, such as 100 iterations for mapping and 60 iterations for tracking on Replica. Tab.11 reports the operation time on each frame. We will report these iterations clearly in our revision.
3. Results on city-level scenes
We follow previous methods like SplaTAM to report our evaluations on the widely used benchmarks such as ScanNet++. We also show our advantages in storage complexity, which can learn 20 times more Gaussians to recover more details on all frames than the latest methods, like SplaTAM, and also overcome the obstacle of out of memory at any frame during optimization. These advantages significantly improve our capability and performance in large-scale scenes. Due to the time limit, we only report our results on several city-level scenes in KITTI to evaluate tracking and mapping performance below. We also report a visual comparison of rendering performance at the link: https://imgur.com/a/FOIgcrz.
| Methods | Gaussian-SLAM | SplaTAM | LoopSplat | Ours |
|---|---|---|---|---|
| 00 (ATE RMSE[m]) | 3.02 | 58.83 | 2.22 | 2.06 |
| 01 (ATE RMSE[m]) | 77.51 | 84.45 | 74.47 | 29.01 |
| 05 (ATE RMSE[m]) | 128.88 | 80.39 | 117.43 | 7.74 |
| 10 (ATE RMSE[m]) | 10.60 | 43.82 | 11.39 | 4.54 |
| Methods | Gaussian-SLAM | SplaTAM | LoopSplat | Ours |
|---|---|---|---|---|
| 00 (PSNR) | 15.51 | 9.82 | 15.82 | 28.54 |
| 01 (PSNR) | 15.95 | 12.89 | 14.69 | 30.33 |
| 05 (PSNR) | 16.22 | 26.48 | 15.98 | 28.19 |
| 10 (PSNR) | 15.58 | 25.58 | 14.58 | 27.59 |
| Peak GPU Use (GiB) | 2.74 | 22.37 | 3.56 | 4.79 |
We also report memory consumption on KITTI in the table above. Each method uses the most Gaussians until no improvement can be made. We use a little bit more memory, but we manage to use more Gaussians to produce much better rendering.
4. Densify and prune Gaussians during mapping
No, we do not need these operations, which can speed up the mapping. This is because our view-tied Gaussians provide adequate Gaussians to achieve better rendering quality and have an efficient strategy to resolve out-of-memory issues.
5. Larger weight on the depth loss
Although depth maps from TUM-RGBD and ScanNet are not perfect, low qualities like motion blur, low resolution, and different exposure time make RGB supervision may be even more unreliable than depth to us. We found that weighting larger on depth maps can take full advantage of our view-tied Gaussians on real images.
6. Minor issues
We will revise the manuscript accordingly to resolve other minor issues.
Thank you for your response. I do not have further concerns and keep the original score.
Thanks for your review and comments. Really appreciated it.
Best,
The authors
This work presents VTGaussian-SLAM, a novel method for RGB-D SLAM by a novel view-tied 3D Gaussian representation, with corresponding tracking and mapping methods. This method does reduce parameter optimization (e.g., exact localization, rotation and covariance parameters), so the system can store many more Gaussians in the GPU memory for more detail and/or larger scale mapping. The method is organized in terms of sections of frames. In a section, added frames spawn new Gaussians (for novel geometry coverage) or update existing ones for uniform appearance and geometry. Crucially, only the current section's (and a limited number of overlapping sections for consistency in pose) Gaussians are learnable at any point, removing the need for global consistency with all keyframes. Results with Replica, TUM-RGBD, ScanNet, and ScanNet++ show improvements in both camera tracking accuracy (ATE) and rendering quality (PSNR, SSIM, LPIPS), and in reconstruction metrics (depth L1, F1) over prior state-of-the-art SLAM methods based on both implicit fields (NeRF variants) and 3D Gaussian splatting.
给作者的问题
see weekeness and other commonts
论据与证据
The authors claim that by using view-tied Gaussian representation, the system reduces parameter storage (saving location, rotation, and variance) by directly binding each Gaussian to a depth pixel, and allows more Gaussians to represent local details. The authors use ablation experiments on benchmarks to demonstrate the effectiveness of the method using ATE RMSE, PSNR, and other quantitative metrics.
方法与评估标准
Evaluation criteria include ATE RMSE, rendering quality, reconstruction results, and runtime and memory Usage. It is a good evaluation of various aspects of neural SLAM.
理论论述
The contribution of this work lies more in algorithm design and is demonstrated experimentally rather than through formal proofs. The main hypothesis is that "view-tied" Gaussians can represent geometry/color for local views (with supervision in depth) sufficiently without the expense of unconstrained 3D Gaussian parameters.
实验设计与分析
The experimental metrics designed are good, but the benchmarks are all indoor datasets and do not involve extremely large scene mentioned in the abstract and introduction.
补充材料
The video in the supplementary material explains in detail the main improvements of the article's algorithm and demonstrates the effectiveness of the paper's approach through visual comparison of results.
The code provided in the supplementary only has a main.py file, without other files, which makes it impossible to run the code directly. However, the main file also shows the main process of the algorithm.
与现有文献的关系
The paper's approach builds on 3D Gaussian Splatting and Neural Implicit SLAM, and correctly cites the body of related research.
遗漏的重要参考文献
No Essential References Not Discussed
其他优缺点
- Although the authors say their method works better for extremely large scenes, most evaluations are still on single-room or short multi-room scales. So it would be more helpful if benchmark results on city scale or other large scenes could be shown.
- The author's method uses the depth from the RGB-D sensor for each frame. But depth is usually noisy, so why don't the author optimize the depth value based on the depth and the linear direction of the camera? Will this lead to better rendering results?
- For the Bundle Adjustment part, the authors did not explain how many iterations were used. And after the online mapping section, whether the author's method used final refine Bundle Adjustment for several iterations (similar to Mono-GS).
其他意见或建议
- In the RMSE error section of Table 9, it seems that the results of the first three columns are filled in incorrectly, and are too different from the results of the fourth column.
Thanks for your review and positive comments on our idea and evaluations.
1. Benchmark selection and large-scale scenes
We follow previous methods like SplaTAM to report our evaluations on the widely used benchmarks such as ScanNet++. We also show our advantages in storage complexity, which can learn 20 times more Gaussians to recover more details on all frames than the latest methods, like SplaTAM, and also overcome the obstacle of out of memory at any frame during optimization. These advantages significantly improve our capability and performance in large-scale scenes.
As requested, we additionally report our performance on extremely large scenes, such as city-level scenes in KITTI. Due to the time limit, we only report our results on several scenes to evaluate our tracking and mapping performance below. We also report a visual comparison of rendering performance at the link: https://imgur.com/a/FOIgcrz.
| Methods | Gaussian-SLAM | SplaTAM | LoopSplat | Ours |
|---|---|---|---|---|
| 00 (ATE RMSE[m]) | 3.02 | 58.83 | 2.22 | 2.06 |
| 01 (ATE RMSE[m]) | 77.51 | 84.45 | 74.47 | 29.01 |
| 05 (ATE RMSE[m]) | 128.88 | 80.39 | 117.43 | 7.74 |
| 10 (ATE RMSE[m]) | 10.60 | 43.82 | 11.39 | 4.54 |
| Methods | Gaussian-SLAM | SplaTAM | LoopSplat | Ours |
|---|---|---|---|---|
| 00 (PSNR) | 15.51 | 9.82 | 15.82 | 28.54 |
| 01 (PSNR) | 15.95 | 12.89 | 14.69 | 30.33 |
| 05 (PSNR) | 16.22 | 26.48 | 15.98 | 28.19 |
| 10 (PSNR) | 15.58 | 25.58 | 14.58 | 27.59 |
| Peak GPU Use (GiB) | 2.74 | 22.37 | 3.56 | 4.79 |
We also report memory consumption on KITTI in the table above. Each method uses the most Gaussians until no improvement can be made. We use a little bit more memory, but we manage to use more Gaussians to produce much better rendering.
2. Impact of depth noise on the performance
Although Gaussians are fixed at depth with noise, Gaussian splatting is flexible enough to overfit the current frame and neighboring frames by tuning other attributes like color, opacity, and shape. Our results show that depth noises do not significantly impact the rendering performance. Meanwhile, we try to optimize the position of Gaussians along the ray direction, but we do not find an obvious improvement in rendering performance. We report additional results below. We also report a visual comparison using either fixed Gaussians or movable Gaussians (along the ray) at the link: https://imgur.com/a/oEfbgro.
| 10% pixels w/ noises | 20% pixels w/ noises | 30% pixels w/ noises | Gaussians movable along ray | Ours(w/o additional noises & fix) | |
|---|---|---|---|---|---|
| PSNR | 43.41 | 43.40 | 43.29 | 42.89 | 43.06 |
| SSIM | 0.996 | 0.996 | 0.996 | 0.995 | 0.996 |
| LPIPS | 0.015 | 0.015 | 0.015 | 0.020 | 0.013 |
3. Bundle adjustment
As stated in Lines 243-245, our bundle adjustment is just used at the head frame in each section. Since the head is so important to start a section and is used as a reference by the following frame in the same section, this design stabilizes the optimization a lot and achieves high accuracy. Due to the large amount of Gaussians in total over a large-scale scene, we cannot conduct a final or global bundle adjustment using all Gaussians. The 80 iteration optimization is illustrated in Fig. 5.
4. Numerical comparison in Tab. 9
We confirmed the results are correct. We keep the experimental setting the same but just using different kinds of Gaussians. So, one analysis here is that these alternatives do not work well with some parameters, which produce rendering errors that can accumulate across frames fast in tracking.
5. Code
The code in the supplementary materials is merely for demonstration. We will release the code upon acceptance.
This paper introduces VTGaussian-SLAM, an RGB-D SLAM system that utilizes a novel view-tied 3D Gaussian representation. By anchoring Gaussians to depth pixels, the method avoids the need to optimize explicit parameters like position, rotation, and scale, significantly reducing memory usage and computational overhead. The system organizes frames into sections, optimizing only the current and overlapping sections' Gaussians at any given time, enhancing scalability and efficiency.
All reviewers learned to be positive after reading comprehensive feedback from the authors. Reviewers 4tKp and wSTB gave the acceptance score, and reviewers HsjG and N2ss gave a weak acceptance score. AC carefully read the review and agreed with the reviewer's recommendation. AC recommends applying the additional results and discussions in the revised version.