Holistic Large-Scale Scene Reconstruction via Mixed Gaussian Splatting
摘要
评审与讨论
This work proposes a large-scale scene reconstruction framework, MixGS, based on Gaussian Splatting. Specifically, this work introduces a Mixed Gaussian representation, combined by Coarse 3D Gaussians from inital points and Shifted Gaussians from view-aware positional encoding. The Mixed Gaussian representation is optimized and further rendered for final results. Experimental results on several datasets show the effectiveness of the proposed method.
优缺点分析
Paper Strengths
This work focuses on two limitations of previous algorithms, and experimental results on several datasets show the effectiveness of the proposed method.
Major Weaknesses
Introduction
- The limitations of pervious algorithms. The paper highlights two limitations of previous algorithms (Complex Parameter Tuning and Loss of Global Information), but it does not present specific examples. It would be better to provide some certain cases to the problem.
Method
-
Coarse 3D Gaussians Training. As illustrated in this section, Guassians are initialized from sparse sfm points. According to line 199 ('Note that the number of decoded Gaussians are the same with Gaussians in the view frustum'), decoded Gaussians are correlated to the initial sets. How can areas with extremely sparse initialization (e.g., due to textureless input) be effectively processed?
-
View-Aware Representation Modeling. The proposed process of learning view-aware Gaussians is quite similar to Scaffold-GS [1]. It would be better to discuss the difference between the proposed method and Scaffold-GS.
[1] Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering
Experiments
- Quantitative Results. Why does the proposed MixGS show consistently lower results in the human perception metric (LPIPS), as shown in Table 1?
- Rendering Speed. Is it feasible to compare all the methods on the same platform?
Minor Weaknesses
typo
- line 121: Define -> Definition
问题
- It would be better to provide some certain cases to the problem.
- How can areas with extremely sparse initialization (e.g., due to textureless input) be effectively processed? It would be better to provide some examples?
- Why does the proposed MixGS show consistently lower results in the human perception metric (LPIPS), as shown in Table 1?
- Is it feasible to compare all the methods on the same platform?
局限性
yes
最终评判理由
After reading the comments from other reviewers and the authors' rebuttal, the reviewer maintains the rating as borderline accept.
格式问题
There is no major formatting issue.
We thank the reviewer for thorough evaluation and constructive feedback. We address each concern below:
Q1: Specific Cases of the Problems
1. Complex Parameter Tuning
Prior divide-and-conquer approaches rely heavily on carefully crafted view assignment strategies for each individual block, which introduces considerable tuning complexity. For example, VastGaussian requires setting a visibility threshold to determine whether a particular camera view should contribute to a block, while CityGaussian introduces an intersection threshold for selection. Beyond that, parameters must also be decided for how many blocks to divide the scene into and define the splitting intervals along each axis. These hyperparameters vary across scenes and typically require manual adjustment or grid search, significantly increasing implementation burden and making reproduction more challenging.
2. Loss of Global Information
Existing large‑scale scene reconstruction methods employ a divide‑and‑conquer strategy, partitioning the scene into independent blocks for multi‑GPU training. Since each block is optimized in isolation, multiview constraints do not span block boundaries, which can lead to visible artifact or texture misalignment at those boundaries. Methods such as CityGaussian, CityGaussianV2, and VastGaussian attempt to mitigate these artifacts by increasing the overlap between adjacent blocks. The more recent DOGS framework further addresses boundary inconsistencies through a dedicated “consensus step.” As stated in the DOGS paper, below information acknowledges that boundary inconsistencies persist in prior methods.
“Our method presents higher‑fidelity rendering results than VastGaussian near the splitting boundary, which validates the effectiveness and importance of the consensus step” (DOGS, page.17).
Due to independent optimization of blocks breaks multi-view constraints across boundaries, such post-processing cannot fundamentally resolve the boundary inconsistency.
Regarding qualitative comparison, due to NeurIPS policy preventing us from including boundary issue visualizations in this rebuttal, please refer to DOGS’s Figures 8 and 11, which illustrate these boundary effects.
To provide a quantitative comparison, we follow DOGS’s partitioning strategy and select 300 viewpoints across block boundaries in both the Rubble and Building scenes. We render each viewpoint using the DOGS model and our MixGS method, then compute PSNR, SSIM, and LPIPS for both in below table. Despite DOGS’s consensus refinement, MixGS consistently yields higher quality at block interfaces, effectively eliminating inconsistencies at the source rather than relying on post-processing.
| Methods | Rubble | Building | ||||
|---|---|---|---|---|---|---|
| Metrics | PSNR ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
| DOGS | 27.93 | 0.858 | 0.209 | 25.23 | 0.848 | 0.215 |
| Ours | 29.18 | 0.865 | 0.187 | 26.40 | 0.859 | 0.201 |
Compared with other methods, MixGS optimizes the entire scene as a single entity: all Gaussians are updated under global multi-view constraints, and no block partitioning is performed. This holistic design eliminates boundary inconsistencies at the source, rather than relying on post-processing alignment. We appreciate the reviewer’s suggestion: in the revised manuscript, we will include a clearer discussion of block‑boundary issues and provide illustrative visualizations in the supplementary material.
[1] Dogs: Distributed-oriented gaussian splatting for large-scale 3d reconstruction via gaussian consensus. NeurIPS 2024.
Q2: Processing Areas with Extremely Sparse Initialization
To ensure that richly textured areas are faithfully reconstructed, MixGS employs two complementary mechanisms.
Adaptive Densification: On the one hand, in the coarse training stage, Gaussians are duplicated and split in regions with high-frequency textures and detailed structure (as in 3DGS [1]). As a result, areas with complex details are initially covered by a denser set of coarse Gaussians. This seeding provides a strong prior for the decoded Gaussians to refine fine details.
Offset-Driven Refinement: On the other hand, benefiting from the offset pool, the decoded Gaussians will move toward under-reconstructed and high-frequency zones during training, instead of locating near the corresponding coarse Gaussians. We show the center location of the coarse and decoded Gaussians in Figure 4 of the manuscript. It can be observed that the decoded Gaussians shift toward more complex areas (e.g., bush and rooftop tiles in Fig.4), further increasing the Gaussian density exactly where it is needed most.
Moreover, MixGS can generate multiple decoded Gaussians per coarse Gaussian by simply modifying the decoder's output dimension. We evaluated different ratios K (i.e., ) on the Rubble scene:
| Ratio K | PSNR↑ | SSIM↑ | LPIPS↓ | Peak VRAM |
|---|---|---|---|---|
| 1 | 26.66 | 0.792 | 0.267 | 20.45 GB |
| 2 | 26.76 | 0.796 | 0.263 | 22.91 GB |
| 3 | 27.02 | 0.801 | 0.255 | 25.41 GB |
Higher ratios improve quality (PSNR reaches 27.02 dB at 3:1) but increase memory usage. We use the 1:1 setting for optimal balance between performance and memory efficiency on a single 24GB GPU.
[1] 3D Gaussian Splatting for Real-Time Radiance Field Rendering. TOG 2023.
Q3: Different from Scaffold-GS
Scaffold‑GS builds a hierarchical, region‑aware representation by introducing fixed anchor points and dynamically spawning Gaussians around them based on view coverage and feature saliency. In contrast, MixGS explores spatial structure through a unified multi‑resolution hash encoding of Gaussian centers, enabling implicit feature learning at multiple scales.
Moreover, MixGS encodes not only each Gaussian’s position but also its auxiliary attributes, such as rotation, scale, and camera pose, into a high‑level feature vector. These rich, view‑aware features are then decoded by a lightweight MLP to generate fine‑detail Gaussians, whose positions are refined via an offset pool. Finally, our novel mixing mechanism combines the original “coarse” Gaussians with the decoded “fine” Gaussians into a single set for rendering, ensuring both global consistency and local fidelity across an entire large‑scale scene.
In summary, while Scaffold‑GS adaptively assigns Gaussians per region and view, MixGS holistically optimizes the full scene representation, jointly learning coarse structure and fine details, without any block partitioning or anchor placement. We will add this comparison in the revision to clarify these conceptual and technical distinctions.
Q4: Why LPIPS Is Relatively Higher?
Our MixGS choice to prioritizes holistic consistency (critical for large scenes), instead of trivial LPIPS optimization. Hence, this does not reflect poor detail recovery. Methods like DOGS and CityGaussian use divide-and-conquer, which can create "pixel-perfect but lighting inconsistent" local regions (lower LPIPS for small patches) but fail to align textures/lighting across block boundaries (e.g., grassland suddenly shifts in the 4 rows and 2 columns in Fig. 1 of supplementary). Our holistic framework avoids this by enforcing global consistency, which may introduce minor local color adjustments to align large structures, and these adjustments are what LPIPS occasionally penalizes.
Additionally, we evaluate MixGS on the MatrixCity and Campus scenes. Both datasets are recognized for their challenging scale and high-frequency details. As shown in the table below, MixGS achieves the highest rendering quality across PSNR, SSIM, and LPIPS. We will include these quantitative results and corresponding qualitative comparisons in the revised manuscript.
| Method | Matrixcity | Campus | ||||
|---|---|---|---|---|---|---|
| Metric | PSNR ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
| MegaNeRF | - | - | - | 23.42 | 0.537 | 0.636 |
| SwitchNeRF | - | - | - | 23.62 | 0.541 | 0.616 |
| 3DGS | 23.67 | 0.735 | 0.384 | 18.95 | 0.630 | 0.447 |
| VastGaussian | 28.33 | 0.835 | 0.220 | 23.82 | 0.695 | 0.329 |
| CityGaussian | 27.32 | 0.813 | 0.211 | - | - | - |
| DOGS | 28.58 | 0.847 | 0.219 | 24.01 | 0.681 | 0.377 |
| Ours | 28.63 | 0.857 | 0.210 | 24.28 | 0.715 | 0.315 |
Q5: Rendering Speed Comparison on the Same Platform
Due to hardware constraints, we are unable to benchmark MixGS on A100 GPUs. Instead, we measure all methods on an NVIDIA RTX 3090 to ensure a fair comparison. The results are shown in the Table below. We will include this direct comparison in the revised manuscript for greater clarity.
| Methods | GPU Type | Building | Rubble | Residence | Sci-Art |
|---|---|---|---|---|---|
| 3DGS | RTX 3090 | 62.8 | 79.4 | 64.5 | 98.7 |
| CityGaussian (w/ LoD) | RTX 3090 | 45.4 | 59.8 | 49.5 | 69.4 |
| CityGaussian (w/o LoD) | RTX 3090 | 30.7 | 48.1 | 36.3 | 60.6 |
| MixGS (Ours) | RTX 3090 | 33.5 | 42.5 | 36.6 | 53.9 |
Q6: Typo Error
Thank you for your careful review. We will perform a comprehensive proofread of the manuscript and correct all grammatical and spelling mistakes in the revised version.
Thank for the rebuttal from authors, the reviewer maintain the rating as the borderline accept.
We are grateful for your insightful suggestions and happy our rebuttal has addressed your concerns. We will update the paper with all the experiments and details mentioned.
This paper proposes MixGS, which aims to reconstruct large-scale scenes via 3D Gaussian Splatting. MixGS utilizes both vanilla gaussian representation and the neural network-based gaussian representation to represent the large-scale scenes. The vanilla gaussias reconstruct the coarse information, and the neural gaussians capture fine-grained details, enabling reconstruction of large-scale scenes with less gaussian primitives.
优缺点分析
Strengths:
- The proposed method MixGS achieves state-of-the-art rendering quality on Mill19 and UrbanScene3D datasets, especially for the PSNR and SSIM metrics.
- The idea to incorporate neural gaussians to capture details with vanilla gaussians is interesting. The neural gaussians are helpful to model view-dependent colors, which may be hard for vanilla gaussians due to the constraints of sh-degree.
- The paper is well-written and easy to follow..
Weaknesses:
- The method has a significant drawback in training time, as shown in Table 2 of the supplementary material. MixGS needs about 15 hours to optimize a large-scale scene, while the previous 3d gaussian splatting-based methods need about 3 hours. Although the authors discuss this drawback in the limitation section in the supplementary material, I still think the training time is the weakness because training efficiency is a crucial issue for large-scale scene reconstruction .
- Although the authors claim that using neural gaussians is to capture fine-grained details, there is a lack of deeper analysis about the actual effects. The method uses the View Frustum to select the gaussians to pass through the neural network and introduces camera poses as auxiliary information. This implementation introduces lots of view-dependent information to render the scenes. In a large-scale scene, the same scene often has very different lighting from different views. Therefore, it is confusing whether the improvement comes from a reasonable modeling of the scene or from introducing a representation of view-specific information. From Table 1, MixGS has better PSNR metrics, while the LPIPS metrics are not that significant. Modeling of specific view information can lead to an improvement in PSNR, while lpipis can well measure high-frequency information. And from Figure 3, I also find MixGS has an advantage in rendering view-dependent lightning effects, while the details improvement is not significant.
问题
My major concerns are the low training time efficiency and the confusing effects of the method. I will raise my score if the authors provide a convincing analysis of these two weaknesses in the rebuttal.
局限性
yes
最终评判理由
The proposed method achieves good global consistency of using 3DGS for large-scale rendering. While the high LPIPS metrics make me uncertain whether the LPIPS gap is solely due to global consistency, or also reflects differences in local detail reconstruction. Overall, I appreciate the benefits of the proposed method for global consistency and rate the boardline accept.
格式问题
No.
We thank the reviewer for thorough evaluation and constructive feedback. We address each concern below:
Q1: Training Time Efficiency
We acknowledge the reviewer’s concern about training time. The longer training duration in our original results is primarily due to hardware limitations: while other methods (as shown in Table 2 of the supplementary material) were trained using five high-end GPUs (5 × RTX 6000 48GB), our experiments were conducted on a single RTX 3090 24GB. This significant difference in hardware configuration directly impacts training efficiency.
To provide a more fair comparison, we integrate MixGS into the open-source distributed training framework [1] and retrain our model using 5 × RTX 3090 GPUs. The training time comparison is summarized below:
| Methods | Used GPU | Building | Rubble | Residence | Sci-Art |
|---|---|---|---|---|---|
| Mega-NeRF | 5× RTX6000 48GB | 19:49 | 30:48 | 27:20 | 27:39 |
| Switch-NeRF | 5× RTX6000 48GB | 24:46 | 38:30 | 35:11 | 34:34 |
| 3DGS | 1× RTX3090 24GB | 02:39 | 02:24 | 02:49 | 02:03 |
| VastGaussian† | 5× RTX6000 48GB | 03:26 | 02:30 | 03:12 | 02:33 |
| DOGS | 5× RTX6000 48GB | 03:51 | 02:25 | 04:33 | 04:23 |
| MixGS (Ours) | 1× RTX3090 24GB | 14:39 | 12:07 | 17:09 | 14:48 |
| MixGS (Ours) | 5× RTX3090 24GB | 03:48 | 03:02 | 04:28 | 04:13 |
With five GPUs, training time of MixGS drops to a range comparable with or better than existing methods.
Furthermore, we confirm that distributed training preserves rendering quality. Below are the quantitative results comparing the original single-GPU model and our multi-GPU version. These results demonstrate that MixGS can scale effectively under distributed GPU training while preserving high rendering quality. We appreciate the reviewer’s suggestion and will include this discussion and additional results in the revised version.
| Scenes | Building | Rubble | Residence | Sci-Art | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Metrics | PSNR↑ | SSIM↑ | LPIPS↓ | PSNR↑ | SSIM↑ | LPIPS↓ | PSNR↑ | SSIM↑ | LPIPS↓ | PSNR↑ | SSIM↑ | LPIPS↓ |
| Mega-NeRF | 20.92 | 0.547 | 0.454 | 24.06 | 0.553 | 0.508 | 22.08 | 0.628 | 0.401 | 25.60 | 0.770 | 0.312 |
| Switch-NeRF | 21.54 | 0.579 | 0.397 | 24.31 | 0.562 | 0.478 | 22.57 | 0.654 | 0.352 | 26.51 | 0.795 | 0.271 |
| GP-NeRF | 21.03 | 0.566 | 0.486 | 24.06 | 0.565 | 0.496 | 22.31 | 0.661 | 0.448 | 25.37 | 0.783 | 0.373 |
| 3DGS | 20.46 | 0.720 | 0.305 | 25.47 | 0.777 | 0.277 | 21.44 | 0.791 | 0.236 | 21.05 | 0.830 | 0.242 |
| VastGaussian | 21.80 | 0.728 | 0.225 | 25.20 | 0.742 | 0.264 | 21.01 | 0.699 | 0.261 | 22.64 | 0.761 | 0.261 |
| Hierarchy-GS | 21.52 | 0.723 | 0.297 | 24.64 | 0.755 | 0.284 | — | — | — | — | — | — |
| DOGS | 22.73 | 0.759 | 0.204 | 25.78 | 0.763 | 0.257 | 21.94 | 0.740 | 0.244 | 24.42 | 0.804 | 0.219 |
| CityGaussian | 21.67 | 0.764 | 0.262 | 24.90 | 0.785 | 0.256 | 21.90 | 0.805 | 0.217 | 21.34 | 0.833 | 0.232 |
| MixGS (x1 GPUs) | 23.03 | 0.771 | 0.261 | 26.66 | 0.792 | 0.267 | 23.39 | 0.815 | 0.219 | 24.20 | 0.856 | 0.220 |
| MixGS (x5 GPUs) | 22.95 | 0.781 | 0.253 | 26.63 | 0.807 | 0.253 | 23.38 | 0.824 | 0.216 | 24.13 | 0.853 | 0.222 |
[1] gsplat: An open-source library for Gaussian splatting. JMLR 2025.
Q2: Deep analysis of the proposed method
1. Differences Between PSNR and LPIPS
We appreciate the reviewer’s attention to metric consistency. However, it is critical to note that PSNR and LPIPS measure fundamentally distinct aspects of reconstruction quality, and their divergence does not inherently indicate a failure in detail recovery:
-
PSNR quantifies pixel-level intensity/color error (MSE-based). A high PSNR confirms that models accurately match the ground truth in terms of low-level color/brightness fidelity, which is a prerequisite for detail recovery, as fine textures rely on precise pixel values.
-
LPIPS measures perceptual similarity using pre-trained CNN features (e.g., VGG), which are sensitive to high-level structural cues (e.g., edge continuity, semantic consistency) but may overpenalize certain benign discrepancies (e.g., subtle lighting variations in large scenes that humans barely notice).
In large-scale scenes, LPIPS can be elevated by factors unrelated to "lack of details": for example, global illumination consistency (a strength of our holistic framework) may introduce minor local color adjustments to align lighting across the scene, which LPIPS sometimes flags despite preserving fine textures.
2. Fine Detail Recover of MixGS
We directly demonstrate the capability of MixGS for detail recovery:
2.1 Qualitative Proof: According to the policy for NeurIPS 2025, we are unable to include additional links or figures. Please refer to Row 3 and Row 5 of Figure 1 in the supplementary materials. These clearly demonstrate that our method reconstructs richer details in the areas indicated by arrows, particularly for the rooftop tiles and blue building, despite exhibiting marginally lower LPIPS scores in certain scenes.
2.2 Quantitative Proof: In experiments where we removed auxiliary view-dependent information from the Rubble scene, i.e., camera poses and rotation/scaling parameters of Gaussian splats, the results presented in the table below demonstrate that reducing view-dependent information consistently degrades PSNR while simultaneously increasing LPIPS. Consequently, it would be inaccurate to simplistically conclude that incorporating more view-dependent information inherently deteriorates LPIPS performance.
| Methods | PSNR↑ | SSIM↑ | LPIPS↓ |
|---|---|---|---|
| Ours w/o auxiliary view-dependent information | 26.24 | 0.784 | 0.277 |
| Ours (full model) | 26.66 | 0.792 | 0.267 |
Additionally, we evaluate MixGS on the MatrixCity and Campus scenes. The MatrixCity dataset comprises 5,620 training views and 741 test views, while the Campus subset of UrbanScene3D includes 5,850 training images and 21 test images. Both datasets are recognized for their challenging scale and high-frequency details. As shown in the table below, on both MatrixCity and Campus, MixGS achieves the highest rendering quality across PSNR, SSIM, and LPIPS. We will include these quantitative results and corresponding qualitative comparisons in the revised manuscript.
| Method | Matrixcity | Campus | ||||
|---|---|---|---|---|---|---|
| Metric | PSNR ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
| MegaNeRF | - | - | - | 23.42 | 0.537 | 0.636 |
| SwitchNeRF | - | - | - | 23.62 | 0.541 | 0.616 |
| 3DGS | 23.67 | 0.735 | 0.384 | 18.95 | 0.630 | 0.447 |
| VastGaussian | 28.33 | 0.835 | 0.220 | 23.82 | 0.695 | 0.329 |
| CityGaussian | 27.32 | 0.813 | 0.211 | - | - | - |
| DOGS | 28.58 | 0.847 | 0.219 | 24.01 | 0.681 | 0.377 |
| Ours | 28.63 | 0.857 | 0.210 | 24.28 | 0.715 | 0.315 |
3. Why LPIPS Is Relatively Higher?
Our MixGS choice to prioritizes holistic consistency (critical for large scenes), instead of trivial LPIPS optimization. Hence, this does not reflect poor detail recovery. Methods like DOGS and CityGaussian use divide-and-conquer, which can create "pixel-perfect but lighting inconsistent" local regions (lower LPIPS for small patches) but fail to align textures/lighting across block boundaries (e.g., grassland suddenly shifts in the 4 rows and 2 columns in Fig. 1 of supplementary). Our holistic framework avoids this by enforcing global consistency, which may introduce minor local color adjustments to align large structures, and these adjustments are what LPIPS occasionally penalizes.
4. Conclusion
The observed phenomenon of higher PSNR with slightly elevated LPIPS reflects a trade-off in our framework to maintain global consistency in large-scale scenes, rather than indicating any failure in detail recovery. Furthermore, our MixGS approach achieves reasonable scene modeling through an explicit coarse Gaussian training phase combined with implicit decoded Gaussians, while simultaneously incorporating view-specific representation information. This cooperative combination ultimately produces superior rendering results. We appreciate the reviewer's comments and will include additional discussion to eliminate confusing description, along with more representative comparative visualizations in our revision.
Thank you very much for your time and thoughtful suggestions.
May I kindly ask whether the concerns you raised have been adequately addressed in our rebuttal.
We appreciate your valuable feedback and would be grateful to know if our response has helped clarify the key issues.
Thanks for the rebuttal. My concern regarding training efficiency has been addressed.
However, the reason behind the relatively high LPIPS values remains unclear to me, as also pointed out by Reviewer buD9. I am not fully convinced that maintaining global consistency necessarily requires a drop in LPIPS. While the rebuttal explains that minor local color adjustments can negatively affect LPIPS scores, I would appreciate a more detailed analysis or evidence supporting this trade-off. Also, why is LPIPS considered a “trivial” objective in this context, given its widespread use as a perceptual similarity metric?
Thanks for your time and reply. We address each concern below:
- Detailed analysis and supporting evidence:
We kindly refer the reviewer to Figure 1 (Row 5, Residence) in the supplementary material. As highlighted by the arrows, our method recovers finer structural details than CityGaussian, such as the blue building and the cars on the road, which are more aligned with human perceptual preference. However, the LPIPS scores present a counterintuitive result:
| Methods | PSNR↑ | LPIPS↓ |
|---|---|---|
| CityGaussian | 28.12 | 0.160 |
| Ours | 28.27 | 0.199 |
Despite achieving better detail restoration and a higher PSNR, our method yields a slightly worse LPIPS score. This behavior is due to the nature of LPIPS, which measures perceptual similarity by comparing deep features from pretrained networks (e.g., VGG) on a per-pixel basis. As described in previous studies [1][2][3][4], LPIPS sometimes could be overly sensitive to small spatial misalignments or local color adjustments, even when the perceptual quality, as judged by humans, is improved or unchanged.
Due to rebuttal policy limitations, we are unable to include more visual comparisons at this stage, but we will extend this analysis in the revised version.
[1] Shift-tolerant Perceptual Similarity Metric. ECCV 2022
[2] Locally Adaptive Structure and Texture Similarity for Image Quality Assessment. ACM MM 2021
[3] Attacking Perceptual Similarity Metrics. TMLR 2023
[4] Stochastic BIQA: Median Randomized Smoothing for Certified Blind Image Quality Assessment. CVIU 2025
- Clarification on our previous wording regarding LPIPS:
We would like to clarify that we did not intend to diminish the value of LPIPS, which remains one of the most widely adopted perceptual metrics in image reconstruction. Our goal was to emphasize that no single metric should be treated in isolation. LPIPS, PSNR, and SSIM offer complementary insights, measuring perceptual similarity, pixel-level fidelity, and structural similarity, respectively. For a complete evaluation, they should be considered together, along with visual comparisons.
To further intuitively compare the LPIPS metric, we compile the LPIPS scores across six benchmark scenes. As shown in the table below, our method achieves strong LPIPS performance across the board, achieving 0.249 in average. Notably, the average LPIPS of MixGS (0.235) is lower than that of CityGaussian (0.236) when excluding missing entries, confirming that our approach remains competitive in perceptual quality.
| Scenes | Building | Rubble | Residence | Sci-Art | Campus | MaxtriCity | Average |
|---|---|---|---|---|---|---|---|
| Mega-NeRF | 0.454 | 0.508 | 0.401 | 0.312 | 0.636 | - | - |
| Switch-NeRF | 0.397 | 0.478 | 0.352 | 0.271 | 0.616 | - | - |
| 3DGS | 0.305 | 0.277 | 0.236 | 0.242 | 0.447 | 0.384 | 0.315 |
| VastGaussian | 0.225 | 0.264 | 0.261 | 0.261 | 0.329 | 0.220 | 0.260 |
| DOGS | 0.204 | 0.257 | 0.244 | 0.219 | 0.377 | 0.219 | 0.253 |
| CityGaussian | 0.262 | 0.256 | 0.217 | 0.232 | - | 0.211 | - |
| MixGS (Ours) | 0.261 | 0.267 | 0.219 | 0.220 | 0.315 | 0.210 | 0.249 |
We appreciate the reviewer’s question and will include this clarification, statistical analysis, and additional visual comparisons in the revision. Please feel free to let us know if further elaboration would be helpful.
Thanks for referring me to Figure 1 (Row 5, Residence). The proposed method shows finer details in some areas (e.g., the blue building, cars), while the left part of the image appears blurrier than CityGaussian. This makes me uncertain whether the LPIPS gap is solely due to global consistency, or also reflects differences in local detail reconstruction.
Overall, I appreciate the benefits of the proposed method for global consistency, and I am raising my score to borderline accept. I also encourage the authors to include additional discussion on the LPIPS metric.
We are grateful for your insightful suggestions and happy our rebuttal has addressed your concerns. We will update the paper with all the experiments and details mentioned.
This paper proposes a method that enhances the image quality by adding one additional Gaussian splat per each coarse Gaussian splat. To learn the parameter field of the decoded Gaussian splats, a multi-level hash encoding is used. The spatial feature from hash encoding is then combined with the local features, including splat scale, orientation, and camera pose. The decoder finally generates additional Gaussian splats with spatial offsets. The quantitative and qualitative comparison demonstrates the effectiveness of the proposed method.
优缺点分析
There are the following strong points.
The proposed method is promising as the additional Gaussian splats are jointly optimized to enhance the quality.
The hash-based neural parameter field doesn’t require a large memory size.
The ablation study thoroughly demonstrates the efficacy of each module.
I found the following weak points.
The decoded Gaussian splat could be far from a coarse Gaussian splat. This means that coarse Gaussians outside the frustum may be necessary to enhance image quality. This effect would be significant near frustum boundaries.
If coarse Gaussians are edited after optimization, the decoder part won’t produce the same image enhancement.
The network may produce unstable reconstructions at novel views, as it is not trained on dense camera views. This is crucial when we synthesize a video with a moving camera.
问题
-
For the given view frustum, do we need to expand it to maximize image quality? The expanded frustum includes coarse Gaussians, of which decoded Gaussians enhance the quality near the image boundary.
-
Does the neural decoder consistently produce a smooth image result when moving the viewpoint?
-
Do you have an idea about how to address the post-processing of coarse Gaussians?
局限性
Yes.
最终评判理由
The rebuttal resolves my concerns. I think generating auxiliary Gaussians by MLP is a novel idea. I am leaning towards the positive for the publication of this paper.
格式问题
None.
We thank the reviewer for thorough evaluation and constructive feedback. We address each concern below:
Q1: Position of Decoded Gaussians
A1: We agree that decoded Gaussians may shift significantly relative to their coarse counterparts, because our offset pool tends to move primitives toward high‑frequency, texture‑rich regions within the current view frustum (e.g., bush and rooftop tiles panels in Fig.4) to improve local rendering quality.
Crucially, we observe through experiments that the decoded Gaussians remain strictly inside the original view frustum. This behavior emerges naturally during training: any decoded Gaussian shifting outside the view frustum would not project onto the current image plane (by definition of the frustum), thus contributing nothing to the rendered pixel values. Consequently, such Gaussians receive zero loss gradient during backpropagation, and the network naturally learns to restrict decoded Gaussians to regions that affect the current view. For the current camera pose, coarse Gaussians outside the frustum are irrelevant to rendering, so their exclusion does not sacrifice information needed for the current view. In other words, decoded Gaussians can roam freely within the view frustum to capture scene information, but they will not wander outside the frustum, since such movement would not improve (and indeed would harm) the training objective.
Q2: Should we expand the view frustum?
A2: Expanding the view frustum brings more coarse Gaussians into the visible set and thus yields additional decoded Gaussians for rendering the same camera view. In other words, a larger frustum promotes more primitives contributing to the image and can improve quality.
As suggested by the reviewer, we expand the frustum by scaling the camera’s translation by a factor of 1.05, simulating raising a drone's altitude to capture the scene. Note that during the rendering process, all the camera poses remain unchanged so that the rendered images still correspond to the same ground truth. The table below compares standard MixGS to the expanded view frustum (EV) on four benchmark scenes. The expanded view frustum could consistently improves reconstruction quality for Rubble and Residence scenes, demonstrating the benefit of including more Gaussians in the frustum. Due to time constraints, we could not explore this idea more deeply. We appreciate the reviewer’s insightful suggestion, which is interesting for our future research.
| Scenes | Building | Rubble | Residence | Sci-Art | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Metrics | PSNR↑ | SSIM↑ | LPIPS↓ | PSNR↑ | SSIM↑ | LPIPS↓ | PSNR↑ | SSIM↑ | LPIPS↓ | PSNR↑ | SSIM↑ | LPIPS↓ |
| Mega-NeRF | 20.92 | 0.547 | 0.454 | 24.06 | 0.553 | 0.508 | 22.08 | 0.628 | 0.401 | 25.60 | 0.770 | 0.312 |
| Switch-NeRF | 21.54 | 0.579 | 0.397 | 24.31 | 0.562 | 0.478 | 22.57 | 0.654 | 0.352 | 26.51 | 0.795 | 0.271 |
| GP-NeRF | 21.03 | 0.566 | 0.486 | 24.06 | 0.565 | 0.496 | 22.31 | 0.661 | 0.448 | 25.37 | 0.783 | 0.373 |
| 3DGS | 20.46 | 0.720 | 0.305 | 25.47 | 0.777 | 0.277 | 21.44 | 0.791 | 0.236 | 21.05 | 0.830 | 0.242 |
| VastGaussian | 21.80 | 0.728 | 0.225 | 25.20 | 0.742 | 0.264 | 21.01 | 0.699 | 0.261 | 22.64 | 0.761 | 0.261 |
| Hierarchy-GS | 21.52 | 0.723 | 0.297 | 24.64 | 0.755 | 0.284 | — | — | — | — | — | — |
| DOGS | 22.73 | 0.759 | 0.204 | 25.78 | 0.763 | 0.257 | 21.94 | 0.740 | 0.244 | 24.42 | 0.804 | 0.219 |
| CityGaussian | 21.67 | 0.764 | 0.262 | 24.90 | 0.785 | 0.256 | 21.90 | 0.805 | 0.217 | 21.34 | 0.833 | 0.232 |
| MixGS | 23.03 | 0.771 | 0.261 | 26.66 | 0.792 | 0.267 | 23.39 | 0.815 | 0.219 | 24.20 | 0.856 | 0.220 |
| MixGS + EV | 22.93 | 0.783 | 0.250 | 26.76 | 0.796 | 0.263 | 23.47 | 0.825 | 0.215 | 24.24 | 0.851 | 0.225 |
Q3: Post-Processing of Gaussians
A3: If the coarse Gaussians are edited, such as being repositioned or having their appearance changed, our MixGS framework can adapt efficiently. In cases of moderate changes, the decoder can simply relearn the offsets without requiring retraining from scratch. Since the decoder is conditioned on the updated coarse Gaussians, it can reassign detail representations to match the new configuration.
For more substantial edits, we can freeze the coarse Gaussians and fine-tune only the decoder in the affected regions. This localized adaptation enables MixGS to flexibly handle post-processed scenes without globally re-optimizing the full representation, making it particularly suitable for downstream editing tasks.
Q4: Does the method produce smooth results with a moving camera?
A4: On the one hand, all the quantitative results reported in our paper are evaluated on the official test set, which contains viewpoints unseen during training. Despite this, our method consistently achieves high rendering quality, demonstrating its generalization ability.
On the other hand, due to NeurIPS policy, we are unable to provide an anonymous link to show qualitative videos rendered under a densely sampled, unseen camera trajectory.
To further validate the smoothness of our method under continuous camera motion, we conduct experiments on the large‐scale MatrixCity dataset [1]. MatrixCity comprises 5,620 training images and 741 test images, with the test set designed to provide densely sampled, consecutive viewpoints. We evaluated our approach alongside several baselines and report the results in the table below. It can be observed that our method achieves the highest results across all metrics, demonstrating its robustness in producing smooth, high‐quality renderings as the camera moves continuously. We appreciate the reviewer’s comments and will integrate video results into our revision.
| Method | Matrixcity | ||
|---|---|---|---|
| Metric | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
| MegaNeRF | - | - | - |
| SwitchNeRF | - | - | - |
| 3DGS | 23.67 | 0.735 | 0.384 |
| VastGaussian | 28.33 | 0.835 | 0.220 |
| CityGaussian | 27.32 | 0.813 | 0.211 |
| DOGS | 28.58 | 0.847 | 0.219 |
| Ours | 28.63 | 0.857 | 0.210 |
[1] Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. ICCV 2023.
Q2: Should we expand the view frustum?
A2: Expanding the view frustum brings more coarse Gaussians into the visible set and thus yields additional decoded Gaussians for rendering the same camera view. In other words, a larger frustum promotes more primitives contributing to the image and can improve quality.
Comment: Thank you for the experiment.
Q4: Does the method produce smooth results with a moving camera?
Comment: Lines between 187 and 189 evoke my concern about view consistency. It would be great to discuss how the rendering changes when each camera parameter, such as rotation and translation, changes.
Thank you for this thoughtful suggestion and reply. Since the rebuttal does not support visual results to show the impact of camera rotation (R) and translation (T), we design a quantitative strategy to reflect these variations.
Specifically, from the MatrixCity test set, we selected 195 test images with smooth and continuous changes in translation (T) while keeping rotation (R) relatively stable, and 58 images with smooth rotation variation while maintaining minimal translation changes. The following table reports the rendering performance on these subsets.
| Method | Matrixcity-R | Matrixcity-T | ||||
|---|---|---|---|---|---|---|
| Metric | PSNR ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
| 3DGS | 23.67 | 0.737 | 0.380 | 23.58 | 0.739 | 0.380 |
| CityGaussian | 27.28 | 0.809 | 0.218 | 27.30 | 0.818 | 0.213 |
| DOGS | 28.43 | 0.839 | 0.222 | 28.52 | 0.840 | 0.223 |
| Ours | 28.57 | 0.850 | 0.214 | 28.62 | 0.855 | 0.212 |
As shown, our method demonstrates strong robustness to both camera rotation and translation changes, confirming that our method exhibits strong view-consistent rendering behavior under varying camera parameters.
Thanks for your reply and support of our work. Please don't hesitate to let us know if you require any further clarifications.
This paper presents a new paradigm for large-scale scene reconstruction, which combines coarse explicit Gaussians and detailed neural Gaussians, showing improved rendering quality on the selected datasets.
优缺点分析
Quality: There exists some overclaim. And the comparison is not sufficiently fair and enough, weakening the solidity of the paper.
Clarity: Good writing, the paragraph and figures are easy to read.
Significance: The overall pipeline reveals a new paradigm for large-scale GS training.
Originality: The coarse-to-fine multi-stage training and neural Gaussians are existing techniques.
问题
- The authors claimed disconsitnuity of global structure, illumination, and geometry across block boundaries as one main drawbacks of previous works. However, there is no qualitative comparison or quantitative analysis on boundary areas. Indeed, existing works such as VastGaussian and CityGaussianV2 show no clear evidence about texture or geometry discontinuity in boundary. Considering that, this part is to some extent overclaimed.
- The datasets used in experiments are limited. Recent works like CityGaussian and DoGS compared at least 5 datasets. The authors should supplement more results on other challenging cases, like the Campus of UrbanScene3D dataset or the MatrixCity dataset.
- Excluding the appearance embedding for VastGaussian in the comparison doesn't seem to be fair. The consideration of appearance variations is an advantage and part of the model structure. The authors should include it and make the comparison fairer.
局限性
The limitation is briefly discussed in the appendix. But a significant one is ignored. Specifically, the number of decoded Gaussians is the same as that of coarse Gaussians within the frustum. However, for some areas with detailed structure and high-frequency textures, it would be difficult for the model to avoid under-reconstruction with limited Gaussians, restricting its rendering quality and application. The authors should provide more explanation on this potential problem and its solutions.
最终评判理由
My main concerns have been resolved. Therefore, I changed my score to the borderline accept.
格式问题
Not any.
We thank the reviewer for thorough evaluation and constructive feedback. We address each concern below:
Q1: Inconsistencies at Block Boundaries in Divide‑and‑Conquer Methods
A1: Existing large‑scale scene reconstruction methods employ a divide‑and‑conquer strategy, partitioning the scene into independent blocks for multi‑GPU training. Since each block is optimized in isolation, multiview constraints do not span block boundaries, which can lead to visible artifact or texture misalignment at those boundaries. Methods such as CityGaussian, CityGaussianV2, and VastGaussian attempt to mitigate these artifacts by increasing the overlap between adjacent blocks. The more recent DOGS framework further addresses boundary inconsistencies through a dedicated “consensus step.” As stated in the DOGS paper, below information acknowledges that boundary inconsistencies persist in prior methods.
“Our method presents higher‑fidelity rendering results than VastGaussian near the splitting boundary, which validates the effectiveness and importance of the consensus step” (DOGS, page.17).
Due to independent optimization of blocks breaks multi-view constraints across boundaries, such post-processing cannot fundamentally resolve the boundary inconsistency.
Regarding qualitative comparison, due to NeurIPS policy preventing us from including boundary issue visualizations in this rebuttal, please refer to DOGS’s Figures 8 and 11, which illustrate these boundary effects.
To provide a quantitative comparison, we follow DOGS’s partitioning strategy and select 300 viewpoints across block boundaries in both the Rubble and Building scenes. We render each viewpoint using the DOGS model and our MixGS method, then compute PSNR, SSIM, and LPIPS for both in below table. Despite DOGS’s consensus refinement, MixGS consistently yields higher quality at block interfaces, effectively eliminating inconsistencies at the source rather than relying on post-processing.
| Methods | Rubble | Building | ||||
|---|---|---|---|---|---|---|
| Metrics | PSNR ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
| DOGS | 27.93 | 0.858 | 0.209 | 25.23 | 0.848 | 0.215 |
| Ours | 29.18 | 0.865 | 0.187 | 26.40 | 0.859 | 0.201 |
Compared with other methods, MixGS optimizes the entire scene as a single entity: all Gaussians are updated under global multi-view constraints, and no block partitioning is performed. This holistic design eliminates boundary inconsistencies at the source, rather than relying on post-processing alignment. We appreciate the reviewer’s suggestion: in the revised manuscript, we will include a clearer discussion of block‑boundary issues and provide illustrative visualizations in the supplementary material.
[1] Dogs: Distributed-oriented gaussian splatting for large-scale 3d reconstruction via gaussian consensus. NeurIPS 2024.
Q2: Evaluation on Additional Datasets
A2: As suggested, we have evaluated MixGS on the MatrixCity and Campus scenes. The MatrixCity dataset comprises 5,620 training views and 741 test views, while the Campus subset of UrbanScene3D includes 5,850 training images and 21 test images. Both datasets are recognized for their challenging scale and high-frequency details.
| Method | Matrixcity | Campus | ||||
|---|---|---|---|---|---|---|
| Metric | PSNR ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
| MegaNeRF | - | - | - | 23.42 | 0.537 | 0.636 |
| SwitchNeRF | - | - | - | 23.62 | 0.541 | 0.616 |
| 3DGS | 23.67 | 0.735 | 0.384 | 18.95 | 0.630 | 0.447 |
| VastGaussian | 28.33 | 0.835 | 0.220 | 23.82 | 0.695 | 0.329 |
| CityGaussian | 27.32 | 0.813 | 0.211 | - | - | - |
| DOGS | 28.58 | 0.847 | 0.219 | 24.01 | 0.681 | 0.377 |
| Ours | 28.63 | 0.857 | 0.210 | 24.28 | 0.715 | 0.315 |
On both MatrixCity and Campus, MixGS achieves the highest rendering quality across PSNR, SSIM, and LPIPS, demonstrating its robustness across diverse large-scale scenes. We will include these quantitative results and corresponding qualitative comparisons in the revised manuscript. Thanks.
Q3: Fair Comparison Setup Regarding Appearance Embedding
A3: The Appearance Embedding (AE) module used in VastGaussian introduces a lightweight CNN that refines rendered images from the 3D Gaussian pipeline as input and encodes the discrepancy between these renderings and the ground-truth images. This component serves as an image-level post-processing refinement that operates on the rendered output, rather than directly enhancing the underlying 3D Gaussian representation. Therefore, in our main manuscript, we followed the same evaluation protocol as DOGS by disabling the AE module to ensure a fair comparison of the underlying 3D representation and rendering quality.
As suggested by the reviewer, we also implement and test the AE module for further comparisons. The complete results are shown in the table below:
| Scenes | Building | Rubble | Residence | Sci-Art | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Metrics | PSNR↑ | SSIM↑ | LPIPS↓ | PSNR↑ | SSIM↑ | LPIPS↓ | PSNR↑ | SSIM↑ | LPIPS↓ | PSNR↑ | SSIM↑ | LPIPS↓ |
| Mega-NeRF | 20.92 | 0.547 | 0.454 | 24.06 | 0.553 | 0.508 | 22.08 | 0.628 | 0.401 | 25.60 | 0.770 | 0.312 |
| Switch-NeRF | 21.54 | 0.579 | 0.397 | 24.31 | 0.562 | 0.478 | 22.57 | 0.654 | 0.352 | 26.51 | 0.795 | 0.271 |
| GP-NeRF | 21.03 | 0.566 | 0.486 | 24.06 | 0.565 | 0.496 | 22.31 | 0.661 | 0.448 | 25.37 | 0.783 | 0.373 |
| 3DGS | 20.46 | 0.720 | 0.305 | 25.47 | 0.777 | 0.277 | 21.44 | 0.791 | 0.236 | 21.05 | 0.830 | 0.242 |
| VastGaussian | 21.80 | 0.728 | 0.225 | 25.20 | 0.742 | 0.264 | 21.01 | 0.699 | 0.261 | 22.64 | 0.761 | 0.261 |
| VastGaussian+AE | 22.34 | 0.747 | 0.220 | 25.42 | 0.755 | 0.261 | 21.84 | 0.711 | 0.251 | 23.04 | 0.797 | 0.249 |
| Hierarchy-GS | 21.52 | 0.723 | 0.297 | 24.64 | 0.755 | 0.284 | — | — | — | — | — | — |
| DOGS | 22.73 | 0.759 | 0.204 | 25.78 | 0.763 | 0.257 | 21.94 | 0.740 | 0.244 | 24.42 | 0.804 | 0.219 |
| CityGaussian | 21.67 | 0.764 | 0.262 | 24.90 | 0.785 | 0.256 | 21.90 | 0.805 | 0.217 | 21.34 | 0.833 | 0.232 |
| MixGS (Ours) | 23.03 | 0.771 | 0.261 | 26.66 | 0.792 | 0.267 | 23.39 | 0.815 | 0.219 | 24.20 | 0.856 | 0.220 |
These results show that even with AE enabled, MixGS maintains superior performance across all metrics, confirming that its advantage stems from the holistic 3D modeling framework rather than the absence of post-processing.
Q4: Limited Number of Gaussians in Highly Textured Regions
A4: To ensure that richly textured areas are faithfully reconstructed, MixGS employs two complementary mechanisms.
Adaptive Densification: On the one hand, in the coarse training stage, Gaussians are duplicated and split in regions with high-frequency textures and detailed structure (as in 3DGS [1]). As a result, areas with complex details are initially covered by a denser set of coarse Gaussians. This seeding provides a strong prior for the decoded Gaussians to refine fine details.
Offset-Driven Refinement: On the other hand, benefiting from the offset pool, the decoded Gaussians will move toward under-reconstructed and high-frequency zones during training, instead of locating near the corresponding coarse Gaussians. We show the center location of the coarse and decoded Gaussians in Figure 4 of the manuscript. It can be observed that the decoded Gaussians shift toward more complex areas (e.g., bush and rooftop tiles in Fig.4), further increasing the Gaussian density exactly where it is needed most.
Moreover, MixGS can generate multiple decoded Gaussians per coarse Gaussian by simply modifying the decoder's output dimension. We evaluated different ratios K (i.e., ) on the Rubble scene:
| Ratio K | PSNR ↑ | SSIM ↑ | LPIPS ↓ | Peak VRAM |
|---|---|---|---|---|
| 1 | 26.66 | 0.792 | 0.267 | 20.45 GB |
| 2 | 26.76 | 0.796 | 0.263 | 22.91 GB |
| 3 | 27.02 | 0.801 | 0.255 | 25.41 GB |
Higher ratios improve quality (PSNR reaches 27.02 dB at 3:1) but increase memory usage. We use the 1:1 setting for optimal balance between performance and memory efficiency on a single 24GB GPU.
[1] 3D Gaussian Splatting for Real-Time Radiance Field Rendering. TOG 2023.
Thanks for the rebuttal. Most of my concerns are addressed. I would consider raising the score to borderline accept.
We are very grateful for your insightful suggestions and support of our work. We are happy to see that our rebuttal has addressed most of your concerns.
After rebuttal, all reviewers lean to accept the paper. The AC thereby recommends acceptance of the paper. The authors are required to include the new materials and results during the rebuttal into the final version, e.g., the discussion on the LPIPS metric.