Optimized Minimal 3D Gaussian Splatting
摘要
评审与讨论
Overall, this submission focuses on the problem of 3D Gaussian Splatting compression. It in specific focuses on the compression of Gaussian primitive attributes given the existence of minimal Gaussian primitives.
优缺点分析
Strength:
-
This submission is in general easy-to-follow.
-
This submission achieves relatively good experimental results.
Weaknesses:
(See the question section for more details)
问题
Overall, the reviewer is currently around the borderline level for this submission. Below are the current concerns.
-
Firstly, before section 3.1, besides background about Gaussian Splatting, the authors are also suggested to briefly review how existing methods leverage neural fields. From the reviewer's perspective, this can help reducing the reading difficult of this submission.
-
In lines 207-211, the authors claim that they perform SVQ seperately for different geometric attributes while unifiedly for the appearancee features. The authors are expected to better elaborate on this part, i.e., explain the reason behind this difference.
-
The reviewer is a bit confused over the claim in line 232-233. Specifically, since the final rendering is achieved through the blending (aggregation) of several Gaussian kernels, the reviewer is confused that, why the existence of multiple Gaussians with near-identical contributions can be redundant. It seems to the reviewer that even if they have near-identical contributions, removing one (or several) of them does not mean that their aggregated contribution can be kept. The reviewer thus appreciates more detailed explanation for this part.
-
Lastly, if the reviewer is not wrong, the authors seem to build their model on Mini-Splatting (see line 246). While this is fine, the reviewer also appreciates the application on other baselines to better demonstrate the method's generalizability and enhance the experiment's comprehensiveness. Note that, from the reviewer's perspective, this can be especially important considering that the authors claim in Appendix in the limitation section that they believe that OMG could provide a primising direction when compined with effective large-scene representation. In this case, the reviewer believes that the authors at least need to show the generalization ability of OMG beyond a certain single baseline.
Given the above concerns, the reviewer currently is at the borderline level for this submission.
局限性
The submission has properly discussed its limitations while the reviewer believes that something may need to be done to better show their potential in addressing the limitations (see the last question above).
最终评判理由
After reading the rebuttal, I believe that my concerns are largely addressed. I thus raise my score from 3 to 4
格式问题
N.A.
We sincerely appreciate the reviewers’ thoughtful comments and efforts in reviewing our manuscript. In the following, we address each comment individually, providing detailed explanations and clarifications as required.
Significance of our contributions
Our method achieves a 40-50% reduction in storage compared to the current state-of-the-art, LocoGS (to put this into perspective, for a large-scale and ultra-high resolution 3D digital twin of a city, this would be from 1 TB to 500 GB). To further contextualize this gain: in the field of video coding, each major standard over the past two decades, such as H.264 (2003), H.265 (2013), and H.266 (2020), has achieved approximately a 30–50% reduction per generation, typically at the cost of significantly increased computational complexity. Put differently, our method achieves a decade’s worth of progress without any sacrifices in other important factors, including rendering quality, rendering speed, and training time. We would also like to highlight our training efficiency (LocoGS: 1 hour vs. ours: 20 mins). Even when compared to MiniSplatting, a method not optimized for compression, we introduce only a 1-minute training time overhead. While our method has only been tested on standard benchmarks (Mip-NeRF360, Tanks and Temples, and the Deep Blending dataset), we believe the improvements demonstrated on these datasets remain highly relevant and reflect the current interests of the research community. Last but not least, as we already provided the source code in the supplementary materials, the codes and the model weights will be publicly available.
[Q4] Generalization ability
We have applied OMG to 3DGS-MCMC [1], a method well-known for effective densification. The following result highlights the broader applicability of OMG.
| Mip-NeRF | #G | PSNR | SSIM | LPIPS | Size |
|---|---|---|---|---|---|
| 3DGS-MCMC | 500 K | 27.42 | 0.807 | 0.248 | 115 MB |
| MCMC+OMG | 500 K | 27.21 | 0.797 | 0.256 | 5.1 MB |
| 3DGS-MCMC | 1 M | 27.83 | 0.823 | 0.221 | 230 MB |
| MCMC+OMG | 1 M | 27.63 | 0.813 | 0.227 | 10.0 MB |
Furthermore, OMG leverages per-Gaussian features to represent the irregularity of Gaussians, making it inherently scalable and well-suited also for complex environments. To demonstrate this, we tuned the original 3DGS-MCMC for a large-scale scene, the NYC scene from the Zip-NeRF [2] dataset, and applied OMG. Our method achieves superior performance while drastically reducing storage requirements. We believe large-scale scenes presents a compelling challenge to other compression approaches that rely on compressibility and locality, and OMG’s scalability opens new possibilities for efficient modeling of extremely large scenes.
| NYC | #G | PSNR | SSIM | LPIPS | Size |
|---|---|---|---|---|---|
| Zip-NeRF | - | 28.42 | 0.850 | 0.281 | 607 MB |
| 3DGS-MCMC | 3 M | 27.77 | 0.857 | 0.295 | 675 MB |
| MCMC+OMG | 3 M | 27.82 | 0.853 | 0.291 | 29 MB |
[Q3] Clarification of L. 232-233
Lines 232-233 do not suggest that "the existence of multiple Gaussians with near-identical contributions is inherently redundant." Rather, the intended point is that redundant Gaussians may exist, but cannot be effectively pruned if their contribution-based importance scores all exceed the pruning threshold. This issue often arises because Gaussians in close proximity tend to exhibit similar blending contributions. To address this, we introduce LD scoring, which differentiates their importance by assessing the distinctiveness of their appearance—increasing the importance of Gaussians with distinct appearance and decreasing it for those that are less distinctive. This enables the pruning of less distinctive Gaussians, whose contributions can be compensated by the spatial extension of nearby Gaussians during training. The effectiveness of this approach is demonstrated in Table 4 and Figure 5.
[Q2] SVQ configuration
OMG requires four types of vectors: a 3-dimensional scale vector, a 4-dimensional rotation vector, and two 3-dimensional appearance vectors. As demonstrated in Table 5, although these vectors are already short, applying naive vector quantization still demands significant computation due to the need for large codebooks. Through empirical analysis, we found that using 2-dimensional sub-vectors offers a favorable trade-off, avoiding excessive computational overhead while achieving high performance. Accordingly, we apply 2-dimensional S-VQ to the rotation vector and the unified appearance feature, while the 3-dimensional scale vector, due to its odd dimensionality, is quantized using scalar quantization.
[Q1] Existing methods leveraging neural fields
Although we have introduced existing methods that leverage neural fields in the related work section (Lines 139–146), we will provide a more detailed technical discussion in Section 3.0.
[1] Kheradmand, S., et al. 3d gaussian splatting as markov chain monte carlo. Advances in Neural Information Processing Systems, 2024.
[2] Barron, J. T., et al. Zip-nerf: Anti-aliased grid-based neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
Dear Reviewer MiKC,
Once more, we genuinely express our gratitude for your insightful feedback on our manuscript. We gently remind you that the discussion phase will conclude in a couple of days. We believe we have effectively addressed your inquiries, concerns, and recommendations through the outcomes of our supplementary experiments. We are pleased to update our manuscript accordingly.
Should you have any additional concerns, queries, or suggestions, please feel free to reach out to us.
Best regards,
The Authors
Hi authors,
I am sorry that previously, I thought you can see the final justification and I thus left my feedback to your rebuttal there.
As mentioned there, after reading the rebuttal, I believe that my concerns are largely addressed. I thus raise my score from 3 to 4
Thank you for your response and for raising your evaluation toward acceptance. We are pleased that we were able to address your concerns and will ensure that the final version is updated accordingly.
Once again, we sincerely appreciate the time and effort you dedicated to reviewing our work.
Best regards,
The authors
This paper introduces Optimized Minimal Gaussians (OMG), a framework for efficient 3D scene representation using significantly fewer Gaussian primitives. It innovates in three key areas: (1) a local distinctiveness-based importance metric for selecting informative Gaussians, (2) a compact and hybrid attribute representation combining per-Gaussian features and spatial-aware neural fields, and (3) sub-vector quantization (SVQ) to compress attributes with low codebook complexity. Experimental results across multiple benchmarks demonstrate strong compression performance (up to 50% smaller models) while preserving rendering quality and enabling real-time rendering at over 600 FPS.
优缺点分析
Strengths:
- Novel contribution in Gaussian pruning: The introduction of a local distinctiveness metric addresses a practical challenge in sparse Gaussian selection, improving pruning efficiency.
- Efficient representation via SVQ: The use of sub-vector quantization is well-motivated and achieves a good trade-off between compression and computational cost.
- Practical utility: The system supports real-time rendering at >600 FPS with minimal memory usage, making it suitable for deployment on edge devices.
Weaknesses:
- Lack of theoretical analysis: The paper introduces important ideas (e.g., distinctiveness scoring, SVQ), but lacks formal analysis or theoretical guarantees about convergence, generalization, or approximation.
- Limited novelty in architecture: While effective, parts of the pipeline (e.g., feature fusion with MLPs) resemble prior works, especially Mini-Splatting and Compact-3DGS, and the innovation is largely incremental.
- No generalization test beyond rendering: The method is tightly scoped to rendering tasks; it would be stronger if generalized to other downstream 3D vision applications.
问题
- How sensitive is the distinctiveness score to the choice of λ or the number of nearest neighbors K? Is there a principled way to set these values?
- How does SVQ compare to product quantization or learned codebooks in terms of training stability and encoding speed?
- Can OMG handle dynamic 3D scenes where Gaussian properties evolve over time? If not, what modifications would be needed?
- The neural field used to compute the space feature is described as lightweight—how many parameters does it use, and how critical is it to the final performance?
- How does OMG perform on mobile GPUs or CPUs? Can it realistically enable interactive rendering in those contexts?
局限性
Yes
最终评判理由
Thanks for the authors' response.
While the paper presents notable technical contributions, the direct comparison to video coding standards (e.g., H.266 vs. H.264) in the response is not suitable. The claimed 40-50% improvement appears to be derived from a single operating point, whereas video coding advancements are typically evaluated using the BDBR (Bjøntegaard Delta Bit Rate) metric, which rigorously quantifies gains across multiple quantization points (typically four). This discrepancy in evaluation methodology may lead to an overestimation of the contribution.
Anyway, I raise my score.
格式问题
No
We sincerely appreciate the reviewers’ thoughtful comments and efforts in reviewing our manuscript. In the following, we address each comment individually, providing detailed explanations and clarifications as required.
Significance of our contributions
Our method achieves a 40-50% reduction in storage compared to the current state-of-the-art, LocoGS (to put this into perspective, for a large-scale and ultra-high resolution 3D digital twin of a city, this would be from 1 TB to 500 GB). To further contextualize this gain: in the field of video coding, each major standard over the past two decades, such as H.264 (2003), H.265 (2013), and H.266 (2020), has achieved approximately a 30–50% reduction per generation, typically at the cost of significantly increased computational complexity. Put differently, our method achieves a decade’s worth of progress without any sacrifices in other important factors, including rendering quality, rendering speed, and training time. We would also like to highlight our training efficiency (LocoGS: 1 hour vs. ours: 20 mins). Even when compared to MiniSplatting, a method not optimized for compression, we introduce only a 1-minute training time overhead. While our method has only been tested on standard benchmarks (Mip-NeRF360, Tanks and Temples, and the Deep Blending dataset), we believe the improvements demonstrated on these datasets remain highly relevant and reflect the current interests of the research community. Last but not least, as we already provided the source code in the supplementary materials, the codes and the model weights will be publicly available.
[Q1] Sensitivity of λ and K
We have tested sensitivity of hyperparameters for LD scoring. To ensure a fair comparison in this experiment, we strictly matched the number of Gaussians by applying Top-K sampling to the importance scores, aligning it exactly with our model. Since the effect of LD scoring is more pronounced in smaller models, we adopt OMG-XS as our baseline for evaluation.
The following results show similar performance across different values of λ and K, with our chosen configuration (λ, K = 2) yielding the best results. Although these values were determined empirically, we demonstrate that this setting consistently shows strong performance and generalizes well across a variety of scenes.
| λ | K | PSNR | SSIM | LPIPS |
|---|---|---|---|---|
| 2 | 2 | 27.06 | 0.807 | 0.243 |
| 2 | 1 | 27.05 | 0.807 | 0.243 |
| 2 | 4 | 27.00 | 0.806 | 0.244 |
| 1.5 | 2 | 26.97 | 0.806 | 0.244 |
| 2.5 | 2 | 26.98 | 0.807 | 0.243 |
[Q5] FPS on a weaker device
We have measured the FPS of our method on a low-end GPU, NVIDIA GTX 1080Ti, compared to LocoGS (with COLMAP initialization for a fair comparison). Thanks to the minimal number of Gaussians, OMG-XS achieves over 100 FPS even on this low-end hardware.
| Mip-NeRF 360 | 1080Ti | PSNR | SSIM | LPIPS | #Gauss | Size (MB) |
|---|---|---|---|---|---|---|
| OMG-XS | 106 | 27.06 | 0.807 | 0.243 | 0.43M | 4.06 |
| LocoGS | 73 | 27.09 | 0.798 | 0.250 | 1.04M | 7.96 |
| OMG-XL | 77 | 27.34 | 0.819 | 0.218 | 0.73M | 6.82 |
| LocoGS | 59 | 27.37 | 0.807 | 0.236 | 1.44M | 15.1 |
[Q4] Effectiveness of neural field
The neural field to represent the space feature is constructed by positional encoding and a tiny MLP with 6K weight parameters, The quantitative impact of space feature represented by the neural field is reported in Table 4.
| Method | PSNR | SSIM | LPIPS | #Gauss | Size |
|---|---|---|---|---|---|
| OMG-M | 27.21 | 0.814 | 0.229 | 0.56M | 5.31 |
| w/o Space feature | 26.96 | 0.811 | 0.232 | 0.59M | 5.58 |
| OMG-XS | 27.06 | 0.807 | 0.243 | 0.43M | 4.06 |
| w/o Space feature | 26.85 | 0.804 | 0.246 | 0.44M | 4.17 |
[Q2] Training strategy of SVQ
As referenced in L.197, SVQ is motivated by PQ, sharing the core principle of dividing the original vector into multiple subvectors and applying vector quantization independently to each. However, we introduce SVQ into the 3DGS literature with a novel design that offers scalability across multiple attributes and an efficient training strategy. Specifically, SVQ is applied once, followed by short 1K fine-tuning steps while keeping the quantization indices fixed, which significantly simplifies training. In contrast, prior works using vector quantization, such as CompGS [1] and Compact-3DGS [2], suffer from the high complexity of learning codebooks with VQ iterations, leading to significantly increased training times compared to the original 3DGS, even though they utilize only about one-third of the Gaussians. Our SVQ approach avoids such overhead and enables efficient training, as demonstrated in Table 3 of the main paper.
[Q3] OMG for dynamic scenes
The OMG attribute representation is applicable to all 3DGS-based methods, as it flexibly represents attributes with spatial locality (via space features and SVQ) as well as those without (using SVQ alone). This generalizability is supported by the following results, where OMG is applied to another baseline, 3DGS-MCMC [3]. We believe the same principle can extend to dynamic scenes; however, the process of obtaining a minimal set of Gaussians for such scenes remains an open question and we left this for future work.
| Mip-NeRF | #G | PSNR | SSIM | LPIPS | Size |
|---|---|---|---|---|---|
| 3DGS-MCMC | 500 K | 27.42 | 0.807 | 0.248 | 115 MB |
| MCMC+OMG | 500 K | 27.21 | 0.797 | 0.256 | 5.1 MB |
| 3DGS-MCMC | 1 M | 27.83 | 0.823 | 0.221 | 230 MB |
| MCMC+OMG | 1 M | 27.63 | 0.813 | 0.227 | 10.0 MB |
[W2] Architectural novelty
As noted by Reviewer WdNJ, the hybrid architecture for attribute representation is a clever design that elegantly balances the need for per-primitive specificity with the efficiency of continuous representation. This design makes OMG inherently scalable and well-suited not only for sparse Gaussians but also for complex environments. To validate this, we tuned the original 3DGS-MCMC [3] for a large-scale scene, the NYC scene from the Zip-NeRF [4] dataset, and applied OMG. Our method achieves superior performance while drastically reducing storage requirements. We believe large-scale scenes presents a compelling challenge to other compression approaches that rely on compressibility and locality, and OMG’s scalability opens new possibilities for efficient modeling of extremely large scenes. In this regard, we argue that the OMG architecture represents a distinct and novel contribution compared to existing approaches.
| NYC | #G | PSNR | SSIM | LPIPS | Size |
|---|---|---|---|---|---|
| Zip-NeRF | - | 28.42 | 0.850 | 0.281 | 607 MB |
| 3DGS-MCMC | 3 M | 27.77 | 0.857 | 0.295 | 675 MB |
| MCMC+OMG | 3 M | 27.82 | 0.853 | 0.291 | 29 MB |
[W3] Generalization test
Although our experiments focus on the rendering task, the significance of our contributions extends beyond it. When 3DGS models are used as inputs for downstream tasks, the efficiency of the representation, particularly in achieving high fidelity with a small number of Gaussians and attribute parameters, is directly tied to training and inference complexity. And this issue is exacerated in large-scale scenes, while we prove OMG's generalization ability to large-scale scenes above. We believe that the effective minimal-set Gaussian representation proposed in this paper can benefit a wide range of downstream applications, representing a promising direction for future research.
[1] Navaneet, K. L., et al. Compgs: Smaller and faster gaussian splatting with vector quantization. In European Conference on Computer Vision, 2024.
[2] Lee, J. C., et al. Compact 3d gaussian representation for radiance field. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
[3] Kheradmand, S. et al. 3d gaussian splatting as markov chain monte carlo. Advances in Neural Information Processing Systems, 2024.
[4] Barron, J. T., et al. Zip-nerf: Anti-aliased grid-based neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
Dear Reviewer cPXW,
Once more, we genuinely express our gratitude for your insightful feedback on our manuscript. We gently remind you that the discussion phase will conclude in a couple of days. We believe we have effectively addressed your inquiries, concerns, and recommendations through the outcomes of our supplementary experiments. We are pleased to update our manuscript accordingly.
Should you have any additional concerns, queries, or suggestions, please feel free to reach out to us.
Best regards,
The Authors
This paper (OMG) aims at reducing storage and computational demands for 3D scene representation while preserving real-time rendering capability. It tackles redundancy in 3DGS models through a pruning mechanism based on a novel Local Distinctiveness scoring and an attribute compression via Sub-Vector quantization. The scoring identifies essential primitives by combining blending weight importance with an appearance-aware metric that measures color variation among neighboring primitives, to prune primitives. For compression, SVQ decomposes scale, rotation, and appearance features into sub-vectors quantized independently using small, storage-friendly codebooks. OMG uses a lightweight neural field, driven by gaussian positions, captures coarse spatial context to compensate for continuity loss from sparsity. Results demonstrate appreciable storage and training time reduction versus prior art (LocoGS) and rendering at 612 FPS (with an RTX 4090 GPU).
优缺点分析
Strengths
- Proposes a novel sub-vector quantization method for compressing primitives, to facilitate rendering with high FPS
- Appearance-aware local distinctiveness score for pruning redundant gaussians
- Addresses spatial continuity modeling using space features from positions
Weaknesses
- Exposition: The following points were raised as general questions, due to the lack of adequate explanation
-
600+ FPS claim is with an RTX 4090. What’s the FPS for other methods on the same GPU?
-
In sparse point clouds Morton adjacency (which is already an approximate metric) often maps distant points, leading to spurious similarity measures and erroneous pruning. How is it handled? Since the number of points is already low, doesn’t it make sense to use KNN itself?
-
The LD metric is based on the outputs after SVQ is applied. Does this not cause compression artifacts to bias the pruning process, i.e., what if geometrically critical gaussians now appear ‘redundant’ because of the quantization noise? i. What is done to avoid over-pruning in texture-less regions? ii. Could table-4 have another row that shows the distortion effects of SVD?
-
Is it necessary to compute the LD-metric using Ti/j? How does it work when RGB (DC components of SH) values are used instead for finding local similarity?
-
How important are the space features (Fn)? Intuitively, T denotes the DC and V denotes the other SH bands. Please clarify why the positional encoding of a gaussian is needed to derive this?
- Experimentation: The following points were raised due to lack of proper empirical evidence
- Taming-3DGS also reduces the number of gaussians with competitive quality. Why have there been no comparisons? How sensitive is OMG to the initialization?
- What is OMG-XS performance in MipNeRF360?
- SVQ uses fixed-length partitioning (M). How does it perform with non-uniform attribute distributions? E.g., high-variation regions like texture edges
- Ablations a. How were M and L decided? Is there a sensitivity analysis on the choice of these parameters? b. It’s unclear if the improvement in FPS comes from a low number of gaussians or their quantized states. Could there be an ablation where FPS among methods are compared with the same number of primitives?
c. There are several post-processing steps (g-pcc compression, huffman encoding, etc.). How much does it reduce the storage footprint?
To make the comparison fair, could there be a table where the storage is reported without the above steps or, the above steps are applied to other methods?
问题
Addressed in Strengths and Weaknesses
局限性
yes
最终评判理由
After reading the rebuttal from the authors to my comments and other reviewers, I moved my recommendation to Border Line Accept. I think that the paper is a significant contribution to the GS compression paradigm.
To justify my decision, let me address how the Authors have satisfactorily answered my questions/comments in the rebuttal process.
- Exposition: The following points were raised as general questions, due to the lack of adequate explanation
- 600+ FPS claim is with an RTX 4090. What’s the FPS for other methods on the same GPU?
Solved, Authors provided a table for both a RTX 4090 and 1080Ti, and they outperformed state of the art by a large margin.
- In sparse point clouds Morton adjacency (which is already an approximate metric) often maps distant points, leading to spurious similarity measures and erroneous pruning. How is it handled? Since the number of points is already low, doesn’t it make sense to use KNN itself?
Solved, Authors provided a good explanation of their choice of Morton order approximation versus Knn, they provided numerical and theoretical motivation. Thus showing that Morton adjacency was the right choice for their design.
- The LD metric is based on the outputs after SVQ is applied. Does this not cause compression artifacts to bias the pruning process, i.e., what if geometrically critical gaussians now appear ‘redundant’ because of the quantization noise? i. What is done to avoid over-pruning in texture-less regions? ii. Could table-4 have another row that shows the distortion effects of SVD?
Solved, Authors clarify that LD scoring is not applied after the primitives have been quantized by SVQ. It was a misunderstanding in my side.
- Is it necessary to compute the LD-metric using Ti/j? How does it work when RGB (DC components of SH) values are used instead for finding local similarity?
Solved, Authors provided a good motivation to why it is necessary to compute the LD-metric using Ti/j. They provided empirical validation too.
- How important are the space features (Fn)? Intuitively, T denotes the DC and V denotes the other SH bands. Please clarify why the positional encoding of a gaussian is needed to derive this?
Solved, Authors provided a justification of how important is the space feature for their method. Namely, they do not claim that the space feature should be implemented using PE+MLP. They provided empirical evidence that an extremely small neural field is performant enough.
- Experimentation: The following points were raised due to lack of proper empirical evidence
- Taming-3DGS also reduces the number of gaussians with competitive quality. Why have there been no comparisons? How sensitive is OMG to the initialization?
Solved, Authors provide a good explanation of why they did not compare with Tamming, and also they provided numbers comparing OMG to Tamming.
- What is OMG-XS performance in MipNeRF360?
Solved
- SVQ uses fixed-length partitioning (M). How does it perform with non-uniform attribute distributions? E.g., high-variation regions like texture edges
Solved. Authors addressed their partitioning strategy through empirical analysis, showing that 2-dimensional sub-vectors offers a favorable trade-off, avoiding excessive computational overhead while achieving high performance
- Ablations a. How were M and L decided? Is there a sensitivity analysis on the choice of these parameters? b. It’s unclear if the improvement in FPS comes from a low number of gaussians or their quantized states. Could there be an ablation where FPS among methods are compared with the same number of primitives?
Solved when Authors answered my previous comment.
- There are several post-processing steps (g-pcc compression, huffman encoding, etc.). How much does it reduce the storage footprint?
Solved, Authors provided that their results show that applying post-processing achieves an overall storage reduction of approximately 30–32%.
格式问题
None
We sincerely appreciate the reviewers’ thoughtful comments and efforts in reviewing our manuscript. In the following, we address each comment individually, providing detailed explanations and clarifications as required.
Significance of our contributions
Our method achieves a 40-50% reduction in storage compared to the current state-of-the-art, LocoGS (to put this into perspective, for a large-scale and ultra-high resolution 3D digital twin of a city, this would be from 1 TB to 500 GB). To further contextualize this gain: in the field of video coding, each major standard over the past two decades, such as H.264 (2003), H.265 (2013), and H.266 (2020), has achieved approximately a 30–50% reduction per generation, typically at the cost of significantly increased computational complexity. Put differently, our method achieves a decade’s worth of progress without any sacrifices in other important factors, including rendering quality, rendering speed, and training time. We would also like to highlight our training efficiency (LocoGS: 1 hour vs. ours: 20 mins). Even when compared to MiniSplatting, a method not optimized for compression, we introduce only a 1-minute training time overhead. While our method has only been tested on standard benchmarks (Mip-NeRF360, Tanks and Temples, and the Deep Blending dataset), we believe the improvements demonstrated on these datasets remain highly relevant and reflect the current interests of the research community. Last but not least, as we already provided the source code in the supplementary materials, the codes and the model weights will be publicly available.
[W1-5/2-2/3-c] Already Provided Results
We respectfully note that the following key points have already been addressed in the main manuscript or the appendix:
-
[W1-5] Importance of the space feature: The space feature F, obtained via positional encoding (PE) and a tiny MLP, is one of the core components of OMG. It is designed to capture local continuity among sparse Gaussians, as elaborated in Lines 169–175 of the main paper. Since this feature encodes the basic structural information of Gaussians, the per-Gaussian features can concentrate more on representing fine details. Importantly, we do not claim that the space feature should be implemented using PE+MLP; rather, we demonstrate that even an extremely small neural field (with 6K weight parameters) is sufficiently effective. The quantitative impact of this space feature is reported in Table 4.
-
[W2-2] OMG-XS performance on the Mip-NeRF 360 dataset: The performance of OMG-XS is reported in Table 1, the main table in our manuscript. OMG-XS achieves comparable rendering quality to the previous state-of-the-art method, LocoGS, while requiring nearly half the storage.
-
[W3-c] Effectiveness of post-processing: We have analyzed the impact of post-processing techniques in Appendix 4, with corresponding quantitative results shown in Table 3 of the appendix. The results show that applying post-processing achieves an overall storage reduction of approximately 30–32%.
[W1-1] Other methods' FPS with RTX 4090
We have measured the FPS of LocoGS (with COLMAP initialization for a fair comparison) on an NVIDIA RTX 4090 and a lower-end GPU, NVIDIA GTX 1080 Ti. As shown in the results, OMG achieves significantly faster rendering speeds.
| Mip-NeRF 360 | 4090 | 1080Ti | PSNR | SSIM | LPIPS | #Gauss | Size (MB) |
|---|---|---|---|---|---|---|---|
| OMG-XS | 612 | 106 | 27.06 | 0.807 | 0.243 | 0.43M | 4.06 |
| LocoGS | 396 | 73 | 27.09 | 0.798 | 0.250 | 1.04M | 7.96 |
| OMG-XL | 416 | 77 | 27.34 | 0.819 | 0.218 | 0.73M | 6.82 |
| LocoGS | 325 | 59 | 27.37 | 0.807 | 0.236 | 1.44M | 15.1 |
[W1-2/1-3/1-4] Clarification of LD scoring
To ensure a fair comparison in LD scoring experiments, we strictly matched the number of Gaussians by applying Top-K sampling to the importance scores, aligning it exactly with our model.
-
[W1-2] Usage of Morton order approximation instead of KNN The table below presents a comparison between Morton-order approximation and KNN-based scoring on 9 scenes from the Mip-NeRF 360 dataset. The results are nearly identical, demonstrating that Morton order effectively approximates KNN while offering significantly faster runtime.
Mip-NeRF 360 PSNR SSIM LPIPS Time (ms) OMG-XS 27.06 0.807 0.243 4100 Morton -> KNN 27.07 0.808 0.241 28 While we agree with the reviewer that using KNN is reasonable in our current setup (LD scoring once at 20K with fewer than 1M Gaussians), our decision to adopt Morton order is motivated by the need for scalability and generalizability. In more complex or larger-scale scenes, the number of Gaussians can increase significantly during earlier training stages. In such cases, KNN becomes impractical due to its high memory usage and computational complexity of . Morton order, by contrast, provides a well-approximated yet efficient alternative that scales better with the number of primitives.
-
[W1-3] LD scoring and SVQ LD scoring is not applied after the primitives have been quantized by SVQ. Importance-based pruning with LD scoring is performed at the 20K iteration (as noted in Line 25 of the appendix), whereas SVQ is applied from 29K iteration (during the final 1K iterations). Therefore, SVQ introduces no bias into the pruning process.
-
[W1-4] LD scoring with RGB The static appearance feature T is used to represent both RGB values (the DC component of spherical harmonics) and opacity. Since T shares similar characteristics with RGB, LD scoring based on RGB also yields comparable but slightly less performance.
Mip-NeRF 360 PSNR SSIM LPIPS OMG-XS 27.06 0.807 0.243 T → RGB 26.99 0.807 0.243 OMG-M 27.21 0.814 0.229 T → RGB 27.19 0.814 0.229
[W1-3-ii] Including the effect of SVQ in Table 4
S-VQ achieves substantial storage reduction while preserving the original performance. We appreciate the reviewer’s suggestion and will include this result in Table 4 in the final version of the paper.
| Mip-NeRF 360 | PSNR | SSIM | LPIPS | #Gauss | Size (MB) |
|---|---|---|---|---|---|
| OMG-M | 27.21 | 0.814 | 0.229 | 0.56M | 5.31 |
| w/o SVQ | 27.26 | 0.817 | 0.226 | 0.56M | 26.1 |
| OMG-XS | 27.06 | 0.807 | 0.243 | 0.43M | 4.06 |
| w/o SVQ | 27.06 | 0.809 | 0.241 | 0.43M | 19.8 |
[W2-1-i] Comparison with Taming-3DGS
Taming-3DGS is primarily designed to accelerate training and achieves faster training times compared to our method. However, it does not incorporate any attribute compression, and yet it shows inferior rendering quality compared to OMG. In contrast, OMG achieves substantially better rendering quality while also requiring significantly less storage.
| Mip-NeRF 360 | PSNR | SSIM | LPIPS | #Gauss | Size (MB) |
|---|---|---|---|---|---|
| Taming-3DGS | 27.31 | 0.801 | 0.252 | 0.63M | 141.8 |
| OMG-M | 27.21 | 0.814 | 0.229 | 0.56M | 5.31 |
| OMG-XL | 27.34 | 0.819 | 0.218 | 0.73M | 6.82 |
[W2-1-ii] Sensitivity to initialization
Since OMG is built upon Mini-Splatting, it suffers from performance degradation without initialization. However, OMG can also be applied to other baseline methods such as 3DGS-MCMC, which exhibit stable performance regardless of COLMAP initialization. The following result highlights the broader applicability of OMG.
| Mip-NeRF | #G | PSNR | SSIM | LPIPS | Size |
|---|---|---|---|---|---|
| 3DGS-MCMC | 500 K | 27.42 | 0.807 | 0.248 | 115 MB |
| MCMC+OMG | 500 K | 27.21 | 0.797 | 0.256 | 5.1 MB |
| 3DGS-MCMC | 1 M | 27.83 | 0.823 | 0.221 | 230 MB |
| MCMC+OMG | 1 M | 27.63 | 0.813 | 0.227 | 10.0 MB |
[W2-3/W3-a] Clarification of M and L
As noted by Reviewer WdNJ, SVQ is proposed to strike a better balance among computational cost, storage efficiency, and representational precision for Gaussian attributes. OMG requires four types of vectors: a 3-dimensional scale vector, a 4-dimensional rotation vector, and two 3-dimensional appearance vectors. As demonstrated in Table 5, although these vectors are already short, applying naive vector quantization still demands significant computation due to the need for large codebooks. Through empirical analysis, we found that using 2-dimensional sub-vectors offers a favorable trade-off, avoiding excessive computational overhead while achieving high performance. The scale vectors, having an odd dimension (3), are instead quantized using scalar quantization.
Furthermore, by dividing the vector into sub-vectors, the value distribution within each sub-vector becomes less complex. This simplifies the statistical characteristics of the data, allowing Huffman encoding to exploit redundancy more effectively. As a result, the final storage size remains stable across varying bit allocations, making the method less sensitive to quantization bit-width choices and more robust to distributional variations.
[W3-b] Quantization and rendering speed
Quantization does not affect rendering speed, as rendering in this field is commonly performed on GPUs using floating-point operations. Therefore, the improved speed of OMG results from the reduced number of primitives rather than from quantization.
| Mip-NeRF 360 | PSNR | #Gauss | FPS |
|---|---|---|---|
| OMG-XS | 27.06 | 0.43M | 613 |
| w/o SVQ | 27.06 | 0.43M | 601 |
Dear Reviewer wotF,
Once more, we genuinely express our gratitude for your insightful feedback on our manuscript. We gently remind you that the discussion phase will conclude in a couple of days. We believe we have effectively addressed your inquiries, concerns, and recommendations through the outcomes of our supplementary experiments. We are pleased to update our manuscript accordingly.
Should you have any additional concerns, queries, or suggestions, please feel free to reach out to us.
Best regards,
The Authors
Thanks for your answer to my questions,
Dear Reviewer wotF,
It seems that your comment may have been cut off, as it ends with a comma and appears incomplete. Could you kindly take a moment to review it and see if any part was unintentionally omitted?
-> Thanks for your answer to my questions,
Thank you for your response, and we appreciate you increasing the rating. We are glad that our clarification sufficiently addressed the reviewer’s concerns. We would like to appropriately include the provided materials into the final version.
Once again, thank you for your valuable review.
Best regards,
The authors
it was it, sorry. You answered my questions with enough clarity and I have read the answers to other authors. I will raise my score to BA.
This paper addresses the significant storage and computational overhead associated with 3DGS. While prior work has focused on compressing the attributes of a large set of Gaussians, this paper tackles the more challenging problem of achieving high-quality results with a minimal number of primitives. The authors argue that as the set of Gaussians becomes sparse, each primitive becomes more sensitive to compression loss, and attribute irregularity increases, making traditional compression less effective. This paper presents Optimized Minimal Gaussians (OMG), a compact 3DGS representation that reduces both the number of Gaussians and storage requirements. OMG minimizes redundancy by selecting distinct Gaussians and introduces a precise, compact attribute representation. A sub-vector quantization technique further improves efficiency with minimal overhead. OMG achieves nearly 50% storage reduction than the SOTA and enables 600+ FPS rendering on NVIDIA RTX 4090 GPU without compromising quality.
优缺点分析
A. Strengths:
-
The paper addresses a critical bottleneck in the practical application of 3DGS. As 3DGS becomes more widely adopted, methods for extreme compression and efficient rendering are quite important. The focus on minimizing the number of primitives, rather than just compressing their attributes is an important direction.
-
The hybrid architecture for attribute representation is clever. It elegantly balances the need for per-primitive specificity (to capture geometric detail) with the efficiency of a continuous representation (for appearance), which is perfectly suited for the sparse Gaussian problem. However, discussing other design choices would have been useful too.
-
The Sub-Vector Quantization (SVQ) method is a practical and effective solution to a well-known problem in quantization, providing a better balance between computational cost, storage, and precision than prior approaches. Bringing it to 3DGS totally makes sense.
-
I also like the local distinctiveness scoring metric. It’s an intuitive and direct way to handle redundancy among nearby Gaussians.
-
The paper is well-written and easy to follow. The core ideas are explained clearly.
-
Good literature review.
B. Weaknesses:
-
The paper misses out a lot of important qualitative and quantitative comparisons as discussed below.
-
Qualitative comparisons are only conducted with vanilla 3DGS; there are no visual comparisons with state-of-the-art methods such as LocoGS or HAC in either the main paper or the supplementary materials. Including these comparisons—particularly in the form of rendered videos in the supplementary—would have been a valuable addition.
-
The results for LocoGS-L on the Tanks & Temples and Deep Blending datasets are missing. Including these is important to fully assess the performance gains of the proposed method across a diverse range of datasets.
-
The scoring relies on finding K-nearest neighbors, which is approximated using a Morton order sort for efficiency. The potential impact of this approximation (i.e., not using the true geometric neighbors) on pruning performance is not analyzed.
-
The metric is based on the L1 norm of the difference between static appearance features (T_i). This overlooks the potential importance of view-dependent features in distinguishing primitives, and the choice of the L1 norm over other distance metrics (e.g., L2) is also not motivated.
-
The LD term is multiplicatively combined with the baseline importance score. This means a Gaussian that is highly unique but never the single most dominant contributor to any ray will have its importance score zeroed out, which may not be the most idea solution.
-
The decision to encode geometry directly while using a hybrid model for appearance is a fixed design choice. The paper would be stronger if it discussed or ablated this choice.
-
Tables 1 and 2 lacks comparison with Mini-Splatting. Also visual comparisons will be greatly helpful in identifying the benefits of the proposed method over the baseline.
问题
-
Size, FPS, PSNR is similar for LocoGS-S and OMG-XS. Does that imply that there's not enough significant performance gains?
-
Table 3: LocoGS-S uses more than twice the number of Gaussians yet achieves a similar storage size with comparable performance. Does this imply that LocoGS has a better compression ratio than OMG?
-
Line 261-262: It's unclear whether these statements refer specifically to the Mip-NeRF dataset, but they do not hold consistently when considering all datasets across Tables 1 and 2.
局限性
yes
最终评判理由
This is a good paper with strong potential to influence downstream research in 3DGS.
格式问题
NA
We sincerely appreciate the reviewers’ thoughtful comments and efforts in reviewing our manuscript. In the following, we address each comment individually, providing detailed explanations and clarifications as required.
Significance of our contributions
Our method achieves a 40-50% reduction in storage compared to the current state-of-the-art, LocoGS (to put this into perspective, for a large-scale and ultra-high resolution 3D digital twin of a city, this would be from 1 TB to 500 GB). To further contextualize this gain: in the field of video coding, each major standard over the past two decades, such as H.264 (2003), H.265 (2013), and H.266 (2020), has achieved approximately a 30–50% reduction per generation, typically at the cost of significantly increased computational complexity. Put differently, our method achieves a decade’s worth of progress without any sacrifices in other important factors, including rendering quality, rendering speed, and training time. We would also like to highlight our training efficiency (LocoGS: 1 hour vs. ours: 20 mins). Even when compared to MiniSplatting, a method not optimized for compression, we introduce only a 1-minute training time overhead. While our method has only been tested on standard benchmarks (Mip-NeRF360, Tanks and Temples, and the Deep Blending dataset), we believe the improvements demonstrated on these datasets remain highly relevant and reflect the current interests of the research community. Last but not least, as we already provided the source code in the supplementary materials, the codes and the model weights will be publicly available.
[Q1/2] Performance improvement over LocoGS
-
[Q1] OMG-XS is 48.6% smaller than LocoGS-S: We believe this gain is not trivial. For instance, in the field of video coding, each generation has achieved approximately a 2× improvement in compression ratio: H.265 (2013) roughly doubles the efficiency of H.264 (2003), and H.266 (2020) further doubles that of H.265, while imposing significanly increased computation. While 3DGS compression continues to be highly optimized, we would like to argue that the additional compression gains achieved by our method are substantial, especially when considering the accompanying improvements in rendering speed.
Mip-NeRF 360 PSNR Size (MB) FPS (3090) LocoGS-S 27.04 7.90 310 OMG-XS 27.06 4.06 (-48.6%) 350 Furthermore, while a 4 MB reduction in absolute terms may appear trivial, it becomes highly significant in large-scale scenes. OMG leverages per-Gaussian features to represent the irregularity of Gaussians, making it inherently scalable and well-suited for complex environments. To demonstrate this, we tuned the original 3DGS-MCMC for a large-scale scene, the NYC scene from the Zip-NeRF dataset, and applied OMG. Our method achieves superior performance while drastically reducing storage requirements. We believe large-scale scenes presents a compelling challenge to other compression approaches that rely on compressibility and locality, and OMG’s scalability opens new possibilities for efficient modeling of extremely large scenes.
NYC #G PSNR SSIM LPIPS Size Zip-NeRF - 28.42 0.850 0.281 607 MB 3DGS-MCMC 3 M 27.77 0.857 0.295 675 MB MCMC+OMG 3 M 27.82 0.853 0.291 29 MB -
[Q2] MB per Gaussian is not the key factor: To reduce the number of Gaussians in LocoGS, we incorporate Mini-Splatting’s densification strategy while retaining LocoGS’s attribute representation. Although this approach successfully reduces storage by decreasing the number of Gaussians, LocoGS fails to maintain high-quality rendering. This limitation arises because a single neural field struggles to represent multiple attributes of sparse Gaussians effectively. A similar challenge is observed in Compact-3DGS as well, as demonstrated in Table 1 of the appendix.
Mip-NeRF #G PSNR SSIM LPIPS Size MS+Loco-S 946 K 27.04 0.804 0.225 9.1 MB MS+Loco-S 537 K 26.52 0.789 0.257 6.2 MB MS+Loco-L 923 K 27.35 0.815 0.214 14.2 MB MS+Loco-L 519 K 26.76 0.799 0.247 11.3 MB OMG-XS 427 K 27.06 0.807 0.243 4.1 MB OMG-XL 727 K 27.34 0.819 0.218 6.8 MB
[Q3] Clarification of L. 261-262
Lines 261–262 refer to the results on the Mip-NeRF 360 dataset. For a more comprehensive evaluation, we assessed overall performance on Mip-NeRF 360 (9 scenes), Tank & Temples (2 scenes), and Deep Blending (2 scenes). OMG-M achieves higher PSNR and SSIM than LocoGS-S, while reducing storage by 37%. Furthermore, OMG-XL achieves superior SSIM compared to LocoGS-L, with a smaller storage footprint than LocoGS-S. We will revise the final version accordingly to clarify this point.
| Mip-NeRF 360 | PSNR | SSIM | LPIPS | Size |
|---|---|---|---|---|
| LocoGS-S | 26.98 | 0.827 | 0.225 | 7.66 |
| OMG-M | 27.03 | 0.833 | 0.226 | 4.84 (-37%) |
| LocoGS-L | 27.22 | 0.834 | 0.214 | 13.6 |
| OMG-XL | 27.14 | 0.837 | 0.216 | 6.23 (-54%) |
[W4/5/6] Clarification of LD scoring
To ensure a fair comparison in LD scoring experiments, we strictly matched the number of Gaussians by applying Top-K sampling to the importance scores, aligning it exactly with our model. Since the effect of LD scoring is more pronounced in smaller models, we adopt OMG-XS as our baseline for evaluation.
-
[W4] Usage of Morton order approximation instead of KNN The table below presents a comparison between Morton-order approximation and KNN-based scoring on 9 scenes from the Mip-NeRF 360 dataset. The results are nearly identical, demonstrating that Morton order effectively approximates KNN while offering significantly faster runtime.
Mip-NeRF 360 PSNR SSIM LPIPS Time (ms) OMG-XS 27.06 0.807 0.243 4100 Morton -> KNN 27.07 0.808 0.241 28 As our current setup applies LD scoring only once at the 20K iteration with fewer than 1 million Gaussians, KNN is affordable and potentially more reliable in this context. However, our decision to adopt Morton order is motivated by the need for scalability and generalizability. In more complex or larger-scale scenes, the number of Gaussians can increase significantly during earlier training stages. In such cases, KNN becomes impractical due to its high memory usage and computational complexity of . Morton order, by contrast, provides a well-approximated yet efficient alternative that scales better with the number of primitives.
-
[W5] Design choice for LD scoring We conducted an ablation study on the design choices for LD scoring and found that the results remain nearly unchanged. We would like to include this result in the final version of the paper.
Mip-NeRF 360 PSNR SSIM LPIPS OMG-XS 27.06 0.807 0.243 T -> T&V 27.06 0.807 0.243 L2 27.05 0.806 0.244 -
[W6] LD scoring and the most dominant contributor As the reviewer pointed out, LD scoring is applied to Gaussians that have been the dominant contributor for at least one ray. This design reflects our objective to construct a minimal yet effective set of Gaussians. Given the limited number of Gaussians, further pruning typically leads to performance degradation. However, LD scoring enables the identification of less distinctive Gaussians whose contributions can be compensated by learning (generally extending) nearby Gaussians over the following iterations. The effectiveness of this approach is validated in Table 4 and Figure 5.
[W7] Hybrid model for geometry
Considering that the geometry of sparse Gaussians exhibits weak spatial continuity, we chose not to fuse the space feature into the geometry representation. We agree that an ablation study on this aspect would be valuable. However, incorporating geometry predicted by an MLP is not feasible in our setting, as it cannot be properly initialized during training (at 15K iteration). Randomly initialized geometry leads to out-of-memory issues, especially due to the scale of each Gaussian. Despite considerable effort, we found it challenging to train the model under such conditions.
[W1/2/3/8] More results
We sincerely appreciate the reviewer’s suggestion and will include the following results in the final version of the paper.
-
[W2/8] Qualitative results compared to other methods Due to the strict rebuttal guidelines of NeurIPS, we are unable to include qualitative results here. However, we will certainly incorporate them in the final version.
-
[W3/8] LocoGS-L in Table 2 and Mini-Splatting in Table 1,2
Tank&Temples PSNR SSIM LPIPS Size FPS LocoGS-L 23.84 0.852 0.161 12.34 311 Mini-Splatting 23.41 0.846 0.180 67.6 (1095) Deep Blending PSNR SSIM LPIPS Size FPS LocoGS-L 30.11 0.906 0.243 13.38 297 Mini-Splatting 30.04 0.910 0.241 124.9 (902) Mip-NeRF 360 PSNR SSIM LPIPS Size FPS Mini-Splatting 27.39 0.822 0.216 119.5 (601)
I thank the authors for their detailed response to my concerns. They have provided sufficient clarification for points W4/5/6 and included the requested numbers for W2/3/8. I strongly encourage the authors to incorporate the corresponding qualitative results in the final version of the paper, or at least in the supplementary material.
That said, after carefully reviewing the authors’ response and the feedback from other reviewers, I have decided to raise my score to BA. This is a good paper with strong potential to influence downstream research in 3DGS.
Thank you for your response and for updating your rating toward acceptance. We would be happy to include the provided quantitative results, with the additional qualitative examples, in the final version or supplementary material.
Once again, thank you for your thoughtful feedback and efforts in reviewing our manuscript.
Best regards,
The authors
This paper draws my attention to the fact that reviewer engagement is low, despite being borderline. Please carefully read the author's rebuttals and other reviews and leave corresponding comments.
An efficient 3DGS method is proposed to reduce the number of Gaussian primitives. Removing redundancy and exploiting attribute representations for capturing continuity and irregularity of the primitives. Sub-vector quantization, which better represents irregularity, especially helps with fast training using a smaller codebook.
One of the interesting ideas in this work is the local distinctiveness metric, which considers the difference between the appearance features using the set of pseudo-K-nearest neighbors, where the neighbor selection is done by sorting Gaussians in Morton order.
The successful rebuttal phase gets positive feedback from all reviewers, which consolidates the unanimous positive opinions from the diverse borderline opinions in the pre-rebuttal phase, leaving no major issues. AC acknowledges that this seldom happens. Thank you to the reviewers for the constructive feedback and the efforts from the authors. To sum up, AC is glad to recommend this hardworking paper for our venue.
Besides, AC wants to leave some minor suggestions for further polishing the manuscript, although these are not considered in the recommendation assessment:
- According to the official NeurIPS style guideline, “all headings should be lower case (except for first word and proper nouns)” in Line 55. Major ML/vision conferences adopt this style, except for some other venues.
- Inconsistent styles for the paragraph titles should be corrected, e.g., “Implementation Details” in Section 4 Experiment.
- Aligning with Reviewer MiKC’s concern, the motivation (especially the first two paragraphs) of Section 3.1 needs to be polished to convey the idea. Some sentences are hard to follow.
- Tables 1, 2, 3, and 5 need a space between the caption and the table.
- Finally, in Table 5, it’s challenging to determine that each group of three columns represents the model sizes (XS/M/XL) due to the absence of explicit group headers.
- The paper title seems to be too general. AC suggests slightly updating your title, because the proposed method achieves the "optimized minimal 3DGS" using sub-vector quantization and local distinctiveness, if the authors find it suitable.