5.3

/10

Poster4 位审稿人

最低4最高7标准差1.1

4.0

置信度

正确性3.0

贡献度2.8

表达2.8

NeurIPS 2024

ContextGS : Compact 3D Gaussian Splatting with Anchor Level Context Model

Yufei Wang,Zhihao Li,Lanqing Guo,Wenhan Yang,Alex Kot,Bihan Wen

OpenReview PDF

提交: 2024-04-23更新: 2024-11-06

摘要

关键词

3D scene compression3D Gaussian Splatting

评审与讨论

审稿意见

评分: 5置信度: 32024-06-26

This paper presents Context-GS, a method designed to reduce the memory overhead of 3D Gaussian Splatting (3DGS). Inspired by context modeling in image compression, the authors introduce a similar concept into Scaffold-GS, which uses anchors to predict 3D Gaussian distributions. The method encodes anchors at a coarse-to-fine level, significantly enhancing storage efficiency. Experimental results on real-world datasets demonstrate that the proposed method achieves a high compression ratio while maintaining comparable fidelity.

优点

1.The method is novel, integrating the concept of context modeling from the image compression domain.

2.The results are strong. It achieves significantly better compression while retaining high rendering quality, comparable to the original Scaffold-GS and clearly superior to other counterparts. The evaluation is comprehensive, and the paper conducts thorough ablation studies that fairly analyze the actual compactness.

3.The overall presentation is easy to follow.

缺点

The main concern is that the proposed main components have a very minor effect on performance. As shown in Table 2, the primary contribution to compression is adopted from Compact-3DGS, while the proposed "HP" and "CM" components reduce the memory by only up to 4 MB.
The proposed method appears complicated and specifically customized to Scaffold-GS, which limits its extendability. One of the most appealing features of 3DGS is its compatibility with many graphics engines. However, the use of neural networks and view-dependent Gaussian prediction undermines this advantage.

问题

The teaser figure is hard to understand and does not effectively convey the main idea. Upon reading the caption, my initial impression was that (b), (c), and (d) are from the proposed Context-GS. However, the caption stating "(c) verifies the spatial redundancy" and the main text at L48 stating "spatial dependency has been significantly reduced" confused me about what exactly is being reduced. Is one of the figures taken from Scaffold-GS? It would be more effective to first provide a reference scene so readers can understand what is actually being reduced. Additionally, the points are too small and chaotic, making it difficult to convey the ideas effectively.

Minors:

L21: "quired" should be "queried."
L29: "Neural Gaussian"? It is unclear why this is referred to as "Neural Gaussian" instead of "3D Gaussian." 3D Gaussian typically does not involve neural networks. The paper might define "neural" as meaning differentiable, but this could confuse some readers into thinking it is related to neural networks.

局限性

The paper discusses limitations in encoding and decoding costs, but the primary limitation appears to be highlighted in the weaknesses section. The proposed method also seems difficult to apply to more generic Gaussian splatting techniques.

One of the authors' main motivations is that "these papers mainly focus on improving the efficiency of a single neural Gaussian and neglect the spatial redundancy among neighboring neural Gaussians." However, this motivation does not resonate with me, as the ablation study does not indicate significant improvements from considering spatial redundancy.

作者回复

2024-08-07

We sincerely thank you for your thorough review and valuable suggestions

"The primary contribution to compression is adopted from Compact-3DGS"

Our method is significantly different from Compact-3DGS. We only adopt the masking loss from Compact-3DGS. We also have significantly better performance than Compact-3DGS, i.e., 3.5dB improvements using only 16.9% bitstream measured on Bungeenerf.

While Compact-3DGS used entropy coding, they only used entropy coding as a post-processing step. Specifically, they use Huffman coding to encode the index after vector quantization (VQ) when completing training. As a comparison, we optimize the estimated entropy of the 3D scenes by learning the distribution of anchors during training.

To demonstrate the difference between Compact-3DGS and the effectiveness of the proposed method, we set the weight of the masking loss used in Compact-3DGS in Eq. 10 to $0$ , and the results are as follows:

Measured on Rome (BungeeNerf)	PSNR	SSIM	LPIPS	Size
$\lambda_m$ = 5e-4 (default)	26.38	0.871	0.214	14.06
$\lambda_m$ = 0	26.42	0.872	0.212	13.97

Since the entropy loss is trained in an end-to-end manner, we can achieve a similar or even better rate-distortion trade-off compared with using the masking loss proposed in Compact-3DGS in our framework. This strongly highlights the difference between our end-to-end entropy framework and the Huffman coding as a post-process in Compact-3DGS.

4MB Improvement

The improvement from 18.67MB to 14MB is indeed a significant improvement. As illustrated in the highlight of the summary of the rebuttal, the relative improvements depend on the selection of the baseline. It is worth noting that while HAC and the proposed method both use Scaffold-GS as the backbone model, the baseline model we used for the ablation study is much stronger than that used by HAC, as shown in the following table. Our well-designed baseline model can achieve much better rendering quality with almost the same size as HAC. On top of the SOTA method/such a strong baseline, a 25% performance improvement is already very significant (as mentioned previously, there is only ~11% accumulated improvement over 3 years for image compression).

Tested on BungeeNeRF	PSNR	SSIM	LPIPS	SIZE
HAC	26.48	0.845	0.250	18.49
Baseline We Proposed for Ablation	26.93	0.867	0.222	18.67
ContextGS (Ours)	26.90	0.866	0.222	14.00

"Specifically customized to Scaffold-GS"

The core idea of the proposed context model is not limited to ScaffoldGS, as it does not involve new data formats and does not make any assumptions about the basic elements representing the 3D scene, such as anchors or 3D Gaussians. The context model aims to explore the relationship among existing elements by predicting the probability distribution of the current context based on given decoded contexts. In principle, it could also be applied to vanilla 3DGS or new backbones in the future, which may be left for future exploration.

Teaser figure

Yes, figures (b, c, d) in the teaser figure are all from the proposed ContextGS.

Our method does not aim to reduce the similarity between different levels, i.e., whether using the proposed context model does not significantly affect the similarity among different levels of anchors.

In fact, we need to use the high similarity to better model the relationship among two levels, i.e., predicting the probability distribution of the current context based on given decoded contexts. Usually, the higher similarity between the current context and decoded contexts leads to more significant improvements from the context model as follows.

Higher similarity $\rightarrow$ More accurate distribution modeling $\rightarrow$ Reduced entropy (bitstream)

Fig. (c) aims to visualize the similarity to illustrate the feasibility of implementing the context model in 3DGS. It shows that many anchors in level 0 have highly similar or almost the same features as their corresponding anchors in a coarser level.

The detailed explanations are as follows:

The relationship between similarity and redundancy. Accurately speaking, high similarity is a prerequisite but not a sufficient condition for context modeling. Fig. (c) shows that although the representation of a 3D scene is sparse, similar to natural images where flat areas (areas with similar values) exist that make the context modeling work, 3D scenes also have flat areas that make the context modeling possible.
Visualization of the bit-saving map. The bit saving is calculated by estimating the number of bits to encode each anchor with and without using the information of already coded anchors from coarser levels (similar to previous works that calculate the bit-saving maps, e.g., Fig. 4 in [R1]).
The anchors of which storage costs are reduced are those who have high feature similarities with their corresponding anchors in the coarser levels and are difficult to directly represent using existing hyperprior features. This is the reason why the bit-saving map in Fig. (d) and the similarity map in Fig. (c) do not necessarily match exactly.

We have enlarged the point size of anchors in coarser levels to make them more visible since they only occupy a small portion. However, it is still challenging to avoid the anchors being small and chaotic due to the large number of anchors representing a scene.

[R1] Checkerboard Context Model for Efficient Learned Image Compression, CVPR 2020

Typos and Neural Gaussian

We use "neural Gaussian" to convey their differentiable nature. We follow the symbols and definitions in ScaffoldGS, which mixes the use of "3D Gaussian" and "Neural Gaussian." To avoid confusion, we will unify it to "3D Gaussian" in the revised paper

评论- Most questions are addressed.

2024-08-10

Most of my concerns have been addressed. However, I'm uncertain whether a 4MB increase represents a significant improvement, given that the original Gaussian data is typically several hundred MB. Additionally, the proposed method appears to heavily rely on the concept of anchors, which may make it somewhat customized to Scaffold-GS. I am still unclear on how this method can be applied to 3DGS.

Nevertheless, I would like to raise my score.

评论- Thanks for your reply! Attached more results on different backbones

2024-08-10

Dear Reviewer 8ybR,

Thank you for your valuable feedback, which has significantly improved our submission.

Regarding your concerns about customizing to Scaffold-GS, we are conducting experiments on various backbones, including vanilla 3DGS. Preliminary results, attached below, show huge improvements over the most recent SOTA methods on different backbones.

We will provide more detailed results soon.

Best regards,
Authors

Table R1: Performance of the proposed method on the vanilla 3DGS backbone.

Measured on Bilbao (Bungernerf)	PSNR (dB)	SSIM	Size (MB)
Compressed3D (CVPR'24)	25.81	0.8403	49.32
Ours	27.77	0.8845	13.39

Table R2: Performance of the proposed method on the Compact-3DGS backbone (CVPR'24 Oral). Unlike the vanilla 3DGS, Compact-3DGS uses a small MLP to predict the colors of 3D Gaussians.

Measured on Bilbao (Bungernerf)	PSNR (dB)	SSIM	Size (MB)	Decoding time (s)
Compact-3DGS (CVPR'24 Oral)	25.12	0.8581	51.15	613
Ours (low bpp)	25.58	0.8613	13.36	16.12
Ours (high bpp)	26.28	0.8668	14.85	21.52

Dear Reviewer 8ybR,

Attached are the updated full results on the BungeeNeRF dataset, demonstrating the generalizability of our method on different backbones. On different backbones, we all achieve huge improvements.

Table R1: The performance of the proposed method on the vanilla 3DGS backbone.

Method	Backbone	PSNR (dB)	Size (MB)
3DGS	`3DGS`	24.87	1616
Compressed3D (CVPR'24)	`3DGS`	24.13	55.79
Ours	`3DGS`	25.06	14.36

Table R2: The performance of the proposed method on the backbone used by Compact3DGS.

Method	Backbone	PSNR	Size
Compact3DGS (CVPR'24 Oral)	`3DGS+tiny color mlp`	23.36	82.60
Ours	`3DGS+tiny color mlp`	25.83	13.93

审稿意见

评分: 7置信度: 42024-07-10

The paper aims to compress Gaussian Splatting-based neural rendering representations. To achieve higher representation performance with a smaller size, the paper proposes hierarchical anchors, where coarser level anchors work as context to achieve a higher compression rate. Additionally, the paper introduces hyperprior coding for improved performance.

优点

Novelty The main idea is novel within the 3DGS framework.

Performance The performance improvement is significant.

Experiments The ablation study has been conducted thoroughly, making it easy to understand the contribution of each part in improving the rate-distortion curve.

缺点

Mathematical Notations (minor) The meaning of ^ is unclear. In Eq. 5, $\hat{x}$ denotes that $x$ has been quantized. However, in Eq. 6 and line 156, $\hat{V}$ does not seem to refer to a quantized set of anchors.

问题

Line 167: Is “senses” a typo of “scenes”?

Lines 201-202: It is mentioned that the number of levels was set to three. Did you run experiments with different numbers of levels, and how did they affect the performance?

Tab. 5 (appendix): Does “coding” refer to the hyperprior coding used in the main paper? If so, could you explain why “w/o encoding anchors” results in similar or smaller sizes compared to “w/ encoding anchors”?

局限性

The paper provides limitations of the proposed method.

作者回复

2024-08-07

We sincerely thank you for your review and valuable suggestions on our paper.

Typos

Thank you for pointing out the typo; we have corrected it.

Ablation of Different Number of Levels

Thank you for your suggestions. We conducted an ablation study on the Rome-Bungeenerf dataset without using the learnable hyperprior. The results are shown in the following table:

Level num	PSNR	SSIM	LPIPS	Size
1	26.38	0.8731	0.2079	18.26
2	26.27	0.8706	0.2130	15.99
3 (default)	26.43	0.8730	0.2107	15.12
4	26.32	0.8712	0.2113	15.27

We find that decreasing the level number to 2 leads to an obvious degradation in performance. However, increasing the level number slightly increases the storage cost. We observe that increasing the level number does not significantly enhance the compression rate of anchors while it increases the size of MLPs by approximately 0.1MB for the new level.

Notations

Thank you for pointing out the duplicate use of $\hat{\mathbf{V}}$ . We will use $\tilde{\mathbf{V}}$ instead to avoid confusion.

Without Encoding Anchors in Table 5 (Appendix)

"Coding" in Table 5 refers to using entropy coding techniques to encode the anchor positions, i.e., the detailed results of (w/ APC) in Table 4. We do not encode the anchor positions in the main paper because, as shown in Table 4, it significantly slows down the coding speed.

Specifically, we find that the anchor position is very important and difficult to compress. Through training with adaptive quantization width, we find the anchor position requires high numerical precision for storage. This results in a limited compression ratio and a slow coding speed due to the large number of symbols required for Arithmetic (entropy) Coding.

Since anchor positions only occupy approximately 15% of the storage space and are crucial for rendering, compressing them sometimes does not contribute to significant improvements in performance.

2024-08-11

I appreciate the authors' rebuttal. While I share some concerns raised by other reviewers, such as the dependency on Scaffold-GS (even though the authors provided outperforming results, this only addresses half of the proposed method, as one of the two main methods was Scaffold (anchor)-dependent), I believe the paper's strengths, particularly in performance, outweigh these concerns and weaknesses.

评论- Thanks for your reply!

2024-08-11

Dear Reviewer gvyt,

We sincerely thank you for your support and recognition of our work.

Regarding the concerns raised by other reviewers about the dependency on Scaffold-GS, we are pleased to share our new results demonstrating the effectiveness of our method on other backbones, such as vanilla 3DGS. By utilizing our proposed end-to-end entropy optimization and context model, we achieved impressive performance on these backbones, as shown in the attached tables. We will release the models based on both Scaffold-GS and vanilla 3DGS upon acceptance.

Thank you again for your time and valuable insights.

Sincerely,

Authors

The results of the following tables are measured on BungeeNeRF dataset.

Table R1: The performance of the proposed method on the vanilla 3DGS backbone.

Method	Backbone	PSNR (dB)	Size (MB)
3DGS	`3DGS`	24.87	1616
Compressed3D (CVPR'24)	`3DGS`	24.13	55.79
Ours	`3DGS`	25.06	14.36

Table R2: The performance of the proposed method on the backbone used by Compact3DGS.

Method	Backbone	PSNR	Size
Compact3DGS (CVPR'24 Oral)	`3DGS+tiny color mlp`	23.36	82.60
Ours	`3DGS+tiny color mlp`	25.83	13.93

审稿意见

评分: 5置信度: 42024-07-11

In this paper, the authors propose ContextGS to reduce spatial redundancy among anchors using an autoregressive model. Specifically, the authors divide anchors into three levels, performing entropy coding from the top (coarse) level to the bottom (fine) level. Anchors from coarser levels are utilized as context to assist in the entropy coding of anchors at finer levels. Experimental results show that the proposed method achieves a size reduction of 15 times compared to Scaffold-GS and 100 times compared to 3DGS.

优点

The authors have explored the correlation between anchors and, for the first time, introduce autoregressive entropy models for spatial prediction of anchors.
The entire pipeline is trained end-to-end with joint rate-distortion optimization, supporting multiple bitrates by adjusting λ.

缺点

The size reduction brought by entropy model is limited. The authors claim that the proposed method can achieve a size reduction of 15 times compared to Scaffold-GS. However, Table 2 in the ablation study indicates that the size reduction primarily stems from entropy coding and masking loss (from 183.0 MB to 18.67 MB), while the main contribution of the paper, the hyperprior and anchor level context model, results in approximately a 25% bitrate reduction (from 18.67 MB to 14.00 MB), which is relatively modest.
The description of the anchor division method is somewhat confusing, and the equations are difficult to understand.
The authors design a learnable hyperprior vector for each anchor as an additional prior. However, this approach may introduce additional spatial redundancies between anchors.

问题

It’s better for the authors to make furhter clarification regarding how anchors are divided into different levels. Are anchors that have the same position after quantization selected for a higher level (according to lines 154-155)? But Figure 2(b) contradicts this, as v_3^{k-1}, which is far from the quantization center, is selected as the anchor for upper level.
I wonder about the effects of the learnable hyperprior feature z_i and would like to see the performance without it.

局限性

This paper seems to lack discussions of the limitations or the potential negative societal impact.

作者回复

2024-08-07

We deeply appreciate your thorough review and valuable feedback on our submission. Here are our detailed responses to your comments and suggestions:

Performance Improvement

As illustrated in the summary of the rebuttal, we argue that the proposed main components indeed bring significant improvements.

Any method of compression with SOTA performance relies on entropy coding, and most of them contribute to the improvement of entropy coding as well. Different papers use various entropy-based backbones as the baseline. Our entropy baseline (i.e., Ours w/o HP w/o CM) is well-designed with very strong performance. The baseline method we used in the ablation study even outperforms the concurrent work HAC (ECCV'24).

Tested on BungeeNeRF	PSNR	SSIM	LPIPS	SIZE
HAC	26.48	0.845	0.250	18.49
Baseline We Proposed for Ablation	26.93	0.867	0.222	18.67
ContextGS (Ours)	26.90	0.866	0.222	14.00

A 25% improvement on such a strong baseline is already significant. If we use a plain baseline entropy-based method like the one used in HAC (removing the additionally used anchor position as the prior), the size is ~30 MB, with around a 50% performance improvement. Besides, as mentioned previously, there is only ~11% accumulated improvement over 3 years for image compression.

Additional Cost from Hyperprior

While introducing the hyperprior requires additional bitstream and storage costs, it is also compressed and optimized in an end-to-end manner. The allocated bit number to them is jointly optimized. As shown in Table 2, using the hyperprior model further contributes to size reduction. Besides, almost all image compression works utilize a hyperprior model to improve their performance since its proposition [a] .

[a] Variational image compression with a scale hyperprior, ICLR 2018

Anchor Division

Yes, the anchors that have the same position after quantization are selected for a higher level. The reason that "v_3^{k-1} in Fig. 2(b), which may be far from the quantization center, is selected as the anchor for the upper level" is as follows:

The quantization center of the voxel does not necessarily have a corresponding anchor.
If we create an anchor for it, it goes against the core idea of our work, which is using the context model to improve coding efficiency (since the context model does not involve new storage requirements).
Currently, we select the anchor in the voxel that has the minimum index as elaborated in Eq. 5. This is due to the concern of high efficient implementation and this strategy already demonstrates significant improvements.

Besides, we modified Fig. 2(b) to make it clearer (by swapping the positions of $v_0^{k-1}$ and $v_1^{k-1}$ and adding text description below "Anchor forward") as in the attached PDF. The reason for these modifications is to highlight that the index of anchors in the same voxel is unsorted since all the anchors are stored discretely and unordered, e.g., $\{v_0^{k-1}, v_1^{k-1}, v_2^{k-1}\}$ in the figure. We select the anchor with the minimal index, i.e., $v_0^{k-1}$ , which may be either on the border or center of the voxel.

Hope the modified figure is clearer.

The Effects of the Learnable Hyperprior

We included the ablation study of the learnable hyperprior in Table 2. Removing the learnable hyperprior increases the size from 14.00 MB (Ours) to 15.41 MB (Ours w/o HP), demonstrating its effectiveness and complementarity with the context model. The baseline model (Ours w/o CM w/o HP) plus learnable hyperprior (Ours w/o CM) leads to a size reduction from 18.67 MB to 15.03 MB (~20% reduction).

It is worth noting that, different from HAC, the baseline model in our paper (Ours w/o CM w/o HP) already uses the anchor position as a kind of hyperprior. This is one of the reasons why the proposed baseline method (Ours w/o CM w/o HP) can already achieve comparable or even slightly better performance than HAC.

Limitations

We discussed limitations in L391-396 in the submitted paper.

评论- New results to alleivate your potential concerns and looking for your reply

2024-08-11

Dear Reviewer otkA,

Thank you very much for your insightful review and valuable suggestions. We have carefully considered your feedback and addressed your comments thoroughly in our rebuttal. We believe the clarifications and improvements we've made effectively address the concerns you raised.

If you find that our responses have satisfactorily resolved the issues, could you please consider adjusting your rating accordingly?

If you still have concerns, we are more than happy to provide any additional information or discuss further if needed.

Additional Highlight on Performance if you still have concerns

We want to highlight that a strong end-to-end entropy baseline is also our contribution since there is no standard implementation and different papers have different entropy models, e.g., Huffman coding in Compact3DGS, and entropy and run-length encoding in Compressed3D. Our proposed entropy framework is much stronger and can be applied to different 3DGS backbones, not only limited to ScaffoldGS.

More results on different 3DGS backbones

We further conducted experiments on the vanilla 3DGS, and the modified 3DGS backbone used in Compact3DGS (CVPR'24). The results on the challenging benchmark BungeeNeRF are as follows

Table R1: The performance of the proposed method on the vanilla 3DGS backbone.

Method	Backbone	PSNR (dB)	Size (MB)
3DGS	`3DGS`	24.87	1616
Compressed3D (CVPR'24)	`3DGS`	24.13	55.79
Ours	`3DGS`	25.06	14.36

As shown in Table R1, under the same backbone, we achieve 0.9dB PSNR improvements with a size reduction of ~3x compared with the most recent SOTA on the same backbone. We use less than 1% bitrate and achieve better PSNR than the vanilla 3DGS.

Table R2: The performance of the proposed method on the backbone used by Compact3DGS.

Method	Backbone	PSNR	Size
Compact3DGS (CVPR'24 Oral)	`3DGS+tiny color mlp`	23.36	82.60
Ours	`3DGS+tiny color mlp`	25.83	13.93

Compact3DGS uses a slightly modified backbone from the vanilla 3DGS, incorporating a small MLP for 3D Gaussian color prediction. As shown in Table R2, our method significantly outperforms the latest SOTA method. Compared with Compact3DGS (CVPR'24 Oral), we achieves 2.5dB improvements with a ~5x size reduction.

Because half of the discussion period has passed, please feel free to raise any concerns so that we can better address any potential misunderstandings.

Thank you again for your time and thoughtful consideration.

Best regards,

Authors

评论- thanks for the repsonse

2024-08-14

Thanks for the detail response and clarification, part of the concerns have been addressed, I have increased the score.

评论- Thanks for your update!

2024-08-14

Dear Reviewer otkA,

Thanks for your update and support! We appreciate the time and effort you have dedicated to reviewing our manuscript.

Best regards,

Authors

审稿意见

评分: 4置信度: 52024-07-13

This paper proposes ContextGS, a compact 3D Gaussian Splatting (3DGS) framework that requires only a minimal amount of storage size while demonstrating high rendering quality. Upon the neural Gaussian-based 3DGS framework Scaffold-GS, the authors construct a multi-level anchor structure to reduce the spatial redundancy and adopt context modeling proposed in image compression tasks. Consequently, it achieves a 15 $\times$ compression ratio compared to Scaffold-GS and a 100 $\times$ compression ratio compared to 3DGS. Despite the minimal storage size, it outperforms the existing compact 3DGS approaches in rendering quality.

优点

The quantitative evaluation shows that this method outperforms in rendering quality compared to existing compact 3DGS frameworks, including Scaffold-GS and HAC while requiring minimal storage usage.
The proposed entropy modeling of neural Gaussian features further reduces the storage size with minor degradation of rendering quality.

缺点

Although it achieves high compression performance, there are several critical points to concern.

There are limited technical contributions to their method. The concept of multi-level (or multi-resolution) anchor structure has already been proposed in Scaffold-GS and context modeling of neural Gaussians has been introduced in HAC. There are only minor technical contributions compared to the previous work.
It requires noticeable encoding and decoding time resulting in a bottleneck in the practical application of 3DGS. Especially, the encoding/decoding time of 'w/ APC' requires more than 40 sec for a single scene. Also, Scaffold-GS needs a per-view decoding process for view-adaptive rendering, resulting in slower rendering speed than 3DGS.

问题

What does the ‘mask’ of Scaffold-GS in Table 4 mean? To my knowledge, there is no masking parameter in Scaffold-GS. I am confused that the masking strategy of Lee et al. is applied to Scaffold-GS.
In L271-L272, the authors have argued that this method has fewer anchors due to the use of masking loss. Please clarify this part.
As they have mentioned in L392-394, there exists additional computational costs for entropy minimization during training. I wonder about the exact training time for this method compared to Scaffold-GS and HAC.
Also, the exact quantitative evaluation of rendering speed is needed to support the faster rendering time of ContextGS compared to Scaffold-GS as described in L271-L272. Please provide a comparison of rendering speed to prove the argument.
The multi-level partitioning strategy is similar to the multi-resolution structure of Scaffold-GS. Please clarify the difference between the multi-level strategy of this paper and Scaffold-GS.
Does the encoding anchor in Table 5 denote that APC in Table 4?

局限性

The additional computational costs cannot be alleviated for achieving a small storage size. Also, the format of this representation does not fit to standard 3DGS, thus the existing applications such as real-time interactive renderers cannot be used. Therefore, it has disadvantages to be used as a practical 3DGS representation.

作者回复

2024-08-07

Thanks for your valuable comments!

Novelty

We want to highlight that our method has significant and essential differences from both HAC and ScaffoldGS. The context model, in the generally accepted sense, does not require additional storage. We use already coded anchors (which are part of existing anchors instead of newly created ones) to model uncoded ones in our work. HAC did not use the context model in this definition since they need additional storage for hash features; actually, the termed "context" in HAC is a kind of hyperprior model [1].

The differences in "multi-level" between different papers are summarized as follows:

	Scaffold (Not designed for entropy coding)	HAC	The Context Model of Ours
Multi-level/Extra level	Introduces and stores new data type Anchor compared with 3DGS	Introduces and stores new data type grid hash feature compared with ScaffoldGS	In the anchor level of ScaffoldGS itself
Additional storage cost of the extra level	Yes	Yes	No

Taking decompression as an example, HAC decompresses the hash grid feature first and then uses it to help decode the anchor features, i.e., all anchors are decoded at the same time. (As mentioned above, this acts as the role of hyperprior [1].) In our work, we decode some anchors first and use the already decoded anchors to decode the undecoded ones like in an autoregressive manner. (This is the commonly termed context model and no additional bitstream cost is needed for the "extra levels".)

Encoding and Decoding Time

The encoding and decoding time are required by all entropy-coding-based 3DGS compression methods, e.g., HAC (ECCV'24 Oral) and Compact3DGS(CVPR'24 Oral).

For example, compared with Compact3DGS, our decoding speed is much faster as shown in the following table. Besides, the decoding time (17.85s) is worth it compared to the reduced model size (187MB->14MB).

Measured on Rome (Bungernerf)	PSNR (dB)	Decoding Time (s)	Size (MB)
HAC (ECCV'24)	25.68	22.77	19.3
Compact3DGS (CVPR'24 Oral)	25.17	613.5	51.3
Ours	26.38	17.85	14.1

"Especially, the encoding/decoding time of 'w/ APC requires more than 40 sec for a single scene"

This might be a misunderstanding. As indicated in L267-268, the results in "w/ APC" only aim to explain why we do not encode the position, even though it leads to some improvements in rate-distortion performance. Thanks for your comments, but actually we did not use "APC" in all the experiments as indicated in L267-268.

ScaffoldGS vs. 3DGS

While Scaffold-GS needs a per-view decoding process for view-adaptive rendering, it can achieve comparable (or even faster speed in some cases) by limiting the prediction of neural Gaussians to anchors within the view frustum. Scaffold-GS also has better rendering quality than vanilla 3D-GS. It is worth noting that we do not involve encoding or decoding for view-adaptive rendering.

Mask in Scaffold-GS

Thanks for pointing out the typo. The "mask" of Scaffold-GS originally aims to represent the property "opacity (float32, dim=1)" in the saved checkpoint. Since it is not used in the rendering, we will revise it to "N/A" in the revised paper. Other "mask"s in Table 4 refer to the encoded 3D Gaussian-level mask using Lee et al.'s pruning strategy.

Fewer Anchors and Faster Speed

Lee et al. [14] demonstrate that using their proposed masking loss can significantly reduce the number of Gaussians and increase the rendering speed. In our work, any anchor for which all its 3D Gaussians are masked will also be removed. As a result, as shown in the "Number of anchors" in Table 4, compared with ScaffoldGS, which utilizes 61.9K anchors, the proposed method uses only 52.5K anchors, approximately 15% less, and achieves better PSNR.

Our rendering is exactly the same as ScaffoldGS so fewer anchors lead to faster rendering speed when other hyperparameters are the same. We re-trained a model with similar rendering quality to ScaffoldGS, and the results are as follows. We achieve slightly faster rendering speed due to a smaller number of anchors and Gaussian points.

Rome (Bungernerf)	ScaffoldGS	Ours
FPS	202.8	205.4
PSNR	26.25	26.24

Training Speed

While entropy coding is relatively slow in the inference due to its serial properties, estimating the entropy loss in the training phase is parallel and fast.

As mentioned in our limitations section, estimating the entropy during training will indeed slightly increase the training time. Nevertheless, the additional minor training cost is worthwhile for the reduced size. For example, many existing concurrent works also explicitly estimate the entropy during training, e.g.,

End-to-End Rate-Distortion Optimized 3D Gaussian Representation, ECCV'24 (The training time is not reported, and the code is not released yet)
Hash-grid Assisted Context for 3D Gaussian Splatting, ECCV'24

Compared with other models for compression, we have a similar training speed.

	ScaffoldGS	HAC (ECCV'24)	Compact3DGS (CVPR'24)	Ours
Training Time (mins)	~25	~40	~60	~60
Size (MB)	186.7	19.30	51.27	14.06
PSNR (dB)	26.25	25.68	24.80	26.38

"Does not fit standard 3DGS"

The core idea of our context model is not limited to ScaffoldGS, as it does not involve new data formats and does not make any assumptions about the basic elements of the 3D scene. Applying to vanilla 3DGS may be left for future exploration.

Does the encoding anchor in Table 5 denote that APC in Table 4?

Yes, the encoding anchor is the “w/ APC” (w/ Anchor Point Coding) in Table 4. "w/ APC" in Table 5 is only for reference, and we do not use it in the main paper.

2024-08-12

I appreciate the authors’ efforts during the rebuttal period. I am pleased that your responses have addressed most of my concerns. Despite the rebuttal, I have decided to maintain my rating due to the lack of technical novelty and evaluations.

The key idea to compress Gaussians (entropy minimization w/ hyper-prior for anchor-based Gaussians), has been explored in the previous approaches. Moreover, the proposed multi-resolution strategy needs more evaluations to show its effectiveness.

The limitations for the additional computation costs still remain. The longer training time (~60min) is clearly large, not similar, compared to previous approaches (Scaffold-GS: ~25min / HAC: ~40min). Also, the decoding time is much larger than previous methods that do not use entropy coding. Moreover, the encoding time has not been addressed in the rebuttal.

评论- Additional results on different 3DGS backbones

2024-08-10

Dear Reviewer fG6F,

Thank you very much for your insightful review and valuable suggestions. We have carefully considered your feedback and addressed your comments thoroughly in our rebuttal. We believe the clarifications and improvements we've made effectively address the concerns you raised.

If you find that our responses have satisfactorily resolved the issues, could you please consider adjusting your rating accordingly?

If you still have concerns, we are more than happy to provide any additional information or discuss further if needed.

To further alleviate your concerns in the limitation part, we evaluated the proposed method on both the Compact-3DGS and vanilla 3DGS backbones. The results are presented in the following tables:

Measured on Bilbao (Bungernerf)	PSNR (dB)	SSIM	Size (MB)
Compressed3D (CVPR'24)	25.81	0.8403	49.32
Ours	27.77	0.8845	13.39

Table R2: Performance of the proposed method on the Compact-3DGS backbone (CVPR'24 Oral). Unlike the vanilla 3DGS, Compact-3DGS uses a small MLP to predict the colors of 3D Gaussians.

Measured on Bilbao (Bungernerf)	PSNR (dB)	SSIM	Size (MB)	Decoding time (s)
Compact-3DGS (CVPR'24 Oral)	25.12	0.8581	51.15	613
Ours (low bpp)	25.58	0.8613	13.36	16.12
Ours (high bpp)	26.28	0.8668	14.85	21.52

The results show significant improvements over the most recent SOTA methods on different backbones, strongly demonstrating the significance of the proposed method as a general framework.

Thank you again for your time and thoughtful consideration.

Best regards,
Authors

评论- Thanks for your reply. More repsonses are provided.

2024-08-12

Dear Reviewer fG6F,

Thanks for your response and we are happy to provide additional information for your concerns.

"The key idea to compress Gaussians (entropy minimization w/ hyper-prior for anchor-based Gaussians) has been explored in the previous approaches."

As shown in the title of our paper, the main contribution we claimed is that we are the first to apply the context model to the 3DGS.
As shown in our previous comment, our method is not limited to anchor-based Gaussians. On vanilla 3DGS, we can achieve 0.93dB improvements with a ~3x size reduction compared with the CVPR'24 Oral work.
We have shown significant improvements of the proposed context model. On the proposed strong entropy baseline (even stronger than the SOTA of ECCV'24), a ~25% improvement is huge. (For your reference, the accumulated improvements on recent entropy-based compression is only ~11% over 3 years.)
For the two papers you mentioned in the initial reviewer: The ScaffoldGS does not introduce hyperprior since it does not use entropy coding. HAC is a concurrent work and has significant differences in the hyper-prior design.

Specifically, HAC was submitted to Arxiv on March 21st (visible even later) and was accepted on July 2nd. Our work submitted the abstract to NeurIPS on May 15th. This is a concurrent work according to the NeurIPS guidelines.
"Papers appearing less than two months before the submission deadline are generally considered concurrent to NeurIPS submissions. Authors are not expected to compare to work that appeared only a month or two before the deadline." from the NeurIPS 2024 guideline.
Even though HAC is a concurrent work and is only an Arxiv paper at our submission, we still included it for comparison for completeness. We have significant differences in both motivation and model design, leading to a significantly better performance.

Could you please provide references for the papers that share the same core idea besides HAC? As far as we know, there is no previous work that uses a similar idea with us.
Our novelty is acknowledged by all other reviewers, e.g., "for the first time, introduce autoregressive entropy models for spatial prediction of anchors." by Reviewer otkA, "The main idea is novel within the 3DGS framework." by Reviewer gvyt, and "The method is novel, integrating the concept of context modeling from the image compression domain." by Reviewer 8ybR.

Additional computational costs

With significant performance improvement, slightly increased training overhead shall not be a reason to reject a paper. Many papers even do not report their training time, e.g., [a] accepted by ECCV'24.

[a] End-to-End Rate-Distortion Optimized 3D Gaussian Representation, ECCV'24

Our training speed is in fact similar to Compact3DGS (CVPR'24), i.e., around ~1 hour for a city-level scene. However, as shown in the comment, we can achieve 0.46dB improvements with a ~3x size reduction compared with it.

Table R1: The performance of the proposed method on the same backbone with Compact-3DGS.

Measured on Bilbao (Bungernerf)	PSNR (dB)	SSIM	Size (MB)	Decoding time (s)
Compact-3DGS (CVPR'24 Oral)	25.12	0.8581	51.15	613
Ours (low bpp)	25.58	0.8613	13.36	16.12

"Encoding time has not been addressed"

Encoding time is included in the training time since our model directly outputs the bitstreams. If take it out from the pipeline, our encoding time is ~20s for a large-scale city-level dataset. If needed, we will report the detailed encoding time later for comparisons.

Decoding time

As far as we know, almost all the SOTA works for 3DGS compression dependent on coding techniques. Could you provide reference papers that achieve SOTA performance w/o coding techniques?

Besides, the results in our rebuttal demonstrate that our decoding time is faster than other SOTA works.

If you find that our new responses have resolved your concerns, could you please consider adjusting your rating accordingly? Thank you again for your time and please do not hesitate to reply to us if you still have any concerns.

Best regards,

Authors

评论- We can achieve much better PSNR using less training time compared with HAC

2024-08-13

While we still hold the opinion that slightly increased training overhead shall not be a reason to reject a paper, to further alleviate your concern regarding the training time, we do some preliminary exploration of reducing the number of total training iterations. The results are shown in Table R1. We can achieve much better performance than HAC using less training time. Even using a similar time, we can achieve similar performance with the ScaffoldGS

Table R1: The performance, size, and training time of different methods on Rome/BungeeNerf.

	PSNR (dB)	Size (MB)	Trainining time
ScaffoldGS	26.25	184.34	~25 mins
HAC (ECCV'24, concurrent work)	25.68	19.30	~40 mins
Ours (20k iterations)	26.35	14.56	~35mins
Ours (15k iterations)	26.10	15.99	~25mins

The encoding time (already included in our training time) is ~15s, as a comparison Compact3DGS uses ~50s and HAC uses ~32s.

We believe the clarifications we have made can effectively address all of the concerns you raised. If our responses have resolved your concerns, could you please consider adjusting your rating accordingly? If not, as the discussion period is coming to an end, could you let us know of any remaining concerns?

Thank you for your time.

Best regards,

Authors

作者回复

2024-08-07

Thanks to All the Reviewers for the Insightful Comments

We would like to thank the reviewers for their efforts and insightful comments. We appreciate the reviewers’ acknowledgment regarding the novelty/motivation and performance of the proposed method. For example:

Novelty/motivation:

"For the first time, introduce autoregressive entropy models" from Reviewer otkA10,
"The main idea is novel within the 3DGS framework." from Reviewer gvyt,
"The method is novel, integrating the concept of context modeling from the image compression domain" from Reviewer 8ybR26.

Performance:

"The performance improvement is significant." from Reviewer gvyt,
"Outperforms in rendering quality while requiring minimal storage usage" from Reviewer fG6F,
"The results are strong.", "Significantly better compression while retaining high rendering quality" from Reviewer 8ybR.

The questions or weaknesses mentioned by each reviewer are answered separately. Please feel free to discuss with us if you have any further concerns or questions.

Highlights

Some reviewers may have concerns regarding the improvements in the ablation study. We want to emphasize that these improvements (~25%) are indeed very significant, outperforming those gains achieved in recent published papers on image/representation compression (~11% accumulated improvements over 3 years).

The Baseline We Used for Ablation is Very Strong

We want to highlight that the selected baseline method for the ablation study is strong enough as shown in the following table. Its performance is even better than the most recent SOTA method HAC (ECCV'24).

Tested on BungeeNeRF	PSNR	SSIM	LPIPS	SIZE
HAC	26.48	0.845	0.250	18.49
Baseline We Proposed for Ablation	26.93	0.867	0.222	18.67
ContextGS (Ours)	26.90	0.866	0.222	14.00

A 25% improvement on such a strong baseline/the most SOTA method is already a very significant improvement.

The Relative Improvements Depend on How We Select/Claim the Baseline

The improvement upon the baseline highly depends on how we design/select the baseline. As shown in Eq. 8 of the paper, all the experiments include the anchor position as the hyperprior. Even in the ablation study, "w/o hyperprior" means we do not use the proposed learnable hyperprior but still use the anchor position as the hyperprior.

If we remove the anchor position from the input of the baseline model (similar to the one used in HAC), the results are as follows:

Tested on Rome (BungeeNeRF)	PSNR	SSIM	SIZE
Scaffold-GS	26.25	0.872	186.7
Ours* w/o CM w/o HP	24.83	0.850	27.85
Ours* w/o Learnable Hyperprior	26.53	0.873	16.54
Ours* w/o Context Model	26.44	0.872	19.99
Ours*	26.43	0.872	13.74
Ours	26.38	0.871	14.06

("Ours*" represents that we do not use the anchor position as the hyperprior. "Ours" represents that we use the anchor position as the additional hyperprior, which is the result in Table 4 of the paper.)

Compared with the baseline model without any hyperprior, both the proposed learnable hyperprior and the proposed context model have very significant improvement (~50% compression rate and 1.+ dB PSNR improvement in total). Besides, we find that our method is not affected by removing the anchor position. We did not use such a baseline in our paper previously since we think using a strong baseline can better represent the true performance of the model and benefit society.

~25% is a Significant Improvement in Entropy-Based Compression

Thirdly, we want to highlight that in the deep entropy-based compression field, ~25% improvement is already very significant. For example, taking CVPR'23 Oral work [a] and CVPR'20 work [b] as a comparison, over 3 years development, the accumulated improvements in the image compression domain is only around ~11% on the standard benchmark. (Shown in PSNR/bpp subfigure of Fig.7 in [a]. Under the same PSNR, the bpp of [a] is 0.4 while the bpp of [b] is around 0.45.)

This strongly demonstrates the difficulty of improving performance on a strong entropy baseline and supports the significance of our improvements.

[a] Learned Image Compression with Mixed Transformer-CNN Architectures, CVPR 2023 Oral

[b] Learned Image Compression with Discretized Gaussian Mixture Likelihoods and Attention Modules, CVPR 2020

最终决定Accept (poster)

2024-09-25

This work proposed an autoregressive model at the anchor level for 3DGS compression - which is of certain importance giving the prevailing trends using 3dgs for rendering and acknowledging the value of effective compression technique. The overall idea is of certain novelty and the evaluation is comprehensive. All reviewers give positive scores with high confidence. The AC's recommendation is consistent with the reviewers: acceptance.