5.5

/10

Poster4 位审稿人

最低3最高7标准差1.5

3.8

置信度

正确性3.3

贡献度2.5

表达2.3

NeurIPS 2024

ODGS: 3D Scene Reconstruction from Omnidirectional Images with 3D Gaussian Splattings

Suyoung Lee,Jaeyoung Chung,Jaeyoo Huh,Kyoung Mu Lee

OpenReview PDF

提交: 2024-05-13更新: 2024-11-06

TL;DR

We develop a new method for 3D scene reconstruction from omnidirectional images via 3D Gaussian Splattings.

摘要

关键词

3D scene reconstruction3D Gaussian splattingOmnidirectional images

评审与讨论

审稿意见

评分: 7置信度: 42024-07-04

This paper proposes a method of rasterizing and optimizing 3DGS for omni directional cameras. Built on top of the original 3DGS perspective camera rendering, this paper approximates the rendering of each 3D Gaussian as a perspective camera pointing towards each Gaussian. By limiting the size of individual Gaussian, this effectively approximates the omni-directional camera model applied to 3D Gaussians.

优点

Motivation

Unlike NeRF which operates on ray-level, 3DGS by its nature is limited by the perspective projection applied to 3D Gaussians and is not easily migrated to different camera models. Extending the 3DGS method to omni directional camera model is useful in many real-world applications due to the data collection efficiency of omni-directional camera.

Method

The method proposed by this paper is simple but effective. Without introducing very complicated omni-directional camera models to project the 3D Gaussians, this paper proposes to approximate the projection with individual perspective cameras pointing towards each Gaussian. This effectively approximates the rasterization without changing too much on the pipeline.
The method proposed by this paper is inspiring, not only in the context of omni-directional camera, but arbitrary camera models. Most cameras can be approximated with a similar manner to handle 3D Gaussians using local tangent perspective cameras.
The proposed method is quite general and is not limited to the type of scene, compared to the previous 360-GS.

Experiments

The quantitative rendering quality is consistently higher than the previous NeRF counterparts and the wrapped perspective version of NeRF and 3DGS. The qualitative results also demonstrates considerable improvements.
The high rendering speed of 3DGS is well preserved to a large extent.

缺点

Method

The error of the approximation depends on the size of the Gaussian (or the size of the projected 2D Gaussian). It is mentioned in the paper that the maximum size of the Gaussian is limited to minimize the error, but I could not find the details about this limit. I think more details should be included. A hard limit in 3D might not be the best way of performing this limit.

Experiment

Although the flexibility of this method is preferred over 360-GS, it is better to include the performance comparison between these two works as they are very similar.

Typo

On Line 138 "distribution of after projection".

问题

I am very interested in one thing and hope the author can provide me with some answers: can a 3DGS model trained with Omni-directional images with this method be directly used to render using perspective camera, while preserving the rendering quality without fine-tuning?

The reason I think this is important in practice is that omni cameras are usually useful in data capturing stage, but not the inference stage. It is ideal to train with panorama images but render with perspective cameras during application. I think adding the relevant results in the paper can further strengthen the usefulness of this paper.

局限性

There is no societal impact involved.

One of the main technical limitations is the size of the Gaussian causing approximation error. This paper chooses to limit it but there might be a better way of handling it such as splitting the Gaussians during approximation to handle more complex scenarios efficiently.

作者回复

2024-08-06

Q1) The details about the maximum size of the Gaussian

A) We applied different maximum size thresholds depending on the elevation angles. Looking at Eq. (7), in omnidirectional projection, the Gaussian located in the polar region is rasterized to a wider area of the image than the equatorial region. Reflecting these characteristics, we applied the split size limit of Gaussian differently depending on the upper and lower angles at which Gaussian is located. Reflecting these characteristics, we applied the size limit of Gaussian differently depending on the angle at which Gaussian is located. Specifically, the larger the absolute value of the angle, the larger the size threshold becomes, and it becomes the minimum in the equatorial region. When this size exceeds the threshold, Gaussian is split into two smaller Gaussians and optimized separately.
In Figure R3, we qualitatively compare the images created with and without using the above method. We understand that it could be difficult to fully grasp this concept based solely on the contents written in the paper. Therefore, we will provide a detailed explanation to make it easier to understand.

Q2) Performance comparison with 360-GS

A) We attempted to compare with 360-GS before submission, but were unable to do so. Since 360-GS is currently only available on arXiv and the official code implementation has not been made public, we endeavored to reproduce 360-GS by ourselves. Using the code we reproduced, however, Gaussian splats are hardly optimized, producing ludicrous results. After a thorough analysis of the issue, we noticed that there was an error in the equation calculating the Jacobian matrix $J$ in 360-GS. Specifically, the signs of the diagonal terms of the matrix should be -, in Eq. (7) of 360-GS paper [2]. This could be due to a differentiation error or a mistake in the description within the paper. While our reproduction might have been incorrect, we believe it would not be appropriate to include this comparison in the main table under this situation. If the official code for 360-GS becomes available, we could include the comparison in the main table.

Q3) Is it feasible to render using a perspective camera without fine-tuning?

A) ODGS delivers high-quality perspective camera rendering results without any additional fine-tuning.

Our method is fundamentally based on projecting onto the tangent plane of a sphere, which is conceptually compatible to the perspective rendering of each Gaussian. As a result, the Gaussian splats trained with ODGS produce high-quality perspective images when rasterized with a pinhole camera model without fine-tuning. Some sample images rendered using a perspective camera model are shown in Figure R4. 3DGS(P6) and 3DGS(P18) were optimized using perspective images and rendered with a perspective camera, while ODGS was optimized using omnidirectional images and rendered with a perspective camera. Despite the difference between the camera model used during optimization and the one used during inference, ODGS demonstrates significantly sharper and superior results.

2024-08-08

I appreciate the effort the authors put into the rebuttal. My answers are well answered and I wish the details can be included in the main paper or the supplementary in the future. I would like to wait for the response of other reviewers as well to decide whether I should increase my scores further, but I really think this paper proposes an insightful and useful algorithm.

2024-08-08

Dear reviewer sZSB,

We are pleased to hear that our response has addressed your concerns well.

Your invaluable feedback has greatly helped us improve our draft. We promise to include the points you raised and enhance clarity in the updated manuscript.

Finally, we sincerely thank you for highly valuing the contributions of our work.

Best regards,

Authors.

审稿意见

评分: 6置信度: 32024-07-09

This paper implements a rasterization module for 3D Gaussian Splatting (3DGS) for omnidirectional images. The rasterizer assumes local affine approximation and projects Gaussian primitives to the unit sphere centered by the camera position, which may be reasonable for relatively small Gaussians. Based on the CUDA implementation, experiments show faster and more accurate rendering results compared with the omnidirectional extension of NeRF (e.g., EgoNeRF).

优点

First attempt to omnidirectional 3DGS

This paper would be the first attempt to extend 3DGS to omnidirectional images in general (360-GS [2] is available but for the specific purpose of room modeling requiring layout information, and OmniGS [a] is considered to be a concurrent work uploaded just before NeurIPS deadline.)

[a] Li, Longwei, et al. "OmniGS: Omnidirectional Gaussian Splatting for Fast Radiance Field Reconstruction using Omnidirectional Images." arXiv preprint arXiv:2404.03202 (2024).

Technical correctness

The method seems to be adequate. Under affine approximation, this paper correctly formulated the projection and implemented it with CUDA. This is a good engineering effort, and many people will benefit from it. It is good news to know that the omnidirectional version of 3DGS generally achieves better rendering than omnidirectional NeRF methods.

缺点

Novelty

I am struggling to evaluate the technical novelty of this paper. This paper implements a useful tool, while the paper's contribution is rather in its engineering effort, and the scientific (or technical) novelty may be limited.

I understand it is quite important to release those useful tools to the community, while the idea of extending 3DGS to omnidirectional images is somewhat natural. Locally affine approximation may be straightforward if someone may extend 3DGS to omnidirectional images.

问题

Are there any specific technical novelty we (readers and reviewers) should care about? I would rather like to support the usefulness and technical correctness of the paper, but I think others may not.

局限性

I did not find notable negative social impacts.

作者回复

2024-08-06

Q) Are there any specific technical novelty we (readers and reviewers) should care about?

A) We hope that our work will not be dismissed as merely an engineering effort. The idea of applying a local affine approximation to a sphere may seem simple at first glance. However, aside from concurrent work, this approach is being attempted for the first time. We hope that the process of developing and validating this idea will not be dismissed as merely an engineering effort. We want to highlight the following unique contributions and technical innovations:

New Omnidirectional Rasterizer: We introduce a novel rasterization technique specifically designed for omnidirectional images, which is not a straightforward extension of existing 3DGS methods and provides a more accurate and efficient representation of 3D scenes captured from 360-degree cameras.
Advanced Densification and Splitting Policy: ODGS employs a sophisticated policy for managing the densification and splitting of Gaussians in the omnidirectional domain. Our work is not just an engineering effort, but also a significant technical consideration that enhances the quality and efficiency of the 3D reconstruction. This is a key aspect that we believe sets our work apart.
Comprehensive Evaluation Across Diverse Datasets: Unlike many other approaches, the paper thoroughly tests various datasets, covering both egocentric and roaming scenes. This comprehensive evaluation showcases the robustness and versatility of ODGS.

Fortunately, all other reviewers have acknowledged the strengths of our work, and none have raised concerns about its novelty. Specifically:

Reviewer cMFb noted that our work is "quite relevant to the research community, simple, and efficient."
Reviewer iEY4 highlighted as "practically useful and valuable."
Reviewer sZSB described it as "simple but effective, and inspiring."

Therefore, while we can understand the concerns raised by reviewer oCBJ to some extent, we hope that our work will be evaluated positively.

评论- Thanks for the rebuttals

2024-08-12

I read the others' reviews and rebuttals. Indeed, implementing the omnidirectional rasterizer and splatting policy would be a new part. I am now leaning a bit toward acceptance, but I would also like to discuss it with other reviewers.

2024-08-13

Dear reviewer oCBJ,

We appreciate for the positive evaluation of our work. We will update the draft to better showcase our novelty.

Best regards,

Authors.

审稿意见

评分: 3置信度: 52024-07-13

The paper introduces a novel approach for 3D scene reconstruction from 360 omnidirectional imagery that the authors make compatible with a 2D Gaussian Splatting representation [x].

The proposed method is evaluated on 3 public datasets against 4 Radiance Field variants including NeRF derivatives and 3D Gaussian Splatting [25].

[x]: Huang, B., Yu, Z., Chen, A., Geiger, A., & Gao, S. (2024). 2d gaussian splatting for geometrically accurate radiance fields. Siggraph 2024.

优点

+ Practical usefullness of the considered problem.

The paper addresses a narrow yet valuable problem in the scope of data acquisition from imaging-equipped vehicles in urban areas.

+ Structure, and Organization of the Contents.

Each dedicated section of the paper is properly balanced. The use of text real-estate is fair.

+ Amount of discussions on the experimental evaluation is relatively good.

+ Fair results of the proposed contribution, overall.

+ Level of implementation details, in particular in the appendix is good.

缺点

- (1). Writing, and consistency between key sections and the contents, and overall maturity of the presentation.

(i) Writing and readability.

In its current state, the paper is in a relatively dense state, with grammar issues and typos, even in key sections. E.g., typo in the title of the paper:

"ODGS: 3D Scene Reconstruction from Omnidirectional Images with 3D Gaussian Splattings" -> "ODGS: 3D Scene Reconstruction from Omnidirectional Images with 3D Gaussian Splatting"

The writing of the abstract section for example is very dense and hard to understand.

(ii) Text vs. Contents - Inconsistencies.

Title and paper suggest the proposed paper builds on the 3D Gaussian Splatting framework (eg, title, l7), while it in fact uses 2D Gaussian Splatting (Eq (9)).

- (2). Comparative evaluation lacks depth.

In particular, additional comparative and standard baselines could have been envisioned to better help the reader unstand the relative positioning of the proposed conceptual contribution, perfromance-wise.

E.g., COLMAP (standard SfM), Neus, NeusFacto, NeuralAngelo:

Zehao Yu, Anpei Chen, Bozidar Antic, Songyou Peng Peng, Apratim Bhattacharyya, Michael Niemeyer, Siyu Tang, Torsten Sattler, and Andreas Geiger. Sdfstudio: A unified framework for surface reconstruction, 2022
Schonberger, J. L., & Frahm, J. M. (2016). Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4104-4113).
Schönberger, J. L., Zheng, E., Frahm, J. M., & Pollefeys, M. (2016). Pixelwise view selection for unstructured multi-view stereo. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14 (pp. 501-518). Springer International Publishing.
Li, Z., Müller, T., Evans, A., Taylor, R. H., Unberath, M., Liu, M. Y., & Lin, C. H. (2023). Neuralangelo: High-fidelity neural surface reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8456-8465).
Wang, P., Liu, L., Liu, Y., Theobalt, C., Komura, T., & Wang, W. (2021). NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction. Advances in Neural Information Processing Systems, 34, 27171-27183.

In particular, the unified codebase makes it very easy to produce results for the aforementioned baselines.

- Other minor issues and suggestions.

Missing reference (2DGS): Huang, B., Yu, Z., Chen, A., Geiger, A., & Gao, S. (2024). 2d gaussian splatting for geometrically accurate radiance fields. Siggraph 2024.

问题

I do not have more questions or suggestions than the ones underlying the aforementioned weaknesses.

局限性

The authors provide one dedicated paragraph that reasonably addresses such considerations.

作者回复

2024-08-06

Q1) Inconsistencies between key sections and the contents

A) We strongly contend that the iEY4's review is based on a substantial misapprehension of our study, thus leading to erroneous conclusions. Our work is based on 3DGS [25] and is not related to 2DGS at all. Eq. (9) describes a density function of a 3D Gaussian projected on the image plane, modeled as a 2D Gaussian distribution in the image plane through local affine approximation; the actual 3D space consists of 3D Gaussians. (Please refer to the paragraph in 3DGS [25] paper, Section 4, Equation 5) This is completely different from the 2DGS work that originally composes a 3D space as a set of disk-shaped 2D Gaussian primitives.

Q2) Comparison with other methods

A) The contributions of the listed papers are outside the scope of the problem we are trying to tackle. First, Sdfstudio, Neuralangelo, NeuS, and NeusFacto aim at neural 'surface' reconstruction, which is different from the problem we are addressing. The surface normal maps generated by these models do not contain texture information and, therefore, cannot be directly compared with high-quality rendered images obtained from ODGS. In addition, the datasets used in these works for training and inference do not contain omnidirectional images.

Meanwhile, we excluded classic 3D reconstructions using Structure-from-Motion (SfM) and Multi-view Stereo (MVS) from the main table because they produce significantly inferior results compared to the recent reconstruction methods. Here, we present these results in Table R2 for better understanding. The SfM results are obtained using the OpenMVG library, then the MVS results are obtained from the OpenMVS library, respectively. (We used OpenMVG instead of COLMAP because COLMAP currently does not support camera pose estimation from omnidirectional images.) Both methods fail to create a complete point cloud, resulting in blank areas in the projected image, as observed in the samples shown in Figure R2. The valid area ratio in the table indicates the proportion of pixels among images that receive 3D information and have valid pixel values. The results of SfM shows a valid area ratio under 5%, which implies that most of the regions remain empty. Although the high accuracy of the estimated points shows high PSNRs among the valid area, the number of matched points is small, and the PSNRs of the whole region are measured to be less than 5dB. Applying MVS to the results of SfM increases the number of generated points, but still, at least 40% of pixels are not filled after projection. In addition, by comparing the PSNR measured in the valid area, we can see that the accuracy of the points created by applying MVS has decreased. ODGS, in contrast, shows consistently much higher PSNR than those methods while maintaining a 100% valid area ratio for all datasets.

评论- Response to Rebuttal

2024-08-14

Dear Authors,

Having read (all of) the feedback and having revisited the paper, here is my response.

Thank you for providing meaningful details. There was indeed a confusion on my end in my initial take on Q1 in particular.

Regarding Q2 though and the relative lack of experimental depth, I do maintain that the relative positioning of the proposed contribution performance-wise, can not be properly and completely assessed in absence of more competitive baselines such as the following:

Mip-NeRF360: Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-NeRF 360: unbounded Anti-Aliased Neural Radiance Fields. In CVPR, pages 5470–5479, 2022 Its absence even from the related work discussion and references is also puzzling, especially given it is very competitive against EgoNeRF in the original CVPR paper [10]

To a lesser extent, the following ones:

Tianhao Wu, Fangcheng Zhong, Andrea Tagliasacchi, Forrester Cole, and Cengiz Oztireli. D2NeRF: Self-Supervised Decoupling of Dynamic and Static Objects from a Monocular Video. NIPS, 35:32653–32666, 2022
Sara Sabour, Suhani Vora, Daniel Duckworth, Ivan Krasin, David J Fleet, and Andrea Tagliasacchi. RobustNeRF: Ignoring Distractors with Robust Losses. In CVPR, pages 20626– 20636, 2023

I will ponder and discuss with other reviewers regarding the remaing issue regarding experimental evaluation depth and the relative positioning of the proposed contribution in terms of its performance.

Warm regards.

2024-08-14

Dear iEY4, Thank you for thoroughly reviewing our rebuttal. We have carefully examined the three additional papers and have reached the following conclusions:

While Mip-NeRF 360 is compared in EgoNeRF [10], we excluded it from our comparison for the following reasons:

(Different Scenario) Mip-NeRF 360 represents the scene using a coordinate system combining a spherical center coordinate system with an inverse depth unbounded coordinate system, making it suitable for egocentric scenarios focused on a specific object but not for roaming scenarios. Our work involves comparisons in both the egocentric and roaming data, which differs from [10] for only egocentric scenarios.
(Performance) As shown in [10], Mip-NeRF 360 is neither comparable to EgoNeRF nor time-efficient (100k, > 12 hours). (In [10], "In contrast, Mip-NeRF 360 needs approximately 8 hours to outperform our(EgoNeRF) results at 5k steps.") Therefore, they are not representative in terms of performance compared to EgoNeRF. In contrast, our method significantly outpaces this with much faster optimization speeds.
(Representative for NeRF) The original NeRF and Mip-NeRF do not natively support omnidirectional cameras, so we conducted our experiments as NeRF(P). As reported in [10], the performance of NeRF and Mip-NeRF 360 is comparable (especially at 5k regarding the fast optimization time). Therefore, we concluded that Mip-NeRF would not provide additional information than the NeRF case.

For these reasons, we deemed including Mip-NeRF 360 in our table is redundant.

We believe that D2NeRF and RobustNeRF address problems that differ from the focus of our paper. These works focus on handling dynamic scenes or scenarios with distractors. In contrast, our research centers on 3D reconstruction using Gaussian splatting with a static and clean scene and an omnidirectional camera. While we might consider addressing similar challenges (dynamic scenes, distractors) with omnidirectional cameras in future research, for now, the issues are not closely aligned with the research direction of our current submission.

审稿意见

评分: 6置信度: 32024-07-15

This submission tackles the problem of extending 3D Gaussian Splatting (3DGS) to omnidirectional imagery. 3DGS and specifically its proposer rasterizer is limited to perspective camera. While omnidirectional images can be decomposed into perspective cameras, this typically introduces severe distortion artifacts that impacts the applicability and the quality of the reconstruction results.

To address this, the authors present a number of contributions:

A fast omnidirectional rasterizer (implementable in CUDA) that relies on a local affine approximation while preserving optimization and rendering speed. Each Gaussian is projected for rasterization on the tangent plane of a sphere centered at the camera origin, thereby decreasing the amount of distortion to a minimum.
The pruning and spiltting heuristics of 3DGS are adapted to account for the omnidirectional projection.
An array of qualitative and quantitative evaluations on egocentric (OmniBlender, Ricoh360, OmniPhotos) and roaming (360Room, OmniScenes, 360VO) datasets is shared with comparison against 3DGS and NeRF-based baselines (NeRF, TensoRF, EgoNeRF).

优点

With more and more image contents coming from egocentric cameras, identifying and proposing ways to address the limitations of an established approach like 3DGS to work on this new type of contents is quite relevant to the research community.
The proposed changes to cope with omnidirectional imagery are very simple: essentially projecting Gaussians to use a per-Gaussian rasterization frame that limits distortion while still allowing an efficient implementation on graphics hardware.
The quantitative and qualitative evaluations do a good job of demonstrating the benefits of introducing a custom omnidirectional rasterizer. Performance is measured against relevant baselines (3DGS of course but also NeRF-based ones, including one targeting egocentric contents: EgoNeRF) across multiple datasets. The approach of the author is shown to significantly outperforms in Table 1 in terms of quality (generally) all other approaches (using the typical PSNR, SSIM and LPIPS metrics) while actually being by far the fastest one. Figure 3 and 4 also share convincing qualitative results with significant artifacts in other approaches that ODGS does not generate.

缺点

I believe there is a major gap in the current presentation of the approach: how to implement alpha blending of Gaussians appears to be missing. It is in particular unclear to me in which common coordinate frame Gaussians are being composited on top of another to produce a final image. As this is, in my opinion, essential to understand (and reproduce) the approach, until resolved, this gap is dragging the proposed rating of the paper down despite the strengths above.
Decomposing an omnidirectional field-of-view into six perspective cameras to allow comparison against approaches that only work on perspective imagery makes sense. Such comparison could however be improved and pushed much further: how does the comparison hold with
- the omnidirectional field of view being decomposed into gradually more perspective cameras,
- the rasterizer properly handling the different perspective coordinate frames (compositing all the relevant Gaussians overlapping in the current perspective field-of-view)? The areas of introduced distortion due to the rectification could be visualized and super-imposed on the omnidirectional images of the qualitative comparisons to better understand and correlate with observed artifacts.
The abstract and introduction indicates some adjustments being applied to the densification heuristics, which are not apparently covered in the paper (not in section 3 or 4) apart from a mention of limiting the maximum size of a Gaussian.
Some reference (and possibly comparison) to (very) related work appears to be missing: OmniGS: Omnidirectional Gaussian Splatting for Fast Radiance Field Reconstruction using Omnidirectional Images (April 2024)
Some minor typos to correct:
- l.111 available To -> available to
- l.285 often misses -> it often misses
- l.289 ODGS overcome -> ODGS overcomes

问题

Please clarify how alpha-blending of Gaussians is implemented with per-Gaussian rasterization (see Weaknesses above). While the presented results otherwise do look good, this is a critical gap in the presentation of the approach that prevents me from proposing a higher rating.
Could the authors indicate how ODGS achieves to be significantly (more than 3x) faster than 3DGS applied on perspective images?

局限性

N/A

作者回复

2024-08-06

Q1) How to implement alpha blending?

A) We conducted our work based on the 3DGS framework, utilizing the same tile-based rasterization and alpha-blending pipeline. Our work suggests how to render with an omnidirectional (omni in short) camera model instead of a perspective (persp in short) camera model, and the remaining methods are identical to those of 3DGS. For clarity, the rasterization process is as follows:

Projecting 3D Gaussians to 2D Gaussians in the planar pixel space (as described in Section 3.2).
Tile-based culling against the spherical shell and sorting by depth.
For a given pixel, accumulating color and alpha values by traversing the list front-to-back until we reach a target saturation of alpha in the pixel. (alpha-blending)

C = \sum_{j\in*N*} {c_j \alpha_j T_j}, \quad {T_j = \prod_{k=1}^{j-1} (1-\alpha_k)},

$C$ is the pixel color, $c_j$ s are the colors of splats, and $\alpha$ is computed by multiplying the learned opacity of Gaussian with the power of rasterized Gaussian in the pixel (calculated using Eq. (9)).

We omitted the details of the tile-based culling and alpha-blending process because those are not the contributions of our work. However, we acknowledge the concern that the omission may make the paper appear incomplete to some readers and reviewers. We will include this information in the final version of the paper to enhance its completeness and clarity.

Q2.1) What happens if we increase the number of persp cameras

A) While using more views can yield slightly better results than the 6-view setup, it still falls short of the performance achieved by ODGS. Cubemap projection is a well-known method for converting an omni image to six persp images, where each corresponds to the face of a cube, and it is widely used in many studies involving 360 cameras [A,B,C]. However, as suggested by cMFb, we also believe it would be an interesting experiment to observe the trends when using more persp cameras for optimization. While the standard approach uses six perspective cameras, we optimized the model by adding 12 perspective cameras (18 in total) facing the edges of the cube.
Table R1 shows the performance of optimized results according to the number of persp viewpoints. When using the 18-views, the performance is comparable to the 6-views at the 10-minute mark but surpasses the 6-view results at the 100-minute mark. First, the increased number of images for training keeps the model from sufficiently learning from all views in the early stages (10 minutes), resulting in slightly lower performance. However, after sufficient optimization time (100 minutes), the additional views allow for further optimization, leading to improved results. Still, ODGS shows the highest performance in most metrics, even considering 3DGS using 18-views.

Q2.2) Is the rasterizer properly handling the different persp coordinate frames?

A) In the case of the 6-view cubemap, artifacts can occur at the edges of the cube due to the rasterizing method used by 3DGS. Since 3DGS omits the 3D Gaussians at image boundaries for stable optimization, discontinuities may occur at the edges of the cube, as illustrated in Figure R1.(c). (The orange dashed lines in Figure R1.(b) represent the edges of the cubemap after being combined into one omni image.) The image optimized with 18-view persp images has fewer artifacts, as shown in Figure R1.(d). However, artifacts still appear even at points other than the cube's edges, resulting in less favorable outcomes than ODGS.

Q2.3) How ODGS runs more than three times faster than 3DGS applied on persp images?

A) A majority of time is spent stitching the six persp images into the omni image. Since the total resolution of an omni image (2000 x 1000) is similar to six persp images (512 x 512 x 6), we empirically measured that the time for generating six perspective images is similar to generating one omni image. However, producing the final omni image from the six persp images takes a large proportion of of inference time (about 80%) since every persp image should be non-linearly warped to compose the final omni image.

Q3) The details of the densification heuristics

A) We acknowledge that the concept might be challenging to fully understand from the paper alone, so we will include a detailed explanation for clarity. As represented by $\mathbf{C}_o$ in Eq. (7), Gaussians of the same size in 3D space are rasterized to different sizes in an omnidirectional image depending on the angle at which they are located. Consequently, from a pre-defined size threshold for densification, we dynamically change the threshold according to the vertical location of the Gaussian. As depicted in Figure R3, this approach effectively reduces artifacts when representing lanes of the road.

Q4) Missing a related work: OmniGS

A) We now recognize OmniGS as a concurrent work. We thoroughly reviewed the suggested paper, OmniGS, and noted that they employ a similar approach. Unfortunately, their work is only available on arXiv; no code or implementation details have been provided. We would have liked to compare their method with ours, but we will have to settle for mentioning it as a concurrent work in our references.

Reference

[A] H. Jang et al., "Egocentric scene reconstruction from an omnidirectional video," SIGGRAPH 2022.
[B] M. Gond et al., "LFSphereNet: Real-time spherical light field reconstruction from a single omnidirectional image," SIGGRAPH 2023.
[C] F. Wang et al., "Bifuse: Monocular 360 depth estimation via bi-projection fusion," CVPR 2020.

2024-08-13

I have read the rebuttal of the authors and the other reviews. I appreciate the efforts of the authors and would like to thank them for having prepared this rebuttal and tried to answer concerns and questions.

We conducted our work based on the 3DGS framework, utilizing the same tile-based rasterization and alpha-blending pipeline [...]

Unfortunately, I am still missing where the 3D Gaussians are being individually rasterized. Is this "planar pixel space" shared for all rasterizations or unique to each 3D Gaussian? If shared, could you elaborate on how it is parametrized (is this just the equirectangular projection of the sphere)? If unique, could you explain how to combine the different rasterizations in the same space? Thanks in advance for any clarification.

2024-08-13

Dear cMFb, Thank you for acknowledging our efforts in addressing all your questions. We are happy to provide further clarification on any additional inquiries you may have.

Each 3D Gaussian is projected onto its corresponding tangent plane (Figure 1.(b)). This tangent plane is uniquely determined for each Gaussian as the plane that touches the sphere at the point where the line connecting the center of the spherical camera and the center of the Gaussian intersects the sphere. The covariance of the 2D Gaussian is then calculated through a local affine approximation on this tangent plane ( $J_o$ in Eq. 9). The 2D Gaussians are subsequently mapped onto the shared equirectangular image plane (or planar pixel space). (Figure 1.(c)) The position of the 2D Gaussian is determined by the equirectangular projection (Eq. 5), and the covariance is adjusted to account for the distortion inherent in the ERP ( $S_o, C_o$ in Eq.7, and $\Sigma_{O,2D}$ in Eq.8). The 2D Gaussians are now gathered on the equirectangular image plane and are used to render the image through a tile-based rasterization process. (including depth sorting and alpha blending)

For clarity, here are the further step-by-step details of the rasterization process.

Split the screen (image) into 16x16 tiles.
Cull 3D Gaussians against the spherical shell (from 'near' to 'far') and each tile.
Instantiate each Gaussian according to the number of tiles they overlap and assign each instance a key that combines view space depth and tile ID.
Sort Gaussians based on these keys (i.e., view space depth)
Produce a list for each tile by identifying the first and last depth-sorted entry that splats to a given tile.
For a given pixel, accumulate color and alpha values by traversing the list front-to-back until we reach a target saturation of alpha in the pixel.

This tile-based rasterizer is derived from the 3DGS (in Section 6), with modifications such as culling techniques adapted to suit spherical cameras.

2024-08-14

Thanks for your patience and for providing much more detailed explanations on the rasterization process. I believe these are now finally addressing my concerns on the approach (due to the presentation in the paper) and I hope this level of clarity can be included in the paper.

The remaining concerns / questions had already been address in the initial answer as part of the rebuttal. Consequently I am willing to raise my rating to accept this submission.

2024-08-14

Dear reviewer cMFb,

Thank you for your positive reassessment of our work. We greatly appreciate your valuable and constructive review, and we will update the draft to ensure that its presentation meets the level of clarity highlighted in your feedback.

Best regards,

Authors.

作者回复

2024-08-06

Statements to All Reviewers

We appreciate all reviewers for their valuable comments. We have thoroughly examined the reviews and hope to address all questions and misunderstandings through this rebuttal. We have responded to each reviewer's queries in a question-and-answer format. We will also fix the typos and polish the writing to enhance the clarity of the paper in the updated manuscript.

最终决定Accept (poster)

2024-09-25

The submitted paper has undergone a thorough review process by four reviewers. After careful consideration of the reviewers' feedback and ratings, I believe this paper should be accepted for publication at the conference.

The paper initially received some concerns from two reviewers about the novelty of the approach. However, the authors were able to effectively address these concerns in their rebuttal, leading the initially skeptical reviewers to increase ratings. Additionally, the missing references of the works [1, 2, 3] in Arxiv are recognized as concurrent works. It is encouraged but not compulsory to conduct evaluation and discussion with these concurrent works. Although one reviewer remains negative, the main concerns and missing baselines are not particularly relevant to the proposed work and were therefore not weighted heavily in the final decision.

This paper provides a practical and effective solution to omnidirectional scene reconstruction using Gaussian splatting which contributes to an important and growing research area.

[1] Huang, Letian, et al. "On the Error Analysis of 3D Gaussian Splatting and an Optimal Projection Strategy." arXiv preprint arXiv:2402.00752 (2024). ECCV 24 [2] Li, Longwei, et al. "OmniGS: Omnidirectional Gaussian Splatting for Fast Radiance Field Reconstruction using Omnidirectional Images." arXiv preprint arXiv:2404.03202 (2024). [3] Bai, Jiayang, et al. "360-GS: Layout-guided Panoramic Gaussian Splatting For Indoor Roaming." arXiv preprint arXiv:2402.00763 (2024).