5.5

/10

Rejected4 位审稿人

最低5最高6标准差0.5

3.8

置信度

正确性2.3

贡献度2.5

表达2.5

ICLR 2025

Visibility-Uncertainty-guided 3D Gaussian Inpainting via Scene Conceptional Learning

mingxuan cui,Qing Guo,Yuyi Wang,Hongkai Yu,Di Lin,Qin Zou,Ming-Ming Cheng,Xi Li

OpenReview PDF

提交: 2024-09-16更新: 2025-02-05

摘要

关键词

3D reconstruction; 3D Inpainting; Conceptional Diffusion

评审与讨论

审稿意见

评分: 5置信度: 42024-10-29

This paper introduces a framework for 3D scene inpainting using 3D Gaussian Splatting that leverages multi-view information. The framework consists of two key components: VISTA-GI, which guides inpainting by measuring the visibility uncertainty of 3D points across multiple views, and VISTA-CL, which learns scene concepts without masked objects using a diffusion model. These components work iteratively to refine the 3D representation using geometric and semantic information progressively. The method extends to handling dynamic distractors in scenes, making it effective for challenging scenarios like underwater scenes with moving objects. Evaluated on the SPIn-NeRF dataset and a new underwater dataset derived from UTB180, VISTA demonstrates superior performance over state-of-the-art methods in quantitative metrics and visual quality, particularly in maintaining consistency across viewpoints.

优点

Novel use of visibility uncertainty to guide 3D inpainting
Integration of 3D Gaussian Splatting with diffusion-based scene learning
Iterative refinement combining geometric and semantic cues
Strong quantitative results across metrics (PSNR, SSIM, LPIPS, etc.)

缺点

Missing analysis of how the visibility uncertainty threshold affects results
The iterative process requires three iterations, potentially making it computationally expensive for real-time applications. And no clear ablation showing the impact of different numbers of iterations beyond Figure 5
No discussion of failure cases or limitations with large occlusions
Only tested on ten forward-facing scenes from the SPIn-NeRF dataset - a relatively small evaluation set and scenario.
Missing key hyperparameters like the size of the adjacent view window for uncertainty calculation
No evaluation of common real-world scenarios like removing people from tourist photos. Dynamic object types are limited (mostly fish)

问题

What thresholds were used for the visibility uncertainty measure in Equation (2)? How sensitive is the method to these thresholds? Could you provide ablation studies showing how different thresholds affect the final results?
Why exactly are three iterations optimal? The plot in Figure 5 shows diminishing returns, but could more iterations help in challenging cases? What is the runtime for each iteration? Is real-time performance possible with architectural optimizations?
Could you provide examples of failure cases and discuss their causes?
Could you evaluate more diverse scenarios, especially tourist photos with people?

评论- Response to Reviewer VfkU

2024-11-22

To Reviewer VfkU:

1. Ablation study of the uncertainty map

Response:

Thank you for your comments. In our practical implementation, the uncertainty of the calculations (i.e., variance) is performed in the LAB color space and is normalized. Therefore, in this part, we will not explicitly define the threshold for uncertainty. If you are concerned about the uncertainty threshold referring to $\vartheta$ in Eq. (3), we investigate how different $\vartheta$ increasing strategies affect reconstruction quality. The hyperparameter $\vartheta$ is initialized by 0 and increases by 0.1 with each iteration in our paper. We evaluate five different increase ratios {0.1, 0.2, 0.3, 0.4, 0.5} and analyze their impact on reconstruction quality across iterations using our dataset. As shown in Figure 12 (b), all ratios lead to improved PSNR values over iterations. In the first two iterations, a higher increase ratio improves the reconstruction performance. However, an increase ratio above 0.1 indicates that the algorithm becomes overly confident in the inpainting areas too early, resulting in insufficient interaction of geometric and semantic information between the VISTA-GI and VISTA-CL modules, which subsequently leads to a decline in reconstruction performance in later iterations.

2. About iteration and time consumption

Response:

Thank you for the comments.

The influence of iteration number: In our setup, the number of iterations is strongly correlated with the hyperparameters involved in Figure 12. Simply increasing the number of iterations without altering the hyperparameters goes against the original intent. To evaluate the impact of different numbers of iterations, we analyzed two hyperparameters that related to the number of iterations. By examining Figures 12 (a) and (b), we can infer that fewer iterations lead to insufficient interaction of geometric and semantic information between the VISTA-GI and VISTA-CL. Conversely, excessive iterations do not improve the performance of the challenging scene and may introduce excessive computational costs.

The total time cost: To quantitatively evaluate performance and computational efficiency, we compare our method against baseline approaches (InFusion, SPIn-NeRF, and SpotLess) on the synthetic scene shown in Fig.8. This scene provides ground truth data, enabling evaluation through reference-based metrics for both rendering quality and computational efficiency during optimization. As shown in Table 9, while our method incurs additional computational overhead compared to vanilla 3DGS due to integrating iterations and diffusion models, it achieves superior rendering quality while maintaining comparable efficiency to state-of-the-art 3DGS methods (e.g., SpotLess). Furthermore, our approach demonstrates significantly better reconstruction quality while being approximately 10× faster than leading NeRF-based methods such as SPIn-NeRF.

Our method indeed introduces higher computational resource demands due to the iterative repair process. For one scenario in our dataset, the specific time comparisons are as follows:

Table 9: Quantitative results and time costs on the synthesis data in Fig.8.

	InFusion	SPIn-NeRf	SpotLess	Ours
LPIPS	0.23	0.15	0.14	0.10
PSNR	19.34	23.33	24.75	26.38
Time	16m34s	7h32m18s	30m26s	33m34s

Time cost for each iteration: In the aforementioned case, our method undergoes three rounds of iterations, with an average iteration time of approximately 10 minutes per iteration. By optimizing the current pipeline and reusing the reconstruction results from the previous iteration, it may be possible to further reduce the runtime to achieve real-time performance. Such acceleration efforts are still worth further investigation.

评论- Response to Reviewer VfkU

2024-11-22

3. Failure cases and limitations

Response: Thank you for providing us with a new perspective to analyze our method. We have added the discussion of failure cases of our algorithm in Sec A.2 with Figure 10. Due to errors in the prior mask, some fish were not detected by the object detection model. Furthermore, since the fish did not move significantly during the shooting process, these areas did not produce inconsistencies across multiple viewpoints during reconstruction, making it difficult for the VISTA-GI component to identify these areas through uncertainty. This also validates our algorithm design approach: VISTA-CL introduces semantic information through masks, while VISTA-GI incorporates geometric information through uncertainty, complementing each other to remove distractors. However, in the failed cases, issues arose in both aspects, resulting in poor reconstruction quality of the final scene.

4. Our work uses the largest evaluation set compared to SOTA methods

Response:

Thank you for your comments. We collected the reported datasets and scene counts used in the referenced SOTA method papers, and the details are as follows:

Method	Training datasets	num of scenes
InFusion	Mip-NeRF 360, SPIn-NeRF	19
SPIn-NeRF	SPIn-NeRF	10
RefFusion	SPIn-NeRF	10
GScream	SPIn-NeRF, IBRNet	15
SpotLess	NeRF On-the-go, RobustNeRF	10
Ours	Underwater 3D Inpainting, SPIn-NeRF	20

It can be seen that our method utilizes the largest number of scenes to validate its effectiveness. The underwater dataset we used is derived from videos shot by real-world divers. Due to the effects of the underwater medium and floating debris, these underwater scenes represent an extreme scenario in the real world.

5. Influence of the number of adjacent views

Response:

Thank you for your careful review comments. When calculating the uncertainty of a particular reconstructed viewpoint, our algorithm selects the 6 nearest viewpoints as adjacent reference views. To further clarify the details of this setting, we conducted an ablation study on this hyperparameter, with the results as follows:

Num of adjacent views	UCIQE	URanker	CLIP Score
2	0.49	1.53	0.69
4	0.51	1.63	0.71
6	0.51	1.64	0.72
8	0.51	1.64	0.72

It can be observed that when too few viewpoints are selected, the performance experiences an unacceptable decline due to insufficient spatial points, leading to inaccurate variance estimation. On the other hand, selecting more adjacent viewpoints does not improve the reconstruction quality and, due to the increased points count, results in higher computational demands.

6. More common real-world scenarios

Response:

Thank you for the comments. As we discussed earlier in the response, underwater scenes are a challenging type of scenario in the real world.

However, to further demonstrate the effectiveness of our method, we tested our dataset on a scene related to pedestrian removal from the nerf-on-the-go dataset. This scene called tree, contains 212 images, with the main distractors coming from moving pedestrians. As shown in Figure 9, our method achieved very high-quality results on this dataset. Due to the abundance of viewpoints in the dataset, there is a lot of supplementary information between perspectives, allowing our method to effectively utilize other viewpoints to repair the blurring caused by moving distractors.

2024-11-24

After reviewing other comments and the rebuttal, I would like to maintain my negative rating for these reasons:

The authors' explanation of texture consistency relies too heavily on basic L2/SSIM losses without clear innovation, as highlighted by Reviewer eMA4's concerns.
The ablation studies in the rebuttal show only modest improvements in metrics (Table 7), suggesting limited effectiveness of the individual components.
The rebuttal reveals significant performance issues with viewpoint variations - PSNR drops dramatically from 26.25 to 15.66 with increased sampling intervals. I don't quite understand this effect.
Limited generalizability testing - only one additional real-world scene was tested despite multiple reviewers requesting a broader evaluation.

2024-11-24

Please carefully read our paper and response, and provide reasonable and responsible feedback on the rating.

Regarding Texture Consistency: We introduced L2/SSIM as a relevant background for texture consistency, rather than as a means to maintain texture consistency itself. Our method uses uncertainty to help the algorithm determine which areas have consistent textures, as discussed in our response to Reviewer eMA4.
Ablation Studies: Non-reference metrics in the generation domain (such as CLIP Score, Uranker, etc.) are not as sensitive to generation quality as reference metrics like SSIM and PSNR. While our results may appear to show relatively small differences from non-reference metrics, significant differences in generation results can be observed from our visualizations. We also provided a synthetic dataset in Table 7 (with ground truth for calculating reference metrics) to demonstrate the advantages of our method in dynamic scenes.
Large Variations in Viewpoint Are Not a Performance Issue: Please carefully read our rationale for setting up such ablation studies. Increasing viewpoint variation involves incorporating a large portion (from 50% to 85%) of the training data into the test set, which increases the task difficulty. Our method only shows noticeable performance degradation under extreme conditions, demonstrating robustness.
We Have Provided More Visualization Results: The underwater dataset is a challenging real-world scenario, and we have also provided visual results related to pedestrian removal scenes. The number of scenes our method compares to exceeds that of other SOTA methods in the field, requesting for us to provide visualizations for even more scenes is unreasonable.

审稿意见

评分: 6置信度: 32024-11-02

The paper introduces a general 3D Gaussian Splats (3DGS) inpainting framework called VISTA, which achieves 3D inpainting for both static and dynamic objects within 3DGS reconstruction, resulting in enhanced novel view synthesis performance. By leveraging multiview information, the method extracts visibility uncertainty maps to guide the 3D reconstruction of inpainted scenes. Additionally, masked multiviews are employed to learn the scene concept through textual inversion techniques, which improves the accuracy of image inpainting priors. VISTA iteratively inpaints scenes and learns scene concepts from inter-view visibility information, enhancing scene consistency and preserving scene identity during the 3D inpainting process.

优点

The work proposed a general 3D Gaussian inpainting framework applicable to both static or dynamic scenes.
The proposed VISTA leverages multiview visibility information for more detailed texture consistency, facilitating the inpaiting of unseen areas through conceptual learning.
The whole paper is well-written and easy to follow.

缺点

Although the comparison experiments are sufficient and the discussions are insightful, there is a lack of certain experiments, such as an ablation study on the effect of conceptual learning. Please see details in Question 1.
The paper does not discuss the multi-view inpainting consistency of this method, nor does it compare the results with or discuss differences from those obtained using video diffusion models. See details in Question 2.

问题

As a complex scene (e.g. with objects, backgrounds), the scene feature is not as distinct as ojects, is the concetual learning robust in scenes? How the conceptual learning improve inpainting performance?
It would be benefitial to discuss multi-view consitency during inpainting, as well as review and discuss video inpainting works which concerns about multi-view consitency.
Minor issues:

3.1. Fig.1 InFusion's upper results with red boxes appears to be incorrectly cropped in misaligned areas?

3.2. Fig.2 refers to wrong section numbers.

评论- Response to Reviewer qvjD

2024-11-22

To Reviewer qvjD:

1. Ablation study of the conceptual learning

Response:

Thank you for the comments. We have conducted ablation experiments on two components of our algorithm using our dataset, and the specific results are as follows:

Method	UCIQE	URanker	CLIP Score
Ours w/o VISTA-GI	0.48	1.52	0.70
Ours w/o VISTA-CL	0.50	1.59	0.69
Ours	0.51	1.64	0.72

Our experiments demonstrate two key findings. First, attempting reconstruction using only a 2D generative model (i.e., VISTA-CL) without VISTA-GI leads to significantly degraded image quality metrics. This validates that VISTA-GI's uncertainty guidance effectively mitigates multi-view inconsistencies during 3D reconstruction, resulting in higher-quality outputs. Second, while omitting VISTA-CL maintains image quality comparable to existing methods like SplotLess and SPIn-NeRF, the lack of concept-guided learning significantly reduces CLIP-Score metrics. This indicates that without conceptual constraints, the inpainting process produces results that are visually plausible but semantically inconsistent with the scene context.

To address the concerns, we have added the discussion in Sec 5.

2. Multi-view consistency

Response:

Thank you for the comments. Given multi-view images from a static scene, the original 3DGS trains the Gaussian splats through the $L_1$ loss and SSIM loss function (i.e., Eq.(1) in the submission) to make sure the Gaussian splats can render detailed-preserved image across all perspectives. Hence, the texture consistency across views is implicitly achieved via the mix of $L_1$ loss and SSIM loss functions that are good at the reconstruction of structure and details and are validated in [A].

However, when the scene contains dynamic objects, the object can be seen in one view but cannot be observed in another view. This leads to conflicts between different view renderings and affects the training of the Gaussian splats through the original objective function.

Our visibility uncertainty map indicates the potential conflicts across different views and we use it to reweight the contributions of different pixels in Eq. (1) and get Eq. (4) in the submission. Intuitively, we assign low weights to the pixels with high visibility uncertainty. Such a strategy avoids conflicts during the training of Gaussian splat and leads to better texture consistency.

We analyzed the quality of reconstruction between the masked and non-masked areas in an underwater scene. The results are as follows:

Table 8: Quantitative results of different areas.

Method	non_masked PSNR	masked PSNR	non_masked SSIM	masked SSIM
w/ mask	26.92	11.9	0.88	0.22
w/o mask	24.81	11.23	0.81	0.20

It can be observed that when the mask is not used, our method maintains much better texture consistency in both the masked and non-masked areas. For the restoration of the masked areas, our method also ensures texture consistency by introducing the limitations imposed by uncertainty.

[A] H. Zhao, O. Gallo, I. Frosio and J. Kautz, "Loss Functions for Image Restoration With Neural Networks," in IEEE Transactions on Computational Imaging, vol. 3, no. 1, pp. 47-57, 2017.

评论- Response to Reviewer qvjD

2024-11-22

3. Different from video diffusion models

Response:

Thank you for providing us with a new perspective to analyze our method.

General video generation methods (such as tune a video, videobooth, and animatediff) employ cross-frame attention, such as temporary attention, to maintain object consistency across multiple frames. These video generation algorithms typically do not generate explicit 3D models during runtime; instead, they merge multi-frame information through motion blocks, with the expected output being a sequence of 2D images.

In contrast, methods based on 3D modeling explicitly reconstruct a 3D model, with the desired output being a clean and coherent 3D scene. Since NeRF and 3DGS methods inherently possess some capability to maintain viewpoint consistency, 3D reconstruction methods are generally more stable in preserving viewpoint consistency. Moreover, 3D-consistent video priors can be beneficial for reconstruction. For example, the MVInpainter utilizes video along with 2D inpainting to assist in tasks such as object removal and composition in reconstruction.

To sum up, since our method focuses on generating a 3D scene, while video generation models aim to produce a video, some methods for maintaining 3D consistency in video generation cannot be directly applied to our task. Although video generation and 3D reconstruction can produce similar visual effects, there is a fundamental difference in their output focus.

4. About these minor issues

Response:

Thank you for your careful review comments. We have fixed these mistakes in the revised paper.

2024-11-25

Thank authors for their endeavor made in the detailed responses. After reading other reviews and rebuttals, I find that my main concerns have been addressed.

In addition, the revised version includes more comprehensive experiments for robustness validation, such as runtime comparison, more ablation study, an exploration of robustness across different levels of input frame sparsity. These evaluations can strengthen the paper and provide some new directions for the future research.

Given that the multi-view consistency issue was highlighted by several reviewers, I recommend that the detailed discussion on multi-view consistency (e.g., Table 8) can be included as supplementary content in the revised paper.

Considering the revision can provide sound experiments, verifying the method robustness, I would to raise my score to recommend acceptance.

评论- To Reviewer qvjD

2024-11-25

Thank you for your thoughtful feedback and for recognizing the efforts we put into our detailed responses.

We are delighted to hear that you find our revised version to include more comprehensive experiments that validate the robustness of our method. We appreciate your suggestion to include a discussion on multi-view consistency as supplementary content, and we will incorporate this in our final revision.

Thank you again for your positive feedback!

审稿意见

评分: 6置信度: 42024-11-04

This paper presents VISTA, a novel framework for 3D Gaussian inpainting that leverages visibility uncertainty and scene conceptual learning to address the challenges of inpainting in both static and dynamic scenes. The authors propose a method that integrates visibility uncertainty maps to guide the inpainting process, allowing for more coherent and contextually appropriate fills for masked regions. The framework is evaluated on two challenging datasets, demonstrating superior performance compared to existing state-of-the-art methods.

优点

• Originality: The integration of visibility uncertainty with scene conceptual learning is a fresh approach that addresses existing limitations in 3D inpainting. Although the idea is intuitively sample, but the performance is good under the complementary visual cues.

• Quality: The experiments are robust, showcasing the effectiveness of the proposed method across diverse datasets. The paper also shows some ablation study in discussion and appendix section to support the results.

• Clarity: The main ideas are presented clearly, and the paper provides a comprehensive overview of related work and the proposed contributions.

• Significance: The results indicate potential applications in augmented and virtual reality, highlighting the practical implications of the research.

缺点

• While the paper presents a robust methodology, certain experimental comparisons could be expanded. For instance, a more detailed analysis of the impact of diﬀerent hyperparameters on performance would provide deeper insights, particularly regarding the parameter v in equation (3) and the level of decrease initial noise during inference.

• There are some writing issues to address: In section 4.1, you state that Figure 1 shows the initial representation’s failure to exclude dynamic objects; however, Figure 1 actually displays the results among diﬀerent approaches rather than the initial representation itself

问题

Since the iterative process may slow down the pipeline, how does the computational eﬃciency of VISTA compare to other state‐of‐the‐art methods, particularly in real‐time applications?
In equation (3), Ui represents the visibility uncertainty map for a single frame, and the uncertainty for each pixel is calculated from equation (2). I would like to know if any normalization is applied to Ui between equations (2) and (3).
Regarding the experimental setup illustrated in the appendix, you mention that the initial noise strength during inference decreases by 0.2 with each iteration. Was any ablation study conducted? I agree that reducing the initial noise over iterations is reasonable, as the rendered image quality becomes more conﬁdent with each iteration, but does this hyperparameter aﬀect the ﬁnal result?
The results in Figure 7 show the reconstruction without a mask, which retains the fish in the top-left corner of the image while removing the other fish. I believe the visibility uncertainty used in Equation (3) to update the mask has favorable properties for removing dynamic distractors. However, why does only the top-left fish remain? Could you also provide some statistical evaluation for this ablation study?
I would also like to inquire about the image resolution used for training and inference with stable‐diﬀusion‐v1.5, as the resolution may inﬂuence the inpainting quality in SD‐1.5

评论- Response to Reviewer M4f5

2024-11-22

4. Influence of image resolution

Response:

Thank you for the comments. In our experiment setup, we use the stable diffusion v1.5 as the inpainting model and train and test the model following its default setup: if the input image has a resolution higher than 512x512, we crop the image to a new size that is both the closest to the original image size and a multiple of 8; if the input image is smaller than 512x512, we rescale the image to 512x512. To analyze the influence of the strategy on different original resolutions, given an original scene with input images having the size of 1299x974, we downsample these images to four resolutions: 64x64, 128x128, 256x256, and 512x512. Then, for each resolution, we can build a 3D model and evaluate the rendering quality. As shown in Table 2, we observe that: (1) reducing the resolution to 512x512 does not significantly impact any of the metrics, demonstrating our method's robustness to substantial resolution changes. (2) further decreasing the resolution leads to gradual degradation in reference-based metrics, while non-reference metrics remain relatively stable.

We have added the above discussion in the Appendix Sec.A.4.

Table 2: Quantitative ablation results of different resolutions.

resolution	lpips	ssim	psnr	clip score	uranker	UCIQE
64x64	0.51	0.68	16.27	0.64	-0.23	0.56
128x128	0.42	0.69	18.89	0.69	0.29	0.57
256x256	0.26	0.71	21.33	0.69	0.42	0.57
512x512	0.11	0.84	26.04	0.69	0.30	0.57
1299x974	0.10	0.86	26.38	0.70	0.44	0.58

5. Concerns about Fig. 7

Response:

We appreciate your careful review. Regarding the fish in the top left corner of Fig. 7 is retained while the other fish are removed, this is primarily because the fish remains stationary across different views (as evident in the optical flow map of Fig. 7, where the top-left fish exhibits low flow values at its center). Consequently, it has a lower value in the visibility uncertainty map (see the corresponding map in Fig. 7). Without using a mask to label this area for repair manually, the fish's geometric characteristics resemble those of a stationary object, such as a rock, making it indistinguishable from our uncertainty detection system.

In contrast, moving fish create significant geometric inconsistencies across viewpoints, enabling our uncertainty detection to flag them as anomalies. This leads to their removal through the inpainting process. To address these challenging scenarios, we introduced mask annotations for fish detection, providing semantic guidance for our inpainting method. As shown in the last column of Fig. 7, incorporating the mask ensures the successful removal of the top-left fish.

To address the concern, we have revised the discussion in A.1 and Figure 7 by adding the optical flow results and uncertainty map.

评论- Response to Reviewer M4f5

2024-11-22

To Reviewer M4f5:

1. Impact of Hyperparameters

Response:

Thank you for the suggestions and comments. We have added the discussion about the influence of the hyperparameter $\vartheta$ in Eq. (3) and noise reduction ratios in the Appendix section Sec A.3 with the revised Figure 12.

Impacts of $\vartheta$ in Eq. (3). We investigate how different $\vartheta$ increasing strategies affect reconstruction quality. The hyperparameter $\vartheta$ is initialized by 0 and increases by 0.1 with each iteration in our paper. We evaluate five different increase ratios {0.1, 0.2, 0.3, 0.4, 0.5} and analyze their impact on reconstruction quality across iterations using our dataset. As shown in Figure 12 (b), all ratios lead to improved PSNR values over iterations. In the first two iterations, a higher increase ratio improves the reconstruction performance. However, an increase ratio above 0.1 indicates that the algorithm becomes overly confident in the inpainting areas too early, resulting in insufficient interaction of geometric and semantic information between the VISTA-GI and VISTA-CL modules, which subsequently leads to a decline in reconstruction performance in later iterations.

Impact of noise reduction ratios in diffusion inference. During diffusion model inference, we investigate how different noise reduction strategies affect reconstruction quality. Starting from an initial noise strength of 1.0, we systematically decrease the noise at each iteration by a fixed ratio. We evaluate four different reduction ratios {0.1, 0.2, 0.3, 0.4} and analyze their impact on reconstruction quality across iterations using our dataset. As shown in Figure 12 (a), while all ratios lead to improved PSNR values over iterations, the reduction ratio of 0.2 achieves optimal convergence in the fewest iterations. Based on this empirical analysis, we adopt 0.2 as the noise reduction ratio in our method.

2. Computing cost comparison and analysis

Response:

Thank you for the comments. To quantitatively evaluate performance and computational efficiency, we compare our method against baseline approaches (InFusion, SPIn-NeRF, and SpotLess) on the synthetic scene shown in Fig.8. This scene provides ground truth data, enabling evaluation through reference-based metrics for both rendering quality and computational efficiency during optimization. As shown in Table 1, while our method incurs additional computational overhead compared to vanilla 3DGS due to integrating iterations and diffusion models, it achieves superior rendering quality while maintaining comparable efficiency to state-of-the-art 3DGS methods (e.g., SpotLess). Furthermore, our approach demonstrates significantly better reconstruction quality while being approximately 10× faster than leading NeRF-based methods such as SPIn-NeRF.

Table 1: Quantitative results and time costs on the synthesis data in Fig.8.

	InFusion	SPIn-NeRf	SpotLess	Ours
LPIPS	0.23	0.15	0.14	0.10
PSNR	19.34	23.33	24.75	26.38
Time	16m34s	7h32m18s	30m26s	33m34s

3. Normalization on $\mathbf{U}_i$

Response:

Thank you for the comments. To standardize the visibility uncertainty values, we apply normalization to the pixel-wise uncertainty map computed using Eq (2). Specifically, each pixel's uncertainty value is divided by the standard deviation of all uncertainty values in the map. The resulting normalized visibility uncertainty map is denoted as $\mathbf{U}_i$ .

To address the concern, we have added the above revision into Sec. 4.1.

2024-11-26

Dear Reviewer M4f5,

I hope this message finds you well.

I am one of the authors. First, I would like to express my gratitude for your time and effort in reviewing our paper. We greatly appreciate your valuable feedback and have made the necessary revisions based on your comments.

As the submission deadline approaches, we would like to ask if you could provide further insights or feedback as soon as possible. Your thoughts are very important to us, and we are eager to make any final adjustments before resubmitting.

Thank you for your understanding and support!

审稿意见

评分: 5置信度: 42024-11-04

This paper introduces VISTA (Visibility-Uncertainty-Guided 3D Gaussian Inpainting via Scene Conceptual Learning), a framework specifically designed for 3D scene inpainting. VISTA combines visibility uncertainty maps with scene conceptual learning to improve inpainting quality across multiple viewpoints. The main contributions include VISTA-GI, which guides 3D Gaussian inpainting across views, and VISTA-CL, which uses scene conceptual learning to fill in areas lacking semantic information. Experimental evaluations demonstrate that VISTA outperforms existing methods on the SPIn-NeRF dataset for static scenes and an underwater dataset with dynamic objects.

优点

VISTA introduces a novel multi-view 3D inpainting framework by combining visibility uncertainty maps with a diffusion model. This approach effectively uses complementary information across multiple views and includes scene conceptual learning to handle semantic filling, showing a certain level of originality.
The methodology clearly separates VISTA-GI and VISTA-CL, using an alternating iterative optimization to ensure consistency across views. This design handles both static and dynamic objects, broadening the framework’s flexibility for different application scenarios.

缺点

The main contribution of this paper is the introduction of visibility uncertainty to guide inpainting, but the uncertainty map considers only the visibility of 3D points across views without explicitly addressing texture consistency. This may lead to color or texture inconsistencies in the inpainted regions. Further analysis on how the model maintains texture consistency would strengthen this aspect.
VISTA-GI is designed to maintain consistency between neighboring views. However, as noted in the appendix, it struggles with scenes that have large viewpoint differences, which haven’t been quantitatively evaluated. Testing on larger viewpoint change would provide a clearer picture of the model's performance in maintaining consistency across wide angles.
The framework involves multiple components (visibility uncertainty maps, a diffusion model, and scene conceptual learning), which could increase computational demands. This may impact feasibility in large-scale or real-time applications. Providing more details on computational costs, such as training time, would support its applicability.
The paper does not include ablation studies for VISTA-GI and VISTA-CL, making it difficult to assess the individual contribution of each component. Adding these ablation studies would clarify each module’s role in the framework’s overall performance.
The paper includes few visual results on the SPIn-NeRF dataset, limiting readers’ understanding of the framework’s effectiveness in static scenes. Providing more visual results across different scenarios, as well as showing the impact of the visibility uncertainty map on static scenes, would help illustrate the method's inpainting capabilities.
There is a lack of visualizations of depth maps to verify the geometric accuracy of the inpainting results.
The paper does not compare VISTA with GScream[1], which is another 3D inpainting method operating in a similar setting. Including a comparison with GScream would help establish VISTA’s advantages.

[1] Wang, Yuxin, et al. "GScream: Learning 3D Geometry and Feature Consistent Gaussian Splatting for Object Removal" ECCV. 2024.

问题

I still have concerns about texture consistency. Could you explain how the model ensures texture consistency in the inpainting results and provides visualizations of texture across different viewpoints in static scenes?
I am curious about the performance drop with large viewpoint differences. Could you quantify this to assess VISTA's robustness to viewpoint variations?
Could you conduct ablation studies on VISTA-GI and Scene Conceptual Learning to show their specific roles within the overall framework?
Could you visualize depth maps to demonstrate the geometric accuracy of the inpainting results?

评论- Response to Reviewer eMA4

2024-11-22

To Reviewer eMA4:

1. How visibility uncertainty benefits texture consistency

Response:

Thank you for the comments. Given multi-view images from a static scene, the original 3DGS trains the Gaussian splats through the $L_1$ loss and SSIM loss function (i.e., Eq.(1) in the submission) to make sure the Gaussian splats are able to render detailed-preserved image across all perspectives. Hence, the texture consistency across views is implicitly achieved via the mix of $L_1$ loss and SSIM loss functions that are good at reconstruction of structure and details and are validated in [A].

However, when the scene contains dynamic objects, the object could be saw in one view but cannot be observed at another view. This leads to the conflicts between different view renderings and affects the training of the Gaussian splats through the original objective function.

Our visibility uncertainty map indicates the potential conflicts across different views and we use it to reweight the constributions of different pixels in the Eq. (1) and get Eq. (4) in the submission. Intuitively, we assign low weights to the pixels with high visibility uncertainty. Such a strategy avoids the conflicts during the training of Gaussian splat, and leads to better texture consistency.

We analyzed the quality of reconstruction between the masked and non-masked areas in an underwater scene. The results are as follows:

Table 3: Quantitative results of different areas.

Method	non_masked PSNR	masked PSNR	non_masked SSIM	masked SSIM
w/ mask	26.92	11.9	0.88	0.22
w/o mask	24.81	11.23	0.81	0.20

[A] H. Zhao, O. Gallo, I. Frosio and J. Kautz, "Loss Functions for Image Restoration With Neural Networks," in IEEE Transactions on Computational Imaging, vol. 3, no. 1, pp. 47-57, 2017.

2. Quantitative analysis of large viewpoint differences

Response:

To evaluate the impact of variants of viewpoint difference, we first capture 34 images from continuously distributed viewpoints around a scene to create a ground truth (GT) 3DGS model. We then systematically reduce the number of viewpoints by sampling them at different intervals $\{2,3,4,5,6,7\}$ , where larger intervals represent larger viewpoint differences. For each sampling interval, we construct a new 3DGS model and assess its quality by comparing its rendered images against those from the GT model using standard metrics: LPIPS, SSIM, and PSNR. This methodology allows us to analyze how viewpoint difference affects reconstruction quality quantitatively.

Table 4: Quantitative results of large viewpoint differences.

Sampling interval	lpips	ssim	psnr
2	0.09	0.89	26.25
3	0.14	0.83	23.42
4	0.16	0.80	22.42
5	0.27	0.65	18.09
6	0.25	0.69	18.71
7	0.41	0.57	15.66

Considering that the reduction in available viewpoints for training itself leads to a decrease in 3DGS reconstruction quality, our method still achieves good results even with significant viewpoint variation. This validates that our approach can detect inconsistencies between viewpoints and repair those areas despite the large viewpoint differences. However, in extreme cases, the absence of key viewpoints results in a loss of critical complementary information between viewpoints, leading to a significant decline in the reconstruction metrics of the scene.

To validate our advantages in the extreme case with large viewpoint differences, we conducted a quantitative evaluation of various methods for the extreme case mentioned in Figure 8, and the results are as the following table. It can be seen that our method still outperforms existing methods in removing dynamic distractors under such extreme conditions.

Table 5: Quantitative comparison of different methods in the extreme case.

Method	lpips	ssim	psnr
InFusion	0.23	0.78	19.34
SPIn-NeRF	0.15	0.82	23.33
SpotLess	0.14	0.84	24.75
Ours	0.10	0.86	26.38

To address the concerns, we have added the discussion in Section A.5.

评论- Response to Reviewer eMA4

2024-11-22

3. Computing cost comparison and analysis

Thank you for the comments. To quantitatively evaluate performance and computational efficiency, we compare our method against baseline approaches (InFusion, SPIn-NeRF, and SpotLess) on the synthetic scene shown in Fig.8. This scene provides ground truth data, enabling evaluation through reference-based metrics for both rendering quality and computational efficiency during optimization. As shown in Table 6, while our method incurs additional computational overhead compared to vanilla 3DGS due to integrating iterations and diffusion models, it achieves superior rendering quality while maintaining comparable efficiency to state-of-the-art 3DGS methods (e.g., SpotLess). Furthermore, our approach demonstrates significantly better reconstruction quality while being approximately 10× faster than leading NeRF-based methods such as SPIn-NeRF.

Our method indeed introduces higher computational resource demands due to the iterative repair process. For one scenario in our dataset, the specific time comparisons are as follows:

Table 6: Quantitative results and time costs on the synthesis data in Fig.8.

	InFusion	SPIn-NeRf	SpotLess	Ours
LPIPS	0.23	0.15	0.14	0.10
PSNR	19.34	23.33	24.75	26.38
Time	16m34s	7h32m18s	30m26s	33m34s

4. Ablation studies on VISTA-GI and VISTA-CL

Response:

We have conducted ablation experiments on two components of our algorithm using our dataset, and the specific results are as follows:

Table 7: Quantitative ablation study of VISTA-GI and VISTA-CL on the Underwater 3D Inpainting Dataset.

Method	UCIQE	URanker	CLIP Score
Ours w/o VISTA-GI	0.48	1.52	0.70
Ours w/o VISTA-CL	0.50	1.59	0.69
Ours	0.51	1.64	0.72

To address the concerns, we have added the discussion in Section 5.

5. Visualizations of the views and uncertainty map in static cases

Response:

Thank you for the comments. To address your concern, we visualize the uncertainty map of static scenes in the Appendix section Sec A.2 with Figure 11. It can be observed that our method identifies areas in the rendered image that are inconsistent with other viewpoints. By repairing these high-uncertainty regions, the algorithm gradually becomes more confident in the repaired content, resulting in a more reasonable and cleaner reconstructed scene.

6. Visualizations of depth maps

Response:

Thank you for the comments. Visualization of the uncertainty map of static scenes is shown in the Appendix section Sec A.2 with Figure 11. The deeper the color, the closer the depth. It can be observed that our algorithm generates reasonable content for the target area by incorporating guidance from the repair regions during the reconstruction process.

7. Comparison with GScream

Response:

Thank you for providing a related work. During the writing of our paper, we noted the GScream method; however, since its code was not publicly available at that time, we were unable to conduct a quantitative comparison. During the rebuttal period, we attempted to run Gscream on our underwater dataset. Unfortunately, Gscream requires preprocessing of the dataset, but the currently available open-source content does not provide a method for running Gscream on other datasets, making it difficult for us to conduct a quantitative comparison.

However, we added a discussion about the GScream method in the related work: 'GScream (Wang et al., 2025) focuses on removing objects by introducing monocular depth estimation and employing cross-attention to enhance texture.' Intuitively, GScream incorporates depth estimation to assist in generating consistent content while our method uses the visibility uncertainty to achieve the goal.

2024-11-26

Dear Reviewer eMA4,

I hope you are doing well.

I am writing to follow up on our responses to your review. We sincerely appreciate the time and effort you dedicate to providing feedback on our work.

As the deadline for submission is fast approaching, we would be grateful if you could share any updates or additional comments you may have. Your insights are invaluable to us, and we are keen to ensure that our revisions align with your expectations.

Thank you very much for your attention to this matter. We look forward to your reply.

评论- Official Comment by Reviewer eMA4

2024-12-03

Dear Authors,

Thank you for your responses. However, I remain concerned about several unresolved issues, particularly the uncertainty maps of VISTA in static scenes. Below are my key points:

Quantitative Comparison with GScream on SPIn-NeRF Dataset: Could you provide a comparison to validate texture consistency, even if limited to the SPIn-NeRF dataset?
SPIn-NeRF Dataset Blocky Artifacts in Uncertainty Maps (Figure 11): Are these artifacts inherent to your method or the dataset? Additionally, what is their impact on reconstruction quality?
Discrepancies Between Table 4 and Table 5: Could you clarify whether these represent different datasets or evaluation conditions to address reproducibility concerns?
Ablation Studies on SPIn-NeRF Dataset: Could you share ablation results for this dataset to better validate VISTA-GI and VISTA-CL contributions?

I look forward to your response to these key points to further understand and evaluate your work.

Reviewer eMA4

评论- Official Comment by Reviewer eMA4

2024-11-27

Dear Authors,

Thank you for your detailed and comprehensive responses to my previous comments. I truly appreciate the effort and the additional insights you’ve provided. However, I still find that several critical concerns remain unaddressed, leading me to maintain my current score. Below are my key points:

Quantitative Comparison with GScream on SPIn-NeRF Dataset: While you noted the difficulty in running GScream on your underwater dataset, I believe a quantitative comparison on the SPIn-NeRF dataset, which includes ground truth, would provide a stronger evaluation of texture consistency. Without this comparison, it is challenging to conclusively validate your method's advantages over GScream in scenarios with both masked and non-masked regions.
Blocky Artifacts in Uncertainty Maps (Figure 11): The uncertainty maps for the SPIn-NeRF dataset (Figure 11) exhibit blocky artifacts. It is unclear whether these artifacts are a byproduct of your method or the dataset itself. More importantly, their impact on the reconstruction quality needs further explanation. Without clarity on this issue, the robustness of your approach is difficult to assess.
Discrepancies Between Table 4 and Table 5: There is a noticeable mismatch between the quantitative results in Table 4 and Table 5. If these tables represent different datasets or evaluation conditions, this needs to be explicitly clarified. The current presentation raises questions about the consistency and reproducibility of your results.
Ablation Studies on SPIn-NeRF Dataset: Your ablation studies were conducted on the Underwater 3D Inpainting Dataset. While insightful, the lack of ablation results on the SPIn-NeRF dataset, which includes ground truth, is a missed opportunity. Such experiments would provide more concrete evidence of the contributions of VISTA-GI and VISTA-CL, especially in controlled settings where objective metrics can be thoroughly assessed.

While I commend the efforts made to improve the clarity and comprehensiveness of your work, these unresolved issues leave significant gaps in validating the proposed approach. Therefore, I will maintain my current score, as I believe further revisions are necessary to address these critical aspects.

Reviewer eMA4

评论- Response To Reviewer eMA4

2024-12-03

Thank you for your attention to our work.

We compared our method with the data from the GScream, and the results are as follows:

Method	PSNR	SSIM	LPIPS
SPIn-NeRF	15.80	0.46	0.47
Gaussians Grouping	14.86	0.27	0.45
Masked Gaussians	10.77	0.29	0.59
InFusion	15.59	0.53	0.57
GScream	15.84	0.58	0.28
Ours	16.48	0.59	0.42

It can be seen that our method outperforms all related methods in two metrics, but there is one metric that is lower than GScream. Considering that other methods cannot repair dynamic scenes, while our method applies to dynamic and static scenes, we believe our method has certain advantages.

The blocky areas are not artifacts. The explanation is as follows: When calculating uncertainty, our method references the rendered images from adjacent viewpoints, but not every point can be observed from any adjacent viewpoint. Therefore, we need to assess the visibility of spatial points and remove those not within the camera's visible space. This operation reflects on the uncertainty map, resulting in blocky areas (which can be considered as the intersection of the visible regions from adjacent viewpoints projected onto the current viewpoint).

This phenomenon does not lead to a decrease in performance because such areas are merely due to the inconsistent number of samples from adjacent viewpoints for different points. Essentially, they are effective samples of the true distribution, and the estimated uncertainty/variance is also valid. This phenomenon can be resolved by increasing the number of samples from adjacent viewpoints, but it will introduce more computational costs.
Tables 4 and 5 were tested on the same synthetic underwater dataset, and the visual content is shown in Figure 8. Table 4 compares the time required for different methods, while Table 5 compares the results obtained using different resolutions during the reconstruction process. Since it is the same scene, our optimal results are consistent.
We conducted ablation experiments on our method using the Spin-NeRF dataset, the results are as follows:

Method	PSNR	SSIM	LPIPS
Ours w/o VISTA-GI	13.34	0.27	0.51
Ours w/o VISTA-GL	14.75	0.28	0.48
Ours	16.48	0.59	0.42

It can be seen that VISTA-GI has a more significant impact on the reconstructed scene than VISTA-GL. We analyze this as being due to the latter's reliance on the uncertainty of the former. When VISTA-GI cannot generate high-quality uncertainty images, it first leads to an inability to utilize uncertainty to help exclude interference during the reconstruction process. Consequently, the subsequent VISTA-GL also struggles to identify the areas that need repair, resulting in further performance degradation. The absence of VISTA-GL has a relatively smaller impact, but due to the loss of semantic constraints, the generated areas are difficult to ensure their validity, leading to a decline in performance.

Thank you once again for your attention and valuable suggestions for our work. We look forward to your recognition of our efforts.

评论- Response to All Reviewers

2024-11-22

To all reviewers:

Thank you very much to all the reviewers for their constructive and responsible feedback. In this paper, we present an innovative method for removing dynamic and static interference in 3DGS reconstruction. This approach combines geometric information from three-dimensional reconstruction with semantic information from concept learning, facilitating better scene reconstruction of real-world environments. We are delighted that Reviewer M4f5 emphasized the "Originality" of our method and its "Significance" in the field of virtual reality. Reviewer eMA4 described our framework as "novel," highlighting its "flexibility" across different scenarios. Reviewer qvjD noted that our framework is "general" and considered our experiments to be "sufficient." Reviewer VfkU not only highlighted the "novel" use of uncertainty in our method but also pointed out that it achieved "Strong quantitative results." While some reviewers expressed concerns about the details of our method and its performance in more scenarios, we believe our rebuttal can address these concerns. Below are some key responses, and we will provide more detailed replies in our responses to each reviewer.

List of Changes

According to the comments of all four reviewers, we have made the following major revisions to our paper.

We have added the ablation study on VISTA-GI and VISTA-CL in Sec 5.2 to demonstrate the roles of these two components in our algorithm in response to Reviewer eMA4 and Reviewer qvjD.
We have added a discussion in Sec 5.3 to compare the computing cost with other SOTA methods in response to Reviewer M4f5, Reviewer eMA4, and Reviewer VfkU.
We have add a discussion in Sec A.1 to clarify the impact of mask on inpainting in response to Reviewer M4f5.
We have presented more dimensional and various types of visualization results for our method in Sec A.2 in response to Reviewer M4f5 and Reviewer VfkU.
We have added two experiments and some discussions in Sec A.3 to clarify the hyparameters that are related to iteration in response to Reviewer M4f5 and Reviewer VfkU.
We have added a subsection to discuss the impacts of different resolutions during the training and inference process of diffusion in Sec A.4 in response to Reviewer M4f5.
We have added ablation studies and comparisons on the cases with large viewpoint differences in Sec A.5 in response to Reviewer eMA4.

AC 元评审

2024-12-17

The paper introduces VISTA, a framework for 3D Gaussian inpainting that integrates visibility uncertainty to enhance inpainting. While the method demonstrates technical novelty and competitive performance, several key concerns from reviewers remain unresolved. These include limited real-world evaluation, insufficient analysis of key parameters, lack of comprehensive ablation studies, and missing valid comparisons with relevant baselines such as GScream. I’d recommend the authors to consider all these and further improve the work.

审稿人讨论附加意见

The reviewers raised concerns about texture consistency, evaluation on real-world use cases, and the lack of comparisons with strong baseline models. Despite several rounds of discussion, these issues remain unresolved.

最终决定Reject

2025-01-22

Reject

Visibility-Uncertainty-guided 3D Gaussian Inpainting via Scene Conceptional Learning

摘要

评审与讨论

优点

缺点

问题

To Reviewer VfkU:

1. Ablation study of the uncertainty map

2. About iteration and time consumption

3. Failure cases and limitations

4. Our work uses the largest evaluation set compared to SOTA methods

5. Influence of the number of adjacent views

6. More common real-world scenarios

优点

缺点

问题

To Reviewer qvjD:

1. Ablation study of the conceptual learning

2. Multi-view consistency

3. Different from video diffusion models

4. About these minor issues

优点

缺点

问题

4. Influence of image resolution

5. Concerns about Fig. 7

To Reviewer M4f5:

1. Impact of Hyperparameters

2. Computing cost comparison and analysis

3. Normalization on Ui\mathbf{U}_iUi​

优点

缺点

问题

To Reviewer eMA4:

1. How visibility uncertainty benefits texture consistency

2. Quantitative analysis of large viewpoint differences

3. Computing cost comparison and analysis

4. Ablation studies on VISTA-GI and VISTA-CL

5. Visualizations of the views and uncertainty map in static cases

6. Visualizations of depth maps

7. Comparison with GScream

To all reviewers:

审稿人讨论附加意见

3. Normalization on $\mathbf{U}_i$