Visibility-Uncertainty-guided 3D Gaussian Inpainting via Scene Conceptional Learning
摘要
评审与讨论
This paper introduces a framework for 3D scene inpainting using 3D Gaussian Splatting that leverages multi-view information. The framework consists of two key components: VISTA-GI, which guides inpainting by measuring the visibility uncertainty of 3D points across multiple views, and VISTA-CL, which learns scene concepts without masked objects using a diffusion model. These components work iteratively to refine the 3D representation using geometric and semantic information progressively. The method extends to handling dynamic distractors in scenes, making it effective for challenging scenarios like underwater scenes with moving objects. Evaluated on the SPIn-NeRF dataset and a new underwater dataset derived from UTB180, VISTA demonstrates superior performance over state-of-the-art methods in quantitative metrics and visual quality, particularly in maintaining consistency across viewpoints.
优点
- Novel use of visibility uncertainty to guide 3D inpainting
- Integration of 3D Gaussian Splatting with diffusion-based scene learning
- Iterative refinement combining geometric and semantic cues
- Strong quantitative results across metrics (PSNR, SSIM, LPIPS, etc.)
缺点
- Missing analysis of how the visibility uncertainty threshold affects results
- The iterative process requires three iterations, potentially making it computationally expensive for real-time applications. And no clear ablation showing the impact of different numbers of iterations beyond Figure 5
- No discussion of failure cases or limitations with large occlusions
- Only tested on ten forward-facing scenes from the SPIn-NeRF dataset - a relatively small evaluation set and scenario.
- Missing key hyperparameters like the size of the adjacent view window for uncertainty calculation
- No evaluation of common real-world scenarios like removing people from tourist photos. Dynamic object types are limited (mostly fish)
问题
- What thresholds were used for the visibility uncertainty measure in Equation (2)? How sensitive is the method to these thresholds? Could you provide ablation studies showing how different thresholds affect the final results?
- Why exactly are three iterations optimal? The plot in Figure 5 shows diminishing returns, but could more iterations help in challenging cases? What is the runtime for each iteration? Is real-time performance possible with architectural optimizations?
- Could you provide examples of failure cases and discuss their causes?
- Could you evaluate more diverse scenarios, especially tourist photos with people?
To Reviewer VfkU:
1. Ablation study of the uncertainty map
Response:
Thank you for your comments. In our practical implementation, the uncertainty of the calculations (i.e., variance) is performed in the LAB color space and is normalized. Therefore, in this part, we will not explicitly define the threshold for uncertainty. If you are concerned about the uncertainty threshold referring to in Eq. (3), we investigate how different increasing strategies affect reconstruction quality. The hyperparameter is initialized by 0 and increases by 0.1 with each iteration in our paper. We evaluate five different increase ratios {0.1, 0.2, 0.3, 0.4, 0.5} and analyze their impact on reconstruction quality across iterations using our dataset. As shown in Figure 12 (b), all ratios lead to improved PSNR values over iterations. In the first two iterations, a higher increase ratio improves the reconstruction performance. However, an increase ratio above 0.1 indicates that the algorithm becomes overly confident in the inpainting areas too early, resulting in insufficient interaction of geometric and semantic information between the VISTA-GI and VISTA-CL modules, which subsequently leads to a decline in reconstruction performance in later iterations.
2. About iteration and time consumption
Response:
Thank you for the comments.
The influence of iteration number: In our setup, the number of iterations is strongly correlated with the hyperparameters involved in Figure 12. Simply increasing the number of iterations without altering the hyperparameters goes against the original intent. To evaluate the impact of different numbers of iterations, we analyzed two hyperparameters that related to the number of iterations. By examining Figures 12 (a) and (b), we can infer that fewer iterations lead to insufficient interaction of geometric and semantic information between the VISTA-GI and VISTA-CL. Conversely, excessive iterations do not improve the performance of the challenging scene and may introduce excessive computational costs.
The total time cost: To quantitatively evaluate performance and computational efficiency, we compare our method against baseline approaches (InFusion, SPIn-NeRF, and SpotLess) on the synthetic scene shown in Fig.8. This scene provides ground truth data, enabling evaluation through reference-based metrics for both rendering quality and computational efficiency during optimization. As shown in Table 9, while our method incurs additional computational overhead compared to vanilla 3DGS due to integrating iterations and diffusion models, it achieves superior rendering quality while maintaining comparable efficiency to state-of-the-art 3DGS methods (e.g., SpotLess). Furthermore, our approach demonstrates significantly better reconstruction quality while being approximately 10× faster than leading NeRF-based methods such as SPIn-NeRF.
Our method indeed introduces higher computational resource demands due to the iterative repair process. For one scenario in our dataset, the specific time comparisons are as follows:
Table 9: Quantitative results and time costs on the synthesis data in Fig.8.
| InFusion | SPIn-NeRf | SpotLess | Ours | |
|---|---|---|---|---|
| LPIPS | 0.23 | 0.15 | 0.14 | 0.10 |
| PSNR | 19.34 | 23.33 | 24.75 | 26.38 |
| Time | 16m34s | 7h32m18s | 30m26s | 33m34s |
Time cost for each iteration: In the aforementioned case, our method undergoes three rounds of iterations, with an average iteration time of approximately 10 minutes per iteration. By optimizing the current pipeline and reusing the reconstruction results from the previous iteration, it may be possible to further reduce the runtime to achieve real-time performance. Such acceleration efforts are still worth further investigation.
3. Failure cases and limitations
Response: Thank you for providing us with a new perspective to analyze our method. We have added the discussion of failure cases of our algorithm in Sec A.2 with Figure 10. Due to errors in the prior mask, some fish were not detected by the object detection model. Furthermore, since the fish did not move significantly during the shooting process, these areas did not produce inconsistencies across multiple viewpoints during reconstruction, making it difficult for the VISTA-GI component to identify these areas through uncertainty. This also validates our algorithm design approach: VISTA-CL introduces semantic information through masks, while VISTA-GI incorporates geometric information through uncertainty, complementing each other to remove distractors. However, in the failed cases, issues arose in both aspects, resulting in poor reconstruction quality of the final scene.
4. Our work uses the largest evaluation set compared to SOTA methods
Response:
Thank you for your comments. We collected the reported datasets and scene counts used in the referenced SOTA method papers, and the details are as follows:
| Method | Training datasets | num of scenes |
|---|---|---|
| InFusion | Mip-NeRF 360, SPIn-NeRF | 19 |
| SPIn-NeRF | SPIn-NeRF | 10 |
| RefFusion | SPIn-NeRF | 10 |
| GScream | SPIn-NeRF, IBRNet | 15 |
| SpotLess | NeRF On-the-go, RobustNeRF | 10 |
| Ours | Underwater 3D Inpainting, SPIn-NeRF | 20 |
It can be seen that our method utilizes the largest number of scenes to validate its effectiveness. The underwater dataset we used is derived from videos shot by real-world divers. Due to the effects of the underwater medium and floating debris, these underwater scenes represent an extreme scenario in the real world.
5. Influence of the number of adjacent views
Response:
Thank you for your careful review comments. When calculating the uncertainty of a particular reconstructed viewpoint, our algorithm selects the 6 nearest viewpoints as adjacent reference views. To further clarify the details of this setting, we conducted an ablation study on this hyperparameter, with the results as follows:
| Num of adjacent views | UCIQE | URanker | CLIP Score |
|---|---|---|---|
| 2 | 0.49 | 1.53 | 0.69 |
| 4 | 0.51 | 1.63 | 0.71 |
| 6 | 0.51 | 1.64 | 0.72 |
| 8 | 0.51 | 1.64 | 0.72 |
It can be observed that when too few viewpoints are selected, the performance experiences an unacceptable decline due to insufficient spatial points, leading to inaccurate variance estimation. On the other hand, selecting more adjacent viewpoints does not improve the reconstruction quality and, due to the increased points count, results in higher computational demands.
6. More common real-world scenarios
Response:
Thank you for the comments. As we discussed earlier in the response, underwater scenes are a challenging type of scenario in the real world.
However, to further demonstrate the effectiveness of our method, we tested our dataset on a scene related to pedestrian removal from the nerf-on-the-go dataset. This scene called tree, contains 212 images, with the main distractors coming from moving pedestrians. As shown in Figure 9, our method achieved very high-quality results on this dataset. Due to the abundance of viewpoints in the dataset, there is a lot of supplementary information between perspectives, allowing our method to effectively utilize other viewpoints to repair the blurring caused by moving distractors.
After reviewing other comments and the rebuttal, I would like to maintain my negative rating for these reasons:
- The authors' explanation of texture consistency relies too heavily on basic L2/SSIM losses without clear innovation, as highlighted by Reviewer eMA4's concerns.
- The ablation studies in the rebuttal show only modest improvements in metrics (Table 7), suggesting limited effectiveness of the individual components.
- The rebuttal reveals significant performance issues with viewpoint variations - PSNR drops dramatically from 26.25 to 15.66 with increased sampling intervals. I don't quite understand this effect.
- Limited generalizability testing - only one additional real-world scene was tested despite multiple reviewers requesting a broader evaluation.
Please carefully read our paper and response, and provide reasonable and responsible feedback on the rating.
-
Regarding Texture Consistency: We introduced L2/SSIM as a relevant background for texture consistency, rather than as a means to maintain texture consistency itself. Our method uses uncertainty to help the algorithm determine which areas have consistent textures, as discussed in our response to Reviewer eMA4.
-
Ablation Studies: Non-reference metrics in the generation domain (such as CLIP Score, Uranker, etc.) are not as sensitive to generation quality as reference metrics like SSIM and PSNR. While our results may appear to show relatively small differences from non-reference metrics, significant differences in generation results can be observed from our visualizations. We also provided a synthetic dataset in Table 7 (with ground truth for calculating reference metrics) to demonstrate the advantages of our method in dynamic scenes.
-
Large Variations in Viewpoint Are Not a Performance Issue: Please carefully read our rationale for setting up such ablation studies. Increasing viewpoint variation involves incorporating a large portion (from 50% to 85%) of the training data into the test set, which increases the task difficulty. Our method only shows noticeable performance degradation under extreme conditions, demonstrating robustness.
-
We Have Provided More Visualization Results: The underwater dataset is a challenging real-world scenario, and we have also provided visual results related to pedestrian removal scenes. The number of scenes our method compares to exceeds that of other SOTA methods in the field, requesting for us to provide visualizations for even more scenes is unreasonable.
The paper introduces a general 3D Gaussian Splats (3DGS) inpainting framework called VISTA, which achieves 3D inpainting for both static and dynamic objects within 3DGS reconstruction, resulting in enhanced novel view synthesis performance. By leveraging multiview information, the method extracts visibility uncertainty maps to guide the 3D reconstruction of inpainted scenes. Additionally, masked multiviews are employed to learn the scene concept through textual inversion techniques, which improves the accuracy of image inpainting priors. VISTA iteratively inpaints scenes and learns scene concepts from inter-view visibility information, enhancing scene consistency and preserving scene identity during the 3D inpainting process.
优点
- The work proposed a general 3D Gaussian inpainting framework applicable to both static or dynamic scenes.
- The proposed VISTA leverages multiview visibility information for more detailed texture consistency, facilitating the inpaiting of unseen areas through conceptual learning.
- The whole paper is well-written and easy to follow.
缺点
- Although the comparison experiments are sufficient and the discussions are insightful, there is a lack of certain experiments, such as an ablation study on the effect of conceptual learning. Please see details in Question 1.
- The paper does not discuss the multi-view inpainting consistency of this method, nor does it compare the results with or discuss differences from those obtained using video diffusion models. See details in Question 2.
问题
-
As a complex scene (e.g. with objects, backgrounds), the scene feature is not as distinct as ojects, is the concetual learning robust in scenes? How the conceptual learning improve inpainting performance?
-
It would be benefitial to discuss multi-view consitency during inpainting, as well as review and discuss video inpainting works which concerns about multi-view consitency.
-
Minor issues:
3.1. Fig.1 InFusion's upper results with red boxes appears to be incorrectly cropped in misaligned areas?
3.2. Fig.2 refers to wrong section numbers.
To Reviewer qvjD:
1. Ablation study of the conceptual learning
Response:
Thank you for the comments. We have conducted ablation experiments on two components of our algorithm using our dataset, and the specific results are as follows:
| Method | UCIQE | URanker | CLIP Score |
|---|---|---|---|
| Ours w/o VISTA-GI | 0.48 | 1.52 | 0.70 |
| Ours w/o VISTA-CL | 0.50 | 1.59 | 0.69 |
| Ours | 0.51 | 1.64 | 0.72 |
Our experiments demonstrate two key findings. First, attempting reconstruction using only a 2D generative model (i.e., VISTA-CL) without VISTA-GI leads to significantly degraded image quality metrics. This validates that VISTA-GI's uncertainty guidance effectively mitigates multi-view inconsistencies during 3D reconstruction, resulting in higher-quality outputs. Second, while omitting VISTA-CL maintains image quality comparable to existing methods like SplotLess and SPIn-NeRF, the lack of concept-guided learning significantly reduces CLIP-Score metrics. This indicates that without conceptual constraints, the inpainting process produces results that are visually plausible but semantically inconsistent with the scene context.
To address the concerns, we have added the discussion in Sec 5.
2. Multi-view consistency
Response:
Thank you for the comments. Given multi-view images from a static scene, the original 3DGS trains the Gaussian splats through the loss and SSIM loss function (i.e., Eq.(1) in the submission) to make sure the Gaussian splats can render detailed-preserved image across all perspectives. Hence, the texture consistency across views is implicitly achieved via the mix of loss and SSIM loss functions that are good at the reconstruction of structure and details and are validated in [A].
However, when the scene contains dynamic objects, the object can be seen in one view but cannot be observed in another view. This leads to conflicts between different view renderings and affects the training of the Gaussian splats through the original objective function.
Our visibility uncertainty map indicates the potential conflicts across different views and we use it to reweight the contributions of different pixels in Eq. (1) and get Eq. (4) in the submission. Intuitively, we assign low weights to the pixels with high visibility uncertainty. Such a strategy avoids conflicts during the training of Gaussian splat and leads to better texture consistency.
We analyzed the quality of reconstruction between the masked and non-masked areas in an underwater scene. The results are as follows:
Table 8: Quantitative results of different areas.
| Method | non_masked PSNR | masked PSNR | non_masked SSIM | masked SSIM |
|---|---|---|---|---|
| w/ mask | 26.92 | 11.9 | 0.88 | 0.22 |
| w/o mask | 24.81 | 11.23 | 0.81 | 0.20 |
It can be observed that when the mask is not used, our method maintains much better texture consistency in both the masked and non-masked areas. For the restoration of the masked areas, our method also ensures texture consistency by introducing the limitations imposed by uncertainty.
[A] H. Zhao, O. Gallo, I. Frosio and J. Kautz, "Loss Functions for Image Restoration With Neural Networks," in IEEE Transactions on Computational Imaging, vol. 3, no. 1, pp. 47-57, 2017.
3. Different from video diffusion models
Response:
Thank you for providing us with a new perspective to analyze our method.
General video generation methods (such as tune a video, videobooth, and animatediff) employ cross-frame attention, such as temporary attention, to maintain object consistency across multiple frames. These video generation algorithms typically do not generate explicit 3D models during runtime; instead, they merge multi-frame information through motion blocks, with the expected output being a sequence of 2D images.
In contrast, methods based on 3D modeling explicitly reconstruct a 3D model, with the desired output being a clean and coherent 3D scene. Since NeRF and 3DGS methods inherently possess some capability to maintain viewpoint consistency, 3D reconstruction methods are generally more stable in preserving viewpoint consistency. Moreover, 3D-consistent video priors can be beneficial for reconstruction. For example, the MVInpainter utilizes video along with 2D inpainting to assist in tasks such as object removal and composition in reconstruction.
To sum up, since our method focuses on generating a 3D scene, while video generation models aim to produce a video, some methods for maintaining 3D consistency in video generation cannot be directly applied to our task. Although video generation and 3D reconstruction can produce similar visual effects, there is a fundamental difference in their output focus.
4. About these minor issues
Response:
Thank you for your careful review comments. We have fixed these mistakes in the revised paper.
Thank authors for their endeavor made in the detailed responses. After reading other reviews and rebuttals, I find that my main concerns have been addressed.
In addition, the revised version includes more comprehensive experiments for robustness validation, such as runtime comparison, more ablation study, an exploration of robustness across different levels of input frame sparsity. These evaluations can strengthen the paper and provide some new directions for the future research.
Given that the multi-view consistency issue was highlighted by several reviewers, I recommend that the detailed discussion on multi-view consistency (e.g., Table 8) can be included as supplementary content in the revised paper.
Considering the revision can provide sound experiments, verifying the method robustness, I would to raise my score to recommend acceptance.
Thank you for your thoughtful feedback and for recognizing the efforts we put into our detailed responses.
We are delighted to hear that you find our revised version to include more comprehensive experiments that validate the robustness of our method. We appreciate your suggestion to include a discussion on multi-view consistency as supplementary content, and we will incorporate this in our final revision.
Thank you again for your positive feedback!
This paper presents VISTA, a novel framework for 3D Gaussian inpainting that leverages visibility uncertainty and scene conceptual learning to address the challenges of inpainting in both static and dynamic scenes. The authors propose a method that integrates visibility uncertainty maps to guide the inpainting process, allowing for more coherent and contextually appropriate fills for masked regions. The framework is evaluated on two challenging datasets, demonstrating superior performance compared to existing state-of-the-art methods.
优点
• Originality: The integration of visibility uncertainty with scene conceptual learning is a fresh approach that addresses existing limitations in 3D inpainting. Although the idea is intuitively sample, but the performance is good under the complementary visual cues.
• Quality: The experiments are robust, showcasing the effectiveness of the proposed method across diverse datasets. The paper also shows some ablation study in discussion and appendix section to support the results.
• Clarity: The main ideas are presented clearly, and the paper provides a comprehensive overview of related work and the proposed contributions.
• Significance: The results indicate potential applications in augmented and virtual reality, highlighting the practical implications of the research.
缺点
• While the paper presents a robust methodology, certain experimental comparisons could be expanded. For instance, a more detailed analysis of the impact of different hyperparameters on performance would provide deeper insights, particularly regarding the parameter v in equation (3) and the level of decrease initial noise during inference.
• There are some writing issues to address: In section 4.1, you state that Figure 1 shows the initial representation’s failure to exclude dynamic objects; however, Figure 1 actually displays the results among different approaches rather than the initial representation itself
问题
-
Since the iterative process may slow down the pipeline, how does the computational efficiency of VISTA compare to other state‐of‐the‐art methods, particularly in real‐time applications?
-
In equation (3), Ui represents the visibility uncertainty map for a single frame, and the uncertainty for each pixel is calculated from equation (2). I would like to know if any normalization is applied to Ui between equations (2) and (3).
-
Regarding the experimental setup illustrated in the appendix, you mention that the initial noise strength during inference decreases by 0.2 with each iteration. Was any ablation study conducted? I agree that reducing the initial noise over iterations is reasonable, as the rendered image quality becomes more confident with each iteration, but does this hyperparameter affect the final result?
-
The results in Figure 7 show the reconstruction without a mask, which retains the fish in the top-left corner of the image while removing the other fish. I believe the visibility uncertainty used in Equation (3) to update the mask has favorable properties for removing dynamic distractors. However, why does only the top-left fish remain? Could you also provide some statistical evaluation for this ablation study?
-
I would also like to inquire about the image resolution used for training and inference with stable‐diffusion‐v1.5, as the resolution may influence the inpainting quality in SD‐1.5
4. Influence of image resolution
Response:
Thank you for the comments. In our experiment setup, we use the stable diffusion v1.5 as the inpainting model and train and test the model following its default setup: if the input image has a resolution higher than 512x512, we crop the image to a new size that is both the closest to the original image size and a multiple of 8; if the input image is smaller than 512x512, we rescale the image to 512x512. To analyze the influence of the strategy on different original resolutions, given an original scene with input images having the size of 1299x974, we downsample these images to four resolutions: 64x64, 128x128, 256x256, and 512x512. Then, for each resolution, we can build a 3D model and evaluate the rendering quality. As shown in Table 2, we observe that: (1) reducing the resolution to 512x512 does not significantly impact any of the metrics, demonstrating our method's robustness to substantial resolution changes. (2) further decreasing the resolution leads to gradual degradation in reference-based metrics, while non-reference metrics remain relatively stable.
We have added the above discussion in the Appendix Sec.A.4.
Table 2: Quantitative ablation results of different resolutions.
| resolution | lpips | ssim | psnr | clip score | uranker | UCIQE |
|---|---|---|---|---|---|---|
| 64x64 | 0.51 | 0.68 | 16.27 | 0.64 | -0.23 | 0.56 |
| 128x128 | 0.42 | 0.69 | 18.89 | 0.69 | 0.29 | 0.57 |
| 256x256 | 0.26 | 0.71 | 21.33 | 0.69 | 0.42 | 0.57 |
| 512x512 | 0.11 | 0.84 | 26.04 | 0.69 | 0.30 | 0.57 |
| 1299x974 | 0.10 | 0.86 | 26.38 | 0.70 | 0.44 | 0.58 |
5. Concerns about Fig. 7
Response:
We appreciate your careful review. Regarding the fish in the top left corner of Fig. 7 is retained while the other fish are removed, this is primarily because the fish remains stationary across different views (as evident in the optical flow map of Fig. 7, where the top-left fish exhibits low flow values at its center). Consequently, it has a lower value in the visibility uncertainty map (see the corresponding map in Fig. 7). Without using a mask to label this area for repair manually, the fish's geometric characteristics resemble those of a stationary object, such as a rock, making it indistinguishable from our uncertainty detection system.
In contrast, moving fish create significant geometric inconsistencies across viewpoints, enabling our uncertainty detection to flag them as anomalies. This leads to their removal through the inpainting process. To address these challenging scenarios, we introduced mask annotations for fish detection, providing semantic guidance for our inpainting method. As shown in the last column of Fig. 7, incorporating the mask ensures the successful removal of the top-left fish.
To address the concern, we have revised the discussion in A.1 and Figure 7 by adding the optical flow results and uncertainty map.
To Reviewer M4f5:
1. Impact of Hyperparameters
Response:
Thank you for the suggestions and comments. We have added the discussion about the influence of the hyperparameter in Eq. (3) and noise reduction ratios in the Appendix section Sec A.3 with the revised Figure 12.
Impacts of in Eq. (3). We investigate how different increasing strategies affect reconstruction quality. The hyperparameter is initialized by 0 and increases by 0.1 with each iteration in our paper. We evaluate five different increase ratios {0.1, 0.2, 0.3, 0.4, 0.5} and analyze their impact on reconstruction quality across iterations using our dataset. As shown in Figure 12 (b), all ratios lead to improved PSNR values over iterations. In the first two iterations, a higher increase ratio improves the reconstruction performance. However, an increase ratio above 0.1 indicates that the algorithm becomes overly confident in the inpainting areas too early, resulting in insufficient interaction of geometric and semantic information between the VISTA-GI and VISTA-CL modules, which subsequently leads to a decline in reconstruction performance in later iterations.
Impact of noise reduction ratios in diffusion inference. During diffusion model inference, we investigate how different noise reduction strategies affect reconstruction quality. Starting from an initial noise strength of 1.0, we systematically decrease the noise at each iteration by a fixed ratio. We evaluate four different reduction ratios {0.1, 0.2, 0.3, 0.4} and analyze their impact on reconstruction quality across iterations using our dataset. As shown in Figure 12 (a), while all ratios lead to improved PSNR values over iterations, the reduction ratio of 0.2 achieves optimal convergence in the fewest iterations. Based on this empirical analysis, we adopt 0.2 as the noise reduction ratio in our method.
2. Computing cost comparison and analysis
Response:
Thank you for the comments. To quantitatively evaluate performance and computational efficiency, we compare our method against baseline approaches (InFusion, SPIn-NeRF, and SpotLess) on the synthetic scene shown in Fig.8. This scene provides ground truth data, enabling evaluation through reference-based metrics for both rendering quality and computational efficiency during optimization. As shown in Table 1, while our method incurs additional computational overhead compared to vanilla 3DGS due to integrating iterations and diffusion models, it achieves superior rendering quality while maintaining comparable efficiency to state-of-the-art 3DGS methods (e.g., SpotLess). Furthermore, our approach demonstrates significantly better reconstruction quality while being approximately 10× faster than leading NeRF-based methods such as SPIn-NeRF.
Table 1: Quantitative results and time costs on the synthesis data in Fig.8.
| InFusion | SPIn-NeRf | SpotLess | Ours | |
|---|---|---|---|---|
| LPIPS | 0.23 | 0.15 | 0.14 | 0.10 |
| PSNR | 19.34 | 23.33 | 24.75 | 26.38 |
| Time | 16m34s | 7h32m18s | 30m26s | 33m34s |
3. Normalization on
Response:
Thank you for the comments. To standardize the visibility uncertainty values, we apply normalization to the pixel-wise uncertainty map computed using Eq (2). Specifically, each pixel's uncertainty value is divided by the standard deviation of all uncertainty values in the map. The resulting normalized visibility uncertainty map is denoted as .
To address the concern, we have added the above revision into Sec. 4.1.
Dear Reviewer M4f5,
I hope this message finds you well.
I am one of the authors. First, I would like to express my gratitude for your time and effort in reviewing our paper. We greatly appreciate your valuable feedback and have made the necessary revisions based on your comments.
As the submission deadline approaches, we would like to ask if you could provide further insights or feedback as soon as possible. Your thoughts are very important to us, and we are eager to make any final adjustments before resubmitting.
Thank you for your understanding and support!
This paper introduces VISTA (Visibility-Uncertainty-Guided 3D Gaussian Inpainting via Scene Conceptual Learning), a framework specifically designed for 3D scene inpainting. VISTA combines visibility uncertainty maps with scene conceptual learning to improve inpainting quality across multiple viewpoints. The main contributions include VISTA-GI, which guides 3D Gaussian inpainting across views, and VISTA-CL, which uses scene conceptual learning to fill in areas lacking semantic information. Experimental evaluations demonstrate that VISTA outperforms existing methods on the SPIn-NeRF dataset for static scenes and an underwater dataset with dynamic objects.
优点
-
VISTA introduces a novel multi-view 3D inpainting framework by combining visibility uncertainty maps with a diffusion model. This approach effectively uses complementary information across multiple views and includes scene conceptual learning to handle semantic filling, showing a certain level of originality.
-
The methodology clearly separates VISTA-GI and VISTA-CL, using an alternating iterative optimization to ensure consistency across views. This design handles both static and dynamic objects, broadening the framework’s flexibility for different application scenarios.
缺点
-
The main contribution of this paper is the introduction of visibility uncertainty to guide inpainting, but the uncertainty map considers only the visibility of 3D points across views without explicitly addressing texture consistency. This may lead to color or texture inconsistencies in the inpainted regions. Further analysis on how the model maintains texture consistency would strengthen this aspect.
-
VISTA-GI is designed to maintain consistency between neighboring views. However, as noted in the appendix, it struggles with scenes that have large viewpoint differences, which haven’t been quantitatively evaluated. Testing on larger viewpoint change would provide a clearer picture of the model's performance in maintaining consistency across wide angles.
-
The framework involves multiple components (visibility uncertainty maps, a diffusion model, and scene conceptual learning), which could increase computational demands. This may impact feasibility in large-scale or real-time applications. Providing more details on computational costs, such as training time, would support its applicability.
-
The paper does not include ablation studies for VISTA-GI and VISTA-CL, making it difficult to assess the individual contribution of each component. Adding these ablation studies would clarify each module’s role in the framework’s overall performance.
-
The paper includes few visual results on the SPIn-NeRF dataset, limiting readers’ understanding of the framework’s effectiveness in static scenes. Providing more visual results across different scenarios, as well as showing the impact of the visibility uncertainty map on static scenes, would help illustrate the method's inpainting capabilities.
-
There is a lack of visualizations of depth maps to verify the geometric accuracy of the inpainting results.
-
The paper does not compare VISTA with GScream[1], which is another 3D inpainting method operating in a similar setting. Including a comparison with GScream would help establish VISTA’s advantages.
[1] Wang, Yuxin, et al. "GScream: Learning 3D Geometry and Feature Consistent Gaussian Splatting for Object Removal" ECCV. 2024.
问题
-
I still have concerns about texture consistency. Could you explain how the model ensures texture consistency in the inpainting results and provides visualizations of texture across different viewpoints in static scenes?
-
I am curious about the performance drop with large viewpoint differences. Could you quantify this to assess VISTA's robustness to viewpoint variations?
-
Could you conduct ablation studies on VISTA-GI and Scene Conceptual Learning to show their specific roles within the overall framework?
-
Could you visualize depth maps to demonstrate the geometric accuracy of the inpainting results?
To Reviewer eMA4:
1. How visibility uncertainty benefits texture consistency
Response:
Thank you for the comments. Given multi-view images from a static scene, the original 3DGS trains the Gaussian splats through the loss and SSIM loss function (i.e., Eq.(1) in the submission) to make sure the Gaussian splats are able to render detailed-preserved image across all perspectives. Hence, the texture consistency across views is implicitly achieved via the mix of loss and SSIM loss functions that are good at reconstruction of structure and details and are validated in [A].
However, when the scene contains dynamic objects, the object could be saw in one view but cannot be observed at another view. This leads to the conflicts between different view renderings and affects the training of the Gaussian splats through the original objective function.
Our visibility uncertainty map indicates the potential conflicts across different views and we use it to reweight the constributions of different pixels in the Eq. (1) and get Eq. (4) in the submission. Intuitively, we assign low weights to the pixels with high visibility uncertainty. Such a strategy avoids the conflicts during the training of Gaussian splat, and leads to better texture consistency.
We analyzed the quality of reconstruction between the masked and non-masked areas in an underwater scene. The results are as follows:
Table 3: Quantitative results of different areas.
| Method | non_masked PSNR | masked PSNR | non_masked SSIM | masked SSIM |
|---|---|---|---|---|
| w/ mask | 26.92 | 11.9 | 0.88 | 0.22 |
| w/o mask | 24.81 | 11.23 | 0.81 | 0.20 |
It can be observed that when the mask is not used, our method maintains much better texture consistency in both the masked and non-masked areas. For the restoration of the masked areas, our method also ensures texture consistency by introducing the limitations imposed by uncertainty.
[A] H. Zhao, O. Gallo, I. Frosio and J. Kautz, "Loss Functions for Image Restoration With Neural Networks," in IEEE Transactions on Computational Imaging, vol. 3, no. 1, pp. 47-57, 2017.
2. Quantitative analysis of large viewpoint differences
Response:
To evaluate the impact of variants of viewpoint difference, we first capture 34 images from continuously distributed viewpoints around a scene to create a ground truth (GT) 3DGS model. We then systematically reduce the number of viewpoints by sampling them at different intervals , where larger intervals represent larger viewpoint differences. For each sampling interval, we construct a new 3DGS model and assess its quality by comparing its rendered images against those from the GT model using standard metrics: LPIPS, SSIM, and PSNR. This methodology allows us to analyze how viewpoint difference affects reconstruction quality quantitatively.
Table 4: Quantitative results of large viewpoint differences.
| Sampling interval | lpips | ssim | psnr |
|---|---|---|---|
| 2 | 0.09 | 0.89 | 26.25 |
| 3 | 0.14 | 0.83 | 23.42 |
| 4 | 0.16 | 0.80 | 22.42 |
| 5 | 0.27 | 0.65 | 18.09 |
| 6 | 0.25 | 0.69 | 18.71 |
| 7 | 0.41 | 0.57 | 15.66 |
Considering that the reduction in available viewpoints for training itself leads to a decrease in 3DGS reconstruction quality, our method still achieves good results even with significant viewpoint variation. This validates that our approach can detect inconsistencies between viewpoints and repair those areas despite the large viewpoint differences. However, in extreme cases, the absence of key viewpoints results in a loss of critical complementary information between viewpoints, leading to a significant decline in the reconstruction metrics of the scene.
To validate our advantages in the extreme case with large viewpoint differences, we conducted a quantitative evaluation of various methods for the extreme case mentioned in Figure 8, and the results are as the following table. It can be seen that our method still outperforms existing methods in removing dynamic distractors under such extreme conditions.
Table 5: Quantitative comparison of different methods in the extreme case.
| Method | lpips | ssim | psnr |
|---|---|---|---|
| InFusion | 0.23 | 0.78 | 19.34 |
| SPIn-NeRF | 0.15 | 0.82 | 23.33 |
| SpotLess | 0.14 | 0.84 | 24.75 |
| Ours | 0.10 | 0.86 | 26.38 |
To address the concerns, we have added the discussion in Section A.5.
3. Computing cost comparison and analysis
Thank you for the comments. To quantitatively evaluate performance and computational efficiency, we compare our method against baseline approaches (InFusion, SPIn-NeRF, and SpotLess) on the synthetic scene shown in Fig.8. This scene provides ground truth data, enabling evaluation through reference-based metrics for both rendering quality and computational efficiency during optimization. As shown in Table 6, while our method incurs additional computational overhead compared to vanilla 3DGS due to integrating iterations and diffusion models, it achieves superior rendering quality while maintaining comparable efficiency to state-of-the-art 3DGS methods (e.g., SpotLess). Furthermore, our approach demonstrates significantly better reconstruction quality while being approximately 10× faster than leading NeRF-based methods such as SPIn-NeRF.
Our method indeed introduces higher computational resource demands due to the iterative repair process. For one scenario in our dataset, the specific time comparisons are as follows:
Table 6: Quantitative results and time costs on the synthesis data in Fig.8.
| InFusion | SPIn-NeRf | SpotLess | Ours | |
|---|---|---|---|---|
| LPIPS | 0.23 | 0.15 | 0.14 | 0.10 |
| PSNR | 19.34 | 23.33 | 24.75 | 26.38 |
| Time | 16m34s | 7h32m18s | 30m26s | 33m34s |
4. Ablation studies on VISTA-GI and VISTA-CL
Response:
We have conducted ablation experiments on two components of our algorithm using our dataset, and the specific results are as follows:
Table 7: Quantitative ablation study of VISTA-GI and VISTA-CL on the Underwater 3D Inpainting Dataset.
| Method | UCIQE | URanker | CLIP Score |
|---|---|---|---|
| Ours w/o VISTA-GI | 0.48 | 1.52 | 0.70 |
| Ours w/o VISTA-CL | 0.50 | 1.59 | 0.69 |
| Ours | 0.51 | 1.64 | 0.72 |
Our experiments demonstrate two key findings. First, attempting reconstruction using only a 2D generative model (i.e., VISTA-CL) without VISTA-GI leads to significantly degraded image quality metrics. This validates that VISTA-GI's uncertainty guidance effectively mitigates multi-view inconsistencies during 3D reconstruction, resulting in higher-quality outputs. Second, while omitting VISTA-CL maintains image quality comparable to existing methods like SplotLess and SPIn-NeRF, the lack of concept-guided learning significantly reduces CLIP-Score metrics. This indicates that without conceptual constraints, the inpainting process produces results that are visually plausible but semantically inconsistent with the scene context.
To address the concerns, we have added the discussion in Section 5.
5. Visualizations of the views and uncertainty map in static cases
Response:
Thank you for the comments. To address your concern, we visualize the uncertainty map of static scenes in the Appendix section Sec A.2 with Figure 11. It can be observed that our method identifies areas in the rendered image that are inconsistent with other viewpoints. By repairing these high-uncertainty regions, the algorithm gradually becomes more confident in the repaired content, resulting in a more reasonable and cleaner reconstructed scene.
6. Visualizations of depth maps
Response:
Thank you for the comments. Visualization of the uncertainty map of static scenes is shown in the Appendix section Sec A.2 with Figure 11. The deeper the color, the closer the depth. It can be observed that our algorithm generates reasonable content for the target area by incorporating guidance from the repair regions during the reconstruction process.
7. Comparison with GScream
Response:
Thank you for providing a related work. During the writing of our paper, we noted the GScream method; however, since its code was not publicly available at that time, we were unable to conduct a quantitative comparison. During the rebuttal period, we attempted to run Gscream on our underwater dataset. Unfortunately, Gscream requires preprocessing of the dataset, but the currently available open-source content does not provide a method for running Gscream on other datasets, making it difficult for us to conduct a quantitative comparison.
However, we added a discussion about the GScream method in the related work: 'GScream (Wang et al., 2025) focuses on removing objects by introducing monocular depth estimation and employing cross-attention to enhance texture.' Intuitively, GScream incorporates depth estimation to assist in generating consistent content while our method uses the visibility uncertainty to achieve the goal.
Dear Reviewer eMA4,
I hope you are doing well.
I am writing to follow up on our responses to your review. We sincerely appreciate the time and effort you dedicate to providing feedback on our work.
As the deadline for submission is fast approaching, we would be grateful if you could share any updates or additional comments you may have. Your insights are invaluable to us, and we are keen to ensure that our revisions align with your expectations.
Thank you very much for your attention to this matter. We look forward to your reply.
Dear Authors,
Thank you for your responses. However, I remain concerned about several unresolved issues, particularly the uncertainty maps of VISTA in static scenes. Below are my key points:
-
Quantitative Comparison with GScream on SPIn-NeRF Dataset: Could you provide a comparison to validate texture consistency, even if limited to the SPIn-NeRF dataset?
-
SPIn-NeRF Dataset Blocky Artifacts in Uncertainty Maps (Figure 11): Are these artifacts inherent to your method or the dataset? Additionally, what is their impact on reconstruction quality?
-
Discrepancies Between Table 4 and Table 5: Could you clarify whether these represent different datasets or evaluation conditions to address reproducibility concerns?
-
Ablation Studies on SPIn-NeRF Dataset: Could you share ablation results for this dataset to better validate VISTA-GI and VISTA-CL contributions?
I look forward to your response to these key points to further understand and evaluate your work.
Reviewer eMA4
Dear Authors,
Thank you for your detailed and comprehensive responses to my previous comments. I truly appreciate the effort and the additional insights you’ve provided. However, I still find that several critical concerns remain unaddressed, leading me to maintain my current score. Below are my key points:
-
Quantitative Comparison with GScream on SPIn-NeRF Dataset: While you noted the difficulty in running GScream on your underwater dataset, I believe a quantitative comparison on the SPIn-NeRF dataset, which includes ground truth, would provide a stronger evaluation of texture consistency. Without this comparison, it is challenging to conclusively validate your method's advantages over GScream in scenarios with both masked and non-masked regions.
-
Blocky Artifacts in Uncertainty Maps (Figure 11): The uncertainty maps for the SPIn-NeRF dataset (Figure 11) exhibit blocky artifacts. It is unclear whether these artifacts are a byproduct of your method or the dataset itself. More importantly, their impact on the reconstruction quality needs further explanation. Without clarity on this issue, the robustness of your approach is difficult to assess.
-
Discrepancies Between Table 4 and Table 5: There is a noticeable mismatch between the quantitative results in Table 4 and Table 5. If these tables represent different datasets or evaluation conditions, this needs to be explicitly clarified. The current presentation raises questions about the consistency and reproducibility of your results.
-
Ablation Studies on SPIn-NeRF Dataset: Your ablation studies were conducted on the Underwater 3D Inpainting Dataset. While insightful, the lack of ablation results on the SPIn-NeRF dataset, which includes ground truth, is a missed opportunity. Such experiments would provide more concrete evidence of the contributions of VISTA-GI and VISTA-CL, especially in controlled settings where objective metrics can be thoroughly assessed.
While I commend the efforts made to improve the clarity and comprehensiveness of your work, these unresolved issues leave significant gaps in validating the proposed approach. Therefore, I will maintain my current score, as I believe further revisions are necessary to address these critical aspects.
Reviewer eMA4
Thank you for your attention to our work.
- We compared our method with the data from the GScream, and the results are as follows:
| Method | PSNR | SSIM | LPIPS |
|---|---|---|---|
| SPIn-NeRF | 15.80 | 0.46 | 0.47 |
| Gaussians Grouping | 14.86 | 0.27 | 0.45 |
| Masked Gaussians | 10.77 | 0.29 | 0.59 |
| InFusion | 15.59 | 0.53 | 0.57 |
| GScream | 15.84 | 0.58 | 0.28 |
| Ours | 16.48 | 0.59 | 0.42 |
It can be seen that our method outperforms all related methods in two metrics, but there is one metric that is lower than GScream. Considering that other methods cannot repair dynamic scenes, while our method applies to dynamic and static scenes, we believe our method has certain advantages.
-
The blocky areas are not artifacts. The explanation is as follows: When calculating uncertainty, our method references the rendered images from adjacent viewpoints, but not every point can be observed from any adjacent viewpoint. Therefore, we need to assess the visibility of spatial points and remove those not within the camera's visible space. This operation reflects on the uncertainty map, resulting in blocky areas (which can be considered as the intersection of the visible regions from adjacent viewpoints projected onto the current viewpoint).
This phenomenon does not lead to a decrease in performance because such areas are merely due to the inconsistent number of samples from adjacent viewpoints for different points. Essentially, they are effective samples of the true distribution, and the estimated uncertainty/variance is also valid. This phenomenon can be resolved by increasing the number of samples from adjacent viewpoints, but it will introduce more computational costs.
-
Tables 4 and 5 were tested on the same synthetic underwater dataset, and the visual content is shown in Figure 8. Table 4 compares the time required for different methods, while Table 5 compares the results obtained using different resolutions during the reconstruction process. Since it is the same scene, our optimal results are consistent.
-
We conducted ablation experiments on our method using the Spin-NeRF dataset, the results are as follows:
| Method | PSNR | SSIM | LPIPS |
|---|---|---|---|
| Ours w/o VISTA-GI | 13.34 | 0.27 | 0.51 |
| Ours w/o VISTA-GL | 14.75 | 0.28 | 0.48 |
| Ours | 16.48 | 0.59 | 0.42 |
It can be seen that VISTA-GI has a more significant impact on the reconstructed scene than VISTA-GL. We analyze this as being due to the latter's reliance on the uncertainty of the former. When VISTA-GI cannot generate high-quality uncertainty images, it first leads to an inability to utilize uncertainty to help exclude interference during the reconstruction process. Consequently, the subsequent VISTA-GL also struggles to identify the areas that need repair, resulting in further performance degradation. The absence of VISTA-GL has a relatively smaller impact, but due to the loss of semantic constraints, the generated areas are difficult to ensure their validity, leading to a decline in performance.
Thank you once again for your attention and valuable suggestions for our work. We look forward to your recognition of our efforts.
To all reviewers:
Thank you very much to all the reviewers for their constructive and responsible feedback. In this paper, we present an innovative method for removing dynamic and static interference in 3DGS reconstruction. This approach combines geometric information from three-dimensional reconstruction with semantic information from concept learning, facilitating better scene reconstruction of real-world environments. We are delighted that Reviewer M4f5 emphasized the "Originality" of our method and its "Significance" in the field of virtual reality. Reviewer eMA4 described our framework as "novel," highlighting its "flexibility" across different scenarios. Reviewer qvjD noted that our framework is "general" and considered our experiments to be "sufficient." Reviewer VfkU not only highlighted the "novel" use of uncertainty in our method but also pointed out that it achieved "Strong quantitative results." While some reviewers expressed concerns about the details of our method and its performance in more scenarios, we believe our rebuttal can address these concerns. Below are some key responses, and we will provide more detailed replies in our responses to each reviewer.
List of Changes
According to the comments of all four reviewers, we have made the following major revisions to our paper.
-
We have added the ablation study on VISTA-GI and VISTA-CL in Sec 5.2 to demonstrate the roles of these two components in our algorithm in response to Reviewer eMA4 and Reviewer qvjD.
-
We have added a discussion in Sec 5.3 to compare the computing cost with other SOTA methods in response to Reviewer M4f5, Reviewer eMA4, and Reviewer VfkU.
-
We have add a discussion in Sec A.1 to clarify the impact of mask on inpainting in response to Reviewer M4f5.
-
We have presented more dimensional and various types of visualization results for our method in Sec A.2 in response to Reviewer M4f5 and Reviewer VfkU.
-
We have added two experiments and some discussions in Sec A.3 to clarify the hyparameters that are related to iteration in response to Reviewer M4f5 and Reviewer VfkU.
-
We have added a subsection to discuss the impacts of different resolutions during the training and inference process of diffusion in Sec A.4 in response to Reviewer M4f5.
-
We have added ablation studies and comparisons on the cases with large viewpoint differences in Sec A.5 in response to Reviewer eMA4.
The paper introduces VISTA, a framework for 3D Gaussian inpainting that integrates visibility uncertainty to enhance inpainting. While the method demonstrates technical novelty and competitive performance, several key concerns from reviewers remain unresolved. These include limited real-world evaluation, insufficient analysis of key parameters, lack of comprehensive ablation studies, and missing valid comparisons with relevant baselines such as GScream. I’d recommend the authors to consider all these and further improve the work.
审稿人讨论附加意见
The reviewers raised concerns about texture consistency, evaluation on real-world use cases, and the lack of comparisons with strong baseline models. Despite several rounds of discussion, these issues remain unresolved.
Reject