PaperHub
6.8
/10
Poster4 位审稿人
最低5最高8标准差1.3
8
8
5
6
4.3
置信度
正确性2.8
贡献度2.8
表达2.5
ICLR 2025

Flow Distillation Sampling: Regularizing 3D Gaussians with Pre-trained Matching Priors

OpenReviewPDF
提交: 2024-09-26更新: 2025-05-17

摘要

关键词
3D VisionDifferentiable Rendering3D Gaussian Splatting

评审与讨论

审稿意见
8

In this paper, Flow Distillation Sampling (FDS) is proposed to improve the geometric accuracy and rendering quality of 3D Gaussians Splatting. FDS first adopts a camera sampling scheme to sample unobserved views near the training views, and then uses the flow predicted by the pre-trained model to guide the flow calculated from the 3DGS geometry.

优点

  1. The ideas are intuitive, the paper is well written and easy to understand.
  2. FDS leverages the matching prior mitigate the overfitting problem and enhance the geometry.

缺点

  1. More ablation studies are needed.
  • The depth-adaptive radius is calculated by Eq. (8). How to determine the value of the hyperparameter σ\sigma? Is FDS robust to different σ\sigma?
  • Is FDS must require the normal consistency loss Ln\mathcal{L}_{n}? In Table 3, how is the performance of “2D-GS+FDS”?
  • How to determine the weight for FDS λfds\lambda_{fds}?
  • How to determine the start iteration (e.g., 15,000) of applying FDS?

2 Lack of visual comparison on the ScanNet dataset.

问题

See Weakness.

评论

[Q1]:How to determine the value of the hyperparameter σ\sigma?

[A1]:Thank you for your question. we have derived the physical meaning of this hyperparameter σ\sigma, which represents the average radius of the 2D flow between the current input view and the unobserved sampling view. Therefore, this parameter directly corresponds to a measurable property and can be determined by the flow length that the optical flow model predicts the most accurately, which is typically related to the training data of the optical flow model. This parameter can be shared across different types of datasets, including DTU datasets (a object centered dataset with dense sampling) and Mushroom datasets (an indoor datasets with sparse sampling.) We have updated the results of FDS in our paper.

Derivation:

As noted in [1], the rotation flow in image warping is independent of depth. Therefore, we can set the rotation part to identity matrix. so that the transformation between the input view ii and its sampled view ss is pure translation, we can change the Equ.(4) in our paper to:

Ds(u2,v2)[u2v21]T=Di(u1,v1)[u1v11]T+K[t1t2t3]TD^s(u_2 , v_2) \begin{bmatrix} u_2 \\ v_2\\ 1 \end{bmatrix} ^T =D^i(u_1 , v_1) \begin{bmatrix} u_1\\ v_1\\ 1 \end{bmatrix}^T + K \begin{bmatrix} t_1\\ t_2\\ t_3 \end{bmatrix}^T

where KK is the intrinsic matrix of the camera, after solving the above equation, we get:

[u2v2]T=[Di(u1,v1)u1+fxt1+cxt3Di(u1,v1)+t3Di(u1,v1)u1+fyt2+cxt3Di(u1,v1)+t3]T\begin{bmatrix} u_2\\ v_2\\ \end{bmatrix}^T = \begin{bmatrix} \frac{D_i(u_1, v_1)u_1+f_xt_1+c_xt_3}{D_i(u_1, v_1) + t_3} \\ \frac{D_i(u_1, v_1)u_1+f_yt_2+c_xt_3}{D_i(u_1, v_1) + t_3} \end{bmatrix}^T

We set t3=0t_3 = 0 in our camera sampling scheme and assume camera intrinsic parameters: fxfy=ff_x\approx f_y=f. The radiance flow Fis(u1,v1)=[u2u1v2v1]TF^{i\rightarrow s}(u_1, v_1) = \begin{bmatrix} u_2 - u_1\\ v_2 - v_1\\ \end{bmatrix}^Tfrom the training view ii to its sampled view ss is shown below:

Fis(u1,v1)=[fDi(u1,v1)t1fDi(u1,v1)t2]F^{i\rightarrow s}(u_1, v_1) = \begin{bmatrix} \frac{f}{D_i(u_1, v_1)}t_1\\ \frac{f}{D_i(u_1, v_1)}t_2 \end{bmatrix}

We aim to keep the value of Fis(u1,v1)2||F^{i\rightarrow s}(u_1, v_1)||_2 constant for the pixel x=(u1,v1)x = (u_1, v_1) during each camera sampling. By setting Fis(u1,v1)2=σ||F^{i\rightarrow s}(u_1, v_1)||_2 = \sigma , we get:

ϵt=t12+t22=σDi(u1,v1)f \epsilon_t = \sqrt{t_1^2 + t_2^2} = \sigma \frac{D_i(u_1, v_1)}{f}

Thus, the radius of translation in our camera sampling is defined as ϵt=σDi(u1,v1)f\epsilon_t = \sigma \frac{D_i(u_1, v_1)}{f} which helps maintain stable flow. The parameter σ\sigma can be tuned as a hyperparameter. Given that pixel depths vary within an image, we use the mean depth Diˉ\bar{D_i} of the image and set the radius of our translation ϵt=σDiˉf\epsilon_t = \sigma \frac{\bar{D_i}}{f} . Fis(u1,v1)2=σ||F^{i\rightarrow s}(u_1, v_1)||_2 = \sigma , demonstrating that σ\sigma represents the average radius of the 2D flow between the current input view and its unobserved sampling view.

[Q2]: Is FDS robust to different radius ?

[A2]: As noted in A1, the parameter radius represents the average radius of the 2D flow between the current input view and its unobserved sampling view. So our FDS is robust to different radius in a proper range included in the training dataset of optical flow model. We added an experiment to validate our statements on the Mushroom dataset. The results are shown below:

MethodAcc ↓Comp ↓C-L1 ↓NC ↑F-Score ↑Abs Rel ↓PSNR ↑SSIM ↑LPIPS ↓
2DGS0.10780.08500.09640.78350.51700.100223.560.81660.2730
2DGS + FDS (σ=11.5)0.07650.05930.06790.81070.64690.057423.940.82440.2633
2DGS + FDS (σ=23)0.06150.05340.05740.81510.69740.056124.060.82710.2610
2DGS + FDS (σ=30)0.05940.05390.05660.80890.70230.057124.090.82760.2609

It can be observed that both increasing and decreasing the radius allow FDS to achieve a consistent level of improvement.

[1]. Bian, Jia-Wang, et al. "Auto-rectify network for unsupervised indoor depth estimation." IEEE transactions on pattern analysis and machine intelligence 44.12 (2021): 9802-9813.

评论

[Q3]: Does FDS require the normal consistency loss ?

[A3]: The normal consistency loss is not an essential component of FDS. The normal consistency loss is a regularization loss introduced in 2dgs which can help to align the splats’ normal with the gradients of the depth. Our FDS can also help to improve the reconstruction without this loss. We add the results of 2DGS (w/o nc loss) and FDS+2DGS (w/o nc loss) on the Mushroom dataset:

MethodAcc ↓Comp ↓C-L1 ↓NC ↑F-Score ↑Abs Rel ↓PSNR ↑SSIM ↑LPIPS ↓
2DGS0.10780.08500.09640.78350.51700.100223.560.81660.2730
2DGS + FDS0.06150.05340.05740.81510.69740.056124.060.82710.2610
-------------------------------------------------------------------------------------------------------------------------------
2DGS (w/o nc loss)0.16430.09040.12730.69820.38530.116523.800.81890.2627
2DGS (w/o nc loss) + FDS0.07740.04730.06240.77780.65270.057824.320.82900.2541

[Q4]: In Table 3, how is the performance of “2D-GS+FDS” ?

[A4]: The results of "2D-GS+FDS" compared with other prior information on the Mushroom dataset is shown in Table 3 of updated paper.

MethodAcc ↓Comp ↓C-L1 ↓NC ↑F-Score ↑Abs Rel ↓PSNR ↑SSIM ↑LPIPS ↓
2DGS0.10780.08500.09640.78350.51700.100223.560.81660.2730
2DGS+Depth0.08620.07020.07820.81530.59650.067223.920.82270.2619
2DGS+Normal0.09390.06370.07880.83590.57820.076823.780.81970.2676
2DGS+FDS0.06150.05340.05740.81510.69740.056124.060.82710.2610
-------------------------------------------------------------------------------------------------------------------------------
2DGS+Depth+FDS0.05610.05190.05400.82950.72820.045424.220.82910.2570
2DGS+Normal+FDS0.05290.04500.04900.84770.74300.044324.100.82830.2590
2DGS+Depth+Normal0.06950.05130.06040.85400.67230.052324.090.82640.2575
-------------------------------------------------------------------------------------------------------------------------------
2DGS+Depth+Normal+FDS0.05060.04230.04640.85980.76130.040324.220.83000.0403

Our “2DGS + FDS” achieves better performance compared to using any other single prior information individually. In combinations with other prior injections, our FDS makes the most significant contribution to performance improvement and can be effectively combined with other prior information to achieve enhanced results.

[Q5]: How to determine the weight for FDS ?

[A5]: Similar to other methods supervised by prior information, this weight is determined based on the ratio of values between different loss functions. We set the weight of the FDS loss to be approximately double that of the L1 image loss function.

评论

[Q6]: How to determine the start iteration (e.g., 15,000) of applying FDS ?

[A6]: We set the start iteration of FDS loss to 15000 to maintain the same number of points as the baseline for fair comparison. (The densification of points is stopped after 15000 iterations.) We tested different starting iterations on the Mushroom dataset and the results are presented below:

MethodAcc ↓Comp ↓C-L1 ↓NC ↑F-Score ↑Abs Rel ↓PSNR ↑SSIM ↑LPIPS ↓
2DGS0.10780.08500.09640.78350.51700.100223.560.81660.2730
-------------------------------------------------------------------------------------------------------------------------
2DGS + FDS (15000)0.06150.05340.05740.81510.69740.056124.060.82710.2610
2DGS + FDS (10000)0.05900.06020.05960.80530.69320.054524.210.83000.2579
2DGS + FDS (7000)0.06400.05380.05890.80590.68730.058424.130.82780.2583

In our experience, incorporating the FDS loss when PSNR stabilizes during training yields the best results. We set the starting iteration of the FDS loss to 15000 to not only maintain the same number of points as the baseline but also stabilize the training PSNR.

[Q7]: Lack of visual comparison on the ScanNet dataset.

[A7]: Thank you for your suggestion. We have added a visual comparison of ScanNet in Figure 3 of the updated paper.

评论

Thank you for your detailed rebuttal. I recommend acceptance of the paper.

评论

Thank you for your comments and support.

审稿意见
8

This paper aims to improve the 3D Gaussian Splatting reconstruction quality in regions with sparse or no observational input views, by integrating a pre-trained matching prior into the 3DGS optimization process. The matching prior is incorporated through optical flow from a pre-trained model, which supervises the Radiance Flow calculated using 3DGS-rendered depth. Additionally, the authors introduce a Flow Distillation Sampling scheme to efficiently sample unobserved camera views around the input views. The proposed Flow Distillation Loss effectively avoids the scale ambiguity existing in monocular priors. The authors present clear ablation studies and quanlitative improvements to support their claims.

优点

  • The idea is simple but makes intuitive sense. Matching priors can provide absolute scale information in constrast to monocular priors. This paper offers a promising direction of using pair-wise information prior to advance sparse view reconstruction.
  • This ablation study clearly demonstrates the improvement benefited from using this pair-wise matching prior without scale ambiguity. The quanlitative results shown in Table 3. validate the advatange compared to monocular depth prior, even to multi-view depth prior. A clear visualization of the mutual refinement of two flows is also provided.

缺点

  • This paper lacks evaluation on widely used geometry reconstruction and novel view synthesis benchmarks such as DTU, Tanks and Temples, and MipNeRF 360. The advantages of the proposed method would be more convincing if the authors could present results on one or more of these benchmarks.
  • As mentioned in line 252, both the Prior Flow and Radiance Flow suffer from inaccuracies, raising concerns about the stability of the benefits provided by the proposed Flow Distillation Loss. It is possible that this loss could introduce artifacts or incorrect guidance due to bias. While the metrics in Table 3 appear strong, it’s unclear why the loss significantly outperforms multi-view depth supervision, which does not suffer from inaccurate prior flow. More explanation and analysis are needed to clarify this point.

问题

  • Given that RAFT is computed at every time step, how does the training time for 2DGS + FDS compare to 2DGS?
  • Why do both monodepth and multi-view depth seem to only worsen the results, as shown in Table 3.?
  • it's said in line 316 that, the normal prior is introduced for evaluation on the ScanNet dataset. But the metrics in Table 2. and Table 3. seem inconsistenct regarding the results of 2DGS + FDS. Does this mean the ScanNet result in Table 2. doesn't use the normal prior?
  • Can 2DGS + FDS outperform 2DGS + Normal in Table 3.?
评论

[Q1]:This paper lacks evaluation on widely used geometry reconstruction and novel view synthesis benchmarks such as DTU, Tanks and Temples, and MipNeRF 360.

[A1]: Thank you for your advice. Since DTU is an object-level dense observation dataset, our FDS is primarily designed to mitigate the issue of insufficient sampling in observation regions. Therefore, we did not prioritize testing on the DTU dataset initially. We have added the results of our FDS on the DTU dataset, which are shown below. While FDS performs better under sparse observations, it still achieves notable improvements on the DTU dataset with dense observations.

Method2437405563656983
2DGS0.480.860.360.430.900.940.801.27
2DGS+FDS0.510.850.360.430.791.000.771.23
Method97105106110114118122mean
2DGS1.300.720.701.240.470.700.580.78
2DGS+FDS1.060.730.651.140.440.580.530.73

[Q2]:While the metrics in Table 3 appear strong, it’s unclear why the loss significantly outperforms multi-view depth supervision, which does not suffer from inaccurate prior flow. More explanation and analysis are needed to clarify this point.

[A2]: Thank you for your question. First, the Mushroom dataset has limited overlap between input views, making multi-view depth estimation unreliable. In contrast, our prior flow is extracted from a controlled overlap between the input view and a sampled unobserved view, which is more reliable than offline multi-view depth from uncertain input views.

Second, to reduce inaccuracies caused by relatively blurry sampled views, FDS adopts a random sampling strategy. For any single input viewpoint, FDS effectively generates a sufficient number of sampled viewpoints during training. This acts as a form of model ensemble, helping to average out errors. To validate this claim, we conducted experiments using fixed sampling viewpoints on the Mushroom dataset instead of random sampling. The results showed a significant decline in performance, highlighting the importance of this approach.

MethodAcc ↓Comp ↓C-L1 ↓NC ↑F-Score ↑Abs Rel ↓PSNR ↑SSIM ↑LPIPS ↓
2DGS0.10780.08500.09640.78350.51700.100223.560.81660.2730
2DGS + FDS (fixed sampling)0.07290.06170.06730.80150.63120.072423.970.82600.2623
2DGS + FDS0.06150.05340.05740.81510.69740.056124.060.82710.2610

[Q3]:Given that RAFT is computed at every time step, how does the training time for 2DGS + FDS compare to 2DGS?

[A3]: We have included our training time analysis in Table 1 of the revised paper. On the Mushroom dataset, adding the FDS loss led to a half-hour increase in training time. Despite this increase, the overall training time remained comparable to other baselines.

MethodAcc ↓Comp ↓C-L1 ↓NC ↑F-Score ↑Abs Rel ↓PSNR ↑SSIM ↑LPIPS ↓Time
2DGS0.10780.08500.09640.78350.51700.100223.560.81660.27300.8h
2DGS + FDS0.06150.05340.05740.81510.69740.056124.060.82710.26101.3h
评论

[Q4]: Why do both mono-depth and multi-view depth seem to only worsen the results, as shown in Table 3?

[A4]: We initially tested the scale and shift-invariant loss used in Robust Nerf [1] for monocular depth supervision. This method has limited ability to predict absolute depth estimation, which in turn only worsen the result of geometry reconstruction. We have updated our results using a more advanced strategy for monocular depth supervision, as employed in Mono-SDF [2]:

Ldepth=rR(wiDi+qi)D^i2 L_{depth}=\sum_{r \in R} ||(w^iD^i + q^i) - \hat{D}^{i}||^2

where wiw^i and qiq^i are the scale and shift used to align mono-depth with the absolute depth of input view ii, calculated using the least squares method. D^i\hat{D}^{i} is monocular depth prediction. The updated results on the Mushroom dataset in Table 3 are presented below.

MethodAcc ↓Comp ↓C-L1 ↓NC ↑F-Score ↑Abs Rel ↓PSNR ↑SSIM ↑LPIPS ↓
2DGS0.10780.08500.09640.78350.51700.100223.560.81660.2730
2DGS+Depth0.08620.07020.07820.81530.59650.067223.920.82270.2619
2DGS+Normal0.09390.06370.07880.83590.57820.076823.780.81970.2676
2DGS+FDS0.06150.05340.05740.81510.69740.056124.060.82710.2610
-------------------------------------------------------------------------------------------------------------------------------
2DGS+Depth+FDS0.05610.05190.05400.82950.72820.045424.220.82910.2570
2DGS+Normal+FDS0.05290.04500.04900.84770.74300.044324.100.82830.2590
2DGS+Depth+Normal0.06950.05130.06040.85400.67230.052324.090.82640.2575
-------------------------------------------------------------------------------------------------------------------------------
2DGS+Depth+Normal+FDS0.05060.04230.04640.85980.76130.040324.220.83000.0403

From the results, it can be seen that depth order information provided by monocular depth improves reconstruction accuracy. Meanwhile, our FDS achieves the best performance, and by integrating all three components, we obtained the optimal results. For multi-view depth, we use L1 loss to supervise rendered depth. The results of multi-view depth remain unreliable due to the limited overlap between input views, which is not accounted for in their training datasets. We test average "Abs Rel" of multi-view depth prior. The result is 0.19, which performs worse than the results rendered by original 2DGS whose "Abs Rel" is 0.10.

[Q5]: Inconsistency regarding the results of 2DGS + FDS between Table 2 and Table 3.

[A5]: Thank you for your question. In Table 3, we use the Mushroom dataset for the ablation study, while Table 2 presents the results of our model on the ScanNet dataset which is a different dataset. Therefore, the data in Table 2 and Table 3 are not directly consistent.

[Q6]: Can 2DGS + FDS outperform 2DGS + Normal in Table 3?

[A6]: Thank you for your question. Our 2DGS + FDS can achieve better performance than 2DGS + Normal. The results on the Mushroom dataset are shown below. We have supplemented more comprehensive results in Table 3. It is worth noting that prior normals can provide more accurate geometric orientations as indicated by 'NC', while our FDS enables more precise geometric positioning as indicated by other metrics. These two improvements are largely complementary to each other.

MethodAcc ↓Comp ↓C-L1 ↓NC ↑F-Score ↑Abs Rel ↓PSNR ↑SSIM ↑LPIPS ↓
2DGS+FDS0.06150.05340.05740.81510.69740.056124.060.82710.2610
2DGS+Normal0.09390.06370.07880.83590.57820.076823.780.81970.2676

[1]. Liu, Yu-Lun, et al. "Robust dynamic radiance fields." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

[2]. Yu, Zehao, et al. "Monosdf: Exploring monocular geometric cues for neural implicit surface reconstruction." Advances in neural information processing systems 35 (2022): 25018-25032.

评论

Thank you for the detailed response and the additional experiments. All my previous concerns have been addressed. However, I agree with Reviewer 148K that the proposed radiance flow supervision appears not to be limited to 3DGS. It would be valuable to explore whether the proposed methods can be applied to improve NeRF baselines, e.g. Mip-NeRF 360, by addressing issues like floaters. Incorporating these discussions would strengthen the techinical contributions of the paper. I will consider raising my score if the proposed method can also enhance NeRF baselines.

评论

Thank you for your response. Applying our method to NeRF might lead to improvements. However, our approach requires rendering entire images of neighboring views, whereas NeRF renders on a per-ray optimization. Rendering full images in NeRF would drastically increase memory consumption and time during training, which in turn makes debugging more challenging and difficult to complete in limited time.

3DGS is better suited for our method with its higher rendering quality and speed. We have followed the original suggestion from 148K and compared our method with NeRF-based methods. We plan to explore applying FDS to NeRF soon. Thank you for your suggestion!

评论

Thank you for your response. The proposed flow supervision method demonstrates potential for various reconstruction tasks with limited observations, as evidenced by the provided experiments. I will recommend accepting this paper.

评论

We are very delighted to receive your response and suggestion! Thank you for raising the rating and for your support.

审稿意见
5

the geometric under-constrain problem of 3D Gaussian Splatting (3DGS) in sparse view setups. By incorporating pre-trained matching priors into the optimization process of 3DGS, this method significantly improves both the geometric accuracy and rendering quality of 3DGS.

优点

The method proposed in this paper integrates matching priors derived from a pretrained optical flow model to guide the optimization of 3DGS. This approach effectively enhances both the reconstruction quality and rendering quality of existing 3DGS-based methods.

The methodology is well-structured, providing clear and detailed explanations of the proposed FDS technique, the adaptive camera sampling scheme, and the associated loss functions.

Comprehensive experimental evaluations are conducted across multiple datasets, demonstrating the method's effectiveness and robustness.

Additionally, the paper includes interpretive experiments to illustrate the mutual refinement process of the flows, thereby enhancing the understanding of the method's capabilities.

缺点

The results of the proposed method are constrained by the initial quality of the prior flow, however, the reliability of the prior flow cannot be assured under certain sparse viewpoint configurations.

As noted in the limitations section, the method's reliance on the performance of a pretrained optical flow model restricts its generalizability.

While the authors have conducted experiments across multiple datasets, it is important to point out that these datasets are primarily limited to indoor scenes. It is recommended that the authors evaluate their method on a more diverse range of datasets to assess its applicability in various scenarios.

The paper lacks a discussion on the computational complexity of the method. It is recommended that the authors include a detailed report on the training and inference times of the model in the experimental section, along with comparative metrics against other existing methods.

问题

In the related work section, the review of existing prior art aimed at improving 3DGS performance should be more comprehensive and clearer, particularly concerning the relevant work on optical flow priors. Additionally, in the subsection on Prior Regulation for Rendering, there are sentences with grammatical errors that require careful review and correction.

In Algorithm 1, there are notation errors that need careful checking and correction.

The comparison of depth reconstruction experiments needs to be supplemented with results from other methods to validate the superiority of the proposed approach in geometric reconstruction.

In the dataset section, the paper mentions that the authors have evaluated their method on the Replica dataset, but the experimental results are not presented.

The authors provide limited comparisons with existing methods; it is recommended that they include more baseline methods for a more robust evaluation.

评论

[Q4]: It is recommended that the authors include a detailed report on the training and inference times of the model in the experimental section, along with comparative metrics against other existing methods.

[A4]: Thank you for your advice. We have included our training time analysis in Table 1 of the revised paper. On the Mushroom dataset, adding the FDS loss led to a half-hour increase in training time. Despite this increase, the overall training time remained comparable to other baselines. Since FDS is only applied during the training process, it does not affect the inference time.

MethodAcc ↓Comp ↓C-L1 ↓NC ↑F-Score ↑Abs Rel ↓PSNR ↑SSIM ↑LPIPS ↓Time
GOF0.18120.10930.14530.62920.36650.238021.370.77620.31321.4h
PGSR0.09710.14200.11960.71930.51050.172322.130.77730.29181.2h
-------------------------------------------------------------------------------------------------------------------
3DGS0.11670.10330.11000.79540.37390.121424.180.83920.25110.8h
3DGS + FDS0.05270.05650.05460.81780.69580.056824.760.84860.23811.3h
-------------------------------------------------------------------------------------------------------------------
2DGS0.10780.08500.09640.78350.51700.100223.560.81660.27300.8h
2DGS + FDS0.06150.05340.05740.81510.69740.056124.060.82710.26101.3h

[Q5]: The review of existing prior art aimed at improving 3DGS performance should be more comprehensive and clearer. There are sentences with grammatical errors that require careful review and correction. In Algorithm 1, there are notation errors that need careful checking and correction.

[A5]: Thank you for your advice. We have carefully reviewed and revised our paper in the related work and methods part. In our updated paper, we used colors to highlight the changes. Specifically, we adopted a more concise and clear discussion approach in the related work section and added a discussion on the application of priors in optical flow models. In the method section (including Algorithm 1), we used more rigorous notation and corrected some errors.

[Q6]: The comparison of depth reconstruction experiments needs to be supplemented with results from other methods, it is recommended that they include more baseline methods for a more robust evaluation.

[A6]: Thank you for your advice. We have compared the depth reconstruction results in Table 1 and Table 2 of our paper, where a lower "Abs Rel" metric indicates better reconstruction quality. Compared with GOF, PGSR, and 2DGS, which are SOTA 3DGS-based 3D reconstruction methods, our FDS achieves the best depth reconstruction performance. We choose 2DGS as the baseline method for our FDS because it performed the best on the Mushroom dataset compared with GOF and PGSR.

[Q7]: In the dataset section, the paper mentions that the authors have evaluated their method on the Replica dataset, but the experimental results are not presented.

[A7]: The experimental results of the Replica dataset are shown in our supplementary materials due to page limitations. We also updated our Replica results using a more advanced optical flow model.

评论

[Q1]: The reliability of the prior flow cannot be assured under certain sparse viewpoint configurations.

[A1]: Our FDS introduces two strategies to improve the reliability of prior flow. The first is controlled overlap between the input view and its sampled unobserved view, which ensures controlled overlap and average flow. According to the derivation in our updated paper, the hyperparameter σ\sigma in FDS represents the average radius of the 2D flow between the current input view and sampled view. This ensures that the optical flow model can generate accurate optical flow under suitable overlap. Secondly, the random sampling introduced by FDS also provides an advantage. FDS effectively generates a sufficient number of sampled viewpoints during training which can help to average out errors. We tested our FDS by fixing the sampling viewpoints instead of using random sampling on the Mushroom dataset. The results are shown below:

MethodAcc ↓Comp ↓C-L1 ↓NC ↑F-Score ↑Abs Rel ↓PSNR ↑SSIM ↑LPIPS ↓
2DGS0.10780.08500.09640.78350.51700.100223.560.81660.2730
2DGS + FDS (fixed sampling)0.07290.06170.06730.80150.63120.072423.970.82600.2623
2DGS + FDS0.06150.05340.05740.81510.69740.056124.060.82710.2610

We can see that the random sampling further helps 2DGS improve geometric accuracy.

[Q2]: The method's reliance on the performance of a pretrained optical flow model restricts its generalization.

[A2]: Although our model is limited by the accuracy of the optical flow model, the upper bound of our FDS will continue to improve with the emergence of more annotated data and larger models. To validate our idea, we replae RAFT with the more advanced SEA-RAFT model in our updated paper, the accuracy of geometric reconstruction has been further enhanced.

MethodAcc ↓Comp ↓C-L1 ↓NC ↑F-Score ↑Abs Rel ↓PSNR ↑SSIM ↑LPIPS ↓
2DGS + FDS (Raft)0.06890.06460.06670.80420.65820.058923.980.82550.2621
2DGS + FDS (Sea Raft)0.06150.05340.05740.81510.69740.056124.060.82710.2610

[Q3]: It is important to point out that these datasets are primarily limited to indoor scenes.

[A3]: To test our FDS on more diverse datasets, we have added results of our FDS on DTU datasets shows below. Although our FDS is primarily designed to mitigate the issue of insufficient sampling in observation region, we still achieve improvements on the DTU dataset with dense observation.

Method2437405563656983
2DGS0.480.860.360.430.900.940.801.27
2DGS+FDS0.510.850.360.430.791.000.771.23
Method97105106110114118122mean
2DGS1.300.720.701.240.470.700.580.78
2DGS+FDS1.060.730.651.140.440.580.530.73
评论

We hope our response has addressed your questions. As the discussion phase is coming to a close, we are looking forward to your feedback and would like to know if you have any remaining concerns we can address. We are grateful if you find our revisions satisfactory and consider raising your score for our paper.

Thank you once again for the time and effort you have dedicated to reviewing our paper.

Best regards

Flow Distillation Sampling Authors

评论

We would like to sincerely thank you for your valuable feedback and the time you've dedicated to reviewing our paper. As the extended discussion phase is now nearing its end, we would greatly appreciate your feedback on our revisions. Please let us know if there are any remaining issues we can address.

Thank you once again for your support and effort!

Best regards

Flow Distillation Sampling Authors

审稿意见
6

The paper proposes a optical flow based regularization for 3D and 2D Gaussian Splatting. It compares the optical flow output between an input view and a sampled unobserved view to a so-called radiance flow determined from the camera motion and the reconstructed scene. The authors claim that a loss on the difference between the radiance flow an the optical flow results in improved geometry and view synthesis quality. Experiments show improved performance of 3DGS and 2DGS with flow distillation on the SceneNet, Mushroom and Replica dataset.

优点

  • The paper describes an interesting idea on incorporating a flow prior for 3D reconstruction with Gaussian Splatting.
  • The related works contains all relevant geometry reconstruction-based 3DGS works and set them in context to the proposed method.
  • The approach is simple and easy to understand, as Figure 1 and 2 are well done and intuitive.

缺点

  • The experimental evaluation only considered indoor room datasets, MuSHRoom, ScanNet and Replica. Baseline methods usually use more diverse datasets such as DTU [Jensen et al. 2014], Tanks and Tem- ples [Knapitsch et al . 2017] and Mip-NeRF360 [Barron et al. 2022].
  • The authors found that the depth distortion loss in 2DGS degrades the results. However they do not provide evidence or explanations for that and it is unclear how this influenced the quantitative comparison to 2DGS.
  • The paper solely focus on 3DGS-based methods in the related work and also in the experimental evaluation. Comparisons to neural field based methods such as Geo-NeUS or NeuralAngelo would significantly strengthen the claim of state-of-the-art performance.
  • The overall quality and the presentation should be improved, e.g. the conclusion is unspecific and contains general claims, inconsistent capitalization of 'Gaussian', no explanation of \hat{\alpha} in equation 2.

问题

  • The Radiance Flow maps pixels from the source view to the target view. Considering pixels in the target view containing splatted Gaussians that are occlude in the source view, how is the Radiance Flow computed for these regions?
  • In line 267, how does detaching the optical flow influence the overall performance?

伦理问题详情

no concerns

评论

[Q1]: Baseline methods usually use more diverse datasets.

[A1]: Thank you for your advice. We did not prioritize testing on the DTU dataset initially since DTU is an object-level dense observation dataset, and our FDS is primarily designed to mitigate the issue of insufficient sampling in observation regions. To test our FDS on more diverse datasets, We have now included results of our FDS on the DTU dataset, as shown below. While FDS performs better under sparse observations, it still achieves notable improvements on the DTU dataset with dense observations.

Method2437405563656983
2DGS0.480.860.360.430.900.940.801.27
2DGS+FDS0.510.850.360.430.791.000.771.23
Method97105106110114118122mean
2DGS1.300.720.701.240.470.700.580.78
2DGS+FDS1.060.730.651.140.440.580.530.73

[Q2]: Do not provide evidence or explanations for depth distortion loss and it is unclear how this influenced the quantitative comparison to 2DGS.

[A2]: We found that the depth distortion loss tends to negatively impact performance on indoor scenes. To validate this observation, we evaluated the performance of “2DGS (with depth distortion loss)” and “2DGS(with depth distortion loss)+ FDS”. The results on the Mushroom dataset are shown below.

MethodAcc ↓Comp ↓C-L1 ↓NC ↑F-Score ↑Abs Rel ↓PSNR ↑SSIM ↑LPIPS ↓
2DGS0.10780.08500.09640.78350.51700.100223.560.81660.2730
2DGS (distortion loss)0.12250.12270.12260.76210.47070.163422.400.79160.3031
2DGS (distortion loss) + FDS0.07680.07050.07360.79990.65110.097523.270.81060.2815
2DGS + FDS0.06150.05340.05740.81510.69740.056124.060.82710.2610

For the DTU results in A1, the depth distortion loss is used by default in both 2DGS and 2DGS+FDS. Our FDS also improves reconstruction results.

[Q3]: Comparisons to neural field based methods such as Geo-NeUS or NeuralAngelo would significantly strengthen the claim of state-of-the-art performance.

[A3]: Thank you for your advice. We are testing NeRF-based methods, such as NeuralAngelo, on the Mushroom dataset. We are running the relevant program and will update the results in about two days.

[Q4]: The overall quality and the presentation should be improved.

[A4]: Thank you for your advice. We have carefully revised our updated paper to improve the presentation quality and address identified errors in method and conclusion section.

评论

[Q5]: Considering pixels in the target view containing splatted Gaussians that are occluded in the source view, how is the Radiance Flow computed for these regions ?

[A5]: From the source view to the target view, we compute radiance flow using the following equation, based on depth warping:

Dn(u2,v2)[u2v21]=KTmnK1Dm(u1,v1)[u1v11] D^n(u_2 , v_2) \begin{bmatrix} u_2\\ v_2\\ 1 \end{bmatrix} = KT_{m}^{n}K^{-1}D^m(u_1 , v_1) \begin{bmatrix} u_1\\ v_1\\ 1 \end{bmatrix}

Specifically, we back-project the depth map rendered from the input view into 3D space and calculate the 2D positions of these 3D points in the sampled view using the relative camera poses. The radiance flow is then obtained by subtracting the original pixel positions from these 2D positions. This method allows us to compute the radiance flow for all pixels, including those in self-occluded regions. The accuracy of prior flow tends to be inaccurate in occluded regions, calculated via pixel matching. However, thanks to the random sampling introduced by FDS, for any single training viewpoint, we effectively generate a sufficient number of sampled viewpoints during training. This acts as a form of model ensemble, helping to average out errors in occluded regions, as pixels that are self-occluded in one sampled viewpoint can still be observed from others, thereby mitigating these errors.

[Q6]: In line 267, how does detaching the optical flow influence the overall performance?

[A6]: Detaching the optical flow can help to reduce training time and improve performance. We have tested the results of FDS on the Mushroom dataset without detaching the optical flow. We found that the rendered results tend to corrupt, and the training on the 'honka' scene runs out of memory on a 4090D GPU. We report the compared results of the last four scenes:

MethodAcc ↓Comp ↓C-L1 ↓NC ↑F-Score ↑Abs Rel ↓PSNR ↑SSIM ↑LPIPS ↓
2DGS0.10780.08500.09640.78350.51700.100223.560.81660.2730
2DGS+FDS (without detach)0.16220.19990.18110.69790.23760.168017.230.64590.5593
2DGS+FDS0.05670.04850.05260.82020.71880.050924.470.83770.2547
评论

[Q3]: Comparisons to neural field based methods such as Geo-NeUS or NeuralAngelo would significantly strengthen the claim of state-of-the-art performance.

[A3] (Supplement): We evaluated the geometric reconstruction performance of the NeuralAngelo method on the Mushroom dataset. We used the original method's data preprocessing script, set the scene type to "indoor," and processed the ground truth camera poses from Mushroom along with the initial point cloud from other experiments in this paper to obtain the bounding sphere of interest. The experimental parameters were configured exactly as in the authors' paper for the TnT dataset, with a batch size of 16 and 500k iterations.

Since the NeuralAngelo method was not designed to reconstruct scene geometry from sparse viewpoints, its performance on Mushroom was suboptimal. Specifically, in the “sauna” scene, using the default preprocessing script caused the NeRF model to diverge. Following the recommendations of the NeuralAngelo authors, we tried adjusting the radius of the bounding sphere of interest. While the model no longer diverged, it tended to overfit to the input viewpoints during training.

To ensure a fair comparison, we exclude the 'sauna' scene, where reconstruction failed by NeuralAngelo, and compare the average results of the remaining four scenes, as shown in the table below:

MethodAcc ↓Comp ↓C-L1 ↓NC ↑F-Score ↑Time ↓
NeuralAngelo0.10160.09460.09810.65050.4480>128h
2DGS0.10760.08260.09510.77880.53260.8h
2DGS+FDS0.06350.05050.05700.81020.70641.3h
评论

We hope our response has addressed your questions. As the discussion phase is coming to a close, we are looking forward to your feedback and would like to know if you have any remaining concerns we can address. We are grateful if you find our revisions satisfactory and consider raising your score for our paper.

Thank you once again for the time and effort you have dedicated to reviewing our paper.

Best regards

Flow Distillation Sampling Authors

评论

Thank you for the efforts in the rebuttal. You addressed most of my concern with the DTU experiment the comparison to Neuralangelo and the ablation on the optical flow gradients. For the comparison on DTU you can further compare to NeRF-based methods. All in all, I raise my score to a borderline accept (6).

评论

Thank you for your response and for improving your rating. We’re delighted to receive and follow your suggestions.

评论

We sincerely thank all the reviewers for their constructive and insightful advice. We are encouraged that all reviewers consider the proposed FDS intuitive and interesting.

  • Reviewer w1zd: The idea is simple but makes intuitive sense. This paper offers a promising direction of using pair-wise information prior to advance sparse view reconstruction.
  • Reviewer LTym: The ideas are intuitive, the paper is well written and easy to understand.
  • Reviewer 148K: The paper describes an interesting idea on incorporating a flow prior for 3D reconstruction with Gaussian Splatting.
  • Reviewer GPCw: This approach effectively enhances both the reconstruction quality and rendering quality of existing 3DGS-based methods.

Meanwhile, we acknowledge the reviewers' primary concern that this paper lacks evaluation on some widely used geometry reconstruction dataset, such as DTU. Since DTU is an object-level dense observation dataset, our FDS is primarily designed to mitigate the issue of insufficient sampling in observation regions. Therefore, we did not prioritize testing on the DTU dataset initially. We have now included results of our FDS on the DTU dataset in our supplementary material, as shown below. While FDS performs better under sparse observations, it still achieves notable improvements on the DTU dataset with dense observations.

Method2437405563656983
2DGS0.480.860.360.430.900.940.801.27
2DGS+FDS0.510.850.360.430.791.000.771.23
Method97105106110114118122mean
2DGS1.300.720.701.240.470.700.580.78
2DGS+FDS1.060.730.651.140.440.580.530.73

We have updated the submitted paper based on each reviewer's suggestions. The main updates include:

  • We replaced the original RAFT model with a more advanced optical flow model, SEA-RAFT, enabling FDS to achieve a significant improvement in its performance. The experimental results presented in the submitted paper have all been updated accordingly.
  • We have updated Table 3 to include more comprehensive ablation experiments on prior supervision.
  • We have added detailed derivations to the paper to clarify the physical meaning of the hyperparameter σ\sigma, providing valuable insights for its configuration.

We have carefully addressed the additional concerns raised by each reviewer in the corresponding official comments. Finally, we extend our heartfelt gratitude to all reviewers and remain open to further suggestions to improve all aspects of our work.

AC 元评审

The paper presents a method that integrates pretrained matching priors to guide the optimization of 3D Gaussian Splatting (3DGS). This approach leverages pretrained geometric knowledge to effectively enhance both the reconstruction and rendering quality of 3DGS methods. Incorporating a flow prior to 3D reconstruction with Gaussian Splatting is interesting and intuitive. The primary concern in the original submission was the lack of experiments on popular datasets such as DTU and MipNeRF. The rebuttal addressed this by including additional experiments on the DTU dataset, strengthening the evaluation of the proposed method.

审稿人讨论附加意见

Several issues were raised in the initial reviews, and the rebuttal effectively addressed most of them.

A shared concern is about the limitations of the datasets. In particular, there is a lack of evaluation of some widely used geometry reconstruction datasets, such as DTU. The rebuttal addressed this by including additional experiments on the DTU dataset, where the proposed method demonstrated notable improvements. Additionally, at the reviewer's suggestion, the rebuttal provided comparisons with neural field-based methods, further strengthening the evaluation.

Another concern was the method's dependency on the performance of the pretrained optical flow model. The rebuttal argued that advancements in optical flow models would continuously benefit the proposed method. This was demonstrated by replacing RAFT with SEA-RAFT, which improved performance for the proposed method, illustrating its adaptability to better flow models.

The rebuttal effectively addressed most of the concerns raised during the review process. Most reviewers were optimistic about the paper by the end of the discussion stage.

最终决定

Accept (Poster)