7.0

/10

Spotlight4 位审稿人

最低6最高8标准差0.7

4.5

置信度

正确性3.0

贡献度3.0

表达2.8

NeurIPS 2024

DiffSF: Diffusion Models for Scene Flow Estimation

Yushan Zhang,Bastian Wandt,Maria Magnusson,Michael Felsberg

OpenReview PDF

提交: 2024-05-13更新: 2024-11-06

摘要

关键词

Scene Flow EstimationDenoising Diffusion ModelsUncertainty

评审与讨论

审稿意见

评分: 7置信度: 32024-07-12

This paper introduces a novel diffusion model designed for scene flow estimation, aiming to enhance both accuracy and robustness, particularly in the presence of noisy inputs or occlusions. The proposed denoising diffusion models effectively handle noisy data by modeling stochastic processes, filtering out sensor-induced noise, and focusing on underlying patterns. This approach not only improves prediction accuracy but also introduces a measure of uncertainty by averaging over multiple hypotheses without altering the training process. Additionally, the paper presents a new architecture that integrates transformers with diffusion models, further boosting accuracy and robustness across various benchmark datasets.

优点

The motivation for using diffusion models in scene flow estimation is well-justified. Real-world data often presents challenges such as severe occlusion and noisy inputs, necessitating robust models for scene flow estimation.
The proposed method demonstrates state-of-the-art performance through extensive experiments on multiple benchmarks, including FlyingThings3D, KITTI Scene Flow, and Waymo-Open.
The paper is well-written and easy to understand, making the concepts accessible to readers.
The reproducibility of the results is strong, ensuring that the findings can be reliably replicated.

缺点

The paper lacks qualitative results. I would've liked to see visual comparison of scene flow results (especially for challenging cases) on various datasets. This makes it hard to interpret where the proposed approach performs better than current methods and more importantly where the proposed approach fails.

问题

Line 250 states "showcasing DiffSF’s effectiveness on real-world data.." Can the authors provide some challenging samples of data that your model is robust against?
Which datasets are used in generation of results for figures 3 and 4?

局限性

Authors have presented the limitations of the work (which is a good thing), but this has been mostly discussed in the supplementary section (which is outside of the main paper). I would encourage the authors to include a brief summary of the limitations in the main paper as well.

作者回复

2024-08-07

Question 1: Additional visualizations.

We propose additional qualitative results in the attached PDF in the common comment section above. Figure 3 shows the visualization comparison between GMSF and DiffSF on the KITTI dataset. The results show that DiffSF is more robust than GMSF when the point clouds are sparse which are thus more likely to be noisy. This is consistent with our quantitative results that DiffSF shows a large improvement in all the evaluation metrics, especially in robustness ( $ACC_S$ , $ACC_R$ , $Outliers$ ). Figure 2 shows a visualization comparison between GMSF and DiffSF on the FlyingThings3D dataset. Similarly, DiffSF performs better when the point clouds are sparse (column 1), and when occlusion happens (columns 2 and 3). However, when the occlusion is too severe that the object completely disappears, both models can not give accurate predictions (column 4). Similar results are found in the Waymo dataset (see Figure 4 top row). In the bottom row we show that the proposed DiffSF is robust against various road types on the Waymo dataset, which is a very challenging dataset compared to KITTI, with complicated long sequences, and fast vehicle motions.

Question 2: Which datasets are used in the generation of results for Figures 3 and 4?

The datasets used to generate Figures 3 and 4 are FlyingThings3D. We will add this to the figure caption.

Question 3: Limitations in the main paper.

Thanks for the comment. We will include the limitations in the main paper.

审稿意见

评分: 8置信度: 52024-07-13

This paper deals with scene flow estimation in 3D point clouds. It proposes a formulation based on diffusion model. The model takes the source and target frames of 3D point clouds as condition and turn the problem into a conditional generation problem. Different from naive conditional generation, the formulation considered the characteristic of the scene flow problem to make each reverse diffusion step as a "flow adjustment". Experiments are done on standard benchmarks.

优点

The proposed formulation considered the characteristic of the scene flow problem to make each reverse diffusion step as a "flow adjustment". It makes so much sense and is a elegant solution to this problem.
The presentation of this paper is clear and easy to follow.
The experiment results significantly outperform existing baselines.
Code and visualization are provided.

缺点

The author could consider providing more visualizations of the (reverse) diffusion process other than the only one GIF attached in the supplementary. This will be very helpful.formulation
Minor: the citation notation seems to be wrong. For example, citing paper 1 should be "[1]", not "(1)".
Minor: in Tables 1, 2 and 6, there are numbers in both bold and underlines. What are underlined numbers mean? This should be explained.

问题

What is the definition of "uncertainty"? I understand "EPE" and "outlier". Is there a definition of "uncertainty" somewhere in the paper?

局限性

Limitation was mentioned. No other limitation is found besides the ones mentioned in the paper.

作者回复

2024-08-07

Question 1: Consider providing more visualization of the reverse process other than the only one GIF attached.

Thanks for the comment. We plan to add more visualizations of the diffusion process other than the visualization in Figure 1 in the main paper. However, due to the page limitation of the rebuttal process, here we only provide one additional visualization in the attached PDF in the common comment section above. More will be added to the supplemental document of the final version.

Question 2: The citation notation seems to be wrong. citing of papers should be "[]" instead of "()".

Thanks for the comment. We will revise the citing in the paper.

Question 3: What are the meanings of bold and underlined numbers in Tables 1,2 and 6. This should be explained.

Thanks for noting that the meanings of bold and underlined numbers was not explained in the tables. The following explanation will be added to the tables: The bold numbers represent the best performance, and the underlined numbers represent the second best performance.

Question 4: Definition of "uncertainty".

Good catch! We will include the following definition in the paper: With uncertainty we refer to the epistemic uncertainty, which reflects the confidence the model has in its predictions. In our case, we predict an uncertainty for the prediction of each point. That is, if we have a predicted scene flow vector field of size $N \times 3$ , the predicted uncertainty will have a size of $N \times 1$ . In practice, each uncertainty is predicted by taking the standard deviation of 20 hypotheses with different initial randomly sampled noise.

审稿意见

评分: 7置信度: 52024-07-14

This paper introduces DiffSF, which integrates transformer-based scene flow estimation with denoising diffusion models. The diffusion process involves progressively perturbing the ground truth scene flow vector field by adding Gaussian noise. Conversely, during the reverse process, the scene flow vector field prediction is recovered from randomly sampled Gaussian noise by conditioning on a source and a target point cloud. The DiffSF achieves state-of-the-art performance on FlyingThings3D, KITTI Scene Flow, and Waymo-Open.

优点

Extending denoising diffusion models for scene flow estimation is a novel approach that is straightforward to understand.

The ablation study is comprehensive, such as evaluating the impact of varying the number of time steps during both training and sampling for the diffusion model.

缺点

The tech contribution is limited. This paper does not invent a new concept given a relevant work of optical flow estimation using diffusion models [1].

To enhance understanding of the diffusion process for scene flow estimation, visualizations of source points with added noise at different time steps should be included.

The study is limited to the GMSF model architecture. It is strongly recommended to evaluate other popular architectures, as the key concept of the paper is to introduce the diffusion model for scene flow estimation. Additionally, it would be valuable to investigate whether this approach works in the context of self-supervised scene flow estimation such as [2][3].

[1] The Surprising Effectiveness of Diffusion Models for Optical Flow and Monocular Depth Estimation

[2] Self-Supervised Robust Scene Flow Estimation via the Alignment of Probability Density Functions

[3] Just Go with the Flow: Self-Supervised Scene Flow Estimation

问题

It would be beneficial to include a discussion or exploration on the effects of adding noise directly to the ground truth scene flow vector field versus adding noise to the embedding of the ground truth scene flow vector field, as achieved by a separate network such as a VAE model.

局限性

N/A

作者回复

2024-08-07

Question 1: Limited contribution.

Optical flow and scene flow share similar ideas, i.e. estimating object movement from sensors' data. However, the sensors are completely different and the generated data has different modalities. For RGB cameras, the output images are usually on a regular grid, with RGB values always from 0 to 255, enabling the generation of dense feature maps. On the contrary, for LiDAR sensors, the output point clouds are unstructured and sparse, and the point clouds between consecutive frames could even have different sizes. Due to these inherent differences in data structures, approaches for optical flow estimation cannot be directly transferred to scene flow estimation. For example in [1], the method takes the noisy label and one or more conditioning images as input, the noisy label and the images are concatenated together and then sent into a U-Net. The output of the U-Net is the learned optical flow. Point clouds cannot be processed with CNNs (unless they are densified, which is a separate CV problem) and due to the varying number of points, concatenation over the feature dimension cannot be performed as in [1].

Question 2: The visualizations of source points with added noise at different time steps should be included.

Question 3: The study is limited to the GMSF model architecture.

We chose GMSF as our baseline as it is the currently most powerful scene flow estimation model, more than 50% better than the second best (DifFlow3D) on $F3D_o$ dataset in every metric. Note that instead of taking GMSF directly, we also made modifications to the GMSF model to make it integrate into the diffusion models better. For more information about the importance of the architecture modifications, please refer to the reply to reviewer gpDD (Question 3). The table shows that by improving the architecture, the $EPE_{3D}$ on $F3D_o$ -all improves from 0.061 (line 2) to 0.036 (line 4). The $EPE_{3D}$ on $KITTI_o$ improves from 0.054 (line 2) to 0.029 (line 4), which demonstrates our claim that the modifications we made to the architecture allow for better integration into the diffusion model.

To investigate whether the proposed diffusion model works for self-supervised scene flow estimation, instead of adapting other models into our proposed method, we adapt the proposed method by simply changing the supervision from fully-supervised to self-supervised. The self-supervision consists of two parts: the Chamfer loss and the smoothness loss, which are two common losses employed in self-supervised scene flow estimation.

The Chamfer loss is defined as

L_{Chamfer} = \sum_{x_i\in\mathcal{X}} \min_{y_j\in\mathcal{Y}} |\hat{x_i}'- y_j|^2 + \sum_{y_i\in\mathcal{Y}}\min_{x_i\in\mathcal{X}} |x_i - \hat{y_j}'|^2,

where $x_i$ and $y_j$ are the points from source and target, $\hat{x_i}'$ and $\hat{y_j}'$ are transformed points.

The smoothness loss is defined as

L_{Smoothness} = \sum_{x_i\in\mathcal{X}}\frac{1}{N}\sum_{y_j\in\mathcal{N}(x_i)}\|SF(x_i) - SF(y_j)\|^2,

where $\mathcal{N}(x_i)$ denotes the nearest neighbors of $x_i$ . $N$ is the number of points in the local region. $SF()$ is the predicted scene flow.

The results are shown in the following table, which shows that our proposed approach also works in the context of self-supervised scene flow estimation. An initial try without parameter tuning already gives us a huge performance improvement on the KITTI dataset. The result on FlyingThings3D is also comparable to state-of-the-art.

Table 1: Ablation study on shared weights on $F3D_o$ .

method	$F3D_s$				$KITTI_s$
	$EPE_{3D}$	$ACC_S\uparrow$	$ACC_R\uparrow$	$Outliers\downarrow$	$EPE_{3D}$	$ACC_S\uparrow$	$ACC_R\uparrow$	$Outliers\downarrow$
PointPWC[4]	0.1213	32.39	67.42	68.78	0.2549	23.79	49.57	68.63
SS-APDF[2]	0.0750	58.90	86.20	47.00	0.0920	74.70	87.00	28.30
SS-SFE[3]	-	-	-	-	0.1050	46.48	79.42	-
DiffSF	0.0796	50.35	81.49	54.10	0.0315	89.75	97.03	15.91

Question 4: Adding noise to the embeddings instead of the vector field.

We thank the reviewer for the novel perspective that we have not considered so far. While VAE models are commonly used in image generation with great success their application to scene flow estimation is not commonly studied. Latent representations of data are often global, making them suitable for higher-level tasks like image generation [5], point cloud generation [6], classification [7], etc. However, scene flow estimation is a rather low-level task that focuses on how each point moves instead of understanding the whole scene. This difference in nature might make using a VAE to encode features of the point cloud to get high-level information not favorable for scene flow estimation. However, We are happy about the suggestion and will consider the option in future research.

Additional References

[4] PointPWC-Net: A Coarse-to-Fine Network for Supervised and Self-Supervised Scene Flow Estimation on 3D Point Clouds

[5] High-Resolution Image Synthesis with Latent Diffusion Models

[6] LION: Latent Point Diffusion Models for 3D Shape Generation

[7] Robust Latent Subspace Learning for Image Classification

评论- Issues have been addressed.

2024-08-13

I have reviewed the rebuttal and found that most of my concerns have been adequately addressed. Therefore, I will increase my rating to 'Accept'

审稿意见

评分: 6置信度: 52024-07-23

This paper proposes a scene flow estimation method that estimates scene flow from point clouds. The proposed method combines a previous scene flow method, GMSF, and diffusion model, where a denoising block is introduced based on GMSF. During training, the source point cloud is first warped via noisy scene flow, and then the warped source point cloud and the target point cloud are fed to the denoising block. Experimental results demonstrate the proposed method achieves the best performance compared with state-of-the-art approaches.
In addition, the paper shows interesting results that the proposed method can be used to measure the hardness of input point clouds. That is, given a pair of point clouds, the proposed method can obtain multiple scene flow results using different initial randomly sampled noise, where the uncertainty (standard deviation) of scene flow results is highly correlated with their accuracy.

优点

The idea of the proposed method is interesting and sounds good.
The result of uncertainty-error correspondence is interesting, indicating the proposed method can be used to predict the hardness (estimation difficulty ) of the input point cloud pair.
Extensive experimental results shows the effectiveness of the proposed method

缺点

The organization of the technical part can be improved. It is a bit difficult to follow Sec. 3.2. For example, the paper has already introduced the forward process, reverse process and the training loss, and then introduces the details of the reverse process again.
The proposed method uses the same feature extractors with shared weights to extract features from the warped and target point clouds. However, the distributions of the warped point cloud may be rather different from the target one, since the warped point cloud is noisy and the target point cloud is clean. Would such differences negatively affect the training of feature extractors? Similarly, the training of the linear layers in the Local-Global-Cross Transformer may also be negatively affected. If the feature extractor of the warped point cloud does not share weights with that of the target one, would the estimation performance of the proposed method be improved?
Table 7 in the Appendix shows that when GMSF and the proposed method use the same architecture, the proposed method's performance improvement is marginal on F3D_o, compared with GMSF. Does such performance also occur on other datasets ( such as KITTI and Waymo-Open datasets)? It is not clear which part of the proposed method mainly contributes to the performance improvement, the architecture or the denoising process.
Some statements may be a bit overclaimed. The paper claims that "To the best of our knowledge, we are the first to apply a diffusion model successfully for the task of scene flow estimation.". However, as described in the main paper, a few methods have attempted to employ diffusion models for scene flow estimation [24]. The statement of "the first" may be overclaimed and can be replaced with a more proper claim.

问题

Please refer to my comments above.

局限性

The paper provides the limitations of the proposed method, while it does not discuss the potential negative societal impact of their work.

作者回复

2024-08-07

Question 1: The organization of the technical part can be improved.

The original idea of the organization was to decouple the introduction of the standard diffusion process and our contribution. We first recap the basics of diffusion models for the audience in Section 3.1. In Section 3.2 we introduce an overview of formulating scene flow estimation as diffusion process, including the training and sampling algorithms. Then we introduce our contribution and give the detailed model architecture and the training loss we employed in Section 3.2 and 3.3. But thanks to the reviewer for pointing out that this part can be difficult to follow. We will clarify the organization in the main paper.

Question 2: Will the shared weight in Feature Extraction and linear layers in the Transformer negatively affect the training?

DiffSF uses a global matching model to predict the noise in the reverse process. Since the core of the global matching approach is to match similar points, it makes sense to use shared-weights feature extractors. However, we agree that the distribution of the (noisy) source point cloud and the (clean) target point might be different, thus, we follow the reviewer's recommendation and try feature extraction without shared weights. The results are given in Table 1 which shows that, while having a lower number of parameters, the model with a shared weight feature extractor performs better than the same model without shared weights.

Table 1: Ablation study on shared weights on $F3D_o$ .

model	$EPE_{3D}$	$ACC_S\uparrow$	$ACC_R\uparrow$	$Outliers\downarrow$	$EPE_{3D}$	$ACC_S\uparrow$	$ACC_R\uparrow$	$Outliers\downarrow$
	all	all	all	all	non-occ	non-occ	non-occ	non-occ
shared	0.036	93.9	95.9	5.8	0.015	96.7	98.1	3.5
not shared	0.042	92.2	95.2	7.6	0.020	95.7	97.8	4.9

Question 3: Which part contributes to the performance improvement, the architecture or the denoising process?

Both the improved architecture and the denoising process contribute to the performance improvement. In Table 2 we improved Table 7 in the supplementary material to make the contributions of different design choices more clear. The architecture is modified to allow for a better integration into the diffusion model. Table 2 shows that when fitting the original architecture of GMSF directly to the diffusion models, there is a performance drop ( $EPE_{3D}$ from 0.039 to 0.061). By improving the architecture, the performance of GMSF improves $EPE_{3D}$ from 0.039 to 0.037, and the performance of DiffSF improves $EPE_{3D}$ from 0.061 to 0.036, which is significant and demonstrates that the improved architecture integrates with the diffusion models better. The numbers in parentheses mark the improved percentage of DiffSF compared to GMSF with the same architecture. On the $F3D_o$ dataset, the result shows a more than 10% improvement of $ACC_S$ , $ACC_R$ , and $Outliers$ if we evaluate all the points. An even higher improvement is shown if we evaluate only the non-occluded points. For the $EPE_{3D}$ , there is an improvement of 2.7% over all the points and 6.3% over the non-occluded points. Given the fact that FlyingThings3D is close to saturation, we argue that the relative improvements are more reasonable measurements than absolute improvements. A similar trend is found when testing the generalization ability on the $KITTI_o$ dataset. Note that the models are only trained on $F3D_o$ and then generalized to $KITTI_o$ . The new training scheme on $F3D_o$ (number of points 4096 and batch size 24) causes some generalization performance degradation on $KITTI_o$ (from $EPE_{3D}$ 0.033 to 0.089). From Table 3 we can tell that both the improved architecture and the diffusion process contribute to the improvement of the performance. Due to the limited resources, we didn't do an additional ablation study on the Waymo dataset, where we will have to additionally train three different models.

Table 2: Ablation study on $F3D_o$ .

architecture	diffusion	$EPE_{3D}$	$ACC_S\uparrow$	$ACC_R\uparrow$	$Outliers\downarrow$	$EPE_{3D}$	$ACC_S\uparrow$	$ACC_R\uparrow$	$Outliers\downarrow$
		all	all	all	all	non-occ	non-occ	non-occ	non-occ
		0.039	92.9	95.4	6.7	0.017	96.0	97.8	4.2
	✓	0.061	84.8	92.3	16.7	0.037	88.9	95.3	13.9
✓		0.037	93.2	95.4	6.5	0.016	96.2	97.7	4.1
✓	✓	0.036(-2.7%)	93.9(+10.3%)	95.9(+10.9%)	5.8(-10.8%)	0.015(-6.3%)	96.7(+13.2%)	98.1(+17.4%)	3.5(-14.6%)

Table 3: Ablation study on $KITTI_o$ .

architecture	diffusion	$EPE_{3D}$	$ACC_S\uparrow$	$ACC_R\uparrow$	$Outliers\downarrow$
		0.089	71.2	82.8	29.8
	✓	0.054	80.1	92.2	23.1
✓		0.050	85.7	91.7	18.1
✓	✓	0.029(-42.0%)	94.5(+61.5%)	97.0(+63.9%)	13.0(-28.2%)

Question 4: Overclaimed statements.

By this sentence we mean that we are the first to apply diffusion models to estimate the full scene flow vector field directly from two point clouds instead of a refinement plug-in module, as in DifFlow3D. However, we agree that this sentence might be misinterpreted and we will therefore rephrase it as: To the best of our knowledge, we are the first to propose using diffusion models to estimate the full scene flow directly from two point clouds.

Question 5: Negative societal impacts.

Potential negative societal impacts might be: As any other tracking algorithm scene flow estimation can be used in surveillance scenarios, which might raise privacy concerns and ethical issues. From an ecological perspective, training of deep learning models usually takes time and resources, thus environmental impact should be taken into consideration when training and applying such compute-intensive models. However, in our opinion, the positive impact of our proposed approach in practical applications such as robotics and autonomous driving outweighs the negative impact in the applications we consider.

2024-08-11

Thank the authors for addressing my questions.

-It is a bit difficult to follow Sec. 3.2., not the whole Sec. 3. It may be better to first introduce forward and reverse processes Eq. 6 and Eq. 8 and then describe the training loss. According to the response, the new architecture also contributes to performance improvement. Can the architecture of GMSF be replaced with this new one? If so, how is the performance?

I will keep my original positive rating.

2024-08-12

Thanks for the reply and the follow up questions.

Question 1: It is a bit difficult to follow Sec. 3.2., not the whole Sec. 3.

Answer: Sorry to misunderstood your previous question. In section 3.2 we first introduce the forward and the reverse process in Eq. 4 and Eq. 5. Then we derive how to learn the parameters in the reverse process, i.e. the $\mu_\theta(\mathbf{V}_t, \mathbf{P_source}, \mathbf{P_target})$ , in Eq. 6 and Eq. 7. With the learned parameters we rewrite Eq. 5 as Eq. 8. Hope this will help with clarifying Section 3.2. We appreciate your comment and revise 3.2 to achieve a maximum of clarity.

Question 2: Can the architecture of GMSF be replaced with this new one?

Answer: Yes indeed, the architecture of GMSF can be replaced with the new architecture. We did ablation studies in Table 2 of how the performance changes from GMSF (first line) to GMSF+improved architecture (third line). The performance shows an improvement of 5.1% on $EPE_{3D}$ over all the points (from 0.039 to 0.037). The ablation is done with the same training scheme (number of points 4096, batch size 24) as the proposed method for fair comparison. However, we did not perform the ablations of GMSF with their original (slight inferior) training scheme (number of points 8192, batch size 8).

2024-08-13

Thank the authors for addressing my questions. I will increase my score.

作者回复

2024-08-06

We thank the reviewers for their valuable comments and questions. We are happy that the reviewers find that using diffusion models in scene flow estimation is novel (qmev), interesting (gpDD), and elegant (yrBG), and that the motivation is well-justified (2g78). Reviewer gpDD also finds the result of uncertainty-error correspondence interesting (gpDD), which enables the estimation of prediction uncertainty and is rarely investigated in previous scene flow estimation approaches. All reviewers agree that the experiments are extensive and ablation studies show the effectiveness of the proposed method. Two reviewers (yrBG, 2g78) comment that the paper is well-written and easy to follow, and appreciate the release of our code to make the results reproducible.

In the following we will address the individual questions.

评论- Response to Rebuttal

2024-08-12

Thank you authors for the rebuttal response. I have read the rebuttal and most of my concerns have been addressed. I'll stick with my "Accept" rating here.

最终决定Accept (spotlight)

2024-09-25

The ratings for this paper include one strong accept (ygGr), two accepts (2g78 and qmev), and one weak accept (gpDD). All reviewers acknowledge the idea of extending denoising diffusion models for scene flow estimation. The main concerns are the lack of qualitative results and the organization of the technical section. Most of these concerns were satisfactorily addressed during the rebuttal, so the AC recommends acceptance.