Grouped Correlation Aggregation with Propagation for Stereo Matching
摘要
评审与讨论
This work introduces a new stereo matching method designed to address inefficiencies in existing iterative optimization-based stereo approaches. On the top of RAFT-Stereo, a new iterative updater based on PatchMatch Stereo is presented for better convergence. Also, the contour-aware aggregation (CA) was proposed by modifying multi-pass dynamic programming used in the semi-global matching (SGM). The grouped window-shifting (GWS) is further presented by considering that a large portion of matching costs are never accessed in the stereo matching process. For a faster inference on video stream, authors introduce a simple trick of using previous frame's estimate as an initial disparity at current frame.
优点
- The deformable convolution for approximating dynamic programming-based cost aggregation and the multi-level propagation updater seem to be effective according to experimental results.
缺点
-
The contour-aware aggregation (CA) needs clarifications. The original dynamic programming formulation of (2) is approximated as (4), when starting point and endpoint of the same object contour are given. Rather than predicting them exhaustively, authors propose to use deformable convolution. Though the overall idea is understandable, this part should be revised in a completely different form. In (4), what does '1' means? This may be a starting point . In (5), 'a' seems to be a pixel of deformable convolution, but it is not defined correctly. Are and learnable?
-
Multi-level propagation updater in Section 3.4 needs more details.
- should be explained for completeness.
- Figure 6: what do and dimensions mean? And why is the green area isolated with multiple parts?
- PatchMatch Stereo uses current estimated disparities of neighboring pixels as disparity candidates during an iterative optimization. Is a similar method used in the proposed method?
- What does the multi-level update mean? Are disparity candidates used from different resolutions?
- 'each scale is swapped with different neighbors': This is hard to understand.
- 'two types of updaters are utilized alternately for updates.': The two types of updaters have not been explained before.
In summary, this work introduces an efficient iterative optimization method, which is based on RAFT-Stereo. The deformable convolution for approximating dynamic programming-based cost aggregation and the multi-level propagation updater seem to be effective according to experimental results. Nevertheless, substantial revisions are required in terms of technical presentation.
问题
See the comments in the weaknesses.
伦理问题详情
N.A.
This is a paper based on RAFT Stereo to improve binocular reconstruction performance, training speed, and memory consumption. The main improvements made are: 1. Replace the original RNN with PatchMatch as the new updator to improve the performance of a single iteration. 2. The window shifting mechanism was used to reduce the volume of the cost volume. 3. SGM based cost aggregation is used to enhance the robustness of the cost volume.
优点
- Conducted experiments on diverse datasets.
- The method has improved the performance of the baseline.
缺点
- The window shifting mechanism is based on the continuity assumption of scene disparity or depth, and then sets a threshold to crop the cost volume. This is more like a trick, rather than an innovative module. And this article does not explain the potential negative impacts that this may bring.
- Lack of novelty in SGM based cost aggregation. The method of cost aggregation has been widely and persistently applied in geometric estimation tasks. However, since this paper has not made many improvements to it, we believe that this cannot be considered as the contribution point of this paper.
- This article lacks an explanation for changing the original RNN uploader to PatchMatch.
- The explanation of some legends is not clear enough. For example, in Fig 6, it is not explained what the different colored bands represent. Although it is mentioned in the main text, it caused difficulties in reading.
- The performance on the actual benchmark differs from the performance described in the paper. Line 454 "Our GCAP Stereo ranks 2 nd place on the bad 1.0 metric and 1 st place on EPE metric respect", but when we checked the actual benchmark, we found that the method is not the first. And there is a lack of comparison with the latest methods, such as CVPR2022 CREStereo and others, CVPR2024 LoS: Local Structure guided Stereo Matching, Confidence Aware Stereo Matching for Realistic Cluttered Scenarios. ICIP 2024 also includes a large number of methods that have not been publicly published as papers. Overall, we believe that ranking first is an exaggeration.
问题
- Why replace the RNN updator with PatchMatch updator, and what are the core advantages?
- Will cutting the cost volume result in performance loss?
This paper proposed a new propagation-based updater, which combines the grouped window shifting mechanism and a contour-aware aggregation modified from semi-global matching (SGM), the proposed GCAP-STEREO has superior cross-domain generalization and real-time performance.
优点
- This paper introduces the grouped window-shifting mechanism to discard the vast majority of invalid points for efficiency and proposes a modified SGM-based cost aggregation method for robustness. GCAP-Stereo ranks 1st on ETH3D and has superior cross-domain generalization and real-time performance.
缺点
- The performance comparison is insufficient. Why is there no comparison of performance on the KITTI leaderboard? KITTI is currently the most common public benchmark available. The Zero-shot generalization comparison in Table 1 also lacks the comparison in KITTI 2012.
- The comparison method of generalization is not sufficient. In my opinion, to prove the generalization of this paper, you should compare some algorithms known for generalization, such as Cfnet: Cascade and fused cost volume for robust stereo matching and CREStereo: Practical stereo matching via cascaded recurrent network with adaptive correlation, only such a sufficient contrast can be convincing enough.
问题
- Can you provide a performance comparison on the KITTI leaderboard and a generalization comparison on the KITTI 2012? Adequate experimental comparisons are crucial for enhancing the persuasiveness of the results.
- Can you provide some comparison to the SOTA solution for generalization? The comparison with the precision scheme may not be convincing enough.
The paper presents grouped correlation aggregation with propagation to improve the performance of single iteration optimization. Specifically, the author has proposed some novel designs based on traditional methods, such as a novel updater based on PatchMatch Stereo, a modified SGM-based cost aggregation, and grouped window-shifting mechanism.
The author claims that the proposed method outperforms existing methods on public benchmarks such as ETH3D and demonstrates advantages in zero-shot generalization and video stream inference.
优点
The motivation is commendable; this paper aims to improve the optimization efficiency of RAFT-Stereo. The author believes that the all-pairs correlation constructed by RAFT-Stereo contains a large amount of redundant information, and therefore introduces the grouped window-shifting mechanism to significantly reduce the cost volume.
Furthermore, considering the matching cost between single pixels, which often leads to frequent occurrences of noise points and matching errors, the author proposes a modified SGM-based cost aggregation method to improve the robustness of the cost.
The author provides a distribution map of disparity ground truth (gt) to give an intuitive explanation.
缺点
-
The proposed solution seems to be not very novel and effective. In fact, IGEV-Stereo also constructs a geometry encoding volume for a small range of disparities (disp < 192 px), filtering out most of the redundant disparities.
-
This performance comparison is quite limited, and the author has missed comparisons with recent methods on both the KITTI benchmark and the Middlebury benchmark, such as Selective-IGEV, CREStereo, Los.
-
The writing quality is poor, and the method description is unclear. The author has not adequately explained how the propagation updater works in Section 3.4.
-
The results on Scene Flow are not convincing. In fact, previous methods typically use the test set to evaluate model performance, while the author only conducts ablation studies on the validation set, which is not convincing.
问题
-
Compared with the geometry encoding volume of IGEV-Stereo, what are the advantages of the grouped window-shifting mechanism in this paper?
-
Please provide the evaluation results on the KITTI benchmark
-
Please include more comparisons with recent methods, such as Selective-IGEV [CVPR 2024], CREStereo [CVPR 2022], Los[CVPR 2024] on the Middlebury benchmark.
-
Please provide a comprehensive comparison with RAFT-Stereo when the number of iterations increases from 1 to 16, so that readers can have a more intuitive understanding.
I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.