PaperHub
5.8
/10
Poster5 位审稿人
最低5最高7标准差0.7
7
5
6
6
5
3.8
置信度
正确性3.0
贡献度3.2
表达2.4
NeurIPS 2024

ETO:Efficient Transformer-based Local Feature Matching by Organizing Multiple Homography Hypotheses

OpenReviewPDF
提交: 2024-05-15更新: 2025-01-11
TL;DR

We reduce the time comsuming for local feature matching by reducing the units that participate in transformer, while using more complicated homography hypothese to maintain the accuracy.

摘要

关键词
local feature matching3d visionpose estimation

评审与讨论

审稿意见
7

This paper proposes an efficient transformer-based local feature matching approach named ETO. ETO consists of three steps: hypothesis estimation, segmentation and refinement. The first and second steps can obtain coarse matching results whose resolution is similar to LoFTR’s coarse results. The third step can provide fine-level matching results. The hypothesis estimation performs the Transformer computation on the 1/32 resolution (rather than the 1/8 resolution in LoFTR), which is the core design that reduces the inference time. Experimental results demonstrate that ETO provides a good balance between accuracy and speed.

优点

  1. The motivation of this paper is clear and valuable. Transformer-based approaches are prevalent in the area of feature matching. Nonetheless, it is still an open problem how to design an efficient and accurate transformer-based model.
  2. The technical design of ETO is novel. The manner of predicting the local homography parameters on a very coarse resolution (and then refining them) differs from most of the existing coarse-to-fine transformer-based approaches.
  3. The accuracy of ETO is acceptable, considering its superiority in inference speed.

缺点

1. This paper lacks some important technical details.

1.1 What is the computation manner of the Transformer before the hypothesis estimation step? In line 237, the authors state that "then we perform transformer five times at M_1". Does the transformer contain some cross-attention processes? Or does it only involve the self-attention?

1.2 The computation process of the hypothesis estimation step is confusing. The authors seem to use the notations "M_1, M_2, M_3, f_i^1" to represent the features of both the source and target images. Such definitions make many statements in Sec. 3.2 hard to understand. For example, the meaning of the variable M_1 seems inconsistent in the statements "For each unit i on M_1" (line 144) and "within the neighborhood of the target units on M_1" (line 155).

1.3 The computation architecture of the segmentation step is unclear. The authors should provide the computation details on how to predict the classification confidence from the intermediate features. The statement in lines 177-180 is too brief to understand.

1.4 Why is the classification label in the segmentation step termed as the "pseudo" ground truth (line 181)? To my understanding, the classification label of every unit should be obtained from the ground truth camera pose and depth. It should be the "normal ground truth" rather than the "pseudo ground truth" if the above understanding is correct. The authors should clarify this detail.

2. Some details should be further clarified.

2.1 The statement "while we only need to feed 300" (line 50-51) was not clarified in the subsequent text. Is it the unit number on the 1/32 resolution for a 640x480 input image?

2.2 In the segmentation step, what is the size of output segmentation results for a local input window whose size is 3x3? Is it 4x4? Or 12x12?

2.3 The statement in lines 304-305 should be further discussed. The authors just state that ETO is better on the more difficult YFCC100M without discussing the probable reason.

2.4 The title "Hypotheses Estimation" (line 137) should be "Hypothesis Estimation".

3. Some intermediate results should be visualized.

Some real intermediate results after the hypothesis estimation/segmentation steps should be visualized to show how these steps provide appropriate predictions. The virtual intermediate "results" in Figure 3 are helpful to understand, but the actual results are still necessary.

问题

Please provide more discussions and experimental results to address the above weaknesses.

局限性

The authors have discussed the limitations.

作者回复

Thank you for your review and valuable comments.

1.1 What is the computation manner of the Transformer before the hypothesis estimation step? In line 237, the authors state that "then we perform transformer five times at M_1". Does the transformer contain some cross-attention processes? Or does it only involve the self-attention?

No, we only use this strategy for the fine level, because this approach cannot cope with the need to iterate the transformer multiple times, and it only manages to improve the efficiency without compromising the accuracy if it is only iterated once. This drawback stems from the fact that it only needs to optimize the feature that is used to estimate the refined bias, which is only one out of every 16 features on the source image on the M3M_3 feature map according to our method, and the features on the target image are not optimized in this uni-directional attention operation. Therefore, the scope of this uni-directional attention can only be used in refinement stage, and it is not suitable for the scenarios that require multiple iterations such as coarse level.

1.2 The computation process of the hypothesis estimation step is confusing.

We are trying here to use aa and ii to represent the units on the target images and source images respectively, and then aia_i to describe the unit on the target image that corresponds to unit ii on the source image. Since you find this confusing, we will add an ss or tt annotation in the upper right corner, for example: fk3sf_k^{3s} and fk3tf_k^{3t}

1.3 The computation architecture of the segmentation step is unclear. The authors should provide the computation details on how to predict the classification confidence from the intermediate features. The statement in lines 177-180 is too brief to understand.

Segmentation refers to the classification of each unit, where we determine which homography hypothesis should be adopted for unit j on M2M_2 through classification. The way to obtain the classification result is by comparing the classification score matrix CjC_j of unit jj for different hypotheses HiH_i, where the largest one is the result of our classification operation. This classification uses the concept of multi-label classification, a method widely applied in detection problems. Therefore, we refer to DETR and use focal loss to optimize segmentation here. We can describe the process of obtaining the classification score matrix CjC_j in the form of a formula: Cji=(T(fi)+P(i),fj)C_{ji} = (T(f_i) + P(i), f_j), where CjiC_{ji} refers to the matching score of unit j for hypothesis i. TT refers to the function that converts the feature dimension of i (256 dimensions) to the feature dimension of j (128 dimensions); here, we use a 2D CNN to perform TT. PP refers to positional embedding, which directly represents the relative position of the unit corresponding to the hypotheses i in the local 3*3 units. And ( * , *) indicates the inner product. We will add this part to the supplementary materials later.

1.4 Why is the classification label in the segmentation step termed as the "pseudo" ground truth (line 181)? To my understanding, the classification label of every unit should be obtained from the ground truth camera pose and depth. It should be the "normal ground truth" rather than the "pseudo ground truth" if the above understanding is correct. The authors should clarify this detail.

Here we use the term 'pseudo ground truth' merely because the annotations are computed in real-time. However, considering that the term 'pseudo-labels' is generally used for labels obtained from the predictions of neural networks, this usage is indeed incorrect and your are right. We will correct this term to 'computed groundtruth'.

2.1 The statement "while we only need to feed 300" (line 50-51) was not clarified in the subsequent text. Is it the unit number on the 1/32 resolution for a 640x480 input image?

300 indeed refers to the number of units obtained at 1/32 resolution on a 640×480640 \times 480 resolution image. We will change this to: 'Previous methods feed 80×6080 \times 60 tokens to the transformer with 1/8 resolution, while we only need to feed 20×1520 \times 15 with 1/32 resolution.' This will be easier to understand.

2.2 In the segmentation step, what is the size of output segmentation results for a local input window whose size is 3x3? Is it 4x4? Or 12x12?

The output of the segmentation part is an H/8×W/8 array, with values ranging from 0 to 8. This number determines which local hypothesis should be adopted to calculate the input for the refinement step.

2.3 The statement in lines 304-305 should be further discussed. The authors just state that ETO is better on the more difficult YFCC100M without discussing the probable reason.

All of the methods we tested achieved lower metrics on the YFCC dataset. From this perspective, we consider the YFCC dataset to be more challenging.

2.4 The title "Hypotheses Estimation" (line 137) should be "Hypothesis Estimation".

Thank you for pointing out the typos. We will correct the typos later.

3.1 Some real intermediate results after the hypothesis estimation/segmentation steps should be visualized to show how these steps provide appropriate predictions. The virtual intermediate "results" in Figure 3 are helpful to understand, but the actual results are still necessary.

We provide the visualization results in the pdf of "Author Rebuttal".

评论

Thank the authors for the additional experimental results and discussions. I think the proposed approach is valuable, considering the new technical designs and the good balance of accuracy and efficiency. Therefore, I keep my original rating (Accept).

审稿意见
5

The authors propose a local feature matching method that leverages homography to accelerate the transformer-based feature matching pipeline. Additionally, they employ unidirectional cross-attention in the refinement stage to further reduce computational overhead. Experimental results demonstrate the efficiency and effectiveness of this approach.

优点

  1. The paper is well-written and easy to understand.
  2. Integrating homography as theoretical guidance into the transformer pipeline is a commendable approach, enhancing the transformer-based pipeline with theoretical support.
  3. Experimental results demonstrate that the proposed method achieves much smaller time usage, validating its efficiency in practice.

缺点

  1. Though the time usage is decrease, the accuracy is also decrease from Tab.1&2&3.
  2. The paper lacks significant citations in feature matching methods, such as Efficient LoFTR[1], RoMa[2]. Some of these works also focus on improving efficiency in feature matching and should be referenced to provide a comprehensive background. Including these methods in the experiments would offer a more thorough comparison of the proposed approach's performance.
  3. Though the authors acknowledge that some other methods (e.g., [14, 38]) are better in certain aspects (line356-357), it would be beneficial to include a comparative analysis with these methods. This would provide a clearer understanding of the strengths and weaknesses of the proposed approach.

[1]. Wang, Yifan, et al. "Efficient LoFTR: Semi-dense local feature matching with sparse-like speed." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. [2] Edstedt, Johan, et al. "RoMa: Robust dense feature matching." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

问题

No question to the authors.

局限性

The authors have addressed the limitations in their paper.

作者回复

Thank you for your review and valuable comments.

  1. Though the time usage is decrease, the accuracy is also decrease from Tab.1&2&3.

Yes, you are right. But we claim that the huge improvement in runtime is valuable in realtime applications such as robotics or SLAM.

  1. The paper lacks significant citations in feature matching methods, such as Efficient LoFTR, RoMa. Some of these works also focus on improving efficiency in feature matching and should be referenced to provide a comprehensive background. Including these methods in the experiments would offer a more thorough comparison of the proposed approach's performance.

We add a couple of experiments on Megadepth dataset in Author Rebuttal materials. Here, to be mentioned that our paper is contemporaneous with the Efficient LoFTR. The accuracy of Efficient LoFTR is a bit better than LoFTR, while the runtime of Efficient LoFTR is 56.9 ms, much faster than LoFTR, which is 93.2 ms, however, it is much slower than our 21 ms . And while PATS gets extremely good results, it is almost 100 times slower than our method.

  1. Though the authors acknowledge that some other methods (e.g., [14, 38]) are better in certain aspects (line356-357), it would be beneficial to include a comparative analysis with these methods. This would provide a clearer understanding of the strengths and weaknesses of the proposed approach.

The same as Q.2

评论

Thanks authors' experimental results. However, I noticed that my request for a comparison with RoMa was not addressed in your response. RoMa is another relevant SOTA method which should be included into comparison.

评论
Methodauc@5auc@10auc@20runtime on RTX 2080ti(ms)
RoMa64.877.486.1688.8
Tiny-RoMa36.253.667.529.0
ETO51.766.677.421.0

The experiments show that when efficiency and accuracy are all taken into account, ETO has advantages over the state-of-the-art methods. Note that tiny-roma's demo on github uses a very strong set of ransac parameters (ransacReprojThreshold=0.2, method=cv2.USAC_MAGSAC, confidence=0.999999, maxIters=10000) , so it's not surprising that the results decrease when using settings consistent with ours.

评论

Thank you for the additional comparisons with RoMa and the discussion provided. This effectively addresses my concerns. Your work is both interesting and effective, and I will maintain my initial rating.

审稿意见
6

This paper proposes a novel framework for efficient transformer-based local feature matching. Transformer-based local feature matching usually contains two stages: a coarse matching stage which applies self-attention and cross-attention on coarse-level features (usually H/8 x W/8) to obtain coarse matches, and a refinement stage that refines the coarse matches in a local window based on fine-grained features. This paper proposed two methods to improve the efficiency of this pipeline. First, the coarse matching is performed on a even coarser level (H/32 x W/32) to reduce the computational cost of the costly attention operations. Additionally, homography hypothesis for each patch is estimated at this level. Leveraging the piece-wise smooth prior, the matching at H/8 x W/8 resolution is directly approximated by selecting the most probable homography hypothesis in local windows using a proposed segmentation technique. Second, in the refinement stage, the bi-directional cross attention is reduced to uni-direction one. Experiments show that the proposed framework achieves comparable performance with less inference time compared with state-of-the-art transformer-based and local feature matching algorithms.

优点

  1. The idea of using piece-wise smooth prior to accelerate transformer-based feature matching is novel and promising, and the supervision and the homography hypothesis re-selection algorithms are carefully designed.
  2. Comprehensive ablation studies show that the key design features, including homography hypothesis proposal, homography hypothesis re-selection and uni-directional cross attention have positive effects on the performance and are necessary.

缺点

  1. The symbols used in the method part are too complicated, making it hard to read.
  2. The symbol H\mathscr{H} used in figure 4 is undefined in the text.
  3. The framework adopts two-stage training, possibly making it hard to train and limiting the accuracy.

问题

  1. Do you have any plan to reorganize the symbols?
  2. If the framework is trained end-to-end, will it yield acceptable performance?

局限性

The authors have adequately addressed the limitation of this work in the conclusion and limitations section of the paper. Enabling the network to be trained end-to-end, allowing for dense matching and further improving accuracy are my suggestions for directions of improvement.

作者回复

Thank you for your review and valuable comments.

  1. The symbols used in the method part are too complicated, making it hard to read.

Thank you for pointing this out and I'm so sorry for it.

  1. The symbol H\mathscr{H} used in figure 4 is undefined in the text.

Thanks for pointing this out, the H\mathscr{H} here refers to the output of segmentation, which is the set of homography transformations that each unit on M2M_2 is ultimately assigned to, we will add this explanation.

  1. The framework adopts two-stage training, possibly making it hard to train and limiting the accuracy.

Yes, an end-to-end pipeline may improve our method. But we don't have enough computing resources to complete it. We will try this idea later in the next work.

  1. Do you have any plan to reorganize the symbols?

Maybe I can directly add ss and tt to the features, from fk3f_k^3 to fk3sf_k^{3s} and fk3tf_k^{3t} , it will be more clear.

  1. If the framework is trained end-to-end, will it yield acceptable performance?

Then the batch size will be very small and make the process of training very slow, maybe an end-to-end pipeline will improve our method, but I don't have enough resources to complete it.

评论

Thank you for providing the additional experimental results and discussions. The proposed method demonstrates strong accuracy and efficiency, with novel and well-reasoned technical designs. The authors have addressed my questions and have outlined a plan to clarify the symbols. Therefore, I will maintain my original rating of "Weak Accept."

审稿意见
6

This paper presents a local feature matching method based on the multiple homography hypotheses. The paper explicitly introduces the homography hypotheses for coarse matching, combined with homography segmentation, cross-attention, and sub-pixel refinement to obtain fine matching results. The introduction of the homography hypotheses significantly reduces the number of input tokens for the attention mechanism, and with further model modification, the algorithm proposed in this paper has achieved a substantial increase in inference speed while maintaining performance as much as possible. Experiments on multiple datasets have proven the effectiveness and efficiency of the proposed method.

优点

Clear motivation, reasonable structural design, convincing experiments

缺点

-The pre-compiled transformer model may have some impact on inference speed, can additional tests be conducted on the efficiency of a normal transformer module?

-Lack of the latest state-of-the-art comparison: Some of the latest methods, such as RoMA[1] and EfficientLoFTR[2], have been made public at an early stage, and thus it is necessary to compare and analyze with these methods.

[1] Edstedt, Johan, et al. "RoMa: Robust dense feature matching." CVPR 2024. [2] Wang, Yifan, et al. "Efficient LoFTR: Semi-dense local feature matching with sparse-like speed." CVPR 2024.

问题

-In Section 3, it is 2-stage training adopted, but in Implementation Details, the training phase contains 3 stages. Why there is a joint training stage at first? Moreover, the joint training is usually conducted for the final finetuning, why it is the first stage in your strategy?

-For Basic Refinement w/ Segmentation in Section 4.4, why it is the same training hours instead of the same training samples? Early stopping may lead to performance drop.

-In the introduction, the authors claim that the multiple self- and cross-attention in the fine-level stage are redundant, is there any numerical results? What if you stack several uni-directional cross-attention in your method?

局限性

see Weaknesses

作者回复

Thank you for your review and valuable comments.

  1. The pre-compiled transformer model may have some impact on inference speed, can additional tests be conducted on the efficiency of a normal transformer module?

Without the pre-compiled transformer model, the runtime of ETO will be delayed from 21.0 ms to 32.8 ms for the images of 640*480. While the runtime is almost the same for LightGlue.

  1. Lack of the latest state-of-the-art comparison.

To be mentioned that our paper is contemporaneous with the Efficient LoFTR. The accuracy of Efficient LoFTR is a bit better than LoFTR, while the runtime of Efficient LoFTR is 56.9 ms, much faster than LoFTR, which is 93.2 ms, however, it is much slower than our 21 ms . And while PATS gets extremely good results, it is almost 100 times slower than our method. More detailed results on Megadepth dataset can be found in the supplementary table inside Author Rebuttal.

  1. In Section 3, it is 2-stage training adopted, but in Implementation Details, the training phase contains 3 stages. Why there is a joint training stage at first? Moreover, the joint training is usually conducted for the final finetuning, why it is the first stage in your strategy?

Our purples is to handle multi-resolution images better, which is of great significance for real-world applications, although it is not demonstrated in standard experimental data in this paper.This step is placed in the first stage rather than the second, this is because the different resolutions of the images mean that in the coarse matching stage one has to deal with data inputs of variable sizes, which has a relatively large impact on the results. Correspondingly, the fine matching stage can use inputs of fixed size from the coarse stage and is therefore inherently more robust to resolution changes.

  1. For Basic Refinement w/ Segmentation in Section 4.4, why it is the same training hours instead of the same training samples? Early stopping may lead to performance drop.

We try to demonstrate our superiority in training efficiency. While using the same training samples, the accuracy of this method will be 52.3/67.1/77.9 (auc@5/auc@10/auc@20), and the runtime is still 32.8 ms. While our accuracy is 51.7/66.6/77.4 (auc@5/auc@10/auc@20), and the runtime is 21.0 ms.

  1. In the introduction, the authors claim that the multiple self- and cross-attention in the fine-level stage are redundant, is there any numerical results? What if you stack several uni-directional cross-attention in your method?

The numerical results are in given in Q.4, only a simple version of cross attention can get similar accuracy as complete transformer with self-attention and cross attention, while much faster.

评论

Thanks for the response.

Although most of my quetions are solved. There are some questions ignored by the authors.

  1. In Q3, I wonder if stacking some more the proposed uni-directional cross-attention would be helpful and makes the method more powerful.

  2. The authors compared with PATS instead of RoMa as I mentioned in W2. Additionally, there are also some other reviewers mentioned RoMa, I don know why the author try to avoid comparing with RoMa. In my experience, RoMa runs about ~180ms/Frame on MegaDepth with RTX3090, and there is a tiny version for RoMa. Acoording to the paper and official repo, the accuracy for RoMa (62.6/76.7/86.3) and Tiny-RoMa (56.4/69.5/79.5) are much better than the proposed method (although it is faster).

The author's evasiveness heightens my concern. I do not know if stacking the proposed uni-directional cross-attention is useless. And I do not know why ignoring RoMa, which has been open-sourced for a long time. However, it is still an interesting work, I will maintain the original rating (Weak Accept).

评论
Methodauc@5auc@10auc@20runtime on RTX 2080ti(ms)
RoMa64.877.486.1688.8
Tiny-RoMa36.253.667.529.0
ETO with 2 uni-directional attention52.066.676.822.7
ETO51.766.677.421.0

The experiments show that when efficiency and accuracy are all taken into account, ETO has advantages over the state-of-the-art methods. Note that tiny-roma's demo on github uses a very strong set of ransac parameters (ransacReprojThreshold=0.2, method=cv2.USAC_MAGSAC, confidence=0.999999, maxIters=10000) , so it's not surprising that the results decrease when using settings consistent with ours.

And ETO with 2 uni-directional attention only make a littile difference on performance which is not important,.We consider that it is because the uni-directional attention only manages to improve the efficiency without compromising the accuracy if it is only iterated once. This drawback stems from the fact that it only needs to optimize the feature that is used to estimate the refined bias, which is only one out of every 16 features on the source image on the M3M_3 feature map according to our method, and the features on the target image are not optimized in this uni-directional attention operation. Therefore, more uni-directional attention can not make great improvements for our method.

评论

Thanks for the response, which partly addressed my question.

According to the response, stacking the uni-directional attention achieves improvement on auc@5 but get degreation on auc@20 (both are small), and it seems that you did not re-train a new model (due to the rebuttal schedule) but simply reuse the attention parameters.

When it comes to the comparison with RoMa and Tiny-RoMa, thanks for your insight of the experiment settings. As the metric is the accuracy of camera pose ,which only need several matches (eg. 8 matches) to regress, it is acceptable to filter out less-reliable matches when computing camera poses. Maybe you can also test your method with a strict setting (just a suggestion, not required).

Nevertheless, this work is interesting and provides a effective and efficient mothod for 2-view matching.

评论

We only re-train the part of refinement with two uni-attention blocks, which only takes us 12 hours, and we do not simply reuse the attention parameters.

However, here we just stack two uni-attention blocks and supervise them once, perhaps supervising the residuals for each of them can make meaningful improvement.

Finally, thank you again for your high opinion and useful suggestions for our work.

审稿意见
5

This paper propose an efficient framework to reduce the computational load of transformer-based matching approaches. It is a coarse-to-fine two-stage solution. During the coarse matching phase, multiple homography hypotheses are estimated to approximate continuous matches. Each hypothesis encompasses few features to be matched, reducing the number of features that require enhancement via transformers. In the refinement stage, the unidirectional cross-attention is proposed to replace the classical bidirectional self-attention and cross-attention mechanism, which further decreases the cost of computation. Comprehensive evaluations on other open datasets such as Megadepth, YFCC100M, ScanNet, and HPatches are presented to demonstrate its efficacy.

优点

The whole idea makes sense to me. The experimental studies are quite comprehensive.

缺点

Some parts of the description are quite vague to me, which may hinder its reproducibility. Some of the experimental results may need further clarification.

问题

  1. I think some descriptions in Sec 3.4 need further clarification, e.g., Line 192-193 fj3^\hat{f^3_j} is computed by querying the f3kf3_k with …, is the query done by nearest neighboring searching?
  2. How is Pjt\triangle P_j^t in Line 188 obtained or is it predicted?
  3. In Figure 3, is there an MLP after cross-attention block? If yes, what is the input/output. Also, the presence of “Module 3” is confusing, should it contain the cross attention?
  4. In Table 4, it seems the usage of bidirectional cross-attention performs worse than that of uni-directional cross-attention at 5. Is that reasonable, please give some analysis.
  5. From Table 2, the performance of lightGlue is much worse than ASpanFormer, LoFTR. Interestingly, ASpanFormer is better than LoFTR with a small margin in the Megadepth dataset, similar to the results reported in LightGlue. The performance gap between LightGlue and LoFTR is much larger than that in LightGlue, could you please explain? Moreover, why LightGlue is not reported in the dataset of Scannet?

局限性

  1. Some of the network architectures are not depicted in figures, making it difficult to know the overall diagram of the method.
  2. The two-stage training process may not be easily trained.
  3. It would be better to present some failure case studies.
作者回复

Thank you for your review and valuable comments.

  1. Some descriptions in Sec 3.4 need further clarification.

Yes. Due to space limitations, we have included a more intuitive description and illustrations of this step in Sec.2 in the supplementary materials. Specifically, since the corresponding point PjsP_j^s on the source image is self-chosen, following LoFTR and for computational convenience, we select a point PjsP_j^s at the center of each specific feature f^j3\hat{f}_j^3 on the M3M_3 feature map. For the target image, after calculating the initial PjtP_j^t coordinates through the accepted HH hypotheses, we query the 77 features around PjtP_j^t with f^j3\hat{f}_j^3. We then perform cross attention between this single feature from the source image and the 49 features from the target image. If we were to use traditional methods to perform a complete transformer, we would need a 4949 transformer to achieve such a large receptive field, but here we only need a 1*49 transformer.

  1. How is ΔPjt\Delta P_j^t in Line 188 obtained or is it predicted?

After refining f^j3\hat{f}_j^3 using the above method, we follow traditional methods (such as LoFTR) to correlate this feature with the 49 features on the target image, we obtain a local heatmap representing the matching probability for each feature. By computing the expectation over this probability distribution, we can obtain the final offset for matches.

  1. In Figure 3, is there an MLP after cross-attention block? If yes, what is the input/output. Also, the presence of “Module 3” is confusing, should it contain the cross attention?

Here, Module 3 refers to the entire refinement part, including the cross-attention part. The MLP shown in the figure follows traditional methods(like PATS, ASpanFormer or LoFTR) in the module of transformer, passing the feature through an MLP module before computing the expectation. Its function is to further enhance the f^j3\hat{f}_j^3 mentioned in Section 3.4.

  1. In Table 4, it seems the usage of bidirectional cross-attention performs worse than that of uni-directional cross-attention at 5. Is that reasonable, please give some analysis.

This is because we are using the same training time rather than the same number of training samples as the comparison criterion, which makes a small but not significant difference to the effectiveness of the model. If we had used the same training samples, auc@5 would have achieved 52.3, which is better ours though while still much slower.

  1. From Table 2, the performance of lightGlue is much worse than ASpanFormer, LoFTR. Interestingly, ASpanFormer is better than LoFTR with a small margin in the Megadepth dataset, similar to the results reported in LightGlue. The performance gap between LightGlue and LoFTR is much larger than that in LightGlue, could you please explain? Moreover, why LightGlue is not reported in the dataset of Scannet?

It is important to note that the LightGlue employs two RANSAC methods in their paper: Lo-RANSAC and OpenCV's RANSAC. We use OpenCV's RANSAC exclusively. In the experiments of the LightGlue, it is evident that under OpenCV's RANSAC, LightGlue's performance significantly decreases. Additionally, to better demonstrate the impact of sub-pixel precision, we use a RANSAC threshold of 0.25 in outdoor scenarios. Under this threshold, LoFTR, ASpanFormer, and our method ETO achieve smaller errors within this precision range, further widening the gap with LightGlue. The reason LightGlue was not tested on ScanNet is that its model was not trained on indoor scenes, which would be unfair.

  1. Some of the network architectures are not depicted in figures, making it difficult to know the overall diagram of the method.

Due to space limitations, we have placed the illustrations of the refinement steps in the supplementary materials.

  1. The two-stage training process may not be easily trained.

Yes, an end-to-end pipeline may improve our method. But we don't have enough computing resources to complete it.

  1. It would be better to present some failure case studies.

There exists a shortcoming of our approach, which is that in the case of very complex planar assemblages, we provide too few planar proficiencies, and it is also more difficult to distinguish. It can be found in the Figure.1 of our Author Rebuttal.

评论

Thank you for your response.

While I appreciate the addition of valuable experiments and the inclusion of a failure case study, I still believe that the current version of the manuscript lacks sufficient descriptions of some technical details. As a result, I will be maintaining my initial rating.

作者回复

Thank you for your review and valuable comments.We will correct the problems mentioned by the reviewers.

Here we add some comparative experiments on some recent methods, compared with these methods, ETO still has a high advantage in efficiency. We then provide a graphical representation of how our segmentation and homography hypotheses work.

最终决定

The paper presents an innovative framework that significantly enhances the efficiency of transformer-based local feature matching by introducing a multi-stage approach that includes coarse matching with homography hypotheses and a refinement stage using unidirectional cross-attention. All reviewers rated the paper positively, recognising its technical soundness, comprehensive experiments, and effective design. The authors convincingly addressed most reviewer concerns in their rebuttal, particularly regarding methodological details and experimental comparisons. However, the need for further experiments and improved clarity was highlighted. To ensure the paper meets the highest standards, the AC recommends accepting the paper with the condition that the authors include additional comparisons, specifically with the latest methods such as RoMa, and enhance the clarity of the submission in the final version. This decision has been approved by Senior AC. Congratulation!!