EMVP: Embracing Visual Foundation Model for Visual Place Recognition with Centroid-Free Probing
This paper proposes a novel and effective Parameter Efficiency Fine-Tuning (PEFT) pipeline of adapting a visual foundation model in the visual place recognition task.
摘要
评审与讨论
Existing VPR models often need to be trained from scratch on environment-specific data, resulting in insufficient generalization performance. The paper aims to improve environment generalization by fine-tuning a visual foundation model. Specifically, it uses DINOv2 as the foundation model and enhances its feature extraction capabilities for VPR tasks by introducing novel probing method and adaptor.
优点
- The proposed method is novel. It designed the DPN module to automatically learn the value of power based on the context.
- The motivation of the paper is clear, and the experimental results are both visualized and quantified adequately. The implementation details are described clearly.
- The proposed method has highly parameter efficiency compared to the current state-of-the-art method.
缺点
-
The explanation for why NetVLAD is the most popular aggregation method is lacking in lines 90-91. This directly impacts the reason of choosing NetVLAD as the starting point for probing in the paper.
-
Why do models trained on the GSV-Cities dataset achieve high accuracy on other datasets as shown in Table 1? Further explanation is welcomed.
-
The article makes some designs for deploying VPR models on mobile devices, such as introducing linear layers for dimensionality reduction and proposing DPN for PEFT. However, I am concerned whether ViT series models, which have large parameter sizes, are suitable for deployment on mobile devices.
问题
Please refer to the Weaknesses.The technologies used in the paper do not exhibit any evident negative social impacts.
局限性
The technologies used in the paper do not exhibit any evident negative social impacts.
Thank you for acknowledging our motivations and experimental results.
-
The VPR task involves a more fine-grained classification, as two images of the same location may have minimal overlap and require identification based on features from a small region. As discussed in the introduction of this paper, NetVLAD can be seen as a second-order feature, which provides an accuracy advantage over first-order features used by alternative techniques. This perspective is also supported by many experimental results in the field [12; 15; 17; 18; 9].
-
Mainstream VPR datasets (e.g., Pittsburgh250k) use GPS coordinates to determine positive and negative examples. In fact, images with distant GPS coordinates may contain common buildings, while images with nearby coordinates may have no common objects due to different orientations. In other words, the labels in current mainstream VPR datasets are noisy. Consequently, some cutting-edge works have improved the data efficiency by refining the dataset construction methods [12; 19; 20].
-
On one hand, with the advancement of computing power, ViT-based models are increasingly employed in various cutting-edge research areas of mobile robotics, such as localization [9; 21], perception [22], and manipulation [23]. On the other hand, the size of foundation models is continuously decreasing. For example, the 0.23B Florence-2 [11] can rival the performance of the 80B Flamingo. Therefore, it is feasible for us to study VPR tasks based on foundation models.
I would like to thank the authors for the detailed response. After reading all the comments and responses, I think the author has addressed the key issues reviewers proposed. For now, I am inclined to increase my score, but further discussion is welcome.
This paper aims to fine-tune a visual foundation model for visual place recognition, with an innovative focus on the probing stage. Specifically, it makes three contributions:
-
It proposes a probing method called CFP and introduces a normalization method named DPN within the CFP.
-
It provides theoretical and experimental support for the proposed probing and normalization methods.
-
It integrates DPN as an adapter into the backbone stage to enhance the performance.
优点
-
The paper is well-organized and easy to follow.
-
The paper proposes a Centroid-Free Probing (CFP) stage for fine-tuning a visual foundation model in VPR tasks, providing a novel and task-specific probing method. CFP eliminates the need for explicit calculation of semantic centroids in typical NetVLAD, by introducing a simple and effective Constant Normalization (CN) operation.
-
The paper thoroughly discusses related probing techniques (LP, MP, GeM, NetVLAD), which are supported by extensive experiments. Moreover, EMVP achieves SOTA performance in mainstream VPR datasets.
-
Interestingly, the DPN module can be used for post-processing during the probing stage and as an adapter for parameter efficiency fine-tuning.
缺点
Some details need to be more clearly expressed, and some experimental phenomena need more detailed explanations. Please refer to the ''Questions''.
问题
-
What is the relationship between the VPR and the image classification task? Why are the LP and MP methods in the image classification task not suitable for direct application to the VPR task?
-
In Table 8, why did increasing the number of recalibrated blocks not lead to a performance improvement?
-
As discussed in line 155, NetVLAD can be implemented with bilinear pooling. How does it perform compared to CFP?
局限性
The paper thoroughly discusses the limitations and potential social impacts. Currently, the application of VPR is limited by its insufficient accuracy. However, research in this area undoubtedly contributes to the improvement of mobile robot safety.
Thanks for acknowledging our paper, and we appreciate your valuable suggestions.
-
As discussed in our response to reviewer psQJ, LP is a kind of first-order feature and less accurate compared to second-order features in the fine-grained VPR task. MP is proposed solely for coarse-grained classification tasks, with a focus on accelerating the computation of second-order features, which limits the accuracy.
-
This phenomenon was also observed by the authors of SALAD. More thorough fine-tuning can conceptually yield better domain performance, but it needs more data. Therefore, there is a correlation between the amount of data and the optimal number of recalibrated blocks.
-
The baseline in Table 2 is the bilinear form of NetVLAD (with the centroids removed). We implemented it by removing the optimal transport operation in the SALAD code base. After introducing the constant normalization and layer, our CFP achieves a significant improvement over the baseline. For example, it improves Recall@1 on MSLS Val by 2.3%.
Thank you for your response to my questions. The paper's focus on probing techniques is novel and intriguing to me, so I prefer to keep my current score unchanged.
The paper presents a method for parameter-efficient fine-tuning of Visual Foundation Models (VFMs) for the Visual Place Recognition (VPR) task. The method includes a DPN module, which is placed between frozen VFM blocks to recalibrate features for the VPR task, and an aggregation (probing) module named Centroid-Free Probing (CFP), a simplified version of NetVLAD. The DPN module enables the VFM to focus more on features that are important for the VPR task, such as background objects. The CFP module addresses the costly centroid initialization problem of NetVLAD and enhances generalization. The authors conduct experiments demonstrating the effectiveness of the proposed method compared to existing state-of-the-art (SOTA) VPR methods.
优点
- The paper is devoted to the use of Visual Foundation Models (VFMs) in the Place Recognition task, a relevant and promising area of research.
- The structure of the paper is well-organized and easy to follow, with the main contributions explicitly highlighted in the introduction. Each subsection of the related work clearly defines the paper's place in the existing research. The method section introduces the preliminaries and then explains the proposed solution in detail.
- The proposed method outperforms existing state-of-the-art (SOTA) methods.
缺点
- Minor weakness: The large number of abbreviations (CFP, CN, DPN_C, DPN_R, EMVP, PEFT, etc.) makes the paper a bit hard to read.
- The training details are incomplete, missing the loss function and the training procedure itself (e.g., how the batches were sampled).
- The used metrics are not explained at all. While these are standard metrics for Place Recognition, the paper would benefit from a brief explanation.
- Although the proposed method outperforms existing state-of-the-art (SOTA) methods, the overall novelty of the paper is moderate. It represents more of an iterative improvement of existing results rather than a fundamentally new method.
- When using large Visual Foundation Models, a major drawback is their computational costs, which may affect the applicability of the proposed method. Information about the models' time and memory consumption is missing.
问题
- Can you provide more detailed information on the training procedure, including the loss function, sampling strategy for batches, and any data augmentation techniques used?
- The paper mentions limitations and future work but could benefit from more concrete plans or proposals to address these limitations. Can you elaborate on your future work plans to address the identified limitations, such as handling adverse weather conditions and ambiguous scenes? What specific approaches are you considering?
局限性
The paper adequately addresses that the lack of analysis of different VFM backbones is a major limitation.
Thanks for your insightful comments, we provide the feedback as follows:
Q1&W2. Detailed information on the training procedure. To ensure a fair comparison, we have kept the details unrelated to the innovations of this paper consistent with previous SOTA methods (i.e., Conv-AP [12], MixVPR [15], and SALAD [9]) during the training process. Therefore, the loss function, sampling strategy for batches, and data augmentation techniques remain the same with the above SOTA methods. Specifically, the model is trained by Multi-Similarity loss [10]. We use batches containing 120 places, each represented by 4 randomly sampled images, resulting in batches of 480 images, as shown by Table 4 in Appendix A.1. Training images are resized to , and no data augmentation technique is applied.
Q2. Limitation and future work. Current VPR works (e.g., EMVP, SALAD, and SelaVPR) rely solely on extracting more generalized visual features based on VFMs. We have observed a recent trend in large multi-modal models towards lightweight and unified designs, represented by Florence-2 [11], where the size of backbone is only 0.23B, and different image-text understanding tasks are being unified into the Visual-Question-Answering task. We are currently investigating how to guide the model's reasoning ability in adverse weather and ambiguous scenes by constructing multi-modal thought-of-chains.
W1. Thanks for patiently reading our paper. We will reduce the number of abbreviations in the revised version.
W3. Evaluation metrics. We use the standard Recall@k evaluation metric, following [4; 9; 12; 13; 14; 15]. A query image is considered successfully retrieved if at least one of the top-k retrieved reference images is within 25 meters of the query.
W4. Novelty of the paper. We are among the first to explore probing techniques for the VPR task. Specifically, our CFP admits a theoretical and empirical justification for the simplification of NetVLAD, fixing interpretability and performance issues that were present otherwise. Note that, our CFP improved Recall@1 on NordLand by 2.8% compared to NetVLAD, while also reducing the memory for descriptors by 66%. Reviewer M9Bz and psQJ have also recognized our work's novelty, highlighting the value of our contributions.
W5. Time and memory consumption. We compare both single-stage and two-stage methods in the table below. Without using re-ranking, SALAD and our EMVP outperform all other methods while being orders of magnitude faster. Results marked with are computed using a RTX 4090 GPU. Memory footprint is calculated on the MSLS Val dataset, which includes around 18, 000 images. The evaluation protocol and code are provided by R2Former [16]. Note that, the table below reports test results of Pytorch models. If TensorRT is used for acceleration in the deployment, the applicability of the proposed method on mobile devices (such as unmanned vehicles) will be greatly improved.. The introduction of VFMs increases the backbone size, but it simultaneously reduces the dependence on re-ranking, saving more latency time and memory.
| Method | Global feature size | Local feature size | Memory (GB) | Retrieval (ms) | Reranking (ms) |
|---|---|---|---|---|---|
| Patch-NetVLAD | 4096 | 28264096 | 908.30 | 9.55 | 8377.17 |
| TransVPR | 256 | 1200256 | 22.72 | 6.27 | 1757.70 |
| R2Former | 256 | 500131 | 4.7 | 8.88 | 202.37 |
| SelaVPR | 1024 | 32.01 | 6.87 | 150.78 | |
| SALAD | 0.0 | 0.63 | 2.41 | 0.0 | |
| SALAD | 0.0 | 0.57 | 1.41 | 0.0 | |
| EMVP-B (ours) | 0.0 | 0.57 | 1.42 | 0.0 | |
| EMVP-L (ours) | 0.0 | 0.57 | 3.66 | 0.0 |
This work presents a novel pipeline, i.e., EMVP, for visual place recognition based on foundation models. In this pipeline, a Centroid-Free Probing (CFP) method is used to process the output of the foundation model and get the global features of place images. Besides, the authors also propose a Dynamic Power Normalization (DPN) module and incorporate it into both the foundation model and the CFP module, which can improve fine-tuning performance and make the features more task-specific. Extensive experiments demonstrate the effectiveness of the proposed method.
优点
- The paper is easy to read.
- The proposed method is novel.
- The experimental results are good and outperform SOTA methods.
缺点
- The biggest problem with this paper is that there are many inaccuracies. For example:
a. This paper claims that NetVLAD is a second-order feature, but it is actually a first-order feature. Please refer to the description in the NetVLAD paper [1] (“the NetVLAD layer aggregates the first order statistics of residuals”) and this literature [2] ("the original NetVLAD method utilizes only the first-order statistical information").
b. The authors claim that “In this paper, we resort to post normalization methods (e.g., softmax and L2 normalization) to constrain the value of…to be constant”. In fact, NetVLAD also has softmax and L2 operations, which cannot yield that the summation term in equation (4) is a constant. Given a local feature, I know that the sum of the probabilities of assigning it to all clusters is a constant due to the softmax. But given a cluster, the sum of the probabilities of all local features assigned to it isn’t a constant. I think the author's deduction is wrong. That is, the global feature G in equation (5) is different from equation (3).
c. There are many missing results in Table 1. The results of other methods are directly copied from other papers. In fact, the existing papers used different versions of the Nordland dataset, and the authors do not realize this and compare the results of the different versions of Nordland together, which is wrong.
-
The Dynamic Power Normalization (DPN) module has a similar structure to the classic adapter, and its functions are basically the same. However, the author does not explain the difference between the two, or what improvements DPN brings to the adapter. The parameters of the Linear Layer in DPN are also not provided.
-
In the ablation experiment, the authors call the baseline “the simplified version of NetVLAD adapted by SALAD”, which is not appropriate. This baseline does not use softmax and L2, which is essentially different from NetVLAD.
-
In Table 1 (the comparison to other methods), the SALAD method is based on DINOv2-Base. The authors provide the results of the proposed method only based on DINOv-Large. It is more appropriate to provide those of both DINOv-Large and DINOv2-Base.
[1] Arandjelovic, Relja, et al. "NetVLAD: CNN architecture for weakly supervised place recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
[2] Chen, Boheng, et al. "A novel localized and second order feature coding network for image recognition." Pattern Recognition 76 (2018): 339-348.
问题
see weaknesses
局限性
see weaknesses
Thanks for carefully reading our paper and recognizing its writing and novelty. We appreciate the opportunity to address your questions:
1.a. We posit that the second-order statistics (bilinear features) employed in NetVLAD are implemented through the outer product of two vectors corresponding to each location of an image. This interpretation is rooted in the seminal work [1], which pointed out that "VLAD can be written as a bilinear model". Consequently, our paper adheres to the definition of second-order statistics as established in the research line of [1], [2], and [8]. Otherwise, the mentioned LSO-VLADNet [3] applied the "element-wise square operation" to first-order features (residuals), and considered it as a second-order statistic operation.
We also agree that both [3] and [4] consider residuals as first-order features, a definition that does not take into account the soft-assignment. However, if the soft-assignment is seen as another type of first-order feature extracted by MLP Layers, then the features ultimately output by NetVLAD are second-order. Specifically, Eq. (4) in [4] can be understood as computing the outer product of each local residual with the soft-assignment vector , and then using sum pooling to obtain global second-order features. Note that the process is also formulated by Eq. (5) in our paper.
1.b. Softmax in NetVLAD is introduced to ensure that the sum of the probabilities of assigning a local feature to all clusters is a constant, which can be formulated as . , , and denote batch size, the number of clusters, and the number of local features, respectively. Moreover, NetVLAD adapts intra- and L2-normalization to reduce the effect of bursty image features, as described in paper [5] "intra-normalization fully suppresses burst". The intra-normalization conducted on the features weighted by soft_assign can be formulated as where denotes the dimension of local feature.
The motivation of our post normalization method is to remove the cluster centers, as described in [5] ``VLAD similarity measure is strongly affected by the cluster center''. Accordingly, the post constant normalization is introduced to ensure the sum of the probabilities of all local features assigned to a cluster is constant, which can be formulated as . Note that our EMVP retains both intra- and L2-normalization in NetVLAD.
1.c. Thank you very much for pointing out this error. The Nordland dataset exists in two versions, and the version used in our paper is provided by [6]. As shown in the table below, we display the results for the other version provided by [7]. EMVP even surpasses the previous best method (SelaVPR) with a re-ranking stage. In the revised version, we will separately and clearly present the test results for different versions of Nordland dataset.
| Method | Nordland-test | ||
|---|---|---|---|
| R@1 | R@5 | R@10 | |
| SelaVPR(global) | 72.3 | 89.4 | 94.4 |
| SelaVPR(re-rank) | 85.2 | 95.5 | 98.5 |
| EMVP-L(ours) | 88.7 | 97.3 | 99.3 |
2. Our DPN module leverages power-law operations to effectively alter the magnitude relationships between local features, surpassing the capabilities of multiplication operations employed in classic adapters. This empowers the DPN module to accentuate discriminative features, enhancing its ability to capture task-specific information as discussed in lines 181 to 187. Note that controlling the preservation of task-specific information through changing the power value is also supported by previous theoretical research [8]. The superiority of the proposed DPN is evidenced by the empirical results in Table 3. Compared to the advanced adapter method PRSP, EMVP achieves notable improvements in the Recall@1 by 0.5%, 3.4%, 0.4%, and 0.5% on MSLS Val, NordLand, Pitts250k-test, and SPED, respectively. Remarkably, these enhancements are accompanied by a substantial 64.3% reduction in parameters, highlighting the efficiency and scalability of the DPN module.
In addition, the implementation of DPN is shown as Algorithm 1 in Appendix A.1. Specifically, the input and output dimensions of the linear layer in are 128 and 64, respectively, and the former corresponds to the output size of . While, the input and output dimensions of the linear layer in are and , respectively, where is the dimension of the ViT model. In the case of the ViT-B model, .
3. SALAD also uses L2-normalization, as described in paper [9] "Following NetVLAD, we do an L2 intra-normalization and an entire L2 normalization of this vector.". SALAD and the baseline do not employ softmax, and they differ from NetVLAD to some extent. Therefore, we will revise the description of the baseline in the updated version to: "the baseline can be seen as a form of bilinear pooling with the centroids removed, and it is implemented by removing the optimal transport operation in the SALAD code base".
4. As shown in the table below, our EMVP achieves the best results under different backbone configurations. For instance, compared to the existing single-stage method based on DINOv2-L, EMVP-L improves Recall@1 on MSLS Val by 6.2%.
| Method | Backbone | MSLSVal | Pitts250k-test | SPED | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | R@1 | R@5 | R@10 | ||
| SALAD | DINOv2-B | 92.2 | 96.4 | 97.0 | 95.1 | 98.5 | 99.1 | 92.1 | 96.2 | 96.5 |
| EMVP-B(ours) | DINOv2-B | 93.2 | 96.9 | 97.2 | 95.7 | 98.9 | 99.3 | 91.8 | 96.5 | 97.4 |
| SelaVPR(global) | DINOv2-L | 87.7 | 95.8 | 96.6 | - | - | - | - | - | - |
| EMVP-L(ours) | DINOv2-L | 93.9 | 97.3 | 97.6 | 96.5 | 99.1 | 99.5 | 94.6 | 97.5 | 98.4 |
Thanks for your response to my concerns.
For 1.a, there is still none of the literature you provided that directly mentions VLAD as a second-order feature. And the literature you provided [1] also clearly states that "The Vector of Locally Aggregated Descriptors (VLAD) descriptor aggregates the first order statistics of the SIFT descriptors" (in Section 2.2), which is the same as the literature I provided (including the NetVLAD paper). I think, strictly speaking, the soft assignment in NetVLAD only calculates a degree of belonging to assign local features to different clusters. Although soft assignment also uses local features, the information of these local features does not directly contribute to the final vector. Therefore, the final feature vector is only a first-order feature. A new perspective on NetVLAD is also welcome. Now I don't think this issue can affect the grading of this paper.
For 1.b, the detailed explanation provided by the authors addressed my concern about the equations. However, the usage of softmax in the proposed method is a little odd. For vanilla NetVLAD, the softmax is used to assign a local feature to all clusters (it’s a natural and reasonable assigning way), so the sum of the probabilities of assigning a local feature to all clusters is a constant. However, for the proposed method, the softmax is used to ensure the sum of the probabilities of all local features assigned to a cluster is constant. This is actually equivalent to assigning a cluster to all local features. How should we understand this operation (motivation, pros and cons)?
For 1.c, I am not questioning the performance of the proposed method. Using different versions of Nordland is just one of the problems. More importantly, there are too many missing results in Table 1 (e.g., on Pitts250k-test and SPED), which is inappropriate in a top conference paper. The code for all of these methods is available on GitHub, and I hope you can complete these experimental results in your final paper.
[1] Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. Bilinear cnn models for fine-grained visual recognition. In Int. Conf. Comput. Vis., pages 1449–1457, 2015.
Dear Reviewer,
We sincerely appreciate your reevaluation and are especially grateful for the valuable time you have spent to help improve our paper. Please allow us to address your concerns once again.
1.a We agree that the aggregated features in VLAD are first-order. We consider soft assignment to be an extracted feature primarily because NetVLAD uses a separate ("decoupled") layer to predict soft assignment. The literature you provided introduces a new second-order feature production method, which is very enlightening for us. In our future research, we will conduct a more comprehensive discussion on the definition of higher-order features.
1.b Regardless of whether softmax is applied, a cluster will be assigned to all local features simultaneously. This is because NetVLAD uses soft assignment instead of the hard assignment in the original VLAD. Given a cluster, we consider its physical meaning to be implicit, and each local feature containing some degree of this physical meaning, while this interpretation is not immediately intuitive (cons).
As SALAD points out, "some features, such as those representing the sky, might contain negligible information for VPR." Therefore, SALAD introduces a dustbin in the optimal transport operation, allowing the sum of the probabilities of assigning a local feature to all clusters not to be a constant. Instead, the optimal transport operation ensures the sum of the probabilities of all local features assigned to a cluster remains constant. This approach keeps the sum of assignment weights corresponding to each cluster constant, preventing the dominance of any single cluster and making the assignment more "effectively distributed".
Overall, our use of softmax serves a similar function to the optimal transport operation in SALAD, but we provide a theoretical basis for this operation (motivation). Furthermore, we have explained why the cluster centers no longer need to be included in the computation when using our softmax (motivation & pros). Additionally, unlike optimal transport, which requires complex iterative algorithms, our implementation of softmax is much simpler and faster (pros).
1.c Indeed, we should complete the missing results in Table 1. As the author-reviewer discussion period is nearing its end, we will test all the compared methods on these datasets and include the results in the revised version.
Dear Reviewers,
Thanks for your hard work. Your constructive comments will help us continuously improve this paper. We list all the references mentioned in our rebuttals for your convenience.
[1] Tsung-Yu Lin, Aruni RoyChowdhury, and Subhransu Maji. Bilinear cnn models for fine-grained visual recognition. In Int. Conf. Comput. Vis., pages 1449–1457, 2015.
[2] Mingze Gao, Qilong Wang, Zhenyi Lin, Pengfei Zhu, Qinghua Hu, and Jingbo Zhou. Tuning pre-trained model via moment probing. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11803–11813, 2023.
[3] Boheng Chen, Jie Li, Gang Wei, and Biyun Ma. A novel localized and second order feature coding network for image recognition. Pattern Recognition, 76:339–348, 2018.
[4] Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. Netvlad: Cnn architecture for weakly supervised place recognition. In IEEE Conf. Comput. Vis. Pattern Recog., pages 5297–5307, 2016.
[5] Relja Arandjelovic and Andrew Zisserman. All about vlad. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 1578–1585, 2013.
[6] Niko Sünderhauf, Peer Neubert, and Peter Protzel. Are we there yet? challenging seqslam on a 3000 km journey across all four seasons. In Proc. of workshop on long-term autonomy, IEEE international conference on robotics and automation (ICRA), page 2013. Citeseer, 2013.
[7] Daniel Olid, José M Fácil, and Javier Civera. Single-view place recognition under seasonal changes. arXiv preprint arXiv:1808.06516, 2018.
[8] Qilong Wang, Mingze Gao, Zhaolin Zhang, Jiangtao Xie, Peihua Li, and Qinghua Hu. Dropcov: A simple yet effective method for improving deep architectures. Advances in Neural Information Processing Systems, 35:33576–33588, 2022.
[9] Sergio Izquierdo and Javier Civera. Optimal transport aggregation for visual place recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2024.
[10] Xun Wang, Xintong Han, Weilin Huang, Dengke Dong, and Matthew R Scott. Multi-similarity loss with general pair weighting for deep metric learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5022–5030, 2019.
[11] Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a unified representation for a variety of vision tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4818–4829, 2024.
[12] Amar Ali-bey, Brahim Chaib-draa, and Philippe Giguère. Gsv-cities: Toward appropriate supervised visual place recognition. Neurocomputing, 513:194–203, 2022.
[13] Frederik Warburg, Soren Hauberg, Manuel Lopez-Antequera, Pau Gargallo, Yubin Kuang, and Javier Civera. Mapillary street-level sequences: A dataset for lifelong place recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2626–2635, 2020.
[14] Mubariz Zaffar, Sourav Garg, Michael Milford, Julian Kooij, David Flynn, Klaus McDonald- Maier, and Shoaib Ehsan. Vpr-bench: An open-source visual place recognition evaluation framework with quantifiable viewpoint and appearance change. International Journal of Computer Vision, 129(7):2136–2174, 2021.
[15] Amar Ali-Bey, Brahim Chaib-Draa, and Philippe Giguere. Mixvpr: Feature mixing for visual place recognition. In IEEE Winter Conf. Appl. Comput. Vis., pages 2998–3007, 2023.
[16] Sijie Zhu, Linjie Yang, Chen Chen, Mubarak Shah, Xiaohui Shen, and Heng Wang. R2former: Unified retrieval and reranking transformer for place recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19370–19380, 2023.
[17] Nikhil Keetha, Avneesh Mishra, Jay Karhade, Krishna Murthy Jatavallabhula, Sebastian Scherer, Madhava Krishna, and Sourav Garg. Anyloc: Towards universal visual place recognition. IEEE Robotics and Automation Letters, 2023.
[18] Feng Lu, Lijun Zhang, Xiangyuan Lan, Shuting Dong, Yaowei Wang, and Chun Yuan. Towards seamless adaptation of pre-trained models for visual place recognition. In Int. Conf. Learn. Represent.
[19] Gabriele Berton, Gabriele Trivigno, Barbara Caputo, and Carlo Masone. Eigenplaces: Training viewpoint robust models for visual place recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11080–11090, 2023.
[20] María Leyva-Vallina, Nicola Strisciuglio, and Nicolai Petkov. Data-efficient large scale place recognition with graded similarity supervision. In IEEE Conf. Comput. Vis. Pattern Recog., pages 23487–23496, 2023.
[21] Hanwen Jiang, Arjun Karpur, Bingyi Cao, Qixing Huang, and André Araujo. Omniglue: Generalizable feature matching with foundation model guidance. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19865–19875, 2024.
[22] Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model. In Forty-first International Conference on Machine Learning.
[23] Brianna Zitkovich, Tianhe Yu, and Sichun Xu et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. In 7th Annual Conference on Robot Learning, 2023.
Dear Reviewers,
We sincerely appreciate your time and effort in reviewing our paper. We have made every effort to address all the concerns and questions you raised. Can you please kindly reevaluate our paper in light of the revisions. If any aspects remain unclear, please do not hesitate to let us know. We are more than willing to discuss any further questions you may have.
Best regards, The Authors
All reviewers ultimately provided positive recommendations for the paper. The AC has determined that the novelty and effectiveness of this work meet the standards for acceptance. Additionally, the reviewers' comments contain several suggestions that could further improve the paper, , as reflected in the authors' responses. The AC encourages the authors to use them to update the manuscript.