7.1

/10

Poster5 位审稿人

最低4最高5标准差0.5

3.6

置信度

创新性2.8

质量3.0

清晰度2.6

重要性2.8

NeurIPS 2025

GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization

Pengyue Jia,Seongheon Park,Song Gao,Xiangyu Zhao,Sharon Li

OpenReview PDF

提交: 2025-05-07更新: 2025-10-29

摘要

关键词

Image GeolocalizationImage-to-GPS retrievalRerankingLarge Multi-Modal Models

评审与讨论

审稿意见

评分: 4置信度: 32025-07-01

This paper proposes GeoRanker, a distance-aware reranking framework for the task of “worldwide image geolocalization”. The approach uses large vision-language models to encode the interactions between the query and candidates, and introduces a multi-order distance loss to enable the model to capture rich spatial structure. Moreover, the authors construct a new dataset, GeoRanking, which is tailored for spatial ranking tasks with multimodal annotations, including GPS coordinates, textual descriptions, and image data. The experimental results demonstrate that GeoRanker achieves superior performance on two widely used benchmark datasets (IM2GPS3K and YFCC4K), surpassing existing methods with a clear margin.

优缺点分析

Strengths:

Worldwide image geolocalization is an interesting and meaningful task, and the motivation is clearly stated.
This paper is well-written, easy to read and follow.
The proposed method is novel and the experimental results are solid, outperforming current SOTA methods on two well-established benchmarks (IM2GPS3K and YFCC4K).
Code, checkpoint, and dataset are publicly available.

Weaknesses:

There is an obvious formatting problem in the paper: the paragraph between L126 and L127 lacks line numbers.
The authors propose the GeoRanking dataset as a contribution. However, in Section 3.1 (i.e., GeoRanking Dataset Construction), there is no explicit description or summary of what the dataset consists of. Instead, most of the content is devoted to describing how to encode database samples and queries to complete retrieval candidates. Although there is some additional introduction about GeoRanking in subsequent chapters, the section title and the paragraph content don't match well.
GWS15K [1] is a more challenging test dataset for worldwide image geolocalization. Some prior studies (e.g., PIGEOTTO [2], GeoCLIP [3], and GeoDecoder [1]) do not perform well on it. I would like to know how the proposed GeoRanker performs on the GWS15K dataset.
The paper lacks analysis of failure examples, which is important for understanding the model's limitations and guiding future improvements. Including some qualitative analysis of incorrect predictions would strengthen the work.

Reference

[1] Brandon Clark, Alec Kerrigan, Parth Parag Kulkarni, Vicente Vivanco Cepeda, and Mubarak Shah. Where we are and what we’re looking at: Query based worldwide image geo-localization using hierarchies and scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23182–23190, 2023.

[2] Lukas Haas, Michal Skreta, Silas Alberti, and Chelsea Finn. Pigeon: Predicting image geolocations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12893–12902, 2024.

[3] Vicente Vivanco Cepeda, Gaurav Kumar Nayak, and Mubarak Shah. Geoclip: Clip-inspired alignment between locations and images for effective worldwide geo-localization. Advances in Neural Information Processing Systems, 36:8690–8701, 2023.

问题

Between L126 and L127, the authors claim that “These adapters, along with the GPS encoder, are trained using an InfoNCE loss to align the query and candidate representations, as in G3.” I'm not sure what the “GPS encoder” here is.
The authors use MP16-Pro as a database in constructing the GeoRanking dataset, but where is the query from? Could you describe in detail the composition of the GeoRanking dataset?
In Section 3.4 (i.e., Inference), the equation (6) does not include negative samples (C_neg). However, in Figure 2, negative samples are involved in the inference process. Does there exist an expression error about equation (6)?
According to the ablation study (i.e, Table 2), it seems that using GPT4V to generate some candidates even brings more improvement on IM2GPS3K than the second-order distance loss. Have you tried using other LVLMs to generate candidates? How were the results?
About Figure 4 (i.e., Time efficiency). I want to know for different numbers of candidates, where do they come from, only retrieved candidates or both retrieved and generated candidates? Does the time to generate candidates using GPT4V be included in the inference time?

局限性

Limitations are included in the appendix. GeoRanker introduces an additional ranking stage, resulting in substantial computational overhead during inference. The authors offer two possible approaches to alleviate this problem.

最终评判理由

Authors have addressed my concerns. After reading the comments of the other reviewers, I decide to keep the original rating.

格式问题

The paragraph between L126 and L127 lacks line numbers.

作者回复

2025-07-31

W1: There is an obvious formatting problem in the paper: the paragraph between L126 and L127 lacks line numbers.

Thank you for pointing this out. We will correct this formatting issue and carefully check the entire paper for similar problems.

W2: The authors propose the GeoRanking dataset as a contribution. However, in Section 3.1 (i.e., GeoRanking Dataset Construction), there is no explicit description or summary of what the dataset consists of. Instead, most of the content is devoted to describing how to encode database samples and queries to complete retrieval candidates. Although there is some additional introduction about GeoRanking in subsequent chapters, the section title and the paragraph content don't match well.

Thank you for the helpful feedback. We agree that the current presentation of Section 3.1 could be improved for clarity. While detailed information about the structure of the GeoRanking dataset is provided in Appendix B, we acknowledge that this is not sufficiently highlighted in the main text.

To address this, we will revise Section 3.1 to include a clear summary of the dataset composition, such as the number of queries, candidates information, modality types, and annotation format. We will also adjust the section title and ensure the content better aligns with it, so that the dataset contribution is explicitly and clearly conveyed in the revised version.

W3: GWS15K is a more challenging test dataset for worldwide image geolocalization. Some prior studies (e.g., PIGEOTTO, GeoCLIP, and GeoDecoder) do not perform well on it. I would like to know how the proposed GeoRanker performs on the GWS15K dataset.

Thank you for the suggestion. We would also be interested in evaluating GeoRanker on the GWS15K dataset. However, to our knowledge, the dataset is not publicly available, which unfortunately prevents us from conducting experiments on it at this time.

W4: The paper lacks analysis of failure examples, which is important for understanding the model's limitations and guiding future improvements. Including some qualitative analysis of incorrect predictions would strengthen the work.

Thank you for the suggestion. Due to the rebuttal policy, we are unable to include images here, but we analyze several failure cases from GeoRanker and observe some common patterns. These are often extremely challenging samples with limited or ambiguous geographic cues, for example, indoor scenes (index 55), generic landscapes (index 245), campfire portraits (index 531), and close-up portraits (index 844) in the Im2GPS3k dataset.

These cases suggest that future improvements may require modeling fine-grained visual differences or incorporating reasoning mechanisms to disambiguate visually similar but geographically distant scenes. We will include this qualitative analysis in the revised version.

Q1: Between L126 and L127, the authors claim that “These adapters, along with the GPS encoder, are trained using an InfoNCE loss to align the query and candidate representations, as in G3.” I'm not sure what the “GPS encoder” here is.

Thank you for the question. The “GPS encoder” referred to here follows the design used in G3, which is inspired by GeoCLIP and aims to embed geographic coordinates (latitude and longitude) into a dense representation space suitable for contrastive training.

More specifically, in G3, raw GPS coordinates are first projected from WGS84 (EPSG:4326) into the Mercator coordinate system (EPSG:3857), and normalized to the [-1, 1] range. To capture multi-scale spatial relationships, G3 applies Gaussian encoding at multiple resolutions. Each encoded vector is then passed through a MLP to obtain a 512-dimensional representation. The outputs of different resolution-specific encoders are summed to form the final GPS embedding.

We adopt this GPS encoder directly in our implementation to remain consistent with the retrieval pipeline of G3. Since our focus is on ranking rather than retrieval, we did not modify the GPS encoding component. We will revise the paper to clarify this design, and will consider adding a footnote or an appendix section to explain this module more clearly.

Q2: The authors use MP16-Pro as a database in constructing the GeoRanking dataset, but where is the query from? Could you describe in detail the composition of the GeoRanking dataset?

Thank you for the question. The queries in the GeoRanking dataset are randomly sampled images from the MP16-Pro dataset. For each query image, we retrieve its top-20 most similar images (excluding the query itself) using the retrieval pipeline. These retrieved images serve as the candidate set for ranking. We will update the paper to clarify the construction process of the GeoRanking dataset in more detail.

Q3: In Section 3.4 (i.e., Inference), the equation (6) does not include negative samples (C_neg). However, in Figure 2, negative samples are involved in the inference process. Does there exist an expression error about equation (6)?

Thank you for your comment. You are correct, negative samples $C_{\text{neg}}$ are indeed involved during the inference stage, as illustrated in Figure 2. We will revise Equation (6) accordingly to ensure consistency.

Q4: According to the ablation study (i.e, Table 2), it seems that using GPT4V to generate some candidates even brings more improvement on IM2GPS3K than the second-order distance loss. Have you tried using other LVLMs to generate candidates? How were the results?

Thank you for the thoughtful question. As you pointed out, the candidate generation module plays an important role in overall performance. In addition to GPT-4V, we also experiment on IM2GPS3K using the open-source model Qwen2-VL-7B to generate candidates. The results are shown below:

Methods	1km	25km	200km	750km	2500km
G3	16.65	40.94	55.56	71.24	84.68
qwen2-vl-7b	18.18	44.08	60.79	75.64	88.56
GPT4V	18.79	45.05	61.49	76.31	89.29

As shown in the table, using Qwen2-VL-7B to generate the candidate set $C_g$ results in some performance degradation compared to GPT-4V, but it still yields a clear improvement over G3. This also indicates that GeoRanker can further improve geolocalization accuracy as the capabilities of the underlying generation model advance. In addition, GeoRanker is fully compatible with open-source candidate generation, improving both reproducibility and accessibility. We will include these results in the revised version and clarify that the model choice for candidate generation is flexible, allowing researchers to reproduce and extend our work with fully open-source alternatives.

Q5: About Figure 4 (i.e., Time efficiency). I want to know for different numbers of candidates, where do they come from, only retrieved candidates or both retrieved and generated candidates? Does the time to generate candidates using GPT4V be included in the inference time?

Thank you for the insightful question. In Figure 4, the inference time reported does not include the time required to generate candidates using GPT-4V. The goal of this figure is to compare the ranking efficiency under a fixed set of candidates.

Your observation touches on an important trade-off in GeoRanker’s design:

For time-sensitive scenarios, using only retrieved candidates with GeoRanker ranking offers a strong balance between efficiency and accuracy.
For performance-critical applications, combining retrieved and generated candidates provides better accuracy at the cost of higher latency due to the additional generation step.

We appreciate your comment and will incorporate this clarification into the final version of the paper.

2025-08-04

Many thanks for the author's detailed response. I have no other concerns and will keep the original score.

2025-08-04

I also strongly encourage authors to discuss the feasibility of using the Visual Place Recognition techniques to improve geolocalization in the paper, as mentioned by Reviewer ppXS.

2025-08-05

Thank you for your valuable feedback and support. We will add a discussion section on the feasibility of Visual Place Recognition techniques for improving geolocalization in the final version of the paper.

审稿意见

评分: 5置信度: 52025-07-03

GeoRanker is a novel distance-aware ranking framework that utilizes large vision-language models for worldwide image geolocalization, focusing on modeling spatial relationships among candidate locations. It introduces a multi-order distance loss to rank both absolute and relative distances, enabling better reasoning over structured spatial relationships.

优缺点分析

Strengths

Well-written and Easy to Understand: The paper is clearly articulated and accessible.
Extensive Ablations and Benchmarks: The authors have provided thorough experimental validation.
Significant Performance Improvement: The paper demonstrates a notable leap in performance.
Logical and Elegant Intuition: The concept of encoding cross-similarity is intuitive and makes strong logical sense.

Weaknesses

Missing Citations:
- [1] GOMAA-Geo: GOal Modality Agnostic Active Geo-localization, Neurips 2024
- [2] OpenStreetView-5M, The Many Roads to Global Visual Geolocation, CVPR 2024
- [3] Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation, CVPR 2025
- [4] GaGA: Towards Interactive Global Geolocation Assistant, Arxiv 2024
Dataset Contamination: The use of Mp-16 is a concern due to known contamination with evaluation frameworks, which can unfairly benefit retrieval-based methods like G3 or GeoRanker. To address this, results on OSV-5M [2] would be more convincing, as it employs a 1km separation radius between test and train data, preventing data leakage. Additionally, OSV-5M is not based on YFCC, ensuring compatibility with other benchmarks without contamination.
Unclear Benefit of VLM Usage: The rationale for employing a Vision-Language Model (VLM) is not clear. It is questioned whether positional embeddings for GPS coordinates could achieve similar results, and if the extensive parameters of a VLM are truly necessary.
Philosophical Concerns with Retrieval Methods: There is a fundamental philosophical issue with retrieval-based methods for geolocalization. As the dataset grows, performance mechanically improves, potentially making the task trivial if global image coverage is achieved. This approach is seen as diminishing the inherent complexity and "beauty" of geolocalization, a task that ideally involves information extraction, analysis, critical thinking, and world knowledge, rather than mere retrieval.

问题

One question i have, is could you ablate the number of samples in the retrieval pool?

局限性

yes

最终评判理由

The discussion confirmed to me that this is a good paper that deserve to be accepted i therefore keep my rating. Integrating the discussion from the rebuttal will strengthen this paper even more

格式问题

作者回复

2025-07-31

W1: Missing Citations

Thank you for the pointer. We will include these relevant works in the revised version under the related work section.

W2: Dataset Contamination: The use of Mp-16 is a concern due to known contamination with evaluation frameworks, which can unfairly benefit retrieval-based methods like G3 or GeoRanker. To address this, results on OSV-5M would be more convincing, as it employs a 1km separation radius between test and train data, preventing data leakage. Additionally, OSV-5M is not based on YFCC, ensuring compatibility with other benchmarks without contamination.

Thank you for the suggestion. We use the MP-16 dataset to ensure fair comparison with prior work such as GeoCLIP, PIGEON, and Img2Loc. We agree that results on OSV-5M would be more convincing. Due to its scale and time limitations, we will consider incorporating OSV-5M as future explorations.

W3: Unclear Benefit of VLM Usage: The rationale for employing a Vision-Language Model (VLM) is not clear. It is questioned whether positional embeddings for GPS coordinates could achieve similar results, and if the extensive parameters of a VLM are truly necessary.

Thank you for the thoughtful comment. Our motivation for using a Vision-Language Model (VLM) lies in two aspects: (1) leveraging the strong visual recognition and understanding capabilities learned during pretraining, and (2) utilizing the VLM’s architecture to model rich interactions between the query image and candidate information (image, text, and GPS) in a unified manner.

The suggestion of using positional embeddings for GPS coordinates is very insightful. We agree this could be an interesting alternative and can be explored in following work. Regarding model size, we use the 7B Qwen2-VL model, which we find to be a practical balance between performance and cost. We also evaluated a smaller 2B variant (Section 4.7 and Appendix H) and observed a noticeable drop in performance, suggesting that model scale contributes meaningfully to the task.

W4: Philosophical Concerns with Retrieval Methods: There is a fundamental philosophical issue with retrieval-based methods for geolocalization. As the dataset grows, performance mechanically improves, potentially making the task trivial if global image coverage is achieved. This approach is seen as diminishing the inherent complexity and "beauty" of geolocalization, a task that ideally involves information extraction, analysis, critical thinking, and world knowledge, rather than mere retrieval.

Thank you for this deep and thought-provoking perspective. While our approach is retrieval-based, the core contribution of GeoRanker lies in modeling the complex relationships between the query and candidate locations, beyond simple similarity. This enables the model to understand spatial relevance in a more nuanced and structured way. We believe this insight is broadly applicable and can also benefit non-retrieval-based approaches that aim to incorporate semantic reasoning, geographic understanding, or world knowledge into geolocalization models. Additionally, we also observe that some current mainstream reasoning models, such as O3, require retrieving images from the internet to aid in thinking and decision-making. GeoRanker can seamlessly integrate with these models.

Q1: Could you ablate the number of samples in the retrieval pool?

Thank you for the suggestion. We conduct an ablation study on IM2GPS3K to analyze the impact of the retrieval pool size on GeoRanker’s performance. Specifically, we sampled 10%, 25%, and 50% of the full retrieval pool and compared the results:

Variants	1km	25km	200km	750km	2500km
10%	14.91	40.77	58.26	75.61	88.52
25%	16.95	43.04	59.93	76.38	88.59
50%	17.25	43.31	60.06	76.74	88.72
100%	18.79	45.05	61.49	76.31	89.29

The results show that: (1) GeoRanker’s performance consistently improves as the retrieval pool size increases; (2) The improvement is more pronounced on fine-grained metrics (1km, 25km, 200km) compared to coarse-grained ones (750km, 2500km), indicating that a larger pool provides more precise candidates that benefit high-resolution geolocalization.

This is a very interesting and meaningful finding. Thank you again for your insightful question. We will include this experiment and its results in the final version of the paper.

2025-08-01

Thanks to the authors for the time spent in the rebuttal!

w2: Would be great if results on OSV-5M can be included in the camera ready!

w3 and w4: Thank you the clarifications, would be great if those elements can be added to the main paper

Q1: So clearly, the retrieval helps with very low scale geoloc. It's a bit worrying as it seems to show you "overfit" more!

Overall i think this is a good paper and the authors have answered most of my issues.

I keep my rating and recommend to accept this paper!

2025-08-03

Thank you for your feedback and for reviewing our rebuttal. We're glad our response could address your concerns, and we appreciate your support.

审稿意见

评分: 4置信度: 32025-07-05

This paper introduces GeoRanker, a novel distance-aware ranking framework for worldwide image geolocalization. The task of predicting GPS coordinates from images globally is challenging due to the vast diversity in visual content across regions. Existing methods often rely on simplistic similarity heuristics and point-wise supervision, which fail to model spatial relationships among candidate locations.

GeoRanker addresses these limitations by leveraging large vision-language models (LVLMs) to jointly encode query-candidate interactions and predict geographic proximity. It incorporates a multi-order distance loss that ranks both absolute and relative distances, enabling the model to reason over structured spatial relationships.

The contributions of this work are:

-C1. GeoRanker Framework: Introduction of GeoRanker, a distance-aware ranking framework that models spatial relationships among candidate locations using a multi-order distance loss and large vision-language models.

-C2. GeoRanking Dataset: Creation of GeoRanking, the first dataset tailored for spatial ranking tasks, featuring rich multimodal annotations including GPS coordinates, textual descriptions, and image data to facilitate future research.

-C3. State-of-the-Art Performance: Achievement of state-of-the-art results on two well-established public geolocalization benchmarks (IM2GPS3K and YFCC4K), demonstrating substantial gains at fine-grained localization levels, and effectiveness shown through comprehensive ablations. The authors also provide their code, checkpoint, and dataset online for reproducibility

优缺点分析

Strengths:

a. A Rather Clever New Framework: The paper puts forward GeoRanker, a fresh and rather astute framework for worldwide image geolocalization that's commendably distance-aware. It quite smartly tackles the spatial relationships among potential locations, which is a significant step up from some of the more simplistic methods we've seen. b. Making Good Use of Modern Tech: It very wisely employs those large vision-language models (LVLMs). This means it's adept at understanding how queries and candidates interact, and rather accurately predicts geographic proximity – quite the modern approach. c. The Multi-Order Distance Loss is a Stroke of Genius: This particular loss function is a rather neat innovation. It allows the model to rank distances, both absolute and relative, which in turn means it can reason about structured spatial relationships with a good deal more sophistication, leading to improved accuracy.

Weaknesses - The Sheer Scale of the Task: While GeoRanker is clearly a step forward, the business of worldwide image geolocalization is inherently a colossal challenge. The visual diversity across the globe is simply immense. Even with such an advanced model, truly capturing every local visual nuance remains a persistent hurdle. For GeoRanking, the first dataset for geographic ranking with multimodal information. Could you provide a bit more detail on the scale of this dataset – for instance, the number of images, unique locations, and the diversity of the textual descriptions?

问题

Could you elaborate a bit more on the specific architecture of the Large Vision-Language Models (LVLMs) you've employed within GeoRanker? Were these pre-trained, and if so, on what sort of data?
The "multi-order distance loss" sounds rather interesting. Could you perhaps walk us through a more detailed example of how it operates, particularly in distinguishing between absolute and relative distances during training?
You mention jointly encoding query-candidate interactions. What specific mechanisms or fusion techniques are used to achieve this effective joint encoding?
How sensitive is the performance of GeoRanker to the choice of the underlying LVLM, or to the specific hyperparameters of the multi-order distance loss?

局限性

A Bit Resource-Intensive (One Might Infer): Given the reliance on those hefty Large Vision-Language Models, one might reasonably surmise that training and perhaps even running GeoRanker could demand a fair amount of computational horsepower. This isn't explicitly stated as a limitation in the paper, but it's often the case with such powerful models, which could be a bit of a sticky wicket for some researchers lacking the necessary computing muscle.

最终评判理由

Overall, this is an interesting work with many promising details. Look forward to seeing novel domain applications and theortical framework for data and learning modeling.

格式问题

n/a

作者回复

2025-07-31

Weakness: The Sheer Scale of the Task: While GeoRanker is clearly a step forward, the business of worldwide image geolocalization is inherently a colossal challenge. The visual diversity across the globe is simply immense. Even with such an advanced model, truly capturing every local visual nuance remains a persistent hurdle. For GeoRanking, the first dataset for geographic ranking with multimodal information. Could you provide a bit more detail on the scale of this dataset – for instance, the number of images, unique locations, and the diversity of the textual descriptions?

Thank you for the comment. We agree that worldwide image geolocalization remains an extremely challenging task due to the vast visual diversity across the globe, and current performance still leaves room for improvement. This is precisely the motivation behind our work.

To support progress in this direction, we introduce GeoRanking, the first dataset explicitly designed for geographic ranking with multimodal information. It contains 100,000 query images, paired with 2 million query–candidate pairs (total 2,100,000 locations) sampled from diverse global locations. Each candidate includes GPS coordinates, textual descriptions (e.g., city, country), and image data, covering a broad spectrum of geographic and semantic contexts. We hope this dataset can help facilitate future advances in this area.

Q1: Could you elaborate a bit more on the specific architecture of the Large Vision-Language Models (LVLMs) you've employed within GeoRanker? Were these pre-trained, and if so, on what sort of data?

Thank you for the question. We use Qwen2-VL-7B as the backbone LVLM in GeoRanker. This model is pretrained for general-purpose vision-language tasks (details available in the corresponding technical report [1] ). Since the pretrained model is not specifically designed for geolocation, we apply LoRA-based fine-tuning together with a lightweight value head to adapt it to the image geolocalization task. This enables the model to predict distance scores based on spatial relationships between query–candidate pairs.

Q2: The "multi-order distance loss" sounds rather interesting. Could you perhaps walk us through a more detailed example of how it operates, particularly in distinguishing between absolute and relative distances during training?

Thank you for the question. Our multi-order distance loss consists of two components:

The first-order loss supervises the model to rank candidates by their absolute distance to the query. It encourages the model to assign higher scores to geographically closer candidates. For example, if the ground-truth distances are: $d_1 = 10\text{km}, \quad d_2 = 50\text{km}, \quad d_3 = 200\text{km}$ , then the model should output scores such that: $s_1 > s_2 > s_3$ . This trains the model to rank candidates according to their absolute distances from the query.
The second-order loss focuses on the relative distance gaps between candidate pairs. It trains the model to reflect larger geographic differences with larger score differences, enabling more fine-grained spatial understanding. For example: $|d_1 - d_2| = 40\text{km}, \quad |d_1 - d_3| = 190\text{km}$ , then we expect $|\Delta s_{1,2}| < |\Delta s_{1,3}|$ .

Together, these two components help the model learn not only which candidate is closest, but also how much closer it is compared to others.

Q3: You mention jointly encoding query-candidate interactions. What specific mechanisms or fusion techniques are used to achieve this effective joint encoding?

Thank you for the question. We leverage the inherent multi-modal fusion capability of the LVLM (Qwen2-VL-7B) to jointly encode query–candidate interactions. Specifically, we construct a prompt that includes both the query image and candidate information (GPS, text, and image), and feed it into the LVLM directly. The model attends across all modalities to capture joint semantics without requiring any additional fusion module.

Q4: How sensitive is the performance of GeoRanker to the choice of the underlying LVLM, or to the specific hyperparameters of the multi-order distance loss?

Thank you for the excellent question. We here detail more on the sensitivity of GeoRanker to both the LVLM backbone and the multi-order distance loss hyperparameters.

For the LVLM, we experiment with different model scales (0.5B, 2B, and 7B parameters), as shown in Figure 6 and Appendix H. Performance improves consistently with larger models, indicating that GeoRanker benefits from stronger backbone representations.

For the loss hyperparameters:

The impact of $K^{(1)}$ , which controls the number of top candidates supervised in the first-order loss, is analyzed in our hyperparameter study (Section 4.4, Figure 3). Larger values lead to a performance drop due to train-test mismatch.

To ablate the impact of $K^{(2)}$ , we fix $K^{(1)}$ and vary the formula used to compute $K^{(2)}$ as $K^{(2)} = \frac{[(k_1-1) + (k_1 - K^{\prime})] \cdot K^{\prime}}{2}$ , modifying $K^{\prime}$ to evaluate the effectiveness of our proposed design.

Variants	1km	25km	200km	750km	2500km
$K^{(2)}=0$	18.48	44.61	60.96	75.61	88.28
$K^{(2)}$ with $K^{\prime}=1$	18.79	45.05	61.49	76.31	89.29
$K^{(2)}$ with $K^{\prime}=2$	18.42	44.61	61.36	76.11	88.66

As shown in the table, the model achieves the best performance when $K^{\prime} = K^{(1)}$ .

[1] Wang, Peng, et al. "Qwen2-vl: Enhancing vision-language model's perception of the world at any resolution." arXiv preprint arXiv:2409.12191 (2024).

2025-08-06

Dear reviewer, we sincerely appreciate your valuable time and insightful suggestions on our paper. We hope that we have addressed your concerns with our responses. Since the reviewer-author discussion deadline is approaching, please let us know if you have any other questions. We are glad to further respond to your concerns. With these adjustments in place, we kindly ask if you would consider the possibility of raising the score for our submission. Thank you very much!

审稿意见

评分: 4置信度: 42025-07-12

This paper introduces GeoRanker, a framework for worldwide image geolocalization that refines the retrieve-and-rank paradigm. Instead of relying on simple similarity scores, GeoRanker employs a Large Vision-Language Model (LVLM) to learn a fine-grained, distance-aware scoring function that explicitly models the relationship between a query image and multiple candidates. The core of the method is a multi-order distance loss that supervises the model on both the absolute ranking of candidates and the relative distance gaps between them. To facilitate this, the authors created and have released GeoRanking, a large-scale, multimodal dataset for this task.

优缺点分析

Strengths:

The paper convincingly argues that existing geolocalization methods fail to model the rich spatial relationships between candidates.
GeoRanker achieves state-of-the-art performance on two standard benchmarks, IM2GPS3K and YFCC4K, outperforming 11 baselines, including the most recent SOTA methods.
The paper introduces GeoRanking, a dataset specifically designed for learning to rank geographic entities. With 100,000 samples and 2 million query-candidate pairs containing rich multimodal information (images, GPS, text), this dataset is a valuable contribution to the community. Its public release will likely foster future research in geolocalization, multimodal learning, and information retrieval.

Weaknesses:

During inference, the proposed method integrates candidates generated by GPT-4V, a proprietary, expensive, and non-deterministic model. This dependency raises significant concerns regarding the reproducibility and accessibility of the results.
The authors identify that GeoRanker introduces computational overhead compared to retrieval-only methods. While the efficiency analysis in Figure 4 shows GeoRanker is faster than a generic LVLM prompting baseline, this comparison is not sufficient. A direct comparison of inference latency and computational cost against the primary SOTA baseline, G3, is missing. Without this, it's hard to assess the practical trade-offs of the proposed method.
Here are the design choices for the multi-order loss function that could benefit from deeper justification:

The choice to set the hyperparameter for the first-order loss to 1 is counterintuitive. This simplifies the Plackett-Luce loss to only supervising the ranking of the single best candidate, seemingly underutilizing the ranking formulation. The ablation study shows that performance degrades for $K^{(1)} > 1$ . The accompanying explanation of a "train-test mismatch" is brief and requires more detail to be fully convincing.
The formula for calculating $K^{(2)}$ , the number of pairs in the second-order loss, is complex and presented without an ablation study or strong intuition. It is unclear if this specific formulation is critical or if a simpler heuristic would suffice.

问题

The method's reliance on GPT-4V for candidate generation is a central concern for reproducibility. Could you provide a higher granular breakdown of performance? Specifically, what is the accuracy of GeoRanker when using (a) only the top-k retrieved candidates, (b) only the candidates generated by GPT-4V, and (c) the combined set? This would clarify the exact contribution of each candidate source. Furthermore, to improve accessibility, have you considered using a powerful open-source LVLM (like the Qwen2-VL-7B you use as a backbone) for the candidate generation step as well? An analysis of this trade-off would be very valuable.
The Prompting baseline is an important point of comparison for ranking ability and efficiency. Could you clarify its implementation? Was this a zero-shot evaluation where the LVLM was asked to choose the best location from a list in a single pass? How does this setup compare to the RAG pipelines used in other baselines like Img2Loc and G3? Clarifying this would help situate GeoRanker's contribution more precisely within the landscape of LVLM-based geolocalization methods.

局限性

N/A

最终评判理由

After reading the author's response, all my concerns have been addressed. Therefore, I would like to keep my original score: borderline accept.

格式问题

N/A

作者回复

2025-07-31

W1: During inference, the proposed method integrates candidates generated by GPT-4V, a proprietary, expensive, and non-deterministic model. This dependency raises significant concerns regarding the reproducibility and accessibility of the results.

Thank you for pointing this out. We understand the concern regarding the reproducibility and accessibility of using GPT-4V.

Our use of GPT-4V was intended to ensure consistency with prior works such as Img2Loc and G3, which also used GPT-4V for candidate generation. However, in response to your comment, we additionally evaluate our method using Qwen2-VL-7B, a fully open-source vision-language model. The results on IM2GPS3K are as follows:

Methods	1km	25km	200km	750km	2500km
G3	16.65	40.94	55.56	71.24	84.68
GeoRanker with Qwen2-VL-7B	18.18	44.08	60.79	75.64	88.56
GeoRanker with GPT-4V	18.79	45.05	61.49	76.31	89.29

W2: The authors identify that GeoRanker introduces computational overhead compared to retrieval-only methods. While the efficiency analysis in Figure 4 shows GeoRanker is faster than a generic LVLM prompting baseline, this comparison is not sufficient. A direct comparison of inference latency and computational cost against the primary SOTA baseline, G3, is missing. Without this, it's hard to assess the practical trade-offs of the proposed method.

Thank you for the suggestion. We conduct experiments and add a direct comparison of inference latency and computational cost between GeoRanker and G3.

Inference Latency

G3 runs inference on multiple combinations of reference GPS coordinates to produce diverse candidate predictions, and we assume these inference steps can be parallelized. We perform inference on a single sample, using the same hyperparameter settings as reported in the G3 paper.

Method	Inference Latency
G3	6.70s (generation) + 0.06s (geo-verification) = 6.76s
GeoRanker	GeoRanker: 6.53s (candidate generation) + 1.89s (ranking) = 8.42s

GeoRanker introduces moderate overhead due to the ranking step but remains within a practical latency range.

Computational Cost

Method	Input Tokens	Output Tokens
G3	21860 / sample	3600 / sample
GeoRanker	4565 / sample	900 / sample

G3 requires significantly more tokens due to G3 performs multiple rounds of inference based on different combinations of reference GPS coordinates.

For the geo-verification phase and ranking phase, G3 uses an efficient embedding-based verifier, GeoRanker performs forward passes with a Qwen2-VL-7B-based model, which is also efficient to deploy locally.

We will include these results and discussion in the revised version to provide a clearer view of the trade-offs.

Q3: Here are the design choices for the multi-order loss function that could benefit from deeper justification:

The choice to set the hyperparameter for the first-order loss to 1 is counterintuitive. This simplifies the Plackett-Luce loss to only supervising the ranking of the single best candidate, seemingly underutilizing the ranking formulation. The ablation study shows that performance degrades for $K^{(1)}>1$ . The accompanying explanation of a "train-test mismatch" is brief and requires more detail to be fully convincing.

Thank you for the question. We believe the main reason is the distribution shift in candidate sets: both training and testing candidates are retrieved from MP16-Pro, and selecting the best prediction corresponds to the case of $K^{(1)} = 1$ . When $K^{(1)} > 1$ , the mismatch between candidate distributions during training and testing may introduce noise, which we suspect leads to the observed performance drop. Thank you again for the suggestion. We will provide a more detailed explanation in the corresponding section.

The formula for calculating $K^{(2)}$ , the number of pairs in the second-order loss, is complex and presented without an ablation study or strong intuition. It is unclear if this specific formulation is critical or if a simpler heuristic would suffice.

Thank you for the suggestion. Our formulation of $K^{(2)}$ is designed to focus second-order supervision on candidate pairs that are most relevant to the primary supervision signal.

Specifically, $L_{\text{PL}}^{(1)}$ supervises only the top $K^{(1)}$ candidates, the relative ordering among the rest is unconstrained. Therefore, in $L_{\text{PL}}^{(2)}$ , we restrict the pairwise supervision to those pairs where at least one candidate is within the top $K^{(1)}$ . This prevents noisy or irrelevant candidate pairs (e.g., both far from the ground truth) from dominating the optimization.

The proposed formula ensures this by setting $K^{(2)} = \frac{[(k_1-1) + (k_1 - K^{(1)})] \cdot K^{(1)}}{2}$ , which accounts for all pairs involving the top $K^{(1)}$ candidates. We found this setting to be both principled and effective.

To ablate the impact of $K^{(2)}$ , we fix $K^{(1)}$ and vary the formula used to compute $K^{(2)}$ as $K^{(2)} = \frac{[(k_1-1) + (k_1 - K^{\prime})] \cdot K^{\prime}}{2}$ , allowing us to evaluate the effectiveness of our proposed design.

Variants	1km	25km	200km	750km	2500km
$K^{(2)}=0$	18.48	44.61	60.96	75.61	88.28
$K^{(2)}$ with $K^{\prime}=1$	18.79	45.05	61.49	76.31	89.29
$K^{(2)}$ with $K^{\prime}=2$	18.42	44.61	61.36	76.11	88.66

As shown in the table, the model achieves the best performance when $K^{\prime} = K^{(1)}$ . We will include this explanation and the result in the formal version of our paper.

Q1: The method's reliance on GPT-4V for candidate generation is a central concern for reproducibility. Could you provide a higher granular breakdown of performance? Specifically, what is the accuracy of GeoRanker when using (a) only the top-k retrieved candidates, (b) only the candidates generated by GPT-4V, and (c) the combined set? This would clarify the exact contribution of each candidate source. Furthermore, to improve accessibility, have you considered using a powerful open-source LVLM (like the Qwen2-VL-7B you use as a backbone) for the candidate generation step as well? An analysis of this trade-off would be very valuable.

Thank you for the thoughtful question. To better understand the contribution of each candidate source, we provide a detailed breakdown of GeoRanker’s performance on IM2GPS3K under three settings:

Methods	1km	25km	200km	750km	2500km
only the top-k retrieved candidates	18.21	43.47	59.69	75.47	88.75
only the candidates generated by GPT-4V	12.11	35.67	51.62	68.93	81.98
the combined set	18.79	45.05	61.49	76.31	89.29

These results show that while retrieval provides stronger candidates, generation offers complementary value, particularly for long-tail or hard cases (significant improvement on 25km and 200km level compared between only the retrieved candidates and the combined set).

Regarding accessibility, we also evaluate candidate generation using Qwen2-VL-7B as you mentioned, a powerful open-source LVLM. Please refer to the response to W1 to check the results.

Q2: The Prompting baseline is an important point of comparison for ranking ability and efficiency. Could you clarify its implementation? Was this a zero-shot evaluation where the LVLM was asked to choose the best location from a list in a single pass? How does this setup compare to the RAG pipelines used in other baselines like Img2Loc and G3? Clarifying this would help situate GeoRanker's contribution more precisely within the landscape of LVLM-based geolocalization methods.

Thank you for the question. We clarify that the Prompting baseline is implemented as a zero-shot single-pass. Specifically, we construct a prompt that:

Presents the query image to the LVLM.
Includes several negative samples (GPS + textual descriptions) to provide contrastive context.
Provides a list of candidate options, each with GPS coordinates, textual descriptions, and image data.

The LVLM is asked to select the most plausible candidate from this list as the final prediction. Importantly, the output must correspond to one of the provided candidates.

This differs from prior RAG-style pipelines such as Img2Loc, which (1) only provide GPS coordinates for candidates (without multimodal context), and (2) allow the model to generate predictions outside the candidate set. Our Prompting baseline enforces strict candidate selection and richer input signals, thus offering a stronger and more controlled comparison.

G3 is essentially performing candidate generation rather than ranking; in G3, the final prediction is selected through geo-verification, which relies on an embedding-based model. Compared to G3, our Prompting baseline focuses purely on ranking within a given candidate set, and is more directly comparable to GeoRanker.

2025-08-06

Dear reviewer, we sincerely appreciate your insightful and valuable feedback on our paper. If you have any additional questions or suggestions, we are delighted to address them promptly. In addition, we are committed to revising our paper based on your suggestions to enhance its quality. In light of these revisions, we respectfully request that you consider reevaluating the score for our submission. Thank you very much for your time and consideration.

2025-08-08

After reading the author's response, all my concerns have been addressed. Therefore, I would like to keep my original score: borderline accept.

审稿意见

评分: 5置信度: 32025-07-22

A model for geolocalization based on G3 is presented. Authors hypothesize that previous models fail to correctly rank the list of candidates even though the initial set of candidates might contain correct predictions. To address this the authors preprocess the MP16-Pro dataset including multiple candidates per query and train a model with a novel two order loss that acknowledges the spatial relation between candidates. They perform experiments on standard datasets obtaining state-of-the-art results and improving over the G3 baseline.

优缺点分析

Strengths:

This paper addresses a very interesting idea that has not been widely studied before: the ranking of visual geolocalization systems. As the authors describe, the top-k candidates retrieved by G3 contain better candidates than the final answer of the model. This suggests that explicitly addressing the ranking can bring benefits to the model.

The multi-order loss seems like a neat way of addressing the issue. Besides it’s relevant to bring the Plackett-Luce loss into the visual geolocalization field for future research.

The experimentation shows that the model obtain the best performance to date, improving over the G3 baseline, even though it uses a very similar approach. Ablative experiments show the benefits of the different components of the method.

Weaknesses:

The pipeline is very similar to the work of G3. Authors include a novel loss and a preprocessing of the dataset to accommodate such loss, but most of the architecture remains very similar. The claim of curating a new dataset should be heavily downplayed, as this is just a preprocessing of an existing dataset. To alleviate the problem of the pipeline being similar to G3, I recommend the authors to explicitly explain G3 pipeline in the text and to highlight the differences.

There is a big focus in this paper on how it addresses the ranking of candidates. However the metrics from the experiments, although they show improvements in global geolocalization, do not measure ranking effectiveness. The only experiment that addresses the ranking is the 4.8 Case Study, but this is just qualitatively. I invite the authors to include quantitative experiments with metrics from the “learning to rank” literature to showcase the improvements in ranking capabilities.

问题

Can the authors give a detailed summary of the similarities and differences with G3 to more easily assess the novelty of the paper?
Could the authors include some metrics, or quantitative experiments that showcases the improve in ranking performance? This could strength the paper and the ranking issue formulation.
This comment is not specific to this paper but rather to the field. The field is very related to Visual Place Recognition, however I see that the two field have developed their own literature with little overlap. I wonder if some of the advances of Visual Place Recognition to create robust and discriminative embedding could be applied to this field to improve the candidate retrieval phase. In the other way around, this field could also help in Visual Place Recognition. Could the authors provide some insights regarding this?

局限性

This kind of works can have a societal impact regarding privacy violations, as models can retrieve the close location of images. Yet the authors do not discuss this.

最终评判理由

Authors addressed most of the concerns in the rebuttal and engaged in fruitful discussions.

They provided extra quantitative experiments that defend the relevance of their ranking losses. Besides they promised to revisit and rephrase the GeoRanking dataset part to downplay it a little bit.

Overall it's a good paper. Provided the authors explain clearly the differences with G3, the contributions are valuable and interesting for the community.

格式问题

N/A

作者回复

2025-07-31

W1: The pipeline is very similar to the work of G3. Authors include a novel loss and a preprocessing of the dataset to accommodate such loss, but most of the architecture remains very similar. The claim of curating a new dataset should be heavily downplayed, as this is just a preprocessing of an existing dataset. To alleviate the problem of the pipeline being similar to G3, I recommend the authors to explicitly explain G3 pipeline in the text and to highlight the differences.

Thank you for the suggestion. We acknowledge that our retrieval component builds on the results produced by G3. However, our work specifically targets the ranking problem, which is distinct from the retrieval step in a standard information retrieval pipeline. While G3 focuses on generating plausible candidate locations, our core contribution lies in the ranking stage, where we design a novel model (GeoRanker), introduce a new data construction protocol, and propose a loss function tailored to fine-grained geolocation ranking. To address the concern about similarity, we will revise the paper to explicitly describe the G3 pipeline and clearly delineate how our approach differs, particularly in terms of motivation, architecture, and optimization objectives.

W2: There is a big focus in this paper on how it addresses the ranking of candidates. However the metrics from the experiments, although they show improvements in global geolocalization, do not measure ranking effectiveness. The only experiment that addresses the ranking is the 4.8 Case Study, but this is just qualitatively. I invite the authors to include quantitative experiments with metrics from the “learning to rank” literature to showcase the improvements in ranking capabilities.

Thank you for the valuable suggestion. We agree that directly evaluating the ranking effectiveness is important to support our claim.

In response, we conduct additional experiments on IM2GPS3K using standard learning-to-rank metrics, specifically Recall@K and NDCG@K, to quantitatively assess our ranking performance. We use the top-20 candidates retrieved by G3 and treat the most accurate candidate (closest to the ground truth) as the positive label. The results are summarized below:

Methods	Recall@1	Recall@5	Recall@10	NDCG@10	NDCG@20
G3	0.0894	0.3217	0.5672	0.2825	0.3912
GeoRanker	0.1982	0.5169	0.7387	0.4318	0.4979

As shown, GeoRanker significantly improves both Recall and NDCG over G3. This demonstrates that our ranking module effectively re-orders the candidates to improve fine-grained geolocalization accuracy. We will include these results in the revised version of the paper.

Q1: Can the authors give a detailed summary of the similarities and differences with G3 to more easily assess the novelty of the paper?

Please refer to responses to W1.

Q2: Could the authors include some metrics, or quantitative experiments that showcases the improve in ranking performance? This could strength the paper and the ranking issue formulation.

Please refer to responses to W2.

Q3: This comment is not specific to this paper but rather to the field. The field is very related to Visual Place Recognition, however I see that the two field have developed their own literature with little overlap. I wonder if some of the advances of Visual Place Recognition to create robust and discriminative embedding could be applied to this field to improve the candidate retrieval phase. In the other way around, this field could also help in Visual Place Recognition. Could the authors provide some insights regarding this?

Thank you for raising this insightful and forward-looking question. Your suggestion to leverage robust and discriminative embeddings developed in the VPR literature for candidate retrieval in global geolocalization is very compelling. To our knowledge, current image geolocalization methods, including ours, have not yet fully explored representation learning techniques specifically optimized for robustness under varying appearance and viewpoint changes, which is a strength of VPR research. We agree this could be a promising direction for future work.

Conversely, the image geolocalization community has introduced techniques that could potentially benefit VPR, such as the use of multimodal alignment with textual/geographic priors (e.g., in GeoCLIP, G3), vision-language model reasoning (e.g., Img2Loc, G3), and the ranking-based framework we propose in this paper.

We also believe there is an opportunity for hybrid frameworks. For example, a hierarchical pipeline where a global geolocalization module provides coarse candidate regions, followed by fine-grained matching using VPR techniques, could effectively combine the strengths of both fields. We hope this paper can serve as a step toward bridging the two communities.

2025-08-03

Thank you for your thoughtful comments. We are glad to provide more details regarding the calculation of the metrics and GeoRanking Dataset.

For Recall@k, your understanding is mostly correct. We compute the metric based on whether the correct candidate (denoted as y) appears within the top-k candidates retrieved by the model. Specifically, Recall@k represents the percentage of times that the correct candidate appears within the top-k retrieved candidates across all samples. For example, Recall@5 refers to the percentage of cases where y is among the top-5 retrieved candidates. The calculation is done as follows:

# Find the position of the closest GPS (0-based index)
position = pred_list.index(closest_gps)

# Calculate Recall@k
metrics[method]['recall@1'].append(1 if position == 0 else 0)
metrics[method]['recall@5'].append(1 if position < 5 else 0)
metrics[method]['recall@10'].append(1 if position < 10 else 0)

Finally, we average the results across all samples to obtain the final Recall@k metrics.

For NDCG@k, we appreciate your insight. In our previous implementation, we do not use the scores generated by each method but rather focus on the position of the closest candidate in the ranking list. NDCG requires each candidate to have a relevance label with respect to the query. In our previous experiments, we chose to assign a label of 1 to the closest candidate and 0 to all others. Here’s the code we use to compute NDCG@k:

# For NDCG@k, use 1 / log2(position + 2) if position < k, else 0 (position is 0-based)
# here IDCG is always 1, so we just need to calculate the DCG
metrics[method]['ndcg@20'].append(1 / np.log2(position + 2))  # Always < 20
if position < 10:
    metrics[method]['ndcg@10'].append(1 / np.log2(position + 2))
else:
    metrics[method]['ndcg@10'].append(0)

Based on your feedback, we realized that the previous NDCG calculation method overlooked the ranking of other candidates. We have come up with the following labeling method that is more suitable for image geolocation: based on the actual distance between the query and the candidate, we set the following labels:

0km $\leq$ Distance $<$ 1km, label = 1
1km $\leq$ Distance $<$ 25km, label = 0.8
25km $\leq$ Distance $<$ 200km, label = 0.6
200km $\leq$ Distance $<$ 750km, label = 0.4
750km $\leq$ Distance $<$ 2500km, label = 0.2
2500km $\leq$ Distance, label = 0

We then recompute the NDCG using the following function:

def calculate_ndcg(relevance_scores, k):
    # relevance_scores is the list of relevance scores for the top-20 candidates ranking by GeoRanker or G3
    dcg = 0
    for j in range(min(k, len(relevance_scores))):  # Only consider the top k
        dcg += relevance_scores[j] / np.log2(j + 2)  # Position is 0-based, hence log2(j + 2)

    # Calculate IDCG (ideal DCG, sorted by relevance scores)
    idcg = 0
    sorted_relevance_scores = sorted(relevance_scores, reverse=True)
    for j in range(min(k, len(relevance_scores))):
        idcg += sorted_relevance_scores[j] / np.log2(j + 2)
    
    ndcg = dcg / idcg if idcg != 0 else 0
    return ndcg

Finally, we average the results across all samples to obtain the final ndcg@k metrics, which is shown below:

Methods	NDCG@5	NDCG@10	NDCG@20
G3	0.6238	0.6735	0.7989
GeoRanker	0.8026	0.8419	0.8872

We can still observe that GeoRanker shows significant improvements over G3 across all metrics.

For the GeoRanking Dataset, yes, we processed the existing MP16-Pro dataset to obtain the GeoRanking dataset, which is the first dataset in the worldwide image geolocalization field used for training a ranker. We will revisit the phrasing in the paper to better clarify this point while maintaining its relevance to the community, and we will downplay the novelty of the dataset as you suggested.

Thank you once again for your helpful comments and for taking the time to review our work. We hope these clarifications address your concerns, and we look forward to incorporating them in the revised version. If you believe that our responses have sufficiently addressed the issues raised, we kindly ask you to consider the possibility of raising the score.

2025-08-07

Thanks for your detailed explanation on the recall and NDCG! I think these experiments will help highlighting the novel ranking improvements presented in this paper.

Besides, I appreciate the authors intention to rephrase the GeoRanking dataset part.

I am happy with the authors rebuttal and happy to accept the paper.

2025-08-08

Thank you for your positive feedback and constructive suggestions. We sincerely appreciate your multiple rounds of thoughtful comments and engagement throughout the review process, which have significantly improved the quality and clarity of our work. We will carefully incorporate your suggestions into the final version of the paper.

2025-08-03

Dear authors,

Thanks for your detailed answer. I appreciate the summarized differences with G3, the ranking metrics, and the insightful discussion about VPR.

The ranking metrics are particularly interesting, and I'm happy to see such good results from the author's method. However, I would like a few more details, if possible, on how both metrics were computed.

For Recall@k, I understood that the closes candidate from the top-20 of G3, let's call it y, is considered as the correct one. And thus Recall@1 accounts on whether the top candidate from G3, or GeoRanker is 'y'. Is that right?

For NDCG@k, how was it computed? Using the scores computed of each method, and the rankings of the retrieved candidates over the top k candidates? Wouldn't this ignore how relevant these k candidates are among the global list of candidates? For example if the retrieved top-k candidates are correctly ranked, but they are all non-relevant.

Is my understanding correct that the proposed GeoRanking dataset could be described as preprocessing of an existing dataset? This is still relevant to the community, but the phrasing on the main text might need to be revisited.

Thanks for your time

最终决定Accept (poster)

2025-09-17

GeoRanker introduces a distance-aware ranking framework for worldwide image geolocalization, leveraging large vision-language models and a multi-order distance loss to model spatial relationships among candidate locations. The paper presents strong empirical results, outperforming state-of-the-art baselines on IM2GPS3K and YFCC4K, and includes ablations, reproducibility efforts, and a new multimodal dataset specifically designed for ranking tasks.

A key concern relates to dataset contamination as the use of the MP16-Pro dataset was flagged due to known contamination with evaluation frameworks. This could unfairly benefit retrieval-based methods like G3 or GeoRanker, as models might have seen similar data during pretraining or evaluation setup. It was recommended evaluating on OSV-5M, making it a cleaner benchmark. The authors acknowledged this concern and explained that MP16-Pro was used for consistency with prior work but agreed that OSV-5M would be a stronger benchmark and committed to including it in future work.

The reviewers broadly agree on the paper’s technical soundness and relevance, with reviewers generally recommending acceptance. Concerns around reproducibility, dataset novelty, and efficiency were sufficiently addressed in the rebuttal. Given the paper’s contributions, clarity, and community value, I recommend acceptance with the expectation that the additional results and final comments by the reviewers are included in a final version of the paper.