G3: An Effective and Adaptive Framework for Worldwide Geolocalization Using Large Multi-Modality Models
摘要
评审与讨论
In the paper, authors propose a novel framework, G3, for worldwide geolocalization of a given photograph anywhere on Earth. The authors address the challenges of capturing location-specific visual cues and handling variations in image data distribution across the globe. G3 utilizes a three-step process: Geo-alignment, which learns location-aware image representations, Geo-diversification, which employs multiple retrieval-augmented prompts for robust location prediction, and Geo-verification, which combines retrieved and generated location data for final prediction. The authors also introduce the MP16-Pro dataset to support location-aware visual representation learning. Experiments on the IM2GPS3k and YFCC4K datasets demonstrate the superiority of G3 over existing methods.
优点
- All the modules in the G3 framework: Geo Alignment, Geo Diversification and Geo Verification seem logical and rational. Three kinds of embedding coming from the vision encoder are used for retrieval. LLM is used to generate a set of plausible coordinates by providing positive and negative examples.
- The method achieves superior performance over several baselines at various levels of granularity on IM2GPS3k and YFCC4K.
- Overall, the method is interesting and novel, the writing and flow of the paper is meaningful.
缺点
- The only limitation discussed is regarding the efficiency of inference. However, there is no mention of how much compute time and memory (in number) is required to geo-localize a given input image.
- There are no concrete qualitative example of failures reported in the paper. Can the system be fooled easily? For example, if an image from Italy contains a human with a flag of The Netherlands, is the system capable of correctly geolocalizing the image? How does the RAG system along with the LLM perform in such a case?
- Limited evaluation considering the state-of-the-art. No mention of recent works such as Pigeon [1] or a GeoReasoner [2].
- Why did the authors choose to use CLIP vision encoder for extracting image features? Recent works have shown that purely image-based pretrained models such as DINO-v2 are better feature extractors than CLIP. No ablation study done for the choice of the vision encoder.
- Overall, from discussion in L316-L333 and Figure 4, it looks like the number of references provided to LLM highly depends and varies based on the image content. The performance is highly sensitive to this hyperparameter and a single value cannot guarantee optimal performance. This can make the framework highly unreliable for practical use cases.
[1] Haas, Lukas, Michal Skreta, Silas Alberti, and Chelsea Finn. "Pigeon: Predicting image geolocations." In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12893-12902. 2024
[2] Li, Ling, Yu Ye, Bingchuan Jiang, and Wei Zeng. "GeoReasoner: Geo-localization with Reasoning in Street Views using a Large Vision-Language Model." In Forty-first International Conference on Machine Learning.
问题
- There is only a marginal improvement in performance when including the geo-diversification step considering it is potentially the most expensive step during the inference.
- Currently, the text associated with each coordinate only includes the country and city labels. Will the performance of the framework improve by including fine-grained details such as region and/or street name?
局限性
Limitations are included but failure cases are missing.
W1: No mention of how much compute time and memory (in number) is required to geo-localize a given input image.
Response:
Thanks for your comment. We gather the compute time and memory cost in Geo-diversification and Geo-verification, as Geo-alignment is not directly used in inference. We will add this information in our final version. The statistics are listed as follows:
| Phase | Time cost | Memory cost |
|---|---|---|
| Geo-diversification | LLaVA generating time: 4s/10 prompts; Loading Model: 10s | GPU Memory: 20.81GB; Memory: 6GB |
| Geo-verification | Evaluating time on IM2GPS3K: 56s | GPU Memory: 8.85GB; Memory: 10.37GB |
W2: Hard and failure cases analysis.
Response:
Thanks for your interesting question.
We identify an image similar to the situation you mentioned, depicting a man in Paris holding an American flag. Detailed images can be found in the PDF file in global response. Upon testing, G3 accurately predicts the location as Paris without being fooled.
Additionally, we use images from the failure analysis section of GeoReasoner concerning the Eiffel Tower for testing, which are also included in the PDF. We find that G3 can accurately identify the replica of the Eiffel Tower in the USA but fails to recognize the replica in China, which is better than GeoReasoner. This discrepancy may be due to the presence of more iconic buildings in the images of the USA replica, which aid in location determination, whereas the images of the Chinese replica lack clear geographic indicators, leading to incorrect predictions by the model. All these results and analyses will be included in our formal version.
W3: Limited evaluation considering the state-of-the-art. No mention of recent works such as Pigeon [1] or a GeoReasoner [2].
Response:
Thank you for your comment. We present a comparison of the performance of G3 and PIGEON on the commonly used datasets. There are two reasons we do not take GeoReasoner as a baseline:
- GeoReasoner is specifically fine-tuned for country classification, which differs from our focus on worldwide geolocalization.
- GeoReasoner conduct experiments on filtered IM2GPS3K dataset, which includes only highly locatable data. Therefore, the experimental results in the GeoReasoner paper are not comparable.
The results of PIGEON and G3 are shown below:
IM2GPS3K
| Model | Street 1km | City 25km | Region 200km | Country 750km | Continent 2500km |
|---|---|---|---|---|---|
| PIGEON | 11.3 | 36.7 | 53.8 | 72.4 | 85.3 |
| G3(GPT4V) | 16.65 | 40.94 | 55.56 | 71.24 | 84.68 |
YFCC4K
| Model | Street 1km | City 25km | Region 200km | Country 750km | Continent 2500km |
|---|---|---|---|---|---|
| PIGEON | 10.4 | 23.7 | 40.6 | 62.2 | 77.7 |
| G3(GPT4V) | 23.99 | 35.89 | 46.98 | 64.26 | 78.15 |
From the experimental results, we can find that G3 demonstrates superior performance across most metrics. Additionally, it is worth noting that PIGEON, during its training process, incorporates not only the MP16 dataset but also an additional 340k images from the Google Landmark v2 dataset, utilizing a larger training dataset compared to our work. Thanks again for your advice, we will include these discussions in our final version.
W4: Why using CLIP Vision encoder.
Response:
Thanks for your questions. In this paper, we use CLIP's vision encoder to align our work with previous studies such as GeoCLIP and Img2Loc, ensuring the results are comparable.
W5: Overall, from discussion in L316-L333 and Figure 4, it looks like the number of references provided to LLM highly depends and varies based on the image content. The performance is highly sensitive to this hyperparameter and a single value cannot guarantee optimal performance. This can make the framework highly unreliable for practical use cases.
Response:
Thank you for your question.
As mentioned in lines L316-L333, the number of references provided to the LLM highly depends on and varies based on the image content. You are correct. This indicates that the heterogeneity in the spatial distribution of images can make LMM's predictions highly sensitive. To address this issue, we propose Geo-diversification. By combining RAG templates with different numbers of reference coordinates, we comprehensively consider all candidates to enhance the robustness of the model's predictions.
Figure 4 illustrates the trade-off between different levels of prediction accuracy. If the application scenario requires high accuracy for small-scale predictions, a smaller number of candidates for each RAG prompt should be chosen. Conversely, if the scenario requires high accuracy for large-scale predictions, a larger number of candidates for each RAG prompt is preferable. Overall, selecting 5 as the hyperparameter provides a balanced performance, achieving optimal results at the region, country, and continent levels, and near-optimal results at the street and city levels.
Q1: There is only a marginal improvement in performance when including the geo-diversification step considering it is potentially the most expensive step during the inference.
Response:
Thanks for your comment. Through the ablation study in Table 2, we observe that removing Geo-diversification results in a significant drop in performance. On the Im2GPS3K dataset, the average accuracy across five scales decreases by 2.14%, and on the YFCC4K dataset, the average accuracy across five metrics drops by 8.28%. These results demonstrate the necessity and effectiveness of Geo-diversification.
Q2: Including fine-grained details in Geo-alignment.
Response:
Thank you for your question. Exploring the impact of finer-grained text descriptions on performance requires rerunning the Geo-alignment, which takes approximately 70 hours of training time. We have not yet completed this experiment, but we will provide the results during the discussion phase next week.
I thank the authors for posting the clarifications and additional results which help strengthen the paper. I have updated my score accordingly.
Thank you very much for your feedback and for updating the score. The experimental results regarding integrating more fine-grained geographical textual descriptions in the Geo-alignment will be provided tomorrow. Thanks for your patience.
Thank you for your valuable reviews and patience. We complete experiments on incorporating more fine-grained textual descriptions in Geo-alignment. Specifically, in addition to including the city, county, and country information in the textual descriptions of coordinates, we also introduce neighborhood information, which is the most fine-grained data that can be obtained from Nominatim. We use G3-N to denote this variant and keep the other hyperparameters the same as G3. The experimental results on IM2GPS3K are presented below:
| Methods | Street 1km | City 25km | Region 200km | Country 750km | Continent 2500km |
|---|---|---|---|---|---|
| G3-N | 16.44 | 40.64 | 54.35 | 70.57 | 83.98 |
| G3 | 16.65 | 40.94 | 55.56 | 71.24 | 84.68 |
From the results, we can see that G3 outperforms G3-N across all metrics. This may be because the text encoder's pre-training corpus contains very few instances of neighborhood-level information, resulting in weaker modeling capabilities for neighborhood names. Therefore, introducing neighborhood information into the textual descriptions of coordinates actually adds noise, which negatively impacts the effectiveness of Geo-alignment and subsequently reduces the model's prediction accuracy.
Once again, thank you for taking the time to review our paper. If you feel that our responses have adequately addressed your concerns, we kindly ask if you could consider raising the score. Thank you! If you have any further questions, please do not hesitate to let us know.
Thank you for the additional experiment on including fine-grained text during training. I raise my score to 6.
Thank you for your positive feedback. We're glad the additional experiment could address your concerns.
This paper introduces G3, a RAG framework for geo-localization. By introducing a three-step process, the G3 framework achieves superior performance against other SoTA methods. To improve the expressiveness of the image embeddings, the paper proposed a new dataset, MP16-Pro, which adds textual descriptions to the existing MP-16 dataset. By comprehensive experiments, the paper demonstrates the necessity and merits of the G3 framework.
优点
- The proposed method, G3, achieved competitive performance in geo-localization against existing classification-based, retrieval-based and RAG-based methods.
- Compared to the original MP16 dataset, the proposed MP16-Pro dataset additionally provides textual geolocation-related data, which could be beneficial to the community if fully open-sourced.
- Comprehensive experiments and explanations are provided to prove the effectiveness and necessity of each G3 component.
- The authors have open-sourced their project in a clear and instructive manner, which is very positive for reproduction.
缺点
- The necessity of the Geo-alignment module in the G3 framework has been indicated by experiments results in Table 2. However, the authors did not address the motivation for their particular design choice of the alignment module. Why do the image features have to align with both text features and gps features? Does aligning with simply one modality work just as well? The authors should clarify that by conducting the corresponding ablation study.
- One of the main contributions claimed by the authors is the introduction of the MP16-Pro dataset. However, the description of the construction process for the MP16-Pro dataset is not sufficiently detailed.
- The G3 framework was not the first work to incorporate RAG into geolocalization, nor was it the first work to use retrieval-based models. While the proposed method achieved superior performance, it lacks certain novelty to the field.
- In the experiment setup in section 5(line 226), no retrieved coordinate is considered when evaluating G3 on IM2GPS3K, which is inconsistent to figure 2 of the paper. If no retrieved coordinate works better in certain cases, I wonder about the necessity and applicability of such design.
问题
- Since the proposed dataset, MP16, is relatively large, I wonder if the authors have run decomtamination procedures to ensure there is no overlapping data between training and evaluation.
- The authors of Img2Loc provided experiment results with other LMMs (LLaVA), I wonder what is G3’s performance when switching the LMM to LLaVA compared to Img2Loc. I think by incorporating this result the authors can more robustly state their superiority over existing methods.
- As shown in Figure 2, the text descriptions of the location in MP16-Pro are not used during inference, I wonder if adding text descriptions to the prompt would work better.
局限性
The authors of this paper have addressed the limitation of the G3 framework by pointing out its high computation cost. The introduction of alignment and diversification brings on more computation cost and latency compare to existing methods, which limits the retrieval and inference speed.
W1: The necessity of the Geo-alignment module in the G3 framework has been indicated by experiments results in Table 2. However, the authors did not address the motivation for their particular design choice of the alignment module. Why do the image features have to align with both text features and gps features? Does aligning with simply one modality work just as well? The authors should clarify that by conducting the corresponding ablation study.
Response:
Thank you for your suggestion.
Firstly, from an intuitive perspective, GPS and text data can model continuous and discrete geographic information. Specifically, two cities located on a national border belong to different countries and may have inconsistent driving directions. This is suitable for alignment using text information and image data. On the other hand, their terrain, landscape, and climate are generally similar, which can be expressed by aligning GPS data with image data.
Second, We also conduct experiments to clarify the necessity of aligning image features with both text features and gps features. Please refer to the Ablation study of Geo-alignment section in the global response. We will also add these results in our final version paper.
W2: One of the main contributions claimed by the authors is the introduction of the MP16-Pro dataset. However, the description of the construction process for the MP16-Pro dataset is not sufficiently detailed.
Response:
Thanks for your comment.
The original MP16 dataset provides image data along with the GPS information of the image, such as: image 4f/a0/3963216890.jpg, LAT: 47.217578, LON: 7.542092. The MP16-Pro dataset enhances MP16 by adding textual descriptions about the location for each sample.
The enhanced sample becomes: image 4f/a0/3963216890.jpg, LAT: 47.217578, LON: 7.542092, neighbourhood: Wengistein, city: Solothurn, county: Amtei Solothurn-Lebern, state: Solothurn, region: NA, country: Switzerland, country_code: ch, continent: NA.
It is also worth noting that we perform geographic reverse geocoding using Nominatim. This tool is error-free and is equivalent to directly inputting latitude and longitude coordinates into OpenStreetMap to obtain address information. We will add these details of MP16-Pro construction in our final version.
W3: The G3 framework was not the first work to incorporate RAG into geolocalization, nor was it the first work to use retrieval-based models. While the proposed method achieved superior performance, it lacks certain novelty to the field.
Response:
Thanks for the opportunity for us to clarify the motivation and contribution of G3. Our work builds upon img2loc, but img2loc and other existing approaches suffer from two serious issues: they may easily confuse distant images with similar visual contents, or cannot adapt to various locations worldwide with different amounts of relevant data. To address these issues, we propose G3 and achieve significant performance improvements. Additionally, to further advance the field, we introduce the MP16-Pro dataset, which supplements the original MP16 dataset with geographic text descriptions for each sample.
W4: Necessity of retrieved coordinate.
Response:
Thank you for your comment. The setting of retrieved coordinates is to ensure the lower bound of the pipeline's effectiveness, especially when the LMM model's capabilities are not strong enough. We conduct experiments on the LLaVA model on IM2GPS3K with the following variants:
- LLaVA S=0: LLaVA variant without retrieved coordinates.
- LLaVA S=1: LLaVA variant with one retrieve coordinate.
- LLaVA S=2: LLaVA variant with two retrieved coordinates.
| Model | Street 1km | City 25km | Region 200km | Country 750km | Continent 2500km |
|---|---|---|---|---|---|
| LLaVA S=0 | 14.15 | 35.57 | 48.78 | 66.53 | 81.68 |
| LLaVA S=1 | 14.41 | 35.64 | 48.88 | 66.40 | 81.55 |
| LLaVA S=2 | 14.31 | 35.87 | 49.42 | 66.93 | 81.78 |
From the experimental results, we can see the necessity of retrieved coordinates in the LLaVA experiments. As the number of retrieved coordinates increases from 0 to 2, all metrics generally show an improvement.
Q1: Since the proposed dataset, MP16, is relatively large, I wonder if the authors have run decomtamination procedures to ensure there is no overlapping data between training and evaluation.
Response:
Thank you for your insightful question. The dataset we use, MP16, is consistent with those used in prior works in this domain. This consistency ensures comparability and reproducibility of our results.
Q2: LMM Ablation Study.
Response:
Thanks for your advice. We conduct the experiments with LLaVA on IM2GPS3K, please refer to the Open-source LMM (LLaVA) experiments Section in the global response.
Q3: As shown in Figure 2, the text descriptions of the location in MP16-Pro are not used during inference, I wonder if adding text descriptions to the prompt would work better.
Response:
Thank you for your insightful suggestion. Based on your recommendation, we design the following experiments. The experimental variants and results are as follows:
- ZS: Zero-shot template, RAG template without any information from reference images.
- GPS: RAG template incorporating the GPS coordinates of reference images.
- GPS+Text: RAG template incorporating both the GPS information and textual descriptions of reference images.
| Variants | Street 1km | City 25km | Region 200km | Country 750km | Continent 2500km |
|---|---|---|---|---|---|
| ZS | 12.41 | 35.87 | 50.88 | 67.60 | 80.75 |
| GPS | 16.65 | 40.94 | 55.56 | 71.24 | 84.68 |
| GPS+Text | 16.75 | 41.44 | 55.68 | 71.60 | 85.01 |
The experimental results indicate that adding more various modality information from the reference images to the RAG template can effectively improve prediction accuracy. We will include these experimental results in our final version.
I appreciate the authors for providing clarifications and additional experimental results, which have addressed some of my concerns. I will raise my score to 6.
We appreciate your constructive feedback and valuable reviews. Thank you for your time and effort in reviewing our work.
This work focuses on the task of "worldwide geolocalization" with an effective and adaptive framework based on large multi-Modality models. A novel framework, i.e., G3, is proposed, including Geo-alignment, Geo-diversification, and Geo-verification. This work also releases a new dataset MP16-Pro. The experiment results show that G3 has superior performance on two well-established datasets IM2GPS3k and YFCC4K.
优点
- The task of "worldwide geolocalization" is very important and quite interesting.
- This paper is easy to follow, it is well-written.
- The experiment results are solid. The G3 model achieve better results than GeoCLIP and Img2Loc on IM2GPS3k and YFCC4K.
缺点
-
In the Geo-diversification part, there is no ablation study on different LMMs and different RAG templates. I wonder whether it still works on open-source LLMs.
-
Minor:
a. Figure 1 is often set as a teaser figure, which could show the basic design of the whole work. It would be better to indicate the solution, instead of only showing the limitations.
b. Table 1 needs citations for each previous work. And GeoCLIP should be noted as NeurIPS 2023 instead of arXiv.
问题
- Please address my above concerns on Weaknesses.
- The qualitative results show that most of the retrieved images are photos for tourists. What about using real world images from official company (e.g., Google Map)?
- Please briefly describe how does the MP16-Pro Dataset help (or improve) the G3 model.
局限性
None
W1: In the Geo-diversification part, there is no ablation study on different LMMs and different RAG templates. I wonder whether it still works on open-source LLMs.
Reponse:
Thanks for your advice. For the ablation study on different LMMs, please refer to the Open-source LMM (LLaVA) experiments section in the global response. Regarding the RAG templates, our template follows the previous work Img2Loc to ensure a fair comparison. We agree that exploring RAG templates is necessary. So we conduct experiments to explore the impact of incorporating different reference images' information into the RAG template on prediction performance. The experimental variants and results are as follows:
- ZS: Zero-shot template, RAG template without any information from reference images.
- GPS: RAG template incorporating the GPS coordinates of reference images.
- GPS+Text: RAG template incorporating both the GPS information and textual descriptions of reference images.
| Variants | Street 1km | City 25km | Region 200km | Country 750km | Continent 2500km |
|---|---|---|---|---|---|
| ZS | 12.41 | 35.87 | 50.88 | 67.60 | 80.75 |
| GPS | 16.65 | 40.94 | 55.56 | 71.24 | 84.68 |
| GPS+Text | 16.75 | 41.44 | 55.68 | 71.60 | 85.01 |
The experimental results indicate that adding more modality information from the reference images to the RAG template can effectively improve prediction accuracy. We will include these experimental results in our final version. Thank you again for your assistance.
W2: Figure 1 is often set as a teaser figure, which could show the basic design of the whole work. It would be better to indicate the solution, instead of only showing the limitations.
Reponse:
Thank you for your comment. We understand that teaser figures are typically used to present the basic design of the entire work. We would like to clarify that our intention with the current figure was to highlight the issues with existing methods and help readers understand our motivation. In response to your suggestion, we will revise Figure 1 to include an overview of our solution, making it easier to comprehend our approach. Thank you for your valuable feedback.
W3: Table 1 needs citations for each previous work. And GeoCLIP should be noted as NeurIPS 2023 instead of arXiv.
Reponse:
Thanks for your comment. We will add citations for each previous work in Table 1. Additionally, we will update the reference for GeoCLIP to NeurIPS 2023 instead of arXiv and check the other references. We appreciate your attention to these details.
Q1: The qualitative results show that most of the retrieved images are photos for tourists. What about using real world images from official company (e.g., Google Map)?
Reponse:
Thank you for your insightful question. Adding more images from official companies could indeed enhance the robustness of the database, thereby improving the overall effectiveness of the pipeline. However, we used MP16 as the image source for this work for three main reasons:
- Consistency: To maintain consistency with other works, such as GeoCLIP and Img2Loc, which also use MP16 as their data source.
- Dataset Distribution Rights: We extended the text descriptions based on MP16 to create the new dataset MP16-Pro, which includes distribution attributes. In contrast, Google's street view images have clear distribution restrictions, so we did not include them in this work.
Your insights are thought-provoking, and we greatly appreciate your constructive feedback.
Q2: Please briefly describe how does the MP16-Pro Dataset help (or improve) the G3 model.
Reponse:
Thanks for your comment. The original MP16 dataset provides image data along with the GPS information of the image, such as: image 4f/a0/3963216890.jpg, LAT: 47.217578, LON: 7.542092. The MP16-Pro dataset enhances MP16 by adding textual descriptions about the location for each sample.
The enhanced sample becomes: image 4f/a0/3963216890.jpg, LAT: 47.217578, LON: 7.542092, neighbourhood: Wengistein, city: Solothurn, county: Amtei Solothurn-Lebern, state: Solothurn, region: NA, country: Switzerland, country_code: ch, continent: NA.
MP16-Pro primarily improves G3 through the following aspects:
- Adding Geographic Semantic Information: By incorporating additional geographic semantic information in geo-alignment, aligning it with GPS and image data, which makes the geographic information of samples in MP16 more accurate.
- Enhancing Image Retrieval Accuracy: More precise geographic information enhances the accuracy of images retrieved based on the query image in geo-diversification, providing more effective references for the LMM to predict coordinates. For specific details, please refer to Section 5.5 Case Study on reference image retrieval.
- Help Train the GPS Encoder: Introducing textual descriptions and aligning them with GPS information in geo-alignment helps train the GPS encoder more effectively, further increasing the reliability of the GPS encoder's judgments during geo-verification.
Thank you again for taking the time to review our work, and we hope our response addresses your concerns.
As we approach the end of the author-reviewer discussion period, we respectfully wish to check in and ensure that our rebuttal has effectively addressed your concerns regarding our paper. Should there be any remaining questions or if further clarifications or additional experimental results are needed, please do not hesitate to let us know. We appreciate the thoughtful reviews and the time you’ve invested in providing us with valuable feedback to improve our work. If you believe that our responses have sufficiently addressed the issues raised, we kindly ask you to consider the possibility of raising the score.
This paper proposes three steps, i.e., geo-alignment, geo-diversification, and geo-verification to optimize both retrieval and generation phases of word-wide geo-localization.
优点
- The motivation is clearly stated.
- The experimental results show the effectiveness of the proposed method.
- The proposed method achieves state-of-the-art performance.
- Code is publicly available.
缺点
The experiments are not insufficient. The model seems too large, so the author should provide the number of parameters and gflops experiments.
问题
- What is the purpose of image vectorization? Please provide a detailed explanation by the author. 2.Figure 6 is difficult to understand, what does the author want to express? What does it mean that the number of references has a significant impact on the model?
- The model seems too large, the author should provide the number of parameters and gflops experiments.
- The author proposed a new dataset MP16 Pro, but it seems that the results have not been published on this dataset.
- The author obtain textual descriptions by geographical reverse encoding during the Database Construction. Are the descriptions generated for longitudes and latitudes with similar geographical locations consistent? Is the text description useful for images taken at similar locations? The author can add descriptive text experiments to the ablation experiment to prove that the text is helpful for feature representation.
- The experiment of paper lacks the results of baseline
- the author does not introduce enough details about the branch of Geo-diversification Module.
- Fig.5/8 lacks a comparison with the visualization results of baseline.
局限性
The authors discuss the limitation in the paper, but not in enough depth.
W1: The experiments are not insufficient. The model seems too large, so the author should provide the number of parameters and gflops experiments.
Response:
Thanks for your question. We compile the data of model's parameters and computational load, as shown in the table below.
| Total params | Trainable params | GFLOPs |
|---|---|---|
| 441,266,179 | 13,648,131 (3.09%) | 304.38 |
Since the training parameters of G3 are concentrated in the geo-alignment stage, and most of the module parameters are frozen, only 3.09% of the parameters need to be optimized.
Thank you again for your suggestion, we will include this data in the final version of the paper.
Q1: What is the purpose of image vectorization? Please provide a detailed explanation by the author.
Response:
Thank you for your question. Image vectorization enables us to convert images into vector representations. These vectors allow us to calculate the similarity between a query image and images stored in our database. By doing so, we can efficiently retrieve reference images in RAG process that aid in accurately generating the geographic location of the query image.
Q2: Figure 6 clarification.
Response:
Thank you for your question. In Figure 6, the left side of each example shows the query image, with the ground truth displayed in red text below it. On the right side are the predictions made by LMM using the same RAG template combined with different amounts of reference image information, with the most accurate prediction highlighted in red text. The blue reference text below indicates the coordinates of the top 10 most similar reference images retrieved from the database based on the query image.
In Figure 6, we aim to illustrate that introducing different numbers of coordinates as references during the RAG process significantly impacts the accuracy of the generated results. This is due to the heterogeneity of images in geographic space, which causes the significantly various number of effective reference images retrieved from the database for different query images, further affecting the RAG outcomes. This case study validates the necessity of geo-diversification. Geo-diversification mitigates the prediction sensitivity caused by the heterogeneous spatial distribution of images by generating candidate coordinates using multiple prompts that contain varying numbers of reference coordinates simultaneously. We will add more detailed descriptions of Figure 6 in its Caption.
Q3: The author proposed a new dataset MP16 Pro, but it seems that the results have not been published on this dataset.
Response:
Thank you for the opportunity to clarify our perspective. In the paper, the MP16 Pro dataset was used solely for training and constructing the database, while all testing was conducted on the Img2GPS3K and YFCC4K datasets. This approach is consistent with existing works such as GeoCLIP and Img2Loc.
Q4: Questions about textual descriptions.
Response:
Thanks for your questions. (1) Generating consistency. We perform geographic reverse geocoding using Nominatim. This process is error-free and is equivalent to directly inputting latitude and longitude coordinates into OpenStreetMap to obtain address information. (2) The effectiveness of text descriptions.
please refer to the Ablation study of Geo-alignment section in the global response.
Q5: The experiment of paper lacks the results of baseline
Response:
Thanks for your advice. We will further add the results of PIGEON to the overall results. The comparison between G3(GPT4V) and PIGEON on IM2GPS3K and YFCC4K are shown below:
IM2GPS3K
| Model | Street 1km | City 25km | Region 200km | Country 750km | Continent 2500km |
|---|---|---|---|---|---|
| PIGEON | 11.3 | 36.7 | 53.8 | 72.4 | 85.3 |
| G3(GPT4V) | 16.65 | 40.94 | 55.56 | 71.24 | 84.68 |
YFCC4K
| Model | Street 1km | City 25km | Region 200km | Country 750km | Continent 2500km |
|---|---|---|---|---|---|
| PIGEON | 10.4 | 23.7 | 40.6 | 62.2 | 77.7 |
| G3(GPT4V) | 23.99 | 35.89 | 46.98 | 64.26 | 78.15 |
From the results, we can find that G3 is superior to PIGEON on eight out of ten metrics on IM2GPS3K and YFCC4K. We will add the performance of PIGEON to the overall results in our final version.
Q6: the author does not introduce enough details about the branch of Geo-diversification Module.
Response:
Thanks for your advice. The purpose of Geo-diversification is to generate diverse predictions based on RAG prompts combined with different numbers of references. This approach aims to address the issue of inconsistent numbers of effective references retrieved due to spatial heterogeneity in images. In Geo-diversification, we combine the top S retrieved candidates with generated candidates. K represents the number of RAG prompts, and N represents the number of results generated per RAG prompt. Therefore, Geo-diversification produces a total of candidate predictions.
Q7: Fig.5/8 lacks a comparison with the visualization results of baseline.
Response:
Thanks for your advice.
In Figure 5, the CLIP ViT on the left represents the baseline, while the right side shows the retrieval results of G3. It can be observed that CLIP ViT focuses only on the visual features of the image (such as the presence of two people), whereas G3 pays more attention to the location where the image was taken. As a result, G3 can retrieve images that are geographically closer to the query image.
In Figure 8, we aim to visually demonstrate the performance of G3 under different error bounds. We can observe that the model's predictions are more accurate when the image contains clear geographic indicators such as buildings and decorations. However, when the image is filled with elements like the ocean or sky, which lack clear geographic indicators, the prediction error is larger.
Thank you very much for your valuable reviews and the time you've invested. As the author-reviewer discussion period is coming to an end, we sincerely want to confirm whether we have addressed your concerns. If there are any points that require further clarification or additional experimental results, please do not hesitate to let us know. If you believe our response has adequately resolved the issues you raised, we kindly ask you to consider the possibility of raising the score.
This paper proposes a RAG-based framework for worldwide geo-localization. The first Geo-alignment stage projects input images to embedding spaces to align with GPS coordinates, and text description with contrastive learning. Given new input image, the system is able to retrieve similar GPS and text description. Then the retrieved candidates GPS and text prompts are fed to GPT4V with a pre-defined prompt template to generate GPS coordination. The final stage conducts a similarity-based verification based on multi-modal representations. The method is evaluated on two worldwide geo-localization datasets, i.e., IM2GPS3k and YFCC4K, with state-of-the-art performance.
优点
- The RAG-design for geo-localization is interesting and promising.
- The writing is easy to follow.
- The performance is much better than previous methods.
- Ablation result is provided for the three stages.
- The case study and failure cases are informative and interesting.
缺点
- My major concern is that the current pipeline is highly dependent on the LMM which is the powerful closed-source model, GPT-4V. This model is expensive for large-scale applications and also hard to reproduce due to unannounced updates for API across time. It would be better to provide the results with open-source large multi-modal models, for example, LLaVA. I would expect a lower accuracy with open-source models.
- The proposed MP16-Pro dataset is also claimed as a contribution, but there is no guarantee that the data will be released. Hope this will be provided in final version.
问题
See the weaknesses.
局限性
The limitation is included in the appendix.
W1: My major concern is that the current pipeline is highly dependent on the LMM which is the powerful closed-source model, GPT-4V. This model is expensive for large-scale applications and also hard to reproduce due to unannounced updates for API across time. It would be better to provide the results with open-source large multi-modal models, for example, LLaVA. I would expect a lower accuracy with open-source models.
Response:
Thanks for your suggestions. Please refer to the Open-source LMM (LLaVA) experiments section in the global response to see our response.
W2: The proposed MP16-Pro dataset is also claimed as a contribution, but there is no guarantee that the data will be released. Hope this will be provided in final version.
Response:
Thank you very much for your attention to the MP16-Pro dataset presented in our paper. The MP16-Pro dataset includes image data (360GB) and metadata (700MB). Due to anonymous file-sharing platforms' file size limitations and supplementary file size limitations, we were unable to release this data during the review stage. To further enhance the credibility of MP16-Pro, we upload 100k rows of metadata in the anonymous repository's data folder. The URL of the anonymous repository is given in the paper.
Thanks for the rebuttal. My concerns have been addressed and I am raising the rating to 6.
Thank you for raising the rating and for your constructive reviews. We're glad our response could address your concerns, and we appreciate your support.
Global Response
We would like to extend our sincere gratitude to all the reviewers for your valuable comments and constructive feedback on our manuscript. Your insights have been instrumental in improving the quality of our work.
We find two common concerns raised by multiple reviewers, which we response through this global response.
Open-source LMM (LLaVA) experiments
We conduct the experiments of G3 with LLaVA (LLaVA-Next-LLaMA3-8b) on Im2GPS3K. Since Img2Loc did not specify the version of LLaVA they used, we also re-ran the experiments on Img2Loc. The results are as follows:
| Model | Street 1km | City 25km | Region 200km | Country 750km | Continent 2500km |
|---|---|---|---|---|---|
| GeoCLIP | 14.11 | 34.47 | 50.65 | 69.67 | 83.82 |
| Img2Loc (LLaVA) | 10.21 | 29.06 | 39.51 | 56.36 | 71.07 |
| Img2Loc (GPT4V) | 15.34 | 39.83 | 53.59 | 69.70 | 82.78 |
| G3(LLaVA) | 14.31 | 35.87 | 49.42 | 66.93 | 81.78 |
| G3(GPT4V) | 16.65 | 40.94 | 55.56 | 71.24 | 84.68 |
We can find that:
- After switching LMMs from GPT4V to LLaVA, the performance of G3 shows some decline across various metrics but remains competitive: compared to GeoCLIP, it performs better at the street level and city level.
- Additionally, compared to Img2Loc(LLaVA), G3(LLaVA) significantly outperforms Img2Loc(LLaVA), demonstrating the effectiveness of the proposed modules.
- Finally, by comparing the performance of G3 equipped with LLaVA and GPT4V to Img2Loc equipped with LLaVA and GPT4V, we can observe that G3 shows more stable performance across different LMMs.
Ablation study of Geo-alignment
To verify the necessity of aligning the three modalities in geo-alignment, we conduct experiments on the following variants.
- IMG: Directly using pretrained CLIP vision encoder as encoder.
- IMG+GPS: Aligning Image representations with GPS representations in Geo-alignment, the textual descriptions are not used.
- IMG+GPS+Text (G3): Aligning three modalities simultaneously in Geo-alignment.
| Model | Street 1km | City 25km | Region 200km | Country 750km | Continent 2500km |
|---|---|---|---|---|---|
| IMG | 15.71 | 40.64 | 54.85 | 70.8 | 84.05 |
| IMG+GPS | 16.91 | 41.41 | 55.02 | 70.94 | 84.18 |
| IMG+GPS+Text (G3) | 16.65 | 40.94 | 55.56 | 71.24 | 84.68 |
From the experimental results, we can draw the following conclusions:
- By comparing IMG+GPS+Text, IMG+GPS, and IMG, we find that adding GPS and text information can both enhance the feature representation compared to using the original image information alone.
- By comparing IMG+GPS+Text with IMG+GPS, we find that IMG+GPS performs better at smaller scales, while IMG+GPS+Text performs better at larger scales. This might be because GPS is suitable for modeling variations at smaller scales, whereas text descriptions do not vary significantly at small scales and may even remain the same.
We hope these responses address your concerns. Thank you once again for your time and effort in reviewing our manuscript.
This paper tackles the challenging problem of worldwide geo-localization by introducing a new dataset, MP16 Pro, and a novel retrieval-augmented generation framework, G3, which leverages recent LLMs to explore diverse clues such as retrieved images, GPS data, and textual information. The reviewers engaged actively in the discussion and reached a consensus with positive evaluations. The AC concurs with the reviewers' assessment that this paper addresses an important problem and offers a novel solution. The authors are encouraged to incorporate the additional feedback provided during the rebuttal and discussion period. Additionally, it is recommended that they include details on the number of trainable parameters and GFLOPs of the comparison algorithms, and ensure that the proposed dataset and code are made publicly available as promised.