PaperHub
7.0
/10
Poster3 位审稿人
最低7最高7标准差0.0
7
7
7
4.0
置信度
COLM 2025

The World According to LLMs: How Geographic Origin Influences LLMs' Entity Deduction Capabilities

OpenReviewPDF
提交: 2025-03-21更新: 2025-08-26
TL;DR

LLMs demonstrate geographic bias in entity deduction games, with notably better performance on entities from the Global North and West despite controlling for language, popularity and frequency factors.

摘要

关键词
geographic representationLLM evaluationfairness and biasreasoning capabilitiescross-cultural NLPinteractive question answering

评审与讨论

审稿意见
7

This work proposes a 20-questions dataset on geographically specified entities, and evaluates how several LLMs perform on it along several dimensions of geographic biases, from country, to entity type, to language (7 total). The questions are selected to be balanced across countries and were verified by native speakers. They also performed a correlation study of performance to the pretraining corpora and Wikipedia pageviews.

接收理由

  • The idea of using 20 questions for geographic entities to evaluate LLM's geographic biases is simple yet well-motivated.
  • The curation of the dataset is carefully done and balanced across countries.
  • The correlation study between performance and Wikipedia views or the pretraining corpora is well thought-out and interesting.
  • The experiments were done in 7 high-resource languages. The finding that language has a moderate impact on performance is interesting.
  • The results are presented in various granularities, such as by country, by continent, by Global North/South, by people vs. things, etc.

拒绝理由

  • I am not sure the claim in lines 123-126 is warranted, that this dataset avoids contamination better than other 20Q datasets. For example, LeBron James or any other famous sports star (47.9% of the people entities) are definitely commonly studied.
  • The Global North/South West/East divisions are overly simplified and inaccurate. For example, richer countries like UAE, Singapore, and Saudi Arabia are not in Global North. And why are all Latin American countries considered in the Global East, while they are literally in the West, and most follow Catholicism, a decidedly western religion? Perhaps for each division you should omit certain countries which do not neatly fall into a category. Even if you cite 2 works, these are from 2010 and you can fix their mistakes in your work.
  • The corpus frequency analysis only considers the English-only Dolma, while your entities are multilingual. This makes that analysis less robust. You should have used a corpora like Fineweb-2 instead.
  • RQ2 does not account for variants of the same entity. For example, "Jack Ma" could be "Ma Yun", "马云", "CEO Ma", etc.

给作者的问题

  • I don't like the title "The World According to LLMs", this is overly generic
  • 35-36: avoid anthropomorphizing LLMs, use "models output" and "models encode" instead
  • 29: paradoxically is too strong, use "however" or "interestingly"
  • Figure 1: could you also have a version of this figure which has the multilingual entities?
  • what does the + stand for in Geo20Q+? It's confusing and could suggests there are more than 20 questions
  • 49-50: should say "we evaluate an unlimited turn configuration" instead of "we introduce", since this is part of the evaluation not of the dataset
  • Figure 1: consider using an equal-area map projection, as the current one makes countries in the Global North appear larger which goes against the spirit of your findings
  • Consider adding citations: https://arxiv.org/abs/2305.14610 https://arxiv.org/abs/2205.12247
  • 174: remove this citation, seems like padding
  • 4.2: can you add more details on the processes in which native speakers confirmed the quality of translation and observed some games, in an appendix section? like how they were recruited, any observations or corrections they made
  • Table 1: add % to the success rates to make more clear.
评论

Thank you for your thorough feedback and interest in our work! We address your feedback below:

Feedback1: I am not sure the claim in lines 123-126 is warranted, that this dataset avoids contamination better than other 20Q datasets. For example, LeBron James or any other famous sports star (47.9% of the people entities) are definitely commonly studied.

Thank you for highlighting this concern. We agree that well-known figures in our dataset, such as LeBron James and other prominent individuals, are indeed likely to have been frequently encountered during model pre-training. Our claim of reduced contamination is relative to prior 20Q datasets, which often feature generic or extremely common categories (e.g., “car,” “fruit”) and sometimes have game-play data available online, thus leading to pretraining contamination.

While our approach cannot fully eliminate the risk of pre-training exposure, it does prioritize geographically specific and fine-grained entities to reduce overreliance on memorized facts. We will make the following revision to the claim thanks to your suggestion:

Old Text: Avoiding Contamination. Our dataset design avoids generic or trivia-style entities commonly featured in pretraining corpora and standard 20 Questions datasets. By selecting geographically specific entities, we reduce the risk of models succeeding through memorization of benchmark examples rather than through genuine deductive reasoning.

Revised Text: Avoiding Contamination. Rather than using generic, high-frequency entities (e.g., “car,” “fruit”) that are likely to be memorized during pretraining, we focus on geographically specific and fine-grained instances (e.g., “Audi,” “Alphonso mango”). This reduces the risk of models succeeding through rote recall and encourages evaluation of genuine deductive reasoning over entity-specific knowledge.

Feedback2: The corpus frequency analysis only considers the English-only Dolma, while your entities are multilingual. This makes that analysis less robust. You should have used a corpora like Fineweb-2 instead. RQ2 does not account for variants of the same entity. For example, "Jack Ma" could be "Ma Yun", "马云", "CEO Ma", etc.

We acknowledge that our frequency analysis currently does not account for variant representations of the same entity (e.g., “Jack Ma,” “Ma Yun,” “马云,” “CEO Ma”). Capturing these variants, especially across different languages and scripts, would provide a more robust estimate of an entity’s presence in pre-training data. Nevertheless, we will highlight this as an area for improvement and consider implementing more sophisticated entity normalization and matching strategies in future work.

评论

Feedback3: The Global North/South West/East divisions are overly simplified and inaccurate. For example, richer countries like UAE, Singapore, and Saudi Arabia are not in Global North and all Latin American countries considered in the Global East, while they are literally in the West, and most follow Catholicism, a decidedly western religion?

Thank you for raising this point about the limitations and oversimplifications inherent in our division. We agree that such divisions may be imperfect and can misrepresent countries with unique economic, cultural, or geographic context, such as UAE, Singapore, and Saudi Arabia, or the treatment of Latin American countries.

To address this, we show results using an alternative division that incorporates your adjustments and find that our main conclusions remain unchanged. Nevertheless, we recognize that no single grouping may be universal, and as such we will release the raw results and scripts so that the community can compute and analyze custom divisions according to their own criteria. We realize this would be the most transparent and effective approach!

Here is the alternative division:

Global North:

United States, Canada, United Kingdom, France, Germany, Italy, Spain, Portugal, Netherlands, Belgium, Sweden, Norway, Finland, Denmark, Iceland, Austria, Switzerland, Luxembourg, Ireland, Australia, New Zealand, Japan, South Korea, Taiwan, Israel, Singapore, UAE, Saudi Arabia

Global South:

Mexico, Argentina, Chile, Uruguay, Brazil, Colombia, Peru, India, Pakistan, Bangladesh, Sri Lanka, Nepal, Bhutan, Indonesia, Thailand, Vietnam, Philippines, Malaysia, Myanmar, Cambodia, Laos, East Timor, Nigeria, Kenya, Ethiopia, Egypt, Ghana, Tanzania, South Africa, Zimbabwe, Angola, Morocco, Algeria, Libya, Tunisia, Republic of the Congo, Cameroon, Senegal, Uganda, Zambia, Namibia, Sudan, South Sudan, Burkina Faso, Mali, Niger, Chad, Madagascar, Malawi, Benin, Ivory Coast, Guinea, Togo, Mauritania, Sierra Leone, Lesotho, Eswatini, Central African Republic, Yemen, Syria, Palestine

Global West:

United States, Canada, United Kingdom, France, Germany, Italy, Spain, Portugal, Netherlands, Belgium, Sweden, Norway, Finland, Denmark, Iceland, Austria, Switzerland, Luxembourg, Ireland, Australia, New Zealand, Japan, South Korea, Taiwan, Israel, Mexico, Argentina, Chile, Uruguay, Brazil, Colombia, Peru

Global East:

Russia, Ukraine, Belarus, Serbia, Romania, Bulgaria, Poland, Czech Republic, Hungary, Slovakia, Slovenia, North Macedonia, Bosnia and Herzegovina, Albania, Georgia, Armenia, Azerbaijan, Kazakhstan, Uzbekistan, Turkmenistan, Kyrgyzstan, Tajikistan, Saudi Arabia, United Arab Emirates, Iran, Iraq, Turkey, Jordan, Lebanon, Oman, Qatar, Bahrain, China, North Korea, Mongolia

For the above division(s), we get the following results after performing the Mann-Whitney U test to check the significance of the difference between the mean Success Rates of Global North vs Global South and Global West vs Global East.

TurnsModelAcc.NorthAcc.SouthSig.Acc.WestAcc.EastSig.Acc.NorthAcc.SouthSig.Acc.WestAcc.EastSig.
Notable PeopleThings
Can.GPT-4o mini20.97.5✅***17.07.0✅**5.94.55.73.0
Gemini 2.0 Flash30.910.2✅***23.410.6✅***12.03.6✅***8.43.8✅*
LLaMA 3.3 70B24.113.5✅*25.010.2✅***6.53.25.93.0
Unl.GPT-4o mini64.655.966.757.438.331.036.031.0
Gemini 2.0 Flash83.873.3✅*81.876.847.135.7✅**45.835.3✅**
LLaMA 3.3 70B61.249.4✅*60.037.6✅***41.229.6✅*42.328.5✅**

The results still support our hypothesis of better performance of models on Global North compared to the Global South and Global West compared to Global East.

评论

Q1: I don't like the title "The World According to LLMs", this is overly generic

We thank you for the feedback. Based on your suggestion, the author team is deciding between a variety of alternative candidate titles and will consider the most appropriate one in the revised paper. Some alternatives:

  • A Model-Initiated Evaluation Framework Reveals Geographic Biases in LLM Reasoning
  • How Geographic Origin Influences LLMs' Entity Deduction Capabilities
  • Measuring Regional Biases in LLMs through Multi-Turn Deduction Games

In case you have a preference among these, we would love to know!

Q2: what does the + stand for in Geo20Q+? It's confusing and could suggests there are more than 20 questions

The “+” in Geo20Q+ denotes our inclusion of an unlimited-turn setting, allowing for games with more than 20 questions, as used in our evaluation. We apologize for any confusion and will clarify this naming choice in the camera-ready.

Q3: Figure 1: consider using an equal-area map projection, as the current one makes countries in the Global North appear larger which goes against the spirit of your findings

That’s a great catch, thank you. In the camera-ready, we will update Figure 1 to use an equal-area projection (potentially use Mollweide with the cartopy python library) and briefly describe the rationale in the text to align the visualization with the spirit of our findings.

Q4: Citations, framing suggestions, and clarifications in the tables

We will increase the readability of the paper by adding appropriate corrections, framing changes, and the citations. Thanks!

Q5: can you add more details on the processes in which native speakers confirmed the quality of translation and observed some games, in an appendix section? like how they were recruited, any observations or corrections they made

As detailed in Section 4.2, for each language, at least one native or fluent speaker, recruited from our academic networks, reviewed all translations, edited prompts as needed, and observed multiple games to check for accuracy and consistency. Their review protocol included:

  • Verifying and, if needed, editing the prompt templates and translations to ensure naturalness and accuracy in each language.
  • Observing several sample games played in the target language to check the consistency of the game play. We suggested each speaker to review around 3 games, however, they were free to review as many games as they would like.
  • Confirming the final quality of translations and reporting any issues or corrections (e.g., overly literal translations, culturally inappropriate terminology, or ambiguous instructions).

Any issues (such as awkward phrasing or ambiguous instructions) were corrected based on their feedback. We will summarize this process, compute all the edits to the prompts, and include representative observations in the Appendix.

评论

Thank you for the detailed response, for considering my feedback, and for providing results for the the alternative division of Global North/South East/West. Those revised titles sound reasonable to me. The argument that while this work does not entirely avoid data contamination, it does so better than related datasets, is convincing. And I re-emphasize that followup studies should account for variant entities, but it's fine as a limitation for now. I will keep my score of 7.

评论

Thank you for your thoughtful and constructive feedback. In the final version, we will indeed release the raw results and scripts so that the community can determine custom geopolitical groupings according to their criteria.

-Authors

审稿意见
7

The paper investigates implicit geographic bias in large language models (LLMs) using a 20 Questions-style game. One instance of a model plays the role of the judge (who knows the target entity), while another instance acts as the guesser, asking yes/no/maybe questions to identify the entity. The entities span culturally significant people and objects from different global regions. The authors introduce a new dataset, Geo20Q+, which is highly diverse and geographically balanced, and evaluate model performance across several axes: entity popularity, pretraining corpus frequency, language, and geopolitical groupings (Global North vs. South and West vs. East). The study compares mainly "western" LLMs in both canonical (20-turn) and unlimited-turn (max 150 turns) settings across seven languages. A key contribution lies in shifting the fairness evaluation paradigm from static prompt-response bias probing to dynamic multi-turn, model-initiated reasoning. This approach enables identification of subtle geographic disparities that may not surface in traditional single Q&A evaluations (not probed). The paper is clearly written and provides sufficient methodological detail to follow the experimental design and replicate key aspects, though some elements such as regression quality and reproducibility require clarification.

接收理由

  • Presents a novel and well-executed study that extends fairness research in LLMs to an underexplored axis: geographic origin.
  • Introduces Geo20Q+, a large, diverse and highly curated dataset of notable individuals and culturally significant objects balanced across 172 countries and multiple entity types.
  • Evaluates three major LLMs (GPT-4o-mini, Gemini 2.0, and LLaMA 3) across two gameplay formats and seven languages.
  • Uses a creative and rigorous multi-turn game framework that captures model-initiated reasoning paths, offering deeper insight than static prompts.
  • Provides systematic evidence of geographic performance disparities that persist even after controlling for Wikipedia pageviews and corpus frequency.
  • The results broadly confirm and extend prior work on geographic underrepresentation, especially with respect to Global South and East entities.
  • Demonstrates that geographic disparities persist even after controlling for popularity/frequency and across languages.
  • Paper is clearly written, easy to follow, and methodologically mostly sound.

拒绝理由

  • Although the evaluation strategy is interesting, one could argue it is not strictly needed; standard bias evaluations may still suffice for some questions. The added value over simpler probing techniques should be better justified.
  • The paper does not mention any plans to release the Geo20Q+ dataset (maybe I missed it), which limits reproducibility and broader community utility.
  • Reproducibility concerns due to unclear temperature used during completion generation.While this may be less critical in the absence of a public dataset, it still undermines methodological clarity.
  • While the paper is promising and well written, a major issue lies in the core regression analysis. Table 3 reports multiple negative R2R^{2} values which are not explained. Maybe these are adjusted R2R^{2} values? The authors rely on these results to argue that entity popularity and frequency do not explain model performance disparities, yet the low or negative variance explained weakens this claim, especially without raw scatter plots or visualizations to support the interpretation.
  • The use of odds ratios is statistically valid, but not very intuitive for readers. Including raw correlations or binned visual summaries would improve clarity and interpretability.

I would adjust my rating accordingly if these issues are addressed or clarified.

给作者的问题

  • Could you show a clear example of a deduction path that illustrates a geographic bias (e.g., Western vs. Eastern entity), to better understand how the method surfaces such disparities?

  • Was initial data collection done only in English?

  • Could you share or visualize success rate or number of turns as a function of entity popularity or frequency (e.g., scatter plots or binned averages)? This would greatly help interpret the regression analyses.

  • Could you clarify whether the reported R2R^{2} values in Table 3 are adjusted?

  • Do you control for number of guesses allowed per game as well? similar to number of turns up to "Bingo" or max turns.

  • Do early exits or refusals to guess count as deduction failures in your success rate metric? (Apologies if this was mentioned and I missed it.).

  • What motivated the use of odds ratios to assess popularity/frequency effects? Would raw correlations or binned plots be more interpretable and transparent?

  • What’s the ratio of English to non-English data? Could this correlate better with performance than popularity?

  • Table 1: Please add units (e.g., "% success", "# turns").

  • Table 5: Please define symbols (✓, ✗, *, ***) explicitly in the caption or text.

  • Appendix A.3: it appears the prompt labels for the guesser and judge roles may be reversed.

评论

Thank you very much for your thorough evaluation of our work. We appreciate the recognition of the novelty of our geographic fairness focus, the contribution of the Geo20Q+ dataset, and the depth of analysis enabled by our multi-turn evaluation framework. We address your concerns below:

Feedback1: Although the evaluation strategy is interesting, one could argue it is not strictly needed; standard bias evaluations may still suffice for some questions. The added value over simpler probing techniques should be better justified.

Thank you for raising this important point! Many static prompt-based evaluations only test single-turn interactions similar to training data, missing implicit process-level biases that emerge during multi-step reasoning. Our game-based approach reveals these biases, demonstrating that some reasoning asymmetries only surface through extended interactions. We will further clarify this added value in our introduction.

Feedback2: The paper does not mention any plans to release the Geo20Q+ dataset (maybe I missed it), which limits reproducibility and broader community utility.

We are indeed planning on publicly releasing the Geo20Q+ dataset upon publication of the paper for community use. Our apologies that we did not make that clear in the draft. As evidence of good faith, note that we had included the processed dataset in the supplementary materials for review.

Feedback3: Reproducibility concerns due to unclear temperature used during completion generation. While this may be less critical in the absence of a public dataset, it still undermines methodological clarity.

Thank you for this helpful feedback. Following Zhang et. al., 2024, we set the temperature to 0.8 for the guesser and 0.2 for the judge. We will add this detail in the camera-ready.

Yizhe Zhang, Jiarui Lu, and Navdeep Jaitly. 2024. Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1495–1516, Bangkok, Thailand. Association for Computational Linguistics.

Feedback4: While the paper is promising and well written, a major issue lies in the core regression analysis. Table 3 reports multiple negative R^2 values which are not explained. Maybe these are adjusted values? The authors rely on these results to argue that entity popularity and frequency do not explain model performance disparities, yet the low or negative variance explained weakens this claim, especially without raw scatter plots or visualizations to support the interpretation.

Thank you for this helpful suggestion, and apologies for the overly concise presentation in the original submission. In Table 3, the reported values are simple correlations (squared), which provide both direction and magnitude. These values are equivalent to adjusted R², as there is only one independent variable (popularity or frequency). Negative R² values indicate a negative relationship between the dependent (number of turns taken to answer) and the independent variable (popularity/frequency), i.e. popular entities should be guessed in lesser number of turns. We will clarify this in the revision to avoid confusion. Regarding our claim, the low R² values and odds ratios close to 1 support our interpretation that popularity and frequency partially explain model performance, as stated in lines 258–259. Additionally, since we use only one independent variable in a linear regression, the R² values are equivalent to the square of Pearson correlation values.

With regards to your suggestion, we will add raw scatter plots and relevant visualizations in the final version to further support our analysis. Here is the scatter plot for success/ failure upon variation in popularity score; we observe that there is a limited effect of increasing popularity on Success: link to anonymized image

Feedback5: The use of odds ratios is statistically valid, but not very intuitive for readers. Including raw correlations or binned visual summaries would improve clarity and interpretability.

Thank you for this helpful suggestion. We agree that including raw correlations and binned visual summaries alongside odds ratios would improve clarity and reader intuition. We will incorporate these additional visualizations and descriptive statistics in the camera-ready to enhance interpretability.

评论

Q1: Could you show a clear example of a deduction path that illustrates a geographic bias (e.g., Western vs. Eastern entity), to better understand how the method surfaces such disparities?

Absolutely. The following example shows the deduction path for the target entity A. P. J. Abdul Kalam (an Indian president, poet and scientist). Despite the answer being an Indian figure, the model’s reasoning chain systematically eliminates Western options before considering Asian and Indian possibilities:

Turn 9, [Judge]: Yes.
Turn 10, [Guesser]: Is the poet American?
Turn 10, [Judge]: No.
Turn 11, [Guesser]: Is the poet British?
Turn 11, [Judge]: No.
Turn 12, [Guesser]: Is the poet from France?
Turn 12, [Judge]: No.
Turn 13, [Guesser]: Is the poet from Latin America?
Turn 13, [Judge]: No.
Turn 14, [Guesser]: Is the poet from Asia?
Turn 14, [Judge]: Yes.
Turn 15, [Guesser]: Is the poet Japanese?
Turn 15, [Judge]: No.
Turn 16, [Guesser]: Is the poet from India?
Turn 16, [Judge]: Yes.
Turn 17, [Guesser]: Is the poet Rabindranath Tagore?
Turn 17, [Judge]: No.
Turn 18, [Guesser]: Is the poet Rumi?
Turn 18, [Judge]: No.

We will include several qualitative examples like this in the camera-ready (1 in the main paper, and rest in the appendix) to provide readers with a clearer view into how such biases emerge in model reasoning paths.

Q2: Was initial data collection done only in English? What’s the ratio of English to non-English data? Could this correlate better with performance than popularity?

Thank you for raising this question. Initial data collection and entity selection were performed exclusively in English, as both Wikipedia and the derived notability datasets provide the most comprehensive and consistent coverage in this language. No non-English versions were referenced during data collection, however, our language of game play spans 7 languages. As a result, the dataset is entirely English-based, which could correlate with model performance independently of popularity or pageviews. We will enlist this as a limitation of our work, where non-canonical variants of the same entity are not accounted for in our popularity/frequency computation.

Q3: Do you control for number of guesses allowed per game as well? similar to number of turns up to "Bingo" or max turns.

In the current setup, we do not control the number of guessers allowed per game to allow for all model behaviours. One example would be allowing the guesser to reduce the search space by asking questions (“Is it man-made?”), then making some guesses (“is it Eiffel Tower?”), and again searching again for increasing the chances of making a correct guess later. Limiting the number of guesses would make this a very interesting problem, adding another layer of optimization/ planning for the model, especially in an RL setting. We thank you for this suggestion and will list it as potential future directions for this work.

Q4: Do early exits or refusals to guess count as deduction failures in your success rate metric? (Apologies if this was mentioned and I missed it.).

No worries! Yes, if the game involves early exits or refusals to guess, we count it as a failure.

Q5: Suggestions for Table 1 and 5:

We will surely include the units and correct symbolisms in all the tables to maintain consistency and better readability.

Q6: : Appendix A.3: it appears the prompt labels for the guesser and judge roles may be reversed.

The prompt labels seem to be correct for guesser and judge roles in Appendix A.3 where guesser is prompted to ask questions and judge is prompted to answer Yes/No/Maybe. Can you kindly confirm this with us? We are happy to clarify if there is any confusion regarding this.

评论

Thank you for providing a concrete example of geographic bias. I appreciate the clarifications on the regression analysis and dataset release plans. I have increased my score from 6 to 7. Good luck with the paper.

评论

Thanks a lot! We appreciate your engagement with our clarifications, and thank you for raising the score.

-Authors

审稿意见
7

This paper investigates the potential systematic geographic disparities or biases encoded in large language models (LLMs). To do so, the authors introduce a new dataset named Geo20Q+, which includes two broad categories: Things and Notable People. They then evaluate several popular LLMs under two gameplay configurations: the canonical 20-question format and an unlimited-turns variant. Experiments are conducted in multiple high-resource languages to eliminate the confounding variable of input/output language. The results show that LLMs are generally better at guessing entities from the Global North than the Global South, and from the Global West than the Global East. Additionally, the authors examine whether such disparities are correlated with the popularity of the entity (measured via Wikipedia pageviews and entity frequency), which appears to have minimal impact.

接收理由

  • The paper is generally well-written and enjoyable to read.

  • The newly introduced dataset seems useful and could serve as a complementary evaluation format to standard prompt-based assessments of LLM knowledge.

  • The evaluation is extensive and interesting, though I have some concerns (see below).

拒绝理由

  • I have concerns regarding the two-role game design. There may be instances where the judge provides an incorrect answer during the conversation -- either due to insufficient knowledge of the entity or because the guesser's question is ambiguous, yet the judge is still required to respond with a "yes" or "no." In such cases, the guesser might never arrive at the correct answer.

  • Although the authors conduct extensive experiments, they do not provide explanations for why LLMs display such disparities or biases. This is a crucial aspect of the paper. While the results suggest that entity popularity plays a minimal role, the authors are encouraged to propose some hypotheses, perhaps in the limitations or future work sections.

给作者的问题

  • In lines 121–122, the authors mention that they retain only those entities for which all models agree on a primary country. How many entities were filtered out at this stage?

  • To address the first weakness, would it be possible to first investigate whether the model actually possesses knowledge of the entity through a separate prompt-based evaluation?

  • In lines 173–175, the phrase "span multiple language families and writing systems" appears twice -- it’s redundant.

  • There is also inherent randomness in the game -- the guesser may not always use the same strategy, potentially leading to different success rates or numbers of turns. Perhaps I missed it, but I didn’t see how the authors account for this randomness. Are multiple runs conducted for each entity?

  • The paper doesn’t clearly describe how frequency is computed. I assume the authors use WiMBD to calculate how many documents contain the string representing the entity -- could this be clarified?

评论

Thanks for your feedback! We are glad that you found the paper well-written, the newly introduced dataset useful as a complementary evaluation resource, and the evaluation methodology extensive and rigorous. We address your key comments below:

Feedback1: I have concerns regarding the two-role game design. There may be instances where the judge provides an incorrect answer during the conversation -- either due to insufficient knowledge of the entity or because the guesser's question is ambiguous, yet the judge is still required to respond with a "yes" or "no." In such cases, the guesser might never arrive at the correct answer.

We agree that our interactive framework cannot eliminate all sources of judgment error (section 7 - limitations), however, we believe the following steps we took may address your concern:

  • All entities in our evaluation are based on Wikipedia articles, a source that is extensively used in the pre-training corpora of all evaluated LLMs. This design choice mitigates the risk that the judge (or guesser) completely lacks knowledge about a given entity, as these entities are highly likely to be present in the models' knowledge base.
  • Using the same model for both roles ensures that while there may be insufficient knowledge of the entity, both instances have the same knowledge cutoffs.
  • To minimize errors due to insufficient knowledge, we filtered entities to ensure that all models agreed on a primary country attribution (see Section 3.3), which reduces cases where the judge lacks key knowledge of the entity.
  • For ambiguous questions, the judge is permitted to reply with "Maybe" (Appendix A.3), which serves as a soft signal of uncertainty rather than misleading the guesser. On average, “Maybe” occurs infrequently, only in about 1.8% of the turns.

Feedback2: Although the authors conduct extensive experiments, they do not provide explanations for why LLMs display such disparities or biases. This is a crucial aspect of the paper. While the results suggest that entity popularity plays a minimal role, the authors are encouraged to propose some hypotheses, perhaps in the limitations or future work sections.

Our primary focus of the paper was to find a novel evaluation strategy to effectively reveal implicit geographic biases in models. As for why exactly they display such behavior, we did in fact test a particular hypothesis (i.e. RQ2 and Table 2) and demonstrated that these biases are partially explained by simple pretraining data coverage, revealing that these behaviours are driven by a multitude of complex factors.

We therefore fully agree with you that investigating these potential factors would provide important insights, and we will propose the following experiments as future work in the revised version:

  • Experiment: Analyze model attention and reasoning patterns during deduction to determine whether certain activation paths and model regions are systematically favored at various decision points, using interpretability techniques, such as Patchscopes (Ghandeharioun et. al., 2024).
  • Experiment: Systematically evaluate the impact of different tuning and alignment techniques (e.g., instruction tuning, reinforcement learning) on geographic bias by retraining models with regionally balanced data and observing shifts in deduction performance.

These are potential directions that we believe will help disentangle the contributions of data coverage, model architecture, and training dynamics to the observed biases. We also welcome any suggestions from you for future experiments that will help towards this goal.

Ghandeharioun, Asma, et al. "Patchscopes: A unifying framework for inspecting hidden representations of language models." arXiv preprint arXiv:2401.06102 (2024).

Q1: In lines 121–122, the authors mention that they retain only those entities for which all models agree on a primary country. How many entities were filtered out at this stage?

Thank you for raising this clarification. At the disambiguation stage, we used an iterative sampling process for both “Things” and “Notable people,” with minor differences in sourcing. For “Things,” we scraped country-specific entities from Wikipedia and only retained those with unanimous model agreement on primary country; otherwise, we resampled until each country had 50 valid entities. For “Notable people,” we drew from a global notability dataset, again requiring full agreement, resampling as needed. This approach ensured all included entities had consistent country assignments. Approximately 12% of candidates were excluded for ambiguity or disagreement, resulting in final totals of 8,375 “Things” and 24,049 “Notable people” (see Section 3.3). We will clarify the exact filtering statistics in the camera-ready.

评论

Q2: To address the first weakness, would it be possible to first investigate whether the model actually possesses knowledge of the entity through a separate prompt-based evaluation.

Great point, thank you for the suggestion. We tried a simple prompt: (Who is {Notable person} or Tell me about {Thing}) on randomly sampled 20 entities and manually inspected outputs, finding that all 3 models were hundred percent correct (were accurate in their description about the entity). Note that since all entities are sourced from popular Wikipedia pages, which is widely used in LLM pre-training, the risk of either model role lacking knowledge is minimal, as seen above. We hope the explanations above address your concerns.

Q3: There is also inherent randomness in the game -- the guesser may not always use the same strategy, potentially leading to different success rates or numbers of turns. Perhaps I missed it, but I didn’t see how the authors account for this randomness. Are multiple runs conducted for each entity?

Our experiments are conducted with a single run per entity for each model-language-configuration combination, which results in a total of 50,400+ games (3 models × 1,200 entities × 7 languages × 2 configurations for RQ1-3). Therefore having multiple runs conducted per entity would incur non-trivial computation costs.

However, based on your suggestion, we ran a small-scale study with multiple independent runs per entity per model (50 entities with 5 runs per entity, notable persons, Gemini 1.5 Flash). Here are the results (mean +- standard deviations) across these runs to better account for stochasticity in model behaviour:

ContinentCanonicalUnlimited
SuccessNumber of Turns to AnswerSuccessNumber of Turns to Answer
Africa0.22 ± 0.0413.17 ± 2.870.50 ± 0.1023.20 ± 3.24
Asia0.14 ± 0.0911.90 ± 7.020.40 ± 0.1926.95 ± 11.52
Australia0.42 ± 0.1611.68 ± 1.480.64 ± 0.1324.45 ± 10.32
Europe0.40 ± 0.1911.61 ± 0.850.62 ± 0.1318.82 ± 3.65
North America0.32 ± 0.119.89 ± 2.020.54 ± 0.0519.46 ± 3.96

Overall, we see that the aggregate level patterns and behaviours remain the same. We will include the results of this small-scale study in the appendix.

Q4: The paper doesn’t clearly describe how frequency is computed. I assume the authors use WiMBD to calculate how many documents contain the string representing the entity -- could this be clarified?

To estimate corpus frequency, we used the WiMBD project to query the Dolma corpus, which we believe approximates/resembles the data sources used for pre-training commercial LLMs. For each entity, we count the number of documents in Dolma that contain at least one exact string match to the entity name. This document-level frequency serves as a proxy for how often a model would have encountered that entity during pre-training. We will clarify this implementation in the camera-ready in section 4.4.

Q5: Typos and corrections:

We will make the necessary corrections to the paper. Thanks a lot for your careful attention.

评论

Dear Authors,

Thank you for your nice response and new results. My concerns are properly addressed, so I increased my score from 6 to 7.

Good luck with this submission!

评论

It's our pleasure! Thank you very much for your great feedback and for reconsidering your score. We're grateful for the time and effort you invested in reviewing our work!

-Authors

最终决定

The paper introduces Geo20Q+, a novel curated dataset of geographically specific entities in the 20-questions game format, to evaluate LLMs for the potential of systematic geographic disparities. Geo20Q+ comes in seven high-resource languages (English, Hindi, Mandarin, Japanese, French, Spanish, and Turkish) and allows for free-form evaluation, as LLMs are instructed to proactively ask questions and guess the target entity (thing or notable people).

The authors provided convincing additional information during the author rebuttal. The reviewers unanimously agree that this is a good paper which should be accepted. I second this. This is a good paper, which introduces a new evaluation protocol and interesting dataset to test LLMs for systematic geographic biases.

I urge the authors to check the font/template in future submissions, and fix it for the camera-ready as it deviates from the COLM template. Because this paper was flagged and saved/went for review.