PaperHub
7.3
/10
Poster4 位审稿人
最低7最高8标准差0.4
7
7
7
8
4.3
置信度
TL;DR

This work assesses how large language models represent subjective global perspectives, revealing biases and the effects of prompting and translation on model-human opinion similarity across countries.

摘要

关键词
opinions on global issuescultural representationsubjectivity in LLMsevaluationsocietal impact

评审与讨论

审稿意见
7

NOTE: I am very familiar with the preprint version of this paper released in June 2023. The preprint is well-known. I have not had any direct contact with the authors of this paper.

This paper introduces the GlobalOpinionQA dataset for measuring the representation of subjective global opinions in language models. The dataset was a seminal contribution to the growing literature on pluralistic alignment when it first came out in June 2023, marking a shift from concerns around national (US-centric) representation towards global representation. The dataset and corresponding experiments are designed with care and described in detail. Reviewing the paper on its own terms in April 2024, however, I believe there are two missed opportunities: 1) the paper should engage with critical literature around the use of multiple-choice / survey questions for language model evaluation which has been published since June 2023, and 2) the paper should consider reporting reproducible results for at least one open language model. I will gladly raise my review score if the authors commit to one or both of these points.

Q: What is your overall opinion on the paper? A: Positive, but I would like to see the two points mentioned above addressed or at least acknowledged in the camera-ready version.

接收理由

The GlobalOpinionQA dataset was a seminal contribution when it was first released, and it remains one of the few big datasets for measuring representation of subjective global opinions in language models.

The different experimental setups are carefully designed and described in detail. The results are interesting and their description and interpretation is nuanced.

The paper is clearly structured and well-written. The main body is self-sufficient despite relegating many details to the appendix.

拒绝理由

My impression is that this paper largely presents a reformatted version of the June 2023 preprint (see notes above). Submitting the paper to COLM nearly a year later, I believe there are two big missed opportunities:

First, the paper does not engage with much of the relevant literature published since the first preprint came out. For example, several works have challenged the use of multiple-choice questions for evaluating LLMs (e.g. here) and more specifically the practice of basing evaluations on token probabilities (e.g. here, here). There is also more conceptual discussion around the use of “tidy” multiple-choice surveys in contrast to messy real-world use of LLMs (e.g. here). To some extent, these works are directly motivated by the June 2023 preprint of this paper. I appreciate that some of these works are quite recent, but it would be great to have more discussion of limitations around the multiple-choice format.

Second, the paper only presents results for one proprietary language model, about which very little detail is publicly available. The main contribution of the June 2023 (industry) preprint clearly was the dataset/method. Now, submitting this paper to an academic conference, there is a clear missed opportunity in not at least evaluating one open model. The results presented here are not reproducible, and cannot easily be interrogated by anyone outside the authors’ organisation. In fact, even the authors’ organisation has published new model versions since, which may make the proprietary results presented here obsolete.

Minor notes & formatting

  • Most in-text citations throughout the paper are not correctly formatted, missing parentheses. For examples see the first sentence of the Intro: a correct format would be “(Bommasani et al., 2021; Brown et al., 2020; …)”.
  • Relatedly, subsequent citations should be ordered by year of publication, in ascending order -> Brown 2020, Bommasani 2021, etc.
  • Please check the COLM formatting guidelines in the COLM latex template. Section and subsection titles, for example, should not be fully capitalised.
  • I would consider prepending “GlobalOpinionQA” to the title of the paper. This will make it easier for people to find the paper reference (saying this as someone who always forgets the name of this paper when looking for the dataset).
  • Footnote 9 arguably deanonymizes the authors / their institution
作者回复

We thank the reviewer for the insightful review. We appreciate the reviewer's view of our GlobalOpinionQA dataset as a seminal contribution to the literature on pluralistic alignment. We agree with the suggestion to include more recent literature. We will add these papers in our discussion of the related work, including those highlighting the challenges associated with using multiple-choice questions in LLM evaluations. We appreciate the reviewer's point about presenting results for only one model. While we unfortunately could not run on open-source models due to certain limitations, we fully agree that this is critical. Our primary goal was to open-source the benchmark itself, hoping it would enable the community to conduct these evaluations across a wide range of open models. We will highlight the need for evaluations on open models in our analysis section. We will fix the formatting issues in our final version. We really appreciate the reviewer's constructive review.

评论

Thank you for engaging with my review. I appreciate the commitment to discussing more recent related works. I maintain that this would be a better paper if these "certain limitations" did not prevent you from testing other (open) models, but I understand that this is the world that we live in, and that you would probably like to include other models yourself if you could. The paper should be published at COLM, but I will stick with my score of 7 just to indicate my reluctance on the irresolvable reproducibility issue.

审稿意见
7

The paper compares human values to LLMs’ values, across cultures and languages. It does so via a multiple-choice questionnaire and additional prompting and changing of languages. In summary this paper present interesting results, but could be improved by making the experimental design more thorough and by sharpening the story and take-home message of the paper.

接收理由

  • Interesting findings, such as that simply changing the language does make a LLMs response to the questionnaire more aligned to the culture that language is spoken in.
  • The discussion on how cross-national prompting could lead to stereotypes being expresses, was also insightful.

拒绝理由

  • I have some concerns regarding the experimental setup:
    • First I worry that a forced-choice setup might not be the most suitable way of evaluating LLMs’ values, since such a setup has also been criticized in previous works (see: Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models). Using a complementary, open-ended question evaluation might provide a better picture of LLMs’ values.
  • Further discussion on how to prevent stereotyping and possible directions to take during prompting to prevent stereotypes could be interesting.
    • In general I was missing a clear takeaway message on what to take from these results and what they are useful for.

给作者的问题

  • Typos:
    • “The majority of the pre-training data IS in English”
    • Citations that don’t use the citation as a syntactic argument should have brackets around them.
    • Footnote 9 has a missing closing bracket
作者回复

We thank the reviewer for the insightful review. While a multiple-choice setup may not be the most accurate way to evaluate an LLM's values, we believe that it still provides valuable insights, especially for a large-scale study like ours. Analyzing open-ended model generations in an automated way is challenging. Therefore, we tried to complement our quantitative analysis with qualitative analysis of the model's open-ended generations (examples are provided in the appendix). We will make the limitations associated with using multiple choice format more clear, by adding the relevant sources in the final version. We agree that mitigating existing biases is crucial. While our current study focused on establishing a benchmark and quantifying the issue, developing effective mitigation strategies is an essential next step. In the final version, we will add discussion on promising directions to mitigate these biases such as debiasing techniques and prompting strategies. The main takeaways of our work are as follows:

  • Understanding cultural representation in language models is important. As these models become more prevalent, their biases can have significant societal implications.
  • We need methods to measure cultural representations. We believe that we cannot mitigate issues that we cannot measure. Our benchmark and proposed methods are steps in this direction.
  • Our findings reveal concerning biases—models often align more with Western perspectives and can propagate harmful stereotypes when prompted to adopt different cultural viewpoints. This highlights the importance of developing effective mitigation strategies.
评论

Thank you for your response, I have adjusted my score and hope that you will add the promised discussion to the camera ready!

审稿意见
7

In this work the authors evaluate an LLM on two sets of subjective multiple choice questions that social scientists used to measure differences between countries. The LLM answers turn out to be more similar to the USA and some other countries and less similar to e.g. China. Asking the models about the likely responses of people from country X yields the expected results whereas translating the question doesn't necessarily invoke another identity in the model.

接收理由

The paper is well written and easy to follow. All steps and design decisions are explained in great detail.

The investigated question is timely and relevant and the used model is one of the currently best models and thus a good candidate for evaluation.

Most limitations of the evaluation are already mentioned in the text and footnotes.

The paper includes a large number of useful references and comprehensive citations.

Given the detailed explanations and examples in the appendix, the work is highly reproducible and could become a standard evaluation to assess biases / priors of models.

拒绝理由

While very interesting, the contribution is limited. As far as I can tell, the authors joined two sets of available questions and claim a new name (GlobalOpinionQA) for that concatenation. The evaluation would have worked equally well with numbers on the Global Attitudes survey and the World Values survey presented separately. The proposed metric is just Jensen Shannon distance between the model predictions and the human distribution and the authors themselves claim that any other metric could be used as well.

Some results could benefit from more in-depth discussion. The appendix mentions automatically induced topics for the questions - it would be interesting to see if biases are stronger for some dimensions than others. Can socio-economic dependencies be included in addition to language or nationality? How do the results differ between pretraining, RLHF etc?

给作者的问题

The legends in Fig 2-4 are too small to read.

In Section 2.1 3) n is used twice where it should be \frac{1}{|Q|} \sum{q \in Q}

For the human results - is there demographic information available?

In Appendix A.1 / B.1, should the prompt read "If /I/ had to select one of the options…" ?

In the paragraph "Analysis of Linguistic Prompting" %57 should be 57%.

At the end of Section 3, the authors express disappointment that purely linguistic cues do little to change the predictions. One could argue that this shows language independence or consistency of outputs across languages which would be a desirable property.

I found the term "cross-national prompting" confusing. What about default-prompting vs. country-prompting vs. language prompting?

作者回复

We thank the reviewer for the insightful review. While the questions are sourced from the Global Attitudes Survey and World Values Survey, our work proposes novel methods that use these questions to measure cultural values of language models. To the best of our knowledge, our GlobalOpinionQA is one of the few resources to study cultural representations in models. In our data release, we'll provide questions from both sources separately, allowing researchers to conduct analyses on each dataset separately. Unfortunately, we only have access to aggregated selections for each country, not individual demographic data. It would be a great follow up work to collect and integrate demographic information to enable more granular analyses of cultural biases. We appreciate the reviewer's comment about language independence and consistency. It raises an interesting research question: how much should these models adapt their responses to different languages? While language consistency could be desirable in some contexts (e.g., maintaining factual accuracy), it may be less so in others (e.g., cultural sensitivity). Our paper doesn't take a stance but offers an empirical analysis: linguistic cues alone may not be sufficient for representing different cultural identities. Regarding model training stages, we did analyze the effect of RLHF step sizes on our results but didn't find a major impact.

评论

Thanks for addressing my concerns in your rebuttal!

I'm fine with the approach of not taking a stance regarding how models should behave when promped with different languages and agree that pointing out that this doesn't directly lead to simulating a - potentially stereotypical - stance of a set of speakers.

I stand by my initial rating and think this paper should be published.

审稿意见
8

This paper studies which population of people LLMs’ perspectives on societal issues are most aligned with.They combine Pew Research’s Global Attitudes survey and the World Values Survey to create a dataset of 2.5k questions that cover a variety of topics. They then evaluate a series of prompting strategies and compare how well the LLM aligns with people from different regions and countries. Their results show that by default LLMs are more aligned with Western countries, and instructing the model to answer from the perspective of a certain country’s population can propagate harmful biases. The authors plan to release the dataset after peer review.

接收理由

  • The study of cultural bias in LLMs is an important topic and this paper explores it from different aspects.
  • The dataset of questions and their translations would make for a useful contribution to the community.
  • The experimental setup and prompts are documented in detail in the appendix which improves this paper’s reproducibility.
  • The paper is well written and all the figures and tables are of high quality.

拒绝理由

  • The authors only test their approach on a single model which makes it hard to understand how the results generalize to other LLM families and models of different sizes.
  • The authors do not include Hofstede’s Cultural Dimensions survey in their dataset, which is an important benchmark for studying cultural alignment.
  • The authors do not try the combination of cross-national and linguistic prompting so the question is asked in a more natural manner when trying to align the LLM with a certain community. Combining this with the phrasing suggested below would be even better.
  • The paper does not present any mitigation strategies for the harmful stereotypes presented by the LLM.
  • While the world maps look nice, it would be more useful to include plots which more accurately present the numerical results along with error bars.
  • Multiple aspects of this work have already been explored before but the authors fail to cite the following relevant papers:
    • Naous, T., Ryan, M. J., & Xu, W. (2023). Having Beer After Prayer? Measuring Cultural Bias in Large Language Models. arXiv preprint arXiv:2305.14456.
    • Li, B., Haider, S., & Callison-Burch, C. (2023). This Land is {Your, My} Land: Evaluating Geopolitical Biases in Language Models. arXiv preprint arXiv:2305.14610.
    • Jakobsen, T. S. T., Cabello, L., & Søgaard, A. (2023). Being Right for Whose Right Reasons?. arXiv preprint arXiv:2306.00639.

给作者的问题

  • What is the reason behind the specific phrasing of the cross-national prompt? A more useful approach might be to instruct the model to inhabit the persona itself instead of answering from a different perspective in the third person. Instead of “How would someone from Russia answer the following question:”, we might simply say “You are a resident of Russia.” Did the authors try anything like this?
  • Did the authors try varying the amount of explanation the LLM is allowed to provide with its responses? Does only allowing it to pick its choice or being asked to give a more detailed reason change its performance?
作者回复

We thank the reviewer for the insightful review. Regarding the recommended related work, we agree that these are very relevant papers and will add discussion of them in the final version. The reviewer raises a good point that testing on only a single model limits the generalizability of the results. The main goal for our work was to provide an initial benchmark dataset and some starting points for experiments on the cultural representation of language models (e.g., cross-lingual prompting, linguistic prompting). While covering more variations in prompts, model families and sizes was out of scope for this work given our constraints, we agree that these are important directions for follow-up work. We tried to cover as much ground as possible by including experiments on four languages and several countries. Regarding Hofstede's Cultural Dimensions, we believe our benchmark is quite different. Rather than specifying cultural values and assessing the similarity with particular values explicitly, our benchmark measures alignment between model outputs and human responses across thousands of questions for different countries. We believe both approaches can offer complementary insights. Developing effective mitigation strategies for the harmful biases and stereotypes we observed is a crucial direction for future work. While we considered it out of scope for this study, we agree that it is an essential next step. Promising directions could include exploring debiasing techniques, alternative prompting strategies, or other interventions to reduce problematic model behaviors. We will update our discussion to emphasize the importance of this direction more clearly. Regarding the phrasing of our cross-national prompt, our primary goal was to see how the model would characterize the views of people from various nations, which is why we used prompts like "How would someone from Russia answer the following question?". However, we agree that a role-playing experiment, where the model is instructed to simulate a persona, could also be interesting. It would be a good follow-up study to compare our prompt with role-playing prompts to see if they elicit similar cultural perspectives.

评论

Thank you for your response to my feedback, I look forward to seeing this paper published.

最终决定

This paper introduces the GlobalOpinionQA dataset, which covers subjective opinions from different countries on a range of topics. The results reveal Western-centric cultural biases, and the finding that cross-national prompting can lead to harmful stereotypes particularly highlights the need for further work accounting for cultural variation.

Four reviewers provided insightful comments on the paper, and the authors and reviewers have engaged in thorough discussions during the rebuttal period. Reviewers are in unanimous agreement that the paper is clearly written and makes a valuable contribution to our understanding of cultural biases in language models. I agree with reviewers that evaluating on a single proprietary model limits the generality of the findings and reporting results for at least one open model would significantly improve reproducibility. Reviewers also suggest incorporating important points of engagement with recent work, including critiques of multiple-choice evaluation methods.

Despite these concerns, this paper is a clear accept; I am confident the authors will make the suggested revisions and the GlobalOpinionQA dataset will spur significant follow-up work.