PaperHub
7.0
/10
Oral4 位审稿人
最低6最高8标准差0.7
7
6
8
7
3.8
置信度
正确性3.0
贡献度2.5
表达2.8
NeurIPS 2024

Questioning the Survey Responses of Large Language Models

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06

摘要

关键词
large language modelssurveys

评审与讨论

审稿意见
7

The paper focuses on evaluating 42 different language models on the American Community Survey, highlighting how model responses are governed by ordering and label biases and in general the fact that any demographic correlation with specific subgroups is actually due to the fact that those subgroups aggregated statistics are closest to uniform.

优点

The study in the paper is well-designed and discussed, considering a very large variety of LLMs and in particular by considering both base and instruct-tuned models. The presented results are very relevant for any study that plans to either examine LLMs for understanding underlying characteristics and, most importantly, for researchers who plan to use LLMs as a proxy to study human sub-groups. The strong A-bias showed in the paper is a very important take-away.

缺点

There are two main weaknesses that I have encountered while reading the paper, namely:

  1. It is not clear what the authors would expect the models should do based on their training data. It is true that we don't know exactly on what these models have been trained, but if we consider for instance the Common Crawl as the main corpus, would we expect that models trained on it would somehow produce answers correlating with certain sub-groups? I think a larger discussion in this paper on what we should reasonably expect models answers to be is needed.

  2. given the emphasis on sub-groups, I would have expected the authors to explore impersonation of LLMs as a way of seeing whether that would steer responses to surveys in the direction of what we would expect those sub-groups typical responses. This would clarify whether LLMs have the ability of producing answers relevant to specific sub-groups, if instructed to do so.

问题

Do humans as well have A-bias?

Have you considered exploring whether RLHF would introduce a slight steer for the model towards certain sub-groups (reflected in the way people give feedbacks) which was not present in the base model?

局限性

I think authors have addressed the main limitation of this work (which would be the focus on US survey) by examining other surveys.

作者回复

Thank you for the feedback.

What would you expect the models’ answers to be? It is unclear what to expect models’ responses to be. Prior work has hypothesised that models may trend towards certain demographics; for example, younger demographics, which tend to be more present on the internet. Another candidate could be models trending towards the responses of particularly populous U.S. states, simply because they may produce larger volumes of data. However, we observe that base models' ACS responses are qualitatively different from those of human populations. Because of these qualitative differences, we argue that the quantitative analysis of prior work may be misleading (Section 5). This is not to say that base models do not have biases, or do not represent certain populations better than others. Rather, our findings signal the need to move beyond multiple-choice prompting towards a more holistic evaluation of LLMs (e.g., open ended survey responses) to elicit more faithful representations of the population a language model might represent.

Impersonation We focus on evaluating whether the survey responses of LLMs are representative of certain U.S. demographic subgroups. In this setting, it is standard to prompt the model without any added context. Assessing whether models have the ability of producing answers relevant to specific sub-groups if instructed to do so is beyond the scope of our work.

Do humans have A bias? There is evidence for ordering bias in humans in the context of opinion surveys, and a tendency not to pick extreme values. However, in the context of the ACS demographic survey, it is well-understood that ordering effects play a very minor role in the distribution of responses collected by the U.S. census. The recent work of Tjuatja et al., 2023 finds that the response biases of language models (e.g., A-bias) are generally not human-like.

Tjuatja, Lindia, et al. "Do LLMs exhibit human-like response biases? A case study in survey design." arXiv preprint arXiv:2311.04076 (2023).

Does RLHF introduce steering? We evaluate models that have undergone RLHF, particularly the Llama 2 Instruct models, text-davinci-003, GPT-4, and GPT-4 Turbo. However, these models have undergone both standard supervised fine-tuning (i.e., instruction tuning) as well as RLHF. Overall, we observe that the responses of fine-tuned models vary more across questions (e.g., are not as balanced as those of base models). We, however, find no evidence that the responses of RLHF models better represent those of human populations. This is not to say that RLHF introduces no steer, but rather that the multiple-choice survey methodology that has recently gained traction in the community may not be appropriate to study this phenomena.

评论

Thank you for your detailed answers, all clear! I'm happy to increase my score to 7

评论

Thank you for your response. We are very pleased to have addressed your concerns.

审稿意见
6

This paper prompts LLMs with 25 multiple choice questions (on basic demographic information, education attainment, healthcare coverage, disability status, family status, veteran status, employment status, and income) from the 2019 ACS. The authors use eight kinds of prompts which vary in additional context, instructions, and asking in the second person. However, each time, the next-token probabilities are used to determine the immediate reply by the LLM to the multiple choice question. To evaluate the responses and the differences between LLM and human generated responses, the authors compute the normalized entropy and use KL divergence. They find that "smaller" LLMs are vulnerable to ordering and labeling biases, and that after correcting for these through randomized answer ordering, the LLMs trend towards uniform distributions in their responses (~high entropy). Instruction-tuning seems to increase the variance in the entropy measure for LLMs, but nonetheless the entropy remains higher overall compared to the human generated responses. The authors state that the main takeaway from their paper questions the popular methodology of using survey responses from LLMs using multiple choice questions. They challenge prior work and give the explanation that models consistently appear to better represent subgroups whose aggregate statistics are closest to uniform.

优点

  • The paper has a clear goal to challenge previous work on survey-derived alignement measures by offering the explanation that models consistently appear to better represent subgroups whose aggregate statistics are closest to uniform.
  • The paper shows that LLMs should not be used out-of-the-box to replace human responses in census data.
  • The paper is generally well-written and clear which makes it easy to follow.

缺点

While I find the paper enjoyable to read and it challenges important earlier findings on LLMs, I am not convinced that the current experiment fully supports the claims:

  • The 2019 ACS uses stratified sampling. Therefore, I believe that the variance is already being increases to obtain a representative sample. Since the additional context given to the LLM is very limited + the draws are independent, it seems like a very hard task to generate a matching distribution by the LLM. This is less important when assessing, for example, the political view of an LLM.
  • First-token probabilities may be a biased measure to obtain the replies (see Wang et al., 2024).
  • Additionally to the first-token probabilities, more advanced prompting techniques, such as Chain-of-Thought (Wei et al., 2022a), could improve the coherence and dependencies in the responses, especially for the sequential generation.
  • I believe that the sensitivity to ordering and labeling biases is known (Wei et al. (2022b) and Wei et al. (2023)).

Overall, the authors show that independent draws from an LLM with limited context generates a uniform distribution, which questions earlier findings made used by such as methodology. While I agree with the authors on that statement, I believe that their experiment adds little value to further support their claim on "better represent subgroups whose aggregate statistics are closest to uniform." I believe additional, more fine-grained analysis is required for this.

References

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., ... & Zhou, D. (2022a). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35, 24824-24837.

Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., ... & Fedus, W. (2022b). Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.

Wei, J., Wei, J., Tay, Y., Tran, D., Webson, A., Lu, Y., ... & Ma, T. (2023). Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846.

Wang, X., Ma, B., Hu, C., Weber-Genzel, L., Röttger, P., Kreuter, F., ... & Plank, B. (2024). "My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models. arXiv preprint arXiv:2402.14499.

问题

Minor comments:

  • There is a discrepancy between the 42 models mentioned in the abstract vs 39 in the main text.
  • Typo line 561: rompt instead of prompt (in title)

局限性

The paper does not explicitly state the limitations of the experiments but carefully reassesses its findings in the conclusion. The checklist points to Section 2 where, for example, the authors point to the prompt ablations.

作者回复

Thank you for the feedback. We hope to address your concerns and clarify some misunderstanding below.

their experiment adds little value to further support their claim on "better represent subgroups whose aggregate statistics are closest to uniform."

We believe this to be a misunderstanding. We do not claim that models better represent subgroups whose aggregate statistics are closest to uniform. Let us clarify.

Our experiments show that, using the de-facto standard multiple-choice methodology, models strongly trend towards uniformly random responses (Figures 4 and 5). This results in a very strong correlation between subgroup entropy and alignment (Figure 6). Such correlation consistently holds across surveys, subgroups, and models. Therefore, models appear to better represent subgroups whose aggregate statistics are closest to uniform.

Our experiments explain the findings of earlier work (i.e., Santurkar et al. 2023, see the discussion in Section 5, “Beyond the ACS”), and why these findings may be misleading. Models appear to better represent younger demographics not because of the pre-training data, but because younger demographics happen to have more uniform responses for the ACS. We are not claiming that language models actually better represent certain populations intrinsically. We are arguing that the de-facto standard methodology to survey language models has strong limitations, and it can potentially lead to misleading insights. Our claims are not about whether language models better represent certain populations, but rather about the limitations of the dominant survey methodology itself.

The 2019 ACS uses stratified sampling. Therefore, I believe that the variance is already being increased to obtain a representative sample. Since the additional context given to the LLM is very limited + the draws are independent, it seems like a very hard task to generate a matching distribution by the LLM.

If we understand correctly, your point is that the aggregate responses of the U.S. census (the reference population used in our work) appear to be more entropic than they actually are, due to the U.S. census not representing households uniformly at random. Our claim is that, when using the dominant multiple-choice methodology to survey models, model responses are qualitatively different from those of the U.S. census. Model responses strongly trend towards uniformly random, irrespective of the survey question being asked. The responses of the U.S. census do not – they are heterogeneous. If stratified sampling were to have a small effect on the U.S. census responses, it is still the case that model responses (e.g., the blue dots in Figure 4a) look nothing like those of the U.S. population (green dots).

Please note that there are no “draws” – for each survey question, we extract the models’ survey response analytically by extracting its next-token probabilities over each of the answer choices (e.g., “A”, “B”, …). This is the standard methodology introduced by Santurkar et al. for surveying language models using multiple-choice questionnaires. Our contribution is to shed light on the properties of these output distributions.

First-token probabilities may be a biassed measure to obtain the replies [...] more advanced prompting techniques, such as Chain-of-Thought (Wei et al., 2022a), could improve the coherence and dependencies in the responses, especially for the sequential generation.

We agree with your points. They support the overall conclusion of our work that the current multiple-choice methodology used to survey language models has strong limitations, and we should move towards a more holistic evaluation of LLMs (e.g., open ended survey responses rather than multiple choice) in order to elicit more faithful representations of the population a language model might represent.

I believe that the sensitivity to ordering and labelling biases is known (Wei et al. (2022b) and Wei et al. (2023)).

Ordering bias has been observed in various works. We cite Robinson and Wingate (2023a). We are happy to include more references in the final version. Our work is different from prior work in studying the effects of ordering biases for models’ survey responses. We show that models’ survey responses can substantially change after adjusting for their ordering biases, leading to fundamentally different insights regarding the populations that models’ best represent.

We hope the additional explanations helped address your concern. We are happy to answer further questions that you may have.

评论

Thank you for the clarifications.

Re-reading your paper from the perspective that your main goal is to question the validity of prior work on survey-derived alignment measures, which (unconditionally) sample survey responses from LLMs, made me realize that my initial rating and assessment of the paper could have been much higher. I have adjusted my rating accordingly.

My initial assumption, related to the stratified sampling comment, was that surveys through LLMs were conducted conditionally on demographics. For example, the paper released today by Ashokkumar et al. (2024) requires LLMs to respond to survey questions conditional on random demographic characteristics. This is a more valid approach to conducting surveys through LLMs, and I agree with your paper that there are better approaches than unconditional sampling (or evaluating token probabilities). Similar to the remark of Reviewer gFgE, the conditioning on demographics gave me a conflict with the US Census data set, as many questions related to the demographic information would then be embedded in the prompt. However, as you point out, you also run experiments on survey opinions.

It might be worthwhile to point out that conditioning LLMs on demographics via prompts may be part of the solution. However, you implicitly already do this with the sequential prompting strategy, keeping previous demographic information in context. Maybe this points toward conditional not being a solution?

References

Ashokkumar et al. (2024). Predicting Results of Social Science Experiments Using Large Language Models.

评论

Thank you for your response. We are very pleased to have addressed your concerns.

Regarding sequential prompting, we condition on models' previous outputs rather on existing demographics from U.S. census' individuals, which again does not result in meaningful aggregate response statistics. However, we think that conditioning on existing demographics (e.g., U.S. census demographics) may be one way to obtain more reliable survey responses from language models. However, studying the effectiveness of such approach is beyond the score of this work.

审稿意见
8

This paper critically examines possible pitfalls of using the responses of LLMs to survey queries to study the model alignment. They found substantial bias, e.g., with respect to the order of response option,

优点

The paper examines a very important methodological topic that has gained significant attention also beyond computer science: measuring the values and opinions, in which LLMs are rooted. As such, the topic is very relevant, interesting, timely, and certainly fitting for the conference.

The paper is well written and well motivated. The provided material (i.e., code and documentation) are exemplary. Results are presented in a clear and concise way.

缺点

The paper is motivated by surveys on the "demographics, political opinions, or values best represented by current language model". For this, the paper mostly relies on the ACS dataset. However, this questionnaire mostly covers demographic information, for which naturally the LLM cannot have a "correct" answer. It would be criticial to see for which type of question the high entropy responses actually hold. From my own experience, I would not expect at all that similar uniform distribution also would occur for (political) opinion or value-based questions.

Minor note: I would recommend to add "forward mentioning" in Section 2 that other datasets will also be covered in Section 5

Although a lot of LLMs have been included in the study, only the large-scale models of OpenAI have been studied. To see if the observations hold also for other, similary large models, including also other commercial providers (i.e., Anthropic or Google) would be nice. This is not required for the key results of the paper, however.

问题

局限性

The limitations are mostly described well in the paper. For an exception, see weaknesses.

作者回复

Thank you for the positive assessment and the feedback. Please note that we discuss in Appendix E how our findings for the ACS transfer to opinion surveys. We agree that our observations regarding models’ survey responses may be partially attributable to survey questions not having a “correct” answer. This is in stark contrast with the multiple-choice questions that are typically used to evaluate LLMs (e.g., MMLU), and reveals interesting new insights for model evaluation.

评论

Thank you for your comment and the pointer to the appendix!

审稿意见
7

This paper conducts experiments to verify the alignment between human and LLM responses to the ACS survey. Particularly, the paper questions existing literature suggesting that LLMs can be used as proxies for measuring responses to survey questions, suggesting instead that LLM choices are biased by the ordering of the questions, and when order choice is randomized, models tend to present uniformly random survey responses, thereby closely modeling the behavioural characteristics of sub-groups whose aggregate statistics are close to a gaussian. The paper suggests that using LLMs as human-proxies for multiple choice surveys is a questionable strategy.

优点

  1. The authors test 42 different LLMs in their experiment, testing base, instruction-tuned, and RLHF-tuned models. This is comprehensive and substantive, and a lot of work. The results they find agree across model size and type barring one outlier in instruction-tuned models over one survey.
  2. The authors test the use of randomized choice order and the original choice order, finding that randomizing the response order results in a uniform distribution of responses.
  3. The authors investigate the effect of using instruction-tuning to train models.
  4. The authors also test surveys besides the ACS, and find that the results persist.
  5. The authors interpret findings in earlier papers and provide explanations for why the LLM responses more closely resembles responses from certain demographic sub-groups, i.e that these distributions are Gaussian
  6. The paper is well-written and presents a simple and elegant experiment, and clear and consistent takeaways.

缺点

  1. Order choice bias is a well-known phenomenon in LLMs (Lu, Yao, et al. "Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity." arXiv preprint arXiv:2104.08786 (2021)). Therefore the big finding in this paper is that LLM responses model a Gaussian when order choices are randomized using a gaussian distribution.
  2. Please use a different color for the sub-group dots in fig. 5 -- it is confusing because the same color is used for survey responses in earlier figures.
  3. Takeaways are harder to gauge from Fig. 3. Please use simpler aggregate statistics like means, confidence intervals, variance and save this figure for the appendix.
  4. Please plot the log linear scale as a dashed line in Fig. 2 for easy comparison.

问题

  1. How was the subset of 25 questions selected from the ACS?
  2. Do the authors have any comments on the training data and its influence on survey responses, beyond the frequency of appearance of certain letters in English as noted in C?
  3. Does skewed randomization (as opposed to simple randomization) of the responses, present a similar skew in the model response distributions?
  4. Instruction-tuned models show higher variance in entropies. What could be causing this?
  5. What form of instruction tuning was tried?
  6. Were the effects of question-ordering investigated?

局限性

N/A

作者回复

Thank you for your comments. We will implement the suggested changes to improve the figures. Let us address your questions in the following.

Selection of questions. We chose 25 representative questions to achieve diversity over topics (e.g., educational attainment, healthcare status, employment status, etc.), while keeping figures readable. In the rebuttal pdf we include additional results for all ACS survey questions for comparison. We provide the A-bias and response entropy of models with publicly available weights. They show identical trends.

The role of training data. Regarding the pre-training data, we find that all base models exhibit similar response distributions, despite being trained on substantially different pre-training data. While we find little difference across base models, beyond the ordering effects identified in Appendix C for smaller language models, we do observe substantial differences between the response distributions of different fine-tuned models. This suggests that fine-tuning data can have a larger effect on models’ distributions. This is a positive result for future work seeking to fine-tune models to alter their survey responses (e.g., emulate those of certain populations).

Why do instruction-tuned models have higher variance in entropy? We generally find that, compared to base models, instruction-tuned models tend to have higher confidence in their responses for at least some of the survey questions. This causes instruction-tuned models to have higher variance in their response entropy compared to base models, as any deviation from balanced responses is more amplified. Note that we used publicly available instruction-tuned models, we did not perform instruction tuning ourselves. For some models (e.g., the Llama models), these instruction-tuning datasets are not publicly available.

Effects of question-ordering We follow the predominant methodology of asking questions independent of each other. Therefore, there are no question-ordering effects. If questions were to be asked in sequence, putting the answer to previous questions in context, then we would expect to observe substantial question-ordering effects. But this was not the focus of our work.

Skewed randomization Yes, for models that exhibit choice ordering biases, skewed randomization would change the response distribution. This is because we would no longer uniformly average across each of the possible choice orderings, but perform some weighted average. However, uniform is the only principled approach here to adjust for it.

作者回复

We thank all reviewers for their feedback.

We hope to have addressed your concerns, and we are happy to answer any further questions you may have.

Thank you, Authors

最终决定

This paper conducted large scale experiments to understand LLM's answer to survey questions that's designed for human. It found out that the models suffers strong positiional bias (which is known to the community in general) and models' answer would represent aggregated uniform if the positional bias is properly controlled.

The paper studies LLMs from a novel perspective as a survey participant. The conclusion may not surprise a LLM researcher too much, but have good value to other researchers that could be convinced by the model's human-like behavior and try to use them in studies about social topics, which could be dangerously misleading.

The paper is well received by the reviewers and is in good writing quality.