PaperHub
6.8
/10
Rejected4 位审稿人
最低6最高8标准差0.8
6
7
6
8
3.3
置信度
正确性3.5
贡献度3.5
表达3.5
NeurIPS 2024

Beyond accuracy: understanding the performance of LLMs on exams designed for humans

OpenReviewPDF
提交: 2024-05-16更新: 2024-11-06
TL;DR

We apply traditional psychometrics tools to evaluate the performance of large language models and compare their patterns of correct and incorrect answers against a large dataset of human students doing college-entrance level exams.

摘要

关键词
large language modelsmodel evaluationpsychometrics

评审与讨论

审稿意见
6

This paper shows the use of psychometric modeling techniques to measure the reasoning ability of LLMs on human exams. Specifically, the author(s) use Item Response Theory (IRT) to evaluate a Brazilian college-entrance exam, and demonstrate that IRT can provide a more informative evaluation of LLMs , including: the ability to distinguish human-like vs non-human-like response patterns, and to determine whether an exam can reliably measure an LLM's abilities. The empirical results suggest that traditional accuracy metrics are insufficient to assess the abilities of LLMs, and advocate for using IRT/psychometric theory to evaluate them.

优点

  1. Provides more comprehensive evaluation methods for LLM performance, I personally agree that accuracy metrics often do not give a complete picture of LLM ability.
  2. The results section is methodological, it evaluates not only the IRT scores but how reliable they are based on several metrics (increases reliability of the evaluations)

缺点

  1. The results analysis would benefit from a more detailed and clearer/deeper analysis, some statements made (eg. L293-298) are high level observations based on the results, but lack further insight into why certain LLM behaviors occur. Performing more detailed analyses into the specific subset of questions that contribute to scores could help to further understand the limitations of the LLM (L328-331 alludes to this, but very briefly).

  2. All the evaluations were done on variations of the ENEM exam dataset, showing that these psychometric method would also work on other datasets would make this approach more convincing that it will work for wider applications - I understand that there is limited time to run more experiments, so this is more so just a comment.

问题

  1. L212 mentions "The results we show in this section are strongly consistent with the results we get for the 2022 ENEM exam and for one-shot prompting", there are no descriptions/analysis about the zero-shot prompting models (only graphs in the appendix). Were the findings similar? If no, in what ways did they differ?
  2. L181 mentions that "developers may exclude questions having poor model fit" for the IRT model, can we have more details about how "poor model fit" is determined?

局限性

As mentioned above, as experiments are done on variations of one dataset, there are doubts about the generalizability of these methods on other datasets.

作者回复

The results analysis would benefit from a more detailed and clearer/deeper analysis, some statements made (eg. L293-298) are high level observations based on the results, but lack further insight into why certain LLM behaviors occur. Performing more detailed analyses into the specific subset of questions that contribute to scores could help to further understand the limitations of the LLM (L328-331 alludes to this, but very briefly).

We agree that having a better understanding of why these LLM behaviors occur would be great. We start on this analysis using the discrimination index (DI) showing certain attributes that affect an item’s discrimination power (Appendix A.2) and analyzing questions with very high/low DI (Appendix A.10). However, due to the time and space constraints, we left a complete study for future work.

All the evaluations were done on variations of the ENEM exam dataset, showing that these psychometric method would also work on other datasets would make this approach more convincing that it will work for wider applications - I understand that there is limited time to run more experiments, so this is more so just a comment.

We agree that having more variants of datasets would make the work more convincing. However, there is a limitation in terms of data availability. The ENEM dataset is unique in terms of number of test takers and having their IRT parameters/scores available. Moreover, the usage of psychometrics methods to evaluate test takes abilities is very well tested in many different exams, such as SAT and GRE.

Questions:

L212 mentions "The results we show in this section are strongly consistent with the results we get for the 2022 ENEM exam and for one-shot prompting", there are no descriptions/analysis about the zero-shot prompting models (only graphs in the appendix). Were the findings similar? If no, in what ways did they differ?

Thank you for pointing it out. As you note, results for zero-shot prompting are presented in the Appendices. We will add an analysis about the zero-shot prompt in the main text. Our takeaway for the zero-shot prompt is that it almost always underperforms both one-shot and four-shot method in all subjects but Mathematics, which all prompt methods have similar performance.

L181 mentions that "developers may exclude questions having poor model fit" for the IRT model, can we have more details about how "poor model fit" is determined?

Yes. Poor item fit can be determined in a number of ways, including likelihood-based methods analogous to lzl_z, etc (but for the item parameters rather than θ\theta). For the 2022 and 2023 ENEM exams, the reason reported by the exam designers was that items removed were those that caused convergence problems during the IRT fit.

Limitations: As mentioned above, as experiments are done on variations of one dataset, there are doubts about the generalizability of these methods on other datasets.

We agree that this is a limitation of our work. However, we would like to emphasize that the ENEM exam is rather unique data that enables us to perform an in depth IRT analysis. The methods being generalizable is a valid concern. We would like to highlight that IRT has been used for multiple exams to evaluate humans’ abilities, and that IRT model always have the same assumptions: monotonicity, unidimensionality of the latent trait, local independence, and invariance. For these reason we believe that our methods are generalizable to any scenario where we have exams designed for humans and scored by IRT.

评论

Thank you for addressing the comments and questions. This paper offers a thorough and sound evaluation of the ENEM dataset. My main concern remains the generalizability of the technique beyond this specific context. However, the fact that IRT is a widely accepted and versatile framework does help to alleviate some of this concern, more literature review on this could further strengthen the argument for the generalizability of your approach. Given the additional validation of the method, I am increasing my rating for the paper.

审稿意见
7

The paper focuses on evaluating LLM abilities on a dataset of 8 college-entrance exams in Brazil (translated to English) measuring Item Response Theory instead of Accuracy. It highlights how such metric is useful to better understand models' performance.

优点

I have found the work very well structured and appreciated the amount of care the authors have been given to the preparation of the dataset for the experiments (PDF processing, translation to English, use of exams designed for blind people in order to address questions based on images, etc). The experiments and results are discussed in details, with clear comparisons with human performance, discussing clear differences (e.g. in Mathematics).

缺点

While the paper is well structured, I felt it was missing a "what now?" message. The authors wrote a convincing argument in favour of using IRT, how do we convince now the field of ML / AI to use it more extensively? What are its limitations in comparison with accuracy-based metrics (given there are many, for instance you need information on overall human performance) and how do we overcome them?

问题

Could you add more to the text about the way questions were translated? Was it machine translation? Did someone double-check consistency of translated questions? Would you advice a similar approach if someone would like to further translate those questions to other languages?

局限性

I think the work should have discussed more about the specificities of ENEM - I agree with the authors that this is a relevant test-bed for this sort of evaluation, but in which ways are they specific / tailored to Brazil? Is there anything researchers should know about ENEM, which would make future testing / applications more challenging? For instance which topics are covered in Humanities or Languages, how specific are they about the country cultural context?

作者回复

While the paper is well structured, I felt it was missing a "what now?" message. The authors wrote a convincing argument in favour of using IRT, how do we convince now the field of ML / AI to use it more extensively? What are its limitations in comparison with accuracy-based metrics (given there are many, for instance you need information on overall human performance) and how do we overcome them?

We appreciate the referee's comments.

How to convince the ML/AI field:

  • IRT provides a more "in-depth" analysis of the models performance by looking at how likely a response pattern is according to the data.
  • With IRT, the instruments can be smaller in comparison to the CTT.
  • IRT enables us to have more detailed comparison against human performance.

Limitations of IRT in comparison to accuracy:

  • Need for scores of multiple independent individuals (human test-takers in this paper)
  • Need for fitting the model and performing extra analysis (e.g. confirming unidimensionality)

How to overcome them:

  • The need of better benchmark data: well-designed exams (construct validity and related ideas that we discussed) are better instruments of measurements.
  • If we want to compare against human performance, we have to have human performance data. With this data, we can fit an IRT model.

We will add a "what now?" paragraph summarizing these points in the main text.

Questions: Could you add more to the text about the way questions were translated? Was it machine translation? Did someone double-check consistency of translated questions? Would you advise a similar approach if someone would like to further translate those questions to other languages?

We used the AI-based DeepL API to translate, and then we performed a manual correction on the translated exam to double-check the consistency of the translations. The entire process of translation is described in the Appendix A.1 (From A.1.2-A.1.5). The main advantage of this approach is the low cost. However, it required a fair amount of manual inspection of translations. This approach seems more viable for languages with a good automatic translation from portuguese (e.g. spanish, italian or other high-resource languages).

I think the work should have discussed more about the specificities of ENEM - I agree with the authors that this is a relevant test-bed for this sort of evaluation, but in which ways are they specific / tailored to Brazil? Is there anything researchers should know about ENEM, which would make future testing / applications more challenging? For instance which topics are covered in Humanities or Languages, how specific are they about the country cultural context?

Your points about the country cultural context are correct: the ENEM has some questions that are specific to Brazilian culture (e.g. having fragments of a Brazilian song or poem), especially in the Humanities and Languages exams. However, the exam is designed in a way that these questions are interpretative rather than knowledge based.

In Appendix A.11, we provide more details on the four sections of the ENEM exam. Some examples of actual questions are shown in Appendix A.10.

评论

Thank you for your reply and the further details. It is a really interesting piece of work and I'm sure the community will find it very useful

审稿意见
6

This paper initiates the empirical study of the performance of LLMs using Item Response Theory (IRT) models from a large college-entrance exam.

优点

  • The question of construct validity of LLM evaluations (based on scores in exams designed for humans) is very important. This paper addresses this question in the earnest, by leveraging the primary tool developed in the education and psychometrics field, IRT. As far as I know, this is an original contribution as no previous work has used IRT in the same way to tackle the construct validity issue of LLM evaluations.

  • The paper is largely well-written and the concepts (e.g. section 3) are explained clearly.

  • Relative strength of empirical work. The experiments are well-designed and there is some breadth in the range of hypotheses explored, e.g. English vs. Portuguese effect on performance, Different topics of exams, response patterns in LLMs vs. humans with questions sorted by IRT difficulty. Seven different LLMs were evaluated.

  • Significance. The method of this paper (i.e. using IRT in LLM evals) is an important first step to understanding what LLM evals are trying to measure. The paper already observes interesting phenomena, e.g. (1) the Fisher information of the math exam for the LLM test response distributions is low compared to other exams (although this is a somewhat obvious corollary of the p_i's being close to random for the LLMs performance on the math exam, the FI is a metric that points in the right direction). (2) the joint distribution of IRT scores and CTT scores for LLMs is meaningfully different from that of the human test takers.

缺点

  1. Some of the conclusions drawn by the paper appear unscientific/not well-substantiated. To me, the empirical results are subtle and require more thoughtful interpretations. Most of the interpretations of the experiments are confusing to me (i.e. I'm skeptical the conclusions follow), given the actual plots shown. For example,

(a) What are "outlier models" (line 237)? We cannot see from Figure 1 that "outlier models ... have higher accuracy and/or lower IRT scores..." - how is this statement supported?

(b) line 223-224. The scale of IRT scores and CTT scores is not comparable. How can you conclude there is "greater variability" in the latter than in the IRT score? This is not scientific.

(c) line 264-265. The statement "...questions that are easy for humans but difficult for LLMs" is again inaccurate. The questions are relatively easier for humans but may not be "easier" than the other questions for humans, if easier means for humans anyway.

(d) Why is the math exam not meaningful for evaluating LLMs? Doesn't it suggest that the models are randomly guessing and therefore bad? I don't agree with this interpretation.

  1. A clarity issue with the math writing. Line 154-155: This sentence "...j has a more likely response vector than indicated by their ability" is mathematically wrong. It is not possible to have a random draw from a Multinomial distribution that is "more likely" (i.e. higher probability) than the expectation vector (which is not even in the space of possible draws).

  2. Experiment section writing missing some details and figures are somewhat difficult to interpret (esp Figure 1). I have several unanswered questions. How was the closed curve generated from the 30 points (of random shuffles)? The caption for Figure 1 could be more informative, e.g. was the exam answered in English or Portuguese by the LLM. If English, are the IRT models fit still valid? - I don't think so.

  3. Typo in lines 232-233, "Natural sciences" appears twice. and the sentence contradicts the graph.

问题

  1. Is the dataset of the ENEM exams translated to English made publicly available?
  2. How was the closed curve generated from the 30 points (of random shuffles)?
  3. The caption for Figure 1 could be more informative, e.g. was the exam answered in English or Portuguese by the LLM. If English, are the IRT models fit still valid? - I don't think so.
  4. What are "outlier models" (line 237)?

局限性

Yes.

作者回复

(a) What are "outlier models" (line 237)? We cannot see from Figure 1 that "outlier models ... have higher accuracy and/or lower IRT scores..." - how is this statement supported?

We agree that this statement lacks a quantitative interpretation. We have removed this sentence (and its paragraph) from the paper.

(b) line 223-224. The scale of IRT scores and CTT scores is not comparable. How can you conclude there is "greater variability" in the latter than in the IRT score? This is not scientific.

We agree that this statement lacks a quantitative justification. We have removed this sentence (and its paragraph) from the paper.

(c) line 264-265. The statement "...questions that are easy for humans but difficult for LLMs" is again inaccurate. The questions are relatively easier for humans but may not be "easier" than the other questions for humans, if easier means for humans anyway.

We agree this statement needs clarification. We have updated the text to make the point more clearly:

It is informative that there are sets of questions having similar difficulty for humans (β\beta value) but in which many LLMs answer some questions mostly correctly, and some questions mostly incorrectly. It suggests that the β\beta values estimated from humans may not be good measures of the difficulty of some questions when given to an LLM.

(d) Why is the math exam not meaningful for evaluating LLMs? Doesn't it suggest that the models are randomly guessing and therefore bad? I don't agree with this interpretation.

The point is that the Math exam does not give reliable estimates of θ\theta for LLMs. This is supported by the goodness-of-fit (z\ell_z) scores, the Fisher information of the exam, and the discrimination indexes. As we wrote, this “[casts] doubt on the informativeness of the models’ θ\theta scores on this exam.”

A clarity issue with the math writing. Line 154-155: This sentence "...j has a more likely response vector than indicated by their ability" is mathematically wrong. It is not possible to have a random draw from a Multinomial distribution that is "more likely" (i.e. higher probability) than the expectation vector (which is not even in the space of possible draws).

E[L_j] is not the expectation vector, it is the expectation of likelihood under the model. This sentence has been clarified to read a response vector with a higher likelihood than expected given the test taker’s ability (θ\theta parameter).

Experiment section writing missing some details and figures are somewhat difficult to interpret (esp Figure 1). I have several unanswered questions. How was the closed curve generated from the 30 points (of random shuffles)? The caption for Figure 1 could be more informative, e.g. was the exam answered in English or Portuguese by the LLM. If English, are the IRT models fit still valid? - I don't think so.

The closed curves are isolines of a KDE of the distribution derived from the 30 samples. We will clarify this in the text. The legend identifies whether the exam was given in English or Portuguese. In either case the answers were from the set (A, B, C, D, E). We will update the caption to refer to section 4.2 Models and A.4 Prompting. IRT model fits are for the Portuguese language exams given to Portuguese speakers. However the central thesis of IRT modeling is that it estimates latent variables of test takers that are independent of question specifics.

Typo in lines 232-233, "Natural sciences" appears twice. and the sentence contradicts the graph.

“Natural Sciences” appears twice because different models are being discussed. The sentences don’t contradict the graph, although colors make it hard to distinguish here. We will look for a better way to make this distinction clear in the figure.

Questions: Is the dataset of the ENEM exams translated to English made publicly available?

We provide them in the supplementary material, but they are not publicly available yet. We will make them public upon publication.

How was the closed curve generated from the 30 points (of random shuffles)?

Answered above.

The caption for Figure 1 could be more informative, e.g. was the exam answered in English or Portuguese by the LLM. If English, are the IRT models fit still valid? - I don't think so.

Answered above.

What are "outlier models" (line 237)?

Addressed above.

审稿意见
8

This paper provides a fresh perspective to evaluating LLMs by arguing for a stronger emphasis on psychometric methods particularly Item Response Theory (IRT) when evaluating them on exams designed for humans, rather than the reliance on traditional metrics such as accuracy. The authors postulate that IRT provides a more comprehensive evaluation by considering not just the number of correct answers but also the difficulty of the questions and the patterns of responses. The authors utilize the Brazilian college entrance exam ENEM for their case study and compare how various LLMs fare against human test-takers. They show how psychometric methods can be leveraged to distinguish between human like and non-human like responses. Furthermore, they demonstrate how IRT can be used to assess the suitability of an exam for making meaningful measurements of an LLM's abilities in the given area.

优点

The paper is very well written. It provides a comprehensive literature review and does a good job at covering the background information. The experimental analysis is sound with sufficient supporting materials. The paper makes significant novel contributions to evaluation of LLMs. The application of psychometric methods and the insights that can be mined through them when used to compare LLMs can be of significant interest to the research community. The experimental results on assessing whether an exam is a good indicator of an LLM's ability are particularly interesting and open up significant opportunities for future research.

缺点

The error analysis can be more detailed especially in areas where the results are surprising. This would better help support the conclusions. For instance for the questions in Math and Natural Sciences wherein the models show fluctuating performance it would be useful to know what those questions aim to test. Are LLMs not able to solve the problems due to calculation errors or do these problems involve more complex multi-step reasoning or is it just linked to knowledge cutoff (e.g questions involving current events)?

问题

The evaluations have been performed with relatively smaller models (in comparison to SOTA frontier models e.g GPT-4 / Gemini 1.5 Pro / Claude 3 ). It would be interesting to benchmark frontier models and see whether the results highlight any obvious potential biases in smaller models.

局限性

The pre-requisite for this type of evaluation seems to be the existence of a strong IRT model which in turn requires the existence of large amount of carefully annotated human data.

作者回复
  1. For instance for the questions in Math and Natural Sciences wherein the models show fluctuating performance it would be useful to know what those questions aim to test.

We appreciate the referee's comments. An important difference between the Math/Natural Sciences and Languages/Humanities exams is that the former contains questions requiring mathematical reasoning and image interpretation, whereas the latter primarily focuses on knowledge and language understanding. In the 2023 exam, the percentages are as follows:

  • Humanities: 9% of the questions contain images; 0% contain numbers in the answers.

  • Languages: 24% of the questions contain images; 0% contain numbers in the answers.

  • Natural Sciences: 18% of the questions contain images; 51% contain numbers in the answers.

  • Math: 33% of the questions contain images; 96% contain numbers in the answers.

The Natural Sciences exam features a particular mix of knowledge-based questions and questions requiring mathematical reasoning and/or reasoning with images. Appendix A.2 presents an analysis showing that questions containing numbers and images tend to be less discriminative, which partially explains the fluctuating performance observed.

  1. Experiment with SOTA frontier models such as GPT-4 / Gemini 1.5 Pro / Claude 3

We agree it would be desirable to experiment with SOTA models and we want to find a way to do that for future work. However, our methodology measures the next token logits across the 5 letter options which has shown to be a particularly reliable strategy (see L712 and [32]) and hence we can’t use Gemini 1.5 Pro and Claude 3 since they do not provide the top logits. For the GPT-4 case, we did not run it due to time and cost (45 questions x 4 subjects x 2 years x 2 languages x 30 shuffles = 21600 requests, with GPT-4 costing up to 30 times more than GPT-3.5 at the time of writing).

评论

Thank you for addressing my questions.

最终决定

The paper argues that using accuracy for evaluating LLMs does not allow deep understanding on the true capabilities of LLMs. It proposes to use item response theory (IRT) as a tool to better understand LLM performance on human exams. The results show that LLMs exhibit some significant deviation from human-like behaviors in many cases.

The paper is well-written and the experiments are well-designed as noted and agreed by most reviewers.

Some reviewers consider using IRT to evaluate LLMs is a significant and novel contribution. However, using IRT to evaluate LLMs has already been explored in several previous works like [22, 33, 54] and the following two papers:

  1. Vania, C., Htut, P. M., Huang, W., Mungra, D., Pang, R. Y., Phang, J., Liu, H., Cho, K., and Bowman, S. R. Comparing test sets with item response theory. ACL, 2021.
  2. Felipe Maia Polo, Lucas Weber, Leshem Choshen, Yuekai Sun, Gongjun Xu, Mikhail Yurochkin, tinyBenchmarks: evaluating LLMs with fewer examples, arXiv, 2024

Therefore, using IRT to evaluate LLMs can hardly be considered as a novel and significant contribution of this paper.

The second concern about the paper is on its findings: a key finding of the paper is that LLMs behave differently from humans on certain questions, but this is kind of already known; some findings are subtle and require more thoughtful interpretations (Wmva), e.g. math exam not meaningful for evaluating LLMs; and there is also a lack of clearer/deeper analysis to gain deeper insight into why certain LLM behaviors occur (KhP9).

Given the above concerns and after discussion with SAC, we think that the paper is more like a case study of using IRT to understand how LLMs perform on "human exams" using a large educational dataset, and the technical contributions of the paper is quite limited. The paper might be better suited to a conference other than NeurIPS, e.g., conferences at the intersection of AI and Education, or an NLP conference, where the community would be more interested in the hypotheses and findings of the paper. NeurIPS, with its emphasis on AI/machine learning foundations and methodology, does not seem to be an appropriate venue for a "case study" paper.