PaperHub
8.3
/10
Poster4 位审稿人
最低8最高9标准差0.4
8
8
8
9
4.0
置信度
COLM 2024

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

OpenReviewPDF
提交: 2024-03-22更新: 2024-08-26
TL;DR

We craft a benchmark of PhD-level science questions that are difficult for highly skilled non-domain-experts with full access to the internet.

摘要

关键词
benchmarkevaluationdatasetscalable oversightalignment

评审与讨论

审稿意见
8

This paper presents GPQA, a new dataset of challenging multi-choice question-answer pairs. GPQA was created with the help of human experts to be challenging to even experts in the field, as a means of assessing LLM performance on frontier knowledge. The authors selected as annotators people having or pursuing a PhD in the subject matter for each question (physics, chemistry, biology). Each question was created by one expert, double-checked by another expert, and then revised by the first expert. The authors then test non-experts (annotators who worked on other fields) on each question.

The dataset contains several parts. The "diamond split" is a subset of 198 questions where 2/2 experts agree on the answer, and 2/3 non-experts get the answer wrong. The "main set" is an extension to 448 questions; these may included questions that required revision, and may include questions where only 1/3 non-expert got the answer wrong. Finally, the "extended set" includes all 546 questions created. The authors recommend the main set for experiments.

For the main set, experts achieve 71.9% accuracy, while non-experts (with access to Google) achieve 30.4% accuracy. Language models tend to lie in-between, with the best-performing being Claude-3 Opus at 52.7%.

接收理由

The dataset provides an interesting testbed for question answering performance on highly difficult problems. With each question requiring two expert annotators, substantial resources went into the creation of this dataset; it will provide a good evaluation benchmark for the community. The finding that LLMs perform better than non-experts, even when equipped with Google, is a good demonstration of their potential.

拒绝理由

Despite the effort that went into creating this dataset, it is a small resource. With only 448 examples in the main set, it may be difficult to e.g. estimate 95% confidence intervals for model performance on the dataset.

给作者的问题

I am a little confused about the framing in the introduction. You argue that this dataset is constructed to evaluated models "on questions where we cannot produce or verify the truth on our own". Yet, we (taking the word to mean humanity) can answer all the questions on this dataset -- experts have this ability. Why would you expect evaluation techniques for frontier knowledge known by some humans to generalise to frontier knowledge known by no humans? If you do not have this expectation, how does this paper connect to scalable oversight?

作者回复

Thank you for your thoughtful review and question! We agree that because it’s on the smaller size, resolving small differences in accuracies with statistical significance will often not be possible.

Regarding your concern about generalization: This is a fantastic question that directly engages with the core question scalable oversight poses. While ultimately we can’t be certain scalable oversight methods will generalize, we can cleverly design our experiments to help us build confidence. Specifically, we want to be able to show that non-experts can supervise experts, even when the non-experts do not have all of the knowledge or understanding of the experts. If we can show this, then the question reduces to whether there are fundamental differences between the kinds of knowledge that human experts currently have, and the kinds of knowledge that future AI systems have.

So, if we can show that non-experts are able to supervise a wide range of questions, knowledge, and expertise in different domains, then as long as the kinds of questions, knowledge, and expertise future AI systems have is similar enough to what we’ve evaluated with non-experts, then we can be confident the oversight will generalize. We don’t think this assumption is guaranteed (e.g. we could imagine systems having different internal ontologies from us that are intractable to translate), but we think it should allow us to supervise a very wide class of questions or tasks.

We also don't need to purely rely on empirical evidence—we can also learn/build theoretical justifications for how oversight will scale. For example, in the paper “AI Safety via Debate”, which introduces debate as a method for scalable oversight, a complexity theory analogy is given that demonstrates how polynomial-time verifiers (i.e. judges) can formally verify “debates” (i.e. alternating witnesses) from computationally unbounded debaters on questions in PSPACE (and later work extends this to NEXP). Ideally we can bridge the theory-empirical gap from both ends, by improving the scope of our purely theoretical models, like the complexity theory proofs, and by building empirical models of our observations of scalable oversight methods.

We hope this helps clarify how we think about scalable oversight and supervising frontier AI systems—please feel free to ask follow-up questions during the discussion period!

评论

Let us know if you have any more questions about the motivation and context on scalable oversight—happy to discuss more!

审稿意见
8

This paper presents GPQA, a highly challenging evaluation dataset of 448 multiple-choice questions in the fields of biology, physics, and chemistry. The questions are designed to be "Google-proof", i.e., with 34% accuracy by highly skilled non-experts spending over 30 minutes on average with unrestricted access to the web. The dataset aims to aid the development of scalable oversight methods for supervising AI systems that surpass human capabilities. The proposed dataset construction pipeline involving experts and non-experts ensures the difficulty and objectivity of the questions in GPQA. A high-quality subset of the dataset ,GPQA Diamond, is also proposed, where both experts answer correctly and majority of non-experts answer incorrectly. The difficulty and objectivity have also been validated by humans in the follow-up analysis. The accuracies of several LLMs with self-consistency are also reported.

接收理由

  • The proposed evaluation dataset contains difficult and objective questions. The dataset construction process and follow-up analysis ensure the quality of the dataset.
  • The experiments show that the dataset is difficult for even strong LLMs such as GPT-4. The evaluation results of humans and LLMs should inspire future work on "scalable oversight".

拒绝理由

  • The size of the dataset, 448, is relatively small. Such small evaluation datasets may not be adequate to ensure the statistical significance of model performance.
  • The focus of the dataset is limited to biology, physics, and chemistry, which is narrower compared to broad domains in practical applications of LLMs.
  • This paper seems to lack reference and comparison to existing evaluation datasets, which makes the contribution of this paper unclear.
作者回复

Thank you for your thoughtful review, and for your kind words about the quality of the questions, our construction process, and our analysis/experiments! We agree that because it’s on the smaller size, resolving small differences in accuracies with statistical significance will often not be possible.

Regarding your comment on our lack of references and comparison to existing evaluation datasets, we agree that we could provide more context and comparison in the main paper. We currently have a detailed related work section in the appendix (section A.2), but we’ll happily add this back to the main paper for the camera-ready version to help contextualize our benchmark and highlight its novelty, especially given the newly increased page limit!

评论

Thank you for your hard work on the paper, and the rebuttal! Comments by the authors addressed my 3rd concern, and I understood that the authors mentioned the small size in Limitation section. So I'll raise my score.

审稿意见
8

The paper introduces a challenging dataset designed to test the abilities of both human experts and AI systems in solving difficult multiple-choice questions across three scientific domains: biology, physics, and chemistry. The dataset consists of 448 questions, which were found to be extremely challenging for experts and non-experts alike, designed to be inaccessible through simple Google searches, thus the term "Google-proof". The study highlights significant differences in performance between human experts, non-expert validators, and various AI models, with experts achieving a 65% accuracy, non-experts 34%, and AI models like Claude 3 Opus around 60%.

接收理由

  • Innovative Dataset:
    The GPQA dataset fills a critical gap in existing benchmarks by focusing on extremely hard questions that require deep domain expertise and are resistant to simple internet searches.
  • Rigorous Validation Process:
    The paper details a thorough validation process for question objectivity and difficulty, involving multiple rounds of expert and non-expert reviews, ensuring the reliability and challenge of the dataset.

拒绝理由

  • Limited Diversity of Questions:
    The dataset, while challenging, consists of only 448 questions due to the high costs and complexity of question generation and validation. This small size might limit the statistical power and generalizability of the findings.
作者回复

Thank you for your generous review! We appreciate your words regarding the novel nature of the dataset, and the quality and rigor of our validation process. We agree that because it’s on the smaller size, resolving small differences in accuracies with statistical significance will often not be possible.

审稿意见
9

Contribution:

  • 448 multiple choice questions in biology, physics, chemistry, written by experts
  • PhD’s in the corresponding domain only get 65% of the questions right, 34% for skilled non experts with Google access
  • Gpt-4 achieves 39% accuracy, Claude 3 Opua 60%
  • GPQA may enable “scalable oversight” because of its high difficulty

GPQA seems to be a very useful benchmark that will benefit the community.

接收理由

Looking at the examples in Table 1, the examples in this task seem to be of great quality and incredibly difficult for non experts. I'm very impressed with the quality of the data and I'm surprised that the non expert human performance is so low.

The incentive structure for the data creators is very interesting. It's very useful to know exactly how much the workers were paid and how their compensation was tied to performance.

拒绝理由

The fact that the questions in GPQA are multiple choice could be a big limitation of this dataset, mainly because it might make it non realistic. By this I mean that real world researchers will likely never face tricky multiple choice questions in their work. Writing useful questions is much more likely to be a useful research task.

There is additionally a real risk that expert question writers found ways to game the rules of the data collection to maximize their pay, making the dataset even less realistic.

This leads me to wonder if the reason that in domain PhDs can't solve this task with a high degree of accuracy may actually mean that the questions are needlessly obscure rather than difficult in a useful way.

给作者的问题

What accuracy would in domain PhDs have if they had 30 minutes and access to the internet?

It would be interesting to see some ablations for the data collection choices. For example, how much do performance based financial incentives improve data quality?

It would be really interesting to measure how useful LLMs can be to experts or non experts that are trying to solve difficult questions like the ones in GPQA. Can the LLM output accurate and comprehensive rationales that are useful to a human's understanding?

作者回复

Thank you for your kind words about the quality of the questions, and for your thoughtful and engaged review! Responding to your comments/questions:

  • Multiple choice vs. free-response: To try and make the questions usable in a free-response setting, we instructed all question writers to format their questions such that they can be answered by an expert who doesn’t see the answer choices—while we didn’t enforce this strictly, it was an explicit prompt that expert validators were instructed to review, and anecdotally many of the questions seem to be structured properly regarding this.
  • Question writers gaming the rules: I agree this is in principle a possibility—qualitatively it doesn’t appear that the questions are written in this way, but I’d be curious to hear of strategies that could be used here!
  • In-domain expert validator accuracy w/ 30mins and internet access: Expert validators were actually allowed to access the internet, and they were allowed to spend as much time as they wanted answering the questions. However, on average they spent less than 30 minutes by choice, likely informed by their judgement of the expected return of spending more time on the questions.
  • Ablations for incentives: This is a great question—we’re also very interested in understanding better how incentives affect behavior. Unfortunately, we’d expect these experiments to be quite expensive, and given the costly nature of this project, we ultimately decided to use the payment structure we felt from previous data collection projects would be most likely to produce the best results.
  • Answering with help from LLMs: We completely agree—this is one of the directions we’re most excited to see work using GPQA focus on, particularly as it relates to scalable oversight!
评论

Thanks for the responses

最终决定

This is a dataset paper, which presents a small set (448 examples) of very challenging multiple-choice questions written by domain experts. The dataset is carefully constructed by hiring domain experts and going through careful validation steps. The difficulty of this dataset will enable studying new research questions such as scalable oversight. Human performance, as well as state-of-the-art model performances on the newly proposed dataset is presented in the paper. I find a lack of analysis of model performances a bit disappointing, and reviewers have a little concern with the size of the dataset. Nonetheless, gathering this type of dataset, even at a smaller scale, is very challenging and it is sufficient to rank LLMs. The resulting dataset is already used in the community and very valuable to evaluate the progress of LLMs.