EvidenceBench: A Benchmark for Extracting Evidence from Biomedical Papers
A large biomedical benchmark to evaluate LLMs' evidence-finding ability for scientific hypotheses and an automatic pipeline to create such benchmarks.
摘要
评审与讨论
The authors propose a novel pipeline generating hypotheses and finding relevant evidence, which is useful for researchers investigating scientific hypotheses and finding supporting evidence. They show their pipeline has a close accuracy and validity compared with human experts. Also, they benchmark different models' performances based on the two datasets they created, which shows there is still space for improvement on the models.
接收理由
- The paper is written clearly and is easy to follow.
- The authors provide a comprehensive illustration and explanation of the pipeline.
- The datasets are constructed in a way that is comparable to human-expert level, with the assistance of some GPT models.
- The task is important and useful for the researchers for finding supporting evidence of a hypothesis when using generative AI.
拒绝理由
- The authors target biomedical papers, but they do not say why they target such a field/area, or the significance of providing such a benchmark in this field. Why is identifying the most important pieces of evidence relevant to a hypothesis crucial for this domain? Why not other domain papers?
- It seems that the GPT models have the highest scores using the evaluation of this benchmark. Would it be possible because the authors use GPT in the procedure of generating the dataset [1]? How will it affect the validity of the benchmark?
- EvidenceBench-100k is only assessed using 50 randomly sampled papers for quality purposes; also, it seems that I could not find any table describing the Human & GPT agreement on this.
[1] LLM Evaluators Recognize and Favor Their Own Generations.
给作者的问题
How do you calculate the p-values in Table 2?
Thank you for your thoughtful review and for recognizing the importance and utility of our proposed pipeline for hypothesis generation and evidence discovery in scientific research.
The authors target biomedical papers, but they do not say why they target such a field/area, or the significance of providing such a benchmark in this field. Why is identifying the most important pieces of evidence relevant to a hypothesis crucial for this domain? Why not other domain papers?
We targeted biomedical papers because the biomedical field has systematic reviews that follow well-established guidelines and structured formats (see an example here: https://doi.org/10.1002/14651858.CD001871.pub4), making it easier to cleanly extract evidence summaries.
Identifying evidence for biomedical hypotheses is crucial for evidence-based medicine. Meta analyses and systematic reviews are built from these identified and retrieved pieces of evidence to provide clinicians and researchers with on-demand knowledge and keep them up-to-date with best practices. Many systematic reviews endeavor to present the most important evidence from each reviewed paper in a clean tabular format, so it is crucial for automated systems to identify important evidence and avoid redundancies and trivialities.
As shown in our annotation procedure, we require medical domain experts and Bio PhD students to confirm the viability of our pipeline. Consequently, while our method could be applied to other fields, finding domain experts and PhD students from different fields would exceed our resource limit.
It seems that the GPT models have the highest scores using the evaluation of this benchmark. Would it be possible because the authors use GPT in the procedure of generating the dataset [1]? How will it affect the validity of the benchmark?
We used different GPT-4 variants for dataset generation and evaluation: GPT-4-0125-preview for dataset creation and GPT-4o for evaluation.
More importantly, GPT-4 does not generate any content included in the dataset itself. Its sole role is to label whether sentences support or refute hypotheses based on expert-written evidence summaries.
We will conduct additional experiments with Claude Sonnet 4 and Gemini 2.5 Pro before the discussion period ends.
EvidenceBench-100k is only assessed using 50 randomly sampled papers for quality purposes; also, it seems that I could not find any table describing the Human & GPT agreement on this.
Thank you for raising the sample-size concern. Although only 50 papers were annotated, each paper contains roughly 160 (sentence, aspect) pairs, totaling 8,111 pairs. In the paper, correlation and agreement metrics are calculated based on these 8,111 pairs and not on the 50 papers. Because annotations for sentences in the same paper are partially correlated, the effective sample size therefore lies between 50 and 8,111. In order to account for this correlation structure, we can use hierarchical bootstrap sampling (first sampling papers, then sampling sentences within papers) to compute confidence intervals and p-values.
We will provide the hypothesis testing results for GPT-4o-mini (which was used to generate EvidenceBench-100k) before the end of the discussion period.
How do you calculate the p-values in Table 2?
We will explain how the p-values for Spearman’s are calculated; the other p-values are similar. We test the null hypothesis , where , using 10 000 paired bootstrap resamples that preserve the original annotation pairing. For each resample we recompute the two Spearman correlations, record , and thus obtain a bootstrap distribution of 10 000 values. The observed statistic , computed once on the original (unresampled) data, is compared against this null: the two-sided p-value is the fraction of ’s whose absolute magnitude exceeds . This procedure answers whether GPT’s agreement with humans is statistically distinguishable from the humans’ agreement with each other.
As promised in our previous response, we have completed the additional experiments, including hypothesis testing for GPT-4o-mini and evaluations of Claude Sonnet 4 and Gemini 2.5 Pro.
Hypothesis testing for GPT-4o-mini
Here is the hypothesis testing result for GPT-4o-mini on 8,111(sentence, aspect) pairs using standard bootstrap sampling:
| Metrics | Human & Human Average | Human & GPT Average | p-value |
|---|---|---|---|
| Exact Accuracy | 98.7 ± 0.5 | 98.6 ± 0.5 | 0.134 |
| F1 Binary | 66.5 ± 12.3 | 64.0 ± 9.9 | 0.362 |
| Cohen’s κ | 65.8 ± 12.6 | 63.3 ± 9.9 | 0.356 |
| Spearman’s ρ | 65.9 ± 12.6 | 63.4 ± 9.9 | 0.360 |
Here is the hypothesis testing result for GPT-4o-mini using hierarchical bootstrap sampling:
| Metrics | Human & Human Average | Human & GPT Average | p-value |
|---|---|---|---|
| Exact Accuracy | 98.7 ± 0.6 | 98.6 ± 0.6 | 0.339 |
| F1 Binary | 66.5 ± 9.9 | 64.0 ± 7.4 | 0.541 |
| Cohen’s κ | 65.8 ± 10.1 | 63.3 ± 7.6 | 0.536 |
| Spearman’s ρ | 65.9 ± 10.0 | 63.4 ± 7.6 | 0.533 |
All p-values > 0.05 indicate the lack of evidence to support any statistically significant difference between human-human and human-GPT agreement. We also note correlations and accuracies of human-GPT agreement are close to human-human agreement.
Evaluations of Claude Sonnet 4 and Gemini 2.5 Pro
Here is the experiment for Claude Sonnet 4 and Gemini 2.5 Pro with our baseline Chain-of-Thought prompt.
| Model | ER@Optimal |
|---|---|
| Gemini 2.5 Pro | 52.5 |
| Claude 4 Sonnet | 49.1 |
| GPT-4o | 48.1 |
| Claude 3 Opus | 41.1 |
| Gemini 1.5 | 42.7 |
These results show that newer models (Gemini 2.5 Pro, Claude 4 Sonnet) achieve better performance. This shows GPT family models do not dominate the EvidenceBench.
Thank you for the response and the additional experimental results, which address my concerns. I recommend the acceptance of this work.
Thank you again for your detailed feedback. If you have a moment to adjust the overall score to match your latest recommendation, it will make the recommendation clearer for the area chairs.
This paper introduces EvidenceBench, a benchmark dataset for extracting relevant evidence sentences from biomedical research papers given a hypothesis. The authors propose an efficient LLM-powered pipeline for dataset construction, including hypothesis generation, decomposition of expert summaries into study aspects, and aligning aspects with sentences. This method allows for the creation of large-scale datasets with significantly reduced cost and time compared to purely manual annotation. The quality and validity of the pipeline outputs (hypotheses and alignments) are evaluated against human experts, showing comparable performance to biomedical PhD students for the single-sentence annotation task. The paper evaluates various LLMs and embedding models on the benchmark, but the performance still falls significantly short of the expert level. The task is important, and the datasets (EvidenceBench and EvidenceBench-100k) are original and potentially useful resources. The paper is mostly well-structured, but following the experimental details is challenging as much of the important information is located in the appendix rather than the main body.
接收理由
- The creation and release of EvidenceBench and the large-scale EvidenceBench-100k dataset is a major contribution to the field, providing a valuable resource for research on evidence extraction from biomedical literature.
- The LLM-powered pipeline for dataset construction is a novel approach that enables efficient and scalable creation of high-quality annotated data, significantly reducing manual effort.
- The paper provides human evaluation results demonstrating the validity and accuracy of the dataset construction pipeline outputs, lending credibility to the generated annotations.
- The paper provides a benchmark evaluation of various models on the task, offering insights into the current state-of-the-art performance and highlighting the significant gap that still exists compared to human expert performance, indicating the task's difficulty.
拒绝理由
- The definitions used in the annotation scheme, specifically regarding the decomposition of summaries into study aspects and the criteria for defining a "source of information" sentence, appear subjective. It is unclear how consistently these subjective definitions can be applied by different annotators or if they depend significantly on expert interpretation, which could impact the consistency and reliability of the dataset annotations.
- The benchmark shows LLMs performing significantly below human expert level despite the dataset being created using an LLM-powered pipeline. The paper does not fully clarify the implications of this result or explicitly discuss why evaluating the end-to-end performance of the dataset creation pipeline (or a similar guided approach that leverages the multi-step process) on the extraction task was not included in the benchmark. This could lead to ambiguity regarding the benchmark's objective or the potential for achieving higher performance with different methodologies not evaluated.
- A considerable amount of crucial detail regarding the experimental setup and methodology for the model evaluation is placed in the appendix (Appendices C and D) rather than the main body. This structure makes it challenging for readers to easily follow and fully understand the experimental procedures and interpret the results presented in the main tables without frequently referring to the appendix.
给作者的问题
- Could the authors provide more detailed information or analysis regarding the subjectivity in the decomposition into study aspects and the definition of the source of information among experts? How was consistency ensured for these potentially subjective judgments during annotation?
- The dataset was annotated using an LLM-powered pipeline, yet LLMs show relatively low performance in the end-to-end benchmark evaluation. This makes the reviewer wonder about the primary objective of the evaluation: is it solely to measure the performance of current standalone SOTA LLMs on this task, or is there a broader aim? Given the LLM-based creation process, could the authors discuss why evaluating an approach that leverages a similar pipeline flow for the end task was not considered or presented as a potential direction for achieving higher performance?
- The authors consider recall-oriented evaluation criteria (Aspect Recall@k), but it is unclear from the description how this was considered in designing the instructions provided to the LLMs during the benchmark. Did the authors consider to explicitly instruct LLMs to optimize for recall within a specific number of top relevant sentences?
- The paper mentions filtering cases where less than 70% of study aspects from the evidence summary are covered by sentences. How frequent are such cases, and what is the distribution of the percentage of covered study aspects in the final dataset?
Thank you for your thorough review and for recognizing the significance of our contribution in creating EvidenceBench as a valuable resource for the biomedical research community.
The definitions used in the annotation scheme, specifically regarding the decomposition of summaries into study aspects and the criteria for defining a "source of information" sentence, appear subjective. It is unclear how consistently these subjective definitions can be applied by different annotators or if they depend significantly on expert interpretation, which could impact the consistency and reliability of the dataset annotations.
Could the authors provide more detailed information or analysis regarding the subjectivity in the decomposition into study aspects and the definition of the source of information among experts? How was consistency ensured for these potentially subjective judgments during annotation?
We addressed the subjectivity concerns through several measures to ensure consistent application of our annotation scheme. Prior to annotation, all annotators received comprehensive training on the concepts and terminologies, with detailed examples provided to ensure uniform understanding. Complete annotation guidelines are available in Appendix B.2, which standardize the criteria for defining source of information.
Our annotation protocol was specifically designed to mitigate individual interpretation biases. Each annotation team consisted of two Ph.D. researchers in bioinformatics who first completed annotations independently, then collaborated to reach consensus judgments. This two-stage process helps identify and resolve subjective disagreements before final labels are assigned.
The benchmark shows LLMs performing significantly below human expert level despite the dataset being created using an LLM-powered pipeline. The paper does not fully clarify the implications of this result or explicitly discuss why evaluating the end-to-end performance of the dataset creation pipeline (or a similar guided approach that leverages the multi-step process) on the extraction task was not included in the benchmark. This could lead to ambiguity regarding the benchmark's objective or the potential for achieving higher performance with different methodologies not evaluated.
The dataset was annotated using an LLM-powered pipeline, yet LLMs show relatively low performance in the end-to-end benchmark evaluation. This makes the reviewer wonder about the primary objective of the evaluation: is it solely to measure the performance of current standalone SOTA LLMs on this task, or is there a broader aim? Given the LLM-based creation process, could the authors discuss why evaluating an approach that leverages a similar pipeline flow for the end task was not considered or presented as a potential direction for achieving higher performance?
During dataset creation, our LLM-powered pipeline had access to evidence summaries extracted from systematic review papers written by human experts. These evidence summaries enable the generation of reliable benchmark questions and answers.
However, when we evaluate models on the benchmark, they do not have access to these evidence summaries or the original review article text. The models must perform the extraction task using only the primary research papers, without the benefit of the expert-curated contextual information that was available during dataset creation. This asymmetry is intentional – it allows us to create a challenging yet reliable benchmark by leveraging expert knowledge during creation while ensuring that the evaluation truly tests the models' ability to perform the evidence retrieval task independently.
A considerable amount of crucial detail regarding the experimental setup and methodology for the model evaluation is placed in the appendix (Appendices C and D) rather than the main body. This structure makes it challenging for readers to easily follow and fully understand the experimental procedures and interpret the results presented in the main tables without frequently referring to the appendix.
Thank you for your suggestion. We were originally constrained by page limit, but we will revise our paper to include experimental procedures and methodologies in the main paper given one extra page in the camera ready version.
The authors consider recall-oriented evaluation criteria (Aspect Recall@k), but it is unclear from the description how this was considered in designing the instructions provided to the LLMs during the benchmark. Did the authors consider to explicitly instruct LLMs to optimize for recall within a specific number of top relevant sentences?
The LLMs are given the optimal number of retrieved sentences and instructed to retrieve optimal number of sentences. We provide the prompts in appendix D.
The paper mentions filtering cases where less than 70% of study aspects from the evidence summary are covered by sentences. How frequent are such cases, and what is the distribution of the percentage of covered study aspects in the final dataset?
Approximately 50% of the data was filtered out due to insufficient coverage of study aspects. When a datapoint has less than 70% of its study aspects covered by sentences in the research paper, this typically indicates either that the evidence summary is too general and lacks specific evidence details, or that noise was introduced in earlier processing steps, such as during the evidence summary extraction phase.
In the final dataset, the average coverage of study aspects is 87%. 27% of datapoints fall within 70-79% coverage, 33% fall within the 80-89% range, and the remaining 40% fall within the 90-100% range.
Thank you for the clarification. Since some of my concerns are resolved by the responses, I raised my score.
This paper presents EvidenceBench and EvidenceBench-100k, a pair of similarly constructed benchmarks to evaluate the abilities of (text-only) LLMs to retrieve evidence for a given set of aspects of a hypothesis from a given paper. The benchmark construction occurs in two stages: generating hypotheses and aspects from paragraphs in review articles, and aligning the generated aspects with candidate sentences in the papers referred to by the review. The authors benchmark several recent LLMs, finding that they do not match human performance. I think this paper presents a thoughtfully built, useful benchmark whose construction is well described.
接收理由
This is a great example of well-validated use of model-based annotations for dataset construction. I appreciate that there is well-described expert annotation for each stage of the dataset construction — hypothesis/aspect extraction and candidate/aspect matching — and that that annotation process is repeated before scaling the dataset up to the 100k version. The authors report correlation/agreement/accuracy scores for all of this.
I think this is also a clever use of review articles as a source of expert data. These summaries already exist and are written for the target audiences of systems that this benchmark proposes to evaluate. I do have some concerns with this choice of dataset, which I discuss in the section below, but I don't think this is a major fault of this work.
The authors also clearly document licensing information for all of the collected data.
拒绝理由
Though I noted above that this seems like a good use of review articles, I do still have a few concerns about this use of review articles, largely due to my own unfamiliarity with the form. The authors state at several points that sentences are chosen for their "relevance" to aspects of the hypothesis. I would appreciate some degree of qualitative exploration of the data to estimate the distribution of that relevance - how much of the candidate claims in papers support, vs. refute the hypothesis? Without some estimation of this, it is possible that this benchmark would fail to assess systems that become a confirmation bias machine - i.e. systems that consider all of the evidence that supports a piece of evidence or an aspect, but ignores evidence that contradicts it.
给作者的问题
The authors very carefully describe the process of finding and validating hypotheses, but do not actually describe how these hypotheses are used in the final evaluation of models. I assume that the hypothesis is provided as input to the retrieval/embedding models; is this the case?
Thank you for your thoughtful and encouraging review, and for recognizing EvidenceBench as a carefully constructed, well-validated benchmark that leverages expert annotations and openly licensed data.
The authors state at several points that sentences are chosen for their "relevance" to aspects of the hypothesis. I would appreciate some degree of qualitative exploration of the data to estimate the distribution of that relevance - how much of the candidate claims in papers support, vs. refute the hypothesis? Without some estimation of this, it is possible that this benchmark would fail to assess systems that become a confirmation bias machine - i.e. systems that consider all of the evidence that supports a piece of evidence or an aspect, but ignores evidence that contradicts it.
Thank you for raising an important point about relevance. In review papers, "relevance" typically indicates support rather than contradiction, as reviewers generally select evidence that substantiates their claims rather than presenting hypotheses intended to be disproven. However, we recognize that a robust evaluation should test systems' ability to effectively detect valid contradictory evidence.
To strengthen our benchmark's ability to detect confirmation bias, we propose creating negative variants of existing hypotheses. Since we have already identified supporting sentences for each hypothesis, we can negate the hypotheses to transform support relationships into refutation relationships.
We have conducted an experiment using negated hypotheses from the original EvidenceBench dataset with our baseline Chain-of-Thought prompt. We provide the results in the table below:
| Model | Original Hypothesis | Negated Hypothesis |
|---|---|---|
| GPT-4o | 48.1 | 46.7 |
| Claude 3 | 41.1 | 36.7 |
The results demonstrate that both models show decreased performance when evaluating negated hypotheses, indicating that the task becomes more challenging when models must identify evidence that refutes rather than supports a claim. However, the decrease in performance is small for GPT-4o.
The authors very carefully describe the process of finding and validating hypotheses, but do not actually describe how these hypotheses are used in the final evaluation of models. I assume that the hypothesis is provided as input to the retrieval/embedding models; is this the case?
Yes, the hypotheses are directly incorporated into the model evaluation process. For large language models, we provide the hypothesis as part of the input prompt, and we provide the prompts in Appendix D. For embedding models, we compute sentence embeddings for both sentences from the paper and the hypothesis, then calculate their cosine similarity scores.
Thank you for our response! I hope that the negative hypothesis results will find their way into the paper. Good work!
This paper introduces EvidenceBench to measure LLM performance on finding evidence relevant to hypotheses in biomedical papers. The paper proposes a LLM driven pipeline for hypothesis generation and sentence-by-sentence annotation of biomedical papers for relevant evidence. The paper demonstrates the pipeline’s validity and accuracy with a small set of 50 human-expert annotations. The paper also presents evaluation of language models and retrieval systems on the benchmark and find that model performances still fall significantly short of the expert level on this task.
接收理由
This paper introduces EvidenceBench to measure LLM performance on finding evidence relevant to hypotheses in biomedical papers. The paper proposes a LLM driven pipeline for hypothesis generation and sentence-by-sentence annotation of biomedical papers for relevant evidence. The paper demonstrates the pipeline’s validity and accuracy with a small set of 50 human-expert annotations. The paper also presents evaluation of language models and retrieval systems on the benchmark and find that model performances still fall significantly short of the expert level on this task.
拒绝理由
The paper demonstrates the pipeline’s validity and accuracy with a small set of 50 human-expert annotations. I find this a very low number to claim that the annotation generated by the proposed pipeline is accurate enough to train/fine-tune and test solutions that aim to be state-of-the-art for this task.
The paper states "There are two data sources for EvidenceBench and EvidenceBench-100k. First, a collection of 107,887 CC-BY open-sourced biomedical research papers where each research paper represents a datapoint. Second, a collection of 44,772 review papers from PubMed Central." This is again a very small benchmark compared to the tens of millions of articles in PubMed and PubMed Central. Also, authors do not provide any evidence on whether the papers selected in the benchmark are good representation/sample of the full biomedical publication repository.
Thank you for recognizing the importance of EvidenceBench in addressing the critical gap in evaluating LLMs' evidence-finding capabilities in biomedical literature. We appreciate your constructive feedback and address each of your points in detail below.
The paper demonstrates the pipeline’s validity and accuracy with a small set of 50 human-expert annotations. I find this a very low number to claim that the annotation generated by the proposed pipeline is accurate enough to train/fine-tune and test solutions that aim to be state-of-the-art for this task.
Thank you for raising the sample-size concern. Although only 50 papers were annotated, each paper contains roughly 160 (sentence, aspect) pairs, totaling 8,111 pairs. In the paper, correlation and agreement metrics are calculated based on these 8,111 pairs and not on the 50 papers. Because annotations for sentences in the same paper are partially correlated, the effective sample size therefore lies between 50 and 8,111. In order to account for this correlation structure, we can use hierarchical bootstrap sampling (first sampling papers, then sampling sentences within papers) to compute confidence intervals and p-values. We provide the results in the table below:
| Metrics | Human & Human Average | Human & GPT Average | p-value |
|---|---|---|---|
| Exact Accuracy | 98.7 ± 0.5 | 98.6 ± 0.5 | 0.46 |
| F1 Binary | 66.5 ± 9.9 | 65.8 ± 8.8 | 0.91 |
| Cohen’s κ | 65.8 ± 10.1 | 65.1 ± 9.0 | 0.90 |
| Spearman’s ρ | 65.9 ± 10.0 | 65.3 ± 8.9 | 0.91 |
There are two data sources for EvidenceBench and EvidenceBench-100k. First, a collection of 107,887 CC-BY open-sourced biomedical research papers where each research paper represents a datapoint. Second, a collection of 44,772 review papers from PubMed Central. This is again a very small benchmark compared to the tens of millions of articles in PubMed and PubMed Central.
We acknowledge that our benchmark represents a subset of the broader biomedical literature available in PubMed and PubMed Central. The scale of our dataset was determined by budget constraints associated with large-scale LLM API calls required for our automated evaluation pipeline. However, we emphasize that our methodology is designed for scalability and reproducibility. Our complete data generation pipeline (including all prompt templates in Appendix C, G) is fully documented in the paper to enable easy replication and extension. The procedures we describe allow researchers to apply our approach to additional systematic reviews and expand the benchmark as needed.
Also, authors do not provide any evidence on whether the papers selected in the benchmark are good representation/sample of the full biomedical publication repository.
We quantified topical diversity for the 100k papers using 2025 MeSH headings to demonstrate representativeness. The dataset covers all 16 top-level MeSH branches (A–N, V, Z), confirming coverage of every major biomedical domain. With 17,505 distinct MeSH descriptors representing 56.5% of the entire 2025 vocabulary (30,956 terms), the breadth matches that of a full-year MEDLINE slice and would be unattainable if the collection were thematically narrow. The branch-level distribution shows a Gini-Simpson diversity of 0.81 (where 1 indicates perfect evenness), demonstrating that no single area dominates the corpus. Additionally, MeSH tree numbers reach a median depth of 5 and 90th-percentile of 7, indicating the benchmark encompasses both fine-grained molecular and procedural topics alongside high-level concepts. This comprehensive topical coverage provides strong evidence that our selected papers constitute a representative sample of the biomedical literature.
Dear Reviewer F3p6
Thank you again for your thorough review of our paper. Since the discussion period is coming to an end soon, we would greatly appreciate it if you could indicate whether our rebuttal addresses the questions raised in your initial review.
Thank you for your time, The EvidenceBench Team
Dear Reviewer F3p6
Thank you again for your thorough review of our paper. Since the discussion period comes to an end today, we would greatly appreciate it if you could indicate whether our rebuttal addresses the questions raised in your initial review.
Thank you for your time, The EvidenceBench Team
We thank the reviewers for their insightful feedback. All responding reviewers either raised their scores or reinforced their recommendations for acceptance following the rebuttal period. Our responses focused on three main areas.
First, we addressed questions about the benchmark design. For reviewer Gnr9, we conducted new experiments with recent LLMs, demonstrating that Gemini 2.5 Pro and Claude 4 Sonnet outperform GPT-4o, confirming that the benchmark is not biased towards GPT-family models. Reviewer Gnr9 stated the new experiments "address my concerns" and "recommend the acceptance of this work". For reviewer 4ogA, we introduced a new evaluation using negated hypotheses, showing the benchmark can effectively test for this behavior. Reviewer 4ogA positively responded to our new evaluation.
Second, we clarified our dataset’s scale and representativeness for reviewer F3p6. We noted that our human validation corresponds to over 8,000 annotated sentence pairs, and we performed hierarchical bootstrap analysis to account for possible correlations among annotations from the same research article. We also demonstrated that our benchmark covers all major biomedical domains. We thank F3p6 for their constructive recommendations, though reviewer F3p6 did not reply to our clarifications.
Finally, we clarified the benchmark’s methodology and provided additional descriptions for reviewer b6XD. These clarifications led reviewer b6XD to raise their score to 7.
The revisions have created reviewer consensus around our key contributions: the creation and release of EvidenceBench as a "large-scale", "valuable", "high-quality" (reviewer b6XD) and "useful" (reviewers Gnr9 and 4ogA) resource for the community, and a "clever", "well-validated" (reviewer 4ogA), "novel" (reviewers b6XD and Gnr9) LLM-based pipeline to generate such benchmarks at scale. Reviewer 4ogA explicitly noted this work as a "major contribution to the field" and "a great example of well-validated use of model-based annotations".
This paper introduces EvidenceBench, a corpus instantiating an "evidence extraction" task in which the goal is identify the sentences in an article that are most relevant to a given hypothesis. There was consensus amongst reviewers that this is an important task, and that the authors have done a laudable job of validating each annotation step. The authors did a nice job of addressing remaining concerns during the discussion period.