MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding
We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning.
摘要
评审与讨论
This paper introduces MedXpertQA, a challenging benchmark comprising 4,460 clinical medical questions. Comparing to previous simple Question-Answering (QA) datasets, MedXpertQA conducts difficulty and diversity filtering to ensure the data quality. Additionally, the authors evaluate frontier LLMs on the proposed benchmark. Evaluation results are also thoroughly discussed.
update after rebuttal
The authors' rebuttal addressed most of my concerns. I decided to keep my original score as accept. This paper is a valuable contribution for the medical LLMs community.
给作者的问题
Please refer to weaknesses.
论据与证据
This paper claims the increased difficulty and diversity of the proposed MedXpertQA dataset. These claims are supported by the observed decline in performance of evaluated LLMs and the analysis of data distribution.
方法与评估标准
The data collection and filtering in this paper is reasonable and solid. Table 1 offers a comparison among current medical QA datasets, which supports validating the quality of the proposed benchmark.
理论论述
This paper employs the Brier score to evaluate the posterior difficulty of each question, which is a logical approach.
实验设计与分析
Tables 3 and 4 benchmark leading LLMs on MedXpertQA, suggesting the effectiveness of reasoning LLMs (such as o1 and QVQ-72B) on addressing challenging medical questions. However, several weaknesses still exist in the experiment setting:
- Lack of evaluation on different test-time scaling methods (such as RAG) and prompting strategies (few-shot, CoT) on the proposed benchmark. Will the retrieval-based test-time scaling methods facilitate the solution of challenging medical reasoning problems?
- Lack of benchmarking of specialized medical LLMs. Will these medical LLMs perform better on knowledge questions?
补充材料
The additional material details the methodology for data filtering and includes a case-by-case comparison with other medical question-answering datasets.
与现有文献的关系
This paper provides a more challenging benchmark to further evaluate the performance of LLMs in the medical domain. Comparing to previous works such as medqa[1], medmcqa[2], pubmedqa[3], proposed MedXpertQA can better evaluate the capability of solving challenging clinical problems.
[1] Jin, Di, et al. "What disease does this patient have? a large-scale open domain question answering dataset from medical exams." Applied Sciences 11.14 (2021): 6421.
[2] Pal, Ankit, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. "Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering." Conference on health, inference, and learning. PMLR, 2022.
[3] Jin, Qiao, et al. "Pubmedqa: A dataset for biomedical research question answering." arXiv preprint arXiv:1909.06146 (2019).
遗漏的重要参考文献
The main contribution of this paper lies in presenting a varied and demanding medical benchmark. It would be beneficial if the author could offer a comparison regarding difficulty and data distribution with the medical part of Humanity's Last Exam[4].
[4] Phan, Long, et al. "Humanity's Last Exam." arXiv preprint arXiv:2501.14249 (2025).
其他优缺点
For weaknesses, please refer to Experimental Designs and Essential References part.
其他意见或建议
Does AvgR and AvgK in Table 1 refer to the performance on reasoning and understanding subsets? The author should clarify this.
We appreciate your insightful comments and hope to further address your concerns.
Response 1 - Experimental Designs Or Analyses
1.1 Evaluation on Different TTS Methods
Thank you for the valuable suggestion. We will include these comparisons in the next versions of the paper. At the same time, the main motivation of our work is the high-quality benchmark itself, and dedicated works that focus on comparing inference-time methods (MedPrompt [1]) are better references for researchers interested in this direction. We also note that the inference scheme we use, i.e. zero-shot CoT, aligns with mainstream consensus for evaluating foundation models. We hope the above points address your concerns.
1.2 Specialist Model Results
Your suggestions are very valuable. We agree with the need to include medical specialist models in our evaluation. We did consider it, but only left it out due to time constraints. As preliminary experiments, we evaluated a specialist text model UltraMedical-70B [2]: UltraMedical-70B Results.
We provide the results of o1 for comparison, showing that the specialist model still falls behind the most advanced general reasoning model.
We plan to more thoroughly survey domain-specific models and present the full results in the next version of our paper. We will also systematically compare the performances of generalist models, specialist models, and humans to gain further insights.
Response 2 - Essential References Not Discussed
We totally agree that comparisons with other demanding benchmarks will be informative. We provide the dataset statistics of Humanity’s Last Exam (Biology / Medicine): Humanity’s Last Exam Dataset.
We note obvious discrepancies in relative model performances between the two benchmarks, which demonstrate the informative value of diverse benchmarks.
Regarding benchmark difficulty, we do recognize that the absolute scores for the Biology / Medicine subset of HLE are lower than MedXpertQA. However, this does not take away from the quality and value of MedXpertQA for two reasons:
-
First, most of HLE's biology / medical questions focus on biology instead of clinically relevant medical tasks, for instance: Biology Question Example.
-
Second, questions in HLE covering comprehensive patient information, thus supporting realistic clinical reasoning, as do most questions in MedXpertQA, are scarce (~35 in total). Many questions related to medical tasks only cover single steps within the complex reasoning process required to form a clinical decision for a patient. For example, the following question represents the step of interpreting a single statistic (R-R interval) from a single piece of patient information (ECG results):
What is the longest R-R interval in seconds on this rhythm strip ECG?
In contrast, questions in MedXpertQA that involve ECGs typically include the image as one piece of information within a multifaceted, realistic patient profile. The answerer not only interprets the ECG but also needs to consider the role of this information within a more complex decision, e.g. proposing a diagnosis or treatment. The example illustrated in Figure 2 in our paper precisely shows this scenario. Therefore, while medical questions in HLE effectively pinpoint challenging individual tasks, they are less informative of models' holistic abilities and less clinically relevant. As Reviewer y9fU happened to mention, "reasoning-heavy" and "difficult" are two different concepts, and we believe MedXpertQA has an advantage over HLE in terms of the former.
-
Third, MedXpertQA and HLE are fundamentally different types of benchmarks. The construction of HLE was extremely labor-intensive, requiring experts to manually design individual questions, whereas the construction process of MedXpertQA is scalable and systematic, enabling large-scale evaluations over questions that are clinically relevant and diverse. Moreover, the difference in benchmark scale is crucial. It is in fact possible for us to achieve a similar level of difficulty by further stringent filtering, reducing the dataset size — we simply need to retain questions that stump current models. This will, however, contradict the need for systematic and comprehensive coverage in medical evaluation, as we mentioned in Section 3.1. We believe MedXpertQA strikes a good balance.
Response 3 - Other Comments Or Suggestions
Yes, thanks for your suggestion! We will clarify this in the caption later.
References:
The author rebuttal addressed most of my concerns. I will keep my score at this stage.
Thanks for your response. Some new information regarding the comparison between MedXpertQA and HLE has caught our attention, so we would like to present it here to further address your concerns.
Previously, we obtained the Biology / Medicine (B/M) scores from the original HLE paper for comparison. A recent study has conducted an evaluation specifically on HLE (Med) and MedXpertQA. It is important to note that this evaluation is not on the B/M subset but rather on a more fine-grained subset, Medicine, which also serves as a medical evaluation benchmark. The authors selected medicine questions from the B/M subset in HLE.
Their fine-grained subset allows for new comparisons on both dataset statistics and benchmark difficulty. Surprisingly, our benchmark appears to be more challenging than HLE (Med): https://postimg.cc/RNdmR5nz.
Quick Look:
| Benchmark | Llama3.1-Instruct-8B | Mistral-Instruct-7B |
|---|---|---|
| HLE (Med) | 13.6 | 14.6 |
| MedXpertQA Text | 13.2 | 11.4 |
Statistics:
Below, we present statistical information on the HLE (Med) benchmark based on the released dataset:
| Benchmark | # Size | # Avg Lens |
|---|---|---|
| HLE (Med) | 103 | 224.39 |
| MedXpertQA Text | 2450 | 257.37 |
The HLE, derived from original questions contributed by nearly 1,000 experts representing over 500 institutions across 50 countries, has attracted considerable attention. It has emerged as one of the most challenging benchmarks for assessing the limitations of state-of-the-art models.
It can be observed that MedXpertQA is not only more comprehensive but also more challenging. In comparison to HLE (Med), which was meticulously constructed through extensive human effort, our benchmark is approximately 24x larger and presents an even greater level of difficulty. This makes it the most extensive and demanding medical question-answering benchmark to date.
We hope that the above clarifications fully address all of your concerns.
Authors present a new expert-level knowledge and reasoning benchmark for real-world clinical scenarios. It seems to be the largest Multi-modal dataset in this category (with human annotations). It also is the second largest in the text only category. There seems to be a new barrier in this particular task with this new dataset for the most common LMM models.
给作者的问题
- It seems to me that, this is a "simple (only in terms of comparison, not effort)" extension of previous datasets by collecting questions across more exams. Would it then not be easy to fine-tune on these exams (which I imagine are available to all, even for a price) and get a boosted score knowing that all the questions are limited to these exams?
论据与证据
Yes. (Minor question on this asked at the end)
方法与评估标准
Yes. (Not aware of the best practices in general)
理论论述
None.
实验设计与分析
Sound.
补充材料
Not in depth. I assure everyone involved I have adhered to the leakage prevention statement in Appendix A.
与现有文献的关系
Extremely important.
- While there are larger datasets, this seems to be the most varied and largest expert annotated dataset.
- There is a clear need for a tougher benchmark. The results here depict a nice "barrier".
- Overall evaluation framework seems to be well within the accepted norms across various previous publications in the domain.
遗漏的重要参考文献
None missing to my limited knowledge.
其他优缺点
While not a weakness per say, why not evaluate a domain specific model such as LLaVA-Med (https://github.com/microsoft/LLaVA-Med)?
其他意见或建议
-
If o1 has such a high score in USMLE (stand-alone), does that simply imply that other exams are harder? Why is there such a drastic reduction in performance when exams of other countries are considered. It would be interesting to have a comment on that whether this stems from the dataset structuring or is representative of the exams themselves.
-
While I am not a clinician, I am not sure there is succinct literature proving that exam questions (which seem to be the sole data source) are completely "real-world" representative. (Note: In my short search, I could not find a definite citation for the agreement or disagreement of this.) Maybe having a comment on this can provide stronger confidence to a non-clinical user/reader?
伦理审查问题
Massive medical dataset involving human expert annotation.
Thank you very much for your recognition of our work and your valuable suggestions!
Response 1 - Other Strengths And Weaknesses
You make a great point. Please refer to Response 1.2 (Specialist Model Results) to Reviewer vM2L. Thank you for your understanding!
Response 2 - Other Comments Or Suggestions
2.1 o1 Performance on USMLE
-
We want to clarify that the difficulty of MedXpertQA is not due to additional question sources or countries. During dataset construction, we found USMLE questions to be the most difficult, even compared to specialist board exams. Specialty assessments were included to enhance clinical relevance, not to increase difficulty.
-
To explain your observation of o1's high scores: USMLE questions in our benchmark and those in MedQA [1] are markedly different. USMLE does not publicize exam questions, so "USMLE questions" usually refer to mock questions devised by third-party experts, with varying difficulties. For MedXpertQA, we collected questions from high-quality, diverse sources and performed rigorous dataset construction steps, e.g. question filtering. This ensured the difficulty of MedXpertQA, accounting for o1’s lower score compared to other benchmarks.
In summary, the difficulty of MedXpertQA is primarily due to the dataset construction process.
Note: EDiR questions only make up a small percentage of MedXpertQA, and the large majority of questions still represent US exams.
2.2 Real-World Representative
Point 1: First, we would like to emphasize that even if exam questions do not completely simulate real-world clinical scenarios, this is a common issue faced by all existing medical benchmarks (every baseline in Table 1, 2), not a new challenge introduced by our work. However, these benchmarks still represent the most important and impactful way to evaluate medical AI. For example, MedQA has already been cited over 100 times this year alone and has been used to evaluate frontier medical AI models such as MedPaLM 2 [2] and Med-Gemini [3]. We believe MedXpertQA can greatly contribute to medical AI progress, and that its impact is amplified by the widespread use of similar, yet less clinically relevant alternatives.
Moreover, your concerns are totally valid - we are aware of some works calling into question whether exam questions are real-world representatives. However, full representation of clinical relevance seems to be an elusive and distant topic that no existing benchmark can achieve, nor is it a claim of our work. We intend to convey in good faith that new work should be recognized primarily for the significant improvements it introduces over existing works, rather than for pursuing overarching goals.
Point 2: Secondly, we need to clarify that we do NOT claim that MedXpertQA completely simulates real-world scenarios. Our claim has always been that it significantly improve (in Abstract) clinical relevance compared with previous, widely adopted benchmarks, which we believe is sufficiently proved and represents an important contribution in itself. Compared with previous benchmarks, MedXpertQA improves clinical relevance through fundamental improvements for both subsets:
-
For Text, our addition of medical specialist evaluations is certain to improve relevance, since realistic medical tasks are highly specialized and assigned to different departments. A single, general evaluation suite is evidently inadequate for evaluating the full spectrum of clinical tasks. GMAI-MMBench [4] raised a similar claim. Our discussions with medical expert collaborators further verified this point.
-
For MM, current benchmarks commonly design surface-level questions without realistic patient information, leading to extremely limited relevance. MedXpertQA's questions were constructed by human experts and intended to demonstrate how well a medical student will likely perform facing real patients. These expert-designed questions far surpass those constructed through fixed templates or automatic LLM generation of existing benchmarks in quality and relevance. The corresponding images are also realistic and diverse.
Response 3 - Questions for Authors
Again, your concerns are reasonable, and we were also worried about this during our work. This is the main reason we did not publicize the sources of our questions. We also note that this issue is ubiquitous for benchmarks across different domains, such as the AIME datasets in mathematics. While omitting sources may not solve the problem completely, there's not more we can feasibly do for now, and we hope you understand the difficulty behind this issue.
References:
[1] https://arxiv.org/abs/2009.13081
[2] https://arxiv.org/abs/2305.09617
The paper introduces MedXpertQA, a novel and challenging benchmark for evaluating expert-level medical knowledge and reasoning. MedXpertQA consists of 4,460 questions covering 17 medical specialties and 11 body systems, divided into text-based (Text) and multimodal (MM) subsets. The authors employed a rigorous methodology for benchmark construction, including extensive filtering based on both AI and human expert evaluations, data synthesis to mitigate leakage risks, and multiple rounds of expert review to ensure accuracy. Additionally, they developed a reasoning-oriented subset to facilitate the assessment of advanced reasoning capabilities in medical AI models. The authors evaluate 16 leading language and multimodal models on MedXpertQA, demonstrating that current state-of-the-art models still face significant challenges in expert-level medical reasoning tasks.
给作者的问题
- The paper mentions that MedXpertQA includes questions from 17 American specialty board exams. Could you clarify if the ground truth responses for the questions are also collected by rich sourcing? Are there expert annotations involved for quality control on responses?
- What criteria did you use to quantify "difficulty" in medical questions, and how did you calibrate these metrics across different humans to ensure the measurement of challenge levels is consistent?
- The approach to preventing data leakage relies primarily on paraphrasing ("rephrase the question through alternative expressions or structural adjustments"). Given that modern LLMs can recognize semantically equivalent content despite rewording, why do you believe this strategy is sufficient? Did you consider a verification or test with more extensive modifications that might more effectively prevent recognition?
- The metrics in Table 5 (perplexity and n-gram similarity) may not fully capture whether models have seen semantically equivalent content during training. Did you explore alternative methods to evaluate potential leakage?
- Do any MedXpertQA questions require decision-making over time (e.g., follow-up management after an initial diagnosis)? Would adding a longitudinal subset improve clinical realism?
- Since MedXpertQA is largely based on medical exams, have you considered incorporating non-medical exams to improve generalizability? Could MedXpertQA be used to evaluate interactive AI models that engage in back-and-forth questioning, mimicking real physician-patient interactions?
- How do you determine whether a question requires deep multi-step reasoning versus factual recall?
Happy to adjust scores if those major concerns can be solved.
论据与证据
The authors clearly state the limitations of existing medical benchmarks and provide detailed justification for the development of MedXpertQA:
- Statistics on the coverage of the benchmark across specialties, body systems, and task types
- Explanation of data collection, filtering, and quality assurance processes
- Evaluation of 16 models and quantitative analysis demonstrating the benchmark's difficulty level compared to existing benchmarks
The clinical relevance claim is supported by including questions from 17 medical specialty board exams, though additional validation with practicing clinicians would further strengthen this claim.
Two claims that could benefit from stronger evidence:
- Data leakage prevention: The authors' approach to preventing data leakage primarily on having LLMs "rephrase the question through alternative expressions or structural adjustments while preserving all original information." This simple paraphrasing strategy is unlikely to be effective against modern LLMs that can recognize semantic equivalence despite surface-level rewording. The metrics in Table 5 (perplexity and n-gram similarity) may not adequately capture whether models have seen semantically equivalent content during training. More rigorous methods and analysis would be needed to substantiate this claim.
- Reasoning-oriented evaluation: The distinction between reasoning and understanding questions was made using GPT-4o, but the paper would benefit from clearer operational definitions and validation of these categorizations by medical experts. The paper does not sufficiently demonstrate that the "reasoning" subset genuinely requires medical reasoning rather than just being more difficult questions.
方法与评估标准
The proposed methods and evaluation criteria are appropriate and well-motivated for the problem at hand. The authors' approach to benchmark creation is methodical, involving (1) data collection from authoritative medical sources, including USMLE, COMLEX-USA, specialty board exams, and image-rich sources; (2) filtering using both AI expert filtering and human expert filtering (3) similarity filtering to ensure diversity and remove redundant questions; (4) question and option augmentation to mitigate data leakage and increase difficulty and (5) expert review. The evaluation metrics (accuracy) and zero-shot CoT prompting method are standard and appropriate for this task. The creation of distinct reasoning and understanding subsets enables a more nuanced evaluation of models' capabilities. Some areas that can be improved:
- no explicit discussion is provided on realistic patient contexts—would MedXpertQA be applicable to realistic scenarios requiring sequential diagnostic reasoning in medical records?
- no ablation study is presented to measure the impact of individual filtering steps. Understanding how much each filtering stage contributes to difficulty would be beneficial.
- The paper would benefit from more detail on how humans quantified the difficulty, and finer-grained analysis of human annotations across different medical specialties would add real value for diverse specialties.
理论论述
The paper makes limited theoretical claims, focusing primarily on empirical evaluation. The authors' claim about the relationship between inference-time scaling and medical reasoning capabilities is supported by their experimental results.
实验设计与分析
The experimental designs and analyses are well-executed. The authors evaluate a diverse set of 16 models ranging from proprietary to open-source, including both vanilla models and inference-time scaled versions. The zero-shot CoT prompting approach is appropriate for the evaluation. The breakdown of performance by task type (reasoning vs. understanding) and by the medical system provides valuable insights into model capabilities and limitations. Below are some points that can be improved:
- The data leakage analysis methodology using perplexity and n-gram-based metrics to assess potential memorization can potentially be a main issue for the realistic usage of the constructed benchmark.
- Further analysis is needed to determine whether MedXpertQA’s reasoning subset accurately captures distinct clinical reasoning.
补充材料
I read the supplementary material on case studies, expert review guidelines and statistics on identified errors, and complete prompts used for attribute annotation and data augmentation.
与现有文献的关系
The authors position MedXpertQA effectively within the broader landscape of medical benchmarks. They provide a comprehensive comparison with existing text-based benchmarks (PubMedQA, MedQA, MedMCQA, MMLU) and multimodal medical benchmarks (VQA-RAD, VQA-Med, Path-VQA, SLAKE-En, PMC-VQA, OmniMedVQA, GMAI-MMBench, MMMU), highlighting key differences in terms of complexity, clinical relevance, and diversity.
遗漏的重要参考文献
The paper has a good coverage of relevant literature. However, a few potentially relevant works not discussed include:
- Relevant research on medical reasoning benchmarks
- Recent work on medical AI evaluation and how such benchmarks might eventually be to understand LLMs' practical real-life usage.
其他优缺点
Strengths:
- The benchmark focuses on expert-level questions with filtering and augmentation, addressing the insufficient difficulty of existing benchmarks like MedQA and providing new evaluation datasets for recent models.
- Developing a reasoning-oriented subset demonstrates the recognition that medicine provides a rich context for assessing complex reasoning.
- By incorporating questions from 17 medical specialty board exams, MedXpertQA achieves clinical relevance and specialization diversity compared with previous benchmarks.
Weaknesses:
- Inadequate Data Leakage Prevention: The paper's approach to preventing data leakage relies primarily on simple paraphrasing ("rephrase the question through alternative expressions or structural adjustments"). This strategy is likely insufficient against modern LLMs that can recognize semantically equivalent content despite surface-level rewording. More sophisticated techniques would be needed to genuinely prevent the leakage of medical exam questions that may already be in training data.
- While the reasoning vs. understanding categorization is valuable, the paper lacks clear operational definitions of different reasoning types in medicine and relies on GPT-4o for these annotations, which could be a concern on the validity.
- Unclear validation of reasoning - do models exhibit clinically meaningful reasoning or just perform well on structured multiple-choice formats?
其他意见或建议
- The benchmark would benefit from establishing human performance baselines on the released benchmark, which would provide valuable context for evaluating model performance and validating the difficulty levels.
- Consider including confidence calibration analysis for the evaluated models, as this is particularly important in high-stakes medical domains.
- Minor typos: Page 4, paragraph 2: "we instruct an LLM to annotate each question with its most relevant human body system" - It's unclear which LLM was used
Thanks for your thoughtful comments!
Response 1 - Claims And Evidence
1.1 Leakage Prevention
First, we note that data leakage prevention is an extra precaution we took on top of our main contribution, a challenging, clinically relevant benchmark. Models' subpar performance on MedXpertQA already reflects that its questions haven't been well learned during pretraining.
Our literature review found no effective method for reducing leakage risk of benchmarks, so we used the intuitive LLM rewriting strategy. We detail our extensive efforts here: https://postimg.cc/D4w1hw7g.
Changing questions drastically tends to lower their quality and relevance, and our method balances general quality and low leakage risk. Our method exceeds "simple paraphrasing" - it combines meticulously designed instructions with strict multi-round human reviews and error correction (AppendixD).
1.2 Leakage Risk Evaluation
- We intend to solidly show MedXpertQA's low leakage risk, not devise a new method.
- We highlight the validity of the metrics we used (Please reply for further clarification. Thanks!).
- No previous publication in medical AI benchmarking covered this.
- A recent work compared leakage risks of different benchmarks, showing the superiority of MedXpertQA: https://postimg.cc/yDrvtTB3.
1.3 Reasoning-Oriented Evaluation
- We initially considered having experts label Reasoning (R)/Understanding (U), but we realized that this task is quite straightforward for LLMs given clear prompt guidelines. We also provide expert-written answers and explanations collected from sources (Table13). This dense guidance enables a simplified form of annotation under expert supervision.
- We sufficiently considered the distinction between reasoning complexity and general difficulty when designing our labeling prompt (Table13).
- Human reviewing: For 10% sampled questions (Text-490, MM-400), reviewers found 28 and 11 questions incorrectly labeled as R. We think this error rate (~4.3%) is acceptable.
- Empirical results (Section 5.2) reflect the validity of annotations. LRMs perform much better than their backbones on R, and this does not hold on U. We even note the opposite trend for Qwen-series models on U, which would not hold if U questions were merely easier. For single models, R scores aren't consistently lower than U.
Response 2 - Methods And Evaluation Criteria
2.1 Sequential Reasoning
Two of our question sources included many sequential questions. We found that each question had sufficient context and wasn't dependent on answering previous ones, so we used individual questions instead of preserving the sequence. Though MedXpertQA dissects the multistep process into separate questions, its coverage of all stages of clinical decision-making (Figure3) ensures that it tests wide-ranging abilities needed for realistic multistep tasks. Recent works, e.g. MedChain[1], have focused on this, and we'll add discussions of these works in the next version of the paper.
2.2 Statistics on Individual Filtering Steps
Please see Response 2.1 to Reviewer Rrtx
2.3 Difficulty Quantification and Calibration
We combine two metrics: https://postimg.cc/RNrP4tWF.
Response 3 - Experimental Designs Or Analyses
Please see Response 1
Response 4 - Essential References Not Discussed
We'll add these works in the next version, omitting it here due to space limitations.
Response 5 - Other Strengths And Weaknesses
1 & 2: Please see Response 1
3: We show examples in F.1, F.3 (with human analyses of model errors). Models handle complex clinical information and make nuanced decisions between different possibilities. In addition, fine-grained analysis of model reasoning will be more informative once models achieve better performance.
Response 6 - Other Comments Or Suggestions
- Please see Response 2.2 to Reviewer Rrtx
- Valuable suggestion! We will consider adding it in the next version.
- In Page 4 paragraph 2, we used GPT-4o-2024-11-20. We'll clarify this in revisions.
Response 7 - Questions For Authors
- Ground truth answers were collected from question sources, whose QA pairs were designed by medical experts.
- Difficulty: -> Response 2.3
- Leakage Prevention: -> Response 1.1
- Risk Evaluation: -> Response 1.2
- Sequential decision-making: -> Response 2.1
- While MCQA is not tailored for interactive chat models, MedXpertQA covers diverse tasks, some of which touch on clinical questioning: https://postimg.cc/ZCVYzGXs.
- Reasoning: -> Response 1.3
While responding to your comments, we noticed aspects where our current paper did not sufficiently reflect our efforts. That being said, we believe the drastic improvement in benchmark difficulty is a convincing indication of our extensive efforts. We hope our response addresses your concerns and look forward to incorporating relevant information into the paper.
References:
In this work, the authors contribute a new synthetic dataset for the evaluation of medical reasoning of large language models (LLM), and the newest models in this class, also called large reasoning models (LRM). The creation of the dataset follows several steps that are well described, to ensure the benchmarking tasks are varied, of sufficient difficulty and do not suffer from data leakage with the training data of the models. As the authors outline, it is very important to shape the benchmarks properly as they, in turn, shape the model development.
update after rebuttal
I have slightly increased my grade after the discussion period, but some important points have not been addressed by the authors, in particular a comment about the relatively low human performances on their benchmark, a comment about the need of the different steps based on the number of questions filtered out at each step, and the measure of any performance bias for different genders for the models present in the benchmark. The question of the question difficulty is key to the relevance of this benchmark, as the authors claim that one of the advantages of this benchmark is its difficulty, but it may be related to impossible questions, which undermines the relevance of the difficulty.
给作者的问题
I have disseminated questions throughout the review. I attempt a summary of the main ones here, but more details and context is given in the previous sections
- provide the distribution of metrics across all the 37k initial questions, the number of filtered questions at each step, and the filtering threshold with respect to the whole distribution, to better grasp which filtering steps have been the most crucial, as it could inform the construction of future similar benchmark, in the medical field of other fields. (addressed in the rebuttals, but no discussion about which steps are important and why)
- report human performances (and model performances on the benchmark at this step) (done for a junior doctor, not yet for a senior doctor, the overall performance would need further discussion as human performances are low. Are the questions unsolvable because of a lack of information???? that would importantly affect the relevance of the benchmark if half of the questions are actually not sovable)
- figure 4 for all models - addressed when possible (with the table)
- figure 5 for all models + human performances - addressed
- variations of performances for gender, and ethnicity if/when available (not addressed so far in the rebuttal, the authors have reported the number of questions for male and female patients, but not the associated performances)
- change of model ranking with respect to other reasoning tasks in the literature (to show the relevance of yet another LLM benchmark)
- discussion of the actual relevance for example in routine clinical use
论据与证据
The authors claim that their benchmark is of adequate difficulty and robust, and provides a satisfying evaluation of the models, which seems well supported by evidence. However, the claim that the benchmark reflects real-world clinical data and scenarios is more difficult to prove (as a writer) and to verify (as a reviewer). My main concerns is that the audience of this conference has an extremely limited medical background, and the existence of a benchmark has the power to drift the orientation of future research for medical informatics, so the audience of this paper, to further discuss the content of the benchmark and the relevance of the medical tasks should include medical doctors. A journal of medical informatics might be better suited for this task. That being said, providing benchmarks with tasks from other domains is valuable for the improvement of future developed models, but the claims of relevance and usefulness for the domain of origin should be attenuated in the absence of a thorough multi-disciplinary discussion.
One concern that I have regarding relevance would be that the state of the patient is provided as a clean summary narrative to the model, which already incorporates the doctor's work to gather all relevant information. It's possible that this preliminary data selection and organization is the most difficult task for the doctor.
方法与评估标准
The main evaluation criterion for the relevance of the benchmark is the performances of the different tested models on this benchmark, which seems to exhibit a good discrimination ability between the different models. There is also a series of selection criteria involved through the different steps of the benchmark construction, however there are simply described, and it would be very informative to provide the distribution of metrics across all the 37k initial questions, the number of filtered questions at each step, and the filtering threshold with respect to the whole distribution, to better grasp which filtering steps have been the most crucial, as it could inform the construction of future similar benchmark, in the medical field of other fields.
Detailing more the article as a good methodology to create a benchmark would be a way to make it more adapted to this conference's audience rather than to a medical informatics journal if that is the authors' wish. The authors have actually relied on similar information from previous works to assess the data leakage, illustrating the relevance of a more thorough reporting of all the steps of their work in details.
It would be also extremely insightful as at some of those steps, there is a human performance assessment. This performance assessment is not on the complete final benchmark, but can be a very good proxy of human performance. Maybe it would be worth reporting the 16 models' performances at this intermediate step to provide a comparison, even if this intermediate performance should be taken with cautious as it occurs before the mitigation of data leakage step.
In the impact statement, the authors mention the importance of ethical concerns, mostly with respect to medical data privacy, however there are well-known issues with models biases. It could be relevant to assess the performances of the different models for different categories of patients. Gender seems to be mentioned in most medical questions. How about ethnicity? Also in this line of thought, the mention that this benchmark is in English language and might be tailored to US (or English-speaking country) medicine could be discussed.
理论论述
no theoretical claims.
实验设计与分析
As mentioned before, a lot of relevant "experimental" results of evaluation of the benchmark through all the steps are not reported, which is a major issue of this article, except for the table D in Supplementary that does not mention the number of questions evaluated at this step, only the number of questions flagged by the experts (unless I missed the info in the text, but it should be mentioned again with the table). Figures like figure 4 for all models should be reported in the supplementary as well.
Figure 5 reports interesting results of performance variation with respect to the medical specialty. Do the other models have the same pattern? how about human performances assessed in step 2?
补充材料
I have reviewed most of the supplementary material, but not the expert review guidelines.
与现有文献的关系
This article brings a new evaluation benchmark for LLMs and LRMs. The main novelty of this paper seems to be the discrimination power for method evaluation, however, it is not clear that this line of tasks is relevant for actual clinical practice, though it might be interesting to evaluate a different aspect of model reasoning in general.
遗漏的重要参考文献
It would be interesting to discuss and cite relevant literature comparing the reasoning ability of those models in different domains to see if there is a specificity to the medical domain, or said otherwise, does the evaluation of this medical reasoning task provide a different ranking of the models compared to other reasoning benchmarks?
其他优缺点
The provided dataset seems more diverse and thorough than previously existing benchmarks
其他意见或建议
there is a typo MMMU p. 5
We hope that our clarifications fully address your concerns!
Response 1 - Claims And Evidence
1.1 Target Audience
- We agree with the value of expert insight. We worked closely with medical practitioners when designing and reviewing MedXpertQA.
- Audience from an AI background is irreplaceable. Since the goal of MedXpertQA is to help researchers understand and improve model limitations, our most important target audience is medical AI researchers.
1.2 Real-World Representation
Please see Response 2.2 to Reviewer gh7D.
1.3 Concern on Patient Data Collection
- This is only one type of question in MedXpertQA, whose subtasks span different stages of medical decision-making (Figure 3). It also contains questions requiring data processing and organizing: https://postimg.cc/ZCVYzGXs.
- We acknowledge that MedXpertQA doesn't explicitly model the multi-step process of single patients. This is an inherent limitation of MCQA, and relevant works, e.g. MedChain [1], tackle the issue. Nevertheless, MedXpertQA's diverse task coverage ensures that it adequately tests wide-ranging capabilities.
- Finally, patient data collection may not be most challenging. Poor results of sota models reflect the difficulty of downstream clinical decision-making, the focus of MedXpertQA.
Response 2 - Methods And Evaluation Criteria
2.1 Benchmark Construction Details
We've attempted to present the construction process concretely and in detail. We provide metric formulas, hyperparameters, etc. This level of detail exceeds many prominent benchmarks, e.g. MMLU-Pro, MMMU. If any aspect remains unclear, we can provide further clarification.
- Unfortunately, we are unable to provide the initial dataset distribution, as labeling 37k questions for multiple attributes would be too costly. We appreciate your understanding.
- Statistics on questions remaining after each step: Dataset Filtering Stats.
- Our dataset construction is tailored to the medical domain, thus not intended for direct application to general AI. Our main contribution is the benchmark itself, especially MM, which fills crucial gaps in current medical multimodal evals.
2.2 Human Performance Evaluation
We already have some preliminary human performance results: https://postimg.cc/fkkJCjy0. The details:
- Expert (Junior) Score: Response distributions collected from question sources mostly come from medical students and reflect human performance. Rewriting has little impact on human performance, since it retains question information. Thus, human performance on original questions can be compared with final model results. The large response number of each question (up to 238k) makes these stats highly representative. Fewer than 200 final questions lack response data, and we'll hire humans of similar caliber to answer them and complete the human performance evaluation. We'll incorporate this information before 4.8 AoE.
- Expert (Senior) Score: We will assess human experts with medical licenses/MDs for a separate expert performance measurement. These experiments will take longer and will be added in future versions.
2.3 Model Biases
- MedXpertQA primarily aligns with practices from the US. We'll cover this in the revised paper.
- Model bias analysis, while important, exceeds the scope of this work. Bias mitigation should be conducted by model developers, and works such as [2,3,4] would be better references for practitioners interested in these issues.
- We provide MedXpertQA's coverage of patient demographics. Gender (from keyword matching):
- Text: Male 1025, Female 903
- MM: Male 874, Female 550
- Total: Male 1899, Female 1453
For ethnicity, we re-examined 100 questions and found no mentions.
Response 3 - Experimental Designs Or Analyses
See Response 2 (R2) for stats.
Table D was from expert evaluations. Experts not only reviewed each question, but also did multiple rounds of editing.
Figure 5 results for all models: https://www.hostize.com/zh/v/TH3ddN2Bz3.
Response 4 - Essential References Not Discussed
Please see R2 to Reviewer vM2L.
For MedXpertQA specifically, an interesting result is the noticeable gap between R1 and o1, which would be unexpected if we directly extrapolated from other reasoning benchmarks.
Response 5 - Questions For Authors
- -> R2.1
- -> R2.2
- Figure 4 requires data on paired backbone LLMs and LRMs, thus can't be done for all models. Please refer to the main results table for results on more models.
- -> R3
- -> R2.3
- -> R4
- -> R1.1-1.3 MedXpertQA's main goal is evaluating models' fundamental medical abilities. These abilities provide crucial support for clinical use, but downstream applications are not our focus.
References:
[1] https://arxiv.org/abs/2412.01605
[2] https://www.nature.com/articles/s41591-024-03113-4
I thank the authors for their answers. Regarding response 2.3, I totally understand that the bias mitigation is not the authors' concern, but the benchmark could report the performance change between questions regarding men and women, the same way results for different medical fields are reported.
The dataset filtering stats show little variation after the first two steps, could you discuss that more? what value does this bring to the benchmark performance with respect to the involved work?
Thanks for your response! We appreciate your feedback and have made every effort to address your concerns thoroughly.
Question 1:
We have performed an additional analysis comparing all model performances on male and female patients, with the following results: Text: https://postimg.cc/wtB3qrqz, MM: https://postimg.cc/rz4swxkD. We see that there are no consistent trends of model bias. On Text, more models have slightly higher accuracies on male patients, and on MM, more models have higher accuracies on female patients. The performance gaps are small across all models. This is expected since patient gender is not a decisive factor for most questions in MedXpertQA (specific symptoms and examination results are generally the most important).
Question 2:
After the first two steps, the primary factors influencing the number of questions are Similarity Filtering (Edit Distance Filtering and Semantic Similarity Filtering) and Expert Review, which filtered out 54 and 223 questions, respectively.
These two steps primarily focus on data quality, whereas the first two steps are designed to assess difficulty. Since our data is collected from high-quality and authoritative sources, it is expected that filtering based on difficulty has a greater impact on the dataset size compared to quality-based filtering. The roles of these two filtering stages are as follows:
-
Similarity Filtering: Although this step has a relatively small impact on the number of questions, it is crucial for maintaining the robustness of the benchmark. Our evaluation of traditional visual medical benchmarks, such as SLAKE [1], reveals that these benchmarks often generate QA pairs based on fixed question templates, resulting in a limited number of question types (as shown in Table 1 of VQA-RAD [2]). For example, in SLAKE, 18 questions are "Does the picture contain a lung? Answer ‘yes’ or ‘no’." Among these, 13 answers are "Yes." The performance of models varies significantly across different question types. Once a model learns specific features or shortcuts associated with a question type, it can exploit these patterns, leading to benchmark hacking. Thus, Similarity Filtering is essential.
Furthermore, due to the high diversity of our dataset, which is not generated from fixed templates, a relatively small number of filtered questions at this stage is expected. Although this filtering stage does not significantly reduce the dataset size, it removes highly similar questions that could otherwise bias model performance, undermining the robustness of the evaluation. To ensure the benchmark’s reliability, we consider this step indispensable.
-
Expert Filtering: Human involvement in dataset construction is critical for ensuring both domain expertise and factual accuracy. This is particularly important in specialized fields such as medicine, where expert review is necessary to maintain data quality. During the final step (multi-round expert review), experts directly corrected the most problematic questions they identified and deleted a smaller portion. Since direct editing is involved, the impact of expert reviewing is not fully reflected in the change in the total question number. Please see Table 6 in our paper for more details on this stage (a full tally of flagged questions including fine-grained error types). These modifications significantly enhance data reliability.
The overall reviews of this paper are positive. The contribution to provide a new and challenging test bed for medical reasoning is significant, with more diverse and challenging questions, and the authors answered many of the reviewers' concerns during the rebuttal. The only remaining concerns are relative to data leakage, which seems bound to happen for any open benchmark, and there was no further discussion after the rebuttal, and the feasibility evaluation of the questions, which seems to be addressed by the expert filtering. I think that introducing such a challenging benchmark, with a large number of questions validated by experts and which shows that current LLMs are not able to solve them, is a significant contribution. Moreover, the sound methodology to create the benchmark could be used to create follow-up datasets to evaluate benchmark leakage/overfitting.
For these reasons, I recommend accepting the paper.