PaperHub
6.3
/10
Spotlight3 位审稿人
最低3最高4标准差0.5
4
3
3
ICML 2025

Visual and Domain Knowledge for Professional-level Graph-of-Thought Medical Reasoning

OpenReviewPDF
提交: 2025-01-22更新: 2025-07-24

摘要

关键词
MedicalMedical Reasoning

评审与讨论

审稿意见
4

The paper introduces a novel dataset specifically designed for professional-level medical reasoning in medical visual question answering. It leverages a decade-long collection of MRI and clinical data related to Hypoxic-Ischemic Encephalopathy (HIE), enriched with expert annotations and insights. The authors generate clinical question-answer pairs and MRI interpretations to facilitate comprehensive diagnosis and neurocognitive outcome prediction. Furthermore, the paper proposes the innovative Clinical Graph of Thoughts (CGoT) model that integrates domain-specific medical knowledge with LVLMs. The reported results, including a 15% absolute gain on key neurocognitive outcome tasks, underscore the dataset’s potential and the model’s promising performance.

给作者的问题

The paper mentions that domain prompts are defined by doctors with varying years of experience, specifically distinguishing between low-experience and high-experience physicians. Could you elaborate on whether this difference in clinical expertise leads to significant performance variations or impacts the model’s outputs? How have you ensured that the variation in clinical experience among the doctors does not bias the dataset or the evaluation of the Clinical Graph of Thoughts (CGoT) model?

论据与证据

The results presented in the paper convincingly support the conclusions drawn by the authors. The experimental findings are robust and clearly demonstrate that the proposed approach leads to significant improvements over baseline methods. Detailed evaluations on the dataset illustrate that the clinical reasoning and predictions are well-grounded in the data. The evidence provided aligns with the claims, ensuring that the novel contributions are both meaningful and reproducible. Overall, the alignment between claims and supporting experimental outcomes is strong and well-articulated.

方法与评估标准

The methods and evaluation criteria used in this paper are well-suited for the problem of clinical diagnostic reasoning. The paper describes a thorough experimental setup and uses appropriate metrics to assess performance, ensuring that the impact of the proposed methods can be reliably measured. The benchmark dataset, constructed from real clinical data, provides a realistic and challenging testbed for evaluating diagnosis and prognosis tasks. The inclusion of both visual and clinical data in the evaluation strengthens the overall design. In summary, the approach is methodologically sound and the evaluation criteria are fitting for the intended application.

理论论述

The paper does not involve formal theoretical proofs or derivations, which is appropriate given its focus on clinical application and experimental demonstration.

实验设计与分析

The experimental design is well-constructed and thoroughly detailed, providing confidence in the validity of the analyses performed. The paper clearly explains the procedure for data collection, annotation, and the subsequent generation of clinical question-answer pairs. The evaluation is comprehensive, with proper consideration given to both visual interpretation and neurocognitive outcome prediction tasks. Each experiment is designed to test specific aspects of the proposed Clinical Graph of Thoughts (CGoT) model, and the results are transparently presented. Overall, the experimental design and analysis section is robust and leaves little room for ambiguity regarding the scientific claims.

补充材料

The supplementary material has been carefully reviewed and does not raise any issues.

与现有文献的关系

The paper makes a significant contribution by providing a benchmark that is closely aligned with real-world clinical applications. By incorporating a dataset sourced from relevant clinical settings and focusing on professional-level medical reasoning, the work fills an important gap in the literature on medical visual question answering. It relates well to prior studies in clinical diagnosis while also expanding the scope by integrating both imaging and clinical data for neurocognitive outcome prediction. The proposed CGoT model builds upon existing concepts in LVLMs and adapts them with domain-specific insights, further positioning the work within the broader scientific discourse. Overall, the paper successfully bridges the gap between academic research and clinical practice.

遗漏的重要参考文献

No

其他优缺点

One potential concern is that the paper mentions the dataset is sourced from Massachusetts General Hospital, which might conflict with the anonymity requirements of a double-blind review process. This detail could inadvertently reveal the institution behind the dataset, thereby compromising the anonymity of the submission. Aside from this, the paper is strong in its methodological design and clinical relevance. The combination of long-term data collection, expert annotations, and integrated clinical reasoning within the model is highly commendable. Addressing the anonymity issue explicitly would help mitigate any ethical or procedural concerns.

其他意见或建议

Figure 3 could be improved as the current design allows for some overlap between images and text, which hampers clarity. A clearer layout that avoids any overlap will improve both readability and the overall presentation of the results. The authors might consider revising the figure with better spacing or annotations to ensure that all visual elements are distinct and easy to interpret. Enhancing the visual quality here would further support the clarity of the paper's contribution. Overall, this is a minor presentation issue that, once fixed, will polish an otherwise well-structured paper.

作者回复

We thank reviewer's constructive feedback; we have addressed your concerns below.

Q1. One potential concern is that the paper mentions the dataset is sourced from Massachusetts General Hospital, which might conflict with the anonymity requirements of a double-blind review process. This detail could inadvertently reveal the institution behind the dataset, thereby compromising the anonymity of the submission. Aside from this, the paper is strong in its methodological design and clinical relevance. The combination of long-term data collection, expert annotations, and integrated clinical reasoning within the model is highly commendable. Addressing the anonymity issue explicitly would help mitigate any ethical or procedural concerns.

Thanks for recognizing the value of our dataset and method. We further address the anonymity issue here. While the dataset used in this study originates from Massachusetts General Hospital (MGH) and was collected and shared under appropriate IRB approvals, this does not imply that the research was conducted by or on behalf of MGH, but just provide that we can collect the HIE specific data from one of the best hospitals in the world, with the most professional clinical knowledge of HIE on the world. The use of the MGH dataset was independent of any institutional affiliation and does not reveal the authors' identities.

Moreover, before submission, we have carefully reviewed the manuscript to ensure that any potentially identifying references to the authors' institutions have been removed or anonymized to comply fully with double-blind review requirements. We will restore the appropriate acknowledgments and dataset provenance upon acceptance, as per standard publishing practice.

Q2. Figure 3 could be improved as the current design allows for some overlap between images and text, which hampers clarity. A clearer layout that avoids any overlap will improve both readability and the overall presentation of the results. The authors might consider revising the figure with better spacing or annotations to ensure that all visual elements are distinct and easy to interpret. Enhancing the visual quality here would further support the clarity of the paper's contribution. Overall, this is a minor presentation issue that, once fixed, will polish an otherwise well-structured paper.

We thank the reviewer for the thoughtful suggestion regarding Figure 3. We agree that improving the layout and removing the overlap between images and text will enhance the clarity and readability of the figure. In the revised version, we will update Figure 3 to ensure that all visual elements are clearly separated, with improved spacing and annotations. We appreciate the reviewer’s feedback on this presentation.

Q3. The paper mentions that domain prompts are defined by doctors with varying years of experience, specifically distinguishing between low-experience and high-experience physicians. Could you elaborate on whether this difference in clinical expertise leads to significant performance variations or impacts the model’s outputs? How have you ensured that the variation in clinical experience among the doctors does not bias the dataset or the evaluation of the Clinical Graph of Thoughts (CGoT) model?

The initial annotations were performed by a junior yet experienced fellow and subsequently reviewed through consensus checks with senior experts. This process resulted in a single, unified annotation set. By consolidating annotations through expert agreement and consensus, we aimed to minimize inter-reader variability and reduce potential bias introduced by individual experience levels in both the dataset and the evaluation of the CGoT model. We will include more detailes of the annotation protocol in the revised manuscript to enhance transparency.

审稿意见
3

The authors introduce the HIE-Reasoning dataset and a Clinical Graph of Thought (CGoT) model for professional-level Medical Visual Question Answering (MVQA) focused on Hypoxic-Ischemic Encephalopathy (HIE). The dataset, built from a decade of MRI and clinical data, includes 749 expert-annotated question-answer pairs and aims to simulate complex clinical reasoning. CGoT integrates visual and textual clinical knowledge into LVLMs, outperforming baselines by ~15% on neurocognitive outcome prediction. Evaluations reveal limitations in existing LVLMs for such tasks.

给作者的问题

N/A

论据与证据

  • The HIE-Reasoning dataset is a pioneering effort, shifting MVQA from basic perception to clinically relevant reasoning, with tasks like neurocognitive outcome prediction.

方法与评估标准

  • CGoT’s graph-of-thought approach, mimicking clinical workflows, creatively decomposes complex tasks into manageable steps, enhancing interpretability and performance.
  • Fig. 1 contrasts existing MVQA with HIE-Reasoning but lacks examples of “clinically irrelevant” questions for clarity.
  • Sec. 5.2: Med-Flamingo’s failure is noted, but no discussion of why or how CGoT avoids similar pitfalls (e.g., hallucination).

理论论述

N/A

实验设计与分析

  • The dataset’s decade-long curation and expert validation ensure high clinical fidelity, while CGoT’s significant gains (e.g., 71.73% vs. 56.60% on outcome prediction) validate its efficacy.
  • Ablation studies (Tables 3, 4) lack depth—e.g., no analysis of visual knowledge components (ADC vs. ZADCZ_{ADC}) or alternative reasoning structures beyond task omission. How robust is CGoT to variations in graph design?
  • Sec. 4.2.1: How are ZADCZ_{ADC} thresholds (-2) justified beyond citing Bao et al. (2023)? Sensitivity to this choice is unexplored.

补充材料

N/A

与现有文献的关系

  • Good to include recent work, but miss some related evaluation work [1,2].

[1] Yan Q, He X, Yue X, et al. Worse than random? An embarrassingly simple probing evaluation of large multimodal models in medical VQA[J]. arXiv preprint arXiv:2405.20421, 2024.

[2] Xia P, Chen Z, Tian J, et al. Cares: A comprehensive benchmark of trustworthiness in medical vision language models[J]. Advances in Neural Information Processing Systems, 2024, 37: 140334-140365.

遗漏的重要参考文献

As said in Relation To Broader Scientific Literature

其他优缺点

N/A

其他意见或建议

N/A

作者回复

We thank reviewer's valuable suggestions and address concerns below within 5000 characters.

Q1. Fig. 1 lacks examples of "clinically irrelevant" questions.

"Clinically irrelevant questions" include general or superficial queries about images, such as modality ("What type of scan is this?") or organ identification ("What organ is shown?"). Although valid in general VQA, these questions lack direct clinical relevance or diagnostic utility.

Q2. How does CGoT avoid hallucination pitfalls like Med-Flemingo?

  • Structured Reasoning Pathway: CGoT decomposes tasks into clinically meaningful subtasks aligned with expert workflows, increasing interpretability and traceability.
  • Clinical Knowledge Grounding: Reasoning steps explicitly use specialist-curated clinical knowledge (e.g., ADC maps, injury scores), anchoring predictions in verifiable evidence.
  • Modular Task and Output Verification: Each subtask undergoes ground-truth validation, reducing hallucinations via fine-grained feedback. Intermediate outputs (injury scores, consistency signals) allowing both users and the model itself to validate predictions and detect inconsistencies early in the pipeline.

Q3. Ablation studies on visual knowledge components (ADC vs. ZADC).

Table 3-1. Ablation study on visual knowledge components

Raw ADCZADCBrain AnatomyLesion GradingLesion AnatomyRare LocationsMRI Injury ScoreNeuro OutcomeInterpretation Summary
✔️✔️✔️62.41%, 0.070343.57%41.47%49.62%71.73%53.68%
✔️✔️58.64%, 0.084942.78%41.22%47.37%68.11%51.07%
✔️✔️46.62%, 0.115226.96%25.10%30.83%60.44%34.55%
✔️✔️62.41%, 0.070339.90%37.98%39.85%64.05%47.97%

Each visual knowledge type (ADC, ZADC, brain anatomy) has complementary roles. Raw ADC provides signal details, ZADC identifies abnormal regions, and brain anatomy supply anatomical priors. Removing any component degrades performance, but removing ZADC has the crucial effect, as it provides the probability of abnormal brain regions—crucial for MRI injury interpretation.

Q4. Justification and sensitivity of ZADC (–2).

  • Justification (z < –2): The threshold of –2 aligns with clinical conventions and prior studies for HIE abnormal regions probabilities [A], indicating ADC values below two standard deviations from the normal atlas—often interpreted as abnormally reduced diffusion in neonatal HIE ADC maps—and serves as a marker for potential brain injury regions.

Table 3-2. Ablation study on varying ZADC threshold values

ThresholdLesion GradingLesion AnatomyRare LocationsMRI Injury ScoreNeuro OutcomeInterpretation Summary
z<–1.858.64%, 0.078243.04%41.42%47.14%68.23%51.42%
z<–2.260.90%, 0.086742.58%38.19%37.59%70.93%50.67%
z<–262.41%, 0.070343.57%41.47%49.62%71.73%53.68%

Across ZADC threshold variations, CGoT outperforms baselines, demonstrating robust and effective performance. The drop at z < −2.2 in MRI injury stems from an overly strict threshold that misses mild injuries. NRN includes mild cases (0 and 1), which are often excluded by this threshold, leading to missed low-grade injuries. z < −2 better captures these signals, yielding optimal performance.

Q5. Alternative reasoning structures.

Clinically Grounded. CGoT is predefined and grounded in well-established diagnostic and prognostic workflows used in neonatal care. Its structure captures a real clinical reasoning pipeline [A,B], where each edge and reasoning step reflects realworld clinical logic. Modifying or removing these edges would compromise both interpretability and clinical validity.

Quantitative Analysis. Table 4 (main paper) demonstrates omitting intermediate tasks significantly reduces neurocognitive outcome prediction accuracy, confirming all subtasks are essential. Alternative structures (e.g., bypassing injury scoring or directly predicting outcomes from lesions) resulted in lower accuracy and clinical implausibility (Table 4). Intermediate reasoning nodes and current reasoning structures are necessary; the model is robust and tolerant to minor errors in these steps, as shown in response to Reviewer #fHCe (Tables 2-1 and 2-2). Future explorations of adaptive structures must maintain clinical interpretability. Ablation studies on node features (Tables 3-1, 3-2) confirm that all types of knowledge inputs contribute meaningfully to performance, reinforcing the thoughtful and necessary design of CGoT.

[A] Mining multi-site clinical data to develop machine learning MRI biomarkers: application to neonatal hypoxic ischemic encephalopathy. 2019.

[B] NICHD Magnetic Resonance Brain Imaging Score in Term Infants With HIE: A Secondary Analysis of a Randomized Clinical Trial.

Q6. Add related evaluation work [1,2].

We will add and discuss the new related work in the later version.

审稿人评论

Thanks for your rebuttal. It has addressed most of my comments and I will maintain the original score.

审稿意见
3

The paper introduces HIE-Reasoning, a professional-level medical visual question answering (MVQA) benchmark focused on neonatal Hypoxic-Ischemic Encephalopathy (HIE). The authors propose the Clinical Graph-of-Thought Model (CGoT), which integrates visual and clinical domain knowledge into a structured reasoning framework, significantly enhancing the interpretability and accuracy of clinical decisions and prognoses.

给作者的问题

  1. How to evaluate the model performance on open-ended questions in the HIE-Reasoning benchmark?
  2. How to identify the clinical knowledge (i.e., visual and textual knowledge) most relevant to each specific question? Was this knowledge preprocessed at the beginning?
  3. In Table 3, why does retaining only GoT or clinical knowledge lead to worse performance compared to removing both simultaneously?

论据与证据

See Strengths And Weaknesses part

方法与评估标准

See Strengths And Weaknesses part

理论论述

See Strengths And Weaknesses part

实验设计与分析

See Strengths And Weaknesses part

补充材料

None

与现有文献的关系

See Strengths And Weaknesses part

遗漏的重要参考文献

See Strengths And Weaknesses part

其他优缺点

Strengths

  1. This paper provides a new HIE-Reasoning benchmark tailored for professional-level medical reasoning, which consists of six MRI-related tasks.
  2. This paper introduces a CGoT framework, which integrates clinical knowledge and clinical reasoning steps with LVLMs, enhancing both model performance and transparency.
  3. The paper is well-structured and easy to follow.

Weaknesses

  1. This paper lacks an analysis of the computational complexity, efficiency, and resource requirements of the proposed CGoT compared to SOTA baselines, including the inference time, memory usage, and API costs. This would be helpful in assessing its scalability and practical applicability.
  2. The performance of CGoT on more complex tasks (e.g., predicting 2-year neurocognitive outcomes) is heavily dependent on the results of previous tasks (e.g., MRI injury scores) (shown in Table 4). In other words, solving these more complex tasks requires first addressing related prerequisite tasks, which reduces efficiency and limits practical applicability.
  3. The paper lacks comparisons with important baselines. The proposed CGoT incorporates clinical knowledge into the LVLM, guiding it to addressing more complex medical questions via a clinician-like diagnostic process. However, existing methods such as MDAgent [1] and MedAgents [2] also emulate real-world medical decision-making processes for multimodal medical reasoning tasks using LLMs and LVLMs. These should be included as strong baselines to better validate the effectiveness of CGoT.
  4. The proposed HIE-Reasoning benchmark requires a more detailed description, particularly regarding the distribution of open-ended and multiple-choice questions.

Reference

[1] MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making, NeurIPS, 2024.

[2] MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning, ACL, 2024.

其他意见或建议

  1. Several typos are present in the paper. For example, in Line 048 and Line 090, "Fig. 1" and "Fig. 2" should be revised to "Figure 1" and Figure 2". In Line 124, "proposed" should be revised to "propose". In Line 255, "Sec" should be revised to "Section". In Line 207, "MRI Interpretation Summary." should be revised to "Task 6. MRI Interpretation Summary.".

  2. "MRI" needs to be introduced with its full name at the beginning of the paper.

作者回复

Thanks for the constructive comments. We address your concern below within 5000 characters.

Q1. Computational Complexity

ModelInput TokensOutput TokensInference TimeAPI Cost
Gemini10,5804466.50s$0.00080
CGoT-Gemini36,2561,14621.09s$0.00272
MedAgent-Gemini354,80110,497210.75s$0.02976
MdAgent-Gemini236,1939,228142.75s$0.02048

Table 1-1: Case-wise computational analysis on the Gemini-1.5-flash backbone.

CGoT introduces modest overhead compared to Gemini-1.5-Flash but achieves 15% accuracy improvement with clinical transparency. CGoT significantly reduces computational cost compared to SOTA agents (MedAgent, MdAgent) through structured reasoning that mirrors clinical workflows. We will add this new analysis in the later version.

Q2. Dependency on Prerequisite Tasks

Prerequisite tasks before final decisions align with standard clinical practice. Our approach enhances both transparency and effectiveness.

  1. Intermediate steps are necessary and model is robust to minor errors of these intermediate steps:

    Intermediate step is necessary:

    Table 2-1: Removing MRI injury scoring from the reasoning chain leads to a sharp performance drop in outcome prediction.

    MRI Injury Score2-year Outcome
    50.94
    71.73

    Robust to small errors of intermediate steps: We applied random ±1-level perturbations (across 4 severity levels) to MRI injury score in varying percentages of cases (0–30%). Results show graceful degradation, indicating robustness to small inaccuracies.

    Table 2-2. CGoT is robust to minor errors of intermediate steps.

    Perturbation Ratio (%)2-year Outcome (%)
    071.73
    1067.83
    2066.22
    3062.72
  2. Clinically Grounded: CGoT mimic clinical diagnostic workflows by modeling the stepwise reasoning clinicians use. Its intermediate tasks are clinically meaningful and enhance interpretability. Step-by-step inference in chain-of-thought reasoning [A] and inference scaling [B] has been proved to enhance interpretability and improve performance on other complex reasoning tasks. Table 4 (main paper) shows incorporating such intermediate tasks improves the final outcome prediction compared to relying solely on end-to-end black-box models (Table 2, main paper).

Q3. Comparisons with MDAgent and MedAgent

ModelLesion Grading (%)Lesion Anatomy (%)Rare Lesion Locations (%)MRI Injury Score (%)Neurocognitive Outcome (%)Interpretation Summary (%)
MDAgent28.57, 0.150842.6138.2947.8548.8151.22
MedAgent30.95, 0.167441.8837.5745.8054.3249.17
CGoT62.41, 0.070343.5741.4749.6271.7353.68

Table 3-1: Performance comparisons.

CGoT outperforms both baselines by incorporating structured clinical reasoning with domain-specific visual and textual knowledge, enabling superior performance on complex clinical tasks.

Q4. Question Types

Total QA PairsOpen-EndedMultiple Choice
749399350

Table 4-1: Question distribution.

  • Open-ended: Lesion percentage, lesion anatomy, rare lesion localization, MRI interpretation summaries.
  • Multiple-choice: Lesion severity ratings, MRI injury scoring, outcome classification.

Q5. Evaluation Metric

(1) ROUGE-L, capturing content overlap and fluency by comparing generated answers to expert references, commonly in medical summarization tasks with LLM [C]. (2) F1 score for questions related to brain regions to assess correctness and completeness of region injured (Sec.5).

Q6. Identify Clinical Knowledge Relevance

Clinical experts (radiologists, neonatologists, and neurologists) curated, identified, and validated all relevant clinical knowledge during dataset construction to align with real-world clinical reasoning.

Q7. Analysis of Components in Table 3 (main paper):

Retaining only GoT or only clinical knowledge performs worse than removing both due to misaligned or incomplete reasoning. GoT without clinical context may propagate irrelevant patterns, while clinical knowledge without GoT ignores task dependencies. When both are removed, the model defaults to a purely end-to-end approach, which avoids the confusion caused by partially constrained reasoning. This highlights the importance of combining knowledge and clinical reasoning.

Q8. Paper Revision

We will carefully review and revise the paper.

[A]. W., J., et al. "Chain-of-thought prompting elicits reasoning in large language models." NeurIPS 2022.

[B]. G., D., et al. "Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning." arXiv 2025.

[C]. T., L., et al. "Evaluating large language models on medical evidence summarization." NPJ digital medicine 6.1 (2023): 158.

审稿人评论

Thanks author's rebuttal. The authors have mostly addressed my comments and I will maintain the original score.

最终决定

This paper presents HIE-Reasoning, a professional-level benchmark for Medical Visual Question Answering (MVQA) focused on neonatal Hypoxic-Ischemic Encephalopathy (HIE). The authors contribute a highly valuable and unique dataset, curated over a decade and consisting of 749 expert-annotated question–answer pairs that simulate complex clinical reasoning tasks. The benchmark provides a realistic and challenging setting for evaluating Large Vision-Language Models (LVLMs) in clinical decision-making.

To address this challenge, the authors propose CGoT, a reasoning model that integrates clinical workflows and structured visual–textual knowledge. CGoT achieves ~15% improvement in neurocognitive outcome prediction over existing baselines, highlighting its practical effectiveness.

All reviewers commend the significant effort in building this pioneering dataset, the thoughtful design of CGoT to mirror actual medical practice, and the well-constructed evaluation protocol. The paper is a notable example of bridging the gap between academic AI research and clinical application, offering both a realistic benchmark and a working system that addresses real-world diagnostic needs.

Overall, this work is strong in contribution, clearly written, and would make a valuable addition to ICML.