MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models
摘要
评审与讨论
The authors propose a method, MMed-RAG, to address alignment issues in medical applications, which is essential for enhancing factual accuracy. MMed-RAG is a generalizable approach for medical RAG applications, with key contributions as follows: (1) a domain-aware retriever, (2) adaptive context selection, and (3) a direct preference-optimized algorithm based on a newly developed preference dataset curation process. Additionally, a theoretical justification is provided.
MMed-RAG is evaluated on five datasets across three different medical image types—Radiology, Ophthalmology, and Pathology—on tasks such as VQA and report generation. The authors also conducted a detailed ablation study and analysis of the misalignment issues addressed by MMed-RAG. Overall, MMed-RAG achieves better performance across all tasks and demonstrates the ability to reduce misalignment.
优点
- Originality: The perspective of decomposing RAG error into cross-modal and overall misalignment is novel, which bring about a new algorithm to collect preference data
- Quality: The quality is solid with a theoretical proof
- Clarity: The authors illustrate their motivation and proposed method well with good example and figures, which is easy to follow
- Significance: The authors are working on improving the safety of VLLM in high-stake field (i..e, medicine). Therefore, I appreciate the the idea and improvement from MMed-RAG.
缺点
I think the method is overall well-motivated and well-written. However, I have some concerns which make me doubt the evidences demonstrated in the experiments and would like hear author's thought
- A bit over-claimed novelty: The author claimed MMed-RAG is a general approach to address images from different sources. While it is technically true, the way author implement is merely put 3 independent piece together without certain level creativity and claimed it is general (i.e. still relying on training "individual" retriever model on each image source). It is more like engineering trick than methodological novelty.
- Lack of potential baseline and need more clarification in the evaluation :
- When comparing to Med-Flamingo, which setting did you use? (6-shots or zero-shot?)
- I would suggest including LLaVA-Med 1.5 few shot as baseline. Based on the github record [1], LLaVA-Med 1,5 was published on 05/13/2024. while Med-Flamingo, RadFM,MedVInT and miniGPT-Med were published on 2023 based on the reference. So, I suppose LLaVA-Med 1.5 is a more powerful VLM than those. I think it is important to see how far this new more powerful model can go with simple trick (i.e., Few shot in context learning) like what Med-Flamingo author did (they have 0-shot and few-shot results)
- Concerns about the alignment analysis:
- First of all, what data did you use to evaluate in Figure 3?
- Copy reference rate: If I understand correctly, this refer to line 8 y_{l,o1} in Algorithm 1 , which model overly rely on retrieved text. I find the construction of this idea confusing in practice. Since X_{r} is based on context retrieved by original image then you pair this context with the noise image. However, in practice, you would never get this clean X_{r} from ground truth. What you will have is noisy image + noisy context, then the model take this as input. I am bit unsure what is the point of this evaluation
- Over reliance rate: If I understand correctly, this refer to line 16 y_{l,03} in Algorithm 1m which model is interfered by noisy retrieved context. My question is? Does it just mean the retriever is bad or not well-trained so that the retrieved context is nosy and hurt the model alignment? Referring to Figure 3 OR, it shows over 0.4. With such number, I wonder how well the retriever is trained in the paper, do authors has some results or any observation?
- Following the logic above, I feel like a better evaluation scheme is to (1) Use original image (2) Feed it to retriever (3) Selecting data with bad retrieved context (i.e., retriever fail). Then see how MMMed-RAG can rescue these examples. Any comment for this?
问题
My main questions are (1) Do authors have any rational judgement on not adding more basic baseline so that we can have a better reference method in the paper? (2) I am willing to hear any thought or clarification about my concern of alignment analysis.
Q7: Following the logic above, I feel like a better evaluation scheme is to (1) Use original image (2) Feed it to retriever (3) Selecting data with bad retrieved context (i.e., retriever fail). Then see how MMMed-RAG can rescue these examples. Any comment for this?
A7: For retrieved information, we minimize noise by optimizing the number of retrieved contexts (e.g., Adaptive Retrieved Context Selection in Section 3.2). Following this, we introduce RAG-PT to specifically address the misalignment issues that arise after incorporating RAG, thereby strengthening Med-LVLM to balance its internal knowledge and external retrieval information. Inspired by your suggestion, we employ a rationale-guided approach [6] that uses LLM to explicitly learn denoising of retrieved content through self-synthesized rationales. First, given a question, the retrieved documents, and the ground truth from the training set, we prompt a powerful Med-LLM (i.e., LLaMA3-Med42-70B [7]) to generate a rationale. This rationale explains how to derive the answer from potentially noisy inputs. Next, we use the synthesized rationale from the previous step to guide another smaller Med-LLM (i.e., LLaMA3-Med42-7B [7]) to explicitly learn denoising of the retrieved documents through in-context learning and supervised learning. By employing this rationale-guided Med-LLM to filter noisy retrieval information, the reliability of our retrieved data improves. As shown in Table R4, experimental results show that after rationale-guided RAG, the model's performance further improved. Thank you very much for your valuable suggestions. We have revised the paper and put these details in Appendix A.6.12.
Table R4: Performance comparison on IU-Xray dataset, including RAG and Rationale-Guided RAG variants.
| Model | Acc | F1 | AUC |
|---|---|---|---|
| LLaVA-Med | 75.47 | 64.04 | 67.46 |
| + RAG | 84.82 | 68.85 | 77.54 |
| + RAG-PT | 89.54 | 80.72 | 87.13 |
| + Rationale-Guided RAG | 85.38 | 69.23 | 77.90 |
| + RAG-PT | 89.91 | 80.86 | 87.32 |
Q8: Do authors have any rational judgment on not adding more basic baseline so that we can have a better reference method in the paper?
A8: We have already compared several state-of-the-art baselines (post-2023), including MedDr (released in April 2024), FactMM-RAG (released in July 2024), and RULE (EMNLP 2024), which are highly relevant baseline methods. If you have any additional specific suggestions, we would be happy to include more baselines.
Reference
[1] Zhang J, Qu X, Zhu T, et al. CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling[J]. arXiv preprint arXiv:2409.19291, 2024.
[2] Wang X, Zhou Y, Liu X, et al. Mementos: A comprehensive benchmark for multimodal large language model reasoning over image sequences. ACL, 2024.
[3] Meng F, Wang J, Li C, et al. Mmiu: Multimodal multi-image understanding for evaluating large vision-language models[J]. arXiv preprint arXiv:2408.02718, 2024.
[4] Gravel P, Beaudoin G, De Guise J A. A method for modeling noise in medical images[J]. IEEE Transactions on medical imaging, 2004, 23(10): 1221-1232.
[5] Sanchez M G, Sánchez M G, Vidal V, et al. Medical image restoration with different types of noise[C]//2012 Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE, 2012: 4382-4385.
[6] Wei Z, Chen W L, Meng Y. InstructRAG: Instructing Retrieval-Augmented Generation with Explicit Denoising[J]. arXiv preprint arXiv:2406.13629, 2024.
[7] Christophe C, Kanithi P K, Raha T, et al. Med42-v2: A suite of clinical llms[J]. arXiv preprint arXiv:2408.06142, 2024.
Thank you for your valuable feedback to help us improve our paper. We have revised our paper based on your feedback. We detail our response below and please kindly let us know if our response addresses your concerns.
Q1: A bit over-claimed novelty: The author claimed MMed-RAG is a general approach to address images from different sources. While it is technically true, the way author implement is merely put 3 independent piece together without certain level creativity and claimed it is general (i.e. still relying on training "individual" retriever model on each image source). It is more like engineering trick than methodological novelty.
A1: We have tried training a general retriever by mixing images from all modalities together, instead of using a domain-specific retriever. We conduct experiments based on BiomedCLIP and MedCLIP, but the results are unsatisfactory (please see Table R1). Then we adopt an MoE (Mixture of Experts) architecture. Based on CLIP-MoE, we fine-tune CLIP-MoE [1] with mixing images from all medical imaging modalities, but the performance is still suboptimal. This might be because CLIP-MoE is not pretrained on large-scale biomedical data. All the results are reported in Table R1. Considering model performance, we ultimately adopt a domain-specific retriever architecture. In fact, our approach is also flexible and scalable. When encountering a completely new modality, both the general retriever and our method may still require fine-tuning to achieve optimal retrieval performance. In the future, we plan to explore solutions for pretraining a general retriever on large-scale biomedical data using a Mixture of Experts (MoE) architecture, aiming to determine whether it is possible to develop a truly universal retriever. We have revised the paper and put these details in Appendix A.6.9.
Table R1: Performance comparison based on different retrievers.
| Model | IU-Xray (Acc) | IU-Xray (F1) | IU-Xray (AUC) |
|---|---|---|---|
| LLaVA-Med | 75.47 | 64.04 | 67.46 |
| + RAG (BiomedCLIP-FT) | 79.09 | 65.87 | 69.52 |
| + RAG (MedCLIP-FT) | 75.13 | 63.88 | 67.16 |
| + RAG (CLIP-MoE-FT) | 72.13 | 62.72 | 65.11 |
| + RAG (Ours) | 84.82 | 68.85 | 77.54 |
Q2: When comparing to Med-Flamingo, which setting did you use? (6-shots or zero-shot?)
A2: All our experiments are conducted under a zero-shot setting.
Q3: I would suggest including LLaVA-Med 1.5 few shot as baseline. Based on the github record [1], LLaVA-Med 1.5 was published on 05/13/2024. while Med-Flamingo, RadFM,MedVInT and miniGPT-Med were published on 2023 based on the reference. So, I suppose LLaVA-Med 1.5 is a more powerful VLM than those. I think it is important to see how far this new more powerful model can go with simple trick (i.e., few shot in context learning) like what Med-Flamingo author did (they have 0-shot and few-shot results)
A3: Thank you very much for your suggestion. We conduct experiments on LLaVA-Med-1.5 using the same few-shot strategy as in Med-Flamingo. As shown in Table R2, the results show that compared to the zero-shot setting, the model's performance significantly decreases, even with RAG applied. Our analysis of this phenomenon reveals that, unlike Med-Flamingo, LLaVA-Med does not use interleaved multimodal data for pretraining. As a result, it lacks the capability for few-shot learning. This point has been mentioned in some discussion forums and GitHub issues. In addition, LLaVA-1.5's unsatisfactory performance on multi-image understanding benchmarks also supports this observation [2,3]. We have revised the paper and put these details in Appendix A.6.10.
Table R2: Performance comparison under zero-shot and few-shot settings.
| Model | Acc | F1 | AUC |
|---|---|---|---|
| LLaVA-Med (zero-shot) | 75.47 | 64.04 | 67.46 |
| +MMed-RAG | 89.54 | 80.72 | 87.13 |
| LLaVA-Med (few-shot) | 66.77 | 51.56 | 66.60 |
| +MMed-RAG | 84.10 | 71.92 | 86.40 |
Q4: Concerns about the alignment analysis: First of all, what data did you use to evaluate in Figure 3?
A4: As mentioned in Line 195-202, Section 3.3 Alignment Analysis, we conduct two tests on LLaVA-Med-1.5 using the Harvard-FairVLMed dataset to analyze the misalignment issues caused by RAG in the Med-LVLM.
Q5: Copy reference rate: If I understand correctly, this refers to line 8 y_{l,o1} in Algorithm 1 , which model overly relies on retrieved text. I find the construction of this idea confusing in practice. Since X_{r} is based on context retrieved by the original image then you pair this context with the noise image. However, in practice, you would never get this clean X_{r} from ground truth. What you will have is a noisy image + noisy context, then the model takes this as input. I am a bit unsure what the point of this evaluation is.
A5: In medical imaging, noise refers to random variations in image signals caused by hardware limitations or environmental factors [4,5]. However, the noise we refer to here pertains to images unrelated to the original image, generated through a two-step process: 1. We use a retriever to select images with the lowest similarity to the target image. 2. We introduce strong diffusion noise to these images. As a result, the noisy images in our case are almost entirely random noise and are not merely examples of domain shifts, such as changes in lighting conditions. Refer to the third section of Figure 1 for examples, and additional examples are included in the Appendix A.9 for reference.
The motivation behind our design is that replacing the original image with a highly noisy image while adding retrieved information corresponding to the original image reveals a significant issue of cross-modal misalignment in the Med-LVLM—namely, it ignores the image information and directly copies the retrieved contexts. To mitigate this issue, we construct such preference pairs to specifically strengthen the model's cross-modal alignment capability.
Q6: Over reliance rate: If I understand correctly, this refers to line 16 y_{l,03} in Algorithm 1m which model is interfered by noisy retrieved context. My question is? Does it just mean the retriever is bad or not well-trained so that the retrieved context is nosy and hurt the model alignment? Referring to Figure 3 OR, it shows over 0.4. With such number, I wonder how well the retriever is trained in the paper, do authors have some results or any observation?
A6: The overall alignment issue arises from the conflict between retrieved information and the model's internal knowledge. For retrieved information, we cannot guarantee 100% accuracy, so some noise is inevitable. The Over-Reliance (OR) rate shown in Figure 3 refers to the proportion of initially correct responses that become incorrect after adding the retrieved context, calculated relative to the total number of incorrect samples, not the total number of all samples. This rate represents the proportion of errors caused by over-reliance, rather than indicating poor performance of the retriever. Through RAG-PT, we can effectively mitigate this issue, significantly reducing the OR rate. Regarding the retriever's performance, as shown in Table R3, we compared the performance of our retriever with several CLIP-based models on radiology datasets for image-to-text retrieval. The results demonstrate that our retriever significantly outperforms the other models in retrieval performance. We have revised the paper and put these details in Appendix A.10.
Table R3: Performance comparison of different retrievers on Recall@1 (R@1) and Recall@5 (R@5) metrics.
| Model | R@1 | R@5 |
|---|---|---|
| CLIP | 3.91 | 7.88 |
| PubMedCLIP | 1.47 | 1.64 |
| MedCLIP | 6.74 | 12.69 |
| BiomedCLIP | 15.7 | 23.8 |
| PMC-CLIP | 12.3 | 21.2 |
| Ours | 45.6 | 71.8 |
Dear reviewer mHiX,
We would like to follow up to see if the response addresses your concerns or if you have any further questions. We would really appreciate the opportunity to discuss this further if our response has not already addressed your concerns. Thank you again!
I really appreciate author's effort for conducting additional experiments to clarify all of my concerns. After the rebuttal, I do not find major issue in this paper and already reflected my new scores in my review.
Dear Reviewer mHiX,
We sincerely thank you for your detailed and insightful feedback, we are happy to see that our response address your concerns. Thanks again for your valuable comments to help us improve our paper.
This paper presents MMed-RAG, a retrieval-augmented generation (RAG) framework to improve factual accuracy in medical vision-language models (Med-LVLMs), addressing issues of factual inaccuracies in current models. MMed-RAG incorporates a domain-aware retrieval mechanism, adaptive context selection, and preference-based fine-tuning to enhance alignment with medical knowledge. Experimental results across five datasets show improved accuracy compared to existing methods. While primarily refining existing RAG techniques for medical applications, the approach is well-executed and clearly presented. MMed-RAG offers a focused improvement for Med-LVLMs, potentially increasing their reliability in clinical use.
优点
- MMed-RAG introduces a domain-specific retrieval mechanism, adapting RAG techniques to address the nuances of different medical fields. By implementing adaptive context selection and preference-based fine-tuning, the paper offers a creative and impactful advancement over traditional RAG methods, effectively enhancing cross-modal and ground-truth alignment.
- The paper is methodologically sound, thoroughly evaluating MMed-RAG on five diverse medical datasets. Empirical results show significant improvements in factual accuracy, with gains up to 43.8%, underscoring the effectiveness and robustness of the approach.
- The paper is well-structured, making complex ideas accessible. Each component of the framework—domain-aware retrieval, adaptive context selection, and fine-tuning—is clearly explained, with helpful visual aids to support understanding.
- Addressing factual reliability in medical diagnostics, this work has substantial real-world implications. By reducing hallucinations in Med-LVLMs, MMed-RAG contributes to safer, more reliable AI diagnostics and could be adapted to other high-stakes domains where factual accuracy is critica
缺点
- In the analysis of RAG-PT, the authors evaluate the effectiveness of each individual RAG-PT component (1, 2, and 3). To better understand how RAG-PT mitigates misalignment and improves performance, it would be helpful to include experiments with different combinations, such as RAG-PT 1+2, RAG-PT 1+3, and RAG-PT 2+3.
- In Table 2, it is unclear which specific BLEU score is being reported—is it BLEU-1, BLEU-2, BLEU-3, BLEU-4, or an average of these?
- In the "Comparison with Other Med-LVLMs" section, it’s not clear which metrics are averaged for the medical VQA and report generation tasks. Additional clarification on this point would help interpret the results.
- In the section on Cross-Modality Alignment, more intuitive explanations are needed to clarify how constructing preference pairs (by introducing noisy images) addresses the issue of cross-modality alignment.
问题
please refer to Weaknesses
Q3: In the "Comparison with Other Med-LVLMs" section, it’s not clear which metrics are averaged for the medical VQA and report generation tasks. Additional clarification on this point would help interpret the results.
A3: Thanks for your advice. As shown in Table R3, we have revised this part and put these details in Appendix A.6.2.
Table R3: Accuracy (%) of different Med-LVLMs based on LLaVA-Med-1.5 on medical VQA task.
| Models | Radiology (IU-Xray) | Radiology (MIMIC-CXR) | Ophthalmology (Harvard-FairVLMed) | Pathology (Quilt-1M) | Pathology (PMC-OA) |
|---|---|---|---|---|---|
| LLaVA-Med-1.5 | 75.47 | 75.79 | 63.03 | 62.80 | 59.28 |
| Ours | 89.54 | 83.57 | 87.94 | 72.95 | 64.54 |
| Med-Flamingo | 26.74 | 61.27 | 42.06 | 27.11 | 32.62 |
| MedVInT | 73.34 | 66.06 | 35.92 | 26.81 | 27.77 |
| RadFM | 26.67 | 69.30 | 52.47 | 27.02 | 25.12 |
| miniGPT-Med | 54.87 | 53.92 | 66.73 | 26.82 | 27.03 |
Q4: In the section on Cross-Modality Alignment, more intuitive explanations are needed to clarify how constructing preference pairs (by introducing noisy images) addresses the issue of cross-modality alignment.
A4: To construct preference pairs for cross-modality alignment, we first select a preferred response by having the model generate an answer using the correct medical image, clinical query, and retrieved knowledge, ensuring the response matches the ground-truth answer. Then, we select a dispreferred response by introducing an unrelated input image. This unrelated image is selected by finding the one with the lowest similarity to the target image and adding noise to distort it further. The dispreferred response is generated when the model uses this noisy, unrelated image along with the query and retrieved knowledge to still produce the correct answer. By comparing these pairs during training, the model learns to prioritize relevant and accurate inputs (e.g., the correct medical image) over noisy or irrelevant ones, improving cross-modality alignment. We have revised this part and put these details in Appendix A.9.
Q2: In Table 2, it is unclear which specific BLEU score is being reported—is it BLEU-1, BLEU-2, BLEU-3, BLEU-4, or an average of these?
A2: Apologies for the confusion. Table R2 shows the average BLEU score. We provide detailed results in Appendix A.6.8.
Table R2: BLEU Score (%) of different methods based on LLaVA-Med-1.5 on report generation task.
| Models | IU-Xray (BLEU-1) | IU-Xray (BLEU-2) | IU-Xray (BLEU-3) | IU-Xray (BLEU-4) | MIMIC-CXR (BLEU-1) | MIMIC-CXR (BLEU-2) | MIMIC-CXR (BLEU-3) | MIMIC-CXR (BLEU-4) | Harvard-FairVLMed (BLEU-1) | Harvard-FairVLMed (BLEU-2) | Harvard-FairVLMed (BLEU-3) | Harvard-FairVLMed (BLEU-4) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LLaVA-Med-1.5 | 17.69 | 10.55 | 6.47 | 3.83 | 21.82 | 13.35 | 6.11 | 3.64 | 32.57 | 19.86 | 9.11 | 5.38 |
| + Greedy | 21.04 | 12.57 | 5.75 | 3.35 | 29.91 | 18.26 | 8.27 | 5.03 | 32.40 | 19.82 | 9.04 | 5.37 |
| + Beam Search | 21.78 | 12.71 | 6.05 | 3.63 | 30.55 | 17.79 | 8.49 | 5.09 | 33.07 | 19.14 | 9.14 | 5.48 |
| + DoLa | 21.22 | 12.39 | 5.90 | 3.54 | 30.80 | 17.97 | 8.58 | 5.15 | 32.87 | 19.02 | 9.08 | 5.45 |
| + OPERA | 19.79 | 11.19 | 5.33 | 3.20 | 27.72 | 16.05 | 7.65 | 4.59 | 29.90 | 17.45 | 8.32 | 4.99 |
| + VCD | 19.35 | 10.94 | 5.21 | 3.13 | 27.27 | 15.76 | 7.51 | 4.51 | 30.14 | 17.61 | 8.39 | 5.04 |
| + MedDr | 22.27 | 12.99 | 6.19 | 3.71 | 33.43 | 19.33 | 9.22 | 5.53 | 35.64 | 20.61 | 9.82 | 5.89 |
| + FactMM-RAG | 26.45 | 15.25 | 7.26 | 4.36 | 33.64 | 19.44 | 9.27 | 5.56 | 37.47 | 21.64 | 10.30 | 6.18 |
| + RULE | 49.56 | 28.61 | 13.62 | 8.17 | 33.47 | 19.36 | 9.23 | 5.54 | 40.21 | 23.26 | 11.08 | 6.66 |
| MMed-RAG | 56.48 | 32.67 | 15.56 | 9.34 | 41.81 | 24.18 | 11.52 | 6.92 | 44.65 | 25.79 | 12.29 | 7.38 |
Thank you for your constructive comments and suggestions. We have revised our paper according to your comments. We respond to your questions below and would appreciate it if you could let us know if our response addresses your concerns.
Q1: In the analysis of RAG-PT, the authors evaluate the effectiveness of each individual RAG-PT component (1, 2, and 3). To better understand how RAG-PT mitigates misalignment and improves performance, it would be helpful to include experiments with different combinations, such as RAG-PT 1+2, RAG-PT 1+3, and RAG-PT 2+3.
A1: Thank you very much for your suggestion. Preference data designed for different alignment objectives can indeed produce varying effects. Therefore, conducting ablation experiments on combinations of different types of preference data is necessary. We perform comprehensive ablation experiments on RAG-PT1/2/3 as well as their combinations (1+2, 2+3, 1+3) to analyze the effectiveness of each type of data and their combinations. As shown in Table R1, we find that the combination of 1+3 produced the most significant results, indicating that the two misalignment issues (i.e., cross-modality and over-reliance issues) are the most prominent. Targeted mitigation of these two issues yielded the greatest improvement. However, incorporating data for all three alignment objectives yields the best performance, demonstrating the importance of each alignment component. We have revised the paper and put these details in Appendix A.6.6.
Table R1: Ablation results using RAG-PT based on subsets of preference data. VQA: Visual Question-Answering; RG: Report Generation.
| Model | IU-Xray (VQA) | IU-Xray (RG) | Harvard-FairVLMed (VQA) | Harvard-FairVLMed (RG) |
|---|---|---|---|---|
| LLaVA-Med-1.5 | 68.99 | 10.04 | 66.63 | 13.41 |
| +RAG-PT 1 | 80.19 | 19.38 | 79.42 | 18.37 |
| +RAG-PT 2 | 80.27 | 20.16 | 79.35 | 18.66 |
| +RAG-PT 3 | 81.30 | 19.43 | 80.07 | 18.92 |
| +RAG-PT 1+2 | 82.58 | 22.74 | 82.08 | 18.97 |
| +RAG-PT 1+3 | 82.96 | 24.50 | 82.87 | 19.22 |
| +RAG-PT 2+3 | 83.61 | 25.77 | 83.89 | 19.30 |
| +RAG-PT 1+2+3 | 85.58 | 29.69 | 87.02 | 20.31 |
The authors proposed MMed-RAG, a Multimodal Medical Retrieval-Augmented Generation framework. The authors introduce a domain-aware retrieval mechanism, using a domain identification module to adaptively select the appropriate retrieval model for medical images. MMed-RAG employs adaptive calibration to determine the number of retrieved contexts. RAG-based preference fine-tuning is used to enhance cross-modality and overall alignment, ensuring the model effectively utilizes input medical images and relevant retrieved information.
优点
The authors address the limitations of the original retrieval-augmented generation (RAG) method by tackling two key challenges: the reliance on retrieved contextual knowledge without consideration of the input image, and interference from incorrect retrievals, which results in misalignment with the ground truth. To solve these challenges, the authors proposed MMed-RAG. First, it aims to enhance cross-modality alignment by ensuring the model utilizes input medical images before generating responses. Second, it aims to improve overall alignment by prompting the model to rely on retrieved contexts when uncertain, while minimizing interference from irrelevant information. MMRAG, as an innovative multimodal retrieval-augmented generation framework, significantly enhances the performance of medical image-text tasks. The manuscript is well-written and easy to follow.
缺点
(1) The manuscript lacks a comparison with other general methods, such as GPT-4, Gemini, and QwenVLM. An intuitive comparison to these general-purpose models would provide valuable context for understanding the advantages and limitations of MMed-RAG. (2) A major concern is the use of preference datasets for fine-tuning, which may cause the model to focus excessively on hard samples, potentially leading to challenges such as catastrophic forgetting and overfitting. It would be beneficial to use an external validation dataset from the same domain to assess the model's generalizability.
问题
The same as weakness.
Thank you for reviewing our paper and for your valuable feedback. Below, we address your concerns point by point and we’ve revised our paper according to your suggestions. We would appreciate it if you could let us know whether your concerns are addressed by our response.
Q1: The manuscript lacks a comparison with other general methods, such as GPT-4, Gemini, and QwenVL. An intuitive comparison to these general-purpose models would provide valuable context for understanding the advantages and limitations of MMed-RAG.
A1: Thank you for your valuable suggestion. We have conducted a comparison of general models, including GPT-4, Gemini, QwenVL, LLaVA-1.6, and InternVL-2. As shown in Table R1, our findings show that MMed-RAG consistently outperforms these models, further demonstrating its effectiveness. We have revised the paper to include these details in Appendix A.6.2.
Table R1: Accuracy (%) of different Med-LVLMs based on LLaVA-Med-1.5 on medical VQA task.
| Models | Radiology (IU-Xray) | Radiology (MIMIC-CXR) | Ophthalmology (Harvard-FairVLMed) | Pathology (Quilt-1M) | Pathology (PMC-OA) |
|---|---|---|---|---|---|
| LLaVA-Med-1.5 | 75.47 | 75.79 | 63.03 | 62.80 | 59.28 |
| Ours | 89.54 | 83.57 | 87.94 | 72.95 | 64.54 |
| Med-Flamingo | 26.74 | 61.27 | 42.06 | 27.11 | 32.62 |
| MedVInT | 73.34 | 66.06 | 35.92 | 26.81 | 27.77 |
| RadFM | 26.67 | 69.30 | 52.47 | 27.02 | 25.12 |
| miniGPT-Med | 54.87 | 53.92 | 66.73 | 26.82 | 27.03 |
| GPT-4o | 63.25 | 60.61 | 61.50 | 53.56 | 49.70 |
| Gemini-1.5 | 59.73 | 61.02 | 58.53 | 56.88 | 52.17 |
| LLaVA-v1.6 | 58.05 | 63.70 | 48.52 | 35.73 | 38.54 |
| Qwen-VL-Chat | 59.43 | 60.43 | 38.06 | 28.74 | 29.53 |
| InternVL-2 | 54.06 | 59.47 | 44.38 | 37.82 | 34.40 |
Q2: A major concern is the use of preference datasets for fine-tuning, which may cause the model to focus excessively on hard samples, potentially leading to challenges such as catastrophic forgetting and overfitting. It would be beneficial to use an external validation dataset from the same domain to assess the model's generalizability.
A2: Considering the risk of overfitting, and based on your suggestion, we use external validation datasets from the same domain to evaluate the generalizability of MMed-RAG. We select two domain-specific subsets from PubMedVision [1], i.e., fundus digital photography and microscopy image, for ophthalmology and pathology, respectively. As shown in Table R2, the results show that our method still significantly outperforms other Med-LVLMs on the external validation datasets, indicating MMed-RAG performs well when generalized to external datasets, demonstrating its strong generalization capability. We have revised the paper and put these details in Appendix A.6.7.
Table R2: Performance comparison of models on external validation datasets.
| Model | Ophthalmology (BLEU) | Ophthalmology (ROUGE-L) | Ophthalmology (METEOR) | Pathology (Acc) | Pathology (F1) | Pathology (AUC) |
|---|---|---|---|---|---|---|
| LLAVA-Med-1.5 | 17.11 | 20.05 | 17.09 | 59.65 | 71.90 | 54.87 |
| Ours | 22.64 | 14.98 | 17.85 | 62.88 | 72.24 | 59.69 |
Reference
[1] Chen J, Gui C, Ouyang R, et al. Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale. EMNLP 2024.
Thank you for the author's response; it has effectively addressed my concerns.
Dear Reviewer tmxy,
Thank you for your quick response, we are happy to see that our response address your concerns. Thank you again!
This work proposes MMed-RAG, a RAG system that corrects for misalignment issues for context introduced by RAG for multimodal large vision models (LVLMs). MMed-RAG creates preference pairs for cross-modality alignment and finetunes their LVLM (LLaVA-Med-1.5 7B) via DPO. This is described as RAG-based preference fine-tuning (RAG-PT). Other components in MMed-RAG include heuristics for domain-aware retrieval across different sources. MMed-RAG is evaluated across several medical imaging domains (radiology, ophthalmology, pathology), and compared with recent LVLMs (Med-Flamingo, RadFM, miniGPT-Med) and other RAG systems (MedDr, FactMM-RAG, RULE).
优点
- RAG-PT (RAG-based preference fine-tuning) is a nice idea in improving the alignment performance when introducing RAG context to current LVLMs.
- Comprehensive evaluation against various medical LVLMs and their strategies for multimodal RAG and mitigating hallucination.
缺点
- The contribution of domain-specific retrievers for RAG context in MMed-RAG is not clear to me. Practically, it is simpler to build domain-specific LVLMs that solves a targeted clinical problem well than generalist biomedical LVLMs. If we are evaluating an off-the-shelf for VQA in radiology, why would we also include pathology and ophthalmology as possible documents for RAG? We To evaluate this contribution, we would need to compare performance of MMed-RAG against domain-specific LVLMs (radiology-, pathology-, ophthalmology). All comparisons in this experiment here should ideally also adopt RAG-PT.
- Though the idea of RAG-PT is novel in the application to medical imaging, there exist several related works in exploring DPO for RAG finetuning in LLMs. The authors should discuss these works in the related work section and how MMed-RAG adds to the literature [1,2].
- It is unclear if RAG-PT is specific to only the medical imaging domains, or if it can be applied broadly to other domains. I think it would be valuable to assess the contribution in RAG-PT in problems outside of medical imaging, especially where image-text prompts are likely to come from heterogenous domains.
- Is the problem of misalignment in introducing RAG context only in LLAVA-Med 1.5 7B, or does this exist in other open-source (etc., Med-Flamingo, miniGPT-Med) and commercial LVLMs (like GPT-4O)?
Refs
- Dong, G., Zhu, Y., Zhang, C., Wang, Z., Dou, Z. and Wen, J.R., 2024. Understand what llm needs: Dual preference alignment for retrieval-augmented generation. arXiv preprint arXiv:2406.18676.
- Li, X., Mei, S., Liu, Z., Yan, Y., Wang, S., Yu, S., Zeng, Z., Chen, H., Yu, G., Liu, Z. and Sun, M., 2024. RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards. arXiv preprint arXiv:2410.13509.
问题
None
Reference
[1] Wu C, Zhang X, Zhang Y, et al. Towards generalist foundation model for radiology[J]. arXiv preprint arXiv:2308.02463, 2023.
[2] Seyfioglu M S, Ikezogwo W O, Ghezloo F, et al. Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos. CVPR 2024.
[3] Holland R, Taylor T R P, Holmes C, et al. Specialist vision-language models for clinical ophthalmology[J]. arXiv preprint arXiv:2407.08410, 2024.
[4] Dong, G., Zhu, Y., Zhang, C., Wang, Z., Dou, Z. and Wen, J.R., 2024. Understand what llm needs: Dual preference alignment for retrieval-augmented generation. arXiv preprint arXiv:2406.18676.
[5] Li, X., Mei, S., Liu, Z., Yan, Y., Wang, S., Yu, S., Zeng, Z., Chen, H., Yu, G., Liu, Z. and Sun, M., 2024. RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards. arXiv preprint arXiv:2410.13509.
[6] Li H, Liu J, Wang Z, et al. LITE: Modeling Environmental Ecosystems with Multimodal Large Language Models, COLM 2024.
Q2: Though the idea of RAG-PT is novel in the application to medical imaging, there exist several related works in exploring DPO for RAG fine-tuning in LLMs. The authors should discuss these works in the related work section and how MMed-RAG adds to the literature [4,5].
A2: Thank you for your suggestion. DPA-RAG [4] addresses the alignment issues between the external reranker and the internal LLM through supervised fine-tuning. Later, RAG-DDR [5] leverages a rolling method to generate perturbed responses, further mitigating conflicts between parameter memory and external knowledge. We have already discussed these references in the related work section in the revised paper. In addition, we would like to point out RAG-DDR is concurrent work with ours and is released after the ICLR 2025 submission deadline.
Q3: It is unclear if RAG-PT is specific to only the medical imaging domains, or if it can be applied broadly to other domains. I think it would be valuable to assess the contribution in RAG-PT in problems outside of medical imaging, especially where image-text prompts are likely to come from heterogeneous domains.
A3: We have applied RAG-PT to one additional domain (i.e., environmental ecosystems modeling) to further validate the effectiveness of RAG-PT. We conduct experiments on two environmental system modeling datasets [6]. The CRW-Temp dataset is a river water temperature prediction dataset aimed at forecasting the daily average water temperature of a specific day based on observed physical variables. The CRW-Flow dataset focuses on predicting river segment flow based on observed physical variables. The model used is LITE [6], an environmental system large model based on LLaMA2. We train a semantic time-series encoder using time-series information-text pairs, which works in conjunction with a text encoder as the retriever. Then we retrieve the most similar environmental descriptions based on the current environmental descriptions. As shown in Table R2, our approach also demonstrates significant performance improvements on tasks in this domain. We have revised the paper and put these details in Appendix A.6.4.
Table R2: Performance comparison of different models on CRW-Temp and CRW-Flow datasets.
| Model | CRW-Temp RMSE | CRW-Temp MAE | CRW-Flow RMSE | CRW-Flow MAE |
|---|---|---|---|---|
| LITE | 2.02 | 1.70 | 2.39 | 1.02 |
| +RAG | 1.93 | 1.62 | 2.27 | 0.96 |
| +RAG-PT | 1.74 | 1.46 | 2.11 | 0.90 |
Q4: Is the problem of misalignment in introducing RAG context only in LLAVA-Med 1.5 7B, or does this exist in other open-source (etc., Med-Flamingo, miniGPT-Med) and commercial LVLMs (like GPT-4o)?
A4: Following the alignment analysis method we apply to LLaVA-Med-1.5 in Section 3.3, we conduct two alignment analysis tests on multiple open-source Med-LVLMs and commercial LVLMs using the Harvard-FairVLMed dataset with the incorporation of retrieved information. These tests respectively evaluate (1) cross-modality alignment and (2) overall alignment with the ground truth. As shown in Table R3, the results indicate that both existing open-source Med-LVLMs and commercial LVLMs exhibit misalignment issues with retrieved information. In addition, it is worthwhile to mention that GPT-4o demonstrates the best alignment performance compared with other models when incorporating RAG, especially in cross-modal alignment. This is likely because GPT-4o has been well-trained in visual perception and may also have utilized some post-training methods (like preference optimization) to optimize modal alignment. We have revised the paper and put these details in Appendix A.6.5.
Table R3: Comparison of Copy-Reference Rate and Over-Reliance Rate across different models.
| Model | Copy-Reference Rate (%) | Over-Reliance Rate (%) |
|---|---|---|
| LLaVA-Med-1.5 | 55.08 | 43.31 |
| Med-Flamingo | 60.17 | 33.74 |
| miniGPT-Med | 56.75 | 46.06 |
| GPT-4o | 12.54 | 24.80 |
Thank you for your valuable feedback to help us improve our paper. We have revised our paper based on your feedback. We detail our response below and please kindly let us know if our response addresses your concerns.
Q1: The contribution of domain-specific retrievers for RAG context in MMed-RAG is not clear to me. Practically, it is simpler to build domain-specific LVLMs that solves a targeted clinical problem well than generalist biomedical LVLMs. If we are evaluating an off-the-shelf for VQA in radiology, why would we also include pathology and ophthalmology as possible documents for RAG? To evaluate this contribution, we would need to compare performance of MMed-RAG against domain-specific LVLMs (radiology, pathology, ophthalmology). All comparisons in this experiment here should ideally also adopt RAG-PT.
A1: We design a domain-specific retriever leveraging a generalist Med-LVLM to retrieve information from a dedicated database based on the identified modality of the input medical image. Here, the domain identification models used are capable of reliably recognizing modalities with high accuracy (~99.83% accuracy in our experiments). For radiology VQA tasks, input radiology images are classified as “radiology” by the model, enabling the retrieval of knowledge exclusively from the radiology database to enhance generation. All retrieved documents are specific to radiology and exclude other modalities.
In addition, we conduct additional experiments to compare our method with domain-specific Med-LVLMs as follows: Radiology: RadFM [1], Pathology: Quilt-LLaVA [2], Ophthalmology: RetinaVLM [3]. For radiology, we use the IU-Xray dataset to evaluate VQA. For pathology, we use the PMC-OA pathology subset to evaluate VQA. For ophthalmology, since the domain-specific Med-LVLM, i.e., RetinaVLM, is only trained on report-generation tasks, we use the Harvard-FairVLMed dataset to evaluate report generation. As shown in Table R1, our method significantly outperforms each domain-specific Med-LVLM. Additionally, following your advice, we apply RAG-PT to each domain-specific Med-LVLM. As shown in Table R1, after incorporating RAG-PT, the performance of these models improve significantly, demonstrating the compatibility of our method. Furthermore, domain-specific Med-LVLMs could outperform generalist Med-LVLMs in their specialized domains, as they are fine-tuned using specialized medical domain data. While this significantly enhances their medical understanding in specific domains, it may reduce their generalization ability, such as their capacity to comprehend retrieved information. Consequently, even after incorporating RAG-PT, the performance of several domain-specific Med-LVLMs (e.g., RetinaVLM and RadFM) is inferior to MMed-RAG. We have revised the paper and put these details in Appendix A.6.3 and A.7.
Table R1: Model performance comparison with domain-specific Med-LVLMs.
| Model | Acc | F1 | AUC | Acc | F1 | AUC | BLEU | ROUGE-L | METEOR |
|---|---|---|---|---|---|---|---|---|---|
| Radiology | |||||||||
| RadFM | 26.67 | 30.36 | 55.31 | - | - | - | - | - | - |
| + RAG-PT | 48.39 | 39.40 | 59.70 | - | - | - | - | - | - |
| Pathology | |||||||||
| Quilt-LLaVA | - | - | - | 62.59 | 72.30 | 56.96 | - | - | - |
| + RAG-PT | - | - | - | 64.72 | 73.36 | 61.39 | - | - | - |
| Ophthalmology | |||||||||
| RetinaVLM | - | - | - | - | - | - | 19.96 | 12.73 | 13.52 |
| + RAG-PT | - | - | - | - | - | - | 22.26 | 14.64 | 16.87 |
| Generalist Med-LVLM | |||||||||
| LLaVA-Med-1.5 | 75.47 | 64.04 | 67.46 | 59.28 | 71.98 | 54.19 | 18.11 | 11.36 | 10.75 |
| MMed-RAG | 84.10 | 71.92 | 86.40 | 64.54 | 73.09 | 61.42 | 24.82 | 16.59 | 19.85 |
Dear reviewer 3Jou,
We would like to follow up to see if the response addresses your concerns or if you have any further questions. We would really appreciate the opportunity to discuss this further if our response has not already addressed your concerns. Thank you again!
Hi reviewer 3Jou,
As the discussion period is ending soon, we would like to follow up to see if the response addresses your concerns or if you have any further questions. We would really appreciate the opportunity to discuss this further if our response has not already addressed your concerns. Thank you again!
Thank you for the author's detailed response. The author's rebuttal addresses all of my points nicely. I plan to raise my score during the post-rebuttal discussion.
We sincerely appreciate the thoughtful discussion and your valuable feedback. We are delighted that our rebuttal has addressed all your concerns. We kindly notice that your rating appears unchanged and would greatly appreciate it if you could take a moment to double-check your review. Thank you again!
We sincerely appreciate all reviewers for their insightful and constructive feedback. According to these comments, we have improved the paper (new pdf uploaded) and highlighted the main changes with blue text. Below, we summarize all changes:
- We have supplemented the discussion on two LLM-based RAG fine-tuning methods in the Related Work section. (Reviewer 3Jou)
- We have added a performance comparison of generalist LVLMs and other Med-LVLMs in Appendix A.6.2. (Reviewers 3Jou, WFqW and tmxy)
- We have included a comparison with domain-specific Med-LVLMs and their results after incorporating RAG-PT in Appendix A.6.3. (Reviewer 3Jou)
- We have added a comparison of models in the environmental ecosystems domain in Appendix A.6.4. (Reviewer 3Jou)
- We have conducted additional statistical analysis of Copy-Reference Rate and Over-Reliance Rate in Appendix A.6.5. (Reviewer 3Jou)
- We have added a more comprehensive ablation analysis of preference data in Appendix A.6.6. (Reviewer WFqW)
- We have included external validation in Appendix A.6.7. (Reviewer tmxy)
- We have presented the detailed BLEU score in Appendix A.6.8. (Reviewer WFqW)
- We have conducted an in-depth analysis of retrievers and their performance in Appendices A.6.9 and A.6.11. (Reviewer mHiX)
- We have conducted experiments under the few-shot setting in Appendix A.6.10. (Reviewer mHiX)
- We have supplemented experiments on rationale-guided RAG in Appendix A.6.12. (Reviewer mHiX)
- We have analyzed the contribution of domain-specific retrievers in Appendix A.7. (Reviewer mHiX)
- We have included an explanation of cross-modality alignment in Appendix A.8 and an analysis of noisy images in cross-modality alignment in Appendix A.9. (Reviewer mHiX)
- We have provided an explanation of over-reliance rate in Appendix A.10. (Reviewer mHiX)
Key Strengths:
- The paper introduces RAG-PT (RAG-based preference fine-tuning) to improve alignment when using RAG with medical Large Vision Language Models (LVLMs)
- Thorough testing against various medical LVLMs and detailed analysis of multimodal RAG strategies
- Shows significant improvements in factual accuracy (up to 43.8% gains) with real-world implications for medical diagnostics
Major Concerns:
-
Evaluation and Comparison Issues:
- Lacks comparison with general-purpose models (GPT-4, Gemini, QwenVLM)
- Need for clearer metrics reporting (e.g., which BLEU score version)
- Should include LLaVA-Med 1.5 few-shot baseline
- Unclear evaluation data for alignment analysis
-
Methodology Questions:
- Concerns about preference dataset fine-tuning possibly leading to overfitting
- Questions about the effectiveness of the retriever training
- Need for clearer explanation of cross-modality alignment process
-
Scope and Novelty:
- Unclear if RAG-PT benefits extend beyond medical imaging
- Question about whether combining three independent pieces constitutes meaningful methodological novelty
- Need to clarify if misalignment issues exist in other LVLMs besides LLAVA-Med
-
Domain-Specific Questions:
- Unclear benefit of including multiple medical domains (radiology, pathology, ophthalmology) in retrieval
- Need for comparison with domain-specific LVLMs
- Better explanation needed for handling noisy contexts and images
审稿人讨论附加意见
The authors have clearly clarified the issues raised by the reviewers. The explanations in the rebuttal look reasonable and correct to me.
Accept (Poster)