Uni-Med: A Unified Medical Generalist Foundation Model For Multi-Task Learning Via Connector-MoE
摘要
评审与讨论
The paper introduces Uni-Med, a medical generalist foundation model designed for multi-task learning across six different medical tasks. The proposed CMoE module leverages a mixture of projection experts to align visual and language embedding spaces effectively. The model demonstrates significant performance improvements across diverse medical tasks, validated through extensive experiments and ablation studies.
优点
- The introduction of the CMoE module to address the tug-of-war problem at the connector level is novel and well-executed.
- The paper provides thorough ablation studies to validate the effectiveness of the proposed CMoE module under various configurations.
- Uni-Med achieves impressive performance with minimal training computational overhead, highlighting its efficiency in handling large-scale multi-modal medical data.
缺点
- The ablation studies show that certain configurations (e.g., using a high number of projection experts) might lead to overfitting. This aspect could be discussed in more detail, including strategies to mitigate overfitting.
- While the interpretation analysis only focus on visual features on different tasks, the analysis of visual features on different medical image modalities should be considered.
- It could be better to add more work on evaluation of medical vision-language models in the section of related work to make sure that the relevant work is fully discussed, such as [1,2].
[1] Yan Q, He X, Yue X, et al. Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA[J]. arXiv preprint arXiv:2405.20421, 2024.
[2] Xia P, Chen Z, Tian J, et al. CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models[J]. arXiv preprint arXiv:2406.06007, 2024.
问题
- More detailed ablation studies.
- The complete interpretation analysis.
- More complete reference work.
局限性
N/A
Thanks for your insightful comments!
Q1: The ablation studies show that certain configurations (e.g., using a high number of projection experts) might lead to overfitting. This aspect could be discussed in more detail, including strategies to mitigate overfitting.
A1: Thank you for your suggestion.
- Choosing the optimal parameter configuration in multi-modal multi-task scenarios has always been a concern, which aims to achieve a balance between performance and computational efficiency [1-2]. It is a challenging study as the complexity of the scenario can end up overfitting to simpler modalities or tasks or underfitting complex ones.
- As shown in Table 2 (c), in the experiment exploring the key parameter of the number of experts, increasing the number of experts still brings performance gains on some datasets, but the average gain tends to stabilize across all tasks and datasets.
- To our knowledge, recent studies [3-5] have also conducted ablation experiments on these key parameters. However, there are different observations in different scenarios (number of modalities/tasks/data volumes). Therefore, we believe that selecting the optimal parameter settings using the development set is a simple and effective method to achieve the balance between performance and computational efficiency.
Thanks to your feedback, we will provide more detailed discussions in the revised version. We will continue to focus on this in our future research.
Q2: While the interpretation analysis only focuses on visual features on different tasks, the analysis of visual features on different medical image modalities should be considered.
A2: Thank you for your suggestion. We also use t-SNE method to visualize the distribution of visual features on medical image modalities and provide the results in Figure Re.1.
- Specifically, we first observe the visual feature distribution of different modalities under the same task in Figure Re.1 (a-c). We find that the feature distributions of CT and MRI modalities in the REG task have good discriminability after passing through the frozen visual encoder. After passing through the connector, the improvement in Silhouette score (from 0.3049 to 0.3335) is relatively limited.
- In addition, we select 100 samples from each of the 8 modalities and observe their visual feature distributions after passing through different visual encoders in Figure Re.1 (d-f). It can still be observed that the majority of modality distributions are ordered and tightly packed.
Based on the above observations, the distinction of medical image modalities is achieved effectively through the visual encoder, while task differentiation requires the well-designed connector. We will add the analysis of visual features on different medical image modalities in the revised version.
Q3: It could be better to add more work on evaluation of medical vision-language models in the section of related work to make sure that the relevant work is fully discussed.
A3: Thank you for your suggestion. We attach great importance to your suggestions and will add cutting-edge developments [6-7] on evaluation of medical vision-language models to the related work section in the revised version.
References
[1] Liu Q, Wu X, Zhao X, et al. When MOE Meets LLMs: Parameter Efficient Fine-tuning for Multi-task Medical Applications[C]//Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2024: 1104-1114.
[2] Chen S, Jie Z, Ma L. Llava-mole: Sparse mixture of lora experts for mitigating data conflicts in instruction finetuning mllms[J]. arXiv preprint arXiv:2401.16160, 2024.
[3] Chen T, Zhang Z, JAISWAL A K, et al. Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers[C]//The Eleventh International Conference on Learning Representations.
[4] Dou S, Zhou E, Liu Y, et al. Loramoe: Revolutionizing mixture of experts for maintaining world knowledge in language model alignment[J]. arXiv preprint arXiv:2312.09979, 2023, 4(7).
[5] Gou Y, Liu Z, Chen K, et al. Mixture of cluster-conditional lora experts for vision-language instruction tuning[J]. arXiv preprint arXiv:2312.12379, 2023.
[6] Xia P, Chen Z, Tian J, et al. CARES: A Comprehensive Benchmark of Trustworthiness in Medical Vision Language Models[J]. arXiv preprint arXiv:2406.06007, 2024.
[7] Yan Q, He X, Yue X, et al. Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA[J]. arXiv preprint arXiv:2405.20421, 2024.
Thank you for the rebuttal, which addressed some of my concerns. I have increased my score and am looking forward to reading your revision in future venue.
The paper presents Uni-Med, a medical generalist foundation model designed to perform multiple medical tasks efficiently through multi-task learning. This model introduces a Connector-Mixture-of-Experts (CMoE) module to mitigate the tug-of-war problem in multi-modal, multi-task optimization, which is a common issue in current models. Uni-Med achieves competitive or superior performance across six medical tasks without requiring task-specific fine-tuning.
优点
- Multi-modal multi-task optimization is a complex and important question for large multimodal models, the introduction of the Connector-Mixture-of-Experts (CMoE) module, employs a mixture of projection experts to align visual and language embedding spaces, shows superior performance on multiple tasks.
- The paper conducts a comprehensive interpretation analysis of the problem from the perspective of gradient optimization and parameter statistics.
- Extensive experiments demonstrate Uni-Med's effectiveness across multiple tasks and datasets.
缺点
- The model currently supports only 2D images, whereas most commonly used medical imaging modalities, such as CT and MRI, are in 3D.
- For the report generation task, the evaluation should include metrics like RadGraph Score and RadCliQ, as BLEU and ROUGE cannot fully assess the semantic accuracy.
问题
Is the number of projection experts correlated with the number of tasks or the number of image modalities?
Figure 5 shows that visual features of the same task are more tightly distributed. How would t-SNE behave for different modalities? Why are visual features related to tasks? For a single image, it can be used for both classification and report genertaion.
局限性
Yes.
Thanks for your insightful comments!
Q1: The model currently supports only 2D images, whereas most commonly used medical imaging modalities, such as CT and MRI, are in 3D.
A1: Thank you for your advice. Same as most medical MLLMs, we input 2D slices and corresponding questions for 3D images such as CT and MRI. We acknowledge that Uni-Med has certain limitations in handling genuine 3D medical image inputs. The primary challenge lies in the need for different visual encoders to process 3D images effectively [1]. Replacing the visual encoder to handle 3D images would compromise our ability to process 2D image datasets effectively. This remains an area for future research.
Q2: For the report generation task, the evaluation should include metrics like RadGraph Score and RadCliQ, as BLEU and ROUGE cannot fully assess the semantic accuracy.
A2: Thank you for your suggestion. Firstly, we learne the concepts of the two metrics mentioned above:
-
RadGraph-based metrics. The RadGraph model [2] parses radiology reports into graphs containing clinical entities and relations between them. The RadGraph F1 metric computes the overlap in entities and relations separately, then reports their average.
-
RadCliQ. Radiology Report Clinical Quality (RadCliQ) is a composite metric that integrates RadGraph F1 and BLEU score in a linear regression model to predict the total number of errors in a report [3].
Secondly, we calculate and report RadGraph entity F1, RadGraph relation F1, RadCliQ-v0 and RadCliQ-v1 on MIMIC-CXR dataset in Table Re.1, using the code released by Yu et al. [3]. The improvement of RadGraph-based metrics and the decrease of RadCliQ both indicate that Uni-Med achieves better semantic accuracy in the report generation task.
Q3: Is the number of projection experts correlated with the number of tasks or the number of image modalities?
A3: Thank you for your question. We hold the opinion that both are important, and the answer to this question needs to be based on the actual situation. We analyze the following three scenarios:
-
Single modal, multi-task. A same image may need to complete different tasks, and the number of experts should be related to the number of tasks.
-
Multi-modal, single task. The setting of the number of experts should consider the number of modalities.
-
Multi-modal, multi-task. In this scenario, further analysis of the data is required. Taking Uni-Med as an example, when we try to visualize visual features separately by task and modality, we find that the distribution of features by task is more chaotic, while the distribution of features by modality is more orderly (Detailed in Q4 & A4). Therefore, CMoE needs to consider task information more, and the number of experts is more related to the task.
Q4: Figure 5 shows that visual features of the same task are more tightly distributed. How would t-SNE behave for different modalities? Why are visual features related to tasks? For a single image, it can be used for both classification and report generation.
A4: Thank you for your question.
-
In Figure 5, we visualize the distribution of visual features by tasks. It can be clearly observed that the distribution of features by task is chaotic in Figure 5 (a), which means that there is no obvious discrimination between different tasks after passing through the frozen visual encoder. Visual features of the same task are more tightly distributed after CMoE in Figure 5 (c) than MLP in Figure 5 (b).
-
We use t-SNE to visualize the distribution of visual features by modalities and provide the results in Figure Re.1. Specifically, we first observe the visual feature distribution of different modalities under the same task in Figure Re.1 (a-c). We find that the feature distributions of CT and MRI modalities in the REG task have good discriminability after passing through the frozen visual encoder. After passing through the connector, the improvement in Silhouette score (from 0.3049 to 0.3335) is relatively limited. In addition, we select 100 samples from each of the 8 modalities and observe their visual feature distributions after passing through different visual encoders in Figure Re.1 (d-f). It can still be observed that the majority of modality distributions are ordered and tightly packed.
-
The above findings also provide a new perspective for explaining the explicit task conditioned projection in CMoE. When aligning visual and language embedding spaces through the connector in Uni-Med scenario, task information is more difficult to distinguish than modality information.
-
As described in the question, for a single image, it can be used for different tasks. If the visual features are not related to the task, then the tokens of the same image input into LLM are exactly the same (through linear, MLP, and token-level CMOE in Table 2). In this case, achieving multitasking relies entirely on the capabilities of LLM. On the contrary, we assume that different tasks require attention to different image features. After passing through the connector, the features of the same image for different tasks have adaptively changed, which can alleviate the negative impact of the tug of war problem in multi-task learning on LLM. The significant improvement in experimental results confirms the latter assumption.
References
[1] Bai F, Du Y, Huang T, et al. M3d: Advancing 3d medical image analysis with multi-modal large language models[J]. arxiv preprint arxiv:2404.00578, 2024.
[2] S Jain, A Agrawal, A Saporta, et al. RadGraph: Extracting clinical entities and relations from radiology reports. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, December 2021.
[3] Yu F, Endo M, Krishnan R, et al. Evaluating progress in automatic chest x-ray radiology report generation[J]. Patterns, 2023, 4(9).
Thank you for the response. The authors have adequately addressed my primary concerns, and I have no further questions. I will maintain my previous rating.
The authors propose to build a medical generalist multi-modal foundation model using a novel "connector mixture of experts" module to solve the problem of "multi-task" learning. Their connector-MOE technique introduces a projection and routing module from the visual encoder into the LLM that is explicitly conditioned on the underlying task. The model is tested on a set of medical tasks, and benchmarked against several other multimodal medical models demonstrating improved performance.
优点
The explicit task conditioned projection is novel, and integrates an older concept of MoEs into a SOTA MLLM framework.
The authors do an excellent job sourcing and assembling a large, multi-task, multi-modal set of medical benchmarks and performing an extensive set of ablations.
The paper is benchmarked broadly across multiple tasks - fitting the definition of a foundation model.
缺点
I disagree with the claim that there is limited research on connecting modalities in multi-modal models. This is an area of immense interest and extensive research broadly within the field of machine learning.
I don't think that it is a unique medical issue, and the benchmarking of a novel architecture would be better served using more common datasets to the ML community. This is particularly important because, as the authors note, the medical datasets and models involved were hard for them to control for data leakage. Why not compare the routing technique against non-medical models on non-medical datasets where this issue won't be the case?
The results are hard to follow in two very large Tables Table 2 and 3, particularly Table 3 which is the comparison to existing standards. The explanation of these comparisons is quite brief, and not well described in the text despite being the primary comparison for the paper.
The paper cites data leakage as being an issue, but isn't dataset shift an issue too? If I understand correctly, Uni-Med is being tested on held-out samples from datasets it was trained on, while the other LLMs are being tested on a mixture of data that in some cases was even in their training datasets? A fairer comparison would be to utilize the same datasets across model architectures.
问题
In the discussion of MOEs and the paper motivation I think that it could be clarified substantially. Line 97-105 clearly disambiguates the usage of the term MoE, and it might be helpful to do this sooner in the introduction to help clarify this work for readers.
Generalist foundation model seems to be redundant? Isn't a "Foundation model" by definition "generalist"?
A simple linear projection and purely autoregressive design with visual instruction fine-tuning like LLaVA learns to implicitly condition the projected visual tokens on task and is the appropriate benchmark for this explicit routing framework.
I feel like the paper would, in general, benefit from framing it as an investigation into novel connectors with which to build any foundation model - medical or otherwise. Framing it as a novel medical foundation model focuses on the wrong thing, and makes me wonder why it isn't tested on a broader range of medical tasks and datasets, scaled up and down, and so forth.
局限性
No limitations are mentioned, and I think that this is a missed opportunity. There are clearly limitations with regards to the comparisons to other models, datasets involved with these comparisons, and overall size of the involved models utilized in Uni-Med.
Thanks for your insightful comments!
Q1: I disagree with the claim that there is limited research on connecting modalities in multi-modal models.
A1: There is no doubt about that researches about connecting modalities in multi-modal models is popular and extensive. As mentioned in lines 4-6, 50-52, and 104-105, our viewpoint is that research focus on the field of connectors in MLLMs. We will highlight this scope of our research in the revised version to reduce your misunderstanding.
Q2: I don't think that it is a unique medical issue, and the benchmarking of a novel architecture would be better served using more common datasets to the ML community. As the authors note, the medical datasets and models involved were hard for them to control for data leakage. Why not compare the routing technique against non-medical models on non-medical datasets?
A2: Thank you for your suggestion.
-
First, we agree that this is not a unique medical issue. However, when we construct medical MLLMs based on multi-modal and multi-task scenarios, we observe the tug of war problem at the connector level within standard MLLM architectures. This is the background of our proposal of Uni-Med.
-
Second, it is of great significance to research and develop the medical generalist foundation model. Providing a superior solution to the tug-of-war problem which is particularly serious due to the diversity of image modalities and tasks in the medical field becomes our basic motivation. We believe that "solving problem in prominent fields" and "exploring generalization in general fields" are equally important contributions to the machine learning community.
-
Third, not only in the medical field, but also in general field, model evaluation is plagued by data leakage issue. In our work, the data leakage issue is only observed on the MPx-Single using the model provided by RadFM. We ensure that all experiments of Uni-Med are free of data leakage issue and the results are reliable.
-
Fourth, we conduct preliminary exploration of generalization of our method in general field. We fully follow the training strategy of LLaVA-1.5 and report metrics on 9 benchmarks with/without CMoE in Table Re.2. The results show that the introduction of CMoE brings significant improvements on all benchmarks.
Q3: The results are hard to follow in two very large Table 2 and 3. The explanation of these comparisons is quite brief, and not well described in the text despite being the primary comparison for the paper.
A3: Thank you for your suggestion. Due to the page limit of the paper, we acknowledge the inadequacies in our writing regarding the explanation of experimental results. We will add these details in the revised version. We hope to make the readers feel that the presentation is clear and easy to understand.
Q4: If I understand correctly, Uni-Med is being tested on held-out samples from datasets it was trained on, while the other LLMs are being tested on a mixture of data that in some cases was even in their training datasets?
A4: Your understanding is accurate. For other medical MLLMs, we use readily available model checkpoints for testing. We clearly know that a completely fair comparison would be to utilize the same datasets across model architectures. But we have already achieved relative fairness:
- We use the official test set split for all datasets, except for Slake-VQA, as we utilize it to build data for other tasks. In this case, there is an unfair comparison between Slake-VQA, but it's actually unfair to Uni-Med because part of the test data is used for training in other models.
- As for the fact that we report RadFM's data leakage in the MPx-Single dataset (RadFM opens source this dataset and provides split), we conduct testing strictly according to the split, which only indicates that RadFM's model checkpoint may not have been trained according to this split.
- None of the ablation experiments has data leakage issues, and the effectiveness of CMoE in any configuration is reliable.
Q5: Line 97-105 clearly disambiguates the usage of the term MoE, and it might be helpful to do this sooner in the introduction to help clarify this work for readers.
A5: Thank you for your suggestion. As mentioned in line 50-53, current research to mitigate the tug-of-war problem mainly tailors the MoE approach to the language model components, overlooking the potential benefits of exploring and enhancing the connector. We will provide a clearer presentation of our motivation and the usage of MoE in the introduction section.
Q6: LLaVA learns to implicitly condition the projected visual tokens on task and is the appropriate benchmark for this explicit routing framework.
A6: In fact, all experiments with linear and MLP connectors in Table 2 have model architectures consistent with LLaVA and LLaVA-1.5, respectively. In addition, CMoE with token-level router strategy is also an implicitly condition projection architecture. We hope these explanations are helpful for you to understand our benchmark settings.
Q7: The paper would benefit from framing it as an investigation into novel connectors with which to build any foundation model. Framing it as a novel medical foundation model focuses on the wrong thing, and makes me wonder why it isn't tested on a broader range of medical tasks and datasets.
A7: We elaborate on the background and motivation for why we chose the medical field for research in the first and second point in A2. With current computing resources, we have trained and tested on as wide a range of medical tasks as possible. Compared to the existing medical MLLMs, Uni-Med have added more diverse tasks and datasets. We do indeed look forward to having more data and a wider variety of tasks to validate our method. We will continue to focus on this in our future research.
This paper introduces Uni-Med, which applies mixture of experts at the connector level for efficient training toward a unified medical multi-modal foundation model. The contributions of this work include 1.) curation of indexes to quantify tug-of-war problem in multi-modal multi-task modal; 2.) a novel perspective of applying Connector MoE for multi-modal multi-task model, which enables efficient training; 3.) comprehensive ablative studies to evaluate various configurations for different modules; 4) commitment to providing open-source code and weights of the proposed method.
优点
- It is a technically-solid work. The mitigation of the tug-of-war problem is justified from multiple perspectives, including developed indexes, parameter statistics scores, routing weights, and tSNE feature visualization.
- The proposed framework is evaluated thoroughly in ablation studies.
- Consistent improvements over existing open-source medical foundation models are observed.
- The presentation is clear and easy to follow.
缺点
- While the reviewer appreciates the acknowledgment of several limitations of this work, it would be better to mention them in the main text, especially limitation no. 5 in lines 662-663. If the space is not sufficient, at least they should be briefly mentioned in the main text, and a reference to detailed limitations should be provided.
- Confusion about the training/fine-tuning details: for models presented in Table 3, are they individually fine-tuned for each dataset based on the split introduced in the appendix? How is the fine-tuning implemented? Is it end-to-end or LoRA fine-tuning? For the Uni-Med, the reviewer understands that it is trained on all datasets appearing in Table 3, but there is no individual fine-tuning. Is this understanding correct?
问题
- Line 143 is missing an introduction about the soft router.
局限性
Most of the limitations are discussed in the Appendix, while some of them are not addressable.
Thanks for your insightful comments!
Q1: While the reviewer appreciates the acknowledgment of several limitations of this work, it would be better to mention them in the main text, especially limitation no. 5 in lines 662-663. If the space is not sufficient, at least they should be briefly mentioned in the main text, and a reference to detailed limitations should be provided.
A1: Thank you for your suggestion. We have realized that the limitations of our work and related references should be mentioned in the main text. We will add a limitation section in the main text of the revised version.
Q2: Confusion about the training/fine-tuning details: for models presented in Table 3, are they individually fine-tuned for each dataset based on the split introduced in the appendix? How is the fine-tuning implemented? Is it end-to-end or LoRA fine-tuning? For the Uni-Med, the reviewer understands that it is trained on all datasets appearing in Table 3, but there is no individual fine-tuning. Is this understanding correct?
A2: Thank you for your questions. We will elaborate on the implementation details of the models in Table 3 to reduce your confusion.
-
The understanding of Uni-Med is basically accurate. Uni-Med achieves joint training on 6 six distinct medical tasks and 12 datasets, requiring only one-stage training on a single A800 GPU and no task/dataset fine-tuning. It strictly follows the dataset split introduced in the appendix.
-
For model comparison, we use readily available model checkpoints for testing. The details are as follows: (1) About training data and model type. Except for Med-Flamingo, the raw training data of the comparison models all contain some of the datasets in Table 3. For example, LLaVA-Med uses full parameter fine-tuning on the Slake-VQA and Path-VQA, respectively, which means it offers different dataset-specific model checkpoints. XrayGPT is task-specific model, and it traines on MIMIC-CXR. RadFM is a generalist foundation model and its training data includes Slake-VQA, MIMIC-CXR and MPx-Single. (2) About datasets split. We use the official test set split for all datasets, except for Slake-VQA, as we utilize it to build data for other tasks. In this case, there is an unfair comparison between Slake-VQA, but it's actually unfair to Uni-Med because part of the test data is used for training in other models.
-
A completely fair comparison across different model architectures is to use the same dataset split for training and testing. The medical MLLMs used for comparison all followe the standard architecture consists of a vision encoder, a connector (e.g. XrayGPT: linear layer; LLaVA-Med: MLP), and an LMM. From this perspective, the experiments of the connector using linear layer or MLP in Table 2, to some extent, represent the results of XrayGPT and LLaVA-Med frameworks that are fully consistent with the Uni-Med training and fine-tuning strategy, respectively.
Thank you for recognizing our extensive experiments and analyses. To reduce the confusion of readers, we will add more implementation details in the revised version.
Q3: Line 143 is missing an introduction about the soft router.
A3: Thank you for your meticulous review and reminder. The soft router receives input tokens and calculates the routing weights for each expert. We will add the missing introduction about the soft router in the revised version.
I thank the author for their time and effort preparing for the rebuttal. After reading the rebuttal, I have a follow-up question about the comparison with open-source MLLM and Table 3 (correspond to Q2 and A2 above).
Other models, such as LLaVA-Med and Med-Flamingo, are never trained on some of the other datasets. For instance, the authors used the LLaVA-Med checkpoints that are not trained on MIMIC-CXR and also did not fine-tune it on the MIMIC-CXR dataset. Rather, that checkpoint was directly applied to the test set of MIMIC-CXR, and the results were reported in Table 3. Is this understanding correct?
Thanks again for your patience and meticulousness! For the follow-up question mentioned above:
-
Your understanding in the comment is correct. Taking the evaluation of LLaVA-Med as an example, for Slake-VQA and Path-VQA, we use the checkpoints of the third stage (dataset-specific fine-tuning) for each dataset separately; For other datasets, we use the checkpoints of the second stage (medical instruction tuning). Some open-source medical MLLMs have never been trained on some of the datasets. In these cases, we have annotated "zero-shot" (i.e. gray background) in Table 3.
-
If you are concerned about the performance of other models using the same data and training strategy as Uni-Med, the experiments of the connector using linear layer or MLP in Table 2, to some extent, represent the results of XrayGPT and LLaVA-Med framework, respectively.
-
Furthermore, we would like to clarify the purpose of Table 3: (1) Uni-Med has covered more medical tasks than existing open-source medical MLLMs; (2) Uni-Med has achieved competitive or superior evaluation metrics on various medical tasks compared to other "task-specific" MLLMs (e.g. LLaVA-Med for VQA, XrayGPT for report generation).
We hope these explanations are helpful for you to address your concerns.
Thank you for your prompt reply. I have to admit that I overlooked this detail in my initial review, and I thought each comparing MLLM was not trained on all 12 datasets collected in this study but was fine-tuned on each task/dataset separately.
If the other MLLMs are not fine-tuned for specific tasks and datasets (e.g., LLaVA-Med is not fine-tuned for report generation on MIMIC-CXR), the comparison in Table 3 is too unfair. What is the challenge hindering you from at least partially fine-tuning via LoRA? Even if the performance gap after fine-tuning is smaller, it could still justify the major contribution that Uni-Med covers more tasks and does not need any fine-tuning compared to some open-source medical MLLM.
Thanks for your suggestion! We think that there is a misunderstanding/confusion between the concepts of comparing model architecture capabilities and comparing existing models' capabilities.
-
Comparing model architectures' capabilities. A completely fair comparison across different model architectures is to use the same dataset split for training and testing. Table 2 represents the results of this comparison strategy. In particular, the experiments of the connector using linear layer or MLP, to some extent, represent the architectures' capabilities of XrayGPT and LLaVA-Med, respectively.
-
Comparing existing models' capabilities. Any method of fine-tuning will inevitably lead to changes in the initial capability of the model. Therefore, in Table 3, we directly compare the capabilities of our Uni-Med with existing open-source medical MLLMs through using available checkpoints. Under this comparison strategy, if a model has been trained on a certain dataset, it is fair and comparable to Uni Med on this dataset. The results indicate that Uni-Med has advantages over these models, which further demonstrate the effectiveness of our method. The zero-shot results are listed for reference only, not for the purpose of comparing model capabilities.
Your comments and suggestions have sparked our deep thinking. Considering your concerns, we are very willing to use LLaVA-Med as an example for fine-tuning on specific datasets. We will do our best to provide you with corresponding results before the discussion deadline.
Thanks for your patience!
Considering your concerns and suggestions, we have supplemented the relevant experiments with LLaVA-Med. Specifically, we use the checkpoints of the second stage (medical instruction tuning) to perform two strategies of LLM full parameter fine-tuning: (1) dataset-specific fine-tuning; (2) joint training fine-tuning. The data split and the prompt format are completely consistent with Uni-Med and LLaVA-Med, respectively. Both strategies last for 3 epochs (the same as Uni-Med). The results are as follows:
| Task | Dataset | Metric | LLaVA-Med | LLaVA-Med | Uni-Med |
|---|---|---|---|---|---|
| Joint Training | Dataset-specific | Joint Training | |||
| Visual | Slake-VQA | BLEU-1 | 33.69 | 72.00 | 82.12 |
| Question | F1 | 35.83 | 73.07 | 83.07 | |
| Answering | Path-VQA | BLEU-1 | 37.79 | 56.86 | 58.07 |
| F1 | 38.55 | 57.51 | 58.74 | ||
| Report | MIMIC-CXR | BLEU-1 | 20.43 | 21.03 | 27.79 |
| Generation | BLEU-4 | 4.86 | 4.96 | 6.46 | |
| ROUGE-1 | 26.11 | 28.28 | 28.81 | ||
| ROUGE-2 | 7.66 | 9.01 | 9.62 | ||
| ROUGE-L | 19.00 | 20.61 | 22.58 | ||
| METEOR | 8.73 | 8.89 | 10.59 | ||
| MPx-Single | BLEU-1 | 15.11 | 14.63 | 15.80 | |
| BLEU-4 | 2.40 | 1.75 | 2.47 | ||
| ROUGE-1 | 13.22 | 13.03 | 14.32 | ||
| ROUGE-2 | 2.39 | 2.19 | 2.68 | ||
| ROUGE-L | 10.99 | 10.85 | 12.29 | ||
| METEOR | 5.83 | 5.79 | 5.92 | ||
| Image | DermaMNIST | Accuracy | 25.84 | 79.95 | 76.96 |
| Classification | OrganSMNIST | Accuracy | 66.80 | 77.84 | 78.07 |
| Referring | Slake-REC | IoU | 4.07 | 22.41 | 37.71 |
| Expression | R@0.5 | 1.99 | 18.41 | 39.30 | |
| Comprehension | SA-Med2D-REC | IoU | 8.64 | 17.67 | 21.60 |
| R@0.5 | 4.75 | 9.98 | 14.42 | ||
| Referring | Slake-REG | BLEU-1 | 27.21 | 50.79 | 75.78 |
| Expression | F1 | 30.97 | 53.15 | 77.35 | |
| Generation | Accuracy | 20.40 | 44.78 | 68.16 | |
| SA-Med2D-REG | BLEU-1 | 45.83 | 55.15 | 61.47 | |
| F1 | 47.11 | 55.98 | 62.17 | ||
| Accuracy | 40.80 | 50.92 | 57.69 |
We can observe that:
-
There is a serious tug-of-war problem when using the original LLaVA-Med architecture for joint fine-tuning on multiple tasks and datasets. The strategy of dataset-specific fine-tuning has significantly improved the evaluation mrtrics s of each dataset.
-
It is worth noting that Uni-Med has achieved competitive and leading results through joint training, without dataset-specific fine-tuning. It can be concluded that our method has achieved a superior solution to the tug-of-war problem, which reduces interference and promotes more efficient knowledge sharing.
Dear Reviewers,
We would like to express our heartfelt gratitude for your invaluable time, expertise, and meticulous attention in reviewing our manuscript. The insightful comments and constructive feedback have immensely enriched the quality and rigor of our work.
We appreciate that the reviewers acknowledge the advantages of our work:
-
About module designs. The introduction of the CMoE module to address the tug-of-war problem at the connector level is novel and well-executed (Reviewer wK4x); The explicit task conditioned projection is novel, and integrates an older concept of MoEs into a SOTA MLLM framework (Reviewer Snz8);
-
About experiments. The paper provides thorough ablation studies to validate the effectiveness of the proposed CMoE module under various configurations (Reviewer wK4x); Extensive experiments demonstrate Uni-Med's effectiveness across multiple tasks and datasets (Reviewer b87p); The authors do an excellent job sourcing and assembling a large, multi-task, multi-modal set of medical benchmarks and performing an extensive set of ablations (Reviewer Snz8); The paper is benchmarked broadly across multiple tasks—fitting the definition of a foundation model (Reviewer Snz8); The proposed framework is evaluated thoroughly in ablation studies (Reviewer mvVf).
-
About results and meaning. Uni-Med achieves impressive performance with minimal training computational overhead, highlighting its efficiency in handling large-scale multi-modal medical data (Reviewer wK4x); Multi-modal multi-task optimization is a complex and important question for large multimodal models, the introduction of the CMoE module, employs a mixture of projection experts to align visual and language embedding spaces, shows superior performance on multiple tasks(Reviewer b87p); It is a technically-solid work (Reviewer mvVf); Consistent improvements over existing open-source medical foundation models are observed (Reviewer mvVf).
-
About analysis. The paper conducts a comprehensive interpretation analysis of the problem from the perspective of gradient optimization and parameter statistics (Reviewer b87p); The mitigation of the tug-of-war problem is justified from multiple perspectives, including developed indexes, parameter statistics scores, routing weights, and tSNE feature visualization (Reviewer mvVf).
-
About writing. The presentation is clear and easy to follow (Reviewer mvVf).
On the other hand, we actively adopt the suggestions put forward by the reviewers and diligently address all the issues. Allow me to summarize the revisions made in the rebuttal:
-
Exploring visual features distribution on image modalities. We conduct additional interpretation analysis which focus on visual features on different medical image modalities. We also use the t-SNE method for visualization and get instructive observation results.
-
Adding metrics for the report generation task. To fully assess the semantic accuracy, RadGraph F1 and RadCliQ are uesd to evaluate the results of different models on the MIMIC-CXR dataset.
-
Emphasizing contribution to the ML community. Although this is not a unique medical issue, we have observed that the tug-of-war problem is particularly serious due to the diversity of image modalities and tasks in the medical field. We believe that "solving problem in prominent fields" and "exploring generalization in general fields" are equally important contributions to the machine learning community. Through extensive experiments and interpretability analysis from multiple perspectives, Uni-Med has shown its effectiveness in mitigating the tug-of-war problem in medical field, and the main idea is instructive for the general field.
-
Clarifying fair comparison. None of the ablation experiments has data leakage issues, and the effectiveness of CMoE in any configuration is reliable. For model comparison, we use readily available model checkpoints for testing. We use the official test set split for all datasets, except for Slake-VQA, as we utilize it to build data for other tasks. In this case, there is an unfair comparison between Slake-VQA, but it's unfair to Uni-Med because part of the test data is used for training in other models.
-
Preliminary exploration of generalization in general field. We fully follow the training strategy of LLaVA-1.5 and report metrics on 9 benchmarks with or without CMoE. The results show that the introduction of CMoE brings significant improvements on all benchmarks.
-
Writing revision and explanation. Thanks to the reviewers' feedback, in the revised version, we will (1) provide more detailed experimental discussions; (2) provide clearer explanation for the usage details of comparative models; (3) supplement, emphasize and refine some textual expressions; (4) adjust the placement of the limitation section to the main text.
-
Adding references. We will add cutting-edge developments on evaluation of medical vision-language models to the related work section.
-
Future work. Addressing the limitations of handling 3D image data, the optimal setting of key parameters such as the number of experts in different scenarios, validation on more data and a wider variety of tasks will be part of our future work.
This paper introduces a novel approach to addressing the tug-of-war problem in multi-modal, multi-task optimization within the medical field. The authors propose a unified medical generalist foundation model, Uni-Med, which leverages a Connector Mixture-of-Experts (CMoE) module to efficiently align visual and language embedding spaces. The model is evaluated across six distinct medical tasks, including question answering, visual question answering, report generation, and image classification, demonstrating competitive or superior performance compared to existing medical multimodal large language models (MLLMs). Extensive experiments, including ablation studies, validate the effectiveness of the proposed CMoE module. The paper also provides a comprehensive interpretation analysis of the tug-of-war problem from the perspective of gradient optimization and parameter statistics. After the rebuttal, the paper received mixed feedback, including one strong accept, one accept, one weak accept, and one reject. The reviewers generally found the paper technically sound, with a well-executed novel approach to solving a critical issue in medical MLLMs. The CMoE module was particularly praised for its innovative design and its ability to improve performance across multiple tasks. However, there were concerns about the brief explanation of the comparisons and the fairness of the comparison. For fairness, the authors used the official test set split for all datasets, except for Slake-VQA, as the authors utilized it to build data for other tasks. After carefully considering the paper and all comments, the AC believes the detailed explanation issue can be resolved in the revised version, and the effectiveness of the proposed method can be demonstrated by the results on other testing datasets, except for Slake-VQA. Therefore, the AC recommends accepting this paper.