Preference fine-tuning for factuality in chest X-ray interpretation models without human feedback
摘要
评审与讨论
This is an empirical analysis paper based on the CheXagent VLM model, which compares the performance of different alignment methods in preference fine-tuning for medical report generation, including DPO, KTO, SimPO, IPO and OROP.
优点
The article explores the preference fine-tuning pipeline and significantly enhances the GREEN performance through alignment optimization for the VLM based on the GREEN metric, compared to the baseline SFT.
缺点
- This is an experimental analysis article, but the generalizability of the experimental conclusions is insufficient, as it only evaluates one medical VLM model, leaving unclear whether the conclusions apply to other models as well.
- In preference fine-tuning, the quality of the generated candidates seems critical; however, the overall performance of report generation is not good enough, with metrics like precision and recall generally below 0.5. The SFT baseline struggles to produce high-quality candidates, making it difficult to ensure the chosen candidates are reliable, which raises doubts about the alignment's ability to optimize the model in the desired direction.
- There is a lack of clinical efficacy (CE) metrics that are often used and are very important in report generation. Although preference fine-tuning optimized for the GREEN metric performs much better than the baseline, the GREEN metric itself is not necessarily what is clinically relevant. Clinicians are more concerned about whether a specific disease in the image has been detected, so a comparison of CE metrics for dozens of diseases in MIMIC-CXR is needed, including precision, recall, and F1-score.
- This paper is vague in the description of some results or conclusions, lacking in-depth analysis. For example, in the experimental results in Table 5, what are the reasons for the error increase in alignment methods in (e) and (f) compared to the SFT baseline? Additionally, the authors state that "verbosity will limit the clinical utility of generated reports," yet, according to the results, SimPO has the smallest verbosity bias but performs the worst among the alignment methods. What is the relationship between verbosity bias and the quality of the generated reports? This is not clearly explained.
- The alignment methods lead to performance drops in additional diverse tasks, albeit not significantly. This raises questions about the practicality and reliability of performing alignment on medical VLMs. While alignment improves report generation performance, indicating better model understanding of images, why does it not exhibit advantages in other image understanding tasks? What exactly does alignment change in the model? Is it merely a shift to a radiologist-friendly expression? This is not the focus in the clinical setting, and what we care about more is the model’s general ability to understand images.
问题
Please refer to Sec. weakness for details.
-
W4. Verbosity bias and report quality.
A: Please see our response in two parts below:
First, why do we observe an increase in errors of certain subcategories in Table 5? To be frank, we do not know exactly. However, since both “(e) Mentioning a comparison that isn’t in the reference” and “(f) Omitting a comparison detailing a change from a prior study” are dealing with comparisons, it might be due to the fact that we consider both report generation and “progression report generation” during alignment. That is, the two chest X-rays may be from the same point in time or two different points in time. If they are from two different points in time then the model is asked to compare the current image with that of a prior image. It is possible that this formulation has resulted in a shift in probabilities of tokens related to “comparisons”, leading to two types of error occurring more frequently: making comparisons when the two chest X-rays are from the same point in time and failing to make comparisons when the two images are from different points in time.
Second, what is the relation between verbosity and quality? While DPO and IPO resulted in significant length exploitation, resulting in no clinically meaningful improvements, ORPO and KTO achieved meaningful clinical improvements by generating less false information, ultimately improving factuality. Please see our response to W2 from reviewer wmBo for a more detailed discussion. -
W5. What has changed in the model post-alignment?
A: Please see our response in two parts below:
First, why does improved report generation capability not improve other tasks? This is an interesting question, related to the second part of this response. It is not clear to us that improved report generation capability reflects an improvement in image understanding. Rather, what we are trying to achieve is an improved alignment between the image encoder and text decoder, essentially a better use of the image understanding already embedded in the image encoder. Hence, we do not expect improved performance in other image understanding task post alignment. In fact, we are very happy with the result that there is no change in the performance on these diverse tasks.
Second, what is actually improved post alignment? Mechanistically, the objective of alignment algorithms is to reallocate probability mass towards the preferred completions. Stylistic differences between the accepted and rejected responses will therefore be reinforced. The content, and style, was selected by GREEN in our work and, as was shown in our reader study, this led to a model which was preferred by the radiologists. However, not only does alignment improve style but it also leads to an improvement in clinical efficacy metrics, outlined in our response for W3 above. Our ultimate objective is to enable an AI-radiologist workflow where the VLM drafts an initial report. Hence, even if the only change is to generate reports of a different style that is preferred by radiologists, this is considered a win. In fact, commercially available products , such as RAD-AI, go to great lengths to mimic a specific radiologist style. Connecting back to the issue of the alignment tax, preference alignment, essentially subjective optimization, kept the objective evaluation (the diverse image understanding tasks) fixed, which is an encouraging result.
If you have any additional questions or need further clarification, please let us know.
First of all, we thank the reviewer for valuable and insightful feedback. We address your concerns below (split into two comments due characters limit):
-
W1. Generalizability.
A: We are currently working on implementing MAIRA-2 [1], a state-of-the-art VLM developed by Microsoft. If done by then, these results will be included in the final camera ready version. However, as also mentioned in Q3 from reviewer oXGy, the main contribution of this paper is the automated preference generation technique, and the actual model itself is just a vehicle for us to assess whether the preference data is sufficiently good to lead to clinically meaningful improvements. -
W2. Quality of preference data.
A: Thank you for drawing attention to this important detail! The quality of the generated candidates is indeed critical. However, certain DAAs are more dependent on a strong SFT baseline than others. For instance, DPO and IPO require a strong baseline, capable of generating high quality candidates, to be successful. However, KTO is less sensitive to a strong baseline and can actually be applied without prior fine-tuning [2]. In addition, ORPO bypasses this problem entirely by explicitly including SFT as a part of the alignment process. Hence, the large performance difference between DPO and IPO and KTO and ORPO could partially be explained by the quality of the SFT baseline. We have updated the manuscript to make this more clear. -
W3. Clinical efficacy metrics.
A: Thank you for highlighting this very important point! Based on prior results on GREEN, including being well correlated with expert human judgment, we do treat it as a silver standard. However, as you correctly pointed out, this need not be what clinicians are interested in, and we should include CE metrics. Inspired by MAIRA-2 [1], we approach this problem by extracting labels (14 categories) using the CheXbert labeler [3] from the generated and reference reports. We then measure the F1 score. The following table has been included in section 4.3. We observe a 8.4% and 5.9% increase in micro and macro averages, for KTO and a 8.1% and 6.7% increase for ORPO. | Metric | ECm. | Cmgl. | LOpac. | LLes. | Edema | Cnsl. | Pna. | Atel. | Pmtx. | PEff. | POth | Frac. | SuDev. | NoF. | Micro | Macro | |------------|----------|-----------|------------|-----------|-----------|-----------|---------|-----------|-----------|-----------|----------|-----------|------------|----------|-----------|-----------| | F1 () ||||||||||||||||||| | CheXagent | 0.347 | 0.620 | 0.461 | 0.171 | 0.493 | 0.158 | 0.227 | 0.453 | 0.444 | 0.655 | 0.092 | 0.240 | 0.787 | 0.304 | 0.509 | 0.389 | | +DPO | 0.383 | 0.688 | 0.257 | 0.144 | 0.352 | 0.254 | 0.087 | 0.349 | 0.268 | 0.625 | 0.149 | 0.219 | 0.815 | 0.333 | 0.500 | 0.352 | | +KTO | 0.400 | 0.683 | 0.425 | 0.240 | 0.554 | 0.167 | 0.164 | 0.441 | 0.500 | 0.724 | 0.130 | 0.158 | 0.840 | 0.340 | 0.552 | 0.412 | | +IPO | 0.423 | 0.675 | 0.307 | 0.178 | 0.433 | 0.189 | 0.111 | 0.335 | 0.261 | 0.643 | 0.185 | 0.146 | 0.819 | 0.326 | 0.513 | 0.359 | | +SimPO | 0.381 | 0.668 | 0.398 | 0.150 | 0.320 | 0.167 | 0.178 | 0.332 | 0.456 | 0.669 | 0.152 | 0.078 | 0.812 | 0.351 | 0.506 | 0.365 | | +ORPO | 0.348 | 0.684 | 0.479 | 0.201 | 0.492 | 0.224 | 0.247 | 0.475 | 0.511 | 0.698 | 0.072 | 0.177 | 0.835 | 0.365 | 0.550 | 0.415 |
References:
[1] Shruthi Bannur, Kenza Bouzid et al. “MAIRA-2: Grounded Radiology Report Generation”, arXiv preprint arXiv:2406.04449, 2024.
[2] Kawin Ethayarajh et al. “KTO: Model Alignment as Prospect Theoretic Optimization”, Proceedings of the 41 st International Conference on Machine Learning, 2024.
[3] Akshay Smit, Saahil Jain, Pranav Rajpurkar, et al. “Combining automatic labelers and expert annotations for accurate radiology report labeling using bert”. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
Thanks for your response. Based on the supplementary experimental results, I am more concerned about the performance of the baseline CheXagent (Macro F1: 0.353). PromptMRG (Jin et al., AAAI'24) reports a Macro F1 score of 0.476 on the MIMIC dataset, which is significantly higher than the CheXagent used in this paper. Most preference fine-tuning strategies perform worse than the baseline, and although KTO improved to 0.387, there is still a considerable gap compared to PromptMRG. I am worried about whether the experimental conclusions in the paper can be generalized to more scenarios when applied to strong baselines, which remains uncertain. Besides, I remain concerned about why a model performs better in report generation but shows no advantages in downstream image understanding tasks. The current explanation has not been convincing enough for me. Given that my concerns still hold, I will keep my original rating.
We thank you for your insightful comment, as it has drawn our attention to some issues with our presented CE metrics. In particular, the number that you cite is from the cheXbert [1] instead of the cheXpert [2] labeler, that we initially used. CheXbert is a slightly newer and more accurate labeler, and we have opted to update our results. We have edited our previous post and the manuscript accordingly.
In addition, thank you very much for alerting us to the PromptMRG work. Upon a careful review of the manuscript, we note that the best reported macro-averaged F1 score for the MIMIC dataset is actually 0.381. This can be found in Table 2 on page 12 (appendix), with a prelude and a reference to the appendix is given in the Disease balance subsection on page 7. In contrast, employing the cheXbert labeler, our SFT baseline (CheXagent) has a macro-averaged F1 score of 0.389 on the MIMIC dataset. This gives us confidence that our baseline model is still one of the state-of-the-art. In addition, as can be seen in our edited post above, the baseline aligned using ORPO achieves a macro-average F1 score of 0.415, a 6.7% improvement on a state-of-the-art baseline.
Nonetheless, your point is very well noted that our work would become stronger with the addition of additional report generation models. This is something that we are currently working on with some preliminary results on MAIRA-2, and we hope to update this during the remaining review period as well as the camera-ready duration.
References:
[1] Akshay Smit, Saahil Jain, Pranav Rajpurkar, et al. “Combining automatic labelers and expert annotations for accurate radiology report labeling using bert”. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
[2] Jeremy Irvin, Pranav Rajpurkar et al. “Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison.” In Thirty-Third AAAI Conference on Artificial Intelligence, 2019.
This paper develops a framework for fine-tuning VLMs without requiring clinician annotation. Using various DAA algorithms, the authors claim to have achieved good performance based on automated scoring metrics. I appreciate their comprehensive study in exploring how different DDA algorithms contribute to model performance, as well as their findings that some RL frameworks exploit the automatic grading system by systematically increasing response length. Overall, the authors have done an excellent job explaining their methodology and results, which provide valuable insights for future research. However, I have questions and concerns about whether this framework genuinely enhances the capacity of VLM models.
优点
Innovative Framework: The framework allows fine-tuning VLMs without clinician annotation, addressing a key barrier in medical AI by reducing dependency on expert-labeled data.
Comprehensive Algorithm Comparison: The study thoroughly evaluates different DAA algorithms, providing insights into how each algorithm uniquely enhances model performance.
Detailed Methodology and Evaluation: The paper offers a clear, thorough explanation of its methods and results, making it a valuable resource for future research in VLM fine-tuning and performance evaluation.
缺点
-
One major issue is that you used the GREEN score as the judge for your reward model, which introduces bias and makes the comparison less fair. When I evaluate performance without considering the GREEN score, the improvement across other tasks appears somewhat marginal. Additionally, there is evident over-optimization or reward hacking occurring by simply extending response length. I am not convinced this result is sufficiently robust to demonstrate the effectiveness of this method.
-
My primary concern, which raises doubts about this paper, is that DPO and IPO—the two methods with the highest performance in Table 3—are rated the lowest in expert evaluation. It seems insufficient to attribute this discrepancy to verbosity alone negatively impacting human evaluation. There appears to be a more fundamental issue in these responses, leading clinicians to favor responses generated by SFT. For example, the model produces extensive text, some of which is counterfactual. If this is the case, it contradicts the paper’s objective of improving model factuality through RL.
-
Reproducibility. It would be beneficial to make the code public to enhance reproducibility. At a minimum, a clear summary of the DDA algorithms would assist future researchers.
问题
-
I don't think general metrics like BERTScore, BLEU-4, and ROUGE-L are particularly helpful here, as they do not assess the factual quality of the response. Consider including these in the supplementary figures.
-
I would like to see more examples to understand why clinicians favor the SFT model. I suspect verbosity is a significant issue, negatively impacting response reliability.
-
A straightforward way to mitigate verbosity issues could be to add a rule-based reward term to DPO, IPO, etc., to see if it improves performance in human ratings. If expert ratings are difficult to get, alternatively, using this method to generally reducing verbosity without affecting evaluation scores could also strengthen the findings. This would be a convincing result that could change my assessment of the paper.
-
You might also want to cite this paper, the first to perform RLHF fine-tuning in VLM models: Sun, Z., Shen, S., Cao, S., Liu, H., Li, C., Shen, Y., ... & Darrell, T. (2023). Aligning large multimodal models with factually augmented RLHF. arXiv preprint arXiv:2309.14525.
-
Q1. Relevance of metrics.
A: We have updated table 3 to only include the radiology specific metrics, and one example from the general domain (we chose BERTScore). The previous table has been moved to the appendix. Non-radiology specific metrics may not capture complexity and nuances relevant for evaluation radiology reports. Please see our response to W1 for a more extended discussion. -
Q2. Reason for preferences in reader study.
A: Please see the response for W2. -
Q3. Methods to address verbosity.
A: Verbosity bias seems to be a major issue, in particular for DPO and IPO, so thank you for drawing attention to this important topic! There are two direct ways to address verbosity bias within the current framework. We can use a Judge, or reward model, that is less gameable, and we can employ alignment algorithms that directly control for length.
Initial results of using length-controlled DPO [1] are available in the table below. We show 95% confidence intervals for GREEN and length on the MIMIC-CXR validation set. An explicit regularization term for length has been added, controlled by the hyperparameter . For , this regularization term is active, explicitly controlling for length. As expected since it it is the weight given to the KL-divergence term, there is significantly less verbosity for than for . But we also achieve a higher reward, average GREEN, for . Length-control DPO allows us to significantly mitigate verbosity, leading to a factor 2 difference in length when comparing and . However, since there is an apparent tradeoff between reward and verbosity, it is unclear which configuration is best.
Hence, we would like to explicitly consider length in our evaluation metrics as well. To do this in a principled way, one could employ a length-controlled version of GREEN. One simple update is as follows: GREEN-LC = GREEN / max(length of generated report/length of reference report,1). Intuitively, this downweights GREEN when the length of the generated report is larger than that of the reference. If this is not the case, then the correction does nothing. Although very simplistic, such a correction will allow us to deal with the apparent tradeoff between GREEN and verbosity.
We are currently working on updating the framework to deal with verbosity bias as described above, which we will try to include in the final version or in future work. We have also updated the manuscript with a discussion. | Hyperparameters | Length (95% CI) | GREEN (95% CI) | |------------------------|---------------------|--------------------| | | (135.04, 145.40) | (0.48, 0.51) | | | (59.21, 65.94) | (0.41, 0.45) | | | (106.69, 116.35) | (0.46, 0.49) | | | (67.05, 70.27) | (0.44, 0.47) | | | (52.58, 54.49) | (0.43, 0.46) | | | (63.39, 66.22) | (0.43, 0.46) | | | (59.88, 62.49) | (0.41, 0.45) | | | (52.48, 54.65) | (0.41, 0.44) | | | (58.13, 60.59) | (0.41, 0.44) | | SFT baseline | (54.55, 57.10) | (0.35, 0.38) | -
Q4. Cite first paper on RLHF for VLMs.
A: Thank you for drawing our attention to this pioneering paper, we have included it as a reference.
If you have any additional questions or need further clarification, please let us know.
References:
[1] Ryan Park, Rafael Rafailov, et al. “Disentangling length from quality in direct preference optimization.” In Findings of the Association for Computational Linguistics: ACL, 2024.
First of all, we thank the reviewer for valuable and insightful feedback. We address your concerns below (split into two comments due characters limit):
-
W1. GREEN used for preference generation and evaluation.
A: The gold standard would be to get expert human feedback, from radiologists, for both preference data generation and evaluation. However, this is unfortunately not feasible at the scale required. Hence, we employ GREEN as the silver standard, essentially a low-cost approximation of expert human judgment. GREEN is a state-of-the-art metric for radiology report evaluation, well correlated with expert human judgment, as shown in [1]. Due to this setup, we do expect to see a boost in GREEN, which is ultimately by design since this is our approximation for expert judgment. However, F1RadGraph, our second radiology specific metric, is increased by 10%+ on both the MIMIC-CXR and CheXpert Plus datasets, for all alignment algorithms except SimPO.
Non-radiology specific metrics have shown not to accurately reflect complexity and nuances for radiology reports [2], explaining why we do not see large differences. While no metric is perfect, GREEN has shown to better reflect radiologist preferences, which is why we focus on improving GREEN over all other metrics. Overall, the question we are trying to address is whether it is possible to improve upon a CXR VLM, in the eyes of human experts, without any additional radiologist feedback to produce the preference data. As was shown in the reader study, the answer to this question was yes. We have updated the manuscript to make these points more clear. -
W2. Discrepancy between automated metrics and reader study.
A: This is an observation that made us concerned as well, thank you for pointing it out! There is evidence of very significant length-exploitation, especially for DPO and IPO. It seems like GREEN has been set artificially high by simply producing more verbose output. Moreover, we do actually have more granular reasoning for the preferences indicated in the reader study as the radiologist had the option to select reasons for why they had a certain preference. As seen in Table 9, currently in the appendix, the two most common reasons for preferring the SFT baseline over DPO and IPO were: “(b) Selected report contains LESS repeated Information” and “(c) Selected report is of a MORE preferable length.” Moreover, if we consider why ORPO and KTO were chosen over the SFT baseline, then the most common, by far, reason was “(a) Selected report contains LESS false information.” Hence, it seems like GREEN is fairly gameable via increased verbosity, a fact which was heavily exploited by DPO and IPO, essentially leading to no clinically meaningful improvements, just verbosity. However, alignment algorithms such as ORPO and KTO, exploited this bias in GREEN less, leading clinically meaningful improvements by reducing the prevalence of false information (i.e. improving factuality). To make this more clear, we have moved Table 9 from the appendix into section 4.3, along with a discussion of the results. Based on these insights, and the reviewer's keen observation, we are now focusing on building length-controlled metrics going forward, which we will try to incorporate in future versions of this work. Please see Q3 for more details on this direction. -
W3. Reproducibility.
A: Thank you for pointing this out! Code to run all experiments, as well as all the examples in the reader study and their preference, will be made publicly available. We have stated this in the manuscript.
References:
[1] Sophie Ostmeier et al. "Green: Generative radiology report evaluation and error notation." In Findings of the Association for Computational Linguistics: EMNLP, 2024.
[2] Dave Van Veen et al. "Adapted large language models can outperform medical experts in clinical text summarization." Nature Medicine 30, 1134–1142, 2024.
Thank you for your efforts in conducting additional experiments. I understand that the GREEN score is the second-best approach available. However, it is not surprising that using a reward model to improve a model results in better performance specifically for that reward score in your test set. As you mentioned, other metrics are not very informative, so I see no substantial evidence that this method genuinely improves model performance. One straightforward way to address this would be to create a small test set with ground truth information. You could run it through ChatGPT and have ChatGPT provide scores when comparing the GT and the VLM outputs, penalizing the hallucination information. Even this would be more convincing than relying solely on the GREEN score. Overall, if you cannot demonstrate a clear superior performance for your model using a metric other than your reward model, and the clinician evaluation so far appears more like a negative result to me. Therefore, I will keep my score unchanged.
Many thanks for emphasizing your concerns!
You are absolutely right that it is not surprising to see an improvement in the reward score. Based on insightful feedback from reviewer GH2g, we have included clinical efficacy metrics in the form of F1 scores over 14 categories extracted using the cheXbert labeler. These results are available in table 9 in the updated manuscript and in our response to W3 from reviewer GH2g. Our proposed method boosts the macro-averaged F1 score by 5.9% and 6.7% for KTO and ORPO, respectively. In addition, the policy aligned by ORPO was preferred over the SFT baseline in the reader study. These two datapoints together indicate that our method is capable of yielding clinically meaningful improvements, without relying on any results using GREEN. Future versions of this work will additionally rely on other clinically relevant metrics such as RadCliQ [1].
We thank you for the great suggestion of using ChatGPT to evaluate the aligned policies! We note that GREEN is in effect a distilled version of ChatGPT for the task of evaluating radiology reports. Following your suggestion, we asked GPT-4o (gpt-4o-2024-08-06) to compare the aligned policies with the SFT baseline on the cheXpert plus test set (209 samples), using available reference reports. We explored using two prompts: the exact prompt in A.3. of the GREEN paper and the same prompt with the adjustment that length is important (length-controlled). Win rates against the SFT baseline are available in the table below. Consistent with the results for the CE metrics, ORPO and KTO are the top performers.
| Win rate | Win rate (LC) | |
|---|---|---|
| DPO | 0.52 | 0.46 |
| KTO | 0.61 | 0.61 |
| IPO | 0.55 | 0.54 |
| SimPO | 0.49 | 0.51 |
| ORPO | 0.56 | 0.61 |
References:
[1] Feiyang Yu et al. "Evaluating progress in automatic chest x-ray radiology report generation." Patterns, 4(9), 2023.
The paper introduces a scalable preference alignment technique using an automated “LLM-as-a-Judge” mechanism, reducing the cost and complexity of obtaining preference data without requiring expert radiologist feedback.
优点
1.The paper is well-structured overall, with clear and thoughtfully designed figures that effectively illustrate the concepts and results
2.The paper introduces a scalable preference alignment technique using an automated “LLM-as-a-Judge” mechanism, reducing the cost and complexity of obtaining preference data without requiring expert radiologist feedback.
3.The approach leverages Direct Alignment Algorithms (DAAs), showing significant improvements in multiple metrics over SFT baselines, which enhances the accuracy and clinical relevance of the outputs.
4.The model demonstrates consistent performance across various tasks, avoiding the “alignment tax” phenomenon, which further enhances its applicability in high-stakes medical domains.
缺点
Q1. Please discuss specific theoretical limitations of DAAs that may affect applicability to other medical domains or high-stakes fields beyond medicine.
Q2. Specific experiments to test robustness to biased/noisy data are necessary, such as evaluating performance across different age groups or testing with artificially injected label noise.
Q3. How to detect and mitigate potential bias accumulation in the LLM-as-judge approach.
Q4. Please justify the choice of baseline and comparison methods, and suggest 1-2 specific additional relevant baselines you should consider including, explaining why they would be particularly relevant comparisons for this work.
Q5. How does this approach scale in terms of cost and efficiency over time and with larger datasets?
Q6. Please discuss how the proposed approach specifically compares to or differs from Token-level Direct Preference Optimization (TDPO) https://doi.org/10.48550/arXiv.2404.11999, and whether incorporating ideas from TDPO could potentially improve their method.
问题
Q1. Please discuss specific theoretical limitations of DAAs that may affect applicability to other medical domains or high-stakes fields beyond medicine.
Q2. Specific experiments to test robustness to biased/noisy data are necessary, such as evaluating performance across different age groups or testing with artificially injected label noise.
Q3. How to detect and mitigate potential bias accumulation in the LLM-as-judge approach.
Q4. Please justify the choice of baseline and comparison methods, and suggest 1-2 specific additional relevant baselines you should consider including, explaining why they would be particularly relevant comparisons for this work.
Q5. How does this approach scale in terms of cost and efficiency over time and with larger datasets?
Q6. Please discuss how the proposed approach specifically compares to or differs from Token-level Direct Preference Optimization (TDPO) https://doi.org/10.48550/arXiv.2404.11999, and whether incorporating ideas from TDPO could potentially improve their method.
- Q6. Discuss how our approach compares with TDPO.
A: Thank you for drawing our attention to this fascinating paper. It presents a novel alignment algorithm, TDPO, whereas our key contribution is a method for generating preference data for medical VLMs to address the prohibitively high cost of eliciting preference from medical professionals at scale. Please see our answer to Q1 for more details on this point. With that said, our paper could tentatively benefit from including TDPO. The alignment algorithms considered (DPO, IPO, KTO, SimPO, and ORPO) are all sequence-level optimizers. This is a fundamental difference compared to classical RLHF using policy gradient methods such as PPO [1], where optimization occurs at the token-level. Like classical RLHF, TDPO is a token-level optimizer, but maintains the simplicity of DPO, circumventing explicit reward modeling. Hence, TDPO is a DAA according to our usage of the term. As outlined above for Q4, we chose representative samples from different categories of DAAs. As TDPO represents a novel category, token-level DAA, the generality of our approach could be improved by also including it. We plan to run experiments using token-level optimizers, such as TDPO, in the future.
If you have any additional questions or need further clarification, please let us know.
References:
[1] John Schulman et al. “Proximal policy optimization algorithms”. arXiv preprint arXiv:1707.06347, 2017.
-
Q4. Justify choice of baseline and comparison methods.
A: CheXagent was chosen as a representative example of state-of-the-art, open source, VLMs for CXR interpretation. It has been trained in the canonical way by first adapting the LLM was adapted to medical text by continued pre-training. Second, a vision encoder was adapted via vision pre-training, contrastive learning on CXR image-text pairs. Third, the two modalities were merged by training a vision-language bridger, or adapted network, keeping the LLM and vision encoder frozen. Finally, the model was instruction tuned. CheXagent is also of average size, 8B, for an open source model, providing a good balance between computational complexity and performance.
The argument is similar for the chosen comparison methods. Since it is unfeasible to include all available alignment algorithms, we opted for a representative subset. The first downselection was to focus only on offline DAAs due to computational constraints. But even within this subset there are an abundance of different methods. Hence, we choose representative algorithms from different categories. DPO is the original DAA, and serves as our baseline. IPO is an example of a DAA with generalized preference, relaxing the assumption of the Bradley-Terry model. KTO is an example of a DAA that does not require preference pairs, but instead only binary feedback on whether a completion is desirable or undesirable. SimPO is an example of a DAA that does not require a reference policy, meaning that it is computationally lighter weight. In addition, SimPO explicitly controls for length. ORPO, almost outside of the definition of DAAs, is an example of an algorithm that jointly runs SFT and preference alignment. There are some overlaps between these categories as, for example, ORPO is also reference free. We have updated the manuscript to make the justification behind these choices more clear.
Other relevant baselines include MAIRA-2 [1] and LLM-CXR, as kindly pointed out by reviewer oXGy. MAIRA-2 is of interest since it focuses on grounded radiology report generation, which is a much more structured setup than free text generation. In particular, they report a list of sentences, each of which corresponds to at most a single observation, and an associated bounding box, if relevant, in the corresponding image. In addition, MAIRA-2 uses all information available, including prior reports for progression report generation. In other words, for the progression report generation task we considered, cheXagent only has access to the most recent and prior image, whereas MAIRA-2 will also condition on the prior report. Conditioning on more information is likely to yield better performance. We plan to run these experiments and include the results in the eventual camera ready version. LLM-CXR would be of interest as it represents a novel way of aligning text and image modalities, circumventing the need to train an adapter, or bridger, network which may be a significant information bottleneck. -
Q5. Scalability.
A: Since our method requires image-report pairs, we are limited by the availability of such data. Fortunately, for chest X-rays, there are large datasets publicly available, such as MIMIC-CXR and PadChest [2]. Preference datasets are usually not that large, and hence this availability should not immediately be a limiting factor. Currently, the main bottleneck is to generate the preference data. First the baseline model is prompted to generate N draws per prompt. These N draws are then evaluated by the Judge. This main bottleneck, which is fundamentally computational in nature, can significantly be reduced via parallelization or GREEN distillation/quantization. With that said, provided that we have access to sufficient compute, our approach allows for the generation of hundreds of thousands preference pairs, a task that simply would not be feasible if feedback from radiologists were required.
References
[1] Shruthi Bannur, Kenza Bouzid, et al. “MAIRA-2: Grounded Radiology Report Generation”, arXiv preprint arXiv:2406.04449,2024.
[2] Aurelia Bustos et al. “Padchest: A large chest x-ray image dataset with multi-label annotated reports”. Medical image analysis 2020;66:101797.
First of all, we thank the reviewer for valuable and insightful feedback. We address your concerns below (split into three comments due characters limit):
- Q1. Theoretical limitations of DAAs for medicine.
A: Thank you for drawing attention to this very interesting point! To the best of our knowledge, there are no theoretical limitations for DAAs, or alignment algorithms in general, that prevent their successful application in medicine. Alignment algorithms work by shifting up the probabilities of preferred completions, while simultaneously pushing down the probabilities of dispreferred completions. The application of DAAs can likely be domain agnostic, however, the key component for successful preference alignment is the collection of preference data itself [1]. If we could get radiologists to rate a dataset of, say, 50k+ preferences, then DAAs are likely to work well off-the-shelf. The problem lies in the prohibitive cost of obtaining said preference data, given the current global shortage of radiologists and the increasing volumes of radiology studies. Thus, generating radiologist-derived preferences is not scalable due to time and costs. Consequently, the main question our study is trying to address is whether or not it is feasible to obtain, sufficiently good, preference data without requiring direct radiologist preferences. Sufficiently good meaning that employing DAAs on obtained data leads to clinically meaningful improvements in the generated reports. That ORPO was preferred 62% of the time over the SFT baseline was an initial proof-of-concept that our approach indeed generated “sufficiently good” preference data. As an additional minor comment, this is, to the best of our knowledge, this is the first systematic study of a range of DAAs for VLMs, both in the general and medical domain. The manuscript has been updated to make these points more clear. - Q2. Additional experiments for robustness.
A: Thank you for drawing our attention to this very important problem! As an initial step towards understanding how alignment impacts different subsets of the data, and check robustness, we have included AUROC and F1 metrics across 14 categories extracted by the cheXpert labeler. For instance, considering our best model aligned with ORPO, while we observe an overall improvement in macro and micro averages for F1 and AUROC, we can see deteriorating performance in certain categories such as Pleural Other and Facture. For more details please see our response to W3 from reviewer GH2g. - Q3. Detecting and mitigating bias is preference data.
A: This is an important consideration as there are a range of possible biases. For the LLM-as-a-Judge mechanism there are well known biases such as verbosity, position, and self-selection. Position bias is relevant for pairwise comparisons, whilst our Judge conducts single answer grading. Self-selection bias would occur when the model producing the completions is the same as the Judge, for instance GPT-4. Although the average length of the chosen subsets are only marginally longer, shown in Table 1, we do observe significant length exploitation, especially for DPO and IPO. The current understanding of this phenomena is that small verbosity biases in the preference dataset get exacerbated via reward overoptimization, or hacking [2]. For more on how to address verbosity bias, please see our response to Q3 from reviewer wmBo. Moreover, there might be other societal biases with regards to ages, races, and sexes, embedded in the data or the Judge, as GREEN was trained on data generated by GPT-4–and may have inherited certain implicit biases. Exploring this further is an interesting direction for future work.
References:
[1] Hamish Ivision et al. “Unpacking dpo and ppo: Disentangling best practices for learning from preference feedback”, arXiv preprint arXiv:2406.09279, 2024.
[2] Ryan Park, Rafael Rafailov, et al. “Disentangling length from quality in direct preference optimization.” In Findings of the Association for Computational Linguistics: ACL, 2024.
Thanks for your responses and there are some questions that need to be answered.
Q1: As mentioned, the feasibility of obtaining radiologist-derived preferences is a critical limitation due to cost and time constraints. Please elaborate on whether any alternative methods have been explored for mitigating this issue.
Q2: The main bottleneck is generating preference data. Could provide more insights into how the proposed method of parallelization or distillation can maintain quality while scaling, particularly in the context of large-scale datasets like MIMIC-CXR? Are there specific trade-offs in accuracy or efficiency observed during this process?
Many thanks for your additional questions and sorry for the delay in getting back to you! Please see our responses below:
Q1: Alternative approaches to obtaining preferences.
A: To the best of our knowledge, our work is the first to consider preference fine-tuning in the context of medical VLMs. We are currently evaluating using the BERTScore instead of a “Judge” to generate preference data. The BERTScore is computationally significantly cheaper than GREEN. Please see our response for Q1 from reviewer oXGy for some initial results.
Q2: Accuracy and efficiency for parallelization or distillation.
A: The objective here is to enable the generation of more preference pairs in a given amount of time. For parallelization, we simply suggest running the same process across multiple GPUs. As for distillation, or quantization, we would be reducing the number of parameters or precision used. However, as you shrewdly pointed out, this is unlikely to come without a cost. The question is how big of a cost. A careful study of the trade off between accuracy and efficiency would be required. With that said, there are many works on quantized general-purpose LLMs, for example SmoothQuant [1], that demonstrate negligible loss in accuracy.
References:
[1] Guangxuan Xiao, Ji Lin, et al. “SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models”, Proceedings of the 40th International Conference on Machine Learning, 2023.
Thanks for your response about concerns. And your reply included that the proposed method is the first to consider preference fine-tuning in the context of medical VLMs. That's great. However, this research resembles a benchmarking study on preference fine-tuning in chest X-ray LLMs rather than introducing a novel method. Therefore, I will keep my score unchanged.
This paper proposes a scalable, automated preference alignment technique for chest X-ray (CXR) report generation using publicly available datasets and an LLM-as-a-Judge mechanism. The authors evaluate and benchmark five direct alignment algorithms (DAAs) and show that their approach achieves an overall improvement over the SFT baseline. Additionally, they analyze the alignment tax and clinical implications of the proposed methods.
优点
- The article is well-motivated and clearly written, making it easy to follow.
- The authors systematically examine how DAAs can enhance factual accuracy from three perspectives, considering both improvements and potential degradations introduced by DAAs.
缺点
- The method for collecting preference data is relatively simplistic. An ablation study on the impact of different metrics used as the Judge for preference data generation would be valuable.
- The use of LLMs for preference data collection without human feedback has been explored in prior work, such as RLAIF[1]. A comparison or discussion between the proposed method and RLAIF would strengthen this study.
- As the authors acknowledge, this work focuses on a single model, raising concerns about whether the experimental results adequately support the study's conclusions.
[1] Lee, Harrison, et al. "RLAIF: Scaling reinforcement learning from human feedback with AI feedback." arXiv preprint arXiv:2309.00267 (2023).
问题
Overall, this paper is strong, though it may not fully meet the ICLR criteria. I would consider raising my score if the authors:
- Compare the impact of metrics beyond GREEN in collecting preference data.
- Provide a comparative analysis between their method and RLAIF for preference data collection.
- Conduct additional experiments, such as evaluating their method on other vision-language models (VLMs) like LLM-CXR[1].
[1] Suhyeon Lee, et al. "LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation." In The Twelfth International Conference on Learning Representations, 2024.
First of all, we thank the reviewer for valuable and insightful feedback. We address your concerns below:
-
W1: Simplistic preference data collection.
A: Thank you for this comment. We would not necessarily argue that simplicity is a weakness. Given that it is a challenge to collect expert human feedback in this setting, we propose a simple method. With that said, it is likely possible to improve upon our current setup, in terms of computational complexity and quality. Please see our response for Q1 for more on this. -
W2: How the preference data collection compares with that in RLAIF.
A: Please see our response for Q2. -
W3: Additional experiments using another VLM.
A: Please see our response for Q3. -
Q1. Using different metrics than GREEN for preference data.
A: Thank you for this comment. We agree that there needs to be robustness across metrics for preference data, however, the fundamental challenge here is the lack of clinically-relevant metrics that can actually quantify this aspect. GREEN demonstrated that it can accurately quantify the quality of generated radiology reports. Our preliminary results indicated that GREEN was superior to employing, for instance, the BERTScore. However, this was on a relatively small subset of the training data and we only evaluated using DPO, our baseline DAA. However, with your feedback, we now present preliminary results for using BERTscore as the preference metric for ORPO, the best model according to the reader study, available in the table below. We show the average values (and difference with respect to the SFT baseline in brackets) on the MIMIC-CXR validation set for different hyperparameters. One striking result is that it is possible to achieve about the same average GREEN, but with less verbosity, using BERTScore to generate the preference data. We are currently repeating this experiment for the remaining DAAs and full results will be available in the final camera ready version.
| Judge | GREEN | F1RadGraph | BERTScore | Length | λ |
|---|---|---|---|---|---|
| GREEN | 0.415 (0.051) | 0.293 (0.027) | 0.858 (-0.005) | 65.3 (9.4) | 0.5 |
| 0.449 (0.086) | 0.306 (0.040) | 0.871 (0.008) | 59.2 (3.4) | 1.0 | |
| 0.463 (0.099) | 0.301 (0.035) | 0.869 (0.006) | 63.4 (7.6) | 4.0 | |
| 0.465 (0.101) | 0.309 (0.043) | 0.871 (0.007) | 63.1 (7.3) | 5.0 | |
| BERTScore | 0.393 (0.029) | 0.296 (0.030) | 0.866 (0.002) | 59.0 (3.2) | 0.5 |
| 0.449 (0.086) | 0.306 (0.040) | 0.871 (0.008) | 53.7 (-2.2) | 1.0 | |
| 0.463 (0.099) | 0.301 (0.035) | 0.869 (0.006) | 54.8 (-1.0) | 4.0 | |
| 0.322 (-0.042) | 0.244 (-0.022) | 0.862 (-0.002) | 55.1 (-0.7) | 5.0 |
-
Q2. How the preference data collection compares with that in RLAIF.
A: Thank you for drawing attention to this detail! LLM-as-a-Judge can be set up in three main ways [1]:
a) pairwise comparison: the most common approach is pairwise comparison, where the Judge selects which of two completions is -better for a given prompt.
b) single answer grading: when the Judge directly assigns a numeric score for a completion, given the prompt, and
c) reference-guided grading: conditions on some additional ground truth information to improve the performance of the Judge.
RLAIF employs a pairwise comparison. In contrast, in our work, we use reference-guided single answer grading. The Judge takes as input the completion and its corresponding reference and yields a numeric score. Making the Judge reference-guided allows for preference data generation in a factually grounded manner. In addition, reference reports help us circumvent the issue of making the Judge multimodal. Thank you for helping us realize that this distinction was not sufficiently clear. We have updated the manuscript accordingly. -
Q3. Additional experiments using another VLM.
A: Thank you for drawing our attention to LLM-CXR. We are currently working on implementing MAIRA-2 [2], a state-of-the-art VLM developed by Microsoft. We opted for this model instead since it is available within the Hugging Face ecosystem (https://huggingface.co/microsoft/maira-2). If done by then, these results will be included in the final, camera ready, version. With that said, the main contribution of this paper is the automated preference generation technique, and the actual model itself is just a vehicle for us to assess whether the preference data is sufficiently good to lead to clinically meaningful improvements.
If you have any additional questions or need further clarification, please let us know.
References:
[1] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, et al. “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena”, NeurIPS, 2023.
[2] Shruthi Bannur, Kenza Bouzid, et al. “MAIRA-2: Grounded Radiology Report Generation”, arXiv preprint arXiv:2406.04449,2024.
My concerns align closely with those of Reviewer GH2g. This research resembles a benchmarking study on preference fine-tuning in chest X-ray LLMs rather than introducing a novel method. Thus, the key question is whether the experiments and analysis are sufficiently robust to support the conclusions. While the authors have partially addressed my concerns, I am inclined to raise my score. However, I strongly recommend that the authors conduct experiments on a broader range of chest X-ray LLMs to enhance the comprehensiveness of this study.
We thank all reviewers for their constructive criticism and timely follow-up to our responses throughout the discussion period.
We have now updated the manuscript by:
- Including clinical efficacy metrics, further emphasizing the clinical utility of our proposed approach
- Updating the introduction and conclusion to clarify a key contribution: the automated pipeline for preference data generation, enabling preference fine-tuning in this setting which otherwise would not be feasible due to the prohibitive cost of obtaining feedback from radiologists
- Clarifying the choices of CheXagent, GREEN, and DAAs, and connecting our approach with previous work using AI feedback, in Section 3
- Updating sections 4.1 and 4.3 to:
- Make the error subcategories and reader study results more clear
- Emphasize the relation between quality and verbosity
- Making section 5, now called Limitations and Discussion, much more extensive
We hope that the updated manuscript, and our posted comments, effectively address your concerns. If not, please let us know what other aspects we can improve on. We have ongoing work that we will share with you once finished, and we will remain active here throughout the extended discussion period.
We thank all reviewers again for their insightful comments which have allowed us to greatly improve our work.
This paper proposes a scalable, automated preference alignment technique for chest X-ray report generation using publicly available datasets and an "LLM-as-a-Judge" mechanism, empirically analyzing and benchmarking five direct alignment algorithms (DAAs) to improve model performance without requiring expert radiologist feedback.
The reviewers found this paper has strengths in its innovative use of a scalable "LLM-as-a-Judge" mechanism to reduce reliance on expert-labeled data and systematic evaluation of Direct Alignment Algorithms (DAAs). The main weaknesses of the paper include limited generalizability, methodological shortcomings, and insufficient evaluation. Multiple reviewers highlighted concerns about the focus on a single model, raising doubts about the general applicability of the findings. The reliance on the GREEN score introduces bias, leading to verbosity and counterfactual information, undermining the goal of improving factual accuracy (Reviewers wmBo, GH2g). The lack of clinical efficacy (CE) metrics, crucial for medical applications, further weakens the evaluation (Reviewer GH2g). Reviewers also noted insufficient comparisons with related works, unclear explanations of key results, and the need for robustness testing against biased or noisy data (Reviewers oXGy, P4Hs). These limitations reduce the impact and reliability of the proposed methods.
During the discussion, the authors responded by adding clinical efficacy metrics, clarifying key contributions and methods, and expanding the discussion and limitations section for improved clarity. Reviewers appreciated the updates, including clinical efficacy metrics and clarifications on methodology, which enhanced the paper’s clarity and addressed some concerns (Reviewers oXGy, P4Hs, wmBo). However, the study was still criticized for limited generalizability due to reliance on a single, underperforming baseline model and the potential bias introduced by the GREEN score (Reviewers oXGy, GH2g, wmBo). Concerns included insufficient validation with alternative metrics, lack of improvement in downstream tasks, and a focus on benchmarking rather than introducing novel methods (Reviewers wmBo, GH2g, P4Hs). Most reviewers maintained their original scores (Reviewers P4Hs, wmBo, GH2g), while Reviewer oXGy slightly raised their score after acknowledging the updates.
This AC found this work to be meaningful as it proposed a method to fine-tune the radiology report generation models towards expert preferences/feedback (via the "LLM-as-a-Judge"), distinguishing itself from the common reliance on word-level cross-entropy loss. This approach is expected to align the trained model more closely with radiologist perception and decision-making. However, as most reviewers pointed out, this AC agrees that the most significant issue with this work is its limited generalizability due to reliance on a low-performance baseline. While the paper showed that the proposed strategies could significantly improve the baseline, the improved performances still fall short of the SOTA results achieved by other radiology report generation models. It is unclear whether the proposed framework will work for more advanced baseline report generation models. This fails to convincingly demonstrate the broader effectiveness and significance of the work.
审稿人讨论附加意见
The reviewers pointed out that the main weaknesses of the paper include limited generalizability, methodological shortcomings, and insufficient evaluation. During the discussion, the authors responded by adding clinical efficacy metrics, clarifying key contributions and methods, and expanding the discussion and limitations section for improved clarity. Reviewers appreciated the updates, including clinical efficacy metrics and clarifications on methodology, which enhanced the paper’s clarity and addressed some concerns (Reviewers oXGy, P4Hs, wmBo). However, the study was still criticized for limited generalizability due to reliance on a single, underperforming baseline model and the potential bias introduced by the GREEN score (Reviewers oXGy, GH2g, wmBo). Concerns included insufficient validation with alternative metrics, lack of improvement in downstream tasks, and a focus on benchmarking rather than introducing novel methods (Reviewers wmBo, GH2g, P4Hs). Most reviewers maintained their original scores (Reviewers P4Hs, wmBo, GH2g), while Reviewer oXGy slightly raised their score after acknowledging the updates.
Reject