PaperHub
6.4
/10
Poster4 位审稿人
最低3最高5标准差0.7
3
4
5
4
3.5
置信度
创新性2.8
质量2.8
清晰度2.5
重要性2.3
NeurIPS 2025

Walking the Tightrope: Autonomous Disentangling Beneficial and Detrimental Drifts in Non-Stationary Custom-Tuning

OpenReviewPDF
提交: 2025-04-05更新: 2025-10-29

摘要

关键词
Concept DriftReinforced Fine-tuningMLLMs

评审与讨论

审稿意见
3

This paper investigates the issue of concept drift in multi-modal large language models (MLLMs) during non-stationary reinforcement fine-tuning (RFT), particularly within chain-of-thought (CoT) reasoning. The authors propose a novel method called Counterfactual Preference Optimization (CPO), which combines causal inference and counterfactual reasoning to disentangle beneficial distribution adaptation from harmful concept drift. CPO utilizes a hierarchical concept graph to guide the generation of counterfactual reasoning trajectories, enabling more stable and robust model tuning in dynamic environments, especially in the medical domain. The study makes four key contributions: (1) introducing a theoretical framework that formalizes autoregressive token generation in MLLMs using concept drift theory; (2) proposing the CPO approach that integrates structured domain knowledge with counterfactual interventions; (3) validating the method through experiments across multiple clinical benchmarks for chest radiograph tasks; and (4) contributing the CXR-CounterFact (CCF) dataset, comprising 320,416 curated counterfactual reasoning trajectories derived from MIMIC-CXR. Experimental results demonstrate that CPO outperforms existing methods in robustness, generalization, and accuracy under non-stationary conditions.

优缺点分析

Strengths

  1. This paper covers a wide range of content and has an extensive scope of work, such as: (1) introducing a theoretical framework that formalizes autoregressive token generation in MLLMs using concept drift theory; (2) proposing the CPO approach that integrates structured domain knowledge with counterfactual interventions; (3) validating the method through experiments across multiple clinical benchmarks for chest radiograph tasks; and (4) contributing the CXR-CounterFact (CCF) dataset.

  2. The authors apply relatively novel techniques from the AI field to medical scenarios.

Weakness

  1. The motivation section of the paper does not clearly describe the specific issues and consequences of concept drift in real-world medical problems, making it difficult for readers to grasp its practical impact.

  2. The authors introduced too many new concepts, making the paper overly incremental and lacking a focused research innovation. For example, the authors mentioned concept drift, counterfactual inference, reinforcement custom-tuning, and also contributed a dataset, but each of these contributions appears to be only a minor improvement. Meanwhile, the proposed CPO method seems to be nothing more than a simple variant of DPO.

  3. The authors' main contribution involves making minor modifications to existing methods, such as DPO, and then applying them in the medical domain. From the perspective of AI technology, I believe the paper's technical contribution is relatively limited.

问题

Why does Table 2 compare only SFT and not directly compare with DPO?

局限性

yes

格式问题

NA

作者回复

Sincerely thanks for your meaningful and detailed review. Your positive feedback is a huge encouragement to our team, including 1) theoretical framework, 2) CPO method, 3) experiments across multiple clinical benchmarks, 4) contributing the CCF dataset and 5) novel techniques. In response to your concerns, we have provided a detailed explanation below and revised our manuscript accordingly.

  1. (Weakness 1) Motivation of concept drift. Thanks for your kind and helpful suggestions. Our motivation is based on our Observation 1.1 in our manuscripts, where we impose a small perturbation on the CoT reasoning process that does not affect the semantics, but the distribution of the results will be greatly deviated, which is the concept drift, as shown in our Fig.1 of our manuscripts.

    In terms of real-world medical environments, diagnosis requires strong robustness to resist noise, bias, and other issues from the data. However, the current MLLM will bring biased results by adding only small perturbations that do not affect the semantics to CoT. It poses a huge challenge to the application of MLLM in the medical field. We have added more details to clarify our motivation for concept drift in real-world medical environments. Thanks for your helpful suggestions.

  2. (Weakness 2 & 3) Contributions. Many thanks for your comments. We are sincerely sorry for the misunderstanding. As for contributions:

    • First, we formalized the issues we found in Observation 1.1 using the concept drift framework, regarding the autoregressive CoT generation as a stream of next-token prediction actions, providing a solid theoretical foundation to study the non-stationary reinforced custom-tuning.

    • Second, within the concept drift framework, counterfactual causal inference is used to further tighten the model's decision boundary under the non-stationary reinforced custom-tuning, leading to our proposed Counterfactual Preference Optimization (CPO). By embedding learnable concept graphs as the expert,our approach automates the generation of adversarially-constrained reasoning trajectories, substantial decoupling between beneficial distribution adaptation and detrimental concept drift.

    • Third, we conduct comprehensive empirical validation to verify our proposed method across various clinical benchmarks, including disease classification, report generation and zero-shot generalization. The superior results demonstrate statistically significant improvements in robustness, generalization, and accuracy of our method under non-stationary custom-tuning.

    • Fourth, we contribute the CXR-CounterFact dataset, with over 320,000 counterfactual reasoning trajectories, facilitating future studies on counterfactual reasoning in medical applications.

    In terms of differences between our proposed CPO and DPO, although CPO and DPO both utilize a preference-based optimization framework, CPO is fundamentally distinct from DPO. DPO compares human-preferred responses against dispreferred ones. CPO instead contrasts factuals with counterfactuals generated under explicit causal interventions, specifically designed to isolate the causal effect. CPO has a tighter decision boundary than DPO, as shown in the results of our additional experiments in the answer of 4. Experiment of DPO.

    As for the motivation of choosing chest diagnostics as the application, it is due to the abundance of public professional medical diagnosis reports, which serve as rich CoT reasoning process. This domain provides an ideal platform for studying MLLM reasoning processes and validating the robustness of our proposed CPO in real-world settings. We will explore additional multimodal applications in the future. And we have added related discussions in the section of Outlooks.

    We have added more details and descriptions about our contributions in the section of Introduction. Many thanks for your helpful suggestions.

  3. (Weakness 2) Relationships among concepts. In terms of relationships of different concepts** such as concept drift, counterfactual inference, reinforcement custom-tuning, they are strongly related to each other in our work. We initially discovered the problem of concept drift in reinforcement custom-tuning of chest diagnosis, so counterfactual causal inference is used to solve it. We have added more details about their relationships to clarify them in the section on Methodology. Thanks for your kind review.

  4. (Question) Experiment of DPO. we apprecaite your kind and helpful suggestions. We have added additional experiments about DPO on multi-label chest diseases classification on MS-CXR-T as shown in Table 2. Our proposed CPO is significantly superior to DPO with the same number of training data, proving the performance of our propsoed counterfactual preference optimization. We have added discussions about DPO in the section of Methodology and Experiments. Thanks for your sincere review.

    Con.PEPna.Pnx.Ede.Avg.
    SFT54.971.770.095.976.573.8
    DPO63.272.476.793.576.376.4
    CPO77.772.787.495.875.381.8

Thank you once more for generously dedicating your time to provide a thorough review. Your feedback is tremendously valuable, and we are open to hearing from you at any time. If you find our response satisfactory, we would greatly appreciate your assistance in improving our rating score.

评论

Thank you for your rebuttal. I think this is a solid work. I would like to maintain my original score.

评论

Thank you very much for your recognition that our work is solid. However, your current score is 3: Borderline reject. Would you be willing to improve our score? If you have any further questions or need clarification on our rebuttal, we are happy to provide additional details.

评论

Dear Reviewer 4zYr:

Thank you for your time and valuable feedback on our submission. As the response period will conclude soon, we want to kindly check if you have any further questions or need clarification on our rebuttal. We are happy to provide additional details if needed.

Best wishes,

Authors.

审稿意见
4

This paper proposes and theoretically analyzes the problem of harmful concept drift in chain-of-thought (CoT) of multimodal large language models (MLLMs) in the field of medical diagnosis during unstable reinforcement learning fine-tuning. The authors introduced the concept of counterfactual and proposed counterfactual preference optimization (CPO) to solve the concept shift problem. In addition, the authors built a new dataset CXR-CounterFact based on the existing MIMIC-CXR and conducted experiments to prove the effectiveness of CPO.

优缺点分析

Strengths:

  1. The author raised the problem of concept shifting and proposed a solution.
  2. The author's work is done with both theoretical analysis and experimental support.

Weaknesses:

  1. It may not be clear enough about the introduction of the background and problem in Sec. 1 (such as what concept shifting is and how it manifests, how non-stationary environmental dynamics are manifested here, etc.). I hope the author could give a more understandable explanation or description of this concept you proposed.
  2. The author mentions that this phenomenon occurs in scenarios with inherent domain-specific data characteristics (in Section 1, lines 24-25). But this paper only mentions medical diagnosis. There are certainly many scenarios or fields with such features. Can this framework be transferred to other fields?
  3. The trade-off between time/cost and the efficiency of CPO is not discussed. CPO introduces an additional LLM expert. And from the loss function, each step involves at least four inferences by MLLMs, which may introduce additional training costs.

问题

  1. How does the authors get the positive/negative samples? (in Sec. 2.4, line 202-203) Is there any additional human involvement here?
  2. Line 151 and 155 might have a incorrect typo. The X might be T.

局限性

Yes.

格式问题

N/A

作者回复

Thank you very much for your kind and helpful review, especially for summarizing the strengths of our work, including 1) raising the problem of concept drift in CoT of MLLMs and solution, and 2) theoretical analysis and experimental support. In response to your concerns, we have provided a detailed explanation below and revised our manuscript accordingly.

  1. (Weakness 1) Background and problem. Thank you very much for your sincere review. Our background and problem are based on our Observation 1.1 in our manuscripts, where we impose a small perturbation on the CoT reasoning process that does not affect the semantics, but the distribution of the results will be greatly deviated, as shown in our Fig.1 of our manuscripts. Specifically,

    • Concept drift denotes unpredictable distribution changes in data streams [1], where concept drift manifests itself as a large deviation in the distribution due to a small perturbation in CoT reasoning, which is viewed as a sequential token stream generation process.

    • Non-stationary environment refers to the CoT reasoning process, which is generated by the autoregressive decoding paradigm inherent to MLLMs, vulnerable to the interference from the data characteristics. And we posit that it is characterized as a sequential token stream generation process.

    We have added more details and explanations of our background and proposed problem in the section of Introduction. Thanks for your kind review.

  2. (Weakness 2) Transferred to other fields. Thanks very much for your detailed and thoughtful review. Our framework can be transferred to other fields.

    • First, chest diagnostics was chosen as the application domain in our manuscript due to the wealth of publicly accessible professional diagnostic reports, which constitute valuable CoT reasoning data. This makes it particularly suitable for studying MLLM reasoning and assessing the real-world robustness of our CPO framework.

    • Second, we mainly focus on the methodology for solving the concept drift problem in the CoT reasoning process. Our implementation is also highly automated and can be easily transferred to other similar fields.

    We have added more discussion about our future work, exploring additional multimodal fields in the section on Outlooks. Many thanks for your kind comments.

  3. (Weakness 3) Training costs. We appreciate the reviewer's valuable feedback regarding computational costs. As noted, the primary overhead arises from the additional LLM expert that generates counterfactual samples. We mitigate this through two key strategies:

    • Targeted Counterfactual Generation: Leveraging our hierarchical concept graph, we statistically identify features susceptible to concept drift. This allows targeted perturbation of high-drift-risk features instead of random perturbation. Consequently, we generate only two highly relevant counterfactual samples per original instance, significantly reducing computational demands.

    • Static Dataset Construction: The targeted approach enables pre-generation of all counterfactuals offline. Thus, the entire counterfactual dataset, our published CCF dataset, is built once via LLM inference prior to training, eliminating dynamic generation costs during model optimization. This process required approximately 5 days using 4 A100 GPUs. While still time-consuming and computationally intensive, this one-time cost is feasible and manageable for real-world deployment. We are sincerely sorry for your confusion of each step involving four inferences by MLLMs. We have added more details in the section of Methodology.

    Many thanks for your kind and thoughtful suggestions.

  4. (Question 1) Positive and negative samples. Thanks for your kind question. We are sincerely sorry for your misunderstanding. In terms of positive samples, MIMIC-CXR not only provides the diagnosis results of the disease, but also provides a detailed diagnosis report issued by the doctor. In the diagnosis report, the doctor explains the reasons for the diagnosis results. Therefore, we regard the report issued by the professional doctors as the chain-of-thought preferred by humans, which are positive samples. As for negative samples, we generate counterfactual CoT for diagnostic report instances as our negative samples, that is, perturbing specific radiological features to interfere with the diagnosis results according to our proposed concept graph. We have added more details of our positive and negative samples in Line 202-203. Many thanks for your detailed review.

  5. (Question 2) Typos. Thanks very much for your corrections. We have corrected these typos and double-checked the entire manuscript to make sure there are no further typos.

We appreciate your time and thorough review. Your feedback is highly valuable to us, and we welcome further communication from you. If you are satisfied with our response, we would be grateful for your support in enhancing our rating score.

[1] Lu, Jie, et al. "Learning under concept drift: A review." IEEE transactions on knowledge and data engineering 31.12 (2018): 2346-2363.

评论

Dear Reviewer T1Us:

We're grateful for your mandatory acknowledgement during this busy period. As the rebuttal period is ending soon, please be free to let us know whether your concerns have been addressed or not, and if there are any further questions.

Best regards, Authors.

审稿意见
5

This paper addresses the critical issue of detrimental concept drift in CoT reasoning during non-stationary RFT of MLLMs. This paper reveals that unpredictable shifts in token distributions during reasoning lead to significant biases in final predictions. To tackle this, the authors establish a theoretical bridge between concept drift theory and RFT, formalizing CoT’s autoregressive token streams as non-stationary distributions undergoing temporal shifts. Based on the theoretical framework, they propose Counterfactual Preference Optimization(CPO), which leverages a concept graph-empowered LLM expert to generate counterfactual reasoning trajectories, systematically disentangling beneficial distribution adaptation from harmful concept drift. CPO optimizes a dual objective to maximize alignment with human preferences while minimizing similarity to adversarial counterfactual paths, enabling stable RFT in non-stationary environments, particularly in medical domains. Additionally, they introduce CXR-CounterFact(CCF), a large-scale dataset with 320,416 curated counterfactual reasoning trajectories derived from MIMIC-CXR. Extensive experiments demonstrate CPO’s superior performance in robustness, generalization, and accuracy, validating its effectiveness in mitigating concept drift for non-stationary custom-tuning

优缺点分析

Strength

  1. Theory, method, and experiments in a closed loop

This work establishes a solid theoretical foundation to formalize concept drift, which directly motivates the CPO method. The approach is validated through well-designed experiments and ablation studies.

  1. Practical significance

This paper proposes a method to address the detrimental concept drift within CoT reasoning, which is of practical significance. The release of the CCF dataset also provides a valuable resource for research in causal inference and robust AI.

Weakness

  1. Computational overhead: The entire pipeline (concept graph construction→counterfactual data generation via LLM experts→CPO fine-tuning) appears to incur substantial computational costs. Notably, generating counterfactual samples would require extensive LLM inference operations, representing a significant computational burden. In Appendix C, the authors provide model training details (hyperparameters, 2x2 A100 GPUs) but offer no analysis of the thorough analysis of these computational demands, which is critical for evaluating the method’s practical feasibility in real-world deployment scenarios.

  2. Generalizability of concept graph construction: The method’s core relies on a high-quality domain-specific concept graph. While the paper mentions using Med-PaLM for automated extraction from MIMICCXR dataset, the following aspects remain underexplored: the technical details of this extraction process, the required human verification costs(if need), and the feasibility of extending this approach to other domains. And the current validation focused solely on chest X-ray may represent a potential bottleneck for broader applications.

  3. Some suggestions for word choice:

  • Line 16: contributed→contribute. I think the tense here should be present tense rather than past tense?

  • Line 263: under→in. It seems a bit strange to use ”under” with ”environments” here, should it be changed to ”in”?

问题

See weakness.

局限性

yes

最终评判理由

The authors have addressed my concerns, and thus I have decided to increase my ratings to accept.

格式问题

No concerns.

作者回复

Sincerely thanks for your meaningful and detailed review. Your positive feedback is a huge encouragement to our team, including: 1) solid theoretical foundation and well-designed experiments and 2) practical significance and valuable dataset. And, we have provided detailed responses below to address your questions one by one, and made revisions to our manuscript according to your feedback:

  1. (Weakness 1) Computational overhead. Many thanks for your helpful and thoughtful suggestions. We acknowledge the reviewer's valid concern regarding computational costs and appreciate the opportunity to clarify the practical feasibility of our approach. The additional computational overhead is mainly concentrated in the generation of counterfactual samples by LLM, as pointed out by the reviewers. We mainly alleviated it from the following two points:

    • Targeted Counterfactual Generation instead of Random Perturbation. The constructed hierarchical concept graph establishes disease entities, radiographic features, clinical relationships and taxonomies. By analyzing the statistical relationship between radiographic features and disease entities in the concept graph, we can identify the features that are prone to drift and make targeted perturbations instead of random perturbations. Therefore, for each sample, we only generated the two most relevant counterfactual samples, which greatly reduces the number of counterfactuals generated and reduces computing resources.

    • Static Generation. Because we perform targeted counterfactual generation, we do not need to generate them dynamically during training. Therefore, we only need to build the entire dataset once through the inference of LLMs, which is the CCF dataset we published. The entire dataset construction process took about 5 days on 4 A100 GPUs. Although this is still very time-consuming and computationally intensive, it is affordable and feasible in real-world deployment scenarios.

    We have added more detailed discussions about the computation and generation of counterfactuals in the section on Implementation Details. Thanks so much for your elaborate and thoughtful review.

  2. (Weakness 2) Generalizability of concept graph construction. We sincerely thank the reviewer for raising the points about the generalizability of our concept graph construction.

    • First, we added more technical details of this extraction process in the Implementation Details. The prompt we used for automated extraction from the MIMIC-CXR dataset is as follows:

      TASK ROLE: You are a senior chest radiologist. I will provide a large number of chest DR case report texts. Please automatically extract key imaging feature words related to various chest diseases from these reports, and use these features to build a structured imaging knowledge graph to reveal the association between different diseases based on common imaging features.
      
      CORE REQUIREMENTS:
      1. Standardized feature extraction:
          - Extract all abnormal imaging descriptors from reports.
          - Normalize terminology using Radiology Lexicon (RadLex) and Fleischner Society guidelines.
      2. Disease-feature mapping:
          - Link standardized features to diagnoses diseases per report.
          - Identical features MUST use identical normalized terms across diseases.
      3. Knowledge graph construction:
          - Nodes: 
              Diseases, such as Pneumothorax, Atelectasis and Pneumonia;
              Features, such as lung opacity and air bronchograms.
          - Relationship:
              (Disease)-[HAS_FEATURE]->(Feature);
              (Feature)-[ASSOCIATED_WITH]->(Disease)
          - Semantics:
              Diseases sharing one more identical feature node are interconnected.
      
      OUTPUT REQUIREMENTS:
      Please output the final constructed radiological knowledge graph and the standardized feature-disease association data extracted from the report in a structured JSON format. The format is as follows:
      {
          "data":[
              {"diseases":[], "features":[]},
              // ... cases
          ],
          "disease_feature_map":{
              "Pneumonia": ["Consolidation", "Ground-glass opacity", "Air bronchogram", "Pleural effusion"],
              // ... relationships
          }
      }
      
      

      As can be seen,we have achieved highly automated concept graph extraction and construction through LLM. Then, one chest radiology expert assisted us in verifying the constructed concept graph.

    • Second, in terms of Broader Applications, we have achieved highly automated concept map extraction and construction through LLM, with minimal manual verification. We argue that our method can be transferred to other multimodal fields for broader application. We will continue to expand more multimodal applications in future works. Thanks for your sincere review.

    • Third, The motivation for choosing chest diagnostics as the application is due to the abundance of public professional medical diagnosis reports, which serve as rich CoT reasoning process. This domain provides an ideal platform for studying MLLM reasoning processes and validating the robustness of our proposed CPO in real-world settings. Future work will explore additional multimodal applications. And we have added related discussions in the section of Outlooks. Many thanks for your kind comments.

  3. (Weakness 3) Typos. Thanks very much for your corrections. We have corrected these typos and double-checked the entire manuscript to make sure there are no further typos.

Thank you for your continued time and thorough review. Your feedback is immensely valuable to us, and we are always eager to hear from you. If you find our response satisfactory, could you please consider helping us improve our rating score?

评论

Thank the authors for the detailed reply. I appreciate the authors' thorough clarifications regrading the computational overhead, generalizability of the concept graph, and the language issues I pointed out. All of my concrens have been resolved, and I will raise the score to 5. Wishing the authors all the best in the next stages of their work.

评论

We greatly appreciate your increased rating. Your positive feedback is incredibly helpful and a tremendous boost for our team. Wishing the reviewers all the best.

审稿意见
4

This paper addresses a critical issue in MLLMs—concept drift during CoT reasoning in non-stationary RFT. Concept drift refers to the unpredictable changes in reasoning token distributions during RFT, which can introduce significant biases in final predictions. The authors establish a theoretical link between concept drift and RFT, viewing CoT’s autoregressive token streams as non-stationary distributions subject to temporal shifts. To counteract the negative effects of concept drift, the paper introduces a novel method, CPO, which separates beneficial distribution adaptation from harmful drift.

优缺点分析

Strength

  • The paper establishes a novel theoretical connection between concept drift and reinforcement fine-tuning, addressing a significant gap in the literature regarding the stability of reasoning processes in MLLMs.
  • The creation of the CXR-CounterFact dataset, with over 320,000 counterfactual reasoning trajectories, is a valuable contribution to the research community, facilitating future studies on counterfactual reasoning in medical applications.
  • The proposed Counterfactual Preference Optimization (CPO) offers a methodologically robust way to manage and counteract the harmful effects of concept drift during non-stationary RFT, especially in the context of chain-of-thought reasoning. This is a significant step forward in ensuring stability and accuracy in dynamic environments.

Weakness

  • This paper focuses on the issue of concept drift in CoT reasoning in language models, but the drift problem in the visual modality has not been addressed.
  • The title and abstract of the paper should narrow the scope of the work, limiting it to medical imaging or chest diagnostics. The current title overclaims the contributions of the paper, as it only discusses chest diagnostics.
  • It appears that the CPO algorithm is essentially the DPO algorithm, with the negative samples replaced by counterfactual samples. Therefore, referring to it as a new CPO algorithm may not be entirely appropriate.
  • Your baseline should likely compare with the DPO using random negative samples, as this would demonstrate the effectiveness of your counterfactual approach. It would not be sufficient to simply use reasoning errors as negative samples and claim effectiveness.
  • In fact, most of your benchmark baselines use LLM models that are not pre-trained or fine-tuned with instructions, which makes the comparison unfair. Even if your method outperforms them, it would be difficult to attribute this to the CPO rather than the optimization of the backbone model.
  • Moreover, I noticed that the number of training steps for SFT is much smaller than that for CPO. The two should be compared at similar convergence steps, as CPO has about three times the amount of training data as SFT, which may mean that SFT was stopped before fully converging.
  • In the era of Long CoT [1-2], can various Reasoning LLMs with self-reflection strategies partially solve the problem of inconsistent causal inference?
  • Recently, methods like GRPO, which match based only on the final answer, are gaining popularity in reinforcement learning (RL) and other techniques, as they can achieve good reasoning abilities without requiring additional preference annotations. One concern I have is whether, if GRPO works, we still need CPO, which relies on the difficult process of constructing counterfactual data.
  • There is a lack of case studies and statistical analysis, making it difficult to demonstrate that CPO truly addresses the model's conceptual causal relationships, rather than simply allowing the model to better learn how to reason or acquire more knowledge.
  • Missing Multimodal Reasoning Reference: [1-6]

[1] From System 1 to System 2: A Survey of Reasoning Large Language Models

[2] Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

[3] Multimodal Chain-of-Thought Reasoning in Language Models

[4] M3CoT: A Novel Benchmark for Multi-Domain Multi-step Multi-modal Chain-of-Thought

[5] Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

[6] Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models

问题

Refer to Weakness

局限性

Refer to Weakness

格式问题

No

作者回复

We sincerely appreciate your insightful, thoughtful and detailed review. Your positive feedback serves as a great source of encouragement for our team, especially recognizing the strengths: 1) novel theoretical connection addressing a significant gap in the stability of reasoning in MLLMs, 2) valuable contribution of the CXR-CounterFact dataset, and 3) methodologically robust way of our proposed method. Furthermore, we have diligently responded to each of your questions and incorporated your feedback into our manuscript revisions:

  1. (Weakness 1) The drift in visual modality. Thanks so much for your sincere and constructive suggestions. We acknowledge that this work does not address concept drift within the visual modality, as rightly pointed out by the reviewer. The motivation for focusing on the concept drift in language models is based on our Observation 1.1 in the manuscript. We find that when MLLMs are performing CoT reasoning within the language model, even if a slight perturbation does not change semantics, it will cause concept drift. Consequently, our current analysis and proposed methods focus primarily on this textual reasoning aspect. We sincerely thank the reviewer for this valuable insight. In further work, we will focus on the drift in visual modality.

  2. (Weakness 2) The scope of the abstract. Thank you for your thoughtful and helpful feedback. We have revised the abstract to narrow the scope: "This paper uncovers a critical yet overlooked phenomenon in multi-modal large language models (MLLMs), especially for chest diagnosis: detrimental concept drift within chain-of-thought (CoT) reasoning during non-stationary reinforcement fine-tuning (RFT), where reasoning token distributions evolve unpredictably, thereby introducing significant biases in final predictions". The motivation for choosing chest diagnostics as an application is that, there are a large number of public medical diagnosis reports as professional CoT. It is particularly helpful in studying the reasoning process of MLLMs and verifying the robustness of our proposed CPO in real environments. We will continue to expand more multimodal applications in future works. Thanks for your constructive suggestions.

  3. (Weakness 3 & 4) CPO algorithm. Many thanks for your kind and astute suggestions. While CPO and DPO share a preference-style optimization framework, CPO is fundamentally distinct. DPO contrasts human-preferred vs. dispreferred responses, which are random negative samples. CPO instead contrasts factuals with counterfactuals generated under explicit causal interventions, specifically designed to isolate the causal effect. CPO has a tighter decision boundary than DPO. Crucially, we added additional experiments that replace CPO's counterfactuals with DPO-style random negatives on multi-label chest diseases classification on MS-CXR-T as shown in Table 2. Our proposed CPO is significantly superior to DPO with the same number of training data, proving they are not functionally equivalent. We have added discussions about DPO in the section of Methodology and Experiments. Thanks for your sincere review.

    Con.PEPna.Pnx.Ede.Avg.
    SFT54.971.770.095.976.573.8
    DPO63.272.476.793.576.376.4
    CPO77.772.787.495.875.381.8
  4. (Weakness 5) Pre-trained or fine-tuned Baseline. Many thanks for your comments. We are sincerely sorry for your misunderstanding. We follow the benchmark settings from the BioViL-T[7], and most of the compared methods have been pre-trained and fine-tuned. For example:

    MethodsPre-trained DatasetsFine-tuned Datasets
    BioViLPubMed,MIMIC-III, MIMIC-CXRRSNA,MIMIC-CXR
    BioViL-TMIMIC-CXR v2Chest ImaGenome, MS-CXR, RSNA, MS-CXR-T
    Med-STMIMIC-CXRMS-CXR-T,RSNA,COVIDx
    TempA-VLPMIMIC-CXR, Chest ImaGenomeMS-CXR-T,Chest Imagenome
    CoCa-CXRMIMIC-CXR,Chest ImaGenomeMIMIC-CXR, Chest ImaGenome
    PromptMRGMIMICMIMIC, IU X-ray
    MambaXrayCXPMRGMIMIC-CXR,CheXpert Plus, IU X-ray
    OursMIMIC-CXR, MS-CXR-T

    Significantly, our method outperforms the baselines despite utilizing less data, indicating that the performance improvement is largely attributable to CPO, not simply backbone model optimization. We have added more discussion of the comparison method in the section of Implementation Details. Thanks again!

  5. (Weakness 6) Training steps of SFT and CPO. Thank you for raising this point. We appreciate your attention to the training details. To clarify, both the SFT and CPO models were trained for exactly one epoch, as suggested by Qwen2.5-vl [9]. The difference in the number of training steps arises solely because the CPO dataset is approximately three times larger than the SFT dataset due to the addition of counterfactual trajectories. Therefore, both models completed their training after seeing their respective datasets once, indicating comparable convergence points in terms of epoch count, rather than SFT being stopped prematurely. We have also added a discussion about the training steps in Implementation Details. Thanks for your detailed review.

  6. (Weakness 7) Inspiration of Long CoT. Thank you very much for your constructive and thoughtful inspiration. We agree that LLMs employing long CoT reasoning with self-reflection strategies represent significant advancements and can indeed help mitigate certain forms of reasoning errors. However, self-reflection strategies primarily act as a reactive correction mechanism during the model's own inference process. In contrast, our counterfactual samples are a proactive intervention during the training phase. This actively shapes the model's internal representations and decision boundaries to be more robust against causal fallacies from the outset, reducing the need for later correction. Thus, we view these approaches as highly complementary rather than mutually exclusive. Moreover, we have cited these papers [1-2], and added more discussion in the Outlooks.

  7. (Weakness 8) Discussion about GRPO. Thank you for your insightful comment. We noticed GRPO's excellent efficiency and reasoning ability, especially on DeepSeekMath and DeepSeek-R1. We also tried to use GRPO to train Qwen2.5-vl and DeepSeek-VL2 on MIMIC-CXR, but both encountered reward collapse that led to failed training. Inspired by DAPO paper[8], we attribute this to GRPO's reliance on sparse final-answer rewards, where suboptimal reward assignment obscures high-quality samples. In contrast, CPO provides denser causal trajectories, significantly improving generalization and robustness by operating controlled counterfactual interventions. We have also added a discussion about the GRPO in the section on Experiments. Thanks for your detailed review.

  8. (Weakness 9) Case studies. We appreciate the reviewer's insightful feedback. We have added the case studies based on Observation 1.1 in our manuscripts. The CPO-finetuned logit results for the case under Observation 1.1 are presented below.

    AtelectasisCardiomegalyLung OpacityPneumoniaPneumothorax
    "lung opacity"0.190.220.830.750.08
    "opacity"0.190.170.880.800.04

    Compared to the results shown in Fig. 1 of our manuscripts, CPO shows consistent performance even when slight perturbations are applied during CoT inference. In addition, our model assigns higher logits to the predictions of “lung opacity” and “pneumonia” inferred from the “opacity” attribute, and lower logits to all other diseases. This indicates that our model successfully performs causal inference.

    Besides, we also provided comparative experimental results of DPO in our answer to (Weakness 3 & 4) CPO algorithm. Our superior experimental statistical results illustrate that the performance is attributed to the learn conceptual causal relationships, rather than acquiring more knowledge. Thanks for your thoughtful review.

  9. (Weakness 10) References. Many thanks for your detailed review. We have cited these papers [1-6], and added more discussion about the multimodal reasoning in the section on Introduction and Related Works.

Thanks again for your valuable time and thorough review. Your feedback is immensely valuable to us, and we welcome further communication from you. If you find our response satisfactory, could you please consider helping us improve our rating score?

[7] S. Bannur et al., “Learning To Exploit Temporal Structure for Biomedical Vision-Language Processing,” CVPR 2023.

[8] Yu, Qiying, et al. "Dapo: An open-source llm reinforcement learning system at scale." 2025.

[9] Bai, Jinze, et al. "Qwen technical report." arXiv preprint arXiv:2309.16609 (2023).

评论

Dear Reviewer 3kr3:

Thanks again for your detailed and insightful ​​reviews​​! We would like to double-check and see if our responses have addressed your concerns, as the deadline of rebuttal is approaching. If you have any further questions, please feel free to let us know. We are more than happy to clarify more about our paper and discuss it further with you.

Best wishes, Authors.

评论

We sincerely thank all reviewers and ACs for your constructive and detailed reviews. We have answered the reviewers' questions one by one, and revised our paper according to the suggestions. We really appreciate that Reviewer PZmM has read our response, left comments and increased the score. We would like to know if our response has addressed your concerns and questions. If you have any further concerns or suggestions for the paper or our rebuttal, please let us know. We would be happy to engage in further discussion and manuscript improvement.

最终决定

This paper addresses detrimental concept drift in chain-of-thought reasoning during non-stationary reinforcement fine-tuning of multi-modal LLMs. The author(s) propose Counterfactual Preference Optimization, which uses concept graphs to generate counterfactual reasoning trajectories that decouple beneficial adaptation from harmful drift in medical chest X-ray diagnosis. The work establishes an interesting theoretical connection between concept drift theory and reinforcement fine-tuning. The CPO method demonstrates statistically significant improvements over baselines across multiple benchmarks. Nevertheless, concerns such as computational overhead remain. The author(s) should address these concerns in the final version and explore applications beyond medical imaging to demonstrate broader generalizability of their approach.