PaperHub
6.8
/10
Spotlight4 位审稿人
最低4最高5标准差0.4
4
4
5
4
3.0
置信度
创新性2.8
质量3.0
清晰度3.3
重要性2.8
NeurIPS 2025

SGCD: Stain-Guided CycleDiffusion for Unsupervised Domain Adaptation of Histopathology Image Classification

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29

摘要

关键词
Transfer LearningDomain AdaptationGenerative Models

评审与讨论

审稿意见
4

The paper proposes a novel method called Stain-Guided CycleDiffusion (SGCD) for unsupervised domain adaptation in histopathology image classification. Recognizing that conventional models struggle to generalize across hospitals due to differences in staining protocols and imaging devices, the authors introduce a dual diffusion model framework that enables style transfer between domains without requiring paired data. The method leverages a cyclic structure and introduces a stain-guided consistency loss to preserve crucial structural and semantic information during the domain translation process. Experimental results on multiple public pathology datasets demonstrate that SGCD outperforms existing stain normalization, GAN-based, and general domain adaptation approaches, particularly in maintaining high-quality, diagnostically relevant features in the target domain images. The approach is designed to facilitate more robust histopathology image analysis across diverse clinical environments.

优缺点分析

Strengths

  • Originality: The paper presents a timely and creative approach by adapting diffusion models—rather than traditional GANs—to the problem of unsupervised domain adaptation in histopathology images. The integration of a cyclic framework with stain-guided consistency loss offers a fresh perspective within this area.

  • Significance: Domain shift remains a major barrier for clinical deployment of machine learning in pathology. By addressing domain adaptation without requiring paired data or precise reference images, the proposed method has practical relevance for real-world applications.

Weaknesses

  • Clinical Relevance and Semantic Consistency: While the paper achieves strong quantitative results in classification tasks, it remains unclear whether the transformed images consistently preserve diagnostically critical structures at a level required for clinical decision-making. The stain-guided consistency loss primarily enforces color and statistical similarity, but does not guarantee preservation of fine-grained semantic or pathological features—an aspect that is crucial in actual pathology workflows.

  • Rare Cohort and Limited Data Scenarios: The method, like many domain adaptation approaches, relies on sufficient representation of target domain characteristics in the training data. In practice, rare disease subtypes or underrepresented cohorts remain a significant challenge. The current evaluation does not directly address how well the model preserves or generalizes rare pathological features, which can limit its impact in real-world clinical settings.

  • Practical Deployment Considerations: Although diffusion models offer benefits in diversity and stability over GANs, they are also considerably more resource-intensive, both in terms of training and inference time. The paper does not thoroughly discuss these limitations or provide guidance for efficient deployment in clinical environments.

  • Ablation and Qualitative Assessment: While the proposed method demonstrates quantitative superiority, a more detailed ablation study dissecting the contributions of each architectural component (e.g., cyclic structure, stain-guided loss) would strengthen the claims. Additionally, qualitative analysis—such as expert review of the generated images for diagnostic fidelity—would be valuable for establishing trust in the method.

问题

  1. Clinical/Pathological Semantic Consistency: Can the authors provide more evidence that the proposed method preserves diagnostically meaningful structures in the transformed images? For example, have any pathology experts reviewed the generated images for semantic fidelity, or are there metrics (e.g., segmentation overlap, class activation consistency) that can better quantify this? Additional qualitative or expert-blinded assessments would be helpful to build clinical trust in the transformations.

  2. Handling of Rare Cohorts and Limited Data: How does the method perform when the target domain contains rare disease subtypes or underrepresented pathological features? Are there experiments or analyses—such as leave-one-class-out, few-shot adaptation, or stratified performance reporting—that can clarify the model’s robustness in these challenging settings? If possible, consider including experiments or discussions on rare cohort generalization.

  3. Component-wise Ablation and Contribution Analysis: Could the authors provide a more detailed ablation study to isolate the effect of each architectural component, such as the cyclic structure and the stain-guided consistency loss? Understanding the individual contributions would help clarify which elements are most important for the observed gains.

  4. Practical Deployment and Computational Overhead: Given the resource demands of diffusion models, what are the practical implications for clinical deployment, especially compared to existing CNN or GAN-based approaches? Are there strategies to mitigate computational costs, or are there scenarios where the overhead is justified by performance gains?

  5. Failure Cases and Limitations: Can the authors share any observed failure cases or known limitations—such as when the domain shift is too large, or when key features are not preserved? Transparency about boundaries of the method’s effectiveness will be valuable for both clinical and research users.

Score Adjustment Criteria: If the authors can provide convincing evidence of preserved diagnostic structures (via expert review or strong semantic metrics), demonstrate robustness to rare or limited data scenarios, and clarify the practical deployment implications, my evaluation score would increase. Conversely, persistent uncertainty in these areas would limit my confidence in the method’s clinical applicability.

局限性

The authors have addressed several methodological limitations, such as the need for sufficient target domain diversity and the computational demands of diffusion models. However, the discussion of clinical limitations—especially regarding semantic consistency of diagnostically relevant features and handling of rare disease cohorts—could be expanded. I would encourage the authors to include more discussion of these aspects, as well as to consider the potential risks of over-reliance on synthetic images in clinical decision-making. Explicit acknowledgment of these points, and transparency about where the method may or may not be reliable, would further strengthen the paper.

最终评判理由

Summary: The rebuttal meaningfully addresses my core concerns, so I raise my score by one level to Borderline Accept.

Resolved • Semantic preservation / clinical relevance: Clearer description of classifier-guided training plus a quantitative class-consistency improvement on Camelyon16 (0.66 → 0.85) directly mitigates my concern. • Data scarcity: New sensitivity analysis with as little as 1% target data shows consistent gains over source-only, supporting sample efficiency. • Deployment clarity: Inference relies only on the trained target classifier; diffusion cost is confined to training. • Ablation support: Component-wise ablations (dual diffusion, SGC loss, fine-tuning) substantiate the architectural choices. • Transparency: Commitment to document failure cases and limitations improves reproducibility.

Overall Recommendation: The paper is methodologically original and empirically solid, with improved clarity and supporting analyses after rebuttal. Given remaining gaps in clinical validation, I cannot justify a larger increase, but the manuscript is just above the bar for acceptance.

格式问题

I did not notice any major formatting issues in the paper. All sections, figures, and references appear to conform to the NeurIPS 2025 Paper Formatting Instructions.

作者回复

Dear Reviewer 5ZXg,

We sincerely thank you for your thorough and insightful review of our manuscript. We deeply appreciate your positive feedback on our SGCD method's originality, the effectiveness of our bidirectional diffusion framework and stain-guided loss, and its strong empirical performance. Your constructive comments have significantly helped us improve the manuscript.

We have carefully addressed all your points and made substantial revisions, including:

  • Clarifying the semantic consistency of transformed images.
  • Discussing robustness to limited data and rare cohorts.
  • Addressing practical deployment considerations and computational overhead.
  • Presenting a detailed component-wise ablation study.
  • Discussing failure cases and limitations.

Please find our point-by-point responses below.


Q1: Clinical Relevance and Semantic Consistency.
A1: You raise a critical point regarding semantic consistency for clinical decision-making. We agree that solely relying on statistical information may not guarantee semantic preservation. Our method addresses this through collaborative training between the diffusion model and the target classifier, where the classifier actively guides the diffusion model to ensure generated images retain semantic information consistent with the source. To quantify this, we measured the class consistency on Camelyon16, comparing the class labels of original images with their transformed counterparts:

MethodClass Consistency Ratio
No Adaptation0.66
With SGCD0.85

This demonstrates that our method significantly improves the preservation of class-level semantic information. While this provides strong evidence, we acknowledge that direct expert review by pathologists or specific segmentation overlap metrics for fine-grained pathological features are crucial for full clinical trust. We consider these critical next steps for future work and will explicitly discuss this limitation and future plans in the revised manuscript.

Q2: Handling of Rare Cohorts and Limited Data Scenarios.
A2: We agree that handling rare disease subtypes and limited data presents a significant challenge in real-world clinical settings. It's important to note that our four datasets, being WSI slices, are inherently class-imbalanced, with tumor regions often representing relatively rare cohorts. The results presented in our main paper already offer insights into our method's performance under such imbalanced conditions.

In responding to specifically address performance with limited target domain data, we have conducted a supplementary experiment on Camelyon16, evaluating our method with varying percentages of available target domain data:

Target Data %Source-only1%10%50%100%
AUC75.987.589.493.595.8

The results indicate that our method shows promising adaptability even with very limited target domain data (e.g., 1% of target data), significantly outperforming the source-only baseline. This suggests our method's capability to generalize effectively in challenging, data-scarce scenarios. We will include this supplementary experiment in the appendix of the revised manuscript. We acknowledge that direct experiments on rare disease subtypes or few-shot adaptation are crucial for a comprehensive understanding, and we will explicitly state this as a key direction for future research.

Q3: Practical Deployment and Computational Overhead.
A3: You correctly highlight the resource demands of diffusion models. We want to clarify that our method's primary resource demand is during the training phase, not inference. For clinical deployment, only the trained target classifier is used for inference, which does not introduce additional computational overhead. For the training phase, we acknowledge the resource intensity. However, this can be mitigated by optimizing diffusion model sampling schedules and adjusting model sizes. Reducing training costs will be a key focus of our future research. We have expanded our discussion on these practical implications and mitigation strategies in the revised manuscript.

Q4: Component-wise Ablation and Qualitative Assessment.
A4: Thank you for this excellent suggestion. We have conducted a detailed component-wise ablation study to isolate the contributions of the dual diffusion model, the stain-guided consistency (SGC) loss, and the fine-tuning (FT) of diffusion models. The results are summarized below:

Dual Diffusion ModelSGC LossFT of Diffusion ModelsAUC
VVV95.8
-VV92.8
V-V89.4
---86.8

This study clearly demonstrates that both the dual diffusion architecture and the SGC loss are critical for achieving the observed performance gains. We have incorporated this detailed ablation analysis into the revised manuscript. Regarding qualitative analysis and expert review, as mentioned in Q1, we fully concur with its importance for building clinical trust and consider it a critical next step for future work.

Q5: Failure Cases and Limitations.
A5: Thank you for emphasizing transparency regarding failure cases and limitations. Our method does not rely on a specific reference image. During early training, when the diffusion model is not yet fully fine-tuned, or when domain shifts are very large, unexpected images might be generated, potentially increasing training difficulty. We will supplement the appendix of the revised manuscript with visual examples of observed failure cases and a detailed discussion on these limitations, providing valuable insights into the boundaries of our method's effectiveness.


Once again, we extend our sincerest gratitude for your invaluable feedback and time. We believe that the revisions made in response to your comprehensive comments have substantially improved the clarity, completeness, and clinical relevance of our manuscript. We sincerely hope that our detailed responses address your concerns, and we eagerly look forward to your further feedback.

评论

Thank you for the detailed and thoughtful responses. I appreciate the additional experiments on limited-data settings, the clearer ablation study, and the transparency around failure cases and computational costs. These revisions have certainly strengthened the manuscript.

That said, I remain mindful that the ultimate goal of medical-AI research is to create tools that can be trusted and adopted in real clinical workflows. Two aspects still appear insufficiently validated for that purpose:

  1. Fine-grained pathological fidelity Class-level consistency is a useful first check, but histopathology decisions often hinge on subtle nuclear or glandular details. A small‐scale expert review (even ≤ 50 regions) or a quantitative nucleus/gland overlap metric would provide much stronger evidence that the transformation truly preserves diagnostic cues.

  2. Robustness to rare or under-represented cohorts The new 1 %/10 % target-data experiment is helpful, yet it does not clarify performance on genuinely rare subtypes. Including subtype-specific metrics—or at minimum reporting where performance degrades—would be invaluable for estimating real-world reliability.

I believe addressing these two points—perhaps in the appendix for space reasons—would substantially strengthen the clinical relevance of the work and its appeal to both pathologists and translational researchers.

Overall, the manuscript is clearly moving in a positive direction, and I encourage the authors to keep the focus on demonstrable clinical value as they refine the paper.

评论

Dear Reviewer 5ZXg,

We truly appreciate your time and thoughtful feedback throughout this process. Your insights have been invaluable in helping us further improve the quality and clarity of our work. For the remaining two aspects that need to be further validated, we are conducing experiments and will include them in the appendix.

Best regards, Authors of #19920

评论

Dear Reviewer 5ZXg,

Thank you for your thoughtful and encouraging feedback. We greatly appreciate your focus on the clinical relevance of our work, and we agree that demonstrating trust and real-world value is paramount for medical AI research. We have conducted additional experiments specifically to address your two remaining points, and we believe the new evidence substantially strengthens the manuscript.

Q1: Fine-Grained Pathological Fidelity

A1: You correctly noted that histopathology diagnoses hinge on subtle details. To provide a stronger, quantitative validation that our method preserves these diagnostically meaningful structures, we conducted a new experiment on a subset of the Camelyon16 dataset. We used a semantic segmentation model to measure the pixel-level overlap of positive cell nuclei regions before and after our image translation. The segmentation model was trained on the target domain for a consistent evaluation.
The results, shown below, compare the segmentation performance on original target images with that on our translated images (TS) with their corresponding ground truth.

OriginalTranslated
IoU0.96610.9124
Dice Score0.98270.9542

These metrics confirm that our method is highly effective at preserving fine-grained pathological structures. The high Dice Score and IoU for the translated images, with only a marginal drop in performance, provide strong quantitative evidence that our approach maintains the critical pixel-level details essential for accurate pathological interpretation.

Q2: Robustness to Rare Cohorts

A2: We appreciate your request for further clarification on performance for genuinely rare subtypes. Our previous 1% and 10% target-data experiments were designed to simulate these scenarios, as tumor regions are a minority class in our imbalanced dataset.

We would also like to clarify a fundamental aspect of our method. Our approach is designed to be inherently robust to a limited amount of target domain data. This is because it primarily leverages a large, labeled source domain for knowledge and uses the translated images (T S ​ ) to bridge the style gap. The problem of handling rare data within the source domain itself is a distinct research challenge, typically addressed in the field of few-shot learning, and lies outside the scope of this work. Therefore, our 1% target data experiment precisely tests our method's core capability: achieving high performance on an imbalanced target domain with minimal real-world data.

To provide the subtype-specific metrics you requested and to demonstrate where a standard approach might degrade, we have compared our full method with a Source-Only baseline on the 1% target data setting.

Negative/Majority:

MetricNo Adaptation (Source Only)SGCD with 1% Target Data
Precision0.9850.835
Recall0.9670.917
F1-score0.9760.874

Positive/Rare:

MetricNo Adaptation (Source Only)SGCD with 1% Target Data
Precision0.1040.908
Recall0.2120.819
F1-score0.1400.861

The results clearly show a significant performance degradation in the Source-Only baseline for the rare positive class, with a minimal Recall of 0.212 and a low F1-score of 0.140. In stark contrast, our full method achieves a robust Recall of 0.819 and a high F1-score of 0.861 for the same rare class. This evidence demonstrates that our domain adaptation strategy is not only robust but also provides a crucial and decisive benefit in identifying the most challenging, clinically relevant rare samples.

We believe these new, demonstrative experiments directly address your concerns about real-world reliability and the handling of rare cohorts. We will include these findings in a revised appendix to strengthen the clinical appeal of our work.

Once again, we are grateful for your insightful guidance.

评论

Dear Reviewer 5ZXg, Thank you very much for confirming that most of your concerns have been addressed! Regarding the remaining concern, we have conducted new experimental verification to address it. Since the rebuttal process will end soon, is it possible to confirm if you are satisfied with our response? We greatly appreciate your time and constructive feedback, which have helped us improve both the clarity and quality of the paper.

Best, Authors of #19920

审稿意见
4

The paper introduces a new technique for unsupervised domain adaptation that uses diffusion models. The method is based on converting source domain images in the target style domain using a bidirectional diffusion process. The new source images obtained are used to train a target classifier by using cross entropy on those images and aligning distribution between target and source.

优缺点分析

Strengths:

  1. The paper is well written and easy to follow to understand the proposed architecture.

  2. The proposed SGCD method uses a bidirectional diffusion framework combined with a stain-guided consistency mechanism. Unlike traditional domain adaptation approaches based on unidirectional image translation (GANs), SGCD employs two separate diffusion models. Each model is trained on the source and target domains, respectively, enabling a cyclic conversion process (S → T → S and T → S → T) . This dual-path architecture ensures that the generated images retain essential structural and semantic features of the original data. Furthermore, the Stain-Guided Consistency loss guides the reverse process during generation, allowing more consistency for style transfer across domains.

  3. Also, the proposed approach doesn’t require paired images across domains, which is a significant advantage in medical imaging scenarios such as histology.

  4. Empirical results show improved performance on different standard histology datasets.

Weaknessess:

  1. The second part of the method relies on a standard feature alignment strategy that is already used in existing domain adaptation literature [1].

  2. Since the adaptation step is not novel, the main contribution lies in the generation part of the source images in the style of the target domain using diffusion models. However, the paper does not provide sufficient justification or empirical evidence demonstrating the advantage of diffusion-based generation over traditional GAN-based approaches.

Minor Weaknessess:

  1. The authors defined CtC_{t} and CsC_{s} sometimes as Classifiers and target model line 227 and 228. Can be nice to clarify this aspect. Also the equation 9 need to be clarify as the authors use Ct(Ts)C_{t}(T_{s}) in the two part but for the first part it’s more the prediction of the model and second part for MMD it’s the feature of TsT_{s} it’s inconsistent.

  2. The authors can support the claim line 33 and 34 about "time consuming and expert interpretation"

[1] Domain Impression: A Source Data Free Domain Adaptation Method, VinodKKurmi,VenkateshKSubramanian,VinayPNamboodiri*Vinod K Kurmi, Venkatesh K Subramanian, Vinay P Namboodiri*

问题

  1. As previously mentioned, the adaptation component of the second phase of the method is similar to standard practices commonly used in unsupervised domain adaptation [1]. Can the authors clarify the novelty of this part, as it seems limited. The benefits of the proposed approach seem to result essentially from the image generation stage rather than the target model training step.

  2. Also, since the advantage of the proposed method seems to rely on the generated images, can the authors explain on why diffusion-based images perform better than those generated by GANs for adaptation? What would happen if images from a GAN were used instead during the second stage of adaptation?

  3. Can the authors clarify how their model would perform on more challenging cross-organ adaptation tasks, such as adaptation between GlaS and CAMELYON as defined in [2], or on other datasets involving stronger domain shifts ?

  4. Also, it appears that training the model requires a reference image, which I assume is selected from the target domain. However, the paper does not explain how the reference image is selected. It would be interesting to understand if selecting differents images impacts the results. Since each image in the target domain may have a different stain matrix, this parameter could influence the method's performance. Can the author explain more precisely this aspect.

[2] Source-Free Domain Adaptation of Weakly-Supervised Object Localization Models for Histology, AlexisGuichemerre,SoufianeBelharbi,TsiryMayet,ShakeebMurtaza,PouryaShamsolmoali,LukeMcCaffrey,EricGranger*Alexis Guichemerre, Soufiane Belharbi, Tsiry Mayet, Shakeeb Murtaza, Pourya Shamsolmoali, Luke McCaffrey, Eric Granger*

局限性

yes

最终评判理由

The authors' responses are convincing on several points and highlight the advantages of the proposed method, particularly in scenarios where shifts between the source and target domains result from staining variations within the same organ. Nevertheless, it remains unclear which component of the method contributes most significantly to the observed performance: is it the diffusion process, or the adaptation step between the generated images TsT_{s} and TT?

I recommend a borderline accept.

格式问题

No

作者回复

Dear Reviewer dvqh,

Thank you for your thorough review and valuable feedback. We appreciate your recognition of our paper's clarity, the originality of our diffusion-based framework, and its practical significance for addressing stain heterogeneity in pathology without requiring paired data. Your insights have helped us further strengthen our manuscript.

We've addressed your points with significant revisions, including:

  • Clarifying the novelty of our adaptation strategy.
  • Justifying the advantages of diffusion-based generation.
  • Refining notation and supporting claims.
  • Committing to further experiments on challenging cross-organ tasks.
  • Explaining our approach to reference image selection.

Please find our point-by-point responses below.


Q1: Novelty of feature alignment strategy.
A1: You raise a crucial point about the novelty of our adaptation step and the advantage of diffusion over GANs. While some aspects of classifier fine-tuning may resemble existing methods like SDDA [1], our novelty lies in the collaborative and guided generative process. Unlike SDDA, which relies on statistical alignment and can struggle with large domain shifts or pseudo-labeling [3, 4], our method does not emphasize pure statistical alignment. Instead, we leverage pre-trained diffusion models' ability to capture data distributions. During fine-tuning, stain information and semantic consistency constraints from the classifier dynamically guide the diffusion model, generating images precisely suited for target classifier refinement. This synergistic training (diffusion model and classifier working together) is distinct from approaches where the generator is trained independently.

Q2: Justification for diffusion vs. GANs in generation.
A2: We agree that justifying the advantage of diffusion-based generation is crucial. Our empirical results in Tables 3 and 4 show that methods like StainGAN and StainNet, despite incorporating stain normalization and consistency constraints, perform slightly weaker than our approach. We attribute this to GANs' inherent mode collapse and training instability, which hinder accurate distribution modeling. Diffusion models, conversely, offer superior training stability and produce higher-quality, more diverse image generations. This is vital for robust data augmentation. Unlike GANs, diffusion models benefit from a step-by-step denoising process that can be precisely guided by fine-grained information (like our stain guidance), leading to more consistent and semantically accurate style transfer. GANs, due to their architectural differences, cannot achieve this gradual guidance, making their generated images highly dependent on the initial GAN quality. We will include a dedicated discussion on this comparative advantage in the appendix, potentially with supplementary experiments using GANs for image generation within our framework.

Q3: Clarification of classifier notation and Equation 9.
A3: Thank you for pinpointing these important inconsistencies. We will revise lines 227-228 to consistently define CSC_S as the source domain classifier and CTC_T as the target domain classifier (our target model). For Equation 9, we'll clarify that CT(x)C_T(x) refers to the prediction (e.g., logits) in the first part, and to the feature representation (embedding) for MMD in the second part. This precision will be reflected in the revised manuscript.

Q4: Support for WSI annotation claim.
A4: We agree this claim needs support. Annotating Whole Slide Images (WSIs) is indeed highly time-consuming and requires expert interpretation. We'll support this claim in lines 33-34 by referencing relevant literature [5] that discusses the complexity and cost of obtaining such annotations.

Q5: Novelty of the adaptation component.
A5: As detailed in our response to Q1, the novelty lies in the collaborative training of our adaptation phase. Unlike most UDA methods where the generator is trained independently, our approach enables synergistic training where the target classifier actively guides the diffusion model's generation. This dynamic feedback loop ensures generated images are tailored to the classifier's needs, optimizing directly for downstream classification. This active co-training distinguishes our adaptation from a passive consumption of pre-generated data.

Q6: Advantages of diffusion-based images over GANs for adaptation.
A6: As addressed in Q2, diffusion models generally offer superior fidelity and diversity compared to GANs. While pre-trained diffusion models capture target distribution, our method integrates stain guidance and classifier-driven semantic consistency for effective image transformation. If GANs were used, their structural differences would prevent step-by-step, fine-grained stain guidance, making generated images highly dependent on initial GAN quality and prone to instability/mode collapse. This would hinder precise style transfer and semantic preservation, limiting adaptation performance. We will include further discussion on this in the appendix.

Q7: Performance on challenging cross-organ adaptation.
A7: We appreciate this excellent suggestion. In responding to this, we are conducting additional experiments on challenging cross-organ adaptation tasks, referencing settings from [2] (GlaS and CAMELYON adaptation) and potentially other datasets with stronger domain shifts [6]. However, these cannot be finished before rebuttal deadline (July 30). Once they are finished, the results and discussion on our model's performance under these more extreme conditions will be included in the appendix of the revised manuscript, providing valuable insights into its generalizability.

Q8: Reference image selection process.
A8: Thank you for this clarification request. Our method does not rely on a specific fixed reference image. During training, we dynamically use randomly sampled images from the target domain to provide stain information for our stain-guided consistency loss. This inherently accounts for variations in stain matrices within the target domain, making our method robust. Moreover, occasionally generated images with slight deviations can act as a form of data augmentation for the classifier [7]. We will clarify this crucial aspect in the revised manuscript.


Once again, we sincerely thank you for your invaluable feedback. We believe our revisions significantly enhance the manuscript's clarity, completeness, and practical relevance. We hope our detailed responses address your concerns and look forward to your further feedback.

Reference:
[3] Haifeng Xia, Handong Zhao, and Zhengming Ding. Adaptive adversarial network for source-free domain adaptation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9010–9019, October 2021.

[4] Song Tang, Wenxin Su, Mao Ye, and Xiatian Zhu. Source-free domain adaptation with frozen multimodal foundation model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23711–23720, June 2024.

[5] Weiyi Wu, Chongyang Gao, Joseph DiPalma, Soroush Vosoughi, and Saeed Hassanpour. Improving representation learning for histopathologic images with cluster constraints. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21404–21414, 2023.

[6] Fangda Li, Zhiqiang Hu, Wen Chen, and Avinash Kak. Adaptive supervised patchnce loss for learning h&e-to-ihc stain translation with inconsistent groundtruth image pairs. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 632–641. Springer, 2023.

[7] Jia-Ren Chang, Min-Sheng Wu, Wei-Hsiang Yu, Chi-Chung Chen, Cheng-Kung Yang, Yen-Yu Lin, and Chao-Yuan Yeh. Stain mix-up: Unsupervised domain generalization for histopathology images. In Marleen de Bruijne, Philippe C. Cattin, Stéphane Cotin, Nicolas Padoy, Stefanie Speidel, Yefeng Zheng, and Caroline Essert, editors, Medical Image Computing and Computer Assisted Intervention -- MICCAI 2021, pages 117–126. Springer International Publishing, 2021.

评论

I would like to thank the authors for addressing my comments and those of the other reviewers, which helped clarify multiple points.

I still have one concern about the proposed method. It's not clear the theoretical aspect behind the adaptation process after de diffusion part. Especially the need to fix the backbone and retrain the image classifier.

What is the performance with the new image Ts before and after the adaptation part (after diffusion)?

Why not retrain the backbone while freezing the image classifier and using Ts and T ?

评论

Dear Reviewer dvqh,

Thank you again for your focused feedback. Your latest comments have helped us identify a crucial point of confusion in our manuscript, and we appreciate the opportunity to clarify our approach with new evidence.

Q1: Clarifying the Adaptation Process

A1: We sincerely apologize for the imprecise description in our manuscript that led to a misunderstanding. We would like to clarify that our proposed method actually employs the exact strategy you suggested: we retrain the backbone while freezing the image classifier head. This is a deliberate choice to allow the backbone to adapt to the new domain's feature distribution while preserving the high-level semantic knowledge of the classifier. We will revise Section 4.3 to provide a more clear description of this process.

Q2: Performance with the new image Ts

A2: We appreciate you asking about the performance of our generated images (TS) before and after adaptation. This allowed us to directly quantify our method's core contribution: the ability to produce high-quality, labeled TS images that are both stylistically consistent with the target domain and class-consistent.

This means that even before the full adaptation process, the generated TS images themselves are a powerful resource. Training a classifier on these images alone provides a strong baseline performance, which we use to quantify the inherent quality of our generated images. The full adaptation pipeline then refines the model further by incorporating unlabeled target images (T) to ensure the classifier's performance is optimized on the true target domain distribution.

The results below quantify the impact of these two steps:

 AUC
No Adaptation (Source Only)75.9
Training with only TS92.6
Training with TS and T95.8

The results show that even when training with only our generated TS images, we achieve a strong AUC of 92.6%. This demonstrates the primary benefit of our method: generating class-consistent, labeled images from an unlabeled target domain that can be used directly for DA. The further performance gain to 95.8%, achieved by additionally incorporating the unlabeled target domain images (T) via a feature alignment strategy like MMD, highlights that our approach provides an excellent starting point for adaptation with TS, and subsequent steps can be used to further refine the model and achieve even greater performance.

We believe this clarification and new evidence directly address your concerns about our adaptation strategy and its theoretical foundation. We will add these findings to the manuscript to enhance clarity.

评论

Dear Reviewer dvqh, Thank you very much for confirming that most of your concerns have been addressed! Regarding the remaining concerns, we have conducted new experimental verification and clarified a misleading part. Since the rebuttal process will end soon, is it possible to confirm if you are satisfied with our response? We greatly appreciate your time and constructive feedback, which have helped us improve both the clarity and quality of the paper.

Best, Authors of #19920

审稿意见
5

The authors tackle stain/style shifts in histopathology with Stain-Guided Cycle Diffusion (SGCD. Two pre-trained diffusion models (one per domain) that translate patches back-and-forth under (i) bidirectional generative constraints and (ii) a stain-guided consistency (SGC) loss. Unlike prior GAN or color-deconvolution approaches, SGCD needs no paired data or hand-picked reference slide and explicitly pushes the reverse process to preserve pathology-relevant structure. On four public datasets, the method consistently outperforms current methods.

优缺点分析

Strengths:

  1. Clever reuse of pre-trained diffusion models plus a stain-aware guidance term. The cyclic bi-directional training is a nice twist that avoids the usual need for paired patches and still keeps semantics intact (although not really new, as this is a known trick first introduced by CycleGAN).
  2. Uses four well-accepted datasets, multiple cross-hospital splits, and compares to 10+ baselines, including the latest diffusion UDA (StainDiff) and medical DG methods
  3. Pathology practitioners struggle with stain heterogeneity; a drop-in diffusion-based adaptor that outperforms current methods is valuable. The approach may generalize to other low-data medical domains.
  4. The paper is mostly easy to follow; pseudo-code and training details (optimizer, timesteps, GPUs) are given, and the code is promised.

Weaknesses:

  1. The method adds overhead. Performing a cycle while training is always a big cost in training run-time and resources.
  2. Minor: No ablation on the (seemingly important) hyper-param k. It would be interesting to see how sensitive the method is to it.
  3. It would be nice to see visually the failure cases (which I'm sure exist) in order to understand better the method.

问题

  1. Why Connect Later performs marginally better in table 2 but in table 3 it doesn't?
  2. Would it generalize to other areas in the medical domain?

局限性

yes

最终评判理由

The rebuttal has addressed most of my concerns.

格式问题

no

作者回复

Dear Reviewer TZLz,

We are truly grateful for your thorough and insightful review of our manuscript. We deeply appreciate your positive feedback on the originality of our dual-diffusion architecture, the effectiveness of the stain-guided loss, and the state-of-the-art performance of SGCD. Your recognition of our method's contribution to unsupervised domain adaptation in histopathology and its practical implications is very encouraging.

Your constructive suggestions and precise questions have been invaluable in guiding us to further strengthen the clarity, completeness, and practical considerations of our paper. We have carefully addressed each of your points and made significant revisions to the manuscript accordingly. Our main updates include:

  • Providing a discussion on the computational overhead and strategies for efficient deployment.

  • Presenting a comprehensive ablation study on the hyperparameter k.

  • Committing to include visual examples of failure cases in the appendix.

  • Clarifying the performance of baselines in different experimental settings.

  • Discussing the generalizability of our method to other medical domains.

Please find our point-by-point responses below, addressing each of your valuable comments.

Q1: Computational Cost--"The method adds overhead. Performing a cycle while training is always a big cost in training run-time and resources."
A1: We sincerely appreciate your concern regarding the computational overhead during training. We acknowledge that the cyclic bi-directional training, while crucial for maintaining semantic integrity without paired data, indeed introduces additional computational demands. This is an inherent trade-off for the benefits our method provides. However, we believe this can be mitigated by exploring different diffusion model sampling schedules and optimizing the size of the diffusion model itself. Enhancing the training efficiency will be a key focus of our future research, and we have added a discussion on this aspect in the revised manuscript.

Q2: Hyperparameter kk Ablation
A2: Thank you for pointing out the importance of an ablation study on hyperparameter kk. We agree that understanding its sensitivity is crucial for a robust method. We have now conducted a dedicated ablation study on the Camelyon16 dataset, and the results are presented in the table below, where kGk_G is as defined in Equations 5 and 7 of our manuscript:

kk \ kGk_G10100150
20088.694.585.6
40093.694.289.1
60094.195.890.9
80085.994.790.4
100091.594.983.9

The experimental results demonstrate that our proposed method consistently achieves superior performance compared to existing methods across a wide range of kk values. This suggests that while k does influence performance, our method's effectiveness is robust to reasonable variations in this hyperparameter. We have incorporated this ablation study and its discussion into the revised manuscript.

Q3: Visual Failure Cases
A3: We appreciate this valuable suggestion. Due to the constraints of the rebuttal submission format, we are unable to provide experimental images here. However, we fully agree that visualizing failure cases is important for a comprehensive understanding of our method's limitations. We will supplement the failure case images and their analysis in the appendix of the revised manuscript. Our method primarily aims to improve the performance of target classifiers in the target domain by generating high-quality images, and understanding failure modes will further guide future improvements.

Q4: Connect Later Performance Discrepancy.
A4: In both Table 2 and Table 3, we strictly followed the hyperparameter settings proposed by the original authors of Connect Later [1]. However, it's important to note that these settings might not generalize optimally across all simulated scenarios. Specifically, Table 3 simulates a more complex environment (transfer from a single source domain to multiple target domains) compared to Table 2. The original Connect Later [1] also highlights the critical importance of targeted data augmentation for specific scenarios. We believe that the inherent complexity of the multi-target domain setting in Table 3, combined with the fixed hyperparameter settings, explains the observed performance difference for Connect Later.

Q5: Generalizability to Other Medical Domains.
A5: This is an excellent question that touches upon the broader impact of our work. Given the nature of our currently available datasets, our primary research focus has been on pathological images. However, we are profoundly optimistic about the potential for our method to generalize effectively to other medical imaging domains, such as immunology problems, where stain or domain heterogeneity is also a significant challenge. Extending our methodology to these diverse medical imaging fields will be a crucial and exciting direction that we envision for our future research.

Once again, we extend our sincerest gratitude for your invaluable feedback and time. We believe that the revisions made in response to your comments have substantially improved the clarity, completeness, and practical relevance of our manuscript. We sincerely hope that our detailed responses address your concerns, and we eagerly look forward to your further feedback.

Reference: [1] Helen Qu and Sang Michael Xie. Connect Later: Improving Fine-tuning for Robustness with Targeted Augmentations. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors, Proceedings of the 41st International Conference on Machine Learning, Volume 235, pages 41769–41786. PMLR, July 21–27 2024.

评论

Thank you for the response and your effort! The rebuttal has answered most of my concerns. I raised the score accordingly.

评论

Dear Reviewer TZLz,

We sincerely thank you for carefully reading our rebuttal, acknowledging our effort, and raising the score accordingly. The discussions involved in the rebuttal phase will be included in the final revised version. Thank you again for your comments!

Authors of #19920

审稿意见
4

This paper aims to address the problem of UDA problem for histopathology image classification, where models trained on one domain stained slides often underperform on another’s due to stain and preparation variations. The authors propose SGCD, a dual-diffusion generative framework that translates images from source to target domain while preserving crucial morphological features. The SGCD architecture imposes bidirectional generative constraints, which is a cyclic consistency between source to target and target to source translations similar to the cycleGAN, to maintain semantic fidelity during translation. Additionally, a stain-guided consistency loss is introduced to guide the diffusion models and enhance their denoising and translation quality without requiring any paired training images. In addition, the author also use a classification loss to ensure that the source classifier should predict the same label for the reconstructed image. Experiments on four public histopathology datasets demonstrate that SGCD substantially improves target-domain classification performance compared to prior stain normalization and UDA methods.

优缺点分析

Strengths:

  1. The author propose dual-diffusion architecture with bidirectional generative constraints, which first pushes an image to the opposite domain, then brings it back through another noisy–denoise cycle, similar to the cycleGAN, the idea seems interesting.

  2. The proposed stain-guided loss steers each denoising step toward a Beer–Lambert–based stain reference while simultaneously enforcing task-level classification consistency, enabling high-quality translation without paired data.

  3. The author use source-trained classifiers to synthesized images, the classification loss make synthesized images remain discriminative for downstream diagnosis.

  4. SGCD delivers state-of-the-art performance on four pathology benchmarks, demonstrating superior stability, reproducibility, and cross-hospital robustness.

Weaknesses:

  1. The approach involves training/fine-tuning two diffusion models (for source and target domains) and then generating many synthetic images for classifier training. Diffusion models are resource-intensive, and although the authors fine-tune rather than train from scratch, the method still likely requires substantially more computation than simpler UDA techniques (like feature adaptation or one-pass style transfer). The paper does not report training times or complexity in detail. As a reviewer, I worry about the practicality: e.g., how many diffusion steps k are needed for good results, some discussion or results on efficiency would strengthen the work.

  2. All experiments focus on binary classification tasks (tumor vs. normal or mitosis detection), and mostly on H&E stain variations. It remains to be seen how SGCD performs for more complex scenarios, such as multi-class classification or adaptation between entirely different staining protocols (e.g. H&E vs. immunohistochemistry).

问题

  1. While you compare to GAN-based and other UDA methods, it would strengthen the work to see an ablation or comparison showing the benefit of dual diffusion and SGC loss a one-diffusion approach. Are the gains primarily coming from the cycle consistency or from diffusion in general?
  2. The method is described as “stain-guided”, suggesting it’s tailored to color/stain variations. How would SGCD handle domain shifts that are not purely stain differences? For instance, if the target domain has different tissue structures or artifacts (blur, scanner noise) that the source doesn’t, can the dual diffusion still adapt effectively? Please kindly refer to the weakness part and the question part for the author.

局限性

Please kindly refer to the weakness part and the question part for the author.

最终评判理由

These results from the rebuttal alleviate some of my concerns. Nevertheless, the method remains computationally demanding and, at present, is limited to benefiting binary classification on a single staining, which limits its immediate clinical applicability. For these reasons, I will keep my original borderline-accept rating.

格式问题

No Paper Formatting Concerns.

作者回复

Dear Reviewer 1BK2,

We are truly grateful for your insightful and comprehensive review of our manuscript. Your positive feedback on the originality of our dual-diffusion architecture, the effectiveness of the stain-guided loss, and the state-of-the-art performance of SGCD is deeply appreciated. We are particularly encouraged by your recognition of our method's contribution to unsupervised domain adaptation in histopathology and its practical implications.

Your constructive criticisms and precise questions have been invaluable in guiding us to further strengthen the clarity, completeness, and practical considerations of our paper. We have carefully addressed each of your points and made significant revisions to the manuscript accordingly. Our main updates include:

  • Providing a detailed discussion on the computational overhead and strategies for efficient deployment.

  • Presenting a comprehensive ablation study to clarify the individual contributions of our key architectural components.

  • Including new experimental results demonstrating the method's robustness to non-stain related domain shifts.

Please find our point-by-point responses below, addressing each of your valuable comments.

Q1: Computational overhead and efficiency concerns.
A1: You raise a very valid and important concern regarding the computational overhead and practicality of our method. We wholeheartedly agree that diffusion models, by their nature, are more resource-intensive than traditional UDA techniques. We acknowledge that the cyclic bi-directional training, while crucial for maintaining semantic integrity without paired data, indeed introduces additional computational demands during training.
We understand the need for more transparency on efficiency. While a full breakdown of training times across all experiments is extensive, we will add a dedicated discussion in the revised manuscript to address these practical implications. We propose several strategies to mitigate computational costs for more efficient deployment:

  • Optimized Sampling Schedules: Utilizing advanced diffusion model sampling schedules (e.g., DDIM, DPM-Solver) can significantly reduce the number of inference steps required (the 'k' you mentioned), thereby speeding up inference time considerably without compromising quality.

  • Model Size Optimization: Exploring smaller, more efficient diffusion model architectures tailored for specific histopathology tasks can reduce both training and inference resource requirements.

Enhancing training and inference efficiency will be a key focus of our future research, and we appreciate you highlighting this important challenge.

Q2: Performance on complex scenarios (multi-class, different staining protocols).
A2: Thank you for this insightful comment on the scope of our current evaluation. We agree that our initial experiments primarily focus on binary classification tasks and variations within H&E staining. This was a deliberate choice to establish the foundational effectiveness of SGCD in a controlled yet clinically relevant setting.
We acknowledge that extending SGCD to more complex scenarios, such as multi-class classification or adaptation between entirely different staining protocols (e.g., H&E to immunohistochemistry, or even different tissue types), presents unique challenges and opportunities. We believe our underlying dual-diffusion architecture and the concept of stain-guided consistency are fundamentally adaptable to these scenarios, potentially with modifications to the guidance term or network architecture. We will explicitly discuss this as a crucial and exciting direction for future research in the revised manuscript's "Limitations and Future Work" section. Exploring these more diverse and challenging domain shifts is indeed a high priority for us.

Q3: Ablation/comparison showing benefit of dual diffusion and SGC loss over one-diffusion approach; source of gains (cycle consistency vs. general diffusion).
A3: Thank you for this precise and excellent question. We completely agree that a component-wise ablation study is vital for understanding the individual contributions of our key architectural elements. We have now conducted a comprehensive ablation study to isolate the effects of the dual diffusion model, the stain-guided consistency (SGC) loss, and the fine-tuning (FT) of diffusion models. The results, using the Camelyon16 dataset, are summarized in the table below:

Dual Diffusion ModelSGC LossFT of Diffusion ModelsAUC
VVV95.8
-VV92.8
V-V89.4
---86.8

This experiment clearly demonstrates that the dual diffusion model significantly contributes to the method's performance (comparing 'V V V' to '- V V'), indicating that the cyclic nature and the bidirectional generative constraints are crucial. Furthermore, the introduction of the SGC loss provides a substantial boost to the proposed method's effectiveness (comparing 'V V V' to 'V - V'). The baseline without any of our proposed components ('- - -') shows significantly lower performance. These results underscore that the gains are primarily coming from the synergistic effect of both the cycle consistency (enabled by the dual diffusion architecture) and the targeted stain guidance provided by the SGC loss, rather than diffusion in general alone. We have incorporated this detailed ablation study and its discussion into the revised manuscript.

Q4: SGCD's handling of non-stain domain shifts (e.g., tissue structures, artifacts like blur/noise).
A4: Thank you for this insightful follow-up question, which helps us clarify the broader applicability of SGCD beyond pure stain variations. While our method is indeed "stain-guided" in its core loss design, the underlying diffusion process, particularly the bidirectional generative constraints, inherently provides a degree of robustness to other forms of domain shifts, including those related to tissue structures or common image artifacts.

To demonstrate this, we have conducted an additional experiment on the Camelyon16 dataset, where we simulated common image imperfections like blur and noise through data augmentation. The results, using our SGCD method, are as follows:

Augmentation MethodNo AdaptationWith SGCD
Blur61.486.3
Noise59.382.9
Blur + Noise62.183.5

As shown in the table, our method significantly improves performance even when the input images are corrupted with blur and noise, suggesting its ability to effectively handle certain types of input degradation that are not purely stain differences. This aligns with similar conclusions found in related work, such as Gao et al. (2023), which explored diffusion-driven adaptation to test-time corruption. This demonstrates that our dual-diffusion framework, while optimized for stain, possesses a broader adaptability to other structural or artifactual variations commonly encountered in real-world medical imaging. We will include this discussion and the supplementary experiment in the revised manuscript.

Once again, we extend our sincerest gratitude for your invaluable feedback and time. We believe that the revisions made in response to your comments have substantially improved the clarity, completeness, and practical relevance of our manuscript. We sincerely hope that our detailed responses address your concerns, and we eagerly look forward to your further feedback.

Reference: Jin Gao, Jialing Zhang, Xihui Liu, Trevor Darrell, Evan Shelhamer, and Dequan Wang. Back to the source: Diffusion-driven adaptation to test-time corruption. In CVPR, 2023.

评论

Thank you for the detailed response and for running the additional experiments on cross-stain domain shifts. These results alleviate some of my concerns. Nevertheless, the method remains computationally demanding and, at present, is limited to benefit binary classification on a single staining, which limits its immediate clinical applicability. For these reasons, I will keep my original borderline-accept rating.

最终决定

The paper proposes a new method, Stain-Guided CycleDiffusion (SGCD), to address domain adaptation challenges in histopathology image analysis. It leverages dual diffusion models with cyclic constraints and a stain-guided consistency loss to generate realistic, diagnostically useful images without requiring paired data. Experiments on multiple datasets show that SGCD outperforms existing approaches in accuracy and robustness, making it a promising direction for medical imaging where stain and preparation differences can hinder model generalization.

Reviewers praised the paper’s originality, technical soundness, and strong empirical performance. They highlighted its clear writing, innovative use of diffusion models, and demonstrated improvements across several benchmarks. Concerns centered on high computational cost, the method’s current focus on binary classification, and the need for more evidence of clinical relevance, particularly regarding rare disease subtypes and preservation of fine-grained pathological details. The authors responded with detailed rebuttals, additional ablation studies, new experiments on limited-data settings, and quantitative evidence showing that SGCD preserves diagnostic features.

The paper received positive final ratings from reviewers--two Accepts and two Borderline Accepts. Three of the four reviewers kept their original rating after rebuttal, citing remaining doubts about clinical validation and computational practicality; one reviewer raised their rating, acknowledging that the rebuttal satisfactorily addressed their concerns. Overall, the paper was judged to be technically solid, original, and impactful within its subfield. The area chair agrees with the reviewers' judgements, and recommends acceptance as a spotlight paper, recognizing it as the best submission within the set of papers assigned to this AC.