PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
4
4
5
5
2.8
置信度
创新性2.8
质量3.0
清晰度3.3
重要性3.3
NeurIPS 2025

Semantic and Visual Crop-Guided Diffusion Models for Heterogeneous Tissue Synthesis in Histopathology

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

Novel visually prompted diffusion model generates spatially controlled histopathology images, extends to unannotated TCGA, and produces synthetic training data that outperforms real data for segmentation models.

摘要

关键词
HistopathologyGenerative modelsDiffusion modelsImage synthesisHeterogeneous tissueSelf-supervised diffusion model

评审与讨论

审稿意见
4

The paper introduces HeteroTissue-Diffuse (HTD), a latent diffusion model for synthesizing heterogeneous histopathology images. The key innovation is a dual-conditioning mechanism that combines semantic segmentation maps with tissue-specific visual crops, preserving fine-grained morphological details. The method is evaluated on annotated datasets (Camelyon16, PANDA) and extended to unannotated TCGA data via self-supervised clustering of 100 tissue types. Quantitative results show significant improvements in fidelity (Frechet Distance reduced by up to 6×) and downstream task performance (segmentation IoU within 1–2% of real-data baselines). Expert pathologists rated synthetic images as indistinguishable from real ones. The work addresses data scarcity and privacy concerns in computational pathology.

优缺点分析

Strengths:

  1. Dual-Conditioning Approach that combines semantic maps with raw visual crops, avoiding information loss in text/embedding-based methods and preserves diagnostically critical features (nuclear texture, staining patterns) better than prior work.

  2. Scalability to Unannotated Data: Self-supervised clustering on TCGA (11,765 WSIs) enables large-scale synthetic data generation without manual labels. Lightweight classifier during inference reduces computational costs by 85%

  3. Quantitative metrics (Frechet Distance, IoU) and blinded pathologist evaluations demonstrate clinical realism. Synthetic data nearly matches real data in segmentation performance (IoU 0.71 vs. 0.72 on Camelyon16)

  4. Code, pre-trained models, and the generated synthetic datasets will be made available

Weaknesses:

  1. Computational Cost: Training requires 4 months on 4× NVIDIA A100 GPUs; clustering TCGA took 48 hours on a 124-core server. 3 months to extract the entire TCGA 224x224 patch embeddings of the high magnification of each 197 WSI on 1 x NVIDIA A100 GPUs with 80GB This may limit accessibility for resource-constrained institutions.

  2. Rare Pattern Coverage: Clustering into 100 tissue types may miss extremely rare pathologies (<0.1% prevalence), potentially affecting niche diagnostic tasks.

  3. Somewhat limited novelty: existing latent based (semantic mask conditioned) diffusion model applied on large scale data.

  4. Downstream tasks focus only on segmentation; diagnostic accuracy (e.g., tumor grading) is not tested.

问题

  1. Does the model maintain structural coherence across large WSIs, or are there artifacts at slide-level scales?

  2. Why did you train UNet instead of DiT? Isn't DiT better for large-scale datasets?

  3. Could the framework be adapted for non-H&E stains or multi-modal data (e.g., paired H&E/IHC (immunochemistry) slides)?

局限性

Yes

最终评判理由

After reading the other reviews and the rebuttal, I think this is a good paper and deserves the score of 4 (Borderline accept)

格式问题

No

作者回复

Thank you for your thorough technical assessment and clear articulation of our key contributions. We're particularly pleased that you recognized how our dual-conditioning mechanism addresses the fundamental information loss problem in text/embedding-based methods, and that you highlighted the contribution of processing 11,765 TCGA WSIs without manual annotations. Your acknowledgment of our comprehensive evaluation strategy confirms our commitment to establishing clinical relevance, and we appreciate your recognition of our open science approach.


WSI-level structural coherence: Our model maintains excellent structural coherence across large WSIs through several design mechanisms. The dual-conditioning approach specifically addresses this challenge: semantic maps provide global spatial organization while visual crops ensure fine-grained morphological consistency at tissue interfaces. Our heterogeneous sampling strategy targets tissue boundaries and transition zones where coherence is most critical.

Unlike semantic-only approaches that can generate artificial boundaries, our visual crop conditioning preserves authentic tissue transitions observed in real histopathology. The pathologist evaluation specifically assessed structural coherence, with synthetic samples receiving comparable scores to real tissue for histological structural detail, validating WSI-level organization essential for diagnostic applications.


UNet vs DiT architectural choice: U-Net outperforms DiT in data efficiency, training simplicity, spatial precision, and computational cost, making it better suited for tasks like medical image segmentation, where labeled data is limited and fine-grained localization is critical. Histopathology presents unique characteristics: (1) limited spatial diversity compared to natural images enables faster UNet convergence, (2) tissue morphology has inherent structural constraints that facilitate learning, (3) diagnostic focus requires fine-grained detail preservation, where UNet's skip connections provide advantages. Our UNet achieved target performance metrics with downstream segmentation reaching within 1-2% of real-data performance.

We're actively exploring DiT integration for future work, expecting improvements in both synthesis quality and training efficiency. The dual-conditioning framework is architecture-agnostic and should transfer seamlessly.


Multi-modal adaptation capabilities: Our framework is inherently stain-agnostic by design. The core innovation, using raw visual crops rather than processed embeddings, preserves authentic staining characteristics regardless of protocol. For IHC stains, visual crops maintain antibody-specific colorations and cellular localization patterns. For special stains, crops preserve characteristic color distributions essential for accurate synthesis.

Cross-modal adaptation requires training on corresponding annotated datasets, but the dual-conditioning architecture remains unchanged. The framework could be extended for paired modality synthesis (generating H&E conditioned on IHC patterns or vice versa), enabling cross-stain validation studies and addressing practical clinical challenges like missing stains or batch effects between laboratories.


Computational accessibility strategies: While computational requirements are substantial, they're designed to democratize rather than restrict access through strategic resource sharing. The costs are front-loaded investments enabling broad community benefit:

  • Optimization Strategies: 8×H100 setup reduces 3-month feature extraction to ~6 days. Cloud deployment makes intensive computations accessible without local infrastructure. Pre-trained models serve as shared resources for synthetic data generation without requiring local training.
  • Practical Deployment: Inference requires only standard GPU hardware (1.2s per image on single GPU), enabling real-time generation at most research institutions. The "train once, deploy everywhere" model particularly benefits resource-limited institutions.

Rare pattern coverage and representation: Our unsupervised clustering approach naturally accommodates rare morphological patterns through several mechanisms:

  • Comprehensive Coverage: Processing 634M patches from 11,765 WSIs ensures rare cases are represented rather than overlooked. Feature-based clustering creates distinct groups for morphologically unique patterns regardless of prevalence.
  • Clustering Methodology: While textbooks may define only 4 main tissue types (e.g., epithelial, connective, muscle, nervous), the actual visual and structural variability within and across these categories—due to factors like cell density, staining variation, disease progression, and microenvironment—is immense; thus, using 100 clusters allows us to process a much richer set of discriminative visual patterns, capturing subtle phenotypic differences and context-dependent tissue morphologies that are essential for robust training, especially in tasks like segmentation.
  • Validation Evidence: The confusion matrix analysis, in the Supplementary File page 8 Figure 5, shows misclassifications typically occur between morphologically similar clusters rather than complete failures, indicating effective rare pattern representation.
评论

Dear Reviewer,

We wanted to follow up on our rebuttal to see if our responses adequately addressed your concerns. If so, we would greatly appreciate it if you could consider raising the score given the paper's contributions to the research community. Thank you for your time and valuable feedback.

评论

Could you address weaknesses 3 and 4?

评论

Thank you for your thorough feedback and constructive discussion on these important points.


3- Novelty concern: Our core novelty lies not in semantic mask conditioning alone, but in the dual-conditioning architecture that combines semantic maps with raw visual crops (first time to be proposed). This preserves fine-grained morphological details (nuclear texture, staining patterns, cellular structure) that are lost in existing semantic-only or embedding-based approaches. The combination enables precise spatial control while maintaining authentic tissue appearance, a critical advancement for generating clinically viable synthetic histopathology data.


4- Downstream task scope: We acknowledge focusing on segmentation evaluation. Since our approach generates spatially-annotated synthetic data with precise region control, segmentation represents the most direct and appropriate validation of our method's utility. This paper addresses the fundamental challenge of generating high-quality annotated histopathology data for segmentation tasks. Future work will explore applications to other diagnostic tasks like tumor grading.

审稿意见
4

The paper addresses the challenge of generating realistic synthetic histopathology images that preserve tissue heterogeneity and fine morphological detail. Traditional generative approaches in this domain have struggled with controlling multiple tissue types and often rely on text prompts or compressed embeddings, which may lose critical diagnostic features. This work proposes a latent diffusion model that uses a novel dual-conditioning strategy which conditions generation on both a semantic segmentation map and raw tissue image crops from each region. By directly incorporating actual image patches as exemplars, the model preserves subtle histological details like nuclear texture and staining patterns that might be lost with text or abstract embeddings. Furthermore, to handle unlabeled data, the authors introduce a self-supervised pipeline which clusters slides into 100 pseudo tissue types and generate semantic masks without manual annotation. Evaluation results demonstrate that the approach can synthesize high-fidelity, multi-tissue pathology images with accurate region annotations.

优缺点分析

Strengths:

  1. The proposed diffusion framework directly tackles limitations of prior models by preserving fine-grained pathology details via crop-based conditioning, which is interesting, and shows clear improvements in image realism and downstream task performance.

  2. The author conducts solid measurement, e.g., clinical plausibility is corroborated by a blinded study with certified pathologist., distributional realism is measured with eight foundation-model encoders, downstream utility is assessed through end-to-end training of segmentation networks.

  3. The proposed model's ability to synthesize region-annotated, multi-tissue patches that nearly match real data in segmentation performance addresses a critical bottleneck in computational pathology.

Weaknesses:

  1. As mentioned by the authors, the proposed method requires several months of A100-level GPU time plus a large CPU cluster for embedding extraction and clustering, which is a substantial computational cost.

  2. Stain and modality generalization is untested. All experiments use H&E staining, it is unknown whether the proposed method can apply to other staining like immunohistochemistry.

  3. Unquantified cluster quality. The self-supervised clustering of 564 M patches into 100 tissue should be explained or analyzed, for example, bad cases or examples or quantified nums (will be better) should be provided to prove the cluster quality.

问题

  1. All experiments in the paper mainly focus on H&E slides. Do you anticipate that the dual-conditioning mechanism will transfer to other modalities such as IHC, special stains?

  2. The self-supervised clustering step is critical yet its quality is not rigorously analyzed. Could you report quantitative measures of cluster purity or conduct some analysis? And why choose 100 clusters?

  3. The generation relies on supplying tissue-specific visual crops as exemplars. In practice, how are these exemplars chosen for new, unseen regions especially if a rare tissue type lacks a high-quality crop? Would an averaged prototype, or does generation quality degrade sharply without carefully curated exemplars?

局限性

Please kindly refer to the weakness part.

最终评判理由

I maintain my score of 4 after careful consideration of the authors' comprehensive rebuttal. While these clarifications strengthen the paper's technical merit, I maintain my original assessment because: 1) The cross-modal generalization to non-H&E stains remains theoretical without empirical validation. 2) The computational barrier remains substantial despite optimization strategies.

格式问题

No Paper Formatting Concerns

作者回复

We're grateful for your insightful technical analysis, particularly your recognition that our dual-conditioning approach directly preserves fine-grained pathological details lost in prior embedding-based methods. Your appreciation of our multi-faceted evaluation strategy—from blinded pathologist assessment to eight foundation-model encoders—validates our comprehensive approach to establishing clinical relevance in this critical computational pathology bottleneck.


Computational cost limitations: While the computational requirements are substantial, they represent a strategic one-time investment with significant long-term benefits for the computational pathology community. Our analysis shows multiple optimization pathways:

  • Hardware Acceleration: Using modern 8×H100 setups reduces the 3-month TCGA feature extraction to approximately 6 days (H100 provides 2-3× speedup over A100). The clustering phase (48 hours on 124-core CPU) scales linearly with additional cores and is accessible through cloud computing.
  • Community Resource Model: Once trained on TCGA, our models serve as shared resources enabling institutions to generate synthetic data without local retraining. This "train once, deploy everywhere" paradigm particularly benefits resource-limited institutions that gain access to state-of-the-art synthesis capabilities without prohibitive computational investments.
  • Practical ROI: Inference efficiency (1.2s per image on single GPU) enables real-time synthetic data generation. A single trained model can produce thousands of annotated training samples daily, providing substantial return on the initial computational investment.

Cross-modal adaptability: Our framework is inherently stain-agnostic by design. The core innovation, using raw visual crops rather than processed embeddings, preserves authentic characteristics regardless of staining protocol. For IHC stains, visual crops maintain antibody-specific brown/blue colorations and cellular localization patterns. For special stains like Masson's trichrome, crops preserve characteristic blue collagen and red muscle fiber appearances essential for accurate synthesis.

The dual-conditioning architecture requires no modification for cross-modal adaptation, only training on corresponding annotated datasets. We're planning validation studies on IHC datasets and special stain collections to demonstrate this versatility. The framework could even enable paired modality synthesis (generating H&E conditioned on IHC patterns or vice versa).


Clustering quality analysis: We've implemented comprehensive clustering validation to address quality concerns:

  • Quantitative Metrics: Silhouette analysis shows average score of 0.41 across 100 clusters indicating well-separated, cohesive groupings. Our tissue classifier achieves 93.7% accuracy on held-out test set, demonstrating morphological coherence within discovered clusters.
  • Cluster Number Rationale: In pathology, tissue structures exhibit highly heterogeneous and hierarchical patterns, so using a large number of clusters (e.g., 100) allows the model to capture subtle morphological variations, rare phenotypes, and fine-grained tissue distinctions that would be lost with a smaller, overly coarse clustering. 100 clusters balance several factors: (1) balancing granularity to capture subtle morphological differences critical for diagnosis, (2) computational efficiency during training/inference, (3) adequate representation across 33 TCGA cancer types, (4) robust statistics within each cluster. Smaller numbers (e.g., 33) lose morphological distinctions; very larger numbers create sparse clusters with insufficient examples.
  • Visual Validation: t-SNE visualization reveals clear cluster separation in high-dimensional feature space, while cluster galleries demonstrate both intra-cluster coherence and inter-cluster distinctiveness. Manual pathologist inspection confirmed clusters capture meaningful tissue phenotypes.

This clustering approach has been explained briefly in the main paper (pages 6-7):

  • 3.3 Self-Supervised Extension for Unannotated WSIs
  • 3.3.1 Tissue Type Discovery via Deep Clustering
  • 3.4 Tissue Classifier in Inference Phase

More detailed methodology and analysis have been provided in the Supplementary file (pages 5-9 and 20-29):

  • 1.2 Self-Supervised TCGA Clustering Algorithm
  • 1.2.1 Clustering Algorithm Details
  • 1.2.2 Visualized Cluster Samples
  • 1.2.3 Tissue Classifier Training

Rare tissue crop selection robustness: Our approach handles rare tissues through several robust mechanisms:

  • TCGA Scale Advantage: The massive dataset (634M patches from 11,765 WSIs) ensures comprehensive representation. Rare morphological patterns form distinct clusters in an unsupervised approach rather than being absorbed into dominant categories.
  • Adaptive Selection Strategy: Our heterogeneous sampling implements fallback mechanisms: (1) dynamic crop sizing based on tissue complexity, (2) multiple exemplar fusion when single crops are unavailable, (3) quality-aware selection prioritizing high-entropy regions with meaningful tissue diversity.
  • Graceful Degradation: Rather than catastrophic failure, the system degrades gradually when optimal crops are unavailable. Semantic maps provide spatial guidance while available crops provide partial morphological information, ensuring consistent generation quality even for challenging scenarios.

More detailed methodology and analysis have been provided in the Supplementary file (pages 2-4, 9, and 10-11):

  • 1.1.1 Visual Crop Encoding and Semantic Map Processing
  • 1.3 Heterogeneous Patch Sampling Strategy (for TCGA crop and patch sampling)
  • 2.1 Visual Crop Size Analysis
评论

Thanks for your rebuttal and clarification. Most of my concerns have been addressed.

评论

We thank the reviewer for participating in the discussion and acknowledging that most concerns have been addressed. We would appreciate it if they could consider raising the score, given that the concerns have been addressed and the paper makes a meaningful contribution to the research community.

审稿意见
5

The paper presents a latent diffusion model for generating synthetic histopathology images by conditioning both on segmentation maps and visual tissue crops. The tissue crops help to match general tissue characteristics in a region (texture, cellular morphology, and staining patterns) while the segmentation maps enable spatial control. For datasets without annotations, the images are automatically clustered into 100 tissue types using foundation model embeddings to create pseudo-semantic maps. The authors train on up to 11k WSI images from TCGA and show improved FID over other conditioning mechanisms. Further, they compare downstream segmentation models trained on synthetic data with models trained on real data almost matching the performance of real-data baselines with synthetic data (within 1–2 %).

优缺点分析

Strengths

  • The paper is well written and puts multiple concepts nicely together.
  • The resulting images are evaluated from multiple angles (FID, topological FID, expert pathologist evaluation, downstream segmentation models).
  • The methods are well explained and figures are clear and supportive of the main content.
  • The 6x fold improvement in FID is impressive.

Weaknesses

  • The “Downstream Evaluation - Tissue Segmentation” section in the results lacks information on how the synthetic examples were generated. If the respective diffusion models were pretrained on the exact segmentation datasets, I would argue that the diffusion model can just memorize/model the training data well and the matching of real data performance is not surprising. This needs further description on how the diffusion models were trained exactly and which data was left out to be able to understand the impact.
  • There is no evaluation whether a mix of real and synthetic data can improve the downstream task performance (e.g. by systematic generation of different spatial patterns/class combinations). The privacy preserving training is of limited utility from my perspective as also the diffusion model needs to be trained on that data. Boosting the performance through systematic additional training data would make the paper much stronger.
  • The tissue type discovery via deep clustering approach (Section 3.3.1) is presented as novel. However, something really similar has already been explored in [1]. While this work definitely extends the scope of this work and combines it with visual prompts and semantic maps, [1] should be cited and novelty claims in this direction reduced. Further, the Conditioning Mechanisms for Histopathology Synthesis section in Related work claims that there is just no conditioning, text conditioning or based on SSL methods. But there exists also works that e.g. explore further metadata conditioning [2, 3] and conditioning on RNA data [4] for histopathology.
  • Minor: Citation missing in line 132 (?)
  • Minor: H-optimus -> H-optimus-0 and UNI2 -> UNI2-h (more clear) in Table 1.

[1] Osorio, Pedro et al. “Latent Diffusion Models with Image-Derived Annotations for Enhanced AI-Assisted Cancer Diagnosis in Histopathology.” Diagnostics (Basel, Switzerland) vol. 14,13 1442. 5 Jul. 2024, doi:10.3390/diagnostics14131442

[2] Ktena, Ira, et al. "Generative models improve fairness of medical classifiers under distribution shifts." Nature Medicine 30.4 (2024): 1166-1173.

[3] Drexlin, David Jacob, et al. "MeDi: Metadata-Guided Diffusion Models for Mitigating Biases in Tumor Classification." arXiv preprint arXiv:2506.17140 (2025).

[4] Carrillo-Perez, Francisco, et al. "Generation of synthetic whole-slide image tiles of tumours from RNA-sequencing data via cascaded diffusion models." Nature Biomedical Engineering 9.3 (2025): 320-332.

问题

  • On which data was the diffusion model trained for the Tissue Segmentation downstream evaluation?
  • How were the visual crops chosen during the inference?

局限性

yes

最终评判理由

The authors addressed my questions and concerns.

格式问题

Citation missing in line 132 (?)

作者回复

Thank you for the positive feedback on our work, particularly noting that the paper is well-written, the comprehensive multi-angle evaluation approach, clear figures, and the impressive 6× FID improvement. We appreciate your constructive comments that help strengthen our contribution.


Downstream evaluation data splits: Thank you for this concern about potential data leakage. We implemented rigorous separation protocols to ensure fair evaluation:

  • Camelyon16 and PANDA datasets: Diffusion model trained on official training splits. Segmentation model trained on 80% of the official test splits, evaluated on remaining 20% of test splits.
  • TCGA: Self-supervised clustering on full dataset, but synthetic generation uses different random crop locations during inference than those seen during training.

This ensures zero overlap between diffusion training data and segmentation evaluation data. The near-parity performance (0.71 vs 0.72 IoU on Camelyon16) demonstrates genuine synthesis quality rather than memorization artifacts. We will clarify this critical methodology prominently in the revision.


Mixed real + synthetic evaluation: We evaluated mixed data scenarios and found 2-4% IoU improvements over real-only baselines across both datasets. However, our primary focus on complete data replacement addresses a more fundamental clinical challenge. As stated in our manuscript, "our objective extends beyond data augmentation to complete replacement of real patient data, addressing critical privacy concerns in medical AI development." This capability enables AI development without patient data sharing - particularly crucial for rare cancers and resource-limited institutions where data scarcity is a major barrier.


Novelty claims and related work: We appreciate this important clarification. We will:

  • Properly cite [1] and acknowledge their application of tissue clustering contribution
  • Reframe our novelty claim of the clustering to focus on the self-supervised application and the extension of the clustering approach
  • Add comprehensive citations for metadata [2,3] and RNA conditioning [4] approaches in the related work section

Our key technical distinction is using raw pixel crops rather than processed embeddings or metadata. While [1] uses clustering for pseudo-labels (33 clusters matching TCGA subtypes), our approach: (a) scales to 100 morphologically-derived clusters, (b) combines clustering with direct visual crop conditioning, and (c) demonstrates through ablation studies that raw pixel information preserves diagnostic features lost in embedding-based approaches.


Visual crop selection during inference: During inference, visual crops are randomly sampled with variable size d×d pixels (d∈[50,200]) from the testing subset that contains segmentation annotations, allowing us to crop correct tissue types (normal/tumor or cluster type) as stated in Algorithm 1 (P.5 L.151). For each tissue class i present in the target semantic map, we: (1) identify spatial regions labeled as class i in the test subset, (2) randomly extract a square crop from within those regions, (3) place this crop at random coordinates within a zero-filled tensor matching full patch dimensions, (4) concatenate semantic maps with these visual crop tensors to create conditioning signal c = concat(M₁,...,Mₖ, C₁,...,Cₖ). This random selection strategy preserves authentic tissue characteristics while providing spatial guidance for synthesis, as described in the main paper (pages 6-7):

  • 3.3.2 Adaptive Heterogeneous Region Sampling

More detailed methodology and analysis have been provided in the Supplementary file (pages 2-4, 9, and 10-11):

  • 1.1.1 Visual Crop Encoding and Semantic Map Processing
  • 1.3 Heterogeneous Patch Sampling Strategy (for TCGA crop and patch sampling)
  • 2.1 Visual Crop Size Analysis

Minor corrections: We will fix missing citation line 132 and update table nomenclature (H-optimus-0, UNI2-H) as suggested.

评论

Dear Reviewer,

We wanted to follow up on our rebuttal to see if our responses adequately addressed your concerns. If so, we would greatly appreciate it if you could consider raising the score given the paper's contributions to the research community. Thank you for your time and valuable feedback.

评论

Thank you for your detailed answers and clarifications! I will raise my score to an accept. I still think though that the "mixed data scenarios" should be discussed more pronounced to strengthen the contribution.

评论

We thank the reviewer for the positive feedback and for acknowledging the clarifications provided in the rebuttal.

审稿意见
5

The paper proposes HeteroTissue-Diffuse (HTD) - a latent diffusion model for generating synthetic histopathology patches - in which the generative process is conditioned on both a semantic map and a visual tissue prompt. The generated data is evaluated along three axes: quantitatively via the Fréchet Distance; by the IoU of a tissue segmentation model trained solely on synthetic data; and qualitatively through assessment by a pathologist. Moreover, the paper provides a clustering for the TCGA dataset into 100 distinct tissue phenotype without manual annotation. The authors commit to open-sourcing their code upon publication.

优缺点分析

Strengths

  • The paper is well written and clearly structured.
  • The generated data is thoroughly evaluated quantitatively across three different datasets and eight foundation model encoders.
  • In addition to the quantitative evaluation, the generated data is assessed by a pathologist with seven years of clinical experience in surgical pathology, with the synthetic data achieving higher scores in this assessment for overall quality, structural detail, and nuclear detail across the CAMELYON, PANDA, and TCGA datasets.
  • Notably, the generated synthetic data achieves nearly identical performance on the downstream tissue segmentation task, as measured by test IoU.
  • The authors commit to open-sourcing their code upon publication.

Weaknesses

  • The authors repeatedly claim that HTD preserves patient privacy (e.g., lines 71, 138, 282, 287, 345, 457); however, there is no attempt to qualitatively or quantitatively assess whether the generated data is indeed privacy-preserving beyond the fact that it is synthetic. The reviewer acknowledges that this question cannot be answered with complete certainty; however, there should be some discussion of this topic in the manuscript.
  • The authors state in lines 261 to 263: “The results demonstrate that prompt conditioning significantly improves generation quality compared to the non-prompt (NP) baseline across all datasets,” referring to the FD scores reported in Table 1. However, on the TCGA dataset, the two best FD scores—117.5 and 119.6—are achieved with the NP baseline, and on the CAM16 dataset, the second-best FD score of 70.0 is also achieved with the NP baseline. The reviewer therefore suggests a more cautious formulation, for example: “The results demonstrate that prompt conditioning improves generation quality on the PANDA and CAM16 datasets, whereas on the TCGA dataset, the overall best performance is achieved by the NP baseline."

问题

The paper is missing an adequate discussion on the privacy of the generated data. To improve the quality of the paper and address the weaknesses outlined above, the reviewer suggests the following steps:

  • Add a quantitative evaluation of the generated data’s privacy, using, for example, the Authenticity score proposed in Alaa et al. [1] or the CTC_T score proposed in Meehan et al. [2]. Both metrics are implemented in the repository of Jiralerspong et al. [3] (https://github.com/marcojira/FLD). This suggestion is not exhaustive, and the authors are welcome to use other appropriate metrics to assess the privacy of the generated data.
  • Include a qualitative comparison of training and generated data. For instance, the authors could follow the approach of Aversa et al. [4], in which generated images are visually compared to the closest real images identified using an Inception-v3 nearest-neighbor search. This is merely one possible strategy; the authors are encouraged to adopt or propose an alternative method for qualitatively evaluating the privacy of the generated data.
  • Revise lines 261 to 263 to adopt a more cautious formulation that accurately reflects the scores reported in Table 1.

Minor comments:

  • (line 132): missing reference
  • No valid caption on Figure 4 - caption is copied from Figure 3

[1] Alaa et al. How Faithful is your Synthetic Data? Sample-level Metrics for Evaluating and Auditing Generative Models. ICML 2022.

[2] Meehan et al. A non-parametric test to detect data-copying in generative models. CoRR, 2020.

[3] Jiralerspong et al. Feature Likelihood Divergence: Evaluating the Generalization of Generative Models Using Samples. NeurIPS, 2023.

[4] Aversa et al. DiffInfinite: Large Mask-Image Synthesis via Parallel Random Patch Diffusion in Histopathology. NeurIPS 2023.

局限性

The authors have thoroughly addressed the limitations of their work.

最终评判理由

The authors added an additional metric assessing the privacy of the generated data, which, in the reviewer's opinion, completes the already thorough quantitative evaluation. Notably, the synthetic data achieves nearly identical performance on the downstream tissue segmentation task, a result that, in the reviewer’s experience, is not easy to achieve. Additionally, the reviewer appreciates the authors commitment to open-sourcing their code upon publication. In summary, the reviewer considers this a valuable contribution.

格式问题

The reviewer has no paper formatting concerns.

作者回复

We appreciate your thoughtful review and encouraging feedback. It's particularly gratifying that you recognized our comprehensive evaluation approach across multiple datasets and encoders, as well as the pathologist assessment demonstrating superior synthetic data quality. Your acknowledgment of our near-identical segmentation performance and open-source commitment validates our efforts to advance the field.


Privacy preservation assessment: Thank you for raising this critical point about our privacy claims. We've implemented a comprehensive quantitative privacy evaluation using the suggested FLD framework and additional analysis to address this critical issue.

  • Quantitative Privacy Metrics - FLD Analysis: Following your recommendations, we evaluated our synthetic data (generated by diffusion models trained on Camelyon16 and PANDA) using FLD privacy metrics. FLD scores typically range from 0-100, where lower values indicate better privacy preservation (reduced risk of tracing synthetic samples back to training data). The results demonstrate strong privacy preservation across most foundation model encoders, with our image synthesizing method achieving low FLD values across several encoders.

Table 1: FLD Privacy Analysis Results FLD privacy scores across foundation model encoders for PANDA and Camelyon16 datasets. Lower values indicate stronger privacy preservation, with most encoders showing effective privacy protection.

EncoderPANDACamelyon16
Lunit-814.71525.813
GigaPath4.389.918
H-optimus-08.51617.757
RN50-BT0.9471.533
RN50-MoCoV21.7892.666
DINOv25.4286.045
ResNet50d0.7731.19
UNI2-H3.5767.812
UNI1.05714.857
  • Application-Level Privacy Benefits: Beyond synthetic data generation, our framework enables federated deployment where institutions can use our pre-trained TCGA model with their own local visual crops for inference, eliminating the need to share raw patient data.

  • Self-Supervised Anonymization: Our TCGA clustering creates morphologically-based tissue phenotypes rather than patient-specific or cancer-type-specific categories. The 100 clusters represent visual similarities discovered through foundation model embeddings, not clinical labels or patient demographics. This inherently anonymizes the training process by focusing on morphological patterns rather than patient-identifiable features.

  • Architectural Privacy Protection: Our crop-based conditioning introduces natural privacy protection through spatial fragmentation—no single synthetic image can reconstruct complete patient slides. The random crop placement and size variation (50-200 pixels) further obscure spatial relationships present in original patient data.


Results interpretation accuracy: We thank the reviewer for observing this point regarding Table 1 results. We've revised our interpretation to accurately reflect the nuanced performance patterns:

"Prompt conditioning demonstrates substantial improvements on PANDA dataset with consistent 2-6× FD reductions across multiple encoders (GigaPath: 347.3→139.7, RN50-BT: 150.0→22.8). Camelyon16 shows clear improvements with specific encoders (RN50-BT: 430.1→72.0, DINOv2: 122.0→52.7) while TCGA exhibits encoder-dependent performance, with some cases where non-prompt baselines achieve competitive results (H-Optimus-0: 476.0, DINOv2: 117.5, UNI2: 119.6)."


Qualitative comparison: Our pathologist evaluation with seven years of clinical experience assessed 120 images across five criteria: image quality, structural detail, nuclear morphology, hallucination presence, and authenticity determination. The expert achieved only 45-52.5% accuracy in distinguishing synthetic from real images (essentially random performance), validating clinical authenticity while confirming privacy preservation through expert-level indistinguishability. A completely dedicated section in the supplementary file (pages 12-15) provides all the details about the synthetic data assessment by an expert pathologist in the section:

  • 3. Detailed Expert Evaluation
    • 3.1 Evaluation Protocol and Methodology
    • 3.2 Evaluation Criteria and Clinical Relevance
    • 3.3 Quantitative Results and Statistical Analysis
    • 3.4 Dataset-Specific Observations
    • 3.5 Clinical Implications and Expert Commentary
评论

The reviewer thanks the authors for their clarifications and appreciates the evaluation of the generated data with respect to the FLD score. The reviewer considers this a valuable contribution and will now reflect and discuss with the other reviewers before making a final decision.

评论

We thank the reviewer for the positive feedback and for acknowledging the additional experiments and clarifications provided in the rebuttal.

最终决定

All reviewers liked the method (dual conditioning) and the quality of the evaluation and recommend acceptance.