PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
4
5
5
4
4.0
置信度
创新性3.0
质量3.0
清晰度3.5
重要性2.8
NeurIPS 2025

MAPLE: Multi-scale Attribute-enhanced Prompt Learning for Few-shot Whole Slide Image Classification

OpenReviewPDF
提交: 2025-05-09更新: 2025-10-29
TL;DR

We propose Multi-scale Attribute-enhanced Prompt Learning (MAPLE), a hierarchical framework for few-shot WSI classification that jointly integrates multi-scale visual semantics and performs prediction at both the entity and slide levels.

摘要

关键词
multiple instance learningwhole slide image classificationprompt learningvision-language modelfew-shot learning

评审与讨论

审稿意见
4

This work leverages large language models (LLMs) to generate both entity-level and slide-level prompts, integrating fine-grained and coarse-grained visual and textual information for few-shot whole-slide image classification. The proposed MAPLE method consistently outperforms existing SOTA models in this field. The integration of entity-level knowledge is promising to boost few-shot classification performance, especially in the era of LLMs.

优缺点分析

Strengths:

  1. This work introduces entity-level descriptions generated from LLMs to provide fine-grained information for visual-textual alignment.
  2. The performance of the proposed MAPLE model is promising, surpassing existing SOTAs.
  3. Experiments are relatively solid, covering three cancer cohorts from the TCGA dataset.

Weaknesses:

  1. Visuals in this work (for example, Figure 1) are not precise. For instance, the third paragraph of Introduction mentioned that “TOP [38] introduces instance-level phenotypic prompts to guide patch aggregation into slide-level features, while ViLa-MIL [43] leverages learnable visual prototypes to guide the fusion process of patch features and considers dual-scale visual descriptive text prompt to boost the performance.”, however, Figure 1(a) only illustrates the slide-level similarity computation process, which is too simplified and does not correspond to the texts. What’s more, last paragraph of Introduction mentions “MAPLE … across different scales, as illustrated in Fig. 1”, however, Fig. 1 clearly does not contain multi-scale visualization.
  2. Similar concerns are in the texts as well. For example, the second paragraph of Section 2.2 describes the learning process of few-shot WSI analysis. However, it is way too simplified and only mentions the slide-level similarity computation for logits generation. Most existing works computes the logits more than this, such as ViLa-MIL [1] and FOCUS [2], which also apply patch-level similarity computation.
  3. Lack of certain ablation studies. It is mentioned that PLIP is used as the feature extractor for the visual branch of MAPLE, however, as mentioned in FOCUS [2] (Figure 2 Ablation Study on Foundation Models), PLIP typically performs the worst in few-shot WSI classification. The choice of using PLIP is not well explained. Please refer to the Questions section for more related details.

[1] Shi, Jiangbo, et al. "ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[2] Guo, Zhengrui, et al. "Focus: Knowledge-enhanced adaptive visual compression for few-shot whole slide image classification." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.

问题

  1. Reliability of using LLMs to generate knowledge-required descriptions for entities. Although it’s common practice to utilize LLMs for prompt or diagnosis description generation, it would be better to demonstrate the reliability of the generated prompts. For example, do they accurately describe the slides’ morphology of a certain cancer? I see that Table 6 provides an ablation of the impact of using different LLMs for prompt generation. One thing I’d like to discuss further is that existing works mostly conduct this kind of ablation, yet sometimes the results are not consistent due to different environments or settings. For example, in FOCUS, they show that Claude demonstrates the strongest performance for prompt generation. This kind of difference leads to difficulty in choosing appropriate LLMs in clinical scenarios. Could the authors share some insights regarding this problem?
  2. As mentioned in the Weaknesses part, could the authors provide more explanation regarding the choice of using PLIP as the feature extractor (although there is a comparison between PLIP and CLIP, CLIP achieves even worse performance)? There have been many powerful pathology foundation models (or feature extractors) in recent years, such as UNI [1], GPFM [2], and Virchow [3]. A major concern here is that existing models might use different feature encoders to extract features, for example, FOCUS uses CONCH [4] as the visual encoder. Thus, this lack of sufficient ablation study on feature encoders could lead to unfair comparisons.
  3. Meanwhile, from Table 1, it could be seen that the results are relatively low compared to ones reported from other works. Still take FOCUS for example, the 16-shot classification results of TCGA-NSCLC are around 0.95 AUC for several models. Yet in this work, the same task under the same few-shot setting only achieves around 0.9 AUC. I believe this also stems from the choice of feature extractor, i.e., PLIP.
  4. Could the authors explain more about the visualization details in section 4.3? Are the patches shown for each entity selected from the ones with the highest attention score?
  5. An additional question is that, if the entity can be precisely identified, then what’s the point of using the global description and WSI-level representation for further integration and analysis?

[1] Chen, Richard J., et al. "Towards a general-purpose foundation model for computational pathology." Nature Medicine 30.3 (2024): 850-862.

[2] Ma, Jiabo, et al. "Towards a generalizable pathology foundation model via unified knowledge distillation." arXiv preprint arXiv:2407.18449 (2024).

[3] Vorontsov, Eugene, et al. "A foundation model for clinical-grade computational pathology and rare cancers detection." Nature medicine 30.10 (2024): 2924-2935.

[4] Lu, Ming Y., et al. "A visual-language foundation model for computational pathology." Nature Medicine 30.3 (2024): 863-874.

局限性

N/A

最终评判理由

After carefully reviewing the authors' response and other reviewers' comments, I believe the revised manuscript after rebuttal would satisfy NeurIPS's standards.

格式问题

Putting the Related Works section in the Appendix does not seem appropriate.

作者回复

We thank the reviewer for the comments and for the time spent reviewing our paper. We address the weaknesses (W) and questions (Q) as follows:


[W1] Clarification on Figure 1.

We thank the reviewer for pointing this out. Actually, Figure 1 is intended to emphasize the key difference in alignment strategy between our MAPLE and the existing studies (e.g., TOP and ViLa-MIL), where the existing methods only perform the slide-level alignment (shown in Fig.1(a)), while our method considers additional entity-level features and incorporates subtype-specific phenotypic attributes for more interpretable and precise alignment (shown in Fig.1(b)). Since multi-scale modeling also appears in methods like ViLa-MIL, we do not explicitly distinguish this in the figure, but note it in the caption. We will further revise the figure and related text in the final version to avoid confusion.

[W2] Concerns on texts in Section 2.2.

We acknowledge that different methods may adopt varying strategies to compute logits, often involving additional architectural components and complex aggregation strategies. However, the purpose of Section 2.2 is to provide a concise and coherent overview of the general paradigm, rather than to detail the full implementation of all existing approaches. Therefore, we believe that our formulation captures the essential inference process commonly adopted in prompt-based few-shot WSI classification. We would be happy to further clarify any part of this formulation if needed.

[Q1] Reliability of LLM and insights of choosing appropriate LLMs.

We acknowledge the reviewer's concern regarding the reliability of LLM-generated prompts, and we provide the detailed discussion in the response to [W3] of Reviewer dMRb. To assess the reliability of our generated prompts, we examine the entities generated from GPT4 by querying different LLMs (e.g., Claude 3.5 Sonnet, DeepSeek-V3 and Qwen2.5) using the prompt: "Do you think [ENTITY] is a key entity relevant to distinguishing different subtypes in [DATASET]?" All of these LLMs affirm the relevance and importance of the generated entities, which can demonstrate the accuracy of selected entities.

For practical usage of choosing appropriate LLMs in clinical scenarios, we suggest selecting recently released LLMs that benefit from improved medical knowledge coverage and instruction-following ability. In addition, combining different outputs from multiple LLMs and incorporating domain expert verification may offer a more robust and trustworthy solution than relying solely on a single model. We consider a more systematic exploration of the reliability and choice of different LLMs is an important direction for future work.

[W3 and Q2] The choice of using PLIP and CONCH as VLMs.

Vision-language models (VLMs) have experienced explosive growth in the visual domain, however, CLIP is still widely regarded as a standard backbone for few-shot learning research [1][2]. In this work, we further explore the applicability of CLIP and its variant in pathology domain, PLIP, as our backbone. PLIP retains the architecture of CLIP while being fine-tuned on large-scale pathology-specific image-text data, making it suitable for investigating the effectiveness of prompt learning in few-shot WSI classification. Compared to CONCH, which is based on the more complex CoCa architecture, PLIP has lighter parameters and a structure more consistent with CLIP, facilitating direct and fair comparisons.

We appreciate the reviewer’s suggestion to consider other strong pathology foundation models such as UNI, GPFM, and Virchow. However, we note that these are image-only models that do not include a compatible text encoder, and therefore cannot be directly integrated into our prompt-learning pipeline, which relies on vision–language alignment. To this end, we instead conduct the ablation study using pathology VLMs, including CLIP-based methods (QuiltNet, PLIP) and the CoCa-based method (CONCH). As shown in the table below, CONCH provides the best classification performance as it is a stronger foundation model.

Next, for the purpose of illustrating the advantage of our MAPLE under different foundation models, we also compare it with all baselines using CONCH as the backbone. The experimental results shown in the response to [Q4] of Reviewer TGVt clearly verify the advantage of our MAPLE for few-shot WSI classification. We will include these CONCH-based results in the revised paper.

In summary, we can observe that the choice of different VLM backbones could affect the few-shot WSI classification results. However, our MAPLE can achieve consistently superior prediction results under different vision-language foundation models (e.g., CLIP, PLIP, CONCH), highlighting the advantages of our MAPLE to jointly integrate multi-scale visual semantics and perform prediction at both the entity and slide levels.

DatasetMethodsAUC(RCC)F1(RCC)ACC(RCC)AUC(BRCA)F1(BRCA)ACC(BRCA)AUC(NSCLC)F1(NSCLC)ACC(NSCLC)
4-shotMAPLE-QuiltNet0.688±0.0630.560±0.0500.630±0.0680.914±0.0170.720±0.0600.748±0.0610.728±0.0630.641±0.0860.651±0.064
MAPLE-PLIP0.722±0.0630.594±0.0760.664±0.1340.909±0.0200.705±0.0550.728±0.0570.740±0.0560.663±0.0520.675±0.053
MAPLE-CONCH0.844±0.1090.653±0.1260.695±0.1660.947±0.0170.791±0.0860.805±0.0690.889±0.0550.774±0.0320.786±0.033
8-shotMAPLE-QuiltNet0.760±0.1070.595±0.0750.653±0.0900.948±0.0160.794±0.0410.824±0.0380.832±0.0490.742±0.0440.748±0.040
MAPLE-PLIP0.786±0.0700.618±0.0240.673±0.0180.957±0.0150.791±0.0240.806±0.0240.855±0.0410.762±0.0310.766±0.030
MAPLE-CONCH0.900±0.0820.748±0.1100.797±0.1200.971±0.0110.888±0.0250.899±0.0230.964±0.0320.894±0.0330.894±0.034
16-shotMAPLE-QuiltNet0.791±0.0270.660±0.0330.715±0.0330.964±0.0090.846±0.0320.884±0.0270.891±0.0390.805±0.0520.806±0.052
MAPLE-PLIP0.801±0.0310.672±0.0760.735±0.0390.969±0.0140.838±0.0340.867±0.0310.903±0.0330.806±0.0600.810±0.055
MAPLE-CONCH0.916±0.0240.790±0.0290.825±0.0300.984±0.0060.910±0.0160.919±0.0150.981±0.0050.914±0.0500.914±0.047

[1] Zeng et al. Local-Prompt: Extensible Local Prompts for Few-Shot Out-of-Distribution Detection. ICLR 2025.
[2] Pan et al. NLPrompt: Noise-Label Prompt Learning for Vision-Language Models. CVPR 2025.

[Q3] Results in Table 1 are relatively low, which stems from the choice of feature extractor.

We appreciate the reviewer’s observation. We agree that CONCH is a significantly stronger foundation model compared to PLIP, which may partly explain the AUC difference observed in Table 1 of original paper. To address this, we conduct additional experiments using CONCH as the visual encoder, with results presented in the response to [Q4] of Reviewer TGVt. Notably, MAPLE still achieves the best performance under this stronger backbone, reinforcing the effectiveness and generalizability of our method across different VLMs. We will include these CONCH-based results in the revised paper.

[Q4] Explanation on visualization details in Section 4.3.

The visualized patches are selected based on attention scores from the entity-guided cross-attention module. For each entity, we rank the attention scores and visualize the top-k scoring patches, which reflect the most relevant regions associated with that entity. We will clarify this detail explicitly in Section 4.3 of the revised paper.

[Q5] The point of using the global description and WSI-level representation if the entity can be precisely identified.

In MAPLE, entity-level prompts are designed to capture fine-grained and localized entities. However, WSIs are inherently heterogeneous, and accurate diagnosis often requires integrating evidence from multiple spatial and semantic cues, rather than relying on isolated entities alone. For instance, the presence of necrosis alone may not be diagnostically conclusive. However, when necrosis co-occurs with abnormal mitotic activity and specific inflammatory patterns, the combination strongly supports malignancy or tumor subtyping [1]. In such cases, global representations are essential for capturing diagnostic dependencies and interrelationships across the entire slide. Based on the above consideration, our slide-level representation is designed specifically to provide a holistic view that aggregates entity features. As confirmed in Table 2 of the main paper, MAPLE combined with entity-level and slide-level representations consistently outperforms using either alone, highlighting the effectiveness of both entity-level and slide-level information.

[1] Wesseling et al. The pathological diagnosis of diffuse gliomas: towards a smart synthesis of microscopic and molecular information in a multidisciplinary context[J]. Diagnostic Histopathology, 2011, 17(11): 486-494.

评论

Thanks for the comprehensive response. The authors have addressed my concerns, and I'll raise the score.

In the revised manuscript, please incorporate the aforementioned revision. Also, please move the Related Works part into the main text instead of the Appendix if possible.

评论

Thank you for considering our response and raising the score! We are glad to know that our rebuttal has addressed your concerns. We will incorporate the revisions and move the Related Works part into the main text in the revised paper.

审稿意见
5

Few-shot WSI classifiers often miss fine-grained, subtype-specific histological variations crucial for accurate diagnosis. To remedy this, the authors introduce MAPLE, which uses a frozen language model to generate both entity-level prompts for detailed histological attributes and slide-level prompts for global tissue context. A cross-attention module guided by these prompts, together with a cross-scale entity graph, refines and aligns multi-resolution features before aggregating them for prediction, boosting both accuracy and interpretability.

优缺点分析

  1. Strengths:

(1) MAPLE consistently outperforms leading MIL and prompt-learning baselines across different few-shot settings on multiple datasets.

(2) By fusing entity-level features from both low and high magnification with slide-level cues, MAPLE surpasses single-scale variants. Ablation studies show that every key module (language-guided instance selection, entity-guided cross-attention, and cross-scale graph learning) contributes positively.

(3) With just less trainable parameters and inference/training times comparable to simpler baselines, MAPLE strikes a practical balance between accuracy and efficiency.

  1. Weaknesses:

(1) Prompt quality depends entirely on raw LLM descriptions, which can hallucinate or omit clinically crucial details. The authors note this limitation and suggest future human-in-the-loop refinement.

(2) Without ground-truth entity labels, the paper lacks directly quantification on how well selected patches match textual prompts—relying instead on indirect ablation and qualitative figures.

(3) Although the paper highlights gains from multi-scale integration, it doesn’t benchmark against established multi-scale MIL architectures, leaving it unclear how much improvement stems from language cues versus simply using multiple resolutions.

问题

  1. Prompt reliability seems crucial for this study. How did the authors generate prompts that prevent hallucinations or the omission of clinically critical attributes? It is helpful to provide results with and without expert correction to quantify the performance gains from text guidance. The authors also claim that MAPLE’s hierarchy mirrors diagnostic practice, yet they offer no user study or expert evaluation / references to validate this alignment.

  2. Although qualitative examples show that selected patches visually match the LLM-described attributes, there are no region-level ground-truth annotations with heatmap visualizations (or IoU / Dice metrics) to demonstrate slide-level alignment.

  3. Both the low-resolution-only and high-resolution-only models perform nearly as well as the multi-scale model, which only marginally outperforms the single-scale variants. How can we assess that each scale provides complementary information rather than redundant features? Furthermore, how can we ensure that 10× magnification captures sufficient cellular morphology detail compared with 20× or 40×?

  4. The paper does not compare MAPLE against existing multi-scale MIL methods [1, 2, 3]. Could the authors include these baselines to isolate the benefits of their language-guided enhancements?

[1] DTFD-MIL: Double-Tier Feature Distillation Multiple Instance Learning for Histopathology Whole Slide Image Classification [2] Dual-stream Multiple Instance Learning Network for Whole Slide Image Classification with Self-supervised Contrastive Learning [3] Cross-scale multi-instance learning for pathological image diagnosis

局限性

MAPLE depends on uncurated LLM prompts, risking hallucinations without expert oversight. It lacks pixel-level annotations for rigorous, quantitative region–prompt validation, relying instead on indirect ablation and qualitative examples.

最终评判理由

The author addressed all my concerns with explicit illustrations and supporting experiments, which have improved the quality of the manuscript. Therefore, I would like to increase my score and recommend accepting the paper.

格式问题

n/a

作者回复

We thank the reviewer for the comments and for the time spent reviewing our paper. We address the weaknesses (W) and questions (Q) as follows:


[W1 and Q1] Prompt reliability and MAPLE’s hierarchy.

We appreciate the reviewer’s insightful comments regarding the reliability of LLM-generated prompts and the clinical plausibility of MAPLE’s hierarchical design.

Regarding the first point, we address this concern in detail in our response to [W3] of Reviewer dMRb. In current step, we carefully design the LLM instructions in the Prompt Construction stage and evaluate the reliability of GPT-4 outputs by querying multiple independent LLMs (e.g., Claude 3.5 Sonnet, DeepSeek-V3, and Qwen2.5), aiming to minimize the potential impact of hallucinations and increase the trustworthiness of the extracted entities. In future work, we plan to incorporate expert correction and clinical guidelines to further refine and validate entity construction.

On the second point, MAPLE’s hierarchical design is motivated by real-world diagnostic workflows. We clarify this hierarchy and provide more explanations to validate it. Specifically, the hierarchy in MAPLE is reflected in two aspects: multi-scale (high magnification and low resolution) and multi-level (entity-level and slide-level) design:

  • One the one hand, at high magnification, pathologists analyze cellular components such as nuclear pleomorphism and cytoplasmic features for cancer diagnosis. As to the image at low magnification, they distinguish different cancer subtypes by examining tissue architecture such as gland formation and tumor-stroma interfaces [1][2].
  • On the other hand, clinical diagnosis is rarely based on isolated observations. Instead, pathologists synthesize cues from multiple key tissue entities (e.g., mitotic activity, necrosis, glandular structures) and integrate them with global tissue context to reach a diagnosis [3]. For instance, necrosis alone may be inconclusive, but when co-occurring with abnormal mitoses and specific inflammatory patterns, the combination supports malignancy or tumor subtyping [4]. Our design emulates this diagnostic reasoning by jointly modeling localized entity-level and global slide-level descriptions.

[1] Kumar et al. Whole slide imaging (wsi) in pathology: current perspectives and future directions. Journal of digital imaging, 33(4):1034–1040, 2020.
[2] Li et al. A multi-resolution model for histopathology image classification and localization with multiple instance learning. Computers in biology and medicine, 131:104253, 2021.
[3] Heba et al. A comprehensive review of the deep learning-based tumor analysis approaches in histopathological images: segmentation, classification and multi-learning tasks. Cluster Computing 2023.
[4] Wesseling et al. The pathological diagnosis of diffuse gliomas: towards a smart synthesis of microscopic and molecular information in a multidisciplinary context[J]. Diagnostic Histopathology, 2011, 17(11): 486-494.

[W2 and Q2] Lack direct quantification on how well selected patches match textual prompts.

To the best of our knowledge, public WSI datasets typically lack spatial annotations that directly correspond to textual entities, making it infeasible to compute IoU or Dice metrics. In our current setup, we follow common practice adopted by prior works [1, 2], where semantic alignment is indirectly validated by visualizing the top-k patches with the highest attention scores from the entity-guided module and confirming that they visually reflect the corresponding textual descriptions.

To further address the reviewer’s concern and provide quantitative insights, we additionally compute the similarity scores between selected patches and their corresponding entity descriptions. Specifically, for each entity, we select the top-50 and bottom-50 patches based on the similarity scores from entity-guided attention, and then use the CONCH model to compute similarity scores with the entity description and the patches from different sets. Finally, we can derive the average scores of the top-50 and bottom-50 patches. We report the scores for the examples (Figure 4 of the original paper) in the table below. The results confirm that the selected patches align well with their associated textual descriptions, further validating the effectiveness of our method.

Stroma (TCGA-55-6970 from LUAD)Nucleolus (TCGA-55-6970 from LUAD)Stroma (TCGA-58-A46M from LUSC)Nucleolus (TCGA-58-A46M from LUSC)
top-50 score0.320.340.250.27
bottom-50 score0.150.120.080.07

[1] Lu et al. Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images. CVPR 2023.
[2] Jaume et al. Modeling Dense Multimodal Interactions Between Biological Pathways and Histology for Survival Prediction. CVPR 2024.

[Q3] Complementarity of multi-scale features and selection of magnifications.

MAPLE is designed to capture distinct histological entities at different magnifications. Different resolution levels in WSIs naturally correspond to different semantic scales: low magnification (e.g., 5×) reveals tissue architecture and global organizational patterns, while high magnification (e.g., 10×) focuses on cellular morphology and nuclear detail. These features are inherently complementary rather than redundant. As demonstrated in Table 2 of the original paper, integrating both scales consistently outperforms using either scale alone. To further validate this observation, we perform the paired t-test comparing multi-scale and single-scale results. The resulting p-values are consistently below 0.05, confirming that the performance gains from multi-scale integration are statistically significant.

Our resolution choices follow ViLa-MIL [1], which also employs dual-scale inputs at 5× and 10× for few-shot slide-level classification. We agree that higher magnification (e.g., 20×) may offer finer morphological cues. To explore this, we conduct experiments using 20× as the high resolution, and find that 10× and 20× perform comparably, suggesting that both magnifications can effectively capture high-resolution histological features. Considering that 20× images significantly increase memory usage and computational cost, we adopt 10× in our main experiments to achieve a balance between performance and efficiency.

DatasetMethodsAUC(RCC)F1(RCC)ACC(RCC)AUC(BRCA)F1(BRCA)ACC(BRCA)AUC(NSCLC)F1(NSCLC)ACC(NSCLC)
4-shotMAPLE-20×0.714±0.0520.583±0.1180.647±0.1810.904±0.0270.691±0.0750.717±0.0800.739±0.0610.654±0.1210.666±0.074
MAPLE-10×0.722±0.0630.594±0.0760.664±0.1340.909±0.0200.705±0.0550.728±0.0570.740±0.0560.663±0.0520.675±0.053
8-shotMAPLE-20x0.784±0.0750.621±0.0920.680±0.1090.958±0.0130.796±0.0600.813±0.0620.847±0.0730.751±0.0880.760±0.082
MAPLE-10×0.786±0.0700.618±0.0240.673±0.0180.957±0.0150.791±0.0240.806±0.0240.855±0.0410.762±0.0310.766±0.030
16-shotMAPLE-20x0.796±0.0340.688±0.0770.746±0.0980.960±0.0120.831±0.0190.857±0.0210.909±0.0410.793±0.0220.805±0.023
MAPLE-10×0.801±0.0310.672±0.0760.735±0.0390.969±0.0140.838±0.0340.867±0.0310.903±0.0330.806±0.0600.810±0.055

[1] Shi et al. Vila-mil: Dual-scale vision-language multiple instance learning for whole slide image classification. CVPR 2024.

[W3 and Q4] Comparison against existing multi-scale MIL methods.

We appreciate the reviewer’s suggestion and conduct additional experiments to include existing multi-scale MIL baselines such as DTFD-MIL, Dual-Stream MIL, and Cross-Scale MIL under the same 5×/10× magnification settings used in MAPLE. As shown in the table below, MAPLE consistently outperforms these baselines across all datasets, highlighting the advantage of our multi-scale prompt-guided VLM-based method. These results further confirm that the performance gain of MAPLE arises not merely from multi-scale integration, but from the combination of multi-scale modeling and language-guide prompt supervision via vision-language models.

DatasetMethodsAUC(RCC)F1(RCC)ACC(RCC)AUC(BRCA)F1(BRCA)ACC(BRCA)AUC(NSCLC)F1(NSCLC)ACC(NSCLC)
4-shotDTFD-MIL0.648±0.0500.520±0.0650.593±0.0680.869±0.0310.626±0.0570.660±0.0700.624±0.0490.555±0.0510.581±0.046
Dual-Stream MIL0.669±0.0520.562±0.1320.623±0.1620.877±0.0390.664±0.0370.693±0.0360.649±0.1220.564±0.0940.590±0.080
Cross-Scale MIL0.673±0.0970.552±0.0610.615±0.0680.876±0.0190.667±0.0370.690±0.0390.651±0.1030.560±0.0930.586±0.084
MAPLE0.722±0.0630.594±0.0760.664±0.1340.909±0.0200.705±0.0550.728±0.0570.740±0.0560.663±0.0520.675±0.053
8-shotDTFD-MIL0.733±0.0480.546±0.0610.603±0.0860.907±0.0240.722±0.0360.748±0.0460.729±0.0400.632±0.0420.652±0.030
Dual-Stream MIL0.758±0.0730.548±0.0710.576±0.0880.926±0.0250.761±0.0410.789±0.0370.752±0.0740.651±0.0320.667±0.034
Cross-Scale MIL0.756±0.0620.554±0.0630.588±0.0750.924±0.0230.757±0.0280.782±0.0310.748±0.0370.645±0.0750.657±0.071
MAPLE0.786±0.0700.618±0.0240.673±0.0180.957±0.0150.791±0.0240.806±0.0240.855±0.0410.762±0.0310.766±0.030
16-shotDTFD-MIL0.738±0.0440.623±0.0640.679±0.0880.919±0.0190.762±0.0510.799±0.0400.812±0.0470.742±0.0440.747±0.042
Dual-Stream MIL0.752±0.0380.636±0.0590.696±0.0620.946±0.0140.813±0.0300.827±0.0320.824±0.0290.760±0.0370.762±0.032
Cross-Scale MIL0.759±0.0520.632±0.0550.698±0.0600.948±0.0190.815±0.0340.830±0.0350.830±0.0420.764±0.0290.768±0.028
MAPLE0.801±0.0310.672±0.0760.735±0.0390.969±0.0140.838±0.0340.867±0.0310.903±0.0330.806±0.0600.810±0.055
评论

Thank you so much for the author’s efforts in providing explicit illustrations and additional experiments to address my concerns. Please make sure to include all of the clarifications and new experiments in the final version of the manuscript. I would like to increase my score to reflect the improvements in the quality of the manuscript.

评论

We sincerely thank the reviewer for the positive feedback and for acknowledging our efforts. We are pleased to hear that the revisions have addressed your concerns. We will ensure that all clarifications and new experimental results are included in the final version. We greatly appreciate your thoughtful review and the revised score.

审稿意见
5

This paper addresses the limitation of existing few-shot WSI classification methods that lack fine-grained, entity-level prompts. It proposes a Multi-scale Attribute-enhanced Prompt Learning approach, which incorporates both slide-level and entity-level prompts. The method fully leverages the capabilities of large language models (LLMs) throughout the process. Furthermore, an entity graph is constructed to enhance instance features and derive entity-level representations. Finally, the predictions from both the entity-level and slide-level are integrated to produce the final classification result.

优缺点分析

Strengths:

  1. To achieve prompt learning aligned with fine-grained features, this paper designs components such as entity name extraction, hierarchical entity graphs, and an entity-guided cross-attention module—all of which are validated through ablation studies.
  2. By introducing entity-level prompt learning, the paper enables more interpretable predictions at the entity level.

Weaknesses:

  1. The font size in Figure 1 is too small and difficult to read; it is recommended to enlarge the text for better readability.
  2. In Figure 2, the meaning of the arrows between the entity-level features and the learnable prompts is unclear. It is suggested to add symbols or annotations to clarify their purpose.
  3. How does the use of LLM ensure comprehensive and accurate extraction of entity names? Is there a possibility of incomplete or missing entity names? If the extraction is not sufficiently comprehensive, will it affect model performance, and how is this issue addressed?
  4. In the Region Selection module, is there a high degree of overlap among the selected instances, meaning that most image patches come from the same region? Is this phenomenon beneficial or detrimental to the final model performance?

问题

please refer to Weaknesses

局限性

please refer to Weaknesses

最终评判理由

The authors’ responses have addressed most of my concerns. I will maintain my score.

格式问题

None

作者回复

We thank the reviewer for the comments and for the time spent reviewing our paper. We address the weaknesses (W) as follows:


[W1] The font size in Figure 1 is too small and difficult to read.

We thank the reviewer for pointing this out. We will revise Figure 1 by enlarging the font size and optimizing the layout to ensure clarity and readability in the final manuscript.

[W2] In Figure 2, the meaning of the arrows between the entity-level features and the learnable prompts is unclear.

We appreciate the reviewer’s suggestion. The arrows represent the computation of similarity scores between the extracted entity-level features and the learnable prompts. We will revise Figure 2 by adding appropriate annotations to clarify their meaning and ensure interpretability.

[W3] Reliability and completeness of entity extraction by LLMs.

We thank the reviewer for this insightful comment. To ensure comprehensive and accurate extraction of entities, we carefully design the LLM instructions in the Prompt Construction stage to prioritize entities with higher clinical importance. For example, we incorporate prompts such as "iteratively suggest a discriminative histological entity not in the current entity set" (see Appendix B.1 for details), thereby reducing the likelihood of missing critical entities. This prioritization encourages the LLM to focus on highly discriminative entities.

To further evaluate the accuracy of the selected entities, we examine the entities generated from GPT4 by querying different LLMs (e.g., Claude 3.5 Sonnet, DeepSeek-V3 and Qwen2.5) using the prompt: "Do you think [ENTITY] is a key entity relevant to distinguishing different subtypes in [DATASET]?" All of these LLMs affirm the relevance and importance of the generated entities, which can demonstrate the accuracy of selected entities.

Furthermore, as shown in our ablation study on the number of entities (Figure 8 in the original paper), increasing the number of entities beyond a certain threshold does not improve performance and may even slightly decrease. This suggests that redundant entities may lack discriminative phenotypic attributes relevant to cancer subtype classification and potentially introduce noise that hinders the model's decision-making process. This observation supports the completeness and effectiveness of our selected entity set.

As acknowledged in our Limitations section, we recognize that LLM-generated entities may still suffer from hallucinations or generate clinically irrelevant descriptions in the absence of expert curation. To mitigate this, we propose the following future directions:

  • Incorporating expert review or clinical guidelines to refine and validate the extracted entities;
  • Aggregating outputs from multiple LLMs (e.g., a mixture-of-experts approach) to achieve more robust and reliable entity sets and descriptions rather than relying solely on a single model.

We believe these strategies will enhance the reliability and clinical applicability of LLM-driven entity construction, and we plan to explore them in future versions of MAPLE.

[W4] Patch overlap and the region selection module.

The Region Selection module in MAPLE is designed to accurately identify tumor-relevant regions while suppressing irrelevant background areas, thereby guiding the model to focus on informative patches that align well with the entity-level prompts. Since patch instances are partitioned in a non-overlapping manner and the region selection ratio is set to r = 0.7, the selected patches can effectively cover a variety of heterogeneous tumor-related regions within a WSI, and thus do not exhibit a high degree of spatial redundancy or overlap. We also investigate the impact of r in our ablation study (Section E.3 of the main paper). As shown in Figure 9, higher values of r (e.g., r > 0.7) yields the slight performance degradation. This decline can be attributed to the inclusion of less informative or non-tumor regions that compromise the quality of the selected patch set. In contrast, lower values (e.g., r < 0.7) tend to overly concentrate on a few localized regions, resulting in performance degradation.

评论

The authors’ responses have addressed most of my concerns. I will maintain my score.

评论

Thank you for taking the time to review our responses. We sincerely appreciate your thoughtful feedback and your engagement throughout the review process.

We are happy to address any further questions you may have!

审稿意见
4

The work presents prompting-based learning for few-shot WSI classification. An LLM is used to form prompts at both the entity level and the slide level, at both low and high resolutions. A cross-attention mechanism is used to extract features for each entity, and cosine similarity is then computed with subtype-specific prompts to generate logits for classification. A Graph Attention Network is employed to account for correlations among various entities. Additionally, slide-level representation is formed in a similar manner, where the LLM is prompted to identify slide-level features. Slide-level logits and entity-level logits are then averaged using a parameter lambda to produce the final logits. The effectiveness of the proposed method is evaluated on three WSI datasets across three few-shot settings.

优缺点分析

Strengths:

  1. The method is clearly and effectively presented.

  2. MAPLE explores the use of scale-specific prompts, which is intuitive and well-motivated.

  3. The integration of the Graph Attention Network (GAT) is both interesting and effective.

  4. Exhaustive ablation studies are provided for each component of the method.

Weaknesses:

  1. While the proposed method depends on a VLM-based pathology foundation model, the baselines using traditional MIL approaches such as ABMIL and TransMIL could benefit from stronger image-only foundation models like Virchow 2 or UNI2-h. The study lacks a fair comparison in which MIL methods are trained using these state-of-the-art pathology feature encoders.

  2. All experiments are conducted with at most a 16-shot setting, which limits the generalizability of the proposed approach to more realistic scenarios where around fifty to several hundred WSIs per class may be available.

  3. The paper lacks any comparison with the ConcepPath method, which is highly relevant as it also decomposes slide-level prompts into concept-level prompts and proposes a guided aggregation strategy. Although ConcepPath was developed under a fully supervised setting, it can be readily adapted to few-shot scenarios as well.

问题

  1. Please include baseline results for ABMIL and TransMIL using state-of-the-art pathology foundation models such as Virchow 2 or UNI2-h in Table 1, evaluated across the three datasets and all three few-shot settings. The PLIP model used in this study is a relatively much weaker feature extractor.

  2. Provide results for the 32-shot, 64-shot, and fully supervised (all-label) settings, and compare them with the existing baselines in Table 1. These results will offer a clearer perspective on the generalizability of the proposed method across varying levels of label availability.

  3. Include a comparison of the proposed method with ConcepPath across multiple few-shot settings (Table 1) to highlight differences in performance.

  4. Provide an ablation study evaluating the effect of using different VLMs, such as CONCH or QuiltNet, on the performance of both the proposed method and baseline methods, at least on one representative dataset.

局限性

yes

最终评判理由

I thank the authors for the thorough rebuttal.

格式问题

None

作者回复

We thank the reviewer for the comments and for the time spent reviewing our paper. We address the weaknesses (W) and questions (Q) as follows:


[W1 and Q1] SOTA pathology foundation models for MIL baselines.

We agree that traditional MIL approaches like ABMIL and TransMIL could indeed benefit from stronger image-only foundation models such as Virchow 2 or UNI2-h. In our main experiments (Table 1 and 5), we use the same feature extractor (i.e., PLIP and CLIP) across all baselines and MAPLE to ensure a fair comparison. To address the reviewer's concern, we additionally conduct the experiment including ABMIL and TransMIL with stronger image-only foundation models (e.g., UNI2-h), and MAPLE with a stronger VLM (e.g., CONCH). As shown in the table below, we can derive the following observations. Firstly, both ABMIL and TransMIL with UNI2-h exhibit noticeable performance gains, validating the importance of encoder strength. Secondly, MAPLE with CONCH can consistently beat the MIL based methods with latest image-only foundation model (i.e., UNI2-h), demonstrating the scalability and effectiveness of our method. Moreover, we compare MAPLE with all baseline methods under CONCH for a fair comparison in the response to [Q4]. MAPLE achieves the best performance across different few-shot settings and datasets, highlighting its effectiveness under different VLMs. In summary, these findings confirm that while backbone strength contributes to overall performance, MAPLE consistently delivers superior results.

MethodsAUC(RCC)F1(RCC)ACC(RCC)AUC(BRCA)F1(BRCA)ACC(BRCA)AUC(NSCLC)F1(NSCLC)ACC(NSCLC)
4-shotABMIL-UNI20.787±0.0450.604±0.0730.651±0.0850.932±0.0120.763±0.0310.781±0.0310.806±0.0540.704±0.0290.713±0.028
TransMIL-UNI20.779±0.0750.587±0.1980.631±0.2280.941±0.0170.768±0.0460.785±0.0580.815±0.0780.719±0.0890.729±0.081
MAPLE-CONCH0.844±0.1090.653±0.1260.695±0.1660.947±0.0170.791±0.0860.805±0.0690.889±0.0550.774±0.0320.786±0.033
8-shotABMIL-UNI20.855±0.0300.710±0.0680.742±0.0700.953±0.0080.861±0.0240.872±0.0200.927±0.0230.819±0.0380.830±0.038
TransMIL-UNI20.847±0.0520.701±0.0620.730±0.0730.961±0.0020.870±0.0110.878±0.0140.915±0.0450.808±0.0680.811±0.065
MAPLE-CONCH0.900±0.0820.748±0.1100.797±0.1200.971±0.0110.888±0.0250.899±0.0230.964±0.0320.894±0.0330.894±0.034
16-shotABMIL-UNI20.884±0.0310.767±0.0480.790±0.0400.962±0.0030.865±0.0120.868±0.0050.961±0.0110.891±0.0230.897±0.023
TransMIL-UNI20.897±0.0260.772±0.1040.804±0.1210.969±0.0020.874±0.0110.879±0.0120.949±0.0260.885±0.0360.885±0.036
MAPLE-CONCH0.916±0.0240.790±0.0290.825±0.0300.984±0.0060.910±0.0160.919±0.0150.981±0.0050.914±0.0500.914±0.047

[Q2 and W2] Generalizability beyond 16-shot settings.

For few-shot WSI classification, we follow prior works (e.g., MSCPT and FOCUS) and set the shots as 4, 8 and 16. We agree that evaluating the performance under larger-shot settings (e.g., 32-shot and 64-shot) is helpful for assessing MAPLE to more realistic scenarios where more WSIs per class are available. To this end, we conduct additional experiments under 32-shot and 64-shot settings, and present the results in the table below. As shown, MAPLE continues to outperform existing baselines, demonstrating its generalizability under larger-shot settings. Due to character limitation, we report the fully supervised results directly in the text. Specifically, our MAPLE achieves the best AUC of 0.903, 0.990 and 0.966 across three datasets, demonstrating significant improvements over the second-best method FOCUS with AUC of 0.880, 0.975 and 0.955.

MethodsAUC(RCC)F1(RCC)ACC(RCC)AUC(BRCA)F1(BRCA)ACC(BRCA)AUC(NSCLC)F1(NSCLC)ACC(NSCLC)
32-shotViLa-MIL0.830±0.0290.704±0.0180.775±0.0230.954±0.0140.806±0.0200.837±0.0230.892±0.0170.809±0.0180.809±0.018
MSCPT0.816±0.0380.670±0.0440.729±0.0520.949±0.0140.801±0.0540.828±0.0510.899±0.0280.823±0.0360.823±0.036
FOCUS0.826±0.0210.696±0.0320.769±0.0500.961±0.0090.836±0.0370.849±0.0320.909±0.0280.850±0.0280.851±0.028
MAPLE0.849±0.0250.727±0.0500.784±0.0650.975±0.0150.855±0.0490.869±0.0380.926±0.0220.856±0.0290.857±0.029
64-shotViLa-MIL0.859±0.0250.715±0.0440.772±0.0480.966±0.0030.861±0.0120.884±0.0120.924±0.0060.851±0.0100.851±0.010
MSCPT0.849±0.0140.687±0.0360.742±0.0400.953±0.0050.857±0.0180.864±0.0130.928±0.0130.853±0.0210.854±0.021
FOCUS0.852±0.0170.710±0.0350.772±0.0360.967±0.0040.864±0.0140.885±0.0130.930±0.0180.859±0.0150.861±0.014
MAPLE0.873±0.0250.733±0.0270.796±0.0340.980±0.0010.889±0.0180.899±0.0140.942±0.0310.876±0.0290.876±0.029

[W3 and Q3] Comparison with ConcepPath.

We thank the reviewer for pointing out the relevance of ConcepPath. We report the results of ConcepPath in the table below. MAPLE consistently outperforms ConcepPath across different datasets and few-shot settings, highlighting the effectiveness of our method in few-shot WSI classification.

MethodsAUC(RCC)F1(RCC)ACC(RCC)AUC(BRCA)F1(BRCA)ACC(BRCA)AUC(NSCLC)F1(NSCLC)ACC(NSCLC)
4-shotConcepPath0.672±0.0310.555±0.0910.605±0.0860.877±0.0260.654±0.0570.676±0.0430.657±0.0450.601±0.0530.604±0.031
MAPLE0.722±0.0630.594±0.0760.664±0.1340.909±0.0200.705±0.0550.728±0.0570.740±0.0560.663±0.0520.675±0.053
8-shotConcepPath0.759±0.0720.571±0.0930.635±0.1420.920±0.0320.753±0.0680.774±0.0660.750±0.0340.668±0.0330.680±0.035
MAPLE0.786±0.0700.618±0.0240.673±0.0180.957±0.0150.791±0.0240.806±0.0240.855±0.0410.762±0.0310.766±0.030
16-shotConcepPath0.764±0.0680.631±0.0730.694±0.0690.938±0.0310.817±0.0350.835±0.0320.831±0.0530.753±0.0460.756±0.039
MAPLE0.801±0.0310.672±0.0760.735±0.0390.969±0.0140.838±0.0340.867±0.0310.903±0.0330.806±0.0600.810±0.055

[Q4] Evaluating the effect of using different VLMs.

In addition to the experiments using CLIP and PLIP in the original paper, we provide further results with stronger pathology VLMs such as CONCH in the table below. MAPLE consistently outperforms baseline methods across different datasets and few-shot settings, highlighting the effectiveness of our method. We will include these results in the revised paper. Furthermore, we conduct a comprehensive ablation study evaluating the impact of different pathology VLMs including QuiltNet, PLIP, and CONCH, as detailed in our response to [W3 and Q2] of Reviewer yzxQ.

MethodsAUC(RCC)F1(RCC)ACC(RCC)AUC(BRCA)F1(BRCA)ACC(BRCA)AUC(NSCLC)F1(NSCLC)ACC(NSCLC)
4-shotABMIL0.770±0.0690.591±0.0700.648±0.1050.924±0.0260.755±0.0630.773±0.0620.832±0.0410.714±0.0490.721±0.044
TransMIL0.757±0.1270.602±0.1110.672±0.1260.938±0.0190.761±0.0980.778±0.0990.848±0.0590.756±0.1250.762±0.109
GTMIL0.728±0.1020.552±0.0940.630±0.1410.923±0.0310.743±0.0700.765±0.0540.836±0.0930.750±0.0800.754±0.081
WiKG0.768±0.1090.605±0.1050.651±0.1230.925±0.0100.761±0.0380.777±0.0360.820±0.0870.730±0.0840.733±0.083
TOP0.728±0.1700.583±0.1170.638±0.1050.916±0.0330.743±0.0570.758±0.0530.816±0.0660.683±0.1300.707±0.092
ViLa-MIL0.783±0.1080.590±0.1100.635±0.1300.919±0.0300.768±0.0580.793±0.0510.853±0.0730.759±0.0810.756±0.082
MSCPT0.782±0.0870.605±0.0720.632±0.0880.931±0.0190.770±0.0570.785±0.0500.842±0.0590.730±0.0680.748±0.066
FOCUS0.810±0.1150.632±0.1350.667±0.1600.930±0.0320.767±0.0750.780±0.0650.875±0.0770.762±0.0600.769±0.061
MAPLE0.844±0.1090.653±0.1260.695±0.1660.947±0.0170.791±0.0860.805±0.0690.889±0.0550.774±0.0320.786±0.033
8-shotABMIL0.857±0.0490.705±0.0570.763±0.0580.941±0.0130.835±0.0550.844±0.0470.925±0.0130.835±0.0210.835±0.021
TransMIL0.853±0.0440.710±0.0440.771±0.0460.940±0.0120.840±0.0300.851±0.0240.916±0.0100.821±0.0270.821±0.027
GTMIL0.861±0.0520.711±0.0680.780±0.0650.945±0.0060.845±0.0200.857±0.0160.928±0.0160.844±0.0200.844±0.020
WiKG0.851±0.0450.685±0.0500.741±0.0600.947±0.0120.834±0.0250.855±0.0200.919±0.0070.834±0.0210.834±0.020
TOP0.859±0.0280.701±0.0500.758±0.0540.929±0.0200.814±0.0390.828±0.0350.908±0.0430.817±0.0680.818±0.067
ViLa-MIL0.880±0.0810.726±0.1170.776±0.0900.945±0.0080.835±0.0370.858±0.0320.934±0.0370.856±0.0510.857±0.051
MSCPT0.882±0.0910.720±0.1320.774±0.1360.950±0.0110.849±0.0430.851±0.0410.924±0.0430.849±0.0550.849±0.055
FOCUS0.875±0.0600.719±0.1140.747±0.1450.959±0.0080.871±0.0330.875±0.0310.949±0.0300.873±0.0430.873±0.042
MAPLE0.900±0.0820.748±0.1100.797±0.1200.971±0.0110.888±0.0250.899±0.0230.964±0.0320.894±0.0330.894±0.034
16-shotABMIL0.876±0.0170.759±0.0200.789±0.0180.954±0.0040.857±0.0090.859±0.0070.935±0.0130.865±0.0210.865±0.021
TransMIL0.884±0.0300.761±0.0570.795±0.0570.955±0.0030.854±0.0100.860±0.0130.926±0.0140.851±0.0270.851±0.027
GTMIL0.891±0.0250.770±0.0730.803±0.0830.962±0.0040.865±0.0080.878±0.0130.938±0.0160.874±0.0200.874±0.020
WiKG0.882±0.0130.762±0.0190.796±0.0180.957±0.0100.839±0.0310.856±0.0310.939±0.0070.864±0.0210.864±0.020
TOP0.887±0.0110.768±0.0160.790±0.0150.944±0.0030.829±0.0270.835±0.0270.924±0.0130.859±0.0250.859±0.025
ViLa-MIL0.902±0.0330.775±0.0380.812±0.0420.966±0.0060.862±0.0240.872±0.0160.941±0.0230.877±0.0280.877±0.017
MSCPT0.894±0.0180.767±0.0270.808±0.0240.958±0.0040.859±0.0210.871±0.0250.934±0.0170.866±0.0310.867±0.031
FOCUS0.893±0.0170.764±0.0410.805±0.0420.974±0.0060.884±0.0450.891±0.0510.964±0.0070.894±0.0520.895±0.050
MAPLE0.916±0.0240.790±0.0290.825±0.0300.984±0.0060.910±0.0160.919±0.0150.981±0.0050.914±0.0500.914±0.047
评论

I thank the authors for the thorough rebuttal. My concerns are addressed.

评论

Thank you for carefully reading our response and for your prompt feedback. We truly appreciate your thoughtful engagement with our work. If you have any further questions or concerns during your review, please don't hesitate to reach out - we're more than happy to provide additional clarification or information to assist in your assessment.

最终决定

This paper proposes MAPLE, a multi-scale attribute-enhanced prompt learning framework for few-shot WSI classification. By combining entity-level prompts (generated via LLMs) with slide-level prompts, and refining features through entity-guided cross-attention and cross-scale graph learning, MAPLE achieves consistent improvements across multiple cancer datasets and few-shot regimes.

The paper is well-motivated and supported by extensive ablations. Reviewers initially raised concerns about fairness of baselines (stronger encoders, larger-shot settings, comparisons with ConcepPath and multi-scale MIL methods), clarity of figures, and reliability of LLM-generated prompts. The authors provided comprehensive additional experiments and clarifications during rebuttal, which convincingly addressed these points. While prompt reliability and lack of direct region–prompt alignment remain limitations, they are acknowledged with clear future directions.

Overall, reviewers converged towards acceptance after rebuttal, with multiple raising scores. I recommend accept as a solid and meaningful contribution, though not at spotlight/oral level given the prompt reliability and limited clinical validation.