PaperHub
7.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
5
5
5
4
3.5
置信度
创新性3.0
质量3.0
清晰度3.3
重要性3.0
NeurIPS 2025

Sequential Attention-based Sampling for Histopathological Analysis

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

An attention-augment deep reinforcement learning method for automated histopathology

摘要

关键词
Deep LearningDeep Reinforcement LearningMultiple Instance LearningHistopathological AnalysisWhole Slide Images

评审与讨论

审稿意见
5

This paper proposes a new RL technique for a whole-slide classification of histopathology images. The idea is to use RL to select higher magnification patches to save on computation while maintaining classification performance. The novelty of this work is around the feature aggregator that allows the author to combine high magnification feature with low magnification, and to use targeted state updater to select only patches with high similarity for updating the state of RL machine. Experiments show some improvement on two histopathology tasks, and ablation studies demonstrate the importance for each component of the method.

优缺点分析

Strength

  • The proposed method works relatively well for the selected task. Achieving 4x faster inference time and comparable accuracy.
  • The design is well motivated, and ablation study shows the necessity for each component of the technique.

Weakness

  • The tasks selected here are limited to tumor detection-style binary classification. With the complexity of RL, it would be helpful to demonstrate its ability in more complex, multi task classes. One potentially interesting case is breast cancer mitotic activity scoring, which requires the model to detect dividing cells and classify it into three density classes.

问题

N/A

局限性

yes

最终评判理由

I am satisfied with the rebuttal that shows the proposed technique to be on par with SOTA on more complex problems.

格式问题

N/A

作者回复

We thank the reviewer for the insightful suggestions, and for highlighting the motivation of the study, and the computational advantages of our approach. We address their suggestion below.

helpful to demonstrate its ability in more complex, multi task classes ... breast cancer mitotic activity scoring

  • We sought to download the TCGA-BRCA for breast cancer mitotic activity scoring (e.g., Ibrahim et al, Mod. Pathol. 2024).
  • However, the data were not readily accessible because of the need for specialized download managers associated with the GDC data portal, requiring detailed manifest files that we found challenging to configure. We will resolve these download issues and seek to include this dataset in our future evaluations.
  • At this time, following the reviewer’s suggestion, we have evaluated two additional benchmarks, Camelyon-plus and the BRACS dataset, each of which represents a multi-class classification problem.

 

1) Extension to multi-class classification

CamelyonPlus 4-class classification

  • First, we employed the curated CamelyonPlus dataset [1].
  • This dataset comprises 1,347 WSIs with labels from curated Camlelyon-16 and Camlelyon-17 datasets after removing low-quality slides and upgrading the task to a four-class classification task with the following labels: negative, micro-metastasis, macro-metastasis, and Isolated Tumor Cells (ITC).
  • The following are the train/validation/test splits (numbers indicate WSI counts). We trained the model with the pooled datasets and report the performance separately for the Camelyon-16 and Camelyon-17 test WSIs.
         |  CAMELYON16+(4-class) |  CAMELYON17+(4-class) | 
         | Neg | Mic | Mac | ITC | Neg | Mic | Mac | ITC |
Train    | 172 | 45  | 51  | 6   | 436 | 77  | 123 | 30  |
Val      | 30  | 12  | 9   | 1   | 100 | 14  | 29  | 8   |		
Test     | 35  | 14  | 9   | 1   | 96  | 12  | 29  | 8   |
  • We observed performance metrics that are on par with sota methods like DTFD and ACMIL (table below). We will add these results to those in Table 1 of the main text.
Method      |     CAMELYON16+(4-class) |    CAMELYON17+(4-class)   | 
            |  Acc   |  AUC   |   F1   |   Acc   |   AUC  |   F1   | 
DTFD        |  0.918 |  0.947 |  0.904 |  0.896  |  0.903 |  0.891 | 
ACMIL       |  0.962 |  0.925 |  0.965 |  0.826  |  0.932 |  0.827 | 
HAFED       |  0.963 |  0.911 |  0.949 |  0.883  |  0.931 |  0.868 | 
SASHA-0.1   |  0.856 |  0.880 |  0.801 |  0.869  |  0.914 |  0.841 | 
SASHA-0.2   |  0.939 |  0.937 |  0.894 |  0.889  |  0.921 |  0.864 |

BRACS multi-class classification

  • For the BRACS (BReAst Carcinoma Subtyping) dataset, we employed the public dataset available at [2].
  • This dataset comprises 547 labeled WSIs, and involves a 7-class classification that characterizes the tumor as one of 3 benign (BT), 2 atypical (AT), or 2 malignant (MT) sub-types. * We ignored the “normal” sub-type in the BT class (Group_BT/Type_N) and the two sub-types within each class were grouped into its respective “uber”-class.
  • In addition, a few slides that did not include all relevant levels of magnification (7/503) were discarded.
  • The final distribution of train/validation/test WSIs is as follows:
         |         BRACS (3-class)          |
         |  Benign  |  Atypical | Malignant | 
Train    |    171   |    52    |    135     | 
Val      |     16   |    14    |     21     | 
Test     |     25   |    23    |     16     | 
  • We performed a 3-class classification among BT, AT and MT classes and compared the performance of SASHA-0.2 with the other sota models.
Method      |         BRACS (3-class)        |   
            | Accuracy  |   AUC    |   F1    | 
DTFD        |   0.609   |   0.796  |  0.609  | 
ACMIL       |   0.547   |   0.750  |  0.522  | 
HAFED       |   0.593   |   0.756  |  0.559  | 
SASHA-0.2   |   0.609   |   0.755  |  0.589  | 
  • We find that SASHA, even with sampling 20% of the patches, performs comparably with methods like DTFD, and even outperforms ACMIL, each of which samples all of the WSI patches at high resolution.
  • The lower F1 score compared to sota (0.695, Brancati & Frucci, Information, 2024), could be because we used the intermediate 5x→20x zoom levels for SASHA, whereas, conventional models have employed the highest zoom levels (40x).
  • In future work, we will perform more extensive hyperparameter searches, extend this classification to all the different subtypes (2 per class) as well as normal, and evaluate our model on this 7-class classification problem.
  • These results will also be reported in the paper.

 

2) Comparison with competing methods for multi-class classification

In addition, in response to a suggestion from reviewer tugf, we have also compared our approach with a previous non-RL-based approach, called ZoomMIL [3], in a multi-class setting.

Differences between SASHA and ZoomMIL

  • Briefly, ZoomMIL uses a rational strategy of zooming into patches recursively, at different levels of magnification, so that all patches need not be sampled at high resolution, thereby providing significant speedups.
  • A key difference is that in ZoomMIL RoI selection for zooming is performed with a gated attention and a differentiable top-K module, and not with an RL agent, like SASHA.
  • Effectively, this sampling is similar to the RL policy, in that our RL agent also chose to sample and zoom into patches afforded a high attention score by conventional methods as shown in Fig. 3f (Top-k overlap) and Fig. 3g (Avg. Attention Score).
  • Yet, the RL policy does not exclusively sample the Top-k attention patches and, in this sense, the two algorithms’ sampling policies are not identical.
  • Moreover, while ZoomMIL samples a fixed budget of K patches at each resolution, SASHA samples a fixed proportion of samples at the high resolution (0.1 or 0.2).
  • Therefore, the relative efficacies must be determined empirically through performance comparisons.

Performance comparison of SASHA with ZoomMIL

  • We ran ZoomMIL with the Camelyon-Plus benchmark using the binary class classification (4-class). We performed this comparison in two stages:

Histopathology-tuned ViT feature extractor

  • First, we re-trained the ZoomMIL model using the histopathology-tuned ViT feature extractor [4] for the 10x→20x zoom levels (top-K=300), for the Camelyon-16 dataset.
  • As shown in the table below, this encoder change improved upon the results reported in the original study, based on the ResNet50 encoder, by ~6-7%.
Method              |    CAMELYON16 (2-class)   |
                    |  Accuracy |  Weighted-F1  |
ZoomMIL (ResNet50)  |   0.842   |     0.833     |
ZoomMIL (ViT-histo) |   0.906   |     0.905     |

Comparison with SASHA & HAFED

  • Second, we report a comparison of ZoomMIL with SASHA-0.1, SASHA-0.2 and HAFED for evaluation of two different datasets -- Camelyon-16 and Camelyon-17 -- with the models trained together using the entire CamelyonPlus database.
  • In this case, we report the results for the 5x→20x zoom levels for SASHA (as in our paper), as well as for ZoomMIL (as in their paper), but using the histopathology-tuned ViT feature extractor for all methods.
  • Moreover, SASHA-0.1 and SASHA-0.2 sample on average 84+/-62 (median=68) and 168+/-125 (median=136) patches respectively, based on the observation budget. Therefore, we run ZoomMIL for 2 different values of top-K -- K=80 and 160 -- for a fair comparison with SASHA-0.1 and -0.2, respectively.
  • We also run ZoomMIL with K=300, as per the choice in the original study for the Camelyon-16 dataset.
Method             |  CAMELYON16+(4-class)  |  CAMELYON17+(4-class) | 
                   |  Acc  |  AUC  |   F1   |  Acc |  AUC  |   F1   | 
ZoomMIL(ViT, k=80) | 0.706 | 0.841 |  0.710 | 0.841| 0.885 |  0.812 | 
ZoomMIL(ViT, k=160)| 0.672 | 0.852 |  0.683 | 0.827| 0.860 |  0.792 |
ZoomMIL(ViT, k=300)| 0.741 | 0.852 |  0.688 | 0.820| 0.859 |  0.790 |
HAFED              | 0.963 | 0.911 |  0.949 | 0.883| 0.931 |  0.868 | 
SASHA-0.1          | 0.856 | 0.880 |  0.801 | 0.869| 0.914 |  0.841 | 
SASHA-0.2          | 0.939 | 0.937 |  0.894 | 0.889| 0.921 |  0.864 |
  • SASHA-0.1 and -0.2 outperform ZoomMIL -- across all top-K values -- by ~5-20% for both Camelyon-16 and Camelyon-17.
  • This difference may be due to the choice of 5x→20x zoom levels. Attention scores -- which ZoomMIL uses for magnification -- were perhaps less reliably computed at the 5x zoom level than at the 10x level, used in their paper.
  • Nevertheless, zooming between the same 2 levels (5x→20x) SASHA outperformed ZoomMIL.
  • We will perform more extensive hyperparameter searches to validate these findings before reporting them in the paper.

 

References

[1] Ling et al., Science Data Bank, 2025.

[2] Brancati, N., et al., Database, 2022.

[3] Thandiackal, K., et al., ECCV, 2022.

[4] Kang, M., et al., IEEE/CVF, 2023.

评论

I am satisfied with the rebuttal. Thank you authors for adding additional experiment that shows the proposed method is on par with SOTA on multiclass problems. I will be keeping my score.

审稿意见
5

The paper presents SASHA, a reinforcement learning-based framework for the efficient classification of gigapixel whole-slide histopathology images (WSI). Unlike traditional methods, that process the entire high-resolution slide, SASHA learns to adaptively sample a small subset (only 10/20%) of high-resolution patches while still achieving state-of-the-art diagnostic classification accuracy.

The framework consists of three main components: 1)Hierarchical Attention-based Feature Distiller (HAFED): a two-stage multi-head attention MIL module that extracts label-aware feature representations from WSI high-resolution patches and aggregates them for slide-level classification. 2)Targeted State Updater (TSU): a network that maintains a global state of the WSI with a d-dimensional feature vector for each low-resolution patch. When a patch is zoomed in to high resolution, the state is updated selectively only for patches with features correlated to the sampled one. 3)Deep RL agent: a policy network that, given the current WSI state (low-res features and any updated high-res embeddings), sequentially selects the next patch to zoom, aiming to maximize classification accuracy.

On CAMELYON16 and TCGA-NSCLC benchmarks, SASHA matches or exceeds full-slide MIL baselines (e.g., ACMIL, DTFD) using only 20 % of patches, reduces inference time by 4–8×, and cuts memory use by 16×. Extensive ablation studies confirm the importance of each component of the framework, and explainability analyses show that the agent focuses on tumor-rich, highly informative regions of the WSI.

优缺点分析

Strengths

  1. Real-world motivation: the challenge of computationally efficient yet accurate WSI analysis is central to digital pathology. The paper tackles the inefficiencies of full-resolution WSI analysis and prior sparse sampling methods.

2)Novelty of the architecture: combining label-aware MIL (multi-head HAFED), targeted state propagation (TSU), and on-policy RL into a coherent framework is innovative and well justified.

3)Strong empirical results: Under the same observation budget, SASHA consistently outperforms RLogist by a notable margin (e.g., +9% AUC at 10% sampling). Moreover, SASHA-0.2 achieves near-SOTA performance using only ~20% of high-res patches, and HAFED (100%) is competitive with top full-slide MIL models like DTFD and ACMIL. In addition to its strong performance, the framework offers significant gains in efficiency, with an inference time reduced by up to 8× and memory usage by over 16×.

4)Each component of the framework is well justified: removing any key component such as multi-head attention of HAFED, targeted update mechanism of TSU, or replacing the learned policy with a random one—leads to substantial drops in performance, clearly justifying each design decision.

Weaknesses

  1. Training time cost: although the framework is efficient at inference, training remains costly and less scalable. It requires access to all high-resolution patches and involves computing high-res embeddings V(i) for all patches within each correlated set C, as part of TSU’s training. This contradicts the efficiency objective of avoiding exhaustive high-resolution patch processing (l.195), a trade-off that is acknowledged but not extensively examined.

2)Generalization beyond binary classification: the evaluation is limited to binary classification tasks. While this is common in the literature, it remains unclear how SASHA would perform in multi-class or multi-label settings common in real-world pathology.

3)Minor presentation inconsistency: there’s a discrepancy between Section 3.3 and Figure 2b regarding the TSU input. The figure and appendix suggest that S(i), S(a_t), and V(a_t) are concatenated, whereas the text says only S(i) and V(a_t) are used.

问题

  1. TSU input clarification: does the MLP of TSU receive [S(i), S(a_t), V(a_t)] (as per Fig. 2b & Appendix) or only [S(i), V(a_t)] (as per Sec. 3.3)?
  2. Definition of “time fraction”: in Fig. 3a and 3b, the patch boxes are colored according to the “time fraction in episode.” However, the precise meaning of this term is not defined in the text. Is it meant to indicate the normalized timestep at which each patch was visited within an RL episode? Clarifying this would aid interpretability.
  3. Multi-class extension: have you attempted (or can you comment on) applying SASHA to tasks with > 2 labels, such as tumor grading or multi-label subtype classification?
  4. Pretraining sensitivity: In your ablations, replacing the WSI-pretrained ViT with an ImageNet-pretrained ResNet50 caused a substantial drop. Could you discuss how robust is performance if the ViT encoder is pretrained on smaller or out-of-domain datasets?

局限性

none relevant

最终评判理由

The paper is solid and technical sound. Experiments are good and the novelty of the proposal is significant. Authors provide a good and solid rebuttal indeed.

格式问题

none relevant

作者回复

We thank the reviewer for the valuable suggestions and for the positive comments on our motivation, novelty, strength of evaluation and justification for the model components. We address the questions below.
 

1) TSU input clarification

there’s a discrepancy between Section 3.3 and Figure 2b

Typo corrected

  • Thank you for identifying this typo. As shown in the figure and clarified in the Appendix: “For each patch a_τ that crosses the similarity threshold, the TSU model takes as input a concatenated vector comprising (i) its low-resolution features Z_aτ ∈ R^d, (ii) the low-resolution feature representation of the selected patch at, Z_at ∈ R^d and (iii) the high-resolution feature representation of the selected patch V_at ∈ R^d.”
  • We will make the description of the TSU in Section 3.3 consistent with Fig. 2b in the main text.

 

2) Definition of “time fraction”

the precise meaning of this term is not defined

Definition added

  • We regret the lack of clarity. As the reviewer correctly points out it is the normalized timestep (timestep/length of episode) at which each patch was visited by the RL agent, within an episode.
  • We will clarify this in the revised manuscript.

 

3) Multi-class extension

have you attempted (or can you comment on) applying SASHA to tasks with > 2 labels

Per the reviewer’s suggestion, we have now evaluated two other benchmarks, Camelyon-plus and the BRACs dataset, each of which reflects a multi-class classification problem.

CamelyonPlus 4-class classification

  • First, we employed the curated CamelyonPlus dataset [1].
  • This dataset comprises 1,347 WSIs with labels from curated Camlelyon-16 and Camlelyon-17 datasets after removing low-quality slides and upgrading the task to a four-class classification task with the following labels: negative, micro-metastasis, macro-metastasis, and Isolated Tumor Cells (ITC).
  • The following are the train/validation/test splits (numbers indicate WSI counts). We trained the model with the pooled datasets and report the performance separately for the Camelyon-16 and Camelyon-17 test WSIs.
         |  CAMELYON16+(4-class) |  CAMELYON17+(4-class) | 
         | Neg | Mic | Mac | ITC | Neg | Mic | Mac | ITC |
Train    | 172 | 45  | 51  | 6   | 436 | 77  | 123 | 30  |
Val      | 30  | 12  | 9   | 1   | 100 | 14  | 29  | 8   |		
Test     | 35  | 14  | 9   | 1   | 96  | 12  | 29  | 8   |
  • We observed performance metrics that are on par with sota methods like DTFD and ACMIL (table below). We will add these results to those in Table 1.
Method      |     CAMELYON16+(4-class) |    CAMELYON17+(4-class)   | 
            |  Acc   |  AUC   |   F1   |   Acc   |   AUC  |   F1   | 
DTFD        |  0.918 |  0.947 |  0.904 |  0.896  |  0.903 |  0.891 | 
ACMIL       |  0.962 |  0.925 |  0.965 |  0.826  |  0.932 |  0.827 | 
HAFED       |  0.963 |  0.911 |  0.949 |  0.883  |  0.931 |  0.868 | 
SASHA-0.1   |  0.856 |  0.880 |  0.801 |  0.869  |  0.914 |  0.841 | 
SASHA-0.2   |  0.939 |  0.937 |  0.894 |  0.889  |  0.921 |  0.864 |

BRACS multi-class classification

  • For the BRACS (BReAst Carcinoma Subtyping) dataset, we employed the public dataset available at [2].
  • This dataset comprises 547 labeled WSIs, and involves a 7-class classification that characterizes the tumor as one of 3 benign (BT), 2 atypical (AT), or 2 malignant (MT) sub-types.
  • We excluded the “normal” sub-type in the BT class (Group_BT/Type_N) and the two sub-types within each class were grouped into its respective “uber”-class.
  • In addition, a few slides that did not include all relevant levels of magnification (7/503) were discarded.
  • The final distribution of train/validation/test WSIs is as follows:
         |         BRACS (3-class)          |
         |  Benign  |  Atypical | Malignant | 
Train    |    171   |    52    |    135     | 
Val      |     16   |    14    |     21     | 
Test     |     25   |    23    |     16     | 
  • We performed a 3-class classification among BT, AT and MT classes and compared the performance of SASHA-0.2 with the other sota models.
Method      |         BRACS (3-class)        |   
            | Accuracy  |   AUC    |   F1    | 
DTFD        |   0.609   |   0.796  |  0.609  | 
ACMIL       |   0.547   |   0.750  |  0.522  | 
HAFED       |   0.593   |   0.756  |  0.559  | 
SASHA-0.2   |   0.609   |   0.755  |  0.589  | 
  • We find that SASHA, even with sampling 20% of the patches, performs comparably with methods like DTFD, and even outperforms ACMIL, each of which sample all of the WSI patches at high resolution.
  • The lower F1 score compared to sota (0.695, Brancati & Frucci, Information, 2024), could be because we used the intermediate 5x→20x zoom levels for SASHA, whereas, conventional models have employed the highest zoom levels (40x).
  • In future work, we will extend this classification to all the different subtypes (2 per class) as well as normal, perform more extensive hyperparameter searches, and evaluate our model on this 7-class classification problem.
  • These results will also be added to the paper.

 

4) Pretraining sensitivity

discuss how robust is performance if the ViT encoder is pretrained on smaller or out-of-domain datasets

ViT-encoder generalizes to OOD data

  • We seek to clarify that our ViT encoder was not pretrained on the WSI-s specific to each dataset before training.
  • Rather it was pre-trained on a common set of datasets, including >20k WSIs of TCGA dataset and >15k WSIs of the TULIP dataset, as described in [3]; and this pretrained ViT encoder was used, as is without further fine-tuning.
  • In other words, we did not tailor the ViT encoder to be pretrained on the specific datasets on which it was to be evaluated.
  • The fact that our model (SASHA) performed well even with the Camelyon-16 dataset, even though this dataset was not part of the ViT’s pretraining, indicates that these features were general enough to extend to OOD (out-of-distribution) scenarios.
  • We will clarify this in the revision.

Alternative encoder models are effective

  • Per the reviewer’s suggestion, we have now tested an alternative, histopathology-based, vision-language foundation model -- CONCH (CONtrastive learning from Captions for Histopathology).
  • CONCH is pretrained on 1.17M image caption pairs and provides significant advantages with feature extraction for downstream tasks [4].
  • Importantly, because the CONCH dataset was created by crawling a Pubmed Open Access (OA) database of histopathology images, and trained in an unsupervised manner without access to WSI labels, it is not in any way tailored to the specific datasets used in our study (e.g. Camelyon-16).
  • Here, we utilized only the image-encoder output from the CONCH model, and ignored the text-encoder output, for encoding WSI features. Performance was evaluated with the Camelyon-16 dataset.
Variant   | Feat. | Classif. | TSU | RL Pol.| Accuracy | AUC   | F1 
SASHA-0.2 | ✓     | ✓        | ✓   | ✓      | 0.964    | 0.980 | 0.953 
ResNet-50 | *     | ✓        | ✓   | ✓      | 0.860    | 0.817 | 0.780 
CONCH     | *     | ✓        | ✓   | ✓      | 0.930    | 0.950 | 0.905 
  • As shown in the ablation experiment table (extending Table 2 in the paper) above, we discover that replacing the ViT with the CONCH encoder produces far less drop in performance, as compared to replacing it with ResNet-50 (table below).
  • Broadly, our results highlight the advantage of employing a feature extractor appropriate for medical image analyses, over generic feature extractors.

We will report this in the revised “ablation experiments” in Table 2, and discuss the sensitivity to the pretrained encoder in the revision.


 

5) Training time cost

training remains costly and less scalable … a trade-off that is acknowledged but not extensively examined.

We appreciate this important point.

Value in low-resource settings

  • In its current form, the algorithm still finds significant value in low-resource settings.
  • By training the full model in high-resource settings, and fine-tuning HAFED and TSU with a limited number of OOD (out-of-distribution) WSI images, we can run the more efficient RL inference engine on unlabeled WSIs in low-resource settings.

End-to-end pipeline for RL training

  • To reduce training time, future improvements would involve developing an end-to-end pipeline that trains the RL agent concurrently with the HAFED and the TSU modules.
  • In such a model, the policy network decides the specific low-resolution patch to sample in real time, followed by HAFED feature distillation for the zoomed-in patch, and TSU state update for the entire slide, with all steps being trained end to end with reward at each time step.
  • This would be followed by the additional steps of training the policy and value networks, and so on, until the end of each episode, for several episodes.
  • With this method, there would be no need to sample every patch at high-resolution during training, and would align with how RL agents learn, conventionally, in novel environments with sporadic rewards [5].

Additional points

  • In the response #1 to reviewer fOU5, titled Training Cost, we have provided a detailed discussion about this point.
  • We are unable to repeat the entire response here, due to limits on character counts. We request the reviewer to refer to that response for further details regarding RL agent training stability and the like.

We will discuss this scope for improvement in detail in the revision, in Appendix A.6 (Limitations).


 

References

[1] Ling et al., Science Data Bank, 2025

[2] Brancati et al., Database, 2022

[3] Kang, et al, CVPR, 2023

[4] Lu et al., Nature Medicine, 2024

[5] Mnih et al, Nature, 2015

评论

I am pleased with the response that strenghten my opinion on the work. I am still inclined to score the paper as a clear accept.

审稿意见
5

The authors introduces SASHA, a novel deep RL framework designed for the efficient diagnosis of gigapixel-sized whole-slide images (WSIs) in histopathology. SASHA uses an RL agent that intelligently samples a small fraction (10-20%) of informative patches from a low-resolution view of the WSI to "zoom in" on for high-resolution analysis. It consists of several key modules:

Hierarchical Attention-based Feature Distiller (HAFED): A 2-stage attention model that learns to extract diagnostically relevant features from high-resolution patches for aggregation into both a set-based representation (for updating state via TSU) as well as slide-level representation (for classification).

Targeted State Updater (TSU): An efficient method to update the representation of the WSI. After a patch is selected and analyzed at high resolution, the TSU updates the state of not only the selected patch but also other, un-sampled patches that are highly correlated with it, supervised by their ground truth features during training time.

Via 2 WSI-classification benchmarks, the paper demonstrates that SASHA achieves classification performance comparable to state-of-the-art methods that analyze the entire WSI at high resolution, but does so at a fraction of the computational cost, with up to 8x faster inference times and requiring significantly less memory.

优缺点分析

Strengths The method is well thought out and motivated, maximizing the learning signal from multiple resolution while allowing computationally efficient inference after training.

Weakness Experiments is lacking with only 2 datasets that are focused on binary diagnosis (e.g. no prognosis tasks or more challenging large-scale classification tasks) and are already near saturation in performance, making it difficult to assess potential performance impact across diverse tasks in computational pathology. The norm in the field for MIL-based methods is usually 5+ benchmarks, in addition to potential few-shot / ablation experiments.

There's also no comparison with non-RL-based sampling methods such as ZoomMIL.

问题

I recommend expanding the scope of evaluation to at least 2 - 3 other benchmarks, including more challenging classification tasks (e.g. BRACs or EBRAINS), as well as prognostic tasks (e.g. TCGA or SurGen). Comparison with non-RL based sampling methods such as ZoomMIL should also be added.

局限性

Yes.

最终评判理由

My concerns have been addressed. Raised scores.

格式问题

NA

作者回复

We thank the reviewer for the important suggestions, and for evaluating our method as well-motivated and computationally efficient at inference time. We address the suggestions below.
 

1) Expanding the scope, including more challenging classification

recommend expanding the scope of evaluation to at least 2 - 3 other benchmarks, including more challenging classification tasks

Per the reviewer’s suggestion, we have now evaluated two other benchmarks, Camelyon-Plus and the BRACS dataset, each of which is a multi-class classification problem.

CamelyonPlus 4-class classification

  • First, we employed the curated CamelyonPlus dataset [1].
  • This dataset comprises 1,347 WSIs from Camlelyon-16 and Camlelyon-17 datasets after removing low-quality slides and upgrading the task to a four-class classification: negative, micro-metastasis, macro-metastasis, and Isolated Tumor Cells (ITC).
  • The following are the train/validation/test splits (numbers indicate WSI counts). We trained the model with the pooled datasets and report the performance separately for the Camelyon-16 and Camelyon-17 test WSIs.
         |  CAMELYON16+(4-class) |  CAMELYON17+(4-class) | 
         | Neg | Mic | Mac | ITC | Neg | Mic | Mac | ITC |
Train    | 172 | 45  | 51  | 6   | 436 | 77  | 123 | 30  |
Val      | 30  | 12  | 9   | 1   | 100 | 14  | 29  | 8   |		
Test     | 35  | 14  | 9   | 1   | 96  | 12  | 29  | 8   |
  • We observed performance metrics that are on par with sota methods like DTFD and ACMIL (table). We will add these results to those in Table 1.
Method      |     CAMELYON16+(4-class) |    CAMELYON17+(4-class)   | 
            |  Acc   |  AUC   |   F1   |   Acc   |   AUC  |   F1   | 
DTFD        |  0.918 |  0.947 |  0.904 |  0.896  |  0.903 |  0.891 | 
ACMIL       |  0.962 |  0.925 |  0.965 |  0.826  |  0.932 |  0.827 | 
HAFED       |  0.963 |  0.911 |  0.949 |  0.883  |  0.931 |  0.868 | 
SASHA-0.1   |  0.856 |  0.880 |  0.801 |  0.869  |  0.914 |  0.841 | 
SASHA-0.2   |  0.939 |  0.937 |  0.894 |  0.889  |  0.921 |  0.864 |

BRACS multi-class classification

  • For the BRACS (BReAst Carcinoma Subtyping) dataset, we employed the public dataset available at [2].
  • This dataset comprises 547 labeled WSIs, and involves a 7-class classification that characterizes the tumor as one of 3 benign (BT), 2 atypical (AT), or 2 malignant (MT) sub-types.
  • We grouped the sub-types within each class into its respective “uber”-class and excluded the “normal” sub-type in the BT class (Group_BT/Type_N), as well as 7/503 slides that did not include all relevant levels of magnification.
  • The final distribution of train/validation/test WSIs is as follows:
         |         BRACS (3-class)          |
         |  Benign  |  Atypical | Malignant | 
Train    |    171   |    52    |    135     | 
Val      |     16   |    14    |     21     | 
Test     |     25   |    23    |     16     | 
  • We performed a 3-class classification among BT, AT and MT classes and compared the performance of SASHA-0.2 with the other sota models.
Method      |         BRACS (3-class)        |   
            | Accuracy  |   AUC    |   F1    | 
DTFD        |   0.609   |   0.796  |  0.609  | 
ACMIL       |   0.547   |   0.750  |  0.522  | 
HAFED       |   0.593   |   0.756  |  0.559  | 
SASHA-0.2   |   0.609   |   0.755  |  0.589  | 
  • SASHA, even with sampling 20% of the patches, performs comparably with DTFD, and outperforms ACMIL, each of which samples 100% of patches at high resolution.
  • The lower F1 score compared to sota (0.695, Brancati & Frucci, Information, 2024), could be because we used the intermediate 5x→20x zoom levels for SASHA, whereas, conventional models have employed the highest zoom levels (40x).
  • In future work, we will perform more extensive hyperparameter searches, extend this classification to all the different subtypes (2 per class) as well as normal, and evaluate our model on this 7-class classification problem.

 

2) Comparison with non-RL based methods

Comparison with non-RL based sampling methods such as ZoomMIL should also be added.

Differences between SASHA and ZoomMIL

  • There are comparatively few non-RL-based sampling methods for histopathology. We thank the reviewer for the pointer to ZoomMIL [3]
  • ZoomMIL uses a rational strategy of zooming into patches recursively, at different levels of magnification, so that all patches need not be sampled at high resolution, thereby providing significant speedups.
  • In ZoomMIL RoI selection for zooming is performed with a gated attention and a differentiable top-K module, and not with an RL agent, like SASHA.
  • One similarity is that our RL agent also chose to sample and zoom into patches afforded a high attention score by conventional methods as shown in Fig. 3f (Top-k overlap) and Fig. 3g (Avg. Attention Score).
  • Yet, the RL policy does not exclusively sample the Top-k attention patches and, hence, the two policies are not identical.
  • Also, while ZoomMIL samples a fixed budget of K patches at each resolution, SASHA samples a fixed proportion of samples at high resolution (0.1 or 0.2).
  • Their relative performances must be compared empirically.

Performance comparison of SASHA with ZoomMIL

  • We ran ZoomMIL with the Camelyon-Plus benchmark using the 4-class classification, in two stages:

Histopathology-tuned ViT feature extractor

  • First, we re-trained the ZoomMIL model using the histopathology-tuned ViT feature extractor [4] for the 10x→20x zoom levels (top-K=300), for the Camelyon-16 dataset.
  • This encoder change improved upon the results reported in the original study, based on the ResNet50 encoder, by ~6-7%.
Method              |    CAMELYON16 (2-class)   |
                    |  Accuracy |  Weighted-F1  |
ZoomMIL (ResNet50)  |   0.842   |     0.833     |
ZoomMIL (ViT-histo) |   0.906   |     0.905     |

Comparison with SASHA & HAFED

  • Second, we compared ZoomMIL with SASHA-0.1, SASHA-0.2 and HAFED for evaluation of two different datasets -- Camelyon-16 and Camelyon-17. Models were trained together using the entire CamelyonPlus database.
  • We report the results for the 5x→20x zoom levels for SASHA and ZoomMIL, but using the histopathology-tuned ViT feature extractor for all methods.
  • Moreover, SASHA-0.1 and SASHA-0.2 sample on average 84+/-62 (mean +/- std, median=68) and 168+/-125 (median=136) patches respectively, based on the observation budget. Therefore, we run ZoomMIL for 2 different values of top-K -- K=80 and 160 -- for a fair comparison with SASHA-0.1 and -0.2, respectively.
  • We also run ZoomMIL with K=300 for the Camelyon-16 dataset, as per the original study
Method             |  CAMELYON16+(4-class)  |  CAMELYON17+(4-class) | 
                   |  Acc  |  AUC  |   F1   |  Acc |  AUC  |   F1   | 
ZoomMIL(ViT, K=80) | 0.706 | 0.841 |  0.710 | 0.841| 0.885 |  0.812 | 
ZoomMIL(ViT, K=160)| 0.672 | 0.852 |  0.683 | 0.827| 0.860 |  0.792 |
ZoomMIL(ViT, K=300)| 0.741 | 0.852 |  0.688 | 0.820| 0.859 |  0.790 |
HAFED              | 0.963 | 0.911 |  0.949 | 0.883| 0.931 |  0.868 | 
SASHA-0.1          | 0.856 | 0.880 |  0.801 | 0.869| 0.914 |  0.841 | 
SASHA-0.2          | 0.939 | 0.937 |  0.894 | 0.889| 0.921 |  0.864 |
  • SASHA-0.1 and -0.2 outperform ZoomMIL -- across all top-K values -- by ~5-20% for both Camelyon-16 and Camelyon-17.
  • This difference may be due to the choice of 5x→20x zoom levels. Attention scores -- which ZoomMIL uses for magnification -- were perhaps less reliably computed at the 5x zoom level than at the 10x level, used in their paper.
  • Nevertheless, zooming between the same 2 levels (5x→20x) SASHA outperformed ZoomMIL.
  • We will perform more extensive hyperparameter searches to validate these findings before reporting them in the paper.

 

3) Ablation experiments

norm in the field … is usually 5+ benchmarks, in addition to potential few-shot / ablation experiments.

Detailed ablation experiments are reported in the paper

  • In the paper, Section 4.4 and Appendix A.3, we already conducted full-fledged ablation experiments by removing or replacing each component of the SASHA model -- including the feature extractor, classifier (HAFED), state-updater (TSU) and RL agent policy.
  • We describe these in the paper as follows: ”We performed a systematic ablation study removing each component of the model in turn – or replacing it with a naive variant – and evaluating it with the CAMELYON16 dataset. These included: a) replacing the WSI-pretrained ViT with a ResNet50, used commonly for feature extraction [37, 36], b) replacing multi-branch attention with single branch in the Classifer [14], c) replacing the targeted state update policy with a global state update [36], d) removing the TSU and updating the state only of the sampled (local, high-resolution) patch, and e) and selecting an action based on a random policy (see “Variant”s in Table 2). In every case, we observed sharp drops in accuracy, ranging from 7-45%, relative to the baseline SASHA model... (Table 2). The strongest impact occurred upon changing the RL policy (Table 2, e), with the next most impactful effects occurring upon changing the feature extractor (a) or the state update method (c-d). The least impact occurred upon changing the classifier to single branch attention (b), but even here, the F1 score dropped by 14% points. Additional ablation experiments, and including exploring the effect of terminal versus intermediate reward during RL agent training, adopting a stochastic action selection policy are described in Appendix A.3."
  • These results indicate that every component of our model was critical for achieving high performance, comparable to sota models.

 

References

[1] Ling, et al, Science Data Bank, 2025

[2] Brancati, et al, Database, 2022

[3] Thandiackal, et al, ECCV, 2022

[4] Kang, et al, CVPR, 2023

审稿意见
4

This paper introduces Sequential Attention-based Sampling for Histopathological Analysis (SASHA), a deep reinforcement learning (RL) framework that selectively analyzes diagnostically relevant high-resolution regions in gigapixel whole-slide images (WSIs). SASHA achieves performance comparable to state-of-the-art full-resolution models while utilizing only 10–20% of the high-resolution patches.

优缺点分析

Strengths:

(1) The proposed approach closely mimics practical diagnostic workflows—initial low-resolution scanning followed by selective high-resolution inspection.

(2) The paper is well-written, clearly structured, and easy to follow.

(3) Experimental results on two benchmark datasets demonstrate that the proposed method outperforms existing approaches while using only 10–20% of the high-resolution patches.

(4) Qualitative visualizations further support the effectiveness of the proposed sampling strategy.

Weaknesses:

(1) Training Cost: The method requires access to all high-resolution patches during training, which contradicts its low-resource inference objective and significantly increases training time and memory usage.

(2) Inference Time Reporting: While Section 4.3 provides a general analysis of inference efficiency, the paper would benefit from more detailed, side-by-side comparisons of actual inference times across methods.

(3) Clinical Deployment Considerations: Although the approach is promising, the paper lacks discussion on practical deployment challenges in clinical settings, such as interpretability, robustness to failure, and integration into existing diagnostic workflows.

(4) TSU Similarity Threshold (τ): The threshold parameter τ plays a key role in the targeted state update mechanism, yet no sensitivity analysis or justification is provided for its choice.

问题

See the Weaknesses.

局限性

Yes

最终评判理由

The experiments included in the rebuttal have satisfactorily addressed some of the issues I raised, thus I will uphold my existing rating.

格式问题

N/A

作者回复

We thank the reviewer for these valuable suggestions, and for the positive review of our methods, writing, evaluation and visualizations. We address each question, below.  

1) Training Cost

method requires access to all high-resolution patches during training

Substantial savings during inference

  • Our approach is similar in training cost to existing state-of-the-art models like ABMIL [1], ACMIL [2], DTFD [3] etc, which require access to all high-resolution patches, and perform feature extraction on them, during training. Indeed we acknowledge this limitation in Section 5.
  • Yet, where we differ -- and a major advantage of our approach -- is our low-resource inference objective: access to all high-resolution patches is not required at inference time. This makes our inference faster by 4x-8x -- depending on the observation budget -- a substantial savings, without compromising performance.

Stable training of the RL agent

  • We train and freeze the feature extraction (HAFED) and state-update (TSU) mechanism before training the RL agent.
  • An advantage of such an approach is that it stabilizes the training of the RL agent, because these other modules do not have to train concurrently with the policy and value networks.

Value in low-resource settings

  • In its current form, the algorithm still finds significant value in low-resource settings.
  • By training the full model in high-resource settings, and fine-tuning HAFED and TSU with a limited number of OOD (out-of-distribution) WSI images, we can run the more efficient RL inference engine on unlabeled WSIs in low-resource settings.

End-to-end pipeline for RL training

  • To reduce training time, future improvements would involve developing an end-to-end pipeline that trains the RL agent concurrently with the HAFED and the TSU modules.
  • In such a model, the policy network decides the specific low-resolution patch to sample in real time, followed by HAFED feature distillation for the zoomed-in patch, and TSU state update for the entire slide, with all steps being trained end to end with reward at each time step.
  • This would be followed by the additional steps of training the policy and value networks, and so on, until the end of each episode, for several episodes.
  • With this method, there would be no need to sample every patch at high-resolution during training, and would align with how RL agents learn, conventionally, in novel environments with sporadic rewards [4].

Caveats

  • Yet, in this case, there could be challenges associated with stable training of the RL agent (the policy and value networks), along with the HAFED and TSU networks.
  • We seek to overcome these challenges through alternating or sequential minimization of the loss functions associated with each of these components [5], thereby training all components in a coordinated manner; this is scope for future work.

We will discuss this scope for improvement in detail in the revision, in Appendix A.6 (Limitations).


 

2) Inference Time Reporting:

detailed, side-by-side comparisons of actual inference times across methods

Detailed inference time comparisons

  • We have provided below a detailed table of inference times (in seconds) for the different methods, including the break up for the different components of SASHA.
  • Our method achieves a ~4-8x speedup in total inference time.
Method    |Patch | FE(HR) | FE(LR) | Infer. | TSU  |  RL   |  Total
DTFD      | 1.46 | 116.75 |   N/A  | 0.0230 |  N/A |  N/A  | 116.77
ACMIL     | 1.46 | 116.75 |   N/A  | 0.0140 |  N/A |  N/A  | 116.76 
HAFED     | 1.46 | 116.75 |   N/A  | 0.0010 |  N/A |  N/A  | 116.75
SASHA-0.1 | 1.46 |   8.60 |  4.74  | 0.0007 | 2.06 | 0.105 |  15.50
SASHA-0.2 | 1.46 |  18.23 |  4.95  | 0.0007 | 4.09 | 0.219 |  27.49

Table Caption

  • FE: Feature Extraction | HR: High Resolution | LR: Low Resolution | Infer.: Inference time for one model pass | TSU: Time for state update | RL: Time for policy and value update
  • “Total” time (last column) corresponds to the values shown in Fig. 3c. of the paper.
  • “Patch” refers to time required for dividing the WSI at low resolution into non-overlapping 256x256 patches, while also removing background content (e.g. portions with the glass slide alone, or with insufficient biological tissue).
  • This is a constant overhead for all methods, and is not included in the Total time.
  • For SASHA, “Infer.” corresponds to time for one forward pass through the HAFED model.

Conclusions

  • The main cost saving at inference time is with feature extraction for the high-resolution patches.
  • Because SASHA visits only a small fraction (10-20%) of patches at high resolution, this yields considerable savings over other approaches that visit all (100% of) patches at high resolution.
  • In addition to the feature extraction, the TSU is a secondary bottleneck in SASHA because at each time step the cosine similarity with all other non-masked patches must be computed, albeit at low-resolution.
  • We are working toward making this component more computationally efficient, for example, by precomputing the pairwise similarities and storing them in a lookup table.

We will discuss this approach in the revision, and seek to incorporate it into future extensions of our method.


 

3) Clinical Deployment Considerations:

lacks discussion on practical deployment challenges in clinical settings,

With regard to clinical workflows [6] and utility for real-world deployment we discuss two cornerstones features -- Explainability and Calibration -- in Sections 4.5-4.6.

Explainability -- easy to explain RL agent’s strategies for clinicians

  • In Section 4.5, we show that: “tumor fraction in patches sampled by the RL agent at high resolution (10% with SASHA-0.1 or 20% with SASHA-0.2) significantly exceeds that in a corresponding number of randomly chosen patches, among those not sampled by the agent (Fig. 3e, p<0.001). Moreover, the RL policy guides the agent to preferentially samples patches afforded a high attention score by sota models: both the average attention score (Fig. 3g, p<0.001) and the fraction of top-k patches sampled under the RL policy were statistically significantly greater than the fraction of patches sampled by a random policy (Fig. 3f, p<0.001).”
  • In other words, our RL agent consistently selects high-attention, tumor-rich patches far more often than a random policy, making its decisions intuitive and easy to explain to clinicians.

Calibration -- useful indicator for referral to a clinician specialist

  • In Section 4.6 we show that: “sampling a higher fraction of patches yields better calibrated predictions… We quantified the expected calibration error (ECE) [10] as a measure of the degree of calibration of the models. The ECE decreased systematically as the proportion of sampled patches increased from 10% (SASHA-0.1) to 20% (SASHA-0.2) to 100% (HAFED). This pattern occurred possibly because providing the model more timesteps to sample the WSI in each RL episode enabled the model to recover from sub-optimal sampling actions in the early timesteps, and to sample the relevant portions of the WSI in the later timesteps.”
  • Moreover, SASHA-0.2 ECEs (=0.0292) were lower than those of competing sota attention-based models like ACMIL (0.0617) and DTFD (0.0452), as shown in Fig. 3h.
  • A lower ECE reflects a better calibrated model. With well calibrated models, the model’s confidence in its prediction (e.g., predictive uncertainty) is accurate. In this case, it becomes a useful indicator for when referral to a clinician specialist is warranted.
  • In other words, by quantifying predictive uncertainty of the SASHA-0.2 classifier with established approaches (e.g. predictive entropy), our RL model could provide clinicians with a valuable preview of when its prediction can (or cannot) be trusted.

We will include this discussion on incorporating our method into clinical workflows in more detail in the revised paper.


 

4) TSU Similarity Threshold:

The threshold parameter τ … no sensitivity analysis or justification is provided for its choice.

We regret the lack of clarity.

τ was tuned as a hyperparameter

  • As shown in Table 6, and related text in Appendix A.2.2, the TSU similarity threshold (τ) was tuned as a hyperparameter with a validation set (details in Table 3 of Appendix A.1) to maximize the sum of AUC and F1 scores.
  • We paste below the relevant rows of the table:
  • Table 6 in the Appendix also provides a full list of model hyperparameters, selection criteria and the different ranges explored and values selected for the two cancer benchmarks.
Hyperparam   | Sel. crit. |      CAMELYON16      |       TCGA           
                          |   Range   |  Values  |   Range  |  Values 
Threshold (τ)| Validation | 0.88-0.98 | 0.9/0.95 | 0.8-0.92 | 0.9/0.9

Caption: Hyperparameters for the CAMELYON16 and TCGA-NSCLC datasets. In the Values column, entries separated by a slash (/) represent values for SASHA-0.1 and SASHA-0.2 models respectively.


 

References

[1] Ilse, et al, ICML, 2018

[2] Zhang, et al, ECCV, 2024

[3] Zhang, et al, IEEE/CVF, 2022

[4] Mnih et al, Nature, 2015

[5] Choromanska et al, ICML, 2019

[6] Pantanowitz et al, Lancet Dig Health, 2020

评论

Thanks for the clarification. I’ll stick with my borderline accept rating.

最终决定

The paper initially received mixed reviews (5543). The major concerns were:

  1. high training cost, and lacks inference time comparisons [foU5, jTur]
  2. missing ablation study [foU5, tugf]
  3. missing discussion about interpretability, robustness, etc. [foU5]
  4. experiments only on 2 datasets, when 5+ are expected [tugf]
  5. missing comparison to non-RL based sampling methods (ZoomMIL) [tugf]
  6. lacks evaluation on multi-label datasets [jTur, LuCH]

The authors wrote a response to address the concerns. All reviewers were satisfied with the response, and tugf raised score from 3 to 5, resulting in all positive ratings. Overall, the AC agrees with the reviewers, and notes that they all appreciated the motivation, novelty, and experiments. The authors should revise the paper according to the reviews, response, and discussions.