PaperHub
5.0
/10
withdrawn3 位审稿人
最低3最高6标准差1.4
6
3
6
4.3
置信度
ICLR 2024

Revisiting Subsampling and Mixup for WSI Classification: A Slot-Attention-Based Approach

OpenReviewPDF
提交: 2023-09-23更新: 2024-03-26

摘要

关键词
multiple instance learningregularizerattention mechanismmedical imaginghistopathologyweakly-supervised learning

评审与讨论

审稿意见
6

This paper proposes an original architecture for WSI classification which combine MIL (multi-instance learning) with multi-head attention, slot attention and pooling. The model can be seen as summarizing the WSI (split into M patches) into S slots where S is a fixed hyperparameter and is small (e.g. 16). Another model of similar architecture classifies these S slots into K classes. The M patches are first converted to codes using a ResNet-18 pre-trained on Imagenet. The paper also proposes a data augmentation scheme based on patch subsampling and MIXUP. Experiments show that the proposed architecture and data augmentation improves upon SOTA for 3 datasets (CAMELYON-16/17 and TCGA-NSCLC) for cancer/non-cancer or subtype classification. Ablation studies show the effect hyperparameter selection and combination of data augmentation.

优点

  • The paper proposes a novel architecture for WSI classification that improves upon SOTA MIL. The architecure choices are well motivated and the results show a clear improvement. The idea of using a fixed number of attention-guided slots to summarize the important patches of a WSI prior to classification is original and powerful.
  • A typical drawback of MIL is that it tend to overtrain as the number of bags is generally small compared to the number of instances (a single WSI may generate several thousand patches). The proposed approach of subsampling and MIXUP augmentation appears to be effective at decreasing overtraining issues and improving classification measures. Again those approaches are relatively well motivated in the paper.
  • The choice of using ResNet-18 pre-trained on imagenet makes the approach simpler, faster and more reproducible.

缺点

  • The SOTA AUC for all 3 datasets, as reported on the current litterature and Grand-Challenge website is significantly higher than the baselines chosen in the paper. For TCGA-NSCLC, [Zhang22] reports 0.9377 AUC, while the top baseline in the paper is 0.893. For the CAMELYON datasets, the Grand Challenge leaderboard also outperforms the reported baselines. Furthermore, Transmil [2] reports 0.9309 AUC for CAMELYON-16, while the paper reports it at 0.834 ! And for TCGA-NSCLC, the discrepency is: 0.893 vs 0.9603 ! Those are significant differences, that make the proposed approach not SOTA anymore. [update: this issue has been cleared] [1] Zhang, Jingwei, et al. "Gigapixel whole-slide images classification using locally supervised learning." MICCAI 2022. [2] Shao, Zhuchen, et al. "Transmil: Transformer based correlated multiple instance learning for whole slide image classification." NIPS 2021.

  • Different subsampling rates are applied to the 3 datasets, based on information that is not generally known a-priori (the percentage of positive patches). This unfairly inflates the reported performance.

  • It is not clear how the optimal hyperparameters are obtained. At least the subsampling rate seems to be obtained heuristically (see bullet above), so it makes me suspicous about the others too. [update: this issue has been addressed by the authors in the rebuttal]

  • Since, one of the claim the authors make repeatedly is that their approach is more computationally efficient, it would be good to include some numbers and compare them to other approaches. [update: this issue has been addressed by the authors in the rebuttal]

Overall, this is a technically solid paper presenting a novel approach to WSI classification using MIL sampling and augmentation. Unfortunately, despite its claims, it doesn't reach SOTA on the reported datasets.

问题

see weaknesses

评论
  • Weakness 1

Thank you for providing a detailed comparison.

We would like to highlight that our method achieves an AUC of 0.975 for CAMELYON-16 and an AUC of 0.981 for TCGA-NSCLC, showcasing better-calibrated predictions with improved Negative Log-Likelihood (NLL) values. These results are obtained using features extracted from SimCLR, provided by the DSMIL paper[4].

As previously outlined in our general response, the overall performance disparities observed in various papers can be attributed to the challenges posed by deficient and unbalanced datasets. The significance of a standardized evaluation protocol becomes crucial in such scenarios. Notably, TRANSMIL stands out as the most parameterized model in the baseline, making it particularly vulnerable to issues like overfitting and changes in the validation set. In the case of CAMELYON-16, TRANSMIL[3] exhibits varying AUC values of 0.877[1], 0.906[2], and 0.931[3].

For further details on our efforts to ensure fair and reproducible comparisons, we invite you to refer to our general response.

  • Weakness 2

In terms of the subsampling rate (p), the empirical evidence from Table 6 and Table 8 in the Appendix shows that adopting any subsampling rate results in a performance gain compared to not adopting it. This implies that the benefits of subsampling can be harnessed without needing a prior understanding of the ratio of positive patches. While we reported the optimal subsampling rate(p), it's worth noting that improvement from the baseline model can still be achieved without precisely tuning the subsampling rate(p).

  • Weakness 3

The main hyperparameters in our method comprise the number of slots (S), mixup beta distribution alpha (α), subsampling rate (p), and Late-mix (L), as elaborated in Section 4.3 of the paper. We fix the optimal hyperparameters based on the results of k-fold AUC. AUC is recognized as a more robust metric than ACC, as it is not influenced by threshold variations.

For Slot-MIL, where S is the sole hyperparameter, we observe minimal performance differences when the number of slots exceeds a certain threshold. Especially in TCGA-NSCLC, there is no discernible trend when varying the number of slots. It's important to highlight that our approach is based on learnable (implicit) clustering using attention scores, making meticulous hyperparameter search unnecessary, unlike non-learnable k-means clustering. Additional experiment results are provided in the table below.

# of Slots\DatasetCAMELYON_16TCGA-NSCLC
ACC(⭡)AUC(⭡)NLL(⭣)ACC(⭡)AUC(⭡)NLL(⭣)
S = 40.841±0.0160.874±0.0261.000±0.4440.842±0.0220.906±0.0220.901±0.480
S = 80.846±0.0130.892±0.0241.221±0.4450.843±0.0210.910±0.0180.964±0.337
S = 160.834±0.0470.893±0.0231.242±0.9790.852±0.0250.914±0.0161.001±0.202
S = 320.852±0.0140.891±0.0311.438±0.5360.847±0.0320.905±0.0311.106±0.409

In the case of Slot-MIL + SubMix, we employ grid search to select hyperparameters, with α ∈ {0.2, 0.5, 1.0}, p ∈ {0.1, 0.2, 0.4}, and L ∈ {0.1, 0.2, 0.3}, while maintaining the same number of slots as in Slot-MIL. Similar to subsampling rate (p) and the number of slots (S), the results show minimal changes with variations in α and L. It's important to note not to use an excessively large α (beyond 1.0) or to avoid using Late-mix. Detailed results can be found in Table 9 in our paper.

评论
  • Weakness 4

non-fully attention

In the context of non-fully attention methods such as ABMIL and DSMIL, where they do not incorporate attention between all patches, their efficiency in both training and inference is notable. However, they tend to exhibit inferior performance compared to fully attention methods.

fully-attention

On the other hand, fully attention methods like TRANSMIL, ILRA, and Slot-MIL incorporate attention across all patches in a slide. Among these, Slot-MIL stands out with a training time 2.2 times faster and an inference time 1.4 times faster than its counterparts. Additionally, the FLOPs of Slot-MIL are significantly smaller, attributed to the absence of positional encoding CNN and iterative attention layers which is essential for others.

augmentations

When comparing augmentation methods, achieving accurate FLOPs measurements can be challenging, as other papers often incorporate additional methods.

DFTD-MIL, relying on grad-cam[5], is usually based on ABMIL although they are model-agnostic. Its inference time is slow due to the need for distilling patches using gradients, introducing extra computation.

Rank-Mix, based on our model Slot-MIL, requires more time to train, as pre-training of the teacher model (patch rank classifier) is essential for optimal performance. The experiment result without the teacher model is provided in Appendix B.3, offering a more fair comparison in terms of complexity. Without a teacher, Submix overwhelms Rank-Mix. Since they don't utilize mixup during the inference phase, the inference time is just the same with Slot-MIL.

Compared to DTFD-MIL and Rank-Mix, Slot-MIL + Mix and Slot-MIL + SubMix are significantly faster in both training and inference. It's noteworthy that our method achieves fast inference times while demonstrating superior performance in the dataset with distributional shifts, such as CAMELYON-17, highlighting advantages when applied to real-world scenarios.

Experiment Details

We conducted experiments on CAMELYON-17, where the training set comprises 240 WSIs, while the test set consists of 198 WSIs. Training time, and Inference time indicate seconds per epoch. To measure FLOPs, we utilized FlopCountAnalysis with a bag size of 10,000. Notably, we didn’t count FLOPs or parameters of the common feature extractor for a clearer comparison. We followed the experimental settings for each method detailed in Appendix A.2, and the number of slots (S) employed is 16.

Training TimeInference TimeFLOPs# of Parameters
ABMIL9.594.791.32G132,483
DSMIL10.154.843.30G331,396
TRANSMIL36.929.0250.31G2,147,346
ILRA27.537.989.92G1,555,330
Slot-MIL(Ours)12.535.635.45G1,590,785
----------------------------------------
ABMIL + DTFD-MIL15.096.90--
Slot-MIL + RankMix30.915.64-Additional Patch rank classifier.
Slot-MIL + Slot-Mix(Ours)23.205.62-No additional param.
Slot-MIL + Sub-Mix(Ours)22.745.66-No additional param.

If you have any further inquiries or require additional clarification, please don't hesitate to reach out.

Best, Authors

References

[1]Xiang, Jinxi, and Jun Zhang. "Exploring low-rank property in multiple instance learning for whole slide image classification." The Eleventh International Conference on Learning Representations. 2022.

[2]Zhang, Hongrun, et al. "DTFD-MIL: Double-tier feature distillation multiple instance learning for histopathology whole slide image classification." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

[3]Shao, Zhuchen, et al. "Transmil: Transformer based correlated multiple instance learning for whole slide image classification." Advances in neural information processing systems 34 (2021): 2136-2147.

[4]Li, Bin, Yin Li, and Kevin W. Eliceiri. "Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.

[5]Selvaraju, Ramprasaath R., et al. "Grad-CAM: Why did you say that?." arXiv preprint arXiv:1611.07450 (2016).

评论

I appreciate the author's extensive response to the reviewers critiques. In my opinion, most points raised by the reviewers have been addressed constructively. In particular the misunderstanding related to SOTA numbers has been cleared. I will update my review accordingly.

评论

We are happy to hear that our rebuttal addressed your concerns well. We also extended our examination to assess the validity of our method in the context of multi-class classification, thereby expanding its applicability to real-world scenarios.

If you have any further concerns or questions, please do not hesitate to reach out. Your feedback is invaluable to us.

Sincerely, Authors

审稿意见
3

The papers proposes Slot-MIL, which incorporates the ideas related to inducing points and slot attention for simplifying pooling mechanisms in multiple instance learning for WSIs. Specifically, given two WSIs (bag of instance features), the model should be able to encode them into the identical number of slots for classification. This work assesses performance on CAMELYON and TCGA-NSCLC subtyping, with additional ablation studies performed related to impact of subsampling and number of slots.

优点

  • This work performs a comprehensive assessment into different subsampling approaches for WSIs, and how subsampling strategies perform as data augmentation methods for MIL. In addition to performing substantive ablation experiments for the main augmentation method, SubMix (different parameters and combinations of subsampling and Slot-Mixup), this work also thoroughly assesses ablation strategies in combination with other strategies such as RankMix. Additional figures and presentations in the supplement also convey the stability of how subsampling affects training and validation loss.

缺点

  • Though adapting new techniques such as slot attention, I found this method to have limited novelty as it addresses common concerns such as patch redundancy in MIL. Many works such as DeepAttnMISL (Yao et al. 2020) achieve similar goals as Slot-MIL in filtering the bag to a smaller set of patches. Overall, relative to the performance improvement demonstrated, the contributions presented by the method may still be too limited and lack extensive validation with diverse downstream tasks.
  • One of the outlined contributions of this work (#3) is that Slot-MIL reaches state-of-the-art performance on CAMELYON and TCGA-NSCLC. Slot-MIL outperforms baselines relative to the comparisons developed in this work. However, when compared across studies, the reported best performance for C16 on the test set underperforms other reported results by a large margin. For example, on C16, whereas the accuracy / AUC for Slot-MIL+SubMix is 0.890 / 0.921, the reported best performances for ILRA-MIL (FRC) in Xiang et al. 2023 is 0.922 / 0.965. In other works such as MHIM-MIL (DSMIL) by Tang et al. 2023, the reported best performances is 0.925 / 0.965 (evaluated using cross-validation, not on official C16 test), and Bayes-MIL-APCRF by Cui et al. 2023 have best performances of 0.900 / 0.948. Though not using the same splits for TCGA-NSCLC, the AUCs for this task using 10-fold CV is generally 0.930+ (0.977 in Xiang et al. 2023).
  • Benchmarks such as C16 and TCGA-NSCLC lack difficulty and can be easily solved without adapting techniques related to WSI augmentation and slot attention. It would be interesting to explore this method on more diverse tasks that would benefit from data augmentation and "sparsity", such as gene mutation prediction (such as MSI prediction in TCGA-COADREAD), survival analysis, and other challenging tasks such as Gleason score grading in PANDA. The tasks evaluated in this work are limited to diagnostically-simple binary classification problems that do not need sparse MIL or virtual augmentation methods to see clinical translation.
    • Additionally, tasks such as C16, TCGA-NSCLC, TCGA-RCC have been over-explored in computational pathology and should no longer be evaluated as the only tasks evaluated for MIL in the reviewer's opinion. C16 already has many state-of-the art performances and is nearly solved from both fully-supervised and weakly-supervised perspectives. Similarly, TCGA-NSCLC and TCGA-RCC can be generally solved without requiring sophisticated MIL approaches. Overall, it would be more interesting to demonstrate how this method would enable more challenging tasks to be solved in computational pathology.
  1. Xiang, J. and Zhang, J., 2022, September. Exploring low-rank property in multiple instance learning for whole slide image classification. In The Eleventh International Conference on Learning Representations.
  2. Yufei, C., Liu, Z., Liu, X., Liu, X., Wang, C., Kuo, T.W., Xue, C.J. and Chan, A.B., 2022, September. Bayes-MIL: A New Probabilistic Perspective on Attention-based Multiple Instance Learning for Whole Slide Images. In The Eleventh International Conference on Learning Representations.
  3. Tang, W., Huang, S., Zhang, X., Zhou, F., Zhang, Y. and Liu, B., 2023. Multiple Instance Learning Framework with Masked Hard Instance Mining for Whole Slide Image Classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 4078-4087).
  4. Yao, J., Zhu, X., Jonnagaddala, J., Hawkins, N. and Huang, J., 2020. Whole slide images based cancer survival prediction using attention guided deep multiple instance learning networks. Medical Image Analysis, 65, p.101789.

问题

See above comments.

评论
  • Weakness 1

Thanks for your constructive review.

However, we want to emphasize that our method is a novel approach and contributes a lot to MIL problems, while the underlying ideas may exist in other domains. You mentioned DeepAttnMISL, but we consider DeepAttnMISL to be notably different from ours for two main reasons: clustering mechanism and pooling strategy.

Firstly, in the clustering phase, while seemingly achieving similar goals, DeepAttnMISL utilizes non-learnable k-means clustering, which cannot be trained with backpropagation or end-to-end. However, our approach employs a learnable O(n) complexity attention-based (implicit) clustering mechanism, which has already been explored in many different deep learning tasks including sets [7], object-centric learning [8], etc. Also, our method is much simpler to implement, not needing a manual action to capture clusters. This crucial distinction allows our method to adapt more dynamically to the underlying data structure, enhancing its efficacy.

Moreover, in the pooling phase, DeepAttnMISL relies on local attention, whereas our model integrates an O(n) complexity global attention pooling strategy. This strategic choice contributes to the model's ability to capture dependencies between clusters, which may be particularly valuable in tasks involving complex relationships. Again, our method is simple, computationally efficient, robust to distribution shifts, and able to capture underlying structures well, leading to state-of-the-art performance.

It's worth noting that non-learnable k-means clustering has its limitations, as some previous studies have indicated; its adaptation results in a marginal improvement of performance [1], and its application is confined solely to intra-label mixup[2]. Also, k-mean clustering shows optimal performance only in the limited range of the number of clusters (k) (as shown in Table 3 of [3], and Fig.A.1 of [2]), while our Slot-MIL shows the consistent performance when the number of slots (S) exceed a certain threshold.

Furthermore, our novel attention clustering, with its linear time complexity, extends its utility to inter-label mixup, thereby addressing a broader spectrum of scenarios. To the best of our knowledge, the mixup between WSIs using a fixed number of attention-based clusters remains unexplored by other researchers.

  • Weakness 2

Thanks for your insightful comparison.

We forgot to mention the results with SimCLR features in the main paper, which was already detailed in Appendix B.6. With comparable contrastive-learning-based features, our model achieves a notable performance of 0.923/0.975 (ACC/AUC), demonstrating superior results compared to ILRA-MIL and other papers discussed in the context of C16. We'd like to underscore that the performance is significantly influenced by the type of pre-trained features and evaluation protocol considering the scarcity and imbalance in the overall train/validation/test sets. We encourage you to refer to our general response at the top where we elaborate on our dedicated efforts to ensure fair and reproducible comparisons.

  • Weakness 3,4

We appreciate your comments, and we find that applying our method to gene mutation prediction or multi-class Gleason grading in PANDA is very intriguing. Recognizing the potential limitations of C16 and TCGA-NSCLC for evaluating sophisticated MIL approaches, we intentionally included the CAMELYON-17 (C17) experiment in Table 4.

C17, being a multi-center dataset with a substantial distribution shift, offers a more realistic reflection of real-world scenarios compared to other datasets. In the case of C17, both Slot-MIL and Slot-MIL + SubMix outperform the baselines by a large margin with well-calibrated predictions. It's worth noting that C17 has been recognized as challenging, even with pre-training[5] or surgical fine-tuning[6] has shown limited promise. Notably, our method addresses distribution shifts in C17 without the need for pre-training or fine-tuning.

Bayes-MIL[4] also reveals its performance on C17 in their appendix Table 5. However, they didn't specify whether they partitioned the multi-center data to make distribution shifts which is common practice on C17. It's noteworthy that our method achieves superior performance on C17 without relying on a slide-dependent Regularizer (SPDR) or incorporating spatial information (APCRF), both of which are essential components in Bayes-MIL.


Your understanding and consideration are greatly appreciated. By reflecting your feedback, we are planning to reorganize the updated version of the paper to highlight the results of C17, emphasizing the strong performance gap. If you have any further questions or require additional clarification, please don't hesitate to reach out.

Best regards, Authors

评论

References

[1] Chen, Yuan-Chih, and Chun-Shien Lu. "RankMix: Data Augmentation for Weakly Supervised Learning of Classifying Whole Slide Images With Diverse Sizes and Imbalanced Categories." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

[2] Yang, Jiawei, et al. "Remix: A general and efficient framework for multiple instance learning based whole slide image classification." International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2022.

[3]Yao, Jiawen, et al. "Whole slide images based cancer survival prediction using attention guided deep multiple instance learning networks." Medical Image Analysis 65 (2020): 101789.

[4]Yufei, Cui, et al. "Bayes-MIL: A New Probabilistic Perspective on Attention-based Multiple Instance Learning for Whole Slide Images." The Eleventh International Conference on Learning Representations. 2022.

[5] Wiles, Olivia, et al. "A fine-grained analysis on distribution shift." arXiv preprint arXiv:2110.11328 (2021).

[6] Lee, Yoonho, et al. "Surgical fine-tuning improves adaptation to distribution shifts." arXiv preprint arXiv:2210.11466 (2022).

[7]Lee, Juho, et al. "Set transformer: A framework for attention-based permutation-invariant neural networks." International conference on machine learning. PMLR, 2019.

[8]Locatello, Francesco, et al. "Object-centric learning with slot attention." Advances in Neural Information Processing Systems 33 (2020): 11525-11538.

评论

Below are additional questions and concerns following the rebuttal:

Do we need better MIL if improving SSL encoder demonstrates the most performance gains?: The reviewer thanks the authors for updating their results (Table 5) with self-supervised features using SimCLR. However, these results suggest a different finding – that developing more principled MIL architectures or augmentation strategies is not as significant as using a better SSL encoder. In CAMELYON-16 results in Table 5, ABMIL reaches the same AUC as base Slot-MIL (with smaller standard deviation) and with lower NLL (better supposed calibration). Slot-MIL is matched in AUC by DTFD-MIL. Potentially, ABMIL with subsampling can do better than Slot-MIL with SubMix (baselines presented in Table 5 do not seem to have subsampling). In TCGA-NSCLC results in Table 5, all MIL methods reach the same AUC as Slot-MIL (with or without Submix). The reviewer also acknowledges the author’s rebuttal in that “not all tasks possess an optimal feature extractor”. However, it is becoming increasingly prevalent with histopathology domain-specific encoders such as CTransPath [1] that SSL is the mainstay approach for extracting features in MIL. As suggested, evaluating on more difficult tasks may highlight better performance gains with Slot-MIL. However, the current results significantly diminishes Slot-MIL’s contributions.

Confusion with calibration: NLL is used as the main approach for measuring calibration. Upon review, the reviewer is unsure why NLL is used, when metrics such as Brier score, conformal prediction, and Expected Calibration Error (ECE) are some of the main methods for evaluating calibration [2,3].

Missing comparison with CLAM-SB: Upon review, the results of this work do not compare against CLAM-SB [4,5,6], which has been demonstrated to be a strong baseline (commonly evaluated in other MIL works on these tasks). As the evaluated tasks can be solved at the instance-level, comparing against CLAM-SB is necessary.

Concluding Thoughts The reviewer thanks the authors for the detailed reply. From reviewing the submission again, I did find the experimentation and investigation of subsampling as an augmentation strategy to be interesting and thoughtful. This is a worthwhile investigation, and I appreciate both positive and negative results that would make salient how subsampling works. However, the evaluation on NSCLC subtyping and CAMELYON-16, coupled with the updated results via SSL, may not be challenging enough to demonstrate the novelty of Slot-MIL and SubMix. The evaluation of SubMix itself is not convincing, as established calibration metrics are not used. The performance gains are extremely reduced following using SSL features, which emphasizes further interesting points on: 1) trade-off between MIL architecture and patch-level encoder, 2) what new clinical problems can MIL architectures solve that can't be solved with a strong patch-level encoder? At the moment, I am not changing my rating, but would encourage the authors to continue working on this problem as I believe it would have important results if evaluated on more diverse tasks.

  1. Wang, X., Yang, S., Zhang, J., Wang, M., Zhang, J., Yang, W., Huang, J. and Han, X., 2022. Transformer-based unsupervised contrastive learning for histopathological image classification. Medical image analysis, 81, p.102559.
  2. Naeini, M.P., Cooper, G. and Hauskrecht, M., 2015, February. Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the AAAI conference on artificial intelligence (Vol. 29, No. 1).
  3. Minderer, M., Djolonga, J., Romijnders, R., Hubis, F., Zhai, X., Houlsby, N., Tran, D. and Lucic, M., 2021. Revisiting the calibration of modern neural networks. Advances in Neural Information Processing Systems, 34, pp.15682-15694.
  4. Lu, M.Y., Williamson, D.F., Chen, T.Y., Chen, R.J., Barbieri, M. and Mahmood, F., 2021. Data-efficient and weakly supervised computational pathology on whole-slide images. Nature biomedical engineering, 5(6), pp.555-570.
  5. Xiang, J. and Zhang, J., 2022, September. Exploring low-rank property in multiple instance learning for whole slide image classification. In The Eleventh International Conference on Learning Representations.
  6. Yufei, C., Liu, Z., Liu, X., Liu, X., Wang, C., Kuo, T.W., Xue, C.J. and Chan, A.B., 2022, September. Bayes-MIL: A New Probabilistic Perspective on Attention-based Multiple Instance Learning for Whole Slide Images. In The Eleventh International Conference on Learning Representations.
评论

First of all, we sincerely appreciate your thoughtful response. Your suggestions have significantly contributed to improving the contribution of our paper.

In our initial submission, we chose to report NLL as the calibration metric, considering its widespread use in uncertainty literature[1,2] and its comprehensive evaluation of both performance and calibration error. However, in response to the multi-class experiment on CAMELYON-17 (C17), we have expanded our reporting to include both NLL and ECE. For ECE calculations, we employed 10 bins within the [0,1] range. Additionally, for AUC measurements, we present the mean value of one-vs-one for every class, as it is known to be less sensitive to the class imbalance than one-vs-rest.

Due to the time constraint, we are not able to provide comprehensive and optimal results (especially, hyperparameter tuning) on our method. Here, we used the same hyperparameter settings as we did in the C17 experiments except that we used 32 slots in the first PMA module (instead of 4 slots in binary classification tasks) since it is a 4-class classification problem. Also, we set the late_mix hyperparameter(L) as 0.4. As you mentioned in the earlier response, the C17 multi-class task poses greater difficulty compared to C16, given its class imbalance and distribution shifts. In this challenging context, SubMix demonstrated a substantial accuracy gain, with a 7.16% improvement over Slot-MIL, which is also better than just applying Subsampling. Additionally, the calibration metrics (NLL and ECE) of SubMix were lower, providing evidence of the regularization effect of our method.

Given the time constraints until the end of the rebuttal period, reporting ECEs for all tables in the main paper is not feasible. However, in the provided table for the C17 multi-class task, we observed a significantly high Pearson's correlation of r=0.9313 (p_value = 0.0003) between NLL and ECE. We anticipate similar trends in other experiments on different datasets. As suggested, we commit to reporting both NLL and ECE in the final version of the paper.

For CLAM-SB, given its consistently low performance in subsequent works [3,4], we maintain the belief that Slot-MIL will outperform CLAB-SB. We are currently conducting experiments on CLAB-SB and plan to incorporate its results into the final version of the paper. Thank you for your understanding and suggestions.

Finally, we would like to emphasize that our primary focus is on the contributions of Slot-MIL and SubMix, considering feature extraction as an orthogonal concern. However, the results with SimCLR features indicate that a SSL encoder can enhance performance irrespective of the MIL method. Given the challenging nature of tasks like C17 multi-class classification and PANDA classification (ACC of 0.64 [5], ACC of 0.73 [3] in the current best literatures, respectively), the combination of SSL-based features and MIL methods becomes crucial. In such demanding scenarios, both the model and feature extraction will play significant roles.

Thank you once again for your valuable feedback and insightful suggestions. If you have any additional questions or further clarification, please don't hesitate to reach out.

Best regards, Authors


References

[1]Guo, Chuan, et al. "On calibration of modern neural networks." International conference on machine learning. PMLR, 2017.

[2] Verma, Vikas, et al. "Manifold mixup: Better representations by interpolating hidden states." International conference on machine learning. PMLR, 2019.

[3]Xiang, Jinxi, and Jun Zhang. "Exploring low-rank property in multiple instance learning for whole slide image classification." The Eleventh International Conference on Learning Representations. 2022.

[4]Zhang, Hongrun, et al. "DTFD-MIL: Double-tier feature distillation multiple instance learning for histopathology whole slide image classification." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

[5]Yufei, Cui, et al. "Bayes-MIL: A New Probabilistic Perspective on Attention-based Multiple Instance Learning for Whole Slide Images." The Eleventh International Conference on Learning Representations. 2022.

评论

To quickly follow-up, are the baselines presented in Table 5 trained with subsampling?

评论

The reviewer thanks the authors for their thoughtful response. After reading the authors’ rebuttal, thoughts of other reviewers, and inspecting the methodological details further (of this work and its preceding works), Presented below is a point-by-point reply:

Reply to Weakness 1 - Novelty of Slot-MIL: Slot-MIL is indeed distinct from DeepAttnMISL (not requiring clustering beforehand) for addressing patch redundancy. At the same time, as the methodology of Slot-MIL is more of a direct extension of the already-existing Slot Attention paper, it would be nice to visualize and characterize which slots are associated with clinically relevant pathological features to further justify adapting this method from other literature. Any number of methods can be intuitively adapted from Transformer literature and broader works in deep learning literature for addressing redundancy (e.g. - I imagine Clustering Transformers would also work well here [1]). As the performance of Slot-MIL is ultimately modest, there does not seem to be a strong motivation to use Slot-MIL when ABMIL, DeepAttnMISL, TransMIL, and IRLA-MIL (combined with self-supervised features) already work very well. In addition to visualization of slots, other ways to demonstrate the applicability of Slot-MIL is with additional experimentation regarding: 1) solving harder tasks and 2) data-efficiency experiments. The improvement with SubMix (as raised by vwP9 and its reply) is only in NLL, which does not translate into any improvement of clinically relevant performance metrics.

Reply to Weakness 2 – On “SOTA” performance: The reviewer understands that experimental designs (splits, parameters, implementation) differ across studies, especially when comparing MIL results with different pretrained encoders for feature extraction. The intention of my critique was to lessen the tone, as the previous (and also current versions) of this work continue to emphasize “SOTA” performance.

Reply to Weakness 3 – Diverse Benchmarks: The reviewer understands that C17 is a good benchmark for studying distributional shift. At the same time, the reviewer contends that the experiments in this study are too limited to binary classification tasks. As utilized in many works, PANDA presents a more challenging task with 1) multi-class classification with label uncertainty, and 2) domain shift. The suggestion of PANDA (and other tasks) is two-fold in improving the study:

  1. Finding tasks where Slot-MIL would have better improvement: As the results stand, there is limited performance gain for Slot-MIL (and its combination with SubMix) (see below).

  2. Task diversity: Even with C17, the only tasks evaluated are lung cancer subtyping and breast cancer metastasis detection. These tasks are diagnostically simple and over-represented in computational pathology studies, relative to much harder tasks such as gene mutation prediction and even Gleason score grading.

  3. Vyas, A., Katharopoulos, A. and Fleuret, F., 2020. Fast transformers with clustered attention. Advances in Neural Information Processing Systems, 33, pp.21665-21674.

评论

The baselines listed above the double line in Table 5 do not incorporate augmentation (and also subsampling), and the baselines below the double line use augmentation. To be clear,

  • DTFD-MIL employs a pseudo-bag with knowledge distillation based on the ABMIL model but does not exhibit significant improvement. Here, a pseudo-bag splits a single WSI into subset patches, which can be seen as a variant of subsampling.
  • RankMix, built on our Slot-MIL along with subsampling and patch-level mixup, displays inferior performance compared to SubMix, underscoring the importance of mixing at the attention-based clustered feature level. The rationale behind conducting RankMix on Slot-MIL lies in Slot-MIL being identified as the most powerful model. RankMix authors emphasized the significance of baseline performance in achieving optimal results.
  • Also, as demonstrated in Table 6 (Appendix B.1), even when subsampling is applied to ABMIL and DSMIL, it does not align closely with the performance achieved by SubMix, demonstrating a substantial margin.
  • We also conducted experiments with other baselines (TRANSMIL, ILRA) incorporating subsampling and found that they do not closely match SubMix's performance by a considerable margin. Additionally, applying subsampling to TRANSMIL is impractical due to its limited applicability, to be strict. The model relies on the ordering of full patches for the utilization of an additional positional encoding CNN module.

We can provide a further analysis within a day if needed.

In summary, our experiments indicate that even when subsampling is applied to the baseline, leading to performance improvement through the regularization of attention over-concentration on specific patches, it does not closely align with the performance achieved by SubMix.

审稿意见
6

The authors propose an efficient model called SlotMIL that leverages an attention-based mechanism to organize patches into a fixed number of slots. They demonstrate that combining the attention-based aggregation model with subsampling and mixup augmentation techniques enhances both generalization and calibration in WSI classification.

A key contribution is subsampling/ mixup augmentation, which creates new bags of patches by randomly sampling subsets from the original slides. This helps restrict overfitting to the weak slide-level supervision. They also introduce an efficient model called SlotMIL that summarizes patches into a fixed number of slots using attention. Experiments show subsampling helps make more informative slots and improves generalization.

优点

  • The subsampling augmentation is a simple but effective way to create new training bags that reduces overfitting, without altering underlying slide semantics or adding training cost. This is an improvement over complex augmentation techniques.
  • The SlotMIL model provides an efficient attention-based aggregation method to summarize patches into discriminative slots. This is more sophisticated than relying only on max pooling approaches commonly used.
  • The authors showed that subsampling plus mixup augmentation can work well together, whereas prior work found mixup had limited applicability in MIL frameworks.
  • Thorough experiments on multiple datasets demonstrate state-of-the-art performance, including on class imbalance and distribution shifts.
  • Authors considered various relevant baselines (ABMIL, DSMIL) and conducted rigorous ablation experiments to test the various components of the proposed model.
  • The paper is well written and easily to follow.

缺点

  1. MIL attention (https://arxiv.org/pdf/1802.04712.pdf) has been widely used for WSI analysis and several other extensions to the method have been explored in the field. While it is a relevant baseline, the proposed improvements do not significantly improve upon performance (<2-5%) and it is unclear if Mixup augmentation alone is better than other extensions to improve performance in the proposed binary classification task.
  2. The paper focuses solely on binary classification problems. Extending the approach to multi-class classification could be challenging.
  3. In some pathology samples (e.g., in cancer), very small proportion of patches might contain the relevant signal for classification. The subsampling augmentation could potentially discard useful patch information. Strategies that retain all patches may be able to learn more robust features.

问题

  • What impact does the subsampling rate have on model performance? Is there an optimal sampling fraction or range across datasets?
  • How well does the SlotMIL model and subsampling augmentation transfer to multi-class classification tasks? The codebase is not public - making it opensource would help the reproducibility efforts.
评论
  • Weakness 1

Thanks for your comprehensive comparison.

We want to highlight that our enhancements amount to approximately 9.4%, 6.5%, and 3.1% when measured against the best baselines in terms of AUC for CAMELYON-16, CAMELYON-17, and TCGA-NSCLC, respectively. While this improvement might seem modest, it's crucial to note that we have concurrently improved calibration while enhancing both ACC and AUC, making this margin noteworthy.

As detailed in Table 3, we acknowledge that Slot-Mix doesn't yield a substantial improvement in terms of ACC and AUC. However, its contribution to calibration is significant, especially as it involves mixing inter-label slides using informative slots. Notably, our mixup approach stands out for its simplicity and efficiency, especially when compared to other augmentations such as DTFD-MIL[1] and Rank-Mix[2], which necessitate knowledge distillation based on grad-cam[3] or a pre-training phase to unify the number of patches between slides. It's also important to highlight that DFTD-MIL and Rank-Mix naturally pose similarities to subsampling, involving the division of WSIs into pseudo-bags and selecting a subset of patches. The comparison with these baselines becomes more fair in the context of Slot-MIL + SubMix. In response to reviewer Wtm8, we also incorporate additional metrics about the complexity at reply for Wtm8.

  • Weakness 3, Question 1

The ratio of positive patches in a positive slide varies across datasets, as it is known to be less than 10% in CAMELYON-16 and over 80% in TCGA-NSCLC[4]. The precise number of patches containing relevant signals for classification is naturally unknown. While subsampling may potentially discard some important patches in certain iterations, the stochasticity introduced by subsampling and the abundance of patches within a slide mitigate the likelihood of discarding all useful patches. This can be further adjusted by the subsampling rate (p).

Retaining all patches, as depicted in our "Base train" in Figure 3, exacerbates overfitting since attention scores become excessively concentrated on specific patches. Additional details are provided in Section 4.1.2. Regarding the subsampling rate (p), empirical evidence in Table 6 and Table 8 in the appendix demonstrates that adopting any subsampling rate leads to performance gain compared to not adopting it. This implies that the benefits of subsampling can be harnessed without requiring a prior understanding of the ratio of positive patches.

We already validate the effectiveness of subsampling in CAMELYON-16, where only a small proportion of patches may contain relevant signals for classification. Also, at inference stage, we utilize whole patches in a slide, not a subset patches of a slide. The optimal range of subsampling rates might be higher in scenarios where the ratio of positive patches is low, as low subsampling rates may result in the exclusion of any positive patches (p=0.1 indicates the use of 10% of total patches per iteration)

  • Weakness 2, Question 2

The adoption of subsampling for multi-class scenarios is anticipated to pose no significant obstacle, as subsampling is an orthogonal augmentation with the number of classes, akin to dropout rates[5] and masking ratios in MAE[6]. While determining an optimal subsampling ratio (p) might require further elaboration in multi-class scenarios, our observations suggest that the performance is not overly sensitive to p, making this task more manageable.

It is worth noting that Slot-Mix originated from manifold mixup[7], where multi-class scenarios are the basic premise. Consequently, the adoption of Slot-MIL + SubMix in multi-class situations is not only feasible but also aligns with the method's conceptual foundations.

Demonstrating the validity of our method on multi-class datasets, such as PANDA, where WSIs are categorized into 6 classes based on Gleason score, would indeed be valuable. However, given the large scale of PANDA with tens of thousands of WSIs, feature extraction might require a substantial amount of time over the rebuttal period. If you are aware of any open-source repositories providing extracted features from PANDA, we are keen to conduct experiments as time permits. Regarding the code, it has already been posted in the supplementary materials for Slot-MIL. We are in the process of making it open-source and endeavor to release it anonymously within the coming week including SubMix.


If you have any further questions or need additional clarification, please feel free to reach out.

Sincerely, Authors

评论

References

[1]Zhang, Hongrun, et al. "DTFD-MIL: Double-tier feature distillation multiple instance learning for histopathology whole slide image classification." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

[2]Chen, Yuan-Chih, and Chun-Shien Lu. "RankMix: Data Augmentation for Weakly Supervised Learning of Classifying Whole Slide Images With Diverse Sizes and Imbalanced Categories." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

[3]Selvaraju, Ramprasaath R., et al. "Grad-CAM: Why did you say that?." arXiv preprint arXiv:1611.07450 (2016).

[4]Xiang, Jinxi, and Jun Zhang. "Exploring low-rank property in multiple instance learning for whole slide image classification." The Eleventh International Conference on Learning Representations. 2022.

[5]Srivastava, Nitish, et al. "Dropout: a simple way to prevent neural networks from overfitting." The journal of machine learning research 15.1 (2014): 1929-1958.

[6]He, Kaiming, et al. "Masked autoencoders are scalable vision learners." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.

[7]Verma, Vikas, et al. "Manifold mixup: Better representations by interpolating hidden states." International conference on machine learning. PMLR, 2019.

评论

I thank the authors for addressing my comments. The authors acknowledge that Slot-Mix doesn't yield a substantial improvement in ACC and AUC but argue for its significant contribution to calibration. However, the quantifiable evidence of this contribution is not well presented. The new results also point to the fact that SSL is more promising in certain experiments than MIL-based methods, somewhat undermining the relevance of the presented approach. Hence, I will keep the current score.

评论

Dear vwP9,

In response to your suggestion, we conducted a multi-class classification experiment on the challenging CAMELYON-17 dataset, which comprises four classes: negative, itc, micro, and macro. In this rigorous evaluation, SubMix demonstrated its efficacy by achieving an ACC of 0.733, representing a notable improvement over applying Sub alone on Slot-MIL (ACC of 0.722). Moreover, SubMix exhibited superior calibrated predictions, as evidenced by lower ECE and NLL values. This underscores the practical utility of SubMix in real-world applications, where uncertainty is as critical as accuracy.

Given the demanding nature of tasks such as CAMELYON-17 multi-class classification and PANDA classification (ACC of 0.64 [1], ACC of 0.73 [2] in the current best literature, respectively), the synergy between SSL-based features and MIL methods becomes pivotal. In this context, our model, Slot-MIL, and SubMix are poised to make substantial contributions, recognizing that both the model and feature extraction play significant roles.

We sincerely appreciate your assistance in refining our paper, and we welcome any further suggestions or feedback.

Best regards, Authors


References

[1]Yufei, Cui, et al. "Bayes-MIL: A New Probabilistic Perspective on Attention-based Multiple Instance Learning for Whole Slide Images." The Eleventh International Conference on Learning Representations. 2022.

[2]Xiang, Jinxi, and Jun Zhang. "Exploring low-rank property in multiple instance learning for whole slide image classification." The Eleventh International Conference on Learning Representations. 2022.

评论

Dear Reviewers,

Thank you for your meticulous reviews and valuable feedback on our paper. We have carefully considered each of your insightful comments and are fully committed to addressing the raised concerns. The reviewers commonly highlighted the novel attention-based aggregation in Slot-MIL (reviewer vwP9, Wtm8) and the effectiveness of subsampling and MIXUP (reviewer vwP9). Also, they appreciated that we well addressed the challenges of over-fitting and over-confidence in MIL (reviewer vwP9, Wtm8) by outperforming other previous papers without complicated architectures or augmentations (reviewer vwP9). Also they valued extensive ablation experiments (reviewer 5YV1, vwP9).

However, there seem to be some concerns shared by the reviewers on the experimental results, especially 1) for the claim that our method achieves SOTA performance, and 2) the discrepancy between the results reproduced by ourselves and the ones reported in the baseline papers, and 3) for the novelty of our works.

For 1), we’d like to mention that our model achieves AUC of 0.975 for CAMELYON-16, and AUC of 0.981 for TCGA-NSCLC with better-calibrated predictions (better NLL values), using the features extracted from SimCLR (provided by DSMIL paper [4]). With these features, ours indeed outperforms previous SOTA methods, even though the margin is not as significant as compared to the setting we put in the paper. We forgot to include a mention or link in our paper for the results with SimCLR features, which is already presented in Appendix B.6. The details can be found there or at the bottom table. The results in the main paper are based on ResNet-18 features, and we chose this for two reasons. Firstly, as can be seen in the table in Appendix B.6, when solved with SimCLR features, due to the increased flexibility of the features, the MIL algorithms could not make a significant difference in performance, making it hard to judge the contribution of the MIL algorithms themselves. It is also known that strong feature extractor like SimCLR makes patch-level features more linear-separable [7]. Secondly, maybe due to a similar reason to ours, previous works [2,3,7] chose to report the performance based on ResNet-50 features, and so we followed the literature.

For 2), the gap in the performance between the ones reported in the references and the ones reproduced by ourselves is mainly due to the difference in the way we construct the train / validation / test splits. We make sure that slides from the same patient do not exist in the train and test set, and use stratified k-fold for splitting a validation set from the training set. Both are essential for reproducibility and fair comparison but seem not to be well followed by previous works [2,4,6] (they don’t usually report which splits have been used).

Due to the limited number of slides in common medical MIL benchmarks and class imbalance, we observed that the performance is quite sensitive to the ratio of positive/negative slides in the validation set, and thus choosing different splits largely affects the results. Many papers in the literature have reported different figures with the same model. In case of CAMELYON-16, TRANSMIL[3]’s AUC differs by 0.877[1], 0.906[2], 0.931[3], and DSMIL[4]’s AUC differs by 0.894[1], and 0.818[3]. In the case of TCGA-NSCLC, ABMIL[5]’s AUC differs by 0.921[1], 0.941[2], 0.866[3]. This performance difference between identical models further highlights the importance of selecting validation set and evaluation protocols fairly. Therefore, we used a stratified k-fold cross-validation as the distribution of the labels in the training set are a given priori. We firmly believe that our evaluation protocol would positively contribute to the fair evaluation and reproducibility in the WSI area where even test sets are occasionally not designated. We will provide our train/validation/test slide indices for clarity, which has not been released by other papers.

Considering the fact that training a self-supervised feature on WSIs takes 4 days using 16 Nvidia V100 Gpus[1] and 2 months to be well-optimized[4], it may not be applicable to all real-world scenarios. In addition, our experiment in Appendix B.6 showed that the SimCLR-based features are so powerful that the simple mean/max pooling reaches nearly the best performance which makes comparison less meaningful. Nevertheless, given that not all tasks possess an optimal feature extractor, we contend that our robust method, which performs effectively across various extractors, holds greater value and is better suited for real-world applications.

评论

For 3), to our best knowledge, using mixup augmentation with a fixed number of intermediate features is a very new approach in MIL problems, while being simple to implement and cost-efficient than other SOTA augmentation methods[2,6]. In addition, our experiment on CAMELYON-17 with a huge distribution shift, which is not experienced much in the WSI classification area as they are challenging[7], empirically proved that our method is much more robust than other methods. As real-world application inevitably suffers from distribution shift, our performance is quite notable. Moreover, we want to underscore the importance of our well-calibrated predictions, a crucial factor in the medical domain where reliability and explainability are needed.

Lasty, we will address each reviewer's concerns as soon as possible. We appreciate your valuable feedback and encourage any additional questions or requests for further experiments. Please feel free to stay in touch with us.

Best regards, Authors

Experiment with SimCLR-based features (Also can be found in Appendix B.6). We use the identical hyper-parameters and evaluation protocol as the experiment with ResNet-extracted features. | Method\Dataset | | CAMELYON_16 | | | TCGA-NSCLC | | |:-----------------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:| | | ACC(⭡) | AUC(⭡) | NLL(⭣) | ACC(⭡) | AUC(⭡) | NLL(⭣) | | Meanpool | 0.693±0.000 | 0.604±0.003 | 0.674±0.023 | 0.927±0.014 | 0.972±0.010 | 0.232±0.055 | | Maxpool | 0.920±0.002 | 0.967±0.002 | 0.353±0.147 | 0.920±0.023 | 0.961±0.013 | 0.608±0.053 | | ABMIL | 0.921±0.009 | 0.972±0.002 | 0.234±0.033 | 0.933±0.018 | 0.981±0.010 | 0.263±0.101 | | DSMIL | 0.916±0.012 | 0.968±0.008 | 0.456±0.182 | 0.936±0.017 | 0.981±0.010 | 0.324±0.133 | | TRANSMIL | 0.889±0.026 | 0.939±0.010 | 0.988±0.188 | 0.924±0.020 | 0.974±0.009 | 0.381±0.127 | | ILRA | 0.923±0.009 | 0.973±0.007 | 0.333±0.043 | 0.933±0.018 | 0.981±0.011 | 0.277±0.103 | | Slot-MIL | 0.922±0.008 | 0.972±0.007 | 0.294±0.065 | 0.937±0.018 | 0.981±0.011 | 0.276±0.131 | | Slot-MIL + SubMix | 0.923±0.009 | 0.975±0.008 | 0.229±0.071 | 0.935±0.018 | 0.981±0.012 | 0.248±0.105 |


References

[1]Xiang, Jinxi, and Jun Zhang. "Exploring low-rank property in multiple instance learning for whole slide image classification." The Eleventh International Conference on Learning Representations. 2022.

[2]Zhang, Hongrun, et al. "DTFD-MIL: Double-tier feature distillation multiple instance learning for histopathology whole slide image classification." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

[3]Shao, Zhuchen, et al. "Transmil: Transformer based correlated multiple instance learning for whole slide image classification." Advances in neural information processing systems 34 (2021): 2136-2147.

[4]Li, Bin, Yin Li, and Kevin W. Eliceiri. "Dual-stream multiple instance learning network for whole slide image classification with self-supervised contrastive learning." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021.

[5]Ilse, Maximilian, Jakub Tomczak, and Max Welling. "Attention-based deep multiple instance learning." International conference on machine learning. PMLR, 2018.

[6]Chen, Yuan-Chih, and Chun-Shien Lu. "RankMix: Data Augmentation for Weakly Supervised Learning of Classifying Whole Slide Images With Diverse Sizes and Imbalanced Categories." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

[7]Yufei, Cui, et al. "Bayes-MIL: A New Probabilistic Perspective on Attention-based Multiple Instance Learning for Whole Slide Images." The Eleventh International Conference on Learning Representations. 2022.

评论

Dear reviewers,

Thank you for dedicating your time and efforts once again to reviewing our paper. We would like to gently remind you that the discussion period is concluding soon (in a few days).

We believe that we have genuinely and effectively addressed your comments, incorporating the results of the supporting experiments. Additionally, we have updated our main paper to highlight the results from CAMELYON-17 (Table 4), incorporate efficiency metrics (Table 4), and present results with SimCLR-based features (Table 5).

If you have any further concerns or questions, please do not hesitate to reach out! Your feedback is highly valued.

Sincerely, Authors

评论

Dear reviewers,

In response to the reviewers' suggestion, we conducted experiments on the multi-class classification task using the CAMELYON-17 (C17) dataset, which includes four classes: negative, itc, micro, and macro based on the size of tumor cells. This dataset is characterized by class imbalance and distribution shifts, as it involves data collected from multiple hospitals, adding complexity to the problem. In this challenging setting, our Slot-MIL demonstrates relatively decent performance, and the SubMix method significantly enhances accuracy while providing more calibrated predictions, in terms of both ECE and NLL. We hope that this experiment provides a sufficient answer to the multi-class request from reviewers (5YV1 and vwP9) and improves our contribution. Below is the detailed result.

CAMELYON-17
Method\MetricACC(⭡)AUC(⭡)NLL(⭣)ECE(⭣)
Meanpool0.642±0.0290.601±0.0240.969±0.0360.095±0.028
Maxpool0.701±0.0120.620±0.0150.894±0.0100.077±0.027
ABMIL0.702±0.0250.677±0.0240.986±0.1610.133±0.029
DSMIL0.699±0.0320.675±0.0201.283±0.4770.145±0.065
TRANSMIL0.715±0.0190.686±0.0361.405±0.3350.183±0.041
ILRA0.670±0.0910.664±0.0211.186±0.2960.142±0.052
Slot-MIL0.684±0.0710.703±0.0231.370±0.1120.192±0.035
Slot-MIL + Sub0.722±0.0430.702±0.0171.116±0.0650.144±0.016
Slot-MIL + SubMix0.733±0.0470.703±0.0170.901±0.1140.099±0.024