xMIL: Insightful Explanations for Multiple Instance Learning in Histopathology
摘要
评审与讨论
This work proposed an LRP-based extension for MIL setup in histopathology. It focuses on explanation quality, not derivation of novel architectures.
优点
It is quite well written in terms of language.
Images are nicely depicting the key ideas.
缺点
This paper does not follow the MIL paradigm. It references only standard MIL assumptions and methods dedicated to that. There is work extending standard MIL methods to other MIL assumptions such as additive MIL [1], presence and threshold-based MIL [2]. and lastly percentage MIL [3].
The novelty is extremely limited because I do not see here more than just the application of LRP to MIL models.
Moreover, there are models such as CLAM [4] and ProtoMIL [5] that are addressing MIL in a more interpretable way, how are they working in your setup? Are they following the limitations?
[1] Javed, Syed Ashar, et al. "Additive mil: Intrinsically interpretable multiple instance learning for pathology." Advances in Neural Information Processing Systems 35 (2022): 20689-20702.
[2] Rymarczyk, Dawid, et al. "Kernel self-attention for weakly-supervised image classification using deep multiple instance learning." Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2021.
[3] Struski, Łukasz, et al. "ProMIL: Probabilistic multiple instance learning for medical imaging." ECAI 2023. IOS Press, 2023. 2210-2217.
[4] Lu, Ming Y., et al. "Data-efficient and weakly supervised computational pathology on whole-slide images." Nature biomedical engineering 5.6 (2021): 555-570.
[5] Rymarczyk, Dawid, et al. "Protomil: Multiple instance learning with prototypical parts for whole-slide image classification." Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Cham: Springer International Publishing, 2022.
问题
At some point, you are relaxing MIL assumptions so much that is is not MIL anymore, especially when the bag is not a set of instances anymore, because the spatial dependence between instances is important.
What is the difference between your work and a straightforward application of LRP to MIL problem?
局限性
They are discussed.
Limited Novelty
Regarding your concerns about the novelty of applying existing XAI methods to MIL, we would like to clarify that it was not the aim of our work to develop a new explanation method for MIL. Instead, the novelty of our work extends across other dimensions:
-
Methodical novelty: Despite attempts to apply XAI to MIL models in histopathology (e.g. [1-10]), there exists no formalism guiding the interpretation of the heatmaps and defining their desired properties. Instead, the heatmaps are often considered a prediction of assumed ground-truth instance labels, which we identified as problematic (Section 2.2). xMIL is a novel framework addressing this gap. Unlike previous works, it posits that bags possess a ground truth evidence function that quantifies the instances’ impact on the bag label. Within xMIL, heatmaps estimate this ground truth, which makes interpreting the heatmaps straightforward and insightful.
-
Empirical novelty: Many high-impact medical publications use attention heatmaps or Integrated Gradients to explain histopathological MIL models [1-8]. To our knowledge, no prior research has systematically evaluated whether these methods produce sensible results. Our extensive empirical evaluation on synthetic and real-world histopathology datasets is the first of its kind. It reveals that the widely used explanation methods may yield misleading results, as they often fail to reflect the model decisions faithfully. In contrast, xMIL-LRP sets a new state-of-the-art for explainability in AttnMIL and TransMIL models in histopathology.
-
Novelty in insight generation: Previous studies [9-10] conducted qualitative assessments of heatmaps on easy-to-learn datasets like CAMELYON or TCGA NSCLC. The insights gained in these settings are limited to model debugging, i.e., “Does the model focus on the disease area?” To our knowledge, we are the first to present a method generating heatmaps that enable pathologists to extract fine-grained insights about the model in a difficult biomarker prediction task.
“This paper does not follow the MIL paradigm. It references only standard MIL assumptions and methods dedicated to that. [...]”
It seems to us that this must be a misunderstanding.
We are aware that there are various formulations and methods beyond standard MIL, and we explicitly reference and discuss many of them (lines 72-73, 86-88, 95-97, 113-115, 125-128), including additive MIL (lines 95-97, 125-128) and the formulation of threshold-based MIL from Foulds & Frank [14] (lines 72-73, 113-115).
In fact, we investigated various MIL formulations but didn’t find any that we deemed suitable to cover all the complexities possibly present in histopathology tasks like biomarker or outcome prediction (Section 2.2). Concretely, with respect to the works mentioned:
- Additive MIL is unable to learn arbitrary instance interactions (lines 37-41, 125-128).
- The MIL formulations in threshold-based and percentage MIL do not include negative evidence, e.g., evidence speaking against a class label. They also don’t account for patch interactions.
- We argue that including negative evidence and instance interactions may be crucial for capturing the complexities of histopathological datasets (lines 105-112).
This is why we formulated the xMIL framework, introducing a novel perspective on MIL that overcomes those issues.
“At some point, you are relaxing MIL assumptions so much that it is not MIL anymore [...]”
Thanks for your comment. However, we do not agree that Definitions 3.1 and 3.2 of our manuscript do not formulate a MIL problem.
In Definition 3.1, we use bags of instances with bag-level labels just as standard MIL and most other MIL formulations (lines 67-68, 135-136). The key difference between our formulation and other formulations is that we do not make assumptions about the relationship between the instances and the bag label, but instead posit the existence of an evidence function assigning context-sensitive evidence scores to the individual instances.
Note that we do not make any assumptions about the (spatial) dependence between the instances. Hence, we neither require nor not rule out that spatial dependence plays a role. Thus, our framework is a general MIL framework that can be reduced to existing MIL formulations with extra assumptions. For example, AttnMIL does not model the correlation among instances, whereas TransMIL does.
We would appreciate it if you could expand on why you think that our xMIL formulation is not MIL anymore. In particular, we would like to understand what exactly you consider requirements for a MIL formulation.
“Moreover, there are models such as CLAM and ProtoMIL that are addressing MIL in a more interpretable way [...]”
Despite extending AttnMIL, the explanations from CLAM are also based on attention scores. Therefore, the method is subject to the same conceptual limitations, namely, the lack of faithfulness and not being able to distinguish between positive and negative evidence (lines 35-37, 89-90, 118-120, we cited CLAM in reference [6]).
To our understanding, ProtoMIL is conceptually different from our approach, as it provides explainability in terms of similarity scores to prototypical instances instead of generating heatmaps over all instances of a bag. Within our xMIL framework, we would therefore argue that it does not estimate the evidence function. Hence, it is unclear to us how to quantitatively or qualitatively compare the evaluated explanation methods with ProtoMIL. We acknowledge that we did not cite ProtoMIL in our paper, but we are happy to include it.
Additional comments
As outlined above, we believe some of the comments and questions may have been based on misunderstandings. We have outlined our key contributions and clarified the raised questions. We are happy to continue the discussion.
References
See Author Rebuttal.
I appreciate the detailed responses provided, and I must admit that I had some misunderstandings when I was reviewing the paper. I now recognize that my argument about limited novelty no longer holds. However, I recommend that the authors incorporate a structured description of novelty into the paper, as it would enhance transparency and clarity. Excellent job on this aspect!
Regarding the MIL formulation, I refer to a scenario where there is a bag of instances with an associated known label . For simplicity, let's assume is binary, . Each instance is independent and has its own hidden label . The bag is considered positive if at least one instance is labeled positive, which is the standard MIL assumption. Additionally, there are other MIL assumptions, such as:
- Presence MIL: Multiple hidden labels must co-occur for the bag to be positive.
- Threshold MIL: More than instances within a bag must have a given hidden label.
Both of these are well described in [1]. There is also a percentage MIL [2], among others. Crucial to the MIL definition is the independence of instances and their hidden labels constituting the bag-level label.
From your manuscript, I understood that you work primarily with the Standard MIL assumption but relax the assumption about instance independence within a bag, suggesting that the order of instances within a bag is significant for xMIL. Was this a misunderstanding on my part based on your response?
Regarding the other works I mentioned, I appreciate the comparison and acknowledgment provided.
I would like to also point you to the application of MIL to the histopathology dataset that I've personally found interesting as the labels are derived based on the pathology change location on the tissue, as well as its severity - area covered by pathological change [3]. But this is just out of curiosity, no action is needed :)
In general, you did well during the responses, I am keen to increase my grade, but first would like to see your comments regarding MIL formulations, and whether we agree on the definitions.
References: [1] Rymarczyk, Dawid, et al. "Kernel self-attention for weakly-supervised image classification using deep multiple instance learning." Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2021. [2] Struski, Łukasz, et al. "ProMIL: Probabilistic multiple instance learning for medical imaging." ECAI 2023. IOS Press, 2023. 2210-2217. [3] Rymarczyk, Dawid, et al. "Deep learning models capture histological disease activity in Crohn’s disease and ulcerative colitis with high fidelity." Journal of Crohn's and Colitis 18.4 (2024): 604-614.
Many thanks for your response! We are happy that we could make our point more clear. In the following, we address the points in your comment:
“I recommend that the authors incorporate a structured description of novelty into the paper“
We agree that a description of the novelty is needed in the paper. We will definitely include it as a structured paragraph in the introduction.
“Crucial to the MIL definition is the independence of instances and their hidden labels constituting the bag-level label. [...] From your manuscript, I understood that you work primarily with the Standard MIL assumption but relax the assumption about instance independence within a bag, suggesting that the order of instances within a bag is significant for xMIL.”
Thanks for clarifying your point!
It is correct that in xMIL, we drop assumptions about instance independence and hidden instance labels. That being said, we also don’t suggest that the correlations among instances necessarily exist in every case. Instead, we propose that such (in)dependencies are an inherent property of the prediction task at hand, encoded in the aggregation function (Definition 3.1). We posit that the instances possess evidence scores that can be ordered, reflecting how they influence the bag label. We would like to emphasize that within xMIL, ordering is a property of the evidence scores (Definition 3.2), whereas instance order in the bag is not necessarily relevant in the aggregation function.
With your clarification in mind, though, we understand that one may argue that a problem formulation should not be called MIL anymore if the instance independence or hidden instance label assumptions are dropped. However, we will still argue the opposite. We would like to make two points to support our position:
-
There are widely adopted MIL formulations that also dropped such assumptions. For example, “correlated MIL” proposes the existence of correlation among the instances of a bag, which results in the TransMIL model [R1]. This correlation may or may not be position-dependent. For example, in TransMIL, a positional encoder is used to take this possible dependency into account. From our point of view, this relaxation of the iid assumption is intuitive in histopathology: in a whole slide image, different parts of the tissue are not biologically independent, and even long-range correlations may exist among them.
-
Going back to the seminal work by Foulds & Frank [14], we realized that this seems to be an old debate: in Section 2.5, the authors state that “[...] the literature is not in agreement on whether the relaxed version of the MI problem belongs within the umbrella of MI learning, or is a separate problem.” However, they support our point of view, namely that “[...] the term ‘multi-instance learning’ should contrast directly with ‘single-instance learning’, and connotes any type of learning where several instances can be included within a single learning example, regardless of the assumptions used.”
“I would like to also point you to the application of MIL to the histopathology dataset that I've personally found interesting [...]“
Many thanks for the pointer!
Please let us know if you have further questions or comments!
References
[R1] Shao et al. “Transmil: Transformer based correlated multiple instance learning for whole slide image classification.” NeurIPS (2021). (ref. [4] of the original paper)
[14] See Author Rebuttal.
Thank you for your comments and clarification. I will revise my score accordingly. However, I strongly encourage you to explicitly mention that the assumption of instance independence and hidden labels has been dropped. I suggest including this clarification in the Abstract, as many advancements in MIL rely on this assumption to develop their models.
The authors propose xMIL-LRP which is an explainable MIL method to overcome the shortcomings of previous instance score-based methods (such as attention-based). This can provide positive/negative evidence towards the label and can be slotted into existing MIL methods, since this is a post-hoc interpretability method.
优点
With most of MIL methods reaching saturation, especially with advent of transformer-based modules and very powerful patch-level foundation models, the performance gain to be had with new MIL method would likely be marginal. Therefore it is encouraging that the authors are proposing a tool for interpretability that can fit into any of these MIL methods (if I didn't misunderstand) - I am assuming this also works with any feature encoders. The LRP-based methods have been utilized many times outside this field, so I think it is positive that the authors brought it into the field with great rigor.
缺点
There are few weakness/confusion that I would like the authors to address to make this a worthwhile contribution to the field.
Presentation: I am assuming the authors are very familiar with LRP, but most of the audience in the computational pathology would not be. Therefore, I found lack of motivation/explanation beyone LRP quite confusing. I think Section 3.2 is too compressed (Especially lines 178-182). The section of faithful evaluation was also not easy to understand as this was entirely new analysis. How do these analyses relate to the fact that predicted evidence scores are positive/negative?
-
Can we do away with formalizing evidence function? It is a good formalization of the concept, but as soon as it was applied to the actual histology data, the concept was abandoned for the lack of ground truth, which I found it quite awkward. Why not go directly into introducing relevance scores (which do not need evidence function formalism) and allocating more explanation for general computational pathology readers?
-
I think explaining more rigorously how LRP would concretely be applied to AB-MIL and TransMIL (with few-step explanations) would be immensely helpful, since these are quite different architecture (multihead-attn vs. not) and authors simply bypass this by explaining the chain rules for "neurons". Only then readers can be convinced that LRP is easy to apply to any MIL of their choice.
Clarification vs. other methods: My understanding is Additive MIL also provides positive/negative instance-level evidence (positive/negative logits) and can function very similarly as proposed LRP. I think the authors need to provide more explanation/experiments differentiating LRP from Additive MIL. Especially the heatmap ablations should probably include additive MIL results to convince the readers. In addition integrated gradient methods have been applied to explain positive/negative evidence towards bag-level labels [1], [2]. Can authors expand on this?
Restriction to binary classification task: My understanding is that the current methodology is restricted to binary classification problem. For multi-class subtyping problem and survival prediction (where ordering of predicted risk is important), how can this be adapted? How would evidence scores add up to bag-level label, if the predicted endpoint is continuous (e.g. survival prediction)? I would like authors to expand on this, since only subset of the problems in pathology can be regarded as binary classification problem.
Speculation on different feature encoders: While the explicit head-to-head comparison would not be necessary, I am curious to know how different feature encoders, especially with deluge of patch-level foundation models, affect LRP performance. Can authors share their thoughts on this?
[1] Lee, Yongju, et al. "Derivation of prognostic contextual histopathological features from whole-slide images of tumours via graph deep learning." Nature Biomedical Engineering (2022): 1-15. [2] Song, Andrew H., et al. "Analysis of 3D pathology samples using weakly supervised AI." Cell 187.10 (2024): 2502-2520.
问题
See the section on the weaknesses.
局限性
Please see above
Details on LRP
We acknowledge that some descriptions are too condensed. Therefore, we will add a section in the appendix with text and figures to properly describe LRP.
Faithfulness experiments
We will clarify the faithfulness experiments and their motivation in the manuscript, with algorithmic details already provided in Appendix A.4. The primary goal of the faithfulness experiments is to evaluate the ordering of relevance scores (property 3 of evidence function in Definition 3.2). However, the results also reflect whether the explanation scores contain meaningful positive/negative evidence for the target class: if such information is present, we expect the model’s prediction to flip when all patches supporting the target class are excluded. In Figure 4, the model decision always flips when patches are excluded based on xMIL-LRP scores, whereas other methods show inconsistent results.
Formalization of evidence function
The evidence function is crucial to xMIL. It serves as the ground truth, while the explanation method estimates it. We posit that each instance contains evidence for or against a bag label. Sometimes, the ground truth is (partly) known, such as through pathologist annotations in disease detection tasks. However, many biological associations remain unknown even to experts. Thus, relevance scores provide inexpensive annotation for each patch, as they estimate the ground truth. We will add more details about this to Sections 3.1 and 4.2.
Applying LRP to AttnMIL and TransMIL
We will allocate a section in the appendix for LRP details. In a general attention mechanism, let be the embedding vector of the k-th token and the attention score between tokens i and j. The output vector is . The AH rule of LRP treats attention scores as a constant weighting matrix during the backpropagation pass of LRP, not affecting the forward predictions. If is the relevance of the d-th dimension of , the AH rule computes the relevance of the d-th feature of as: .This formulation can be directly applied to AttnMIL, and also adapted to a QKV attention block in a transformer, where is the embedding associated with the value representation. A figure will be added to illustrate how AH and gamma LRP propagation rules combine to explain multi-head attention.
Additive MIL and integrated gradients (IG)
Additive MIL (AddMIL) provides positive and negative class-wise evidence. However, AddMIL cannot model arbitrary instance interactions, which we consider crucial in histopathological prediction tasks (lines 109-112). Comparing post-hoc explanations and AddMIL is challenging due to the lack of ground truth for the heatmaps. The faithfulness experiments do not apply to AddMIL, as its explanations are faithful by design [9]. Furthermore, our manuscript focuses on explaining non-additive models, as they are widely used in the computational histopathology community (see e.g. [1-8]).
We ran additional experiments with AddMIL on the toy datasets, where ground truth evidence is available. We used the Additive Attention MIL implementation (with Adam optimizer) provided by Javed et al. [9], and the same training protocol in Sections 4.1 and A.2 of our paper. The results presented in Table B.1 indicate that AddMIL performed worse in all settings, making its explanations not competitive with the post-hoc explanation methods on AttnMIL and TransMIL. This supports our point that AddMIL may not be competitive in difficult prediction tasks (lines 126-128).
IG has been used to interpret MIL models in histopathology. We implemented IG, and our faithfulness evaluations (Table B.2) show that IG explanation scores are less faithful to model decisions compared to xMIL-LRP, especially for TransMIL. This underscores the importance of our work in rigorously assessing explanation methods for MIL models in histopathology. We will add these additional results to the manuscript.
Multi-class classification
xMIL-LRP is applicable to multi-class classification. All evaluated explanation methods in the paper, except attention, yield one heatmap per class. For example, the 4-Bags dataset in the toy experiments is a 4-class classification problem (Table 1 of the paper).
Regression and survival prediction
We do not explicitly evaluate regression tasks, but prior work shows how to apply LRP in regression [11], which can be adapted to histopathological MIL tasks. In survival prediction, outcomes are typically not predicted via regression due to data censorship; instead, survival probabilities at event times are computed, resulting in Kaplan-Meier analyses. While attention heatmaps have been used to interpret survival models (e.g. [12]), explaining these models remains challenging since they are neither purely regression nor classification tasks. We are extending our work to survival analysis in an ongoing project and will address this in a future work section of the paper.
Feature encoders
A major advantage of our method is that propagation through the feature encoder is not required: the patch scores can be computed at the input level of the MIL aggregation. Thus, the feature encoder’s architecture does not affect the explanation quality. Possible differences in model predictions and heatmaps are due to the quality of the feature embedding. To address this, we repeated the histopathology experiments with the more recent UNI foundation model [13], with the same training parameters as for the best models with CTransPath (CTP) encoder. In our results (Table B.3), xMIL-LRP shows a robust performance in faithfulness experiments in comparison to the results with CTP backbone. Interestingly, while “Single” shows high faithfulness for the LUAD TP53, its performance for NSCLC is much worse in comparison to that with CTP.
References
See Author Rebuttal.
The paper introduces xMIL-LRP, an efficient solution for explainable Multiple Instance Learning (MIL) by incorporating layer-wise relevance propagation (LRP) into MIL. The authors further employ the AH-rule to define the "propagation rules" for relevance. Experiments are conducted on three toy datasets and four real-world Whole Slide Image (WSI) prediction tasks, resulting in improved faithfulness scores for the explanations.
优点
-
The experimental setup on both toy datasets and real-world WSI tasks is robust, demonstrating the effectiveness of the proposed method.
-
The paper is logically structured, with well-illustrated figures that aid in understanding the results.
-
The development of explainable MIL methods is crucial for the deployment of these models in real-world applications, where understanding the model's decisions is important.
缺点
Limited Novelty: The method primarily applies existing explainable AI techniques to the MIL domain, which limits its novelty. Unclear Method Description:
Unclear Method Description: The definition of the relevance score as the attention score between neurons and is unclear. Attention-based MIL typically defines attention at the instance level, making it confusing how relevance is defined and propagated within this context.
Absence of Fair Comparison with Additive MIL: The performance of additive MIL is directly cited without conducting new experiments under the same conditions (data splits, feature extractors, etc.). This could lead to an unfair comparison and underestimation of additive MIL's performance.
问题
Please refer to weaknesses.
One more question is that considering that xMIL-LRP potentially provides more explainable results, does it affect the prediction performance of the original MIL model? How do the authors balance between model explainability and prediction accuracy?
局限性
N/A
Limited Novelty
Regarding your concerns about the novelty of applying existing XAI methods to MIL, we would like to clarify that it was not the aim of our work to develop a new explanation method for MIL. Instead, the novelty of our work extends across other dimensions:
-
Methodical novelty: Despite attempts to apply XAI to MIL models in histopathology (e.g. [1-10]), there exists no formalism guiding the interpretation of the heatmaps and defining their desired properties. Instead, the heatmaps are often considered a prediction of assumed ground-truth instance labels, which we identified as problematic (Section 2.2). xMIL is a novel framework addressing this gap. Unlike previous works, it posits that bags possess a ground truth evidence function that quantifies the instances’ impact on the bag label. Within xMIL, heatmaps estimate this ground truth, which makes interpreting the heatmaps straightforward and insightful.
-
Empirical novelty: Many high-impact medical publications use attention heatmaps or Integrated Gradients to explain histopathological MIL models [1-8]. To our knowledge, no prior research has systematically evaluated whether these methods produce sensible results. Our extensive empirical evaluation on synthetic and real-world histopathology datasets is the first of its kind. It reveals that the widely used explanation methods may yield misleading results, as they often fail to reflect the model decisions faithfully. In contrast, xMIL-LRP sets a new state-of-the-art for explainability in AttnMIL and TransMIL models in histopathology.
-
Novelty in insight generation: Previous studies [9-10] conducted qualitative assessments of heatmaps on easy-to-learn datasets like CAMELYON or TCGA NSCLC. The insights gained in these settings are limited to model debugging, i.e., “Does the model focus on the disease area?” To our knowledge, we are the first to present a method generating heatmaps that enable pathologists to extract fine-grained insights about the model in a difficult biomarker prediction task.
Unclear Method Description
We apologize for any potential misunderstanding regarding the description of the attention mechanism used in the considered MIL models. We will clarify our description of the AH rule in lines 173-174. In a general attention mechanism, let be the embedding vector of the k-th token and the attention score between tokens i and j. The output vector is . The AH rule of LRP treats attention scores as a constant weighting matrix during the backpropagation pass of LRP, not affecting the forward predictions. If is the relevance of the d-th dimension of , the AH rule computes the relevance of the d-th feature of as: .This formulation can be directly applied to AttnMIL, and also adapted to a QKV attention block in a transformer, where is the embedding associated with the value representation. A figure will be added to illustrate how AH and gamma LRP propagation rules combine to explain multi-head attention.
Absence of Fair Comparison with Additive MIL
We agree that just mentioning the performance from the original paper without testing under the same conditions is unfair. We will remove the respective lines from the manuscript (lines 277-278).
A comparison between post-hoc explanations and Additive MIL (AddMIL) is inherently difficult, as we cannot compare heatmaps from different models without having a ground truth. The faithfulness experiments from Section 4.2 are not applicable to Additive MIL, as its explanations are faithful by design [9]. Furthermore, we would like to highlight that the focus of our manuscript is explaining non-additive models, as they are widely used in the computational histopathology community.
Regardless, we ran additional experiments with AddMIL on the toy datasets, where ground truth evidence is available. We used the Additive Attention MIL implementation (with Adam optimizer) provided by Javed et al. [9], and applied the same training protocol as described in Sections 4.1 and A.2 of our manuscript. The results are presented in Table B.1. The test performance of AddMIL was worse in all settings, resulting in explanations that are not competitive with the post-hoc explanation methods on AttnMIL and TransMIL. This supports our point that AddMIL may not achieve competitive performances in difficult prediction tasks (lines 126-128). We will add these results to the manuscript.
Effect of xMIL-LRP on prediction performance
xMIL-LRP and all the benchmarked methods in this manuscript are post-hoc explanation methods. Therefore, they do not impact the training process or prediction performance of the original model, as they are applied after the model is fully trained. Thus, balancing between model explainability and prediction accuracy is not a concern in this context. We did not investigate utilizing xMIL-LRP scores to further improve prediction performance, but may consider this in future works.
Additional comments
Many thanks for the thoughtful and constructive comments! We hope our responses have effectively addressed your concerns and showcased the novelty of our research. Given your acknowledgment of the relevance and soundness of our work, we kindly hope you might reconsider your rating. If you have any further questions or suggestions, we are happy to continue the discussion.
References
See Author Rebuttal.
Overview
We thank the reviewers for their comments and valuable feedback. We have addressed their questions and concerns and made the following additions and changes to our manuscript:
- More clarity regarding the contributions of the work (reviewers utz5 & DaUT): We discussed our contributions in detail where the reviewers asked about them.
- Comparisons to Additive MIL (all reviewers): We attached experimental comparisons to Additive MIL in the presence of available ground truth (Table B.1). This confirmed that xMIL-LRP outperformed all baselines, including additive MIL, across all three tasks. We also discussed how Additive MIL is conceptually out of the scope of our manuscript.
- Comparison to Integrated Gradients (IG) as an increasingly popular explanation approach in histopathological MIL papers (reviewer GyUV): We added IG to our faithfulness evaluation (Table B.2). It reached consistently lower faithfulness compared to xMIL-LRP, and performed poorly for TransMIL.
- Impact of the foundation model on the results (reviewer GyUV): We elaborated upon this point and conducted faithfulness experiments using a different histopathological foundation model as an ablation (Table B.3). We did not observe major changes in the results.
- Clarity of method description (reviewers utz5 & GyUV): We included a more precise description of the xMIL-LRP method. We will also release our code on GitHub upon publication, which will provide further clarity on the implementation details. We also expanded on the motivation and idea of the formalization of the evidence function in the xMIL definition. Finally, we appended a clarification on how our work extends to regression and survival prediction.
In summary, we provided clear additional evidence for the soundness of our evaluation and effectiveness of the xMIL framework. We further clarified the novelty of this contribution by formalizing xMIL using evidence functions and providing faithful patch-level relevance scores for models that are not inherently interpretable but widely used. As appreciated during the review, this is crucial for the deployment of MIL models in real-world applications.
References
To avoid redundancies, we collect all references from the individual rebuttals here.
[1] Lu et al. “Data-efficient and weakly supervised computational pathology on whole-slide images.” Nature Biomedical Engineering (2021). (ref. [6] of the original paper)
[2] Lu et al. "AI-based pathology predicts origins for cancers of unknown primary." Nature (2021). (ref. [28] of the original paper)
[3] Lipkova et al. “Deep learning-enabled assessment of cardiac allograft rejection from endomyocardial biopsies.” Nature Medicine (2022). (ref. [27] of the original paper)
[4] Chen et al. “Pan-cancer integrative histology-genomic analysis via multimodal deep learning.” Cancer Cell (2022).
[5] Wagner et al. "Transformer-based biomarker prediction from colorectal cancer histology: A large-scale multicentric study." Cancer Cell (2023). (ref. [29] of the original paper)
[6] Calderaro et al. “Deep learning-based phenotyping reclassifies combined hepatocellular-cholangiocarcinoma.” Nature Communications (2023).
[7] Song et al. "Analysis of 3D pathology samples using weakly supervised AI." Cell (2024).
[8] El Nahhas et al. “Regression-based Deep-Learning predicts molecular biomarkers from pathology slides.” Nature Communications (2024).
[9] Javed et al. "Additive MIL: Intrinsically interpretable multiple instance learning for pathology." NeurIPS (2022). (ref. [33] of the original paper)
[10] Early et al. “Model agnostic interpretability for multiple instance learning.” ICML (2022). (ref. [34] of the original paper)
[11] Letzgus et al. "Toward explainable artificial intelligence for regression models: A methodological perspective." IEEE Signal Processing Magazine (2022).
[12] Chen et al. "Multimodal co-attention transformer for survival prediction in gigapixel whole slide images". ICCV (2021).
[13] Chen et al. "Towards a general-purpose foundation model for computational pathology." Nature Medicine (2024). (ref. [42] of the original paper)
[14] Foulds & Frank. “A review of multi-instance learning assumptions.” The Knowledge Engineering Review (2010). (ref. [38] of the original paper)
The paper introduces xMIL-LRP for multiple instance learning (MIL) from Whole Slide Images (WSI) by aggregating layer-wise relevance propagation (LRP) into MIL framework and accordingly merits an interpretable module for model’s decision process. The proposed xMIL-LRP provides a tool to fit LRP into any MIL framework that can be used in WSI downstream tasking for aggregation/classification. Overall the paper is well written and easy to follow. The paper has received mixed reviews. There is consensus among reviewers while the idea of xMIL-LRP is limited in terms of technical novelty but it provides a feasible tool to be used in computational pathology field for transformation of any MIL aggregation model into an interpretable framework. Some concerns were raised by reviewers on (a) lack of motivation and explanation behind LRP for adoption in computational pathology, (b) differentiability of LRP from Additive MIL, (c) limitation to only binary classification of WSI, (d) deployment of existing explainable AI methods into MIL which raises a concern on the novelty of the method, (e) fair comparison to with other MIL approaches such as additive MIL and threshold-based MIL, and percentage MIL, (f) does the explainability module affects the prediction performance of the original MIL, (g) comparison to other interpretable MIL methods such as CLAM and ProtoMIL, and (h) utility of different encoders for patch feature embedding. Authors have addressed several concerns during rebuttal, explained on the novelty of their tool and provided further comparative analysis results. The AC also finds the proposed tool can be very useful for computational pathology practitioners and they finds enough support from the reviewers to merit the paper for publication. It is highly encouraged for the authors to take the advantage on the discussions raised by reviewers as highlighted above for their final revision using from both pre-/post-rebuttal phase comments.