PaperHub
6.1
/10
Poster4 位审稿人
最低3最高4标准差0.4
4
3
3
3
ICML 2025

Context Matters: Query-aware Dynamic Long Sequence Modeling of Gigapixel Images

OpenReviewPDF
提交: 2025-01-09更新: 2025-07-24

摘要

关键词
Computational PathologyWhole Slide ImageCancer Diagnosis and Prognosis

评审与讨论

审稿意见
4

The paper introduces Querent, a query-aware long contextual modeling framework for whole slide image (WSI) analysis, addressing the challenge of computational efficiency in gigapixel images. Unlike standard transformer architectures with quadratic complexity, Querent dynamically selects relevant regions for each patch using region-wise metadata summarization and importance estimation. This enables efficient self-attention while preserving long-range dependencies. The method outperforms existing approaches in biomarker prediction, gene mutation prediction, cancer subtyping, and survival analysis across multiple WSI datasets. Empirical results show that Querent achieves state-of-the-art accuracy while significantly reducing computational costs.

update after rebuttal

I think the authors well address my concerns, so I will raise my score.

给作者的问题

  1. In the second set of ablation experiments, why does the Estimation Side Network perform worse than the Random Region Selection, which serves as the lower bound?
  2. In the final step of Lemma B.4, since qq and q^\hat{q} are not in the same feature space, how can they be combined to yield the final result? Could you provide a detailed derivation?
  3. In the first set of ablation experiments and in Appendix G.1, it is mentioned that distance matrices are calculated. After summarization, each region has a feature vector, but before summarization, since a region contains multiple patches, it has multiple feature vectors. How, then, are the distance matrices calculated in this case? Could you elaborate?
  4. Is the α\alpha in Theorem 3.1 the same as the α\alpha in Equation 8?

论据与证据

The authors have demonstrated, through theoretical analysis and experimental validation, that their proposed query-aware attention mechanism possesses expressiveness comparable to that of full self-attention, while achieving greater computational efficiency. Moreover, the effectiveness of the region-level metadata summarization and importance estimation modules introduced by the authors has also been empirically substantiated.

方法与评估标准

I believe that the methods and evaluation criteria proposed by the authors are well-aligned with the problem at hand.

理论论述

Upon reviewing the authors' theoretical proofs, I have some reservations and cannot guarantee their complete accuracy.

实验设计与分析

The experimental design and analysis conducted by the authors are reasonably sound.

补充材料

I have specifically reviewed Appendices B, C, and G.

与现有文献的关系

The paper builds on MIL and transformer-based WSI analysis, addressing efficiency challenges seen in TransMIL and HIPT. While prior work uses linear approximations (Shao et al., 2021) or local-global attention (Chen et al., 2022), Querent introduces query-aware attention, dynamically selecting relevant regions, inspired by context-dependent tissue relationships (Heindl et al., 2015). This improves efficiency while maintaining long-range modeling, advancing adaptive sparse attention in pathology AI.

遗漏的重要参考文献

No, there aren’t.

其他优缺点

Strengths:

  1. The paper introduces query-aware attention, a novel approach to dynamically selecting relevant regions in WSIs, improving upon rigid local-global attention and linear approximations.
  2. By significantly reducing computational costs while maintaining long-range dependencies, Querent advances scalable WSI analysis, impacting biomarker prediction, cancer subtyping, and survival analysis.
  3. Strong performance across 11 datasets and multiple CPath tasks demonstrates robustness, outperforming state-of-the-art MIL and transformer-based models.

Weaknesses:

  1. The assumptions in the theoretical proofs may be challenging to satisfy in practical code implementation, which undermines their persuasiveness. For instance, the authors did not specify how to ensure that the neural networks fminf_{min} and fmaxf_{max} satisfy the L-Lipschitz continuity. Additionally, the fulfillment of the four conditions in Theorem B.6 during actual network training was not addressed.
  2. Could you provide a comparison of training and inference times between this method and other networks? How does the training convergence speed of this method fare? In practical inference for WSI, the most time-consuming part is likely the extraction of patch features, with aggregation taking up a relatively small portion of the time. How much does this method improve the overall inference time compared to other methods during inference?

其他意见或建议

It appears that there is an error in Equation 12 (NLL survival loss) in the appendix. The second and third terms on the right side of the equation should be preceded by a minus sign rather than a plus sign. Additionally, the second term should be yj1(i)y^{(i)}_{j-1} instead of yj(i)1y^{(i)}_j-1.

作者回复

We thank the reviewer for the valuable suggestions. We answer the reviewer's questions in the following one by one:

Weakness

The assumptions in the theoretical proofs may be challenging to satisfy ...

We appreciate the reviewer's concerns. Regarding the Lipschitz continuity of fminf_{min} and fmaxf_{max}: while strict LL-Lipschitz continuity is a common theoretical assumption used to bound approximation errors, in practice we implement these functions as single-layer perceptrons with ReLU activations. Moreover, we apply regularization techniques such as normalization and weight clipping during training to encourage Lipschitz-like behavior, which empirically ensures that the projections behave in a sufficiently smooth manner for our theoretical guarantees to hold approximately.

As for the four conditions in Theorem B.6, they are idealized guidelines to understand the behavior of our query-aware selective attention mechanism. In our implementation, we approximate these conditions by: (1) using a large hidden dimension (e.g., d=512d=512) to meet JL lemma requirements; (2) selecting an appropriate number of regions based on spatial decay analysis; and (3) designing region sizes (e.g., K=16,24K=16,24) that balance the need for a small diameter with computational efficiency. Although perfect adherence is challenging, our parameter choices, guided by validated performance, effectively control the approximation error and preserve the key theoretical properties, as supported by our empirical results.

Could you provide a comparison of training and inference times ...

We appreciate this question on computational efficiency. In our experiments (see Figure 2 in the anonymous link), Querent trains in 72.33s per batch — faster than more complex models like HIPT (473.49s) yet slightly slower than simpler baselines. It converges in a similar number of epochs as other transformer-based methods while achieving state-of-the-art accuracy. Notably, Querent has the lowest memory footprint (2286 MB) among the compared methods. For inference, Querent processes a slide in 0.1328s.

As the reviewer correctly points out, patch feature extraction (typically 2~3mins per slide) dominates the WSI processing pipeline, while feature aggregation accounts for less than 1% of the total time. We believe that the slight increase in computation time compared to simpler methods is a worthwhile trade-off given Querent's state-of-the-art performance across all tasks.

Other Comments or Suggestions

It appears that there is an error in Equation 12 ...

We thank the reviewer for the careful review of Supplementary Materials. We will correct this equation in the revised manuscript to ensure mathematical accuracy.

Questions

In the second set of ablation experiments, why does the Estimation Side Network ...

This counter-intuitive result stems from fundamental limitations in the Estimation Side Network approach. By attempting to predict region importance independently without considering query context, this network struggles with optimization challenges and fails to capture the relational information critical for accurate importance assessment. It also tends to overfit to region patterns seen during training. Random Region Selection, while simple, provides diverse contextual sampling that occasionally includes relevant regions by chance, avoiding biased selection. Our query-aware approach resolves these issues by dynamically assessing region importance relative to each specific query, leading to significantly better performance than both alternatives. We will clarify this explanation in the revised manuscript.

In the final step of Lemma B.4, ...

We thank the reviewer for the insightful comment regarding the feature space transition in Lemma B.4. To clarify, although qq and q^\hat{q} reside in different feature spaces, the projection functions are assumed to be Lipschitz continuous, which allows us to control the distortion when moving from the original space to the projected space, with detailed deviation shown in Figure 3 in this anonymous link.

In the first set of ablation experiments and in Appendix G.1 ...

In our distance matrix calculations, pre-summarization distances between regions are computed by first flattening all patches in each region into a single vector, then calculating Euclidean distances between these region vectors. Post-summarization distances are simply the Euclidean distances between the metadata vectors.

Is the α\alpha in Theorem 3.1 the same as ...

No, these are different quantities. In Theorem 3.1, α\alpha is the exponential decay rate parameter for attention scores with spatial distance. In Equation 8, α\alpha represents normalized attention weights for feature aggregation. We'll use distinct notation in our revision to prevent confusion.

审稿人评论

Thanks for the feedback. I think the authors well address my concerns, so I will raise my score.

作者评论

We appreciate the reviewer for raising the score and we are glad that our rebuttal effectively addressed concerns. Following the reviewer's constructive suggestions, we will further polish our manuscript and include all revisions mentioned above for better readability.

审稿意见
3

This paper introduces Querent, a framework for dynamic long-range contextual modeling of gigapixel WSIs through the adaptive determination of patch relationships. The key idea is to maintain the modeling power of full self-attention while achieving computational efficiency through dynamic sparsification. The method adaptively predicts which surrounding regions are most relevant for each patch, enabling focused yet unrestricted attention computation only with potentially important contexts. By using efficient region-wise metadata computation and importance estimation, their approach dramatically reduces computational overhead while preserving global perception to model fine-grained patch correlations. The effectiveness of the proposed method is validated on benchmark datasets, showing improvements over existing approaches.

给作者的问题

  • Should a comprehensive global attention mechanism be employed instead of this approximate variant, what impact might it have on performance?
  • Would the adoption of alternative patch feature extractors—such as CHIEF, UNI/UNI2, Virchow/Virchow2, PRISM, or GigaPath—yield analogous conclusions?

论据与证据

The authors claim that their method outperforms existing techniques in both efficiency and accuracy when analyzing WSIs. The experimental results presented partially support these claims, showing improvements in key metrics.

方法与评估标准

The proposed query-aware attention mechanism dynamically adapts to the unique context of each patch in gigapixel WSIs, preserving global attention while substantially reducing computational complexity. This results in enhanced computational efficiency, making it suitable for the intended application.

理论论述

The paper includes theoretical justifications for the proposed approach, particularly in the modeling techniques used. The proofs and derivations appear sound.

实验设计与分析

The experimental design and analyses are sound, offering lucid delineations of datasets, metrics, and methodologies. However, the comparative methods delineated in the paper appear to diverge from the tasks reported in the original studies. For instance, the performance of the RRT-MIL method, as reported on the TCGA-BRCA dataset, pertains to a sub-typing task, whereas this study utilizes the BRCA dataset for survival prediction. A comparative evaluation of the sub-typing task’s performance could enhance the persuasiveness of the experimental findings.

补充材料

The supplementary material provides additional experimental results and technical details that complement the main text. This material enhances the paper's comprehensiveness and provides valuable insights for replication and further study.

与现有文献的关系

The paper builds upon existing work in sequence modeling, introducing novel adaptations for WSIs. It contributes to the literature by addressing specific challenges associated with WSI and proposing a method that integrates context-aware mechanisms.

遗漏的重要参考文献

The paper covers relevant literature.

其他优缺点

Strengths:

  • The integration of query-aware mechanisms with dynamic sequence modeling in WSI analysis.
  • The paper is clearly written and well-structured.
  • The proposed methodology has yielded commendable performance across a diverse array of tasks and datasets.

Weaknesses:

  • The primary contribution of the paper lies in its ability to reduce computational complexity while preserving global attention. As evidenced in Tables 1 and 2, the proposed methods outperform existing approaches; however, the analysis appears somewhat deficient. For instance, the absence of results derived from global attention computations raises questions about whether the superior performance of the proposed method stems predominantly from the MIL paradigm, the extraction of patch features via PLIP, or the novel Dynamic Attention mechanism introduced in the text.

其他意见或建议

  • Some sections, particularly the theoretical derivations, could be elaborated for better clarity. For instance, the meaning of B in Theorem 3.1 should be promptly elucidated.
  • Exploring the integration of the proposed method with other advanced models, such as vision transformers, could be a valuable direction.

伦理审查问题

N/A

作者回复

We thank the reviewer for these constructive comments. We answer the reviewer's questions in the following one by one:

Experimental Designs or Analyses

However, the comparative methods delineated in the paper appear to diverge from the tasks ...

We appreciate the reviewer's concern regarding task alignment between our evaluation and original studies. We want to clarify that our evaluation framework was intentionally designed to be comprehensive, spanning multiple computational pathology tasks (biomarker prediction, gene mutation prediction, cancer subtyping, and survival prediction) to demonstrate the robustness and generalizability of our method across diverse clinical applications. While some baseline methods like RRT-MIL were originally evaluated on specific tasks, we adapted all methods for multiple tasks using standardized feature extraction and training protocols to ensure fair comparison. This approach provides stronger evidence of our method's versatility than limiting evaluation to a single task type would, as demonstrated by Querent's consistent performance advantages across all tasks in Tables 1 and 2. We believe our comprehensive evaluation strategy enhances rather than diminishes the persuasiveness of our experimental findings by showing our approach's effectiveness across the spectrum of computational pathology applications.

Weakness

however, the analysis appears somewhat deficient. For instance, the absence of results derived from global attention ...

We appreciate the reviewer's concern about determining the source of our method's superior performance. It's worth noting that directly applying global self-attention to tens of thousands of WSI patches leads to out-of-memory problems, which explains why existing methods use alternatives like local-global or linear attention approximations. To respond to the reviewer's comment directly, we implemented a full global attention approach using FlashAttention (which achieves linear memory complexity) and compared it with our method with the same PLIP feature extractor and same training protocol (see Table 1 in this anonymous link). The experimental results show that Querent consistently outperforms the global attention approach (FlashMIL). This confirms that our performance improvements stem specifically from the proposed query-aware dynamic attention mechanism rather than other components, as these elements were identical across both compared methods.

Other Comments or suggestions

Some sections, particularly the theoretical derivations, could be elaborated ...

In response to this comment, we acknowledge that some elements of Theorem 3.1 would benefit from additional clarification. Specifically, the parameter B in Theorem 3.1 represents the bound on input norms (qi||q_i||, Kj||K_j|| \leq BB), which is critical for establishing the error bounds of our query-aware attention approximation. This parameter is properly defined in Lemma B.4 of the appendix but should have been explicitly introduced in the main text for clarity. We will ensure this and other theoretical elements are more thoroughly explained in the revised version to enhance readability and comprehension of our technical contributions.

Exploring the integration of the proposed method with other advanced models ...

We agree with the reviewer that integrating our query-aware dynamic attention mechanism with advanced vision transformer architectures represents a promising direction for future work. Our current implementation focuses on efficient modeling of long-range dependencies in gigapixel images, but the core principles of our approach — dynamic region-level metadata summarization and importance-based selective attention — could be readily adapted to enhance various vision transformer frameworks. We appreciate this valuable suggestion and plan to explore such integrations in our future research.

Questions

Should a comprehensive global attention mechanism be employed ...

We have responded to this comment with detailed interpretation in the Weakness section.

Would the adoption of alternative patch feature extractors—such ...

We appreciate this insight. As demonstrated in Table 2 in this anonymous link, we conducted additional experiments using state-of-the-art foundation models (Virchow and CHIEF). Results show that Querent consistently outperforms other methods with these advanced feature extractors, confirming that our method's superiority stems from its long contextual modeling capability rather than the choice of feature extractor. This indicates Querent's contribution is complementary to advances in foundation models and will continue to provide advantages as foundation models evolve.

We welcome any further questions or clarifications regarding our rebuttal and are happy to provide additional details if needed.

审稿人评论

Thanks for the responses, which have addressed most of my concerns. After carefully reading all the comments and responses, I decide to raise the score to weak accept.

作者评论

We thank the reviewer for raising the score. Following the reviewer's constructive suggestions, we will polish our manuscript further and include all revisions mentioned above for a better presentation of our method.

审稿意见
3

This paper introduces Querent, a query-aware dynamic modeling framework for analyzing whole-slide images in computational pathology. To address the computational inefficiency of standard transformer architectures, which struggle with the quadratic complexity of self-attention in large-scale WSI analysis, the authors propose a novel approach that dynamically adapts attention computation to the most relevant regions for each query patch. The framework includes: 1) Region-Level Metadata Summarization, 2) Query-Aware Attention Mechanism and 3)Efficient Importance Estimation. Experiments on biomarker prediction, gene mutation prediction, cancer subtyping, and survival analysis demonstrate that Querent achieves state-of-the-art performance while significantly reducing computational overhead, enabling efficient processing of gigapixel WSIs.

给作者的问题

1.The method can visualize the original WSI corresponding to the K regions selected by the Querent and their min/max feature to prove the accuracy of the method. The author could provide some visualization results to demonstrate the effectiveness of the proposed method. For example, visualizing the metadata feature score for each region could help illustrate that the model indeed selects influential patches. 2.It is not clear how the method can avoid the situation where the top K regions miss the regions containing key information. 3.It is uncertain what the advantages of the proposed method are compared with the latest methods that can also achieve efficient and fast classification through Mamba (such as MambaMIL, MamMIL) in existing research. MamMIL can also perceive the topological structures among the instances and incorporate short-range feature interactions.

论据与证据

Yes, the claims made in the manuscript are largely supported by clear and convincing evidence.

方法与评估标准

Yes, the proposed method(s) and/or evaluation criteria (e.g., benchmark datasets) are appropriately justified for the current problem or application.

理论论述

Yes, I reviewed the principles of the method proposed in the paper, primarily focusing on Query-Aware Attention Approximation. I examined the detailed derivation provided in the paper, and after referencing the cited literature (Kaban et al., 2015), I found the methodological derivation to be reasonable.

实验设计与分析

Yes, I have checked the validity of experimental designs. The experimental design and analysis conducted for the proposed method in the article are methodologically sound and empirically valid.

补充材料

I have reviewed the supplementary material in Appendix A.

与现有文献的关系

The paper introduces query-aware sparse attention, which dynamically selects relevant regions for each query patch, maintaining the expressive power of full self-attention while achieving near-linear computational complexity. This aligns with findings from prior studies (e.g., HIPT, Chen et al., 2022) that proposed various attention mechanisms and region selection strategies to improve computational efficiency and model performance in WSI analysis.

遗漏的重要参考文献

To the best of my knowledge, I think the authors have already provided sufficient explanation and discussion.

其他优缺点

Strengths: 1.The paper introduces a novel query-aware attention mechanism that dynamically adapts to the context of each patch, addressing the computational bottleneck of standard transformers in large-scale WSI analysis. 2.The paper provides theoretical guarantees for the query-aware attention mechanism, proving its error bounds in approximating full self-attention.

Weaknesses: The performance of the model depends on the quality of the region-level metadata. In the computation of region-level metadata, using min/max/mean/mean-std feature to summarize the patch features within a region may lead to the loss of important local information, especially in regions with high tissue heterogeneity or significant noise.

其他意见或建议

Some minor issues: 1.The experiments did not leverage features from latest foundation models (e.g., UNI and CONCH). Incorporating these advanced features could potentially reduce the performance disparity between existing MIL methods with Querent. 2.The region size in Querent significantly impacts performance, potentially limiting its generalizability across diverse WSI datasets. When applied to new datasets, it may require careful tuning, posing challenges for real-world applications.

作者回复

We thank the reviewer for the constructive comments. We respond to the comments one by one as follows:

Weakness

The performance of the model depends on the quality of the region-level metadata ...

We acknowledge that any approach to summarizing region-level metadata, whether it be min/max, mean, or mean-std, will inherently lose some local information. This trade-off is necessary to achieve computational efficiency when handling gigapixel whole slide images. In our work, the min-max strategy was specifically chosen because it captures the extreme values of feature distributions, which are critical for preserving discriminative patterns — especially in heterogeneous tissues. Our additional analysis (see Figure 1 in this anonymous link), using the "Adjusted Average Distance to Summary" metric, shows that the min-max approach outperforms other summarization methods, particularly in high-heterogeneity regions, where it achieves significantly lower error compared to mean or mean-std methods.

Furthermore, our query-aware attention mechanism complements the summarization by dynamically selecting and weighting the most relevant patches, which helps mitigate the loss of local information and filters out noise. Although some information loss is unavoidable with any summarization method, our experimental results demonstrate that the min-max approach, in combination with our selective attention, provides an effective and robust representation that leads to superior performance across all tasks, even in challenging heterogeneous and noisy scenarios.

Other Comments or Suggestions

The experiments did not leverage features from latest foundation models ...

We appreciate this suggestion. As demonstrated in Table 2 in this anonymous link, we conducted additional experiments using state-of-the-art foundation models (CHIEF and Virchow). Results show that Querent consistently outperforms other methods with these advanced feature extractors, confirming that our method's superiority stems from its long contextual modeling capability rather than the choice of feature extractor. This indicates Querent's contribution is complementary to advances in foundation models and will continue to provide advantages as foundation models evolve.

The region size in Querent significantly impacts performance, potentially ...

We appreciate the reviewer's comment on region size impact. While region size is a hyperparameter, our ablation studies (Figure 5) show that moderate-sized regions (16-24) consistently deliver strong performance across diverse datasets. This pattern aligns with pathological intuition, i.e., region size should capture meaningful local context without diluting distinctive tissue patterns. This provides a reliable starting point that significantly narrows the hyperparameter search space, enhancing Querent's practical applicability without extensive tuning.

Questions

The method can visualize the original WSI corresponding to the K regions selected ...

We thank the reviewer for this great suggestion and will include visualizations in the revised version.

It is not clear how the method can avoid the situation where the top K ...

We address this important concern through two mechanisms: (1) our region importance estimation algorithm provides theoretical guarantees (Theorem 3.1) that selected regions are at most 2ϵ\epsilon1-suboptimal compared to the true top-K regions, ensuring minimal information loss; and (2) our min-max summarization strategy (superior in Fig. 4, p<0.005) effectively captures extreme feature distributions, making the metadata highly discriminative for identifying diagnostically relevant areas. Our ablation studies confirm that our approach significantly outperforms random region selection (Table 3, 8.9% accuracy improvement on UBC-OCEAN), demonstrating that our method reliably identifies regions containing key diagnostic information.

It is uncertain what the advantages of the proposed method are compared with the latest methods ... through Mamba ...

We directly compare with both MambaMIL and MamMIL (see Table 3 in this anonymous link) and Querent consistently outperforms these methods across all metrics and datasets.

The fundamental difference between Querent and these methods lies in our query-aware dynamic modeling approach. While MambaMIL uses sequence reordering and MamMIL employs graph-based representations with MST, both still process all patches with predetermined patterns. In contrast, Querent adaptively determines which surrounding regions are most relevant for each patch based on content, focusing computational resources only where needed.

We welcome any further questions or clarifications regarding our rebuttal and are happy to provide additional details if needed.

审稿人评论

The authors address my concerns in the rebuttal and I'll retain my score.

作者评论

We appreciate the reviewer for these constructive suggestions and we are glad that our rebuttal effectively addressed concerns. We will further polish our manuscript and include all revisions/discussions mentioned above for better readability.

审稿意见
3

To alleviate the self-attention o(n^2) complexity when modeling WSI, this paper introduces query-based lager-region pruning method to replace linear-attention and local-global attention mechanisms. By ignoring the irrelevant regions to current patches, all the computational cost between current patch and all patches in these regions can be pruned. The evaluations in experiments demonstrate the computational efficiency and performance effectiveness.

给作者的问题

No further question.

论据与证据

The claim in line 023-025 (abstract) 'the query-aware long contextual dynamic modeling framework, which maintains the expressive power of full self-attention while achieving practical efficiency' is not evident.' If the full self-attention is maintained, how to speed up. I think this expression should be refined

方法与评估标准

The proposed methods and evaluation criteria make sense.

理论论述

I have reviewed all the proofs and found no issues.

实验设计与分析

I have checked the experimental design. An issue of experiment in this paper is that the performance of full self-attention is not included. To the best of my knowledge, the full self-attention can be implemented via FlashAttention to avoid out-of-memory problem in WSI tasks.

补充材料

I have reviewed the computational complexity part of supp.

与现有文献的关系

The method of this paper may also be applied in other tasks with long-sequence modeling using Transformer, e.g. document-level language understanding and AI4Science tasks with long-sequence DNA, RNA.

遗漏的重要参考文献

I find that all the essential references are discussed.

其他优缺点

Strengths: The proposed method is novel and can highly speed up Transformer WSI modeling.

Weakness:

  1. The claim on the relationship between 'Querent' and 'full self-attention' should be further discussed.
  2. The Flash-attention (with full self-attention) but linear-memory and quadratic time cost should be compared.
  3. A very important issue: There seems no explanation or motivation on why the method can improve the results. If the motivation is just like weakness 1), the weakness 2) should be validate in rebuttal. If full self-attention (implemented via Flash-attention) cannot reach the results just like Querent, how to explain it?

其他意见或建议

The author should provide more discussion on Flash-attention. The Flash-attention is of linear-memory cost and quadratic-time cost, but the time cost or speed was accelerated by their hardware operations optimization. Does your method can be combined with it? How much of your method surpass it?

作者回复

We thank the reviewer for the valuable comments. We respond with detailed interpretations as follows:

Claims and Evidences

The claim in line 023-025 (abstract) 'the query-aware ...

We should clarify that Querent provides a theoretically bounded approximation of full self-attention rather than claiming it maintains the exact same expressive power. As demonstrated in Theorem 3.1, our approach maintains the expressiveness within a small constant bound of full self-attention while significantly reducing computational complexity. We will revise this wording in the final version to be more precise: "Querent achieves a theoretically bounded approximation of full self-attention and meanwhile delivers practical computational efficiency for modeling gigapixel images."

Relation with Full Self-Attn (FlashAttn)

Since the reviewer's major concern is the necessity of developing a new dynamic attention pattern instead of directly using Flash Attention for full self-attention implementation, here we address these concerns step by step.

1. Comparison with Full Self-Attention via FlashAttention
Following the reviewer's suggestion, we have implemented a full self-attention-based MIL model (FlashMIL) using FlashAttention to enable the processing of long sequences without memory limitations. Specifically, our implementation applied 4 flash-attn layers to model the WSI patch sequence, followed by a mean operation to obtain the slide-level representation for prediction. The comparison results on our three classification datasets are shown in Table 1 in this anonymous link. While FlashMIL achieves comparable performance on the BCNB-ER dataset, Querent significantly outperforms it on the more complex TCGA-LUAD TP53 and UBC-OCEAN datasets. This demonstrates that our method provides benefits beyond just addressing memory efficiency.

2. Relationship Between Querent and Flash Attention
While both Querent and FlashAttention address the computational challenges of self-attention, they do so through fundamentally different approaches. FlashAttention optimizes the implementation of full self-attention through IO-aware algorithms and hardware optimizations, but still computes attention between all pairs of patches. In contrast, Querent introduces a context-dependent attention mechanism that dynamically identifies and focuses only on the most relevant regions for each query patch.

The superior performance of Querent over FlashMIL can be explained by this contextual selectivity, which serves as an implicit regularization mechanism. By focusing only on relevant regions, Querent filters out noise and irrelevant information that could potentially confuse the model, especially in highly heterogeneous WSIs. This is particularly important in computational pathology, where diagnostically relevant features may be sparsely distributed across the gigapixel image.

3. Why Querent Improves Results Beyond Memory Efficiency
The improved performance of Querent over Flash Attnetion-based implementation can be attributed to several factors. First, the context-aware attention dynamically determines which surrounding regions are most relevant for each patch, allowing Querent to adapt to the heterogeneous nature of WSIs, where different tissue types require different contextual considerations. Second, selective attention acts as a form of implicit regularization by reducing the influence of irrelevant or noisy patches, which is particularly beneficial in weakly-supervised settings with limited training data. Third, while reducing computational overhead, our min-max region metadata approach ensures that potentially important long-range dependencies are still captured, unlike fixed local-global attention patterns that make strong assumptions about which spatial relationships matter. These advantages explain why Querent outperforms Flash-Attention-optimized self-attention.

4. Compatibility with FlashAttention
Regarding the potential combination of Querent with FlashAttention: Yes, our method is compatible with and complementary to FlashAttention. While FlashAttention optimizes how attention is computed through IO-aware algorithms, Querent determines which attention computations are most valuable to perform. A combined approach could leverage FlashAttention's efficiency for computing the selected region attention in our Step 3 (Query-Aware Selective Attention), potentially providing even greater computational benefits. This represents an interesting direction for future work.

Our current implementation already demonstrates significant efficiency gains over standard attention (as shown in Figure 6), requiring only ~1% of the memory and ~5% of the computational cost for 100k patches. Even compared to FlashAttention, Querent offers advantages in computational complexity (near-linear vs. quadratic time complexity) while achieving superior performance on CPath tasks.

审稿人评论

The rebuttal has resolved all my issues, and I will keep the initial rating. I believe that the authors will include these discussions in their final version.

作者评论

We thank the reviewer for these valuable comments and we are glad that our rebuttal effectively addressed concerns. We will polish our manuscript further and include all revisions mentioned above for better readability.

最终决定

This paper proposes to improve whole slide prediction though a dynamical attention mechanism. The strength of the method is tested on various downstream tasks. All reviewers are positive about the paper and are satisfied with the rebuttal.