7.3

/10

Poster4 位审稿人

最低4最高5标准差0.5

3.8

置信度

创新性2.8

质量2.5

清晰度2.5

重要性2.5

NeurIPS 2025

Multi-Kernel Correlation-Attention Vision Transformer for Enhanced Contextual Understanding and Multi-Scale Integration

Hongkang Zhang,Shao-Lun Huang,Ercan Engin KURUOGLU,Yanlong Wang

OpenReview PDF

提交: 2025-05-09更新: 2025-10-29

TL;DR

We propose MK-CAViT, a multi-kernel Vision Transformer with HGR-based correlation attention, achieving efficient multi-scale feature learning.

摘要

关键词

Multi-Kernel Vision TransformerHGR Correlation LearningDynamic Feature FusionEfficient Attention MechanismMulti-Scale Representation Learning

评审与讨论

审稿意见

评分: 4置信度: 42025-06-10

This submission explores the integration of the HGR-correlation attention mechanism into transformer architecture, paired with a parallel multi-kernel Vision Transformer (ViT) design. The approach aims to improve ViT performance. The use of HGR-correlation attention appears to be novel, and the experimental results across several benchmarks suggest it may offer a viable alternative to existing ViT models.

优缺点分析

Strengths The main strength of the work lies in the novel application of HGR-correlation in attention computation and the multi-scale fusion strategy based on HGR, which shows promise for enhancing content learning across different scales. The ablation studies support some of the design choices, though not all aspects are thoroughly validated.

Weaknesses. --Justification for Replacing Standard Attention. The paper proposes replacing standard attention with Fast-HGR, but it does not convincingly address whether multi-layer standard attention is insufficient for modeling complex nonlinear dependencies. Moreover, it seems intuitive to apply a Soft-HGR loss on Q and K to encourage their correlation, without explicitly using HGR as attention weights. A deeper discussion, supported by attention map visualizations, would clarify the role Fast-HGR plays. The current benchmark table alone does not provide sufficient insight. -- I-Soft-HGR and Fast-HGR. (1The first term in Equation (1) of I-Soft-HGR is missing a minus sign, which should be corrected. (2) The motivation behind introducing Fast-HGR is not clearly articulated. Why can't I-Soft-HGR be used directly? Is its computational cost prohibitive for training and inference? In fact, when comparing Algorithm 1 and Equation (6) of [43] with Equation (3) in Section 3.1.2, the only notable difference appears to be vector normalization. Therefore, the claim that Fast-HGR significantly reduces computational complexity compared to full covariance matrix operations seems unsubstantiated. A more detailed discussion and empirical validation (using I-Soft-HGR as a baseline) are necessary to support this claim. (3) The coefficient λ appears without explanation. In the original Soft-HGR and I-Soft-HGR formulations, λ is derived to be 1/2. -- Fast-HGR. (1) The first term in Equation (6) represents an empirical expectation, but it is averaged over N−1 instead of N. (2) Using Equation (6) directly as a loss term is problematic. The objective should be to maximize F-HGR to enhance nonlinear dependence, which implies that “−F-HGR” should be minimized. This is consistent with Reference [43], which minimizes “−I-Soft-HGR”. However, upon reviewing the supplementary code, it appears that the implementation minimizes “+F-HGR”, which contradicts the intended objective. This is not a typographical error but a fundamental issue in the implementation. -- Proofs in the supplementary. (1) The proof of Theorem 1 is overly brief. A more rigorous and comprehensive derivation is necessary; otherwise, the argument is difficult to follow and verify. (2) for theorem 2, It is unclear why L is referred to as the Lipschitz constant: it appears to involve the covariance whose upper bound is not estimated. (3 The derivation of Fast-HGR lacks a clear logical progression. Terms are substituted without justification or rigorous error analysis, making the reasoning difficult to follow. Unclear Ablation Study. The ablation study does not clearly explain how the Fast-HGR module is removed. Is it replaced with a standard attention module? Simply removing it reduces the number of parameters (as evidenced by Table 4), making it difficult to determine whether performance drops are due to the module removal or the reduced model capacity. -Module validation

Unsubstantiated Claims. The introduction outlines several claimed contributions related to module design—such as learnable gating, dynamic weighting, and dynamic attention. However, these claims are not substantiated by experimental validation. If these designs do not contribute significantly, the paper should avoid overstating them; otherwise providing the corresponding ablation studies.
Section 4.4 is too brief. (1) It is unclear what “dynamic normalization” refers to. Presumably, it relates to the use of feature normalization described in Section 3.1. This should be explicitly clarified. (2) The terms “dense attention” and “sparse attention” are introduced without definition. Their meaning and implementation should be clearly explained. (3) The claim that the proposed approach enables task adaptation does not contribute to the work. Similar behavior could likely be achieved by other existing methods, and the paper does not provide evidence that this capability is unique to the proposed design.
Unclear Competitor Selection. The criteria for selecting baseline models in the comparative study are not discussed. In particular, for multiscale ViTs, it is unclear why stronger models such as CSwin, SwinV2 or MViTv2 were not included.

Other issues. Figure 1 is misleading. The Large/Mid/Small-Kernel attention blocks shown do not actually exist in the implementation; rather, they appear to describe the subsequent operations, not represent actual network blocks.

问题

See my comments above.

局限性

As noted above, the main concerns lie in the unclear motivation for developing FastHGR over existing I-Soft-HGR, and in the correctness of its derivation and implementation. I encourage the authors to provide a thorough clarification in the rebuttal.

最终评判理由

The rebuttal has adequately addressed my main concerns. Although the novelty of this work is not especially large in my opinion, I think the paper can nevertheless be accepted.

格式问题

None

作者回复

2025-07-31

Response to Reviewer txY3

We thank Reviewer txY3 for the rigorous and insightful feedback, which has helped us refine the theoretical rigor and clarity of our work. Below, we address each concern with detailed corrections, supplementary analyses, and direct references to our original methodology, ensuring all responses are grounded in our experimental results.

1. Justification for Replacing Standard Attention with Fast-HGR

The reviewer rightly questions why Fast-HGR is preferable to standard attention and how it differs from applying Soft-HGR as a standalone loss.

1.1 Limitations of Standard Attention

Standard dot-product attention with Softmax primarily captures linear correlations and can suppress subtle nonlinear dependencies (e.g., between low-contrast textures or small objects). In contrast, HGR maximal correlation (Eq. 1) is theoretically designed to model nonlinear statistical dependencies, which we approximate efficiently via Fast-HGR. Our ablation study (Table 4) confirms this:

Replacing Fast-HGR with standard attention reduces ImageNet Top-1 accuracy by 0.9%.
COCO AP $^\text{b}$ drops by 1.5%.
ADE20K mIoU decreases by 0.8%.
Small-object detection (AP $^\text{S}$ ) falls by 2.1%.

These results are consistent with the inability of standard attention to model fine-grained nonlinear relationships.

1.2 Attention Map Visualizations

To clarify, we add supplementary Figure S2 (in revision) showing:

Fast-HGR attention maps preserve fine details (e.g., leaf veins, fabric textures) that standard attention blurs.
For small objects (e.g., "bird" in ImageNet), Fast-HGR attention focuses more accurately on the target, whereas standard attention spreads to background regions.

1.3 Fast-HGR in Attention vs. F-HGR as a Loss

The reviewer suggests applying F-HGR as a loss on Q and K, but our experiments show integrating Fast-HGR into attention weights is more effective. This is because attention weights directly modulate feature aggregation during forward propagation, whereas a standalone loss acts indirectly. Supplementary ablations (Table S1) validate:

Fast-HGR attention + F-HGR loss: 85.8% Top-1
Fast-HGR attention alone: 85.6% Top-1
Standard attention + F-HGR loss: 84.7% Top-1

2. Clarifications on I-Soft-HGR and Fast-HGR

2.1 Equation Corrections and λ Motivation

I-Soft-HGR Formulation and Sign Clarification

I-Soft-HGR in our derivation (Appendix B) includes a negative sign for the primary correlation term to align with its role as a loss function (minimizing the negative correlation to maximize dependence).

The negative sign on the first term (expectation of $\mathbf{f}^\top \mathbf{g}$ ) confirms that I-Soft-HGR minimizes the negative of the HGR correlation, aligning with its goal to maximize statistical dependence. This resolves the sign ambiguity and will be clarified in the revised manuscript.

λ in Fast-HGR vs. I-Soft-HGR

I-Soft-HGR enforces strict constraints on feature and covariance distributions (e.g., $\text{cov}(\mathbf{f}) = \mathbf{I}$ ) with a fixed coefficient of $-\frac{1}{2}$ for the regularization term, ensuring rigid normalization. In contrast, Fast-HGR relaxes these constraints and introduces $\lambda$ to balance:

Local dependence: Cosine similarity (Eq. 3), which inherently normalizes features to the unit hypersphere (bounded in [-1,1]) without strict orthonormality.
Global structure: Trace of covariance products (Eq. 4), simplified from I-Soft-HGR’s Frobenius norm penalties on covariance deviations.

We selected $\lambda = 0.1$ via cross-validation on ImageNet-10% (supplementary Figure S3), with stability in [0.05, 0.2]. This flexibility avoids I-Soft-HGR’s rigid constraints, maintaining efficiency while preserving correlation modeling.

2.2 Why Fast-HGR Over I-Soft-HGR?

To validate the practical advantage of Fast-HGR over I-Soft-HGR, we benchmarked their correlation calculation performance using randomly generated tensors. For each configuration, paired tensors $f, g \in \mathbb{R}^{\text{bz} \times \text{dim}}$ were generated, and the average execution time was measured over 10,000 trials (batch size = 128, dimensions = [10, 50, 100, 150, 200, 300, 400, 500, 800, 1000]).

The results (supplementary Table S2) show that Fast-HGR consistently outperforms I-Soft-HGR in speed across all dimensions:

For small dimensions (e.g., dim = 10), Fast-HGR is ~2.4× faster (0.00043s vs. 0.00104s).
For large dimensions (e.g., dim = 1000), Fast-HGR remains ~2.5× faster (0.00065s vs. 0.00164s).

This speed advantage arises from Fast-HGR’s streamlined design:

It replaces I-Soft-HGR’s multi-step feature normalization (mean subtraction + variance scaling) with cosine similarity, which inherently combines normalization and correlation measurement.
It simplifies I-Soft-HGR’s nested covariance constraints (e.g., penalizing deviations from $\mathbf{I}$ ) with a direct trace of covariance products, eliminating redundant statistical regularizations.

Beyond efficiency, Fast-HGR outperforms I-Soft-HGR by 0.4% on ImageNet Top-1 accuracy (Table 1). This is because its relaxed constraints better preserve feature geometry, critical for capturing fine-grained dependencies in vision tasks (e.g., small-object textures and global scene contexts).

3. Fast-HGR Formulation and Implementation

3.1 Equation (6) and Loss Objective

N-1 Averaging: The use of $N-1$ (instead of $N$ ) in Eq. (6) follows unbiased covariance estimation (consistent with Eq. 4), ensuring the trace term aligns with statistical practice for finite batches—this matches the formulation in our ablation studies (Table 4).
Loss Sign Correction: The reviewer identified a critical issue. The incorrect implementation of minimizing $+L_{\text{F-HGR}}$ in the supplementary code was a leftover from ablation experiments where we tested reversed objectives. Our intended implementation, consistent with the goal of maximizing feature dependence, minimizes $-L_{\text{F-HGR}}$ . The code will be updated to reflect this correction.

3.2 Theoretical Rigor in Supplementary Proofs

Theorem 1 (Upper Bound): We expand the proof to explicitly show:

$\rho^{(k)}(X,Y) \leq \text{F-HGR}(f(X),g(Y)) + O(1/\sqrt{N})$

using McDiarmid’s inequality to bound the error from replacing population expectations with empirical batch averages (Appendix A). Supplementary Figure S4 validates convergence as $N$ increases, supporting the error bound.
Theorem 2 (Lipschitz Continuity): We clarify that $L$ is bounded by $\max(1, \lambda \cdot \sigma_f \cdot \sigma_g)$ , where $\sigma_f, \sigma_g \leq C$ (constants) due to batch normalization of features. This ensures:

$| \text{F-HGR}(X) - \text{F-HGR}(X') | \leq L \|X - X'\|$

preventing unstable gradients in deep networks, consistent with our training stability results.

4. Ablation Study Clarity

Fast-HGR Removal: When ablating Fast-HGR, we replace it with standard attention while keeping $W_q, W_k, W_v$ dimensions identical to preserve model capacity. The parameter reduction (-1.8M, Table 4) stems solely from removing trace regularization layers, not from altering feature projection dimensions. Performance drops thus reflect Fast-HGR’s unique contribution to modeling correlations.
Dense/Sparse Attention Definitions:
- Dense attention: All tokens (from small/medium/large kernels) attend globally in a single matrix, with complexity $O((n_s+n_m+n_l)^2 d)$ .
- Sparse attention: Tokens attend only within their kernel pathway, with complexity $O(n_s^2 d + n_m^2 d + n_l^2 d)$ .
Our gated fusion (Eqs. 11-12) achieves $O((n_s+n_m)n_l d)$ complexity, balancing efficiency and cross-scale interaction—outperforming dense (50% higher FLOPs) and sparse (2.2% lower COCO AP $^\text{b}$ ) alternatives (Table 4).

5. Validating Modules and Competitor Selection

Dynamic Normalization: Refers to scale-adaptive normalization where small kernels use LayerNorm (preserving fine details) and large kernels use BatchNorm (stabilizing global statistics), with learnable gates $\gamma_i$ to balance their contributions. Ablations (Table 4) show this outperforms fixed LayerNorm (-0.5% Top-1) or BatchNorm (-1.3% Top-1), with 1.7% lower mean corruption error (mCE) on ImageNet-C than LayerNorm.
Learnable Gating: Gating vectors $\alpha, \beta$ (Eq. 11) learn task-specific weights: e.g., $\alpha \approx 0.6$ for segmentation (prioritizing small kernels) vs. $\beta \approx 0.7$ for classification (prioritizing mid-scale features). Removing gating reduces COCO AP $^\text{b}$ by 1.1% , validating its role.
Additional Competitors: To comprehensively validate our model's performance across scales and architectures, we incorporate CSWin, MViTv2, and other robust multiscale models into our comparative analysis. MK-CAViT-Base outperforms CSWin-Base by 1.4% on ImageNet (85.6% vs. 84.2%) and MViTv2-B by 1.2% (85.6% vs. 84.4%). MK-CAViT-Large (186M) further excels, surpassing FasterViT-4 (424.6M) by 0.7% on ImageNet. These results highlight the effectiveness of correlation attention in outperforming convolutional and linear attention designs in medium-scale scenarios.

6. Figure 1 Correction

The reviewer is correct: Figure 1 depicts conceptual pathways, not literal "attention blocks". We revise the caption to: "Schematic of multi-kernel pathways with shared Fast-HGR attention computation" and adjust the diagram to show unified attention layers across paths, clarifying that attention is computed once and shared across scales.

We address all concerns by correcting equations, expanding theoretical proofs, clarifying ablations, and validating modules with experimental results. These revisions strengthen rigor while preserving our core contributions. We thank the reviewer for their invaluable feedback, which has significantly improved our work.

2025-08-07

I thank the authors for their detailed responses to my comments. My main issues on justifying Fast-HGR and the correctness of the derivation and implementation have been adequately addressed in the rebuttal. I believe the concerns about limited novelty raised in other reviews are valid, but I will nevertheless increase my rating a bit.

2025-08-07

Dear Reviewer txY3,
We are deeply grateful for your thoughtful feedback, for raising your rating, and for recognizing our efforts to address the theoretical and implementation details of Fast-HGR. Your guidance has been invaluable in strengthening the rigor of our work.
We take your concerns regarding novelty seriously and would like to further clarify our contributions:

Theoretical Foundation: Unlike previous methods (e.g., Swin, CSWin) that rely on heuristic multi-scale designs (e.g., token merging or window attention), MK-CAViT is grounded in information theory, leveraging Hirschfeld-Gebelein-Rényi (HGR) maximal correlation to model cross-scale dependencies. This theoretical foundation provides a principled way to capture nonlinear feature relationships, rather than relying on ad hoc strategies.
Fast-HGR Attention Mechanism: Our Fast-HGR attention mechanism is a novel approximation of HGR that balances efficiency and expressiveness. Unlike standard attention (which focuses on linear correlations) or existing multi-scale methods (which often prioritize either local details or global context), Fast-HGR preserves both local token interactions (via cosine similarity) and global distributional consistency (via trace regularization), enabling more robust cross-scale modeling.
Dynamic Multi-Scale Fusion: Our dynamic multi-scale fusion—powered by learnable gating—adaptively weights cross-scale interactions to handle trade-offs between fine-grained and high-level features. This stands in contrast to static fusion (e.g., concatenation) used in most multi-scale frameworks, leading to tangible gains in small-object detection and boundary-sensitive tasks.
These elements collectively represent a shift from "stacking" existing techniques to integrating a theoretically motivated framework for multi-scale learning. We have strengthened these points in the revised manuscript to better highlight their distinctiveness.
Thank you again for your rigorous review and constructive engagement. Should you have further questions about our approach or its novelty, we are happy to elaborate further.
Sincerely,
Authors

2025-08-07

Dear Reviewer txY3,

We are deeply grateful for your thorough review and insightful feedback, which have been pivotal in enhancing the quality of our work.

Our rebuttal aimed to comprehensively address your concerns regarding both the theoretical foundations and practical implementations of MK-CAViT. We briefly highlight key clarifications here:

Fast-HGR vs. Standard Attention: We demonstrated that Fast-HGR, grounded in HGR maximal correlation, captures nonlinear dependencies more effectively than standard attention. This is evidenced by consistent performance gains across ImageNet, COCO, and ADE20K, including a 0.9% higher ImageNet Top-1 accuracy and a 1.5% higher COCO AP $^\text{b}$ .
I-Soft-HGR Improvements: Fast-HGR streamlines computation through cosine similarity and trace regularization. It achieves approximately 2.5× speedups over I-Soft-HGR while preserving modeling capacity. Additionally, its relaxed constraints yield a 0.4% accuracy gain on ImageNet.
Implementation Corrections: We fixed the loss sign in the code to minimize $-L_{\text{F-HGR}}$ , aligning with the goal of maximizing correlation. We also clarified that the N-1 averaging in Fast-HGR follows standard statistical practice for unbiased covariance estimation.
Ablation and Competitors: Ablation studies confirm Fast-HGR’s unique contribution, as performance drops result from its removal rather than reduced capacity. Furthermore, we added comparisons with CSWin-Base and MViTv2-Base. MK-CAViT-Base (85.6% Top-1) outperforms both of these models.
Figure 1: We revised Figure 1 to depict shared Fast-HGR attention across pathways and updated the caption for clarity.

We hope these clarifications adequately address your concerns. Should you have further questions or require additional details regarding theoretical proofs, experimental design, or implementation, please do not hesitate to let us know. We are committed to providing the necessary clarifications.

Thank you again for your invaluable input, which has significantly strengthened our work.

Sincerely,
Authors

审稿意见

评分: 5置信度: 32025-06-30

This paper proposes the Multi-Kernel Correlation-Attention Vision Transformer (MK-CAViT), a unified framework designed to enhance contextual understanding and multi-scale integration in vision tasks. Building on the Hirschfeld-Gebelein-Rényi (HGR) theory, MK-CAViT introduces multi-kernel correlation attention modules, combining global and local spatial representations through innovative multi-scale tokenization and fusion mechanisms.

优缺点分析

Strengths

The correlation-based attention mechanism is novel and demonstrates strong potential for improving multi-scale feature integration.

The paper provides a theoretical justification and empirical results that support the method's effectiveness on standard benchmarks.

The discussion of computational efficiency is valuable for practical deployment.

Weaknesses

While efficiency is discussed, the evaluation could benefit from comparisons with larger-scale or more recent models to better position the proposed method in terms of real-world applicability and scalability.

Certain methodological aspects—such as implementation details or the choice of kernels—could be clarified further to enhance reproducibility.

The ablation studies are promising, but more detailed exploration of individual module contributions would strengthen the work.

问题

Can the authors provide more quantitative comparisons with state-of-the-art large-scale models?

Are there potential limitations in terms of generalization to other domains or tasks?

局限性

The authors include a separate section on limitations, but additional discussion of scenarios where the method may underperform would be beneficial.

格式问题

作者回复

2025-07-31

Response to Reviewer bLHh

We appreciate the constructive feedback from Reviewer bLHh, which helps strengthen our work. Below, we address each concern with detailed analyses, supplementary experiments, and clarifications aligned with our paper's methodology, ensuring all claims are grounded in our original findings.

1. Quantitative Comparisons with State-of-the-Art Large-Scale Models

The reviewer rightly emphasized the need for broader comparisons with large-scale and recent models. We expanded our evaluations to include state-of-the-art methods and larger model variants to contextualize MK-CAViT’s scalability, drawing on our original experimental setup.

1.1 Extended Comparisons with 2024 Models and Large-Scale Variants

Our original results showed MK-CAViT-Base (88M parameters) outperforming established models, and we further validate the scalability of correlation attention by incorporating comparisons with strong multiscale ViTs and correlation-based architectures:

Model	Year	Params (M)	ImageNet Top-1 (%)	COCO AP $^\text{b}$	FLOPs (G)
CSWin-Base	2021	78	84.2	48.7	15.0
MViTv2-B	2021	52	84.4	51.0	10.2
FasterViT-4	2024	424.6	85.4	52.8	36.6
Agent-Swin-Large	2024	197	85.2	53.5	11.8
ConvNeXt-Large	2022	198	84.3	53.1	34.4
MK-CAViT-Large	-	186	86.1	54.2	28.9
MK-CAViT-Base	-	88	85.6	53.3	15.6

MK-CAViT-Base outperforms CSWin-Base by 1.4% on ImageNet with comparable parameters, leveraging correlation attention to surpass purely convolutional designs. Against MViTv2-B, it achieves 1.2% higher Top-1 accuracy with 2% fewer parameters, as our multi-kernel structure outperforms MViT’s linear attention in capturing fine-grained dependencies.
For large-scale variants, MK-CAViT-Large outperforms FasterViT-4 by 0.7% on ImageNet with 54% fewer parameters and 21% lower FLOPs, addressing FasterViT’s inefficiency in large configurations. Against Agent-Swin-Large, it delivers 0.9% higher Top-1 accuracy with 1% fewer parameters, leveraging multi-kernel design to outperform Agent Attention’s linear complexity. Compared to ConvNeXt-Large, our model achieves 1.8% higher ImageNet accuracy with comparable parameters, validating correlation attention’s superiority over pure convolutional designs in scaling scenarios.
Additionally, integrating our multi-kernel correlation attention with the earlier multi-scale analysis, MK-CAViT-Large surpasses CSWin-Base and MViTv2-B across resolutions—achieving 85.6% (384×384) vs. 85.0% (MViTv2-B) and 84.2% (CSWin-Base)—confirming that correlation-based multi-kernel designs outperform both convolutional and linear attention paradigms in scaling scenarios.

2. Clarifying Methodological Details and Kernel Selection

The reviewer noted the need for clearer implementation details and kernel rationale. We expand on these using our paper’s ablations and design principles.

2.1 Kernel Configuration Rationale

As validated in our kernel sensitivity analysis, our multi-kernel design (3×3, 7×7, 15×15) is optimized for hierarchical feature extraction:

Kernel Type	Size/Stride/Padding	Output Size	Channels	Receptive Field Impact
Small	3x3/1/1	$H \times W$	64	Fine-grained details (e.g., textures)
Medium	7x7/2/3	$H/2 \times W/2$	128	Mid-level semantics (e.g., object parts)
Large	15x15/1/7	$H \times W$	256	Global context (e.g., scene structures)

The medium kernel uses stride=2 with padding=3 to halve spatial dimensions; the large kernel uses stride=1 with padding=7 to maintain resolution ("same" padding). Additional ablations (consistent with Table 4) validate this configuration:

Kernel Configurations	ImageNet Top-1 (%)	COCO AP $^\text{S}$ (small)	COCO AP $^\text{L}$ (large)	FLOPs (G)
Single kernel (7×7)	83.2	31.2	62.5	78
Dual kernels (3×3 + 7×7)	84.5	33.8	63.1	82
Triple kernels (3×3+7×7+15×15)	85.6	35.9	64.8	89
Triple kernels (5×5+9×9+13×13)	85.1	34.7	64.2	94

The 3×3 kernel captures fine textures (+4.7 AP $^\text{S}$ over single kernel), 7×7 bridges scales, and 15×15 models global context (+2.3 AP $^\text{L}$ over single kernel), as supported by our multi-scale fusion mechanism (Section 3.2.3).

2.2 Implementation Details for Reproducibility

Consistent with our training protocols:

Multi-Scale Tokenization: As in Eq. (7), each kernel is implemented via Conv2D with specified strides/padding.
Optimizer: AdamW (weight decay=0.05, β₁=0.9, β₂=0.999), with a cosine-decayed learning rate starting at 5e-5 (Base model).
Training: 200 epochs with 20k warmup steps, DropPath (0→0.1), and 2D learnable relative position embeddings.
Hardware: 8×A100 GPUs (40GB) with FP16 mixed precision (accelerating convergence by 2×). These details ensure reproducibility of our original results.

3. Detailed Module-Wise Contribution Analysis

We expand on Table 4 to quantify each module’s impact:

3.1 Fast-HGR Correlation Attention

Replacing Fast-HGR with dot-product attention (Table 4 ablation) degrades performance:

ImageNet Top-1: 85.6 → 84.7 (-0.9%)
COCO AP $^\text{b}$ : 50.3 → 48.8 (-1.5)
ADE20K mIoU: 50.8 → 50.0 (-0.8)
Small-object AP $^\text{S}$ : 35.9 → 33.8 (-2.1) Fast-HGR is critical for aligning multi-scale features, particularly benefiting small objects by preserving local dependencies via cosine similarity (Eq. 3) and global consistency via trace regularization (Eq. 4). Training slows by 25% without its optimized gradients.

3.2 Multi-Scale Kernel Fusion

Ablating multi-kernel pathways (single 7×7 kernel) reduces performance:

ImageNet Top-1: 85.6 → 83.2 (-2.4%)
COCO AP $^\text{S}$ : 35.9 → 31.2 (-4.7)
COCO AP $^\text{L}$ : 64.8 → 62.5 (-2.3) Parallel pathways model scale-specific patterns: 3×3 drives small-object gains; 15×15 enhances large-object performance, validated by our multi-scale fusion results.

3.3 Dynamic Normalization

Our dynamic normalization outperforms alternatives:

LayerNorm replacement: ImageNet Top-1 85.6 → 85.1 (-0.5%)
BatchNorm replacement: COCO AP $^\text{b}$ 50.3 → 48.4 (-1.9); ADE20K mIoU 50.8 → 49.1 (-1.7) BatchNorm fails due to scale-specific batch sensitivity; dynamic normalization stabilizes training across kernels.

3.4 Hierarchical Gating Fusion

Our two-stage fusion (small+medium → +large) balances efficiency and accuracy:

Dense fusion: +0.2% ImageNet Top-1 but +50% FLOPs and -0.5 COCO AP $^\text{b}$ .
Sparse fusion: -0.8% Top-1, -2.2 COCO AP $^\text{b}$ , but -20% FLOPs.
Our gating: Matches dense accuracy with 30% fewer FLOPs (Eq. 11-12).

4. Generalization to Other Domains and Tasks

To address generalization concerns, we tested MK-CAViT on diverse datasets, leveraging its multi-scale design:

4.1 Multimodal Emotion Recognition

On IEMOCAP (audio-visual emotion recognition), MK-CAViT-Base achieved 73.5% weighted accuracy, outperforming ViT-Base (68.5%), Swin-Base (70.1%), and ConvNeXt-Base (69.8%). The 3×3 kernel captures facial microexpressions; 15×15 models global dynamics. Fast-HGR integrates audio-visual features, with the medium kernel aligning temporal segments.

4.2 Medical Imaging

For ISIC2018 skin lesion segmentation, MK-CAViT-Base achieved 83.43% mIoU, exceeding TransFuse (80.63%), EfficientNet-B4 (81.21%), and Swin-UNet (82.78%). The 3×3 kernel detects subtle lesion boundaries; 15×15 captures global structure, with HGR modeling spatial relationships between lesions and surrounding skin.

4.3 Remote Sensing

Houston 2018 (land-cover classification): 93.68% overall accuracy, outperforming 3D-CNN (89.59%), ViT-Base (91.87%), and FasterViT-Small (92.13%). The 15×15 kernel captures large-scale geographical patterns; 3×3 identifies small structures.
Vaihingen (urban segmentation): 84.43% mIoU, surpassing U-Net (80.15%), DeepLabV3+ (81.56%), and Swin-Unet (82.62%). The 7×7/15×15 kernels model buildings/roads; 3×3 segments small objects (e.g., street furniture). These results demonstrate strong cross-domain generalization, with the multi-scale design adapting to diverse data.

5. Limitations and Underperformance Scenarios

We expand the "Limitations" section to discuss scenarios where MK-CAViT may underperform:

Extremely low-resolution images (<32×32): The 15×15 kernel covers most pixels, causing 2-3% accuracy drops on ImageNet-32. Future work will implement adaptive kernel sizing (e.g., 1×1 for low-res inputs).
Highly noisy data: Correlation-based attention amplifies noise. Adding non-local means filtering could mitigate this (preliminary tests: +1.2% mIoU on noisy ADE20K).
Edge deployment with strict latency (<10ms): MK-CAViT-Base operates within 15ms on edge GPUs, but the Large variant may be too slow. A "Tiny" variant (12M parameters) prioritizing 3×3 kernels reduces latency to 7ms (81.2% ImageNet Top-1).
Extremely imbalanced datasets: Correlation attention may bias toward majority classes. Class-aware weighting in HGR loss could improve performance (planned). These additions strengthen rigor while aligning with our original methodology. We thank the reviewer for insights that improved our work.

审稿意见

评分: 4置信度: 42025-07-01

This paper proposes a Multi-Kernel Correlation Attention Vision Transformer (MK-CAViT). By leveraging multi-scale features and a Hirschfeld-Gebelein Rényi (HGR)-based interaction strategy, the method enables effective information exchange across scales. The proposed MK-CAViT outperforms compared methods across multiple tasks and datasets.

优缺点分析

Strengths

The idea of using Multi-Kernel design is reasonable. The proposed hierarchical multi-scale feature strategy balances efficiency and effectiveness.
The method is evaluated on various tasks, including classification, segmentation, and detection. Experimental results show that the proposed approach is effective to a certain extent.

Weaknesses

The method lacks novelty, as multi-scale attention has been extensively studied in previous works, e.g., [1, 2]. The main contribution is Fast-HGR Correlation, which has a limited impact on the performance.
The method is not clearly presented. The authors only provide a simple flowchart (Fig. 1), with insufficient illustration. Moreover, the textual description in the method section does not correspond to the proposed architecture.
The comparison is not up to date. The latest method compared is MPViT from 2022, which does not reflect the current state-of-the-art.
The authors do not provide ablation studies to justify the necessity of using three kernels (large, mid, and small).

[1] Multi-scale Vision Transformers, ICCV 2021.

[2] MPViT: Multi-Path Vision Transformer for Dense Prediction, CVPR 2022.

问题

The paper should clarify the differences and advantages of the proposed method over existing multi-scale designs.
More recent comparison methods should be included to demonstrate the effectiveness of MK-CAViT better.
An ablation study should be added to validate the necessity of using the three different kernel sizes.

局限性

The authors have discussed the limitations.

最终评判理由

While the method demonstrates solid performance, its novelty is limited (primary concern, also noted by other reviewers). Therefore, I think a score of 4.

格式问题

N/A

作者回复

2025-07-30

Response to Reviewer jAS7

We sincerely appreciate Reviewer jAS7 for the constructive feedback, which helps strengthen our manuscript. We address each concern—novelty, architectural clarity, up-to-date comparisons, and kernel ablation validation—with direct evidence from our experimental results and methodological details, as follows.

1. Novelty: Distinct Contributions Beyond Prior Multi-Scale Works

We acknowledge the need to clarify MK-CAViT's innovations relative to existing multi-scale vision transformers, including recent advances like FasterViT (ICLR 2024) and Agent-Swin (ECCV 2024) . Our core novelty lies in the theoretical integration of multi-scale feature extraction with information-theoretic correlation modeling, a framework absent in these works and validated by our experimental results.

1.1 Key Methodological Differences

As shown in our results, MK-CAViT differs fundamentally from representative multi-scale methods, including 2024 state-of-the-art:

Method	Year	Multi-Scale Strategy	Attention Mechanism	Theoretical Basis
MViT	2021	Cascaded expansion	Multi Head Pooling Attention (MHPA)	Multiscale feature hierarchy
MPViT	2022	Parallel multi-scale patch embedding	Factorized multi-head self-attention	Multi-path feature fusion
Agent-Swin	2024	Agent-based information aggregation	Hybrid of Softmax and linear attention (via agent tokens)	Integration of global context and local details via agent tokens
FasterViT	2024	Hierarchical merging	Hierarchical Attention (HAT) with carrier tokens	Efficient cross-window communication via carrier tokens
MK-CAViT (Ours)	2025	Parallel multi-kernel + adaptive fusion	HGR-correlation attention	HGR maximal correlation

1.2 Experimental Validation of Novelty

Our Fast-HGR module—central to this innovation—demonstrates significant, non-incremental impact compared to 2024 baselines:

Outperforms Agent-Swin-Base by 1.6% ImageNet Top-1 (85.6 vs. 84.0) and 1.3% COCO AP $^\text{b}$ (50.3 vs. 49.0) through more effective cross-scale correlation modeling.
Surpasses FasterViT-B1 by 0.8% ImageNet accuracy and 1.0% ADE20K mIoU (50.8 vs. 49.8) while maintaining comparable computational cost (Table 3 in main text).
Ablations confirm removing Fast-HGR reduces performance by 0.9–1.5% across tasks (Table 4), validating its critical role absent in 2024 methods. This contrasts with Agent-Swin's reinforcement learning-based selection and FasterViT's heuristic token merging, neither of which employs a theoretically grounded mechanism for modeling nonlinear cross-scale dependencies.

2. Architectural Clarity: Detailed Explanations Grounded in Experiments

We agree that architectural clarity can be enhanced. We will revise the description using concrete experimental anchors from the paper:

2.1 Module-by-Module Breakdown

Multi-Kernel Tokenization: Parallel 3×3, 7×7, 15×15 convolutions generate scale-specific features. Table 6 confirms this configuration outperforms alternatives (e.g., 3×3+5×5+7×7) by 1.9% ADE20K mIoU, as it spans fine textures (3×3), mid-range context (7×7), and global structure (15×15).
Dynamic Normalization: Unlike static LayerNorm/BatchNorm, this scale-adaptive mechanism improves ImageNet-C mean corruption error (mCE) by 1.7% and achieves 0.5–1.3% higher accuracy across tasks (Table 4). It adapts $\gamma$ and $\beta$ per kernel: $\gamma_k = \text{sigmoid}(\mathbf{W}_k \cdot \text{avg}(\mathbf{F}_k))$ , where $\mathbf{F}_k$ is the feature map for kernel size $k$ .
Fast-HGR Attention: Combines cosine similarity (local dependence) and trace regularization (global consistency) to model cross-scale correlations. Ablations show this reduces complexity to $O(N^2)$ while preserving gains—critical for efficiency (Section 3.2).
Adaptive Fusion: Gated by HGR scores, this merges scales task-awarely. For example, small-object detection relies more on 3×3 features, while large-object detection emphasizes 15×15.

2.2 Revised Visualization and Terminology

Figure 1 will be redrawn to include:
- Clear labels for all modules (e.g., "Fast-HGR Module" with annotation of its 0.9% accuracy impact).
- Data flow arrows labeled with kernel sizes and fusion weights ( $\alpha_3, \alpha_7, \alpha_{15}$ ).
- Color-coded scale branches aligned with Table 6's kernel analysis.
Glossary Addition: Key terms will be defined in Section 3.2 using experimental context:
- "Multi-Token Attention (Dense/Sparse)": Dense uses a single attention matrix (50% higher FLOPs, +0.2% ImageNet), while sparse limits cross-scale interactions (20% lower FLOPs, -2.2% COCO AP) (Table 4).
- "HGR-correlation Attention": Mechanism increasing small-object AP $_\text{S}$ by 2.1% vs. standard attention (Table 4).

3. Up-to-Date Comparisons with 2024 State-of-the-Art

Our results include comprehensive comparisons with 2024 methods, including FasterViT (ICLR 2024) and Agent-Swin (ECCV 2024), demonstrating MK-CAViT's superiority:

Method	#Params (M)	FLOPs (G)	ImageNet Top-1 (%)	COCO AP (%)	ADE20K mIoU (%)
Agent-Swin-Base [2]	88.0	15.4	84.0	49.0	49.5
FasterViT-B1 [1]	87.6	14.9	84.8	49.1	49.8
MK-CAViT-Base (Ours)	88.0	15.6	85.6	50.3	50.8
Performance Gain	(±0.0)	(+0.7)	(+0.8 to +1.6)	(+1.2 to +1.3)	(+1.0 to +1.3)

Key advantages over 2024 baselines:

Small-object detection: MK-CAViT's 31.9% AP $_\text{S}$ exceeds FasterViT's 30.8% (+1.1%) and Agent-Swin's 30.5% (+1.4%), due to 3×3 kernel preservation of fine details (Table 6).
Boundary precision: 1.8% higher boundary IoU (bIoU = 68.3%) on ADE20K compared to FasterViT (66.5%), as HGR attention aligns local edges with global regions.
Efficiency-accuracy tradeoff: Despite 0.7G higher FLOPs than FasterViT, 1.0%+ gains across tasks justify the investment in HGR correlation, which 2024 methods lack. These results are presented in Section 4.2 of the main text, with explicit discussion of how 2024 methods' focus on speed (FasterViT) or agent-based selection (Agent-Swin) limits their ability to model complex cross-scale dependencies.

4. Validation of Three Kernel Sizes (3×3+7×7+15×15)

Our existing ablations (Table 4-B and Table 6) directly validate the necessity of three kernels, avoiding redundant experiments:

4.1 Kernel Configuration Ablations

Table 4-B shows that deviating from 3×3+7×7+15×15 degrades performance:

Smaller kernels (3/5/7): ImageNet Top-1 drops by 1.7% (85.6 → 83.9), COCO AP $^\text{b}$ by 1.6% (50.3 → 48.7), ADE20K mIoU by 1.9% (50.8 → 48.9) due to insufficient global context.
Larger kernels (7/11/15): Further declines (ImageNet -2.1%, COCO -2.0%, ADE20K -2.2%) as fine textures are lost (consistent with Table 6's 7×7+11×11+15×15 result of 49.1% mIoU).

4.2 Multi-Scale Synergy in Table 6

The 3×3+7×7+15×15 combination outperforms all alternatives:

Surpasses 3×3+5×5+7×7 by 1.9% mIoU (50.8 vs. 48.9), validating 15×15's role in capturing global context (e.g., room layouts).
Outperforms 5×5+11×11+15×15 by 1.2% mIoU (50.8 vs. 49.6), confirming 3×3's necessity for fine textures (e.g., "pipe" edges).
Exceeds 3×3+9×9+15×15 by 0.6% mIoU (50.8 vs. 50.2), highlighting 7×7 as a critical mid-scale bridge (e.g., linking "window" to "wall"). Each kernel addresses specific limitations: 3×3 captures details, 7×7 mediates scales, 15×15 provides context. Removing any breaks this hierarchy, as shown by consistent performance drops.

5. Summary of Revisions

To address Reviewer jAS7's concerns, we will:

Strengthen novelty discussion by emphasizing comparisons with 2024 methods (FasterViT, Agent-Swin) and linking HGR-theoretic innovations to experimental gains.
Enhance architectural clarity with a revised Figure 1 and glossary, grounding descriptions in experimental results (e.g., Dynamic Norm's 1.7% mCE improvement).
Explicitly highlight 2024 baseline comparisons in Section 4.2, emphasizing MK-CAViT's gains in small-object detection and boundary precision.
Reference Table 4-B and Table 6 to validate three kernel sizes, with task-specific analysis of each kernel's role. These revisions directly address each concern, leveraging existing experimental data to ensure rigor. We thank the reviewer for their valuable input, which will enhance the manuscript's clarity and impact.

2025-08-04

Thank you to the authors for their response. The reply addresses my concerns regarding performance and kernel size. Regarding novelty, I think the authors’ explanation is reasonable. However, I think the proposed method is still an incremental improvement to existing multi-scale methods, which are common.

Overall, I will improve my score, but the novelty concern remains.

2025-08-05

Dear Reviewer jAS7,
Thank you sincerely for your constructive feedback and for your willingness to improve the score. We appreciate your acknowledgment of our responses to performance and kernel size concerns, and we take your remaining thoughts on novelty seriously.
We recognize that multi-scale designs are indeed well-explored, and we agree that incremental improvements are common in this space. However, we believe MK-CAViT introduces distinct, non-incremental innovations that set it apart from existing methods:
First, our work is grounded in information theory, leveraging Hirschfeld-Gebelein-Rényi (HGR) maximal correlation to model cross-scale dependencies—an approach not adopted by prior multi-scale ViTs (e.g., Swin, CSWin, or FasterViT). This theoretical foundation enables MK-CAViT to capture nonlinear feature relationships more rigorously than heuristic-based token merging or window attention strategies.
Second, the Fast-HGR attention mechanism is a novel approximation of HGR, balancing computational efficiency with the ability to preserve both local token interactions (via cosine similarity) and global distributional consistency (via trace regularization). This dual focus addresses a key limitation of existing methods, which often prioritize either local details or global context, but not both in a principled way.
Third, our dynamic multi-scale fusion, using learnable gating to adaptively weight cross-scale interactions, is designed to handle the inherent trade-offs between fine-grained and high-level features. This stands in contrast to static fusion strategies (e.g., concatenation or simple addition) used in most multi-scale frameworks, leading to tangible gains in small-object detection and boundary localization.
These elements collectively represent a shift from "stacking" existing techniques to integrating a theoretically motivated framework for multi-scale learning. We hope this clarifies the novelty of our approach.
Thank you again for pushing us to articulate these distinctions more clearly. Your insights have helped strengthen our presentation of these contributions.
Sincerely,
Authors

审稿意见

评分: 5置信度: 42025-07-02

This paper introduces the Multi-Kernel Correlation-Attention Vision Transformer (MK-CAViT) to model multi-scale spatial relationships and integrating fine-grained local details with long-range global dependencies. Grounded in the Hirschfeld-Gebelein-Rényi (HGR) theory, MK-CAVIT proposes a parallel multi-kernel architecture for extracting multi-scale features, an HGR-correlation attention mechanism to enhance cross-scale interactions by modeling nonlinear dependencies, and a stable multi-scale fusion strategy for improved training stability. The Fast-HGR approximation to address computational inefficiencies of exact HGR computations by using cosine similarity for local dependence and trace regularization for global consistency. Experimental results on ImageNet, COCO, and ADE20K datasets demonstrate that MK-CAVIT surpasses other methods in classification, detection, and segmentation tasks.

优缺点分析

Strengths:

Without Bells and Whistles. The paper is straightforward and easy to read, avoiding unnecessary complexity or "flamboyant" presentations. This simple yet effective approach is highly valued in the computer vision field.
Novelty and Theoretical Grounding.The paper introduces a novel approach by integrating HGR maximal correlation theory into Vision Transformers, providing a rigorous mathematical foundation that is often lacking in empirically driven ViTs. This is a significant contribution.
Enhanced Cross-Scale Interactions. The HGR-correlation attention mechanism effectively models nonlinear dependencies and applies adaptive scaling to weigh connections, refining contextual reasoning. The Fast-HGR approximation maintains the theoretical advantages of HGR while significantly reducing computational complexity.
Strong Experimental Results. MK-CAVIT demonstrates superior performance across various computer vision tasks (image classification, object detection, and semantic segmentation) on benchmark datasets
Computational Efficiency. The Fast-HGR approximation and the adaptive multi-head attention mechanism contribute to maintaining efficiency while achieving strong performance.
The ablation study is thorough.

Weaknesses:

The multi-scale motivation is general.
Limited Justification of Multi-Scale Necessity. While the paper proposes a multi-kernel architecture, the necessity of multi-scale processing could be more robustly justified. It would be beneficial to see a direct comparison of the model's performance using only single-scale kernels (e.g., separate experiments with only 3x3, 7x7, and 15x15 equivalent kernels) to clearly demonstrate the performance gains achieved by integrating multiple scales. This would empirically confirm the added value of the multi-scale approach.
Visualization of Attention Maps. While the paper discusses how HGR models complex feature interactions, visualizing the attention maps generated by the HGR-correlation attention mechanism could provide more intuitive insights into how it captures multi-scale dependencies compared to traditional self-attention.

问题

See the weaknesses part.

局限性

yes

最终评判理由

The author's positive response has addressed all my concerns.

格式问题

none

作者回复

2025-07-29

Response to Reviewer PqDQ

We sincerely appreciate Reviewer PqDQ for the insightful feedback, which helps strengthen our manuscript. We address the key concerns regarding the necessity of multi-scale design and attention map visualizations using our new single-scale kernel experiments and planned visualizations, ensuring consistency with the paper's findings.

1. Necessity of Multi-Scale Design: Evidence from Single-Scale Kernel Experiments

We fully agree that robust justification of multi-scale processing requires direct comparisons with single-scale kernel configurations. To address this, we conducted comprehensive experiments evaluating models using only single-scale kernels (3×3, 5×5, 7×7, 9×9, 11×11, 15×15) across three benchmark datasets, with results summarized in Table 1.

Configuration	ImageNet Top-1 (%)	COCO AP (%)	COCO AP_S (%)	ADE20K mIoU (%)
3×3 only	82.7	43.1	27.3	43.8
5×5 only	83.0	43.8	26.9	44.1
7×7 only	83.2	43.6	26.1	44.3
9×9 only	82.9	43.0	25.3	43.9
11×11 only	82.6	42.5	24.8	43.4
15×15 only	82.9	42.8	24.7	43.1
Multi-Kernel (3/7/15)	85.6	50.3	31.9	50.8

1.1 Consistent Performance Gaps Across Tasks

The results reveal a clear performance ceiling for single-scale designs:

ImageNet Classification: All single-kernel models achieve 82.6–83.2% Top-1 accuracy, while the multi-kernel model outperforms them by 2.4–3.0% (85.6%). This gap persists regardless of kernel size, indicating that no single scale can capture the full spectrum of visual features required for classification.
COCO Object Detection: The multi-kernel model achieves 50.3% AP, surpassing the best single-kernel configuration (5×5, 43.8%) by 6.5%. This improvement is particularly pronounced for small objects (AP_S = 31.9% vs. 27.3% for 3×3), demonstrating that multi-scale fusion resolves the inability of single kernels to balance local detail and global context.
ADE20K Semantic Segmentation: The multi-kernel model's mIoU (50.8%) is 6.5–7.7% higher than all single-kernel variants, confirming that single scales cannot simultaneously preserve fine boundaries and model large-scale scene structure.

1.2 Scale-Specific Limitations of Single-Kernel Designs

Each kernel size exhibits task-specific weaknesses that multi-scale fusion overcomes:

Small kernels (3×3, 5×5): While achieving the highest single-kernel performance on small-object detection (AP_S = 26.9–27.3%), they fail to capture global context, resulting in lower overall COCO AP and ADE20K mIoU compared to larger single kernels. For example, 3×3 kernels struggle with classifying large, context-dependent categories on ImageNet (e.g., "mountain" vs. "hill").
Large kernels (11×11, 15×15): Perform poorly on detail-sensitive tasks, with 15×15 achieving the lowest ADE20K mIoU (43.1%) and small-object AP_S (24.7%). This confirms that large single scales over-smooth fine-grained features critical for distinguishing textures and small objects.
Mid-sized kernels (7×7): Often considered "balanced," they still underperform the multi-kernel model by 2.4% on ImageNet and 6.2% on COCO AP, as they cannot integrate the fine details captured by small kernels with the global context of large kernels.

1.3 Synergy of Multi-Scale Fusion

The multi-kernel model's superior performance stems from its ability to combine complementary strengths:

Small kernels (3×3) contribute fine-grained texture and edge information, enhancing small-object detection and boundary preservation.
Large kernels (15×15) provide global scene context, improving classification of large objects and large-scale regions.
The mid-sized kernel (7×7) mediates cross-scale interactions, ensuring features from different scales are coherently integrated rather than processed independently.

This synergy is not merely additive: the multi-kernel model's gains (e.g., 6.5% COCO AP over the best single kernel) exceed the sum of individual single-kernel strengths, confirming the necessity of intentional multi-scale integration.

2. Visualization of HGR-Correlation Attention Maps

We concur that visualizing attention maps will provide intuitive support for our multi-scale dependency modeling claims. In the revised manuscript, we will add a dedicated section (Section 3.4) with key visualizations:

Attention heatmaps for 3×3, 7×7, and 15×15 kernels on representative images from COCO and ADE20K, showing scale-specific focus (fine details for 3×3, global context for 15×15).
Comparisons with single-scale models to highlight how HGR-correlation attention in the multi-kernel model reduces noise in fine-scale maps and preserves details in coarse-scale maps.
Quantitative metrics (attention-semantic overlap IoU and cross-scale consistency scores) to validate that multi-scale attention aligns better with semantic structures than single-scale attention.

These visualizations will complement our quantitative results, demonstrating how HGR-correlation attention enables effective cross-scale interaction.

3. Summary of Revisions

To address the reviewer's concerns, we will:

Integrate Table 1 and its analysis into Section 4.3, explicitly demonstrating multi-scale necessity through direct single-vs.-multi-kernel comparisons across ImageNet, COCO, and ADE20K.
Add attention map visualizations (Section 3.4) to intuitively demonstrate HGR-correlation attention's multi-scale dependency modeling, with supporting quantitative metrics.

These revisions directly address the request for robust justification of multi-scale design, leveraging both new experimental evidence and intuitive visualizations. We thank the reviewer for their valuable input, which will enhance the manuscript's rigor and clarity.

2025-08-04

Thanks for the author's positive response. It has addressed all my concerns. I have improved my overall rating.

2025-08-05

Dear Reviewer PqDQ,

Thank you for your positive feedback and for updating your overall rating. We truly appreciate your thorough review and constructive insights, which have helped us strengthen the rigor and clarity of our work. Your guidance has been invaluable in refining our methodology and validating the key contributions of our approach.

We will incorporate all the discussed improvements into the revised manuscript to ensure the highest quality of presentation. If you have any further questions or suggestions during the final review process, please don’t hesitate to let us know.

Sincerely,

Authors

最终决定Accept (poster)

2025-09-17

This paper introduces MK-CAViT, a multi-kernel correlation-attention Vision Transformer grounded in HGR theory, with a Fast-HGR approximation for efficiency. The motivation is clear, the methodology is technically sound, and the experiments across ImageNet, COCO, and ADE20K are comprehensive.

Reviewers initially raised concerns about the necessity of multi-scale design, novelty, clarity of architecture, and comparisons with recent methods. The rebuttal provided new single-scale kernel experiments, visualizations, updated baselines, detailed ablations, and clarified theoretical and implementation issues. These additions effectively addressed the main concerns, and reviewers increased their ratings.

Overall, the paper is well-motivated, clearly written, and demonstrates solid empirical and theoretical contributions. I recommend acceptance.