TRUST: Test-Time Refinement using Uncertainty-Guided SSM Traverses
摘要
评审与讨论
This paper introduces TRUST, a novel test-time adaptation (TTA) method specifically designed for the VMamba architecture. TRUST leverages the model's unique internal traversal mechanism by permuting the scanning order to generate diverse causal "views" of an input image. The model is then adapted on the most confident permutations, and the resulting weights are averaged to produce a single, more robust model for inference.
优缺点分析
Strengths
The primary strength of this work is its novelty and clever use of the model's intrinsic properties. By treating the internal traversal paths as a form of implicit data augmentation, the method avoids the computational overhead of traditional augmentation techniques and is the first of its kind for State Space Models. The experimental results are impressive, showing consistent and significant improvements over well-established, general-purpose TTA methods across a variety of benchmarks.
Weaknesses
My main concerns with this paper relate to its specificity and experimental setup:
- The core mechanism of TRUST is fundamentally tied to the VMamba architecture. The concept of permuting traversal paths cannot be applied to other popular architectures like CNNs or standard Vision Transformers, which limits the broader impact of the proposed method. It's an effective technique for one specific model family, not a general TTA paradigm. Moreover, the adaptation exclusively targets "Mamba-specific state space parameters." The paper would benefit from justifying this choice—it's unclear if adapting other components (e.g., normalization layers, as in TENT) was explored and what the performance trade-offs might be.
- For the VSS block, do you need to process the same image K times? This sounds even worse than data augmentation when the number of augmented samples is much smaller than K (e.g., 1). Moreover, are the traversal permutations fundamentally different from rotating the original image? For example, if we rotate the input image, would this be similar to changing from permutation (a) to (b) in Fig. 1? It seems like data augmentation methods require the model to process one image T times (T = number of augmented samples) while TRUST processes the same image K times with different traversal orders. Both methods require multiple forward passes per image, right? Also, for weight averaging approaches, there are works like [6] that achieve this without data augmentation (weight averaging prompt pools for the same image without multiple forward passes). A discussion of works in this direction is also necessary.
- The baselines used are outdated. Stronger recent baselines [1-5] should be discussed to demonstrate performance. Also, [2, 4] are architecture-specific TTA methods (learning visual prompts for ViT) but could be extended to CNNs, whereas TRUST only works for VMamba. This makes the comparison potentially unfair since other methods are more architecture-agnostic.
- The multi-stage process (generating permutations, ranking by entropy, separate adaptation for top-K paths, and final weight averaging) appears more complex and computationally intensive than single-pass methods like TENT, especially since it involves multiple backpropagation steps. A computation and memory comparison is needed.
[1] Self-Bootstrapping for Versatile Test-Time Adaptation (ICML2025)
[2] Test-Time Model Adaptation with Only Forward Passes (ICML2024)
[3] Beyond Entropy: Region Confidence Proxy for Wild Test-Time Adaptation (ICML2025)
[4] OT-VP: Optimal Transport-Guided Visual Prompting for Test-Time Adaptation (WACV2025)
[5] Active Test-Time Adaptation: Theoretical Analyses and An Algorithm (ICLR2024)
[6] DPCore: Dynamic Prompt Coreset for Continual Test-Time Adaptation (ICML2025)
[7] Continual test-time domain adaptation (CVPR2022)
[8]. NOTE: Robust Continual Test-time Adaptation Against Temporal Correlation (NeurIPS2022)
[9]. Robust Test-Time Adaptation in Dynamic Scenarios (CVPR 2023)
问题
- The experiment design only focuses on the simplest TTA setting. Testing TRUST under more challenging settings could further demonstrate its performance since the model architecture differs significantly from existing approaches—consider Continual TTA [7] or temporal correlation settings [6, 8, 9].
- What's the latency when each batch requires all three phases (offline, adaptation, evaluation) to be performed?
- Could the proposed method be applied to other tasks like segmentation?
- I'm not familiar with Mamba—is it really necessary to develop a method specialized for it? Probably not a good question for this paper, but this might strengthen the paper's motivation if addressed.
I'd be happy to raise my rating if these weaknesses and questions are thoroughly addressed in the rebuttal.
局限性
Though the author mentioned that limitations are in " In the last part of the conclusion section." (Line 508), I failed to find it.
最终评判理由
My concerns have been addressed, and I have raised my final score to 4.
格式问题
NA
We thank the reviewer for their thoughtful review. We are glad that the paper’s novelty and the use of VMamba’s intrinsic traversal mechanism were well appreciated. The recognition of our permutation strategy as a form of implicit augmentation, along with the strong empirical results, reinforces the effectiveness and originality of the proposed approach.
1. The experiment design only focuses on the simplest TTA setting. Consider Continual TTA [7] or temporal correlation settings [6, 8, 9].
We would like to clarify that continual test-time adaptation and temporal correlation represents distinct subfields within test-time adaptation, with different assumptions and objectives, often requiring longer adaptation timelines and specialized setups. Additionally, most papers in the TTA literature do not compare their results against continual adaptation or temporally correlated test-time settings, such as those explored in [1, 2, 3]. Moreover, implementing and evaluating TRUST in this setting would require extensive design and tuning beyond the scope of a one-week rebuttal period. We will explore this in future works.
2. What's the latency when each batch requires all three phases (offline, adaptation, evaluation) to be performed?
Our method consists of three stages: Offline phase: We evaluate our pre-trained model under different traversal permutations and select the top-k permutations with the lowest entropy values, corresponding to the most confident predictions. Since no model adaptation occurs at this stage, the procedure is equivalent to a forward-pass evaluation and is computationally lightweight. For instance, on the PACS dataset, evaluating for a single permutation takes only 4 seconds for all images in this dataset. Extending this across all 24 permutations results in a total evaluation time of 1.5 minutes, which is a one-time cost. Our experiments show that the top-K permutations remain consistent across all test datasets, as visualized in Figure 2 of the supplementary material. This consistency stems from the similarity of these permutations to the original traversal order used during pre-training, as reflected in their high confidence scores. Therefore, this procedure can be performed only once on a single dataset to determine the appropriate Ks, and then used across all test datasets. Adaptation stage: We measured the latency for the batch size of 128 used in our experiments for forward and backward passes, which results in approximately 1.2 second per permutation. Evaluation stage: In this phase, we first perform weight averaging, which involves a simple averaging of the model weights across selected permutations (0.018 second). The subsequent evaluation functions similarly to the offline stage, no adaptation is performed, and we simply test the model (0.2 second). Our overall per permutation is: 1.2 + 0.018 + 0.2 = 1.41 second.
3. Could the proposed method be applied to other tasks like segmentation?
We conducted additional experiments on segmentation tasks using various datasets, including Pascal VOC21 and Pascal Context 59. The results demonstrate that TRUST performs well in segmentation, outperforming methods like TENT. This further supports the generalizability and effectiveness of our approach beyond classification settings.
V21:
| Corruption | gaussian | shot | impulse | defocus | glass | motion | zoom | frost | snow | fog | brightness | contrast | elastic | pixelate | jpeg | Mean |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Source only | 29.1 | 33.1 | 28.3 | 21.0 | 8.2 | 33.1 | 25.4 | 50.9 | 50.3 | 70.7 | 76.5 | 63.9 | 25.5 | 22.2 | 59.2 | 39.8 |
| Tent | 33.0 | 35.7 | 32.0 | 22.3 | 14.7 | 38.2 | 25.3 | 46.5 | 49.0 | 60.2 | 63.9 | 66.2 | 38.5 | 28.8 | 43.9 | 39.9 |
| Trust | 38.8 | 42.0 | 38.7 | 29.8 | 22.6 | 45.1 | 29.8 | 50.5 | 53.5 | 63.4 | 66.4 | 68.5 | 45.1 | 37.7 | 48.6 | 45.4 |
P59:
| Corruption | gaussian | shot | impulse | defocus | glass | motion | zoom | frost | snow | fog | brightness | contrast | elastic | pixelate | jpeg | Mean |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Source only | 17.1 | 19.6 | 17.4 | 27.4 | 14.9 | 29.2 | 19.5 | 30.2 | 28.5 | 42.1 | 50.8 | 41.0 | 23.9 | 30.4 | 38.4 | 28.7 |
| Tent | 17.6 | 18.9 | 17.8 | 22.2 | 15.9 | 27.5 | 17.9 | 26.9 | 30.0 | 36.7 | 41.9 | 42.9 | 25.8 | 28.2 | 28.3 | 26.6 |
| Trust | 24.4 | 27.4 | 25.4 | 24.6 | 21.2 | 30.1 | 19.8 | 29.8 | 32.8 | 39.2 | 42.4 | 43.2 | 31.5 | 36.1 | 31.6 | 30.6 |
4. Is it really necessary to develop a method specialized for it?
As we discussed in the introduction, Mamba is emerging as a powerful alternative to both CNNs and ViTs across a range of vision tasks. Unlike ViTs, which rely on self-attention and quadratic complexity with respect to sequence length, Mamba is built on SSMs that enable linear-time sequence modeling, making it highly efficient for long input sequences. This makes Mamba especially attractive in settings with limited compute resources. Despite being more lightweight, Mamba retains competitive or even superior performance in many tasks. Several recent papers [4, 5, 6] have shown that Mamba outperforms CNNs and ViTs on image classification, segmentation, and even video understanding, while using fewer parameters and less memory. Previous test-time adaptation methods, such as TENT and SHOT, are not well-suited for Mamba-based models, as demonstrated in Table 2 of our paper. Due to the unique architecture of Mamba and favorable properties, such as linear-time complexity, we focused on designing a strategy specifically tailored for Mamba to effectively handle domain shifts during test time in vision applications.
5. The adaptation exclusively targets "Mamba-specific state space parameters." How does the performance differ when other components (e.g., normalization layers, as in TENT) are adapted?
We provided this experiment in table 2 of our supplementary material. The results represent that our method outperforms TENT in both settings, by 3.1% when using BatchNorm adaptation and by 11.8% with SS2D adaptation. We opt to proceed with SS2D adaptation, as traversal permutations directly influence Mamba-specific parameters encoded in the SS2D blocks. Updating these parameters is therefore essential to fully capture the effects of traversal-based modifications.
ImageNet-C:
| Metric | Source only | TENT (BN) | TRUST (BN) | TENT (SS2D) | TRUST (SS2D) |
|---|---|---|---|---|---|
| Mean | 38.7 | 41.7 | 44.8 | 44.3 | 56.1 |
6. Compare with the suggested baselines.
Due to architectural differences between VMamba and other backbones used in the baselines mentioned by the reviewer, a direct comparison would not be entirely fair. These methods, which rely on ViT-based architectures, will be included in a separate category in the final version of the paper. However, to provide a fair evaluation, we adapted the ViT-based method from [2] by replacing its backbone with VMamba and modifying it to be compatible with VMamba backbone. Specifically, we replaced the CLS token (absent in VMamba) with the mean of all tokens and adjusted the number of learnable tokens to match VMamba’s dimensionality. This ensures that both methods operate under the same architectural constraints. Our experimental results demonstrate that TRUST, which is specifically designed to exploit VMamba’s traversal mechanism, significantly outperforms the adapted baseline, highlighting its robustness under distribution shift.
ImageNet-C:
| Corruption | gaussian | shot | impulse | defocus | glass | motion | zoom | frost | snow | fog | brightness | contrast | elastic | pixelate | jpeg | Mean |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Source only | 24.3 | 26.1 | 25.1 | 22.2 | 23.2 | 35.4 | 43.2 | 49.3 | 48.4 | 56.9 | 70.0 | 26.8 | 45.1 | 43.7 | 41.4 | 38.7 |
| Tent | 27.8 | 30.0 | 28.8 | 24.9 | 25.9 | 38.0 | 45.5 | 51.0 | 51.3 | 59.1 | 70.6 | 30.0 | 48.2 | 47.8 | 45.7 | 41.7 |
| Trust | 46.8 | 49.4 | 48.5 | 42.8 | 40.8 | 57.1 | 57.9 | 57.3 | 61.7 | 66.8 | 71.9 | 54.9 | 61.4 | 63.6 | 60.2 | 56.1 |
| FOA | 17.8 | 19.4 | 18.5 | 15.1 | 18.2 | 22.66 | 28.6 | 38.69 | 33.9 | 44.3 | 57.4 | 22.7 | 38.2 | 40.9 | 41.7 | 30.5 |
7. Are the traversal permutations fundamentally different from rotating the original image?
While image rotation changes the spatial arrangement of tokens, it is not equivalent to the traversal permutations we apply within VMamba, as the specific traversal orders used in our method cannot be replicated through simple rotations. As shown in Figure 4 of our paper, replacing traversal permutations with standard image rotations leads to significantly lower performance. This confirms that the gains from TRUST arise specifically from manipulating the causal processing order through traversal permutations, rather than from altering the input appearance via rotation.
[1] WATT: Weight average test time adaptation of CLIP. Advances in neural information processing systems, 37, 48015-48044.
[2] Clipartt: Adaptation of clip to new domains at test time. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (pp. 7092-7101). IEEE.
[3] Test-time prompt tuning for zero-shot generalization in vision-language models. Advances in Neural Information Processing Systems, 35, 14274-14289.
[4] Vmamba: Visual state space model. Advances in neural information processing systems, 37, 103031-103063.
[5] Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model.
[6] LocalMamba: Visual State Space Model with Windowed Selective Scan. CoRR.
I thank the authors for their detailed rebuttal and the new experiments, which have addressed most of my initial concerns.
However, a key point from my second weakness remains unaddressed:
Also, for weight averaging approaches, there are works like [6] that achieve this without data augmentation (weight averaging prompt pools for the same image without multiple forward passes). A discussion of works in this direction is also necessary.
The rebuttal provides a latency breakdown for the proposed method but omits a discussion and comparison of its multi-pass weight averaging approach against more efficient single-pass methods, such as the one described in reference [6]. This discussion is necessary to fully contextualize the method's contributions and understand its efficiency trade-offs relative to existing work.
I will re-evaluate my score once this point has been clarified.
Thank you for the thoughtful follow-up and we are glad that our rebuttal and new experiments addressed most initial concerns. We now clarify this question:
1 · Fundamental Difference in Design.
DPCore does not perform model weight averaging. Instead, it constructs a weighted prompt by interpolating prompt tokens from a learned coreset, where the weights are computed based on the distance between the current batch’s feature statistics (extracted without prompts) and those stored in the coreset. As described in the paper: “each batch requires two additional forward passes (one without prompt, one with weighted prompt)”. Therefore, while DPCore avoids per-image multi-prompt trials, it is strictly dual-pass. Moreover, DPCore learns a new prompt from scratch over 50 steps, and refines the existing prompt for 1 step. By contrast, TRUST adapts with a single-pass per selected traversal and the evaluation is single-pass after we perform weight averaging across traversals. It is worth mentioning that our method is not limited to K=6 traversal permutations; similar to DPCore’s dual-pass setup, it can benefit from as few as two traversal permutations (Table 1).
Table 1:
| Methods | CIFAR10-C (%) | CIFAR100-C (%) | ImageNet-C (%) |
|---|---|---|---|
| Source Only | 65.9 | 41.2 | 38.7 |
| Tent | 66.5 | 41.8 | 41.7 |
| TRUST (K=2) | 75.6 | 51.7 | 54.4 |
2 · Efficiency Trade-Offs and Practical Implementation.
As DPCore and TRUST use different architectures, a direct comparison of performance is not possible. Instead, we conducted a comparative evaluation using Tent as baseline.
We report the relative latency of our method against DPCore and Tent. As stated in the DPCore paper, their method incurs approximately 1.8× the latency of Tent. IIn our case, with K = 2 traversals, our method incurs roughly 3.2× Tent’s latency (see Table 2). However, this increase is modest relative to the performance gains over Tent. Notably, TRUST achieves an additional 1.6% improvement compared to DPCore.
Table 2:
| Methods | Time |
|---|---|
| Tent | 1.0 |
| DPCore | 1.8x |
| TRUST | 3.2x |
3 · Different assumptions: source-free vs. source-anchored.
Furthermore, DPCore use source images during adaptation and depends on pre-computed source/reference statistics (typically from ~300 unlabeled source images, or from a proxy public dataset) to anchor its alignment objective. In contrast, TRUST is source- and reference-free: we do not require source images, source statistics, or proxy datasets at any stage.
4 · Conclusion.
Finally, TRUST and methods like DPCore operate in different architectural regimes: DPCore is prompt-centric and well-suited for ViTs, whereas TRUST is traversal-centric and tailored for Mamba-based models. Our aim is not to replace single-pass methods, but to offer an architecture-aligned TTA strategy that is both robust and efficient for state-space models. We will incorporate a brief discussion of DPCore and related approaches in the final version to contextualize TRUST’s design choices and highlight its unique contributions.
Thank you for the follow-up response. It has addressed my question regarding the comparison between TRUST and DPCore. The clarification that DPCore is also a 2-pass method and the distinction between source-free and source-anchored approaches are noted.
Incorporating the new experimental results and the clarifications regarding related literature from the rebuttal will improve the final paper by contextualizing its contributions. I have no further questions and will raise my score to 4.
Thank you for your thoughtful response and for acknowledging our clarifications. We appreciate your updated evaluation and your time in reviewing our work.
The paper proposes TRUST, a TTA method that capitalizes on the architectural properties of VMamba, a vision-adapted State Space Model. It tackles the issue of performance degradation of SSMs under distribution shifts, which arises from strong inductive biases and the accumulation of domain-specific artifacts in hidden states due to fixed traversal paths. TRUST generates multiple traversal permutations of input images, which are then used to create diverse causal perspectives. By selecting the top-K permutations with the lowest predictive entropy and averaging the Mamba-specific parameters, the method mitigates the impact of noisy predictions and promotes robustness. The authors validate TRUST on various corruption-based and domain generalization benchmarks, showing consistent improvements over baseline TTA approaches.
优缺点分析
Strengths
- Novelty in SSM Adaptation: TRUST is presented as the first TTA approach explicitly designed for Mamba-based vision models, leveraging their internal traversal dynamics rather than relying on external data augmentation or auxiliary models. This is a significant contribution to the field of TTA for emerging architectures like SSMs.
- Effective Use of Architectural Properties: The method effectively utilizes the unique traversal mechanism of VMamba to generate diverse causal perspectives of the input, which is a clever way to induce variability and explore flatter minima in the loss landscape without additional computational overhead from data augmentations.
- Strong Experimental Results: The paper provides comprehensive experimental validation on seven standard benchmarks, including CIFAR10-C, CIFAR100-C, ImageNet-C, PACS, ImageNet-Sketch, ImageNet-V2, and ImageNet-R. TRUST consistently outperforms existing state-of-the-art TTA methods, demonstrating significant improvements in robustness and generalization under various distribution shifts.
- Clear Ablation Studies: The paper includes "TRUST naive" as a baseline which adapts only Mamba-specific SS2D parameters, while "TRUST" further enhances robustness by averaging over multiple traversal permutations during adaptation. This effectively highlights the contribution of the proposed permutation and weight averaging strategy.
Weaknesses
- Complexity of Traversal Permutations: While novel, the concept of systematically generating and selecting traversal permutations based on entropy might introduce a non-trivial computational overhead during the offline phase, especially if the number of permutations or the size of the model increases. The paper states that "All experiments were conducted using a single NVIDIA A6000 GPU", but doesn't elaborate on the time taken for the offline phase. Further analysis on the scalability of this permutation generation and selection process would strengthen the paper.
- Limited Generalizability to Other Architectures: The method is highly specific to the architectural properties of VMamba and its SS2D module. While this is presented as a strength, it also implies that the core mechanism of TRUST might not be directly transferable to other backbone architectures (e.g., CNNs or standard Transformers) without significant modifications. This limits its broad applicability across the wider machine learning community.
- Detailed Explanation of "Confidence Paths": The paper mentions that TRUST "averages model weights from the most confident paths", and selects permutations "with the lowest entropy, which are associated with more stable and domain-robust hidden states". While the link to flat minima is made, a deeper theoretical or empirical dive into why these "confident paths" (low entropy permutations) consistently lead to better generalization beyond the current explanations would be beneficial.
- Discussion on Hyperparameter Sensitivity: The method involves selecting the top-K permutations. A more detailed discussion on the sensitivity of the results to the choice of K, and how this hyperparameter is determined, would improve the completeness of the experimental analysis.
问题
- Could you provide a more detailed analysis of the computational overhead, particularly the time complexity, of the traversal permutation generation and selection process in the offline phase? How does the scalability of this component affect the practical application of TRUST, especially with larger models or an increased number of permutations?
- Given that TRUST's core mechanism is highly specific to the architectural properties of VMamba and its SS2D module, could you elaborate on the potential challenges or modifications required to adapt the underlying principles of uncertainty-guided traversal to other neural network architectures, such as CNNs or standard Transformers?
- Regarding the "confidence paths" and the selection of permutations with the lowest entropy: Could you provide a deeper theoretical or empirical explanation as to why these specific paths consistently lead to better generalization and more stable, domain-robust hidden states, beyond the current explanation of exploring flatter minima? Additionally, could you discuss the sensitivity of the results to the choice of the hyperparameter K (number of top permutations selected) and the methodology used to determine its optimal value?
局限性
- Dedicated Limitations Section: Authors should include a "Limitations" section to acknowledge the inherent constraints and potential drawbacks of their method. This could include:
- Computational Overhead: Acknowledge and quantify the potential computational cost associated with generating and selecting traversal permutations, especially in real-time or resource-constrained environments.
- Specificity to VMamba: Explicitly state that the method's reliance on VMamba's unique architectural properties limits its direct applicability to other model architectures. Discuss whether the core ideas could be generalized and what challenges might arise.
- Hyperparameter Sensitivity: Discuss the sensitivity of the results to key hyperparameters, such as the number of top-K permutations, and the robustness of the method across a wider range of hyperparameters.
- Theoretical Guarantees: Address the lack of strong theoretical guarantees for why "confident paths" (low entropy permutations) consistently lead to better generalization.
- Discussion on Potential Negative Societal Impact: It's crucial to discuss any potential negative societal impacts, even if seemingly minor or indirect. This could involve:
- Bias Amplification: Consider whether the adaptation process could inadvertently amplify existing biases present in the training data, especially when applied to real-world, diverse datasets.
- Adversarial Robustness: While improving robustness to natural distribution shifts, discuss if the method inadvertently introduces new vulnerabilities to adversarial attacks or if its effectiveness in adversarial settings is yet to be explored.
- Ethical Considerations in Deployment: Briefly touch upon the ethical considerations when deploying such adaptive models, particularly in high-stakes applications where errors due to unaddressed biases or unforeseen shifts could have significant consequences.
- Environmental Impact: While not explicitly explored, a brief mention of the energy consumption associated with extensive training and adaptation (if applicable) could be considered, aligning with broader discussions on sustainable AI.
格式问题
Looks fine
We thank the reviewer for their thoughtful feedback and are pleased that TRUST was recognized as a novel and effective TTA method for Mamba-based vision models. We appreciate the positive remarks on our use of traversal dynamics, strong experimental results across diverse benchmarks, and the clarity of our ablation studies. Below, we address the reviewer comments in detail.
1-1. Time complexity of the traversal permutation generation and selection in the offline phase.
In the offline phase, we evaluate our pre-trained model under different traversal permutations and select the top-k permutations with the lowest entropy values, corresponding to the most confident predictions. Since no model adaptation occurs at this stage, the procedure is equivalent to a forward-pass evaluation and is computationally lightweight. For instance, on the PACS dataset, evaluating for a single permutation takes only 4 seconds for all images in this dataset. Extending this across all 24 permutations results in a total evaluation time of 1.5 minutes, which is a one-time cost. Our experiments show that the top-K permutations remain consistent across all test datasets, as visualized in Figure 2 of the supplementary material. This consistency stems from the similarity of these permutations to the original traversal order used during pre-training, as reflected in their high confidence scores. Therefore, this procedure can be performed only once on a single dataset to determine the appropriate Ks, and then used across all test datasets.
1-2. How does the scalability of this component affect the practical application of TRUST, especially with larger models or increased number of permutations?
As mentioned previously, each permutation in the offline phase is evaluated using only a forward pass, no backpropagation is involved. The total evaluation time scales linearly with the number of permutations. In terms of memory, the model size remains constant regardless of the number of permutations, since the same model is reused for each forward pass. Therefore, the memory required is simply that needed to run a single evaluation of the model.
2. Elaborate on the potential challenges or modifications required to adapt to other NNs such as CNNs or ViTs.
Mamba is emerging as a powerful alternative to both CNNs and ViTs across a range of vision tasks. Unlike ViTs, which rely on self-attention and quadratic complexity with respect to sequence length, Mamba is built on SSMs that enable linear-time sequence modeling, making it highly efficient for long input sequences. This makes Mamba especially attractive in settings with limited compute resources. Despite being more lightweight, Mamba retains competitive or even superior performance in many tasks. Several recent papers [1, 2, 3] have shown that Mamba outperforms ViTs on image classification, segmentation, and detection, while using less memory.
Previous test-time adaptation methods, such as TENT and SHOT, are not well-suited for Mamba-based models, as demonstrated in Table 2 of our paper. Due to the unique architecture of Mamba and favorable properties, such as linear-time complexity, we focused on designing a strategy specifically tailored for Mamba to effectively handle domain shifts during test time in vision applications. Extending our approach to CNNs or ViTs would undermine our core objective, as it would forgo the key advantage of leveraging Mamba’s efficient and sequential computation.
3-1. Deeper theoretical explanation of why these confidence paths consistently lead to better generalization, beyond the current explanation of exploring flatter minima?
We provide a deeper theoretical justification based on SWAD [5]. In our setting, each traversal permutation defines a distinct causal ordering through which VMamba processes input patches, resulting in different trajectories of hidden states . Low-entropy (high-confidence) permutations consistently correspond to stable recurrent dynamics, where the influence of corruptions is minimized. This stabilizes the hidden-state evolution, yielding low-variance predictions and reducing sensitivity to domain-specific perturbations. From a loss landscape perspective, such paths correspond to regions of low curvature, as smoother predictions typically reflect flatter neighborhoods around (parameters of adapted model with ).
Therefore, selecting low-entropy permutations acts as a proxy for identifying solutions that are both dynamically stable (in terms of hidden-state recurrence) and geometrically robust (in terms of curvature). Averaging over the adapted weights from these confidence paths further concentrates the model in a flat, low-loss region of parameter space. This synergy between entropy-based selection and weight averaging aligns directly with the generalization theory of SWAD [5], and provides a principled explanation for the observed robustness of TRUST across domains.
3-2. Sensitivity of the results to the choice of the hyperparameter K (number of top permutations selected) and the methodology used to determine its optimal value?
By applying multiple traversal permutations, we expose the model to diverse causal perspectives of the same input. This variation encourages the model to learn different adaptation patterns at test time. Increasing the number of traversal permutations introduces more variation and, as shown in works such as Model-Soups [4], a greater number of diverse, correctly-aligned representations (our top-K permutations) can enhance the effectiveness of weight averaging. Figure 5 in the paper (table below) empirically supports this observation across three datasets. Our experiments shows that increasing permutation diversity improves robustness, with performance gains saturating around six permutations. This can offer a good trade-off between accuracy and efficiency.
| Traversal Permutation Number | CIFAR10-C (%) | CIFAR100-C (%) | ImageNet-C (%) |
|---|---|---|---|
| 2 | 75.6 | 51.7 | 54.4 |
| 4 | 77.1 | 53.7 | 55.6 |
| 6 | 77.5 | 54.3 | 56.1 |
| 8 | 77.6 | 54.7 | 55.5 |
Limitations
Thank you for the thoughtful suggestions. We will incorporate a dedicated Limitations section in the final version, addressing potential societal impacts, including bias, adversarial robustness, ethical considerations, and environmental footprint.
[1] Vmamba: Visual state space model. Advances in neural information processing systems, 37, 103031-103063.
[2] Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model.
[3] LocalMamba: Visual State Space Model with Windowed Selective Scan. CoRR.
[4] Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International conference on machine learning (pp. 23965-23998). PMLR.
[5] Swad: Domain generalization by seeking flat minima. Advances in Neural Information Processing Systems, 34, 22405-22418.
Dear Reviewer sdHM,
Thank you very much for your thorough and insightful review. We hope that our rebuttal has addressed your concerns, especially regarding the computational cost and scalability of permutation selection, the theoretical basis for entropy-guided traversal selection, and the sensitivity to hyperparameters such as the number of top-K permutations. We would be happy to provide further clarification or discussion on any remaining points you might have.
TRUST addresses two key challenges faced by VMamba under distribution shifts: (1) the strong inductive bias introduced by fixed traversal scans, and (2) the accumulation of domain-specific artifacts in hidden states during sequential processing. To tackle these issues, TRUST introduces a test-time adaptation strategy specifically designed for vision-oriented state-space models.
To mitigate the first challenge, TRUST proposes Traversal Permutation, which involves training multiple versions of the model using different traversal orders. For the second challenge, TRUST merges the weight embeddings by averaging the parameters across all trained model variants, effectively reducing domain-specific biases.
Experiments demonstrate that TRUST outperforms existing SOTA strategies and significantly improves VMamba's performance in cross-domain classification tasks.
优缺点分析
strength:
-
The paper proposes a novel strategy for adapting VMamba to domain-specific data.
-
The strategy does not rely on external source data, making it easily transferable across datasets from different domains.
-
The proposed method significantly improves VMamba’s performance in cross-domain classification tasks and outperforms other adaptation methods applied to VMamba.
-
The writing is clear and easy to follow.
weakness:
- The paper would benefit from a more thorough explanation of how the simple weight averaging across multiple models leads to performance gains. A deeper analysis or ablation study could strengthen the justification.
问题
-
How is the value of K (i.e., the number of traversal permutations or model variants) determined across different images or datasets? Is K fixed, adaptive, or tuned per dataset or image? Clarifying this would help assess the generalizability and scalability of the method.
-
While the proposed strategy is evaluated on classification tasks, is it applicable to other vision tasks such as object detection or semantic segmentation?
局限性
yes
最终评判理由
Having reviewed your response and the comments from the other reviewers, I will be keeping my original score.
格式问题
N/A
We thank the reviewer for their thoughtful review. We are glad our paper was recognized as novel and practical in addressing VMamba's limitations under distribution shifts. We appreciate the positive comments on the method's effectiveness, its source-free nature, and the clarity of the writing.
1. Explanation of how the simple weight averaging across multiple models leads to performance gains.
The concept of weight averaging seeks to identify a solution in parameter space that lies within a flat region of the loss landscape. Such flat minima are associated with low loss values that remain stable under small perturbations to the model parameters. Models converging to these regions tend to generalize better under distribution shifts, as their performance is less sensitive to noise or minor changes in input data and parameters. In our approach, the integration of traversal permutations with weight averaging is inspired by Robust Risk Minimization (RRM) [1]. RRM aims to minimize the worst-case empirical loss within a neighborhood around the current parameters, formulated as:
where defines the neighborhood around the model parameters , and represents small perturbations. While we do not directly optimize this objective, the principle aligns with our use of weight averaging across traversal permutations. By averaging the adapted weights, we converge toward a flatter region in weight space, improving generalization during test-time adaptation and enhancing robustness to distribution shifts.
2. How is the value of K (i.e., the number of traversal permutations or model variants) determined across different images or datasets? Is K fixed, adaptive, or tuned per dataset or image?
As shown in Figure 5 of the paper (table below), we experimented with different values of K across multiple datasets. Prior work such as Model Soups [4] suggests that averaging a diverse set of well-aligned models (our top-K permutations) can improve performance. Consistent with this, our results indicate that accuracy generally increases with more permutations, but the gains saturate beyond K = 6. This suggests that using six permutations provides an effective balance between accuracy and computational cost. Accordingly, we fixed K = 6 for all experiments.
| Traversal Permutation Number | CIFAR10-C (%) | CIFAR100-C (%) | ImageNet-C (%) |
|---|---|---|---|
| 2 | 75.6 | 51.7 | 54.4 |
| 4 | 77.1 | 53.7 | 55.6 |
| 6 | 77.5 | 54.3 | 56.1 |
| 8 | 77.6 | 54.7 | 55.5 |
3. Is your method applicable to other vision tasks such as object detection or semantic segmentation?
Thank you for this thoughtful question. We conducted additional experiments on segmentation tasks using various datasets, including Pascal VOC21 and Pascal Context 59. The results demonstrate that TRUST performs well in segmentation, outperforming methods like TENT. This further supports the generalizability and effectiveness of our approach beyond classification settings.
V21:
| Corruption | gaussian | shot | impulse | defocus | glass | motion | zoom | frost | snow | fog | brightness | contrast | elastic | pixelate | jpeg | Mean |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Source only | 29.1 | 33.1 | 28.3 | 21.0 | 8.2 | 33.1 | 25.4 | 50.9 | 50.3 | 70.7 | 76.5 | 63.9 | 25.5 | 22.2 | 59.2 | 39.8 |
| Tent | 33.0 | 35.7 | 32.0 | 22.3 | 14.7 | 38.2 | 25.3 | 46.5 | 49.0 | 60.2 | 63.9 | 66.2 | 38.5 | 28.8 | 43.9 | 39.9 |
| Trust | 38.8 | 42.0 | 38.7 | 29.8 | 22.6 | 45.1 | 29.8 | 50.5 | 53.5 | 63.4 | 66.4 | 68.5 | 45.1 | 37.7 | 48.6 | 45.4 |
P59:
| Corruption | gaussian | shot | impulse | defocus | glass | motion | zoom | frost | snow | fog | brightness | contrast | elastic | pixelate | jpeg | Mean |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Source only | 17.1 | 19.6 | 17.4 | 27.4 | 14.9 | 29.2 | 19.5 | 30.2 | 28.5 | 42.1 | 50.8 | 41.0 | 23.9 | 30.4 | 38.4 | 28.7 |
| Tent | 17.6 | 18.9 | 17.8 | 22.2 | 15.9 | 27.5 | 17.9 | 26.9 | 30.0 | 36.7 | 41.9 | 42.9 | 25.8 | 28.2 | 28.3 | 26.6 |
| Trust | 24.4 | 27.4 | 25.4 | 24.6 | 21.2 | 30.1 | 19.8 | 29.8 | 32.8 | 39.2 | 42.4 | 43.2 | 31.5 | 36.1 | 31.6 | 30.6 |
[1] Swad: Domain generalization by seeking flat minima. Advances in Neural Information Processing Systems, 34, 22405-22418.
Thank you for the clarifications regarding the choice of K and the downstream task performance. Having reviewed your response and the comments from the other reviewers, I will be keeping my original score.
Dear Reviewer pWep,
Thank you very much for your thoughtful review and constructive feedback. We hope our rebuttal has addressed your questions, particularly regarding how weight averaging contributes to improved generalization, how the top-K value is selected consistently across datasets, and the applicability of TRUST to other vision tasks like segmentation. We would be glad to address any further questions or feedback you may have.
The authors introduce a method called TRUST, to adapt vision SSMs at test-time in out-of-distribution conditions. The core idea in TRUST is to leverage variations of the patch traversal mechanism in the VMamba style architectures. Specifically, the authors permute the order of traversals and pick top-k orders that have minimal entropy on the test-time distribution. Then, the authors increase the sharpness of the predicted distribution by self-training on the pseudo-labels assigned by each of the top-k traversal orders to obtain a weight per traversal order. Eventually, the authors average the weights across all the top-k permutations to get the network weights. Through experiments, the authors demonstrate that TRUST outperforms algorithms like TENT and SHOT, when applied to the base VMamba model. The authors demonstrate these results on ImageNet and CIFAR corruption datasets, and other OOD shifts including PACS, ImageNet-S/v2/R. The authors further conduct several ablations to establish the impact of batch-size, input augmentations, number of traversal permutations, adaptation iterations, ensembling methods, specificity to permutations and computational costs with increase in traversals. Overall, within the purview of the conducted experiments, the authors propose a useful TTA method specifically for VMamba style architectures.
优缺点分析
Strengths:
The following points highlight the strengths associated with the submission.
- The paper is generally well-written and easy to follow. The authors do a good job of motivating the architecture specific TTA method with TRUST. There is novelty in the sense that authors explicitly exploit the traversal procedure of VMamba to improve generalization. That said, this comes with the cost of being limited to a specific architecture.
- The authors conduct thorough experiments to outline the method’s effectiveness. The controlled set of experiments across differing visual distribution shifts clearly demonstrate that varying the traversal mechanisms in itself can provide substantial utility over applying more architecture-agnostic methods like TENT or SHOT. The margins over the base algorithms are quite substantial in some cases, highlighting the usefulness of the proposed method.
- Further, to justify and assess the impact of different design decisions, the authors conduct thorough ablations – ranging from the impact of batch-sizes, to specificity to permutation orders. For future work attempting to build on top of TRUST, these ablations serve as useful starting points to iterate on design decisions.
Weaknesses:
I have the following weaknesses to point out in the submission.
- The model performance being sensitive to the traversal order (abcd > badc) seems odd. Ideally, at test-time, one doesn’t have access to ground-truth to infer the best performing permutation. The submission might benefit from a little discussion about this observation. This being an outcome despite weight averaging seems especially odd – the model is likely carrying over some bias from the source training dataset that TRUST isn’t able to overcome. Concretely, two things would help improve the submission in this regard – (1) an investigation about why this is happening? and (2) if it’s possible to find a reliable heuristic to choose the best test-time permutation.
- The method generally seems very computationally intensive – both in terms of memory and latency. A natural counter-opinion is to ask if similar performance gains can be squeezed out of existing methods (combining augmentations, finding the best TENT, SHOT TTA variants) instead of adopting such a computationally expensive method. Understanding the limits or the experimental settings in which it’s feasible to apply TRUST will help understand downstream use-cases.
问题
I think addressing the points outlined under weaknesses would help improve my rating of the submission. Specifically, the sensitivity to a particular traversal order and assessing the computational overhead vs performance gain trade-off (where best possible performance is squeezed out of a combination of strong augmentations and prior TTA methods).
局限性
Yes. The authors do acknowledge the limitations of the proposed method.
最终评判理由
I primarily had two concerns -- (1) sensitivity to the traversal order, (2) combating computational complexity. For the former, the rebuttal provided by the authors sufficiently addresses my concern (and emphasizes + ties together other design decisions). For the latter, the authors address my concern adequately by providing appropriate comparisons and highlighting efforts to deal with the computational trade-off.
Given that my concerns are sufficiently addressed, I am inclined to increase my rating of the submission.
格式问题
N/A
We thank the reviewers for the valuable comments and are pleased that TRUST was recognized as a novel and well-motivated method for Mamba-style architectures. We appreciate the positive remarks on the clarity of our writing, the effectiveness of our approach, and the depth of our experiments and ablations. Below, we address the reviewer comments in detail.
1-1. Why is the model performance sensitive to traversal orders?
State Space Models (SSMs) are inherently direction-sensitive, meaning that the order in which input patches are traversed has a significant impact on model performance. According to the Mamba formulation (Eq. 1 of our paper), the hidden state at each step () directly influences the next (). Therefore, the choice of traversal order propagates its effects through the entire sequence of hidden-state updates. This directional dependency also comes from the fact that the hidden states learned during training are aligned with the original traversal order. When a different patch order is used at test time, the model is exposed to a sequence it has not encountered before, resulting in a mismatch between learned and test-time dynamics. As noted in the Mamba paper, this directional bias can cause the model to encode information more effectively along the original traversal path.
1-2. How the top-k traversal orders are selected without labels?
We emphasized that our selection of the top-k traversal permutations is entirely unsupervised and does not require access to labels. As detailed in Section 3.2 and illustrated in Figure 1 (offline phase) of the paper, we rank traversal orders solely based on the average Shannon entropy of predictions on a test dataset. We conducted experiments on ImageNet-C and selected the top-k traversal permutations based on the lowest prediction entropy. We observed that these top-k permutations remained largely consistent across different test data distributions. This finding is illustrated in Figure 2 of the supplementary material.
1-3. Find a reliable heuristic to choose the best test-time permutation
To identify the most effective test-time permutation, we adopt Shannon entropy as a self-supervised confidence metric. While low entropy does not strictly guarantee better accuracy, it serves as a strong proxy for model certainty. In particular, entropy captures the sharpness of the predicted probability distribution, lower entropy indicates more confident predictions, which often correlates with more reliable traversal orders. Therefore, entropy offers a useful heuristic that helps rank permutations by their potential to produce accurate and stable outputs in the absence of labeled test data.
2-1. Computational Complexity
Our method supports an efficient parallel adaptation strategy, which we represented in Figure 1 of supplementary material and we discussed its computational overhead in the ablation study of the paper. In this mode, we handle K traversal permutations simultaneously. One batch is first split into K subsets, each corresponding to a different permutation. Each subset is then passed to an independent SS2D TRUST Version block, where the SS2D parameters are adapted in parallel while the rest of the model remains shared across all paths. After adaptation, the outputs are concatenated back into a single batch, which allows for efficient GPU utilization. For evaluation, we perform a weight averaging step across all adapted SS2D modules, producing a single unified SS2D TRUST Version block. This averaged block is then used for inference on the full batch.
2-2. Could the combination of augmentation and TENT yield comparable performance?
We explored the use of different augmentations as an alternative to multiple traversal permutations in Figure 4 of the paper. That experiment evaluates the impact of weight averaging across models adapted with various augmentations versus those adapted with different traversal permutations. To ensure a fair comparison, we match the number of data augmentations to the number of traversal permutations. As shown in the table below, different traversal permutations (TRUST) consistently outperforms different augmentations, highlighting the superior effectiveness of permutation diversity in enhancing the model performance.
CIFAR10-C:
| Source only | Tent | Rotation | Jitter | Crop | TRUST |
|---|---|---|---|---|---|
| 65.9 | 66.5 | 66.8 | 68.3 | 66.9 | 77.5 |
Dear Reviewer GHW6,
Thank you sincerely for your thoughtful and detailed review. We hope that our rebuttal has addressed your concerns, particularly those regarding (1) the sensitivity of model performance to traversal order and the rationale behind entropy-based selection, and (2) the computational trade-offs, including our proposed parallel adaptation strategy and comparative analysis with augmentation-based TTA methods. We would be happy to further elaborate on any remaining questions or suggestions you might have.
Thanks to the authors for responding to the concerns outlined in my review.
I primarily had two concerns -- (1) sensitivity to the traversal order, (2) combating computational complexity. For the former, the rebuttal provided by the authors sufficiently addresses my concern (and emphasizes + ties together other design decisions). For the latter, the authors address my concern adequately by providing appropriate comparisons and highlighting efforts to deal with the computational trade-off.
Given that my concerns are sufficiently addressed, I am inclined to increase my rating of the submission.
Reviewers confirmed the novelty, thorough experiments and writing of this paper. After rebuttal and discussion, most of the concerns raised by reviewers are resolved the authors and all the ratings are positive. I thereby recommend to accept this submission.