PaperHub
6.4
/10
Poster4 位审稿人
最低3最高5标准差0.7
5
3
4
4
4.0
置信度
创新性3.0
质量2.8
清晰度2.5
重要性2.3
NeurIPS 2025

Mint: A Simple Test-Time Adaptation of Vision-Language Models against Common Corruptions

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29

摘要

关键词
vision-language modelstest-time adaptation

评审与讨论

审稿意见
5

The paper introduces Mint, a test-time adaptation technique aimed at improving model robustness against image corruptions. Mint leverages pseudo-labels to maximize inter-class variance, addressing the authors’ observation that both intra- and inter-class variances of CLIP image embeddings tend to collapse under corruptions, which is closely linked to performance degradation. Experimental results demonstrate that Mint significantly improves robustness across various corruption benchmarks, even when limited test data is available.

优缺点分析

Strengths:

  • The paper is clearly written and easy to follow.

  • The proposed method is based on insightful observations about variance collapse of intra- and inter-class embeddings, also supported by a theoretical rationale.

  • The usage of mean and gradient accumulators is effective in low batch size cases, where only a few test samples are available.

  • Mint demonstrates time efficiency, compared to many TTA methods.

  • The paper includes comprehensive ablation studies and analysis that highlight the roles of different accumulators and the influence of hyperparameters.

Weaknesses:

  • The paper lacks comparative results showing how SOTA methods perform under varying batch sizes. While Mint is evaluated across different batch sizes, it is unclear how it compares to SOTA methods in these settings, i.e. it is uncertain whether Mint maintains its advantage in more constrained batch size scenarios.

问题

Table 2 shows that increasing the batch size does not always lead to improved accuracy, which appears counterintuitive given the common assumption that more data should yield better estimates of unknown data statistics. What could be the underlying reasons for this behavior?

局限性

Yes

最终评判理由

I did not find any significant technical flaws. The exchange of comments and responses between reviewer 3aNj and the authors helped clarify several doubts. In particular, the responses of the authors made the setup of corruption robustness and TTA more understandable, and their responses regarding “confusion in theoretical analysis” indicate that they can revise the manuscript to improve its quality. Given the overall strength of the work, I intend to give a score of accept.

格式问题

N\A

作者回复

Thank you for your detailed and encouraging review. We appreciate your recognition of the clarity of the writing, the theoretical support for our variance-based perspective, and the effectiveness of our design under low-batch test-time scenarios. Below, we respond to your comments and provide clarifications where needed.

W1. Comparative results under varying batch sizes

Thank you for the suggestion! To address this concern, we conducted additional experiments comparing Mint with the most competitive baselines (CLIPArTT and WATT-S) under varying batch sizes B=1,2,5,10,20,50B = 1, 2, 5, 10, 20, 50. The results show that Mint consistently outperforms these baselines, particularly in more constrained settings (e.g., batch size = 1, 2, 5, 10), where it demonstrates greater stability and a more pronounced performance advantage. These findings further highlight Mint's effectiveness and robustness across different batch sizes.

Table A: Accuracy, ViT-B/32 on CIFAR-10-C (CLIP: 59.0)

MethodBS=1BS=2BS=5BS=10BS=20BS=50
CLIPArTT59.259.361.763.564.464.5
WATT-S63.663.964.164.867.170.6
Mint70.570.571.071.071.071.0

Table B: Accuracy, ViT-B/16 on CIFAR-100-C (CLIP: 35.8)

MethodBS=1BS=2BS=5BS=10BS=20BS=50
CLIPArTT39.239.339.440.040.741.4
WATT-S35.936.436.938.841.942.0
Mint43.143.143.344.144.544.5

Table C: Accuracy, ViT-L/14 on ImageNet-C (CLIP: 39.6)

MethodBS=1BS=2BS=5BS=10BS=20BS=50
CLIPArTT39.239.740.340.540.741.2
WATT-S42.142.442.743.243.945.0
Mint45.846.246.746.847.047.1

Q1. Batch size vs. accuracy

This is because Mint leverages both a mean accumulator and a gradient accumulator, which allow it to aggregate information from all previously seen test samples, not just the current batch. As a result, a smaller batch size does not necessarily imply "less data" for adaptation, and thus does not always lead to degraded performance. And when the batch size becomes sufficiently large, Mint's performance tends to stabilize, and the small variations observed are primarily due to randomness rather than systematic effects.

As shown in Figure 6, when either the mean accumulator or the gradient accumulator is removed, the resulting ablated variants fail to fully utilize information from prior batches. In these cases, performance drops more significantly as batch size decreases, highlighting the importance of both components in maintaining robustness across batch sizes.

评论

Thank you for the clarification and additional experiment results, and I would keep my score.

审稿意见
3

Work discusses variance collapse under corruptions i.e., embeddings of images tend to become more similar, regardless of whether they belong to the same class or different classes. A strong correlation between inter-class variance and classification accuracy is further shown. The proposed method uses Test time augmentation to improve model accuracy on the noise by

  1. Maximizing intra-variance among features using pseudo labels.
  2. Using mean accumulator (average image embeddings for each pseudo-class) and gradient accumulator to improve TTA under limited batch.
  3. Only updating the LayerNorm parameters to improve image embedding.
    Authors postulate that models should show low intra-class variance (same class embedding should be closer) and high inter-class variance (different classes embedding separated) for higher accuracy.

优缺点分析

Strengths:

  1. Paper gets credit for mathematically justifying some known solution for low resolution (and similar other kinds of noises).
    a. Effect of noise severity of feature variance.
    b. “maximizing PL-inter is equivalent to jointly maximizing PL-total variance and minimizing PL-intra variance”.
  2. Code is provided with supplementary, hopefully they will make the code public, as it's not going to be an easy implementation.
  3. Small batch size in TTA has a series of problems like “noisy and biased gradients” and few samples across classes, where the distance between a sample and its mean becomes zero. Thus, the work opts for a mean accumulator and a gradient accumulator to aggregate information across batches, which is the first of its kind.
  4. Cost-efficient as the model updates only the layer norm during test time.
  5. Ablation successfully establishes the effectiveness of each component.

Weakness

  1. The idea of Variance Collapse is not novel. Features collapse under low resolution is a fairly well-researched topic across severities. The work simply extends the observation to other forms of noise. Missing reference works, and the tone of writing indicates novelty. Here are some works indicating a similar property:
    a) Feature collapse on foundation models: LR0.FM: Low-Res Benchmark and Improving robustness for Zero-Shot Classification in Foundation Models, ICLR’25
    b) Overlapping of positive and negative instances of class under low resolution: Dive into the Resolution Augmentations and Metrics in Low Resolution Face Recognition: A Plain yet Effective New Baseline, AAAI’23.
    c) Overlap of feature centers: Recognizability Embedding Enhancement for Very Low-Resolution Face Recognition and Quality Estimation, CVPR’23.

  2. One of the main contributions of the paper is the theoretical justification for variance collapse with noise. But the theoretical proof is not at all an easy read, with tons of confusion in the derivation of Theorem 3.1. & Theorem 3.2
    a. Confusion regarding layer norm (equation number missing) Line 165. Why are authors using L2 norm as Layer norm? In traditional layer norm there is no L2 normalization,
    Z = ( (v – mean_dim) / (variance_dim) **0.5 ) * w + beta
    Authors have instead used two L2 normalization steps, which is not how PyTorch has implemented layer norm either (ref: https://docs.pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html). The original layer norm doesn't allow L2 normalization either.
    b. A core assumption seems to be feature vector distangled as: vi=vicls+viirr+vishift+vinoisev_i = v_i^{cls} + v_i^{irr} + v_i^{shift} + v_i^{noise} & weight as w=wcls+wirr+wshift+wnoisew = w^{cls} + w^{irr} + w^{shift} + w^{noise}, where vijwk=0 if kjv_i^j \bigodot w^k = 0 \text{ if } k \ne j . How weights are broken into these components is not clear, and why vijwk=0 if kjv_i^j \bigodot w^k = 0 \text{ if } k \ne j , for example viclswirr=0v_i^{cls} \bigodot w^{irr} = 0 . If its not addition, and instead concatenation, it will raise more questions.
    c. Appendix: Line 576 (equation number missing) appears to be using Cauchy Cauchy-Schwarz inequality. I’m not sure why uwcls=u|| u \bigodot w^{cls} || = || u || (not sure why wclsw^{cls} is assumed to be taken from a similar distribution as that of u. Similarly, for other multiplication products as well.
    d. Line 580 (equation number missing). Under the infinite sum, average (mean) becomes an l2 normalized vector? Clarity for the equation is required, with missing steps and lack of assumptions, it's not easy to understand how Theorem 3.1, and Theorem 3.2 are derived. I can only take the author's word for it if these equations are derived correctly.
    e. B.4 Adaptation: What is conv? How is equation line 589 (equation no missing) derived?
    f. [Line 190] “maximize the PL-inter variance by updating the parameters of LayerNorm, the parameters associated with structured distribution shifts (i.e., wshift) are necessarily suppressed” how? Which equation 3 or 4? g. [Line 194] Maximizing PL-inter variance will increase the weights associated with task-relevant features (i.e., wcls)? How?
    h. Text writes inferences without actually pointing to what equation it's using to derive those inferences. This is not possible for me to understand.
    i. [line 176] “GT-inter variance has strong relevance to classification accuracy, as it reflects the proportion of task-relevant features within the overall feature representation”. How is accuracy reflected in Theorem 3.1, which primarily connects to variance to noise severity.

  3. Ablation results requested for :
    a. What happens if you maximize variance inter (instead of total - intra)
    b. Other variations, updating last block? Last MLP layer?

  4. Fundamental flaw with logic: Why test time augmentation?
    Missing works include Super resolution (generate HQ sample on fly), Lr-TK0 [1] (pretraining model on synthetic diffusion generated images with noise), Robsut SAM [2] (prompts help improve accuracy)
    If the Idea is to have never seen the noise before, why do authors treat the model training on each noise independently [line 265]. Separating noise means the authors already know what noise looks like. Not training on it is purely a design choice. Ideally, should evaluate the same model across all noise, i.e., training using an amalgam of all noises. if the assumption is to not been seen the noise before.
    [1] LR0.FM: Low-Res Benchmark and Improving robustness for Zero-Shot Classification in Foundation Models, ICLR’25 (feature collapse on foundation models)
    [2] RobustSAM: Segment Anything Robustly on Degraded Images, CVPR’24

  5. How does this lab-simulated noise translate to the real world? Can the authors show how the model would be affected when faced with real-world images, which often have a large variety of noise?

Minor Complaint: Figure 3 does not clearly convey what it’s trying to convey. I don’t see how severity is playing any role here. Missing citation: Feature Augmentation based Test-Time Adaptation, WACV'25, it has a running mean (mean accumulator) for test time augmentation.

问题

Please refer to the above weakness. Empirically, as the severity of noises goes up, features start to mix into one another is not novel and has been done before. The main contribution of theoretically showing how severity affects features is novel, but also needs to clearly explain the steps, which I believe severely weakened the paper. Also, the case for why using test time augmentation is not clear, given previous works which have seen these noise before, may easily outperform such a approach.

局限性

No limitation as such.

最终评判理由

After careful consideration and seeing additional experiments, I have revised my evaluation to a weak reject. Below, I outline both the strengths and concerns that informed my decision:

Gain for the research community :

  • The paper offers a theoretical perspective on feature/variance collapse under noise. The derivation, while complex, attempts to formalize an important phenomenon.
  • The idea of using test-time adaptation (TTA) via layer norm updates to improve robustness is interesting.

Concerns regarding the work:

  • The core theoretical derivation is difficult to follow (with missing steps) and currently resides in the supplementary material. A clearer and more integrated presentation in the main paper would significantly enhance accessibility.
  • The key goal of the work is to improve the robustness of the model on one kind of noise (results in paper). A simple Google search reveals a missing comparison with prior work (e.g., Super-resolution, LRTK0, RobustSAM), which solves a similar problem. This weakens the empirical justification for the proposed approach. A deeper search may reveal more methods, but the bare minimum Super resolution needs to be incorporated.
  • In my subjective opinion, the authors have put artificial constraints on the problem statement to spare their work against comparison with existing approaches, namely, 1) No training, 2) No information about the noise the model is improving robustness against.
  • However, a similar argument can be made against the work, as existing approaches improve the robustness in zero-shot conditions without any information about the domain of test set images. Additionally, "No information about the noise during inference" ensures the model will be exposed to only one type of noise, which is weirdly convenient (for robustness results in paper).
  • Thus, not training, or training a simple classifier to inform the model about the type of noise, are simply design choices in my subjective opinion, which would have put them in direct comparison with existing methods.
  • Most types of noise (e.g., Gaussian blur, impulse noise, motion blur, etc.) can be easily simulated and addressed with existing approaches. But since comparison with existing work is missing, we don't know how good this solution really is. Comparison with only TTA approaches is too narrow in my opinion.
  • What would have been an acceptance from my side? No assumptions about test set. 1 TTA-Mint model evaluated on all noises (not 15 noises 15 TTA Mints). Table 2 and Table 3 results would be ablation to show improvement on individual noises, but the main result would be 1 model exposed to all "amalgam" of noises.

格式问题

Equation numbers are missing.

作者回复

Thank you for your thoughtful and encouraging feedback. We appreciate your recognition of our theoretical insights and efforts to address the challenges of small-batch TTA. Below, we respond to your concerns point by point.

W1. Idea of variance collapse

Thank you for the helpful feedback and references. While we agree feature collapse under low-resolution inputs has been previously studied, our work explores a broader set of 15 corruption types spanning noise, blur, weather, and digital transformations (e.g., shot noise, motion blur, snow), many fundamentally different from low-resolution. Importantly, our key contribution is not only observing variance collapse but quantifying it, providing theoretical insights, and designing a corresponding adaptation loss. We appreciate your suggestions on relevant literature and will include these references to better contextualize our contributions.

W2. Confusion in theoretical analysis

Thank you very much for carefully reading through our theoretical analysis. Below, we address your concerns and clarify the derivations:

a. Layer norm

We used a simplified form of LayerNorm for analytical tractability. The first normalize()\text{normalize}(\cdot) in our derivation comes from a simplified but general form of LayerNorm. Let the input of LayerNorm to be vi=[vi1,,vid]Rdv_i = [v_{i1}, \cdots, v_{id}]^\top \in \mathbb{R}^d, and the output be uiRdu_i\in\mathbb{R}^d. The standard LayerNorm can be written as ui=vivˉ_i1dj=1d(vijvˉi)2w+bu_i = \frac{v_i - \bar{v}\_i}{\sqrt{\frac{1}{d}\sum_{j=1}^d(v_{ij} - \bar{v}_i)^2}} \odot w + b

where vˉ_i=1dj=1dvij\bar{v}\_i=\frac{1}{d}\sum_{j=1}^d v_{ij}. As mentioned in Line 563, for analytical simplicity, we omit the demeaning step and the bias term, which leads to the following form: ui=vi1dj=1dvij2wu_i = \frac{v_i}{\sqrt{\frac{1}{d}\sum_{j=1}^d v_{ij}^2}} \odot w This corresponds to the widely used RMSNorm [1], a common simplification of LayerNorm that retains the normalization effect. Factoring out the constant 1/d\sqrt{1/d} and writing the result in terms of L2 normalization: ui=dnormalize(vi)wu_i = \sqrt{d} \cdot \text{normalize}(v_i) \odot w Finally, the second normalize()\text{normalize}(\cdot) in our derivation is not part of LayerNorm or RMSNorm, but instead comes from the CLIP architecture (see Figure 3 in [2]), where the output image embeddings are L2 normalized before computing cosine similarity.

We will revise the text to make this simplification process clearer and avoid ambiguity.

[1] Root Mean Square Layer Normalization. NeurIPS 2019

[2] Learning Transferable Visual Models From Natural Language Supervision. ICLR 2021

b. Feature disentaglement

To clarify, we use concatenation to construct the feature and weight vectors, and \odot denotes the Hadamard (element-wise) product, not the dot product. vclsv^{cls} and wirrw^{irr} applied to different subspaces and their Hadamard product is not well-defined.

This decomposition follows a common analytical framework used in prior works [3,4], and is mainly intended to help identify which part of the feature vector each component of ww operates on, making the theoretical results more interpretable. In a more general setting where viv_i undergoes a rotation, each dimension becomes a mixture of the original components, and the gradient on ww will be a rotated version of our current result in Theorem 3.2. While the essence of the conclusion remains, such formulations would significantly complicate the interpretation.

[3] A fine-grained analysis on distribution shift. ICLR 2022.

[4] Entropy is not Enough for Test-Time Adaptation: From the Perspective of Disentangled Factors. ICLR 2024.

c. Inequality

We would like to clarify that the derivation in Appendix Line 576 does not use the Cauchy-Schwarz inequality. All expressions are exact equalities. The identity holds because we assume that wclsw^{cls} is initialized as an all-one vector. As a result, the Hadamard product uwcls=uu\odot w^{cls}=u holds for any uu, and no distributional assumption is involved.

d. Convergence of means

We would like to clarify that the expression on Line 580 does not treat the mean vector as L2-normalized. The normalization factor Z(w,s)Z(w, s) arises from the final normalization step in CLIP's image encoder, and applies to each individual feature vector ziz_i; it is not related to the sample average.

Specifically, the convergence zˉpEzi\bar z \xrightarrow{p} \mathbb{E}z_i follows from the Weak Law of Large Numbers, indicating that the empirical average converges to the expected value. Each feature vector is defined as: zi=1Z(w,s)[vclswcls;virrwirr;δwshift;vnoisewnoise].z_i=\frac{1}{Z(w, s)} \cdot [v^{cls} \odot w^{cls}; v^{irr} \odot w^{irr}; \delta \odot w^{shift}; v^{noise} \odot w^{noise}]. To compute the expectation, we note that Evcls=0\mathbb{E} v^{cls} = 0 due to class balance, Evirr=0,Evnoise=0\mathbb{E} v^{irr} = 0,\mathbb{E} v^{noise} = 0 due to the Rademacher distribution. Thus, we obtain: zˉpEzi=1Z(w,s)[0;0;sδwshift;0]\bar{z}\xrightarrow{p}\mathbb{E}z_i=\frac{1}{Z(w, s)} \cdot [0; 0; s \cdot \delta \odot w^{shift}; 0]

e. B.4 Adaptation

“Cov” in Line 589 refers to the covariance operator. The first two lines follow directly from the definition of covariance Cov(X,Y)=E[X,Y]=E[X]E[Y]Cov(X,Y) = \mathbb{E}[X,Y] = \mathbb{E}[X] \mathbb{E}[Y]. The third line follows from the Weak Law of Large Numbers.

f&g. Theorem 3.2

This refers to Equation 4, where the gradient of PL-inter variance with respect to wshiftw^{shift} is negative. Since we maximize VPLˆ_inter\mathcal{V}\^{PL}\_{inter} via gradient ascent, the update takes the form: wshiftwshift+ηwshiftVinterPLw^{shift} \leftarrow w^{shift} + \eta \cdot \nabla_{w^{shift}} \mathcal{V}^{PL}_{inter} where η\eta is the learning rate. Because the gradient is negative, this update reduces the magnitude of wshiftw^{shift}, effectively suppressing the structured shift component.

Similarly, we show that the gradient with respect to wclsw^{cls} is positive, meaning the same gradient ascent step will increase wclsw^{cls}, thus amplifying task-relevant features.

h. Equation reference

Thank you for the feedback. We appreciate the suggestion and will revise the text to more clearly reference the equations associated with each inference to improve readability and traceability. If there are specific instances that are considered unclear, we would greatly appreciate it if you could point them out, so we can address them more precisely in the revision.

i. Accuracy

Theorem 3.1 shows that GT-inter variance increases with corruption severity, indicating a lower proportion of task-relevant features in the overall representation. This implies a reduced signal-to-noise ratio, which typically leads to lower accuracy. We will clarify this connection in the text.

W3. Ablation study

a. Maximize inter variance

If we directly maximize the PL-inter variance as defined by the LHS of Equation 5, we need to remove the mean accumulator (otherwise, the objective has zero gradients with respect to the current batch). We maximize the objective of 1Cbc=1Cbmcm_22ˆ\frac{1}{C_b} \sum_{c=1}^{C_b} \\| m_c - m \\|\_2\^2 where we use batch statistics mc=iB_by^_icziiB_by^_ic,m=1B_biB_bzim_c = \frac{\sum_{i \in \mathbb{B}\_b} \hat{y}\_{ic} z_i}{\sum_{i \in \mathbb{B}\_b} \hat{y}\_{ic}}, m = \frac{1}{\mathbb{B}\_b} \sum_{i \in \mathbb{B}\_b} z_i as approximations of zc~\tilde{z_c} and z~\tilde{z}. The results are shown below for different batch sizes.

MethodBS=1BS=5BS=20BS=100
Maximize inter61.467.868.969.9
Mint(total-intra)70.571.071.070.9

Notice that this objective is numerically equivalent to the "gradient accumulator only" variant in Figure 6, and thus yields the same results in practice.

b. Other variations

We tried several variations, including updating the last block, the last MLP layer, and the patching layer of the image encoder, but found that updating all LayerNorm layers yields the best performance. This choice is also consistent with prior works such as CLIPArTT and WATT, which also adapt LayerNorm.

MethodAccuracy
Last block63.4
Last MLP62.9
Patching layer63.6
All LayerNorm (Mint)71.0

W4. Why TTA

Thank you for the comment. We would like to clarify the motivation and setup of our experiment setup.

  • We follow the most standard TTA setting widely used in the literature [5,6,7,8]. Specifically, we do not train a model on each corruption. Instead, we use the same pretrained VLM, and perform TTA separately on each corruption type, using only unlabeled test samples. This per-corruption evaluation is a common and practical protocol designed to assess the average performance of TTA algorithms under different corruptions, in a controlled and interpretable manner.
  • This setup does not assume the model “knows” the corruption types. During training, the model is never exposed to these corruptions. The corruptions are only encountered at test time, during adaptation, which is consistent with the goal of evaluating corruption-agnostic TTA methods.
  • Regarding the statement “not training on it is purely a design choice”, this is not the case. Our focus is on the adaptation of pretrained VLMs, where we only have access to public checkpoints and no access to training data. This is a practical constraint, not an arbitrary design decision, and reflects real-world deployment conditions for foundation models.

[5] Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models. CVPR 2024

[6] Efficient test-time adaptation of vision-language models. CVPR 2024

[7] WATT: weight average test time adaptation of CLIP. NeurIPS 2024

[8] Clipartt: Adaptation of CLIP to new domains at test time. WACV 2025

W5. Translation to the real world

We conduct experiment on Natural Distribution Shifts Benchmark (please refer to Table B in our response to Reviewer Cqv3). Mint remains effective when faced with real-world images.

Figure 3

In Figure 3, severity is encoded by the color intensity of each point, with darker shades indicating higher corruption severity levels.

Missing citation

We will include the WACV’25 paper in our revision and clarify its connection to our method.

评论

W1 Thank you for resolving my doubt. I appreciate the effort to explore feature collapse in the context of various types of noise. I apologize if my original query wasn't clear.
Feature collapse (previously observed in low resolution) does not rely on specific properties of pixelation / low resolution / downsampling.In that sense, the current work extends existing observations rather than introducing a fundamentally novel insight. While generalization is valuable, it's not the same as novelty. I would recommend adjusting the tone of the writing to reflect this more clearly, so as not to overstate the novelty.

W4 Thank you for the explanation. I appreciate the clarity with which the TTA setting is used. I understand you are using the TTA setting correctly. However, I’m not convinced how the proposed TTA fundamentally differs in its goals or assumptions from other existing approaches that also adapt pretrained models to noisy inputs without re-training on clean data.

For instance: Super-resolution (SR) based methods correct the input without requiring any model retraining. RobustSAM introduces trainable prompts and uses labeled data with distillation (since you are not using labels, ignore this point , my apologies). LR-TK0 trains prompts using unlabeled synthetic data and distillation. Your method uses trainable layer norms, progressively adapted using samples of the specific noise.

All of these methods (including yours) assume knowledge of the corruption type and DO NOT retrain the original pretrained model. If the goal is simply to adapt to test-time noise, then the choice of whether to use prompts, SR, or layer norms seems more like a design preference than a fundamentally different objective.

I'm still going for the proof W2 (my apologies for taking longer than anticipated). W2 & W3 W5 give me some more time to see if I have any follow-ups.

评论

Thank you for the thoughtful follow-up and constructive feedback. We greatly appreciate the time and effort you've put into reading our rebuttal and providing additional clarification. Below we address W1 and W4 with further explanation:

W1. Idea of variance collapse

Thank you for the clarification and for resolving the earlier misunderstanding. We fully agree that feature collapse has been observed in prior work, especially under low-resolution settings. Our goal was not to claim a new discovery, but to provide a theoretical explanation for this phenomenon and to design an algorithmic solution inspired by it.

We will revise the contribution statement in our paper to more accurately reflect this: rather than emphasizing the novelty of the observation itself, we will highlight the theoretical analysis and the resulting adaptation method, which are the core contributions of this work.

W4. Why TTA

Thank you again for your thoughtful comments. We believe there are two key differences in goals and applicability between TTA and other methods

  1. Broader applicability: TTA methods use a fixed adaptation strategy that can be applied to a wide variety of corruptions and distribution shifts, without requiring highly specialized designs. For example, SR-based methods are mainly effective for pixelation-like corruptions, but may fail under other types such as impulse noise. Similarly, in LR-TK0, the synthetic data consists of paired LR-HR versions of the same image, which is specifically designed for low-resolution scenarios. In contrast, our method, like many TTA approaches, adapts to different types of distribution shifts using the same algorithm, making it more suitable for real-world settings where the specific type of corruption is unknown.
  2. Lower data and computation requirements: TTA typically requires no additional data and performs adaptation directly at test time, progressively adjusting the model as test samples arrive. On the other hand, methods like RobustSAM and LR-TK0 require extra labeled or synthetic data for distillation, which introduces higher data and computational demands.

That said, we agree that when the distribution shift is well-understood and can be precisely simulated (as in the case of low resolution), specialized designs can lead to more optimal solutions.

We appreciate your continued engagement and look forward to your feedback on the remaining points. :)

评论

After careful consideration and seeing additional experiments, I have revised my evaluation to a weak reject, though I acknowledge that a borderline accept could also be a reasonable stance depending on interpretation. Below, I outline both the strengths and concerns that informed my decision:

Gain for the research community :

  • The paper offers a theoretical perspective on feature collapse under noise, which could be valuable to the community. The derivation, while complex, attempts to formalize an important phenomenon.
  • The idea of using test-time adaptation (TTA) via layer norm updates to improve robustness is interesting and could be seen as a contribution when considered in isolation.

Areas for Improvement

  • The core theoretical derivation is difficult to follow and currently resides in the supplementary material. A clearer and more integrated presentation in the main paper would significantly enhance accessibility.
  • A simple Google search reveals missing comparison with prior work (e.g., Super-resolution, LRTK0). This weakens the empirical justification for the proposed approach. A deeper search may reveal more methods, but bare minimum Super resolution needs to be incorporated.
  • The rationale for not training despite the knowledge of noise remains a design choice and is not convincingly argued.
  • Claims around broader applicability are not in the paper. Many types of noise (e.g., Gaussian blur, impulse noise, motion blur etc.) can be simulated and addressed with Super Resolution.
  • Assertions about lower data and computational requirements are less compelling in the context of CLIP-like transformer models, which are inherently resource-intensive.
评论

Thank you for your thoughtful follow-up and the revised evaluation. We genuinely appreciate the time you have taken to reconsider your review and provide detailed feedback. We would like to take this opportunity to clarify a few points where we believe there may have been some misunderstandings:

On comparisons with super-resolution and related works

We agree that super-resolution (SR) based approaches may be beneficial for specific types of corruptions such as pixelation and JPEG artifacts. We appreciate your suggestions and will include the relevant references in our revised draft to better situate our method in the existing literature. That said, SR methods are typically tailored to particular corruption types and often rely on prior knowledge of the corruption, which contrasts with our method’s more general and adaptive nature.

On the motivation for test-time adaptation without knowing the noise type

We would like to clarify a possible misunderstanding regarding our design choice. Test-time adaptation does not assume knowledge of the corruption type. The goal is to adapt a single pre-trained model to arbitrary shifts using the same adaptation procedure, without requiring explicit identification or simulation of the corruption. This is in stark contrast to corruption-specific methods that must be hand-crafted or trained for each anticipated noise pattern.

On broader applicability vs. per-corruption solutions:

While it may be possible to design specific models or pipelines for each corruption type, this approach is not consistent with the original intent of standard corruption benchmarks. For example, the ImageNet-C benchmark explicitly states in Appendix B that:

Directly fitting the types of IMAGENET-C corruptions should be avoided, as it would cause researchers to overestimate a model’s robustness.

In real-world deployment scenarios, one typically does not know in advance which type of corruption the model will encounter. TTA methods like ours address this challenge directly by using a single model and adaptation rule that generalizes across corruption types.

On computational efficiency

We believe there may be a conflation between the model's inherent computational cost (e.g., using a CLIP-like backbone) and the computational cost of the adaptation strategy. Our claim is that our method avoids costly data collection, data generation, and model retraining procedures used in other approaches like RobustSAM and LR-TK0. At the same time, in fact, both of these methods also use transformer-based or foundation models—including CLIP itself in the case of LR-TK0—so we respectfully disagree that our approach is inherently less efficient in comparison.

评论

Apologies if I'm not getting the basics right.

[Line 265] in the paper says "We consider a standard TTA setting, where the model is adapted to each type of corruption independently"

Which looks to me, the pretrained model is adapted via layer norm separately for each corruption. In my original review, I asked this point blank, "Ideally, should evaluate the same model across all noise, i.e., training using an amalgam of all noises. if the assumption is to not been seen the noise before." for which authors replied perform TTA separately on each corruption type,. Which means TTA adapts to each noise separately, so why can't a super-resolution model fix this noise separately as well?

评论

Thank you for your follow-up and thoughtful question.

We believe the key distinction lies in the assumption of corruption awareness. While our evaluation is conducted separately on each corruption type (as is standard practice in prior TTA work), this does not imply that the TTA method has access to the identity of the corruption before deployment. In contrast, most super-resolution or corruption-specific restoration models are corruption-aware, i.e., they require knowledge of what type of degradation (e.g., blur, noise) occurred, in order to apply the appropriate model (distill with a specific type of noise) or processing pipeline (apply a specific noise-canceling preprocessing).

To clarify with a practical scenario: consider deploying a vision system on mobile phones. Some users may have blurred images due to dirty lenses, some may face Gaussian noise due to older sensors, and others may experience rain-related distortions due to weather. These are all different types of distribution shifts. Crucially, we do not know a priori which type each user will face. However, the shifts within one user’s stream exhibit some consistency over time. In this case, a super-resolution approach (or other restoration-based method) would struggle unless we have an oracle to select the correct model from a collection—something not realistic in practice.

In contrast, a single TTA model (such as Mint, or other general TTA methods) can be deployed universally. Each user can download the {model, TTA code} as a whole, then independently adapt the model to their own data distribution on the fly, without any need to identify or classify the corruption type.

As for your suggestion of "training using an amalgam of all noises": we still believe this may lead to overfitting to the benchmark set of corruptions, which is why it is generally avoided in TTA. That said, we include an experiment along these lines using ImageNet-R(endition), a dataset containing a diverse mixture of real-world distribution shifts (art, cartoons, sketches, etc.). As we reported in our response to Reviewer Cqv3, Mint still outperforms strong TTA baselines in this hybrid scenario, suggesting that our method generalizes well even when exposed to diverse and unknown shifts.

We hope this clarifies the motivation and positioning of our approach. Thank you again for the valuable discussion.

评论

For your reference, here is a short description from their github repo:

ImageNet-R(endition) contains art, cartoons, deviantart, graffiti, embroidery, graphics, origami, paintings, patterns, plastic objects, plush objects, sculptures, sketches, tattoos, toys, and video game renditions of ImageNet classes.

ImageNet-R has renditions of 200 ImageNet classes resulting in 30,000 images.

And here is the table including the ImageNet-R dataset.

Table B: (Updated) Performance on Natural Distribution Shifts Benchmark (ViT-B/16)

MethodImageNet-AImageNet-V2ImageNet-RImageNet-Sketch
CLIP49.260.472.744.9
CLIPArTT49.660.572.845.0
WATT-S51.761.275.747.0
TDA51.061.273.946.4
DMN49.760.573.045.4
Tent51.961.077.045.4
ETA52.061.077.446.8
Mint54.762.678.148.4
评论

Hello Authors,

I'm requesting you to please give a simple and direct yes or no answer. Else, we are just going round and round on the same doubt.

The results reported in the paper were from 1 single TTA being evaluated on all forms of noise? Yes or No?

The results reported in the paper were had different TTA being evaluated on all different noises? Yes or No?

The reply While our evaluation is conducted separately on each corruption type (as is standard practice in prior TTA work), this does not imply that the TTA method has access to the identity of the corruption before deployment.. 'm still emphasizing I'm not questioning the methodology of TTA, I'm questioning the use of TTA for robustness.

I'm not so focused on ImageNet-A, ImageNet-v2 ... etc, because they are not the same as noise. For example, the high-resolution (> 128 x 128) sketch will not cause feature collapse on these datasets, and hence the entire story or theoretical foundation of layer norm update will be invalidated. Since I didn't raise this doubt, I'm not considering this for my final rating. The reason you are seeing the jump in accuracy is that the CLIP zero-shot model is getting unsupervised fine-tuned. But its explanation/performance will not be a part of my final judgement.

评论

Thank you for your follow-up. Below are the direct answers you requested:

Q: The results reported in the paper were from 1 single TTA being evaluated on all forms of noise? Yes or No?

A: If by “TTA” you refer to the algorithm and its hyperparameters: Yes. If you refer to the adapted model parameters after TTA, then No—each corruption type results in a separately adapted model, since TTA is performed independently on each stream.

Q: The results reported in the paper had different TTA being evaluated on all different noises? Yes or No?

A: If by “TTA” you mean the algorithm: No. We use the same algorithm and hyperparameters across all corruptions. If you mean the adapted model weights, then Yes, because adaptation is performed independently per corruption type.


For example, the high-resolution (>128×128) sketch will not cause feature collapse on these datasets, and hence the entire story or theoretical foundation of layer norm update will be invalidated.

Respectfully, we believe there is a misunderstanding here. Nowhere in our paper do we state or assume that feature collapse only happens at low resolution. The paper you referenced in your earlier comment discusses feature collapse in low-resolution settings, but our analysis shows that variance collapse can also occur in high-resolution corrupted images, such as those from corruptions in ImageNet-C. Does ImageNet image + frost decrease resolution? I don't think so. The phenomenon is not inherently resolution-dependent, but corruption-dependent, and more broadly, distribution-shift-dependent.


Finally, we would like to clarify one point for better mutual understanding:

You mentioned that you're “not questioning the methodology of TTA,” but rather “the use of TTA for robustness.” Also it seems like you do not accept the idea of doing unsupervised fine-tuned for zero-shot CLIP model. Could we confirm whether your view is that test-time adaptation methods in general should not be considered valid robustness techniques, and that models should instead be evaluated in a domain generalization setting with frozen parameters?

We ask this in good faith, as it would help us better understand your criteria and expectations for robustness evaluation.

Thank you again for your time and engagement.

评论

I see the confusion here, finally. I'm not trying to be hostile/rude here; my sincere apologies if my message seemed that way.

  • Take the example of Table 1. There are 15 columns corresponding to 15 noises. That means 15 times TTA was adapted.

  • The semantics here are: TTA wasn't informed what noise it was going to be adapted to. However, once TTA is adapted to a certain noise, such as a Gaussian blur, it won't work for Motion Blur.
    Pro: The Model can blindly be adapted to any 1 particular type of noise.
    Con: The Model is sort of domain-aware.

  • Super Resolution / Lr-TK0 :
    Pro: The Model can be fixed without being shown the domain of the dataset. True zero-shot
    Con: The Model needs to be told what noise it's going to be shown.

  • Both methods are solving the same task, but semantics are different, i.e. caters to one particular type of noise. One model parameter tuned for one type of noise won't work on another type of noise, which brings them on the same level in terms of the tasks they are trying to achieve.

  • Here, our perspective is fundamentally different, as I would keep them in the same category of problems. I leave it to the area chairs to decide after this.



  • The 128x128 was an example. Unless we are talking about noise (motion blur, rain etc.) all of these constitute a domain shift rather than a distribution shift.
  • A cat in ImageNet vs a sketch of a cat is a domain shift. Clear cat in ImageNet vs Cat in snow is a distribution shift. Feature collapse / Variance collapse occurs mostly on Distribution shift (cat in snow/rain etc.), not so much in the sketch of the cat or the video graphic rendition of the cat.


  • test-time adaptation methods in general. The way I'm seeing it, if I'm already aware of the test set being already sorted or made conveniently organized. Then yes, I would strongly favor a fine-tuning method over TTA.
  • Example BDD-100K dataset. Its test/train images are well sorted in different weather conditions: sandy weather, rain, fog, etc. The Mint seems to be a method to evaluate on BDD per weather class, just by not telling what weather class its adapting to. TTA would have proved really useful here if Mint had evaluated all the BDD test sets in an amalgam (mixture) of ALL Noises. That's purely a design choice for me. I would train a classifier on the training set and fine-tune my model / use SR for the noise-specific train set. Then evaluate per noise category on BDD-100K. For me both are solving the same task. Not using the train set is a design choice. Using TTA is a design choice. The only way it would be a different problem statement if the authors left TTA to adapt to the entire test set with all forms of noise. Then it's fair to say SR, LR-TK0 are not in the same category of problem, and comparison is not meaningful. I guess our perspective on this are different, and I guess its ok.
评论

Thank you for the thoughtful clarification. We now better understand your perspective and appreciate the clear articulation.


Comparison of TTA and SR/Lr-TK0

We largely agree with your summary comparing TTA and SR/Lr-TK0: while the semantics differ, they aim to address similar robustness challenges. Though we may interpret certain details differently, we recognize your framing as a valid viewpoint.


On variance collapse

Actually we have also observed a noticeable decrease in feature variance on datasets like ImageNet-Sketch compared to ImageNet.

Table A: Variances on Natural Distribution Shifts Benchmark (ViT-B/16)

Variance typeImageNetImageNet-Sketch
GT-total0.4790.387
GT-intra0.2570.224
GT-inter0.2220.163

This aligns with our variance collapse discussion: all types of variances—especially GT-inter—tend to decrease as the distribution shift increases. At the same time, we agree that fully characterizing when and where variance collapse occurs is beyond the scope of this paper; our focus is to use this observation to motivate the design of our method.


On Mixture of Corruption Evaluation

Your suggestion about evaluating on a mixture of all corruptions is well taken. Following our earlier response, we have now conducted this exact experiment: we evaluate TTA methods on a test set formed by mixing all 15 corruption types, with random corruption types applied to each test image. Therefore there will be only one model that is adapted on-the-fly, with an unsorted, mixture of corrupted images. The result is shown below.

Table C: Accuracy on Mixture of 15 Types of Corruptions

MethodCIFAR-10-CCIFAR-100-CImageNet-C
CLIP59.035.839.6
TDA62.138.342.3
DMN60.236.039.9
WATT-S63.639.043.9
CLIPArTT56.938.740.5
Mint65.939.845.2

The results show that:

  • Many TTA methods still outperform the non-adaptive CLIP baseline.
  • Mint continues to show strong performance, even under this harder setting with mixed corruption types.

We hope they help clarify that Mint (as well as TTA methods in general) is effective even in the mixture setting, where adaptation is more challenging and realistic.

审稿意见
4

This paper proposes a novel test-time adaptation (TTA) method for vision-language models (VLMs) like CLIP targetting image corruption. The authors found a phenomenon in which the variance of image embeddings decreases, and the embeddings become less distinguishable under image corruption, which is called variance collapse. The mechanism of variance collapse and the fact that variance collapse is closely related to model performance is also theoretically explained. Based on this finding, a novel TTA method called Mint, which maximizes the inter-class variance estimated with pseudo labels, is proposed. Experimental results show that Mint outperforms prior TTA methods for VLMs under image corruption.

优缺点分析

Strengths

  • The paper is well-structured, and the motivation is clear.
  • The finding of variance collapse is interesting.
  • The design choice of Mint based on variance collapse is reasonable.

Weaknesses

  • W1: It is questionable whether the assumption for the theoretical analysis that the image features can be disentangled into the four components. Under image corruption, such as noise, the information is lost and cannot be recovered. Although this assumption is motivated by Wiles et al. [33], they mainly consider attributes rather than corruption.
  • W2: It would be interesting to validate that variance collapse is specific to image corruption experimentally. For example, comparing vanilla ImageNet and ImageNet-A/R/S would support the claim.
  • W3: Maximizing PL-total variance while minimizing PL-intra may be related to neural collapse [a,b]. Discussing neural collapse would improve the coverage of related works.

[a] Kothapalli, Neural Collapse: A Review on Modelling Principles and Generalization, TMLR, 2023.  
[b] Papyan et al., Prevalence of neural collapse during the terminal phase of deep learning training, PNAS, 2020.

问题

  • Q1: Since VPL_inter=VPL_totalVPL_intra\mathcal{V}^\text{PL}\_\text{inter} = \mathcal{V}^\text{PL}\_\text{total} - \mathcal{V}^\text{PL}\_\text{intra}, VPL_intra\mathcal{V}^\text{PL}\_\text{intra} should have negative correlation with accuracy. However, L136 and Fig. 3 state that VPL_intra\mathcal{V}^\text{PL}\_\text{intra} increases with severe corruption and has a positive correlation with performance. Is the sign flipped?
  • Q2: In Tabs. 1 and 2, how the model and dataset combinations, i.e., (ViT-B/32, CIFAR10-C), (ViT-B/16, CIFAR100-C), and (ViT-L/14, ImageNet-C), were selected?

局限性

Yes.

最终评判理由

Considering the other reviews and the authors' rebuttal, my concerns have been addressed. I keep my score. Although it is questionable whether the theoretical assumption holds, the finding on the variances is intriguing.

格式问题

N/A

作者回复

Thank you very much for your positive and encouraging feedback. We're pleased that you found the paper well-structured, the motivation clear, and the proposed phenomenon and method compelling. We truly appreciate your support and are happy to address any remaining points or clarifications below.

W1. Assumption of disentangled components

Thank you for raising this point. We would like to clarify that the disentanglement assumption is primarily a conceptual tool to make the theoretical analysis more interpretable and analytically tractable.

In fact, our analysis can be extended to a more general setting where each feature vector undergoes a rotation, such that each dimension becomes a mixture of the original components. While this generalization does not change the conclusion of Theorem 3.1, the gradient derivation in Theorem 3.2 would involve additional rotation terms, making the result harder to interpret while offering limited new insight.

That said, we acknowledge that in extreme scenarios, such as when corruption completely entangles class-relevant signals (e.g., classifying different noise patterns themselves), the disentanglement assumption may indeed fail. In these cases, adaptation methods relying on discriminative features become fundamentally ineffective, as meaningful class-specific signals no longer exist or cannot be distinguished from corruption.

W2. ImageNets benchmark

Thank you for the insightful suggestion. Following your recommendation, we computed the GT-total, GT-intra, and GT-inter variances on ImageNet, ImageNet-A, ImageNet-R, and ImageNet-Sketch using the ViT-B/16 backbone. While this comparison may not be as tightly controlled as our corruption-based experiments, due to differences in image content and number of classes (e.g., ImageNet and ImageNet-Sketch have 1,000 classes, while ImageNet-A and ImageNet-R contain only 200), it still provides valuable insights.

Table A: Variances on Natural Distribution Shifts Benchmark (ViT-B/16)

Variance typeImageNetImageNet-AImageNet-RImageNet-Sketch
GT-total0.4790.4960.4450.387
GT-intra0.2570.3460.3300.224
GT-inter0.2220.1500.1150.163

On ImageNet-Sketch, we observed a clear variance collapse pattern consistent with our corruption benchmarks: both GT-inter and GT-intra variances decrease. However, on ImageNet-A and ImageNet-R, we observed that GT-inter variance still decreases, but GT-intra variance increases instead.

That said, the consistent drop in GT-inter variance across all three datasets supports the potential applicability of Mint to a broader range of distribution shifts, as it directly targets inter-class variance. Therefore, we evaluated Mint on these datasets, and the results are as follows:

Table B: Performance on Natural Distribution Shifts Benchmark (ViT-B/16)

MethodImageNet-AImageNet-RImageNet-Sketch
CLIP49.272.744.9
CLIPArTT49.672.845.0
WATT-S51.775.747.0
Mint54.778.148.4

Notably, Mint also outperforms our strongest baselines (CLIPArTT, WATT-S) from the main paper on all three datasets, further supporting its effectiveness beyond corruption benchmarks.

W3. Neural collapse

Thank you for the insightful suggestion. We agree that the idea of maximizing PL-total variance while minimizing PL-intra shares a conceptual connection with neural collapse phenomena, which also involve the alignment of features and separation of class means. We will include a discussion of neural collapse in the revised version to better contextualize our work within this broader literature.

Q1. Variances

Thank you for the question. In Line 136 and Figure 3, we are referring to the GT- variances (e.g., VGTˆ_inter\mathcal{V}\^{GT}\_{inter}, VGTˆ_intra\mathcal{V}\^{GT}\_{intra}), not the PL-variances. As shown in Tables 4–6 in Appendix C.1, all VGTˆ_inter\mathcal{V}\^{GT}\_{inter}, VGTˆ_intra\mathcal{V}\^{GT}\_{intra} and VGTˆ_total\mathcal{V}\^{GT}\_{total} positively correlate with accuracy, i.e., they decrease as corruption severity increases, and so does performance.

This is due to the phenomenon we described in Line 173: “As a result, the image encoder tends to embed corruption-related patterns into the representation itself.” Consequently, embeddings from different classes become more similar, reducing both inter- and intra-class variance, and degrading performance.

Q2. Model and dataset combination

Thank you for the question. The chosen model-dataset combinations reflect practical usage scenarios, where more efficient, lightweight models (e.g., ViT-B/32) are often used for simpler tasks like CIFAR-10-C, while stronger models (e.g., ViT-L/14) are typically employed for more complex tasks such as ImageNet-C. This setup ensures that the evaluation remains realistic and meaningful across a range of difficulty levels.

评论

I appreciate the authors' further clarification and experiment. I would like to keep my score.

审稿意见
4

This paper investigates the degradation of CLIP's performance under input corruptions and introduces a novel phenomenon, variance collapse. The authors analyze variance collapse and provide a theoretical explanation: corrupted images weaken class-discriminative features, leading to a performance drop. To address this, the authors propose Mint, a test-time adaptation method that maximizes inter-class variance estimated from pseudo-labels. Mint consists of a mean accumulator to estimate PL-inter variance and a gradient accumulator to stabilize updates. Extensive results show that Mint improves the classification performance of CLIP across three commonly used corruption benchmarks.

优缺点分析

Strengths:

  1. The authors provide a solid theoretical framework connecting the variance collapse to performance degradation, enhancing the interpretability of the proposed method.
  2. Experimental results show that Mint outperforms existing TTA methods on several corrupted benchmarks.
  3. The submission is well written and easy to follow.

Weaknesses:

  1. The decomposition in Eq.5 may be redundant. If certain classes have no samples, they contribute nothing to the computation of the inter-class variance. The average embedding of these classes can effectively be treated as the global center. In this way, the calculation of the left term of Eq.5 can be directly performed while retaining the update of Eq.6. If there is a meaningful numerical difference between the two formulations, the authors should provide quantitative evidence to justify the added complexity.
  2. The experiments are limited to corrupted benchmarks and do not report classification accuracy on clean images. Moreover, the paper omits evaluation on other standard distribution shift benchmarks commonly used for VLMs (e.g., ImageNet-A, -V, -R, -K).
  3. Is Mint sensitive to optimization space? The authors need to provide the results of optimizing the prompt or the text encoder and explain whether the image encoder must be selected for optimization.
  4. In Table 3, the authors only select some training-based methods for time comparison, but explain that they select the top-performing algorithm. In fact, training-free methods such as TDA are better than TPT in experimental results, but they were not selected. The authors should provide the time of high-performance training-free methods for reference.

Minor Weakness:

  1. The statement ‘where KpriorK_{prior} is the strength of prior’ in Line 249 would make one think that KpriorK_{prior} is an adaptive variable rather than a hyperparameter.
  2. There is a conflict between the statement in Lines 49-53 about ‘simultaneous reduction of GT-intra and GT-inter variances ... dilutes class-discriminative information.’ and the method that continues to minimize PL-intra variances.

问题

Please refer to the above weaknesses and provide corresponding explanations.

局限性

Yes, the authors provide the limitations and potential negative societal impact in the supplementary material.

最终评判理由

The paper presents a clearly written TTA method that consistently outperforms strong baselines. The rebuttal satisfactorily addresses all concerns, adding results on clean and natural distribution shift benchmarks, clarifying the decomposition of Eq. 5, explaining optimization choices, and providing runtime comparisons. Given its strong empirical results and authors' detailed responses, the reviewer decided to raise the score to borderline accept.

格式问题

No major formatting issues.

作者回复

Thank you for taking the time to review our work and for your thoughtful feedback. We appreciate your recognition of our theoretical framework and experimental results. We also acknowledge your concerns and value your suggestions. Below, we respond to each point in detail and clarify potential misunderstandings.

W1. Decomposition in Eq (5)

Thank you for the question. You are right that even if certain classes have no samples, we can still evaluate the left-hand side of Eq (5) with the mean accumulator defined in Eq (6). However, the key issue with this formulation is that the loss becomes disconnected from the current batch, making it non-differentiable with respect to the current features {z_i}\{z\_i\}. This prevents backpropagation and gradient-based optimization. Our reformulation (right-hand side of Eq. 5) addresses this by preserving differentiability via the use of batch-level embeddings {zi}\{z_i\}, while still leveraging z~c\tilde{z}_c and z~\tilde{z} as running estimates. This enables optimization using stochastic gradients and facilitates information sharing across batches.

W2. Clean images and ImageNet benchmarks

Thank you for the helpful suggestion. We have added results on clean datasets corresponding to the three corruption datasets (CIFAR-10, CIFAR-100, and ImageNet), as well as four widely-used natural distribution shift datasets: ImageNet-A, ImageNet-V2, ImageNet-R, and ImageNet-Sketch. The table below reports accuracy under the same no-augmentation setting as used in our main experiments. We compare against the strongest baselines from the paper, including CLIPArTT and WATT.

Table A: Performance on Clean Datasets

MethodViT-B/32 on CIFAR-10ViT-B/16 on CIFAR-100ViT-L/14 on ImageNet
CLIP88.368.473.0
CLIPArTT89.170.272.1
WATT-S89.872.374.5
Mint91.674.175.6

Table B: Performance on Natural Distribution Shifts Benchmark (ViT-B/16)

MethodImageNet-AImageNet-V2ImageNet-RImageNet-Sketch
CLIP49.260.472.744.9
CLIPArTT49.660.572.845.0
WATT-S51.761.275.747.0
Mint54.762.678.148.4

W3. Sensivity to optimization space

Thank you for the question. In our method, optimizing the image encoder is essential, as our loss is explicitly designed to address the variance collapse in image embeddings. The PL-inter variance objective is defined on image features and is not differentiable with respect to the text encoder or prompt, since we treat the model’s predictions as hard pseudo-labels. As a result, updating only the text prompt or text encoder yields zero gradients and cannot adapt the model.

W4. Time comparison

Thank you for the suggestion. By “top-performing algorithms,” we refer to CLIPArTT and WATT-S, which achieve 4.9% and 6.1% accuracy gain, respectively. Following your advice, we have added all previously omitted training-free baselines (Ensemble, Zero, VTE, DMN, TDA) to the runtime comparison in Table 3, ordered by accuracy. This provides a more comprehensive reference for efficiency comparison.

Table 3: (Updated) Comparison of computation time for one corruption on CIFAR-100-C.

MethodTesting TimeAccuracy (%)Gain (%)
CLIP21s35.8-
TPT23m21s36.0+0.2
VTE9m45s36.6+0.8
Zero9m50s36.8+1.0
Ensemble21s37.1+1.3
TDA33s38.4+2.6
DMN-ZS30s38.5+2.7
TPS9m58s38.6+2.8
CLIPArTT7m40s40.7+4.9
WATT-S50m20s41.9+6.1
Mint1m07s44.1+8.3

Minor 1. Hyperparameter KpriorK_{prior}

Thank you for pointing this out. We agree that the current phrasing may be misleading. We will revise the sentence to avoid confusion.

Minor 2. Statement conflict

Thank you for the question. We did observe a simultaneous reduction in both GT-inter and GT-intra variances. Among them, GT-inter variance, which reflects the separation between class centers, is naturally desirable to be large. On the other hand, a decrease in GT-intra variance, which reflects within-class variation, is not necessarily harmful.

In practice, we found that GT-inter variance correlates more strongly with accuracy than GT-intra and GT-total (see Tables 4,5,6 in Appendix C.1). Therefore, in our method, we choose to maximize PL-inter variance (computed as PL-total minus PL-intra), without explicitly penalizing PL-intra further in the loss.

评论

I have read the authors' rebuttals.

W1. Regarding the statement that the left side of Eq.5 is not differentiable, I think it is wrong. I think this is a code implementation problem. If a new variable is used to calculate the new running average, and the previous value is treated as a constant, the same effect as the right side can be achieved. Therefore, the left side of the equation is also differentiable.

W2. I think that, as an online method, more online TTA results should be provided, including TDA mentioned in the submission and some methods not mentioned in the submission. On the contrary, the comparison of methods such as Zero and TPS that do not use historical sample information will not lead to any effective conclusions.

W3, W4: Thanks for your reply. I have no further questions about this part of the rebuttal.

评论

Thank you for your acknowledgement of our responses to W3 and W4. Below we provide further clarification.

W1. Decomposition in Eq (5)

We agree with you that the objective formulation you propose is differentiable. In fact, we also experimented with it during the early stages of our work. However, we observed a gradient scaling issue with this formulation: as adaptation progresses, the contribution of each individual sample to the running mean becomes smaller, leading to systematically smaller gradients in later stages. This makes subsequent updates less effective and weakens the overall adaptation process.

To illustrate, consider the case where the batch size is 1. Let the only test sample in the batch have feature ziz_i, with pseudo-label cc. Define:

  • z~preˆ\tilde{z}\^{pre} as the global average feature before the update;
  • z~preˆ_c\tilde{z}\^{pre}\_c as the class-cc average feature before the update;
  • z~postˆ=KK+1z~pre+1K+1zi\tilde{z}\^{post} = \frac{K}{K + 1} \cdot \tilde{z}^{pre} + \frac{1}{K+1} \cdot z_i as the global average after the update;
  • z~postˆ_c=KcKc+1z~cpre+1Kc+1zi\tilde{z}\^{post}\_c = \frac{K_c}{K_c + 1} \cdot \tilde{z}^{pre}_c + \frac{1}{K_c+1} \cdot z_i as the class-cc average after the update; where KK is the total number of seen samples, and KcK_c is the number of samples with pseudo-label cc.

Under the formulation you proposed, the objective becomes:

\mathcal{V}\_{inter}\^{PL} = \\| \tilde{z}^{post}_c - \tilde{z}^{post} \\|_2^2 = \left \\| \frac{K_c}{K_c + 1} \cdot \tilde{z}_c^{pre} + \frac{1}{K_c+1} \cdot z_i - \frac{K}{K + 1} \cdot \tilde{z}^{pre} - \frac{1}{K+1} \cdot z_i \right\\|_2^2

Treating z~preˆ\tilde{z}\^{pre} and z~preˆ_c\tilde{z}\^{pre}\_c as constants, the gradient with respect to ziz_i becomes:

ziV_interPLˆ=2(1Kc+11K+1)(KcKc+1z~pre+1Kc+1ziKK+1z~pre1K+1zi)=2(1Kc+11K+1)(z~cpostz~post)\nabla_{z_i} \mathcal{V}\_{inter}\^{PL} = 2\left(\frac{1}{K_c + 1} - \frac{1}{K + 1}\right) \cdot \left ( \frac{K_c}{K_c + 1} \cdot \tilde{z}^{pre} + \frac{1}{K_c+1} \cdot z_i - \frac{K}{K + 1} \cdot \tilde{z}^{pre} - \frac{1}{K+1} \cdot z_i \right) = 2\left(\frac{1}{K_c + 1} - \frac{1}{K + 1}\right) \cdot ( \tilde{z}^{post}_c - \tilde{z}^{post})

As KK and KcK_c increase over time, this gradient term shrinks toward zero. Consequently, new test samples make increasingly negligible contributions to adaptation.

In contrast, our objective takes the form:

V_interPLˆ=ziz~post22ziz~cpost22\mathcal{V}\_{inter}\^{PL} = \\| z_i - \tilde{z}^{post} \\|_2^2 - \\| z_i - \tilde{z}^{post}_c\\|_2^2

Treating z~post\tilde{z}^{post} and z~cpost\tilde{z}^{post}_c as constants, the gradient with respect to ziz_i becomes:

2(ziz~post)2(ziz~cpost)=2(z~cpostz~post)2(z_i - \tilde{z}^{post}) - 2 (z_i - \tilde{z}^{post}_c) = 2(\tilde{z}^{post}_c - \tilde{z}^{post})

This gradient is independent of KK and KcK_c, making the update scale stable throughout the entire adaptation process and ensuring that each test sample contributes equally, regardless of when it arrives.

W2. Clean images and ImageNet benchmarks

Thank you for your suggestion. We are currently running additional online TTA baselines, including TDA and other relevant methods, on our evaluation benchmarks. We will share these results with you as soon as they are available.

评论

W2: Clean images and ImageNet benchmarks

Thank you again for your helpful suggestions. Following your advice, we have included additional online TTA baselines in our experiments:

  • TDA and DMN, two online adaptation algorithms already included in our submission. While their original papers report results on some of the benchmarks, we re-ran both methods under our unified evaluation protocol to ensure a fair comparison. Specifically, we use no data augmentation and the same set of 7 fixed templates across all methods. This setting differs from their original implementation, which includes AugMix augmentation and a broader CuPL template pool. These differences are documented in public discussions (e.g., GitHub issues #1 and #9 in the kdiAAA/TDA github repository). Due to rebuttal policies, we cannot provide direct links.
  • Tent and ETA, two widely adopted online TTA algorithms. Since our model uses ViT backbones without BatchNorm, we adapted their methods by updating LayerNorm parameters, which aligns with other baselines including CLIPArTT and WATT-S.

We summarize the extended results in the table below. Mint consistently outperforms these strong online baselines across all benchmarks, demonstrating the robustness and effectiveness of our method even beyond corruption datasets.

Table A: (Updated) Performance on Clean Datasets

MethodViT-B/32 on CIFAR-10ViT-B/16 on CIFAR-100ViT-L/14 on ImageNet
CLIP88.368.473.0
CLIPArTT89.170.272.1
WATT-S89.872.374.5
TDA89.670.173.4
DMN90.269.473.1
Tent91.172.273.4
ETA91.473.073.6
Mint91.674.175.6

Table B: (Updated) Performance on Natural Distribution Shifts Benchmark (ViT-B/16)

MethodImageNet-AImageNet-V2ImageNet-RImageNet-Sketch
CLIP49.260.472.744.9
CLIPArTT49.660.572.845.0
WATT-S51.761.275.747.0
TDA51.061.273.946.4
DMN49.760.573.045.4
Tent51.961.077.045.4
ETA52.061.077.446.8
Mint54.762.678.148.4
评论

I thank the authors for the additional experimental results and clarification of Eq.5.

But I still want to point out that the gradient calculation provided by the authors in their response is incorrect.

The fact that z~pre\tilde{z}^{pre} appears repeatedly in the first formula is a minor error. The main problem is that the author did not take into account the other terms involving z~post\tilde{z}^{post} when calculating the gradient.

According to my calculations, the gradients for zi{z}_{i} of the left and right objectives of Eq.5 are the same. Eq.5 just shows two different forms, so it is reasonable that the gradients are the same.

I think the authors should carefully check the objective used in the experiment. If the incorrect one written in the response ( ziz~post22ziz~cpost22{||z_i-\tilde{z}^{post}||}^{2}_{2} - {||z_i-\tilde{z}_{c}^{post}||}^{2}_{2}) is used, then this incomplete objective is inconsistent with Eq.5, which will lead to incorrect experimental results.

My conclusion remains that the decomposition in Eq.5 is redundant.

评论

In addition, we would like to highlight another key distinction in our derivation. While Eq.5 provides a theoretical objective involving the full test distribution, our actual optimization target in the algorithm is Eq.7, which estimates the objective using only the current batch.

Eq.5, whether on the left-hand or right-hand side, effectively assumes the loss is computed over the full test set, but gradients are only taken with respect to a small subset (i.e., the current batch). This naturally leads to gradient scale decay as more samples accumulate.

In contrast, Eq.7 explicitly maximizes the PL-inter variance over the current batch, ensuring that the gradient scale remains stable. The aggregation of information across the test stream is delegated to the gradient accumulator, which maintains the benefits of long-horizon adaptation without compromising per-step gradient scale.

评论

I thank the authors for the further clarification.

I have basically understood the motivation of Eq.5 and Eq.7.

评论

Thank you for your kind response and for taking the time to engage with our clarification.

We are glad to hear that the motivation behind Eq.5 and Eq.7 is now clearer. This discussion has been particularly valuable, and we will make sure to improve the clarity of this part in our next revision.

If you have any further questions or concerns, we would be more than happy to address them promptly. We sincerely hope you will consider revisiting your overall evaluation of our paper. Thank you again for your thoughtful feedback and support.

评论

Hi Reviewer Cqv3,

Thank you again for taking the time to read and respond. I'm glad to hear that the clarification on Eq.5 and Eq.7 was helpful.

If there are any remaining questions or concerns you'd like us to address, we’d be happy to provide further clarification.

Otherwise, if everything looks good to you, we kindly invite you to submit your final evaluation at your convenience. Thanks again for your thoughtful review!

评论

Hi Reviewer Cqv3,

Thank you very much for this insightful question. We truly appreciate your engagement with the theoretical details—it has been very helpful in clarifying and improving the presentation of our paper.

We agree with your observation: the right-hand side of Eq.5 does indeed depend on KcK_c. However, a key point we would like to emphasize is that Eq.5 is still not the final objective actually optimized during gradient ascent, the actual objective function used in each update step is Eq.7. (This is also true for the batch size = 1 special case we provided earlier)

If we were to use the RHS of Eq.5 directly as the loss, the normalizer’s denominator would be Kc+1K_c + 1 as you said, which effectively corresponds to computing the loss over the full test stream—but only updating based on the current sample. This would lead to a gradient scale decay problem, since the influence of each new sample diminishes over time.

In contrast, Eq.7 is computed over only the current batch, and its normalization term is independent of the total number of previously seen samples. This design ensures that the gradient scale remains stable across adaptation steps, which is crucial for effective test-time optimization.

评论

Thank you for pointing out the typo! we have corrected it in this response.

We would also like to clarify a potential misunderstanding regarding the gradient derivation. If the assumptions about what is treated as differentiable vs. constant are not changed, then of course the resulting gradient remains the same. However, our derivation and yours differ precisely in which values require grad and which do not.


In your version, z~cpre\tilde{z}_c^{pre} and z~pre\tilde{z}^{pre} is treated as constants, which leads to a scale-decreasing gradient with respect to ziz_i, potentially causing optimization issues.


In contrast, our version treats z~cpost\tilde{z}_c^{post} and z~post\tilde{z}^{post} as constants. As you noted, this means we "did not take into account the other terms involving z~post\tilde{z}^{post} when calculating the gradient"—but we emphasize that this is by design, not an oversight. This design choice is what ensures that the gradient magnitude is independent of the number of seen samples KK, and thus remains stable throughout optimization. In this sense, what may appear to be a “missing term” is actually a feature, not a bug.


We also followed your suggestion and double-checked our code implementation. In practice, we first update the mean accumulators under torch.no_grad, so that both z~cpost\tilde{z}_c^{post} and z~post\tilde{z}^{post} are computed with requires_grad=False before computing the loss. This aligns exactly with the form of the RHS of Equation 5 (and more precisely, Equation 7), confirming that our implementation is consistent with the derivation.

评论

Thanks for your reply.

I still have another question.

If the authors treat z~cpost\tilde{z}_c^{post} and z~post\tilde{z}^{post} as constants, the gradient of the right term of Eq.5 should become 2C(Kc+1)(z~cpostz~post)\frac{2}{C (K_c+1)} \cdot (\tilde{z}^{post}_c - \tilde{z}^{post}) , which is also dependent of KcK_c.

最终决定

This paper investigates why pretrained VLMs like CLIP are vulnerable to distribution shifts caused by input perturbations or corruptions. The authors identify what they call "variance collapse", where intra-class and inter-class variances collapse as corruption increases. Their theoretical results argue that the visual encoder tends to encode corruption-related signals, which suppresses discriminative features. They also show that maximizing inter-class variance can counteract this.

This work has several strengths. Notably, the writing is clear and easy to follow, the work introduces and formalizes an interesting and interpretable phenomenon (variance collapse), and the proposed method (MINT) is a simple, efficient yet effective TTA algorithm.

Several concerns were identified during the initial review stage. Primarily, questions were raised regarding some of the equations, decomposition and theoretical claims [Cqv3, EhQi]; experiments were only on corrupted benchmarks and no clean image accuracy was reported [Cqv3], questions regarding why some sate of the art methods were excluded from comparison [Cqv3], whether variance collapse applies only to corrupted images [EhQi]; concerns regarding the novelty of variance collapse [3aNj]; lack of clarity for some of the theoretical/technical components [3aNj]; missing related works (Super-resolution, LRTK0, RobustSAM) [3aNj]; the TTA problem setting requiring no fine-tuning or training of any sort potentially being too restrictive and not a sufficient reason to avoid comparison with other methods [3aNj].

The authors provided substantial clarifications and new experiments during rebuttal: added results on clean datasets and natural distribution shift benchmarks; addressed concerns about Eq. (5) vs Eq. (7), explaining why their formulation avoids gradient scale decay; Expanded baselines to include online training-free TTA methods (TDA, Tent, ETA), showing Mint still performed best; cited additional related works and acknowledged overlaps with prior observations of feature collapse.

After the rebuttal, most weaknesses were adequately addressed, though some concerns remained. Primarily, reviewer 3aNj raised concerns RobustSAM, LR-TK0, and Super-resolution likewise "assume knowledge of the corruption type" and also do not "retrain the original pretrained model", and that they should be included in the empirical evaluation as "the choice of whether to use prompts, SR, or layer norms seems more like a design preference than a fundamentally different objective." Further, "not training, or training a simple classifier to inform the model about the type of noise, are simply design choices in my subjective opinion, which would have put them in direct comparison with existing methods." Thus, restricting the comparison to include only other TTA methods can be seen as an artificial limitation, as the methods cited by the reviewer are still aiming to achieve the same objective.

While the authors have included results on a mixture of 15 types of corruptions during the rebuttal, all other experiments and empirical analysis were done on test sets that included only a single type of perturbation, which may be overly artificial.

Other concerns surround the clarity of the mathematical formulations. The authors would be required to update their manuscript to include the clarifications made during the rebuttal.

While I agree that methods identified by reviewer 3aNj should be included in the related work, I feel like this work offers interesting theoretical insights and proposes a simple yet sound method for easily adapting pretrained models to be more robust to corruptions. The new results added during the rebuttal, especially results on domain shifts and mixture of noise types, strengthens the work to the point where I recommend acceptance.