6.8

/10

Poster5 位审稿人

最低4最高5标准差0.4

4.4

置信度

创新性2.6

质量2.6

清晰度3.0

重要性2.4

NeurIPS 2025

Test-Time Spectrum-Aware Latent Steering for Zero-Shot Generalization in Vision-Language Models

Konstantinos M. Dafnis,Dimitris N. Metaxas

OpenReview PDF

提交: 2025-05-12更新: 2025-10-29

TL;DR

We propose Test-Time Spectrum-Aware Latent Steering (STS), a lightweight adaptation that learns a few per-sample steering parameters in latent space to align visual and text embeddings and boost zero-shot robustness.

摘要

关键词

Vision-Language ModelsFoundation ModelCLIPTest-Time AdaptationGeneralizationOut-of-Distribution Robustness

评审与讨论

审稿意见

评分: 4置信度: 52025-06-17

This paper introduces STS, an efficient Test-Time Adaptation technique for VLMs such as CLIP. Inspiring by the processing pipeline of previous works such as TPT, STS treats adaptation as an alignment problem between the visual and text spaces, particularly focusing on modifying the text features. A small set of parameters are learned in order to shift the right values matrix from the Singular Value Decomposition (SVD) of text features. A marginal entropy objective (based on augmentations on a single image) is used to adapt the linear variables, alongside L2 regularization. The method obtains competitive results on popular baselines for TTA.

优缺点分析

Strenghts:

The paper introduces an unexplored and interesting idea: adaptation can be carried out by modifying the text features in the most significant semantic directions based on SVD decomposition. This ensures to "steer" the text representation towarda a better alignment with the input images without incurring in expensive backpropagation, and enforcing a black-box adaptation.
The experiments show competitive results on several of the popular datasets used in the field, and integrates MaPLe as well as CLIP, including previous adaptation strategies from the state-of-the-art.
The writing is clear and the explanations are easy to follow.

Weaknesses:

Although the main contribution (SVD on the text space) is interesting, the cohesion with the rest of the ideas is not entirely clear. The paper ZERO [5] proposes some intuition, but it's not discussed here.
Popular datasets such as Imagenet-Domains and the CLIP's zero-shot suite are used, but some of the previous (and customary) datasets from TTA are disregarded; corruptions such as CIFAR-10/100-C or Imagenet-C, simulations as VisDA-C, etc. It is a trend in recent works to ignore the previous protocols (see TENT for instance).
The comparison against the state-of-the-art is based solely on methods based on TPT. The paper intents to reduce the adaptation complexity, but that does not invalidate other gradient-based methods such as WATT, TENT, or parameter-free methods like TDA. The only difference is that they do not learn text prompts, but instead finetune the visual encoder.

问题

For instance, what's the advantage of performing marginal entropy minimization instead of conditional entropy minimization on filtered samples?
Also, are the augmentations needed? What is the impediment of using a single image? Is the strategy of augmenting an image related solely to the fact that the marginal entropy is the objective function?
If the domain shift is on the visual side, why not applying STS on the visual features so that they align better with the class concepts embedded in the text space?
What is the performance on corruptions? Datasets such as CIFAR-10/100-C or Imagenet-C can be included. Oftentimes important gains can be obtained.

局限性

Yes

最终评判理由

The idea of the paper is interesting, and the evaluation seems correct. I believe this paper is a good effort towards new directions for TTA in VLMs. The authors have not responded to my follow-up questions. I keep my score.

格式问题

The paper formatting seems correct.

作者回复

2025-07-31

Dear Reviewer eVss,

Thank you for your positive assessment of our work. We’re delighted that you appreciate STS’s combination of black-box simplicity and semantic grounding for episodic test-time adaptation.

We thank the reviewer for raising this point. We agree that ZERO’s hard‐voting scheme is important and it is not in competition with STS but rather complementary. In fact, one can seamlessly replace STS’s marginal‐probability averaging over confident views with a majority‐vote across those same views:

Adapt with STS as usual. Compute the low‐dimensional shift $\Delta z = B\*\gamma$ via entropy minimization (Eq. 4) and apply it to the prototypes.
Predict by voting instead of averaging. For each of the $N$ filtered augmentations, take the arg max over the adapted similarities to cast a “vote” for its top class, then choose the class with the most votes.

When we tried this combined STS + ZERO voting pipeline on ImageNet-A, we observed an additional +0.15 % accuracy gain over STS’s standard averaging, demonstrating that ZERO’s discrete consensus step can further sharpen STS’s predictions without any extra gradient computation.

Q1 By minimizing the marginal entropy over your filtered views, rather than the conditional entropy of each view, we gain several practical and robustness benefits:

Robustness to outlier views. Conditional entropy forces every selected crop to be highly confident, so a single ambiguous or noisy view can pull the update in a harmful direction. Marginal entropy only requires the aggregate distribution (across all filtered views) to be sharp, allowing a few less‐confident or noisy views to be “overruled” by the majority.
Encourages consensus without over-fitting. By focusing on the consensus distribution, marginal entropy guides STS to find a shift that works well overall, not just for each individual augmentation. This is particularly important when you have only a small number of filtered views, the fewer the views, the more you want to lean on their collective signal.

In short, marginal entropy minimization on your filtered subset yields a cleaner, more robust, and more easily optimized objective for a single-step, black-box adaptation, exactly what STS needs to steer the prototypes confidently without getting thrown off by individual noisy views.

Augmentations are related to the marginal entropy minimization and our goal to have a more robust representation for the test sample.

We tried this approach and we found that the performance improvement of having a steering vector for both the visual and textual embeddings is marginal.

Please see the response on reviewer QTGH for the results on CIFAR10-C.

评论- Follow up discussion

2025-08-01

Thank you for addressing my concerns.

Q3: Is there any hypothesis of why the language side is more effective even when the domain shifts are on the visual side? I find this discussions important to help us advance our understanding of VLMs, specially during unsupervised adaptation, where method tend to be brittle.

Q4: Is there any exact comparison in accuracy on CIFAR10-C with any other method?

评论- Official Comment by Authors

2025-08-08

Dear Reviewer, we are happy to see that our rebuttal addressed the concerns raised in the initial review. We answer the additional questions below:

Q3. We would like to clarify that we did not claim, either in our previous response or in the manuscript, that text side STS is categorically more effective than visual side steering. Rather, our choice to focus on text-side adaptation was driven by three key considerations:

Most episodic TTA methods that touch the text modality rely on relatively heavy prompt-engineering or prompt-tuning steps, which incur non-trivial computational and memory overhead. By contrast, applying STS to each text prototype requires only a single closed-form update, no back-propagation through the entire encoder and no storage of the intermediate activations, making it exceptionally cheap and scalable in the episodic setting.
As an intuition, the singular vectors of the text prototype space often align with coherent semantic shifts. We expect that the visual embeddings space will capture primarily the low-level augmentation axes and more entangled mid-level patterns that are harder to ascribe clear meanings to.
Across all the 15 benchmarks, the gap between (a) visual-only steering, (b) text-only steering, and (c) their naive combination was consistently under ~0.10%. In some datasets visual steering edged out text steering by a few hundredths of a percentage point; in others the reverse was true. In practice, the Gavish-Donoho thresholding retains the principal steering subspace, and those axes capture similar dominant domain-shift directions whether computed on visual embeddings or on text prototypes. The remaining directions lie near the noise bulk and vary unpredictably.

Note that we tested both approaches on CIFAR10-C and we observed that text-prototype steering performed marginally better than visual-side steering.

Q4. Below is a concise summary of our CIFAR10-C (severity 5) results under the TPT evaluation protocol (10% most confident views, lr=0.005, hand-crafted prompt "a photo of a {cls}". Note that for the ensemble results, we do not use specific hand-crafted templates (as in the previous response on reviewer QTGH - 3 hand-crafted templates). In contrast, we use the set of 7 generic templates highlighted in the official CLIP repository. The results we have are the following:

Zero-shot	Ensemble	TPT	TPS	STS	STS_(ensemble)
61.12	63.31	63.83	61.73	63.79	67.24

Summary:

STS matches TPT to within 0.05%, showing that our steering-subspace update is as effective as their prompt tuning when used in the same setup.
STS substantially outperforms the naive shifting on the full per class textual prototypes (TPS), highlighting the value of constraining adaptation to a structured subspace.
By combining STS with the 7 generic CLIP prompts we achieve 67.24%, a large gain over all other methods, demonstrating that our subspace steering and prompt ensemble are highly complementary.
Even under very severe corruptions (CIFAR10-C severity 5), our simple linear steering update recovers part of the lost accuracy and outperforms prior episodic TTA methods.

These results confirm that (1) structured subspace steering is crucial, (2) STS is competitive with heavier prompt-tuning approaches. STS operates purely on the pre-computed embeddings, no access to weights, no back-prop through the encoder, and no extra modules to store. Its closed-form updates make it trivially "drop-in" for any case, opening the door to future research on lean, high-performance adaptation methods that deliver real-world impact without bells and whistles.

审稿意见

评分: 4置信度: 52025-06-25

The paper introduces Spectrum-Aware Test-Time Steering (STS), a lightweight test-time adaptation (TTA) method for vision-language models (VLMs), specifically addressing the challenge of zero-shot generalization under domain shifts. STS works by steering latent text representations within a low-dimensional spectral subspace obtained from the singular value decomposition (SVD) of initial textual embeddings. This steering optimizes a minimal set of parameters (spectral coefficients) at inference, significantly enhancing model robustness and efficiency.

优缺点分析

Strengths:

The idea of using an SVD-based spectral subspace to guide adaptation is novel and well-motivated, leveraging intrinsic low-dimensionality in text embeddings.
The approach is parameter-efficient, significantly reducing computational overhead compared to existing TTA strategies.

Weaknesses:

The method's adaptation mechanism is linear, which may limit its effectiveness in handling more complex, nonlinear domain shifts. Such a limitation could restrict generalization performance under extreme or highly varied conditions.
Although the paper presents an ablation study on the computational aspects, additional detailed ablations exploring the sensitivity to the choice of subspace dimensionality or impact of specific hyperparameters (like the number of augmented views) could further clarify robustness.
The experimental results appear suboptimal on certain datasets. What could be the reasons for this?

问题

Although the method is simple and easy to understand, I believe it requires some theoretical justification or relevant citations to support the claims made in the paper.
Is data augmentation for images necessary? Can the method perform without image augmentation?
Can TTA-based methods or the method proposed in this paper be applied to a wider range of domains?
Line 243: Inconsistent capitalization in the title.
Line 176: Citation error.
Lines 247, 253, and 256: Tables are missing citations.
Line 312 in the provided document appears incomplete.

局限性

See questions.

最终评判理由

After checking authors' responses, which well addressed my concerns. Thus, I raised my rating and recommend to accept for this manuscript.

格式问题

No major formatting issues observed.

作者回复

2025-07-31

Dear Reviewer QTGH,

We sincerely thank you for recognizing the novelty and motivation behind our proposed approach. We also appreciate the acknowledgment of the method’s parameter efficiency and its ability to significantly reduce computational overhead compared to existing test-time adaptation (TTA) strategies. These aspects were central to our design goals, and we are glad they resonated with the reviewer.

We address the raised concerns and questions below:

W1 - Method's Adaptation Mechanism:

We appreciate the reviewer’s point and in fact explicitly note this in our Limitation section (Sec 6): STS’s single linear shift in Eq 4. may under-fit highly complex transformations. In STS, however, we find that:

a. Across ImageNet‑A, and ‑V2 a single linear shift in our SVD‑derived subspace recovers a significant percentage of the zero‑shot accuracy gap under moderate to severe corruptions. This suggests that, even when the visual perturbation is nonlinear, the resulting change in optimal text‐prototype geometry often lies in a roughly linear manifold.

b. We also evaluated STS on CIFAR-10-C at the most severe corruption level (severity 5) and observed a +3.52 % accuracy gain over the zero-shot baseline, demonstrating that even under extreme, nonlinear corruptions, steering in our low-dimensional semantic subspace yields substantial robustness improvements.

Method	Accuracy (%)
Zero-shot	61.12
STS	64.64

We believe that this design reflects a deliberate trade-off that maximizes robustness and efficiency:

c. Under the $*manifold hypothesis*$ , pretrained text embeddings concentrate on a low-dimensional semantic manifold. An SVD of the class-prototype matrix uncovers the principal axes along which classes vary. Even complex visual corruptions often project predominantly onto these top directions in text space, so a linear shift suffices to recover the bulk of the alignment gap.

d. By constraining adaptation to the top-k singular vectors, STS avoids over-fitting low-energy, noise-driven directions, an ever-present risk in richer, nonlinear adapters when only a few unlabeled views are available. This bottleneck both stabilizes and focuses learning on the most meaningful shifts.

In summary, although STS’s core update is linear, it is carefully designed to harness the most significant semantic directions, yielding a practical balance of expressivity, robustness, and efficiency.

W2 - Ablation Studies:

We thank the reviewer for suggesting a deeper ablation on (a) the choice of spectral subspace dimension and (b) the number of augmented views. Below we report two experiments:

a. Subspace dimension sensitivity

We compare two strategies for selecting $k_t$ over three random initializations (seeds) and the performance on ImageNet-A dataset:

(i) Gavish-Donoho hard thresholding, and (ii) 98% energy retention

Strategy	Seed 1 Acc (%)	Seed 2 Acc (%)	Seed 3 Acc (%)	Mean Acc (%)
Gavish–Donoho ( $k_t=31$ )	61.07	61.51	61.13	61.24
98 % Energy ( $k_t = 183$ )	60.91	61.35	60.92	61.06

b. Number of Augmented Views

We vary the number of views $N ∈ ({16,32,64,128})$ reporting accuracy on ImageNet-1K:

Num Augmentations $N$	Accuracy (%)
16	67.52
32	68.61
64	68.93
128	69.08

Increasing from 64 → 128 views gives only a +0.15 % boost, while nearly doubling time and memory. We therefore adopt $N=64$ (as in prior TPT work) to balance performance and efficiency, still remaining substantially faster than prompt‐tuning baselines.

W3 - Suboptimal on certain Datasets:

We thank the reviewer for raising this point. Here, we pinpoint two possible reasons that have to do with the visual transformations (augmentations):

a. ** Familiarity of the semantic categories:**

Many of the ImageNet‐variant benchmarks (and fine‐grained datasets like SUN397 or Caltech101) cover object classes that CLIP likely saw often during pretraining. As a result, CLIP’s predictions remain stable across simple augmentations for these “common” categories. In contrast, classes in datasets such as Flowers102 or Oxford‐Pets represent less frequent concepts, so CLIP is more sensitive to their augmented views.

b. Nature of the visual transformations:

Our augmentation pipeline uses only random resized crops and horizontal flips, essentially random “zoom‐ins.” For some domains (e.g. FGVC‐Aircraft, Stanford Cars), zooming can highlight distinctive details like logos or text, which helps CLIP recover the correct label. But on datasets like Flowers102, cropping can remove critical features (e.g. petals or stems), degrading performance.

In this work, we intentionally kept our augmentations uniform across all tasks rather than tuning them per dataset. Nevertheless, these results suggest that the choice of augmentations, and how they interact with the model’s learned visual priors, is a rich avenue for future study.

We thank the reviewer for calling attention to the need for stronger theoretical grounding. In the revised manuscript, we will add the following justifications:

a. Low intrinsic dimension of pretrained embeddings:

Prior work shows that deep‐network features—both vision and text—concentrate on a low‐dimensional manifold. In particular, Aghajanyan et al. (ICML 2020) empirically measure this “intrinsic dimension” and demonstrate that most semantic information lies in a small subspace. This motivates our choice to perform adaptation in a structured, low-dimensional basis rather than the full embedding space.

[1] INTRINSIC DIMENSIONALITY EXPLAINS THE EFFECTIVENESS OF LANGUAGE MODEL FINE-TUNING, Aghajanyan et al. (ICML 2020)

b. Domain‐adaptation generalization bound:

We reference the theoretical bound of Ben-David et al. (JMLR 2010), which states that the target risk is controlled by the source risk plus a divergence term between source and target feature distributions. By minimizing distributional discrepancy in our spectral subspace (via entropy minimization), STS directly reduces this divergence and thus the target error.

[2] Nonlinear Dimensionality Reduction by Locally Linear Embedding, Ben-David et al. (JMLR 2010)

STS can indeed be applied with no augmentations (i.e. a single view, $N=1$ ), since the optimization in Eq. (4) does not strictly require multiple crops. However, in our experiments, we observe that:

As $N$ increases accuracy improves but computational cost grows roughly linearly.

Thank you for asking about the TTA and STS's broader applicability. Our experiments cover a broad range of domains, from ImageNet variants to Satellite images, and Flowers variants. While our experiments focus on zero-shot image classification, STS is not tied to any specific modality or task, as long as two conditions hold:

a. We have static embeddings (e.g. text prompts, class centroids, label vectors) that represent the categories or concepts of interest.

b. You can generate multiple views of each test sample (e.g. augmentations of an image, noise-perturbed audio) without requiring labels.

Q4-7

All minor errors and typos will be corrected.

评论- Official Comment by Authors

2025-08-09

Dear reviewer QTGH,

Thank you for dedicating your time and effort to evaluate our manuscript.

With today being the closing date for author-reviewer discussions, we trust our comprehensive response has resolved your points. If you need any clarification on any points, please do not hesitate to reach out to us.

Thank you again!

Best,

Authors

审稿意见

评分: 5置信度: 32025-06-30

This paper proposes Spectrum-Aware Test-Time Steering (STS), a lightweight method for adapting vision-language models like CLIP to distribution shifts without using labels or updating model weights. STS identifies a low-dimensional semantic subspace via SVD of text prototypes and learns a small set of coefficients to shift all class embeddings in this space. The shift is optimized at test time by minimizing the entropy of predictions across augmented views, promoting confident and consistent outputs. STS achieves better or comparable performance to existing test-time adaptation methods (e.g., TPT, TPS), while offering reasonable gains in speed and memory efficiency.

优缺点分析

The following are the strengths of the paper.

The idea of constraining prototype shifts to a spectral subspace derived via SVD is reasonable. It introduces a well-motivated alternative to existing methods like TPT and TPS by grounding adaptation in semantically meaningful directions rather than unconstrained latent shifts or prompt tuning.
STS operates entirely in latent space without requiring any backpropagation through the encoder or modifications to the model. This makes the approach efficient, reducing memory footprint compared to prior methods.
The paper demonstrates consistent gains over zero-shot CLIP and prior TTA methods like TPT and TPS across a wide range of distribution-shifted and cross-domain datasets. The trend remains the same with a stronger MaPLe initialization.
The paper is well written. The method is introduced step by step and the intuition behind using SVD to define the adaptation space is well-motivated. The use of entropy minimization as a label-free optimization objective is standard but well integrated.

The following are the weaknesses of the paper.

All experiments are conducted only on CLIP ViT-B/16 which is now considered relatively outdated. With the increasing use of larger and stronger backbones like CLIP-L/14, SigLIP-SO400M [1], SigLIP2-SO400M [2] and Perception Encoder [3], it is unclear how STS scales or performs in those settings.
While the core method is evaluated well, the paper lacks deeper ablations. For instance, how performance varies optimization steps, sensitivity to initialization, or impact of using alternative subspaces (e.g., PCA instead of SVD). These would improve our understanding of robustness and generality.

[1] SigLIP (SO‑400M): Zhai, Xiaohua, et al. "Sigmoid Loss for Language Image Pre‑Training." arXiv preprint arXiv:2303.15343, 2023.

[2] SigLIP 2 (SO‑400M variant): Tschannen, Michael, et al. "SigLIP 2: Multilingual Vision‑Language Encoders with Improved Semantic Understanding, Localization, and Dense Features." arXiv preprint arXiv:2502.14786, 2025.

[3] Perception Encoder: Bolya, Daniel, et al. "Perception Encoder: The best visual embeddings are not at the output of the network." arXiv preprint arXiv:2504.13181, 2025.

问题

How does STS perform on stronger VLMs such as CLIP-L/14, SigLIP2-SO400M or Perception Encoder L/14 & G/14? Even a small-scale evaluation (e.g., on a subset of datasets) would be highly informative. A detailed justification for why such models were not considered (e.g., computational limitations) would also be appreciated but ideally with pointers to expected behavior or work-in-progress.
Can you provide a deeper ablation on the choice of spectral subspace, such as comparing SVD to PCA? A strong response showing that the method is either robust to subspace choice or that the current design is optimal, would strengthen confidence in the generality of STS and may improve my rating.

局限性

Yes

最终评判理由

This paper proposes a practical and well-motivated method for test-time adaptation of vision-language models using a spectral subspace derived via SVD of text prototypes. The approach is elegant in its simplicity, avoiding any changes to model weights or architecture, and achieves strong results while being significantly more efficient than existing methods like TPT and TPS.

The authors' rebuttal addressed my key concerns. They pointed to CLIP-L/14 results in the appendix, and their explanation for choosing SVD over PCA is reasonable. While I would still like to see broader evaluations on more modern VLMs (e.g., SigLIP2, Perception Encoder) and deeper ablations on subspace design, the core contribution is solid, and I would like to increase my score to 5: Accept.

格式问题

No major formatting issues.

作者回复

2025-07-31

Dear Reviewer 2jDh,

Thank you for the positive recommendation of our work and the comments.

We appreciate the reviewer’s interest in understanding STS’s behavior on more modern backbones. In the Supplementary Material, we report results for CLIP ViT-L/14 on ImageNet and its OOD variants. As we can see, our method still outperforms zero-shot by a large margin on average across 5 datasets, showcasing that our method generalizes well to larger-scale VLMs.

Thank you for your interest in the spectral subspace.

Here, we note that PCA approach needs to center the class prototypes (substract the mean of the C rows in a CxD matrix), then compute the covariance and then perform an eigen-decomposition on the covariance matrix and take its top-k eigenvectors.

Because CLIP’s embeddings live on the unit hypersphere, their global mean is not zero, and removing it (as PCA requires) can actually distort the geometry you care about.

CLIP optimizes ⟨zᵥ, zₜ⟩ with both zᵥ and zₜ normalized to ‖·‖₂ = 1. Those points lie on the surface of a high‑dimensional sphere, and their mean ⎡∑ᵢ zᵢ⎤/N sits strictly inside the sphere.
PCA subtracts that mean before finding principal axes, effectively “recentering” your data cloud at the origin. That shifts your spherical manifold in a way that no longer respects the original contrastive geometry.

SVD on uncentered data respects the learned hypersphere

A raw SVD of the C×D text‑prototype matrix Zₜ finds the directions of maximum energy,including any offset from the sphere’s center, which often encode globally relevant semantics (e.g. a “generic image” direction).
Those directions align more naturally with CLIP’s training objective, which never removed the mean and whose singular vectors capture both mean and variation.

Empirical parity (and simplicity)

In practice, centering then doing PCA yields virtually the same subspace span as SVD on the centered data, but differs from SVD on the raw data.

评论- Reviewer Response to Author Rebuttal

2025-08-03

Thank you for the rebuttal.

I appreciate you pointing out the CLIP-L/14 results in the appendix and the explanation about preferring SVD over PCA. While broader evaluations on newer models like SigLIP2 or Perception Encoder would further strengthen the paper, I understand the practical constraints.

Thanks again for the detailed response.

审稿意见

评分: 4置信度: 52025-07-04

This paper tackles the well-known efficiency challenge of test-time prompt tuning (TPT) by proposing Spectrum-Aware Test-Time Steering (STS). The key idea behind STS is to selectively tune parameters at the output layer of the model. Specifically, it extracts a spectral subspace from the textual embeddings and learns to adapt a small number of per-sample shift parameters in a spectrum-aware manner. Empirical results demonstrate that the proposed method accelerates TPT by up to 8x and reduces GPU memory usage by up to 12x, all without compromising performance.

优缺点分析

Strengths

This paper addresses an important issue in test-time prompt tuning, enhancing the practicality of this test-time technique.
The empirical results are promising, significantly reducing computational costs while preserving performance.
The writing is clear, and the paper is well-structured and easy to follow.

Weaknesses

Limited Scope of Evaluation: While STS is presented as a plug-and-play technique applicable to various TPT methods, the experiments are restricted to TPT alone. Evaluating STS on additional methods, such as DiffTPT [1] and PromptSRC [2], would provide stronger evidence of its generalizability and effectiveness.
Missing Baselines: The paper omits several important baselines that would help contextualize the contributions of STS. For example: 1) Directly ensembling results with negative entropy, which incurs no additional computational cost but can boost performance. 2) Updating only the last transformer layer, a potentially effective and efficient alternative. 3) Adding an identity matrix to the text features and optimizing its parameters. Including these intuitive baselines and demonstrating the gains of STS over them would significantly strengthen the paper.
More Performance-Efficiency Comparisons: As the paper focuses on addressing efficiency, it would be valuable to compare STS with test-time training-free techniques for zero-shot image classification tasks, such as [3, 4]. These approaches demonstrate strong zero-shot performance while maintaining efficiency. Incorporating such comparisons and discussing the implications would elevate the paper's contributions and provide a more comprehensive analysis.

[1] Self-regulating Prompts: Foundational Model Adaptation without Forgetting. NeurIPS 2023. [2] Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning. ICCV 2023. [3] A Simple Zero-shot Prompt Weighting Technique to Improve Prompt Ensembling in Text-Image Models. ICML 2023 [4] AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation. NeurIPS 2024

问题

Please see the weakness.

局限性

Yes

最终评判理由

The rebuttal addresses my concerns regarding the baseline experiments. However, the plug-and-play generality remains largely conceptual and is not supported by experimental results. I recommend that the authors include additional experiments to substantiate this aspect in their revision. Additionally, the authors declined to compare their approach with training-free baselines, due to their reliance on LLMs. However, the use of LLMs in downstream tasks is increasingly widespread and, in some cases, even essential (e.g., for never-seen classes in CLIP). Furthermore, test-time adaptation methods are known to be computationally intensive, which can limit their practical applicability. Therefore, a comprehensive comparison with alternative approaches, e.g., training-free methods, would be valuable. Based on these considerations, I keep my borderline acceptance and strongly encourage the authors to include the suggested experiments.

格式问题

作者回复

2025-07-31

Dear Reviewer VSFB,

Thank you for your comments and positive recommendation of our work. We provide point-by-point responses to address your concerns below.

We would like to clarify that STS does not perform prompt tuning; instead, it steers the frozen text prototypes directly in a low-dimensional spectral subspace.

W1 - Limited Scope of Evaluation: Thank you for the suggestion to evaluate STS alongside other adaptation pipelines. A few points address this:

a. STS vs. DiffTPT as baseline
In Table 1, DiffTPT appears as an independent baseline; its diffusion-based augmentations are not combined with STS. While STS is fully compatible with any view generation scheme (including DiffTPT’s), our core goal is ultra-efficient, episodic black-box TTA: adopting DiffTPT’s augmentations would multiply per-sample latency and GPU memory by 5–10×, undermining STS’s real-time, low-footprint advantage.

b. Plug-and-play generality
Conceptually, STS can replace the gradient-through-prompt step in any TPT-style framework with a single spectral steering update. If a domain demands richer augmentations, one can simply feed those views into STS’s entropy-minimization loop (no changes to the SVD are required).

c. PromptSRC’s
PromptSRC [2] is designed for few-shot prompt tuning with labeled examples at test time. Because STS targets zero-shot, unlabeled, per-sample black-box adaptation, combining it with a few-shot method falls outside our stated problem setting.

W2 - Missing Baselines:

a. Directly ensembling results with negative entropy

We thank the reviewer for these helpful suggestions. Below we explain how each of these baselines fits into our black-box, episodic TTA setting, and point to where we already cover or will cover them in revision:

We appreciate the reviewer’s suggestion to include a test-time ensembling baseline via averaging predictions over augmented views (i.e., without adaptation). We implemented this “zero-shot ensemble” baseline and tested it on the challenging ImageNet-A dataset. We found that it provides a +8.63% absolute improvement over standard zero-shot CLIP, increasing top-1 accuracy from 47.87% to 56.50%.

While this confirms that multi-view ensembling can modestly improve robustness, our proposed STS method still significantly outperforms it, achieving up to 61.23% with a single hand-crafted prompt and 64.29% with ensemble.

This result demonstrates that STS goes beyond passive confidence averaging. It delivers active and semantically targeted adaptation in the latent space, leading to greater gains under distribution shift than ensembling alone.

b. Updating the last transformer layer

Such a baseline requires white-box access to CLIP’s internal weights and architecture to backpropagate into a deep layer, contradicting our design goal of black-box TTA. STS assumes only query access to the model’s logits or prototypes, never internal gradients. For real-world deployment (e.g. proprietary APIs), updating a hidden layer is infeasible, so we do not evaluate it.

c. Identity-residual on text features

Learning an matrix shift on the D-dimensional text embeddings is exactly the shared-residual variant of Test-Time Prototype Shifting (TPS) [31], and learning a full C×D residual is TPS’s class-specific variant. In TPS paper, the authors show that the class-specific variant performs slightly better than the shared-residual variant. This class-specific variant of TPS appears in Table 1 and Table 2. Across all 15 benchmarks, STS, using only a small number of spectral coefficients, consistently outperforms TPS class-specific residuals, demonstrating that our SVD-based bottleneck is a strictly stronger, more parameter-efficient adaptation mechanism that steers the textual prototypes along the most semantically meaningful directions.

W3 - More Performance-Efficiency Comparisons:

We thank the reviewer for highlighting these training-free baselines, which indeed represent an important point on the performance-efficiency frontier.

Prompt-weighting method, re-weights a fixed ensemble of hand-crafted prompts by minimizing the ensemble entropy. Unfortunately, no public code or details were released, so a faithful re-implementation is infeasible. In the revised manuscript, we will incorporate a qualitative discussion of it.

AWT method relies on an external LLM to craft dataset-specific text prompts from prior knowledge, then re-weights and transports multi-view image features via optimal transport (still no backprop through the VLM but with significant dependency on the LLM and multiple heavy forward passes). AWT’s use of the external LLM to craft dataset-specific prompts brings a practical caveat: it requires prior knowledge of the entire target dataset’s class taxonomy and domain characteristics in order to generate tailored descriptions. In true episodic, zero-shot TTA settings, where each sample arrives in isolation and no labeled or domain-wide metadata is available, this a priori dataset overview is typically unavailable, limiting AWT’s applicability. In contrast, STS operates purely on the model’s frozen prototypes and per-sample augmentations, with no dependence on external models or dataset-level information. These additions will clarify why, even among training-free methods, STS occupies a unique point on the performance-efficiency frontier.

评论- Official Comment by Authors

2025-08-09

Dear reviewer VSFB,

Thank you for your positive recommendation of our work and for dedicating your time and effort to its evaluation.

As today concludes the author-reviewer discussion phase, we trust our comprehensive responses have resolved your concerns. Should any point remain unclear, please let us know and we will be happy to provide further clarification.

Thank you again!

Best,

Authors

审稿意见

评分: 4置信度: 42025-07-07

This paper addresses the problem of test-time adaptation of CLIP-like vision-language models, where the goal is to adapt the pretrained CLIP model on the fly for each test sample. A pioneering work that addresses the same task is [29], where the test-time adaptation is performed using prompt learning. This paper adopts the same framework, composed of image augmentation to multiple views and confident views selection. Yet, instead of tuning the prompt, which requires full backpropagation over the text encoder, it performs a singular value decomposition of the text embeddings matrix and identifies a spectral subspace with a basis formed from the largest right singular vectors. Then a vector is learned in this basis, which leads to a steering of initial class prototypes (i.e. text embeddings). The proposed method is faster and requires less memory than test-time prompt tuning, and its effectiveness is shown for out-of-distribution generalization and cross-dataset generalization.

[29] Shu et al., Test-time prompt tuning for zero-shot generalization in vision-language models. NeurIPS 2022

优缺点分析

Strengths

The paper is written with a clear language and is easy to follow, contributions are well-conveyed and the method is well-illustrated (e.g. Figure 1).
Performing test-time adaptation of CLIP in the latent space instead of prompt tuning is a very interesting direction, especially thanks to the faster optimization and lower memory usage it induces.

Weaknesses

1- Missing important details

There are several missing details about the method and experiments. For example, the actual value of $k_t$ , the number of optimizable parameters for each dataset, is missing. The value of $\lambda_R$ , the regularization loss weight, is not specified as well.

2- The method

On the method's level, the necessity of SVD decomposition is not justified enough. In other words, why adaptation isn't performed by just learning parameters $R\in\mathbb{R}^{C \times D}$ , initialized with zeros, on top of $Z_{T_{\text{init}}}$ similarly to [A]? This means that there might be no justified need for optimization in a subspace, and a residual can be directly learned on top of the text prototypes. Additionally to $R\in\mathbb{R}^{C \times D}$ , which is class-dependent, another choice can be explored by learning $r\in\mathbb{R}^{D}$ , a shared residual across all classes.

3- The experiments

On the experimental level, when not using ensembling (i.e. 7 templates to encode class names), STS performance is very close to TPT except on ImageNet-A (cf. Table 1). The reasons why such a difference occurs only and specifically for this dataset are not discussed. Such discussion/interpretation is crucial to better understand insights about the effectiveness of the method, as well as estimating where using the proposed setting could be the best. Otherwise, when the ensembling is used, the performance gain is more significant. This is a good signal about the complementarity between using ensembling and the proposed adaptation approach, which can't be leveraged in prompt tuning methods like TPT. However, using the ensembling imposes exploring different design choices for the method. One of them could be the residual mentioned above [A]. Another choices could be simply tuning $Z_{T_{\text{init}}}$ , or using adapters [8], or tuning the projection matrix of the vision encoder [B]. All the mentioned methods allow for ensembling and do not perform prompt tuning, so no full backpropagation is performed. Such comparisons help corroborate the effectiveness of STS as design.
In Tables 1 and 2, STS is tested on top of MaPLe. This setting is a bit confusing as there are missing details about the so-called " MaPLe initialization". How is MaPLe trained in both tables? Is it trained in a few-shot labeled setting? On which dataset is it trained? Calling the MaPLe-initialized model "zero-shot" in the same tables is also confusing. Additionally, STS seems to degrade MaPLe performance in Table 2. Why does it happen in Table 2, while the same observation doesn't hold in Table 1? Interpreting this might help practical choices: deciding when to use STS on top of a method like MaPLe and when not.

[A] Yu et al., Task Residual for Tuning Vision-Language Models. CVPR 2023
[8] Gao et al., Clip-adapter: Better vision-language models with feature adapters. IJCV 2024
[B] Fahes et al., CLIP’s Visual Embedding Projector is a Few-shot Cornucopia. arxiv 2024

Minor:

Line 176: Figure ?? -> Figure 2
Lines 247, 253: Table 1 not clickable
Line 256, 293: Table 3 not clickable
Line 306: whe -> when
Table 2 caption: cross-datesets -> cross-datasets

问题

The paper claims that inference is 8 $\times$ faster than test-time prompt tuning (Lines 16,17). What does inference exactly refer to in this context? Lines 51-52 in the supplementary material indicate that inference includes the optimization steps, since it claims that inference efficiency decreases with more steps. This might be confusing since inference usually refers only to the computation of a prediction from an input, using only a forward pass.
Why is the cross-dataset generalization setting called so? I might have missed something here, but this setting is referred to in few-shot adaptation [18,39] when training on ImageNet and testing the generalization ability of the adapted model on other datasets. However, in this paper, in Table 2, test-time adaptation is performed separately for each test image in each dataset. Thus, the cross-dataset generalization name might be confusing as there is no training on one dataset and testing on others.

[18] Khattak et al., Maple: Multi-modal prompt learning. CVPR 2023
[39] Zhou et al., Conditional prompt learning for vision-language models. CVPR 2022

局限性

yes

最终评判理由

I would like to thank the authors for the rebuttal. The latter partially addressed my concerns. I acknowledge having missed TPS [31] from the tables, which indeed learns shifts of class prototypes. However, regarding [B], the authors state that: ´´[B] is evaluated using full-batch adaptation and has not been tested in an episodic TTA setting´´, which seems to be incorrect, as full-batch adaptation is only for few-shot adaptation experiments while TTA is performed in [B] the same way it is performed in the proposed SPS and the TPT [29] baseline, as Appendix E of [B] indicates.

That being said, I increased my score to 4 since [B] seems to be not published yet and I would not penalize the authors for not comparing with an arxiv paper. However, I highly encourage the authors to include this baseline to their work, as it seems to achieve strong results on TTA by simply tuning the projection layer and will absolutely help understanding whether performing SVD is necessary or not. I also highly encourage the authors to include the discussion about the apparent effectiveness of methods tuning parameters in the deep layers for ImageNet-A.

[29] Shu et al., Test-time prompt tuning for zero-shot generalization in vision-language models. NeurIPS 2022

[31] Sui et al., Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models. WACV 2025

[B] Fahes et al., CLIP’s Visual Embedding Projector is a Few-shot Cornucopia. arxiv 2024

格式问题

None

作者回复

2025-07-31

Dear Reviewer EXj6,

We greatly appreciate your valuable feedback on our paper. We address the raised concerns and questions below.

W1 - Missing Details

Spectral dimension K_t: We select K_t by applying the Gavish-Donoho thresholding rule to the singular-value spectrum of the text prototypes. For our CLIP ViT-B/16 experiments with the hand-crafted prompt “a photo of a” this yields to:

Dataset	ImageNet	ImageNet-A	ImageNet-V2	ImageNet-R	ImageNet-Sketch	Flower102	DTD	OxfordPets	UCF101	Caltech101	Aircraft	EuroSAT	StanfordCars	Food101	SUN397
$k_t$	106	31	106	37	106	20	7	10	19	16	29	3	51	19	72

W2 - The method

We thank the reviewer for prompting a deeper comparison. Three points clarify why SVD-based steering outperforms both class-specific and shared residuals:

a) $*Structured, low-dimensional adaptation*$ : As discussed in Section 3.2 (main manuscript), pretrained deep features exhibit a low intrinsic dimension [1], meaning their essential semantics lie on a smaller manifold. By performing an SVD of the full text-prototype matrix we extract the principal singular vectors that capture the most salient axes of class variation. Adapting within this constrained subspace inherently regularizes learning (preserving the rich, frozen knowledge of the VLM) and focuses updates along directions of maximal semantic relevance, rather than allowing noisy, unconstrained shifts in the D-dimensional embedding space.

b) $*Relation to TPS’s residuals*$ : In Section 2 (main manuscript) we describe Test-Time Prototype Shifting (TPS) [Sui et al., NeurIPS 2024], which indeed learns either a full class-specific residual $R ∈ \mathbb{R}^{C×D}$ or a shared residual $r ∈ \mathbb{R}^D$ on top of $Z_{T_{init}}$ , avoiding backprop through encoders similar to [A]. Our STS method advances this by introducing a spectrum-aware mechanism: rather than learning arbitrary shifts, we learn compact coefficients that steer prototypes only along the top-k singular directions. TPS’s experiments show that class-specific residuals modestly outperform shared ones, but both remain unconstrained and thus prone to overfitting under scarce, unlabeled test-time data.

b) $*Empirical evidence*$ : As reported in Section 4 (main manuscript), STS, using just $k_t$ parameters per sample, consistently outperforms TPS across all 15 benchmarks. This demonstrates that constraining adaptation to a semantically meaningful, low-rank subspace yields better OOD accuracy and stability than learning full or shared residuals in the high-dimensional prototype space.

Together, these points show that SVD-driven subspace adaptation is not only theoretically motivated by intrinsic dimensionality, but also empirically superior to both class-dependent and shared residual approaches.

W3 - The experiments

Thanks for your thoughtful feedback. We acknowledge that our STS method largely surpasses or compares favorably against state-of-the-art test-time adaptation methods, while introducing only a handful of additional parameters and achieving inference speeds up to 8× faster with a 12× smaller memory footprint than conventional test-time prompt tuning. Here we discuss why without ensembling, STS and TPT yield closer performance on most ImageNet variants, but diverge sharply on ImageNet-A:

a1) ImageNet-A is made of “border-line” images that exploit spurious, low-level cues which CLIP has not fully suppressed. ImageNet-A was intentionally filtered to keep pictures whose foreground object is hard to spot (small, partially occluded, off-center) or whose background contains misleading texture cues. Its adversarial filtering process selects images that standard models misclassify (so they lie at the extremes of semantic variation, which align closely with the principal axes of the text-embedding manifold). Because every image already triggers the right class, the model’s output entropy is high and a tiny shift in the decision boundary flips a large fraction of errors into correct predictions. STS adapts by steering directly the text prototypes along the top-k singular vectors of the text-embedding matrix (directions that capture the dominant semantic axes e.g. object vs. background, canonical vs. extreme pose). Under the extreme distortions of ImageNet-A, the optimal correction indeed lies very close to one of those principal axes, so a single spectrum-aware steering recovers a disproportionate amount of accuracy. TPT’s prompt tokens must propagate their effect through multiple transformer layers, making it harder to affect such a large, coherent shift in a single gradient step (especially when tuning must remain lightweight). The same observation occurs for TPS method since as you can see by our experiments, TPS greatly improves performance over the TPT.

a2) Complementarity with ensembling: The framework requires a single representation per class, but if such a representation is derived from a single prompt, the amount of information that it carries is limited to the number of input tokens that the CLIP text encoder was trained with. Hence, we easily improve the robustness of class prototypes by taking the mean of each class embeddings. This allow us to leverage multiple prompting and retain the knowledge from these various representations while keeping the computational and memory efficiency of our method without sacrificing prototype representation strength. As we mention in the paper, the design of TPT does not allow leveraging this text ensemble (also pointed out by concurrent work [31]).

a3) Alternative adaptation choice: Thank you for your suggestion. We note that TPS’s class specific and shared-residual variants (which mirror two of these suggested choices [A][Z_{T_{init}}] are already compared in Table 2 and fall short of STS’s performance (confirming the value of our spectral bottleneck). Clip-adapter [8] and [B] have a prevalent assumption which is the accessibility of labeled data from the target domain, a condition that is frequently incompatible with the requirements for rapid deployment in real-world applications. In addition, a full study of every ensemble-compatible alternative is beyond our current scope. We believe the principled regularization and semantic interpretability of SVD-based adaptation make STS uniquely effective, especially in the episodic black-box TTA that we investigate.

We thank the reviewer for pointing out the ambiguity around our MaPLe initialization and for asking about the differing TPT and STS effects in Tables 1 vs 2. We follow the official MaPLe implementation of Khattak et al. (CVPR 2023), which provides three pretrained prompt sets obtained via few-shot prompt learning on ImageNet, each corresponding to a different random seed. To avoid arbitrarily favoring any single seed, we associate one of these three weight sets with each of our three independent runs. In all tables, the entry “MaPLe zero-shot” therefore means “apply the respective ImageNet-trained MaPLe prompt to the target OOD benchmark without any additional adaptation.” We will state this explicit mapping in the revision and annotate the tables accordingly to remove any confusion.

Why TPT and STS can slightly degrade MaPLe on cross-dataset TTA (Table 2)?

Specialized base prompt: MaPLe’s prompt tokens are finely tuned to the ImageNet distribution. When they are transferred to datasets with different semantics or styles, those tokens already sit close to a local optimum for many classes. As a result, subsequent shift-based updates (TPT or STS) have limited head-room and can occasionally over-correct, producing a small drop.

For the natural ImageNet variants, the ImageNet-trained MaPLe prompt remains reasonably well aligned with the target domain. In this case STS identifies truly helpful low-rank principal directions, yielding a net gain, while TPT achieves positive results as well. We will incorporate this explanation into the manuscript to make the interaction between MaPLe, TPT, and STS clear across all benchmarks.

[Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.]

Question 1 (“inference-time” wording)

Thank you for pointing out the possible ambiguity. In the submission we used the term “inference” to denote the entire test-time adaptation pipeline:

The time it takes to produce a first prediction for a previously unseen image, including any optimization / update steps that the TTA method performs on that very image. We agree that “inference” is often reserved for a single forward pass through a frozen model. To avoid confusion, we will replace “inference time” with “end-to-end adaptation latency” throughout the paper.

Question 2 (“cross-dataset generalization”)

Thank you for drawing our attention to the possible confusion. In recent TTA literature the expression has been used in the following way:

This experiment was first designed in the TPT paper to compare few-shot prompt learning vs instance-specific prompt learning (without supervision).

In our case, there is no few-shot learning, so this would not apply. To avoid confusion with the few-shot prompt learning, we will rename this setting to fine-grained generalization in Table 2.

Typos & citation links.

All figure/table references and minor errors (Lines 176, 247, 253, 256, 306) will be corrected.

评论- Official Comment by Authors

2025-08-09

Dear Reviewer EXj6,

Thank you sincerely for the time and effort you have volunteered to evaluate our manuscript.

Since today is the final day for the author-reviewer discussions, we hope our detailed explanations have fully addressed your concerns. Should you need any further clarification on any points, please do not hesitate to reach out to us.

Thank you again!

Best,

Authors

2025-08-09

I would like to thank the authors for their answers.

W1: there are still missing details about the selection of the regularization loss weight $\lambda_R$ .

W2: thanks, these answers make sense, I missed TPS.

W3: a1) thanks for your answer. The provided interpretation is intuitive, however it lacks theoretical rigor and/or empirical evidence, and intuition isn't sufficient for explaining the effectiveness of SPS on ImageNet-A. However, I would not penalize this point, as it might be too much to ask for a theoretical interpretation. The remark about TPS is interesting, it seems that in the specific case of ImageNet-A, methods that tune parameters in the deep layers are better than others tuning parameters in earlier layers. Also please note that [B], which fine-tunes the projection matrix of the vision encoder, presents also remarkably strong performance on ImageNet-A is for test-time adaptation (cf. Table 9), this adds more evidence to the effectiveness of fine-tuning parameters close to the latent space for this specific dataset. This discussion is crucial to include as it can trigger future research in understanding "what to fine-tune" for a specific dataset.

a3) [8] and [B] do not necessarily need labeled data. They can be applied to the same test-time adaptation framework in a straightforward manner. [B] already performs experiments on test-time adaptation and shows strong results compared to TPT by adopting the same framework but training the projection matrix instead of tuning the prompt. This is an extremely simple baseline that seems to be powerful and can put in question the necessity of SPS method. [8] doesn't show experiments on test-time adaptation but it can be applied to this setting in the same manner, as well as the simple regularized linear adapter proposed in [B] if black-box adaptation is required.

Thanks for the explanation about MaPLe. So in both Tables 1 and 2, MaPLe is trained on ImageNet (is it 16-shot? please specify this in the paper) and evaluated on variants (Table 1) and other datasets (Table 2). Please make sure to write this clearly in the paper.

Q2 I see that the word generalization propagated from TPT, because it was showing few-shot training then generalization vs. tuning the prompts on-the-fly for datasets that were OOD for the few-shot setting. Cross-dataset generalization or fine-grained generalization can be both misleading, again because for test-time adaptation, the source domain is the pretraining dataset of CLIP and SPS/TPT is performed on-the-fly for each test image in each dataset. I see that this naming started in TPT, but it's a bit misleading and it might be better not to propagate it. That being said, this point is minor.

评论- Official Response by Authors

2025-08-09

Response to W1

We thank the reviewer for pointing this out. In our experiments, we select the value of the regularization loss weight by running 5 adaptation steps on the ImageNet test set and keeping this value fixed for all benchmarks. We will add this detail to the manuscript to ensure reproducibility.

Response to W2

We appreciate the reviewer’s comment and clarification regarding TPS.

Response to W3-a1

We appreciate the reviewer’s insightful remarks about the connection between ImageNet-A performance and tuning parameters in deeper layers. We will definitely incorporate this discussion into the manuscript, including the supporting evidence from method [B], and highlight its implications for understanding “what to fine-tune” for specific datasets.

Response to W3-a3

We thank the reviewer for the insightful observations regarding [8] and [B].

For [8], while we acknowledge that its architecture could in principle be applied within a TTA framework, we note that the method itself has only been evaluated in a few-shot supervised setting and was not originally designed for black-box, per-instance adaptation. In our initial exploration of STS, we implemented a non-linear adapter following the same architecture as [8] $(ReLU(W^T 𝑊_𝑡^1) 𝑊_𝑡^2$ and evaluated it in the same black-box TTA setting. We observed that its performance was markedly lower than STS. We believe this is because such adapters introduce a substantially larger number of parameters that generally require labeled supervision and multiple gradient steps to learn effectively-conditions that are absent in our target setting. We will include this empirical observation in the manuscript for completeness.

Regarding [B], which is also proposed by the authors as a few-shot method, we respectfully note that its approach involves fine-tuning the projection layer of the vision encoder. In a black-box TTA scenario, where access to internal model weights is prohibited, this operation is not feasible. Furthermore, [B] is evaluated using full-batch adaptation and has not been tested in an episodic TTA setting, which is the focus of our work. For these reasons, we did not include [B] in our experiments. We will further clarify these constraints explicitly in the revised manuscript.

Other points

Regarding MaPLe, in both Tables 1 and 2, MaPLe is trained in a 16-shot setting on ImageNet and then evaluated either on ImageNet variants (Table 1) or on other datasets (Table 2). We will clearly state this in the manuscript. For terminology, we agree that “cross-dataset generalization” may be misleading in this context, and we will make sure not to propagate this term in the final version.

最终决定Accept (poster)

2025-09-17

This paper has received mixed recommendations before the rebuttal as follows: 2 x Borderline Reject (EXj6, QTGH), and 3 x Borderline Accept (VSFB, 2jDh, eVss).

Overall the reviewers appreciated the practicality of this approach, its good performance in terms of scores, parameter and computational efficiency and the clear writing.

Regarding the criticisms and concerns, the reviewers pointed out the lack of several computationally efficient baselines, of different CLIP encoders, of ablations and potentially limited generalization on distribution shift.

The authors submitted a rebuttal to respond to reviewer questions and criticism. After the rebuttal and discussions reviewers generally increased their scores: EXj6 QTGH to Borderline Accept and 2jDh to Accept. The reviewers unanimously recommend this work for acceptance, but still have some recommendations of baselines and comparisons to be made in the updated paper, in particular from EXj6 and VSFB.

After analyzing and discussing the strengths and weaknesses identified by the reviewers, the meta-reviewer agrees with the feedback and recommendations expressed by the reviewers and recommend this work for acceptance. We encourage the authors to take into consideration the useful advice from the reviewers towards improving this work.