4.5

/10

withdrawn4 位审稿人

最低3最高5标准差0.9

3.3

置信度

正确性2.5

贡献度2.3

表达3.0

ICLR 2025

Test-time Zero-shot Recognition with Good Attributes

Junhui Yin,Nan Pu,Huixing Jiang,Chen Wei,Xiaojie Wang,Shengfeng He,Zhun Zhong

OpenReview PDF

提交: 2024-09-25更新: 2024-11-15

摘要

关键词

Test-time adaptationprompt learningattribute searchsoft votingvision recognition

评审与讨论

审稿意见

评分: 3置信度: 42024-10-29

The paper presents a method for test-time zero-shot recognition that leverages attribute-based reasoning to improve model performance. The experiments are thorough, and the results are compelling.

优点

The paper introduces a framework Search4Prompt to address the challenge of test-time adaptation in zero-shot learning scenarios. The two main components, Retrieval-based Attribute Search (RAS) and Implicit-Explicit Attribute Injection (IEAI) module contribute to the overall effectiveness of the framework. The paper provides extensive experimental results, demonstrating the effectiveness of Search4Prompt over existing methods on benchmark datasets.

缺点

The idea of searching relevant attributes for ZSL is not novel. For example, [a][b] show that it is possible to achieve the same recognition accuracy with a significantly smaller attribute vocabulary. What is the difference between these works? Besides, the proposed Retrieval-based Attribute Search is just a cosine similarity evaluation, which is somewhat simplistic.
The effectiveness of Search4Prompt relies on the comprehensiveness of the attribute bank. However, the details, such as how to determine the attribute bank and what attributes it concludes are not clear.
As we know, LLM is sensitive to prompts. The robustness of the framework to noisy or incorrect attribute descriptions from the LLMs is missing. It is also very important to evaluate the quality of generated attribute descriptions.

[a] Learning concise and descriptive attributes for visual recognition, ICCV'23. [b] Language in a bottle: Language model guided concept bottlenecks for interpretable image classification, CVPR'23.

问题

See weaknesses above.

审稿意见

评分: 5置信度: 32024-11-02

This paper presents a new test-time adaptation method that identifies critical attributes for prompt learning. The selected attributes are used in text prompt augmentation and pseudo labeling via an implicit-explicit attribute injection module. The experiments demonstrates the method's effectiveness.

优点

The experimental results demonstrate the effectiveness of the proposed method.
This paper is well-written and easy to follow.

缺点

1.The proposed method reduces testing efficiency compared to the baseline, as the time required to search for attributes increases with the expansion of the attribute pool / number of categories, making it difficult to scale up. As shown in Table 8, the inference time considerably increases on ImageNet-R when using three optimization steps.

2.This paper lacks discussion and comparison with several prompt learning methods that also utilizes fine-grained attributes for VLM adaptation, i.e., MAP[1], ArGue[2].

[1] Argue: Attribute-guided prompt tuning for vision-language models. In CVPR, 2024.

[2] Multi-modal attribute prompting for vision-language models. TCSVT, 2024.

问题

It is weird that the few-shot setting and unsupervised setting are compared in the same table (Table 1). Can you provide results in few-shot learning experiments?
Could you offer a qualitative analysis of the generated and retrieved attributes for each category, with more examples?
Typos: Line 166 Retrieve.

审稿意见

评分: 5置信度: 52024-11-03

This paper leverages descriptors generated by large language models (LLMs) to assist vision-language models (VLMs) at the inference stage, achieving performance improvements across multiple datasets.

优点

The paper is well-written and organized, making it easy to follow.
Leveraging LLM-generated descriptors to enhance VLM performance during inference is a reasonable and meaningful approach.
The paper provides notable performance improvements over baseline methods, supporting the practical value of the approach

缺点

The approach of generating descriptors and the design of Discriminative Attribute Generation lack novelty; directly using top-k prototypes for each class at inference is also a common approach. Similar to Soft voting, DVDet [1] also selects high-confidence descriptors by voting, which can lead to ambiguous category selection through misclassification. Please clarify the distinctions.
CaF [2] should be used as a baseline to demonstrate the effectiveness of the method.
The statement, “The challenge for our test-time adaptation, where test data lacks labeled information, lies in how to retrieve discriminative attributes from A and use them to generate specific prompts for each test sample,” is unnecessary, as test data inherently lacks label information.

References:

[1] LLMS MEET VLMS: BOOST OPEN VOCABULARY OBJECT DETECTION WITH FINE-GRAINED DESCRIPTORS [2] Visual Classification via Description from Large Language Models

问题

Please see the weakness.

审稿意见

评分: 5置信度: 12024-11-03

This paper introduces a test-time adaptation method called Search4Prompt, which improves the TPT by identifying representative good attributes.

优点

In general the paper is well organized and clearly written.
They retrieve and composite the discriminative attributes to tune prompts during test-time adaptation.

缺点

While the paper introduces an improved version of TPT by integrating text-based attributes, it would benefit from a clearer articulation of the specific innovations distinguishing it from existing TPT approaches. For instance, the authors could strengthen the novelty claim by further emphasizing unique aspects of the Retrieval-based Attribute Search (RAS) module or exploring novel attribute filtering techniques.
While RAS effectively retrieves attributes, it shows limitations in filtering irrelevant ones, especially when relying on LLM-generated attributes. To address this, it would be beneficial if the authors evaluated Search4Prompt with multiple LLMs to assess generalizability. Additionally, the paper could enhance its comparative analysis by including results from consistent auxiliary models, such as ViT-B-16-based CLIP, to provide a balanced perspective on performance variations under different model sizes (see Table 6).
It is not clear how the soft voting scores are derived from the top-k attributes. Following the approach described in Section 3.2.1, the class with the highest probability in the pseudo-labels in the IEAI in the Figure 1 should be "Red velvet cake," with the highest matching value of 0.23, rather than "cup cakes," which only reaches 0.22. The authors are encouraged to provide a step-by-step illustration or example of the soft voting process, explaining how these scores are calculated, as it would improve understanding and reproducibility.
Given the addition of 15 retrieved attributes, it would be valuable for the authors to provide a comparison of computational resources, such as memory usage, parameter count, and FLOPS, between their method and baselines (e.g., TPT and TDA). This addition would help readers assess the computational trade-offs associated with the approach.
The experimental results require further verification. First, the same baseline model yields different results in Table 3 and Table 4. Second, the VCD results in Table 5 appear to be questionable. For instance, VCD [1] achieves a zero-shot result of 86.92 on the Pets dataset without any prompts. Based on experience, a combination of VCD with the baseline TPT should yield improved results; however, this paper reports only 81.30. Similarly, in a related paper [2], CLIP + A evaluates pre-trained CLIP with attributes obtained from LLMs, achieving 80.84 on the Flower dataset, while this paper reports only 69.23. I recommend that the authors revisit and verify these results, providing further details on the experimental setup. Additionally, an analysis to reconcile the reported outcomes with those from related studies (e.g., VCD and CLIP + A) would enhance result validity and comparability.

[1] Menon, Sachit, and Carl Vondrick. "Visual classification via description from large language models." ICLR 2022.

[2] Saha, Oindrila, Grant Van Horn, and Subhransu Maji. "Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions." CVPR 2024.

问题

No question

撤稿通知

2024-11-15

I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.