PaperHub
4.8
/10
Rejected4 位审稿人
最低4最高5标准差0.4
4
5
5
5
4.0
置信度
正确性2.8
贡献度2.5
表达2.3
NeurIPS 2024

Class Concept Representation from Contextual Texts for Training-Free Multi-Label Recognition

OpenReviewPDF
提交: 2024-05-13更新: 2024-11-06
TL;DR

Inspired by cognitive neuroscience perspective on forming concepts for words, we propose class concept representation for exploiting vision-language model for training-free multi-label recognition.

摘要

关键词
Training-free Multi-Label RecognitionVision-Language ModelClass ConceptContext-Guided Visual Feature

评审与讨论

审稿意见
4

The paper supplies a post hoc method to tune the ResNet based CLIP method on multi-label recognition task. Firstly, the method includes class concept representation, which is an alternative of the default prompt “The photo of a {class}”. It is the average of class description sentence embedding from a text description source (MSCOCO and git3.5 generated caption in the paper). Secondly, the paper proposed a sequential attention to iteratively transfer the the visual features to align with the class concept representation.

优点

  1. Training-free enhancement: The proposed method significantly improves zero-shot and prompt-tuning performance without requiring additional training or labeled samples, making it computationally efficient.
  2. Robust performance: Experimental results on multiple benchmark datasets (MS-COCO, VOC2007, and NUS-WIDE) show substantial performance gains, demonstrating the method's effectiveness.

缺点

  1. Lack of clear differentiation: While the authors mention TaI-DPT and claim differences, the paper does not clearly articulate the advantages of the proposed method over TaI-DPT, leaving the comparative benefits ambiguous.

  2. Unfair comparison in Table 5: The Class Concept Representation is based on the MS-COCO dataset, making direct comparisons with the baseline CLIP method potentially unfair due to inherent advantages provided by the dataset-specific information.

  3. Limited model implementation: The paper only implements the ResNet-based CLIP model and does not explore transformer-based CLIP models. It is unclear whether the method is ineffective for transformer-based models or if there are specific reasons behind this omission. This limits the generalizability of the findings.

  4. Ambiguous terminology: The paper uses the term "training-free" in its title, yet it describes the approach as "test-time adaptation" within the content. This inconsistency can lead to confusion about the nature of the proposed method.

  5. Dependent on source text description: It seems that text description source need to be carefully selected. A comparison of different description dataset can be interesting.

问题

  1. Confusing softmax explanation: The description of softmax values in Figure 3 is unclear. Typically, in CLIP, text-image similarity is calculated using cosine similarity, with softmax applied separately to each modality. The paper does not adequately explain why softmax is necessary between image and text similarities in this context.
  2. I am curious about the reasons behind the performance improvement observed when switching from default prompts to class concept representation. Is the enhancement due to the use of phrase-level embeddings being more effective than word-level embeddings? Can you provide any insights or explanations for this improvement?

局限性

see disadvantage and question

作者回复

We appreciate your valuable comments and suggestions.

[W1] Clear explanation for difference between TaI-DPT and our method.

To clarify, our method differs from TaI-DPT in two key ways:

  • Training-free approach: Unlike TaI-DPT, our method performs multi-label recognition without any training using massive text descriptions inspired by how human forms concept on words. (as detailed in L75-80).
  • Enhanced adaptability: Our method employs a sequential attention mechanism to adaptively project visual features onto the same linear space as class concept representation. This adaptive approach, in contrast to TaI-DPT, leads to improves the visual-text alignment at inference. We will revise this explanation for further clarity.

[W2] Unfair comparison in Table 5.

The class concept representation in Table 5 is based on the MS-COCO captions, potentially leading to inherent advantages from dataset-specific information on only "MS-COCO" dataset. However, several points should be considered:

  • Ablation Study: We isolated our method's effectiveness by using training set of MS-COCO captions in an ablation study.
  • Consistency Across Datasets: We observed consistent performance improvements on the VOC2007 dataset, which lacks MS-COCO-specific information.
  • Generalizability to NUS-WIDE: Using GPT-3.5 generated texts from only target class names, we demonstrated performance gains on NUS-WIDE (Table 1), supporting our approach's broader applicability.

[W3] Model implementation for fair comparison.

To ensure a fair comparison with existing methods, we intentionally opted to implement ResNet-based CLIP. It's important to note that TaI-DPT and DualCoOp, which are SOTA benchmark methods, were also developed based on only ResNet architectures.

[W4] Explanation of terminology, "training-free" and "test-time adaption" in our method.

our approach can be accurately described as both training-free and employing test-time adaptation.

  • Test-time adaptation: At inference, the visual features of a test image are adaptively projected onto the linear space of class concept representation Z\mathcal{Z} through sequential attention. This adaptive process dynamically enhances visual-text feature alignment and occurs entirely at test time without requiring any training, making it a form of test-time adaptation.

This aligns with Reviewer jAJ4's strength comment that "The proposed method does not require training and can be seen as a form of test-time adaptation.".

[W5] Dependent on source text description.

Thank you for interesting suggestion. We acknowledge the reviewer's concern regarding the potential dependence of our method on the quality of text descriptions. To address this, we conducted experiments using both high-quality (MS-COCO captions) and low-quality text descriptions. The low-quality texts (e.g. "a art of a chair, mouse, car, and bus.") were generated naively by combining 1~4 random class names and simple handcrafted templates like "a photo of {}" and ""a art of {}".

Table 1: Evalation of our method with low-quality texts

Zero-shot MLRMS-COCOVOC2007
ZSCLIP57.482.8
+ Ours (with low-quality texts)66.888.0
+ Ours (with MS-COCO captions)70.089.2

Even when using low-quality, context-free text descriptions, our approach significantly improves the baseline (ZSCLIP). This indicates that while carefully curated descriptions can enhance performance, our method effectively leverages diverse text information from even simplistic combinations of random class names.

[Q1] Explanation of Softmax operation for measuring the similarity between visual and text features.

The cosine similarity between L2-normalized visual (f and F) and text (T) features is computed using matrix multiplication as detailed in Eq. 2 and 3. (tt refers to the transpose operation.) The softmax operation is subsequently applied to these cosine similarity values, and Figure 3 visualizes the resulting softmax values. We will clarify it in more detail within the paper to avoid any further misunderstandings.

[Q2] Performance improvement from class concept representation.

Class concept representation tcconceptt_{c}^{concept} combines vectors of text descriptions, capturing co-occurrence patterns with other classes and rich contextual information. Thus, tcconceptt^{concept}_{c} encapsulates essential information about relevant classes and nuanced context for the target class. We hypothesize that this process mirrors how human form concept on words from past experience and is crucial for multi-label recognition. This point implies that these phrase-level embeddings are more effective than using word-level default prompts.

To verified the effectiveness of tcconceptt_{c}^{concept}, we compared the closest text captions in MS-COCO caption to both (1) hand-crafted prompts embeddings tchandcraftedt_{c}^{hand-crafted} (e.g., "a photo of") and (2) class concept representation tcconceptt_{c}^{concept}.

  • Top-3 co-occurrence objects of class "knife" in MS-COCO caption
    • dining table, person, cup
  • Top-3 cloest texts with tcconceptt^{concept}_{c}
    • A knife sticking out of the top of a wooden table.
    • A person holding a knife on a cutting board.
    • A guy looking down at a big knife he is holding.
  • Top-3 cloest texts with tchandcraftedt^{hand-crafted}_{c}
    • A knife is sitting on a plate of food
    • A sharp bladed knife left out on the ground.
    • A knife about to cut a birthday cake in pieces.

Results indicates that our tcconceptt_{c}^{concept} is better aligned with text descriptions containing multiple class names that are frequently co-occurring objects with target class "knife". This demonstrates suitability of tcconceptt_{c}^{concept} for multi-label recognition tasks where images depict complex scenes with multiple objects.

评论

I have read the author's response and maintain my rating.

评论

Thanks author for the detailed reply. Most of my questions have been solved. I am still curious about the decision to use only a ResNet backbone instead of the more widely adopted transformer backbone. The reason behind the omission seems not explained.

评论

Thank you for your response. As mentioned in "[W3] Model implementation for fair comparison," the primary reason for omitting the implementation of a transformer backbone is that existing state-of-the-art (SOTA) methods (DualCoOp and TaI-DPT) are solely based on the CLIP ResNet architecture. To ensure a fair performance comparison with these methods, we also employ CLIP ResNet.

To explore the potential of our method with a different backbone, we extended our experiments to the ViT/B-16 architecture (Table 2). However, due to ViT's inherent lack of locality compared to CNNs [1,2], CLIP ResNet-101 shows better performance than CLIP ViT/B-16 in zero-shot CLIP (ZSCLIP). The performance improvement was also especially pronounced for CLIP ResNet-101. When it comes to capturing spatial information through local features, ResNet-based CLIP models are generally better suited for multi-label recognition tasks. To effectively handle local features, modifications to the vision transformer architecture are necessary, as exemplified by MaskCLIP [3].

Nevertheless, our proposed approach significantly improved zero-shot CLIP (ZSCLIP) performance for both ResNet and ViT backbones. Consequently, we have demonstrated the general applicability of our method to both ResNet and transformer architectures.

Table 2: Evalation of our method on CLIP ViT/B-16.

Zero-shot MLRArchitectureMS-COCOVOC2007
ZSCLIPViT/B-1656.981.8
+ OursViT/B-1665.6 (+8.7)86.7 (+4.9)
ZSCLIPResNet-10157.482.8
+ OursResNet-10170.0 (+12.6)89.2 (+6.4)

[1] Mao, Mingyuan, et al. "Dual-stream network for visual recognition." NeurIPS, 2021.

[2] Chen, Zhiwei, et al. "Lctr: On awakening the local continuity of transformer for weakly supervised object localization." AAAI, 2022.

[3] Zhou, Chong, et al. "Extract free dense labels from clip." ECCV, 2022.

审稿意见
5

This paper proposes a class concept representation for zero-shot multi-label recognition in a label-free manner and introduces a context-guided visual feature that enhances the alignment of the visual feature of VLM with the class concept.

优点

  1. This paper presents a novel class concept representation for training-free multi-label recognition tasks using VLMs from massive text descriptions inspired by how human forms concept on words.
  2. This paper proposes a context-guided visual feature, which is transformed onto the same text feature space as class concepts using sequential attention, to better align multi-modal features.
  3. The method presented in this paper synergistically enhances the performance of ZSCLIP and other state-of-the-art just-in-time tuning methods, with a minimal increase in inference time.

缺点

  1. Tip-adapter is the proposed training free method in 2021, it would be better to choose the newer training free method in few shot setting.
  2. It would be more appealing to emphasize label-free in the abstract.

问题

  1. Is The MS-COCO dataset a sentence generated using GPT's API? Is this process time consuming?

局限性

Yes

作者回复

We appreciate your valuable feedback and comments!

[W1] Additional multi-label few-shot method in Table 3.

As you mentioned, Tip-adapter is a training-free few-shot method proposed in 2021, primarily designed for single-label classification. To incorporate a recent few-shot method for evaluating multi-label recogntion, we adopt BCR[1], published in 2024, which is specifically designed for multi-label few-shot learning. BCR[1] has demonstrated superior performance, achieving 58.6 mAP for 1-shot and 66.5 mAP for 5-shot on the MS-COCO dataset. To provide a comprehensive evaluation, we will compare our method to BCR[1] in Table 3 of the main paper.

[1] An, Yuexuan, et al. "Leveraging Bilateral Correlations for Multi-Label Few-Shot Learning." IEEE Transactions on Neural Networks and Learning Systems (2024).

[W2] Emphasizing the "lable-free" in the abstract.

Thank you for pointing out the importance of emphasizing the 'label-free' aspect of our method. We agree that this is a crucial aspect of our work. We will revise the abstract to clearly highlight the 'training-free' and 'label-free' properties of our approach.

[Q1] Is generation of massive sentences time consuming?

The MS-COCO caption dataset we used is a publicly available dataset and was not generated using GPT-3.5.

For evaluating NUS-WIDE, we generated a large set of sentences using GPT-3.5. This process was relatively quick, taking only a few hours. In comparison to acquiring a labeled multi-label dataset, generating sentences was significantly less time-consuming.

评论

I have read the author's response and maintain my rating.

评论

We sincerely appreciate your constructive and positive feedback. We will carefully consider your valuable suggestions and incorporate them into the revised paper. Thank you for your time and efforts in reviewing our paper.

审稿意见
5

The authors propose a method to adapt without training a large vision-language model for the task of multi-label recognition. They introduce a class concept representation, based on averaging the representation of image descriptions relevant to each class, to replace simple hand-crafted text prompts (e.g., “a photo of {class name}”). Furthermore, they propose to use a context-guided visual process to align visual features with the class concept representation. Experiments conducted on several benchmarks and in zero-shot and partial labeling settings show state-of-the-art performance compared to relevant baselines. Combination with some baseline methods further shows the improvements that can be obtained with the proposed method. Ablation studies show the contribution of each component of the method and the sensitivity to some of the method's parameters.

优点

  • The method achieves state-of-the-art performance
  • The proposed method does not require training and can be seen as a form of test-time adaptation
  • The method can be combined with existing prompt-tuning methods

缺点

  • Parts of the method descriptions, especially the Context-Guided Visual Feature, are unclear
  • The method relies on thousands of text descriptions relevant to the target classes, which could hinder the scalability of the methods with a large number of classes

问题

  • Section 3.2 on Context-Guided Visual Feature is unclear, G is used both for the number of groups and iterations. Is the number of groups and the assignment of text descriptions to groups constant in this process? Figure 2(b) shows an iterative reduction of the number of groups but that does not seem to be described in the formulas. Have the authors explored alternatives like e.g. just selecting the top K descriptions with the highest similarities before computing softmax? Are the modulation parameters similar to a softmax temperature parameter?
  • Section 3.3, p5-l179:180, the authors state that “Following TaI-DPT [18], we adopt the structure of double-grained prompts (DPT)”. However, TaI uses localized text features (text tokens) in their approach, while it seems the authors only use the global text features (or an aggregation of them in the class concept representation). Are text tokens used directly anywhere in the proposed method? If not, the author should clarify this difference with TaI-DPT.
  • In Eq (4), is the spatial aggregation in SclocS^{loc}_c sensitive to the spatial extent of a class?
  • Table 5, it seems there could be an intermediate setting where the class concepts are used instead of text descriptions in the “Context-guided visual feature” process. Has this been explored by the authors?
  • Table 6, with just 1K text descriptions the performance already exhibits most of the improvement, especially for VOC2007. How many text descriptions per class does that correspond to? Are there duplicates or near duplicates in these text descriptions when going to 32K or more? Have the authors considered a sampling strategy to select diverse text descriptions for each class?
  • “As the number of text descriptions ranges from 1K to 32K, the text embeddings of Zall can cover the wider range of test dataset, resulting in increased performance gains.“ Are all the text descriptions used for MS-COCO and VOC2007 coming from the training set? As more and more descriptions are added it is likely some descriptions would correspond almost exactly to all the elements present in the test image, in that sense the proposed method could be seen as “retrieving” the most relevant description(s) for an image to guide the multi-label recognition. That almost feels like using image captioning (by retrieval) to perform multi-label recognition. Have the authors explored what would be the performance of the approach if the “Context-guided visual feature” process was using the ground truth descriptions or a caption generated by an image captioning model?
  • The performance reported for DualCoOp does not seem to align with what is reported in their paper, were the results reported in the submission recomputed? What could explain the difference? Furthermore, an updated version of DualCoOp, DualCoOp++ [Hu, Ping, Ximeng Sun, Stan Sclaroff, and Kate Saenko. "Dualcoop++: Fast and effective adaptation to multi-label recognition with limited annotations." IEEE Transactions on Pattern Analysis and Machine Intelligence (2023)] was published last year and shows even better performance, why these results were not reported?

Typos etc:

  • p3-l123: “so that ways of using them are completely different.”
  • p4-l147: missing space “model(i.e.“
  • p5: Eq (2) and (3) use SoftmaxdimkSoftmax_{dim_k} but the text after mentions SoftmaxMkSoftmax_{M_k}
  • p5-l173: V(3)V_{(3)} should be V(3)V^{(3)}
  • p6-l216: tuining

局限性

Yes

作者回复

Thank you for your instructive comments and suggestions.

[W1 & Q1] Clarification on Sec.3.2 Context-Guided Visual Feature.

We appreciate the reviewer's careful reading of Section 3.2. Iterative reduction of the number of groups GG is formalized in Eq.2 and 3. For a detailed explanation of the dimension changes in the sequential attention, please refer to the response of Reviewer 7zr3 [W2]. We will provide a detailed algorithm and correct Sec.3.2 for clear understanding.

For each dataset, we set G=3G=3 and (M1,M2,M3M_1,M_2,M_3) as (40,40,25) for MS-COCO, (40,40,40) for VOC2007 and (40,40,36) for NUSWIDE. The modulation parameter αf\alpha_{f} and αF\alpha_{F} function as temperature values within the softmax operation.

As suggested by the reviewer, we explored selecting the top-K descriptions with the highest similarities before softmax in Eq.2 and 3 with G=1G=1. The Table 1 demonstrates that this approach effectively improves performance without sequential attention. However, our proposed sequential attention on context-guided visual features achieved the best overall performance.

Table 1: Alternative approach using Top-k similarity selection

GGTop-kMS-COCOVOC2007NUS-WIDEAvg
1x67.088.744.966.9
150069.689.444.867.9
1100069.389.645.368.0
3(Ours)x70.089.246.668.6

[W2] Scalability with a large number of classes.

As the number of target classes increases, our method demands a correspondingly large number of text descriptions. Handling such massive amounts of text can pose scalability challenges. To address this, we propose sequential attention, which efficiently manages numerous text descriptions as demonstrated in Figure 3(b).

[Q2] Correction for double-grained prompt (DPT) structure.

Thank you for your detailed feedback. As you correctly point out, while our method utilizes visual features at both local and global levels, similar to TaI-DPT, we does not incorporate text representations at either level. Consequently, our approach does not employ a double-grained prompt (DPT) structure. We will clarify this crucial distinction in the text to prevent any misunderstandings.

[Q3] Spatial aggregation sensitivity to the spatial extent of class.

Following the DualCoOp and TaI-DPT, we employ the spatial aggregation for obtaining local similarity score, SlocS^{loc}, which is aggregated according to the j=1HWSoftmax(sc,jloc)\textstyle\sum_{j=1}^{HW}\text{Softmax}(s_{c,j}^{loc}) sc,jloc\cdot s^{loc}_{c,j} as shown in Eq.4. The softmax operation makes this approach robust to variations in the spatial extent of each class.

[Q4] Intermediate setting in Table 5.

We conducted the sequential attention process with class concept representation tcconceptt^{concept}_{c} for the class-guided visual feature. In the table, TT refers to the text features matrix of all text descriptions. We will incorporate these results into the paper.

Table 2: Ablation study on each components of our method

MethodMS-COCOVOC2007
Baseline(ZSCLIP)57.482.8
Baseline+Class concept representation61.5(+4.1)83.9(+1.1)
Baseline+Class concept representation+Context-guided visual feature with tcconceptt^{concept}_{c}68.8(+7.3)88.3(+4.4)
Baseline+Class concept representation+Context-guided visual feature with TT70.0(+8.5)89.2(+5.3)

[Q5] Performance convergence with increasing number of text descriptions.

In the Table 6, the rapid performance convergence at 1K texts for VOC2007 is primarily attributed to the dataset's small number of target classes (20 classes). In contrast, MS-COCO (80 classes) requires a larger number of texts for convergence.

While using exclusively MS-COCO captions might introduce redundancies as the texts grows, we believe that the quantity and diversity of descriptions are crucial to our method's effectiveness. Sampling strategies might limit the model's exposure to valuable information. Instead, we believe that incorporating text data from multiple sources would be a more promising approach to enhance performance and address potential redundancy issues.

[Q6] Leveraging the ground truth descriptions for test images.

Interesting question! As you noted, our method might be seen as retrieving the most relevant descriptions for multi-label recognition. However, it's crucial to emphasize that our proposed sequential attention mechanism differs from a simple retrieval approach by performing a weighted averaging of vectors within the linear space Z\mathcal{Z} spanned by text features

We acknowledge that GT captions could potentially provide valuable information. However, our observation revealed that most GT captions lack comprehensive coverage of all object classes, especially for smaller or less prominent objects. This limitation motivated our decision to leverage a massive dataset of text descriptions, which offers a richer and more diverse words.

  • Example of GT captions from MS-COCO dataset
    • Label of a text image - (person, bicycle, car, stop sign, backpack, handback)
  • night is falling on an empty city street.
  • a street view of people walking down the sidewalk.
  • a street with a few people walking and cars in the road.
  • woman walking down the side walk of a busy night city.
  • a street with cars lined with poles and wires.

As you can see, these captions often omit certain object classes, such as the bicycle, backpack, and handbag. Note that we use the training annotation of MS-COCO caption for evaluation on MS-COCO and VOC2007.

[Q7] Performance discrepancies of DualCoOp and why DualCoOp++ results were not reported?

(DualCoOp) For Table 2, we use mean average precision (mAP) for zero-shot learning, while DualCoOp reports F1 score. Table 4 presents our reproduced DualCoOp results, which exhibit minor variations from the reported performance.

(DualCoOp++) Due to the unavailability of public code for DualCoOp++, we were unable to report its performance.

评论

After reading all the reviewers' comments and the authors' answers, I believe my initial rating of 5: Borderline accept is a fair assessment of this contribution and thus maintain it.

评论

Thanks for your valuable comments and the time you devoted to reviewing our paper. We appreciate reviewer's insightful feedback and will incorporate it to revise the paper.

审稿意见
5

This paper proposes class concept representation for zero-shot multi-label recognition. The paper also proposes context-guided visual representation, which is in the same linear space as class concept representation, with sequential attention. Experiments show the proposed methods improved the performance of zero-shot methods.

优点

  1. The paper uses class concept representation for training-free multi-label recognition tasks
  2. The paper proposes context-guided visual feature using sequential attention.
  3. Experiments show the proposed methods improved the performance of zero-shot methods.

缺点

  1. The class concepts from averaging the vectors of text descriptions need to be verified. E.g. What text/image embeddings are the closest to the class concepts? What clusters do the concepts belong to? Since taking the average for class concepts "was guided by the prior work on prompt ensembling [4]" L280, it is not a novel representation for class concepts.
  2. Eq 2,3 needs further explanation. What is "t" in the equation? If "t" is transpose, what dimensions are swapped for a tensor T? Take k=1 as an example, how do the dimensions change in each step of the equation? In experiments, there should be ablation studies on G and the value of each Mg. Also, is T randomly reshaped? It would be better to have ablation studies on random reshaping or reshaping by clusters.
  3. What is the implementation detail for partial label learning with the proposed method?

问题

  1. It would be better to use other representations for class concepts beyond simple averaging. E.g. decomposition.
  2. Incorrect grammar in L182
  3. The class "sofa" is in VOC but not in COCO, how do you get the class concept from COCO captions?
  4. What's the motivation behind setting the value of αf,t to be half of αF,t? Different ratios should be experimented.

局限性

Yes

作者回复

We appreciate your insightful feedback and suggestions.

[W1] Effectiveness of class concept representation tcconceptt^{concept}_{c}.

(Verification of class concept): To evaluate the class concept representation, we compared the closest text descriptions in the MS-COCO caption dataset to both (1) hand-crafted prompts embeddings tchandcraftedt_{c}^{hand-crafted} (e.g., "a photo of") and (2) class concept representation tcconceptt_{c}^{concept}.

  • Top-3 co-occurrence objects of class "knife" in MS-COCO testset
    • dining table, person, cup
  • Top-3 closest texts with tcconceptt^{concept}_{c}
    • A knife sticking out of the top of a wooden table.
    • A person holding a knife on a cutting board.
    • A guy looking down at a big knife he is holding.
  • Top-3 closest texts with tchandcraftedt^{hand-crafted}_{c}
    • A knife is sitting on a plate of food
    • A sharp bladed knife left out on the ground.
    • A knife about to cut a birthday cake in pieces.

Results indicates that our tcconceptt_{c}^{concept} is better aligned with text descriptions containing multiple class names that are frequently co-occurring objects with target class "knife". This demonstrates suitability of tcconceptt_{c}^{concept} for multi-label recognition tasks where images depict complex scenes with multiple objects.

(Difference from prompt ensembling[4]): While [4] demonstrate the effctiveness of equal weight averaging for text features from several hand-crafted template(e.g., "a photo of" and "a art of"), we averages the vectors of text descriptions including diverse contextual information associated with the target class name. Therefore, this enriched representation results in significant improvement of performance.

[W2] The explanation for reshaping the text feature T in "Context-Guided Visual Feature".

(Detailed explanation): In Eq. 2 and 3, the t refers to transpose as you mentioned. At the kk iteration, the transpose operation tt swap the dimension corresponding MkM_k and DD of TT \in RM1××MG×DR^{M_1 \times \cdots \times M_G\times D}, so TtT^t \in RD××MG×M1R^{D \times \cdots \times M_G\times M_1}. Let L2-normalized global feature ff \in R1×DR^{1\times D} and global context-guided visual feature v(0)=T,v(0)v^{(0)}=T, v^{(0)} \in RM1××MG×DR^{M_1 \times \cdots \times M_G\times D} and (v(0))t(v^{(0)})^t \in RD××MG×M1R^{D \times \cdots \times M_G\times M_1}.

For example in the case of Eq.2 at k=1k=1, we process v(1)=Softmaxdim1(f(v(0))tαf)v(0)v^{(1)} = Softmax_{\dim_1} \left(\frac{f (v^{(0)})^t}{\alpha_{f}} \right)v^{(0)} \in RM2×MG×DR^{M_2 \cdots \times M_G\times D}, where f(v(0))tf (v^{(0)})^t \in RM2××MG×M1R^{M_2 \times \cdots \times M_G \times M_1} and Softmaxdim1Softmax_{\dim_1} applied along the dimension of M1M_1. Finally, we obtain the vGv^{G} \in R1×DR^{1 \times D}. We will clarify it in the main paper and add the detailed algorithm in the supplementary material.

(Ablation study on GG and MgM_g): The text feature TT is reshaped by pre-defined shape, not random reshaping. We have conducted the ablation study varying with the value of GG and MgM_g in TT \in RM1××MG×DR^{M_1 \times \cdots \times M_G\times D}.

Table 1: Ablation study for the number of groups GG

GGMS-COCOVOC2007NUS-WIDE
167.088.744.9
268.789.145.5
370.089.246.6

Table 2: Ablation study for varing the value of MgM_g for 64000 texts

M1M_1M2M_2M3M_3VOC2007
203210089.1
40404089.2
80204089.1

The experimental result demonstrate that increasing GG and balancing the number of texts MgM_g for each group is important to process sequential attention process. Note that we set (M1,M2,M3M_1,M_2,M_3) as (40,40,40) for VOC2007.

[W3] The implementation of partially labeled setting with our method.

In the partial label setting, we combine the prediction values of DualCoOp and our method by averaging them with a weight of 0.5 for each. Since DualCoOp was originally designed for practical partial label scenarios (limited-annotation MLR), our approach can enhance DualCoOp by providing complementary information during inference.

[Q1] Exploring equal weight averaging for class concept representation.

We have explored weighted averaging for class concept representation tcconceptt^{concept}_{c} using an attention mechanism. Weights ww were calculated by applying a softmax operation to cosine similarities between context-guided visual feature (v(G)v^{(G)}, V(G)V^{(G)}) and the set of corresponding text feature vectors for target class.

By varying the temperature αconcept\alpha_{concept} for softmax in attention module, we observed that nearly uniform weight averaging yielded optimal performance with large values of αconcept\alpha_{concept}. For simplicity and efficiency, we adopt equal weight averaging.

Table 3: Ablation study for varing the αconcept\alpha_{concept}

Methodαconcept\alpha_{concept}MS-COCOVOC2007NUS-WIDE
Attention1/1068.087.945.0
Attention1069.688.745.6
Equal weight averaging(Ours)-70.089.245.6

[Q2] Incorrect grammar in L182.

Thank your for the comment. We will correct the grammar in L182.

[Q3] The class "sofa" is present in VOC2007 but absent from MS-COCO classes. How can we construct the class concept from MS-COCO captions?

While the class "sofa" is not explicitly defined as a category in the MS-COCO dataset, the term "sofa" appears frequently within its numerous text descriptions from MS-COCO caption dataset.

[Q4] Exploring different ratio αf,t/αF,t\alpha_{f,t}/\alpha_{F,t}.

The main reason for αf,t\alpha_{f,t}/αF,t=0.5\alpha_{F,t}=0.5 is because it yielded the best overall performance across the dataset. We fixed the 1/αf,t1/\alpha_{f,t} at 80, 60, and 40 for MS-COCO, VOC2007, and NUS-WIDE, respectively.

Table 4: Ablation study for varing the ratio αf,t\alpha_{f,t}/αF,t\alpha_{F,t} in the sequential attention

αf,t/αF,t\alpha_{f,t}/\alpha_{F,t}MS-COCOVOC2007NUS-WIDE
0.2568.488.646.5
0.5070.089.246.6
0.7569.789.045.9
评论

Thanks for the author's reply. I've adjusted my rating to 5: Borderline accept:

评论

We would be grateful for your response and valuable feedback. Many thanks for your efforts and suggestions which make our paper better. We are pleased that our clarifications have adequately addressed your concerns.

作者回复

We sincerely appreciate the constructive feedback from the reviewers. Their insightful comments and suggestions have significantly enhanced the quality of our work.

We are pleased to the highlighting that

  • Training-free zero-shot method (7zr3, jAJ4, TJWi, xF3A)
    • We introduce a training-free and label-free approach for multi-label recognition.
  • Novel class concept representation and alignment of multi-modal features (7zr3, TJWi)
    • Our novel class concept representation, inspired by how humans form concepts on words from past experience, may contain diverse and realistic patterns of co-occurrence with target class name and other related class names.
    • Our class-guided visual feature enhances the alignment between visual and textual information during inference.
  • Robust performance improvement and synergistic combination with existing prompt tuning methods (jAJ4, TJWi, 7zr3, xF3A)
    • We demonstrated that our proposed method can be effectively combined with other existing methods, resulting in improved performance.
    • Our proposed approach leads to substantial performance gains across various benchmarks during inference.

We have provided detailed responses to each reviewer's comments and will be incorporated into the revised manuscript.

最终决定

The submission focuses on improving pre-trained vision-language models on zero-shot image classification tasks. They achieve the improvement by (1) replacing the class-name text prompts with class concept representation; and (2) leveraging context-guided visual features. Evaluations are performed on several image classification benchmarks under the "multi-label recognition" setup, including MS-COCO, VOC2007, NUS-WIDE, under zero-shot, few-shot, and partially labeled settings.

The authors provided detailed rebuttals to address the reviewers' concerns, and all reviewers have finalized their ratings accordingly. The submission received three borderline accepts and one borderline reject. Reviewers did not raise additional questions during the rebuttal discussion, and the AC found the question on the choice of ResNet-50 backbone to be reasonably addressed. However, there is also a lack of strong champion advocating for the acceptance of the submission. In addition, the AC shares the reviewers' concerns about the clarity of the paper's presentation, and also believes that the submission should discuss other concept-based zero-shot image classification, such as [a,b,c,d]. Therefore, the AC believes the submission should be revised and resubmitted.

[a] Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification

[b] Visual Classification via Description from Large Language Models

[c] Pre-trained Vision-Language Models Learn Discoverable Visual Concepts

[d] What does a platypus look like? Generating customized prompts for zero-shot image classification