PaperHub
5.8
/10
Poster4 位审稿人
最低4最高7标准差1.1
6
4
7
6
4.5
置信度
正确性3.3
贡献度2.3
表达3.3
NeurIPS 2024

TPR: Topology-Preserving Reservoirs for Generalized Zero-Shot Learning

OpenReviewPDF
提交: 2024-04-23更新: 2024-11-06
TL;DR

a model incorporating dual-space alignment and topology-preserving strategies for GZSL

摘要

关键词
Generalized Zero-shot LearningVision-Language ModelsContrastive Learning

评审与讨论

审稿意见
6

This paper proposes a new task, "generalized zero-shot learning (GZSL)," in which both seen and unseen objects should be recognized for vision-language tasks. It also proposes a new method based on CLIP that uses the loss in the "attribute space" to perform better in both seen and unseen classes. This method is evaluated on various kinds of data sets and evaluated by the harmonic mean of the accuracies of seen and unseen classes.

优点

The proposed approach using the attribute space seems novel enough, and its effectiveness was verified by a detailed comparison of the other methods and the well-designed ablation studies.

缺点

  1. It is unclear what is learned in "learnable attribute tokens." It is not so beneficial for unseen classes. It is unclear what information is represented as tokens for seen classes. It may be better to analyze the acquired tokens in more detail.

  2. It is difficult to think about the case that we have never seen an object, but we know its attributes quite well. In such a sense, I believe this method is more appropriate for few-shot learning.

问题

Strangely, the accuracy increases as the number of learnable tokes increases in Fig 4. (a) AwA2. It would be appreciated if you could provide any insights into this phenomenon.

局限性

None.

作者回复

Response to Reviewer 8m4c

Response Q1-Q3

Thank you for the positive and insightful comments. The reviewer appreciated the novelty of our approach, the creation of the attribute space, the effectiveness of our method, and the well-designed ablation studies. We address the mentioned concerns below.

Q1: Analysis of the learned attribute tokens in more detail.

A1: Thanks for the insightful comment. To better analyze the learned attribute tokens, we attach a PDF in the top-level comment with two figures (Fig. R1 and Fig. R2). Our analysis is two-fold:

  • In Fig. R1 of the attached top-level PDF file, we visualized the correlation matrix of the base attribute vocabulary and the learned attribute tokens, respectively. We observe that the correlation of learned tokens is indeed different from that of base vocabulary. Specifically, many items in the base vocabulary are highly correlated, with an average value of 0.858, indicating redundancy within the base vocabulary. In contrast, the correlation of learned tokens is much smaller, with an average value of only 0.003, indicating that separable and independent tokens are learned.
  • In Fig. R2 of the attached top-level PDF file, we visualize the distribution of the base attribute vocabulary and the learnable attribute tokens using t-SNE. It can be seen that the learnable tokens (marked in orange) can fill in part of the gaps in the base vocabulary (marked in blue), indicating that the two represent different semantic meanings and are therefore complementary to each other.

Q2: It is difficult to think about the case that we have never seen an object, but we know its attributes quite well. In such a sense, I believe this method is more appropriate for few-shot learning.

A2: Thank you for the comment. In the ZSL/GZSL setting, a general assumption is that for unseen classes, their prior semantic information such as attributes is known in advance, although there are no images from unseen classes for training. The overlapping attributes between seen and unseen classes can facilitate the knowledge learned on the seen classes to be transferred to the unseen for zero-shot recognition.

As per your suggestion, our framework can be readily extended to few-shot learning. It is important to note that few-shot learning differs from ZSL/GZSL in that it leverages a small number of labeled samples from unseen/novel classes. To accommodate this difference, we can align the multimodal representations of both the seen/base and unseen/novel classes within the dual-space. The model can then be trained using our multi-modal alignment loss (Eq. 5 and 6) plus the topology-preserving loss (Eq. 7) proposed in this work. In this setting, our model still inherits the good generalization ability of the CLIP model for effective few-shot learning.

Q3: Explanations for the accuracy increase in Fig.4(a).

A3: On the AwA2 dataset (Fig.4(a)), the performance of the unseen classes (i.e., UU) improves as the number of learnable attribute tokens (i.e., N2N_2) increases.

We attribute this to the following reason: The AwA2 dataset is a coarse-grained dataset encompassing a wide range of animal categories. The base vocabulary may not comprehensively cover all essential attributes, leaving some important features unrepresented. This gap necessitates the addition of more learnable attribute tokens, which contribute to the observed increase in accuracy as N2N_2 grows.

审稿意见
4

In this paper the author proposed a dual-space feature alignment module to keep the semantic consistency between visual and attribute. In addition, the authors proposed Topology-Preserving Reservoir (TPR) to tackle the issue into the generalized zero shot learning (GZSL) setting, which utilized the Pearson correlation coefficient to define a topology-preserving loss, which effectively prevents overfitting of the seen and unseen classes. Sufficient experiment demonstrate the effectiveness of the proposed method.

优点

(1)The Paper is well-written, meanwhile, the method is intuitive and easy to understand. (2)The proposed method focused on Generalized Zero-Shot Learning (GZSL) to present Topology-Preserving Reservoir to finetune the pre-trained CLIP for better fit the distribution of seen and unseen classes, which seems reasonable. (3)Sufficient and significant experiments demonstrate the effectiveness of the proposed method.

缺点

(1)The Dual-Space Feature Alignment proposed by the author, which uses a Cross Attention mechanism for cross-modal alignment, lacks innovation. (2)The author mentions "attribute reservoir" in the article, but essentially it is just a fully connected layer that generates different feature representations through various loss constraints. Additionally, in Figure 2, the attribute reservoir is shown in two states: frozen and trained. I am unsure about when these two states should transition between each other. (3)The idea proposed by the author to fine-tune feature distribution using spatial topological structures is intriguing. However, relying solely on the Pearson correlation coefficient to define a topology-preserving loss seems somewhat simplistic.

问题

NA

局限性

NA

作者回复

Response to Reviewer k9s8

Response Q1-Q3

Thank you for the valuable comments and recognizing the various strengths of our paper: "well-written", "intuitive and easy to understand", "the method better fits the seen and unseen classes", "reasonable", and "sufficient and significant experiments". We address the raised concerns by the reviewer as follows.

Q1: Lack of innovation for the dual-space feature alignment module.

A1: We appreciate the opportunity to further clarify innovation for our dual-space feature alignment module. To the best of our knowledge, there is no existing related work that employs a latent-attribute dual-space representation to address GZSL. Most previous methods rely solely on a single latent space for multimodal feature alignment. We observe that these methods fail to effectively capture complex and fine-grained patterns, leading to suboptimal performance, especially for unseen classes (see Table 1 of the main paper).

To overcome these limitations, we propose to align multimodal features in a dual-space consisting of a latent space and an attribute space. The latent space provides a general representation of the input multimodal features, while the attribute space, induced by a predefined base attribute vocabulary and learnable attribute tokens, offers a more structured and interpretable representation. By aligning multimodal features within these two complementary spaces, our method not only captures both prior knowledge and task-specific information more effectively but also reduces the risk of overfitting to seen classes. This dual-space alignment design leads to superior performance compared to previous works.

Finally, we would like to emphasize that cross-attention is a common practice for cross-modal alignment. How to effectively align image and text modalities is the key contribution of our work. Thus, we design a novel dual-space to align the multimodal image-text representations instead of projecting them into a single latent space.

Q2: Explanations for the attribute reservoir.

A2: Thank you for the comment. To the best of our knowledge, our work is the first attempt to leverage an attribute reservoir to solve the description-based GZSL task. By utilizing this reservoir, we are able to significantly improve performance through capturing complex and fine-grained patterns and facilitating alignment between image and description features.

More specifically, the attribute reservoir consists of two complementary components: the base attribute vocabulary and the learnable attribute tokens, which is conceptually and technically distinctive from simply applying a fully-connected layer to project features.

  • The base attribute vocabulary contains 5,996 extensive attributes collected from the existing literature related to attribute recognition. These attributes are universally shared across all datasets, which provide prior knowledge and effectively improve the generalization of the model to unseen classes.
  • The learnable attribute tokens are introduced to compensate for the cases where base attribute vocabulary cannot cover sufficient attributes to perform effective GZSL. Concretely, they have two functions: 1) learning complementary attribute knowledge that is missing in the base vocabulary; 2) integrate task-specific information into the attribute reservoir so as to better align the image and description features.

Experimentally, the two components of the attribute reservoir are complementary to each other (Table 2 of the main paper) and together they achieve superior performance over the state-of-the-art. This demonstrates the effectiveness of our attribute reservoir.

Q3: The idea of using spatial topological structures is intriguing. Relying solely on the Pearson correlation coefficient to define a topology-preserving loss seems somewhat simplistic.

A3: Thank you for the helpful comment. In this work, our primary goal is to maintain the class topology structure of the CLIP embedding space, thereby improving the generalization ability of unseen classes. As shown in Table 3 of the main paper, we conducted various forms of topology-preserving losses, but we found that using the Pearson correlation coefficient to define topology-preserving loss is simple yet effective. Further empirical results show that it indeed improves the generalization ability of the model significantly (Table 1 and Fig. 1(a)).

评论

Dear Reviewer,

Thank you for your comprehensive review of our manuscript. We have taken your feedback seriously and have revised the paper to address your concerns and incorporate your suggestions.

If there are any areas where you would like us to provide more information or clarification, please feel free to ask. Your expertise and insights are invaluable to us, and we aim to address any outstanding concerns.

审稿意见
7

This paper is a new study that introduces the Generalized Zero-Shot Learning (GZSL) framework within VLMs, aiming to classify both known and novel classes without class partitioning. Key innovations include a dual-space feature alignment module, enhancing latent representations with an attribute reservoir for nuanced visual-linguistic patterns. Additionally, a topology-preserving objective ensures that model adaptations preserve the semantic structure learned by CLIP, thus maintaining generalization across all classes. Extensive experiments across diverse datasets validate the proposed Topology-Preserving Reservoir (TPR) model, demonstrating superior performance over conventional methods in recognizing both seen and unseen classes, underlining its potential for practical applications in complex visual recognition tasks.

优点

  1. This paper introcuces a novel research aspect for VLMs: generalized zero-shot learning, which requires the model to identify both seen and unseen concepts at the same time. From my perspective, this proposal could be a great contribution to VLM community.
  2. This paper is well-organized and well-written, which makes it easy to follow.
  3. Extensive experiments,ablation study and visualization results demonstrate the effectiveness and rationality of TPR.

缺点

None in particular.

问题

GZSL is a long standing problem in ML/AI community with many classic solutions, like stacking[1] .etc. Since the authors only compare with VLM-based methods, my concern is about the perfromance of classic methods in VLM-based GZSL tasks, can they achieve surprising performance with well-trained features? [1] Chao W L, Changpinyo S, Gong B, et al. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild[C]//Computer Vision–ECCV 2016

局限性

None in particular.

作者回复

Response to Reviewer BqNV

We sincerely thank the reviewer BqNV for the very positive and helpful comments. Thank you for acknowledging the novelty of our idea, our contribution to the VLM community, the writing and organization of our paper, and the extensive experiments & ablation study. We address your questions as follows:

Q1: Can classic methods [1] achieve superior performance with well-trained features in VLM-based GZSL tasks?

A1: Thank you for the comment. In Table R1 below, we compare with the two variants of the classic method [1] using pretrained VLM-based features (i.e., ViT-B/32 CLIP). The classic method SynC [1, 2] achieves competitive results on seen classes, but degrades significantly on unseen classes. Specifically, SynC [1, 2] constructed classifiers for unseen classes by linearly combining the phantom classifiers trained on seen class data. Therefore, it exhibits a significant bias towards seen classes, resulting in inadequate generalization to unseen classes. This poor generalization phenomenon is also observed in Ref [3].

In contrast, our TPR method obtains promising results on both seen and unseen classes, demonstrating the generalization capability. We will include the results and discussion with Ref [1-3] in the revision.

Table R1. Performance comparison between the classic methods [1] and ours in the VLM-based framework.

AwA2CUB
ModelSS / UU / HHSS / UU / HH
a) CLIP81.7 / 77.7 / 79.629.9 / 29.6 / 29.7
b) SynCo-vs-o^\text{o-vs-o} [1]86.0 / 15.1 / 25.731.9 / 10.5 / 15.8
c) SynCstruct^\text{struct} [1]88.3 / 16.7 / 28.133.8 / 11.4 / 17.0
d) TPR (Ours)87.1 / 76.8 / 81.641.2 / 26.9 / 32.5

[1] Wei-Lun Chao, et al. An Empirical Study and Analysis of Generalized Zero-Shot Learning for Object Recognition in the Wild. ECCV, 2016.

[2] Soravit Changpinyo, et al. Synthesized Classifiers for Zero-Shot Learning. CVPR, 2016.

[3] Yongqin Xian, et al. Zero-Shot Learning—A Comprehensive Evaluation of the Good, the Bad and the Ugly. IEEE T-PAMI, 2018.

审稿意见
6

The proposed approach targets the generalized zero-shot learning (GZSL) problem for the vision language model (VLM). It is observed that a strong VLM model shows promising results for novel class generalization. Fine-tuning these models for seen classes leads to a loss in generalization capability and poor results for unseen classes. Additionally, a single latent space demonstrates limited ability to adapt to complex visual-linguistic patterns in fine-grained datasets. The paper proposes dual-space alignment, augmenting the latent space with static and learnable tokens. To address the generalization problem post fine-tuning, the paper introduces a Topology-Preserving Reservoir (TPR), which helps preserve the model's generalization ability for unseen classes. The authors conducted extensive experiments across several standard ZSL datasets and explored the impact of various components through ablation studies.

优点

[1] Generalization of unseen classes in VLM is a critical problem. The strong pretrained model also loses its generalization ability, which the author explores, and the proposed model shows a significant impact.

[2] The idea and intuition behind the static and learnable attribute reservoir are interesting. Additionally, TPR helps improve generalization.

[3] The wide-ranging experiments conducted across various ZSL datasets and the ablation studies are satisfactory.

缺点

[1] The standard ZSL model assumes that there is a description per class rather than per sample, which is more intuitive since a single description for each class suffices for the model to understand the class, making it cost-efficient. Standard annotation-based attributes often yield better results for ZSL/GZSL settings. For example, [a] demonstrates impressive results for the CUB dataset compared to the proposed complex static, learnable, and description-based model. This issue is particularly observed in fine-grained datasets. Why is this the case?

[2] It is unclear how the base attribute vocabulary is created. At a high level, the author collected a few attributes and obtained LLM embeddings. This description may not be sufficient for reproducibility since the code and data are not provided.

[3] There are multiple variants of TPR (Table-3) in various scenarios where different methods work, making it difficult to apply and choose the best one. What does the author conclude here?

[4] In Table-1 for the SUN dataset, the model shows inferior performance. While we do not expect the model to outperform in all scenarios, a clear description and author observations are required: why is this the case?

[a] Meta-Learned Attribute Self-Interaction Network for Continual and Generalized Zero-Shot Learning, WACV-24

问题

Please refer to the weakness section.

局限性

Authors has not discussed the limitation in the paper, while in the Checklist they said "Yes" i.e. they had discussed. This is a bad practice, I don't know what to do.

Dear AC please look into it.

作者回复

Response to Reviewer TcJo

Response Q1-Q6

Thank you for the valuable comments with many kind words to our work: a critical problem, significant impact, interesting, improve generalization, wide-ranging experiments, satisfactory ablations studies. Below, we address the raised questions:

Q1: Whether a description is provided to each class or each sample?

A1: We would like to clarify that our method utilizes the Class-Level textual descriptions (a description per class). Therefore, our method is cost-efficient with the class-level design.

Q2: Standard annotation-based attributes often yield better results for GZSL settings. For example, [a] demonstrates impressive results on CUB.

A2: Thank you for your constructive comments. Firstly, we agree with Reviewer TcJo that sometimes standard annotation-based attributes yield good results for ZSL/GZSL. However, they require extensive expert knowledge and detailed class-level attribute annotation for both seen and unseen classes. In contrast, our method relies only on attribute names and LLM-generated descriptions, which are easier to obtain and significantly reduce the need for human efforts.

Secondly, we speculate that the impressive results of [a] on CUB are due to the usage of meticulously annotated attributes. Specifically, the CUB dataset provides 312-dimensional densely labeled attributes, which capture the characteristics of bird body parts, such as the beak, head, trunk, tail, and claws. Compared with the well-annotated attributes, LLM-generated descriptions may overlook some key attributes characterizing the bird classes. Thus, learning with detailed attribute annotations [a] leads to recognition improvements.

We will cite [a] and include the above discussions in the revision.

[a] Vinay Verma, et al. Meta-Learned Attribute Self-Interaction Network for Continual and Generalized Zero-Shot Learning. WACV, 2024.

Q3: How is the base attribute vocabulary created ?

A3: We detail its construction as follows:

  • First, we collected the readily available attribute words from diverse attribute recognition related repositories, such as MAD, VAW, and LSA. For example, MAD dataset has 158 attributes, which are incorporated into our base vocabulary. Note that we just used the attribute names without involving any annotations.
  • Then, we eliminated duplicate attribute words to form a base vocabulary of 5,996 attribute words. A pre-trained LLM was adopted to extract features.

In Appendix Line 751, we provided the anonymous code link. We will make all code and data public upon acceptance.

Q4: Choose of multiple variants of TPR and the conclusion.

A4: Thank you for the comment. In GZSL, Harmonic Mean (HH) is the most comprehensive metric to evaluate model performance, which takes into account both seen and unseen classes. Therefore, we choose the best model according to the results of HH. For ease of comparisons, in Table R2 below, we average the performance w.r.t. three datasets and observe that our proposed topology-preserving loss (row e: LtpL_{tp}) obtains the best results.

Our conclusions are two-fold: 1) all regularization variants exhibit a degree of performance improvement compared to the baseline, indicating the importance of constraining spatial structure of CLIP; 2) among these, our topology-preserving loss LtpL_{tp} shows better performance than the other variants, highlighting its effectiveness.

Table R2. Average performance of different variants of TPR on three datasets: AwA2, CUB and FLO. The proposed topology-preserving loss LtpL_{tp} obtains the best results.

ModelSS / UU / HH
a) w/o LtpL_{tp}66.6 / 54.2 / 59.6
b) nuclear norm67.9 / 54.2 / 60.0
c) orthogonality67.1 / 55.3 / 60.4
d) LtplatL_{tp}^{lat}67.4 / 55.8 / 60.8
e) LtpL_{tp}68.6 / 56.1 / 61.5

Q5: Explanations for the inferior performance on the SUN dataset.

A5: Unlike object-specific classes, SUN is a generic scene dataset containing 717 categories, such as "amphitheater" and "amusement park". These broad scene categories exhibit extensive variations in their components and features, making it challenging for our TPR method to align images with descriptions. For instance, the category "amusement park" may include diverse elements such as "crowds", and "buildings", which vary significantly across different instances. This diversity complicates the projection of both multimodal features into the attribute space, where consistency is harder to achieve.

Moreover, we observe that several class descriptions in the SUN dataset exhibit high cosine similarity scores. Specifically, about 5% of the textual descriptions have similarities higher than 0.8, with an average similarity of 0.69. In contrast, the average similarity for the FLO dataset is 0.52. The high similarity across descriptions of different categories adversely impacts the model's performance, especially for unseen classes.

Furthermore, we would like to emphasize that although TPR is 3.2% lower than ProGrad [12] on the SUN dataset, while on other datasets TPR is on average about 10% higher than the state-of-the-art.

Q6: Summary of limitations.

A6: We summarize the limitations of TPR as follows:
Text alone may not fully capture the nuances of fine-grained datasets like CUB, while attribute annotations, though more accurate, are costly. Thus, a more desirable solution would be combining the knowledge from expert-provided attribute annotations with LLM-generated text to enhance performance. Additionally, our method may face challenges in aligning visual features of generic scenes with description features in the attribute space, especially when descriptions are not sufficiently specific. This could be alleviated by providing more distinct and human-refined descriptions.

作者回复

Global Rebuttal

We thank all reviewers for their insightful and positive feedback. We are encouraged that the reviewers acknowledge our paper:

  • Novel and impactful. Reviewer BqNV -- "introduces a novel research aspect", "a great contribution to VLM community"; Reviewer TcJo -- "the proposed model shows a significant impact", "idea and intuition are interesting", "TPR helps improve generalization"; Reviewer k9s8 -- "the method is intuitive"; Reviewer 8m4c -- "seems novel enough".
  • Sufficient and significant experiments demonstrating the effectiveness of our method. Reviewers reckon our experiments are extensive (BqNV), wide-ranging (TcJo), sufficient and significant (k9s8), and with detailed comparison and well-designed ablation studies (8m4c).
  • Well-organized, well-written, and easy to follow Reviewer BqNV -- "This paper is well-organized and well-written, which makes it easy to follow." Reviewer k9s8 -- "The paper is well-written, meanwhile, the method is intuitive and easy to understand."

We thank all the reviewers' suggestions and comments that significantly improve the quality of our work, and we have done our best to address the reviewers' concerns.

Below, we summarize the key changes we have made for rebuttal:

  • Perfromance of classic methods in VLM-based GZSL tasks (reviewer BqNV)
  • Clarification of Class-Level description, comparison with standard annotation-based attributes methods, chosen of model variants, explanation of the performance on SUN dataset (reviewer TcJo)
  • Clarification of key innovations regarding dual-space design, attribute reservoir, and topology-preserving loss (reviewer k9s8)
  • Explanation of attribute reservoir (reviewer TcJo and k9s8)
  • Provision of a summary of limitations (TcJo)
  • Visualization and Analysis of the learned attribute tokens (PDF attached, reviewer 8m4c)

Please find individual responses to your questions below.

最终决定

This paper received an overall positive review score. The reviewers pointed out that the work has merits in motivation, performance, and presentation, and the point of contention is about the novelty of this paper. After reading the paper, reviews, and rebuttal, the AC agrees that the proposed dual-space alignment learning exhibits differences from the prior approaches and leads to superior performance. Due to the above considerations, the AC gives a recommendation for acceptance.