/10

Poster3 位审稿人

最低1最高4标准差1.2

ICML 2025

GenZSL: Generative Zero-Shot Learning Via Inductive Variational Autoencoder

Shiming Chen,Dingjie Fu,Salman Khan,Fahad Shahbaz Khan

OpenReview PDF

提交: 2025-01-23更新: 2025-07-24

TL;DR

We propose an inductive variational autoencoder for generative zero-shot learning

摘要

Remarkable progress in zero-shot learning (ZSL) has been achieved using generative models. However, existing generative ZSL methods merely generate (imagine) the visual features from scratch guided by the strong class semantic vectors annotated by experts, resulting in suboptimal generative performance and limited scene generalization. To address these and advance ZSL, we propose an inductive variational autoencoder for generative zero-shot learning, dubbed GenZSL. Mimicking human-level concept learning, GenZSL operates by inducting new class samples from similar seen classes using weak class semantic vectors derived from target class names (i.e., CLIP text embedding). To ensure the generation of informative samples for training an effective ZSL classifier, our GenZSL incorporates two key strategies. Firstly, it employs class diversity promotion to enhance the diversity of class semantic vectors. Secondly, it utilizes target class-guided information boosting criteria to optimize the model. Extensive experiments conducted on three popular benchmark datasets showcase the superiority and potential of our GenZSL with significant efficacy and efficiency over f-VAEGAN, e.g., 24.7% performance gains and more than $60\times$ faster training speed on AWA2. Codes are available at https://github.com/shiming-chen/GenZSL.

关键词

Zero-shot learning; transfer learning

评审与讨论

审稿意见

评分: 42025-03-07

This paper proposes a novel generative paradigm for zero-shot learning (GenZSL), which is based on the idea of induction rather than imagination. To ensure the generation of informative samples for training an effective ZSL classifier, GenZSL incorporates two key strategies, e.g., class diversity promotion and target class-guided information boosting criteria. The experiment results are extensive and meaningful.

给作者的问题

Please see my detailed comments in strengths and weaknesses.

论据与证据

Yes

方法与评估标准

Yes

理论论述

Yes

实验设计与分析

Yes

补充材料

Yes

与现有文献的关系

Please see my detailed comments in strengths and weaknesses.

遗漏的重要参考文献

其他优缺点

Strengths:

This paper well-organized, and the motivation is clear and interesting.
The technical contributions are novel and clearly presented.
The extensive results demonstrate the effectiveness of the proposed method.

Weaknesses:

The method is based on the assumptions that the target classes are inducted from the similar referent classes. If there are not similar classes for the unseen classses in the seen class set, the method may out of work?
In Sec 3, the authors state that the refinement of text embeddings preserves the semantic relationships between classes. Is there any theoretical evidence or empirical analysis to validate this statement?
As shown in Table 2, the GenZSL achieves sota results on SUN and AWA2, except for CUB. Please provide more discussions on such inconsistent performances. - In Tale 4, the unseen class performance of GenZSL falls short of standalone CLIP results. The improved seen class performance, while valuable, somewhat diverges from the primary objectives of zero-shot learning.

其他意见或建议

作者回复

2025-03-27

Response: Thank you for the comprehensive reviews and detailed comments! We are very happy to help to address your concerns!

Q1: If there are not similar classes for the unseen classses in the seen class set, the method may out of work?

A1: Thank you for this constructive comment. In response to the Q6 of Reviewer eZyG, if we randomly sample the seen class samples to synthesize unseen class samples, the performances of GenZSL may drop slightly, i.e., CUB (acc: 63.3% $\rightarrow$ 62.5%; H: 57.4% $\rightarrow$ 55.9%) and AWA2 (acc: 92.2% $\rightarrow$ 91.1%; H: 87.4% $\rightarrow$ 85.3%). These results show that GenZSL may not heavily rely on the similar samples from seen classes for synthesizing unseen class samples, and the similar classes for unseen classes in the seen class set can further improve GenZSL.

Q2: In Sec 3, the authors state that the refinement of text embeddings preserves the semantic relationships between classes. Is there any theoretical evidence or empirical analysis to validate this statement?

A2: As shown in Fig. 3 and Fig. 8, the qualitative results show that the high similarities of two classes, they still keep relatively high similarities after refinement using CDP. We statistic the similarities between class semantic vectors of pre/post-CDP, and we find that they keep similar class relationships. We will further highlight this in Sec. 3.1.

Q3: Provide more discussions on inconsistent performances on Tab. 2.

A3: As shown in Tab. 2, GenZSL achieves better improvements on SUN and AWA2 over CUB. This is because CUB is a fine-grained dataset consisting of 200 bird classes, which are very similar to each other for all classes, GenZSL inevitably synthesizes similar classes for various unseen classes. That is, the diversity of the synthesized samples is limited, and thus the performance gains are not significant. Although SUN is also a fine-grained dataset, it consists of more classes (717 classes) and has higher diversities between various classes. Meanwhile, AWA2 is a coarse-grained dataset, all classes also have higher diversities. Accordingly, GenZSL obtains better performance gains on SUN and AWA2.

Q4: In Tale 4, the unseen class performance of GenZSL falls short of standalone CLIP results. The improved seen class performance, while valuable, somewhat diverges from the primary objectives of zero-shot learning.

A4: In the GZSL setting, the goal of ZSL methods is to achieve good performance both in seen and unseen classes. That is, we mainly evaluate the harmonic mean in GZSL. The unseen accuracy of GenZSL slightly drops over CLIP on CUB, but it significantly improves the performance on seen classes (i.e., 7.1% improvement). This means that GenZSL well address the seen-unseen bias issue for GZSL. We will add these discussions to the final version.

审稿人评论

2025-04-08

After checking authors’ responses and other reviewers’ comments, I keep my initially positive rating. Because this work 1) introduces a new induction-based generative model to offer a new insight for ZSL, 2) aligns with vision-language models (e.g., CLIP) to enable attribute-free generalization, which paves the way for further advancements in ZSL, 3) bridges the gap between classical ZSL method (e.g., generative model) and VLM-based methods (e.g., CLIP).

审稿意见

评分: 32025-03-09

This paper introduces GenZSL, a novel inductive framework for generative zero-shot learning (ZSL) that addresses limitations in existing generative ZSL methods. Traditional approaches generate visual features "from scratch" using expert-annotated class semantic vectors, leading to suboptimal performance and poor generalization. Inspired by human concept induction, GenZSL synthesizes unseen class features by inductively transforming similar seen-class samples guided by weak semantic vectors (e.g., CLIP text embeddings of class names) and class diversity promotion, achieving state-of-the-art accuracy.

给作者的问题

论据与证据

"Class Diversity Promotion (CDP) preserves original class relationships". CDP’s SVD-based orthogonalization (Eq. 1) removes redundancy but risks distorting semantic relationships. The claim lacks quantitative validation (e.g., semantic similarity metrics pre/post-CDP).
"60× faster training". Speed comparisons (Fig. 6) lack details on hardware parity or implementation optimizations. GANs are notoriously slower, so gains may reflect architectural simplicity rather than algorithmic superiority.
Robustness to hyperparameters (Fig. 5). Results show stability for $λ$ and $N_{syn}$ , but top-k referent classes (critical for induction) are fixed to k=2. Sensitivity to k is untested on fine-grained datasets (e.g., SUN’s 717 classes).

方法与评估标准

Yes

理论论述

实验设计与分析

While GenZSL outperforms methods using strong semantics (Table 5), the paper does not isolate the impact of semantic source (CLIP embeddings vs. attributes). Are gains due to induction or CLIP’s inherent cross-modal alignment? And there is no comparison to hybrid approaches (e.g., CLIP + expert attributes).
The authors compare with embedding-based and generative methods. They use CLIP features, which might give an unfair advantage because other methods don't use similar features.
Overreliance on CLIP’s implicit alignment: Attributes gains to "induction" but does not isolate CLIP’s role. No experiments compare 1) GenZSL with CLIP vs. expert-annotated attributes and 2) induction vs. imagination using identical semantic vectors.
Description based ZSL vs. class-name based ZSL. Recently, there have been methods proposed to use natural language descriptions as semantic information. What is the comparison result with these methods under this setting? [a] TPR: Topology-Preserving Reservoirs for Generalized Zero-Shot Learning, NIPS 2024.

补充材料

Yes

与现有文献的关系

GenZSL bridges cognitive science-inspired induction mechanisms (e.g., Bayesian concept learning with generative ZSL, advancing prior imagination-based methods (e.g., f-VAEGAN) by replacing expert-dependent attributes with scalable CLIP-guided induction, while aligning with vision-language models (e.g., CLIP) to enable attribute-free generalization.

遗漏的重要参考文献

The authors should compare with more prompt learning based methods, such as SHIP, PromptSRC, Maple.

其他优缺点

其他意见或建议

Fig.1 caption line 4: strong should be limited.

作者回复

2025-03-27

Response: Thank you for the comprehensive reviews and detailed comments! We are very happy to help to address your concerns!

Q1: The claim for CDP lacks quantitative validation (e.g., semantic similarity metrics pre/post-CDP)

A1: The quantitative validation for CDP is presented below, we take these results in ablation study section. Results show that the CDP effectively improves the performance of GenZSL.

Here are the results of pre-CDP.

Dataset	U	S	H	acc
CUB	48.2	64.6	55.2	60.9
AWA2	82.3	87.9	85.0	90.7

Here are the results of post-CDP.

Dataset	U	S	H	acc
CUB	53.5	61.9	57.4	63.3
AWA2	86.1	88.7	87.4	92.2

Q2: Speed comparisons (Fig. 6) lack details on hardware parity or implementation optimizations.

A2: We implement the results in Fig. 6 on a single NVIDIA RTX 3090 graphic card with 24-GB memory without any further implementation optimization following the official codes. Furthoremore, our GenZSL inductives the unseen samples from the similar seen classess with the guidance of target class semantic vectors, which is more easy to learn the target distribution than the imagination-based generative models that learns from the scratch (e.g., Gaussian distribution). As such, GenZSL learns target distribution of unseen classes efficiently. We will further highlight these discussions in the final version.

Q3: Sensitivity to top-k is untested on SUN.

A3: Due to the page limitation, the hyperparater analysis on SUN and AWA2 are presented in Appendix E. Specicfically, the evaluation of top-k on SUN is presented on Fig. 10(b).

Q4: Are gains due to induction or CLIP’s inherent cross-modal alignment?.

A4: As shown in Tab. 4, GenZSL achieves improvements over CLIP-based models (e.g., CoOp, CoOp+SHIP). Meanwhile, GenZSL also outperforms the imagination-based generative model with identical class semantic vectors (e.g., CLIP embeddings). Due to the different dimensions between CLIP visual features and human-annotated attributes, GenZSL cannot be implemented based on them. As such, we cannot provide the results of GenZSL (strong) similar to f-VAEGAN in Tab. 5. However, f-VAEGAN (strong) obtains better performances than f-VAEGAN (weak), which demonstrates human-annotated attributes could be a better semantic condition in generative ZSL than CLIP embeddings. Furthermore, GenZSL (weak) achieves performance gains over f-VAEGAN with various semantic vectors. These results demonstrate the effectiveness of the induction mechanism in generative ZSL.

We will further highlight these discussions in the final version.

Q5: Comparisons between embedding-based and generative methods.

A5: In Tab. 1, we mainly categorize the compared methods based on the visual features extracted from ViT or CNN structure, and CLIP visual features are also extracted from the ViT network structure. Furthermore, we also compared GenZSL with VADS (Hou et al., 2024) which uses CLIP visual features. Results show the superior performances of GenZSL.

Q6: Comparison between description based ZSL vs. class-name based ZSL.

A6: Thank you for this helpful comment. Indeed, we take description based ZSL (e.g., I2MVFormer-Wiki (Naeem et al., 2023) and I2MVFormer+ (Naeem et al., 2024) into comparison under the CZSL setting in Tab. 1. Recently, TPR (Chen et al., 2024) provided comprehensive results for description based ZSL under the GZSL setting. We will take these description based GZSL methods (especially prompt based methods, e.g., TPR, SHIP, PromptSRC, MaPLe) into discussions in the final version.

Q7: Fig.1 caption line 4: strong should be limited.

A7: Thank you for this helpful comment, we will update it.

审稿意见

评分: 12025-03-13

This paper introduces GenZSL for generative zero-shot learning. It first employs a class diversity promotion module to reduce redundant information in class semantic vectors. Additionally, a semantically similar sample selection module is used to select referent class samples. Experiments conducted on three popular benchmark datasets demonstrate the effectiveness of the proposed method.

update after rebuttal

The proposed method lacks comparison with recent approaches, and its performance is over 10% worse than a 2024 method (VADS) on two datasets. Additionally, the authors attribute the performance drop with more generated samples to "limited diversity," which directly contradicts their claim of improving diversity. This explanation is unconvincing. Thus, I recommend rejecting the paper.

给作者的问题

N/A.

论据与证据

The authors claim that their proposed class diversity promotion (CDP) module enhances the diversity of class semantic vectors. However, Fig. 3 only shows that the vectors with CDP are less similar to each other. Can the authors provide evidence that these vectors are also more diverse?

方法与评估标准

Yes, the proposed method and the evaluation criteria make sense for the problem.

理论论述

The paper does not provide formal proofs.

实验设计与分析

Yes, I checked the ablation study in 4.2, the qualitative evaluation in 4.3, the comparison between induction-based generative ZSL and imagination-based generative ZSL in 4.4., and hyper-parameter analysis in 4.5. Based on Fig. 5(c), the accuracy fluctuates as the number of synthetic samples increase. Can the authors discuss why the performance does not positively correlate with the amount of augmented data?

补充材料

Yes, I reviewed the supplementary material, including the class semantic vectors’ similarity heatmaps, additional t-SNE visualization and hyper-parameter analysis, and performance of Generative ZSL with weak class semantic vectors.

与现有文献的关系

Using weak class semantic vectors for feature generation has been explored in previous ZSL/FSL studies [1]. The sample selection module also aligns with prior works that transfer information from base classes to novel classes for data generation [2][3].

[1]. Xu & Le, Generating representative samples for few-shot classification, CVPR 2022

[2]. Yang et al, Free Lunch for Few-shot Learning: Distribution Calibration, ICLR 2021

[3]. Schwartz et al, ∆-encoder: an effective sample synthesis method for few-shot object recognition, NeurIPS 2018

遗漏的重要参考文献

In Tab. 2, only one method from 2023 is listed for comparison. Can the authors also include comparison with these works?

[1] Hou et al, Visual-Augmented Dynamic Semantic Prototype for Generative Zero-Shot Learning, CVPR 2024

[2] Cavazza et al, No adversaries to zero-shot learning: Distilling an ensemble of gaussian feature generators, TPAMI 2023

其他优缺点

Strengths:

The proposed method is simple and easy to implement.
It outperforms previous methods on various datasets.

Weaknesses:

The temperature parameter in equation (4) is set to 0.07. How is this value determined? Can the authors provide some analysis for choosing this specific parameter?
In Section 4.2, the authors provide an analysis of the class diversity promotion module, the target class reconstruction loss, and the target class-guided information boosting loss. Can the authors also show the effectiveness of the semantically similar sample selection module?

其他意见或建议

N/A.

作者回复

2025-03-27

Response: Thank you for reviewing our submission and the comments! Here are our responses to your concerns:

Q1: Can the authors provide evidence that these vectors are also more diverse?

A1: In Fig. 3 and Fig. 8, we present the class semantic vectors’ similarity heatmaps that are extracted by the CLIP text encoder and CLIP with our class diversity promotion (CDP). Results show that similarities between various class semantic vectors are smaller with CDP, such as the mean similarity between various classes drops from 0.5726 to 1.825 $e^{-5}$ on CUB. That is, CDP makes the refined class semantic vectors nearly perpendicular to each other. As such, the refined class semantic vectors are more diverse.

Q2: Can the authors discuss why the performance does not positively correlate with the amount of augmented data in Fig. 5(c)?

A2: When the number of augmented data for unseen classes is set to small (e.g., smaller than 1600 on CUB), GenZSL can synthesize high-quality samples of unseen classes to effectively train a supervised classifier. However, if the number is set to too large, the model may fail to synthesize diverse unseen samples, resulting in overfitting to seen classes. This is because there exists an upper bound on synthetic diversity. We will add this discussion into Sec. 4.5 in the final version.

Q3: Using weak class semantic vectors for feature generation has been explored in previous ZSL/FSL studies [1]. The sample selection module also aligns with prior works that transfer information from base classes to novel classes for data generation [2][3].

A3: We should emphasize that our GenZSL is different from existing FSL studies. The reasons are illustrated below:

First, [1] is an imagination-based generative model to synthesize data for data augmentation in FSL tasks, it selects representive samples to learn an imagination-based generative model (e.g., standard VAE) for data augmentation. On the contrary, our GenZSL is an induction-based ZSL model, which is an novel induction-based generative model mimicking human-level concept learning. It is effective in synthesizing high-quality samples for unseen classes. The imagination-based generative model synthesizes samples from scratch data, in which the generator learns from scratch without sufficient data to capture the high-dimensional data distribution.

Secondly, [2][3] samples the top-k samples based on the data distribution of visual features, which requires the samples of novel classes. However, there are no samples of unseen classes in the ZSL task, and thus they can be used for FSL but not for ZSL. On the contrary, GenZSL is based on the class semantic vectors to select similar classes for synthesizing unseen samples. Accordingly, GenZSL can applied in the ZSL task, which is focused in this manuscript.

To avoid reader’s confusion, we will add these discussions into the final version.

[1]Xu & Le, Generating representative samples for few-shot classification, CVPR 2022.

[2]Yang et al, Free Lunch for Few-shot Learning: Distribution Calibration, ICLR 2021.

[3]Schwartz et al, ∆-encoder: an effective sample synthesis method for few-shot object recognition, NeurIPS 2018.

Q4: Adding GG (Cavazza et al., 2023) and VADS (Hou et al., 2024) into Tab. 2 for comparison?

A4: Yes, we will add these two works for comparison in Tab. 2.

Q5: Can the authors provide some analysis for choosing temperature parameter in equation (4)?

A5: Initially, we set the temperature parameter $\tau$ to 0.007 following [a] default. Following [4], we set $\tau$ to 0.007/0.2/0.3/1.0 respectively on CUB for analysis. Results show that GenZSL achieves better performance when $\tau$ =0.007.

setting	U	S	H	acc
GenZSL( $\tau$ =0.007)	53.5	61.9	57.4	63.3
GenZSL( $\tau$ =0.02)	50.4	65.7	57.0	63.0
GenZSL( $\tau$ =0.03)	49.5	65.7	56.5	62.4
GenZSL( $\tau$ =1.0)	49.1	65.7	56.2	62.4

[4] "Understanding the Behaviour of Contrastive Loss." In CVPR, 2021.

Q6: Can the authors also show the effectiveness of the semantically similar sample selection module?

A6: We conduct additional experiments on GenZSL without a semantically similar sample selection module, results show that the performances will decrease compared to GenZSL (full). We will add these results to the Tab. 3.

Results on CUB.

Method	U	S	H	acc
GenZSL w/o similar sample selection	48.0	67.0	55.9	62.5
GenZSL w/ similar sample selection	53.5	61.9	57.4	63.3

Results on AWA2.

Method	U	S	H	acc
GenZSL w/o similar sample selection	84.2	86.4	85.3	91.1
GenZSL w/ similar sample selection	86.1	88.7	87.4	92.2

审稿人评论

2025-04-04

Thanks authors for the response. I still have following concerns:

Regarding the comparison with recent methods, VADS (Hou et al., 2024) achieves 74.1 (unseen) and 74.6 (seen) on CUB, while the performance reported in the paper on CUB is 53.5 and 61.9, respectively. Similarly, on SUN, VADS achieves 64.6 (unseen) and 49.0 (seen), compared to the reported values of 50.6 and 43.8. These suggest that the results are not state-of-the-art.
Regarding the correlation between the number of synthesized samples and performance, I understand there should be an upper limit—i.e., performance should no longer improve once the number of synthesized samples becomes sufficiently large. However, based on Fig. 5(c), the harmonic mean actually decreases as the number of samples increases from 1600 to 3200. Does this phenomenon suggest that the generator is not well trained?
Also, using CLIP features during inference violates the ZSL setting since CLIP is trained on a very large quantity of samples. Thus, it is highly likely that some training samples of CLIP overlap with the unseen classes. To ensure a fair comparison with existing ZSL methods, it is necessary to conduct experiments using ResNet-101 features, which are commonly employed in prior work.

作者评论

2025-04-07

Q7: Comparison with VADS.

A7: Compared to VADS, our GenZSL achieves better performances on AWA2, i.e., seen classes: 88.7% vs 83.6; unseen classes: 86.1% vs 75.4%, and except for fine-grained datasets (e.g., CUB and SUN). The reason is that VADS takes the human-annotated attributes as semantic information, which provides fine-grained information for the model. We will add these discussions to the final version.

Q8: The harmonic mean actually decreases as the number of samples increases from 1600 to 3200. Does this phenomenon suggest that the generator is not well trained?.

A8: When the number of the synthesized unseen classes is set to larger than 1600 on CUB, GenZSL overfits to seen classes as the diversity of synthesized unseen samples is limited. This is a normal phenomenon in generative ZSL. Accordingly, we should select a good hyper-parameter for $N_{syn}$ .

Q9: Using CLIP features during inference violates the ZSL setting since CLIP is trained on a very large quantity of samples.

A9: In fact, CLIP leads to new trends in ZSL as it is good generalization and well applied in ZSL tasks, e.g., zero-shot segmentation, zero-shot detection, and zero-shot retrieval. Analogously to SHIP (Wang et al., 2023), how to make use of the advantages of CLIP for ZSL may be a potential research. As raised by Reviewer omKE, our work aligns with vision-language models (e.g., CLIP) to enable attribute-free generalization. This is an important contribution of this work.

Additionally, due to GenZSL requires the same dimension of visual and semantic features for model learning, the dimension of ResNet features is inconsistent to both strong/weak semantic features. As such, we can not provide the additional experiments using ResNet features.

最终决定Accept (poster)

2025-05-01

This paper proposes a generative zero shot learning (GenZSL) method for generating samples for unseen classes using their weak CLIP embeddings via induction from the most similar seen classes. For ZSL from weak labels it shows SOTA performance. Three reviewers provided scores of "Reject", "Weak Accept" and "Weak Accept". The AC carefully read the paper, all reviews, the authors' responses to the reviewers' comments, as well as all discussion between the reviewers and the authors. The main concerns raised by the reviewers were around the lack of apples-to-apples comparison in the strong semantic label setting to the prior VADS method. Two reviewers raised this concern. However, the authors did not provide this comparison. Given this, while the method is novel and intuitive, and has been shown to improve performance in the weak semantic information setting of using CLIP embeddings, it has not been adequately verified to improve performance also in the strong semantic label setting. Hence, the AC feels that while it advances research in the field of ZSL, its performance has only been partially validated. All things considered, the AC feel that this paper is at the borderline of acceptance and recommends "Weal accept". It would not be bad to reject it if there isn't space in the program.