Text Descriptions are Compressive and Invariant Representations for Visual Learning
摘要
评审与讨论
This paper is in line with a recent trend of augmenting CLIP's classification templates with class-specific descriptions generated from LLMs (eg, GPT-3). The major technical contribution of this work is to concatenate the descriptions of all classes of interest, learn weights to aggregate these descriptions' text features and form new classifiers for each class. The weights are regularized with l1-norm to encourage sparsity. Experiments show that tuning with this classifier outperforms methods that work directly on top of vision features (without texts). Besides, information-theoretical analysis is also provided to show the improved invariance of text features (in CLIP's joint VL space) over raw vision features (output of the vision tower prior to linear projection).
优点
Originality: The technical contribution is relatively limited, which still fits in the scope of description selection. Its combination with other schemes like LP-FT is somehow instrumental. Yet the theoretical analysis is relatively novel and inspiring.
Clarity: The paper is overall clearly written. Yet the organization could be improved, and more details could be provided to help understand details of the proposed method.
Significance: This paper fits in the scope of visual representation learning under text supervision, and provides analysis on feature invariance and compression, which is helpful for the community.
缺点
I put the minor concerns in the section above (which did not harm my rating much), and list the major concerns here:
1) The theoretical analysis is not well-aligned with the proposed method:
- The major conclusion that could be derived from sec 3.3 is that features from CLIP's joint VL space are more compressed and invariant to visual variations (compared with vision features).
- This is good and helps us understand CLIP itself, but does not explain why using descriptions (no matter w/ or w/o selection) is better than using other texts (eg, the default templates), which is the foundation of the proposed method.
- From fig. 2 I find is almost identical to , indicating descriptions do not introduce more invariance than template ensemble, which raises concern on why descriptions are needed in this work. is higher than , which should indicate better predictions, yet in tab. 1 the gain of ZS-AVD over ZS is just marginal.
2) The experiments also do not support this work (description selection)'s superiority over other template designs:
- In tab. 2 & 3, SLR's superiority over FT & LP under the few-shot setting is expected, since both WISE-FT's and LP's classifier are trained from scratch given only very few samples, thus could not match the performance of classifiers derived from CLIP's text encoder.
- Yet this does not help understand SLR's strength over other classifiers. For instance, a) what if we drop the selection process and just follow Menon & Vondric's average ensemble, b) what about WaffleCLIP [1]'s random descriptors, c) how much is SLR's improvement over CLIP's default templates, and d) how do k-NN classifiers (use average pooling of the labeled few-shot samples to form classifiers, try both vision features and VL features) perform?
3) No ablation study is provided to understand the proposed method (relatively minor)
Ref: [1] Roth et al., Waffling around for Performance: Visual Classification with Random Words and Broad Concepts, ICCV'23
问题
One possible cause of ZS-VD's much inferior performance than ZS in tab.1 is the use of class templates. If the templates of ZS are also applied to ZS-VD, the difference in their performance should be marginal (refer to WaffleCLIP). I suggest the authors give it a try.
We would like to thank the reviewer for their thoughtful comments. Our responses are below.
Weaknesses
1 It is actually a favorable result that is almost identical to . One can think of as a postprocessing of (since the former is a subset of the latter). This result suggests that even if has more features, these extra features do not harm invariance. The difference between and is reflected in the newly revised Figure 3. There the L1 class prompts perform far worse than AVD.
2.1 Even though SLR’s features are derived from text encoders, its coefficients are not regularized to be close to ZS-AVD. Instead, we only enforce sparsity. Therefore, the model can still be wrong in arbitrary ways. The results suggest that the discrete text space has a stronger implicit bias than the image embedding space. 2.2 the average ensemble corresponds to ZS-VD or ZS-AVD. Notice that without WISE-FT, SLR doesn’t really incorporate the strong language prior. Since only sparsity is enforced, the coefficients can still overfit the few-shot data in almost arbitrary ways. One can of course overcome this issue by regularizing the difference between the SLR coefficients and the zeroshot weights, which is almost what WISE-FT does. Compared to regularization during SLR training, WISE-FT is post hoc. So we only need to train the model once. We choose to train this way because it helps us understand the implicit bias of each space (image embedding space vs the discrete text space that is congruent to natural language), rather than the effect of the zeroshot prior.
We add WaffleCLIP zeroshot performance in Table 12, and we notice that its performance tends to decrease as you add more random tokens. On the other hand, our descriptors have length around 8-10 tokens. This suggests that the most important factor is to perturb your basis around the class prompt in a controlled way, which is more easily achieved by using language congruent descriptions. We also compare WaffleCLIP+WISE-FT to SLR+WISE-FT. The result in figure 15 also suggests that SLR has a better performance than WaffleCLIP.
Questions
In Table 12, we also tried to average the templates for ZS-VD, the results are listed in column ZS-VD. We see a slight improvement.
We also want to remark that the gap between WaffleCLIP and our ZS baseline is because we use the hand-selected 7 templates later released in the OpenAI CLIP github repo, and the original WaffleCLIP paper randomly selects 30 out of the original 80 templates for ensembling.
Thanks to the authors for the reply. Still, my major concerns are not well-resolved.
Considering methodology as one core contribution of this paper, given its negligible performance gain, the reason that one prefers it over other practices, eg, simply ZS, is unclear and unconvincing. I raised WaffleCLIP for attention as it challenged the use of descriptions by showing even random words can perform comparably, for the message but not for its performance. The fact this work can perform better than random words is unsurprising.
Considering the information-theoretic analysis, it is good to see progress in this direction. Still, it only justifies the goodness of previous works, but not this one itself. The fact it "does not harm invariance" is insufficient to support it as a more preferable method.
Overall, I prefer to keep my initial recommendation unchanged.
The paper proposes a method for obtaining more robust CLIP models. First, multiple visual descriptions are generated by an LLM and then used by the CLIP model alongside encodings of the class names. The visual representation is projected into the class + descriptors space and then a sparse logistic regression is trained on this representation. As shown by experiments estimating mutual information, projecting on the class embeddings and visual descriptions minimizes the mutual-info with the domain, while selecting sparse features follows the bottleneck principle. Experimentally, the method is shown to produce better results than good baselines and can be combined with existing methods.
优点
-
S1. The method is well-motivated.
-
S2. The method is sound. Both the visual descriptions and sparsity are good ways to improve robustness.
缺点
-
W1. The paper already compares against a few methods, but some additional baselines and ablations would also help. First, the full method but with logistic regression instead of sparse logistic regression (this will show if the sparsity is really important). The same method without using the video descriptions (partially shown in Table 1). Second, an MLP with a similar number of parameters as SLR-AVD, with different levels of weight decay. Third, use random projections / learnable matrices instead of the U projections (this will also induce a bottleneck).
-
W2. It will be good to have an overall comparison with different methods and an ablation study of the proposed method. I am thinking mainly of using the same base model (e.g. ViT- B/16), maybe just some standard few-shot k (e.g. k=16). A table with all datasets as columns and different methods (ZS, LP, MLP probing, FullFinetunning, CoOp, Wise) and ablations (ZS-AVD, SLR-AVD etc.).
-
W3. It will be good to see the performance of all models presenting using different number of shots: 1,2,4,8,16, 32. At the moment some ablations contain only up to 4 or 16 shots.
-
W4. The paper focuses on few-shot learning and this is an important area. To see the tradeoffs of the proposed method and the baselines, it will be interesting to also see the performance using larger training sets. For example, use k from 64, 256, 1024, etc. This way we can see at which scale of training samples is the proposed method more beneficial.
-
W5. Comparison with CoOP does not seem to be fair. “Since CoOp injects “classname” to the prompt during inference directly, this enforces a very strong prior. For a fair comparison, we also inject a strong prior by interpolating our learned linear head”. It is not clear what this means, and why a direct comparison is not fair. Why is the “classname” injection a strong prior for CoOP? Isn’t SLR-AVD also using the same classname to produce the class prompts (CP)? Combining the proposed method with Wise and comparing to plain CoOp seems unfair.
-
W5.2. The comparison with CoOP is made using Resnet-50, which gives poorer performance for CLIP. Comparison using the ViT models should also be made.
-
W6. It is not clear how hyperparameter selection is done. Hyperparameter and model selection are crucial for domain generalization, thus this should be made more clear. For Wise models, alpha seems to be selected optimally, using validation OOD data.
-
W7. In Figure 4 top, it seems like WISE-FT+LP, which finetunes only the last linear layer, is compared against WISE-FT+SLR-AVD, which finetunes the entire model. Is this correct, or is WISE-FT+SLR-AVD finetunning only the linear classifier? Both should either update the linear layer or full-finetuning. Also, what is the difference between WISE-FT+LP and WISE-SLR?
-
W8. The paper will benefit from a better presentation. There are multiple acronyms, and sometimes the difference between them is not clear. The section in the appendix explaining the acronyms should be expanded with more details and should contain all acronyms and combinations used (e.g. WISE-FT+SLR-AVD, WISE-FT+SLR-AVD).
-
W8.2 Minor: Figure 4 should be improved, e.g. use consistent symbols, especially for start and end points. Show the optimal checkpoint in the figure.
问题
Q: What is the number of learnable parameters of SLR-AVD, how does it compare to linear probing?
We would like to thank the reviewer for their positive review of the paper and we hope that the following comments address their outstanding concerns.
Weaknesses
W1. We have added comparisons to several of the methods you suggested.
W1.1 We have added L2-regularized AVD (i.e. L2 regularized logistic regression replacing L1/sparse logistic regression) to show that L1 regularization is very important.
W1.2 For random projection, we initialize , where 512 is the image embedding dimension. We pick the projected dimension to be since we found that the number of parameters picked by L1 is around 100-300 per class. This bottleneck underperforms our method. See Figure 3 for a detailed comparison.
W1.3 We have also added a comparison to MLP in Table 13 in the appendix. The MLP has 3 layers of sizes 512-4500-1000, which equals the raw number of parameters in SLR. It consistently underperforms linear models because the CLIP embeddings are trained to be linearly classified, and MLP can easily overfit.
W2. We added the comparison in Table 13 in the appendix for k=4, since this is the k that we have used for every model.
W3. We have added the comparison to CoOp on 32 shots, see figure 15 in the appendix. For Full FT, we are limited by our resources.
W4. See Fgure 16 in the appendix for larger k. We tried k=64, 256, 1024. L1 AVD still outperforms linear probing on most datasets.
W5. The reason behind this statement is that CoOp optimizes a continuous prompt like “X X X X {classname}” for each class, and only the “X X X X” part is optimized in the continuous space; the “{classname}” is always fixed. On the other hand, the coefficients of SLR-AVD are not regularized until it incorporates WISE-FT. To prevent confusion, we change Table 5 so that CoOp also incorporates WISE-FT, and SLR-AVD still outperforms CoOp, especially when the amount of data increases.
W5.2. CoOp comparison now is done on ViT-B/16.
W6. This is one drawback that we inherit from WISE-FT. The nice property of WISE-FT is that it’s fully post hoc. So in reality, even if the is not optimally picked initially, it can always be easily updated. The original WISE-FT method suggests a heuristic of choosing . However, their original setting has abundant ID data, and 0.5 fails in our case. Our heuristic suggests that picking seems to perform pretty well.
W7. WISE-FT+SLR-AVD only finetunes the last layer. WISE-SLR is the method that first finds a sparsity pattern using SLR-AVD. Then we fixed the pattern and finetuned the whole model. WISE-FT+LP is linear probing with image embeddings +WISE-FT on only the last linear layer.
W8. We added more details in the appendix, and also Table 6 for a better visualization.
W8.2 We updated the figures. As for optimal checkpoint, it is a little ambiguous since sometimes the optimal ID acc and OOD acc are achieved by slightly different , so we decide not to include it.
Questions
Q1. The total learnable parameters in SLR-AVD are 6804*1000. The average number of non-zero entries for each class after learning is , and for . The numbers are rounded to the nearest integers. So the total number of “used” parameters is the aforementioned numbers * 1000 (for each class). Linear probing optimizes 512*1000 parameters. So SLR does in the end lead to much less parameters. Also, as the training data increases, SLR-AVD can eliminate more irrelevant descriptors.
This paper proposes a method to produce better zero-shot classifiers for vision-language models, more specifically CLIP. To do so, the set of standard class prompts used to construct zero-shot classifiers is enhanced using GPT-3 with extra textual descriptions. Moreover, regularized logistic regression classifier is trained to select certain textual descriptions for each class. This way, slightly better performance is achieved on ImageNet datasets compared to the standard prompts.
优点
One of the strengths of this paper is the idea of sparse selection of automatically generated prompts for classes (for which we want to learn a zero-shot classifier). There are several prompt templates, which are provided as input to GPT-3 to retrieve textual descriptions for classes. Then -regularized logistic regression is trained to select the most discriminative prompts for each class. This strategy brings slight performance boost over manually constructing only a few (if not one) class prompts.
缺点
There are two main concerns that I would like to raise.
-
The benefit of automatically generating prompts for obtaining zero-shot classifiers is not very clear and significant. Table-1 compares the proposed method of generating textual desciptions from GPT-3 (ZS-AVD) against using only manually defined class prompts (ZS), and we see marginal improvements (max a few decimal points). It would be nice to see the impact of the number and diversity of generated prompts into performance. Section 4.1 mentions that "...class names are probably one of the strongest prompts... One can certainly try to improve ZS-VD results by more carefully prompting GPT-3, or gathering descriptors from different data sources/search engines." this contradicts with the motivation of the paper, no?
-
The benefit of using -regularized logistic regression technique is not very clear either. It would be nice to see if simple regularization performs the same, or what kind of prompts selected for certain classes using different regularization criterons. Also, a simple k-NN based approach can also be applied as a baseline both with soft or hard assignment. On the other hand, similar logistic regression can be trained also for ZS (where there can be multiple manually defined class prompts). Maybe the impact is only due to learning weights () which connect the two modalities (image and text) using some regularization technique. Section-5 mentions that "Applying sparse logistic regression then successfully selects the important features, which turn out to be intuitive" but we don't see any evidence, right?
Minor comments:
- 1st sentence of Introduction: "Self-supervised vision-language models (VLMs) like CLIP..." is there any reference claiming CLIP to be self-supervised?
- 2nd sentence of Section-2: "WLOG..." what is WLOG?
- Missing closing paranthesis in Figure-1
- 4th paragraph of Section-3.1 ("Denote M ...") is very confusing. It would be nice to explain all and in a diagram/visualization.
- Figure-3 caption "the x-axis represents..." (not y)
- Figure-3 what is the label for y-axis?
问题
I would like the authors to address the concerns I listed in the weaknesses part.
We thank the reviewer for their comments and we hope that the following points address the concerns they brought up.
Weaknesses
-
The main motivation is that we want to use multiple features to learn a classifier in the presence of few shot data. If we only have one prompt for each class, then there is no candidate to learn from (except using the vanilla image embeddings). An important finding is that the visual descriptions perturb the original class prompt in a specific way such that the perturbed features have nice information theoretic properties. See the newly added Table 12 in the appendix for a comparison to WaffleCLIP, which appends random words to the class prompts. Our visual descriptors have length around 8-10 tokens, and WaffleCLIP’s performance decreases as the random token length increases. WaffleCLIP only does well when a couple words are added. This small perturbation is beneficial even if the words are irrelevant. However, when more words (8-10) are added, WaffleCLIP performance decreases, while our method can pick useful descriptors of this length that constructively enhance the original class prompt.
We also have a small scale experiments on CIFAR10 in the appendix, paragraph "Choosing and LLM prompting". There we found the most important factor is to generate more diverse descriptors by setting the frequency penalty. -
See revised figure 3. L1 consistently leads to better performance than L2. Sparsely chosen class prompts perform worse than AVD on most datasets. These results still hold when combining with WISE-FT, see figure 15 in the appendix. All results indicate that choosing features with L1 is important.
Minor comments:
-"1st sentence of Introduction: "Self-supervised vision-language models (VLMs) like CLIP..." is there any reference claiming CLIP to be self-supervised?” We changed this to “natural language supervised.”
-"2nd sentence of Section-2: "WLOG..." what is WLOG?” This stands for “without loss of generality.” We modified the text to clarify this term.
-"Missing closing paranthesis in Figure-1” Noted - we fixed this in the paper.
-"4th paragraph of Section-3.1 ("Denote M ...") is very confusing. It would be nice to explain all and in a diagram/visualization. " We added a visualization of at the end of the appendix
-"Figure-3 caption "the x-axis represents..." (not y)” and “what is the label for y-axis?” The y-axis refers to test accuracy; we have modified the text accordingly.
This paper proposes a model SLR-AVD which first automatically generates visual descriptions of each class via a LLM, then use a VLM to translate these descriptions to a set of viaul feature embeddings of each image. The features are proved to be more invariant to domain shift than traditional image embeddings with information-theory. The SLR-AVD is validated on both in-distribution and out-of-distribution classification.
优点
The proposed model is novel, which extracts multiple potential visual features of each class, and then uses L1-regularized logistic regression to fit a sparse linear classifier on top of these visual descriptions. The generated descriptive features are proved to retain substantial information about the true labels, making them good invariant representations.
缺点
The paper writing and the experiments need improvement.
- The training and inference process is not clear.
- What is the loss function used to train the model?
- The three W matrices W_{vd}, W_{cp}, and W_{avd} are used as zero-shot classifiers. However, the inference process of the zero-shot classifier is not clearly explained.
- In Section 3.2 paragraph 2, it would be better to mathematically specify how to regularize W_{avd}, and how to pick three features for each class with the largest coefficients.
- The experiments show limited performance gain in zero-shot classification.
- the results in Table 1 indicate that ZS-VD performs worse than ZS. ZS-AVD only provides marginal improvement over ZS, i.e., 0.74 (IN), 0.74 (IN-V2), 0.13 (IN-R), 0.23 (IN-A), 0.27 (ObjectNet).
- In Figure 3, it would be interesting to show the performance of the ZS-VD model.
- In section 4.2, which dataset is the OOD test set?
I am looking forward to the authors' responses and would like to adjust my rating if the questions are properly addressed.
问题
Figure 4 is unclear and the sub-caption in the bottom-right corner is blocked.
We would like to thank the reviewer for their thoughtful comments. Please see our responses below.
Weaknesses
1.1 The model is trained by the normal multiclass cross entropy loss + L1 regularization.
1.2 For any zeroshot matrix , the inference is done by taking , note that for each with subscripts cp, vd, and avd, respectively.
1.3 The particular algorithm we use to perform cross-entropy loss minimization with L1 regularization is a mini-batch invariant of SAGA, which is a first-order method, detailed in [1]. The appendix paragraph “hyperparameter” gives a brief introduction to the method used.
2 While the improvement observed is relatively modest, we believe the exploration of learning with descriptive features is still noteworthy. The application of information theoretic results in this context is itself valuable, as it provides a useful framework for analyzing these features. Furthermore, the MI-based analysis offers a convincing explanation for the strong performance and robustness of both AVD and the zeroshot baseline, which is an interesting aspect on its own.
3 The results are added, see the updated figure 3. VD performs worse than AVD but better than class prompts, suggesting the descriptors are useful.
4 The OOD datasets are ImageNet-A (an adversarial dataset), ImageNet-R (only contains arts and paintings etc), ImageNet-Sketch (only contains sketch), ImageNet-V2 (natural distribution shift), and ObjectNet.
Questions
Figure 4 has been updated according to the comments.
[1]: Optimal mini-batch and step sizes for SAGA
We appreciate the reviewer's willingness to update their score and would be happy to provide any more information as needed to inform this decision.
Thanks for the responses. Unfortunately, my major concern is not fully addressed as the experiments show limited performance gain in zero-shot classification.
I would like to keep the original rating.
Additional experiments
We added extra experiments in three categories: zeroshot, probe only (so no regularization considered), and WISE-FT
Zeroshot
We added a comparison to zeroshot WaffleCLIP. This method appends random words to the original class prompts "a photo of {}". We tried to append 2, 5, and 10 random words. The results are demonstrated in the updated Table 12 in the appendix. The results suggest that the semantically meaningful descriptors still have an edge over random words.
Probing only
- Training AVD with L2 instead of L1 (labeled as "L2 AVD")
- KNN (k=1, since we only consider the few-shot cases) with both image features (KNN image) and AVD features (KNN AVD)
- Sparse learning (that is, learn with L1 regularizer) just with VD (VD)
- Random projection (with 300 dimensions)
- Only class prompts with templates. For example, ImageNet has 1000 classes; we consider 7 templates; for each class and each template, we get one text embedding. These amount to 7000 total prompts. Then we learn a L1 classifier on them (L1 class prompts)
- MLP with layer sizes 512-4500-1000. This has the same number of parameters as the SLR model. We optimized using AdamW with weight decay of 0.1 and 0.01. The results are presented in the updated Figure 3 (everything except for MLP) and Table 13 (MLP). We have seen consistent improvement of SLR-AVD over other baselines on almost all datasets.
We also compare SLR vs LP when the number of shots is large. For k=64, 256, 1024, SLR still outperforms LP on most datasets. See Figure 16 in the appendix.
WISE-FT
We compare WaffleCLIP, L1 Class prompts, and CoOp vs SLR-AVD; each method is incorporated with WISE-FT. The frontier figure is presented in Figure 15 in the appendix. We see that SLR-AVD consistently outperforms other methods.
The AC recommendation is that the paper is not yet ready for publication due to its limited performance improvements and unclear methodological advantages. Reviewers noted that the proposed SLR-AVD model shows only marginal gains in zero-shot classification compared to existing methods. This modest improvement challenges the significance and practical relevance of the research.
Furthermore, the paper lacks clarity in its presentation, particularly regarding the training and inference processes, and the application of L1-regularized logistic regression. The theoretical alignment with the proposed method is also insufficiently justified, raising questions about the novelty and effectiveness of the approach.
为何不给更高分
Despite the authors' responses to reviewer queries, major concerns remain unaddressed, particularly regarding the method's performance and rationale. These unresolved issues, combined with the lack of comparative analysis with baseline methods and detailed explanations of experimental design, lead to the conclusion that the paper falls short of the standards required for acceptance.
为何不给更低分
N/A
Reject