Exploring Weak-to-Strong Generalization for CLIP-based Classification
摘要
评审与讨论
This paper aims to explore the weak-to-strong generalization framework in image classification tasks. The paper proposes to learn a classifier for a strong model guided by a pre-trained weak model.
优点
- The paper explores an interesting weak-to-strong generalization problem in image classification tasks, which is still underexplored.
- The proposed approach is simple and easy to implement, which only needs to learn a set of class prototypes in downstream tasks.
- Experiments show that the proposed weak-to-strong generalization method outperforms several baselines.
缺点
-
Regarding the problem setup, the reviewer finds it unclear why labeled data cannot be directly used to fine-tune a strong model, especially since weak models in the experiments are trained with labeled data. In my opinion, the weak-to-strong generalization framework is more reasonable if the pre-trained CLIP serves as a weak learner to guide a strong model with more learnable parameters.
-
Regarding the novelty of the method, the proposed class prototype learning is closely related to training cosine classifiers initialized with CLIP text prompts, a strategy already explored in previous work [1]; Additionally, knowledge distillation using unlabeled data has been extensively studied in previous semi-supervised learning research [2].
-
The paper may overclaim its contribution to extending the weak-to-strong generalization to multi-modal tasks because the paper only uses the CLIP model and does not experiment on multi-modal datasets.
-
Implementation details are missing because the appendix of the paper is not accessible.
[1] Long-Tail Learning with Foundation Model: Heavy Fine-Tuning Hurts. ICML 2024.
[2] Big Self-Supervised Models are Strong Semi-Supervised Learners. NeurIPS 2020.
问题
- The implementation details mention that the experiments were conducted on a single V100 GPU with 40GB of memory. However, does the V100 GPU actually have 40GB of memory?
Dear Reviewer DfP2,
Thank you for your detailed feedback and insightful questions. Below, we address your concerns:
w-1 Problem Setup:
Our work aims to explore weak-to-strong generalization in controlled settings, leveraging weak models trained with labeled data to simulate scenarios where direct supervision is impractical or unavailable. Currently we follow the setting of [1]. Your suggestion to consider CLIP as a weak model guiding a more parameter-rich strong model is more challenging and aligns with potential extensions of our framework.
w-2 Novelty:
While CPL shares similarities with cosine classifiers initialized using CLIP prompts
, our contribution lies in leveraging prototypes for alignment under weak supervision. Furthermore, while knowledge distillation has been extensively studied in semi-supervised learning, our focus is on applying weak-to-strong generalization in multimodal settings, which involves unique challenges not addressed in traditional semi-supervised frameworks.
w-3 Multi-Modal Contributions:
Our current work primarily focuses on CLIP, a Vision-Language Model designed to align images and text in a shared embedding space. Image classification datasets serve as a practical and measurable way to evaluate this alignment by assessing how well the learned representations differentiate between classes. While our study does not extend to explicitly multimodal datasets, it establishes a foundation for understanding weak-to-strong generalization in the context of image-text alignment.
w-4 Implementation Details:
We apologize for the omission of implementation details in the appendix. These will be included in the revised version to ensure reproducibility and clarity.
q-1 GPU Specification:
Thank you for catching this inconsistency. We had a typo here. The 40GB specification refers to the A100 GPUs, and this will be corrected in the revised manuscript for accuracy.
[1] Burns, C., Izmailov, P., Kirchner, J. H., Baker, B., Gao, L., Aschenbrenner, L., ... & Wu, J. (2023). Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390.
We will incorporate these clarifications and insights into the revision to strengthen the manuscript's rigor and clarity.
Sincerely,
The Authors
This paper focuses on knowledge distillation of visual language models - CLIP. Specifically, this paper proposes a simple method called CPL, which considers extracting knowledge from additional weak models and additional unlabeled datasets to enhance a strong model.
优点
The proposed method is quite simple and easy-to-follow.
缺点
The reviewer may lack familiarity with recent work on weak-to-strong generalization and will therefore wait additional perspectives from other reviewers to assess the technical novelty of this paper. Based on the my expertise in representation learning and knowledge distillation, I have the following concerns and questions:
-
Methodology: The approach proposed in this paper issomewhat straightforward, essentially using the original KD (Knowledge Distillation) loss. This may detract from the novelty of the method. Does the author have any distinctive insights tailored to the characteristics of vision-language models?
-
Experiments: The experimental comparisons presented in the paper lack clarity. For instance:
- Strong Ceiling: What fine-tuning method is applied here?
- KD+LP: What does this refer to, and how does it differ from CPL? From my understanding, CPL is exactly KD between two models.
- Most importantly, how do you utilize and in various settings?
-
Motivation: The authors also need to clarify the relationship between their problem setup and other weakly supervised learning setups, particularly semi-supervised learning. If the Strong Ceiling (directly fine-tuning a strong model with ) achieves optimal performance (as shown in Table 2), what motivates us to first fine-tune a weak model with and then use that weak model to supervise the strong model? In both cases, is assumed to be accessible, and the computational costs is similar (if not higher). Conversely, following general knowledge distillation principles, I would expect the two-step approach to achieve better results than Strong Ceiling. This should ideally be independent of whether the teacher model is weak or strong, as even a simple label smoothing can sometimes yield improved performance.
问题
See weakness.
Dear Reviewer L8km,
Thank you for your thoughtful feedback and questions. Below, we address your concerns:
Methodology:
We recognize that CPL may appear straightforward. However, its novelty lies in adapting KD to VLMs by leveraging class prototypes rather than logits. This approach specifically addresses the risk of overfitting strong models to weak models' outputs in multimodal tasks. Additionally, CPL’s design avoids reliance on the text encoder during fine-tuning, reducing computational costs and improving efficiency.
Experiments:
- Strong Ceiling: This refers to directly fine-tuning the strong model (CLIP) using labeled data from , without weak supervision.
- KD+LP: This baseline combines knowledge distillation (KD) loss with a linear probing (LP) layer trained on . CPL differs by using prototypes as class representations instead of logits, enabling more robust alignment.
- Utilization of and : is used to fine-tune the weak model, while provides weak supervision for the strong model in CPL. This split simulates scenarios where weak supervision arises from a less reliable source.
Motivation:
The motivation for CPL lies in scenarios where directly fine-tuning a strong model with labelled data is either infeasible or undesirable due to resource constraints or access limitations. While the Strong Ceiling approach achieves optimal performance in controlled conditions, CPL aims to explore a weak-to-strong paradigm where strong models can be guided by weaker models in data-limited settings. Notably, CPL’s performance improvement over KD+LP demonstrates the potential of prototype-based alignment in this context.
We will incorporate these clarifications and insights into the revision to strengthen the manuscript's rigor and clarity.
Sincerely,
The Authors
This paper investigates weak-to-strong generalization in CLIP-based classification. The author introduces a class prototype learning (CPL) method to enhance the classification capabilities of a stronger CLIP model with assistance from a weaker model. CPL achieves this by developing more representative prototypes for each category. Empirical results demonstrate that CPL consistently outperforms baseline methods across various weak models.
优点
- This paper is the first to explore weak-to-strong generalization in CLIP-based classification.
- This paper proposes a straightforward yet effective method for achieving good performance.
缺点
- The aim of weak-to-strong generalization is to mitigate harmful outputs while enhancing model performance. However, in your experimental results, while an improvement in the strong model's performance is evident, the aspect of protection against harmful outputs is not sufficiently demonstrated.
- You have only validated your method on the DomainNet dataset. We recommend testing the effectiveness of your approach on additional datasets.
问题
See above.
Dear Reviewer Fics,
Thank you for your detailed feedback and suggestions. Below, we address your concerns:
w-1 The aspect of protection against harmful outputs is not sufficiently demonstrated. Our work focuses on CLIP-based weak-to-strong generalization rather than text generative models. While the study [1] introduced weak-to-strong generalization to address harmful outputs in text-only models, our aim is to explore its potential in multimodal settings, specifically for classification tasks. Evaluating CPL's effectiveness in mitigating harmful outputs remains an exciting direction for future work.
w-2 You have only validated your method on the DomainNet dataset. We recommend testing the effectiveness of your approach on additional datasets. DomainNet was chosen due to its diverse domains, allowing us to simulate different knowledge transfer scenarios for strong models. While we recognize the importance of broader dataset evaluations, many benchmarks may require modification to enable similar simulations.
[1] Burns, C., Izmailov, P., Kirchner, J. H., Baker, B., Gao, L., Aschenbrenner, L., ... & Wu, J. (2023). Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390.
We will incorporate these clarifications and insights into the revision to strengthen the manuscript's rigor and clarity.
Sincerely,
The Authors
The paper proposes Class Prototype Learning (CPL), a novel method which aims to improve the classification performance of CLIP models , via learning more representative prototypes for each category under weak supervision. Despite using a simple loss function, CPL achieves obvious improvement over baselines on DomainNet dataset (including 6 different styles: Clipart, Infograph, Painting, Quickdraw, Real, and Sketch) This work highlights the potential of weak-to-strong generalization for image-text alignment, providing a new direction for future fine-tuning research on VLMs.
优点
1.As claimed by the authors, their CPL method archives SOTA on many kinds of weak models(including resnet and cvt), when testing on six distinct domains of DomainNet dataset.
2.Unlike traditional knowledge distillation, weak-to-strong generalization uses weaker model as the teacher, and this work firstly introduces this new knowledge distillation method to VLM. In fact, the research is quite interesting and meaningful.
缺点
1.More related experiments are needed to increase confidence of your method, as experimental result comparisons with counterparts on many important datasets are missing.
2.There are some imprecise statements in the paper, and the CPL method lacks visual demonstration of results. It is necessary to add the relevant content in the appendix.
Please see the questions section for more details.
问题
1.Why do you conduct experiments on DomainNet dataset only?As you aim to improve the classification performance of CLIP models, testing on ImageNet(including the comprehensive Stanford version and -V2/ A / R/ Sketch etc.) and other smaller but typical datasets(Flower102, DTD, Pets, StanfordCars, UCF101, Caltech101, Food101, SUN397, Aircraft, EuroSAT etc.). Conducting experiments on those datasets can significantly enhances the credibility of your CPL method.
2.Some statements in your paper are not rigorous and lack clarity. For example, in lines 242-243, 'students ' among 'Here, the weaker models act as students guiding the stronger model' should be 'teachers', if you mean the knowledge distillation in weak-to-strong generalization. Also, in line 489, ‘RLHF’ is mistakenly written as ‘RLFH’.
3.The paper lacks an introduction to phototype learning, which not only reduces the readability of the article but also violates the academic requirements for referring previous work. Additionally, It's better to give references for the terminology like RLHF and AdaptConf etc., even though the terminology is well-known.
4.The paper should specify the basic conditions and configurations of the experiments. For example, you should clarify what image encoder and text encoder are used in the strong model (CLIP), and state that all strong models referred to in this paper are configured with this specific CLIP setup. This avoids confusion for readers and the need to repeatedly check the context for confirmation.
5.Other popular VLMs (such as Llama and LLaVA) could be introduced as strong models and compared against CLIP. That can further enhance the generalizability and universality of your paper.
In fact, I have no idea why your work is so rough. Maybe lack of time for the submission deadline? If so, you can supplement the relevant information during the subsequent rebuttal phase. I hope you can carefully check the references and mathematical derivations in the paper again, to ensure the accuracy and standardization of the article. Once the requirements are met, I will reset my review score.
Dear Reviewer 4hGp,
Thank you for your detailed feedback and suggestions. Below, we address your concerns:
W-1 Dataset Selection: We focused on DomainNet due to its six diverse domains (e.g., Clipart, Infograph, Painting, etc.), which naturally simulate scenarios where weak supervision captures distinct knowledge. This diversity allows us to evaluate CPL in settings where the knowledge transfer challenges vary, mimicking real-world weak-to-strong generalization scenarios.
We agree that evaluating CPL on additional benchmarks like ImageNet and other datasets (e.g., Flowers, Pets, Aircraft) would further demonstrate its robustness and applicability. However, some of these datasets may require modifications to align with our experimental framework. For instance, in our setup, weak models must be fine-tuned to generate supervision signals compatible with the strong model’s learning process.
W-2 Imprecise Statements: Thank you for highlighting issues with imprecise phrasing and the need for additional clarity. We will correct all identified inaccuracies, such as the misuse of "students" instead of "teachers" in lines 242-243 and the typo in line 489 ("RLFH" instead of "RLHF").
Additionally, to improve the paper's clarity and visual appeal, we plan to include visual demonstrations of results in the appendix. For example, we will present learned prototype representations and alignments between weak and strong models, providing deeper insights into the behavior and effectiveness of CPL.
Q-1 Why Only DomainNet? DomainNet was chosen because it contains multiple domains that simulate different styles and knowledge representations (e.g., Infograph vs. Real). This makes it a suitable benchmark for validating weak-to-strong generalization, as it naturally captures diverse scenarios where the weak model provides complementary or incomplete knowledge.
Expanding CPL to benchmarks like ImageNet and others (e.g., Flowers, Pets, SUN397) is a logical extension of our work. However, many of these datasets lack the inherent diversity found in DomainNet and may require modifications to adapt them to our framework. For instance, splitting the dataset in a way that mimics weak-to-strong supervision would involve generating weak model outputs for specific domains or classes.
Q-2 Imprecise Statements We appreciate your attention to detail. To address the specific examples provided:
- In lines 242-243, we will revise the phrase "students guiding the stronger model" to "teachers guiding the stronger model," accurately reflecting the role of weak models in our framework.
- In line 489, we will correct the typo "RLFH" to "RLHF."
These revisions, along with a thorough review of the entire manuscript, will ensure greater precision and consistency in our language.
Q-3 Lack of Prototype Learning Introduction We agree that a concise introduction to prototype learning is essential to improve the paper’s readability and adherence to academic standards. In the revised version, we will expand the related work section to include a summary of prototype learning techniques and how CPL builds upon them.
Furthermore, we will ensure all technical terms, such as RLHF and AdaptConf, are accompanied by references to their original works. Even though these terms are well-known, providing context will help readers unfamiliar with specific details better understand our approach.
Q-4 Experiment Configurations We will include detailed descriptions of the experimental setup in the revised manuscript. For example, we will explicitly state the configurations of the strong model, including the image encoder (e.g., ViT or ResNet) and text encoder used in CLIP. Additionally, we will clarify that all strong models in this paper are based on a specific CLIP setup.
Q-5 Comparison with Other VLMs: We appreciate the suggestion to compare CPL against other VLMs, such as LLaVA and LLaMA. While our current focus is on CLIP-based classification, Llama and LLaVA are more complex than CLIP, with different training strategies and evaluation methods. In this study, we mainly explore the weak-to-strong exploration based on the CLIP model, and surely we will consider studying this in the future.
We will incorporate these clarifications and insights into the revision to strengthen the manuscript's rigor and clarity.
Sincerely,
The Authors
This paper investigates the concept of weak-to-strong generalization within the context of vision-language models (VLMs), specifically focusing on the CLIP model. The authors proposed class prototype learning (CPL), which performs logit distillation between a supervised learning (weak) model and the class prototypes of a CLIP model (strong). The effectiveness of CPL is demonstrated through experiments on the DomainNet dataset, reporting a 3.67% improvement over baseline methods.
优点
- This paper explores weak-to-strong generalization -- how to train models when models surpass human knowledge. Unlike previous works that consider LLM, it works on CLIP, a VLM.
- Experiments show that the proposed method shows improvements over other CLIP tuning methods.
缺点
- The paper starts with an ambitious story (weak-to-strong generalization) but remains unclear how the setting and solution can benefit human-surpassing models. The scope of the paper narrows down to a specific application in CLIP classification, which is disconnected from the overarching goals of weak-to-strong generalization. If the major novelty is considering a VLM, why not consider LLaVA, BLIP2, or similar models?
- The setting and proposed method appear to be somewhat ad-hoc focused on improving CLIP classification through distillation from a supervised learning model. This raises questions about the generalizability of the findings beyond the specific context of classification and CLIP. For one thing, can the method help CLIP recognize unseen classes or unseen concepts? For another, can the method improve generative VLMs?
- The method itself is naive and lacks novelty. Can the authors clarify more on the novel contributions compared with prior methods?
- The writing lacks clarity and coherence, making it difficult for readers to grasp the main contributions and findings. The paper spends excessive time on background information and settings, which detracts from the core message. Also, the setting of data splits (L315-L323) is confusing and hard to understand.
问题
N/A
Dear Reviewer Lxe8,
w-1 Connection to Weak-to-Strong Generalization Goals:
We acknowledge the disconnect between our specific CLIP-based classification experiments and the broader goals of weak-to-strong generalization. Our work aims to explore the feasibility of this paradigm in VLMs starting with CLIP classification as a controlled setting. Expanding to models like LLaVA or BLIP2 is a natural next step to enhance applicability and address tasks beyond classification.
w-2 Generalizability of CPL:
We appreciate the question about generalization. While the current work focuses on classification, the principles of CPL could extend to help CLIP recognize unseen classes by refining prototype representations. However, applying CPL to generative VLMs is less straightforward, particularly in adapting prototype-based learning to generation tasks.
w-3 Novel Contributions:
Our method integrates weak supervision with class prototype learning, which contrasts with prior works that rely on logits or direct knowledge distillation. CPL’s emphasis on prototype alignment provides robustness against overfitting to weak model outputs, as evidenced by its performance gains on challenging domains. We will refine the manuscript to better highlight these distinctions.
w-3 Clarity and Coherence:
We acknowledge the writing issues and will streamline the introduction and background sections to focus on our contributions. Additionally, we will improve the explanation of data splits (e.g., L315-L323) and ensure the experimental setup is more accessible.
We will incorporate these clarifications and insights into the revision to strengthen the manuscript's rigor and clarity.
Sincerely,
The Authors
This paper aims to enhance the classification capability of the CLIP model given a weak model and an unlabeled training set. It proposes a new method called class prototype learning (CPL). Specifically, for a classification task with k categories, CPL first initializes k prototypes using the text features, and calculates the strong logits by matching the image features and the prototypes. After that, it utilizes a weak model to calculate the weak logits, and then uses the weak logits to teach the strong logits by optimizing the class prototypes. In the inference stage, the model is able to make predictions by matching the image features and the learned prototypes. The authors also conduct experiments to support the proposed method.
优点
- The idea of weak-to-strong generalization is somewhat meaningful.
- The efficiency is improved by removing the text encoder in the inference stage.
缺点
- I am quite confused with the 'unlabeled data' setting. Although CPL does not use labeled data, the weak model is trained with ground truth labels. Maybe you just want to verify the idea of 'weak-to-strong'. However, in practice, you will not have a weak model just right having k candidate categories (particularly when the weak model is a vision-only model), thus it is infeasible to optimize the CPL loss to teach the k prototypes.
- The ideas of using the text encoder for initialization and dropping the text encoder during the inference stage have been proposed by multiple previous works [1-3].
- To demonstrate the efficiency of CPL, it is better to include the complexity analyses.
- The related works regarding unsupervised learning with CLIP are not discussed [4-5].
[1] Masked Unsupervised Self-training for Label-free Image Classification.
[2] Parameter-Efficient Long-Tailed Recognition.
[3] Exploring Vision-Language Models for Imbalanced Learning.
[4] Unsupervised Prompt Learning for Vision-Language Models.
[5] PromptKD: Unsupervised Prompt Distillation for Vision-Language Models.
问题
See weaknesses.
Dear Reviewer NPKw,
Thank you for your detailed feedback and suggestions. Below, we address your concerns:
w-1 In practice, you will not have a weak model just right having k candidate categories (particularly when the weak model is a vision-only model), thus it is infeasible to optimize the CPL loss to teach the k prototypes. This design aims to simulate the weak-to-strong generalization problem, aligning with prior studies (e.g., [1]). While our studies focus on image classification tasks, it is easy to have a label set (which is how we build candidate categories) and would be able to use CPL loss. We will clarify this in the revised version to make sure readers can easily understand the setup.
w-2 The ideas of using the text encoder for initialization and dropping the text encoder during the inference stage have been proposed by multiple previous works. We recognize that leveraging text encoder initialization and excluding it during inference has been explored. Our contribution focuses on enhancing weak-to-strong generalization. While the underlying mechanisms share similarities, we believe our application offers a unique perspective on a different set of challenges. We particularly found this technique is effective in the context of weak-to-strong generalization. We will emphasize this distinction in the revised manuscript.
w-3 To demonstrate the efficiency of CPL, it is better to include the complexity analyses. CPL reduces computational overhead by excluding the text encoder during training and inference. We will include a detailed complexity analysis in the revised version.
w-4 The comparison with other weakly supervised methods is missing. We appreciate your suggestion to include relevant works on unsupervised learning with CLIP. We will expand this discussion in our revision.
[1] Burns, C., Izmailov, P., Kirchner, J. H., Baker, B., Gao, L., Aschenbrenner, L., ... & Wu, J. (2023). Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390.
We will incorporate these clarifications and insights into the revision to strengthen the manuscript's rigor and clarity.
Sincerely,
The Authors
I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.