3.5

/10

Rejected4 位审稿人

最低3最高5标准差0.9

3.5

置信度

正确性2.8

贡献度2.3

表达2.5

ICLR 2025

Weak Supervision from Vision-Language Models to Self-Improve on Downstream Tasks

Shuvendu Roy,Ali Etemad

OpenReview PDF

提交: 2024-09-13更新: 2025-02-05

摘要

关键词

Semi-supervised LearningVision-language Model

评审与讨论

审稿意见

评分: 3置信度: 42024-10-18

The authors present SelfPrompt, a prompt-tuning approach for vision-language models (VLMs) in a semi-supervised learning setup. SelfPrompt introduces a weakly-supervised sampling technique that selects a diverse and representative labelled set, a cluster-guided pseudo-labelling method that improves pseudo-label accuracy, and a confidence-aware semi-supervised learning module that maximizes the utilization of unlabelled data by combining supervised learning and weakly-supervised learning.

优点

The structure of this paper is well-written and easy to follow.
The authors provide thorough ablation studies, which is helpful in understanding different design choices applying pseudo-labeling into semi-supervised learning of CLIP
The qualitative examples as well as the failure cases effectively help the reviewer understand how the model performs on specific examples and the limitations.

缺点

Lack of explanation for some hyperparameters such as $\lambda$ , the number of quantiles, the number of samples, etc (for example the number of quantiles, why choosing the large range as 3, 5, 20?).
Why does CPT under 4-shot experiments fall short against SelfPrompt?
What are the results if we set the number of samples larger than 50, as from Table 7c, the larger the better.
If we drop different sizes of high and low confidence sets, does it improve the performance?
The part I am most concerned about is dropping high and low confidence subsets. For example, dropping the samples with probability <= 50% is reasonable, but dropping both high and low could remove the good information and making the model slow to converge.

问题

see Weaknesses

2024-11-19

We thank the reviewer for positive feedback on the clarity of the paper's structure, the thoroughness of our ablation studies, and the usefulness of the qualitative examples in illustrating the model’s performance and limitations. Below, we provide detailed responses to the questions and suggestions.

Lack of explanation for some hyperparameters.

We have revised the manuscript to ensure all the hyper-parameters are clearly mentioned by name throughout the paper. Specifically

$λ$ : loss factor controlling the importance of partial label learning loss

$q$ : number of quantiles, which defines the number of quantiles that the unlabelled samples are divided into before filtering.

$p$ : number of samples selected from each cluster for pseudo-labelling.

For the number of quantiles ( $q$ ), we initially reported a wide range (including the optimal value) for brevity. However, we also explored other intermediate values. In the revised manuscript, we have expanded Table 7c (included below) to show these intermediate values (e.g., $q=10$ ). Our experiments indicate that the best performance is achieved at $q=5$ .

$q$	Accuracy
3	78.25
5	79.33
10	79.12
20	78.75

Why does CPL under 4-shot experiments fall short against SelfPrompt?

We believe CPL falls short against SelfPrompt in 4-shot experiments because of its random selection of labelled samples, which often includes noisy and less representative data. This results in a suboptimal model that struggles to generalize effectively. The suboptimal model, in turn, generates noisy pseudo-labels through CPL's incremental pseudo-labelling approach, which results in inefficient use of unlabeled data and a decline in the final model's performance. On the other hand, our proposed solution selects the most representative set of samples as the labelled set, and our clutter-guided pseudo-labelling and confidence-aware semi-supervised learning ensures optimal utilization of the labelled set budget and learning from the unlabelled set.

Results for number of samples larger than 50?

The performance saturates at $p = 50$ . In the revised manuscript, we now include results for a larger number of samples selected per cluster ( $p$ ). Specifically, we show that increasing $p$ to 75 results in a slight performance decline. This decrease occurs because higher values of $p$ increase the likelihood of including incorrect pseudo-labels, as cluster-guided pseudo-labelling tends to select samples farther from the cluster centers.

$p$	Accuracy
5	77.01
20	79.03
50	79.33
75	79.31

Potential impact of dropping different sizes of high and low confidence sets?

Thank you for this interesting question. Performance can vary with the portion of high and low confident samples removed from the unlabelled set. As discussed in Section 3.2, we control the number of samples to be dropped with a quantile value, $q$ . For example, $q = 20$ indicates that the unlabeled dataset is divided into 20 quantiles and removes the lowest and highest 5% of samples while retaining 90% of the samples. As shown in Table 7 (b), the best result is obtained when we filter the most and least confident 20% of the sample ( $q=5$ ).

2024-11-19

Concern about dropping high and low confidence subsets.

The primary reason for excluding both high-confidence and low-confidence samples is that retaining high-confidence samples (in the labelled set) results in representations that fail to generalize effectively to distributions outside the selected labelled set. Previous works, such as Roy et. al., (2024), and Sarkar et. al., (2023) have demonstrated that overconfidence on a specific distribution hinders effective generalization. This issue is particularly pronounced when the model has not been fine-tuned on the distribution of the specific downstream task. Accordingly, we hypothesize that selecting a more diverse and representative set of samples and excluding both high- and low-confidence samples would improve generalization. To validate this hypothesis, we conduct experiments comparing the outcomes of different sampling strategies: (a) removing only the low-confidence samples, (b) removing only the high-confidence samples, (c) keeping only the high-confidence samples, and (d) keeping only the low-confidence samples. The results averaged across all datasets with $q=5$ , are presented below. These experiments were performed using only the weakly-supervised sampling module to isolate the behaviour of this specific module. As we find from the Table below (Table 10 in the revised manuscript), excluding both the high- and low-confidence samples yields the best performance. This discussion is included in the Appendix A.2 of the revised manuscript.

Setting	Accuracy
Ours (removes both high- and low-confidence samples)	73.08
Removing only the low-confidence samples	72.03
Removing only the high-confidence samples	72.43
Keeping only the high-confidence samples	71.78
Keeping only the low-confidence samples	72.55

References:

Roy, S & Etemad, A. Consistency-guided prompt learning for vision-language models. In ICLR, 2024.
Sarkar, P., Beirami, A., & Etemad, A. Uncovering the hidden dynamics of video self-supervised learning under distribution shifts. In NeurIPS, 2023.

2024-11-20

Thank the authors for their detailed rebuttal. However, my concerns remain after reading the rebuttal.

The method for selecting the q value relies on random selection, which means that when fine-tuning with different datasets, we need to determine the optimal value again. The same applies to the p value. Furthermore, the difference between using a p value of 50 and 75 is minimal, asserting that "performance saturates at p = 50" is somewhat ambiguous.
In [1], it is demonstrated that utilizing both high-confidence and low-confidence predictions can enhance performance. Therefore, the study suggests that eliminating excessively high and low-confidence predictions minimizes having less impact.
Due to the random selection of labeled samples, which often includes noisy and less representative data, the performance of CPL falls short compared to SelfPrompt. I wonder if this effect is also evident with 8-shot and 16-shot scenarios.

[1] Nguyen, Khanh-Binh, and Joon-Sung Yang. "Boosting Semi-Supervised Learning by bridging high and low-confidence predictions." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.

2024-11-22

Requires dataset-specific hyperparameter.

We would like to clarify that our method does not rely on dataset-specific hyperparameters. This is a strength of our approach, as it demonstrates robust performance with a single set of hyperparameters (as reported in the implementation details). As with any method, there are some hyper-parameters to tune, and in our case, they were tuned to $p=50$ and $q=10$ , accompanied by experiments showing optimal performance (see Tables 7). Once again, we would like to reiterate that these hyperparameters are constant for all 13 datasets.

[1] demonstrates that utilizing both high-confidence and low-confidence predictions can enhance performance.

Our method demonstrates the same core concept as [1], as our confidence-aware semi-supervised framework learns from both high- and low-confidence samples. Specifically, high-confidence samples are used in a fully supervised manner, while the remaining samples (irrespective of their confidence) are leveraged through weakly-supervised learning (please refer to the confidence-aware semi-supervised learning module, section 3.2, page 6). However, we propose excluding the most and least confident samples (predicted by the pre-trained encoder, not the online encoder used during semi-supervised training) from being selected as the labelled set, and we aim to select the most representative samples as the labelled set. As discussed in section A.2 (page 14), the most and least confident samples are not the most representative samples for labelling.

Is the performance difference evident with 8-shot and 16-shot scenarios?

Yes, SelfPrompt outperforms CPL across all labelled set sizes, including 8-shot and 16-shot scenarios. As we can see from the table below (on the RESICS45 dataset only; due to the time constraint in the rebuttal), SelfPrompt outperforms CPL across all shots.

Setting	CPL	SelfPormpt
1-shot	73.44	84.09
2-shot	80.98	85.58
4-shot	72.84	85.60
8-shot	80.51	85.92
16-shot	81.12	86.85

2024-11-26

Dear Reviewer mPat,

As we approach the end of the rebuttal period, we would like to offer any further clarifications regarding our paper if there are additional questions. In response to your original comments, we have clarified our intuition on the usage of high and low confident samples, provided additional results on different shots, details on hyper-parameters, and additional sensitivity studies. We hope this change, along with additional results (Section A.2), further explanations of the proposed solution (Section 3), expanded related work on Weakly-supervised learning (page 3), comparisons to more methods (Table 1, 2), pseudo-code (Section A.3), slightly adjusted the positioning of our paper and presented results for both the standard and our proposed active semi-supervised learning setups (suggested by Reviewer 352h and Reviewer mPat) and other minor changes made during the rebuttal, have addressed your concerns and strengthened the overall quality of our paper. If you believe these updates have successfully addressed your concerns, we kindly ask you to consider raising your score to reflect the improved state of the paper.

Once again, we would like to sincerely thank you for your comments and constructive discussions, which have without a doubt significantly improved our paper.

Best regards,

Authors

审稿意见

评分: 3置信度: 42024-10-18

This paper proposes SelfPrompt, a prompt-tuning approach for VLMs. SelfPrompt consists of three modules: a weakly supervised sampling technique, a cluster-guided pseudo-labeling method, and a confidence-aware semi-supervised learning module. The authors conducted extensive evaluations across 13 datasets, and SelfPrompt achieves state-of-the-art performance.

优点

Tuning VLMs in the semi-supervised learning setting is an important problem as it can utilize abundant unlabeled data.
Using a limited budget to label important samples is an important and practical problem. However, this is a new setting different from standard semi-supervised learning in [3] (refer to weaknesses).
The authors conducted extensive experiments and SelfPrompt achieved much better performance than existing methods (but with potential comparison problems, see weaknesses).

缺点

Weakly-supervised learning (WSL) is an important topic in machine learning. WSL frequently appears and seems to be an important component of this paper. Authors should discuss related WSL papers in the related work section and other parts of this paper.
The presentation of the problem setting is unclear. The authors define the setting as being given $M$ unlabeled data and $N = n \times c$ labeled data. However, in line 180, the authors also state, "In a few-shot learning setting with $N$ samples, the common strategy is to randomly select $n$ samples per class from the unlabeled set $U$ to form the labeled set." So, how is the labeled dataset formed? Are we given labeled data, or do we select labeled data from the unlabeled dataset and assign pseudo labels to them to form a labeled dataset? In standard semi-supervised learning, both labeled and unlabeled datasets are provided, so I am confused with the setting in this paper. Authors are suggested to write this part more clearly for understanding. Below I can only continue by guessing the setting is selecting important data ('limited budget' in paper) and then assigning ground truth labels to them as the labeled dataset. If this is true, then:

The setting in this paper is not standard semi-supervised learning (SSL). In standard SSL, we are given both labeled and unlabeled datasets (we cannot select which data are labeled), and methods are designed to address this setup. However, in lines 38–41, the authors claim that one limitation of existing methods is that "existing methods typically select the labeled sample set randomly." However, this is not a limitation of existing methods. In standard SSL, we cannot control the selection of labeled data, so random selection is used to mimic practical scenarios, and it is not part of the method but the experiment setting. Using a limited budget to label important samples is an important problem, but it is different from the standard SSL problem in [3] and should be regarded as a new setting. So, the experimental comparison is not proper, as other methods are designed for standard SSL. It is better to make the setting more clear and show that the performance of these existing methods can be boosted using these selected labeled datasets. To show the effectiveness of the later part of pseudo-labeling for SSL, the labeled set should be the same.

One motivation for filtering is: "Highly confident samples offer minimal information gain, as the model is already certain of their classification." However, is this claim really correct? Are there any references to support this? In my view, in many semi-supervised learning works [1, 2], authors have proposed using thresholds to select highly confident samples for training and filter out unconfident, noisy data. This suggests that highly confident samples are important for training. I am not sure whether the motivation in this paper is correct and appropriate. Moreover, "Highly confident samples" is very vague. Can the authors use something like probability to describe it?
It is unclear why the authors perform k-means with $N$ clusters instead of $C$ . $N = n \times C$ is the number of labeled data with no semantic meaning (In my view). To me, k-means with $C$ clusters makes more sense, since $C$ is the number of classes, and clustering with respect to classes seems appropriate. However, it seems that there is no semantic meaning for $N$ , so why are we clustering samples into $N$ clusters? Could the authors further clarify the motivation of this part?
The authors claim that "Cluster-guided pseudo-labeling" does not rely on VLM predictions for pseudo-labeling. However, this claim is not entirely convincing. As stated in line 256, "the clusters are formed based on embedding similarity," and the embedding similarity is derived from VLM pretraining knowledge. So, if there is bias in the VLM-generated pseudo-labels, the embedding generated by VLMs is likely to contain similar bias.
The module "Confidence-aware semi-supervised learning" contradicts the earlier claim that "Highly confident samples offer minimal information gain, as the model is already certain of their classification." If the authors claim that highly confident samples are not important and exclude them from the pseudo-labeled dataset, why do they now claim that confident samples are important and use them here? Moreover, this module seems similar to [3], could authors discuss the difference?
Reproducibility: The authors did not report the values of the hyperparameters $p$ , $q$ , $\tau$ , and $\lambda$ in the experiment section. For reproducibility, the authors should report these values and discuss how to select them in the implementation details section.
Undefined and unclear variables:

In line 264, $(x_{j1},y_j)$ , what is j?
The definition of $X_{\text{weak}}$ is unclear. The authors only define $x_i \notin X^+$ , but where does $x_i$ come from?
$L_{\text{weak}}$ is not defined.
How the function top $_t(\cdot)$ defined?
$f$ is not defined. Is it $\theta$ ? Moreover, the authors should define $\lambda$ as a hyperparameter in the paper.
Subscripts are very confusing: the authors use subscripts to distinguish different vectors but also use subscripts to represent the corresponding elements of the vector. For example, authors use $s_{i}$ to represent the corresponding vector for $x_i$ , but in Eq. 6, the authors seem to use $s_c$ to represent the c-th element.

Minor problems:

The presentation of this paper is not very polished. The authors should pay attention to the math formulas and variable definitions. The form of variables should be consistent. Some of the issues with the math formulas are listed below:

The input $x_i$ is a vector and it should be represented in bold.
In Eq. 2, $p_{ic}$ is not defined. Maybe it should be $p_i^c$ ?
In line 238, the form of $Q_k$ is different from line 236.
In line 262, $\mathbf{z}^*_j$ is not defined, and $|| \cdot ||^2$ is not defined.
The form of $P_j$ is different from how it is defined in line 260.
The form of $X_p$ in Eq. 4 is different from how it is defined in line 264.
$s_i$ is a vector but is not represented in bold.
In Eq. 5, the $X_L$ is not in a consistent form.
SOTA is not defined. The authors should introduce it as "state-of-the-art (SOTA)" and then use the abbreviation.

References

[1] FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence

[2] FlexMatch: Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling

[3] Candidate Pseudolabel Learning: Enhancing Vision-Language Models by Prompt Tuning with Unlabeled Data

问题

How do the three datasets ( $X_L$ , $X_+$ , and X_{weak}) change during the tuning process? Could authors provide a pseudo-code to better understand how the method is implemented?

2024-11-19

Does cluster-guided pseudo-labelling not rely on VLM predictions for pseudo-labelling?

The clusters are formed using only the image embedding from the vision encoder of the VLM and do not utilize the prediction (using both vision and text encoder) for pseudo-labelling. Specifically, unlike prior works, we do not utilize the zero-shot prediction from the VLM to generate the pseudo-labels. To avoid any confusion, we have revised the concerning line from "We propose a novel clustering-guided pseudo-labelling approach that does not rely on the VLM to generate the pseudo-label." to the following: "We propose a novel clustering-guided pseudo-labelling approach that does not utilize the zero-shot prediction from the VLM as the pseudo-label."

Additionally, the bias in the zero-shot predictions of the VLM does not necessarily carry over to our cluster-guided pseudo-labelling approach, as it only utilizes the vision encoder of the pre-trained VLM, while the VLM prediction requires both text and vision encoder and is often sensitive to the prompt used with the text encoder for zero-shot prediction. Our empirical evaluation, shown in Figure 1 (left), demonstrates that the proposed pseudo-labelling method consistently generates more accurate pseudo-labels compared to prior approaches.

Values of the hyper-parameters, q, τ, and λ?

In the revised manuscript, we discuss the values of q, τ, and λ in the implementation details, along with the sensitivity study on q, and τ in Table 7. Specifically, all the main results are reported with the values of $q = 5$ , $τ = 0.05$ , and $λ = 1$ .

Undefined and unclear variables.

Thank you for pointing this out. Following is a clarification of the variables:

$j \in \{1,2,...,N\}$ is the cluster index.
$X_{weak}$ is composed of the remaining samples from the unlabelled set $U$ that are not part of the pseudo-label set $\mathcal{X}^{+}$ . We have revised the definition of $X_{weak}$ as: $\mathcal{X}_{\text{weak}} = \{(x_i, s_i) \mid x_i \in U \setminus \mathcal{X}^{+}\}$
$X_{weak}$ was a typo intended to be $\mathcal{X}_{weak}$ .
$f(x)$ is the prediction of the VLM, $f(x) = p(y| x; \theta, \phi, P)$ .
Subscripts confusion. To avoid any confusion with the subscripts, we have revised Eq. 6 by using superscript to represent the element of vector $s$ .

We have now clarified all of these in the paper on lines 254, 291, 295, 184, and 299, respectively.

Minor problems.

Thank you very much for pointing these out. All the minor problems are now addressed in the revised manuscript.

How do the three datasets $X_{L}$ , ${X}^{+}$ , and $X_{weak}$ change during the tuning process

$X_{L}$ is the labelled set, which remains constant throughout training. Following prior works such as GRIP and CPL, the training is conducted over $S$ sessions, where $\mathcal{X}^{+}$ and $X_{weak}$ are linearly expanded by including samples from unlabeled set $U$ .

Could authors provide a pseudo-code to better understand how the method is implemented?

Certainly. Please refer to the pseudo-code (Algorithm 1) in the Appendix of the revised manuscript.

2024-11-19

We thank the reviewer for the positive feedback on our paper's extensive experiments and the method's performance. Below, we address the specific weaknesses and provide detailed responses.

Authors should discuss related weakly supervised learning papers.

In the revised manuscript, we have added a section to discuss the related work on weakly supervised learning. Specifically, we have included the following:

"Weakly supervised learning encompasses a class of methods that train models with limited or imprecise supervision (Peyre et al., 2017; Li et al., 2019; Zhou, 2018). Unlike fully supervised learning, which requires large amounts of precisely labelled data, weakly supervised learning leverages weak annotations, such as noisy, incomplete, or coarse-grained labels. This paradigm significantly reduces the reliance on costly and time-intensive data annotation processes. This paradigm has demonstrated broad applicability across various domains, including vision-language models (Wang et al., 2022b), medical image analysis (Kanavati et al., 2020). Despite its potential, it has been underexplored in the context of semi-supervised learning, where pseudo-label predictions frequently introduce noise. This positions weakly supervised learning as an ideal candidate to address the challenges posed by noisy labels in semi-supervised frameworks."

Labelled set being selected randomly in other works.

Thank you for this interesting question. Please note that standard semi-supervised learning papers do indeed select the labelled set randomly from the whole dataset. It's neither fixed nor provided. Specifically, a subset of samples is randomly selected from the unlabeled set along with the ground truth labels to form the labelled set. Please refer to the seminal SSL work FixMatch [1] (page 17, Section C). As mentioned in this paper, the method selects the labelled set randomly, trains the model with different random labelled sets, and reports the average and standard deviation. Other popular methods, such as FlexMatch, MixMatch, and ReMixMatch, followed the same protocol of random selection of labelled sets. The previous SOTA (CPL [3]) on tuning VLMs with semi-supervised learning also followed the same protocol. We would like to clarify that there is no mention of how the labelled set is created in the original paper, but the official implementation of the method shows that the labelled set is, in fact, selected randomly, like all methods mentioned above: https://github.com/vanillaer/CPL-ICML2024/blob/master/methods/main_SSL.py#L74

We acknowledge that the idea of diverse and representative labelled set selection can be utilized with other existing methods too. In the table below, we show the results of our weakly-supervised sampling (WS) with existing methods and also present the performance of SelfPrompt with and without a weakly-supervised sampling module. As we find from this table, SelfPrompt (without WS) outperforms the prior works by 4.71%, while WS with GRIP, CPL, and SelfPrompt improves the performance by 1.72%, 1.67%, and 3.21%, respectively. In the revised manuscript, we have now included a new section on Page 7 to include these results.

Method	ACC.
Random sampling
GRIP	67.40
CPL	71.41
SelfPrompt	76.12
Weakly-supervised sampling
GRIP + WS	69.12
CPL + WS	73.08
SelfPrompt + WS	79.33

2024-11-19

(A) Why are highly confident samples dropped during labelled set selection? (B) Why are confident samples important in confidence-aware semi-supervised learning?

(A) The primary reason for excluding both high-confidence and low-confidence samples is that retaining high-confidence samples (in the labelled set) results in representations that fail to generalize effectively to distributions outside the selected labelled set. Previous works, such as Roy et. al., (2024), and Sarkar et. al., (2023) have demonstrated that overconfidence on a specific distribution hinders effective generalization. This issue is particularly pronounced when the model has not been fine-tuned on the distribution of the specific downstream task. Accordingly, we hypothesize that selecting a more diverse and representative set of samples and excluding both high- and low-confidence samples would improve generalization. To validate this hypothesis, we conduct experiments comparing the outcomes of different sampling strategies: (a) removing only the low-confidence samples, (b) removing only the high-confidence samples, (c) keeping only the high-confidence samples, and (d) keeping only the low-confidence samples. The results averaged across all datasets with $q=5$ , are presented below. These experiments were performed using only the weakly-supervised sampling module to isolate the behaviour of this specific module. As we find from the Table below (Table 10 in the revised manuscript), excluding both the high- and low-confidence samples yields the best performance.

Setting	Accuracy
Ours (removes both high- and low-confidence samples)	73.08
Removing only the low-confidence samples	72.03
Removing only the high-confidence samples	72.43
Keeping only the high-confidence samples	71.78
Keeping only the low-confidence samples	72.55

(B) Once the labelled set has been selected, we fine-tune the model on the downstream task. As shown in prior works (e.g. FixMatch), after training for a few iterations, the model gradually starts to generate highly accurate predictions where the most confident samples are, in most cases, correct. As a result, such pseudo-labels are utilized to learn from the unlabeled data.

References:

Roy, S & Etemad, A. Consistency-guided prompt learning for vision-language models. In ICLR, 2024.
Sarkar, P., Beirami, A., & Etemad, A. Uncovering the hidden dynamics of video self-supervised learning under distribution shifts. In NeurIPS, 2023.

Difference w.r.t. [3].

Our confidence-aware semi-supervised learning module is different from [3] in that ours is a hybrid approach of fully-supervised and weakly-supervised learning, while [3] only learns from noisy samples in a weakly supervised setting. Specifically, we learn from high-confident samples in a fully-supervised setting while learning from low-confident samples in a weakly-supervised setting, ensuring the best utilization of the unlabelled data.

"Highly confident samples" is vague. Can the authors use something like probability to describe it?

We define highly confident samples using quantiles instead of probabilities. Specifically, we sort all samples in descending order of confidence scores and divide them into $q$ quantiles. The first quantile contains the least confident samples, while the last quantile represents the most confident ones. The reason for choosing this strategy over a fixed probability threshold is that with the quantile-based definition confidence, depending on the distribution of a particular set of data, the probability of high/low confidence values can adaptively change, making it more suitable and dynamic for our solution. In other words, picking a fixed probability based on which to discard samples may result in different number of kept/discarded samples based on the distribution, whereas with our approach, we always end up with a fixed number of samples.

Why perform k-means with N clusters instead of C?

Our goal is to select the most diverse 'N' samples from the unlabeled set. Creating $N$ clusters divides the whole datasets into $N$ clusters of distinct features, allowing us to select a diverse set of samples for our labelled set by selecting one sample per cluster (the sample closest to the center in our approach). On the other hand, selecting $C$ (number of classes) clusters does not necessarily fulfill the labelling budget ( $N$ ).

评论- Thank you for detailed rebuttal

2024-11-19

Thank you to the authors for their detailed rebuttal. However, my main concerns about problem setting and motivation still exist. Specifically,

Problem setting. Please note that in semi-supervised learning, we are given with a small-size labeled dataset and a large-size unlabeled dataset. Data and labels in the labeled dataset are sampled from an unknown distribution p(x, y) but not the unlabeled dataset, and data from the unlabeled dataset are sampled from an unknown distribution p(x) (refer to the problem background or problem setting in [4,5]). The labeled dataset is not from the unlabeled dataset and we can not control the samples in the labeled dataset. Moreover, I have carefully checked the paper of Fixmatch and did not find that the authors selected labeled data from the unlabeled datasets (if I missed it, please point that out).
Motivation. I understand that N is the number of labeled data. However, to me, N does not have any semantic meaning, so it is not clear why authors conduct k-mean clustering with N clusters and claim that N clusters will have distinct features.

Given that these concerns still exist, I will currently maintain my score.

[4] A survey on semi-supervised learning

[5] Realistic Evaluation of Deep Semi-Supervised Learning Algorithms

2024-11-19

Thank you for your quick response, for taking the time to read our long rebuttal, and for acknowledging most of our responses. Below we provide more detailed response to the remaining two concerns.

Problem setting. In existing methods, is labelled data given or randomly selected?

To clarify how the existing method implements sampling, consider an example experiment from FixMatch (Sohn et al., 2020): the first column (CIFAR-10, 40 labels) of Table 2 in the FixMatch paper (https://arxiv.org/pdf/2001.07685). In this example, the labelled set size is 40. For CIFAR-10, FixMatch randomly sampled these 40 labelled examples from the CIFAR-10 training set of 50,000 samples, with the remaining 49,960 samples treated as the 'unlabeled set', and reported the average accuracy over five such randomly selected labelled sets, as stated in Section C (Page 17: "We report results over five random folds of labelled data"). Here, the labelled set in one fold is part of the unlabelled set in other folds. We can further verify the random selection protocol by looking into the implementations of existing methods here: https://github.com/TorchSSL/TorchSSL/blob/main/datasets/data_utils.py#L58, which contains the official implementation of FlexMatch (Zhang et al., 2021) and re-implementation of many other popular semi-supervised learning methods under a unified framework, including FixMatch (Sohn et al., 2020), ReMixMatch (Berthelot et al., 2019), SoftMatch (Chen et al., 2023), FreeMatch (Wang et al., 2022), and others.

All these prior works followed the same strategy of randomly selecting the labelled set from the training data and using the remaining samples as the unlabelled set. We see a similar strategy in recent literature for fine-tuning foundation models in semi-supervised learning setups, e.g. GRIP (Menghini et al., 2024) and CPL (Zhang et al., 2024). This also implies that in the standard semi-supervised learning setup, labelled and unlabeled samples come from the same data distribution; only the labelled set contains ground truth labels. However, there are other settings, which are not standard semi-supervised learning setups, where the labelled and unlabelled data come from different distributions, for instance, in OpenSet semi-supervised learning (Saito et al., 2021), and unconstrained semi-supervised learning (Roy et al., 2024).

Nonetheless, in the revised manuscript, we have provided additional results by considering representative sample selection (our approach) as a special case and integrating our weakly-supervised sampling module with prior works. Please kindly refer to Table 3 (Page 7).

References:

S. Roy and A. Etemad, "Scaling Up Semi-supervised Learning with Unconstrained Unlabelled Data", AAAI, 2024.
K. Saito, D. Kim, and K. Saenko, "Openmatch: Open-set Consistency Regularization for Semi-supervised Learning with Outliers", arXiv preprint arXiv:2105.14148, 2021.
J. Zhang, Q. Wei, F. Liu, and L. Feng, "Candidate Pseudolabel Learning: Enhancing Vision-Language Models by Prompt Tuning with Unlabeled Data", ICML, 2024.
C. Menghini, A. Delworth, and S. Bach, "Enhancing Clip with Clip: Exploring Pseudolabeling for Limited-label Prompt Tuning", NeurIPS, 2024.
K. Sohn, D. Berthelot, N. Carlini, Z. Zhang, H. Zhang, C. A. Raffel, E. Dogus Cubuk, A. Kurakin, and C-L. Li, "Fixmatch: Simplifying Semi-supervised Learning with Consistency and Confidence", NeurIPS, 2020.
B. Zhang, Y. Wang, W. Hou, H. Wu, J. Wang, M. Okumura, and T. Shinozaki, "Flexmatch: Boosting Semi-supervised Learning with Curriculum Pseudo Labeling", NeurIPS, 2021.
D. Berthelot, N. Carlini, E. D. Cubuk, A. Kurakin, K. Sohn, H. Zhang, and C. Raffel, "ReMixMatch: Semi-Supervised Learning with Distribution Matching and Augmentation Anchoring", ICLR, 2020.
H. Chen, R. Tao, Y. Fan, Y. Wang, J. Wang, B. Schiele, X. Xie, B. Raj, and M. Savvides, "SoftMatch: Addressing the Quantity-Quality Tradeoff in Semi-supervised Learning", ICLR, 2023.
Y. Wang, H. Chen, Q. Heng, W. Hou, Y. Fan, Z. Wu, J. Wang, M. Savvides, T. Shinozaki, B. Raj, and B. Schiele, "FreeMatch: Self-adaptive Thresholding for Semi-supervised Learning", ICLR, 2022.

2024-11-19

Motivation for using $N$ for clustering.

Considering that $N = n \times C$ , we can deduce that $N \geq C$ , and therefore, our approach would result in finer-grained representatives. Let's consider the following example to illustrate our approach. Suppose we have a dataset with $C=2$ classes: 'dog' and 'car', where the dog class contains a few images of 'golden retriever' and 'bulldog', while the car class includes a few images of 'SUV' and 'truck'. Given a labelling budget of $N=4$ , an ideal selection of diverse samples would include one image each of a golden retriever, bulldog, SUV, and truck. Whereas, if we create $C$ clusters, we would form two broad semantic clusters corresponding to 'dog' and 'car', which would not necessarily help with our goal of selecting a diverse set of $N=4$ samples. However, creating $N$ clusters would result in fine-grained semantic clusters of golden retriever, bulldog, SUV, and truck. Selecting one sample from these clusters ensures the selection of semantically diverse $N$ samples.

We hope these answers help address the remaining concerns of the reviewer. Please let us know if we can provide any additional clarifications.

2024-11-22

I agree with Reviewer homK that this framework is arguably valid. As noted in my initial review, labeling important data within a limited budget is a significant and practical problem. However, it is fundamentally different from semi-supervised learning. In my view, the authors may consider reframing the narrative of the paper to focus on the context of labeling important data with a limited budget rather than positioning it within the scope of semi-supervised learning. However, in my view, considering the substantial changes required and the limited remaining time available for ICLR discussion, the current paper may not yet be ready for publication and may be better revised and submitted to other conferences later.

2024-11-22

Please note, as you mentioned, in Fixmath, the labeled data are sampled from the dataset (data distribution) rather than the unlabeled dataset. As I previously stated, this experimental setup is designed to mimic real-world conditions where we cannot choose which data to label. Therefore, unfortunately, this approach is flawed within the context of semi-supervised learning.

评论- In this context, the dataset is an unlabelled dataset. So, our approach is not flawed.

2024-11-22

Thank you for your response. We are glad to see that you have acknowledged most of our responses. And the discussion has boiled down to one point. Before we further discuss this point, we would like to note two important facts: (1) Even without the point in question (regarding our first contribution), our method outperforms existing methods by 4.71% on identical setup, which is an important contribution to this field. (2) We now have included experiments (Table 3) and results by considering our sampling technique a new setup, as suggested by the reviewer, and have shown improvements across ours and the existing methods. We believe this change eliminates any discussion on unfair comparisons.

Regarding the sampling: we are also glad to see that the reviewer now acknowledges that labelled data is, in fact, selected randomly by prior methods; it’s neither fixed nor provided- which was the point of discussion originally. In light of the current concerns (that the labelled set is randomly sampled from the whole dataset, not from the unlabelled dataset), we would like to point out the fact that the dataset, in this context, is the unlabelled data. Specifically, the existing method considers the whole dataset as the unlabeled data and randomly samples a subset from this unlabelled set. No label information is utilized at this point, even though the labels are available with the benchmark datasets. In other words, we sample from the unlabelled data distribution- representing the unknown distribution of the whole data of the target domain. Once such a smaller subset of the unlabelled data is sampled from all the unlabelled data, we form the labelled set. For research and benchmarking purposes, we take such labels from a benchmark dataset, but in the real world, ground truth labels are annotated by human experts- and we can choose which of the unlabelled samples we want to be annotated by the expert. Our proposed method makes no change to this pipeline. We simply propose strategically selecting the subset from the unlabelled set.

2024-11-22

To clarify, I did not claim that the labeled dataset is not given or fixed. The problem setting of semi-supervised learning is that we are given a small-size labeled dataset sample from p(x,y) and a large-size unlabeled from p(x). Sampling the labeled dataset with multiple random seeds does not mean that we can select labeled data. The experiment setting of Fixmatch is to conduct experiments with multiple random seeds for robust evaluation. Again, please refer to the problem setting of semi-supervised learning in [4,5].

2024-11-22

"in the real world, ground truth labels are annotated by human experts- and we can choose which of the unlabelled samples we want to be annotated by the expert."

I think this comment is the key stumbling block which has been completely missing from discussions within the paper and, until now, also in the rebuttal.

I think this framework is arguably valid, however it is fundamentally different from those of FixMatch and other works, in which labeled data is already given and we have no control over it. This distinction is absolutely critical.

2024-11-23

Dear Reviewer homK and Reviewer 352h,

We deeply thank the reviewers for their continued engagement. This means a lot, regardless of the outcome of this submission, and has without a doubt increased the quality of our paper. We would like to clarify that the overarching goal of our method is to fine-tune a pre-trained model on a downstream task. Nonetheless, semi-supervised learning is one of the benchmarks we use to evaluate our solution (alongside base-to-novel generalization), following prior works addressing the same problem. However, since our setup assumes the ability to select the labelled set (inspired from the active learning literature), based on your recommendation, we have now slightly re-positioned this work as active semi-supervised learning. Accordingly, we present our results in two distinct setups as follows. First, we use a randomly selected labelled set, similar to standard semi-supervised learning literature (please see Section 4.2 of the revised manuscript). Second, we evaluate our method with the additional ability to choose the strategic labels to sample. These results are now presented in Section 4.3. We observe that in both setups (as well as the third setup of base-to-novel generalization), our method achieves very strong results, outperforming other methods. Kindly see the revised manuscript, where the changes regarding this discussion have been highlighted in red.

We hope that this repositioning of the work, along with using two explicitly separate experiment setups (covering both random sampling and selected sampling), addresses the concerns regarding this particular component of our method. If the reviewers have any additional questions or concerns, we would be more than happy to provide further clarifications or make additional changes to the paper in the remaining duration of the rebuttal period.

2024-11-26

Dear Reviewer 352h,

As we approach the end of the rebuttal period, we would like to offer any further clarifications regarding our paper if there are additional questions. As mentioned in our previous response, we have slightly adjusted the positioning of our paper and presented results for both the standard and our proposed active semi-supervised learning setups. We hope this change, along with additional results (Section A.2), sensitivity analysis (Table 7), further explanations of the proposed solution (Section 3), expanded related work (page 3), comparisons to more methods (Table 1, 2), pseudo-code (Section A.3), and other minor changes made during the rebuttal, have addressed your concerns and strengthened the overall quality of our paper. If you believe these updates have successfully addressed your concerns, we kindly ask you to consider raising your score to reflect the improved state of the paper.

Once again, we would like to sincerely thank you for your comments and constructive discussions, which have without a doubt significantly improved our paper.

Best regards,

Authors

2024-11-26

Thank you to the authors for polishing their paper. I appreciate them for reconsidering the context of their paper. However, in my view, changing the narrative and context goes beyond simply renaming the setting from "semi-supervised learning" to "active semi-supervised learning." The authors should rewrite the whole storyline of the paper by focusing on active learning. The current paper still focuses on semi-supervised learning. Moreover, I believe that if the main idea of labeling with a limited budget is rooted in active learning, the paper should also be reviewed by experts in active learning to properly assess its quality. Therefore, based on the above reasons, I think this paper is not yet ready for publication and needs more revision.

2024-11-26

We respectfully disagree with the comment. Our approach was not a mere renaming of the setting but involved substantial modifications to address the reviewer’s concerns. Specifically, the reviewer highlighted that existing semi-supervised methods assume the labelled set is randomly sampled from the entire dataset, whereas our approach introduces a strategic sampling of the labelled set. This distinction makes comparisons to prior methods unfair, as the proposed setting cannot be categorized as standard semi-supervised learning. While strategic sampling could be considered a specific case of random sampling, we adopted the reviewer’s suggestion to avoid any ambiguity or potential misunderstanding. Accordingly, we made the following changes:

We reframed this setup as a variant of semi-supervised learning, where the labelled set is assumed to be sampled and labelled with the assistance of an oracle or human in real-world scenarios. Within this setup, we reproduced prior works, demonstrated improvements with our weakly supervised sampling technique, and ensured fair comparisons.
We also evaluated our method under the standard semi-supervised learning setup, i.e., using the same randomly sampled labelled sets as in prior works without employing strategic sampling.

Additionally, while strategic sampling holds potential for broader applications in machine learning, exploring such scenarios would require additional studies tailored to specific cases or setups, which fall beyond the scope of this research. Active learning, for instance, could benefit from our intuition; however, our solution and narrative cannot be directly adapted to active learning. Active learning operates under a completely different framework, where training occurs over multiple rounds, with each round selecting and annotating additional labelled data based on the model’s current state. In contrast, semi-supervised learning involves a single round of training, where the labelled and unlabeled data remain fixed throughout the process, and our proposed solution is specialized in this setup.

Overall, this paper makes several contributions, including the introduction of these two additional modules (beyond the sampling module), demonstrating substantial improvements over existing methods across three benchmark evaluations (not limited to semi-supervised learning), and presenting a comprehensive set of experiments and findings. Considering these contributions and the extensive revisions made to address concerns of unfair comparisons to semi-supervised learning, we do not find it justified to argue that this work should be reframed as a completely new paper for a new setup.

审稿意见

评分: 3置信度: 22024-10-28

This paper proposes a prompt-tuning approach for VLM adaptation. Specifically, to mitigate the impact of sub-optimal VLM predictions, a quantile-based sub-sampling strategy which removes the most and least confidently predicted samples and selects candidate samples based on closeness to class-based cluster centers. These labelled samples are used in prompt-tuning the VLM.

优点

The paper is well written and easy to follow throughout.

The results are promising and experimentation is comprehensive, including the ablation study.

缺点

The number of baselines is too small. There are many other baselines the authors could have implemented (as evidenced in the base-to-novel generalization results).

Please see questions section.

问题

Line 253 says "does not rely on the VLM to generate psuedo-labels". This appears incorrect to me, as the VLM probabilities are used to determine which samples are used in clustering?

Line 249 says to "gather the labels of the selected samples". By labels, is this referring to the pseudo-labels? If not, then where are these labels taken from?

It is unclear to me what the loss function defined in line 282 is used to optimise. The prompt-tuning pipeline is not clearly and formally defined in the preliminaries as it should be.

2024-11-19

We appreciate the reviewer’s positive feedback, highlighting that the paper is well-written and easy to follow, as well as recognizing the promise of our results and the comprehensiveness of our experiments, including the ablation study. Below, we address the questions and comments in detail.

The number of baselines is too small.

Our work focuses on the semi-supervised tuning of VLMs, a relatively new area compared to the more established fields like few-shot tuning, where the pre-trained encoder is fine-trained using a combination of a few labelled samples and large amounts of unlabeled data. We have compared our results with all prior methods in this domain, similar to the previous SOTA CPL (published at ICML 2024), which we present in Tables 1 and 2. Additionally, as per your suggestion, we have now reproduced the SOTA (PromptKD) on the base-to-novel generalization task (Table 5) in the semi-supervised learning setup using the same training details (encoder and prompt) as our proposed solution. As we find from the Table below (please refer to Table 1, 2 in the revised manuscript for more details), SelfPrompt (ours) outperform PromptKD by a large margin, indicating the effectiveness of our solution in utilizing the unlabelled data while fine-tuning the pre-trained VLM.

Methods	Average Acc.
Textual Prompt Tuning
CLIP	55.17
CoOp	62.28
GRIP	67.40
PromptKD	66.90
CPL	71.41
SelfPrompt	79.33
Visual Prompt Tuning
CLIP	55.17
CoOp	60.02
GRIP	64.77
PromptKD	64.60
CPL	67.11
SelfPrompt	75.04

VLM probabilities are used to determine which samples are used in clustering.

The VLM is only used to filter out some of the samples in the weakly-supervised sampling module, but not for pseudo-labelling. Specifically, unlike prior works, we do not utilize the zero-shot prediction from the VLM to generate the pseudo-labels. To avoid any confusion, we have revised the concerning line from "We propose a novel clustering-guided pseudo-labelling approach that does not rely on the VLM to generate the pseudo-label." to the following: "We propose a novel clustering-guided pseudo-labelling approach that does not utilize the zero-shot prediction from the VLM as the pseudo-label."

By labels, is this referring to the pseudo-labels?

In this context, labels refer to ground truth labels and not pseudo-labels. In a semi-supervised setting, we have a large unlabelled set, and we select a small subset (randomly or strategically) from the unlabelled data to form the labelled set. This set is sampled from ground truth labels provided with the dataset.

What does the loss function on line 282 optimize?

The loss function optimizes the learnable tokens/prompts added to the pre-trained encoder. This is discussed in line 181 of section 3.1 of the paper. We used the same model and prompt tuning approach as the prior works in this area (CPL, GRIP) and did not introduce/change any details to ensure a fair comparison to prior work.

评论- Problem setting

2024-11-20

I have to agree with Reviewer 352h that the problem setting is fundamentally flawed: It is not realistic to assume we can selectively choose which samples are most helpful to somehow obtain labels for when they come from an unlabeled set. Unfortunately, this means I will downgrade my decision.

2024-11-22

Please note that it is not unrealistic to select samples from an unlabelled set for labelling. There is an entire field of research called active learning [1,2] that focuses on selecting and labelling samples from unlabeled data that not only selects the initial labelled set but also uses a trained model to identify and select the most informative samples for further labelling. The question asked by Reviewer 352h concerns whether, in existing works, the labelled set is pre-defined or selected randomly. We consider this an important question because if prior works used a pre-defined labelled set, our approach of strategically selecting the labelled set would result in an unfair comparison. However, if the labelled set was randomly sampled from the whole dataset in prior works, our method of strategic sampling can not be considered unfair. Our response clearly addresses this concern by referencing prior papers and their implementation, which explicitly indicate that the labelled set was randomly selected from the entire dataset. Therefore, we believe our problem setup is not flawed. For convenience and clarity, we have copied our full discussion to Reviewer 352h at the end of this response.

Nonetheless, as per the suggestion of Reviewer 352h, we have now included new results for both our method and prior methods by considering strategic samples as a new setup. Specifically, we demonstrate the impact of the SelfPrompt's other two components separately while incorporating the sampling technique into both our method and prior works. We believe this completely eliminates any concerns regarding fair comparison. We believe our response clearly resolves any concerns regarding unfair/flawed experimental design, but we are more than happy to answer any further questions on this matter.

References:

Brame, Cynthia. "Active learning." Vanderbilt University Center for Teaching (2016).
Felder, Richard M., and Rebecca Brent. "Active learning: An introduction." ASQ higher education brief 2, no. 4 (2009): 1-5.

2024-11-26

Dear Reviewer homK,

As we approach the end of the rebuttal period, we would like to offer any further clarifications regarding our paper if there are additional questions. In response to your previous comment (agreeing with Reviewer 352h), we have slightly adjusted the positioning of our paper and presented results for both the standard and our proposed active semi-supervised learning setups. We hope this change, along with additional results (Section A.2), sensitivity analysis (Table 7), further explanations of the proposed solution (Section 3), expanded related work (page 3), comparisons to more methods (Table 1, 2), pseudo-code (Section A.3), and other minor changes made during the rebuttal, have addressed your concerns and strengthened the overall quality of our paper. If you believe these updates have successfully addressed your concerns, we kindly ask you to consider raising your score to reflect the improved state of the paper.

Once again, we would like to sincerely thank you for your comments and constructive discussions, which have without a doubt significantly improved our paper.

Best regards,

Authors

审稿意见

评分: 5置信度: 42024-11-04

This work proposes SelfPrompt method to address the semi-supervised fine-tuning problem of CLIP. SeflPrompt contains: (1) weakly supervised sampling to select the few-shot confident examples as pseudo-labels; (2) cluster-guided pseudo-labeling to select min-centroid distance samples as cluster-wise pseudo-labels; (3) confidence aware semi-supervised learning which separately learn from confident (with supervised learning) and low-confident samples (with weakly-supervised learning). The empirical results demonstrate the significant performance gains compared with several state-of-the-art baselines on existing semi-supervised learning and few-shot learning benchmarks. The ablation results justify the contribution of each proposed component.

优点

strengths:

The performance gains of the proposed method is significant compared with previous baselines.
The identified sampling issue of CLIP few-shot benchmark is insightful.
The experiments are comprehensive to validate the effectiveness of the proposed method.

缺点

Limitations:

Lack of technical novelty: The proposed method is primarily based on pseudo-label thresholding and selection, which has been well-studied by semi-supervised learning a few years ago. The identified sampling issue of existing CLIP few-shot benchmarks is somehow insightful.
Lack of implementation details: Although the sensitivity of hyperparameters are studied, how hyperparameters are determined for various datasets is unclear. Does it involve dataset-specific tuning?
Lack of baselines: Two recent baselines are neglected. The comparisons with them are necessary to claim the state-of-the-art performance of the proposed method.

[1] PLOT: PROMPT LEARNING WITH OPTIMAL TRANSPORT FOR VISION-LANGUAGE MODELS

[2] AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

问题

All my concerns are listed in the weaknesses.

2024-11-19

We thank the reviewer for the positive feedback, acknowledging the significant performance gains achieved by our method, the insightful identification of the sampling issue, and the comprehensiveness of our experiments. Below, we provide detailed responses to the questions raised.

Lack of technical novelty

The proposed method is not primarily based on pseudo-label thresholding and selection. Out of the 3 modules in our proposed method, only the 3rd one (confidence-aware semi-supervised learning) uses the concept of pseudo-label thresholding. Specifically, our proposed confidence-aware semi-supervised learning is a hybrid approach of fully-supervised and weakly-supervised learning, which learns from the high-confident samples in fully-supervised learning while learning from the low-confident samples in a weakly-supervised setting. Overall, This paper introduces three key technical novelties that distinguish it from existing literature. Specifically, we identify and address 3 key limitations in existing approaches to tuning VLMs in a semi-supervised learning setup: (a) the random selection of labelled set by existing methods does not adequately represent the underlying data distribution, leading to inefficient use of the limited label budget, (b) utilizing the zero-shot capabilities of pre-trained VLMs leads to a considerable number of incorrect pseudo-labels, (c) incremental pseudo-labelling approach of existing approaches lead to accumulation of noisy pseudo-labels, ultimately leading to performance degradation. Each of these issues contributes to a significant drop in the final performance of the model and requires careful consideration and unique solutions to mitigate their adverse impact. Following are our solutions to each of these problems:

Contribution (a): We introduce the concept of weakly supervised labelled set sampling to overcome the limitation of existing solutions of not selecting the most representative set of samples as the labelled set. The proposed weakly supervised labelled set sampling is a novel technique that consists of a two-step protocol. The first step removes the least informative and noisy samples with our filtering with weak supervision approach, and the second step selects a diverse set of samples as the labelled set using our diversity sampling technique. No prior works have explored such an approach for sampling such a representative set of labelled sets for semi-supervised learning.

Contribution (b): We introduce the concept of cluster-guided pseudo-labelling. Unlike existing methods that utilize the zero-shot capabilities of pre-trained VLMs, leading to a considerable number of incorrect pseudo-labels due to miscalibration in the VLM, we propose to leverage the clusters formed in the weakly-supervised sampling module to generate the pseudo-labels. To our knowledge, this approach to pseudo-labelling has never been explored in the context of semi-supervised learning.

Contribution (c): Finally, we introduce confidence-aware semi-supervised learning. The proposed solution is a novel technique that combines fully-supervised learning with weakly-supervised learning to enable efficient utilization of the unlabelled data. Our solution is the first of its kind to use such a hybrid approach to utilizing the labelled data.

Does it involve dataset-specific tuning?

No. Our proposed solution does not involve any dataset-specific tuning. A key strength of our method is that the solution is dataset-invariant and does not require any data-specific tuning. All the results reported for datasets use a single set of hyper-parameters. All the hyper-parameters and implementation details are provided in section 4.1 to ensure the reproduction of our proposed solution.

Two recent baselines are neglected.

Thank you for pointing out these two papers. In the revised manuscript, we have discussed these two papers under the prompt tuning section (Page 3) of the related work. However, we can not directly compare our solution with these methods as our approach is based on semi-supervised learning, whereas the two mentioned papers focus on few-shot learning. Unlike our solution, these methods are unable to leverage unlabeled data.

2024-11-26

Dear Reviewer WqY4,

As we approach the end of the rebuttal period, we would like to offer any further clarifications regarding our paper if there are additional questions. In response to your original comments, we have clarified our technical novelty, discussed two more related works, and addressed other questions/concerns. We hope this change, along with additional results (Section A.2), sensitivity analysis (Table 7), further explanations of the proposed solution (Section 3), expanded related work on Weakly-supervised learning (page 3), comparisons to more methods (Table 1, 2), pseudo-code (Section A.3), slightly adjusted the positioning of our paper and presented results for both the standard and our proposed active semi-supervised learning setups (suggested by Reviewer 352h and Reviewer mPat) and other minor changes made during the rebuttal, have addressed your concerns and strengthened the overall quality of our paper. If you believe these updates have successfully addressed your concerns, we kindly ask you to consider raising your score to reflect the improved state of the paper.

Once again, we would like to sincerely thank you for your comments and constructive discussions, which have without a doubt significantly improved our paper.

Best regards,

Authors

2024-12-03

Most of my previous concerns have been addressed. However, there are two points.

Baselines Although the baselines listed are not specialized in the semi-supervised learning setting, some of them have few-shot learning capability and are easy to adapt to this setting. I believe the performance of this work is stronger, therefore, it would be better to also adapt some few-shot baselines to this setting.
Technical Novelty I am still concerned about the limited technical novelty of this paper: For example, in lines 66-69, "This approach is based on the observation that the most confident samples do not contribute significantly towards learning, while the least confident samples tend to be noisy and are not representative of the dataset." this is a well-known fact in the semi-supervised learning community with a bunch of studies and technical papers on it. Second, although the clustering-based pseudo-labeling methods may not be proposed in the specialized area of semi-supervised learning with CLIP, the general idea of leveraging structural information in the embedding space for pseudo-labeling is not new [1]. Besides, we usually prefer contrastive learning with selected nearest neighbors in the CLIP embedding space which normally are superior to the clustering-based method. If the author truly aims to claim this contribution, I would suggest adding such a comparison experiment. Third, learning with confidence scores is also widely studied.
Additional Comments I personally am not clearly satisfied with the presentation quality of the revised script as an ICLR conference paper. For instance, there are some writing issues, such as in lines 77-80, the sentences do not follow logically. Moreover, as argued by other reviewers, the proposed setting is not well-motivated. I personally agree with the value of the active SSL. But you need to exhibit the unique challenges of SSL and active SSL in the vision-language learning area.

References: [1] Graph-Based Semi-Supervised Learning: A Comprehensive Review [2] PROTOCON: Pseudo-label Refinement via Online Clustering and Prototypical Consistency for Efficient Semi-supervised Learning

In summary, I greatly appreciate the effort made by the author. After serious consideration, I would like to keep my original score.

评论- General comments by authors

2024-11-19

We sincerely thank the review committee for their time and for providing constructive feedback. We are happy to see the overall engaging comments given by all four reviewers, especially acknowledging our comprehensive experiments and strong results. We have carefully addressed all the concerns raised by the reviewers under the individual response section. We have also updated the main PDF accordingly. Following, we provide a summary of our responses:

Expanded baseline comparisons: We have expanded the main tables of the experiments by reproducing PromptKD (state-of-the-art in base-to-novel generalization) in the semi-supervised learning setup, as suggested by Reviewer homK.
Extensive hyper-parameter study: In response to the question by Reviewer mPat, we have expanded the hyper-parameter study in Table 7 to include more setups.
Discussion and experiments on weakly-supervised sampling: In response to the concerns of Reviewer mPat and Reviewer 352h, we have included a discussion section in the revised manuscript on the intuition for removing high confidence samples during labelled set selection, along with new experimental results to support our intuition.
Generalization of the weakly-supervised sampling module: As per the suggestion of Reviewer 352h, we show that our proposed weakly-supervised sampling module can be utilized with existing semi-supervised learning methods, improving their performance.
Pseudo-code. We have included the pseudo-code of our method for more clarity, as suggested by Reviewer 352h.
Expanded related work section: As per the suggestions by Reviewer WqY4 and Reviewer 352h, we have expanded the related work by including very recent works on prompt tuning and a section on weakly-supervised learning.

We hope these changes and responses address the reviewers' questions. Should the reviewers require further clarification, we would be happy to provide additional details during the remainder of the rebuttal period.

AC 元评审

2024-12-14

This paper present a prompt-tuning approach for VLM in a semi-supervised learning setup. All the reviewers raised questions about the experiments and the evaluation:

(WqY4 & homK) Lack of baseline methods.

(352h & homK) Some part of the problem setting, especially for the semi-supervised learning is unclear.

(mPat) Lack of explanation and analysis for some experiment results.

The rebuttal did not provide satisfactory answers and failed to convince the reviewers. The authors can further improve the current work by addressing the reviewers' concerns.

审稿人讨论附加意见

The main concern raised by reviewers is about the experimental setup proposed in this paper, the sampling method in this paper is flawed within the context of semi-supervised learning, and the rebuttal of the authors failed to address this concern.

Reviewer WqY4's concerns regarding the limited technical novelty (as highlighted in lines 66-69) and the insufficient justification of the proposed SSL setting remain unresolved after the discussion stage.

Reviewer homK agrees with Reviewer 352h that the problem setting is fundamentally flawed after the discussion stage.

Reviewer mPat's concerns about the selection method in this paper still remain after reading the rebuttal.

最终决定Reject

2025-01-22

Reject