Boundary Matters: A Bi-Level Active Finetuning Method
摘要
评审与讨论
In this paper, the authors aim to address the problem of active fine-tuning learning. The difference with active learning is that active fine-tuning models are pre-trained rather than trained from scratch. Therefore, the characteristics of the pre-trained model are important for this problem. While the uncertain methods of active learning disturb the quality of the pre-trained model, ActiveFT in active fine-tuning learning ensures stable training by selecting the most representative sample solution. The authors' idea lies in refining the boundaries by selecting samples through uncertainty based on representative samples. Their active fine-tuning framework, BiLAF, selects central and boundary samples to learn diversity and uncertainty. Besides, it has novel improvements, including unsupervised denoising and an iterative strategy for boundary sample selection. Finally, this paper conducts extensive experiments and ablation studies.
优点
The authors provide a new active fine-tuning framework that combines the advantages of traditional uncertainty methods with the state-of-the-art ActiveFT method. They also provide extensive data experiments and analyses to illustrate the effectiveness of the proposed method.
缺点
Boundary sample selection is not discussed for fine-grained categorization, a case with less category differentiation. Besides, what is the gap between active fine-tuning learning and the upper performance of fine-tuning using all samples?
问题
I have the following questions/comments for the authors:
- Does using a pre-trained model to select fine-tuned samples amplify errors in the pre-trained model? I noticed significant performance differences between the different pre-trained models for Table 7.
- Is it reasonable to consider samples that deviate from the center of the pseudo-class as noise?
- On different datasets, when the labeled sample size is less than, which approximate range is boundary sample selection not required?
- Is the proposed method stable for tasks like fine-grained classification?
- The notation for intra-class and inter-class distances should be changed from Xi to Ci, using the cluster notation to describe intra-class and inter-class distances.
局限性
For downstream task fine-tuning, which in many cases is application-specific, one of the problems I often encounter is how to determine the data collection conditions as best as possible in order to minimize costs. Therefore, the biggest problem I encountered was not having enough samples for classification. Therefore, I suggest the authors to further improve the results for tasks with higher annotation costs, such as detection and segmentation tasks or domains requiring expert knowledge annotation such as remote sensing and medicine.
Q1: Fine-grained classification?
R1: Following your suggestion, we utilized the CUB-200-2011 dataset, which includes bird species with a total of images. According to the default configuration of the dataset, samples are used for training, with the remainder used for testing. Given the large number of classes, we used 10% of the data as Core Samples to select Boundary Samples, while all other parameters were set according to the default values reported in the paper. The following table clearly demonstrates that our approach continues to hold a leading position in fine-grained classification datasets.
| Percent | Select Number | Random | ActiveFT | BiLAF(Ours) |
|---|---|---|---|---|
| 20% | 1198 | 46.32 | 47.83 | 48.53 |
| 30% | 1798 | 58.85 | 59.67 | 60.52 |
| 40% | 2397 | 66.75 | 67.36 | 68.31 |
| 50% | 2997 | 72.98 | 73.25 | 74.06 |
Q2: What is the gap between active fine-tuning learning and the upper performance of fine-tuning using all samples?
R2: The table presents the upper bound performance achieved by training with all available data. Despite a significant gap between training with a subset of samples and full selection, our method consistently outperforms random sampling at equivalent annotation rates.
| DataSet | 0.5% Rand | 0.5% BiLAF | 1% Rand | 1% BiLAF | 2% Rand | 2% BiLAF | 100% Full |
|---|---|---|---|---|---|---|---|
| Cifar10 | 77.3 | 81.0 | 82.2 | 89.2 | 88.9 | 92.5 | 99.0 |
| DataSet | 1% Rand | 1% BiLAF | 2% Rand | 2% BiLAF | 5% Rand | 5% BiLAF | 5% Rand | 5% BiLAF | 100% Full |
|---|---|---|---|---|---|---|---|---|---|
| Cifar100 | 14.9 | 31.8 | 24.3 | 43.5 | 50.8 | 62.8 | 69.3 | 73.7 | 90.5 |
| DataSet | 1% Rand | 1% BiLAF | 2% Rand | 2% BiLAF | 5% Rand | 5% BiLAF | 100% Full |
|---|---|---|---|---|---|---|---|
| ImageNet | 45.1 | 50.8 | 52.1 | 56.9 | 64.3 | 66.2 | 81.5 |
Q3:Does using a pre-trained model to select fine-tuned samples amplify errors in the pre-trained model? Significant performance differences in Table 7.
R3: In Table 7 and the principal experiments of our paper, ViT-S models pretrained with iBOT and DINO frameworks exhibit comparable performance, whereas ResNet50 differs due to its unique architecture and model size. Objectively speaking, variations in the samples selected can arise based on the quality of features and the upper limits of the pretrained models. Nonetheless, Table 7 underscores the universality of our approach, showcasing its capability to effectively select superior samples across various pretrained models.
Q4: Is it reasonable to consider samples that deviate from the center of the pseudo-class as noise?
R4: 1. Similar to typical denoising scenarios, points significantly distant from their cluster are often deemed as noise. In our approach, we implement localized denoising, where each category has a center, and noise specific to that locality is removed. 2. In practice, due to inadequate feature learning, it's common for samples potentially belonging to two categories to intermingle at the boundaries. Therefore, our denoising aims to clear these mixed areas to avoid erroneous sample selection. This is a trade-off: denoising allows us to conservatively select boundary samples to address poorly learned features and classifications, while opting not to denoise would constitute a bolder choice.
Q5: On different datasets, when the labeled sample size is less than, which approximate range is boundary sample selection not required?
R5: We employed the Core Sample Selection method to obtain different numbers of central points and then analyzed the benefits these central points brought. Specifically, we define Distance as the mean Euclidean distance from each sample to the nearest selected point in the feature space and investigate the Rate of Return (Incremental Benefit per Core Sample) in different range, , where Rate of Return = Distance Difference / Core Number Difference between two adjacent columns.
- CIFAR10
| Core Num | 50 | 100 | 150 | 250 | 375 | 500 | 1000 | 1500 | 2500 | 5000 |
|---|---|---|---|---|---|---|---|---|---|---|
| Distance | 0.7821 | 0.7588 | 0.7472 | 0.7307 | 0.7203 | 0.7117 | 0.6896 | 0.6746 | 0.6506 | 0.6010 |
| Rate of Return | - | 4.6575 | 2.3264 | 1.6422 | 0.8362 | 0.6887 | 0.4420 | 0.3002 | 0.2398 | 0.1984 |
- CIFAR100
| Core Num | 50 | 100 | 150 | 250 | 375 | 500 | 1000 | 1500 | 2500 | 5000 |
|---|---|---|---|---|---|---|---|---|---|---|
| Distance | 0.8378 | 0.8082 | 0.7913 | 0.7724 | 0.7564 | 0.7478 | 0.7229 | 0.7059 | 0.6800 | 0.6272 |
| Rate of Return | - | 5.9221 | 3.3791 | 1.8906 | 1.2797 | 0.6845 | 0.4985 | 0.3404 | 0.2589 | 0.2112 |
We found that the rate of return diminishes gradually, indicating that core samples are crucial in the early stages, while the benefits decrease significantly later on. A clear demarcation point can serve as a guide for when to begin Boundary Sample Selection, such as the range of 250-375 for CIFAR-10 and 375-500 for CIFAR-100. This provides a simple yet effective guideline. Additionally, in practical applications, we discovered that introducing boundary points earlier may yield better results, such as CIFAR100 with 1% (500 samples) annotation samples.
Q6: The notation for intra-class and inter-class distances should be changed from Xi to Ci.
R6: Thank you for your suggestion. Given that our distance metric applies to each individual sample, I understand your recommendation to express as for greater clarity. We will adopt your suggestion and implement this change in the revised version.
Q7: Recommend Tasks with higher annotation costs, such as detection segmentation, remote sensing and medicine.
R7: There are two aspects to consider: which data should be annotated and how to increase the speed of annotators. Due to time constraints and our lack of exposure to remote sensing and medical data, we plan to conduct further research in the future. As for detection and segmentation tasks, our experiments have also covered these, specifically in PASCAL VOC and ADE20k datasets.
We hope this response could help address your concerns, and wish to receive your further feedback soon.
Thank you for your insightful suggestions. We believe we have comprehensively addressed your questions regarding the applicability of different pre-trained models, scenarios involving fine-grained classification, the rationale for denoising, and the threshold for selecting boundary samples.
It is worth noting that our method can maintain a leading performance across different pre-trained models and effectively enhance data quality in fine-grained classification scenarios to achieve optimal performance. Additionally, we have proposed a simple yet effective guideline as a threshold for selecting boundary samples.
We are wondering whether you have any additional questions or comments regarding our response to your review comments. We will do our best to address them.
We appreciate your time and effort reviewing our manuscript. Thanks for your consideration!
The paper proposes an active fine-tuning method that considers both diversity and uncertainty. This method selects uncertain samples through an unsupervised denoising approach and boundary score evaluation. The efficiency and effectiveness of this method, which involves selecting central and boundary samples, have been validated through multiple experiments.
优点
- The writing is generally clear, and the content is relatively comprehensive.
- The method considers both diversity and uncertainty, which is a good idea.
- The experiments are quite thorough.
缺点
- The main diagram, Figure 2, lacks detail and clarity, and its meaning overlaps with the schematic in Figure 1. Figure 2 should focus on explaining how the methods "Boundary Score Calculation" and "Iterative Selection and Removal" are implemented (in detail) rather than simply outlining the overall process.
- Is there consistency between the decision boundaries of the K pseudo-classes and the decision boundaries of the true task categories? Are the decision boundaries of the unsupervised method consistent with the decision boundaries when the pre-trained model is fine-tuned on downstream tasks? (Different types and capabilities of models have different decision boundaries.) It seems these issues were neither considered nor analyzed.
- The theoretical foundation is weak. Is it necessary to introduce data on class decision boundaries? Is there a theoretical basis for applying "diversity and uncertainty" sampling to active fine-tuning methods? Is there a theoretical basis for the design and implementation methods?
- Exploration of the number of uncertain sample points: How is K determined? Having too many such sample points presents two problems: incorrectly labeled samples and disruption of learning the general features of the categories.
- "but neglects the contributions of samples to local boundaries." Is the statement in the abstract deviating from the original meaning? Should it be "but neglects the contributions of local boundary samples" instead of "but neglects the contributions of samples to local boundaries"?
问题
Please see the weakness.
局限性
Yes.
Q1: Figure 2 should be modified.
R1: Thank you for your suggestion. In the revised version, we will enhance the figure to improve its detail and clarity.
Q2: Is there consistency between the decision boundaries of the K pseudo-classes and the decision boundaries of the true task categories? Are the decision boundaries of the unsupervised method consistent with the decision boundaries when the pretrained model is fine-tuned on downstream tasks?
R2: Yes, these conditions are consistent! We conducted linear probing on all the samples with true labels using features from both the pre-trained model and the oracle model (fine-tuned on all samples). We analyzed whether samples selected using different methods—Random, ActiveFT, BiLAF (ours)—tend to be near the decision boundaries. We used two metrics for this analysis: 1) Entropy, where a higher value indicates greater uncertainty and a propensity towards boundary samples. 2) Prob_Diff (probability difference between the top two classes) calculated as the difference between the highest and second-highest probabilities. A smaller value indicates that the sample is closer to the boundary between these two classes.
We computed the mean values for all samples and the top 50 samples. We found that samples selected based on pre-trained features exhibit higher Entropy and smaller Probability Differences in both scenarios, sufficiently demonstrating that our method aligns with the true decision boundaries and the samples selected in unsupervised pretrained features aligns with the pretrained model is finetuned on downstream tasks.
Q3: The theoretical foundation is weak. Is it necessary to introduce data on class decision boundaries? Is there a theoretical basis for applying "diversity and uncertainty" sampling to active fine-tuning methods? Is there a theoretical basis for the design and implementation methods?
R3:
-
Theoretical Assurance: Assuming a binary classification problem that is linearly separable. Our solid theoretical foundation lies in Support Vector Machines (SVM). The objective of SVM is to find a hyperplane that maximizes the margin, enhancing the classifier's generalization ability. Only the points on the boundary (support vectors) contribute to the margin, while internal points do not affect its size. This classical design emphasizes the importance of boundary samples.
-
Related Work: Previous works in Active Learning invariably involve diversity and uncertainty. Diversity aims to fit the original data distribution better with fewer samples. As the labeled data increases, allocating the remaining budget to uncertainty becomes common.
-
Design Philosophy: Faced with multiple challenges in a single iteration scenario for data selection, we must determine class centers and choose boundary samples. We also need to consider the potential noise, where different classes mix at the boundaries and each class may have multiple feature centers. Thus, we select more pseudo-classes than actual classes to ensure coverage of different classes, and we employ denoising to reduce sample mixing at boundaries and avoid noise as much as possible. Denoising process mainly targets noise in non-ideal scenarios. Opponent Penalty and Iterative Selection encourage the boundary samples with the regional diversity. Overall, based on solid boundary theory, we have designed each part for real-world noisy scenarios, using a set default of parameters throughout except for the number of pseudo-classes, which varies with different sample sizes, further demonstrating our method's effectiveness.
-
Oracle Verification: We apply our method an feature Oracle model(finetuned on all samples). At 5% budget of CIFAR100, we achieved 65.7% and at 10% budget, we reached 75.3%—both over 1.5% higher than selecting central points alone, proving the practical upper bound. We also confirmed our method's consistency with actual conditions in Q2. Thus, this also validates the effectiveness of our approach from this perspective.
Q4: Exploration of the number of uncertain sample points: How is K determined? Having too many such sample points presents two problems: incorrectly labeled samples and disruption of learning the general features of the categories.
R4:
- Core Number K: We conduct experiments to estimate an approximate threshold for the Core Number needed.
- Boundary Sample Selection: Our method's components tackle the problem:
- Denoising: Removes highly uncertain boundary points to prevent class mixing, conservatively choosing more accurate points.
- Opponent Penalty: Encourages diverse explorations across boundaries, allowing a more varied distribution.
- Iterative Selection: Removes nearby points of selected samples, maintaining diversity by choosing boundary-nearer points instead of regional centers.
- Practical Experiments: Since exact sample centers are unknown, we use numerous pseudo-classes, K, with boundaries between same-label pseudo-classes acting as internal points.
- Experiment Consistency: This flexible framework can achieve SOTA results using the same default hyperparameters. The ablation study showed that a suitable K enhances sample quality which represent much room for improvement.
Q5: Should it be "but neglects the contributions of local boundary samples" in abstract?
R5: We will adopt your suggestion in the revised version, as it appears more reasonable. Thank you for helping to further enhance the readability of our paper.
Thank you for valuable time and insightful comments. We hope our responses have addressed your concerns. We welcome further discussions and sincerely appreciate it if you could reconsider your rating.
[Finally, experiments for Q2 and Q4 will be detailed in official comments due to the characters limitation.]
Experiments For Q2.
Experiments are as follows:
| CIFAR10(Pretrain) | Selected Nums | Entropy | Prob_Diff | Entropy(Top50) | Prob_Diff(Top50) |
|---|---|---|---|---|---|
| Random | 250 | 0.0935 | 0.9486 | 0.4260 | 0.7536 |
| ActiveFT | 250 | 0.0424 | 0.9747 | 0.2073 | 0.8743 |
| BiLAF(ours) | 250 | 0.1023 | 0.9366 | 0.4769 | 0.6924 |
| Random | 500 | 0.0960 | 0.9416 | 0.7022 | 0.5056 |
| ActiveFT | 500 | 0.0433 | 0.9763 | 0.3690 | 0.7799 |
| BiLAF(ours) | 500 | 0.1149 | 0.9273 | 0.7089 | 0.4500 |
| Random | 1000 | 0.0955 | 0.9430 | 0.9181 | 0.3081 |
| ActiveFT | 1000 | 0.0849 | 0.9495 | 0.8810 | 0.3560 |
| BiLAF(ours) | 1000 | 0.1461 | 0.9064 | 1.0262 | 0.2005 |
| CIFAR100(Pretrain) | Selected Nums | Entropy | Prob_Diff | Entropy(Top50) | Prob_Diff(Top50) |
|---|---|---|---|---|---|
| Random | 500 | 0.6240 | 0.7295 | 2.2516 | 0.0876 |
| ActiveFT | 500 | 0.2962 | 0.8664 | 1.5663 | 0.2369 |
| BiLAF(ours) | 500 | 0.3933 | 0.8167 | 1.7832 | 0.1423 |
| Random | 1000 | 0.5317 | 0.7766 | 2.2375 | 0.0594 |
| ActiveFT | 1000 | 0.3650 | 0.8430 | 2.0812 | 0.0833 |
| BiLAF(ours) | 1000 | 0.4751 | 0.7815 | 2.2606 | 0.0542 |
| Random | 2500 | 0.5253 | 0.7749 | 2.6790 | 0.0232 |
| ActiveFT | 2500 | 0.4851 | 0.7936 | 2.6653 | 0.0206 |
| BiLAF(ours) | 2500 | 0.5795 | 0.7476 | 2.7196 | 0.0192 |
| Random | 5000 | 0.5442 | 0.7652 | 2.8791 | 0.0151 |
| ActiveFT | 5000 | 0.5197 | 0.7768 | 2.8487 | 0.0118 |
| BiLAF(ours) | 5000 | 0.6219 | 0.7336 | 2.9194 | 0.0090 |
| CIFAR10(Oracle) | Selected Nums | Entropy | Prob_Diff | Entropy(Top50) | Prob_Diff(Top50) |
|---|---|---|---|---|---|
| Random | 250 | 0.001512 | 0.999831 | 0.002437 | 0.999729 |
| ActiveFT | 250 | 0.001521 | 0.999830 | 0.002403 | 0.999717 |
| BiLAF(ours) | 250 | 0.001541 | 0.999827 | 0.002473 | 0.999714 |
| Random | 500 | 0.001483 | 0.999835 | 0.002782 | 0.999676 |
| ActiveFT | 500 | 0.001542 | 0.999828 | 0.002958 | 0.999644 |
| BiLAF(ours) | 500 | 0.001605 | 0.999820 | 0.003020 | 0.999641 |
| Random | 1000 | 0.001551 | 0.999823 | 0.003543 | 0.999575 |
| ActiveFT | 1000 | 0.001527 | 0.999829 | 0.003472 | 0.999586 |
| BiLAF(ours) | 1000 | 0.001598 | 0.999820 | 0.003588 | 0.999567 |
| CIFAR100(Oracle) | Selected Nums | Entropy | Prob_Diff | Entropy(Top50) | Prob_Diff(Top50) |
|---|---|---|---|---|---|
| Random | 500 | 0.049238 | 0.992216 | 0.176811 | 0.956287 |
| ActiveFT | 500 | 0.040140 | 0.993335 | 0.124594 | 0.962662 |
| BiLAF(ours) | 500 | 0.042792 | 0.994406 | 0.137114 | 0.965457 |
| Random | 1000 | 0.047730 | 0.993905 | 0.202417 | 0.963037 |
| ActiveFT | 1000 | 0.042763 | 0.993913 | 0.198896 | 0.961337 |
| BiLAF(ours) | 1000 | 0.045447 | 0.993854 | 0.203469 | 0.960462 |
| Random | 2500 | 0.046654 | 0.993407 | 0.297133 | 0.919803 |
| ActiveFT | 2500 | 0.047036 | 0.993245 | 0.303767 | 0.919122 |
| BiLAF(ours) | 2500 | 0.047357 | 0.993103 | 0.309907 | 0.917868 |
| Random | 5000 | 0.047028 | 0.993741 | 0.407092 | 0.897825 |
| ActiveFT | 5000 | 0.045173 | 0.994264 | 0.340326 | 0.925870 |
| BiLAF(ours) | 5000 | 0.049476 | 0.992885 | 0.466601 | 0.842029 |
Therefore, there is consistency between the decision boundaries of the K pseudo-classes and the decision boundaries of the true task categories. The decision boundaries of the unsupervised method is consistent with the decision boundaries when the pretrained model is finetuned on downstream tasks.
Experiments For Q4.
We employed the Core Sample Selection method to obtain different numbers of central points and then analyzed the benefits these central points brought. Specifically, we define Distance as the mean Euclidean distance from each sample to the nearest selected point in the feature space and investigate the Rate of Return (Incremental Benefit per Core Sample) in different range, where Rate of Return = Distance Difference / Core Number Difference between two adjacent columns.
- CIFAR10
| Core Num | 50 | 100 | 150 | 250 | 375 | 500 | 1000 | 1500 | 2500 | 5000 |
|---|---|---|---|---|---|---|---|---|---|---|
| Distance | 0.7821 | 0.7588 | 0.7472 | 0.7307 | 0.7203 | 0.7117 | 0.6896 | 0.6746 | 0.6506 | 0.6010 |
| Rate of Return | - | 4.6575 | 2.3264 | 1.6422 | 0.8362 | 0.6887 | 0.4420 | 0.3002 | 0.2398 | 0.1984 |
- CIFAR100
| Core Num | 50 | 100 | 150 | 250 | 375 | 500 | 1000 | 1500 | 2500 | 5000 |
|---|---|---|---|---|---|---|---|---|---|---|
| Distance | 0.8378 | 0.8082 | 0.7913 | 0.7724 | 0.7564 | 0.7478 | 0.7229 | 0.7059 | 0.6800 | 0.6272 |
| Rate of Return | - | 5.9221 | 3.3791 | 1.8906 | 1.2797 | 0.6845 | 0.4985 | 0.3404 | 0.2589 | 0.2112 |
We found that the rate of return diminishes gradually, indicating that core samples are crucial in the early stages, while the benefits decrease significantly later on. A clear demarcation point can serve as a guide for when to begin Boundary Sample Selection, such as the range of 250-375 for CIFAR-10 and 375-500 for CIFAR-100. This provides a simple yet effective guideline. Additionally, in practical applications, we discovered that introducing boundary points earlier may yield better results, such as CIFAR100 with 1% (500 samples) annotation samples.
Thank you for your insightful suggestions. We believe we have comprehensively addressed your questions regarding the consistency of decision boundaries, the basis for designing boundary samples, and the threshold for the number of uncertain samples.
It is worth noting that our method aligns with the true decision boundaries, and the samples selected from unsupervised pre-trained features are consistent with those used when the pre-trained model is fine-tuned on downstream tasks. We have validated the effectiveness of boundary samples from three perspectives: theoretical guarantees, related work, and Oracle experiments. Ultimately, we have proposed a simple yet effective guideline as a threshold for selecting boundary samples.
We are wondering whether you have any additional questions or comments regarding our response to your review comments. We will do our best to address them.
We appreciate your time and effort reviewing our manuscript. Thanks for your consideration!
Thank you very much for your recognition of our paper and the increased score! We also appreciate the valuable suggestions you've provided. We will make further improvements in the revised version to meet your expectations.
In this paper, the authors introduce a novel Bi-Level Active Finetuning Framework (BiLAF) designed to address the limitations of existing active learning methods in the context of the pretraining-finetuning paradigm. The framework aims to optimize sample selection for finetuning models within a limited annotation budget. The authors propose an innovative unsupervised denoising technique to eliminate noisy samples and use a newly designed boundary score metric for iterative boundary sample selection. Extensive experiments demonstrate that BiLAF outperforms existing methods across various datasets and tasks.
优点
- The bi-level framework that combines core sample selection with boundary sample selection effectively addresses limitations in existing active learning methods.
- The novel unsupervised denoising technique effectively eliminates noisy samples, improving the reliability of sample selection.
- Extensive experiments on multiple datasets and tasks consistently show BiLAF outperforms state-of-the-art methods.
- The paper is well-organized with a clear explanation of the motivation, proposed method, and experimental results.
缺点
Activate Finetuning is advantageous for model fine-tuning in scenarios with limited data, and the motivation behind the proposed BiLAF is clear. However, I have concerns regarding the generalizability of this method to different data sizes. As shown in Table I, there is a significant performance drop in the CIFAR10 dataset under the 0.5% setting. This decline might be attributed to the small size of CIFAR10 images and the low 0.5% ratio, making it challenging to select and denoise boundary samples. It raises the question of whether BiLAF requires a certain threshold of fine-tuning data to be effective, which warrants further investigation by the authors. Additionally, the CVPR 2024 paper (see reference below) reports on the performance of CIFAR10 under 0.1% and 0.2% settings, which is highly relevant. I believe that Activate Finetuning would be more meaningful in scenarios with extremely limited data (e.g., few-shot fine-tuning), where changing the random seed can lead to significant accuracy fluctuations. Therefore, I recommend the authors include experiments and analyses on such scenarios. Furthermore, comparative experiments with other settings in the paper should be included to substantiate the method's effectiveness. Moreover, when the amount of fine-tuning data increases, BiLAF's performance becomes comparable to the baseline ActiveFT. Hence, it would be insightful for the authors to discuss whether BiLAF remains effective with larger amounts of fine-tuning data, such as 20% of the ImageNet 1k dataset. This additional discussion would enhance the paper's comprehensiveness.
Xu, Wenshuai, et al. "ActiveDC: Distribution Calibration for Active Finetuning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
问题
Same as weakness.
局限性
As described in the Weakness section, there might be some critical aspects that are not verified enough. The authors are encouraged to show additional results to support their claims.
Q1: Threshold: Whether BiLAF requires a certain threshold of fine-tuning data to be effective?.
R1: This is an excellent question. The performance gap with just 0.5% of CIFAR-10 data indeed triggers considerations about scale. In traditional active learning, the applicability of different methods varies with scale. The classic Coreset[1] method aims to cover all samples with the selected points, whereas ProbCover[2] notes that Coreset performs poorly when the data scale is very small, leading to the proposal of a coverage radius to ensure each selected point controls a fixed area. In our scenario, it is evident that using central points to fit different dense distributions appears to be a more rational choice when the budget is extremely limited, without using additional samples for boundary selection. We believe that this situation is both normal and objectively present.
Following this, we explored what such a threshold might look like. We employed the Core Sample Selection method to obtain different numbers of central points and then analyzed the benefits these central points brought. Specifically, we define Distance as the mean Euclidean distance from each sample to the nearest selected point in the feature space and investigate the Rate of Return (Incremental Benefit per Core Sample) in different range, where Rate of Return = Distance Difference / Core Number Difference between two adjacent columns.
- CIFAR10
| Core Num | 50 | 100 | 150 | 250 | 375 | 500 | 1000 | 1500 | 2500 | 5000 |
|---|---|---|---|---|---|---|---|---|---|---|
| Distance | 0.7821 | 0.7588 | 0.7472 | 0.7307 | 0.7203 | 0.7117 | 0.6896 | 0.6746 | 0.6506 | 0.6010 |
| Rate of Return | - | 4.6575 | 2.3264 | 1.6422 | 0.8362 | 0.6887 | 0.4420 | 0.3002 | 0.2398 | 0.1984 |
- CIFAR100
| Core Num | 50 | 100 | 150 | 250 | 375 | 500 | 1000 | 1500 | 2500 | 5000 |
|---|---|---|---|---|---|---|---|---|---|---|
| Distance | 0.8378 | 0.8082 | 0.7913 | 0.7724 | 0.7564 | 0.7478 | 0.7229 | 0.7059 | 0.6800 | 0.6272 |
| Rate of Return | - | 5.9221 | 3.3791 | 1.8906 | 1.2797 | 0.6845 | 0.4985 | 0.3404 | 0.2589 | 0.2112 |
We found that the rate of return diminishes gradually, indicating that core samples are crucial in the early stages, while the benefits decrease significantly later on. A clear demarcation point can serve as a guide for when to begin Boundary Sample Selection, such as the range of 250-375 for CIFAR-10 and 375-500 for CIFAR-100. This provides a simple yet effective guideline. Additionally, in practical applications, we discovered that introducing boundary points earlier may yield better results, such as CIFAR100 with 1% (500 samples) annotation samples.
Similarly, how to allocate proportions remains a question. Our ablation study shows that for different budgets, there are varying optimal core numbers, which remains an open question. To some extent, we can determine this based on the rate of return and the number of sample categories. In the paper, we adhered to a uniform hyperparameter setting, which actually demonstrates that our framework offers more flexible parameter choices with significant potential for performance improvement.
[1] Sener, Ozan, and Silvio Savarese. "Active Learning for Convolutional Neural Networks: A Core-Set Approach." ICLR18
[2] Yehuda, Ofer, et al. "Active learning through a covering lens." NIPS22
Q2: Limited Data Condition: ActiveDC reports on the performance of CIFAR10 under 0.1% and 0.2% settings. Active Finetuning would be more meaningful in scenarios with extremely limited data?
R2: ActiveDC employs pseudo-labeling from semi-supervised learning, which pertains to how to fine-tune after sample selection, rather than the selection of the samples themselves. Therefore, the appropriate comparisons should be with other semi-supervised training or fine-tuning methods. Following Reviewer V2Vc's suggestion in Q2, we experimented with KNN and Linear Probing in scenarios with very small data volumes and achieved results that significantly surpassed those of ActiveDC.
| CIFAR10 | 0.1% | 0.2% |
|---|---|---|
| Random+KNN | 64.2 | 76.6 |
| Random+Linear | 67.1 | 78.7 |
| Finetune(ActiveDC) | 61.3 | 73.1 |
We also found that these basic methods become ineffective as the data volume increases. Consequently, we believe that active selection for full-scale fine-tuning should focus on scenarios where the data volume is not extremely small. Our method selects higher-quality samples and achieves superior results across different fine-tuning paradigms. We will reference the ActiveDC paper in the revised version and discuss the differences between our work and ActiveDC.
Q3: Larger Data Condition: How about larger amounts of finetuning data, such as 20% of ImageNet.
R3: We have supplemented our results with 10% and 20% budget scenarios on ImageNet, and our method continues to maintain a strong advantage. Theoretically, as data volume increases, the selection method ActiveFT tends to lean towards Random, thus tending towards a uniform distribution. However, our method effectively identifies boundaries, providing better separability. Empirically, our method has shown impressive results. The Improvement Ratio has increased, where the Improvement Ratio is defined as Improvement Ratio=(BiLAF - Random) / (ActiveFT - Random). In active learning, it is normal for accuracy differences to decrease as data volumes increase, making the improvement ratio a reasonable metric to assess performance.
| Selection Ratio | Random | ActiveFT | BiLAF(ours) | Improvement Ratio |
|---|---|---|---|---|
| 1% | 45.1 | 50.1 | 50.8 | 1.14 |
| 2% | 52.1 | 55.8 | 56.9 | 1.30 |
| 5% | 64.3 | 65.3 | 66.2 | 1.9 |
| 10% | 71.2 | 71.8 | 72.5 | 2.17 |
| 20% | 74.5 | 74.8 | 75.2 | 2.33 |
Overall, we have effectively analyzed our method across different scales, proposed effective boundary metrics for smaller data volumes, and empirically demonstrated the superior results and feasibility of our method with larger data volumes. We hope our responses have addressed your concerns. We welcome further discussions and sincerely appreciate it if you could reconsider your rating.
Thank you for your insightful suggestions. We believe we have comprehensively addressed your questions regarding the data scale used for fine-tuning the model and the scenarios where our method is applicable.
It is worth noting that we have proposed a simple yet effective guideline as a threshold for selecting boundary samples, which enhances the flexibility of our method. Additionally, we have conducted supplementary experiments that demonstrate our method maintains its advantages and is even more effective when applied to larger datasets.
We are wondering whether you have any additional questions or comments regarding our response to your review comments. We will do our best to address them.
We appreciate your time and effort reviewing our manuscript. Thanks for your consideration!
Thank you for your recognition of our paper and the increased score!
This paper proposes a Bi-Level Active Finetuning Framework (BiLAF) for optimizing sample selection in the pretraining-finetuning paradigm. BiLAF combines global diversity and local decision uncertainty through two stages: core sample selection and boundary sample selection. Without requiring labels, the method successfully identifies pseudo-class centers, employs a tailored denoising technique, and iteratively selects boundary samples. Experimental results demonstrate that BiLAF consistently outperforms existing baseline methods across various vision tasks, demonstrating its superior efficacy in improving model performance.
优点
- The paper is well-structured, with clear organization into subsections that introduce the different stages of the method. The use of an algorithm to summarize the entire process is effective.
- The approach not only focuses on the centers of each class but also pays attention to the boundary samples between classes, effectively balancing global diversity and local decision uncertainty when selecting samples for annotation.
- The iterative selection and removal strategy, along with the opponent penalty, effectively prevents the aggregation of multiple samples near the same pseudo-class boundary that faces the same opposing pseudo-class center.
缺点
- There is ambiguity in the use of symbols in the method description, leading to unclear explanations.
- The authors claim that the method is effective for imbalanced datasets but do not provide a thorough theoretical explanation. Additionally, experimental results show that the improvement over ActiveFT on the CIFAR100LT dataset with a 15% annotation ratio is only 0.3%, whereas the improvement is much more significant on balanced datasets. This contradicts the authors' claim. The authors explain that "the default denoising removal ratio parameters might remove minority samples in long-tail distributions," which highlights a potential issue with their algorithm design under long-tail conditions, conflicting with their earlier statement.
- The ablation study shows that the inclusion of the opponent penalty contributes little to the performance improvement and even causes performance degradation under lower budgets. The necessity and design of the opponent penalty need to be reconsidered.
问题
- What does c_i in equation (1) refer to? Previously, c_i was used to denote the cluster centers, but here it seems the authors use c_i to represent the assignment of f_j. Additionally, what does a_i of X_i in section 3.3.2 represent? The inconsistent use of symbols is confusing.
- On what basis do the authors claim that this algorithm is effective on imbalanced datasets? Particularly when the authors state that classes with fewer samples might disappear during the denoising and iterative selection processes.
- In the explanation of Fig. 3, the authors state that "Pentagrams represent the selected core samples, while circles denote the chosen boundary samples." However, the figure shows that the learned center points are still biased towards the boundaries. Does this not risk the model focusing too much on the boundaries and neglecting the intra-class distribution characteristics?
局限性
The authors acknowledge the issue that in long-tail scenarios, the denoising methods might remove outliers that are key samples. However, what the authors do not mention is that the numerous hyperparameters in their algorithm can lead to difficulties in achieving optimal performance and even cause instability in different scenarios.
Important Notes. Reviewer’s recognition of "numerous hyperparameters can cause instability in different scenarios" in Limitations is not justified. Except Core Number, we consistently employs the same hyperparameters across all six different tasks, such as nearest neighbors number, denoising rate, clustering fraction and opponent penalty. The adjustment in Core Number is tailored to the varying class numbers across tasks, and we ensure uniformity in Core Number for different budgets within the same task. This uniformity underscores the stability and scalability of our framework. Changes in these parameters could potentially elevate the performance ceiling of our method. In our Ablation Study, alterations in the Core Number and Core Selection Method indicated further enhancements.
Q1: There is ambiguity in the use of symbols, such as in equation (1), of in Sec.3.3.2.
R1: The sample set is , where the subscript represents the sample and its corresponding feature . In Section 3.2, Core Sample Selection identifies center points, represented by the index set . In Section 3.3.1, Equation (1), for each sample , we find its nearest center point in the feature space, which is closest to its feature . We then assign this sample to the cluster controlled by the center , denoted as .
In Sec.3.3.2, we introduce , where , to represent the index of the -th element in the set controlled by the -th center . Since subsequent operations are performed on each set individually, serves as the index of the sample and its feature . Actually, records all the indices of samples belong to the center .
The variable serves as an enumeration variable in different contexts, which might have led to some ambiguity. We will address these ambiguities and provide clearer explanations in the revised version.
Q2: Why is the method effective on imbalanced datasets? Improvement on the CIFAR100LT dataset with 15% ratio is only 0.3% with the claim "the default denoising removal ratio parameters might remove minority samples".
R2: Firstly, these methods are not specifically tailored for long-tailed distributions. Our method, which focuses on boundary-based approaches, exhibits strong representational capabilities in general scenarios, providing us with a natural advantage. This supports our claim of the method's universality based on its performance in long tailed dataset.
- Denoising involves a trade-off; excessive denoising can eliminate minority classes at the boundary. However, totally without any denoising, selecting boundary samples based on minority class centers could mistakenly include majority class samples because minority and majority classes are often close in boundaries.
- Objectively, as the annotation rate increases, ActiveFT tends to select long-tailed samples as well, which may reduce the gap as the data volume grows.
- The default core number and denoising ratio were selected in the paper, which could potentially exclude long-tailed samples due to their isolated distribution or outlier status. Upon adjusting the Core Number and Denoising Ratio in experiments, we discovered that the model has considerable improvement.
| Denoising Ratio | 0% | 5% | 10% | 20% |
|---|---|---|---|---|
| Core Number 2.5% | 37.10±0.60 | 37.61±0.66 | 37.33±0.71 | 37.04±0.89 |
| Core Number 5% | 37.24±0.56 | 37.85±0.59 | 37.67±0.64 | 37.17±0.78 |
Q3: The necessity and design of the opponent penalty need to be reconsidered with little contribution to the performance improvement.
R3: In Ablation Study, we observed an intriguing trend: the negative impact of the components examined in IDs 1-4 generally diminishes as the number of selected samples increases, while the benefits of the opponent penalty grow with a larger sample size. The design intention behind this metric is to prevent the selection of too many similar boundary points, which can degrade model performance. At sample sizes of 5% and 10%, the opponent penalty yields an accuracy improvement of over 0.5%, which we find significant.
As the important notes explained, we used the same hyperparameters in all experiments. In practice, parameters such as Core Number, Denoising Rate, and Opponent Penalty can be further tailored to specific scenarios to enhance the model's training capacity. This metric can also be applied flexibly depending on the amount of data selected.
Q4: Learned center points are still biased towards the boundaries? Does this not risk the model focusing too much on the boundaries and neglecting the intra-class distribution characteristics?
R4: The possible reasons are 1)Each class have multiple centers 2)TSNE visualizes data in two dimensions, which can be the centers in high-dimensional space. Actually, in our method, Opponent Penalty encourages diverse explorations across boundaries, allowing a more varied distribution. Iterative Selection removes nearby points of selected samples, maintaining diversity by choosing boundary-nearer points instead of regional centers. What's more, we might sample multiple points within a class as pseudo-centers. Boundary points between pseudo-classes of the same label can act as internal points of the class to better fit the distribution. Therefore, our method not only focuses on the classification boundaries but can also enhance the intra-class distribution.
Thank you for valuable time and insightful comments. We hope our responses have addressed your concerns. We welcome further discussions and sincerely appreciate it if you could reconsider your rating.
Thank you for your insightful suggestions. We believe we have comprehensively addressed your questions regarding the effectiveness of our method in long-tail problems, the necessity of the opponent penalty design, and the selection of sample distributions.
It is worth noting that our method maintained default parameters across all experiments, demonstrating the universality of the model. The design of different components not only preserves this universality but also enhances the flexible architecture and performance ceiling of our method. Additional experiments have confirmed that our approach further improves performance on long-tail tasks.
We are wondering whether you have any additional questions or comments regarding our response to your review comments. We will do our best to address them.
We appreciate your time and effort reviewing our manuscript. Thanks for your consideration!
Thanks very much for the feedback. My raised issues have been fully addressed. After going through the authors' responses as well as other reviewers' comments and the whole review process. I am glad to raise my score by one point.
Thank you for your recognition of our paper and the improved score! We also appreciate the valuable suggestions you provided. We will incorporate these insights into the revised version of our paper, enhancing the clarity of our expressions and formulas to improve the paper's readability.
This paper introduces BiLAF, a novel approach for selecting boundary examples alongside core samples to enhance the fine-tuning of pre-trained models for downstream tasks. Specifically, the boundary selection strategy leverages the distinction between intra- and inter-class distances within the pre-trained feature space and incorporates an opponent penalty to promote diversity across different boundaries. Experiments and ablation studies demonstrate the effectiveness of the proposed BiLAF in achieving state-of-the-art results.
优点
- The paper is well-written and easy to follow.
- In addition to classification tasks, the experiments include different scenarios such as detection and segmentation. In these settings, BiLAF demonstrates the state-of-the-art results.
- Ablation studies and execution time are included in the analysis to help provide a deeper understanding of the proposed BiLAF.
缺点
-
BiLAF seems to heavily rely on pre-trained features, as all of its data selection processes are based on them. BiLAF's reliance on pre-trained features for data selection raises concerns due to potential discrepancies between pre-training and fine-tuning tasks. Note that the features from pre-trained classifiers are commonly used in traditional active learning because the labeled and unlabeled datasets are usually from the same task. However, this might not hold true in general pre-training and fine-tuning paradigms. Moreover, Line 120 mentions that “By leveraging pre-trained models, data samples are mapped to robust feature representations that elucidate the relationships among samples, their intra-class counterparts, and inter-class samples from diverse classes.” It remains unanswered and uncertain to what extent these pre-trained features are robust enough to effectively apply BiLAF.
-
In the typical pre-training and fine-tuning paradigm, full fine-tuning may not always be the optimal approach. For instance, when we only have limited/few-shot data, techniques like linear probing or even nearest-neighbor classifiers can yield better results. Furthermore, the choice of fine-tuning method can also be influenced by the similarity between pre-training and downstream tasks [1]. However, the paper lacks a discussion of this aspect and solely focuses on full fine-tuning, neglecting the potential benefits of alternative methods. Specifically, does BiLAF remain necessary for selecting high-quality data points, if we can apply more suitable fine-tuning methods? Or would simple random sampling suffice?
[1] Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning, ICML 2022.
问题
Besides the weakness shown in the above section, please also see the following questions:
-
The efficacy of pre-trained features for the fine-tuning task is essential for BiLAF. Would there be a consistent improvement if we first apply unsupervised pre-training on the unlabeled fine-tuning set?
-
Algorithm 1's pseudo-code indicates that ActiveFT is the first stage of BiLAF. However, Table 3 suggests that BiLAF is faster than ActiveFT. Why BiLAF is faster than ActiveFT if it needs to run ActiveFT first?
-
It is an interesting idea to consdier an opponent penalty to encourage diversity across boundaries. I wonder if this penalty has varying effects on different pseudo classes. Specifically, does it help more on a pseudo class with more neighboring classes (which could potentially be the more confusing class), like the pink one in Figure 3, while helping less for those on the border, such as the green one Figure 3?
局限性
In the appendix, the paper acknowledges a limitation in the proposed method, as it is based on general features. However, as I indicated in the weakness section, the paper lacks a discussion/study to investigate the severity of this limitation and its potential impact on the performance.
Q1: Concern about pre-trained features. Would there be a consistent improvement if we first apply unsupervised pre-training on the unlabeled fine-tuning set?
R1:
- Our main experiment utilized a checkpoint pre-trained on ImageNet using DINO. We conducted studies both on consistent tasks (ImageNet) and inconsistent ones (CIFAR & Others), demonstrating that our method is effective regardless of whether the pre-trained features come from the same task.
- Furthermore, we explored features trained using different paradigms and architectures, such as iBOT and DINO, as well as ResNet50 and DeiT-S, detailed in Appendix F’s Ablation Study. This highlights the generalizability of our approach across different feature sets.
- It is widely acknowledged and applied that many pre-trained models, such as DINOv2 and CLIP, claim robust transferability across diverse downstream tasks after extensive pre-training on large datasets.
- Due to constraints of time and resources, we were unable to perform additional unsupervised training. However, we conduct experiments to show that the boundary points selected based on pseudo-classes are consistent with the actual boundaries, and the boundaries selected based on pre-trained features are consistent with using fully-trained Oracle models. The experiments results will be detailed in the following offical comments due to the characters limitations.
Q2: Full fine-tuning may not always be the optimal approach? Linear probing or even nearest-neighbor classifiers can yield better results. Specifically, does BiLAF remain necessary for selecting high-quality data points with more suitable fine-tuning methods?
R2: Thank you for your insightful comments and suggestions. Following your ideas, we experimented with Linear Probing and KNN on the CIFAR10 dataset and compared the results across Random, ActiveFT, and BiLAF(ours).
| CIFAR10 | 0.5% | 1% | 2% | 5% |
|---|---|---|---|---|
| KNN | ||||
| Random | 82.1 | 86.7 | 88.2 | 90.8 |
| ActiveFT | 86.8 | 87.2 | 88.5 | 91.7 |
| BiLAF | 85.7 | 87.4 | 88.7 | 91.8 |
| Linear | ||||
| Random | 85.1 | 87.6 | 90.1 | 92.5 |
| ActiveFT | 87.8 | 88.7 | 90.6 | 92.8 |
| BiLAF | 86.5 | 89.1 | 91.0 | 93.0 |
| Finetune | ||||
| Random | 77.3 | 82.2 | 88.9 | 94.8 |
| ActiveFT | 85.0 | 88.2 | 90.1 | 95.2 |
| BiLAF | 81.0 | 89.2 | 92.5 | 95.7 |
Based on these results, we have supplemented our experiments with 5% budget. From the table, the following conclusions can be drawn:
- At extremely low data volumes, Linear Probing and KNN outperform full fine-tuning.
- As the data volume increases, the performance improvements of Linear Probing and KNN start to slow down, which gradually necessitates the use of full fine-tuning.
- Interestingly, the quality of data selected by different methods shows a consistent trend across KNN, Linear, and full fine-tuning. Our method, compared to competitors, is able to select more suitable data, which is effective across different fine-tuning paradigms.
- Following prior work, our original focus was on full fine-tuning. We appreciate your comments, which have helped enhance the comprehensiveness of our work.
Q3: Why BiLAF is faster than ActiveFT if it needs to run ActiveFT first?
R3: The primary reason is that ActiveFT in our approach is utilized solely for selecting a core sample set, not the entire dudget as 2%, 5%, 10%. In our CIFAR100 experiments, we fixed the core number as 0.5%. Therefore, the time consumption for BiLAF mainly comprises the time ActiveFT takes to select 0.5% of the samples and the time for further boundary sample selection, which can be faster than directly selecting all samples using ActiveFT. For example, when B=2%, the denoising process takes seconds and iterative boundary sample selection takes seconds, with additional time allocated to ActiveFT tasks selecting core points and other operations. A significant portion of the time, including the denoising, is independent of . When is small, the impact of boundary sample selection based on is minimal, resulting in a negligible increase in runtime for our method across different values. For a comprehensive time complexity analysis, please see Appendix D.
Q4: Specifically, does it help more on a pseudo class with more neighboring classes while helping less for those on the border?
R4: In an ideal scenario, if a central representative is identified for each class, the class on the border would select samples closer to other classes rather than samples farther away, as demonstrated in Figure 1. For instance, as shown in Figure 3, the green class would select adjacent samples from the blue, pink, orange, or even light blue categories, thereby promoting diversity at these boundaries. Such selections can significantly enhance the distinguishability of boundary classes from others, demonstrating that the 'opponent penalty to encourage diversity across boundaries' is a universally applicable strategy. However, in light of your concerns and upon further reflection, we recognize that as the number of boundary selections increases, the advantages for border classes may not be as pronounced as for central classes, which may require more samples to depict their more complex and varied boundaries accurately. Optimizing the distribution of boundary points based on the positions of the centers presents a compelling and valuable new direction for future research.
Thank you once again for your valuable time and insightful comments, which have greatly enhanced our work. We hope our responses have addressed your concerns and demonstrated the versatility of our method. We look forward to further discussions and sincerely appreciate it if you could reconsider your rating.
Experiments For Q1.
We conducted linear probing on all the samples with true labels using features from both the pre-trained model and the oracle model (fine-tuned on all samples). We analyzed whether samples selected using different methods—Random, ActiveFT, BiLAF (ours)—tend to be near the decision boundaries. We used two metrics for this analysis: 1) Entropy, where a higher value indicates greater uncertainty and a propensity towards boundary samples. 2) Prob_Diff (probability difference between the top two classes) calculated as the difference between the highest and second-highest probabilities. A smaller value indicates that the sample is closer to the boundary between these two classes.
| CIFAR10(Pretrain) | Selected Nums | Entropy | Prob_Diff | Entropy(Top50) | Prob_Diff(Top50) |
|---|---|---|---|---|---|
| Random | 250 | 0.0935 | 0.9486 | 0.4260 | 0.7536 |
| ActiveFT | 250 | 0.0424 | 0.9747 | 0.2073 | 0.8743 |
| BiLAF(ours) | 250 | 0.1023 | 0.9366 | 0.4769 | 0.6924 |
| Random | 500 | 0.0960 | 0.9416 | 0.7022 | 0.5056 |
| ActiveFT | 500 | 0.0433 | 0.9763 | 0.3690 | 0.7799 |
| BiLAF(ours) | 500 | 0.1149 | 0.9273 | 0.7089 | 0.4500 |
| Random | 1000 | 0.0955 | 0.9430 | 0.9181 | 0.3081 |
| ActiveFT | 1000 | 0.0849 | 0.9495 | 0.8810 | 0.3560 |
| BiLAF(ours) | 1000 | 0.1461 | 0.9064 | 1.0262 | 0.2005 |
| CIFAR100(Pretrain) | Selected Nums | Entropy | Prob_Diff | Entropy(Top50) | Prob_Diff(Top50) |
|---|---|---|---|---|---|
| Random | 500 | 0.6240 | 0.7295 | 2.2516 | 0.0876 |
| ActiveFT | 500 | 0.2962 | 0.8664 | 1.5663 | 0.2369 |
| BiLAF(ours) | 500 | 0.3933 | 0.8167 | 1.7832 | 0.1423 |
| Random | 1000 | 0.5317 | 0.7766 | 2.2375 | 0.0594 |
| ActiveFT | 1000 | 0.3650 | 0.8430 | 2.0812 | 0.0833 |
| BiLAF(ours) | 1000 | 0.4751 | 0.7815 | 2.2606 | 0.0542 |
| Random | 2500 | 0.5253 | 0.7749 | 2.6790 | 0.0232 |
| ActiveFT | 2500 | 0.4851 | 0.7936 | 2.6653 | 0.0206 |
| BiLAF(ours) | 2500 | 0.5795 | 0.7476 | 2.7196 | 0.0192 |
| Random | 5000 | 0.5442 | 0.7652 | 2.8791 | 0.0151 |
| ActiveFT | 5000 | 0.5197 | 0.7768 | 2.8487 | 0.0118 |
| BiLAF(ours) | 5000 | 0.6219 | 0.7336 | 2.9194 | 0.0090 |
| CIFAR10(Oracle) | Selected Nums | Entropy | Prob_Diff | Entropy(Top50) | Prob_Diff(Top50) |
|---|---|---|---|---|---|
| Random | 250 | 0.001512 | 0.999831 | 0.002437 | 0.999729 |
| ActiveFT | 250 | 0.001521 | 0.999830 | 0.002403 | 0.999717 |
| BiLAF(ours) | 250 | 0.001541 | 0.999827 | 0.002473 | 0.999714 |
| Random | 500 | 0.001483 | 0.999835 | 0.002782 | 0.999676 |
| ActiveFT | 500 | 0.001542 | 0.999828 | 0.002958 | 0.999644 |
| BiLAF(ours) | 500 | 0.001605 | 0.999820 | 0.003020 | 0.999641 |
| Random | 1000 | 0.001551 | 0.999823 | 0.003543 | 0.999575 |
| ActiveFT | 1000 | 0.001527 | 0.999829 | 0.003472 | 0.999586 |
| BiLAF(ours) | 1000 | 0.001598 | 0.999820 | 0.003588 | 0.999567 |
| CIFAR100(Oracle) | Selected Nums | Entropy | Prob_Diff | Entropy(Top50) | Prob_Diff(Top50) |
|---|---|---|---|---|---|
| Random | 500 | 0.049238 | 0.992216 | 0.176811 | 0.956287 |
| ActiveFT | 500 | 0.040140 | 0.993335 | 0.124594 | 0.962662 |
| BiLAF(ours) | 500 | 0.042792 | 0.994406 | 0.137114 | 0.965457 |
| Random | 1000 | 0.047730 | 0.993905 | 0.202417 | 0.963037 |
| ActiveFT | 1000 | 0.042763 | 0.993913 | 0.198896 | 0.961337 |
| BiLAF(ours) | 1000 | 0.045447 | 0.993854 | 0.203469 | 0.960462 |
| Random | 2500 | 0.046654 | 0.993407 | 0.297133 | 0.919803 |
| ActiveFT | 2500 | 0.047036 | 0.993245 | 0.303767 | 0.919122 |
| BiLAF(ours) | 2500 | 0.047357 | 0.993103 | 0.309907 | 0.917868 |
| Random | 5000 | 0.047028 | 0.993741 | 0.407092 | 0.897825 |
| ActiveFT | 5000 | 0.045173 | 0.994264 | 0.340326 | 0.925870 |
| BiLAF(ours) | 5000 | 0.049476 | 0.992885 | 0.466601 | 0.842029 |
Therefore, the conclusion is that our method effectively selects boundary samples, maintaining consistency across models with various capabilities. Consequently, as the quality of model features improves, training with our selected high-quality data can also yield consistent improvements.
Thank you for your insightful suggestions. We believe we have comprehensively addressed your questions concerning the impacts of various pre-training features, fine-tuning paradigms, and the opponent penalty design.
It is worth noting that our method effectively identifies high-quality samples that significantly enhance model training across various pre-training features and tasks. These samples facilitate best accuracy across different fine-tuning paradigms. Our approach is fast, efficient, and generalizable.
We are wondering whether you have any additional questions or comments regarding our response to your review comments. We will do our best to address them.
We appreciate your time and effort reviewing our manuscript. Thanks for your consideration!
Thanks to the authors for their detailed rebuttal. As it has addressed my concerns, I will raise my score from 4 to 5.
Thank you for your positive feedback and the increased rating! We are deeply grateful for your review, which has greatly assisted us in supplementing and perfecting our paper. We will incorporate your suggestions and these new findings into the revised version.
We are grateful to the reviewers for their time and insightful feedback. Overall, we are heartened by their recognition of our paper's clear writing and structure (V2Vc, 2Ep4, 8trC, kN8s), insightful balance of diversity and uncertainty (2Ep4, kN8s, vV4Z), technical novelty and effectiveness (2Ep4, 8trC), extensive experiments and analysis (V2Vc, 8trC, kN8s, vV4Z) with the state-of-the-art performance (V2Vc, 8trC).
We have carefully considered and responded to each of the critical insights and suggestions provided by the reviewers. Our aim is to address all concerns and enhance our work through this collaborative process. We will incorporate these constructive comments into the revised version of our paper. We are confident that incorporating the reviewers' suggestions will significantly improve the quality of our paper and contribute to the field.
Dear Reviewers:
Thank you again for your wisdom and valuable comments. We have provided complete explanations for all the questions. Since the discussion period has been underway for some time, we would be glad to hear from you whether our rebuttal has addressed your concerns. Feel free to comment on our rebuttal if you have further questions and considerations.
Summary: The paper proposes a novel, bi-level fine-tuning method that actively selects uncertain and diverse samples through an unsupervised denoising approach and boundary score evaluation. Extensive experiments demonstrate that BiLAF outperforms existing methods across various datasets and tasks.
Strength: Clear, well-organized, clearly motivated writing; thorough experiments; interesting and meaningful ideas; a new active fine-tuning framework that is effective and combines the advantages of traditional uncertainty methods with the state-of-the-art method and effectively balances global diversity and local decision uncertainty; the novel unsupervised denoising technique effectively eliminates noisy samples, improving the reliability of sample selection.
Weakness: The main diagram can be improved; analysis or discussions on the consistency of decision boundaries, the quality and effect of pre-trained models, fine-grained categorization (a case with less category differentiation) were missing; the theoretical foundation is weak; some technical details (e.g., hyperparameters) and ablation were missing; the generalizability of this method to different data sizes was not clear; comparative experiments with other settings should be included; further justification on imbalanced datasets was missing; the necessity and effectiveness of the opponent penalty need to be reconsidered; only full fine-tuning is considered.
After rebuttal: The authors provided a rebuttal, including new experimental results. Three out of five reviewers read the rebuttal and provided feedback. These reviewers have acknowledged that most of their concerns are well-addressed and have raised the ratings. After rebuttal, the paper received increased ratings 5-5-6-6-7, with an average rating of 5.8.
Recommendation: The AC checked the review, rebuttal, and the reviewers’ further feedback. The AC agreed with the reviewers about the multiple strengths of the papers and appreciated the authors’ efforts to effectively address weaknesses. The AC thus recommended acceptance and suggested the authors incorporate reviewers’ comments in their final version.