PaperHub
7.8
/10
Oral4 位审稿人
最低4最高4标准差0.0
4
4
4
4
ICML 2025

Foundation Model Insights and a Multi-Model Approach for Superior Fine-Grained One-shot Subset Selection

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

This paper investigates the effectiveness of using foundation models (FMs) as information extractor for one-shot subset selection on a set of image datasets, and proposes a novel multi-foundation-model subset selection method called RAM-APL.

摘要

关键词
one-shot subset selectionfoundation modelsdata-efficient learning

评审与讨论

审稿意见
4

This paper explores the use of foundation models (FMs) for one-shot subset selection, focusing on fine-grained image datasets. The authors find that FMs outperform traditional information extractors (IEs) in fine-grained tasks but struggle with noisy, coarse-grained datasets. To address this, they propose RAM-APL, a multi-FM framework that combines intra-class (via Ranking Mean, RAM) and inter-class (via Accuracy of Pseudo-class Labels, APL) feature analysis for improved subset selection. The method is evaluated on fine-grained datasets like Oxford-IIIT Pet and CUB-200-2011, demonstrating its effectiveness.

给作者的问题

Please see the weaknesses of the ``Other Strengths And Weaknesses'' part, where there is a summary.

论据与证据

The authors demonstrate RAM-APL's superiority on fine-grained datasets like Oxford-IIIT Pet and CUB-200-2011, with ablation studies validating its components. However, the claim that FMs underperform on noisy, coarse-grained datasets could benefit from broader experiments. Overall, the evidence is solid but could be further reinforced.

方法与评估标准

The proposed RAM-APL method addresses fine-grained subset selection by leveraging multiple foundation models and combining intra-class and inter-class feature analysis, which appears reasonable for the task. The evaluation criteria, including accuracy on datasets like Oxford-IIIT Pet and CUB-200-2011, are appropriate for assessing performance. However, the evaluation could be expanded to include more diverse datasets, particularly those with noisy or coarse-grained characteristics, to better validate the method's robustness and generalizability. Overall, the methods and evaluation criteria are suitable, though broader validation could strengthen the findings.

理论论述

The paper does not present any explicit theoretical claims or proofs, which focuses on the empirical evaluation of the proposed RAM-APL method.

实验设计与分析

The experimental design is well-structured, with appropriate benchmark datasets (e.g., Oxford-IIIT Pet, CUB-200-2011) and a comprehensive set of baseline methods for comparison. The inclusion of ablation studies and parameter analyses is good, as they provide valuable insights into the contributions of different components of the proposed RAM-APL method. However, the ablation experiments in Tables 1, 2, and 3 are limited to sampling rates of 1%, 50%, and 70%, which raises some concerns. The 1% sampling rate seems unconventional and may not reflect practical scenarios, while the 70% rate sometimes underperforms, as seen in the results. A more balanced evaluation, including the 10% and 30% sampling rates used in Figure 3, would provide a clearer understanding of the method's performance across a wider range of realistic settings.

补充材料

The supplementary material provides detailed experimental setups, results, and analyses for both single-model and multi-model studies. It includes comprehensive comparisons with 12 baseline methods, visualizations of the RAM metric, and an exploration of feature relationships between different foundation models.

与现有文献的关系

The key contributions of this paper are closely related to the broader literature on subset selection and foundation models.

遗漏的重要参考文献

From my perspective, this article does not omit any essential references.

其他优缺点

There are no other strengths and weaknesses. Here I summarize the strengths and weaknesses I answered in the previous parts.

Strengths:

  1. Clarity of Methodology: The paper presents the proposed method, RAM-APL, in a straightforward and concise manner, making it easy to understand and follow.
  2. Comprehensive Experimental Setup: The experiments are well-designed and cover a range of datasets and scenarios, providing a thorough evaluation of the method's effectiveness.

Weaknesses:

  1. Lack of Analysis on FM Performance in Coarse-Grained Tasks: While the paper highlights that foundation models (FMs) underperform in coarse-grained, noisy datasets, it does not provide a detailed analysis or explanation for this behavior. Additional experiments or theoretical insights could help clarify why FMs struggle in these scenarios.
  2. Limited Sampling Rate Evaluation in Ablation Studies: Tables 1, 2, and 3 only compare results at 1%, 50%, and 70% sampling rates. The 1% rate is unconventional and may not reflect practical use cases, while the 70% rate sometimes underperforms. It would be more informative to include results for 10% and 30% sampling rates, as shown in Figure 3, to provide a more balanced and realistic evaluation of the method's performance.

其他意见或建议

N/A

作者回复

Thank you for your insightful feedbacks! We address your questions in the following responses.


W1: A detailed analysis of why FM as IE underperform on coarse-grained datasets with noisy labels.

A1: We sincerely appreciate the reviewer's insightful comments. Due to the character limit in the rebuttal, we kindly refer the reviewer to our response to Reviewer aGx6's "W1&Q1: A deeper discussion of why FM as IE performs poorly on coarse-grained datasets with noisy labels."


W2: Limited sampling rate evaluation in ablation studies.

A2: We sincerely appreciate the reviewer’s insightful feedback and acknowledge the importance of conducting ablation studies across diverse sampling rates.

To address this concerns, we have conducted additional Ablation Studies at 10% and 30% sampling rates. The revised Tables 1, 2, and 3 are provided below.

Table 1. Ablation study based on Pet.

MethodInformation Extractor (IE)1%10%30%50%70%
MINModel-TD5.6±0.714.6±0.526.4±1.640.3±2.655.2±2.7
MINCLIP5.6±0.215.4±1.029.3±2.445.9±1.856.3±0.7
MINDINOv26.2±0.115.5±0.732.0±1.446.8±2.060.5±2.9
RAMCLIP+DINOv25.9±0.315.1±0.533.1±2.347.1±1.456.5±2.7
RAM-APLCLIP+DINOv26.5±0.415.2±1.232.4±2.947.5±1.958.7±2.2

Across various sampling rates, RAM consistently outperforms MIN (CLIP as IE), while RAM-APL further improves performance, reaching levels comparable to DINOv2. Though RAM-APL (CLIP+DINOv2) demonstrates overall superior performance, its effectiveness at 70% sampling can be improved. In future work, we aim to enhance our method’s effectiveness at high sampling rates to further improve its practical utility.

Table 2. Comparison of the performance of our method using different numbers of foundation models as information extractors.

DINOv2CLIPSigLIPEVA-CLIP1%10%30%50%70%
5.9±0.315.4±1.131.6±2.347.7±1.157.9±4.1
5.7±0.415.0±0.227.9±1.243.6±1.957.0±0.4
6.6±0.314.1±1.028.8±1.143.9±1.755.1±2.6
5.4±0.315.0±0.630.2±2.544.4±2.356.6±1.8
6.5±0.415.2±1.232.4±2.947.5±1.958.7±2.2
5.9±0.316.2±0.131.4±3.245.0±1.358.6±1.2
6.0±0.616.0±0.935.8±2.946.5±1.854.9±3.5
6.4±0.215.1±0.429.8±1.645.9±1.356.2±2.7
5.9±0.315.5±0.731.4±1.744.2±2.255.9±1.8
6.7±0.416.2±0.634.7±0.345.7±0.856.6±2.4
6.2±0.815.6±0.533.2±1.448.3±1.157.6±0.1
6.0±0.417.5±1.035.2±1.847.9±1.555.6±2.1
6.1±0.316.8±0.634.4±2.147.0±2.055.1±1.6
6.1±0.216.1±0.333.9±1.446.8±1.555.1±0.5
6.5±0.216.8±1.134.0±2.746.3±0.556.9±1.1

We observe that leveraging multiple foundation models outperforms using a single model. The optimal balance of computational efficiency, memory usage, and performance is achieved with DINOv2 + CLIP. For the highest overall accuracy, DINOv2+CLIP+EVA-CLIP is recommended. These findings validate the benefits of multi-model selection, and the results will be included in the supplementary material.

Tabel 3. Comparison of feature fusion strategies.

Fusion Method1%10%30%50%70%
Concatenate5.9±0.416.3±0.431.7±1.347.7±3.057.8±1.2
Ours6.5±0.415.2±1.232.4±2.947.5±1.958.7±2.2

We observe that Our strategy outperforms Concatenate, especially at higher sampling rates, which are crucial for practical applications. To maximize the performance of multi-model method at high sampling rates, we adopt the Ours fusion strategy. The findings and new results will be included in the supplementary material.

审稿意见
4

The paper makes comparisons between traditional information extractors (IEs) and a single foundation model (FM) on a series of datasets to explore scenarios in which a single FM would be advantageous as an IE. It reveals that a single FM performs poorly on coarse-grained image datasets with noisy labels and performs well on fine-grained image datasets with clean and noisy labels. The paper introduces a one-shot subset selection approach (called RAM-APL) tailed for fine-grained datasets, which ingeniously maps the misaligned features extracted by an ensemble of FMs into a unified distance ranking space, considering both intra-class distribution and inter-class distribution of samples. Experiments demonstrate the SOTA selection performance of RAM-APL on three fine-grained image datasets.

给作者的问题

Please see [Weaknesses] 1-2 in the Other Strengths And Weaknesses. If authors address them, reviewer would like to change the rating.

论据与证据

The claims regarding the foundation model insights and the effectiveness of RAM-APL in the paper are supported by strong empirical evidence. However, additional analysis on fine-grained datasets with noisy labels could further strengthen the claims.

方法与评估标准

The proposed approach is reasonable, but the evaluation is limited to fine-grained datasets. A broader range of datasets would help confirm RAM-APL’s robustness.

理论论述

The paper does not present formal theoretical proof, as it is largely focused on empirical evaluation. The conceptual framework of the RAM-APL method is well-explained, and the reliance on empirical analysis is justified. No significant issues were found in the presentation of the algorithmic ideas.

实验设计与分析

To explore scenarios in which a single FM would be advantageous as an IE, the paper employs systematic and rigorous experiments to analyze the impact of various factors (such as coarse-grained and fine-grained, labels that are clean or noisy, and balanced or unbalanced class distributions) and provides in-depth discussions of the results. Besides, the experimental design is robust in validating the effectiveness of RAM-APL, and the chosen datasets (Oxford-IIIT Pet, Food-101, and CUB-200-2011) are well suited for evaluating subset selection methods in the context of fine-grained image classification. The paper compares RAM-APL against multiple baseline methods, showing clear improvements.

However, it would be beneficial to include additional experiments to explicitly examine the performance of subset selection methods on fine-grained datasets with noisy labels. The paper highlights the strengths of FMs as IEs on both fine-grained datasets with clean and noisy labels, so more explicit comparisons between RAM-APL and other methods on fine-grained datasets with noisy labels would further strengthen the effectiveness of RAM-APL.

补充材料

The supplementary material is detailed, providing valuable insights into the methodology and additional experimental results. However, the paper would benefit from including code or links to a code repository to facilitate reproducibility.

与现有文献的关系

The paper effectively situates itself within the subset selection literature, particularly focusing on feature-based subset selection. The use of FMs for subset selection is both relevant and timely, especially as research increasingly relies on large pre-trained models.

遗漏的重要参考文献

N/A

其他优缺点

Strengths

  • The paper is well-written and clear and presents a novel contribution to subset selection in fine-grained image classification.
  • The paper systematically compares FMs and traditional IEs through rigorous experiments, offering practical insights into their relative performance across diverse image datasets.
  • The paper is highly original in combining multiple FMs for subset selection, a novel approach that significantly improves performance in fine-grained datasets.
  • The paper provides comprehensive empirical results, strengthening its practical impact.

Weaknesses

  1. The paper provides convincing results on three classical image fine-grained datasets. However, the paper does not compare the performance of RAM-APL with other methods on fine-grained image datasets with noisy labels. Experiments on fine-grained image datasets with noisy labels are important to further demonstrate the effectiveness of RAM-APL.
  2. The paper lacks a deeper discussion of why FM as IE performs well on fine-grained datasets with noisy labels and poorly on coarse-grained datasets with noisy labels. For example, which specific types of images or classes does FM as IE perform better on, and which classes does it not perform well on? Understanding these nuances would help in understanding the advantages of FM as an IE and would help in adapting RAM-APL to other domains.

其他意见或建议

  • Consider including a code release or a link to a public repository for reproducibility purposes.

伦理审查问题

N/A

作者回复

Thank you for your positive feedbacks! We address your questions in the following responses.


W1: Evaluation on fine-grained image datasets with noisy labels.

A1: We sincerely appreciate the reviewer’s insightful suggestion. We acknowledge the importance of evaluating the effectiveness of our approach with other selection methods on fine-grained image datasets with noisy labels.

To address this concerns, we conducted additional experiments (as detailed in the tables below) on the Oxford-IIIT Pets dataset with 20% symmetric label noise and with 40% symmetric label noise. Subsets are sampled following the same experimental setup described in the manuscript.

Dataset: Oxford-IIIT Pets dataset with 20% symmetric label noise

MethodIE1%10%30%50%70%100%
Random-4.9±0.710.0±1.016.7±0.625.3±0.433.4±2.642.7±1.8
HardingModel-TD5.0±0.18.1±1.815.3±1.820.7±0.533.1±0.442.7±1.8
KCGModel-TD5.3±1.47.8±0.815.4±1.122.3±1.732.2±1.542.7±1.8
CDModel-TD5.2±0.46.6±0.813.7±0.222.4±1.632.1±1.442.7±1.8
MarginModel-TD4.7±0.19.0±0.716.3±0.523.9±0.633.5±1.242.7±1.8
ForgettingModel-TD5.9±0.611.5±0.918.7±1.329.5±0.936.9±0.442.7±1.8
GraNdModel-TD4.3±0.27.8±0.815.6±0.722.9±1.532.4±2.142.7±1.8
CalModel-TD6.2±0.812.2±0.622.2±2.629.4±1.338.7±0.842.7±1.8
GlisterModel-TD4.8±0.210.5±1.117.2±1.026.1±2.634.7±2.242.7±1.8
GCModel-TD5.2±0.612.8±1.520.3±1.327.0±0.632.9±0.942.7±1.8
MDSModel-TD3.8±0.49.8±0.317.1±0.624.3±1.730.7±3.142.7±1.8
MINModel-TD5.6±0.211.6±0.419.8±1.428.0±2.235.3±2.542.7±1.8
OursCLIP+DINOv26.7±0.316.7±0.332.5±1.846.0±1.656.7±0.742.7±1.8

Dataset: Oxford-IIIT Pets dataset with 40% symmetric label noise

MethodIE1%10%30%50%70%100%
Random-5.1±0.58.0±0.612.6±0.615.0±0.319.1±0.523.0±0.6
HardingModel-TD4.4±0.26.3±0.511.1±0.913.1±0.618.2±1.323.0±0.6
KCGModel-TD4.9±0.86.3±0.59.9±1.214.3±1.118.1±0.923.0±0.6
CDModel-TD4.8±0.86.3±0.510.3±0.814.0±0.317.7±1.223.0±0.6
MarginModel-TD4.1±0.37.0±0.811.1±0.914.3±0.919.0±0.823.0±0.6
ForgettingModel-TD5.4±0.810.2±1.612.9±0.817.2±0.421.4±0.923.0±0.6
GraNdModel-TD4.4±0.96.7±1.010.2±0.514.5±1.618.8±1.223.0±0.6
CalModel-TD5.4±0.310.6±1.114.9±1.118.9±1.022.2±1.223.0±0.6
GlisterModel-TD5.2±0.37.6±1.112.4±0.818.3±0.821.8±1.623.0±0.6
GCModel-TD4.9±0.79.7±1.112.8±0.715.4±0.820.5±1.723.0±0.6
MDSModel-TD3.9±0.27.2±0.312.0±0.215.0±1.518.5±0.823.0±0.6
MINModel-TD5.3±0.49.4±0.714.3±0.718.3±0.620.9±0.623.0±0.6
OursCLIP+DINOv26.1±0.315.0±1.230.4±1.744.8±0.142.6±0.823.0±0.6

("IE" means information extractor, "Model-TD" denotes the model trained on the full set for 10 epochs.)

We observe that RAM-APL consistently outperforms all baselines across different sampling rates on each noisy fine-grained dataset, demonstrating its effectiveness.

Your suggestion has been highly valuable. Through experimental analysis, we have identified the significant advantages of designing selection algorithms based on foundation models for noisy datasets, which motivates us to explore more effective foundation model-based denoising approaches in future work. The above experimental results and discussions will be included in the supplementary material.


W2: A deeper discussion of why FM as IE performs poorly on coarse-grained datasets with noisy labels.

A2: We sincerely appreciate the reviewer's insightful comments. Due to the character limit in the rebuttal, we kindly refer the reviewer to our response to Reviewer aGx6's "W1&Q1: A deeper discussion of why FM as IE performs poorly on coarse-grained datasets with noisy labels."


S1: Code release.

A3: We thank the reviewer for this suggestion. We will release the full implementation code in a public repository upon paper acceptance to ensure reproducibility.

审稿人评论

Thanks to the authors for their positive response and detailed rebuttal. The authors have addressed my concerns.

审稿意见
4

To investigate whether foundation models (FMs) can truly replace task-specific information extractors (IEs) in subset selection, this paper examines the effectiveness of FMs as IEs for one-shot subset selection. Through extensive experiments across a set of image datasets, this paper identifies the strengths and limitations of FMs as IEs: they excel on fine-grained image datasets but underperform on coarse-grained datasets with noisy labels. To capitalize on the complementary strengths of multiple FMs and overcome limitations in existing feature-based selection methods, this paper introduces RAM-APL, which maps misaligned features from multiple FMs into a unified distance ranking space, considering intra-class and inter-class distributions. The selection methods are evaluated on three fine-grained classification datasets.

给作者的问题

  1. Would the subsets selected by RAM-APL retain their effectiveness when applied to architectures beyond ResNet? 2.Why do FMs struggle with coarse-grained datasets containing noisy labels but perform well on fine-grained image datasets?
  2. How do RAM and APL strategies influence the distribution of representations in selected subsets?

论据与证据

Yes. The claims in this paper are generally supported by clear experimental results, particularly in demonstrating that RAM-APL improves subset selection on fine-grained image datasets.

方法与评估标准

Yes. The methodology is well-motivated for subset selection, and the benchmark datasets are appropriate.

理论论述

Yes. Since this paper is largely data-driven, there are no formal proofs for the theoretical claims, but the empirical justification is sound.

实验设计与分析

Yes. The experimental design is rigorous and well-structured:

  1. The evaluation considers three fine-grained image datasets (CUB-200-2011, Oxford-IIIT Pets, Food-101), making the conclusions well-supported in the targeted domain.
  2. A range of baselines is compared, including random selection, single-FM approaches (DINOv2, CLIP, et al.), and other subset selection methods.
  3. The ablation study analyzes hyperparameters (\alpha, \beta) and the effect of different FM combinations, showing that DINOv2 + CLIP provides the best results.

However, a few concerns:

  1. This paper does not assess whether the selected subsets generalize across different model architectures. A key question is whether subsets selected by RAM-APL would maintain their effectiveness when applied to architectures beyond ResNet.
  2. This paper claims FMs are ineffective on coarse-grained datasets with noisy labels but does not analyze why in depth. A more detailed study (e.g., feature visualizations or error analysis) would help substantiate this finding.

补充材料

Yes. The supplementary material is well-structured, providing additional insights into the methodology and extended experimental results.

与现有文献的关系

This paper relates to subset selection approaches but differs by leveraging multiple FMs to form a unified ranking space.

遗漏的重要参考文献

No

其他优缺点

Strengths:

  1. This paper structure is reasonable, with each component of the proposed method clearly explained, making it easy to understand and implement.
  2. This paper introduces a well-motivated and innovative approach to subset selection. RAM-APL effectively harnesses the complementary advantages of multiple foundation models, addressing the variability in FM performance across datasets and selection methods. The empirical evaluations are thorough and provide strong evidence supporting the effectiveness of the proposed method.

Weaknesses: 1.This paper does not assess whether the selected subsets generalize across different model architectures. Would the subsets selected by RAM-APL retain their effectiveness when applied to architectures beyond ResNet? 2. While this paper argues that FMs perform poorly on coarse-grained datasets with noisy labels, it lacks an in-depth analysis of the underlying reasons. Incorporating feature visualizations or error analysis could provide stronger empirical justification for this claim. 3. The analysis is somewhat limited, primarily focusing on accuracy. Additional insights, such as diversity or difficulty analysis of the selected subsets, would enhance the evaluation.

其他意见或建议

See the Weakness part.

作者回复

Thank you for your positive feedbacks! We address your questions in the following responses.


W1&Q1: Cross-architecture generalization of RAM-APL.

A1: We sincerely appreciate the reviewer’s insightful question regarding the cross-architecture generalization of RAM-APL. We acknowledge the importance of evaluating whether our selected subsets remain effective across different model architectures beyond ResNet.

To address this concerns, we conducted additional experiments on the Oxford-IIIT Pets dataset (Pets) using MobileNet-V3 as the target model. The results, presented in the table below, compare RAM-APL against five strong baselines that maintain identical architectures for their information extractors (IE) and target models.

MobileNet-V3 (MBV3)

MethodIE→Target Model10%30%50%
RandomMBV3 → MBV310.9±1.142.1±3.661.6±1.9
ForgettingMBV3 → MBV313.3±0.842.0±2.061.0±2.3
GCMBV3 → MBV312.4±1.740.4±0.261.3±1.5
MDSMBV3 → MBV311.9±0.739.8±2.062.1±3.3
MINMBV3 → MBV311.9±1.838.4±0.861.0±1.4
RAM-APL (Ours)(CLIP+DINOv2)→ MBV313.6±0.345.7±0.962.3±1.4

We observe that RAM-APL consistently outperforms all baselines across different sampling rates, indicating its strong cross-architecture generalization ability.

Your suggestion has been highly valuable, inspiring us to further explore multi-model subset selection in broader cross-architecture settings in future work. The above experimental results and discussion will be included in the supplementary material.


W2&Q2: A deeper discussion of why FMs struggle with coarse-grained datasets containing noisy labels but perform well on fine-grained image datasets.

A2: We sincerely appreciate the reviewer's insightful comments. Due to the character limit in the rebuttal, we kindly refer the reviewer to our response to Reviewer aGx6's "W1&Q1: A deeper discussion of why FM as IE performs poorly on coarse-grained datasets with noisy labels."


W3&Q3: How do RAM and APL influence the distribution of representations in selected subsets?

A3: We sincerely appreciate the reviewer’s insightful question regarding the influence of RAM and APL strategies on the distribution of representations in the selected subsets. We acknowledge the importance of analyzing how these strategies shape the feature space and their impact on sample diversity and representativeness.

To address this concern, we conducted additional experiments and analyzed the feature distributions of different selection strategies. Specifically, we examined the average cosine distance between data pairs within the selected subsets, which provides insights into intra-class and overall diversity. The results are summarized in the table below:

Table. Average cosine distance of data pairs in the subset

MethodIEClass 0Class 1Class 2Class 3Class 4Whole subset
MinCLIP0.16170.17950.11760.15090.13270.2680
RAMCLIP+DINOv20.16950.19190.12590.16110.13920.2767
RAM-APLCLIP+DINOv20.16590.19860.13170.15970.13990.2787

From these results, we observe that RAM and RAM-APL lead to a more diverse feature distribution in the selected subset compared to Min-based selection. The whole-subset average cosine distance is highest under RAM-APL (0.2787), indicating that it selects more diverse samples overall, improving coverage of the feature space. Moreover, the per-class distances suggest that RAM-APL encourages a balance between inter-class and intra-class diversity, with slightly higher values in harder-to-distinguish classes.

Furthermore, the t-SNE visualizations in Figures 9-11 (https://anonymous.4open.science/r/RAM-APL-DED5/README.md) further confirm these findings. Compared to Min-based selection, which tends to concentrate samples within certain regions of the feature space, RAM and RAM-APL distribute samples more broadly across the space, ensuring better representational coverage. This suggests that our approach enhances model performance by capturing a more comprehensive representation of the dataset.

Your suggestion has been highly valuable in strengthening our analysis. The above results and discussions will be included in the supplementary material to provide a clearer understanding of the impact of our proposed selection strategies.

审稿人评论

After reading the response, the authors have addressed my concerns. Thus, I support accepting this paper.

审稿意见
4

This paper investigates one-shot subset selection using Foundation Models (FMs) to reduce deep learning training costs by improving efficiency. Traditional Information Extractors (IEs) rely on models pre-trained on the target dataset, introducing dataset dependency. The paper addresses two key questions: (1) Can FM-based subset selection outperform traditional IE-based methods across diverse datasets? (2) Do all FMs perform equally well for subset selection? Experimental results show that FMs excel on fine-grained datasets but underperform on coarse-grained datasets with noisy labels. Based on these findings, the authors propose RAM-APL (RAnking Mean-Accuracy of Pseudo-class Labels), a novel method that leverages multiple FMs to enhance subset selection performance on fine-grained datasets. Extensive experiments validate the superiority of RAM-APL on three fine-grained datasets.

给作者的问题

  1. Why do FMs underperform on coarse-grained datasets with noisy labels? Is this related to feature distribution or noise levels in the datasets?
  2. Can the authors provide a theoretical analysis of the RAM-APL method, explaining why it effectively leverages the complementary strengths of multiple FMs?
  3. Have the authors considered applying the RAM-APL method to other tasks, such as few-shot learning or various noisy datasets?

论据与证据

The main claims of the paper are supported by experimental data. For instance, the superiority of FMs as IEs on fine-grained datasets and the effectiveness of the RAM-APL method are validated through experiments. However, some conclusions (e.g., the limitations of FMs on coarse-grained datasets) lack deeper explanations.

方法与评估标准

The proposed RAM-APL method significantly improves subset selection performance on fine-grained datasets by leveraging the feature extraction capabilities of multiple FMs. The method is well-designed, and the evaluation criteria (e.g., prediction accuracy) are appropriate for the subset selection task. The experimental results demonstrate that RAM-APL outperforms SOTA methods on multiple datasets, validating its effectiveness.

理论论述

The paper does not provide rigorous theoretical proofs but validates the effectiveness of FMs for subset selection through experiments. The experimental design is sound, and the results support the advantages of FMs on fine-grained datasets. However, the paper lacks a deeper theoretical analysis of the RAM-APL method, such as why it effectively leverages the complementary strengths of multiple FMs. The authors are encouraged to supplement the paper with relevant theoretical analysis to enhance the credibility of the method.

实验设计与分析

The experimental design of Single-model Study is comprehensive, covering multiple datasets (e.g., CIFAR-10, CIFAR-10N, Oxford-IIIT Pet) and different FMs (e.g., DINOv2, CLIP). The results demonstrate that FMs perform well on fine-grained datasets but struggle on coarse-grained datasets with noisy labels. The RAM-APL method significantly improves performance on fine-grained datasets by combining the feature extraction capabilities of multiple FMs.

A limitation of the experimental design is the lack of in-depth analysis of why FMs underperform on coarse-grained datasets. For example, is this due to the feature distribution or noise levels in the datasets?

补充材料

The supplementary material provides detailed methodological explanations and additional experimental results, enhancing the credibility of the paper.

与现有文献的关系

The paper clearly situates itself within the existing literature. Traditional subset selection methods rely on IEs pre-trained on the target dataset, which introduces dataset dependency. By introducing FMs, the paper proposes a dataset-agnostic subset selection method, expanding the scope of subset selection research. However, the paper does not sufficiently discuss the relationship with existing FM-related work, such as FM applications to few-shot learning or noisy datasets.

遗漏的重要参考文献

The paper cites a wide range of related literature but omits some key works. For example, FM applications on noisy datasets (e.g., “CLIPCleaner: Cleaning Noisy Labels with CLIP” by Chen Feng et al., 2024) are highly relevant. The authors are encouraged to include relevant references and discuss the implications.

其他优缺点

Strengths:

  1. The findings of the effectiveness of FMs for subset selection are interesting.
  2. The proposed RAM-APL method significantly improves subset selection performance on fine-grained datasets.
  3. The experimental design is comprehensive, covering multiple datasets and FM combinations.
  4. The supplementary material provides detailed experimental explanations and results, enhancing the paper's credibility.

Weaknesses:

  1. The paper lacks a theoretical explanation for the underperformance of FMs on coarse-grained datasets with noisy labels.
  2. The paper lacks a theoretical analysis of the RAM-APL method, explaining why it effectively leverages the complementary strengths of multiple FMs.
  3. The discussion of related FM literature is insufficient.

其他意见或建议

NA

作者回复

Thank you very much for your positive feedback! We greatly appreciate your insightful questions, which have deepened our analysis of the findings and inspired further exploration for future work. Below, we address your questions in sequence.


W1&Q1: A deeper discussion of why FM as IE performs poorly on coarse-grained datasets with noisy labels.

A1: We sincerely appreciate the reviewer's insightful comments regarding the theoretical understanding of foundation models (FMs) on noisy datasets. We have conducted extensive additional analyses to explain this phenomenon, with key findings visualized in Figures 1-8 (https://anonymous.4open.science/r/RAM-APL-DED5/README.md).

Our empirical investigation reveals:

In coarse-grained datasets (CIFAR-10N-worse, Figures 1-4)

  • FM-extracted features exhibit:
    • Weak inter-class separation for visually similar categories (e.g., dog/cat in CIFAR-10N-worse);
    • Substantial overlap between clean and noisy samples' feature distributions.

This explains FMs' limited effectiveness as information extractors for coarse-grained noisy data.

By contrast, in fine-grained datasets (Oxford-IIIT Pet with 40% symmetric label noise, Figures 5-8):

  • FM-extracted features exhibit:
    • Compact clustering of correctly-labeled samples;
    • Strong inter-class separation for visually similar categories;
    • Smaller overlap between clean and noisy samples in feature space.
  • Features from models trained on full noisy set show:
    • Loose clustering of correctly-labeled samples;
    • Significant overlap between clean and noisy samples in feature space.

This leads to the selection of more noise samples (visible in dark red points) and substantially weaker performance of traditional information extractors (IEs) compared to FMs.

Key Inference:

The comparative analysis reveals that FMs serve as superior information extractors when their features demonstrate:

  • Tighter clustering of correctly-labeled samples;
  • Reduced overlap between clean and noisy distributions.

We will incorporate these analyses in the supplementary material to strengthen our empirical analysis and contributions.


W2&Q2: Theoretical Analysis of Multi-FM Complementarity in RAM-APL.

A2: We thank the reviewer for this important question. RAM-APL's effectiveness in leveraging multiple FMs stems from two fundamental principles:

1.Feature Space Orthogonality:

Our analysis reveals that different FMs learn nearly orthogonal feature representations (Figure 6 in the Suppl.), with:

cossimMi(x),Mj(x)0ij\text{cossim}⟨M_i(x), M_j(x)⟩ ≈ 0 \quad \forall i \neq j

This orthogonality demonstrates that each FM (e.g., Mi,MjM_i, M_j) captures distinct, complementary aspects of the data xx.

2.Bias Reduction via Ensemble Consensus:

The ensemble mechanism could mitigate individual FM biases and preserve robust cross-model agreements. Table 2 in the manuscript demonstrates RAM-APL's performance gains when combining CLIP and DINOv2 versus individual FMs, confirming the benefits of multi-FM integration.


W3: Expanded Discussion of FM Literature.

A3: We sincerely appreciate the reviewer for this constructive suggestion. We will significantly strengthen our discussion of FM literature by incorporating CLIPCleaner (Chen et al., ACM MM 2024). The key insights are:

Both our work and CLIPCleaner leverage CLIP's zero-shot capabilities. Differently, CLIPCleaner, a single-FM method, focuses on noisy label cleaning via prediction probabilities. Our RAM-APL, a Multi-FM approach, specializes in subset selection for clean and noisy fine- rained data using visual deep features.

We'll discuss CLIPCleaner in Section 2 of the revised manuscript.


Q3: Extensions to Few-shot Learning and Noisy Data.

A4: We sincerely appreciate the reviewer's insightful question regarding the broader applicability of RAM-APL. By leveraging the strong feature discriminability of foundation models and mitigating biases through ensemble consensus, RAM-APL shows a strong ability to distinguish noisy datasets. While our current work focuses on standard subset selection for fine-grained datasets, its theoretical framework and algorithmic design naturally extend to:

  • Noisy Few-shot Learning: Enhances robustness in few-shot scenarios by effectively identifying label-feature mismatches in small support sets.

  • Noisy Label Scenarios: Particularly effective for fine-grained noisy data (as demonstrated in our experiments). Moving forward, we plan to develop more effective denoising strategies tailored to such datasets within the RAM-APL framework.

We will include this extended analysis in the supplemental materials to better position RAM-APL's broader applicability.

最终决定

In this paper, the authors explored using foundation models for the one-shot subset selection problem. Based on their findings, they further propose a method that leverages multiple foundation models to enhance subset selection by exploiting their complementary strengths. Extensive experimental results on fine-grained datasets, including Oxford-IIIT Pet, Food-101, and Caltech-UCSD Birds-200-2011, are provided to show the effectiveness of the proposed method.

This paper received consistent positive ratings after the rebuttal period. All four expert reviewers decided to accept this paper. Three of them confirmed that their concerns have been addressed during the author-reviewer discussions. Therefore, it is a clear acceptance.