PaperHub
6.6
/10
Poster4 位审稿人
最低3最高4标准差0.5
4
3
4
3
ICML 2025

Proxy-FDA: Proxy-based Feature Distribution Alignment for Fine-tuning Vision Foundation Models without Forgetting

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

Forgetting-free fine-tuning of vision foundation models via Proxy-based Feature Distribution Alignment (Proxy-FDA)

摘要

关键词
Proxy-FDArobust fine-tuningconcept forgettingvision foundation model

评审与讨论

审稿意见
4

The paper introduces Proxy-FDA, a novel feature-space regularization method designed to prevent concept forgetting during the fine-tuning of vision foundation models. The key idea is to align the local structures of pre-trained and fine-tuned feature distributions using nearest neighbor graphs, which is further enhanced by generating synthetic features (proxies) to increase data diversity. Experiments demonstrate that Proxy-FDA significantly reduces concept forgetting across various fine-tuning settings and tasks, including end-to-end fine-tuning, few-shot prompt tuning, continual fine-tuning, and applications beyond classification like image captioning and visual question answering. The method achieves state-of-the-art results in mitigating forgetting, showing a strong correlation between a structure-aware distributional distance metric (OTDD) and concept forgetting.

给作者的问题

no

论据与证据

The claims made in the submission are well-supported by clear and convincing evidence. The authors provide extensive experimental results across multiple datasets and tasks, demonstrating the effectiveness of Proxy-FDA in reducing concept forgetting. They also conduct ablation studies to analyze the impact of different components of their method, such as the feature distribution alignment and proxy generation. The correlation analysis between OTDD and concept forgetting further strengthens the validity of their approach. However, the claim that Proxy-FDA consistently outperforms all other methods across all settings might be slightly overstated, as there could be specific scenarios where other methods perform comparably or better, though the provided evidence strongly supports its general effectiveness.

方法与评估标准

The proposed methods and evaluation criteria make sense for the problem of mitigating concept forgetting during fine-tuning. The use of feature distribution alignment with nearest neighbor graphs captures the local structure of the feature space effectively, and the introduction of synthetic proxies addresses data scarcity issues in few-shot scenarios. The evaluation criteria, including the change in performance on unseen tasks (∆LP) and the distributional distance metric OTDD, are appropriate for assessing the extent of concept forgetting and the quality of feature alignment.

理论论述

I did not check the correctness of any proofs for theoretical claims in this paper, as the focus is primarily on empirical evaluation and methodological innovation rather than theoretical analysis.

实验设计与分析

The experimental designs and analyses appear sound and valid. The authors conduct experiments on a diverse range of datasets and tasks, ensuring the robustness of their findings. They compare against relevant baselines, including naive fine-tuning, LP-FT, L2SP, and LDIFS, providing a comprehensive evaluation of their method's performance. The ablation studies and sensitivity analyses further validate the effectiveness of individual components and design choices in Proxy-FDA.

补充材料

I reviewed the supplementary material, including the detailed architecture of the proxy generator, the hard class mining strategy, and additional experimental results. These materials provide valuable insights into the implementation details and further support the claims made in the main paper.

与现有文献的关系

The key contributions of this paper are well situated within the broader scientific literature on robust fine-tuning of foundation models. The work builds upon and advances prior research in regularization techniques for fine-tuning, such as L2SP and LDIFS, by introducing a structure-aware feature distribution alignment method. It also relates to knowledge distillation and domain adaptation literature, where preserving and transferring knowledge across different data distributions is a central concern. The use of optimal transport for measuring distributional distances connects to a broader body of work on optimal transport in machine learning.

遗漏的重要参考文献

There are no critical references missing from the discussion that would significantly impact the understanding of the paper's contributions. The authors adequately cite and discuss relevant prior work in the areas of fine-tuning, concept forgetting, and related regularization methods.

其他优缺点

Strengths: The method is conceptually simple yet effective, leveraging feature space regularization with nearest neighbor graphs and proxy generation. Extensive experiments across diverse datasets and tasks demonstrate consistent improvements over existing methods. The analysis of the correlation between OTDD and concept forgetting provides valuable insights into the effectiveness of structure-aware regularization.

Weaknesses: The computational overhead introduced by the proxy generator could be a limitation in resource-constrained settings. The method's effectiveness might be sensitive to the choice of hyperparameters, such as the neighborhood size K and the scalar s for proxy generation, though the authors provide some analysis on their sensitivity.

其他意见或建议

The paper is well-written and clearly presents the methodology and experimental results.

作者回复

Thank you for the recognition of our work and constructive feedback. Below is our response to each concern, as well as new comparisons on compute cost.

Q1: Sensitivity to hyperparameters like the neighborhood size K and the scalar s for proxy generation

Given the held-out validation set of each downstream dataset (a reasonable assumption for model fine-tuning), we adjust Proxy-FDA's hyperparameters on that validation set. Fig. 6 shows that results are mostly insensitive to each key hyperparameter in a wide range. We acknowledge there are methods that automatically learn hyperparameters instead of manual tuning, in order to further reduce hyperparameter sensitivity. Examples include meta-learning methods, or recent "validation data-free" methods (e.g., CLAP paper cited in Table 7) that optimize hyperparameters online without even requiring a validation set. Integrating such ideas into Proxy-FDA will be a promising future plan.

Q2: Computational overhead by the proxy generator could be a limitation in resource-constrained settings

This concern is related to our hypothetical claims in the 1st paragraph of Section 3.2: despite incurring some computational overhead, proxy feature generation is still compute- and data-efficient (comparatively) to improve data diversity. Other viable strategies like retrieving external data should have higher cost and suffer from distribution shift. After paper submission, we implemented one type of Retrieval Augmented Fine-Tuning (RAFT) method using external data, and the comparison results support the above claims.

Concretely, to enrich both the positive feature set Xi+\mathbf X_{i}^{+} and negative set Xi\mathbf X_{i}^{-} (by the amount controlled by scalar ss), we retrieve top similar samples from the large-scale LAION-400M dataset, instead of synthesizing proxy features {Pi+,Pi\mathbf P_{i}^{+}, \mathbf P_{i}^{-}} online. Both methods involve the augmented features (retrieved or synthesized) with FDA loss computation, and we call the methods as RAFT-FDA and Proxy-FDA, respectively.

Note for RAFT-FDA to effectively retrieve external data, the same model is needed to extract features for the sampled batch in Dft\mathcal D_{\text{ft}} and external DLAION\mathcal D_{\text{LAION}} to measure their feature similarity. It'd be obviously expensive to use the fine-tuned model to repeatedly refresh all the feature representations of DLAION\mathcal D_{\text{LAION}}. Hence we use the pre-trained model to pre-compute similarities between the entire Dft\mathcal D_{\text{ft}} and DLAION\mathcal D_{\text{LAION}}, so each sample in Dft\mathcal D_{\text{ft}} has its external neighbor indices that remain fixed. We treat this as a "pre-processing step" for RAFT-FDA whose training time excludes this step. In each fine-tuning iteration of RAFT-FDA, what requires additional feature extraction using the fine-tuned model is now only a limited number of external data (i.e., indexed neighbors from DLAION\mathcal D_{\text{LAION}}). Overall, the computational overhead of RAFT-FDA is mainly affected by FDA and external feature extraction/augmentation.

The table below compares RAFT-FDA against Proxy-FDA for few-shot prompt tuning based on CoOp.

Proxy-FDARAFT-FDA
Percent (%) of augmented featuresss2s4s6s
Avg AH\mathcal A_{\text{H}} across 11 datasets78.1376.0277.178.2978.84
Training time overhead21%58%83%127%171%

Observations: 1) Proxy generation is data-efficient in the low-data regime. Our Proxy-FDA outperforms RAFT-FDA when the latter retrieves the same or a double amount of external data. Generated proxies achieve higher data efficiency because they adapt to the fine-tuned feature distributions, while using external data will inevitably suffer from distribution shift and provide less effective feature regularization. However, our performance benefits diminish as the size of external data increases (over 4ss). 2) On the other hand, external data augmentation is costly and the cost increases drastically with the retrieved data size. This is due to the need of extra feature extraction for each external sample using the large vision model. While our proxy generator is lightweight, and can generate proxy features all at once (not individually).

More comments: 1) We may potentially improve RAFT-FDA by feature fusion strategies (e.g., Mixup) to address the distribution shift issue, but the training cost remains high. 2) It's possible to speedup Proxy-FDA through a more efficient architecture design of our proxy generator, which we leave as future work.

审稿意见
3

The paper proposes Proxy-FDA, a regularization method for fine-tuning vision foundation models that mitigates concept forgetting by aligning local structural relationships in feature spaces. The core innovation lies in preserving neighborhood structures via nearest neighbor (kNN) graphs derived from pre-trained and fine-tuned features, augmented by dynamically generated proxies to enhance data diversity. By regularizing both neighbor indices and similarity scores, Proxy-FDA transfers rich semantic attributes (e.g., color, texture) encoded in foundation models while adapting to downstream tasks. Extensive experiments on classification, captioning, VQA, and continual learning demonstrate significant reductions in forgetting (quantified via LP and OTDD metrics) compared to point-wise regularization baselines (e.g., LDIFS, L2SP). The method excels in data-scarce settings (e.g., 16-shot tuning) and integrates seamlessly with prompt-tuning frameworks, achieving state-of-the-art performance without compromising downstream accuracy.

update after rebuttal

Dear authors, I've reviewed your rebuttal. While your responses to some questions, like proxy diversity and hard class mining, show thought, there are gaps. You didn't fully answer on training time and theoretical aspects in - line, just referring elsewhere, which makes it hard for readers. And for 1 - 2 shot tuning results, a summary was lacking. I appreciate your efforts, yet considering these, I'm keeping the “Weak Accept” rating. Revise to make your answers more complete and straightforward.

给作者的问题

  1. Proxy Diversity: How is the diversity of generated proxies quantified (e.g., feature entropy, pairwise distance)? Could metrics like Fréchet Inception Distance (FID) evaluate synthetic feature quality?
  2. Hard Class Mining: Does random batch sampling degrade performance compared to the proposed greedy strategy? If so, by what margin?
  3. Computational Trade-offs: How does Proxy-FDA’s training time compare to methods using external data augmentation (e.g., Mixup or retrieval-augmented tuning)?

论据与证据

Supported Claims:

  1. Proxy-FDA reduces concept forgetting: Empirical validation across 10+ datasets (Tables 1, 4) and correlation analysis with OTDD (Fig. 3) robustly support this claim. The method consistently achieves higher LP (up to +1.54) than baselines, indicating superior retention of pre-trained knowledge.
  2. Proxies enhance feature distribution alignment: Ablation studies (Fig. 7) confirm that proxy generation improves FDA by synthesizing features from underrepresented regions of the data manifold. Comparisons with interpolation-based augmentation (e.g., VOS, NPOS) further validate its effectiveness.
  3. Structure-wise alignment outperforms point-wise methods: Proxy-FDA’s use of kNN graphs and similarity transfer (Eq. 2) yields statistically significant gains over LDIFS and L2SP (Tables 1–3), with OTDD analysis (Fig. 3) demonstrating stronger correlation to forgetting than L2 distance.
    Unsubstantiated Claims:
  4. Proxy diversity and novelty: While proxies are qualitatively shown to represent unseen concepts (Fig. 4), their diversity and novelty lack quantitative metrics (e.g., entropy, coverage).
  5. Computational efficiency: Though Proxy-FDA incurs a 17–21% training time overhead (Appendix D), comparisons with retrieval-augmented methods or gradient-based alternatives are absent.

方法与评估标准

Methods: The integration of kNN graph alignment and proxy generation is novel. The use of OTDD, a structure-aware distribution distance, appropriately evaluates alignment quality.
Evaluation:

  1. Strengths: Broad validation across tasks (classification, VQA), architectures (CLIP, DINOv2), and settings (end-to-end, continual).
  2. Weaknesses: The selection of "other datasets" for ΔLP computation (Table 1) lacks explicit justification (e.g., domain overlap or task relevance).

理论论述

The paper does not present formal theoretical guarantees. Claims about the superiority of structure-wise alignment are supported empirically but lack proofs (e.g., convergence analysis or generalization bounds). The correlation between OTDD and forgetting is observational, not causal.

实验设计与分析

Strengths:

  1. Comprehensive ablations (Fig. 7) isolate contributions of hard class mining, proxy architecture, and similarity transfer.
  2. Hyperparameter sensitivity analysis (Fig. 6) validates robustness to batch size and neighborhood size.
    Weaknesses:
  3. Hard class mining: The heuristic batch construction (Appendix A) is not rigorously compared to alternative strategies (e.g., entropy-based sampling).
  4. Extreme low-data regimes: Results for 1–2 shot tuning are omitted, limiting insights into Proxy-FDA’s applicability to ultra-scarce data.

补充材料

The appendices provide critical implementation details:
1.Batch construction (Appendix A): Describes greedy class mining to maximize inter-class similarity.
2.Proxy generator (Appendix B): A lightweight architecture with attention and adaptive pooling (23.6K parameters).
3.OTDD computation (Appendix C): Uses K-means pseudolabels for label-aware distribution alignment.

与现有文献的关系

Proxy-FDA bridges gaps in robust fine-tuning and relational knowledge distillation:

  1. Robust Fine-Tuning: Extends LDIFS by replacing point-wise feature matching with structural alignment, akin to graph-based knowledge transfer (Park et al., 2019).
  2. Proxy Learning: Differs from metric learning proxies by synthesizing instance-wise features rather than class prototypes.
  3. Continual Learning: Outperforms rehearsal-free methods (e.g., DualPrompt) by preserving structural knowledge without task-specific prompts (Table 9).

遗漏的重要参考文献

None.

其他优缺点

Strengths:

  1. Practical Versatility: Demonstrated efficacy across vision-language tasks (captioning, VQA) and architectures (ViT, ResNet).
  2. Novel Regularization: Combines structural preservation with proxy-augmented diversity, advancing beyond point-wise or logit-based methods.
    Weaknesses:
  3. Interpretability: The t-SNE visualization (Fig. 4) lacks statistical rigor (e.g., clustering metrics).
  4. Scalability: Proxy generation for large-scale models (e.g., ViT-L/14) is not benchmarked.

其他意见或建议

Clarity: Eq. 3–4 could be simplified by merging redundant terms.

作者回复

Thanks for the detailed feedback! Below is our response to the main questions and "weaknesses".

Q1: Proxy diversity and novelty lack quantitative metrics. Could FID evaluate synthetic feature quality?

As suggested, many metrics are available to quantify our proxy feature diversity. Also, high proxy diversity often implies a high probability of proxy novelty, especially when the feature distribution is sparsely sampled and has a vast space of unseen data. Hence we focus on quantifying proxy diversity, and the novelty is simply examined through qualitative analysis — for example, one can perform some visual validation by image retrieval (Fig. 4) or even training a decoder on feature representations.

To quantify proxy diversity, we choose the variance loss in Eqs. (3-4), which is widely used in many domains like self-supervised learning to measure feature diversity. Note FID can be used to measure how well our generated proxies maintain the diversity of true features. Our adopted variance loss actually serves a similar purpose, since it is computed in an embedding space that's forced to align with the true one.

Here we report the averaged standard deviation of proxy features in the variance loss: higher value indicates larger diversity. To further aggregate the standard deviation values of the positive and negative proxies, we take their mean and compute its moving average till fine-tuning is completed. The table below compares the aggregated diversity metric of all the proxy generation baselines ablated in Fig. 7.

Diversity metric ×102\times 10^{-2}
Proxy generation (default)3.14
random interpolation2.89
VOS1.53
NOPS1.72

Our method clearly achieves higher proxy diversity than VOS/NPOS. The latter two methods focus on outlier synthesis in low-likelihood regions, thus miss the chance to encode diverse unseen data that are crucial for improving FDA. Our method also obtains marginally higher proxy diversity than random interpolation. More importantly, our learning-based method improves diversity in a way that best helps FDA: the diverse proxies not only enrich data but also refine the decision boundary between positive and negative feature manifolds. This is not possible with random interpolation, which explains its lower performance in Fig. 7.

Q2: Compare Proxy-FDA’s training time with that of retrieval-augmented methods.

Please refer to our response to Q2 of Reviewer bxDC.

Q3: Lack theoretical guarantees for FDA. Also, the correlation between OTDD and forgetting is observational, not causal.

Please refer to our response to Q2 of Reviewer 3vR2.

Q4: Hard class mining: does random batch sampling degrade performance? Comparing to entropy-based sampling.

As detailed in Appendix A, our batch sampling is performed by hard class mining plus random data sampling within class. Our ablation studies (Fig. 7) already compare with random batch sampling—the "No hard class mining" baseline—where classes are randomly sampled too. The average AH\mathcal A_{\text{H}} is compared for few-shot prompt tuning, when we apply Proxy-FDA to CoOp/PromptSRC baselines. We see Proxy-FDA obtains AH\mathcal A_{\text{H}} of 78.13/80.81 with hard class mining, and 75.23/80.35 with random batch sampling. As mentioned in Appendix E (L800-804), the big performance difference shows the hard class mining is crucial — it samples close class distributions, among which we can have meaningful modeling and matching of kNN graphs.

We now compare with the entropy-based batch sampling strategy that is also implemented under our greedy framework in Appendix A (for fair comparison). We simply change the inter-class similarity metric in step 2: our default hard class mining uses FDA loss to select similar class samples, while the entropy-based strategy prioritizes them by low entropy. The entropy-based strategy is found to have moderate decrease in AH\mathcal A_{\text{H}} (77.46/80.75 vs. 78.13/80.81). This is because entropy cannot characterize similarity adaptively as a function of current feature distribution structure. As a result, the batch sampling criterion is decoupled with the structural FDA within sampled batch. While FDA loss-based sampling adapts to the feature distribution structure, and is coupled with FDA-based regularization in batch. Will add the results in paper.

Q5: Clarifications needed.

Results for 1–2 shot tuning are shown in Fig. 8 (more details in L860-865).

Table 4 (and L853-857) contains results across different foundation models and architectures (including the large-scale one ViT-L/14), where proxy generation (Proxy-FDA vs. FDA) improves ΔLP\Delta_{\text{LP}} consistently.

审稿意见
4

This paper presents a new approach to mitigate concept forgetting in model fine-tuning (robust fine-tuning) by building on existing feature-matching methods. Specifically, this work aims to align the feature structure by regularizing the feature space using k-nearest neighbors (KNN) within each batch. They also propose generating proxies from the data to preserve diversity across datasets.

给作者的问题

See strengths and weaknesses.

论据与证据

Yes. Section 4 is about empirical results.

方法与评估标准

Yes.

理论论述

There is no theoretical results.

实验设计与分析

Yes. Section 4 is about empirical results. And there are also some empirical details in appendix.

补充材料

The appendix is about some experimental details and supplement results.

与现有文献的关系

This paper is mainly related to model generalization performance, which mainly focusing on the controlling the upper bound of test error, using the information on training error and function class.

遗漏的重要参考文献

No.

其他优缺点

Strengths

  1. This paper is well-written, which provide a clear statement of the results.
  2. There is a strong performance on the new prosed method, which seems reasonable and significant.

Weaknesses

  1. There is a lack of discussion about the motivation of the new method. Could you provide more discussion on why such method can improve model performance?
  2. There is a lack of theoretical guarantee. Is there some theoretical explanations about the benefits of such method in reducing forgetting?

其他意见或建议

See strengths and weaknesses.

作者回复

Thank you for the constructive feedback on our work. Below is our point-by-point response to your questions.

Q1: Motivation discussion: why Proxy-FDA improves performance.

(Proxy-)FDA is essentially a feature-space regularization term added to the task loss during model fine-tuning. The goal is to keep the fine-tuned model in the desired vicinity of the pre-trained one, so that the tuned model can preserve pre-trained knowledge while still learning the task at the same time. Extensive experiments show that (Proxy-)FDA can significantly reduce forgetting while achieving strong fine-tuning performance (sometimes better).

For better forgetting mitigation, our high-level idea is to extend existing point-wise feature regularization methods that lack explicit awareness of feature neighborhood structures. Proxy-FDA is proposed to fill this gap - it aligns the structural relations between kNN feature graphs, which is further improved by a proxy feature generator that increases feature diversity. Two empirical observations confirm our benefits, and hence reaffirm the motivation behind Proxy-FDA: 1) FDA can transfer the structural knowledge in kNN feature graphs, e.g. visual attribute shared between class concepts (Fig. 4). Preserving such common-sense knowledge is useful to maintain the generalizability of foundation model. 2) There's a strong correlation between forgetting and a structure-aware distributional distance metric OTDD (Fig. 3). Such correlation suggests the need of structure-wise feature regularization in some form to effectively mitigate forgetting, and our structural method Proxy-FDA is one such instantiation. In other words, this observation explains our advantage from an optimization perspective, i.e., optimizing our Proxy-FDA objective is close to optimizing a metric directly related to forgetting.

Q2: Theoretical explanations of the benefits of Proxy-FDA in reducing forgetting

Good suggestion. Note the primary focus of this paper is on empirical evaluation of our new method, along with two empirical observations (refer to our response to Q1) that shed light on the benefits of Proxy-FDA. However, we argue that it's promising to derive theoretical guarantees based on the 2nd observation that forgetting is strongly correlated with OTDD metric. Interestingly, OTDD is computed in an extremely similar way to our FDA loss -- they both use clustering techniques to account for the clustering structure of the underlying space, and hence both can compare feature distributions with rich geometry awareness. In other words, unlike L2 loss, our FDA loss is a good proxy of a metric (OTDD) that itself is directly related to forgetting. Such finding has two key implications: 1) There exists clear advantage of optimizing FDA loss over L2 loss-based optimization for direct forgetting prevention. 2) More importantly, the generalization error (or forgetting effect) could be bounded by some function of our own distance metric FDA (akin to OTDD). We leave such theoretical analysis for future work.

审稿人评论

Thanks for the author's reply. It has addressed my concerns. I will update the score.

作者评论

Thanks for raising the score! We will integrate the new insights into the paper.

审稿意见
3

This paper introduces a novel approach to mitigate concept forgetting during model fine-tuning by extending existing feature-matching methods. The authors propose to preserve feature structure by regularizing the feature space using k-nearest neighbors (KNN) within each batch. Additionally, they develop a method for generating proxies from the data to maintain diversity across datasets.

给作者的问题

Update After rebuttal

I appreciate the authors' thorough response, which has successfully addressed most of my initial concerns. I will maintain my current score.

For the final version of the paper, I strongly recommend that the authors clearly articulate the fundamental differences between traditional KL based Knowledge Distillation and the proposed FDA method. This distinction should be presented prominently in the main paper to help readers immediately understand the novel contribution.

Thank you for your attention to the review comments.

论据与证据

Yes, the claim has been supported by proposed experimental results

方法与评估标准

Yes. It make sense

理论论述

No new theory proposed in this paper

实验设计与分析

Some improvement of their experimental results looks marginal. The experimental designs are valid.

补充材料

Yes. Additional experimental results.

与现有文献的关系

The solution is interesting but the broader impact of this work is limited.

遗漏的重要参考文献

No

其他优缺点

Strengths:

  • The research addresses robust fine-tuning, which is a highly active and valuable area with significant practical relevance.
  • The approach is well-motivated, with the preservation of data structure during feature matching being both reasonable and innovative.
  • The paper is clearly written, making the technical approach accessible and easy to follow.

Weakness:

  • Distribution Alignment Method: The motivation for using distribution alignment through feature matching with KNN regularization lacks sufficient justification. The approach bears similarities to knowledge distillation, but the paper doesn't adequately explain how this method differs from distillation between original and fine-tuned models.

  • KNN Clustering Limitations: The method clusters features in each batch using KNN, which may not effectively group samples with similar labels, particularly during fine-tuning. The clustering approach could be enhanced with label-based constraints that either include or exclude samples based on label proximity to improve feature alignment.

其他意见或建议

NA

作者回复

Thanks for the constructive feedback on our work. Below we include the results of requested experiments, and respond to your specific comments.

Q1: Justify motivation of using kNN feature-based FDA, and differences from knowledge distillation.

In response, our introduction section (paragraphs 3 & 4) motivates that we aim to improve over existing feature regularization methods that are often point-wise and preserve limited concepts since they lack explicit awareness of feature neighborhood structures. We propose Feature Distribution Alignment (FDA), a structure-wise regularization method that aligns the structural relations between pre-trained and fine-tuned feature distributions. The structural relations are modeled by feature graphs, and we choose kNN feature graphs because they not only enable efficient graph matching, but also are effective enough to capture the rich knowledge in local feature neighborhoods (e.g. shared visual attribute).

The related work section (L145-152) mentions that Proxy-FDA is indeed similar to Knowledge Distillation (KD), especially to those relational KD methods. Our main difference is that we distill knowledge from both neighbor indices and similarities, with an additional proxy learning component. Appendix G shows Proxy-FDA is directly applicable to KD and is quite performant compared to related KD baselines.

Q2: Improve FDA by kNN clustering with label constraints

During our FDA-based fine-tuning, class labels are mainly used in the task loss Ltask\mathcal L_{\text{task}}, while LFDA\mathcal L_{\text{FDA}} is only treated as a feature regularization term without involving labels. The intuition behind the label-free LFDA\mathcal L_{\text{FDA}} is that we aim to preserve a foundation model's general knowledge, which can be much richer than class labels on downstream datasets. More specifically, LFDA\mathcal L_{\text{FDA}} matches kNN feature graphs to align their structural relations only based on feature (not label) similarities. Fig. 4 shows this can go beyond class concepts in a feature neighborhood with different classes (e.g., cross-class attributes), which is key to maintain the generalizability of foundation models. On the other hand, matching kNN graphs with label constraints may end up aligning class semantics on the downstream task, thus may risk forgetting the general knowledge embedded in foundation models.

In the table below, we empirically compare with an FDA variant that models and matches kNN feature graphs using both feature similarities w^ij\hat w_{ij} and label similarities wijtw_{ij}^t. Note we use the text encoder of CLIP to compute wijtw_{ij}^t as the text-text similarity between the class templates "a photo of a {class}". Comparisons are performed under the base-to-new setting for few-shot prompt tuning (average across 11 datasets). Results show that the use of wijtw_{ij}^t may produce comparable or better ABase\mathcal A_{\text{Base}} (i.e., fine-tuning accuracy on seen classes), but always leads to much lower ANew\mathcal A_{\text{New}} (i.e., worse generalization on unseen classes - more concept forgetting).

ABase\mathcal A_{\text{Base}}ANew\mathcal A_{\text{New}}AH\mathcal A_{\text{H}}
CoOp82.6963.2271.66
+Proxy-FDA (w^ij\hat w_{ij} - default)83.1673.6778.13
+Proxy-FDA (w^ijwijt\hat w_{ij} \cdot w_{ij}^t)83.0270.9176.49
PromptSRC84.2676.1079.97
+Proxy-FDA (w^ij\hat w_{ij} - default)84.4777.4580.81
+Proxy-FDA (w^ijwijt\hat w_{ij} \cdot w_{ij}^t)84.5577.1280.66
最终决定

This paper introduces a novel approach to mitigating concept forgetting during model fine-tuning by extending existing feature-matching methods. It demonstrates strengths such as practical versatility across vision-language tasks, innovative regularization techniques. The paper is well-written. By employing feature space regularization with nearest neighbor graphs and proxy generation, the paper achieves significant improvements across diverse datasets. After the author-reviewer discussion, all reviewers recommended acceptance.