PaperHub
6.0
/10
Poster4 位审稿人
最低5最高8标准差1.2
6
8
5
5
3.8
置信度
ICLR 2024

Towards Few-Shot Adaptation of Foundation Models via Multitask Finetuning

OpenReviewPDF
提交: 2023-09-18更新: 2024-04-06
TL;DR

We delve into theoretical framework of multitask finetuning for foundation models on new tasks with limited labels. Our introduced task selection algorithm notably boosts model performance, backed by extensive empirical evidence.

摘要

关键词
Foundation modelMultitask finetuningFew-Shot learning

评审与讨论

审稿意见
6

This paper studies multitask finetuning for adapting foundation models to new tasks in few-shot. It provides a theoretical justification to analyze the multitask finetuning and a practical task selection algorithm for choosing good finetuning tasks. Extensive experiments are conducted to validate the proposed algorithm.

优点

  • The paper proposes a theoretical framework to analyze multitask finetuning for foundation models, providing definitions and analysis of task diversity and consistency. Based on the theoretical concepts, the authors introduce a practical task selection algorithm.
  • The authors conducted extensive experiments and showed the effectiveness of the proposed method compared to the direct adaptation and full fine-tuning baseline.
  • The overall writing is good, which clearly explains the problem, approach, analysis and results.

缺点

  • The theoretical part discussed diversity and consistency, but the method part simplifies consistency to similarity and diversity to coverage. So the final algorithm is basically a similarity-based sorting. The simplicity is not a problem, but I wonder if it is really related to the theoretical concepts.
  • The authors evaluated the proposed algorithm on a wide range of datasets/models, but how does it compare with some stronger baselines besides direct adaptation and full fine-tuning, e.g. meta-learning algorithms?

问题

See above

评论

We thank the reviewer for the valuable comments and address the questions below.

Consistency to similarity and diversity to coverage

We formally show consistency is reduced to similarity and diversity to ellipsoid coverage in Gaussian distribution. We consider each latent class/representation as a feature vector. The non-zero entries contain the information/feature. The non-zero entries contain the information. Both consistency and diversity can be depicted through the variations in the entries of target tasks' latent classes, which are encoded by entries in the finetuning tasks' latent classes.

To meet consistency, the latent classes in finetuning classes should not include an excessive amount of unrelated varying entries. This is directly related to the similarity in the context of Gaussian distributed features. In our framework, consistency is quantified by how closely the task-related feature distributions align, which is effectively captured by the similarity between mean vectors.

Similarly, to meet the diversity criterion, we would want latent features in target tasks covered by that in finetuning tasks. Under our theoretical model, diversity equates to the spread of the feature vectors across the task space. The ellipsoidal coverage represents the extent to which the feature space of the finetuning tasks encapsulates the feature space of the target task.

Stronger baselines besides direct adaptation and full finetuning, e.g. meta-learning algorithms

Our experiments were designed to demonstrate the effectiveness of the proposed method against common approaches like direct adaptation and full finetuning. We interpret that the reviewer is pointing to Table 2 regarding our multitask finetuning approach. It's important to recognize that our multitask finetuning strategy falls under the category of meta-learning algorithms, as it involves treating target few-shot tasks as individual tasks and creating analogous few-shot finetuning tasks to refine the model. This process assists the model in learning to tackle these few-shot challenges effectively. In response to the feedback, we aim to incorporate additional meta-learning algorithms into our comparison table to ensure a more equitable evaluation.

评论

We incorporated the Model-Agnostic Meta-Learning (MAML) algorithm, as outlined by Finn et al. [1], as another baseline for our few-shot tasks.

MAML operates in a two-step process: it initially updates parameters based on within-episode loss (the inner loop), then it evaluates and updates loss based on learned parameters (the outer loop). We follow the pipeline in [2] to implement MAML for few-shot tasks.

Here we show the results of the comparison for DINOv2 ViT-S on tieredImageNet. We refer the reviewer to Table 14 in Appendix G.8 for complete results.

Method1-shot5-shot
Adaptation74.54 (0.32)89.20 (0.19)
Standard FT74.84 (0.32)89.30 (0.19)
MAML74.63 (0.34)89.60 (0.19)
Ours77.78 (0.32)90.23 (0.18)

As illustrated in the table, MAML can slightly outperform Adaptation and Standard FT for DINOv2 ViT-S on tieredImageNet, but lags behind our multitask finetuning. The reason is in few-shot tasks, support samples (shot image) are very limited, preventing the model from learning a good set of parameters for adapting to new tasks.

[1] Finn et al. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. PMLR 2017.

[2] Triantafillou et al. Meta-Dataset: A Dataset of Datasets for Learning to Learn from Few Examples. ICLR 2020.

评论

Dear Reviewer AgCa,

Since the deadline for the author-reviewer discussion phase is fast approaching, we would like to follow up with you to see if you have any further questions. In our rebuttal, we have tried to address all your questions. In particular, we made a clarification for the connection between our algorithm and the theoretical result. We also added MAML as a baseline for the meta-learning algorithm. We are willing to provide more clarification if you have any additional concerns. Thank you!

Best,

Authors

审稿意见
8

This paper presents theoretical results that to adapt a pretrained model to a target task, it is beneficial to first finetune the model on a diverse set of related tasks and then tune the model on the target task. The authors prove the theorems by first defining the diversity and consistency between different tasks that the model is going to adapt to, where diversity refers to the coverage of the other tasks on the target task and consistency refers to the similarity between the other tasks and the target task. Then, under some Lipschitzness assumptions, sufficient consistency and diversity assumptions, and sufficient tasks and sufficient sample assumptions (on the finetuning tasks, not the target task), the model with multitask finetuning can achieve a reduced error on the target task compared to the setting without multitask finetuning. Moreover, the authors propose a multitask selection algorithm based on the consistency and diversity requirements. Experimental results indicate that with the proposed multitask selection, the ViT-B model achieves better results on multiple target datasets, and increasing the number of tasks and the number of samples per task are most effective in improving target task performance. Lastly, with multitask finetuning, various pretrained models achieve better results than direct adaptation / standard finetuning on multiple datasets.

优点

  • The presented theoretical results are quite intuitive -- if the tasks in multitask finetuning are diverse and consistent with the target task T0T_0, then the model with pretraining followed by multitask finetuning achieves better target prediction performance than the model only with pretaining. My intuition is that, through the training/finetuning on the proxy tasks, the model learns some knowledge related to the target tasks. It is nice that the authors present the theorems that support this intuition, although I didn't verify the proofs.
  • Although the theorems are not directly applicable to real problems, the authors propose a practical Consistency-Diversity Task Selection algorithm, which is effective in experiments. The algorithm is a nice method to try for target task adaption in the real world.
  • The overall experiments are quite extensive and echo the theoretical results, clearly demonstrating the advantages of multitask finetuning over standard finetuning and direct adaptation.

缺点

  • The Gaussian distribution assumption in Section 3.2 may not be realistic, leading to biased task selection that may be suboptimal to the model performance on the target task. However, it is okay to derive further refined algorithms in future work.

问题

  • I found the notations in Section 3.1 (Linear Data and Tasks) somewhat difficult to understand. Tz,zT_{z, z'} is discussed in the text but how is it related to the Ti(i=1,,M)T_i (i = 1, \dots, M) in figure 1? It would ease the reader's understanding by revising the explanation in that part.
  • Does the Experiment Protocol apply to all three major experiments (4.1, 4.2, and 4.3)? It seems to me that the experiments have their own protocols, which causes confusion.
  • What are the multitasks for finetuning in Section 4.3?
评论

We thank the reviewer for the valuable comments and address the questions below.

Gaussian distribution assumption

Please see the global rebuttal.

Relation between Tz,z\mathcal{T}_{z,z’} and Ti(i=1,,m)\mathcal{T}_i (i = 1,\ldots,m)

We realize that the notation used in Section 3.1 could be made clearer. In Figure 1, we denote each task containing two latent classes, namely (zz,zz’). Each task in Figure 1 can be represented as (T1T_1 to Tz1,z1T_{z_1,z_1^\prime}, T2T_2 to Tz2,z2T_{z_2,z_2^\prime}). We thank the reviewer for pointing this out. We have updated this in the new revised manuscript.

Confusions in the Experiment Protocol in Section 4

The paragraph Experiment Protocol outlined at the beginning of Section 4 applies to all three major experiments (Sections 4.1, 4.2, and 4.3). We clarify further here: In Section 4.1, we retain the same finetuning setup and vary the sample complexity in tieredImageNet. In Section 4.2, we continue with the same finetuning process and apply the task selection algorithm on Meta-Dataset. In Section 4.3, we retain the same finetuning setup and include the additional standard finetuning where we append the encoder with a linear head to map representations to class logits and finetune the whole model.

All experiments focus on evaluating models using few-shot tasks. For multitask finetuning, the model's parameters are updated using the nearest centroid classifier. This involves encoding all samples, calculating class centroids, and using the cosine similarity between a query sample and these centroids as the class logits. For adapting to a specific target task, we employ the model's encoder along with the same nearest centroid classifier approach.

We will ensure that the protocol is clearly described for each experiment in the revised manuscript to avoid any confusion.

Multitasks for finetuning in Section 4.3

In Section 4.3, the multitasks are the few-shot tasks consisting of K=15K=15 classes with N=1N=1 support samples and Q=10Q=10 query samples per class (known as KK-way NN-shot). The multitasks selected for finetuning were chosen based on their relevance and consistency with the target tasks, as prescribed by our Consistency-Diversity Task Selection algorithm.

审稿意见
5

Given a pretrained foundation model, a target task with limited labeled samples, and a bunch of auxiliary tasks possibly with many labeled data, this paper analyzes how to select good auxiliary tasks to finetune the foundation model first, then adapt the auxiliary-task-finetuned model to the target task (a.k.a multitask finetuning), such that the resultant model can outperform the model that directly adapted to the target task from pretraining, or the model that finetuned with all the auxiliary tasks before adaption. To this end, the authors derive several bounds which indicate that the model with multitask finetuning outperforms that without, and the selected auxiliary tasks should be sufficiently diverse and have good consistency w.r.t. the target task. According to this, the authors formulate an auxiliary-tasks-selection algorithm, whose outperformance over standard finetune is validated on various datasets.

优点

  1. Several bounds about multitask finetuning are derived.
  2. The multitask finetuning with task selection outperforms that finetuned on all the tasks.
  3. The experiments have been extensively performed on a dozen of datasets.

缺点

  1. While several bounds have been derived, their implications are quite straightforward. For instance, the model with multitask fine-tuning outperforms the one without it, and the selected auxiliary tasks should exhibit sufficient diversity and strong consistency w.r.t. the target task, are well accepted by the community, despite those being derived based on the somewhat strict assumptions. It would be more valuable if these bounds could lead to novel insights that were previously overlooked by the community.
  2. Such limitation also lies in the task-selection algorithm, where the authors simply assume a Gaussian distribution of features, then use the angle of mean and variance-scaled L2 distance indicating the consistency and the diversity.
  3. As \phi evolves during the training, should Algorithm 1 perform repeatedly during the training?
  4. The paper in its current version is quite difficult to read, it is suggested to give more intuition before diving into the derivation, and also simplify/brief the symbols if applicable.

问题

  1. Why does the performance decrease with the increase of sample shots in Fig. 2a?
  2. Is Standard FT in Table 2 the same as All in Table 1?
  3. First Paragraph on Page 4, In contrastive pretraining -> In contrastive to pretraining; First Paragraph on Page 7, the cosine similarity similarity -> the cosine similarity
  4. It seems that Theorem 3.3 (or the index 3.3) is missing in the main text.
评论

We thank the reviewer for the valuable comments and address the questions below.

Novel insights of theoretical bound

While our derived bounds may coincide with what has been empirically observed, it's important to recognize that such empirical observation, until formalized, remains a hypothesis. By providing a rigorous theoretical foundation, we not only validate these empirical observations but also establish a framework upon which future hypotheses can be tested and understood.

Within the theoretical community, various concepts of diversity have been proposed to establish guarantees [1,2]. Yet, we propose an additional notion of consistency, highlighting the significance of aligning finetuning data with target data. Conversely, practitioners in the empirical community have predominantly concentrated on algorithms whose selection is based on similarity ([3,4] and many more). Our algorithm, which integrates both diversity and consistency, aims to enhance task selection when adapting foundation models to specific tasks.

[1] Tripuraneni et al. On the theory of transfer learning: The importance of task diversity. NeurIPS 2020.

[2] Zhao et al. Blessing of Class Diversity in Pre-training. AAAI 2023.

[3] Liu et al. Learning customized visual models with retrieval-augmented knowledge. CVPR 2023.

[4] Liu et al. What Makes Good In-Context Examples for GPT-3? arXiv 2021.

Gaussian distribution assumption

Please see the global rebuttal.

Evolved ϕ\phi in Algorithm 1

We appreciate this insightful observation. Currently, our algorithm does not perform repeatedly during training. Instead, we utilize a pre-trained ϕ\phi as a feature extractor to obtain the representation of each individual sample. Online update ϕ\phi is interesting and we leave it as a future work.

Difficulty in readability

Thank you for pointing out improvements in the presentation. To aid in the understanding of task selection, we further include a diagram in Figure 2 in Section 3.2 to show relation between tasks. We have a plan following to improve the readability:

  • Example illustration: We will hold a simplified example before introducing new notations, including the definition of tasks and loss in Section 2.
  • Intuitive explanations: We will introduce each major theoretical result with an intuitive explanation, including our main theorem and theorem for diversity and consistency in Section 3.
  • Symbol simplification: We will try to simplify the notation by reducing the number of symbols and by using more familiar notations, including different models ϕ\phi, task T\mathcal{T}, loss L\mathcal{L} in Section 3.

We hope these improvements will enhance the clarity and make our paper more approachable to broader range of readers.

Performance decreases with the increase of shots in Figure 2a

We illustrate this phenomenon below as we mentioned in paragraph Results. in Section 4.1. In Figure 2a we fix the target task as a 11-shot setting but vary the number of shots from 11 to 44 in finetuning. We recognize that the results may initially appear counterintuitive as they oppose the common belief that finetuning using meta-learning has to mimic the exact setting in the target task. A mismatch will hurt the performance [5]. However, in our study, we discovered that within the range of shots considered (from 1 to 4), the number of shots in finetuning did not significantly alter the accuracy. This observation underscores a key insight from our work: it is the total sample size M×mM \times m, not just the number of shots, that determines the performance gains. This aligns with our theoretical findings (Theorem 3.2).

[5] Snell et al. Prototypical networks for few-shot learning. NeurIPS 2017.

Standard FT in Table 2 and All in Table 1

We illustrate “Standard FT” in the Section 4.3 Setup paragraph. We use multitask finetuning for results in Table 1 (including “All”) as illustrated in the Section 4 Experiment Protocol paragraph. Here we give a brief explanation.

In Table 2, "Standard FT'' refers to another finetuning method where an encoder is appended with a linear head that maps representations learned by encoders to the class logit probability, with the model finetuned based on logit loss. On the other hand, in multitask finetuning, the model is updated using the nearest centroid classifier. This approach differs from "Standard FT" in that it does not rely on the logits loss but rather uses class centroids computed from encoded samples. In Table 1, we use a multitask finetuning strategy and focus on comparing the task selection algorithm. "All" refers to results without a task selection algorithm and finetuning on all datasets.

We hope this explanation resolves the confusion and distinguishes between the two finetuning methods.

Typos of contrastive pretraining, cosine similarity, and missing index of theorem

We thank the identification of textual errors. We corrected these in the revised manuscript.

评论

Dear Reviewer 1yNr,

Since the deadline for the author-reviewer discussion phase is fast approaching, we would like to follow up with you to see if you have any further questions. In our rebuttal, we have tried to address all your questions. In particular, we made a clarification for theoretical bound, Gaussian distribution assumptions, algorithms and experimental details. We've also revised typos and improved readability as we intended. We are willing to provide more clarification if you have any additional concerns. Thank you!

Best,

Authors

审稿意见
5

This paper explores how to effectively adapt foundational models to new tasks, especially those with limited labeled data. The authors investigate a multi-task fine-tuning approach, wherein the foundational model is fine-tuned for a set of related tasks before fine-tuning it for the target task.

优点

The paper provides a theoretical analysis, revealing that fine-tuning with a diverse set of related tasks reduces errors in the target task compared to directly adapting a pre-trained model. The authors introduce diversity and consistency metrics to quantify the relationship between fine-tuning tasks and the target task and propose a practical task selection algorithm.

缺点

  1. In Section 1.1, it would be beneficial to provide a more detailed enumeration of recent advancements in the field of multitask fine-tuning.
  2. I suggest adding some diagrams in Section 3.2 to visually illustrate the entire process. Visual representation will help readers gain a clearer understanding and assess the effectiveness of task selection.
  3. I suggest conducting more detailed ablation experiments for TASK SELECTION to provide stronger evidence for its effectiveness.
  4. The experimental section lacks sufficient detail, such as the absence of descriptions for hyperparameter settings. Providing these details would make the findings more convincing.

问题

  1. In Section 1.1, it would be beneficial to provide a more detailed enumeration of recent advancements in the field of multitask fine-tuning.
  2. I suggest adding some diagrams in Section 3.2 to visually illustrate the entire process. Visual representation will help readers gain a clearer understanding and assess the effectiveness of task selection.
  3. I suggest conducting more detailed ablation experiments for TASK SELECTION to provide stronger evidence for its effectiveness.
  4. The experimental section lacks sufficient detail, such as the absence of descriptions for hyperparameter settings. Providing these details would make the findings more convincing.
评论

We thank the reviewer for the valuable comments and address the questions below.

Recent advancements in the field of multitask finetuning

We included the detailed related work in Appendix C. We refer the reviewer to Section Multitask Learning and Adapting Foundation Models in Appendix C for multitask finetuning. Please let us know if the reviewer has further concerns or suggestions.

Visual diagrams in Section 3.2

The procedure of the proposed task selection method was illustrated in Algorithm 1 of the main paper. Per the reviewer’s request, we further highlight the key intuition using the diagram in Figure 2 under Section 3.2 of the new revised manuscript.

​​Detailed ablation experiments for task selection

We conducted several ablation studies presented in Appendix G.4 that examine the task selection:

  • G.4.1: Violate both consistency and diversity: We maintained the same level of sample complexity but incorporated data from unrelated tasks during finetuning. This resulted in a decline in performance.
  • G.4.2: Only retain diversity while violating consistency: Introducing additional unrelated data during the finetuning phase also led to a deterioration in performance.

Per the reviewer’s request, we further included an additional ablation study on our task selection algorithm, focusing solely on either consistency or diversity, while compromising on the other:

  • Ignoring Diversity: If the algorithm terminates early without fulfilling the stopping criteria, the data utilized in finetuning tasks fails to encompass all the attributes present in the target data. This leads to a breach of the diversity principle.
  • Ignoring Consistency: Conversely, if the algorithm persists beyond the stopping criteria, the finetuning tasks become overly inclusive, incorporating an excessive amount of unrelated data, thus breaching the consistency.

We include details in Appendix G.4.4 and Table 9 of the new revised manuscript.

Descriptions for hyperparameter settings

We had detailed experimental settings with hyperparameters in Appendix G.2, including the number of tasks, the number of classes (N) per task, as well as the number of shot samples (K), and query samples (Q).

We have further expanded the appendix to include specifics on acquiring image embeddings and the methods used to measure consistency and diversity. Additionally, we have included information on the hyperparameters used during the finetuning optimization process. This expansion ensures that our experimental procedures are reproducible and that our findings can be validated. All these additional details are available in Appendix G.2.

评论

Dear Reviewer 72L9,

Since the deadline for the author-reviewer discussion phase is fast approaching, we would like to follow up with you to see if you have any further questions. In our rebuttal, we have tried to address all your questions. In particular, we made a clarification for related work and hyperparameters description. We also added a visual diagram and more task selection ablation study. We are willing to provide more clarification if you have any additional concerns. Thank you!

Best,

Authors

评论

We thank all the reviewers for their constructive and valuable feedback. We are honored that all reviewers unanimously agree that our multitask finetuning theoretical analysis with diversity and consistency metrics is intuitive. Most reviewers agree that our task selection algorithm is nice, practical, and effective (1yNr, mX1E, AgCa). Multiple reviewers value the experiments are extensive, which supports our methods and analysis (1yNr, mX1E, AgCa). We are glad that our writing is recognized as good and clear (AgCa).

We submitted a new revised manuscript based on the feedback. In short, we have a visual diagram in Section 3.2, additional hyperparameter details in Appendix G.2, ablation study on task selection algorithm in Appendix G.4 and Table 9. We highlight our revision with the color blue.

Several reviewers have pointed out that the Gaussian distribution assumption in Section 3.2 is simplified compared to our theory (1yNr, mX1E). In the following, we will address this specific issue. Additional questions will be responded to individually for each reviewer.

We agree that our assumption of a Gaussian distribution is indeed a simplification of the theoretical model we have proposed. However, this simplification was made to capture the essence of the theoretical insights, while ensuring the algorithm remains effective and applicable in practical settings. We consider each latent class/representation as a feature vector. Both consistency and diversity can be depicted through the variations in the entries of latent classes/representations. By using the angle of mean and variance-scaled L2L_2 distance, we provide a computationally feasible proxy for these theoretical constructs without violating the effectiveness of the algorithm in an empirical study.

This was a starting point for our theoretical framework, and we aim to develop more refined algorithms that can account for different types of data distributions in future work as Reviewer mX1E suggested.

评论

Per reviewer AgCa’s request, we added an additional MAML baseline in Appendix G.8 and Table 14 for the second version of the revised manuscript. MAML can slightly outperform Adaptation and Standard FT under certain settings (DINOv2 ViT-S on tieredImageNet), but lags behind our multitask finetuning. The reason is in few-shot tasks, support samples (shot image) are very limited, preventing the model from learning a good set of parameters for adapting to new tasks.

AC 元评审

This paper presents theoretical results that, to adapt a pretrained model to a target task, it is beneficial to first finetune the model on a diverse set of related tasks (potentially with many labels) and then tune the model on the target task (potentially with few labels).

The presented theoretical results are quite intuitive. The overall experiments are quite extensive and echo the theoretical results, clearly demonstrating the advantages of multitask finetuning over standard finetuning and direct adaptation.

I recommend acceptance as a poster.

为何不给更高分

While several bounds have been derived, their implications are quite straightforward. For instance, the model with multitask fine-tuning outperforms the one without it, and the selected auxiliary tasks should exhibit sufficient diversity and strong consistency w.r.t. the target task, are well accepted by the community.

为何不给更低分

Intuitive theoretical results, extensive experiments

最终决定

Accept (poster)