6.0

/10

Poster4 位审稿人

最低3最高8标准差1.9

4.5

置信度

正确性3.0

贡献度3.3

表达3.5

NeurIPS 2024

Dual-Personalizing Adapter for Federated Foundation Models

yiyuan yang,Guodong Long,Tao Shen,Jing Jiang,Michael Blumenstein

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

摘要

关键词

Federated LearningFoundation ModelsPersonalizationTest-Time Distribution Shifts

评审与讨论

审稿意见

评分: 3置信度: 42024-06-24

This paper focuses on personalization and test-time adaptation in federated learning of foundation models. The authors propose two solutions for this setting and show its performance on two splits of FLAN dataset.

优点

The paper is generally easy to read.
The constructed experimental setup is genenrally reasonable.

缺点

The motivation and significance of the setting is not sufficiently strong. The training method in this paper is not new in personalized FL [1,2]. The test-time adaptation method is straightforward and does not convey new insight. Therefore, I do not clearly see the significance of considering personalization and test-time adaptation at the same time.
Limited performance gain. Despite that the paper focuses on personalization and test-time adaptation at the same time, it proposes two methods, one performs better at personalization while the other performs better at test-time adaptation. This is really confusing. If the authors can not find one solution that fits both two metrics, then why bother to consider these two issues in one paper?
Limited performance gain. From the two main tables in the paper, the performance of FedLoRA is stable across two metrics, which always performs the second best.

[1] Exploiting shared representations for personalized federated learning

[2] Perada: Parameter-efficient and generalizable federated learning personalization with guarantees

问题

Why the convergence curve of FedIT so distinct from the others? Also, it is unclear what is the difference between FedIT and FedLoRA.

局限性

yes

作者回复

2024-08-06

W1: Cannot clearly see the significance of considering personalization and test-time adaptation at the same time.

Our proposed method is not doing test-time adaptation. In general, the test-time adaptation methods [1] usually involve fine-tuning steps to align the model parameters with new distributions in test time. In contrast, our proposed method is designed without any adaptation step, aiming instead for test-time generalization where clients must maintain performance on their main tasks while also demonstrating generalization capabilities on new tasks during the testing stage.

Moreover, our method addresses an innovative application scenario known as federated foundation models (FFM) [2], an emerging domain that significantly diverges from traditional federated learning methods. We are the first to explore the test-time generalization on personalized FFM. Unlike the works you mentioned, we refined a new setting (test-time personalization) in FFM, proposed a new training paradigm with self-reconstructed datasets to train personalized FFM in extreme scenes like cross-tasks, and also introduced a new dynamic weighting mechanism to determine the combination of different adapters for test-time personalization. We will add a discussion about the difference between our work and your mentioned works in related work.

[1] Liang, J., He, R., & Tan, T. (2024). A comprehensive survey on test-time adaptation under distribution shifts. International Journal of Computer Vision, 1-34.

[2] Ren, C., Yu, H., Peng, H., Tang, X., Li, A., Gao, Y., ... & Yang, Q. (2024). Advances and open challenges in federated learning with foundation models. arXiv preprint arXiv:2404.15381.

W2: Propose two methods and can not find one solution that fits both two metrics.

Our paper aims to improve the test-time generalization without sacrificing the performance on personalization metric in federated settings. In traditional machine learning methods, an algorithm should be evaluated on both validation dataset and test dataset, which in our context correspond to the personalization metric and test-time generalization metric, respectively.

Our proposed framework (ref to Section 4), model architecture (ref to Figure 1) and loss function (ref to Equation 1) are unified design. The mentioned two methods can be considered as two hyperparameters to control the updating strategy (sequentially or iteratively) of local and global adapters within the federated learning process. Specifically, FedDPA-F adopts a sequential approach, first optimizing the global adapter and then optimizing the local adapter. In contrast, FedDPA-T alternates between optimizing the global and local adapters iteratively during each communication round. To facilitate understanding, Figure 2 illustrates the two options of the hyperparameter in the learning process. In the distributed learning scenario, it is a common way to try different updating strategies in federated learning, such as [3].

[3] Tian Li, Shengyuan Hu, AhmadBeirami, and Virginia Smith, “Ditto: Fair and Robust Federated Learning Through Personalization”, ICML 2021

W3: Limited performance gain.

The objective of this paper is to improve the test-time generalization of federated foundation models without sacrificing personalization performance. In two main tables (Table 1 & 2), in the test-time generalization metric, our proposed method shows significant improvement over FedLoRA. For example, both FedDPA-F and FedDPA-T demonstrate average improvements of approximately 2.3% and 0.9% over FedLoRA. Notably, in the summarization task, our approaches achieve an 8.4% higher score than FedLoRA.

Similarly, in the personalization metric, our method achieved relatively higher performance than FedLoRA. For example, both FedDPA-F and FedDPA-T demonstrate average improvements of approximately 0.5% and 2.3% over FedLoRA. Notably, in the reading comprehension task, our methods outperform FedLoRA by 5.9% in terms of personalization.

Moreover, we will further explore one more analysis of the statistical significance. Specifically, it will involve assuming that our method can improve test-time generalization performance and subsequently employing statistical tests, such as the P-value approach, to verify the significance and validate the efficacy and confidence of our method.

Q1: Difference between FedIT and FedLoRA.

Compared to personalized federated learning methods, FedIT focuses on training a global model applicable across all clients, without incorporating personalization techniques. In each communication round of the convergence curve, FedIT evaluates the global model on test data. Conversely, FedLoRA, a personalized federated learning method, evaluates performance using fine-tuned models in each communication round. Therefore, FedIT shows a relatively lower accuracy in the early stage of the learning process. For a more detailed discussion, please refer to Appendix A.2.

2024-08-12

Dear Reviewer 94he,

As the author-reviewer discussion period is approaching its end on 13 August, we respectfully ask whether we have addressed all your questions and concerns.

2024-08-14

Dear reviewer 94he,

We believe that we have addressed the concerns that you have raised. Specifically,

Our work does not focus on test-time adaptation [1] but test-time generalization, where clients must maintain performance on their main tasks while also demonstrating generalization capabilities on new tasks during testing without any adaptation methods. Unlike the works you mentioned, our work is the first to explore the test-time generalization of personalized federated foundation models (FFM) [2], an emerging domain that significantly diverges from traditional FL. We establish the benchmark by refining the new setting (test-time personalization), proposing the new training paradigm with self-reconstructed datasets for personalized FFM, and introducing a new dynamic weighting mechanism to determine the combination of different adapters for test-time personalization.
Our proposed solution is a unified design, the mentioned two methods are under the same solution with different hyperparameters (different optimizing strategies) as illustrated in Section 4, and it is a common way to try different optimizing strategies in FL such as [3]. Since our paper aims to improve the test-time generalization without sacrificing the performance on personalization, we take the personalization metric as validation results and the test-time generalization metric as test results.
The objective of our paper is to improve the test-time generalization of FFM without sacrificing personalization performance. In our two main tables (Table 1 & 2), our proposed method shows an average improvement of 2.3% (even 8.4% higher in the summarization task) over FedLoRA in the test-time generalization metric with relatively higher performance of personalization.
The difference between FedIT and FedLoRA lies in whether there is a personalization step or not. FedIT focuses on training a global model applicable across all clients, while FedLoRA adapts the global model to each client with further fine-tuning for personalization. For a more detailed discussion, please refer to Appendix A.2.

[1] Liang, J., He, R., & Tan, T. (2024). A comprehensive survey on test-time adaptation under distribution shifts. International Journal of Computer Vision, 1-34.

[2] Ren, C., Yu, H., Peng, H., Tang, X., Li, A., Gao, Y., ... & Yang, Q. (2024). Advances and open challenges in federated learning with foundation models. arXiv preprint arXiv:2404.15381.

[3] Tian Li, Shengyuan Hu, AhmadBeirami, and Virginia Smith, “Ditto: Fair and Robust Federated Learning Through Personalization”, ICML 2021

We would like to gently remind you that the end of the discussion period is imminent. We would appreciate it if you could let us know whether our comments addressed your concerns.

Best regards, Authors

2024-08-14

Thanks for the responses. However, my concerns are not well addressed.

W1. I still do not clearly see the motivation for considering such scenario. Are there specific and reasonable real-world application scenarios?
W2 & W3. It seems like that you are cherry-picking the results in the rebuttal. In Table 1, your two methods perform comparably with FedLoRA, and no one method can consistently perform the best. Specifically, for test-time personalization, FedDPA-T performs worse than FedLoRA.

2024-08-14

Thanks for your responses.

For W1, let's consider the recommendation system in real-world application scenarios. Each user has their own areas of interest (e.g., user1 focuses on book reading), and there is sufficient historical clicking/viewing data for training and personalization. However, users may have new interests at any time, for example, user1 focusing on reading is becoming interested in dancing. Since the user has no historical data about the new interest and this recommendation for new interest only happens during inference/testing, the test-time generalization ability of the personalized model becomes significant in this case.

For W2 and W3, our main focus is to improve the generalization ability without sacrificing the personalization performance; therefore, as long as our method can outperform other methods on test time personalization with comparable results on personalization, it proves the effectiveness of our method. FedDPA-T and FedDPA-F are under the same solution with different optimizing strategies, which means that either method performing better than other baselines on test-time personalization can prove the effectiveness of our proposed solution. In Table 1&2, FedDPA-F consistently performs better than FedLoRA on test-time personalization and maintains comparable performance on personalization, which has sufficiently demonstrated the effectiveness of our proposed solution. FedDPA-T performs worse than FedLoRA, which can only indicate that the optimizing strategy used for FedDPA-T is not beneficial for test-time generalization.

These results are from Table 1 & 2 on test-time personalization

Dataset1	Paraphrase	Entailment	Structure to Text	Text Formatting	Linguistic Acc	Word Dis	Coreference	Question CLS
FedLoRA	75.56	76.55	75.21	74.94	76.16	74.64	74.99	76.97
FedDPA-F	78.10	77.36	77.18	76.98	77.11	76.23	76.84	77.19

Dataset2	Paraphrase	Commonsense	Entailment	Text Formatting	Summarization	Reading Com	Sentiment	Open QA
FedLoRA	69.60	71.64	71.09	71.28	65.63	68.89	70.32	70.44
FedDPA-F	71.64	72.28	72.42	72.39	71.12	70.46	71.00	71.82

2024-08-14

Dear Reviewer 94he,

We sincerely hope our responses have solved your concerns; if you have further questions, please let us know.

Best Regards, Authors

审稿意见

评分: 6置信度: 42024-07-12

The authors use two set of adapters to personalize a model with federated learning. The idea is to use FL to learn the global adapter whereas each device has a local adapter to personalize the model for each client.

优点

The paper is well written and easy to understand
The overall approach is sound
THe authors evaluated their approach on a number of NLP tasks

缺点

Overall this is an interesting paper. However these are some areas to potentially improve:

Firstly, the overall premise: the authors say that this approach is used to learn from distribution shifts between training and test datasets. However, this is the whole reason we use federeated learning. Typically we have models trained on public deatasets (e.g., imagenet, wikipedia, etc), but these don't work on real-world applications because the domain is slightly different. FL was proposed so we can get a real-world (and real-time) signal about in-domain distributions in a privacy-preserving. As a result, we actually do want to learn this "test" distribution during FL. It is unclear why in the FL scenario, the "test" distribution is not similar to the train distribution. Overall, the paper would benefit for a much stronger motivation.
There are a lot of works that use federated learning to learn a global model and then personalize this model for each client (e.g., https://arxiv.org/abs/2103.00710). As a result, the overall approach of training a global model and then having extra client-side personalization is not entirely novel. Furhtermore, there have been many works that train adapters instead of full models in order to save on communication and on-device compute.
At section 3.1 the authors say there are two datasets belonging to two different distributions (Ps vs Pt): why does this happen within a device ? How do you detect that the distribution has shifted ? How does this happen in practice ?
The evaluation has been done in a very artificial scenario where each of the 8 devices had a completely different NLP task (section 5.1). This is quite extreme and it is likely to favour the results that take advantage of personalization. However, this extreme is unlikely to happen in the real FL deployments: typically we want to build a global model from a set of non-IID users but not to an extreme where each user has a different task. Furthemore, we attempt to train large-enough models that can tolerate some distribution shift (as long as it has been seen in the data).

问题

See Above

局限性

N/A

作者回复

2024-08-06

W1: The paper would benefit for a much stronger motivation.

Unlike traditional FL (especially personalized FL) which primarily addresses heterogeneity among clients during training, our setting also considers heterogeneity within each client during testing, building on insights from previous works [1]. Since the learning process of foundation models will involve more heterogeneous datasets from different tasks compared with the traditional machine learning methods, our setting aims to consider more complex scenarios with various distribution shifts in federated foundation models. For example, in traditional personalized FL, each client's training and testing are restricted to the same distribution (e.g., client 1 trains and tests on task 1); in contrast, our setting involves clients training on one task and testing on multiple different tasks (e.g., client 1 trains on task 1 and tests on tasks 1, 2, 3...).

Moreover, imagine the on-device foundation model in the near future, the model should be lightweight and versatile enough to tackle many different tasks. To this motivation, this paper is the first step towards exploring a solution that can enable the lightweight on-device foundation models with desired performance on training data and also gain a better test-time generalization while the test data is from another domain or task.

[1] Tan, Y., Chen, C., Zhuang, W., Dong, X., Lyu, L., & Long, G. (2024). Is heterogeneity notorious? taming heterogeneity to handle test-time shift in federated learning. Advances in Neural Information Processing Systems, 36.

W2: Novelty of our methods.

As discussed in the recent literature on Federated Foundation Models (FFM) [2], there are many new challenges that need to be rethought. For example, training versatile foundation models requires incorporating more heterogeneous datasets within the federated learning framework. Moreover, the model communication-efficient and fine-tuning strategy also needs to be considered in the federated foundation models. Before the research community can go further to explore more exciting applications of federated foundation models, some base work must be explored. Our work is to rethink the problems of traditional FL (especially PFL) based on FFM and establish the benchmark of personalized FFM.

We are the first to explore the test-time generalization of personalized FFM. The overall design is new with refined setting (test-time personalization) in FFM and the technique framework is composed of multiple pieces including a new training paradigm with self-reconstructed datasets to train personalized FFM, and also a new dynamic weighting mechanism to determine the combination of different adapters for test-time personalization.

[2] Ren, C., Yu, H., Peng, H., Tang, X., Li, A., Gao, Y., ... & Yang, Q. (2024). Advances and open challenges in federated learning with foundation models. arXiv preprint arXiv:2404.15381.

W3: Different distributions within a device and how to detect shifts.

In federated foundation models, each client’s device will be deployed with a lightweight intelligent assistant to tackle a broad range of downstream tasks, thus it is very likely that the client needs to tackle unseen tasks with different distributions during testing.

In this paper, we make new assumptions for FFM: 1) for each device, the test distribution may differ from the train distribution, 2) different clients have different distributions as well, and 3) the test distribution of client A could be observed by another client in their training tasks. For example, in practice, a client accustomed to writing emails in English may require translation assistance when working on a new project in Chinese. Here, the client has ample historical data for training in English email writing, but none for Chinese translation, which only emerges during the testing/inference phase. Additionally, other clients may be specialized in translation tasks.

To detect the distribution shifts during the testing phase, we designed an Instance-wise Dynamic Weighting mechanism. Specifically, we measure the similarity between the representations of test and training instances to adjust the importance weight of local adapter and global adapter. More detailed discussion could be found in Section 4.3.

W4: Extreme evaluation setting.

The problem setting and application scenario in federated foundation models have been different from the traditional federated learning methods. Our research is focused on the challenges of FFM, wherein foundation models, pre-trained on extensive datasets, already exhibit a degree of generalization capability towards general non-IID problems —a topic thoroughly investigated within traditional FL. Consequently, in the realm of FFM, we aim to address more complex scenarios, such as cross-domain and cross-dataset challenges, where centralized foundation models typically underperform [3].

In our future work, we could add more experiments to include traditional federated learning settings by splitting a dataset into many pieces with a slight non-IID. We believe our proposed method can be easily adapted to this type of non-IID.

[3] Wang, Y., Ivison, H., Dasigi, P., Hessel, J., Khot, T., Chandu, K., ... & Hajishirzi, H. (2023). How far can camels go? exploring the state of instruction tuning on open resources. Advances in Neural Information Processing Systems, 36, 74764-74786.

2024-08-12

I want to thank the authors for their answers. After reading the rebuttal and the other reviews, the authors did address some of the comments. I am still not entirely convinced about the motivation (I.e,. real-world use-cases) but I will increase my score to reflect this.

2024-08-12

Dear Reviewer mFUh,

As the author-reviewer discussion period is approaching its end on 13 August, we respectfully ask whether we have addressed all your questions and concerns.

审稿意见

评分: 8置信度: 52024-07-12

This paper proposes a novel dual-personalizing adapter to tackle the test-time distribution shift for federated foundation models (FedFM). FedFM is a new research domain to enhance foundation models by leveraging many fine-tuning tasks on many protected datasets for end users. The solution is essentially to tackle the trade-off of personalisation and generalisation of the client-specific models in FedFM.

优点

FedFM is a new research domain that has been regarded as an important pathway to enhance foundation models by leveraging decentralized data. Tackling the tradeoff between client-specific personalisation and generalisation is a key challenge of the FedFM research domain.
The proposed method is a new idea with original design. Moreover, this paper is the first work to discuss the test-time generalization challenge on FedFM.
The proposed method is technique sounds. The design of the method is simple yet effective. It is very easy to follow.
In terms of clarity, the paper’s contents are well organized and clearly presented with easy-to-understand figures and well-described details.
The appendix with experiment details and the source codes are provided to support the reproducibility of this work.

缺点

The important weight of the global adapter and personalised adapter is a key factor of the proposed dual-personalising adapter mechanism in FedFM. More insightful discussion is expected to analyse the selection of the important weights.
The proposed instance-wise trade-off mechanism essentially relies on the similarity between the test instance and training dataset. The current solution is straightforward and reasonable, however it could be elaborately designed.
It would be better if a large-scale experiment could be conducted to further evaluate the proposed method. However, considering the lack of computation resources, this paper’s experiment is sufficient enough as an initial exploration and discussion of this research direction.

问题

In line 260, what type of similarity function will be chosen and why?
According to the design of alpha, if the new test instance is more similar to the client’s training instances in a representation space, the local adapter will gain a bigger importance weight. Any theoretical discussion about this?
Why FedDPA-F wins more times than FedDPA-T?
Are there any other data sources other than Flan that can be used in this type of setting?

局限性

N/A

作者回复

2024-08-06

W2 and W3: Design of trade-off mechanism and large-scale experiment.

Thanks for your insightful advice. We will further explore better mechanisms from the perspective of generalization theory and more datasets in real applications.

Q1: Choice of similarity function.

We select cosine similarity due to its better robustness and normalization with high-dimensional vectors than other metrics, and its superiority has been demonstrated in many NLP/CV works. Additionally, we conducted an ablation study comparing various similarity functions in Appendix B.2, and the results in Table 9 demonstrate cosine similarity outperforms other similarity functions.

W1 and Q2: Theoretical analysis of the design of alpha.

In section 3.2, we conducted a preliminary theoretical analysis of the discordance between personalization and test-time tasks, which motivated us to learn two adapters: a local adapter for personalization task (training distribution) and a global adapter for test-time tasks. Therefore, when a test instance is more similar to the client’s training instance, we intuitively infer that its distribution is more similar to the training data, and because the local adapter aimed for personalization contains more knowledge of training data distribution, we increase the weight of local adapter. We will further explore the theoretical relation between them.

Q3: Why FedDPA-F wins more time than FedDPA-T.

It is because that FedDPA-F maintains a more generalized local adapter than FedDPA-T. The difference between FedDPA-F and FedDPA-T lies in the local adapter: FedDPA-F utilizes the global adapter as initialization of the local adapter, while FedDPA-T randomly initializes the local adapter. Given that the global adapter aggregated on different distributions matin certain generalization capabilities, the local adapter of FedDPA-F has better generalization performance than that of FedDPA-T, which leads to better performance on most test-time tasks. We will add this discussion in our experiment analysis to clarify.

Q4: Other data sources.

Our method is designed with a high degree of flexibility for its adaption across various data sources. As illustrated in Appendix A.1, any public text data sources (e.g., Dolly), uniformed into the generative task with appropriate prompts like Table 6, can be used in this setting. In addition, other types of data can also be applied to our methods by replacing LLM with the corresponding foundation models. For example, for image datasets (e.g., ImageNet, MS COCO), ViT can be used as the foundation model in our methods for this setting. We will add a discussion about the adaptation to other data sources.

审稿意见

评分: 7置信度: 52024-07-13

Federated Foundation Models (FedFM) is an emerging research domain to study collaboratively fine-tuning the pre-trained foundation models. This paper studied a test-time distribution shift problem on FedFM by proposing a new dual-personalising adapter.

优点

The proposed method is novel. The targeting problem is new while a new setting is created in the experimental study.
The paper presents high-quality content and novelty design. The claimed points are well supported by theoretical discussion and experimental analysis. The appendix and source codes provide sufficient details of the experiment.
The clarity of this paper is excellent.
The targeting problem is significant to the emerging domain of FedFM.

缺点

According to Eq 1, the P_all is all potential distributions. However, in real applications, the clients are usually insufficient to represent all possible distributions. A discussion is required to clarify this assumption.
The implementation framework heavily relies on the LoRA, thus the contribution is reduced.
The experiment setting assumes each client owns one type of dataset or task. What’s the difference with multi-task learning? Should you compare it with some multi-task learning baseline methods?

问题

Is it possible to apply this dual-personalizing adapter framework to other types of data, for example, image recognition, recommendation, time series, and multimodality data?
Is it possible to merge two Federated Datasets into one so that you can have 16 clients?
Line 315, how to fix the value of alpha? According to the design, the value of alpha should be decided by a dynamic weighting mechanism.
In future work, the author mentioned “theoretical analysis and more datasets”. Why do you think this method needs a theoretical analysis that can make a significant difference to this paper? What are the potential major challenges to applying this method to other datasets that prevent you from finishing these experiments in this paper?

局限性

N/A

作者回复

2024-08-06

W1: Discussion of assumption.

Yes, our main experiments are based on the assumption that all possible distributions are included in all clients. We already discussed this in the first paragraph of section 4 and Appendix C. To enhance clarity, we will rewrite this part and emphasize keywords in bold to make it clearer.

W2 and Q1: Heavily rely on the LoRA and adapt to other data types.

Our method is designed with a high degree of flexibility, facilitating its adaption across various adapter-based PEFT methods and transformer-based foundation models of different data types. In this paper, we just use LLM and LoRA as examples to illustrate our method, and our method can easily be adapted to other frameworks by substituting the LoRA and LLM with alternative adapter-based PEFT methods and transformer-based foundation models. We will revise our paper by including a discussion about adaptation to other adapter-based PEFT methods (e.g., series adapter) and other foundation models (e.g., ViT for images and UNITER for multimodality data).

W3: Different with multi-task learning.

Multi-task learning is centralized, and tuning on the combination of multi-task data of a centralized foundation model can be taken as the multi-task learning baseline. As in the centralized foundation model, all tasks are standardized into a uniform format (refer to Appendix A.1 and Table 6), and the model already acquires task-agnostic token embedding by the extensive data pre-training; directly tuning on these multi-task data is the implementation of multi-task learning with foundation models [1]. We have included this baseline, referred to as "Centralized," in our experimental comparisons. In addition, our setting accounts for test-time distribution shifts, a factor not typically addressed in multi-task learning. Although our main experiments are under the assumption that all possible distributions are included in all clients, our setting can also allow for scenarios involving unseen distributions for all clients, and our methods have demonstrated robustness in these scenarios, as detailed in Appendix C.1. We will add a discussion in related work to clarify the difference with multi-task learning.

[1] Yu, J., Dai, Y., Liu, X., Huang, J., Shen, Y., Zhang, K., ... & Chen, Y. (2024). Unleashing the Power of Multi-Task Learning: A Comprehensive Survey Spanning Traditional, Deep, and Pretrained Foundation Model Eras. arXiv preprint arXiv:2404.18961.

Q2: Merge two federated datasets into one for 16 clients.

Yes, we can merge these two datasets into one for more clients. As explored in section 6.2 of the ablation study on the impact of client number, our methods could maintain performance when scaling up the clients (up to 40 clients) in Figure 4. Therefore, our methods are supposed to be effective when having more clients and more tasks.

Q3: How to fix the value of alpha.

In line 315, we aim to investigate the impact of the global adapter on the training of the local adapter in FedDPA-T. As illustrated on the left of Fig 2(b), during the local adapter training of FedDPA-T, we employ the frozen global adapter to expedite the learning of the local adapter. This is because the global adapter contains certain task-related knowledge to accelerate the local adapter’s learning. It is worth noting that during the training phase, there are no distribution shifts; such shifts occur solely in the inference/testing phase, and the dynamic weighting mechanism is only used in the inference phase to determine the value of alpha for addressing test-time distribution shifts. In contrast, in the training phase, alpha is only used to control the contribution of the frozen global adapter to the local adapter learning. Therefore, during the training of FedDPA-T, we can fix the value of alpha, as its function differs from that during inference. To enhance clarity, we will revise the symbol and use two distinct symbols to differentiate these uses.

Q4: Theoretical analysis and more datasets.

For theoretical analysis, we believe that further work could rethink this problem within the context of existing theoretical frameworks and establish a new theoretical framework of FFM. For example, we can start from the perspective of out-of-domain generalization theory to rethink the generalization analysis and establish a new generalization and convergence analysis framework of FFM.

One main challenge in applying to more datasets is the significant computational resources required. Given that foundation models often consist of billions of parameters, they necessitate considerable time and storage for tuning. Additionally, the computational demand varies significantly across data types; for example, a thousand text data in the Flan occupy only about 1MB, whereas the same number of images in ImageNet can require around 130MB. Therefore, we believe that future work should explore more efficient methods for tuning FFM on larger datasets.

2024-08-11

The authors' great efforts made in the rebuttal have been appreciated. After carefully checking it, my queries have been well replied and I will keep my score.

作者回复

2024-08-06

We thank all the reviewers for their valuable reviews. We also appreciate their recognition of the key contributions of our work and the efficacy of our method.

Contributions to federated foundation models:
- "The targeting problem is significant to the emerging domain of FedFM." (Reviewer D5Ab)
- "FedFM is a new research domain that has been regarded as an important pathway to enhance foundation models by leveraging decentralized data. Tackling the tradeoff between client-specific personalisation and generalisation is a key challenge of the FedFM research domain." (Reviewer vXiN)
The novelty of our method:
- "The proposed method is novel. The targeting problem is new while a new setting is created in the experimental study." (Reviewer D5Ab)
- "The proposed method is a new idea with original design. Moreover, this paper is the first work to discuss the test-time generalization challenge on FedFM." (Reviewer vXiN)
Comprehensive experiments and efficacy of our method:
- "The paper presents high-quality content and novelty design. The claimed points are well supported by theoretical discussion and experimental analysis. The appendix and source codes provide sufficient details of the experiment." (Reviewer D5Ab)
- "The appendix with experiment details and the source codes are provided to support the reproducibility of this work." (Reviewer vXiN)
- "The authors evaluated their approach on a number of NLP tasks." (Reviewer mFUh)
Excellent presentation of our paper:
- "The clarity of this paper is excellent." (Reviewer D5Ab)
- "In terms of clarity, the paper’s contents are well organized and clearly presented with easy-to-understand figures and well-described details." (Reviewer vXiN)
- "The paper is well written and easy to understand." (Reviewer mFUh)

Detailed responses to each reviewer are provided below. We will incorporate all the feedback in the final version.

2024-08-11

Dear Reviewers,

Sincerely thank you for reading our rebuttal.

We hope that our rebuttal has effectively addressed your comments and concerns. if you have any further questions about our paper or the rebuttal, please let us know. Thank you.

Best regards, Authors

最终决定Accept (poster)

2024-09-25

In this paper, the authors concentrate on foundational models within the framework of federated learning and aim to propose a novel approach. Currently, the paper has received four reviews, resulting in the following scores: one Strong Accept, one Accept, one Weak Accept, and one Reject.

From the positive points of the reviews, most of the reviewers feel that (1) the paper is easy to understand, (2) the proposed method is technically sound, and (3) the overall idea is interesting. However, at the same time, the reviewers have pointed out several weaknesses, such as an unconvincing motivation, unsatisfactory experimental results, and unclear technical details.

In response to these weak points, the authors provided feedback to each reviewer during the rebuttal phase, successfully addressing or clarifying quite a few concerns. Nevertheless, two main issues are not fully addressed by the rebuttal:
(1) Unconvincing motivation: it is not entirely clear that which real-world applications motivate the studied scenario and proposed method. (Reviewer mFUh and Reviewer 94he)
(2) Unsatisfied experimental result: two versions of their solution (FedDPA-F and FedDPA-T) are compared in the experiment, while FedDPA-T can perform worse than the baseline. (Reviewer 94he)
Unfortunately, when the discussion period ends, the scores are still diversified, which makes the paper a borderline work.

Although this is a tough decision to make, we recommend accepting this paper for NeurIPS 2024 due to the following reasons:
(1) Most of the reviewers feel positive about this work;
(2) The authors have tried to resolve or explain all the weak points during the rebuttal;
(3) From a long-term perspective, it is meaningful for a top-tier conference to encourage more researchers to explore emerging research domain (eg FedFM as done in this work), even if some preliminary results may not be perfect.
Meanwhile, we also hope the authors will consider and address the remaining concerns during the camera ready.