Hot-pluggable Federated Learning: Bridging General and Personalized FL via Dynamic Selection
This work proposes a selective federated learning approach to integate personalized modules into general federated learning.
摘要
评审与讨论
After authors responses: The rating has been updated considering the authors' inputs and clarifications. I appreciate their efforts in providing those responses.
In order to solve the selective FL (SFL) problem, this paper leverages the model components that attempt to edit a submodel for specific purposes to design a framework referred to as Hot-Pluggable Federated Learning (HPFL). In HPFL, clients individually train personalized plug-in modules based on a shared backbone, and upload them with a plug-in marker on the server modular store. During the inference stage, a selection algorithm allows clients to identify and retrieve suitable plug-in modules from the modular store to enhance their generalization performance on the target data distribution. This paper also provides differential privacy protection during the selection with theoretical guarantee. Key contributions can be summarized as:
- Identifying a major gap between Generic FL (GFL) and Personalized FL (PFL), and formulate a new problem SFL to bridge this performance
- Developing a general, efficient and effective framework HPFL, which practically solves SFL and adding noise on communicated markers to provide differential privacy protection with theoretical guarantee.
- Experiments on four datasets and three neural networks to demonstrate the effectiveness of HPFL
优点
The paper is well-written. It fairly cites prior works that it built on and shows how it leverages those solutions. It clarifies what is the problem and what are research questions to answer. After carefully reviewing authors' clarifying points and responses to my concerns and other reviewers, I increased two of my scores.
缺点
The originality of this paper is not clear. For instance, the following are two major claimed contributions according to the paper “Identifying a major gap between Generic FL (GFL) [e.g., (Karimireddy et al., 2019; Woodworth et al., 2020; Tang et al., 2022b)'s works] and Personalized FL (PFL) [e.g., (Li & Wang, 2019; Chen & Chao, 2021; Li et al., 2021c):'s works], and formulate a new problem SFL to bridge this performance; and Developing a general, efficient and effective framework HPFL, which practically solves SFL and adding noise on communicated markers to provide differential privacy protection with theoretical guarantee.” However, the methodology seems to be a combination of existing works on PFL while leveraging several existing work with minimal advances. It is not clear how the mentioned theorems add value to the literature, for example, this statement is vague and is not adequately explained: “solving SFL means that clients achieve performance in GFL as high as in PFL.”. It is not clear how plug-in marker can contribute to bridging the gap between GFL and PFL? It would be very helpful to explain elaborately. Specifically, please provide a more detailed explanation and concrete example of how the plug-in markers specifically help bridge the gap between GFL and PFL performance. While using differential privacy can add value to the method, it is not clear whether it can benefit from local DP, central DP, or both? Please clarify which type(s) of differential privacy (local, central, or both) are used, and to provide a more detailed explanation of how the plug-ins interact with the DP mechanisms and what novel contributions are made in this area. Specifically it could be explained how plug-ins affect DP and what is the contribution in this part of the paper. Figure 2 should also include results of HPFL. The comparison should be expanded to cover more advanced PFL studies to showcase how HPFL performs compared to those methods. Currently is mainly focuses on basic PFL algorithms for comparison purposes. For instance, some key papers in this domain can help authors to provide a more compelling comparison among proposed method and existing PFL solutions, such as FedAlt/FedSim of Krishna et al, 2022@ICML and for the specific case of PFL with differential privacy, Hu et al, 2020 @ IEEE IoT Journal.
问题
Q1. What is the main contribution of the HPFL that makes it outperform existing PFL models? This could be described by adding a table with core update rule of existing PFLs [including more recent studies] and the proposed method; Q2. What is the difference of plug-in module and vanilla personalized model? Intuitively they seem to be the same and help considering the local model to personalize the local model of each client.; Q3. What is the novelty of integrating DP in the algorithm? How does it lead to advancing the proposed HPFL solution? Is it HPFL+DP or the integration has some challenges? If so, what are the challenges and how does this paper tackle them? Why can’t we use other privacy preservation mechanisms? ; Q4. Detailed comparison with two types of studies: i. existing PFL algorithms and comparing with HPFL, ii. Existing privacy preserving algorithms and comparison with DP; 5. How can the plug-in marker contribute to bridging the gap between GFL and PFL? Q6. Can you please clarify which type(s) of differential privacy (local, central, or both) are used, and to provide a more detailed explanation of how the plug-ins interact with the DP mechanisms and what novel contributions are made in this area?
Q5 :
How can the plug-in marker contribute to bridging the gap between GFL and PFL?
Ans for Q5): Below we clarify how HPFL contribute to contribute to bridging the gap between GFL and PFL in detail:
- Plug-in mechanism helps improve GFL performance with PFL modules: Sharing all personalized plug-ins (PFL modules) and make it available for all clients grealty enrich the model space a client can choose from when meet with all datasets (GFL setting).
- The naive design of selecting is inefficient: Considering the design of MoE, there should be a gating layer to identify PFL modules. However, in the real-world module sharing platforms like huggingface, an unified updating gating layer is not achievable due to the asynchronous and continuous new plug-ins.
- Markers help to identify the training data distribution of the plug-in modules: Therefore, we design a selection mechanism based on plug-in markers (like a identifier of client in the selection process). We match the approprite plug-in and test data by plug-in and task markers through distance measurement. Take one of our real-world examples in introduction, when one traveling abroad, the personal map app might recommend entirely different restaurants from their residence. In this process, plug-in marker can help us find the models trained on local restaurant and personal data, which can make better recommendations. Then the final recommendation made by suitable model is prompted to the user, thus increase the overall chance of satisfying the user.
Q6 :
(1) Can you please clarify which type(s) of differential privacy (local, central, or both) are used, (2) and to provide a more detailed explanation of how the plug-ins interact with the DP mechanisms and (3) what novel contributions are made in this area?
Ans for Q6):
- (1) As explained above in answer to Q3, DP is applied to markers instead of models. Therefore, it cannot be classified with the category used in model DP. In our code implementation, the noise of markers are added in client side.
- (2) Since DP is applied to markers instead of model components like backbone or plug-ins, we think no interaction exists between the plug-ins and DP mechanisms [1].
- (3) Compared with applying DP to model, applying DP to shared information is significantly less explored. Therefore, we hope our study can add diversity to DP applications like applying DP to shared information except for model parameters.
Reference
[1] Wei, Kang, Jun Li, Ming Ding, Chuan Ma, Howard H. Yang, Farhad Farokhi, Shi Jin, Tony QS Quek, and H. Vincent Poor. "Federated learning with differential privacy: Algorithms and performance analysis." IEEE transactions on information forensics and security 15 (2020): 3454-3469.
[2] Acar, Abbas, Hidayet Aksu, A. Selcuk Uluagac, and Mauro Conti. "A survey on homomorphic encryption schemes: Theory and implementation." ACM Computing Surveys (Csur) 51, no. 4 (2018): 1-35.
[3] Knott, Brian, Shobha Venkataraman, Awni Hannun, Shubho Sengupta, Mark Ibrahim, and Laurens van der Maaten. "Crypten: Secure multi-party computation meets machine learning." In NeurIPS, 2021.
[4] Lalitha, Anusha, Shubhanshu Shekhar, Tara Javidi, and Farinaz Koushanfar. "Fully decentralized federated learning." In Third workshop on bayesian deep learning (NeurIPS), 2018.
[5] Nofer, Michael, Peter Gomber, Oliver Hinz, and Dirk Schiereck. "Blockchain." Business & information systems engineering 59 (2017): 183-187.
[6] Shazeer, Noam, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." In ICLR, 2016.
[7] Xie, Cong, Sanmi Koyejo, and Indranil Gupta. "Asynchronous federated optimization." arXiv preprint arXiv:1903.03934 (2019).
[8] Yoon, Jaehong, Wonyong Jeong, Giwoong Lee, Eunho Yang, and Sung Ju Hwang. "Federated continual learning with weighted inter-client transfer." In ICML, 2021.
Q3 :
(1) What is the novelty of integrating DP in the algorithm? (2) How does it lead to advancing the proposed HPFL solution? (3) Is it HPFL+DP or the integration has some challenges? If so, what are the challenges and how does this paper tackle them? (4) Why can’t we use other privacy preservation mechanisms?
Ans for Q3):
- (1) Instead of adding Gaussian noise to the partial or full model, we utilizing Differential Privacy (DP) on the markers which are the intermediate features outputed from the model according to the training data and test data, to provider better privacy safety for HPFL. Compared with applying DP to model [1], applying DP to shared information is significantly less explored.
- (2) DP is mainly used to provide privacy protection for the shared plug-in markers.
- (3) No special issues have been encountered in our attempts to apply DP to HPFL, DP is a commonly-used mechanism used to protect privacy in FL, so we utilize this technique to provide guarantee for privacy protection. Both theoretic analysis in Section 3.4 and experiments in Appendix E.2 have shown its effectiveness.
- (4) Sure, there should be other privacy preservation mechanisms which can protect HPFL from information leakage, we adopt a commonly-used and simple technique in the simple implementation of HPFL our paper shows. This indicates that the HPFL is not restricted to the DP as the privacy protection. It is adaptable to some other methods to enhance its privacy protection. For example, we can use Homomorphic Encryption [2], Secure Multi-Party Computation [3] into HPFL to enhance privacy protection.
Q4
Detailed comparison with two types of studies: i. existing PFL algorithms and comparing with HPFL, ii. Existing privacy preserving algorithms and comparison with DP.
Ans for Q4):
- Differences of HPFL with existing PFL algorithms:
- (1) PFL algorithms focus on how to personalize model to perform best in PFL setting (local test datasets) instead of how to utilize them for GFL setting (all datasets), and HPFL is the first framework to leverage the personalized model component from other clients to adapt to various and shifting test distribution.
- (2) PFL algorithms cannot adapt its model at test-time to keep the inference model suitable for test data, this problem significantly degrade its robustness and performance when meet test data out of its local data distribution. While HPFL manage to choose suitable plug-in according to the test data with its selection mechanism.
- (3) clients in PFL algorithms only have the access of its local personalized model, while clients in HPFL can get personalized plug-in of other clients, which greatly enrich the knowledge a client can access.
- (4) the plug-in in HPFL is much more light-weight than the personalized model in PFL, which greatly increse the efficiency of HPFL.
- Differences of DP with other existing privacy preserving algorithms: DP is a commonly-used privacy protection mechanism. Main differences of DP with other existing privacy preserving algorithms used in FL can be concluded as:
- (1) DP provide strict theory framework to quantify its risk of privacy leakage, other methods cannot provide the possibility of privacy leakage.
- (2) the implementation is simple, different with Decentralized Federated Learning without a central server [4] and Secure Multi-Party Computation[3], which require complicated protocols between clients to directly communicate.
- (3) requires less computational costs than methods like Homomorphic Encryption [2] and blockchain [5].
We sincerely thank the reviewer for taking the time to review. According to your insightful comments, we provide detailed feedback below.
Q1 :
What is the main contribution of the HPFL that makes it outperform existing PFL models? This could be described by adding a table with core update rule of existing PFLs [including more recent studies] and the proposed method;
Ans for Q1):
| GFL | PFL | HPFL (SFL) | |
|---|---|---|---|
| Focus on performance of | all datasets | local datasets | all datasets |
| training paradigm | train a single model by aggregation | fine-tuning local model | fine-tuning model components with a backbone |
| inference paradigm | use global model | use personalized models | select appropriate personalized plug-in and the common backbone then inference |
| real-world deployment | All clients share a common model | Clients can only use local-personalized models | All clients can share all personalized plug-ins and a common backbone |
- Focus on performance of: This column distinguish the target test data of different algorithms, showing HPFL is target on all datasets (global dataset) instead of local datasets as in PFL algorithms.
- training paradigm: This column show the difference of training process of all algorithms.
- inference paradigm: This column show the different pipeline when inference, selection mechanism make sure our methods can get the suitable plug-in for inference.
- real-world deployment: This column shows the differences when various algorithms deploy in real-world applications, here we highlight the importance of sharing all personalized plug-ins and regard this as our key contribution leading to better GFL performance than all PFL algorithms.
With the above design, HPFL has the following advantages than previous GFL and PFL paradigms:
- Superb performance on All datasets instead of local datasets.
- Lower resource requirements than model selection [6].
- Implementing a possible market mechanism of plug-in sharing between different users [5].
- Potential Application for asynchrnous and continuous learning scenarios [7, 8].
Q2 :
What is the difference of plug-in module and vanilla personalized model? Intuitively they seem to be the same and help considering the local model to personalize the local model of each client.
Ans for Q2):
- Plug-in modules PFL models are the personalized part of models: In HPFL, plug-ins are produced by finetuning on client data with frozen backbone. While vanilla personalized models are often personalized by directly finetuning on client data.
- Plug-in modules can be selected and plugged: Personalized plug-ins can be shared and plugged to all clients due to its light-weight feature. Moreover, since plug-in is a part of model, encoded middle feature (marker) can be shared and utilized to distinguish which plug-in it matches, however, for vanilla personalized model, the identifier information (auxiliary information) selection process based on is difficult to construct.
- Selecting the complete personalized models is low efficient. Storing the whole models on the server or client devices will occupy the memory as , in which is the number of clients, the size of the whole model. While storing the plug-ins require memory, where is the size the of plug-ins.
Dear Reviewer kxCM,
Thanks a lot for your time in reviewing and reading our response and the revision. Thanks very much for your valuable comments. We sincerely understand you’re busy. But as the window for discussion is closing, would you mind checking our responses and and confirm whether you have any further questions? We look forward to answering more questions from you.
Best regards and thanks,
Authors of #9027
Dear reviewer kxCM,
Thanks for your efforts in reviewing our paper and raising the score. Your constrcutive and concrete suggestions have greatly contributed to our paper. We are pleased to respond to you further concerns if you have any.
Best regards and thanks,
Authors of #9027
The paper introduces Hot-Pluggable Federated Learning (HPFL), a framework aimed at bridging the gap between Generic Federated Learning (GFL) and Personalized Federated Learning (PFL) by proposing Selective Federated Learning (SFL). SFL optimizes PFL while allowing for the selection of personalized models (PMs) to enhance generalization performance across diverse test data. In HPFL, clients train personalized plug-in modules based on a shared backbone model, which are then uploaded to a server for selection during inference. This process also incorporates differential privacy to protect user data during the selection phase. Experimental results demonstrate that HPFL significantly outperforms traditional GFL and PFL methods, suggesting its applicability in various federated learning scenarios, including continual learning.
优点
-
The introduction of SFL effectively addresses the limitations of existing PFL methods, allowing for better adaptation to real-world scenarios where test data may differ significantly from local training data.
-
The HPFL framework’s modular approach enables efficient communication and computation, making it suitable for practical applications in federated learning, including scenarios with resource-constrained clients.
缺点
-
The dependency on a common backbone model may lead to reduced performance if the backbone fails to generalize well across heterogeneous client data distributions.
-
The plug-in selection process during inference could introduce additional computational delays, particularly for clients with limited resources, potentially hindering real-time performance.
-
While the framework claims differential privacy protection, the effectiveness of this mechanism in preventing information leakage during plug-in selection remains to be empirically validated in diverse operational contexts.
问题
See the weaknesses.
We thank the reviewer for taking the time to review. We appreciate that you find the proposed framework HPFL efficient and practical, the introduction of SFL effective. According to your valuable comments, we provide feedback below.
Q1 :
The dependency on a common backbone model may lead to reduced performance if the backbone fails to generalize well across heterogeneous client data distributions.
Ans for Q1):
- Low Sensitivity of backbone GFL performance: HPFL does not require a pretty high performance backbone. Because even given a moderate backbone, HPFL can ensure the performance of plug-ins by finetuning on local data. HPFL is not greatly sensitive to the generality of backbone, as shown in experiment in Table 2 in the main text, evidenced by little or even reverse PFL performance gap between low heterogenous setting (10 client, ), and high heterogenous setting (100 client, ), even given GFL performances significantly decrease in the high heterogeneity.
- Orthogonality of backbone training methods: Our main contribution and novelty is the design of HPFL framework, which is orthogonal to traditional FL. The specific training algorithm of the backbone can be any other GFL algorithm. Even if the backbone performance is not enough to support HPFL given the high heterogeneity, we can train the model with more advanced GFL algorithms like FedSAM, which take the data heterogeneity into consideration.
- Easy access of appropriate backbone: In the era of LLM, a pretrained backbone decent enough for fine-tuning under PFL performance is easy to get, such as LLama 3.
For these three reasons mentioned above, we believe backbone available to use in HPFL is easily accessible.
Q2 :
The plug-in selection process during inference could introduce additional computational delays, particularly for clients with limited resources, potentially hindering real-time performance.
Ans for Q2): To show the risk of selection process hindering real-time performance. We analyze computation costs of selection process. Moreover, we discuss how to reduce the number of plug-ins and the frequency of carrying out selection process to relief HPFL from heavy computation costs.
- Computation costs of selection is correlated to the total number of plug-in training samples. In a large FL system, samples on each client tend to be in little quantity, so the number of training samples are controllable in real deployment.
- Reduce number of plug-ins: Within an FL system involving millions of clients, many of them share similar plug-ins. Therefore, in Appendix F.2.1, we propose initial ideas on controlling the number of plug-ins, and show that HPFL surpass the best GFL-PM baseline FedTHE with only 1/3 plug-ins, by simply using selection score to measure the similarity of plug-ins and abandon similar ones.
- In real implementation, clients will keep the selected plug-in locally. Therefore, in real deployment, the update of plug-in doesn't happens from time to time, as the distribution shifts happen in a relatively low frequency, e.g. traveling to another country or place, the climate change of a certain place. This behaviour is consistent with many real-world FL systems[1,2]. These FL updates often are set to happen only at spare time of clients[1,2,3], further reduce the burden in real-world FL system.
Q3 :
While the framework claims differential privacy protection, the effectiveness of this mechanism in preventing information leakage during plug-in selection remains to be empirically validated in diverse operational contexts.
Ans for Q3):
- In Appendix E.2 and Figure 15, 16, we carry out the resorted image reconstruction by feature inversion method, and observe that after our protection against the markers, no raw image can be successfully reconstructed by inverting the representation through the pretrained global backbone model parameters. This results experimentally demonstrate that our privacy protection scheme is effective to protect HPFL from information leakage.
- Besides, we have carried out expeiments showing differential privacy do little harm to the performance of HPFL: In Table 13, we have experimentally shown under the protection of Differential Privacy, HPFL still achieve the significantly better GFL performance over all baselines. Moreover, results in Table 8 also shows that the model performance are not significantly affected by the Differental Privacy protection, the performances of HPFL didn't experienced significant decrease during the rising of noise coeffiency from 0 to 1000.
Reference
[1] Age-Based Scheduling Policy for Federated Learning in Mobile Edge Networks. In ICASSP, 2020.
[2] CMFL: Mitigating communication overhead for federated learning. In ICDCS, 2019.
[3] Towards federated learning at scale: System design. In SysML, 2019.
Dear Reviewer 8vJb,
Thanks a lot for your time in reviewing and reading our response and the revision. Thanks very much for your valuable comments. We sincerely understand you’re busy. But as the window for discussion is closing, would you mind checking our responses and and confirm whether you have any further questions? We look forward to answering more questions from you.
Best regards and thanks,
Authors of #9027
This article proposes a new federated learning framework called Hot-Pluggable Federated Learning (HPFL), which aims to solve the performance gap problem between general federated learning (GFL) and personalized federated learning (PFL). Traditional GFL cannot cope with the diversity of data distribution, while PFL is only suitable for scenarios where local data distribution is similar. When the client encounters test data that is different from local data, PFL's personalized model has difficulty maintaining efficient generalization performance. To this end, this paper proposes a new problem framework of Selective Federated Learning (SFL), which enhances the effect of GFL by selecting an appropriate personalized model for each client in the inference stage. The HPFL framework divides the model into shared backbone and personalized plug-in modules. The client trains and uploads plug-ins based on local data. During inference, it can select appropriate plug-ins to adapt to different data distributions, while protecting data privacy through differential privacy.
优点
Originality:The HPFL framework proposed in this paper innovatively introduces a plug-in selection mechanism into federated learning, realizes the bridge between general models and personalized models, and solves the performance balance problem of traditional GFL and PFL. This is a new attempt in federated learning.
Quality:The experimental part is relatively comprehensive, covering a variety of data sets and model verification, and enhancing security through differential privacy. The overall design is rigorous, and the results demonstrate the advantages of HPFL in performance and adaptability.
Clarity:The paper is well-structured, with clear background and problem descriptions, and clear algorithm design, framework details, and experimental procedures, making it easy for readers to understand its core contributions.
Significance:This study proposed a new solution to the adaptability problem of federated learning under heterogeneous data distribution, which has practical application potential and provides a new direction for the future development of federated learning.
缺点
·Insufficient details of the selection mechanism (page 5, Section 3.3):HPFL uses multiple distance metrics such as MMD, SVCCA, and CKA to select plug-ins, but the specific algorithm steps and implementation details are rarely described. The article can add mathematical expressions or pseudocodes for some of the selection methods to increase readability and make it easier for readers to understand the robustness of the selection process.
·Selection of comparison methods (page 7&8, Table 2 and Table 3):The paper compares a variety of GFL and PFL algorithms, but lacks a comparison of newer solutions that focus on heterogeneous data distribution problems. It is recommended to add more to further highlight the advantages of HPFL in performance and adaptability.
问题
·How to balance the computational and communication overheads brought by the storage and selection of personalized plug-in modules in the HPFL framework to improve model efficiency?
·Is it possible that differential privacy protection during model selection in this article, or when using other privacy protection methods, may significantly affect model performance?
Q3 :
How to balance the computational and communication overheads brought by the storage and selection of personalized plug-in modules in the HPFL framework to improve model efficiency?
Ans for Q3): We provide three implementations of HPFL inference, each provides solution with different focus on computational/communication and storage/selection costs:
- (1) Plug-in Updating: only update the local plug-in stored locally at a client-defined frequency, which greatly reduce communication and computational costs of single inference;
- (2) Cache All: request all plug-ins and markers from the server to omit the communication costs at inference time, suitable for the applications where low latency is required;
- (3) Cache Selected: store the plug-ins that clients have chosen before, in real world, the number of distributions from other clients one client can meet tends to be limited, thus a single client may not need too many plug-ins to perform inference. This solution trades storage of serveral regularly-used plug-ins for significantly lower communication costs, which is usually dominate in FL systems.
Plug-in Updating and Cache All is the and implementation in the introdution.
Besides, we also attempts to reducing the number of plug-ins to control the overhead of plug-in storage and selection.
- Controllable Number of Plug-ins: In Appendix F.2.1, we provide some naive ideas on reducing the number of plug-ins, and show HPFL outperform the best GFL-PM baseline FedTHE with only 1/3 number of plug-ins by simply eliminating plug-ins with the calculated selection score.
Q4 :
Is it possible that differential privacy protection during model selection in this article, or when using other privacy protection methods, may significantly affect model performance?
Ans for Q4):
Expeiment results shows differential privacy probably do little harm to the performance of HPFL: In Table 13, we have experimentally shown under the protection of Differential Privacy, HPFL still achieve the significantly better GFL performance over all baselines. Moreover, results in Table 8 also shows that the model performance are not significantly affected by the Differental Privacy protection, the performances of HPFL didn't experienced significant decrease during the rising of noise coeffiency from 0 to 1000.
Reference
[1] Mingsheng Long, Han Zhu, Jianmin Wang, and Michael I Jordan. Deep transfer learning with joint adaptation networks. In ICML, 2017.
[2] Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. In NeurIPS, 2017.
[3] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. In ICML, 2019.
[4] Karimireddy, Sai Praneeth, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. "Scaffold: Stochastic controlled averaging for federated learning." In ICML, 2020.
[5] Chen, Hong-You, and Wei-Lun Chao. "On Bridging Generic and Personalized Federated Learning for Image Classification." In International Conference on Learning Representations, 2022.
[6] Qu, Zhe, Xingyu Li, Rui Duan, Yao Liu, Bo Tang, and Zhuo Lu. "Generalized federated learning via sharpness aware minimization." In ICML, 2022.
[7] Jiang, Liangze, and Tao Lin. "Test-Time Robust Personalization for Federated Learning." In ICLR, 2023.
We thank the reviewer for taking time to review. In light of your insightful comments, we offer responses below.
Q1 :
Insufficient details of the selection mechanism (page 5, Section 3.3):HPFL uses multiple distance metrics such as MMD, SVCCA, and CKA to select plug-ins, but the specific algorithm steps and implementation details are rarely described. The article can add mathematical expressions or pseudocodes for some of the selection methods to increase readability and make it easier for readers to understand the robustness of the selection process.
Ans for Q1): Thank you for your insightful suggestion that help us clarify HPFL, due to the page limit, we add the specific implementations and formulas of MMD [1], SVCCA [2] and CKA [3] in Appendix D.2 to provide better elaboration on the selection process. MMD
- MMD Given observations X:=\left\\{x_{1}, \ldots, x_{m}\right\\} and Y:=\left\\{y_{1}, \ldots, y_{n}\right\\},
, where is the kernel function.
-
CKA Let and be two kernel functions defined over such that and . Then, the alignment between and is defined by
-
SVCCA
- Input:
- Perform: SVD(), SVD(). Output:
- Perform CCA(, ). Output: , and , where and is the number of samples of and
Q2 :
Selection of comparison methods (page 7&8, Table 2 and Table 3):The paper compares a variety of GFL and PFL algorithms, but lacks a comparison of newer solutions that focus on heterogeneous data distribution problems. It is recommended to add more to further highlight the advantages of HPFL in performance and adaptability.
Ans for Q2): As far as we are concerened, most previous works aiming at dealing with data heterogeneity can be categoried into two types:
- (1) how to train a better global model given the heterogenous data scattered on different clients, usually though (i) designing more robust aggregation schemes, (ii) reduce the heterogeneity of trained local models when given heterogenous local data. For (1,i), SCAFFOLD [4] is the representative of this type of methods, however, very few recent work focus on this path, so we didn't choose recent representitive from this category. FedRoD [5] optimize local models towards class-balanced objectives to improve GFL performance, which is a implementation of (1, ii) scheme. FedSAM [6] also carries out (1, ii) scheme by applying Sharpness Aware Minimization (SAM) local optimizer; Moreover, our method is orthogonal to this type of methods, since HPFL can adopt these methods to train better backbone.
- (2) how to adaptively adjust the inference model to mitigate the test-time distritbution shifts. FedTHE [7] achieve the goal mentioned in (2) by adjust the enesemble weight of global and local heads according to test data. The experimental results of HPFL and these baselines show that HPFL have advantages over those kinds of methods.
Dear Reviewer VDPF,
Thanks a lot for your time in reviewing and reading our response and the revision. Thanks very much for your valuable comments. We sincerely understand you’re busy. But as the window for discussion is closing, would you mind checking our responses and and confirm whether you have any further questions? We look forward to answering more questions from you.
Best regards and thanks,
Authors of #9027
Thank you for the response, I tend to keep my positive score.
Dear reviewer VDPF,
Thanks for your time in reviewing our paper and reply despite such a busy period. Your comments have helped us improve our presentation. If you have any further concerns, we will do our best to respond to them.
Best regards and thanks,
Authors of #9027
This paper presents a novel framework called Hot-Pluggable Federated Learning (HPFL) that aims to bridge the gap between generic federated learning (GFL) and personalized federated learning (PFL). The authors propose a new learning paradigm, Selective Federated Learning (SFL), which combines model optimization with model selection. HPFL addresses the challenges of storing and selecting whole models by designing an efficient framework that allows clients to train personalized plug-in modules and upload them to a server. During inference, a selection algorithm identifies suitable plug-in modules to enhance performance on target data distributions. The paper also incorporates differential privacy protection during the selection process. Comprehensive experiments demonstrate HPFL's effectiveness in improving GFL performance and its potential in addressing other FL challenges like continual learning and one-shot FL.
优点
-
The authors identify a substantial gap between GFL and PFL, and formulate a new problem SFL to bridge them together to address this performance gap. Both optimization function of GFL and PFL are the special cases of it.
-
The authors propose a general, efficient and effective framework HPFL, which practically solves SFL.
-
Comprehensive experiments and ablation studies on four datasets and three neural networks demonstrate the effectiveness of HPFL.
缺点
- It’s adorable that the authors define a paradigm to bridge the gap between GFL and PFL, but the selective method are not novelty enough. For example, [1] allowed each client to choose the appropriate scale model to train. And the training process in HPFL are identical to the split federated learning, the authors only add another select process in inference process.
- The authors claim that they theoretically show PMs can be used to enhance GFL with a new learning problem named Selective FL (SFL), which involves optimizing PFL and model selection. But only the statement of the loss cannot totally evaluate the effectivess of the SFL. And in the Eq.4, the greater-than sign \geq should be a less-than sign \leq? The better one should gain the less loss?
- What’s the originality in the analysis of the privacy protection? It seems that add Gaussian noise to the partial model or full model is identical. Thus I think it’s only an existing result.
- The notations needs to be improved. For example, in section 2.4 the definition of Selective FL (SFL) problem, the introducing of auxiliary information is confused. And in Theorem2.3, the function s(\dot) lacks description.
- The presentation in experimental section should be improved. For example, the Table 2 is hard to read. It’s confused that the authors not only give the best result in grey, but also some second best result. And why the proposed HPFL does not gain the best result especially compared to the GFL FedSAM?
[1] Cho, Yae Jee, et al. "Heterogeneous ensemble knowledge transfer for training large models in federated learning." arXiv preprint arXiv:2204.12703 (2022).
问题
As shown in the Weakness.
Reference
[1] Cho, Yae Jee, et al. "Heterogeneous ensemble knowledge transfer for training large models in federated learning." arXiv preprint arXiv:2204.12703 (2022).
[2] Liang, Paul Pu, Terrance Liu, Liu Ziyin, Nicholas B. Allen, Randy P. Auerbach, David Brent, Ruslan Salakhutdinov, and Louis-Philippe Morency. "Think locally, act globally: Federated learning with local and global representations." arXiv preprint arXiv:2001.01523 (2020).
[3] Wu, Zhaomin, Qinbin Li, and Bingsheng He. "A coupled design of exploiting record similarity for practical vertical federated learning." Advances in Neural Information Processing Systems 35 (2022): 21087-21100.
[4] Karimireddy, Sai Praneeth, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. "Scaffold: Stochastic controlled averaging for federated learning." In ICML, 2020.
[5] Chen, Hong-You, and Wei-Lun Chao. "On Bridging Generic and Personalized Federated Learning for Image Classification." In ICLR, 2022.
[6] Li, Tian, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith. "Federated optimization in heterogeneous networks." In MLSys, 2020.
[7] Wei, Kang, Jun Li, Ming Ding, Chuan Ma, Howard H. Yang, Farhad Farokhi, Shi Jin, Tony QS Quek, and H. Vincent Poor. "Federated learning with differential privacy: Algorithms and performance analysis." IEEE transactions on information forensics and security 15 (2020): 3454-3469.
[8] Shazeer, Noam, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." In International Conference on Learning Representations. 2016.
[9] Ren, Jie, Peter J. Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark Depristo, Joshua Dillon, and Balaji Lakshminarayanan. "Likelihood ratios for out-of-distribution detection." Advances in neural information processing systems 32 (2019).
[10] Loh, Wei‐Yin. "Classification and regression trees." Wiley interdisciplinary reviews: data mining and knowledge discovery 1, no. 1 (2011): 14-23.
[11] Cutler, Adele, D. Richard Cutler, and John R. Stevens. "Random forests." Ensemble machine learning: Methods and applications (2012): 157-175.
Q2 :
(1) The authors claim that they theoretically show PMs can be used to enhance GFL with a new learning problem named Selective FL (SFL), which involves optimizing PFL and model selection. But only the statement of the loss cannot totally evaluate the effectivess of the SFL. (2) And in the Eq.4, the greater-than sign \geq should be a less-than sign \leq? The better one should gain the less loss?
Ans for Q2):
- (1) We understand the value of loss function in training (i.e. training error) is not enough to evaluate the performance of model on test data, however, in our analysis, we actually calculate the generalization error of the models, we believe it should be enough for evaluating the actual performance in inference stage. Besides, in the condition that same training samples and model complexity/architecture, which is obvisouly satisfied in our analysis, lower training error means lower generalization bound, and thus the better model performance. The theoretical analysis of our paper follows a series of work [4, 5, 6]. We also experimentally report the accuracy of HPFL, which is the implementation of SFL.
- (2) We are sorry for the misleading information we convey in our presentation. The greater-than sign in Eq.4 is the correct meaning we trying to express, where the local model test on its own data , while is the model of other client i. We will revise the notation from "client i" to "client m" in text description to prevent potential misunderstanding in our revision. Thanks for your suggestion and careful reading.
Q3 :
What’s the originality in the analysis of the privacy protection? It seems that add Gaussian noise to the partial model or full model is identical.
Ans for Q3): Instead of adding Gaussian noise to the partial or full model, we utilizing Differential Privacy (DP) on the markers to provider better privacy safety for HPFL. Compared with applying DP to model, applying DP to other shared information like intermediate features is significantly less explored. Using DP for privacy protection, which is common in FL [7], is a component of HPFL.
Q4 :
(1) The notations needs to be improved. For example, in section 2.4 the definition of Selective FL (SFL) problem, the introducing of auxiliary information is confused. (2) And in Theorem 2.3, the function s(\dot) lacks description.
Ans for Q4):
- (1) To solve the potential confusion, we add some text descriptions to explain the function of auxiliary information in Section 2.4 in our revision: " is auxiliary information exploited to select plug-in module, e.g. noisy feature for calculating distance metrics like Maximum Mean Discrepancy (MMD)". In real implementation, are not limited in feature prototype like markers, but also some mechanisms such as learning gate like in MOE [8], distinguish models like OOD detector[9], decision tree [10], and random forest [11]. Hopefully with the modification, the introducing of auxiliary information make more sense.
- (2) " is called selection function that outputs the model index to select a model from the PMs based on the input and the auxiliary information , which will be illustrated in Section 3", we use the lowercase letter to denote the instantialized version of Selection function , i.e. a specific selection method, e.g. MMD, SVCCA and CKA. We revise the notation to more clearly explain the select function s(\dot).
Q5 :
The presentation in experimental section should be improved. For example, the Table 2 is hard to read. It’s confused that the authors not only give the best result in grey, but also some second best result. And why the proposed HPFL does not gain the best result especially compared to the GFL FedSAM?
Ans for Q5): Thank you for your kindly reminder. We use different colors to distinguish best GFL performances under different settings including GFL-GM and GFL-PM. In GFL-GM, the single global model meet all datasets, while GFL-PM refers to the setting where personalized local model meet all datasets. We found PMs often perform not well in all datasets, and revert this with adaption of personalized model, which we found even surpass the GFL performance of GM.
- ForestGreen: overall best GFL results under the GFL-GM and GFL-PM setting (highest among those two settings).
- Grey: only the best results under the GFL-GM setting (highest of only the traditional GFL-GM setting).
You may misunderstand the experiment results we list in Table 2. The focus of our experiments is to compare the GFL performances of baselines and HPFL (including two columns GFL-GM and GFL-PM instead of only the first column GFL-GM), as shown in Table 2 in the main text, HPFL gains most of the best GFL performances among all thest baselines including FedSAM.
We sincerely thank the reviewer for taking the time to review. We appreciate that you find our proposed framework novel, general and efficient, our experiments and ablation studies comprehensive. According to your valuable comments, we provide detailed feedback below and add them into our main text or appendix in the revision. We hope that these changes have addressed your concerns and improved the overall quality of our work.
Q1 :
It’s adorable that the authors define a paradigm to bridge the gap between GFL and PFL, but the selective method are not novelty enough. For example, [1] allowed each client to choose the appropriate scale model to train. And the training process in HPFL are identical to the split federated learning, the authors only add another select process in inference process.
Ans for Q1): Thanks for your valuable comments.
- Differences with [1]: As far as we know, only selection in [1] happens when deciding which client takes part in the training round according to their dataset size, it's quite different from the plug-in selection of HPFL in three ways: (1) First, our method attempts to deal with the notorious data heterogeneity problem to select appropriate well-trained plug-in models, while [1] attempt to achieve by knowledge distillation; (2) the selection of participating clients are completely random with the possiblity in proportion to their dataset size, instead our selection is based on the matching degree of task markers and plug-in markers, which measured by Maximum Mean Discrepancy (MMD); (3) our selection only happens in inference phase to adapt the model used in test time according to the test data, while the selection of clients happens every training round in [1]. (3) Moreover, their method requires a public unlabelled dataset on server to perform knowledge distillation (KD) to transfer the knowledge from clients model to server, which is not always accessible especially for fields where privacy are highly evaluated, such as medicine and finance.
- Differences with split federated learning: (1) model decoupling is a commonly-used technique in FL, thus we do not regard this technique as our main novelty, as claimed in Section 3.2, this technique serves for the purpose of reducing the storage and communication costs in real implementation, i.e. any training techique can be incorporated into HPFL as long as it get light-weight model components to act as plug-ins. (2) split federated learning often focus on PFL [2] or vertical FL [3] setting, however, our method HPFL aims to enhance GFL performance with the plug-in trained with PFL methods, instead of PFL performance itself. (3) the model architectures of splif FL are fixed to ensure the aggregation of client-side model update, while HPFL support heterogenous personalized part of model with different architecture since model component are pluggable and personalized. (4) to support dynamic hot-pluggable module selection, HPFL further considers the prototypes named markers, while split federated learning only consider computation partition and parallelism. (5) the design of HPFL supports asynchronous and one-shot FL scenarios, while the split FL cannot support due to the requirenment of client-side model update aggregation.
- Our Novelty: we would like to highlight our major novelty as: (1) we first formulate the performance gap between GFL and PFL, and though this fomulation we find a theoretic framework named Selective FL (SFL) that can utilize the excellent PFL performance of local models (implemented as plug-ins in HPFL) to boost the GFL performance. In our knowledge, this is the first work that enhances GFL through learning, sharing and selecting model components, instead of classic paradigm relying on single global model. (2) we instantiate SFL with a efficient and practical framework named HPFL and (3) conduct comprehensive experiments and ablation studies to demonstrate the advantage of HPFL over previous methods. (4) we supports asynchronous, one-shot and continual FL scenarios with our practical and efficient framework HPFL.
Dear Reviewer FWNo,
Thanks a lot for your time in reviewing and reading our response and the revision. Thanks very much for your valuable comments. We sincerely understand you’re busy. But as the window for discussion is closing, would you mind checking our responses and and confirm whether you have any further questions? We look forward to answering more questions from you.
Best regards and thanks,
Authors of #9027
Dear reviewer FWNo,
Thanks for your valuable time in reviewing and constructive comments, according to which we have tried our best to answer the questions and carefully revise the paper. Here is a summary of our response for your convenience:
-
(1) Comparison with Fed-ET & split FL: We listed the differences in a table to deliver them more intuitively.
Fed-ET split FL HPFL Major Mechanism ensemble and KD parameter decoupling adapt to test data with appropriate well-trained plug-in models Focus on performance of all datasets local datasets all datasets training paradigm train a single model by aggregation and KD fine-tuning local model fine-tuning model components with a backbone inference paradigm use global model use personalized models or in vetical FL way select appropriate personalized plug-in and the common backbone then inference real-world deployment all clients share a common model clients can only use local-personalized models all clients can share all personalized plug-ins and a common backbone public datasets required on server for KD not required not required transmitted model whole model none plug-in (part of model) when and how selection happens randomly picking participating clients every round randomly picking participating clients every round select by deterimined mechanism (e.g. distance metric like MMD) at test time - Focus on performance of: This column distinguish the target test data of different algorithms, showing HPFL is target on all datasets (global dataset) instead of local datasets as in split FL algorithms.
- real-world deployment: This column shows the differences when various algorithms deploy in real-world applications, here we highlight the importance of sharing all personalized plug-ins and regard this as our key contribution leading to better GFL performance than all split FL algorithms.
With the above design, HPFL has the following advantages than Fed-ET and split FL methods:
- Superb performance on All datasets instead of local datasets.
- Lower communication resource than Fed-ET.
- Free from the requirement of public dataset, which is diffult to get in many cases like medical information.
- Implementing a possible market mechanism of plug-in sharing between different users.
- Potential Application for asynchrnous and continuous learning scenarios.
We have add the above works and their comparison with HPFL in our revision.
-
(2) Problems of Theoretical analysis: Following your constructive comments, we have revised the text description of Theorem 2.1. Besides, We also show analyze loss is a common way for theoretical performance analysis in FL.
-
(3) Originality of our DP protection: Following your valuable suggestions, we have make the novelty of our DP clear: **Our DP is applied to shared information like intermediate features instead of models or gradients.
-
(4) Presentation problems: Following your valuable suggestions, we have revised the presentation of our theoretical analysis: 1. In Section 2.4 in our revision: " is auxiliary information exploited to select plug-in module, e.g. noisy feature for calculating distance metrics like Maximum Mean Discrepancy (MMD)"; 2. " is called selection function that outputs the model index to select a model from the PMs based on the input and the auxiliary information , which will be illustrated in Section 3", we also revise the notation to more clearly explain the select function s(\dot).
-
(5) Presentation in experimental section: We explain the meaning of results in different colors in Table 2, together with the motivation of our experiments.
We humbly hope our repsonse has addressed your concerns. If you have any additional concerns or comments that we may have missed in our responses, we would be most grateful for any further feedback from you to help us further enhance our work.
Best regards,
Authors of #9027
Thank you for the detailed response, I will raise my score accordingly.
Dear reviewer FWNo,
Thanks for your efforts in reviewing our paper. Your constructive comments have greatly helped us improve our paper. If you have any further concerns, we are pleased to respond to them.
Best regards and thanks,
Authors of #9027
We sincerely thank all reviewers for taking the time to review our work. We appreciate you find our framework novel, innovative, general, efficient and effective (Reviewer VDPF, FWNo and 8vJb), our paper well-written (Reviewer VDPF and kxCM) and easy to follow (Reviewer VDPF), detailed and fair discussion of related works (Reviewer kxCM), we identify a substantial gap between GFL and PFL, and formulate a new problem SFL to bridge them together (Reviewer VDPF, FWNo), extensive experiments and ablation studies showing superior performance (Reviewer VDPF, FWNo and 8vJb), our method promsing and potential in applications (Reviewer VDPF and 8vJb), we provides a new direction for the future development of FL (Reviewer VDPF).
Here, we provide a summary of our responses to frequent questions for convenient reading.
Q1: Clarification on contributions and advantages: (Reviewer VDPF, FWNo and kxCM)
Ans for Q1):
We identify our advantages compared with previous works as:
- Superb performance on all datasets instead of local datasets.
- Lower communication resource required.
- Free from the requirement of a public dataset, which is diffult to get in many cases like medical information.
- Implementing a possible market mechanism of plug-in sharing between different users.
- Potential application for asynchrnous and continuous learning scenarios.
, which are achieved by
- We first formulate the performance gap between GFL and PFL, and though this fomulation we find a theoretical framework named Selective FL (SFL) that can utilize the excellent PFL performance of local models to boost the GFL performance.
- We instantiate SFL with an efficient and practical framework named HPFL, which incorporates the plug-in mechanism to improve GFL performance with PFL modules.
- We use markers help to identify the training data distribution of the plug-in modules and adapt test model accordingly.
Q2: Selection method: (Reviewer VDPF, FWNo)
Ans for Q2):
Clarification: we add the specific implementations and formulas of MMD, SVCCA, and CKA in Appendix D.2 to provide better elaboration on the selection process.
Issues on formulation and presentation: We have revised the notation and text presentation to better clarify auxiliary information and selection function in Section 2.4 in our revision.
Q3: Differential Privacy: (Reviewer VDPF, FWNo, 8vJb and kxCM)
Ans for Q3):
Novelty: Compared with applying DP to models, applying DP to other shared information like intermediate features is significantly less explored.
Experimental Verification:
- In Table 13, we experimentally shows under the protection of Differential Privacy, HPFL still achieve better GFL performance over all baselines.
- Moreover, results in Table 8 also show that the model performance are not significantly affected by the Differental Privacy protection.
- In Appendix E.2 and Figure 15, 16, we observe that after our protection against the markers, no raw image can be successfully reconstructed with model inversion attack.
Q4: Discussion on extra system overheads (Reviewer VDPF and 8vJb)
Ans for Q4):
We provide three real-world deployment schemes of HPFL inference, each targeted at a specific scenario to reduce the according bottleneck of system overheads:
- (1) In light-weight applications requiring low update frequency (like backend activities), the bottleneck is the storage and communication costs for frequent update, Plug-in Updating only update the local plug-in stored locally at a client-defined frequency, which greatly reduce inference costs;
- (2) In latency-sensitive applications, the bottleneck is communication time to fetch plug-ins, Cache All request all plug-ins and markers from the server to omit the need for communication at inference time;
- (3) When one client meets a limited number of distributions, the bottleneck comes from the frequent download of the same plug-in, we adopt Cache Selected, which stores the plug-ins that clients have chosen before.
Plug-in Updating and Cache All are the and implementation in the introdution.
Below we briefly summarize the factors reducing the system overheads of HPFL.
- Low update frequency and fewer transmitted parameters, detailed explanation can be found in responses to Reviewers VDPF, 8vJb.
- Controllable Number of Plug-ins and Storage: In Appendix F.2.1, we provide some naive ideas on reducing the number of plug-ins, and show HPFL outperform the best GFL-PM baseline FedTHE with only 1/3 number of plug-ins by simply eliminating plug-ins with the calculated selection score.
We have integrated the above mentioned modifications into our revision.
a) Summary
The paper introduces Selective Federated Learning (SFL), a framework to bridge the gap between Generic Federated Learning (GFL) and Personalized Federated Learning (PFL) by enabling clients to share and selectively integrate lightweight plug-in modules trained on local data. The core insight is that personalized modules can enhance generalization across diverse test distributions when combined with a shared backbone and selected using markers. The main result demonstrates that the proposed Hot-Pluggable Federated Learning (HPFL) framework significantly outperforms state-of-the-art GFL and PFL methods while addressing privacy concerns through differential privacy and reducing system overhead with efficient plug-in sharing mechanisms
b) Strengths
- The Hot-Pluggable Federated Learning (HPFL) framework, which sets up a "module store" and picks the most relevant module for the test distribution. This framework is realistic and potentially impactful. This could be interesting to extend to a marketplace setting as well.
- Introduces privacy protection for shared plug-in markers using differential privacy, with theoretical guarantees and experimental validation.
- Extensive evaluations across multiple datasets and architectures demonstrate significant improvements over traditional GFL and PFL methods.
c) Weakness
- The core methodology—using modular plug-ins, markers, and differential privacy—largely combines existing techniques, with minimal theoretical innovation beyond problem formulation. Potential extensions could try analyze more interesting data and model acquisition techniques (see e.g. Lu et al. 2023: https://arxiv.org/abs/2403.13893). It is also clearly related to the routing modules in mixture of experts (MoEs): see Yadav et al. 2024: https://arxiv.org/abs/2408.07057.
- Performance depends heavily on the backbone model's ability to generalize across heterogeneous client data
d) Decision to recommend accept
The authors introduce a system that allows clients to share and use small, adaptable components (plug-ins), which improves performance and reduces resource requirements. The experiments show that this approach works better than existing methods, and it also includes privacy protection to keep shared information secure. While some aspects build on existing ideas, the practical usefulness and strong results lead all the reviewers to recommend accepting this work.
审稿人讨论附加意见
During the rebuttal, reviewers raised concerns about the originality of the approach, clarity in the selection mechanism, reliance on the backbone model, and adequacy of comparisons with recent methods. The authors addressed these by providing additional implementation details, revising notations, elaborating on the plug-in selection process, and expanding comparisons with related works. They clarified that the framework’s novelty lies in adapting personalized components for generalization, not in theoretical advancements. While the originality remains moderate, the practical contributions and thorough revisions demonstrated the approach's effectiveness and relevance, leading to a positive overall evaluation.
Accept (Poster)