Federated Learning Can Find Friends That Are Advantageous
摘要
评审与讨论
This paper introduces a novel algorithm called Merit-based Federated Learning (MeritFed), designed to tackle the challenges posed by heterogeneous data distributions in Federated Learning (FL). The authors employ an adaptive selection of aggregation weights for clients based on their data distributions and provide a theoretical framework demonstrating the provable convergence of MeritFed under mild assumptions, indicating that the method can achieve faster convergence compared to existing approaches. The paper conducts experiments comparing MeritFed with other baseline methods like TAWT and FedAdp, highlighting its efficiency and reduced computational complexity, as it does not rely on intensive hypergradient estimations or cosine similarity for weight calculation. Meanwhile, the incorporation of zero-order mirror descent in the algorithm enhances privacy during the training process. Overall, the paper contributes to the advancement of Federated Learning by proposing a method that promotes collaborative learning while addressing the challenges of data heterogeneity and privacy concerns.
优点
S1. The introduction of adaptive aggregation weights based on the data distributions of clients represents a significant advancement in Federated Learning, allowing for more effective collaboration and utilization of diverse datasets.
S2. The paper provides a rigorous convergence analysis, proving that MeritFed converges under mild assumptions. This theoretical backing enhances the credibility and reliability of the proposed method compared to traditional FL approaches.
S3. MeritFed is designed to handle outlier clients effectively, which is crucial in real-world scenarios where some clients may have non-representative or adversarial data. This robustness contributes to the overall stability of the learning process.
S4.The paper discusses the potential for future extensions and enhancements, indicating that MeritFed can be adapted for large-scale FL scenarios. This forward-looking perspective suggests that the method can evolve to meet the demands of diverse applications, making it a valuable contribution to the field.
缺点
W1. Lack of theoretical or experimental support of the claim about the algorithm discerning optimal collaborators for faster convergence and better generalization.
W2. The paper does not clearly distinguish its approach from traditional heterogeneous federated learning, especially empirical risk-based methods.
W3. Although communication is identified as a bottleneck in federated learning, the paper does not propose any improvements or solutions to this issue.
W4. Unclear definitions of key terms, such as the term "similar distributions," lead to potential confusion about the method's applicability and assumptions.
W5. Inadequate experimental evaluation, such as existing federated learning benchmarks or ablation studies. This limits the understanding of the method's performance.
问题
Q1. In line 54, the authors claim that “Through this innovative training mechanism, our algorithm discerns which clients are optimal collaborators to ensure faster convergence and potentially better generalization”. Could you please provide theoretical proof or conduct an empirical analysis to validate the improvement of the generalization ability of the proposed method, such as performing a comparing test on unseen clients or other domains/clusters?
Q2. Domain adaptation addresses the issue of data distribution dissimilarity of client datasets, while most are empirical risk-based methods. How does the approach in this study differ in principle from the empirical risk-based domain adaptation methods in principle?
Q3. Communication is a bottleneck in federated learning. I am wondering how the proposed approach impacts communication efficiency.
Q4. In line 208, how to define “similar distributions”? Could you please provide a formal definition of the "similar distributions" or give a quantitative measure of distribution similarity used in this work?
Q5. The paper conducts experiments on different tasks, such as mean estimation, image classifications, and text classifications, including pervasive domain application tasks. Yet, it does not thoroughly evaluate the method's performance against existing federated learning benchmarks. The three selected baseline approaches are old. Please benchmark the proposed approach against the federated learning algorithms published after 2022. I am also wondering about the performance difference between the three methods presented in Section 3, which is worth conducting an experiment and reporting the results.
W1. Lack of theoretical or experimental support of the claim about the algorithm discerning optimal collaborators for faster convergence and better generalization.
We kindly but firmly disagree with the reviewer since we provided four different sets of experiments supporting our claim: mean estimation, ResNet18 at CIFAR10 training, BERT at GoEmotions fine-tuning, and ResNet18 at MedMNIST training. All of these experiments suggest that MeritFed automatically adjusts the weights to make the collaboration beneficial for the target client both in terms of faster convergence and better generalization. If the reviewer has particular comments about the experiments, we will be happy to address them as soon as possible.
Q1. In line 54, the authors claim that “Through this innovative training mechanism, our algorithm discerns which clients are optimal collaborators to ensure faster convergence and potentially better generalization”. Could you please provide theoretical proof or conduct an empirical analysis to validate the improvement of the generalization ability of the proposed method, such as performing a comparing test on unseen clients or other domains/clusters?
Theoretically, MeritFed is not worse than Ideal SGD, that is a very reasonable limitation of MeritFed, since the worst-case scenario allows all the clients (except the target one) to possess irrelevant data. In this scenario MeritFed theoretically and practically recovers the best possible performance given by Ideal SGD. In contrast, when clients with similarly distributed data participate in training, we can see an increased generalization of the model since test accuracy (unseen data) increases; see, for example, Figures 7 and 11. The part of the question about other domains/clusters is not applicable to the considered setup. Note that MeritFed also possesses linear speed-up if there are clients with the same distribution participating in training.
W2. The paper does not clearly distinguish its approach from traditional heterogeneous federated learning, especially empirical risk-based methods.
We kindly but firmly disagree with the reviewer’s comment. As we explain in lines 91-103, the majority of works in FL consider standard empirical risk minimization (problem (2) in our paper). Moreover, we explicitly mention that our paper focuses on problem (1). We also discuss in detail related approaches, in particular, in lines 175-182 we discuss FedAdp and TAWT that are the closest to our work.
Q2. Domain adaptation addresses the issue of data distribution dissimilarity of client datasets, while most are empirical risk-based methods. How does the approach in this study differ in principle from the empirical risk-based domain adaptation methods in principle?
Domain adaptation is the process of adjusting models trained on one domain to work effectively on a different domain where labeled data is not available. Domain adaptation does not consider training a model on data aggregated from clients. Moreover, we do not consider the special case when labeled data is not available on a different domain. Instead, our approach improves the training itself since a model is trained on larger amounts of data, and it automatically and provably benefits from data having the same distribution – MeritFed possesses linear speed up if there are clients (unknown in advance) with the same distribution participating in training.
W3. Although communication is identified as a bottleneck in federated learning, the paper does not propose any improvements or solutions to this issue.
Q3. Communication is a bottleneck in federated learning. I am wondering how the proposed approach impacts communication efficiency.
Reduced communication is another challenge in FL, which we do not address and which is orthogonal to the problem we consider in the paper. Our paper focuses on dealing with diverse or even adversarial data among clients and provides a new way of training in this setup.
In our setup, each client sends one (mini-batched) stochastic gradient (not parameters) once per communication round. Additional communication between the target client and server can be efficiently and privately done using zero-order MD, as detailed in Section 2.2. In lines 270-274, we state that the main concurrent algorithms, FedAdp and TAWT, also require the server to store (and send) vectors. However, for modern servers, this is not an issue.
W4. Unclear definitions of key terms, such as the term "similar distributions," lead to potential confusion about the method's applicability and assumptions.
Q4. In line 208, how to define “similar distributions”? Could you please provide a formal definition of the "similar distributions" or give a quantitative measure of distribution similarity used in this work?
We do not need any definition since we do not make any assumptions on data distribution similarity, in contrast to main concurrent works, see lines 176-179. Therefore, the term “similar distributions” is used informally in our paper to provide an intuition on why our methods can benefit from collaboration with clients having different distributions. Nevertheless, one can formally define the similarity of the distributions in terms of the proximity of the solutions of the corresponding expected risks. Moreover, we specifically point out that we allow adversarial participants having irrelevant or even malicious data; see lines 064-065.
Indeed, MeritFed is robust to Byzantine attacks since our proof of Theorem 3.4 does not make any assumptions on the vectors received from the workers having different data distribution than the target client. This means that any worker can send arbitrary vectors at each iteration, and MeritFed will still be able to converge. Moreover, MeritFed can tolerate Byzantine attacks even if Byzantine workers form a majority, e.g., the method converges even if all clients are Byzantine except for the target one. We also tested the Byzantine robustness of our method on the mean estimation problem, see Appendix C.3.
W5. Inadequate experimental evaluation, such as existing federated learning benchmarks or ablation studies. This limits the understanding of the method's performance.
Q5. The paper conducts experiments on different tasks, such as mean estimation, image classifications, and text classifications, including pervasive domain application tasks. Yet, it does not thoroughly evaluate the method's performance against existing federated learning benchmarks. The three selected baseline approaches are old. Please benchmark the proposed approach against the federated learning algorithms published after 2022.
We agree that there are many related works, but comparing with all of them in the experiments is infeasible. Therefore, as explained in lines 320-323, we limit our consideration only to the methods that satisfy the following two conditions: (i) the method should have theoretical convergence guarantees, and (ii) the method should be designed for the same problem as we consider. To the best of our knowledge, the only methods satisfying these two conditions are FedAdp, TAWT, and FedProx. We believe that such criteria are fair considering the scope of our work, and we would be happy to benchmark any suggested work from 2022 satisfying these criteria.
I am also wondering about the performance difference between the three methods presented in Section 3, which is worth conducting an experiment and reporting the results.
These experiments were conducted and presented in Appendix C.1 “Results Without Additional Validation Dataset” for “Approach 3: use training data”. These results are very similar to the ones we report in the main paper for “Approach 2: use additional validation data”. Finally, “Approach 1: use fresh data” is less practical and then considered only for the mean estimation problem.
Thank you very much for your reply. I will keep my score this time but I believe this work will be significantly improved after a major revision.
Thank you for reading our responses. Could you kindly provide specific advice for the revisions? Additionally, could you let us know if we have addressed all your questions, comments, and concerns?
This work achieves collaborative learning among clients with both similar and different distributions by solving for the optimal weight allocation and provides a theoretical proof of its convergence. Numerous figures in the experiments visualize the dynamic changes in weights, demonstrating the effectiveness of MeritFed in image classification and text classification tasks.
优点
- The paper is well-written and clearly presents the dynamic process of weight allocation.
- The analysis of feasible solutions surrounding the weight optimization objective function (Problem 7) is detailed, with extensive experimental evidence for each feasible solution.
- Sufficient theoretical proofs are provided.
缺点
- The computational overhead could limit the scalability of this method, as the experiments only involved 20 clients. As the number of clients increases, the computation for optimal weights will also increase significantly.
- The paper’s contributions emphasize that benefits can be obtained from clients with both similar and different data distributions; however, this aspect is not explained in the experimental analysis.
问题
- Solving for an optimal set of weights for each client entails significant computational overhead. Providing additional details on computational and communication costs would make the experiments more complete.
- From the loss iteration graphs in Figures 4-7, it can be observed that larger values of \alpha lead to greater fluctuations in loss. Is this fluctuation due to weight allocation changes? Please analyze.
- The weight allocation heatmaps in Figures 4-7 show that the weight distribution for MeritFed SMD and MeritFed MD becomes increasingly stable as iterations progress. Could weight optimization be conducted only in the early rounds, with fixed weights used in later stages?
- Please elaborate on the statement “not all collaborations are beneficial; some may even be detrimental,” and analyze how harmful clients could be assigned smaller weights.
The computational overhead could limit the scalability of this method, as the experiments only involved 20 clients. As the number of clients increases, the computation for optimal weights will also increase significantly.
We acknowledged certain limitations of our work, including scalability (see lines 527-534). In the Appendix, we also provide experiments with 40 workers for ResNet18 on CIFAR10. Having a much larger number of clients for the same problem can also be handled using partial participation of clients (client sampling), which is a standard approach in Federated Learning. Nevertheless, even without client sampling, we believe our experiments for the mean estimation with 150 clients illustrate that MeritFed works well even when the number of clients is much larger.
The paper’s contributions emphasize that benefits can be obtained from clients with both similar and different data distributions; however, this aspect is not explained in the experimental analysis.
We thank the reviewer for pointing out this aspect. However, lines 436-448 provide explanations of why benefits can be obtained from clients with both similar and different data distributions. We can expand these explanations upon the reviewer’s request.
Solving for an optimal set of weights for each client entails significant computational overhead. Providing additional details on computational and communication costs would make the experiments more complete.
In lines 271-274, we state that, similarly to our method, the main concurrent algorithms, FedAdp and TAWT, also require the server to store vectors. However, for modern servers, this is not an issue. Additionally, we emphasize that in our experiments, it was sufficient to perform just Mirror Descent (we missed including the number of MD steps for experiments in the main part but added these details to the revised version) steps to solve the auxiliary problem from Line 7 at each communication round. Therefore, the computation overhead is not too large.
We also highlight that the validation dataset is stored only on the target client; there is no need to make it public or accessible to other clients. However, our approach requires a computation of products of the target gradient and other clients' gradients and a weighted sum to perform an MD step. This can possibly be done on the server side. In our work, we consider the case where the target client calculates these products. To address the privacy issue caused by the target client's access to all the gradients, we propose using zeroth-order MD, which requires communicating only function values and thus preserves privacy.
From the loss iteration graphs in Figures 4-7, it can be observed that larger values of \alpha lead to greater fluctuations in loss. Is this fluctuation due to weight allocation changes? Please analyze.
We believe that these fluctuations are due to stochasticity in the gradients and are not connected with the MeritFed framework since we do not observe similar behavior with other experiments, e.g. Figures 8-11.
The weight allocation heatmaps in Figures 4-7 show that the weight distribution for MeritFed SMD and MeritFed MD becomes increasingly stable as iterations progress. Could weight optimization be conducted only in the early rounds, with fixed weights used in later stages?
From a theoretical point of view, we cannot expect convergence to be better than the SGD Ideal. From a practical point of view, the most illustrative Figure 13 shows that weights can evolve non-monotonically, making it challenging to determine a moment to switch to the second stage. However, to reduce the computation cost, one can optimize aggregation weights only once per some number of communication rounds.
Please elaborate on the statement “not all collaborations are beneficial; some may even be detrimental,”...
Theorem 3.4 does not make any assumptions about the vectors received from the workers having different data distribution than the target client. This means that any worker can send arbitrary vectors at each iteration, and MeritFed will still be able to converge. Moreover, MeritFed can tolerate Byzantine attacks even if Byzantine workers form a majority; for example, the method converges even if all clients are Byzantine except for the target one.
Thank you for addressing my concerns about scalability. However, the supplementary experiment involving 40 clients, achieved by duplicating data from the initial 20 clients, significantly reduces heterogeneity among clients and lowers the task difficulty. Under such a setup, increasing the number of clients further would lose its meaning. In future work, I suggest exploring scenarios where the total data volume remains fixed while increasing the number of clients. This would make identifying beneficial clients in a more heterogeneous distribution meaningful. Based on the scalability concerns, I will maintain my initial score.
... and analyze how harmful clients could be assigned smaller weights.
While solving the auxiliary problem in Line 7, the clients with gradients, improving a loss function value, are assigned greater weights. Even if a harmful client sends a gradient that decreases a loss function, it is assigned greater weight since it improves training. Conversely, clients whose aggregation increases the loss are assigned lower weights. It is worth noting that due to the stochastic nature, even stochastic gradients from a target client can lead to an expected function loss increase while decreasing batched loss function, but as in the case of standard SGD applied to a non-distributed problem, a loss function decreases in expectation.
We also tested the Byzantine robustness of our method on the mean estimation problem; Figures 22-25 show that harmful clients are assigned smaller weights.
Dear Reviewer XAQL,
We thank you once again for your time and feedback. We kindly ask you to let us know whether our responses addressed your concerns, questions, and comments. If you have any further comments or requests, we will be happy to reply to them promptly.
Authors
This paper proposes a merit-based federated learning algorithm (MeritFed) that addresses data heterogeneity in federated learning (FL) by solving an auxiliary minimization problem in each iteration to adaptively select aggregation weights. The core innovation of the algorithm lies in leveraging auxiliary optimization tasks to enhance the model’s generalization capability and robustness, thereby enabling it to better adapt to the diverse data distributions across different clients.
优点
S1: Extensive experimental analysis across multiple tasks. S2: Sufficient theoretical proofs that ensure the algorithm’s convergence. S3: A thorough analysis of the limitations of the study and potential future research directions. S4: Code is open-sourced.
缺点
W1: The related work looks like methodology. W2: The paper lacks a detailed explanation of the main research focus (Algorithm 1).
问题
Q1: The paper uses the variable x in several formulas, where it represents a d-dimensional vector, but what does it actually signify? Q2: In Section 2.2, the authors propose three approaches to address the issue in line 7 of Algorithm 1. Do these methods cover most practical application scenarios, thereby making the algorithm more universally applicable?
The related work looks like methodology.
We would like the reviewer to elaborate on their concerns further since we are not sure that we correctly understand the point. Our method is described in Section 2, while the related work provides a summary of closely related works and their relation to our approach. Such a structure of the paper is common; see [1-3,5,7] for the examples.
W2: The paper lacks a detailed explanation of the main research focus (Algorithm 1).
We respectfully disagree with the reviewer since we believe lines 051-087 sufficiently discuss our motivation and research focus. To support the claim, we kindly ask the reviewer to provide more details on what requires further clarification. We are committed to provide all necessary explanations as soon as possible. Regarding the particular use case of MeritFed, we refer to FL on medical images data (lines 068-071), which is supported by our experiments (lines 479-511).
Q1: The paper uses the variable x in several formulas, where it represents a d-dimensional vector, but what does it actually signify?
Vector represents trainable model parameters, e.g., trainable parameters of a neural network. This is a quite standard notation for unknown parameters to learn; see [1-6]
Q2: In Section 2.2, the authors propose three approaches to address the issue in line 7 of Algorithm 1. Do these methods cover most practical application scenarios, thereby making the algorithm more universally applicable?
Yes, they do. Approach 3: use training data is the most practical. It does not require any additional data and, at the same time, has comparable experimental performance as we show in our extensive numerical experiments.
References
[1] Wang, Jianyu, et al. "Tackling the objective inconsistency problem in heterogeneous federated optimization." Advances in neural information processing systems 33 (2020): 7611-7623.
[2] Pathak, Reese, and Martin J. Wainwright. "FedSplit: An algorithmic framework for fast federated optimization." Advances in neural information processing systems 33 (2020): 7057-7066.
[3] Charles, Zachary, and Jakub Konečný. "Convergence and accuracy trade-offs in federated learning and meta-learning." International Conference on Artificial Intelligence and Statistics. PMLR, 2021.
[4] Reddi, Sashank J., et al. "Adaptive Federated Optimization." International Conference on Learning Representations.
[5] Lin, Tao, et al. "Ensemble distillation for robust model fusion in federated learning." Advances in neural information processing systems 33 (2020): 2351-2363.
[6] Chen, Wenlin, Samuel Horváth, and Peter Richtárik. "Optimal Client Sampling for Federated Learning." Transactions on Machine Learning Research.
[7] Li, Tian, et al. "Federated optimization in heterogeneous networks." Proceedings of Machine learning and systems 2 (2020): 429-450.
Dear Reviewer 4HCL,
We thank you once again for your time and feedback. We kindly ask you to let us know whether our responses addressed your concerns, questions, and comments. If you have any further comments or requests, we will be happy to reply to them promptly.
Authors
The paper tackles non-IID problem in federated learning by adjusting the aggregation weights dynamically. For the aggregated gradients at each communication round, the proposed method finds the optimal aggregation weights that can minimize the loss on the target client. Experiments on some designed non-IID scenarios show that the proposed method achieves dynamic similarity-based weight-assigning by assigning large weights to clients with similar data distributions and small weights to clients with distinct data distributions.
优点
-
The targeting problem is an essential challenge to Personalized FL. Discovering friendships or similarities across clients is criticial to advance the research of Personalised FL.
-
The overall component of the paper is comprehensive including problem formulation, theoretical analysis and experimental analsis.
-
The experiment analysis shows the proposed method can indeed find the inherent client distribution structure, which is achieved by dynamically assigning large weights to similar clients and small weights to distinct clients.
缺点
-
Lack of discussion about the overall FL framework. The proposed PFL objective function only focuses on the targeting client by leveraging the external knowledge from other clients. Eq.1 makes the problem formulation toward a transfer learning task rather than a federated learning.
-
The introduction section should be enhanced by adding a more detailed discussion about discovering structures among FL clients, thus providing better support to the motivation. In the meantime, the related work is too long with irrelevant information. For example, communication compression or clustered FL is not related to the proposed topic.
-
Missing some related works, e.g. graph/structure-guided aggregation in PFL [1-4].
[1] J Ma, et al., Structured Federated Learning through Clustered Additive Modeling, NeurIPS 2023 [2] F Chen, et al., Personalized Federated Learning With A Graph, IJCAI 2022 [3] C Zhang, et al., GPFedRec: Graph-Guided Personalization for Federated Recommendation, KDD 2024 [4] Y Tan, et al., Federated Learning on Non-IID Graphs via Structural Knowledge Sharing, AAAI 2023
问题
Please refer to weakness.
Lack of discussion about the overall FL framework. The proposed PFL objective function only focuses on the targeting client by leveraging the external knowledge from other clients. Eq.1 makes the problem formulation toward a transfer learning task rather than a federated learning.
Following the reviewer's comment, we have added a discussion about the relation between the method proposed in our paper and transfer learning (please see Appendix A.1 in the revised version). Although there are some similarities between MeritFed and standard transfer learning, these two approaches are significantly different, and, in particular, transfer learning cannot be applied in the online PFL setup with (potentially harmful) clients/datasets.
The introduction section should be enhanced by adding a more detailed discussion about discovering structures among FL clients, thus providing better support to the motivation.
Standard Parallel SGD often leads to models that generalize across all clients rather than personalize to the specific data distributions and unique characteristics of individual clients, potentially resulting in suboptimal performance on personalized tasks. We have added this sentence to the introduction.
In the meantime, the related work is too long with irrelevant information. For example, communication compression or clustered FL is not related to the proposed topic.
We moved some parts of the related work to Appendix A.
Missing some related works, e.g. graph/structure-guided aggregation in PFL [1-4]
Indeed, the literature on FL is very rich, and many existing works are broadly related to our paper. To accommodate this fact, we decided to keep the discussion of closely related works in the main part of the paper and moved the discussion of other related works to Appendix A. We thank the reviewer for the relevant suggestions and kindly ask the reviewer to check the revised version of our manuscript.
Dear Reviewer Nbh4,
We thank you once again for your time and feedback. We kindly ask you to let us know whether our responses addressed your concerns, questions, and comments. If you have any further comments or requests, we will be happy to reply to them promptly.
Authors
We thank the reviewers for their feedback and time. In particular, we appreciate that the reviewers acknowledged the multiple strengths of our work. In particular, they write that
The paper is:
- well-written (Reviewer XAQL)
- comprehensive (Reviewer Nbh4)
- provides sufficient theoretical proofs (Reviewer 4HCL and XAQL)
- provides a rigorous convergence analysis (Reviewer tjtk)
- possesses extensive experimental analysis (Reviewer 4HCL and XAQL)
The method:
- converges under mild assumptions (Reviewer tjtk).
- can evolve to meet the demands of diverse applications (Reviewer tjtk)
- reliable compared to traditional FL approaches (Reviewer tjtk)
We have updated our manuscript, highlighting the changes in green. Additionally, we have addressed the reviewers' questions, comments, and concerns in individual responses, referencing line numbers from the revised manuscript. We remain committed to promptly addressing any further questions or comments and are happy to engage in back-and-forth discussions as needed.
The author discussion phase is coming to an end. We have provided detailed responses to the reviewers' comments. Could you confirm whether your concerns have been sufficiently addressed?
Thanks,
Authors
Dear Reviewers,
Thank you very much for your effort. As the discussion period is coming to an end, please acknowledge the author responses and adjust the rating if necessary.
Sincerely, AC
Dear Reviewers,
As you are aware, the discussion period has been extended until December 2. Therefore, I strongly urge you to participate in the discussion as soon as possible if you have not yet had the opportunity to read the authors' response and engage in a discussion with them. Thank you very much.
Sincerely, Area Chair
This paper proposes an algorithm that assigns adaptive aggregation weights to clients participating in federated learning. The reviewers agreed that this paper tackles a very important problem and provides a thorough theoretical analysis. However, no reviewers strongly supported the acceptance of this paper during the discussion period. The technical innovation of this paper is not very high according to the reviews and ratings. Thus, I recommend a reject.
审稿人讨论附加意见
The reviewers raised various concerns on presentation, novelty, and evaluation.
- I ignored most of Reviewer Nbh4's points, because there is no specific comment.
- It seems that the authors successfully addressed the presentation issues raised by Reviewer Nbh4.
- The other reviewers acknowledged the authors' rebuttal, but they kept the original rating.
Reject