Improving Generalization in Federated Learning with Highly Heterogeneous Data via Momentum-Based Stochastic Controlled Weight Averaging
摘要
评审与讨论
Generalization capability is critical for FL in real-world applications. This paper revisits the generalization problem in FL, focusing on the impact of data heterogeneity. The authors propose FedSWA, which uses Stochastic Weight Averaging to find flatter minima, and a varaint FedMoSWA, a momentum-based variant designed to better align local and global models. Theoretical analysis provides convergence and generalization bounds for both algorithms, showing that FedMoSWA achieves smaller generalization errors compared to FedSAM and its variants. Empirical experiments on CIFAR10/100 and Tiny ImageNet demonstrate the superiority of the proposed methods.
给作者的问题
Can you provide comparison to FedSMOO and FedGAMMA?
论据与证据
yes
方法与评估标准
yes
理论论述
some of
实验设计与分析
Lacking comparison to FedSMOO: Dynamic regularized sharpness aware minimization in federated learning: Approaching global consistency and smooth landscape Fedgamma: Federated learning with global sharpness-aware minimization
补充材料
yes, some of proof and loss visualization
与现有文献的关系
Try to improve generalization ability of FL in real world scenarios.
遗漏的重要参考文献
lacking comparison to FedSMOO, and FedGAMMA FedSMOO: Dynamic regularized sharpness aware minimization in federated learning: Approaching global consistency and smooth landscape Fedgamma: Federated learning with global sharpness-aware minimization
其他优缺点
Strengths:
- Well written and good representation
- Good results, although lacking comparison to FedSMOO, and FedGAMMA
- Providing extensive theoretical analysis
Weaknesses: only one major weakness is lacking comparison to fedsmoo and fedgamma
其他意见或建议
NA
Thanks for your valuable comments. We will make modifications in the final version as suggested by the reviewer, and our point by point responses to your major comments are given below.
To address your concern, we conducted many experiments on CIFAR-100 with ResNet-18 under different data heterogeneity levels (0.1, 0.3, 0.6), and compare the proposed algorithms with FedSMOO and FedGAMMA, suggested by the Reviewer. The reproduced results are based on the pre-activation ResNet18-GN network, which is widely used in federated learning algorithms such as FedSAM, FedGAMMA, and FedACG. The following results are produced under our framework, referencing the open-source code of FedSMOO and FedSAM. Because the FedGAMMA algorithm lacks open-source code, we implemented it based on the algorithm flow described in the paper and referenced the FedGamma implementation in FedLESAM's source code. We will include these results in the final version, and will release our code and the federated learning framework to ensure reproducibility.
| CIFAR-100 | 0.1 | 0.3 | 0.6 |
|---|---|---|---|
| FedSMOO | 46.5 | 47.8 | 49.2 |
| FedGAMMA | 48.4 | 51.8 | 52.6 |
| FedSWA (ours) | 50.3 | 55.5 | 59.8 |
| FedMoSWA (ours) | 61.9 | 66.2 | 67.9 |
Thanks! I have no further questions.
I would like to sincerely thank you for your thoughtful and detailed feedback on our paper. We greatly appreciate the time and effort you dedicated to reviewing our work. Your suggestions and comments were extremely helpful in improving the quality of our research, and we have carefully addressed them in our revised paper. We believe these revisions have strengthened our paper and made it more robust. Thank you once again for your invaluable input. We are grateful for your constructive criticism, which has been instrumental in improving the quality of our work.
Tackles generalization issues in FL with highly heterogeneous data.
Introduces a new momentum-based stochastic controlled weight averaging FL algorithm.
给作者的问题
The source code of the method should be included.
论据与证据
Provides some theoretical guarantees and empirical results.
方法与评估标准
Evaluations are conducted on different data sets and ,methods, but the improvemnt with this method is minimal.
理论论述
Provides both theoretical guarantees (convergence and generalization bounds) and solid empirical results.
实验设计与分析
Experiemtns are conducted on several data sets and models.
补充材料
supplementary material incudes more theoretical analysis.
与现有文献的关系
Related to problem of data heteroginity.
遗漏的重要参考文献
None.
其他优缺点
The proposed methods (FedSWA and FedMoSWA) appear to offer incremental improvements over existing approaches like FedSAM and MoFedSAM. The results don't show consistant improvement.
其他意见或建议
The experiments focus primarily on benchmark datasets under simulated conditions (Dirichlet-). I'd like to see results in imbalanced cifar.
Thanks for your valuable comments. We will make modifications in the final version as suggested by the reviewer, and our point by point responses to your major comments are given below.
- Different from FedSAM, the proposed FedSWA algorithm is a new way to improving FL generalization. By using SWA, FedSWA better maintains global flatness consistency compared to FedSAM, as shown in Figure 1. Moreover, FedMoSWA also uses momentum stochastic control, unlike MoFedSAM with only local momentum SAM. Both techniques (i.e., SWA and momentum stochastic control) significantly enhance the theoretical results of the federated learning algorithms such as FedSAM and MoFedSAM, as shown in Table 1.
Experimental results: All the experimental results in the manuscript show that our algorithms show consistent improvements on both ResNet and Transformer networks. As shown in Tables 2, 3 and the followng table, FedSWA with SWA optimizer (59.8%) outperforms FedAvg (54.2%) and FedSAM (47.8%) on ResNet-18 with CIFAR-100. Our FedMoSWA (67.9%) also surpasses MoFedSAM (60.1%). Both theoretical and experimental results show that FedSWA outperforms FedSAM, and FedMoSWA outperforms MoFedSAM.
| Algorithm | Accuracy (%) | Improvement (%) |
|---|---|---|
| FedAvg | 54.2 | - |
| SCAFFOLD | 54.1 | -0.1 |
| FedSAM | 47.8 | -6.4 |
| MoFedSAM | 60.1 | +5.9 |
| Our FedSWA | 59.8 | +5.6 |
| Our FedMoSWA | 67.9 | +13.7 |
Our theoretical results in Table 1 show that the generalization bound of our FedSWA is , which is correlated with data heterogeneity and is superior to that of FedSAM, and MoFedSAM, , where . We prove that the generalization error of FedMoSWA is , which is better than those of FedSAM, MoFedSAM, FedSWA, where .
- To address youe concern, we conducted some experiments on CIFAR-100 with ResNet-18, following the pathological imbalanced settings from the FedSMOO and FedLESAM papers, where each client has at most 10 classes. This represents a highly imbalanced data distribution. These results will be included in our final paper.
Our FedSWA algorithm outperforms the algorithms, FedAvg and FedSAM. FedSWA uses SWA as the local optimizer, while FedAvg uses SGD, and FedSAM uses SAM as local optimizers, and none of them employ variance reduction or momentum acceleration techniques. As the advanced version of FedSWA, our FedMoSWA, which also incorporates momentum and variance reduction, surpasses other algorithms in all the settings.
| Algorithm | Accuracy (%) | Improvement (%) |
|---|---|---|
| FedAvg | 42.7 | - |
| FedDyn | 49.1 | + 6.4 |
| SCAFFOLD | 43.1 | + 0.4 |
| FedSAM | 41.2 | - 1.5 |
| MoFedSAM | 45.6 | + 2.9 |
| FedLESAM | 44.3 | +1.6 |
| FedACG | 52.6 | +9.9 |
| FedSWA (ours) | 48.3 | +5.6 |
| FedMoSWA(ours) | 55.5 | +12.8 |
3.In fact, our source code was included in the supplementary materials. We will release our code and the federated learning framework to ensure reproducibility.
This paper proposes two novel algorithms for improving generalization in federated learning. The first approach, FedSWA, is a variant of FedAvg with stochastic weight averaging, a method known for finding flatter minimums. The second approach, FedMoSWA, extends FedSWA with control variates that are updates using momentum, in order to handle heterogeneity. Both approaches are studied theoretically, showing superior generalization guarantees than FedSAM and MoFedSAM. Extensive numerical experiments confirm these findings.
给作者的问题
In the end, the generalization error still depends on heterogeneity. Surprisingly, this term does not disappear when taking only one local training step. Is it an artefact of the analysis, or is there something fundamental that remains even when doing only one local step?
论据与证据
Theoretical claims are provided with full proofs. Numerical claims provide both intuitive explanations and more thorough comparisons with many existing methods on multiple datasets and in multiple settings.
方法与评估标准
Benchmark datasets make sense, as well as the splits used to emulate heterogeneity.
理论论述
The proofs for optimization errors seem correct, and seem to be widely inspired by the proofs from Karimireedy et al., 2020.
I am less familiar with analyses of generalization error and I did not check them in details, but the results seem correct.
实验设计与分析
Experimental analyses are quite extensive, with comparison with many other algorithms on multiple datasets.
It seems that the only hyperparameters that were tuned are the client learning rate , global step size , momentum step size and local learning rate decay for FedSWA and FedMoSWA. In particular, it is not clear to me whether hyperparameters of other methods have been tuned or not, which could make the comparison unfair.
More specifically, the client selection rate is set to an arbitrary value. To my knowledge, most of the baselines (e.g. Scaffold) are known to underperform when selecting only a fraction of the clients at each rounds of communication. There are thus two major differences between Scaffold and FedMoSWA, that should be studied in isolation: (i) the use of stochastic weight averaging, and (ii) momentum in control variates updates. It is therefore not clear whether superiority observed in experiments is due to the improvement of (i) or (ii). This is a bit concerning, especially seeing that FedSWA is closer to the baselines in Tables 2 and 3, suggesting that improvement over methods like Scaffold may come from this momentum stochastic control. Performing experiments with full participation would clarify this question.
补充材料
I skimmed through the supplementary material and did not identify errors.
与现有文献的关系
Related scientific literature is widely and appropriately discussed.
遗漏的重要参考文献
Not to my knowledge.
其他优缺点
Strengths
- The two proposed methods are shown to achieve better generalization error theoretically and numerically.
- The paper is very well written, with precise discussion describing intuition about the studied phenomenons.
- FedMoSWA is shown to outperform many baselines numerically.
Weaknesses
- There may be a lack of precision in the numerical analysis, which does not allow to distinguish whether FedMoSWA outperforms baselines due to SWA or due to the use of momentum in control variates (see Experimental Design or Analyses section).
- The differences in optimization errors, notably between methods presented in Table 1, is not discussed. Providing discussion on this, specifically discussing whether this should have an impact on the result or not, would greatly improve this part of the discussion.
其他意见或建议
The difference between FedSWA and FedMoSWA in Algorithm 1 is only shown with using different colors. This is a problem when reading the paper in black and white or for the colorblind: the difference should be indicated in another way (on top of using colors) for accessibility reasons.
We sincerely thank the Reviewer for their valuable comments, and our point by point responses to your major comments are given below.
- For Experimental Designs Or Analyses. In fact, we have done the hyperparameter tuning for all the algorithms, and we also followed the parameter settings from their original papers. We conducted full client participation experiments on CIFAR-100 with 10 clients and 300 rounds, where the model is ResNet-18. Here, FedMo denotes FedMoSWA without SWA (i) but only with Momentum Stochastic Control (ii), as shown in the following table. FedMo (62.5%) outperforms SCAFFOLD (59.9%) by 2.6%. With 100 clients and 10% participation, our FedMoSWA achieves 3.3% higher accuracy than SCAFFOLD, addressing the update delay issue with partial client participation.
Full client participation on CIFAR-100 and ResNet-18 with 10 clients and 300 rounds:
| Algorithm | Accuracy (%) | Improvement (%) |
|---|---|---|
| FedAvg | 58.2 | - |
| FedDyn | 58.5 | +0.3 |
| SCAFFOLD | 59.9 | +1.7 |
| FedSAM | 48.3 | -9.9 |
| MoFedSAM | 37.9 | -20.3 |
| FedLESAM | 59.2 | +1.0 |
| FedACG | 60.9 | +2.7 |
| Our FedSWA (i) | 60.2 | +2.0 |
| Our FedMo (ii) | 62.5 | +4.3 |
| Our FedMoSWA (i)+(ii) | 63.2 | +5.0 |
- For Weaknesses 1. From Tables 1 and 4 in our manuscript and the following table , when , FedSWA using only SWA (59.8%) outperforms FedAvg (54.2%) and FedSAM (47.8%) on ResNet-18 with CIFAR-100. When , FedMo using only momentum stochastic control achieves 65.9%, higher than SCAFFOLD (54.1%), demonstrating that momentum variance reduction mitigates SCAFFOLD’s variance reduction delay. By combining SWA and momentum variance reduction, FedMoSWA achieves 67.9%, showing that both SWA (+5.6%) and momentum control (+11.7%) are effective, and momentum control has greater impact. Moreover, FedSWA is a simple algorithm like FedSAM and can be combined with other techniques. We will clarify this in our final paper.
10% client participation on CIFAR-100 and ResNet-18 with 100 clients and 1000 rounds:
| Algorithm | Accuracy (%) | Improvement (%) |
|---|---|---|
| FedAvg | 54.2 | - |
| SCAFFOLD | 54.1 | -0.1 |
| FedSAM | 47.8 | -6.4 |
| MoFedSAM | 60.1 | +5.9 |
| FedSWA (ours) | 59.8 | +5.6 |
| FedMo (ours) | 65.9 | +11.7 |
| FedMoSWA (ours) | 67.9 | +13.7 |
-
For Weaknesses 2. To address your concern, we will add some discussions about the optimization error analysis in our final paper. In fact, Section 5.2 discusses the optimization error analysis, showing that FedMoSWA converges faster than the best-known algorithm, SCAFFOLD, and outperforms both FedSWA and other baselines such as FedSAM, FedAvg, and MoFedSAM. Unlike MoFedSAM, which uses only local momentum, our FedMoSWA employs momentum variance reduction. Additionally, our FedSWA also converges faster than both FedSAM and FedAvg, as shown in Table 1. We will include all the discussions in Section of Introduction in our final paper.
-
For Other Comments Or Suggestions. To address your concern, we will improve the final version of the paper, and will use italic and bold to distinguish the two algorithms.
-
For Questions. In our analysis, data heterogeneity does not vanish when performing only one local training step. It is not purely an artifact of analysis but a fundamental characteristic of federated learning algorithms, driven by the inherent bias introduced by local training on heterogeneous data. This bias does not completely vanish with only one local training step. Our future work will focus on addresssing this issue.
Thank you for your answer. I remain overall positive about this work, and will keep my score to 3.
"Here, FedMo denotes FedMoSWA without SWA (i) but only with Momentum Stochastic Control (ii), as shown in the following table. FedMo (62.5%) outperforms SCAFFOLD (59.9%) by 2.6%. With 100 clients and 10% participation, our FedMoSWA achieves 3.3% higher accuracy than SCAFFOLD, addressing the update delay issue with partial client participation."
Indeed, SWA gives a slight additional performance bonus, although it seems that most of the improvement comes from the momentum in the end.
"In our analysis, data heterogeneity does not vanish when performing only one local training step. It is not purely an artifact of analysis but a fundamental characteristic of federated learning algorithms, driven by the inherent bias introduced by local training on heterogeneous data. This bias does not completely vanish with only one local training step. "
While I agree that "fundamental characteristic of federated learning algorithms", it should not have any impact when no local training is used: in this setting, all algorithms boil down to SGD on the averaged loss, which does not suffer from heterogeneity. This suggests that there may be a flaw in the analysis (which could be fixed by future work).
Thanks to the reviewer for the responses, which were very helpful, especially the second question.
For the first problem, the essence of the FedSWA algorithm is to find a flatter minimum, and we experimentally demonstrate that the FedSWA algorithm finds a flatter solution with better generalization than FedSAM and its variants. However, the FedSWA algorithm does not speed up the optimization process like momentum or variance reduction. So we propose the FedMoSWA algorithm, combined with momentum and variance reduction, to speed up the optimization process. To analyze the optimization algorithm, we can divide it into two processes: one process is the speed of the optimization algorithm, and the other is the generalization ability of the algorithm. FedSWA tends to address the generalization ability of the algorithm and can be used in conjunction with other speedup algorithms like scaffold, FedACG, etc. We propose the FedMoSWA algorithm to accelerate the convergence speed of FedSWA.
For the second issue, when the local iteration step is 1, the theoretical analysis cannot eliminate the influence of data heterogeneity, which seems to be a limitation of this theory. This problem has been encountered in papers [1,2]. The generalization error bounds of our algorithm are better than both of these algorithms. In the future, we will study new stability theories to improve this limitation. Under the independent and identically distributed (i.i.d.) setting (), our research results are consistent with the classical results of stochastic gradient descent (SGD).
[1] Sun Z, Niu X, Wei E. Understanding generalization of federated learning via stability: Heterogeneity matters[C]//International Conference on Artificial Intelligence and Statistics. PMLR, 2024: 676-684.
[2] Sun Y, Shen L, Tao D. Understanding how consistency works in federated learning via stage-wise relaxed initialization[J]. Advances in Neural Information Processing Systems, 2023, 36: 80543-80574.
The paper presents valuable contributions to the field of federated learning, particularly in addressing the generalization issue in highly heterogeneous data settings. The novel algorithms and theoretical analysis provide a solid foundation for further research. However, the improvements are sometimes incremental, and there are minor concerns regarding experimental design and theoretical limitations.