Efficient Federated Learning against Heterogeneous and Non-stationary Client Unavailability
We propose FedAWE, an efficient algorithm for handling heterogeneous and non-stationary client unavailability in federated learning, achieving linear speedup and outperforming state-of-the-art methods in experiments over diversified dynamics.
摘要
评审与讨论
To address intermittent client availability, the authors study heterogeneous and non-stationary client availability, highlighting the significant impact of such heterogeneity using FedAvg. They propose FedAWE, which (i) compensates for missed computations due to unavailability and (ii) evenly diffuses local updates through implicit gossiping, despite non-stationary dynamics. The authors demonstrate that FedAWE converges to a stationary point for non-convex objectives while achieving the desired linear speedup property, supported by numerical experiments on real-world datasets with varied client unavailability dynamics.
优点
-
The authors propose a new algorithm to address the issue of intermittent client availability.
-
The paper is easy to follow, and a large number of cases w.r.t. unavailable dynamics are used for validation.
缺点
-
The proposed algorithm is intuitive. The weight of the model update for each client is modified such that the longer the participation interval rounds, the greater the proportion of model update, which is proportional to . However, when the client participation rate is very low (e.g., 1% of clients participate randomly), will be expected to be 100, magnified by a factor of 100. This is somewhat unreasonable.
-
The rate of convergence in corollary1, the last term on the right-hand side, apparently dose not show a linear speedup with respect to the number of local steps and the number of active clients .
-
In the rate of convergence in Corollary 1, it appears that combines two variables with different "physical dimensions", which suggests that there might be an issue with the proof. For example, the "physical dimension" of the learning rate is , and is reasonable, while might be incorrect.
-
It is not reasonable to see the performance of "FedAvg over all clients" being much lower than "FedAvg over active clients" in Table 1.
问题
Can the authors provide an in-depth analysis on the algorithm's complexity?
局限性
The authors have discussed the limitations of algorithms.
We are truly grateful for your appreciation for our paper flow and extensive numerical experiments. We believe that many of your concerns can be easily addressed, and we apologize for our lack of clarity on these points. We hope that our response can address your concerns, and the insights discussed here will be incorporated into the camera-ready version if the paper gets accepted.
W1: on gradient amplification. We note that solving the issue of client unavailability by amplifying local gradients is widely adopted in the prior literature [19,27,53,52,40]. As pointed out by [13], when the exact availability distribution is known, the most intuitive way to amplify these gradients is to collect only the fastest responses and re-weight according to the response probabilities. Thus, like our method, these weights can be very large when the participation rate is very low. Although somewhat counterintuitive, as suggested by the numerical results in our manuscript and in [13,53], these types of algorithms perform just fine.
W1: on intuitive algorithm (novelty). We emphasize that our contribution is highly non-trivial. For example, the counterexample below shows that simply modifying weights accordingly to the participation interval but not postponing broadcast will not debias FedAvg under even stationary availability. The proof is skipped due to space. It highlights the need for both of the two algorithmic components in Section 5.
[Counterexample]. Suppose that we have two clients such that client #1 is always available, while client #2 is available only in odd rounds. Each client holds and computes 1 local step. The server boadcasts as in FedAvg [30] and aggregates local updates as , where . Choose and . It holds that
W2: on speedup in the last term. We achieve the desired speedup property for a sufficiently large training round , consistent with the prior literature [52,53,60]. For ease of presentation, let , . In the special case where clients participate u.a.r., we have . The convergence rate in (16) reduces to If we ignore the constants hidden behind the asymptotic bound, the first term dominates when , i.e., when . Consequently, the convergence bound reduces further to Now, we can see that the R.H.S. of the bound indicates a clear speedup w.r.t. the number of local steps and the number of sampled clients . We will make this clear in the camera-ready version if accepted.
W3: on the appearance of in Corollary 1. We note that the appearance of in Corollary 1 is not a glitch in our proof but a testament that we account for the key parameters in the assumptions. Although we agree with the reviewer that the term would be at the constant level with the learning rate , it is not uncommon to find the Lipschitz constant in the final bound in the prior literature; see, e.g., Theorems 1-3 in [27], Corollary 4 in [53], Theorem 3 in [R1].
Technically speaking, the appearance of results from Assumption 4. This assumption is more general than the popular ones in the prior literature on gradient divergence; see details in Table 2 on page 16. Then, together with Assumption 2, Proposition 3 on page 26 holds as follows: Notice that the coefficient of the first term on the R.H.S. is instead of a coupling of . Consequently, we shall see a standalone , for example, when we apply the above Proposition 3 to the first term in the second row in Lemma 3.
W4: on the performance of (FedAvg all). In numerical experiments, the phenomenon that (FedAvg all) lags behind (FedAvg active) has been observed under stationary availability in [53] and more significantly in [13, Fig. 2] (FedAvg with device sampling). Non-stationary dynamics brings in more uncertainty, so it is expected that things can become even worse. We recall that (FedAvg all) differs from (FedAvg active) by the gradient aggregation rules. Specifically,
Question: provide an in-depth analysis on the algorithm's complexity. We have briefly discussed the memory and computational complexity in lines 160-165 on page 4, but we are happy to address them in more detail here. A detailed complexity comparison for the baseline algorithms, including ours, can be found in Table 1 in the pdf. Since our algorithm does not draw new stochastic samples for training, there is no increase in sample complexity. It takes extra memory unit to store the last active round . The weight echoing step incurs additional computation to adjust the weight of gradient accumulation.
References:
[R1] Fatkhullin, I. et al. (2023). Momentum provably improves error feedback!. Advances in Neural Information Processing Systems, 36.
I would like to thank the authors for providing the response. But I still have concerns regarding the gradient amplification and the appearance of .
-
Gradient amplification: First, the authors provide some references, but these references are vague and offer limited relevant information on this specific point. For instance, [19] is a review, while [27, 52] focus on the convergence analysis of FedAvg. Second, in more extreme cases where the client participation rate drops to 1 in 1,000 or even 1 in 10,000, the gradient would be amplified by a factor of one thousand or ten thousand, which seems clearly unreasonable. The authors merely assert that their experiment is valid; however, they do not discuss or test this specific issue in their experimental analysis.
-
Appearance of : More importantly, the authors acknowledge that the term is dimensionally unified. Instead, they attempt to justify the reasonableness of the term by citing other references. However, upon reviewing the provided literature [27, 53, R1], I found no terms identical or even similar to . Therefore, if this issue is not resolved, I do not think this paper is in a good shape for publication.
Dear Reviewer i6Qf:
Thanks for your response. We apologize for any lack of clarity in the previous response due to insufficient space and hope that our further elaborations can address your concerns.
On gradient amplification.
-
First, we are happy to elaborate the details of the cited references to corroborate our claim on gradient amplification.
- [53] tackles stationary client unavailability, where the accumulated local gradients of each client are amplified by a factor obtained from a separate online estimation scheme.
- [40] generalizes FedAvg in Algorithm 1 by amplifying the accumulation of local gradients by their corresponding responsive probabilities.
- [19], as a classic survey in the federated learning community, points out open challenges in mitigating bias. For example, Section 7.2.3 on page 84 in [arXiv:1912.04977] mentions in the second bullet point that ''If the expected rate of contribution depends on factors outside our control, ..., one can correct for bias by scaling up or down contributions from devices depending on their stratum.''
- [52] studies non-stochastic client unavailability and proposes a generalized FedAvg algorithm, which is different from the vanilla FedAvg. In line 12 of its pseudocode, the accumulation of local gradients is amplified by a factor of every rounds.
- [27], although studies partial device participation with known probability in Section 3.3, does not touch gradient amplifications. We are willing to take this off from the response.
-
Second, we are running the additional experiments under extremely small probabilities per your requests. We will update the experimental results in a separate response or, in the worst case, in the camera-ready version if accepted, as we are only less than one day away from the end of the author-reviewer discussion period.
- In our submission, the contributing weight of each client's local gradient accumulation is not just but a coupling of due to a global learning rate in line 10 of Algorithm 1.
- For empirical success, the negative impacts of extremely long unavailable duration can be potentially mitigated through the carefully tuned global learning rate . Alternatively, we conjecture that clipping the unavailable duration up to a threshold like in FedAU [53] might help. It is also interesting to explore this direction theoretically in the future.
On the appearance of . After an additional careful review of our proofs, we did not identify the issues that will affect our presentations in Theorem 1 and Corollary 1.
However, we discover an interesting fact that allows us to remove by following the same proof outline and learning rate conditions in but slightly different simplification techniques during lines 789 and 792 on page 34 and line 807 on page 37. The results hold in parallel to those in our initial submission, and we will incorporate the complete results, together with their proofs, into the camera-ready version if accepted.
For ease of presentation, we discuss only the asymptotics w.r.t. and . The term in lines 789, 792 and 806 can be expanded to three terms relating to , and by Proposition 3. Since and share the same order in the first two, we are interested only in the last consensus error term, whose coefficient is proportional to ( for line 807 since no rearrangement is needed).
- Sketch of the new derivation. Directly plugging Lemma 6 on page 29 into lines 789, 792 to bound the consensus error term will yield the coefficients of the resulting terms in the scale of . During the rearrangement in line 799, the order of reduces by 1 and thus becomes . Line 807 can be shown analogously. By choosing for sufficiently large , we will remove .
- Sketch of the original derivation. First, we apply the learning rate condition in (10) to simplify the coefficients of the respective terms in the consensus error bound (Lemma 6) on page 29. Then, the simplified consensus error bound is plugged into lines 789, 792 and 806. The resulting coefficients remain proportional to ( for line 807). Therefore, remains after the rearrangement (one-order reduction) and the choice of learning rate.
We appreciate the reviewer's efforts to check the provided literature in detail. We believe that this was only due to miscommunications, as we try to justify the appearance of instead of a union of .
Best regards,
Authors of Submission 449
Thank you for your detailed response. I believed the concern regarding gradient amplification can be addressed. Regarding the appearance of , the authors have made efforts to resolve this by revising the proof, and I hope it can be fixed.
Dear Reviewer i6Qf,
Thank you for your confirmation. We appreciate it.
Best regards,
Authors of Submission 449
The work studies heterogeneous and non-stationary client availability in FL, showing that it can have a significant impact on the performance of FedAvg. As a solution, the work presents the FedAWE algorithm, which compensate the missed computationd due to unavailability with minimal additional overhead w.r.t. FedAvg and is agnostic to non-stationary participation dynamics. The works presents a convergence proof in non-convex setting and numerical experiments to showcase the problem and assess the effectiveness of the proposed solution.
优点
- Importance of the research problem. The work tackles an important and often overlook problem, that is heterogeneous and non-stationary client participation patterns. The research problem is particularly important given that in FL clients are edge devices, which are heterogeneous in computing capabilities and so availability patterns.
- Simple, principled and convenient approach. The proposed approach is simple to implement following the details in algorithm 1, since it involves the calculation of a discout factor on clients' pseudo gradient and postponing the sending of the updated model.
- Good results. The provided experimental results underscore the effectiveness of the approach in addressing both non-stationary and heterogeneous client availability. A plus is that authors did provide the code.
缺点
The biggest weakness of the work is in my opinion the presentation. I found the writing not clear and straightforward enough to make the reader grasp the intuitions behind the technical innovations of the proposed approach. There are too many pointer to the supplementary material, and in most cases it seems matter that needs to be in main paper (e.g. line 131). Also, the implications of the technical lemmas are often poorly discussed (e.g. proposition 1-2, lemma 1-3), leaving the reader confused about the need of that particulare piece of information.
I kindly suggest authors to revise the writing of the manuscript to better highlight the most imporant information to be conveyed, and avoid as much as possible continous pointer to the appendix.
问题
In section 7, heterogeneous clients' data distribution are simulated , as it is common in FL literature. Then the authors simulate the availabilty function of client as , where models the non-stationary dynamics and , with caractering the importance of each image class. Why is defined from the image class distribution , given that it models the probability of being sampled? I expect the data distribution and the the availability to be uncorrelated, so maybe have , where decides how heterogeneous are the sampling probability of each client.
局限性
Limitations of the approach are not sufficiently discussed in section 8.
Thank you for your appreciation for the motivation, principled algorithm and good results of the paper. We hope that our response can address your concerns.
W1: on pointers to the supplementary material. We apologize for not being clear enough in the main text. If the paper gets accepted, we will have one additional page to accommodate those pointers in the main text, including but not limited to line 131, line 224 and line 241 per your suggestions.
W2: on implications of the technical lemmas. In the manuscript, each technical lemma is accompanied by brief illustrations either before or after it, but they may not be clear enough. We are happy to elaborate the implications here in detail.
- Proposition 1 indicates that, when a client becomes available, the weight echoing procedure equalizes the number of local improvements with that of the other available peers. It captures the algorithmic intuition to allow the unavailable clients to catch up to missed computations.
- Proposition 2 builds the connection between the sequence and the auxiliary sequence . At a high level, it says that we shall see a bounded approximation error as long as the gradient norm and consensus error of the auxiliary sequence are bounded. Therefore, the problem boils down to bounding those two terms.
- Lemma 1 is well known in the fully decentralized learning literature to characterize information mixing [33,32,49]. The spectral norm is expected to be strictly smaller than 1 to ensure exponential decay, which we have shown in Lemma 4.
- Lemma 2 is an intermediate result to characterize the statistical properties of the unavailable duration. It indicates that despite the lack of structure in the non-stationary dynamics, the expected echoing weight remains finite.
- Lemma 3 quantifies the progress of the global objective per global round. Similarly to Proposition 2, it points out the need for the other technical lemmas, for example, Lemma 2 for the second term, Proposition 2 for the approximation error.
Again, we will make sure to improve the presentation if the paper gets accepted.
Q1: why is defined from the image class distribution. We note that correlating the local data distribution and the probability of client availability is a common practice in the prior literature. For example, MIFA [13] experiments with a formula for so that clients that hold images of smaller digits participate less frequently. Similarly to our construction, FedAU [53] considers as an inner product of the clients' local data distribution and an external distribution .
Moreover, we recall that the issue of client unavailability fundamentally arises from heterogeneous local data distributions (a.k.a. non-i.i.d. local data). As we have demonstrated in Fig. 2, Section 4, the expected output of FedAvg can be far away from the global optimum under non-i.i.d. local data. Intuitively, when clients hold homogeneous local data, the bias will be much milder even under heterogeneous availability. This is because every client becomes interchangeable when their local data distributions are homogeneous.
Q2: why not, for example, sample . The issue of directly applying the Dirichlet distribution to construct 's is that the generated 's are likely to be more concentrated in the presence of a larger client population. Suppose that we have , it holds that In the cross-device federated learning, could be quite large, leading to small and . In other words, the generated 's are more likely to center around the mean with a larger client population. For example, Fig. 1 and Fig. 2 in the pdf are histograms of realizations with clients, where in the former and in the latter. It can be seen that all the 's concentrate around their mean , and the range of the generated 's is around in the former while around in the latter. In contrast, our construction of 's results in less concentration. We copy Fig. 5 from our manuscript on page 42 and attach it as Fig. 3 in the pdf, which is a histogram of the generated 's under our construction. Clearly, our 's are more spread out with a significantly heavier tail over the entire interval.
Limitations: the limitations are not sufficiently discussed in Section 8. Thank you for bringing this to our attention. Per NeurIPS guideline, we have Section A in the Appendix to elaborate the potential limitations of our work. We briefly reiterate them here for completeness.
- Availability probabilities in the manuscript are theoretically assumed to be independent and strictly positive. It is interesting to theoretically characterize the setting, where we have correlated probabilities under arbitrary probabilistic trajectories.
- Client unavailability can vary greatly in the presence of heterogeneous and non-stationary dynamics. It is also worth exploring variance-reduction techniques for a more robust update.
Thanks to the authors for the elucidating response. I have also read and considered all the other reviews and relative authors' rebuttals. Since the provided information allowed me to better understand the work, I can more confidently confirm that the work provides interesting insights, and so I raised my confidence score accordingly. The authors promised to enhance the presentation, and based on the clear rebuttal I believe this will enhance the paper quality.
Dear Reviewer ZjRF,
Thank you so much for increasing your confidence score!
We are glad that you find our response to be elucidating and allowing you to better understand our work. We appreciate it and will further improve our presentation in the camera-ready version if accepted.
Best regards,
Authors of Submission 449
This paper primarily focuses on addressing the issue of intermittent client availability in federated learning, where the problem scenario involves heterogeneity in participation and non-stationary client availability.The authors draw on ideas from other federated learning algorithms, including asynchronous federated learning, and introduce two novel algorithmic structures, "Adaptive Innovation Echoing" and "Implicit Gossiping", to address the issues of heterogeneity and non-stationary client unavailability.The authors demonstrate through experiments that heterogeneity and non-stationarity significantly impact the convergence performance of federated learning algorithms. Their newly proposed algorithm, FedAWE, exhibits the property of linear speedup while also saving substantial memory. The experimental results are comprehensive and show that their algorithm outperforms others, highlighting its advantages.
优点
The author's article combines the ideas of federated learning algorithms including asynchronous federated learning and peer-to-peer networks. Two novel algorithm architectures, "Adaptive Innovation Echoing" and "Implicit Gossiping", are proposed, which can greatly reduce memory. At the same time, the article is more general for client availability, only needs to satisfy independent and strictly positive across clients and rounds.
缺点
In the section "Heterogeneity and Non-stationarity May Lead to Significant Bias", the authors only provide experimental evidence to verify that these factors affect the convergence performance of the algorithm. Could authors perhaps reference Theorem 1 and Theorem 2 from the FedAU paper for a theoretical analysis of these two factors? I believe that Theorem 1 from the article "A Lightweight Method for Tackling Unknown Participation Probabilities in Federated Averaging" could encompass the authors' experimental results concerning heterogeneity, that is, if the client participation is heterogeneous and the weight selection is not appropriate, it will directly affect the effect of convergence.
问题
Firstly, do you believe that the conclusions regarding the heterogeneity part of your experiments in the section "Heterogeneity and Non-stationarity May Lead to Significant Bias" overlap with Theorem 1 from the paper "A Lightweight Method for Tackling Unknown Participation Probabilities in Federated Averaging"? Or do your conclusions offer any particular insights beyond this theorem? Secondly, the paper “FedVARP: Tackling the Variance Due to Partial Client Participation in Federated Learning” introduces the FedVARP algorithm, which is similar to the MIFA algorithm but employs a different variance reduction method. Could you also conduct related experiments to compare your algorithm with FedVARP?
局限性
The authors' work generalizes the application scenarios for federated learning by relaxing assumptions about client participation. It is hoped that in future research, the authors will incorporate variance reduction and other related techniques within the current algorithm framework to further enhance performance.
Thank you for your appreciation for the novelty and generality of our problem setup. Please find our responses to your questions below.
W1/Q1: reference Theorems 1-2 in the FedAU paper [53] for a theoretical analysis of Section 4. Yes, we are happy to reference Theorem 1 and Theorem 2 in the FedAU paper in the camera-ready version if the paper gets accepted. Thank you for pointing out our oversight. Those two theorems share a similar spirit with us and will complement our numerical results in Section 4, where we use numerical experiments to illustrate the significant bias incurred under stationary dynamics. However, it is worth noting that our paper focuses on a much harder problem, where the non-stationary dynamics is unstructured. Thus, we generalize their results to the problem of non-stationary dynamics.
Q1: insights beyond Theorem 1 in the FedAU paper [53]. Theorem 1 in the FedAU paper theoretically characterizes the biased global objective under heterogeneous but stationary client unavailability. However, as we focus on the general case where clients are allowed to have unstructured non-stationary dynamics subject to Assumption 1, we believe it is in general hard to obtain a nice analytical form as that in the FedAU paper. This is because the complex interplay between 's across rounds and clients will inevitably complicate and thus hinder the theoretical analysis. Fortunately, the numerical experiments help us to confirm that non-stationary dynamics will further degrade the performance of FedAvg algorithm.
Q2: conduct related experiments to compare with FedVARP. Thank you for the great suggestion. Yes, we are happy to add new experiments on FedVARP for a more in-depth understanding of our work.
Specifically, we test the FedVARP algorithm under the client unavailability dynamics in Table 1 on page 9 in our manuscript. Please refer to Table 2 in the pdf for complete results. Due to space limit, Table Q2 only lists the comparisons between our algorithm, FedVARP and MIFA under the staircase non-stationary dynamics.
We can see from Table Q2 that the performance of FedVARP is in general slightly better than MIFA, consistent with the observations in [53] and [R1]. We want to emphasize that it is a bit unfair to compare the proposed algorithm with those variance-reduced algorithms, where they take advantage of significantly more extra memory space proportional to the size of and thus violate our design principle of this work: efficiency. In fact, most of the clients in cross-device federated learning are mobile or IoT devices with limited storage and memory capacity [19]. The added memory will burden their hardware, eventually impacting regular operations. However, our algorithm can sometimes even outperform them. Similarly to the statements in the manuscript (lines 340-341 on page 9), we attribute this phenomenon to our usage of gradients from always fresh stochastic samples. This is in contrast to their reuse of stored gradients from unavailable clients.
| Algorithms | SVHN (Train) | SVHN (Test) | CIFAR-10 (Train) | CIFAR-10 (Test) | CINIC-10 (Train) | CINIC-10 (Test) |
|---|---|---|---|---|---|---|
| Ours | ||||||
| MIFA | ||||||
| FedVARP |
Table Q2: Additional performance comparisons with FedVARP under the staircase non-stationary dynamics.
References:
[R1] Jhunjhunwala, D., et al. (2022). Fedvarp: Tackling the variance due to partial client participation in federated learning. In Uncertainty in Artificial Intelligence (pp. 906-916). PMLR.
Thanks to the authors for the elucidating response.With the information you provided, I can better understand the significance of this paper, and the supplementary experiments also show the advantages of the algorithm.I believe that the author's efforts can enhance the overall quality of the paper.
Dear Reviewer DLNT,
Thank you for confirming our efforts to enhance the overall quality of the paper. We greatly appreciate it!
Best regards,
Authors of Submission 449
We sincerely thank all reviewers for their constructive and thorough reviews, and for their careful evaluation of our paper.
In this global response, we would like to provide a brief summary of the reviews, based on our understanding. Overall, the reviewers confirm that
- our research topic is important and more general than prior literature (Reviewers ZjRF and DLNT);
- our algorithm is simple and convenient (Reviewers ZjRF and DLNT);
- our experiments are extensive (Reviewers i6QF and ZjRF).
The tables and figures in the attached pdf address the concerns raised by each reviewer. Specifically, we have additionally
- compared the extra complexity (in terms of sampling, memory, and computation) incurred by our algorithm, MIFA, and FedAU with FedAvg in Table 1 (Reviewer i6QT);
- conducted experiments on FedVARP for performance evaluations in Table 2 (Reviewer DLNT);
- examined the Dirichlet distribution to generate client availability 's in Fig. 1 and Fig. 2 (Reviewer ZjRF).
In addition, we include a copy of Fig. 5 from our manuscript on page 42 as Fig. 3 in the pdf for comparisons.
Please find a point-by-point response to all reviewer concerns with individual comments below. Again, thank you for your time and effort in evaluating our paper.
This paper explores Federated Learning (FL) in the context of heterogeneous and non-stationary client availability. Most reviewers commend the effectiveness of the proposed algorithm in addressing these challenges, as well as the rigor of the theoretical analysis and experimental evaluation. One reviewer raised a concern regarding the constant in the convergence theorem, suggesting that a term like might be more appropriate. While this point is valid, the presence of does not detract from the overall contribution of the main theorem. We recommend the paper for acceptance, with a suggestion for the authors to refine the term to in the camera-ready version.