Introduction

The introduction section lacks contemporary citations. The most recent references date back to 2021, which may be considered outdated in the dynamic field of Federated Learning. Additionally, the limited number of cited papers does not adequately reflect the current state of the field. I recommend commencing with comprehensive overview papers instead.

Wang, J., Charles, Z., Xu, Z., Joshi, G., McMahan, H. B., Al-Shedivat, M., ... & Zhu, W. (2021). A field guide to federated optimization. arXiv preprint arXiv:2107.06917.

Kairouz, P., McMahan, H. B., Avent, B., Bellet, A., Bennis, M., Bhagoji, A. N., ... & Zhao, S. (2021). Advances and open problems in federated learning. Foundations and Trends® in Machine Learning, 14(1–2), 1-210.

Then, also consider modern papers such as

Patel, K. K., Wang, L., Woodworth, B. E., Bullins, B., & Srebro, N. (2022). Towards optimal communication complexity in distributed non-convex optimization. Advances in Neural Information Processing Systems, 35, 13316-13328.

Wang, J., Lu, Y., Yuan, B., Chen, B., Liang, P., De Sa, C., ... & Zhang, C. (2023, July). CocktailSGD: fine-tuning foundation models over 500mbps networks. In International Conference on Machine Learning (pp. 36058-36076). PMLR.

Grudzień, M., Malinovsky, G., & Richtárik, P. (2023, April). Can 5th Generation Local Training Methods Support Client Sampling? Yes!. In International Conference on Artificial Intelligence and Statistics (pp. 1055-1092). PMLR.

By carefully designing the server-client update coordination, we show that SAFARI achieves an convergence rate to a stationary point, matching the convergence rates of state-of-the-art classic FL algorithms.

Please provide citations for this statement and mention what state-of-the-art methods and convergence rates you consider here.

Related works

In this section there is a lack of modern citations for Client Participation in Federated Leaning. The current overview does not cover modern methods that tackle heterogeneity:

Mishchenko, K., Malinovsky, G., Stich, S., & Richtárik, P. (2022, June). Proxskip: Yes! local gradient steps provably lead to communication acceleration! finally!. In International Conference on Machine Learning (pp. 15750-15769). PMLR.

Mitra, A., Jaafar, R., Pappas, G. J., & Hassani, H. (2021). Linear convergence in federated learning: Tackling client heterogeneity and sparse gradients. Advances in Neural Information Processing Systems, 34, 14606-14619.

Also this section does not cover works that consider special client sampling procedures:

Malinovsky, G., Horváth, S., Burlachenko, K., & Richtárik, P. (2023). Federated learning with regularized client participation. arXiv preprint arXiv:2302.03662.

Chen, W., Horváth, S., & Richtárik, P. (2022). Optimal Client Sampling for Federated Learning. Transactions on Machine Learning Research.

Cho, Y.J., Sharma, P., Joshi, G., Xu, Z., Kale, S. & Zhang, T.. (2023). On the Convergence of Federated Averaging with Cyclic Client Participation. Proceedings of the 40th International Conference on Machine Learning, in Proceedings of Machine Learning Research 202:5677-5721 Available from https://proceedings.mlr.press/v202/cho23b.html.

Moreover, related work section does not cover modern literature of asynchronous methods:

Mishchenko, K., Bach, F., Even, M., & Woodworth, B. E. (2022). Asynchronous sgd beats minibatch sgd under arbitrary delays. Advances in Neural Information Processing Systems, 35, 420-433.

Koloskova, A., Stich, S. U., & Jaggi, M. (2022). Sharper convergence guarantees for asynchronous sgd for distributed and federated learning. Advances in Neural Information Processing Systems, 35, 17202-17215.

Tyurin, A., & Richtárik, P. (2023). Optimal Time Complexities of Parallel Stochastic Optimization Methods Under a Fixed Computation Model. arXiv preprint arXiv:2305.12387.

PAC-learnability of FL with incomplete client participation

Conventional Federated Leaning with Incomplete Client Participation

This formula on page 3 might be confusing for readers:

It is not clear whether or Please clarify this aspect in the text.

I carefully reviewed Theorem 1 (the Impossibility Theorem) and its corresponding proof in the appendix. The proof appears to be sound; however, I must note that I lack expertise in statistical learning theory, so I may have overlooked certain details.

It's worth noting that the provided proof heavily relies on prior work, and the outcome it yields does not appear to be particularly surprising. In essence, the core concept of this assertion revolves around the idea that in cases where a system lacks access to specific information (in this context, data from clients that are entirely uncommunicative), it is unfeasible to construct a model with an error rate tending towards zero. This seems to be a rather self-evident conclusion, especially when considering scenarios where such non-communicative clients possess unique and isolated data that does not intersect with the data available from other clients. In such cases, it becomes impractical to derive a model suitable for the broader data distribution.

With all due respect, the term "fundamental" ascribed to this result, as well as the subsequent assertion that "This result sheds light on system and algorithm design for FL," might be perceived as somewhat exaggerated and overstated.

In addition to system heterogeneity, other factors such as Byzantine attackers could also render incomplete client participation. For example, even for full client participation in FL, if part of the clients are Byzantine attackers, the impossibility theorem also applies.

I respectfully disagree with this statement. In the context of Byzantine robustness, we follow a standard formulation prevalent in the literature on Byzantine robustness, as evidenced by the following references:

Karimireddy, S. P., He, L., & Jaggi, M. (2021, October). Byzantine-Robust Learning on Heterogeneous Datasets via Bucketing. In International Conference on Learning Representations.

Gorbunov, E., Horváth, S., Richtárik, P., & Gidel, G. (2022). Variance reduction is an antidote to byzantines: Better rates, weaker assumptions and communication compression as a cherry on the top. arXiv preprint arXiv:2206.00529.

We assume that there are clients consisting of the two groups: , where denotes the set of good clients and is the set of bad/malicious/Byzantine workers. The goal is to solve the following optimization problem

where . In essence, we seek to build a model solely based on data from the good clients. Therefore, we do not factor in data from Byzantine workers in our problem. As a result, the concept of incomplete client participation is not applicable to this particular scenario. If there are any aspects I may have missed, please clarify.

** The PAC Learnability of Server-Assisted Federated Learning**

The intuition of SA-FL is to utilize a dataset i.i.d. sampled from distribution with cardinality as a vehicle to correct potential distribution deviations due to incomplete client participation. By doing so, the server steers the learning by a small number of representative data, while the clients assist the learning by federation to leverage the huge amount of privately decentralized data . Note that the assumption of having a server-side dataset is not restrictive since such datasets are already available in many FL systems: although not always necessary for training, an auxiliary dataset is often needed for defining FL tasks (e.g., simulation prototyping) before training and model checking after training (e.g., quality evaluation and sanity checking) (McMahan et al., 2021; Wang et al., 2021a).

I concur that in practical applications, the existence of server-side datasets is plausible. Nevertheless, I find it challenging to accept the assumption that the server-side dataset T is independently and identically distributed (i.i.d) from the distribution P without limitations. I firmly believe that if the server-side dataset is derived from distribution P in any manner, it implies that the server must possess some level of access to data from all clients to obtain it. Gaining access to clients' raw data stands in direct opposition to the fundamental principles of Federated Learning and constitutes a severe breach of privacy. Otherwise, inherent biases may arise in the dataset. In practice, achieving a server-side dataset that is i.i.d. sampled from distribution P is often unattainable, and it may be more reasonable to consider alternative distributions or methods for constructing server-side data.

Could you kindly provide more detailed information regarding this matter?

Assumption 2 specifies a stronger constraint between distributions and . It implies that the difference of excess error for one hypothesis between and is bounded by the excess error of in some exponential form. Assumption 2 is one of the major novelty in our paper and unseen in the literature. We note that this -positively-related condition is a mild condition.

Could you please provide a more comprehensive explanation as to why this assumption is necessary in the analysis and elucidate its role?

Upon reviewing the proof for Theorem 3, it becomes apparent that it heavily relies on the assumption that the dataset T is independently and identically distributed (i.i.d) from the distribution P. I would appreciate it if you could clarify the outcomes and implications in situations where the server-side dataset T is sourced from a non-i.i.d or biased distribution.

Last but not least, it is worth pointing out that, for ease of illustration, Theorem are based on the assumption that the auxiliary dataset . Nonetheless, it is of practical importance to consider the scenario where is sampled from a related but slightly different distribution rather than the target distribution itself. In fact, the above assumption could be relaxed to for any as long as the mixture distribution is -positively-related with . Under such condition, we can show that the main results in Theorem 2-3 continue to hold.

Can you please characterize the difference between distribution and such that mixture distribution is -positively-related with ?

THE SAFARI ALGORITHM FOR TRAINING UNDER SA-FL

Assumption 5 appears to be rather restrictive, as it necessitates the bound to hold for all in the range , and the overall expression is constrained by a constant. I would like to propose a more general assumption:

It's worth noting that this assumption is a standard concept well-documented in the existing literature.

Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., & Suresh, A. T. (2020, November). Scaffold: Stochastic controlled averaging for federated learning. In International conference on machine learning (pp. 5132-5143). PMLR.

Gorbunov, E., Horváth, S., Richtárik, P., & Gidel, G. (2022). Variance reduction is an antidote to byzantines: Better rates, weaker assumptions and communication compression as a cherry on the top. arXiv preprint arXiv:2206.00529.

I have significant concerns regarding Theorem 4. Firstly, the constants, where and , do not provide any information regarding their potential magnitudes. These constants necessitate the calculation of the minimum among server iterates for and the minimum among client iterates for . Including such terms in the convergence bound is impractical, both in theory and practice, as estimating these constants is infeasible.

Upon reviewing the proof of Theorem 4, I observed that the analysis was conducted in a less rigorous manner. It involved bounding sums of client gradients by the minimum of such sums. Similar analytical approaches were utilized in subsequent works. In this works such sum as are bounded in more rigorous way. Please clarify this aspect of the proof.

Moreover, the end of proof is not correct:

where and

The term "minimum" should be replaced with "maximum" since the formula requires an upper bound. I acknowledge that this is likely a typographical error, but the criticism mentioned earlier still holds true.

It's also unclear why bounding the term is meaningful and what it signifies. Typically, in the non-convex literature, the term is utilized in the bound.

Karimireddy, S. P., Kale, S., Mohri, M., Reddi, S., Stich, S., & Suresh, A. T. (2020, November). Scaffold: Stochastic controlled averaging for federated learning. In International conference on machine learning (pp. 5132-5143). PMLR.

Khaled, A., Mishchenko, K., & Richtárik, P. (2020, June). Tighter theory for local SGD on identical and heterogeneous data. In International Conference on Artificial Intelligence and Statistics (pp. 4519-4529). PMLR.

Malinovsky, G., Mishchenko, K., & Richtárik, P. (2022). Server-side stepsizes and sampling without replacement provably help in federated optimization. arXiv preprint arXiv:2201.11066.

The most recent paper, titled "Server-side stepsizes and sampling without replacement provably help in federated optimization," delves into an algorithm closely related to the one discussed in this study, and their analytical approach bears similarities. However, it notably avoids the pitfalls associated with bounding sums by the minimum among server and client iterates.

I would suggest conducting a comparative analysis of the results obtained in this study with those derived from the aforementioned relevant papers to gain a deeper understanding of the strengths and limitations of each approach.

Further, Theorem 4 immediately implies that, by choosing parameters and the learning rate appropriately, we achieve linear convergence speedup to a stationary point:

What is in this sentence? Is this server or client stepsize?

Numerical experiments

In this section, there is a notable absence of a comparative analysis with respect to baselines other than FedAvg. To offer a more comprehensive evaluation, it would be advantageous to include a comparison with additional baseline algorithms commonly employed in similar studies. This would provide a more holistic perspective on the performance of the proposed approach.

Furthermore, to enhance the clarity and effectiveness of the analysis, it is advisable to incorporate graphical representations such as plots depicting the convergence of the algorithm in terms of the loss function. Visualizing the convergence process can aid in conveying a more intuitive understanding of the algorithm's performance and highlight any distinct advantages or shortcomings it may exhibit in relation to other baselines.