Simple Minimax Optimal Byzantine Robust Algorithm for Nonconvex Objectives with Uniform Gradient Heterogeneity
We propose a new simple minimax optimal Byzantine robust algorithm for nonconvex federated learning under uniform gradient heterogeneity assumption.
摘要
评审与讨论
This paper proposes a new Byzantine robust algorithm called Momentum Screening (MS) for nonconvex federated learning. MS uses a simple screening test to detect and remove potentially malicious gradients from Byzantine workers. The remaining gradients are aggregated using standard momentum SGD. The algorithm is adaptive to the Byzantine fraction .
This paper gives theoretical analysis on the proposed algorithm, showing that MS achieves an optimization error of O() for the class . A matching minimax lower bound is also provided for . This rate differs from the rate for the class .
Experiments on MNIST and CIFAR10 with neural nets demonstrate MS outperforms existing methods like Centered Clipping, especially for small . This aligns with the better dependence on .
优点
- The proposed algorithm (MS) is simple to implement yet performs well. It also does not need to know the Byzantine fraction in advance, which is practical.
- The rate of convergence of MS is better than the previous best known rate of under the (corrected) condition when , so the method is preferred if Byzantine workers are very few.
- The author also provide a computationally efficient algorithm for their proposed MS method, whose performance is only worse than the original one by a constant factor.
- Some experiments on MNIST and CIFAR10 with neural nets shows that MS outperforms other method.
缺点
-
In the literature the rate is derived from , while this paper gives the rate O() for . Now, to give a better rate, one needs where the RHS is since . Therefore, the requirement of is wrong throughout the paper (the authors give ). The authors even did not notice this mistake when they write (in Section 7) but in fact Byzantine fraction . Such mistake makes me doubt the correctness of the proof in this paper, but I do not have enough time to check the whole proof.
-
As argued in this paper, , meaning that the method is only favourable when is very small, which seems to be not practical in the Byzantine workers setting. Moreover, since and are different hypothesis classes, directly comparing rates seems to be improper. An analysis of MS in is also needed.
-
Although the hyperparameter is adaptive to the Byzantine fraction , it has to be be chosen according to , which is unknown in priori, so an inproper choice of could harm the performance of the algorithm. It would be favourable to provide an empirical way to choose .
-
For the presentation of the paper, it would be clearer if the author provides a sketch of the proof rather than presenting directly some propositions.
问题
- Could the authors comment more on the relation between and , particularly with some real datasets?
Thank you for your important comments.
About Weakness 1.
As you pointed out, the expression should be fixed as (please check the revision paper). We thank you for your pointing out these typos and apologize for them, but, they are simple systematic typos and do not affect the proofs given in Sections A and B.
About Weakness 2 and Question 1
We believe that is reasonable because is not too small in practice. In fact, after reading your review, we conducted additional experiments to address the concern that is very small in practice. Specifically, we examined the empirical values of on the settings of Section 6 and additionally on Fed-EMNIST, which is a real FL dataset. The experimental results can be found in Section E of the revised paper (p.36). From Figure 14 in Section E, we can see that was in on MNIST and CIFAR10, and was in on Fed-EMNIST. From these observations, we conclude that the condition is practical enough and the benefits of MS are not so limited.
Moreover, since and are different hypothesis classes, directly comparing rates seems to be improper. An analysis of MS in is also needed.
We disagree with the first point. In general, it is very common in the machine learning and optimization literature to show that an algorithm achieves a better rate (generalization error, optimization error, etc.) for a smaller hypothesis class than the rate of another algorithm targeting a larger hypothesis class. For example, it is well known that a typical theory of LASSO shows that LASSO achieves a better generalization error than OLS when the true parameter is sparse, where LASSO assumes a smaller hypothesis class than OLS does.
Regarding the second point, we agree that an analysis of MS in may be necessary to judge which algorithm is better in . However, our main focus is to construct a minimax optimal algorithm for the hypothesis class , and as shown in Figure 14 in Section E of the revised paper, it is empirically justified to assume that the local objectives are in for a reasonable because empirically. Thus, this point does not detract from the importance of this study.
About Weakness 3.
In our experiments, as described in Section D.2, we used a heuristic strategy to stabilize the performance that was increased by times from the original as long as , although , which is the limit of () , had to be roughly tuned. As you said, it is practically desirable that the algorithm be adaptive not only to Byzantine fraction but also to heterogeneity (or ). This direction is definitely an important future work.
About Weakness 4.
Thanks for your suggestion. Actually, we give a brief overview of the analysis in the first part of Section 4 to clarify our proof strategy. If that is not enough, we would like to add a more detailed sketch of the proof.
We would be very happy if your concerns were addressed and the score would be raised.
Thank you for your response. It clarifies my particular concern on the quantity . I would like to raise my score.
This paper considers the problem of federated learning with Byzantine workers who can send arbitrary responses to the central server. In the non-IID case where the local distributions of non-Byzantine workers are heterogeneous, the standard aggregations will fail empirically, as shown in previous works. In this paper, the authors developed a new, simple byzantine robust algorithm that have better minimax optimal optimization error compared to the best previous algorithm when the maximum gradient heterogeneity is not much larger than the average gradient heterogeneity, whose optimality in this parameter regime is demonstrated by establishing a lower bound result. Moreover, the authors conducted numerical experiments to support their theoretical analysis.
优点
The algorithm is novel and simple, makes it relatively easy to be implemented in practice. Moreover, the improvement in the minimax optimal optimization error is significant in the parameter regime where the maximum gradient heterogeneity is around the same order as the average gradient heterogeneity, which seems like a common assumption in various practical situations. The performance of the algorithm is also well demonstrated in the various numerical experiments.
缺点
The convergence rate in terms of the number of steps might not be optimal. In particular, the algorithm is a momentum-based method, however, the convergence rate exhibits the form of a non-momentum based method, and it is unclear to me why the momentum is needed here.
问题
Will the convergence rate of the algorithm remain unchanged if the momentum is removed? Or, is there a better momentum-based algorithm that has better convergence rate?
伦理问题详情
N/A
Thank you for your insightful comments and questions.
The convergence rate in terms of the number of steps might not be optimal. In particular, the algorithm is a momentum-based method, however, the convergence rate exhibits the form of a non-momentum based method, and it is unclear to me why the momentum is needed here.
Will the convergence rate of the algorithm remain unchanged if the momentum is removed? Or, is there a better momentum-based algorithm that has better convergence rate?
Our algorithm relies on a heavy-ball method rather than the famous Nesterov's acceleration method. Thus, it is natural that the convergence rate matches that of non-momentum based methods when , since a heavy-ball method does not improve the convergence rate of non-momentum methods at least in the standard nonconvex optimization theory (see, for example, Mai and Johansson, 2020 [1]). Although the main focus of this paper is on the asymptotic optimization error, obtaining the optimal convergence rate in terms of the number of steps based on Nesterov's acceleration is an interesting future direction.
The reason why we introduce the momentum in our algorithm is that the momentum mitigates the effect of the stochastic noise by canceling out the noise thanks to the accumulation of the previous stochastic gradients. This is very important when because the screening algorithm judges the workers to be non-Byzantine or Byzantine based on comparing the distance between the workers' outputs. If the momentum is removed, the screening algorithm must aggregate the stochastic gradients only at the current iteration, which will degrade the performance of the distance-based detection of the Byzantine workers due to the large stochastic noise. As a result, the convergence rate will be degraded when the momentum is removed.
[1] Mai and Johansson, 2020: Convergence of a Stochastic Gradient Method with Momentum for Non-Smooth Non-Convex Optimization.
I would like to thank the authors for answering my questions, and I remain my rating.
The paper studies nonconvex federated learning (FL) in the presence of byzantine workers with a fraction of out of the workers. Then the authors proposed the Momentum Screening (MS) algorithm for such setting, achieving error rate for -uniform gradient heterogeneity, and showed the minimax optimality of the proposed method in such setting. Experimental results are then given to validate the MS algorithm.
优点
The algorithmic structure of the MS algorithm is simple and can adapt to the Byzantine fractions , all of which can be practically attractive. Furthermore, the minimax optimality results seem like the first of its kind for such setting of -uniform gradient heterogeneity.
缺点
- The consideration of algorithmic design for uniform gradient heterogeneity as in this paper has been done in the literature. In fact, the rate achieved here seems to be the same as the CCLIP method (Karimireddy et al. (2022)) (ref [1] as below for convenience). Yet, such literature was not well discussed enough in the paper.
- Following up the above point, many results in the paper are quite the same as those in CCLIP without improvement, and the analysis is quite natural and motivated from previous work. The true technical novelty of the paper, besides the MS method with simplicity, is perhaps the fact that they proved lower bound in the minimax sense for uniform gradient heterogeneity. However, this is quite a natural extension from the first work that proved such results for the case of mean gradient heterogeneity.
- Systematic typo throughout the paper: note that yours is better than CCLIP when . Can you give a sense of what can be in real datasets, especially those considered in your experiments? Because I think such fraction can be very small in practice, which is also acknowledged in your Section 2.1. So the regime in which MS provides benefits is in fact quite limited.
问题
Please see weaknesses.
Thank you for your helpful feedback.
About Weaknesses 1 and 2.
First of all, we want to emphasize that the most critical difference of our study from CCLIP paper [1] is the improvement of the optimization error; our obtained error is times smaller than that of CCLIP in the best case (i.e., ). Since is often a small value in distributed learning systems, this improvement is quite important from both a theoretical and a practical point of view. The condition holds empirically in our experiments (please see the comments to ``About Weakness 3'').
Below, we summarize the main differences between our Momentum Screening (MS) and CCLIP [1] that are mentioned in our paper.
- Algorithmically, as shown in Section 3, our algorithm relies on the screening technique rather than clipping, which is critical for our theoretical analysis.
- Theoretically, as described in Section 1 (Main contribution and Related work) and Section 4 (Remark 2), the optimization error can be times better than that of CCLIP.
- Empirically, as provided in Section 7, we demonstrated the consistent superiority of our method over CCLIP in various numerical experiments.
Also, the derivation and the results of the aggregation error bound (Proposition 2) and the momentum diameter bound (Proposition 3) under the uniform gradient heterogeneity condition is the key technical part of our analysis, and is clearly different from the analysis of CCLIP, where only the mean gradient heterogeneity condition is assumed and the aggregation error bound is times worse than ours due to the clipping bias.
About Weakness 3.
First, the expression was fixed in the revised paper as . The typos do not affect the proofs given in Sections A and B. We thank you for pointing this out and apologize for these typos.
Second, regarding your concern, we recognize that empirical validation of the condition is critical in our study. After reading your review, we examined the empirical values of on the settings of Section 6 and additionally on Fed-EMNIST, which is known as a FL dataset for a realistic situation. From Figure 14 in Section E of the revision paper (p.36), we can see that was in on MNIST and CIFAR10, and was in on Fed-EMNIST. Thus, we conclude that the condition is practical enough and the benefits of MS are not so limited.
We would be very happy if your concerns were addressed and the score would be raised.
The AC has been reading the papers and the discussions between authors and reviewers recently. Currently, the AC has the following questions for this paper, can the authors also provide some comments?
(1). Assumption 3 and 4 are strong assumptions that are not adopted in the existing works like CClip. Therefore, MS seems to achieve the proposed improvement by much stronger assumptions.
(2). Assumption 3 seems redundant as the boundedness of Assumption 4 is stronger than the sub-Gaussian assumed in Assumption 3.
(3). As CClip algorithm already achieves an complexity, hence there seem to be an effective regime within which the lower bound holds. That is, there should be some such that the derived lower bound holds for . However, this fact is not reflected in Theorem 2.
Thank you for your careful reading and important questions.
About Question (1):
Assumption 3: As you said, Assumption 3 is actually stronger than the standard Stochastic Gradient Variance Boundedness (SGVB) assumption, which requires for minibatch stochastic gradient at of worker . However, this stronger requirement comes from the fact that our theory derives a high-probability bound instead of an expectation bound, which is a much stronger result. Assumption 3 is a standard assumption for deriving a high-probability bound in the stochastic optimization literature not only in the context of the Byzantine robust optimization; in general, deriving an expectation bound requires SGVB, and deriving a high-probability bound requires Assumption 3 (see, for example, Section 2.3 of [1]).
[1] Jin et al., 2019: On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points.
Assumption 4: Assumption 4 has very little effect on our theoretical results even if can be much larger than in Assumption 3. In fact, in Theorem 1, depends only log log order on the non-dominant terms of the convergence rate (and never depends on the final optimization error ), as described immediately after Assumption 4 in Section 2. Thus, Assumption 4 is not restrictive at all.
About Question (2):
This is not true. Generally, in Assumption 3 can be much smaller than in Assumption 4, since approaches to zero as the minibatch size goes to . In Theorem 1, in Assumption 3 arises explicitely in the convergence rate. In contrast, as mentioned in the answer to Question (1), in Assumption 4 depends only log log order on the non-dominant terms of the convergence rate. Thus, Assumption 3 is not redundant, since replacing with gives a much worse bound.
About Question (3):
First, we would like to clarify your concern. We think your concern is why the upper bound of CClip seems to be better than our lower bound for some when . Is our understanding correct? If so, we would like to address this important point.
In fact, the lower bound derived in Theorem 2 holds for any for function class and does not contradict the upper bound of CClip for any . This point is explained as follows. First, please note that the theory of CClip essentially gives an upper bound for . Using this fact, a trivial upper bound by applying CClip theory to () is larger than our lower bound in Theorem 2, for any . Please note that for any in general, and thus the theory of CClip never guarantees the upper bound for for any . Thus, the upper bound of CClip for does not contradict the lower bound in Theorem 2 for any .
We would be happy to clarify any further concerns and questions.
We thank all reviewers for their valuable comments.
As mentioned in the replies to Reviewers p51w and BuTUe, we did additional experiments on the validity of the condition on Fed-EMNIST, which is a real federated learning dataset. From Figure 14 in Section E of the revision paper (p.36), we found that was in 0.34 ~ 0.7 and not so small, and thus the condition can be practical enough.
We then also performed accuracy comparisons of the proposed method with the existing methods against the five attacks described in Section 6 for the case of on Fed-EMNIST. From Table 3 in Section F of the revision paper, we found that the proposed method still consistently outperformed the other methods in terms of the worst best test accuracy against the five attacks.
We hope that if these results are helpful in addressing your concerns.
This paper studies the nonconvex federated learning problem with Byzantine workers. Let the fraction of Byzantine workers be , a momentum SGD enhanced with a screening test is proposed, and an error can be achieved, in addition, an lower bound has been provided. When is a small quantity and is not much larger than , then this is a large improvement compared to the . However, the stronger result is based on the stronger light-tailed noise assumption, without which the high probability bound and the screening would not be valid. Moreover, the convergence rate in terms of the number of steps might not be optimal. Overall, the screening technique, the lower bound, and the improved final accuracy are all valid contribution in this paper.
As the reviewers are having a split reviews for this paper, the AC also read this paper to understand the concerns from the negative, which is about the difference of the algorithm and the results compared with the CCLIP method (Karimireddy et al. (2022)). The AC thinks that the authors' response to the negative reviewer is valid (which is not responded by reviewer), and the authors also clarified a few additional questions from the AC. As a consequence, the AC decides to override the negative review and propose an accept to this paper.
为何不给更高分
Stronger assumption than before, and the convergence rate in terms of the number of steps might not be optimal.
为何不给更低分
The paper does have some theoretical merit in proposing the screening technique, the lower bound, and the improved final accuracy.
Accept (poster)