LoCoDL: Communication-Efficient Distributed Learning with Local Training and Compression
摘要
评审与讨论
This paper proposed LoCoDL, an algorithm than combines communication compression with local training. The authors proved the convergence results under regular assumptions, achieving comparable rate with existing SOTA algorithms. The experimental results also show that LoCoDL behaves best among tested algorithms.
优点
- The combination of communication compression and local training is novel.
- The convergence results achieves SOTA for large and nearly SOTA for small .
- The algorithm behaves empirically better than ADIANA, an existing theoretically SOTA algorithm.
- The algorithm is simple.
- The target problem setting is novel and general.
缺点
- Throughout the four experimental settings, the number of nodes, , is comparable to the feature dimension . As the convergence rate of LoCoDL is suboptimal when is small, I believe it important to compare LoCoDL with ADIANA when is at least 10 or 100 smaller than to see whether LoCoDL beats ADIANA in these scenarios.
- The experimental datasets are relatively small. It's recommended to conduct experiments on MNIST or larger datasets.
- It is not easy to capture the intuition behind each algorithm line. It's recommended to give more detailed explanations on how the algorithm is developed.
问题
- Existing works widely use the error feedback mechanism from EF21 to remove the gradient dissimilarity bound condition. Does LoCoDL overcome this issue by the primal-dual format? Is the primal-dual format somehow equivalent to error feedback or has it been used by prior works for the same reason? I'm not questioning the novelty of the proposed algorithm but rather curious about the intuition behind.
- Is it possible to extend LoCoDL to stochastic settings or generally convex settings? Are there any technical difficulties?
- When , the paper considers an equivalent case where and . However, may not be available for many objective functions. Should we tune it as a hyperparameter? If we apply LoCoDL to the vanilla case where , are there any difficulties in convergence analysis?
局限性
As stated in the conclusion part, the algorithm is limited to single-directional, deterministic setting without partial participation.
Thank you for your positive evaluation and acknowledging our contributions.
Weaknesses
- I believe it important to compare LoCoDL with ADIANA when is at least 10 or 100 smaller than to see whether LoCoDL beats ADIANA in these scenarios.
- It's recommended to conduct experiments on MNIST or larger datasets.
We have followed your suggestion and run additional experiments, please see the rebuttal to all reviewers for the results. This confirms that LoCoDL significantly outperforms ADIANA.
- It is not easy to capture the intuition behind each algorithm line. It's recommended to give more detailed explanations on how the algorithm is developed.
Thank you for this constructive suggestion. We will make use of the extra page to better describe the steps of LoCoDL in Section 2.2.
Questions
- Existing works widely use the error feedback mechanism from EF21 to remove the gradient dissimilarity bound condition. Does LoCoDL overcome this issue by the primal-dual format? Is the primal-dual format somehow equivalent to error feedback or has it been used by prior works for the same reason? I'm not questioning the novelty of the proposed algorithm but rather curious about the intuition behind.
This is an excellent and deep question. We can note the following.
- In Condat et al. "EF-BV: A Unified Theory of Error Feedback and Variance Reduction Mechanisms for Biased and Unbiased Compression in Distributed Optimization,"Neurips 2022, it is shown that the error feedback mechanism of EF21 and the variance-reduction technique of DIANA based on compressing the difference between gradients and control variates are essentially the same.
- In Condat and Richtárik "RandProx: Primal-Dual Optimization Algorithms with Randomized Proximal Updates," ICLR 2023, it is shown that the control variates introduced in Scaffnew to correct for the client drift can be viewed as dual variables for the dualized consensus constraint .
- Still, the two variance reduction techniques, EF21/DIANA/EF-BV on one hand, Scaffnew/RandProx on the other hand, are of different nature, as far as we can tell, and we are not aware of a primal-dual interpretation of the former.
- Thus, the idea of compressing the difference between local model estimates and an anchor , which are primal variables, is not related to the primal-dual view, which is used to analyze the overall algorithm, in particular how the random errors propagate between the primal and the dual variables.
- Is it possible to extend LoCoDL to stochastic settings or generally convex settings?
Allowing for stochastic gradients with or without variance reduction instead of full and exact gradients does not seem to be difficult. But it is related to computation-efficiency, not communication-efficiency, and is worth a study in a full separate paper to investigate the different tradeoffs. We have not looked at the extension to the general convex setting. It is certainly possible to derive rates without much difficulty. Obtaining convergence to a stationary point in the nonconvex setting would be more interesting, but our efforts so far have been unsuccessful.
- may not be available for many objective functions. Should we tune it as a hyperparameter?
We study linear convergence rates when the functions are smooth and strongly convex. Indeed, the case requires transferring some amount of strong convexity from the to , for which a lower bound on is needed. Underestimating this lower bound will slow down convergence. We note that the same idea of transferring strong convexity to get linearly converging algorithms has been used in other works, e.g. Grudzien et al. "Can 5th Generation Local Training Methods Support Client Sampling? Yes!" AISTATS 2023.
If we apply LoCoDL to the vanilla case where , are there any difficulties in convergence analysis?
Without any change, if , which is not strongly convex, there is no contraction applied to the variable , so linear convergence will be lost.
Limitations
As stated in the conclusion part, the algorithm is limited to single-directional, deterministic setting without partial participation.
Deriving efficient algorithms with bidirectional compression is a hot and very challenging topic and our main priority. We believe the key idea introduced in LoCoDL of having the model estimate in addition to the local estimates will play a major role. Partial participation can be implemented in LoCoDL, but it will be more satisfactory to develop it in the setting of bidirectional compression.
Thank you for the detailed reply. All my concerns have been well addressed and I do not have further questions.
Thank you again for your thorough and positive evaluation. We are kindly counting on you to support the paper toward acceptance during this discussion phase with the AC.
This paper proposes LoCoDL, a new GD-based distributed training algorithm that employs both communication compression (CC) and local training (LT). It achieves double acceleration and a SOTA convergence rate for strongly convex problems.
A crux of the algorithmic improvement is maintaining two local estimates, intuitively enabling efficient LT (similarly to SCAFFOLD) and efficient CC (i.e., compressing values' differences instead of values themselves).
The paper offers a thorough theoretical analysis that proves the main claim and conducts some experiments that show LoCoDL's benefits compared to previous algorithms.
优点
This paper is timely and important, and I enjoyed reading it. It appears to set a new bar for distributed communication complexity in the strongly convex case (and with full participation?).
While some works assume similarity between local client functions, this work allows these functions to be arbitrarily different. Also, interpreting the added term is intuitive and compelling (viewpoints 1-4).
The theoretical claims are rigorously proven, and some experiments demonstrate the efficiency of LoCoDL compared to previous techniques.
缺点
The practical applicability of LoCoDL is unclear. Namely, it does not apply to NNs and possibly is less efficient for partial participation use cases (this part is unclear). Either providing concrete evidence of why this contribution is important for modern practical use cases or slightly rephrasing the paper as a theoretical (and important) contribution would strengthen the claim.
Strengthening the evaluation section is also advised. The submission would be strengthened if the author provided an experiment other than logistic regression to demonstrate the efficiency of LoCoDL in another task.
Additional points:
-
“are smooth, so their gradients will be called. “ This sentence is unclear.
-
“is slower than broadcasting the same message to an arbitrary number of clients.” Are there any real FL systems that employ broadcasting? Or do you mean sending the same message? (The term “broadcast” may be confusing here.)
-
“In this work, we focus on the uplink communication complexity, which is the bottleneck in practice.“ The second part of the sentence should be softened or extended with real evidence that this is the case.
-
“No other compressor can be used, which notably rules out any type of quantization.” Why is this the case? Why quantization cannot be applied according to the selected pattern?
-
“Instead of the cumbersome permutation-based compressor of the latter.” is there a specific challenge in implementing permutation-based compressors? Maybe it's worth specifying specific setups where this is insufficient or cannot be applied.
-
“Thus, LoCoDL sets new standards in terms of communication efficiency. “ Do you mean: our experiments indicate that...?
-
What is the communication complexity with partial participation (PP) (i.e., )? When considering PP, is LoCoDL the current SOTA, or are there better alternatives?
-
It seems that some elements of LoCoDL have some similarities to DoCoFL [1], which also uses an anchor for the model that allows clients to obtain only a compressed correction to that anchor. Can the authors shed light on this similarity?
[1] Dorfman, Ron, et al. "DoCoFL: Downlink compression for cross-device federated learning." International Conference on Machine Learning. PMLR, 2023.
问题
See weaknesses.
局限性
The paper does not have a dedicated limitation section. The conclusions section discusses potential future extensions. Outlining the limitations clearly is advised. For example, is the PP use case relevant here?
Thank you for your positive evaluation and acknowledging our contributions.
Weaknesses
- LoCoDL does not apply to NNs
This work is indeed mainly theoretical as we develop a new algorithmic framework combining the mechanisms of local training and compression, and validate it with proved acceleration in the strongly convex setting. For the nonconvex problem of training NNs, which poses specific challenges, the proof techniques are significantly different and we currently don't know how to analyze LoCoDL in this setting. More generally, the use of variance reduction to correct for client drift and compression error is an open and debated question, see for instance
"On the Ineffectiveness of Variance Reduced Optimization for Deep Learning" Neurips 2019
"On the effectiveness of partial variance reduction in federated learning with heterogeneous data", CVPR 2023
For the ultimate quest of training NNs in a distributed or federated way, we believe that our paper is a first milestone, whose original ideas could foster further research in the community about nonconvex settings.
- LoCoDL possibly is less efficient for partial participation use cases (this part is unclear).
It is possible to allow for partial participation (PP) in LoCoDL. We did not focus on this, because there are several primal and dual variables, so a neat design would involve PP and bidirectional compression jointly. We leave such a difficult study for future work.
- Strengthening the evaluation section is also advised.
We have run additional experiments, please see the rebuttal to all reviewers.
Questions
- "are smooth, so their gradients will be called." This sentence is unclear.
We mean that for a smooth function, it is natural to use the gradient and not the proximity operator. We will reformulate this sentence.
- “is slower than broadcasting the same message to an arbitrary number of clients.” Are there any real FL systems that employ broadcasting? Or do you mean sending the same message?
Yes, we just mean sending the same message in parallel to all clients. We will change the term broadcast.
- The second part of the sentence should be softened or extended with real evidence that this is the case.
We have in mind federated learning (FL) with communication happening via the internet or mobile phone network. We will add references supporting the claim that uplink is more expensive than downlink communication.
- "No other compressor can be used, which notably rules out any type of quantization." Why is this the case?
In CompressedScaffnew, the compressed vectors do not tend to zero. Therefore, for the algorithm to be variance reduced, the following property must hold: if the compressed vectors are all equal, then there is zero compression error. The only known mechanism satisfying this property is sparsification with the correlated rand-k compressors selecting elements according to a random permutation.
- Is there a specific challenge in implementing permutation-based compressors?
Such compressors are correlated and select elements according to a random permutation. A first implementation is that all clients use pseudo-random generators that remain synchronized all the time. In case a client becomes inactive for any reason, synchronization may be lost. A second implementation would be that before compression, the server samples the permutation and sends information to all clients to identity it. This is not very challenging but is detrimental to robustness. The ability of LoCoDL to use independent compressors makes it more versatile for practical FL use cases.
- Do you mean: our experiments indicate that...?
Yes, our experiments indicate that LoCoDL outperforms other algorithms. We will reformulate the sentence.
- What is the communication complexity with partial participation (PP) (i.e., )?
The communication complexity in number of rounds with , , and appropriate is . With independent rand-k compressors with , the complexity in number of reals is the same as in Table 2.So, up to some constants, we can indeed choose and get the same asymptotic complexity. If quantization is applied in addition to rand-k, we just have to be careful to keep .
- It seems that some elements of LoCoDL have some similarities to DoCoFL. Can the authors shed light on this similarity?
Thank you for pointing out the nice paper on DoCoFL. We will cite it and consider it in our future work on bidirectional compression. Indeed, to efficiently compress models, it is natural to compress the difference between models and anchors, and the challenge is to come up with good anchors. The last model sent by the server to all clients is too old after several local GD steps and not a good anchor: linear convergence can be proved but there is no acceleration, so the benefits of local training are lost. In LoCoDL, the anchor is the variable , which is updated in parallel to the local model estimates . Thus, it is a dynamic anchor, in contrast to a static anchor that would be updated once in a while. This is key to obtain acceleration.
Limitations
The conclusions section discusses potential future extensions. Outlining the limitations clearly is advised. For example, is the PP use case relevant here?
We discuss venues for future work in the conclusion, such as bidirectional compression and PP. One can view it as a limitation that LoCoDL does not have these features yet.
I appreciate the detailed answers. Following the rebuttal, I have decided to raise my score from 6 to 7.
Thank you very much for your thorough evaluation of our work and for acknowledging its merits by raising your score.
Thank you again for your thorough and positive evaluation. We are kindly counting on you to support the paper toward acceptance during this discussion phase with the AC.
This paper proposes an algorithm (LoCoDL) that leverages two well-known methods of local training. It reduces the communication load in distributed learning.
优点
The paper addresses the interesting problem of distributed learning.
缺点
- Generally, compressing the error and feeding it back to the updates is a well-known technique to reduce the variance in distributed learning. The idea of the algorithm is marginal with respect to the previous known algorithms (feeding back the error and aggregating with proper coefficients).
- Also, the experiments should include the accuracy versus iteration (or time) to see after how many iterations (or how much time), the performance shown in Figure 1 is achieved. So, there are lots of work to improve the experimental part.
问题
What are the new directions on distributed learning.
局限性
justification, experiments
- compressing the error and feeding it back to the updates is a well-known technique to reduce the variance in distributed learning
Do you mean like in DIANA? We have discussed existing algorithms and techniques in the paper. The contribution is the combination of local training, like in the popular FedAvg algorithm, and compression. This way we obtain a double acceleration, with a complexity that depends on and when is large.
You don't give any argument or reference to support your claim that our contribution is not valuable.
- the experiments should include the accuracy versus iteration (or time).
Do you want to see the plots with the number of iterations in abscissa instead of the number of communicated bits? When communication is much more expensive than computation, which is precisely the point of local training in federated averaging, it is meaningless to measure the number of iterations. Our goal is to reduced the communication complexity, which is what we illustrate.
We have run additional experiments, please see the rebuttal to all reviewers
Regarding the first comment: Yes, one can consider DIANA as a reference. But, the idea of feeding back the error is a very well-known technique and has been considered in several works. The authors can search for a complete list of them. The main point with this comment is that the authors failed to clearly identify the difference between their method and previous literature. The method is a heuristic one without a theoretical justification. I believe that this issue has been raised by another reviewer.
Regarding the second comment: The plot of accuracy versus epochs gives the transitional behavior of the proposed algorithm. If the authors are familiar with the paper of DIANA, they can look at Figure 3.
The method is a heuristic one without a theoretical justification.
No, quite the opposite: our work is theoretically grounded, and accelerated convergence of our new algorithm LoCoDL is proved in Theorem 3.1 and Corollary 3.2.
the authors failed to clearly identify the difference between their method and previous literature.
The comparison is clearly summarized in Tables 1 and 2. The state of the art and our contributions, which improve upon it, are discussed in details in the first 6 pages of the paper.
We don't understand your negative stance as you don't provide any argument or reference to depreciate our contributions.
Let us state again that we prove an accelerated communication complexity that depends on and . DIANA with rand-1 compressors has complexity , for instance (for large ).
The plot of accuracy versus epochs gives the transitional behavior of the proposed algorithm. If the authors are familiar with the paper of DIANA, they can look at Figure 3.
We are familiar with DIANA. In Figure 3 of arXiv:1901.09269v3, the training and testing accuracy are shown for a convex experiment with the softmax cross entropy loss on the CIFAR10-DNN dataset. The 2 plots are nearly identical. The accuracy increases monotonically and there is no phase transition. The number of iterations, epochs, and communicated bits are proportional to each other for a given method, so taking one or the other in abscissa will only change the horizontal scaling of the plot, the visible behavior remains the same.
Dear reviewer,
We would be glad to discuss our contributions further with you. The 2 other reviewers are convinced by the merits of our work and are clearly for acceptance with scores of 7.
Do you have any question or suggestion of improvement?
We have run additional experiments, as recommended by the reviewers, to further compare LoCoDL and the state-of-the-art ADIANA. In all cases, LoCoDL significantly outperforms ADIANA. Attached is the PDF showing some results, including with the MNIST dataset and when d>100n (where ADIANA has a better theoretical complexity).
The reviews of this paper are mixed, and the final decision is pointing slightly downards due to issues raised on both the theory and experimental side.
While the paper claims to be theoretical, at least in terms of theoretical results, there is no edge compared to methods such as ADIANA. There are also areas to improve the level of rigor for a theory paper; for example: 1) functions of appear before defining what it is, 2) the level of rigor around the concept of compression was questioned, noting that is not a well-defined compression paradigm, since there is a bijective map between the two. It would be helpful to clarify this and ideally provide an entirely formal notion.
On the experimental side, other concerns were raised, including: 1) Experiments were done only with logistic regression, which is insufficient to show the proposed scheme to be better, 2) Plots that show accuracy vs number of iteration are absent.
Finally, the reviewers have mixed opinions about the novelty of the proposed algorithm compared to ADIANA.