Don't Compress Gradients in Random Reshuffling: Compress Gradient Differences
In this paper we introduce an improved method that utilizes communication compression with variance reduction and sampling without replacement.
摘要
评审与讨论
This paper analyzes compressed Federated Learning algorithms in the setting of Random Reshuffling. It first provides theory for RR with compression (Q-RR), and several variants (based on existing methods for compressed SGD) which have improved convergence guarantees. The paper sets the hypothesis that naive Q-RR does not yield improvements compared to Q-SGD (based on theory and experiments). However, the theoretical results show that the proposed DIANA-RR indeed improves over its SGD-variant when the number of samples is comparatively small to the number of iterations.
优点
The proposed algorithms are the natural extensions of existing methods (DIANA, NASTYA,..) for the RR setting. All methods are supported with theory for the strongly convex, smooth case. The theory further sheds light on possible improvements from RR in the setup of Federated Learning with compressed communication. The provided experiments nicely complement the theoretical results (with some caveats, see below). One of the main takeaways (also represented in the title of the paper) is that the DIANA-variance-reduction technique for compression yields improvements from RR compared to the SGD-variant.
缺点
-
On the main hypothesis that reducing the compression variance brings improvement, it is yet unclear whether this is also a necessary condition. The paper does provide a convergence result for Q-RR that does not improve over Q-SGD; but what would be preferable is a lower bound that shows how Q-RR can not improve over Q-SGD in a worst-case example. This is currently missing to complete the picture. While the experimental result presented in the main text also support the hypothesis, there are some caveats (see next point).
-
The results for CIFAR10 in the main text suggest that Q-RR is on par with Q-SGD. However, in the appendix it becomes clear that this experiment refers to only training the last layer (see Fig. 12). When training the whole network (Fig. 10), the results suggest that Q-RR is actually much better than Q-SGD, and the gap is roughly equal to DIANA vs. DIANA-RR. The main text does not mention this, so this result is hard to find if it's only presented in the appendix. I would ask the authors to clarify this part in the main text - in particular, because this result weakens one of the hypotheses that the DIANA technique is necessary to get improvements (at least for nonconvex problems, like the ResNet example is). If I missed something important here, please let me know - I am happy to discuss details.
-
The logistic regression examples show a clearer picture, but only three datasets are tested, and all of them are very low-dimensional (). I would suggest to add more high-dimensional datasets, or other deep learning experiments to have more insight whether there is a difference between Q-SGD and Q-RR.
-
All theoretical results only apply to strongly convex, smooth functions. While this is a standard framework for optimization analysis, it is very far from most modern machine learning applications, thus restricting the applicability of the results for such nonconvex, nonsmooth problems.
问题
This question relates more broader to Random Reshuffling: if you use mini-batching, and do RR with the standard Pytorch framework, then the composition of the mini-batches will be different in every epoch (the DataLoader shuffles, and then partitions into batches). In that case, the theoretical framework of the paper does not fully apply anymore, correct? This is, because in every epoch you would have a different set of where each is a mini-batch loss.
Is there a way to overcome this small incosistency between Random Reshuffling analysis and the actual usage? Or is there RR analysis that explicitly takes into account mini-batching?
局限性
I commented on possible limitations in terms of content above.
On the main hypothesis that reducing the compression variance brings improvement, it is yet unclear whether this is also a necessary condition. The paper does provide a convergence result for Q-RR that does not improve over Q-SGD; but what would be preferable is a lower bound that shows how Q-RR can not improve over Q-SGD in a worst-case example. This is currently missing to complete the picture. While the experimental result presented in the main text also support the hypothesis, there are some caveats (see next point).
As the reviewer fairly notices, we provide only upper convergence bounds: according to them Q-RR is not better than QSGD, and DIANA-RR has a better worst-case guarantee than both methods and the original DIANA. Deriving algorithmic dependent lower-bounds for Q-RR is indeed an interesting research question that would complement our work and support the hypothesis. Nevertheless, our experiments also support our hypothesis, as the reviewer noticed. Therefore, the importance of deriving such a lower bound is minor. Moreover, this might require a lot of work deserving another paper since we are not aware of any algorithmic-dependent lower bounds in the literature on distributed optimization with compression.
The results for CIFAR10 in the main text suggest that Q-RR is on par with Q-SGD. However, in the appendix it becomes clear that this experiment refers to only training the last layer (see Fig. 12). When training the whole network (Fig. 10), the results suggest that Q-RR is actually much better than Q-SGD, and the gap is roughly equal to DIANA vs. DIANA-RR. The main text does not mention this, so this result is hard to find if it's only presented in the appendix. I would ask the authors to clarify this part in the main text - in particular, because this result weakens one of the hypotheses that the DIANA technique is necessary to get improvements (at least for nonconvex problems, like the ResNet example is). If I missed something important here, please let me know - I am happy to discuss details.
Although we do not have formal proof explaining this phenomenon, we conjecture that this can be related to the significant over-parameterization occurring during the training of a large model on a relatively small dataset. That is, the model can almost perfectly fit the training data on all clients, leading to the decrease of the heterogeneity parameter . In this case, there is no need for shifts since the variance coming from compression naturally goes to zero, and the complexities of QSGD and DIANA match (see Table 1). In this situation, Q-RR performs better than QSGD since the compression does not spoil the convergence of RR. Therefore, DIANA-type shifts are not always necessary to get improvements. Nevertheless, we conjecture that they are necessary when the datasets are larger and more complex (since in this case the models do not perfectly fit the data).
The logistic regression examples show a clearer picture, but only three datasets are tested, and all of them are very low-dimensional (). I would suggest to add more high-dimensional datasets, or other deep learning experiments to have more insight whether there is a difference between Q-SGD and Q-RR.
We agree with the reviewer that the additional experiments are important and would strengthen the paper. We will do our best to include them in the final version. However, we also would like to emphasize that our work is primarily theoretical, and the experiments are mainly needed to illustrate and support our theoretical findings.
All theoretical results only apply to strongly convex, smooth functions. While this is a standard framework for optimization analysis, it is very far from most modern machine learning applications, thus restricting the applicability of the results for such nonconvex, nonsmooth problems.
We agree that many practically important problems are non-convex. However, the main goal of our paper was to understand how to combine RR and unbiased compression properly. This research question is not related to the convexity of the problem, and it was natural for us to investigate it in the strongly convex case first. Moreover, in many cases, convex optimization serves as inspiration for methods that work well even for non-convex problems. For example, the well-known Adam uses a momentum trick. However, we know that momentum does not theoretically improve convergence in non-convex cases.. Moreover, momentum was initially proposed by Boris Polyak to accelerate the convergence rate for GD in a convex setup (more specifically, for strongly convex quadratics). Next, very successful optimizers for DL proposed recently, such as D-Adaptation [1], Prodigy [2], and Schedule-Free SGD [3], were designed and analyzed for convex problems only. Nevertheless, they work well in the training of neural networks. To sum up, we want to say the initial step in the direction considered by us is natural.
[1] Defazio & Mishchenko. Learning-Rate-Free Learning by D-Adaptation. ICML 2023
[2] Mishchenko & Defazio. Prodigy: An expeditiously adaptive parameter-free learner. ICML 2024
[3] Defazio et al. The Road Less Scheduled. arXiv preprint arXiv:2405.15682
Dear authors,
thank you for the detailled rebuttal.
Regarding your hypothesis on Fig 12 vs. Fig 10: if the underlying reason is overparametrization, then this should be relatively easy to verify by running a comparison on a larger/more complex dataset than CIFAR (e.g variants of Imagenet). Given the current results in the paper, the conjecture that DIANA-type shifts are necessary for improvement seems slightly weak, as I argued in my initial review.
Alltogether, it seems that we agree more or less on the limitations mentioned in the review, which is why I will keep my score.
Dear Reviewer,
We would like to kindly emphasize the following four points.
The experiments presented in our paper are intended for illustrative purposes. The primary focus of the paper is to highlight the fundamental complexities and limits of algorithmic behavior. Conducting additional experiments, whether 1 or 100, may offer some insights but would not alter the key findings.
To the best of our knowledge and experience, the computational demands of our work are significant. Our algorithm is distributed across 10 clients, making it challenging to scale beyond this setup, especially within an academic environment. For instance, performing experiments beyond ResNet-18@CIFAR-10@FP64 with 10 clients is near the limit of what is feasible in academia. While it might be possible to scale to 100 clients in an industrial setting with dedicated computational resources, doing so would require a specialized compute cluster or the use of low-precision computation techniques.
ResNet-18 has approximately 11 million parameters (more precisely, 11,173,962). Given that the CIFAR-10 dataset contains only 50,000 training samples, this places us in an over-parameterized regime ("d >> m"). Using a larger model would only exacerbate this over-parameterization.
Training larger models, such as VGG-16 with 138 million parameters, would require roughly 100 times more computational time. Moreover, these models would still be over-parameterized. Increasing the size of the training dataset would also proportionally increase the computation time. Additionally, our simulations involve serialized computation, and it is crucial to conduct multiple runs to properly tune the step-size using the strategy outlined in Section B.2.4, "Tuning Process."
We appreciate your understanding and consideration of these points.
wait... I'm just curious and a little bit confused: why exactly are you using FP64 for training? If all the calculation is in FP64, then it is not surprising that the computational demands are significant. However, this kind of setting is very uncommon in academia. Even for lower level hardwares such as consumer-level GPUs like RTX3080/4080, FP32 or FP16 are commonly used. I've been in academia and I also have experience in limited budgets, but I haven't heard any commonly used hardware (in both CPU and GPU) that doesn't support FP32.
And honestly, from my own experience, even using pure CPU (yes my lab is very poor) to train resnet-18 on cifar-10, it only takes 1-2 minutes for each epoch. So from my point of view, the common reason why people do not scale resnet18@cifar10 to 100 clients is that the computation is too fast and scaling up beyond 10 clients is meaningless. And I'm pretty sure that vgg@cifar100 is feasible in academia.
Dear Reviewer,
Thank you for your insightful comments.
We opted for FP64 (IEEE 754) due to its superior numerical stability compared to FP32, FP16, and BFloat16. Our primary intention in using FP64 was to address potential numerical instabilities that can arise with FP32. While FP32 and FP16 are commonly used for inference tasks, the choice of precision for training depends on the specific requirements of the task. In certain cases, FP32 may be sufficient, but for others, FP64 is necessary to ensure stability.
The performance gain from switching from FP64 to FP32 can indeed vary based on the GPU model. For instance, the NVIDIA A100 40GB GPU used in our experiments offers approximately a two-fold increase in computational throughput with FP32 compared to FP64. In contrast, GPUs such as the RTX 3080 that you referred to may exhibit a more substantial difference, with potential speedups of up to 64 times. The choice of precision is influenced by the specific architecture of the GPU, and these characteristics can differ across various GPU models and updates.
In our simulation involving 10 clients sharing a common dataset, we ran 2000 rounds/epochs for fine-tuning. Based on your estimate of 2 minutes per epoch, the total computation time would be approximately 66 hours per run (2 minutes/epoch × 2000 epochs = 66 hours). Taking into account the grid search with 18 preset learning rates, 5 sets of decay parameters, and 4 algorithms, the total estimated computation time would be around 23,760 hours (66 hours × 18 × 5 × 4). We recognize that this represents a substantial amount of time.
Our simulation infrastructure was designed to maximize computational efficiency, and we would be pleased to provide additional details about the actual simulation time in the camera-ready version of the paper.
Additionally, we understand there may have been some misunderstanding regarding the feasibility of comparing multiple algorithms. To clarify, conducting a comprehensive comparison involving four algorithms with an extensive grid of hyperparameters is particularly challenging for models larger than ResNet-18 on CIFAR10. To cover 23,760 hours of training would indeed require approximately 40 GPUs running continuously for about 25 days. Nonetheless, we have conducted numerous experiments to ensure a thorough and fair comparison, as illustrated in Figure 13 on page 24.
Thank you once again for your valuable feedback. We hope this clarification addresses your concerns.
The authors propose communication-efficient distributed learning algorithms with faster convergence rates than previous methods. Their starting point is mini-batch stochastic gradient descent (SGD): Previous works have shown that random reshuffling (RR), i.e. sampling mini-batches without replacement (in other words, having clearly defined training epochs), has better theoretical convergence guarantees than sampling with replacement.
Inspired by this, in the first part of the paper, the authors propose the most natural extension of RR to the distributed learning setting: the clients quantise the gradients they computed to communicate the gradients to the server efficiently. The authors derive a bound on the quantised RR (Q-RR) convergence rate (assuming the loss is convex and its gradient is Lipschitz) and show that as the quantisation error goes to 0, their bound recovers the convergence results for non-quantised RR. However, when the quantisation error is nonzero, Q-RR no longer has an advantage over quantised SGD (Q-SGD), as quantisation introduces an additional error term that negates the benefits of RR. The authors use a variance reduction technique called DIANA to deal with this issue and name this improved method DIANA-RR. They show that DIANA-RR removes the troublesome error term and attains the faster convergence rate of RR even when the gradients are quantised.
In the second part of the paper, the authors observe the same behaviour when the clients perform the model updates and send quantised weight differences instead of gradients. That is, the authors show that quantising the weight differences introduces too much noise, which negates the benefits of RR. However, introducing the same variance reduction technique as in the SGD case restores the benefits of RR.
Finally, the authors perform some basic federated learning experiments and show that their proposed methods outperform the competitors even when the losses involved are non-convex.
优点
The authors provide a rigorous, insightful analysis of simple extensions of previous methods. Concretely, their analysis reveals the flaws of the naively quantising SGD using random reshuffling, which motivates the more involved, control-iterate-based extensions. They then show that the more involved methods retain the benefits of their original, unquantised counterparts. I checked the proofs of theorems in appendices C.1 and D.1, and I believe they are correct.
缺点
I should note that I am not an expert in federated learning, so I cannot fully evaluate the significance of the authors' contributions or how realistic some of their assumptions are in practice.
From a technical perspective, my main concern is that I am unsure how challenging it is to realise random reshuffling in practice. As I understand, using the authors' algorithms does not allow the server to start the next training epoch before every client sends updates for the current epoch. This sounds like a severe limitation, as in the worst case, some of the clients might drop out (e.g. due to connectivity issues) and never send any updates, causing the training loop to get stuck. Could the authors please comment on this? If this is a genuine concern, then the authors should discuss this limitation in the main text.
The other issue with the paper is the writing needs to be improved significantly. While the introduction section is well-written, it is three pages long, which means there is not enough space to expose the large amount of content that the authors wish to include at the level of formality they want to use. Concretely, the paper goes from a high-level discussion in the introduction section to being immensely technical in section two, with many new quantities and notations introduced without explaining what they represent. For example, what is the interpretation of the shuffling radius, or the quantity in Eq (6)? However, even more importantly, the methods the authors base their work on, namely DIANA and NASTYA, are not explained; the reader only has the pseudocode given in Algorithms 2-4 to work with. The best illustration of this is the disconnect between the paper title and its contents: there is no explicit mention of compressing gradient differences in the main text. I believe the authors' use of DIANA inspires the title, but I am not fully sure. Furthermore, the pseudocodes in Algorithms 1-4 are ambiguous, as the authors mix what happens on the client's side and on the server's side. I suggest the authors break up the code into two parts, one demonstrating the server-side operations and one showing the clients' side.
Besides this, more minor weaknesses include:
- I find Eq (1) and similar equations combining definitions with statements very confusing; please separate the two.
- As defined, assumption (1) doesn't make sense; the expectations have no meaning as is not stochastic; please make the meaning precise.
- The symbol denoting the batch size is technically undefined; it should be included in the chain of equalities on line 34
- Table 1 is difficult to interpret, and it is unclear to me what value it brings to the paper. I would either move it to the appendix or cut it altogether and use the space to address my main concern, which I outlined above.
- Over what is the expectation in Def 2.1?
- Figures 2 and 3 should be renamed to Figures 1a and 1b, respectively.
- Merge the two panels in Figure 4
- Algorithm 1, steps 2 and 3, : the symbol is undefined
问题
n/a
局限性
The authors don't discuss any of their methods' limitations (see, e.g., my main technical concern in the weaknesses section); I think such a discussion would significantly strengthen the paper.
From a technical perspective, my main concern is that I am unsure how challenging it is to realise random reshuffling in practice.
RR is a well-known technique exploited in many applications: for ML training, for SGD, Monte Carlo methods, etc. For example, in many ML frameworks, such as TensorFlow or Pytorch, random reshuffling of the training dataset is performed at the beginning of each epoch to ensure that the models do not learn the order of training data.
The specific implementation and initialization of random reshuffling depend on the programming language, library, or framework being used. Most modern data processing and machine learning libraries provide built-in functions for random reshuffling, making it easy to integrate into various workflows.
As I understand, using the authors' algorithms does not allow the server to start the next training epoch before every client sends updates for the current epoch. This sounds like a severe limitation, as in the worst case, some of the clients might drop out (e.g. due to connectivity issues) and never send any updates, causing the training loop to get stuck. Could the authors please comment on this?
Partial Participation and asynchronous setting are important aspects of distributed optimization. However, we believe that one paper cannot cover all possible aspects of the field. Otherwise, each paper would be several hundreds of pages. In this work, we focus on data sampling but not client sampling. We can extend our results to the partial participation setting but we believe that it will be more complicated for readers to understand the results. We believe that for the sake of readability, the paper should be focused on one or two particular ideas, and the questions that we address (how to combine RR, unbiased compression, and, optionally, local steps) are challenging and interesting on their own.
The other issue with the paper is the writing needs to be improved significantly. While the introduction section is well-written, it is three pages long, which means there is not enough space to expose the large amount of content that the authors wish to include at the level of formality they want to use. Concretely, the paper goes from a high-level discussion in the introduction section to being immensely technical in section two, with many new quantities and notations introduced without explaining what they represent. For example, what is the interpretation of the shuffling radius, or the quantity in Eq (6)? However, even more importantly, the methods the authors base their work on, namely DIANA and NASTYA, are not explained; the reader only has the pseudocode given in Algorithms 2-4 to work with. The best illustration of this is the disconnect between the paper title and its contents: there is no explicit mention of compressing gradient differences in the main text. I believe the authors' use of DIANA inspires the title, but I am not fully sure. Furthermore, the pseudocodes in Algorithms 1-4 are ambiguous, as the authors mix what happens on the client's side and on the server's side. I suggest the authors break up the code into two parts, one demonstrating the server-side operations and one showing the clients' side.
We thank the reviewer for the suggestions. If our paper gets accepted, we will have an extra page to add more clarifications to the main text. We also can shorten the introduction. We believe these adjustments can be easily done.
I find Eq (1) and similar equations combining definitions with statements very confusing; please separate the two.
Eq (1) is the standard finite sum formulation of the optimization objective, see (Mishchenko et al., 2019; Stich, 2020; Haddadpour et al., 2021).
As defined, assumption (1) doesn't make sense; the expectations have no meaning as is not stochastic; please make the meaning precise.
We consider only the case when is a random operator, e.g., see (Horvath et al., 2019). This assumption is satisfied for many compression operators, e.g., for random sparsification and quantization.
The symbol denoting the batch size is technically undefined; it should be included in the chain of equalities on line 34.
Thank you for noting this; we will add to the sentence as advised.
Table 1 is difficult to interpret, and it is unclear to me what value it brings to the paper. I would either move it to the appendix or cut it altogether and use the space to address my main concern, which I outlined above.
Table 1 is very important, and it summarizes our main theoretical results. We will address your concerns without removing the table.
Over what is the expectation in Def 2.1?
The expectation is taken with respect to data permutation, which is the essential part of the random reshuffling method.
Figures 2 and 3 should be renamed to Figures 1a and 1b, respectively.
Thank you for the suggestion. We will update the paper accordingly.
Merge the two panels in Figure 4
We believe that it makes the plot more complicated. On the first one you can observe the performance of QSGD and Q-RR and how it is similar. On the second, it is clear that DIANA-RR works better than DIANA. To compare Q-RR and DIANA-RR it is enough to compare the obtained accuracy.
Algorithm 1, steps 2 and 3, : the symbol is undefined
We will make such an adjustment.
I thank the authors for their rebuttal. Unfortunately, my main concerns remain unaddressed.
Given the authors' response, I take that my understanding of their algorithm is correct, and their proposed algorithm requires full participation from each client in each epoch. I am quite concerned about this, as 1) each epoch of the training algorithm thus scales with the time required by the slowest client, which could be very large or even infinite (if a client drops out), and 2) it is unclear how this could be mitigated.
Overall, while I definitely think the authors' work is addressing an underexplored gap, my current understanding is that their algorithm requires significantly more stringent conditions to hold to work well in practice. From a more empirical perspective, I think the authors' convergence results should be discussed with the computational time in mind to provide a fair comparison with other methods.
For example, imagine a certain "best-case scenario:" The time it takes for the clients to compute the gradients and communicate them back to the server is exponentially distributed with rate 1. Is it worth using the authors' algorithm and waiting for the slowest client to finish computation to benefit from the improved convergence rate, or would it be more beneficial to use QSGD, which has much looser requirements, and we can perform a lot more updates, though with worse theoretical guarantees?
I would also like to ask the authors to answer the questions I ask in the paragraph in my review beginning with the phrase "The other issue...".
Thank you for your response!
To address the issue related to the full participation case, we have provided theoretical results specifically for the Partial Participation setting. In this context, we analyze the scenario where only a subset of clients, with cardinality , is sampled. We have included theoretical results for the Q-NASTYA and DIANA-NASTYA methods in this setting.
Furthermore, we plan to present similar theoretical results for the Q-RR and DIANA-RR methods in the camera-ready version of the paper. This additional analysis will provide a more comprehensive view of our methods' performance under various participation scenarios. Please review the theoretical results provided below, starting with the Q-NASTYA method in the Partial Participation setting. We hope these results clarify our approach and address the concerns raised.
Theorem 1. Let step sizes satisfy the following equations
Under the Assumptions 1, 2, 3 it holds
\mathbb{E}\left[\left\Vert x_T-x^{\star}\right\Vert^2\right] \leq & \left(1-\frac{\eta \mu}{2}\right)^T\left\Vert x_0-x^{\star}\right\Vert^2+\frac{9}{2}\frac{\gamma^2 n L_{\max }}{\mu}\left(\frac{1}{M} \sum_{m=1}^M \sigma_{\star, m}^2+n \sigma_{\star}^2\right) +8 \frac{\eta}{\mu}\left(\frac{\omega}{C} \sigma_{\star}^2+\frac{M-C}{C \max (M-1,1)} \sigma_{\star}^2\right) \end{aligned}$$ where $$\sigma_{\star}^2=\frac{1}{M} \sum_{m=1}^M\left\Vert\nabla f_m\left(x^{\star}\right)\right\Vert^2, \quad \sigma_{\star, m}^2=\frac{1}{n}\left\Vert\nabla f_m^i\left(x^{\star}\right)\right\Vert^2.$$ As we can see, there is an additional error term proportional to $ \frac{M-C}{C \max (M-1,1)} $ that arises due to client sampling in the partial participation setting. Note that when $ C=M $ (all clients are participating), this error term vanishes, allowing us to recover the previous result for the full participation case. This shows the consistency of our theoretical framework across different participation scenarios. We start from Lemma F.1: **Lemma F.1. (updated)** Let Assumptions 1, 2, 3 hold. Then, for all $t \geq 0$ the iterates produced by Q-NASTYA satisfy $$\mathbb{E}\_{\mathcal{Q}, S\_t}\left[\left\Vert g\_t\right\Vert^2\right] \leq \frac{2 L_{\max }^2\left(1+\frac{\omega}{C}\right)}{M n} \sum_{m=1}^M \sum_{i=0}^{n-1}\left\Vert x_{t, m}^i-x_t\right\Vert^2 +8 L_{\max }\left(1+\frac{\omega}{C}\right)\left(f\left(x_t\right)-f\left(x^{\star}\right)\right) +4\left(\frac{\omega}{C}+\frac{M-C}{C \max \{M-1,1\}}\right) \sigma_{\star}^2$$ where $\mathbb{E}\_{\mathcal{Q}, S\_t}$ is expectation w.r.t. $\mathcal{Q}, S\_t$ and $ \sigma_{\star}^2=\frac{1}{M} \sum_{m=1}^M\left\Vert \nabla f_m\left(x^{\star}\right)\right\Vert^2. $ **Proof**: Using $\mathbb{E}\left[\Vert \xi\Vert^2\right]=\mathbb{E}\left[\Vert\xi-\mathbb{E}[\xi]\Vert^2\right]+\Vert\mathbb{E} \xi\Vert^2$, we obtain $ \mathbb{E}\_{\mathcal{Q}}\left[\left\Vert g\_t\right\Vert^2\right]= & \mathbb{E}\_{\mathcal{Q}}\left[\left\Vert\frac{1}{C} \sum\_{m \in S_t}\left(\mathcal{Q}\left(\frac{1}{n} \sum\_{i=0}^{n-1} \nabla f_m^{\pi\_m^i}\left(x\_{t, m}^i\right)\right)-\frac{1}{n} \sum\_{i=0}^{n-1} \nabla f_m^{\pi\_m^i}\left(x\_{t, m}^i\right)\right)+\frac{1}{C n} \sum\_{m \in S_t}\sum\_{i=0}^{n-1} \nabla f_m^{\pi\_m^i}\left(x\_{t, m}^i\right)\right\Vert^2\right] \\\\ =&\frac{1}{C^2} \mathbb{E}\_{\mathcal{Q}}\Vert\sum\_{m \in S\_t}(\underbrace{\mathcal{Q}\left(\frac{1}{n} \sum\_{i=0}^{n-1} \nabla f\_m^{\pi\_m^i}\left(x\_{t, m}^i\right)\right)-\frac{1}{n} \sum\_{i=0}^{n-1}\nabla f\_m^{\pi\_m^i}\left(x\_{t, m}^i\right)}_{=\xi\_m}\Vert^2] \\\\ &+\left\Vert \frac{1}{C n} \sum\_{m \in S_t}\sum\_{i=0}^{n-1} \nabla f\_m^{\pi_m^i}\left(x\_{t, m}^i\right)\right\Vert^2 \\\\ =&\frac{1}{C^2} \mathbb{E}\_{\mathcal{Q}}\left[\sum\_{m \in S_t}\left\Vert \xi\_m\right\Vert^2+\sum\_{m, l \in S_t: m \neq l} 2\left\langle\xi\_m,\xi\_l\right\rangle\right]+\left\Vert\frac{1}{C n} \sum\_{m \in S\_t}\sum\_{i=0}^{n-1} \nabla f\_m^{\pi\_m^i}\left(x\_{t, m}^i\right)\right\Vert^2. $ Using independence between $\xi_m$ and $\xi_l$ for different $m, l$ and Using (2), (3), we get $ \mathbb{E}\_{\mathcal{Q}}\left[\left\Vert g\_t\right\Vert^2\right]= & \frac{1}{C^2} \sum\_{m \in S\_t} \mathbb{E}\_{\mathcal{Q}}\left[\left\Vert \mathcal{Q}\left(\frac{1}{n} \sum\_{i=0}^{n-1} \nabla f\_m^{\pi\_m^i}\left(x\_{t, m}^i\right)\right)-\frac{1}{n} \sum\_{i=0}^{n-1} \nabla f\_m^{\pi\_m^i}\left(x\_{t, m}^i\right)\right\Vert^2\right] \\\\ &+\left\Vert\frac{1}{C n} \sum_{m \in S_t} \sum_{i=0}^{n-1} \nabla f_m^{\pi_m^i}\left(x_{t, m}^i\right)\right\Vert^2 \\\\ \leq & \frac{\omega}{C^2} \sum_{m \in S_t}\left\Vert\frac{1}{n} \sum_{i=0}^{n-1} \nabla f_m^{\pi_m^i}\left(x_{t, m}^i\right)\right\Vert^2+\left\Vert\frac{1}{C n} \sum_{m \in S_t} \sum_{i=0}^{n-1} \nabla f_m^{\pi_m^i}\left(x_{t, m}^i\right)\right\Vert^2. $Rewriting previous inequality and using , we have
$
\mathbb{E}_{\mathcal{Q}}\left[\left\Vert g_t\right\Vert ^2\right] \leq & \frac{2 \omega}{C^2} \sum_{m \in S_t}\left\Vert\frac{1}{n} \sum_{i=0}^{n-1}\left(\nabla f_m^{\pi_m^i}\left(x_{t, m}^i\right)-\nabla f_m^{\pi_m^i}\left(x_t\right)\right)\right\Vert^2+\frac{2 \omega}{C^2} \sum_{m \in S_t}\left\Vert\nabla f_m\left(x_t\right)\right\Vert^2 \\ & +2\left\Vert\frac{1}{C n} \sum_{m \in S_t} \sum_{i=0}^{n-1}\left(\nabla f_m^{\pi_m^i}\left(x_{t, m}^i\right)-\nabla f_m^{\pi_m^i}\left(x_t\right)\right)\right\Vert^2+2\left\Vert\frac{1}{C} \sum_{m \in S_t} \nabla f_m\left(x_t\right)\right\Vert^2 \\ \leq & \frac{2\left(1+\frac{\omega}{C}\right)}{C} \sum_{m \in S_t}\left\Vert\frac{1}{n} \sum_{i=0}^{n-1}\left(\nabla f_m^{\pi_m^i}\left(x_{t, m}^i\right)-\nabla f_m^{\pi_m^i}\left(x_t\right)\right)\right\Vert^2 \\ & +\frac{2 \omega}{C^2} \sum_{m \in S_t}\left\Vert\nabla f_m\left(x_t\right)\right\Vert^2+2\left\Vert\frac{1}{C} \sum_{m \in S_t} \nabla f_m\left(x_t\right)\right\Vert^2.
$
Using -smoothness of and and also convexity of , we obtain
$
\mathbb{E}_{\mathcal{Q}}\left[\left\Vert g_t\right\Vert^2\right] \leq & \frac{2\left(1+\frac{\omega}{C}\right)}{C n} \sum_{m \in S_t} \sum_{i=0}^{n-1}\left\Vert \nabla f_m^{\pi_m^i}\left(x_{t, m}^i\right)-\nabla f_m^{\pi_m^i}\left(x_t\right)\right\Vert^2+\frac{4 \omega}{C^2} \sum_{m \in S_t}\left\Vert \nabla f_m\left(x_t\right)-\nabla f_m\left(x^{\star}\right)\right\Vert^2 \\ & +\frac{4 \omega}{C^2} \sum_{m \in S_t}\left\Vert\nabla f_m\left(x^{\star}\right)\right\Vert^2+4\left\Vert\frac{1}{C} \sum_{m \in S_t}\left(\nabla f_m\left(x_t\right)-\nabla f_m\left(x^{\star}\right)\right)\right\Vert^2+4\left\Vert\frac{1}{C} \sum_{m \in S_t} \nabla f_m\left(x^{\star}\right)\right\Vert^2 \\ \leq & \frac{2 L_{\max }^2\left(1+\frac{\omega}{C}\right)}{C n} \sum_{m \in S_t} \sum_{i=0}^{n-1}\left\Vert x_{t, m}^i-x_t\right\Vert^2+\frac{8 L_{\max }\left(1+\frac{\omega}{C}\right)}{C} \sum_{m \in S_t} D_{f_m}\left(x_t, x^{\star}\right) \\ & +\frac{4 \omega}{C^2} \sum_{m \in S_t}\left\Vert\nabla f_m\left(x^{\star}\right)\right\Vert^2+4\left\Vert\frac{1}{C} \sum_{m \in S_t} \nabla f_m\left(x^{\star}\right)\right\Vert^2.
$
Taking expectation w.r.t. and using uniform sampling, we receive
$
\mathbb{E}_{\mathcal{Q}, S_t}\left[\left\Vert g_t\right\Vert^2\right] \leq & \frac{2 L_{\max }^2\left(1+\frac{\omega}{C}\right)}{n} \mathbb{E}_{S_t}\left[\frac{1}{C} \sum_{m \in S_t} \sum_{i=0}^{n-1}\left\Vert x_{t, m}^i-x_t\right|^2\right]+8 L_{\max }\left(1+\frac{\omega}{C}\right) \mathbb{E}_{S_t}\left[\frac{1}{C} \sum_{m \in S_t} D_{f_m}\left(x_t, x^{\star}\right)\right] \\ & +\frac{4 \omega}{C} \mathbb{E}_{S_t}\left[\frac{1}{C} \sum_{m \in S_t}\left\Vert \nabla f_m\left(x^{\star}\right)\right\Vert^2\right]+4 \mathbb{E}_{S_t}\left[\left\Vert \frac{1}{C} \sum_{m \in S_t} \nabla f_m\left(x^{\star}\right)\right\Vert^2\right] \\ \leq & \frac{2 L_{\max }^2\left(1+\frac{\omega}{C}\right)}{M n} \sum_{m=1}^M \sum_{i=0}^{n-1}\left\Vert x_{t, m}^i-x_t\right\Vert^2+\frac{8 L_{\max }\left(1+\frac{\omega}{C}\right)}{M} \sum_{m=1}^M D_{f_m}\left(x_t, x^{\star}\right) \\ & +\frac{4 \omega}{C} \frac{1}{M} \sum_{m=1}^M\left\Vert \nabla f_m\left(x^{\star}\right)\right\Vert^2+4 \frac{M-C}{M C \max {M-1,1}} \sum_{m=1}^M\left\Vert \nabla f_m\left(x^{\star}\right)\right\Vert^2.
$
In the next part we will prove the theorem.
Proof of Theorem 1: Taking expectation w.r.t. and using Lemma F.1 updated, we get
$
\mathbb{E}_{\mathcal{Q}, S_t}\left[\left\Vert x_{t+1}-x^{\star}\right|^2\right]= & \left\Vert x_t-x^{\star}\right|^2-2 \eta \mathbb{E}_{\mathcal{Q}, S_t}\left[\left\langle g_t, x_t-x^{\star}\right\rangle\right]+\eta^2 \mathbb{E}_{\mathcal{Q}, S_t}\left[\left\Vert g^t\right\Vert^2\right] \\ \leq & \left\Vert x_t-x^{\star}\right\Vert^2-2 \eta \mathbb{E}_{\mathcal{Q}, S_t}\left[\left\langle\frac{1}{C} \sum_{m \in S_t} \mathcal{Q}\left(\frac{1}{n} \sum_{i=0}^{n-1} \nabla f_m^{\pi_m^i}\left(x_{t, m}^i\right)\right), x_t-x^{\star}\right\rangle\right] \\ & +\frac{2 \eta^2 L_{\max }^2\left(1+\frac{\omega}{C}\right)}{M n} \sum_{m=1}^M \sum_{i=0}^{n-1}\left\Vert x_{t, m}^i-x_t\right\Vert^2+8 \eta^2 L_{\max }\left(1+\frac{\omega}{C}\right)\left(f\left(x_t\right)-f\left(x^{\star}\right)\right) \\ & +4 \eta^2\left(\frac{\omega}{C}+\frac{M-C}{C \max {M-1,1}}\right) \sigma_{\star}^2 \\ \leq & \left\Vert x_t-x^{\star}\right\Vert^2-2 \eta \frac{1}{M n} \sum_{m=1}^M \sum_{i=0}^{n-1}\left\langle\nabla f_m^{\pi_m^i}\left(x_{t, m}^i\right), x_t-x^{\star}\right\rangle \\ & +\frac{2 \eta^2 L_{\max }^2\left(1+\frac{\omega}{C}\right)}{M n} \sum_{m=1}^M \sum_{i=0}^{n-1}\left\Vert x_{t, m}^i-x_t\right\Vert^2+8 \eta^2 L_{\max }\left(1+\frac{\omega}{C}\right)\left(f\left(x_t\right)-f\left(x^{\star}\right)\right) \\ & +4 \eta^2\left(\frac{\omega}{C}+\frac{M-C}{C \max {M-1,1}}\right) \sigma_{\star}^2.
$
Using Lemma F.2, we obtain
$
\mathbb{E}_{\mathcal{Q}, S_t}\left[\left\Vert x_{t+1}-x^{\star}\right\Vert^2\right] \leq & \left\Vert x_t-x^{\star}\right\Vert^2-\frac{\eta \mu}{2}\left\Vert x_t-x^{\star}\right\Vert^2-\eta\left(f\left(x_t\right)-f\left(x^{\star}\right)\right) \\ & +8 \eta^2 L_{\max }\left(1+\frac{\omega}{C}\right)\left(f\left(x_t\right)-f\left(x^{\star}\right)\right)+\frac{\eta L_{\max }}{M n} \sum_{m=1}^M \sum_{i=0}^{n-1}\left\Vert x_{t, m}^i-x_t\right\Vert^2 \\ & +\frac{2 \eta^2 L_{\max }^2\left(1+\frac{\omega}{C}\right)}{M n} \sum_{m=1}^M \sum_{i=0}^{n-1}\left\Vert x_{t, m}^i-x_t\right\Vert^2+4 \eta^2\left(\frac{\omega}{C}+\frac{M-C}{C \max {M-1,1}}\right) \sigma_{\star}^2 \\ \leq & \left(1-\frac{\eta \mu}{2}\right)\left\Vert x_t-x^{\star}\right\Vert^2-\eta\left(1-8 \eta L_{\max }\left(1+\frac{\omega}{C}\right)\right)\left(f\left(x_t\right)-f\left(x^{\star}\right)\right) \\ & +\frac{\eta L_{\max }\left(1+2 \eta L_{\max }\left(1+\frac{\omega}{C}\right)\right)}{M n} \sum_{m=1}^M \sum_{i=0}^{n-1}\left\Vert x_{t, m}^i-x_t\right\Vert^2+ \\ & 4 \eta^2\left(\frac{\omega}{C}+\frac{M-C}{C \max {M-1,1}}\right) \sigma_{\star}^2.
$
Using Lemma F.3, we have
$
\mathbb{E}_{\mathcal{Q}, S_t}\left[\left\Vert x_{t+1}-x^{\star}\right\Vert^2\right] \leq & \left(1-\frac{\eta \mu}{2}\right)\left\Vert x_t-x^{\star}\right\Vert^2-\eta\left(1-8 \eta L\left(1+\frac{\omega}{C}\right)\right)\left(f\left(x_t\right)-f\left(x^{\star}\right)\right) \\ & +\eta L_{\max }\left(1+2 \eta L_{\max }\left(1+\frac{\omega}{C}\right)\right) \cdot 8 \gamma^2 n^2 L_{\max }\left(f\left(x_t\right)-f\left(x^{\star}\right)\right) \\ & +\eta L_{\max }\left(1+2 \eta L_{\max }\left(1+\frac{\omega}{C}\right)\right) \cdot 2 \gamma^2 n\left(\frac{1}{M} \sum_{m=1}^M \sigma_{\star, m}^2+n \sigma_{\star}^2\right) \\ & +4 \eta^2\left(\frac{\omega}{C}+\frac{M-C}{C \max {M-1,1}}\right) \sigma_{\star}^2.
$
Using (8), we receive
Recursively rewriting the inequality and using , we finish proof.
I thank the authors for their detailed response. They have partially addressed my concerns, and I raise my score to recognise this. However, I still have three important concerns:
- what exactly is the algorithm the authors propose for partial participation? Does the server wait for the first clients to return responses and then start a new epoch?
- given the authors' partial participation bound, is it always worth using the partial participation algorithm above, which includes RR, compared to QSGD? Is there a situation where QSGD is more worth it, at least according to the bounds?
- what is the interpretation of the shuffling radius and the quantity in Eq (6)? Is DIANA the namesake of the paper title?
what exactly is the algorithm the authors propose for partial participation? Does the server wait for the first clients to return responses and then start a new epoch?
We propose the Q-NASTYA and DIANA-NASTYA methods with Partial Participation. In our approach, we consider uniform sampling with a cohort size equal to , meaning that each possible subset of size has an equal probability of being selected. From this sampling method, the probability of any particular client being chosen is . By assigning this probability to all clients, we can effectively say that the server waits for a randomly selected set of clients. Due to time constraints during the discussion period, we have focused on this uniform sampling approach. However, we acknowledge that more complex sampling scenarios could provide additional insights and potentially improve the method's performance. We plan to explore and incorporate these more sophisticated sampling strategies in the camera-ready version of the paper.
given the authors' partial participation bound, is it always worth using the partial participation algorithm above, which includes RR, compared to QSGD? Is there a situation where QSGD is more worth it, at least according to the bounds?
QSGD was not analyzed in the Partial Participation setting, but if we were to conduct such an analysis, we would expect similar terms arising from Partial Participation. However, when compression variance is dominant, Q-RR does not offer significant benefits compared to Q-SGD in either the full participation or partial participation settings. According to the theoretical bounds, Q-RR becomes more advantageous only when the compression is relatively mild. This is the primary motivation behind introducing the DIANA-NASTYA method—by reducing compression variance, we aim to make the method more effective and beneficial across different scenarios, especially when dealing with varying levels of participation and compression.
what is the interpretation of the shuffling radius and the quantity in Eq (6)?
The concept of shuffling radius is analogous to the standard variance at the optimum term used in SGD, but it is specifically adapted for Random Reshuffling methods. This shuffling radius represents the dispersion or deviation of the algorithm's performance around the optimum point, caused by the stochasticity in the process. We argue that the shuffling radius serves as a natural counterpart to the standard variance term commonly utilized in SGD analysis. Additionally, there exists a Lemma 2.1 that provides both upper and lower bounds for the shuffling radius, expressed in terms of the standard variance at the optimum. This relationship further underscores the significance of the shuffling radius as a critical measure in understanding the behavior of Random Reshuffling methods.
In Equation 6, we have a Lyapunov function that consists of two terms: the distance between the current point and the optimum, and a weighted sum of the distances between learnable shifts (control variables) and gradients at the optimum. Including these two terms in the Lyapunov function implies that by minimizing this function, we simultaneously move towards the optimum and learn the shifts.
Is DIANA the namesake of the paper title?
DIANA is a variance reduction technique developed to address the variance introduced by compression [1]. We chose to use a similar naming convention for our algorithms because our approach is conceptually aligned with this technique. However, it is important to note that the DIANA technique required significant modifications to be effective in our context. Specifically, we had to adapt it to handle sampling without replacement and to function properly in scenarios involving partial participation. These adjustments were crucial for ensuring that the technique performs well under the updated conditions described in our paper.
[1] Horváth, Samuel, et al. "Stochastic distributed learning with gradient quantization and double-variance reduction." Optimization Methods and Software 38.1 (2023): 91-106.
If we have addressed all the raised questions, could you please consider increasing the score? If you have any further questions, we are open to answering them.
I thank the authors for their extensive and thorough response; they have addressed all my concerns. I am happy to recommend acceptance for the paper and raise my score accordingly.
To improve the camera-ready version of the paper, I would like to emphasise how important it is that the authors include at least a synopsis of our discussion here, especially their responses in the reply titled "Response to three concerns."
Dear Reviewer,
Thank you for raising the score and engaging in the discussion!
We will be sure to incorporate the clarifications and summary from our discussion into the paper. We plan to use an additional page in the main paper for these updates and will also include further details in the supplementary materials.
Best regards, Authors
Dear Reviewer,
Please note that we provided theoretical results not only for Q-NASTYA in the partial participation regime but also for DIANA-NASTYA in the same setting. Similar to the full participation case, this approach allows us to reduce compression variance in Q-NASTYA, which may be the dominant factor. Even without this technique, Q-NASTYA in the partial participation case remains beneficial if the compression is minimal. However, when variance reduction for compression is applied, DIANA-NASTYA in the partial participation setting becomes particularly advantageous.
Let step sizes satisfy the following equations
Under the Assumptions 1, 2, 3 it holds
Note that we eliminate the variance term proportional to : . In the Partial Participation regime, we have a variance term proportional to , which equals zero if . This term decreases as , so we achieve the expected linear speedup.
We start from STEP 1: we need to estimate inner product. By , we have
STEP 2: STEP 2: We need to bound . By , we have
We will continue in the next parts.
We continue the derivations for STEP 2:
Taking expectation by subsamling, we have
Thus, we have
STEP 3: Note that
Taking expectation by compression, we have
Taking expectation by subsampling, we have
We will continue in the next parts.
Taking expectation by subsampling, we have
Thus, we have
$
\mathbb{E}_{S_t, Q_t}\left[\frac{1}{M} \sum_{m=1}^M\left\Vert h_{t+1, m}-h_m^{\star}\right\Vert^2\right]= & \frac{(1-\alpha) C}{M^2} \sum_{m=1}^M\left\Vert h_{t, m}-h_m^{\star}\right\Vert ^2+\frac{\alpha C}{M^2} \sum_{m=1}^M\left\Vert g_{t, m}-h_m^{\star}\right\Vert^2 \\ & +\frac{M-C}{M} \mathbb{E}_{S_t, Q_t}\left[\frac{1}{M-C} \sum_{m \notin S_t}\left\Vert h_{t, m}-h_m^{\star}\right\Vert^2\right] \\ = & \frac{(1-\alpha) C}{M^2} \sum_{m=1}^M\left\Vert h_{t, m}-h_m^{\star}\right\Vert^2+\frac{\alpha C}{M^2} \sum_{m=1}^M\left\Vert g_{t, m}-h_m^{\star}\right\Vert^2 \\ & +\frac{M-C}{M} \frac{1}{M} \sum_{m=1}^M\left\Vert h_{t, m}-h_m^{\star}\right\Vert^2 \\ \leq & \left(1-\frac{\alpha C}{M}\right) \frac{1}{M} \sum_{m=1}^M\left\Vert h_{t, m}-h_m^{\star}\right\Vert^2 \\ & +\frac{2 \alpha L_{\max }^2 C}{M^2 n} \sum_{m=1}^M \sum_{i=0}^{n-1}\left\Vert x_{t, m}^i-x_t\right\Vert^2+\frac{4 L_{\max } \alpha C}{M^2} \sum_{m=1}^M D_{f_m}\left(x_t, x^{\star}\right)
$
STEP 4: Defining Lyapunov function as follows
we have
$
\mathbb{E}_{\mathcal{Q}, S_t}\left[\Psi_{t+1}\right] \leq & \left(1-\frac{\eta \mu}{2}\right)\left\Vert x_t-x^{\star}\right\Vert^2-\eta\left(1-4 L_{\max } \eta\right)\left(f\left(x_t\right)-f\left(x^{\star}\right)\right) \\ & +\eta L_{\max }\left(1+4\left(1+\frac{\omega}{C}\right) L_{\max } \eta\right) \frac{1}{M n} \sum_{m=1}^M \sum_{i=0}^{n-1}\left\Vert x_{t, m}^i-x_t\right\Vert^2 \\ & +\frac{2 \eta^2 \omega}{C} \frac{1}{M} \sum_{m=1}^M\left\Vert \nabla f_m\left(x_t\right)-h_{t, m}\right\Vert^2+\frac{2 \eta^2(M-C)}{C(M-1) M} \sum_{m=1}^M\left\Vert h_m^{\star}\right\Vert^2 \\ & +\left(1-\frac{\alpha C}{M}\right) \frac{A}{M} \sum_{m=1}^M\left\Vert h_{t, m}-h_m^{\star}\right\Vert^2 \\ & +\frac{2 \alpha L_{\max }^2 A C}{M^2 n} \sum_{m=1}^M \sum_{i=0}^{n-1}\left\Vert x_{t, m}^i-x_t\right\Vert^2+\frac{4 L_{\max } \alpha A C}{M}\left(f\left(x_t\right)-f\left(x^{\star}\right)\right)
$
Setting A = and using Lemma F.3, we have
\mathbb{E}\left[\Psi_{t+1}\right] \leq & \left(1-\frac{\eta \mu}{2}\right) \mathbb{E}\left[\left\Vert x_t-x^{\star}\right\Vert^2\right]+\left(1-\frac{\alpha C}{M}+\frac{4 \omega}{\lambda C}\right) \frac{\lambda \eta^2}{M} \sum_{m=1}^M \mathbb{E}\left[\left\Vert h_{t, m}-h_m^{\star}\right\Vert^2\right] \\\\ & -\eta\left(1-8 \eta L_{\max }\left(1+\frac{\omega}{C}\right)-4 \eta L_{\max } \alpha \lambda \frac{C}{M}\right) \mathbb{E}\left[f\left(x_t\right)-f\left(x^{\star}\right)\right] \\\\ & +8 \gamma^2 n^2 L_{\max }^2 \eta\left(1+4 \eta L_{\max }\left(1+\frac{\omega}{C}\right)+2 \eta L_{\max } \alpha \lambda \frac{C}{M}\right) \mathbb{E}\left[f\left(x_t\right)-f\left(x^{\star}\right)\right] \\\\ & +2 \gamma^2 n^2 L_{\max }^2 \eta\left(1+4 \eta L_{\max }\left(1+\frac{\omega}{C}\right)+2 \eta L_{\max } \alpha \lambda \frac{C}{M}\right)\left(\frac{1}{M} \sum_{m=1}^M \sigma_{\star, m}^2+n \sigma_{\star}^2\right) \\\\ & +\frac{2 \eta^2(M-C)}{C(M-1)} \sigma_{\star}^2
Selecting , also using and applying previous steps we obtain
\mathbb{E}\left[\Psi_{t+1}\right] \leq & \left(1-\frac{\eta \mu}{2}\right) \mathbb{E}\left[\Psi_k\right]+3 \gamma^2 n^2 L_{\max }^2 \eta\left(\frac{1}{M} \sum_{m=1}^M \sigma_{\star, m}^2+n \sigma_{\star}^2\right)+\frac{2 \eta^2(M-C)}{C(M-1)} \sigma_{\star}^2 \\\\ & -\eta\left(\frac{1}{2}-10 \gamma^2 n^2 L_{\max }^2\right) \mathbb{E}\left[f\left(x_t\right)-f\left(x^{\star}\right)\right] \\\\ \leq & \left(1-\frac{\eta \mu}{2}\right) \mathbb{E}\left[\Psi_k\right]+3 \gamma^2 n^2 L_{\max }^2 \eta\left(\frac{1}{M} \sum_{m=1}^M \sigma_{\star, m}^2+n \sigma_{\star}^2\right)+\frac{2 \eta^2(M-C)}{C(M-1)} \sigma_{\star}^2,
Rearranging recursion, we obtain the final result.
Gradient compression is a popular technique for improving the communication complexity of stochastic first-order methods in the distributed training of machine learning models. This work points out that existing work only considers with-replacement sampling of stochastic gradients, while in practice stochastic methods based on without-replacement sampling performs better. This paper fills this gap in the literature by analyzing for the first time methods that combine gradient compression and putback-free sampling.
优点
(1) This work introduces a unique combination of gradient compression and random reshuffling, which addresses a gap in the existing literature. This innovative approach aims to improve convergence rates and communication efficiency in distributed learning.
(2) This work summarizes previous algorithms while providing rigorous theoretical analysis and convergence proofs for the proposed new methods (Q-RR, DIANA-RR, Q-NASTYA, DIANA-NASTYA).This solid theoretical foundation helps in understanding the behavior and benefits of the new algorithms.
(3) The article is clearly structured, logically coherent and step-by-step. The research is clearly presented.
缺点
While the experiments conducted are thorough, they are limited to certain types of tasks (logistic regression and ResNet-18 on CIFAR-10). The generalizability of the results to other models and datasets is not fully explored, which could limit the broader applicability of the findings.
Q-NASTYA and DIANA-NASTYA algorithms are too dependent on hyperparameters, which requires additional computation and management, affecting the practical application and deployment of the algorithms, especially in large-scale distributed environments.
问题
When tuning parameters (e.g. step size), could you try a more efficient way of tuning the parameters to increase the scalability of the algorithm? Whether there are plans to develop automated methods for tuning parameters to reduce the complexity of manual tuning and increase the feasibility of the algorithm in practical applications?
Due to the increased implementation complexity of the algorithms proposed by DIANA-RR et al. are there any performance evaluations in practical applications, especially how well they perform in large-scale distributed systems?
局限性
The algorithm in this work relies on many parameters, but lacks corresponding experimental proofs. For example, DIANA-RR uses multiple shift vectors (n of them), while DIANA uses only 1 shift vector, and there are no experimental results showing the effect of different number of shift vectors on the performance. There are no detailed experimental results showing the effect of different compression levels on the performance of the algorithm at different compression levels. etc.
We thank the reviewer for the detailed feedback and a very positive evaluation of our work.
While the experiments conducted are thorough, they are limited to certain types of tasks (logistic regression and ResNet-18 on CIFAR-10). The generalizability of the results to other models and datasets is not fully explored, which could limit the broader applicability of the findings.
Thank you for your comment! We will try to do additional experiments on other datasets and models.
Q-NASTYA and DIANA-NASTYA algorithms are too dependent on hyperparameters, which requires additional computation and management, affecting the practical application and deployment of the algorithms, especially in large-scale distributed environments.
When tuning parameters (e.g. step size), could you try a more efficient way of tuning the parameters to increase the scalability of the algorithm? Whether there are plans to develop automated methods for tuning parameters to reduce the complexity of manual tuning and increase the feasibility of the algorithm in practical applications?
Thank you for your comment! We agree that dependence on hyperparameters is a limitation of our methods (though many methods in the field have the same similar limitation). We will think about adaptive versions of the proposed methods (and incorporation of recently proposed adaptive methods such as D-Adaptation, Prodigy, and Schedule-Free SGD). We leave it for future work. Regarding the tuning of parameters for our methods in practice, the standard schemes for tuning the stepsizes (e.g., search at the logarithmic scale) should work well.
Due to the increased implementation complexity of the algorithms proposed by DIANA-RR et al. are there any performance evaluations in practical applications, especially how well they perform in large-scale distributed systems?
We will do our best to run the methods with a larger number of workers. Then, we can measure the efficiency of the methods in terms of the number of communicated bits to achieve convergence w.r.t. different metrics (top-1 accuracy, train loss, gradient norm).
The algorithm in this work relies on many parameters, but lacks corresponding experimental proofs. For example, DIANA-RR uses multiple shift vectors (n of them), while DIANA uses only 1 shift vector, and there are no experimental results showing the effect of different number of shift vectors on the performance. There are no detailed experimental results showing the effect of different compression levels on the performance of the algorithm at different compression levels. etc.
We agree with the reviewer that all of these experiments are important and would strengthen the paper. We will do our best to include them in the final version. In particular, we conducted the experiment with a local shift for DIANA-RR and attached the results to the general rebuttal message. Our results indicate that the usage of multiple shifts is important for DIANA-RR, since otherwise, the method does not work better than DIANA.
Regarding the experiments with different compression levels, we expect that the behaviour will be expected: the less we compress, the better all methods converge, and the less the impact of shifts is.
However, we also would like to emphasize that our work is primarily theoretical, and the experiments are mainly needed to illustrate and support our theoretical findings.
Thanks for addressing the questions and comments in the previous round.I also read the others' comments and remain positive for the rating.
This paper provide analysis of methods with gradient compression and without-replacement sampling. Based on this analysis, this paper proposes several new algorithms for distributed optimization with communication, including Q-RR and its vairants Q-NASTYA, which compress gradient differences in the scenario of data random shuffling. Empirical results on logistic regression and resnet-18 on cifar10 show that the proposed algorithm outperforms the baselines.
优点
-
This paper provide analysis of methods with gradient compression and without-replacement sampling.
-
Based on this analysis, this paper proposes several new algorithms for distributed optimization with communication, including Q-RR and its vairants Q-NASTYA, which compress gradient differences in the scenario of data random shuffling.
-
Empirical results on logistic regression and resnet-18 on cifar10 show that the proposed algorithm outperforms the baselines.
缺点
-
The idea of compressing gradient differences is actually not very novel. EF21 [1] proposed compressing gradient differences a long time ago. The algorithm of EF21 also looks very similar to DIANA-RR. I think DIANA-RR is basically a combination of DIANA and EF21. Unfortunately, I lost track of the huge number of variants of DIANA, so I'm not exactly sure whether the variant of DIANA + EF21 has already existed somewhere.
-
Since EF21 already uses compression of gradient differences, I think EF21 is a better baseline than QSGD. However, EF21 is not included in the experiments.
References:
[1] Richtárik, Peter et al. “EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback.” Neural Information Processing Systems (2021).
问题
-
Regardless of DIANA itself, what exactly is the difference between the "compress gradient differences" proposed in this paper and the algorithm of EF21?
-
Could EF21 also be included in the experiments as a baseline?
局限性
Except for the weakness mentioned above, the limitations of this paper is well discussed and addressed.
... including Q-RR and its variants Q-NASTYA, which compress gradient differences in the scenario of data random shuffling.
Just to clarify, Q-NASTYA is not a variant with compressed gradient differences. The variants that utilize compressed gradient differences are DIANA-RR and DIANA-NASTYA, as they are based on the DIANA technique, which specifically compresses gradient differences.
The algorithm of EF21 also looks very similar to DIANA-RR. I think DIANA-RR is basically a combination of DIANA and EF21. Unfortunately, I lost track of the huge number of variants of DIANA, so I'm not exactly sure whether the variant of DIANA + EF21 has already existed somewhere.
Regardless of DIANA itself, what exactly is the difference between the "compress gradient differences" proposed in this paper and the algorithm of EF21?
From these comments, we see that you have some questions related to DIANA, EF21, and our DIANA-RR. We explain the main features of those methods and emphasize the differences between them.
DIANA. The DIANA method was proposed by [Mishchenko et al, 2019] as a fix to the QDGD (quantized distributed gradient descent) leading to linear convergence to the exact solution in a strongly convex case compared to QDGD. As you can see, for QSGD, there is a neighbourhood term corresponding to data heterogeneity in the complexity (see Table 1). The same phenomena have a place for QDGD. This is why DIANA can be seen as a variance reduction technique for unbiased compression. Next, DIANA is designed for unbiased compression operators (see Assumption 1); this is crucial for the analysis. The third property of DIANA, which can be inferred from the complexity results, is that the convergence rate improves when increases. In other words, the neighbourhood terms decrease with the increase in the number of clients .
EF21. EF21 is an improved version of a well-known error feedback technique [Seide et al., 2014]. EF21 works for the wider class of compressors called biassed compressors. However, the convergence rate does not improve when is large. Also the gradient estimator in EF21 is biassed, which brings another challenge for convergence analysis. Moreover, a naive combination of EF21 with SGD requires to use of large batchsizes; see [1].
DIANA-RR. This new algorithm combines the DIANA technique and sampling without replacement (a.k.a. RR). However, the combination is not straightforward (we do not simply plug in RR estimator in DIANA) and requires several modifications: first of all, we do not wait until each client goes through the whole dataset to send information at the end of each epoch. Instead of this, we construct the algorithm to send a compressed gradient from clients to the server on each step of the epoch, and we make a gradient step on the server. To set up such an approach correctly, we set for each client their own set of shift- vectors , which differs from the original DIANA method.
As one can see from the above explanation, DIANA, DIANA-RR, and EF21 are completely different methods, and DIANA-RR is not a combination of DIANA and EF21, as the reviewer claims.
[1] Fatkhulin et al. Momentum provably improves error feedback! NeurIPS 2024.
The idea of compressing gradient differences is actually not very novel. EF21 [1] proposed compressing gradient differences a long time ago.
We do not claim that we are the first who propose usage gradient differences to improve the performance of FL algorithms with communication compression. It was proposed in the DIANA paper, which is much older than EF21. One of the main messages of our work is that the naive approach of a combination of RR and gradient compression (i.e., Q-RR) has no improvements over vanilla QSGD. Therefore, we combine two techniques – RR and DIANA – to improve Q-RR, but the combination is not trivial, as we explain above.
Since EF21 already uses compression of gradient differences, I think EF21 is a better baseline than QSGD. However, EF21 is not included in the experiments.
As we mentioned earlier, EF21 is a method designed for biased compression, which is not the focus of our current work. In this paper, we concentrate on unbiased compression techniques. We plan to explore biased compression in future research.
Could EF21 also be included in the experiments as a baseline?
Since we do not consider biased compression in this paper, we do not see a clear reason why we should add EF21 to the comparison. Our work is primarily theoretical, and the main goal of our experiments is to illustrate and support our theoretical findings. Thus, experiments with EF21 would be completely unrelated.
I thank authors for their feedback. However, it seems that the authors misunderstood my major concern. Note that I didn't ask the authors to extend their theory or experiments to biased compressors. On the contrary, I simply think that EF21 could also be applied to unbiased compressors, as I quote from EF21 paper: "It is well known that, in a certain sense, the latter class contains the former." where the "latter class" is biased compressors and the "former" is unbiased compressors. Also, the appendix of EF21 paper contains some theoretical analysis of applying rand-k compressor on EF21. Then, considering that EF21 is also some kind of technique that compresses gradient differences and its similarity to DIANA-RR (I mean, if you remove Line 8 of Algorithm 2, then it looks exactly the same as EF21), it seems to be a better choice as a baseline in the experiments compared to QSGD. After all, QSGD is too naive and simple as a baseline. Furthermore, since the experiments of EF21 paper has a very similar setting as this paper (logistic regression and cifar10), I don't see any problem why the algorithms proposed in this paper could not be compared to EF21 empirically. Since my concerns on the experiment baselines are mostly unsolved, I tend to keep the negative score.
Dear Reviewer,
Thank you for your response and for providing further clarification! We now have a better understanding of your concern. We agree that EF21 can indeed be used with unbiased compressors, and in such scenarios, its convergence behavior would likely be similar to that of the DIANA method. In our work, we compared our method with DIANA, not just QSGD, recognizing the importance of this comparison.
However, since EF21 has some differences from DIANA, we acknowledge that an empirical comparison between our method and EF21 would be beneficial. We appreciate your suggestion and will include this comparison in the experimental section of the camera-ready version. This addition will provide a more comprehensive evaluation of our method.
Thank you again for your valuable feedback!
If we have addressed all the concerns you raised, would you be willing to consider increasing the score? If you have any additional questions, we would be happy to answer them.
Dear reviewer,
We also would like to further clarify why in the submitted version of our work we do not compare our methods with EF21. The key reason for this is that since EF21 is designed to deal with a larger class of biased (contractive) compressors, EF21 does not have improvements in terms of the number of clients . This is not a limitation of EF21 but rather a fundamental theoretical difference between the best-known results for unbiased and biased compressors. More precisely, for existing methods designed to work with unbiased compressors such as QSGD and DIANA and for the new methods proposed in our work (Q-RR, DIANA-RR, Q-NASTYA, DIANA-NASTYA), the main terms in the complexity bounds depend on the parameter of compression as , i.e., the complexity improves with the growths of . In contrast, the existing complexity bound for EF21 is proportional to the compression parameter and does not improve with the growth of ; see [1].
That is, EF21 has worse theoretical rates for unbiased compression than the methods considered and proposed in our paper. Since, in our work, we mainly focus on theory, we did not compare our methods with EF21 in theory and experiments. Nevertheless, as we acknowledged in the previous message, we are committed to including a numerical comparison of the proposed methods with EF21 in the final version of our paper.
We hope that we resolved your concern. If you agree that this is the case, then we kindly ask you to increase the score. If you have any additional questions, we kindly ask you to communicate them to us: we would be happy to answer them.
References
[1] Richtárik et al. "EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback". NeurIPS 2021
Dear Reviewer zJsy,
We hope that our previous responses have effectively clarified the issues you raised. In addition to directly responding to your comments, we have also provided detailed explanations regarding your hardware and implementation queries under the response of Reviewer X78L. We have provided an explanation of why the presented Neural Net experiment took 25 days of pure execution time. We trust that all our responses have clarified the technical aspects you were concerned about.
Moreover, in response to Reviewer BTLT’s request, we have developed new theoretical results for the Partial Participation case in the Q-NASTYA and DIANA-NASTYA methods. This addition is particularly significant because partial participation is a crucial aspect of Federated Learning, where not all clients can be involved in every round of aggregation. We hope that you find this contribution valuable and consider it when reviewing your score.
If we have satisfactorily addressed all the concerns you raised, we would greatly appreciate it if you could consider adjusting your score accordingly. Should you have any further questions or require additional clarification, we would be more than happy to provide any further details.
Best regards,
Authors
We thank the reviewers for their feedback and time. We addressed all the questions, comments, and concerns raised by the reviewers in separate messages.
Following Reviewer r1qn question about the importance of the usage of multiple shift vectors in DIANA-RR, we conducted additional experiments with a modification of DIANA-RR that uses just one shift vector per client (DIANA-RR-1S). One can find the results in the pdf-file attached to this message. Our results indicate that the usage of multiple shifts is important for DIANA-RR, since otherwise the method works not better than DIANA.
Dear Reviewers,
Thank you for your reviews, comments, and responses during the discussion. We would like to highlight that, in response to Reviewer BTLT’s request, we have provided new results for the Q-NASTYA and DIANA-NASTYA methods in the Partial Participation regime, where only a subset of clients participate in aggregation. These results are particularly important since, in Federated Learning applications, it is often not feasible to involve all clients in the aggregation process. We will incorporate these results into the camera-ready version.
Additionally, we have addressed various presentation issues in our rebuttals to all reviewers and will integrate these clarifications into the final version as well.
Please note that we conducted additional experiments for Reviewer r1qn. Due to the limited discussion period, we only provided experiments using the DIANA-RR method with one shift, but we plan to include more experiments as requested by several reviewers.
We also addressed the concerns raised by Reviewer zJsy. In particular, we provided further explanations regarding the use of EF21 as a baseline. Additionally, we shared all the technical details related to the selection of configuration experiments with classical Convolutional Networks. We have also clarified the technical aspects of these experiments and the underlying reasons for their high computational cost.
As a result of resolving the issues mentioned above and providing new results, the average score has increased from 4.75 to 5.5. We are still open to further discussion and clarification. With this momentum, we hope that all issues can be resolved and that the paper will be recommended for acceptance.
Best regards,
Authors
In the context of communication-efficient distributed finite sum optimization problem, the authors show that it is posssible to prove that random reshuffling can converge faster than sampling with replacement, which has empirically been known to practicitioners. The authors first show that naive combination of gradient quantization and random reshuffling (Q-RR) does not lead to a better upper bound compared to quantized SGD with sampling with replacement (Q-SGD). Then the authors show an improved bound for DIANA-RR which combines compression of gradient differences proposed in DIANA with Q-RR.
The paper is mostly clearly written except that the title "compress the gradient differences" in title seem to have caused confusion with some reviewers because the paper is not the first paper to propose this.
Some reviewers thought the paper was lacking novelty because the proposed method was a combination of DIANA with random reshuffling. The authors have responded well during the rebuttal and clarified that the combination is not at all trivial.
In terms of the practical applicability, it seems that the method requires FP64 precision in its current form. This seems like a hurdle the authors need to address for the scalability and wider adoption of the method. The authors responded that the contribution of the paper is primarily theoretical.
As a theory paper, the biggest weakness to me was the fact that it is unclear if variance reduction through DIANA is necessary both in theory and in practice, especially when the model is over-parametrized. The story would have been more convincing if the authors could show lower bound of achievable error for the Q-RR method.
Overall I feel the scope of the paper is fairly narrow (gradient quantization and random reshuffling; not clear if the method can tolerate low-numerical precision; not clear if the DIANA variance reduction method is required when the model is over-parametrized) but has an interesting contribution.