PaperHub
5.7
/10
Poster3 位审稿人
最低4最高7标准差1.2
7
4
6
4.0
置信度
正确性3.0
贡献度2.3
表达2.7
NeurIPS 2024

Sketching for Distributed Deep Learning: A Sharper Analysis

OpenReviewPDF
提交: 2024-05-15更新: 2025-01-16
TL;DR

We provide a tighter analysis of sketching in distributed learning that eliminates the dimension dependence without imposing unrealistic restrictive assumptions in the distributed learning setup.

摘要

关键词
sketchingdistributed learningfederated learningoptimization

评审与讨论

审稿意见
7

This paper considers the sketch-DL framework for distributed / federated learning studied by prior works such as Rothchild et al., Song et al. The authors identify that for these works, their convergence either has a dependence on the dimension (under rather minimal assumptions) which is unfavorable, or assumes something stronger such as the sketched gradient has heavy-hitter coordinates and it is enough to recover using Top-rr. The model studied in this paper is a deep, feedforward neural net with a 1-Lipschitz activation (say ReLU). They assume the loss function is PL and the eigenvalues of the predicator Hessian are dominated by a few top eigenvalues. Under these assumptions, they manage to remove the linear dependence on the dimension for convergence from Song et al. They also perform experiments to confirm that using sketching matrices such as CountSketch gives better performance than local top-kk or FetchSGD due to Rothchild et al. without error feedback.

优点

This paper resolves a big issue from Song et al., where the convergence depends linearly on the dimension. They are able to provide a tighter analysis and a sharper bound by examining a different set of assumptions that utilize information of the Hessian, going beyond first-order information. This is quite important as in some sense, this paper obtains the best of both worlds: the algorithm it examines is the simplest form of applying linear sketching due to Song et al., where one could easily integrate extra mechanism such as DP into it effortlessly. The new set of assumptions imposed in this paper are quite reasonable, and it also performs experiments to confirm simply applying sketches are enough to guarantee fast convergence.

缺点

There are two main weaknesses of this paper.

  • Fair comparison with the analysis of Song et al. This paper makes 3 assumptions: 1. the loss function is PL, 2. the eigenvalues of Hessian are dominated by a few top eigenvalues and 3. a uniform upper bound on the 2\ell_2 norm of the gradient. For fair comparison, the corresponding results in Song et al. are assuming 1. the loss function is strongly-convex and 2. the loss function is smooth. Note that Song et al. does not need to assume a uniform upper bound on the size of all gradients, which is commonly criticized as an unrealistic assumption. Moreover, Song et al. provides a slew of results for convex and non-convex loss function. In my opinion, this paper should provide a more comprehensive comparison with Song et al. For example, what is the crux to remove the linear dependence on the dimension? Is it due to PL, restricted strong smoothness (RSS) or uniform bound on the gradient? I would assume it's due to the RSS assumptions, but the authors should explicitly spell it out. Also, is it possible to remove the gradient upper bound assumption?

  • Experimental comparison with Rothchild et al. Figure 1 presented in the main body of the paper only compares the sketching algorithm with FetchSGD without error feedback, and hence obtains better dimension reduction ration v.s. accuracy, which is not surprising, as FetchSGD requires error feedback correction to make the algorithm correct. In Figure 2 on page 23, authors did compare with FetchSGD with error feedback, and the "correct" FetchSGD does have the superior performance. I do understand authors intention here as the vanilla FetchSGD is nonlinear, hence very hard to integrate any other pieces such as DP into it. From this perspective, the sketching algorithm in Song et al. slightly trades the performance for a simpler and more general algorithm. Still, I don't think it's fair to put a figure in the main body comparing the sketching algorithm with a "wrong" version of FetchSGD. Authors could probably put Figure 2 in the main body, and provide comprehensive justifications on why one would expect FetchSGD to perform better and why sketching algorithm could potentially yield more use cases in practice.

问题

A few typos:

  • In References, citation [56] should be Sketching as a tool for numerical linear algebra instead of Computational advertising: Techniques for targeting relevant ads.

  • On line 592, end of page 15, what is the matrix 2\\|\cdot \\|_2 norm? Is that the spectral norm?

A few questions:

  • Can you extend the analysis to non-PL losses (say the loss is only convex, or even non-convex) similar to Song et al.?

  • I think it will be greatly beneficial to have a table comparing the set of assumptions posed in this paper and Song et al., together with different convergence results.

局限性

Yes.

作者回复

We thank the reviewer for the suggestion. We will include a table comparing our results with [Song et al. 2023] in the final paper. Here, we provide brief summary:

Q1. Comparison with [Song et al. 2023]: Before comparing, we briefly summarise the difference in our setup, assumptions and results:

  1. Choice of ff: While [Song et al. 2023] provides results of the desksk\operatorname{desk}-\operatorname{sk} framework for convex/non-convex function ff satisfying β\beta-smoothness, our work looks at the case when ff is a overparametrized neural-network and exploits the RSS property of such networks.

  2. Assumption on Loss function L(θ)\mathcal{L}(\theta): [Song et al. 2023] provides results under various assumptions (convex, non-convex, strongly-convex). In our case, motivated by recent works that show neural networks satisfy the PL condition [Zhou et al.],[Hardt et al] , we assume loss function L\mathcal{L} satisfies PL condition. As mentioned by the reviewer, our results can be compared with the Strong Convexity results by [Song et al. 2023].

  3. Uniform bound on the gradient:: Note that, our analysis for single-step case does not require uniform bound on the gradient showing that we remove the dimensional dependence using RSS. For K>1K>1 steps, under PL condition, we require some additional assumptions to show convergence. Motivated by works showing neural losses satisfy PL condition [Zhou et al.],[Hardt et al] we have done the current analysis only for the PL case leaving further extensions for future work.

Under our assumptions and setup (Section 3 in our paper), we provide comparison with [Song et al. 2023] in Table: 1 for K=1 local steps. We achieve a ambient-dimension independent linear convergence under only RSS assumption. We also provide additonal results for non-convex setting and show that we achieve dimension-independence under this setting too. For K>1K>1, we compare our results in Table:2.

Loss function (L\mathcal{L})AnalysisIteration ComplexityAssumption
PLOurs (Theorem 1)O((1+ε)2csϱ2+clcHm[1+κ(ε2+2ε)]μ(1ε)2log(L(θ0)L(θ)ϵ))\mathcal{O}\left(\frac{(1+\varepsilon)^2c_s\varrho^2 + \frac{c_l c_H}{\sqrt{m}}\left[1+ \kappa(\varepsilon^2 + 2\varepsilon)\right]}{\mu(1-\varepsilon)^2}\log\left(\frac{\mathcal{L}(\theta_0)-\mathcal{L}(\theta_{*})}{\epsilon}\right)\right)RSS (κ=o(1)\kappa = o(1))
SC[Song et al. 2023](Theorem E.1)O((csϱ2+clcHm)p(1+ε)2μ(1ε)2log(L(θ0)L(θ)ϵ))\mathcal{O}\left(\frac{\left(c_s\varrho^2 + \frac{c_l c_H}{\sqrt{m}}\right)p(1+\varepsilon)^2}{\mu(1-\varepsilon)^2}\log\left(\frac{\mathcal{L}(\theta_0)-\mathcal{L}(\theta_{*})}{\epsilon}\right)\right)β\beta-smoothness
Non-convexOurs (Theorem 2)O((1+ε)2csϱ2+clcHm[1+κ(ε2+2ε)](1ε)2(L(θ0)L(θ)ϵ))\mathcal{O}\left(\frac{(1+\varepsilon)^2c_s\varrho^2 + \frac{c_l c_H}{\sqrt{m}}\left[1+ \kappa(\varepsilon^2 + 2\varepsilon)\right]}{(1-\varepsilon)^2}\left(\frac{\mathcal{L}(\theta_0)-\mathcal{L}(\theta_{*})}{\epsilon}\right)\right)RSS (κ=o(1))\kappa = o(1))
Non-convex[Song et al. 2023] (Theorem E.3)O((csϱ2+clcHm)p(1+ε)2(1ε)2(L(θ0)L(θ)ϵ))\mathcal{O}\left(\frac{\left(c_s\varrho^2 + \frac{c_l c_H}{\sqrt{m}}\right)p(1+\varepsilon)^2}{(1-\varepsilon)^2}\left(\frac{\mathcal{L}(\theta_0)-\mathcal{L}(\theta_{*})}{\epsilon}\right)\right)β\beta-smoothness

Table 1:Comparison of iteration complexities and assumptions for K=1 local-steps. cs,clc_s,c_l are constants defined in Assumption 3.2 and ϱ,cH\varrho,c_H are defined in Lemma C.1,Lemma C.2 respectively. Note that ϱ\varrho and cHc_H are O(poly(L))\mathcal{O}(poly(L)) and can be seen as independent of ambient dimension for wide networks. Under our assumptions and setup (Section 3), according to Lemma C.4 the loss function L\mathcal{L} can be shown to be β\beta-smooth where β=csϱ2+clcHm\beta = c_s\varrho^2 + \frac{c_lc_H}{\sqrt{m}}. ε\varepsilon is the sketching error and ϵ\epsilon refers to the optimization error tolerance.

Loss function (L\mathcal{L})AnalysisIteration ComplexityAssumption
PLOurs (Theorem 2)O(ϱ2+cHm+εκcHmμ2(1ε)2ϵlog(L(θ0)L(θϵ))\mathcal{O}\left(\frac{\varrho^2 + \frac{c_H}{\sqrt{m}}+\frac{\varepsilon\kappa c_H}{\sqrt{m}}}{\mu^2(1-\varepsilon)^2\epsilon}\log\left(\frac{\mathcal{L}(\theta_0)-\mathcal{L}(\theta^{*}}{\epsilon}\right)\right)RSS, Bounded Gradient
SC[Song et al. 2023](Theorem F.10 (σ\sigma=0))O(pβKμlog(βE[θ0θ22]ϵ))\mathcal{O}\left(\frac{p\beta K}{\mu}\log\left(\frac{\beta\mathbb{E}[\vert\vert\theta_0-\theta^{*}\vert\vert^2_2]}{\epsilon}\right)\right)β\beta-smoothness

Table 2: Comparison of iteration complexities and assumptions for K>1K>1 local-steps. cs,clc_s,c_l are constants defined in Assumption 3.2 and ϱ,cH\varrho,c_H are defined in Lemma C.1,Lemma C.2 respectively. Note that ϱ\varrho and cHc_H are O(poly(L))\mathcal{O}(poly(L)) and can be seen as independent of ambient dimension for wide networks.


Q2. Regarding Figure 2 Thank you for your valuable feedback and for understanding the intention behind our experimental design. We appreciate your insightful comments regarding the comparison between the sketching algorithm and FetchSGD. We will revise the main body of the paper to include Figure 2 and provide a comprehensive discussion on the performance comparison.


Q3.Typos: On line 592, end of page 15, the matrix norm is the spectral norm. We will add this in the notation for the final draft. We thank the reviewer for pointing out the error in citation and typos. We will update them in the revised manuscript.

[Zhou et al.] Characterization of gradient dominance and regularity conditions for neural
networks, 2017. URL https://arxiv.org/abs/1710.06910. [Hardt et al] M. Hardt and T. Ma. Identity matters in deep learning. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=ryxB0Rtxx

评论

I thank the authors for answering my questions. In particular, I find it very interesting that you need bounded gradient assumption for K>1K>1 while for K=1K=1 it's not needed. It might be worth discussing in the revision. I'm also happy to see authors include a table comparing results with Song et al., as this is an important contribution of this paper. I'll raise my score to 7.

审稿意见
4

This paper provided ambient dimension-independent convergence and communication complexity analysis for sketching-based gradient descent algorithms using the second-order geometry of loss Hession.

优点

  • This paper proves that the dimensional dependence comes from the global smoothness assumption and provides a convergence analysis of the sketching method independent of the ambient dimensions.

  • The analysis of this paper does not require the heavy hitter assumption and the Top-rr sparsification operation.

缺点

  • This paper studies gradient descent (GD) and uses additional assumptions, limiting the application of the analytical results.

  • Comparison experiments are not reasonable, and the experimental results are roughly plotted.

问题

  1. Why does this paper not study the more widely used SGD algorithm? This paper studies GD but compares it with FetchSGD in theory and experiment, which is not quite reasonable.

  2. The theoretical result of the multiple-local step in Theorem 2 does not converge to 0 with the number of iterations, so why do the authors say that the result is similar to the sublinear convergence rate of SGD? Also, the authors should provide more explanatory notes on why the GD algorithm using multiple-local step is similar to SGD.

  3. What is the notation mathbfv_0\\mathbf{v}\_{0} in Assumption 3.3? It is not defined in the text nor used elsewhere. Is the notation taken from the literature [28]?

  4. As the authors state, this paper does not make innovations to the algorithm. Therefore, the authors should focus on describing the theoretical contributions of this paper and should not demonstrate the superiority of the algorithm by unfair comparison with flawed algorithms (biased compression algorithms without error feedback).

  5. The authors did not provide code, nor did they describe the experimental setup in detail. The reviewer wonders if it is feasible to simulate 100 clients for full-batch gradient descent (i.e., a batch size of 500 per client) using a single NVIDIA Titan X GPU. Also, it is not reasonable to run the mini-batch based FetchSGD algorithm using the full batch setting.

  6. The symbolic notation and written presentation in lines 137 to 153 of this paper are very similar to that of literature [28, Sections 3 and 4].

  7. The references of the paper are badly formatted, and many of them do not contain information on publication conferences or journals, e.g. [57, 58, 59, 60, 61, 62, 63, 64, 65, 67, 70, 71, 76].

局限性

Please refer to Weaknesses and Questions.

作者回复

We thank the reviewer for carefully reading our paper and providing constructive feedback. Below we provide answers to each of the concerns:

Q1: SGD v/s GD: Firstly we point out our theoretical result in Theorem 1 must be compared to Theorem E.1 in [Song et al. 2023] which also studies GD. Further, in Section 3.2 of our paper we explicitly show that any analysis that uses strong smoothness will eventually pick up dimension dependence. Our aim in the paper is to show the inherent ambient dimension dependence in previous sketching based analysis ([Song et al. 2023]) which can be shown in the GD analysis and will also carry over to the SGD analysis. Our main result is to get rid of this dependence and we provide a sharper analysis that depends on the sum of absolute values of eigenvalues of the Hessian, which is potentially much smaller. Such an analysis can be extended to SGD under suitable assumptions.


Q2:On the use of SGD in experimental section: We do not claim that we provide superior performance compared to FetchSGD. We would like to point out that [Rothchild et al. 2020] acknowledges in Appendix B.3 that a sk\operatorname{sk}-desk\operatorname{desk} based framework is comparable in performance to the uncompressed version, however they are unable to provide a theoretical analysis and thus use an additonal Top-r sparsification step. We fill in this gap through our results. Addionally, the Top-r sparsification step in FetchSGD makes their algorithm non-linear. On the other hand, our experiments show that sk\operatorname{sk}-desk\operatorname{desk} framework achieves comparable performance with FetchSGD (Figure 2 in the Appendix) while retaining linearity which allows easy integration with Differential Privacy.


Q3: Similarity to SGD: We provide the theoretical result for KK-step case with a constant learning rate. As a result, there is a term that decreases exponentially fast and the variance term that is dependent on the choice of learning rate. Hence, our result is similar to analysis of SGD with constant learning rate due to gradient noise. In the analysis of Theorem 2 in our paper by choosing a decaying learning rate ηt=O(1/T)\eta_t = \mathcal{O}(1/T), as is common in SGD analysis, we can show that the result converges to 00, i.e. L(θT)L(θ)=O(1/T)\mathcal{L}(\theta_T) -\mathcal{L}(\theta^{*}) = \mathcal{O}\left(1/T\right) and achieve sub-linear convergence rates as is standard in literature[ Karimi et al]. We thank the reviewer for pointing this out and we will add a remark clarifying this in the final draft.


Q4. Notation in Assumption 3.3: Yes, the notation is taken from the literature [Banerjee et al.], [Liu et al.]. θt\theta_t refers to the parameter vector θ\theta(as in (3) in our paper) at iteration tt and thus θ0\theta_0 refers to the weight at initialization. We will add a clarifying remark in the final draft.


Q5. Clarification on the experimental setup: We thank the reviewer for pointing out the confusion around the computing requirement for the experiments. We managed to run our experiments on one GPU by (1) accumulating the model updates on the fly once we compute the model update of a client and (2) moving the model updates and/or accumulated updates from the GPU to the CPU. Accumulating the updates on the fly, instead of computing the updates in parallel and accumulating them all together, is only for the sake of compute efficiency due to the compute constraints of our research environment. Rest assured, this modification does not lead to any change in the empirical results. However, to clarify any confusion around the computing needs, we will add this explanation to the revised paper.


Q6.Symbolic notation and citation errors: We indeed re-use some of the results and notations from [Banerjee et al.], [Liu et al.] in lines 137 to 153; We have stated this explictly in line 589. Thank you for pointing the errors in citations, we will add publication details to the references in the final draft.

[Karimi et al] Linear convergence of gradient and proximal-gradient methods under the polyak-ojasiewicz condition, 2020. URL https://arxiv.org/abs/1608.04636

[Banerjee et al.] Restricted strong convexity of deep learning models with smooth activations. In The Eleventh International Conference on Learning Representations, 2023.

[Liu et al.] On the linearity of large non-linear models: when and why the tangent kernel is constant. Advances in Neural Information Processing Systems, 33:15954–15964, 2020.

评论

Thanks to the authors for their responses. However, they did not exactly answer my questions, for example

  • Why does the convergence result of GD not converge to zero with the number of iterations? Even with a constant learning rate. [Question 2]
  • What is the notation v0\mathbf{v}_0 in Assumption 3.3? [Question 3]
  • Is the comparison in Figure 1 from the main paper reasonable? [Question 4]

Moreover, I believe that it is very inappropriate to directly use large portions from other papers (lines 137 to 153). [Question 6]

Therefore, I will retain the original assessment.

评论
  1. Convergence result of GD (Question 2): The final result in Theorem 2 in our paper has been shown in a form similar to Theorem 5.1 [Song et al. 2023] . Note that, our result can be made arbitrarily small, either by choosing a small step-size (for constant step-size) or a decreasing step-size. We provide one such step size choice for the communication efficiency comparison (Theorem 3 in our paper) to make the final result L(θT)L(θ)ϵ\mathcal{L}(\theta_T) -\mathcal{L}(\theta^{*}) \le \epsilon and the statement for the decreasing step size in the rebuttal provided above (Question 3).

    Explanation for the residual term in the K-step analysis: In the KK-local-step GD , note that, the average of local gradients is no longer the true gradient (as in the 11-local-step analysis). To see this, each client performs multiple GD step on individual loss function . Following the notation in our paper, θc,t,k=θc,t,k1ηlocalLc(θc,t,k1) k=1,,K\theta_{c,t,k} = \theta_{c,t,k-1} - \eta_{\text{local}}\nabla \mathcal{L_c}(\theta_{c,t,k-1}) ~\forall k=1,\cdots,K. This, causes a drift in the local parameters θc,t,k\theta_{c,t,k} at each of the clients and thus, KK-local-step GD is not the same as performing GD on L\mathcal{L}. Under the bounded gradient assumption and Lipschitz smoothness of the objective function, we can bound this drift and that leads to the residual term in each iteration ( line 652 eq 80 in our paper) and (eq 20 in [Song et al. 2023]). Accumulating the residual over all TT iterations, leads to the final result which depends on η\eta.


  2. Notation in Assumption 3.3 (Question 3): We thank the reviewer for pointing this out. We will rectify this in the main draft. For clarification, following eq(3) in our paper,

θ:=((W(1)),,(W(L)),v)Rp.\theta := (({\vec{W}}^{(1)})^\top,\ldots,(\vec{W}^{(L)})^\top, \mathbf{v}^\top )^\top \in \mathbb{R}^p.

The iterate at iteration tt is give as:

θt:=((Wt(1)),,(Wt(L)),vt)Rp.\theta_t := (({\vec{W_t}}^{(1)})^\top,\ldots,(\vec{W_t}^{(L)})^\top, \mathbf{v_t}^\top )^\top \in \mathbb{R}^p.

Following this, the iterate at initialization θ0\theta_{0} is given as:

θ0:=((W0(1)),,(W0(L)),v0)Rp\theta_0 := (({\vec{W_0}}^{(1)})^\top,\ldots,(\vec{W_0}^{(L)})^\top, \mathbf{v}_0^\top )^\top \in \mathbb{R}^p

Assumption 3.3, defines the initialization parameters W0(l) lL\vec{W_0}^{(l)}~\forall l\in L and v0\mathbf{v_0}.


  1. Running FetchSGD with full-batch: Since we provide an analysis for GD, we run FetchSGD with full batch setting for fair comparison. We recognize that FetchSGD is typically designed for mini-batch training, but given that the theoretical framework in [Rothchild et al. 2020] can be extended to GD under similar assumptions, we believe it is appropriate to compare FetchSGD with our algorithm under a full-batch setting. That said, we would be grateful if the reviewer could elaborate on their concerns regarding the use of FetchSGD in a full-batch setting. This would help us better understand and address any issues in our revised manuscript. A goal of our work is to show that the Top-r sparsification and error-feedback mechanism is not required for the analysis of sketched distributed learning algorithms and thus, we show the comparison with FetchSGD. Since [Rothchild et al. 2020] are the first to propose using sketching for distributed learning, we refer to the Top-r + Sketching version of the skdesk\mathop{\text{sk}}-\mathop{\text{desk}} algorithm as FetchSGD without Error Feedback in Figure 1. We agree with the reviewer that error-feedback is required for biased compressed algorithms and we will revise the main body of the paper to include Figure 2 and provide a more comprehensive discussion on the performance comparison.

  1. Usage of notation and assumptions from previous papers: We have cited [28] right at line 137 before using the notation. However, based on the feedback of the reviewer, we will make this more explicit in the subsequent draft. Note that this is only mathematical setup and assumptions, and therefore they would be same across papers that use the same model and assumptions. This is only generic notation that we are re-purposing for consistency and we do not claim any novelty here. We are using the notation from previous papers to make everything streamlined and we have cited any results that we are using appropriately.

[Song et al. 2023] . Song, Y. Wang, Z. Yu, and L. Zhang. Sketching for first order method: Efficient algorithm for low- bandwidth channel and vulnerability. In International Conference on Machine Learning, pages 32365– 32417. PMLR, 2023. [Rothchild et al. 2020] D. Rothchild, A. Panda, E. Ullah, N. Ivkin, I. Stoica, V. Braverman, J. Gonzalez, and R. Arora. FetchSGD: Communication-efficient federated learning with sketching. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 8253–8265. PMLR, 13–18 Jul 2020

审稿意见
6

This paper presents a novel analysis of sketching-based distributed learning algorithms for deep neural networks. The authors provide a tighter, dimension-independent convergence analysis by leveraging the restricted strong smoothness (RSS) property of deep learning loss functions. This work addresses a gap between existing theoretical analyses (which suggest dimension dependence) and empirical results (which show competitive performance without such dependence). The paper provides theoretical guarantees, improved communication complexity bounds, and empirical validation of their approach.

优点

  1. The paper presents the first dimension-independent convergence analysis for sketching in distributed deep learning without requiring restrictive assumptions like heavy-tailed distributions or top-r sparsification.
  2. The work bridges a gap between theory and practice, explaining why sketching-based methods perform well empirically despite previous pessimistic theoretical bounds. The authors present experimental results that support their theoretical findings, demonstrating the effectiveness of their approach on real-world datasets.
  3. The paper covers both single-step and multi-step local update scenarios, providing a thorough theoretical treatment.

缺点

  1. While the paper removes some restrictive assumptions from previous work, it still relies on certain assumptions about the loss function (e.g., PL condition) and the eigenvalue distribution of the Hessian. The robustness of the results to violations of these assumptions could be discussed further.
  2. While the paper provides a thorough theoretical analysis, more discussion on the practical aspects of implementing the proposed approach in large-scale distributed learning systems could be beneficial.

问题

  1. Could you provide insights on how your approach might perform on other types of deep learning tasks beyond image classification (e.g., natural language processing or reinforcement learning)?
  2. What are the main challenges in implementing your approach in practical, large-scale distributed learning systems?

局限性

yes

作者回复

We are thankful to the reviewer for the suggestions and comments. Below we provide answers to each of the concerns:

Q1. Robustness to violation of assumption on eigenvalues: Under the assumption on eigenvalues, i.e. κ=o(1)\kappa = o(1), we provide a dimension-independent convergence rate. However, our results hold for any κ\kappa and thus even for a slower eigendecay such that κ=O(p)\kappa = \mathcal{O}(\sqrt{p}), our analysis using second order structure shows that b=O(p)b=\mathcal{O}(\sqrt{p}) should suffice to achieve faster convergence. To see why this is true, recall from Theorem D.1 in our paper, the sketching dimension b=O~(1/ε2)b = \tilde{O}(1/\varepsilon^2), thus setting b=O(p)b=\mathcal{O}(\sqrt{p}) gives ε=O~(1/p1/4)\varepsilon = \mathcal{\tilde{O}}(1/p^{1/4}). As a result, leading terms involving κ\kappa in the expression of learning rate η=(1ε)((1+ε)2csϱ2+cHclm(1+κ(2ε+ε2)))\eta = \dfrac{(1-\varepsilon)}{\left((1+\varepsilon)^2c_s\varrho^2 +\frac{c_H c_l}{\sqrt{m}}(1 + \kappa(2\varepsilon+\varepsilon^2))\right)} are κεm=O(L1/4)\frac{\kappa \varepsilon}{\sqrt{m}} = \mathcal{O}(L^{1/4}) which is quite small compared to the ambient dimension p=m2L,for width m,L layersp=m^2L, \text{for width }m,L \text{ layers}, leading to faster convergence rate. In contrast, the results from [Song et al. 2023] have learning rate η=O(b/p)\eta = \mathcal{O}(b/p) which needs b=O(p)b = \mathcal{O}(p) for dimension independent convergence. As a result, our results provide a better interpolation of convergence rate with the assumption violation than [Song et al. 2023] which only provides the blanket pessimistic upper bound of O(b/p)\mathcal{O}(b/p). Furthermore, we present our results on the decay of eigenvalues across a variety of models/datasets in Fig 1 and Fig2 in the attached PDF (in the general response) showing robustness to architecture and dataset choices.


Q2. Robustness to violation of assumption on PL condition: Our dimension independent results depend on the Restricted Strong Smoothnes property(RSS) of overparametrized neural networks and are not due to assumption on PL condition. In Table1 , we provide the dimension independent results even for non-convex L\mathcal{L}.


Loss function (L\mathcal{L})AnalysisIteration ComplexityAssumption
PLOurs (Theorem 1)O((1+ε)2csϱ2+clcHm[1+κ(ε2+2ε)]μ(1ε)2log(L(θ0)L(θ)ϵ))\mathcal{O}\left(\frac{(1+\varepsilon)^2c_s\varrho^2 + \frac{c_l c_H}{\sqrt{m}}\left[1+ \kappa(\varepsilon^2 + 2\varepsilon)\right]}{\mu(1-\varepsilon)^2}\log\left(\frac{\mathcal{L}(\theta_0)-\mathcal{L}(\theta_{*})}{\epsilon}\right)\right)RSS (κ=o(1)\kappa = o(1))
SC[Song et al. 2023](Theorem E.1)O((csϱ2+clcHm)p(1+ε)2μ(1ε)2log(L(θ0)L(θ)ϵ))\mathcal{O}\left(\frac{\left(c_s\varrho^2 + \frac{c_l c_H}{\sqrt{m}}\right)p(1+\varepsilon)^2}{\mu(1-\varepsilon)^2}\log\left(\frac{\mathcal{L}(\theta_0)-\mathcal{L}(\theta_{*})}{\epsilon}\right)\right)β\beta-smoothness
Non-convexOurs (Theorem 2)O((1+ε)2csϱ2+clcHm[1+κ(ε2+2ε)](1ε)2(L(θ0)L(θ)ϵ))\mathcal{O}\left(\frac{(1+\varepsilon)^2c_s\varrho^2 + \frac{c_l c_H}{\sqrt{m}}\left[1+ \kappa(\varepsilon^2 + 2\varepsilon)\right]}{(1-\varepsilon)^2}\left(\frac{\mathcal{L}(\theta_0)-\mathcal{L}(\theta_{*})}{\epsilon}\right)\right)RSS (κ=o(1))\kappa = o(1))
Non-convex[Song et al. 2023] (Theorem E.3)O((csϱ2+clcHm)p(1+ε)2(1ε)2(L(θ0)L(θ)ϵ))\mathcal{O}\left(\frac{\left(c_s\varrho^2 + \frac{c_l c_H}{\sqrt{m}}\right)p(1+\varepsilon)^2}{(1-\varepsilon)^2}\left(\frac{\mathcal{L}(\theta_0)-\mathcal{L}(\theta_{*})}{\epsilon}\right)\right)β\beta-smoothness

Table 1:Comparison of iteration complexities and assumptions for K=1 local-steps. cs,clc_s,c_l are constants defined in Assumption 3.2 and ϱ,cH\varrho,c_H are defined in Lemma C.1,Lemma C.2 in our paper respectively. Note that ϱ\varrho and cHc_H are O(poly(L))\mathcal{O}(poly(L)) and can be seen as independent of ambient dimension for wide networks. Under our assumptions and setup (Section 3), according to Lemma C.4 the loss function L\mathcal{L} can be shown to be β\beta-smooth where β=csϱ2+clcHm\beta = c_s\varrho^2 + \frac{c_lc_H}{\sqrt{m}}. ε\varepsilon is the sketching error and ϵ\epsilon refers to the optimization error tolerance.


Q3. Performance on tasks beyond image classification and practical challenges:

While we focus on the theoretical analysis for sketching methods, past works [Rothchild et al. 2020],[Melas-Kyriazi et al. 2022] provide extensive empirical results comparing sketching based approaches across vision and language domain. For practical purposes, the large memory footprint of generating and storing large sketch tables and fast computation of sketches is a challenge. Also, due to the high variance of the estimator RRR^{\top}R (Lemma B.1 in our paper), it is important to apply some sort of variance reduction which can be explored in future work. A practical design decision is trading off this overhead for reduced communication requirements and the downstream benefit of differential privacy implementation.

[Song et al. 2023] . Song, Y. Wang, Z. Yu, and L. Zhang. Sketching for first order method: Efficient algorithm for low- bandwidth channel and vulnerability. In International Conference on Machine Learning, pages 32365– 32417. PMLR, 2023.

[Rothchild et al. 2020] D. Rothchild, A. Panda, E. Ullah, N. Ivkin, I. Stoica, V. Braverman, J. Gonzalez, and R. Arora. FetchSGD: Communication-efficient federated learning with sketching. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 8253–8265. PMLR, 13–18 Jul 2020

[Melas-Kyriazi et al. 2022] L. Melas-Kyriazi and F. Wang. Intrinsic gradient compression for scalable and efficient federated learning. In B. Y. Lin, C. He, C. Xie, F. Mireshghallah, N. Mehrabi, T. Li, M. Soltanolkotabi, and X. Ren, editors, Proceedings of the First Workshop on Federated Learning for Natural Language Processing (FL4NLP 2022), pages 27–41, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.fl4nlp-1.4.

评论

Thank you for your response on the questions! After reading the other reviews, I would like to keep my current score of 6.

作者回复

General Response:

We thank the reviewers for their positive comments about our work and their efforts in providing meaningful feedback. In particular, we thank Reviewer LRSh for stating that our results are quite important ,resolve big issues from previous work under reasonable assumptions, and have meaningful implications for downstream works in Differential Privacy; Reviewer wSrY for finding that our work is thorough and addresses a gap between theory and practice and lastly, Reviewer E7W7 for pointing out that our work gets rid of unrealistic assumptions.

We have addressed the questions and comments of reviewers in individual responses. Below we provide a brief description of the contributions and the novel aspects of our work:

  1. We identify a gap in theory and practice of sketching methods for distributed learning -- sketching methods incur a dependence on ambient dimension in the convergence rate [Song et al. 2023] while in practice they are observed to be at par with their uncompressed counterpart ( Appendix B.3 of [Rothchild et al. 2020])

  2. We identify and demonstrate the widely used global smoothness assumption of loss functions as the root cause of the dimension dependence. To alleviate this dimensional dependence, [Rothchild et al. 2020] make assumptions on the heavy-tailed nature of gradient distribution and perform additional Top-r sparsification resulting in a biased algorithm.

  3. We get rid of these dimensional dependence while avoiding these assumptions by exploiting the second-order structure of Loss Hessians and the fast decay of Hessian eigenvalues [Ghorbani et al.],[Zhang et al.],[Xie et al.]. Our results instead depend on the sum of the absolute values of the eigenvalues of the Hessian which is potentially much smaller. We also provide empirical evidence verifying this claim.

[Song et al. 2023] Song, Y. Wang, Z. Yu, and L. Zhang. Sketching for first order method: Efficient algorithm for low- bandwidth channel and vulnerability. In International Conference on Machine Learning, pages 32365– 32417. PMLR, 2023.

[Rothchild et al. 2020] D. Rothchild, A. Panda, E. Ullah, N. Ivkin, I. Stoica, V. Braverman, J. Gonzalez, and R. Arora. FetchSGD: Communication-efficient federated learning with sketching. In H. D. III and A. Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 8253–8265. PMLR, 13–18 Jul 2020

[Ghorbani et al.] An investigation into neural net optimization via hessian eigenvalue density. In International Conference on Machine Learning, pages 2232–2241. PMLR, 2019. [Zhang et al.] Y. Zhang, C. Chen, T. Ding, Z. Li, R. Sun, and Z.-Q. Luo. Why transformers need adam: A hessian perspective. arXiv preprint arXiv:2402.16788, 2024. [Xie et al.] . Xie, Q.-Y. Tang, Y. Cai, M. Sun, and P. Li. On the power-law hessian spectrums in deep learning, 2022. URL https://arxiv.org/abs/2201.13011

最终决定

The paper considers distributed learning with compressed communication of updates. It gives novel theoretical insights if sketching is used as the compression mechanism, under realistic assumptions, in particular studying the dimension dependence. Some concerns remained that the baseline algorithms in the experiments were weak (should be changed to include methods using error-feedback to ensure convergence). Nevertheless, overall positive assessment remained, and we hope the authors will incorporate the mentioned feedback by the reviewers.