4.3

/10

Rejected4 位审稿人

最低3最高6标准差1.3

3.5

置信度

ICLR 2024

Federated Learning Under Second-Order Data Heterogeneity

Konstantin Mishchenko,Rui Li,Hongxiang Fan,Stylianos Venieris

OpenReview PDF

提交: 2023-09-23更新: 2024-02-11

TL;DR

We study the problem of nonconvex federated optimization under second-order data heterogeneity (bounded Hessian differences) and present an algorithm called SABER

摘要

We consider the problem of Federated Learning over clients with heterogeneous data. We propose an algorithm called SABER that samples a subset of clients and tasks each client with its own local subproblem. SABER provably reduces client drift by incorporating an estimate of the global update direction and regularization into each client's subproblem. Under second-order data heterogeneity with parameter $\delta$, we prove that the method's communication complexity for nonconvex problems is $O\left(\delta\varepsilon^2\sqrt{M}\right)$. In addition, for problems satisfying $\mu$-Polyak-Lojasiewicz condition, the method converges linearly with communication complexity of $O\left(\left(\frac{\delta}{\mu}\sqrt{M} + M\right)\log\frac{1}{\varepsilon}\right)$. To showcase the empirical performance of our method, we compare it to standard baselines such as FedAvg on a few empirical problems.

关键词

federated learningdata heterogeneityoptimizationtheory

评审与讨论

审稿意见

评分: 3置信度: 42023-10-16

This paper proposes a federated learning algorithm called SABER that seeks to mitigate client drift by making each client optimize a modified local function that is the sum of the actual local function and a bias correction term like SCAFFOLD (Karimireddy et al., 2020) and a prox term like FedProx (Li et al., 2020a). Under second-order heterogeneity with parameter $\delta$ (Assumption 1 in the paper), SABER is shown to converge to an $\varepsilon$ -stationary point in $\mathcal{O}(\frac{\sqrt{M} \delta}{\varepsilon^2})$ rounds of communication for smooth non-convex functions (the paper mistakenly reports the communication complexity as $\mathcal{O}(\sqrt{M} \delta \varepsilon^2)$ throughout). In addition, under the $\mu$ -PL assumption, the complexity improves to $\mathcal{O}((\frac{\delta}{\mu}\sqrt{M} + M) \log \frac{1}{\varepsilon})$ . Empirical results on CIFAR-10 and FEMNIST show the benefits of SABER compared to some baseline algorithms.

优点

This paper proposes a way to address the important issue of client drift in federated learning. Specifically, it maintains one control variate for all the clients which is efficient.
Sharp $\mathcal{O}(\frac{\sqrt{M}}{\varepsilon^{2}})$ complexity in the smooth non-convex case (although this result is under the strong assumption of $w_{k+1} = \text{argmin } \phi_k(w)$ as I have mentioned in Weakness #1).
The second-order heterogeneity assumption is well motivated.
Appreciable improvement over FedAvg, FedProx and SCAFFOLD (although I have some reservations about these results which I have described in Weakness #4).

缺点

The authors mistakenly report the communication complexity as $\mathcal{O}(\sqrt{M} \delta \varepsilon^2)$ for the smooth non-convex case; I guess it should be $\mathcal{O}(\frac{\sqrt{M} \delta}{\varepsilon^2})$ .

Weaknesses/Concerns:

1. The definition of $w_{k+1} \approx \text{argmin } \phi_k(w)$ in Algorithm 1 is very vague; how is $\approx$ quantified? Further, for the theoretical results in Section 3.2, as far as I understand (by looking at the proof of Lemma 3), $w_{k+1}$ has been taken to be exactly $\text{argmin } \phi_k(w)$ . The authors have not discussed this point. It is unreasonable to expect that $\phi_k(w)$ can be exactly minimized and I'd like to see a result where $w_{k+1}$ is an approximate minimizer of $\phi_k(w)$ , for e.g., an $\varepsilon$ -stationary point of $\phi_k(w)$ as written in Section 3.1.

2. Moreover, the complexity of minimizing $\phi_k(w)$ to obtain $w_{k+1}$ has been completely ignored in Section 3.2; this is an important aspect that has not been considered. This could have been captured if there were local steps in the proposed algorithm; specifically, I'd find the result more convincing if there were some $H \geq 1$ local steps of (S)GD on $\phi_k(w)$ to approximately minimize it and the final convergence bound in Theorem 1 would be a function of both $K$ and $H$ .

I understand that the emphasis is probably on reducing the communication complexity but deriving results assuming that the local surrogate functions can be exactly minimized is unrealistic and too strong according to me. I'd like to see the theoretical results capture the inexactness of the approximate minimizer of $\phi_k(w)$ as this would probably entail practically important tradeoffs.

3. [1] and [2] (cited below) are two relevant works that have not been touched upon by this paper. Step 5 in Algorithm 1 of this paper is virtually the same as Step 4 in Algorithm 1 (PAGE) of [1]. Note that [1] is for the centralized setting. [2] proposes a federated version of PAGE called FedPAGE. But unlike SABER, (Fed)PAGE does not need to minimize any surrogate function. Moreover, (Fed)PAGE also attains $\mathcal{O}(\frac{\sqrt{M}}{\varepsilon^{2}})$ complexity. So it appears that the algorithms proposed in [1] and [2] are similar to SABER and attain the same complexity as SABER without its compute-intensive part, viz., minimizing $\phi_k(w)$ . A detailed comparison of SABER with (Fed)PAGE is needed and this includes experiments.

[1]: Li, Zhize, et al. "PAGE: A simple and optimal probabilistic gradient estimator for nonconvex optimization." International conference on machine learning. PMLR, 2021.

[2]: Zhao, Haoyu, Zhize Li, and Peter Richtárik. "FedPAGE: A fast local stochastic gradient method for communication-efficient federated learning." arXiv preprint arXiv:2108.04755 (2021).

4. I also have some concerns w.r.t. the experimental results in the paper. As per Section 4.1, $p=0.5$ and 50 out of 100 clients are used in line 8 of Algorithm 2 for CIFAR-10. So 50% of the clients are used in every other round. It is not clear to me if the benefits of SABER are because of this. For FEMNIST, I could not find the total number of clients (the authors should clearly state this) but it is mentioned that 100 clients are used in line 8 of Algorithm 2 again with $p = 0.5$ . Anyway, I think $p = 0.5$ is too large and the algorithm is practically useful only if it works with much smaller values of $p$ such as 0.1, etc. I'd like to see an ablation study showing the effect of varying $p$ . Also, it is not clear to me if the hyper-parameters were tuned for the other algorithms. Finally, as I mentioned in point #3, there should be some empirical comparisons with FedPAGE too given its similarity to SABER.

Due to the above weaknesses, I can only give a score of 3 for now.

问题

Please address the weaknesses.

2023-11-21

Thank you for providing your feedback on our submission.

It appears to us that the main concern comes from solving the local subproblem, so we answer it here in detail. The definition $w_{k+1}\approx \arg\min_w \phi_k(w)$ requires two things that are used in the proof: 1) descent in the local objective, $\phi_k(w_{k+1}) \le \phi_k(w_k) - \frac{1}{2\eta}\Vert w_{k+1} - w_k\Vert^2$ (used in the proof of Lemma 1), and 2) $\Vert\nabla \phi_k(w_{k+1})\Vert\le \varepsilon$ (used in the proof of Lemma 3). Both conditions can be achieved by applying gradient descent to $\phi_k(w)$ . As is known for nonconvex functions, it will take $H=O(\varepsilon^{-2})$ iterations to satisfy both conditions.
Thank you for providing additional references. We have cited SARAH, Spider, and EF21 for the use of recursive updates, but we see that PAGE uses the same mechanism too. As for the complexity of SABER and FedPAGE, the key difference is that the latter operates under $L$ -smoothness of the problem. As discussed in prior work on second-order heterogeneity, for instance by Khaled & Jin (2023), $L$ -smoothness implies second-order heterogeneity with constant $\delta = 2L$ , but the reverse is not true. In other words, second-order heterogeneity is strictly more general and $\delta$ can be orders of magnitude smaller than $L$ . Moreover, as far as we can see, if we apply PAGE to the local subproblem of SABER without regularization, then we immediately recover FedPAGE as a special case.
We will update the experiments for the camera-ready version to add FedPAGE as a baseline as the reviewer requested. We tried to do it before the end of the discussion period, but couldn't due to the short discussion period and since the FedPAGE paper doesn't provide code.
Regarding the hyperparameter search, we tuned the hyperparameters for all methods, not just ours.

评论- Reply

2023-11-22

Thanks for the rebuttal.

The proof of Lemma 3 in the current version still assumes $\nabla \phi_k(w_{k+1}) = 0$ , i.e., $\phi_k$ needs to be exactly minimized. I had suggested the authors show a result where $\phi_k$ is approximately minimized, but that has not been done.
Overall, there don't appear to be any meaningful differences or improvements over FedPAGE. Unless there are any major challenges in deriving results under second-order heterogeneity compared to $L$ -smoothness, I don't think this is significant enough.
My concern about large values of $p$ (in Weakness #4) has not been addressed.

I will keep my score.

评论- Updated version

2023-11-23

We have uploaded a revision of our paper without the condition $\nabla \phi_k(w_{k+1})=0$ .
The main 2 challenges in working with second-order smoothness are that 1) the update in expectation is not equal to the full gradient, so we had to relate stationarity to the full gradient in Lemma 3 using several intermediate steps, 2) we do not have Lipschitz gradients and standard descent Lemma that follow from $L$ -smoothness.
We agree that using a smaller $p$ is of interest for making the method practical but we didn't have enough time to present an ablation study as we had to work on the rebuttals and participate in paper discussions as reviewers too.

审稿意见

评分: 3置信度: 32023-11-01

Authors propose a novel algorithm for FL under Hessian similarity assumption. Authors provide theoretical analysis for the proposed method, showing $O(\delta \epsilon^{-2} \sqrt{M})$ complexity for general non convex problems and $\tilde{O}(\delta/\mu \sqrt{M} + M)$ complexity for problems under PL condition. Experiments on logistics and NNs are provided and support the theory.

优点

Novel method SABER for non convex FL under Hessian similarity.
Theoretical analysis, showing that the convergence rate of SABER aligns with the rate of Gradient Descent ( $O(\epsilon^{-2})$ ).
Extensive numerical experiments supporting the theory.

缺点

Solution of the subproblem. The subproblem can be non convex. In this case GD for the subproblem would require $\epsilon^{-2}$ iterations to find $\epsilon$ -stationary point. That might significantly slow down the convergence of the whole procedure. Moreover, it is assumed that the subproblem could be solved exactly. In my point of view, this not only makes the method impractical, but also simplifies the theoretical analysis.
Partial participation. Authors claim that SABER allows for partial participation (PP). As far as I understand, SABER with PP is listed as Algorithm 2. However, theoretical analysis is only presented for Algorithm 1, where participation of all clients is necessary, consequently relegating Algorithm 2 to a heuristic approach.
The experiments were conducted with a single local step for all methods. This setup might not be appropriate for Federated Learning (FL) algorithms with local steps, given that the primary aim is to reduce the number of communications by leveraging local updates.

问题

Abstract, first bulletpoint on page 3, page 6. I think it should be $O(\delta \epsilon^{-2} \sqrt{M})$ instead of $O(\delta \epsilon^{2} \sqrt{M})$ .
I think it is better to replace argmin in algorithm with first-order optimality condition and $\phi_k(w_{k+1}) \leq \phi_k(w_k)$ , as according to the proof in non convex case finding the actual argmin is not necessary.

2023-11-17

We thank the reviewer for their time. Please find our answers to weaknesses and questions below.

As the reviewer correctly pointed out, gradient descent would require $\epsilon^{-2}$ iterations to find the $\epsilon$ -stationary point of the local objective. This is, however, expected since employing GD makes it a method from the first-order class, for which there exist a lot of lower bounds. Woodworth et al. (2023) studied a method similar to ours in non-federated context, and the complexity improvement was in having the smoothness constant replaced by the heterogeneity constant, while in terms of $\epsilon$ , it remained the same.
Partial participation is indeed an important feature of our method, but it actually is supported by both Algorithm 1 and Algorithm 2. Notice step 4 in Algorithm 1, where we select a single client that will be used to perform the update.
To clarify, the clients perform a single local epoch, which is not the same as performing a single local step. Each epoch includes multiple local steps as the clients go over the dataset during the epoch.
Thanks for pointing out the typo in the abstract, we also found similar typos in the text, and fixed all of them.
Thank you for the suggestion, you are right that it is not necessary to find the actual argmin. We will make it clear in the paper.

2023-11-21

Thank you for your response!

My main concern remains the same: the assumption of an exact solution to the subproblem appears to be a limitation from both theoretical and practical standpoints, in my opinion.
In Algorithm 1, during step 4, you calculate the full gradient with a probability of $p$ , involving all workers. Could you clarify why this is considered a method with partial participation, given that it requires the involvement of all clients by design?
Thank you for providing clarification.

Considering the points mentioned above, I maintain my current evaluation score.

2023-11-21

Thanks for interacting with us, we really appreciate that you read our response.

In fact, as pointed out by Reviewer M25p, the subproblem does not need to be solved exactly. The statement $w_{k+1}\approx \arg\min_w \phi_k(w)$ requires two things that are used in the proof: 1) descent in the local objective, $\phi_k(w_{k+1}) \le \phi_k(w_k) - \frac{1}{2\eta}\Vert w_{k+1} - w_k\Vert^2$ (used in the proof of Lemma 1), and 2) $\Vert \nabla \phi_k(w_{k+1})\Vert\le \varepsilon$ (used in the proof of Lemma 3). Both conditions can be achieved by applying gradient descent or any other (for instance, stochastic) optimizer to $\phi_k(w)$ .
You're right that this step requires all clients to participate, but this only happens with probability $p$ , while the majority of methods need full participation at every iteration. Furthermore, it can be relaxed to use just a subset of clients, this will only affect the first step in the proof of Lemma 2. If we denote the sampled subset of clients $\tilde S_k$ for updating $\mathbf{v}_k$ , then Lemma 2 would have an extra term $\mathbb{E}[\Vert \frac{1}{|\tilde S_k|} \sum_m \nabla f_m(w_k) - \nabla f(w_k) \Vert^2]$ . If we additionally assume it's upper bounded by $\sigma^2$ (the variance over clients), then we'll get convergence up to $O(\sigma^2)$ error. We originally did not include this derivation because we wanted to keep the theory simple and pay more attention to the novel aspects. We can include this in the camera-ready version to make it clear that one can completely avoid full participation.

2023-11-21

Thank you for the response!

Then, please, present a new version of a proof including inaccuracy in subproblem. You can upload a revised version.
So, Algorithm 1 requires full participation of all clients and cannot be claimed as an algorithm with partial participation. Then, please, correct corresponding lines in the paper. You can provide proof for partial participation and Algorithm 2 in revised version.

I maintain my current evaluation score.

评论- Submission updated

2023-11-23

As requested, we have uploaded a revised version of our paper with the inaccuracy, and we updated all proofs accordingly.
We have included an analysis for the update of $v_k$ as given in Algorithm 2 with partial participation. We will also do an extra revision later to correct any remaining claims regarding partial participation.

审稿意见

评分: 6置信度: 32023-11-01

The authors propose a method for improving FL in the presence of second order data heterogeneity. This is achieved by mitigating client drift over the training duration by estimating the global update direction while also regularizing the local objective.

优点

The paper does good over outlining prior art and motivating the problem sufficiently. The tackled domain is of particular importance in Federated Learning as assuming in certain instances first-order heterogeneity can be problematic. The method seems sound and the theoretical bounds provided for both general cases as well as in the case of μ-PL are also appreciated.

The paper is quite dense but given the content packed I found it easy to read.

缺点

There are a few questions that I would like to ask the authors,

The method is claimed to be stateless by design, but what the authors mean in practice is not clearly defined nor tested in the experiments.
Lack of discussion on what happens in the case of node errors, communication drop-outs, and/or stragglers.
Lack of discussion on what happens if clients have different dataset sizes and how this affect the training.

问题

I have a few questions about the manuscript,

Have the authors tampered with the template? I found that I could not search using my accessibility tools and no text was selectable - is there any particular reason for doing this? I find that people that use specialty equipment might struggle with this...
Why the code was not included in the paper submission? I understand that this can be made public after review but I think it is an essential for proper review and it can be attached as supplementary material not disclosed publicly.
Clarify what the authors mean by "stateless by design" as per weaknesses above.
What happens if a client drops during the computation? Is that handled?
Does the framework assume equal participation of clients? What happens in the case that there are clients that "dominate" the training? Does this affect the end result (perhaps, due to the uniform sampling used...?)?

2023-11-17

Thank you for your positive feedback! We address the pointed weaknesses and answer the questions below in a combined list.

Thank you for asking about the stateless nature, it is an important feature of our method and it seems we forgot to properly define it when first mentioning it. Concretely, we assume that the clients do not carry states from round to round. For example, they may randomly drop out and never come back again.
Issues such as node errors, communication drop-outs, and stragglers that the reviewer mentions are indeed common in federated learning. The proximal term (regularization) used in the subproblem of SABER was originally motivated in FedProx to mitigate the issue of stragglers. For example, when one client does much more work than the others, the update magnitude is still limited thanks to the regularization. We will add a discussion on the consequences of using regularization in the paper.
Similarly, when the clients have different dataset sizes, the standard technique is to sample clients with a non-uniform sampling. All of our techniques are orthogonal and directly applicable to that case too.
We used the standard ICLR template but we split the pdf into the main body and the supplementary and this seems to have affected the text. We will split the pdf differently to fix the issue.
We apologize for not including the code in the submission, we were working hard on the manuscript and didn’t find enough time to clean the code to be included.
If a client drops during computation, the fact that our theory works under second-order heterogeneity implies that there should be no bias as long as we can sometimes get an update from that client to maintain a meaningful estimate of the global direction $v_k$ . In other words, the penalty for the client dropping is that the update would be less useful, but it will still be unbiased.
If some clients dominate the training in terms of how much work they can do, this issue is already handled using the regularization term. If you mean that they dominate in terms of participation, then the required change would be to sample them less frequently. We can add a discussion on how the theory works in this case.

评论- Read the buttal

2023-11-23

Thanks for the explanations. However, given the rest of the other reviews, comments, and tacking in account your rebuttal - I am keeping my score.

审稿意见

评分: 5置信度: 42023-11-01

The manuscript studies the federated leaning problem over clients with second-order data heterogeneity. The authors propose an algorithm called SABER, which combines FedProx and SCAFFOLD. Theoretically, they derive the communication complexity of SABER for both general non-convex and $\mu$ -PL objective functions. Several experiments are performed to evaluate the proposed algorithm.

优点

The problem studied in this paper is well motivated.

The authors carefully compare the difference between first-order and second-order data heterogeneity.

They provide a convergence analysis for SABER with deterministic gradient by constructing a local objective function consisting of a bias correction term and a regularization term.

缺点

The novelty of the paper seems limited. The local objective function constructed by the author is an intuitive combination of FedProx and SCAFFOLD. The similar idea can be found in Lin et al. (2023)
The algorithm design of SABER is not surprising to the reviewer in the sense that the proposed SABER algorithm uses the similar idea of SVRG to deal with data heterogeneity.
Technically, the main proof techniques used in the paper are standard in federated learning. The authors make no concrete improvements to existing SOTA results. Furthermore, it seems that there is no significant difference in the techniques used to analyze the objective function that satisfies the \mu-PL condition and the convex condition, especially for the deterministic gradient scenario.
The writing of the paper needs to be improved. In section 3.3, the result for \mu-PL objectives should be formulated as a formal theorem.

Reference:

Dachao Lin, Yuze Han, Haishan Ye, and Zhihua Zhang. Stochastic Distributed Optimization under Average Second-order Similarity: Algorithms and Analysis. arXiv preprint arXiv:2304.07504, 2023.

问题

Refer to weeknesses.

2023-11-17

We thank the reviewer for their feedback.

As the reviewer pointed out, the local objective construction is intuitive and is similar to SCAFFOLD and FedProx, but we believe this is a positive feature. Not only it connects naturally to the existing literature, it also lets us intuitively explain the meaning of every term in the subproblem. Note that constructing the subproblem itself was never supposed to be the main contribution, it’s rather in the analysis of the method. Moreover, the analysis of our paper doesn’t have anything in common neither with SCAFFOLD nor with FedProx.
Even if there are similarities with SVRG, our analysis is not based on it, so we are not sure why it is seen as a weakness by the reviewer. Furthermore, unlike SVRG, our algorithm does not store an anchoring point that is reset in the outer loop.
We would like to point out some key differences between our method and SVRS, as well as their analyses. First of all, unlike SVRS, SABER does not use $f_1$ at every iteration of the algorithm. In terms of analysis, starting from Lemma D.1 of Lin et. al, one can notice the term $D_h(x, x_t)$ . Then, in the proof of Theorem 3.3, in Appendix D.2, they set $x=x_*$ , where $x_*$ is the optimal solution. Our theory, in contrast, never uses the optimal solution, which might not even since we don't assume convexity. Likewise, we never use the distance term from the SVRS theory, so there are considerable differences in the proofs.
We are happy to rewrite Section 3.3 to improve the structure.

评论- Response to the rebuttal

2023-11-23

The reviewer thank the authors's effort in the reply which have partially addressed the reviewer's concerns. The reviewer would thus raise the score.

AC 元评审

2023-12-09

The paper considers the problem of federated learning in the presence of heterogeneous data. Under the assumption of second-order data heterogeneity, it proposes an algorithm called SABER, which can reduce client drift by estimating the global update direction as well as employing regularization.

The paper is well written and the problem is timely; however, there are serious concerns regarding the novelty of the results and the contributions of the paper in light of prior works. Also, the numerical section can be greatly improved to consider the related methods and to provide a more thorough comparisons amongst similar techniques.

为何不给更高分

There are serious concerns regarding the novelty of the results and the contributions of the paper in light of prior works. Also, the numerical section can be greatly improved to consider the related methods and to provide a more thorough comparisons amongst similar techniques.

为何不给更低分

N/A

最终决定Reject

2024-01-16

Reject