Near-Optimal Distributed Minimax Optimization under the Second-Order Similarity
摘要
评审与讨论
The paper proposes SVOGS, and improved algorithm for distributed minimax optimization by client mini-batch sampling and gradient variane reduction. Theoretical rates on communication complexity and gradient computations are provided, along with their lower bounds. The analysis shows that SVOGS achieves the corresponding lower bounds. A numerical example is provided to justify the efficacy.
优点
-
The presentation is clear and the notations are clean.
-
The paper is theoretically solid, with comprehensive thereotical results. The lower bounds, although based on some tricks and results from prior works, provide novel contribution to the field of minimax optimization.
缺点
The main drawback is the experiment section. The proposed method is only tested on one small dataset, and the paper lacks the basic information on how SVOGS is implemented and tuned (e.g., parameter in Algorithm 1). It seems to me that the proposed method requires much heavier fint-tuning than prior methods thus less practically convenient. The implementation details should be provided to justify the true practical value of this method.
Also, it should be tested on more datasets.
问题
See above
局限性
NA
The implementation details
We tune the step-size of SVOGS from . The probability is tuned from , where . The batch size is determined from , with .
We set the other parameters by following our theoretical analysis. We set the average weight as . For the momentum parameter, we set for convex-concave case and for strongly-convex-strongly-concave case, where we estimate by for problem (13). For the sub-problem solver, we set its step-size according to the smoothness parameter of sub-problem (2), i.e., .
In addition, we estimate the smooth parameter and the similarity parameter by following the strategy in Appendix C of Ref. [5].
Based on above appropriate setting, our SVOGS achieves better performance than baselines. We are happy to provide the detailed parameters setting in our revision.
It should be tested on more datasets
We have provided additional experimental results on datasets w8a (, ) and covtype (, ). Please refer to the PDF file in Author Rebuttal. We can obverse the proposed SVOGS performs better than baselines on these datasets. We are happy to involve the additional experimental results into our revision.
The paper studies distributed min-max optimization under the assumption of second-order data similarity, i.e. the hessians of the objectives at different nodes are close enough. For the classes of (strongly)-convex-(strongly)-concave functions with Lipschitz gradient, lower complexity bounds are proposed. Moreover, an algorithm SVOGS is proposed that strictly reaches the lower bounds in communications and up to a logarithmic factor in local gradient calls. Therefore, the paper almost closes the gap in distributed centralized min-max optimization with data similarity.
优点
- The problem, assumptions and results are clearly stated.
- An algorithm optimal in communications and near-optimal in gradient calls is proposed. That closes the complexity gap for the class of distributed min-max optimization problems with second-order similarity.
- Overall, the paper has a readable structure.
缺点
There still remains a gap in the local gradient calls complexity. Maybe it can be overcome with the usage of gradient sliding technique.
问题
I did not find the difference between the number of communication rounds and communication complexity. It seems that Algorithm 1 sends the same amount of information at each communication round. So why is communication complexity different from the number of communication rounds?
局限性
The work is theoretical and does not have negative societal impact.
Why is the communication complexity different from the number of communication rounds?
Recall that the communication complexity in our paper refers to the overall volume of information exchanged among the nodes. We take the convex and concave case as an example (Theorem 1) to explain why the communication complexity is different from the number of communication rounds.
-
The results of Theorem 1 (discussion in Line 147-177) shows we require the communication rounds of to achieve the desired accuracy .
-
We requires the communication complexity of to achieve in initialization.
-
The index set with means the term of sums in Line 6 of Algorithm 1 takes the communication complexity of at each iteration.
-
The communication for the full gradient term in Line 6 of Algorithm 1 does not need to be performed at every iteration. Notice that Line 10 performs with probability and performs (no communication) with probability , which means only the case of needs to communicate to achieve the full gradient with the communication complexity of . Therefore, the overall the expected communication complexity related to the full gradient term is at each iteration.
Based on above analysis, we can conclude the overall communication complexity is which is different from the communication rounds
In addition, the full participation methods (e.g., EG [24], SMMDS [5] and EGS [25]) requires the communication for the full gradient at every iteration, which leads to the more expensive communication complexity of .
There still remains a gap in the local gradient calls complexity. Maybe it can be overcome with the usage of gradient sliding technique.
The comparisons in Table 1-2 show the local gradient calls complexities in our results match the lower bounds up to logarithmic factors. We thank the reviewer for pointing out that the gap of logarithmic factors could possibly be overcome with the usage of the gradient sliding technique. We believe this is an interesting future direction.
Dear Authors,
Thank you for the response.
This manuscript considers solving (strongly) convex (strongly) concave distributed minimax optimization problem. The authors proposed a stochastic variance-reduced optimistic gradient sliding method with node sampling, named SVOGS, which achieves complexity nearly matches the obtained lower bound.
优点
- The paper improves the complexity of existing algorithms through the adoption of node sampling, especially when dealing with a large number of nodes. The theoretical results are well-supported by simulation experiments.
- The complexity results are thoroughly compared with existing results.
- The paper is well-organized and easy to read.
缺点
- The algorithm design is straightforward. The novelty of the algorithm is limited to the node sampling aspect, which is a simple and obvious adaptation from the existing methods, e.g. Ref. [5]. The authors should further clarify the unique characterisics of the algorithm.
- It seems to the reviewer that incorporating uniform and independent node sampling (c.f., Line 5 in Algorithm 1) does not present any new challenge to the convergence analysis as compared to existing methods. The authors should further clarify the technical contribution of their approach.
- The experiments utilize toy datasets, e.g., a9a, which may limit the generalizability of the results.
- The rationality of Assumption 2 should be discussed in more detail to ensure its validity and relevance to the study.
- Theorems 1 and 2 requires many hyperparameters to be set to exact values that depend on the parameters of the problem. In practice, these parameters are often very difficult to obtain, which may weaken the theoretical results in this work.
问题
- The authors bring the results for minimization in Ref. [25] directly to compare with the results for the minimax problem in this paper. Does this mean that the minimax problem in this paper does not have additional difficulties compared with the minimization problem?
- The distinction between the optimistic gradient and extra-gradient methods for minimax problem should be clarified. The reviewer is curious about the possibility that the EG method in Ref. [25], combined with node sampling, could also achieve similar results.
- Refer to Weaknesses for more concerns to be addressed properly.
局限性
The (strongly) convex (strongly) concave distributed minimax optimization problem considered in this work is not applicable to most machine learning tasks and is thus of limited significance in this area.
Results for minimization in Ref. [25]
Ref. [25] studies both minimization and minimax problems. We only compare their results for the minimax problem (Table 1-3).
The minimax problem is indeed more difficult than the minimization problem. For example, Section 3 of Ref. [25] considers the strongly convex minimization problem by archiving the complexities depend on and , and Section 4 of Ref. [25] considers the strongly-convex-strongly-concave minimax (strongly monotone variational inequality) problem by archiving the complexities depend on and . Noticing that the lower bounds in Section 6.2 of our paper show such dependence on and cannot be improved for this minimax problem, which means it is more difficult.
Combine EG [25] with node sampling
Compared with our result, applying the node sampling (variance reduction) [1] to EG method [25] will lead to the additional term in the complexity of communication rounds, because EG framework cannot benefit to the mini-batch sampling.
For ease of understanding, we follow the notations and settings of the Algorithm 1 for single machine in Ref. [1] to illustrate this issue (the problem in distributed case is similar). The essential step in the analysis for Algorithm 1 of Ref. [1] is their equation (9) in Lemma 2.9, that is
where and Assumption 1(iv) is the mean-squared Lipschitz continuous condition on such that for all .
Directly adapting above analysis to our distributed problem will leads to an extra term of in the complexity of communication rounds (compared with the methods without node sampling), since the third line in the loop of Algorithm 1 of Ref. [1] only draws one sample . To match the results of our paper, we desire to introduce the mini-batch sampling like our Algorithm 1 (Line 6) into the framework of EG, which is replacing the term
with
where follows the notation in our Algorithm 1.
Then the above derivation becomes
Thank you for the detailed response. Since most of my concerns have been well addressed, I raise my overall rating to weak accept.
We thank the reviewers for their appreciation of our work.
Both Reviewer Vpvu and Reviewer DkHm have raised questions about experiments. We provide the response as follows.
The implementation details (hyperparameters)
We tune the step-size of SVOGS from . The probability is tuned from , where . The batch size is determined from , with .
We set the other parameters by following our theoretical analysis. We set the average weight as . For the momentum parameter, we set for convex-concave case and for strongly-convex-strongly-concave case, where we estimate by for problem (13). For the sub-problem solver, we set its step-size according to the smoothness parameter of sub-problem (2), i.e., .
In addition, we estimate the smooth parameter and the similarity parameter by following the strategy in Appendix C of Ref. [5].
Based on above appropriate setting, our SVOGS achieves better performance than baselines. We are happy to provide the detailed parameters setting in our revision.
More datasets
We have provided additional experimental results on datasets w8a (, ) and covtype (, ). Please refer to the PDF file in Author Rebuttal. We can obverse the proposed SVOGS performs better than baselines on these datasets. We are happy to involve the additional experimental results into our revision.
This paper studies convex-concave minimax problems in distributed settings, focusing specifically in settings where the instance satisfies a second order similarity bound. This latter condition refers to an upper bound on the Hessian of the difference between all pairs of objectives within the finite sum. Under such assumption, it is shown that algorithms converge with improved rates, compared to those which simply quantify convergence with smoothness (it should be mentioned that the rates under second-order similarity are never worse than those based on smoothness, from the fact that -smoothness implies -similarity). The paper also provides matching lower bounds.
Reviewers were mostly positive about this submission. However, it appears that the technical contributions of this work are somewhat marginal:
- The idea of using gradient sliding appears in [5], and the idea of subsampling nodes is quite standard in distributed optimization.
- The lower bound is also an adaptation of [42] which is by now quite standard.
Furthermore, the property of Hessian similarity is quite strong. Despite these concerns, and due to the mostly positive assessment of the work, I recommend acceptance.