Newton Meets Marchenko-Pastur: Massively Parallel Second-Order Optimization with Hessian Sketching and Debiasing
摘要
评审与讨论
This paper propoed a distributed Newton method, where the server performs the Newton update, and the workers use Hessian skeching to return a neatly unbiased inverse Hessian to the server. The solution is based on random matrix theory, i.e. the Marchenko-Pastur law.
优点
The solution in this paper is novel to me. I think the idea is very interesting.
缺点
See the questions part.
问题
-
The paper says "The connection to the Marchenko-Pastur law treats in a uniform framework—and recovers almost immediately—earlier results on Hessian sketch debiasing for distributed Newton methods (Derezinskietal.,2020;Zhang&Pilanci,2023). Can the authors compare the works of (Derezinskietal.,2020;Zhang&Pilanci,2023) and this work for me to better clarify the contributions of this paper.
-
Why the authors claim their result on non-asymptotic error guarantees in Section 3.1 is dimension-free. I can not uderstand this because Theorem 3.6 requires . I can understand that if we have enough workers, i.e, then we can set , which is dimension-free. But in applications, I think is considered to be fixed and then to ensure the error is smaller than , one should take if . In this view, the gurantee seems to depend on dimension .
-
(continued my second question) What is the total complexity to minimize a self-concordant function? By Theorem4.1 if we have , then the algorithm can recovers the rate of Newton's method. By according to Theorem 3.6, should be , which may be very large. The advantage of Newton's method is that it can have superlinear convergence rate, but as , the total complexity has polynomial dependency on . Then the total complexity results seems to be sublinear. If so, what is the advatange of Newton's method compared to gradient methods? In other words, can the authors compare the total complexity of their Newton's method and distributed gradient methods (including the accleration) for me?
This paper proposes a novel sketching Newton method. This paper also provides the convergence analysis of the algorithm and gives the non asymptotic guarantees.
优点
This paper proposes a novel sketching Newton method. This paper also provides the convergence analysis of the algorithm and gives the non asymptotic guarantees. The massively parallel fashion is consider in their method. The theory in this paper is good.
缺点
This paper claims that the proposed method is motived by serverless cloud computing, in particular the ``function as a service'' (FaaS) model. However, Eq. (7) is an biased estimation of . Thus, even machines run in parallel. This will not help to obtaim an -times speed-up. The convergence rate of the proposed algorithm with machines running in parallel is the same to one only runing in a single machine.
The experiments prove the biases of the proposed algorithm is lower than other agorithm and the algorithm can achieve lower iteration complexity. However, the coresponding computation cost is not mentioned. Thus, the comparisons in this paper seem not enough to prove the advantages of the proposed algorithm.
问题
No
The key idea of the paper is to relate random matrices and one of the key optimization methods, Newton's method. Convergence for strongly convex self condordant functions is proved.
优点
- A new (best of my knowledge) matrix sketching scheme is presented.
- The proofs look correct.
缺点
-
Convergence only for self condordant functions. The result would be more solid if the authors consider general strongly convex functions.
-
There is no comparison with other methods in the experiments. There is a huge family of Newton-like methods where the Hessian is approximated. Starting from quasi-Newton technique to newer Hutchinson [1] or Fisher approximations [2]. The same Adam can be considered as a method with a diagonal approximation of the Hessian. I could also mention Shampoo or Soap or [5]
[1] Jahani, M., Rusakov, S., Shi, Z., Richt´arik, P., Mahoney, M. W., and Tak´aˇc, M. (2021). Doubly adaptive scaled algorithm for machine learning using second-order information.
[2] Amari, S. (2021). Information geometry.
[3] Gupta, V., Koren, T., and Singer, Y. (2018). Shampoo: Preconditioned stochastic tensor optimization.
[4] Vyas, N., Morwani, D., Zhao, R., Shapira, I., Brandfonbrener, D., Janson, L., and Kakade, S. (2024). Soap: Improving and stabilizing shampoo using adam
[5] Doikov, N., Jaggi, M., et al. (2023). Second-order optimization with lazy hessians.
- There's an important point in the proof of Theorem 3.6: is small probability, what do we need to do if is greater than ? Maybe I misunderstood something.
问题
See 3 in Weaknesses
The paper presents a new optimization method for distributed centralized optimization, where all workers have access to the full data. The method is based on damped Newton method and inexact approximation of inverse Hessian with sketching by Marchenko-Pastur scheme. The authors present extensive theoretical support of their method with a focus on debiasing properties of the proposed approximation.
优点
I think that the results are new. I appreciate the proposed approximation and guarantees by itself without the need of a distributed setting. This approach is fresh for the optimization community. The paper is complicated but well-written.
缺点
-
One of the main weaknesses from my point of view is the experiments section. a) Firstly, there is no comparison with other methods. So, it is unclear how fast is the proposed method. For example, it can be interesting how far is it from the exact Newton method. b) Secondly, the Ridge Regression seems to be a very simple task for Newton-based methods, as it is a quadratic function with a relatively small noise() and big regulariser . So, it might be much more useful to show how good is the inverse approximation depending on the conditioning of the Hessian and size of , similar to Fig.1.
-
Small remark: I recommend changing the indexing in formulas 2, 3, 4. As is a gradient in point and should be , also Hessian is calculated in .
-
Also, the authors may improve the related literature section by some distributed Newton and Quasi-Newton methods.Some possible references:
[1] Safaryan, M. et.al. "FedNL: Making Newton-type methods applicable to federated learning." (2021) arXiv preprint arXiv:2106.02969.
[2] Agafonov, A. et.al. "Flecs: A federated learning second-order framework via compression and sketching". (2022) arXiv preprint arXiv:2206.02009.
[3] Qian, X. et.al "Basis matters: better communication-efficient second order methods for federated learning." (2021) arXiv preprint arXiv:2111.01847.
[4] Elgabli, A. et.al. "FedNew: A communication-efficient and privacy-preserving Newton-type method for federated learning". (2022) In International conference on machine learning (pp. 5861-5877). PMLR.
[5] Bischoff, S. et.al. "On second-order optimization methods for federated learning." (2021) arXiv preprint arXiv:2109.02388.
问题
-
Could you clarify the distributed setting paradigm? As I understand, all workers have access to the whole function and they have the same gradient. As this setting is quite unusual to me, especially from Federated Learning and privacy perspective.
-
Does this method and approximation work for pure convex functions without regularisation? Also, may they work for non-convex functions in practice?
-
How good is the proposed approximation compared to Krylov and Nystorm-based approximations?
-
It might be interesting to apply this approximation to the new branch of methods based on cubic regularisation:
[6] Scieur, Damien. "Adaptive Quasi-Newton and Anderson acceleration framework with explicit global (accelerated) convergence rates." International Conference on Artificial Intelligence and Statistics. PMLR, 2024
[7] Kamzolov, Dmitry, et al. "Cubic Regularization is the Key! The First Accelerated Quasi-Newton Method with a Global Convergence Rate of for Convex Functions." arXiv preprint arXiv:2302.04987 (2023)
[8] Jiang, Ruichen, et al. "Krylov Cubic Regularized Newton: A Subspace Second-Order Method with Dimension-Free Convergence Rate." International Conference on Artificial Intelligence and Statistics. PMLR, 2024.
The paper presents a new optimization method for distributed centralized optimization, where all workers have access to the full data. The method is based on damped Newton method and inexact approximation of inverse Hessian with sketching by Marchenko-Pastur scheme. The authors present extensive theoretical support of their method with a focus on debiasing properties of the proposed approximation.
All 4 reviewers (scores 8, 8, 6, 6) agreed that the paper should be accepted. I have read the reviews, the rebuttals, and the discussion. Moreover, I have had a quick read of the main body of the paper. I agree that the paper develops some very interesting new results, with potential for further extensions.
- The method and the theoretical results are new and interesting.
- The paper is well written.
- Related work is (after rebuttal) in a good shape.
I recommend the paper for acceptance.
审稿人讨论附加意见
Several reviewers raised some concerns and asked some questions; for example, concerns w.r.t. non-cited related literature (resolved by the authors updating their paper), question on the possibility of extension to nonconvex regime (addressed satisfactorily in a reply), comparison to Krylov and Nystrom approximations (addressed by a reply), and more.
In summary, the rebuttal was effective, and the answers satisfactory. Moreover, some of the issues raised were addressed by the authors updating their paper, which is now a bit more complete as a result.
Accept (Poster)