PaperHub
6.0
/10
Rejected4 位审稿人
最低6最高6标准差0.0
6
6
6
6
3.8
置信度
正确性3.0
贡献度3.0
表达3.3
ICLR 2025

SANIA: Polyak-type Optimization Framework Leads to Scale Invariant Stochastic Algorithms

OpenReviewPDF
提交: 2024-09-16更新: 2025-02-05
TL;DR

Polyak step-size based optimization framework SANIA generalizes some of the most popular adaptive optimization methods while also making them scale invariant.

摘要

关键词
optimizationlearning rate freeadaptive optimizerspolyak step-size

评审与讨论

审稿意见
6

This paper presents SANIA, which is a method that doesn't require manual fine-tuning of the learning rate in commonly used stochastic optimization algorithms, leading to faster optimization. The authors present a framework which generalises common stochastic optimization algorithms. The authors also consider affine and scale invariance which seeks to address poorly scaled or ill-conditioned problems. The authors compare performance of SANIA with commonly used algorithms (which have had fine-tuned parameters) for training classifiers for MNIST, FashionMNIST, CIFAR10 and SVHN.

优点

Provides a novel formulation of stochastic optimization algorithms and a novel method that does not require fine-tuning of the learning rate parameter. Investigation into scale invariance is novel. Theorems, equations and ideas are presented clearly. Numerical experiments show that the SANIA methodology achieves similar performance to algorithms with fine-tuned learning rates, and improved robustness as the training curves fluctuate less.

缺点

Numerical experiments lack a comparison to other options for tuning-free methods.

Minor comments:

Occasional incorrect grammar, including after equation (7): "This leads us to Stochastic Polyak step-size method." should read "This leads us to the Stochastic Polyak step-size method." Also bad grammar in the statement of Theorem 1. "Another way to derive this formulation is by solving" (4), I think it may be better to reference appendix B.2 to make it clear why this holds. "interpolation condition" isn't very clearly defined in my opinion, it may be better to be more clear. "in practice it displays better convergence to the true Hessian than other similar methods like BFGS" - I think this needs a reference.

问题

In Page 2, SPS has been derived, why not just cite it? The difference between g_t and m_t is unclear to me, these seem to often mean the same thing? Why not keep the notation consistent? In (6) you have written | w |{B_t} is a Euclidean norm, would it be better to say " | \cdot |{B_t} is a Euclidean norm"?

评论

Dear Reviewer iCAR,

We sincerely appreciate your thorough feedback and constructive comments, which we believe will significantly enhance the clarity and readability of our paper. We are deeply encouraged that you found our work novel and interesting. Moving forward, we are committed to addressing your inquiries and concerns in detail.

Your observation regarding numerical experiments comparing SANIA to other tuning-free methods is highly relevant and aligns with one of the directions we are actively pursuing in our ongoing research.

We apologize for not providing a clearer definition of the “interpolation condition.” To address this, we have added a concise paragraph that offers additional details and intuition about this assumption on page 3.

评论

Reviewer: "In Page 2, SPS has been derived, why not just cite it?"

The main reason for including the derivation of SPS is to provide intuition for readers who may not be familiar with this method. We believe that presenting this compact and concise derivation enhances the coherence and accessibility of our paper without detracting from the key ideas discussed later.

Lastly, we have corrected minor grammatical errors and addressed the missed citations you kindly pointed out.

Thank you once again for your valuable feedback.

评论

Dear Reviewer iCAR,

Thank you again for your thoughtful feedback on our paper. We have addressed all the questions and concerns you raised in our responses. We wanted to kindly follow up to see if you have any additional questions or require further clarifications. If our responses address your concerns, we would greatly appreciate it if you could consider updating your score to reflect this.

Thank you for your time and consideration.

评论

Thank you for the thorough response. I have no more questions and will keep my score the same.

审稿意见
6

The work tries to incorporate a Polyak-type optimization framework with existing optimizers for machine learning. In the new method, the optimization direction and step size are spontaneously determined by a simple local optimization for each step. The new method is not difficult in realization and appears to work. From the numerical results, the test accuracy of network trained with the new method is comparable to existing methods (but not better than them).

优点

The new method can be naturally incorporated with existing methods and the realization is not difficult.

缺点

From the numerical methods, the new method does not show improvements compared to previous methods.

问题

  1. Instability of the Euler method and noises are believed to be helpful for generalization (see Ref[1]). Relatively large step size is helpful for the training process to jump out of bad local minima. In the new framework, when the steps are spontaneously obtained, the instability are also almost inhibited spontaneously. This may be a severe problem for the new frame work.

  2. From the numerical results, it seems that the test accuracy of Sania is slightly worse than existing methods. The author should also compare the results with SGD, since it usually has competitive generalization ability.

  3. In Line 209, what is the meaning of "Otherwise, we replaced step-size parameter γt to parameter fi∗?". The authors should explain more clearly.

  4. Due to the anisotropic property of the loss landscape, the estimate f* may not be a good guess for the local optimization. How will this affect the performance of the new method?

[1] The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent, Lei Wu, Weijie J. Su, ICML 2023;

评论

Reviewer: "From the numerical results, it seems that the test accuracy of SANIA is slightly worse than existing methods. The author should also compare the results with SGD, since it usually has competitive generalization ability."

We agree that SGD is a strong baseline with well-established generalization capabilities. To address this, we have conducted additional experiments comparing SANIA to manually fine-tuned SGD on real datasets. These comparisons have been included in the Appendix E (Figure 5) . The results indicate that SANIA performs competitively with SGD while eliminating the need for a tuned learning rate schedule, highlighting its practical utility. Due to time constraints, we were unable to include large-scale experiments; however, we plan to explore a broader range of problems with SANIA in future work. If we obtain the results in time, we will add large-scale experiments by the end of the discussion deadline and upload the updated findings.

Thank you for this valuable suggestion.

评论

Reviewer: "Due to the anisotropic property of the loss landscape, the estimate f∗ may not be a good guess for the local optimization. How will this affect the performance of the new method?"

Could the reviewer kindly elaborate on this comment?

We are grateful for your feedback and will incorporate detailed responses and additional experiments to address your concerns in the revised manuscript. Thank you for your valuable insights and suggestions.

评论

Reviewer: "In Line 209, what is the meaning of 'Otherwise, we replaced step-size parameter γt to parameter fi∗?' The authors should explain more clearly."

We recognize that this statement required clarification. It has been revised, and we trust it is now clearer. Please refer to the end of page 4 for the updated wording.

评论

Reviewer: "Instability of the Euler method and noises are believed to be helpful for generalization... This may be a severe problem for the new framework."

It is true that injecting noise can play a regularizing role in training for stochastic gradient descent methods. Please note that our method can, and does, take large step sizes, which allows it to jump out of local minima. Regarding the reviewer’s remark about our step size spontaneously inhibiting desired instabilities, we are unsure of the intended meaning. Could the reviewer kindly clarify their comment? If the concern pertains to whether SANIA’s step-size determination process prevents the addition of controlled noise or perturbations to the optimization process, the answer is no. In fact, it would be possible to explore how such modifications might enhance generalization performance, and we view this as an interesting avenue for future research.

We thank the reviewer for bringing the additional reference to our attention, and we will include it in the revised version of the paper.

评论

Reviewer: "From the numerical methods, the new method does not show improvements compared to previous methods."

We acknowledge that SANIA’s performance, as presented, is comparable to existing methods and does not show significant improvements. One key difference, however, is that SANIA operates without requiring hyperparameter tuning, unlike traditional optimizers such as Adam or Adagrad, which often require careful tuning to achieve their best performance. This feature of SANIA can potentially help to save substantial amounts of resources. To further illustrate this, we have added additional experiments to Appendix E (Figures 3, 5, 6, 7) comparing SANIA’s performance against Adam, Adagrad, SGD and KATE under suboptimal tuning conditions and their best-tuned step sizes. These results highlight that SANIA’s performance remains robust across a wider range of scenarios.

评论

Dear Reviewer zkYN,

We sincerely thank you for your thoughtful review of our work and for highlighting both the strengths and weaknesses of our proposed method. We deeply appreciate your acknowledgment of the soundness, ease of realization, and presentation quality of our approach. Your constructive feedback and insightful questions have guided us to identify areas for improvement and further experimentation. Below, we provide detailed responses to the points raised under the weaknesses and questions sections.

评论

Dear Reviewer zkYN,

Thank you again for your thoughtful feedback on our paper. We have addressed all the questions and concerns you raised in our responses. We wanted to kindly follow up to see if you have any additional questions or require further clarifications. If our responses address your concerns, we would greatly appreciate it if you could consider updating your score to reflect this.

Thank you for your time and consideration.

审稿意见
6

This paper introduces cubic Newton method with adaptive Polyak step-size, enhancing robustness and efficiency in non-convex tasks. It also proposes scale-invariant variants of AdaGrad and Adam, which improve optimizer performance on poorly scaled data. Extensive experiments validate the effectiveness of these methods across diverse optimization scenarios.

优点

  1. Thorough theoretical analysis.
  2. Clear writing

缺点

  1. It would be great to have theoretical/practical comparisons with Sophia [1], which also uses Hutchinson's based method to compute their preconditioner.
  2. KATE [2] removes square root to ensure scale invariance of Adagrad. It would be great to have theoretical/empirical comparison with KATE.

I am willing to improve the score if the above are addressed.

[1] Liu, Hong, et al. "Sophia: A scalable stochastic second-order optimizer for language model pre-training." arXiv preprint arXiv:2305.14342 (2023). [2] Choudhury, Sayantan, et al. "Remove that Square Root: A New Efficient Scale-Invariant Version of AdaGrad." arXiv preprint arXiv:2403.02648 (2024).

问题

What are your thoughts on how learning rate schedules such as cosine decay etc compose with λt\lambda_t schedule defined in (19)?

How does λt\lambda_t change during training?

评论

We have included updated experiments with KATE, however due to time constraints we are still running experiments with Sophia. Nevertheless, we have found both of them to be highly relevant to our research and included in Lemma 5 in Appendix as special cases of SANIA General Framework

Thank you again for your valuable feedback and thoughtful suggestions.

Please let us know if some concerns are still unanswered.

评论

Question: What are your thoughts on how learning rate schedules such as cosine decay, etc., compose with the λt\lambda_t schedule defined in (19)? How does λt\lambda_t change during training?

It is true that learning rate schedules such as exponential decay, cosine annealing, and other procedures are widely used in practice. However, one problem persists – different schedules are often suited to different scenarios. For example:

  1. Exponential decay works best for problems where a steady reduction in the learning rate is advantageous.
  2. Reduce-on-plateau is ideal for training pipelines where the convergence speed varies significantly.
  3. Cosine annealing is well-suited for tasks that benefit from smooth reductions in the learning rate, such as fine-tuning.

In contrast, the λt\lambda_t schedule has an update rule directly influenced by the function landscape, which may alleviate the need for manually selecting a specific learning rate schedule. To provide insights into the behavior of λt\lambda_t, we have included numerical experiments demonstrating the evolution of SANIA’s step size compared to fine-tuned methods and methods employing learning rate schedules.

Due to time constraints, we present these experiments using a synthetic dataset. Nonetheless, we are eager to investigate the behavior of λt\lambda_t in large-scale problems as part of future work. Please find the aforementioned experiments in Figure 10 of the Appendix in the revised version of our paper.

评论

Dear Reviewer zTra,

We are extremely delighted to learn that you found our work easy to understand, and we sincerely thank you for appreciating our theoretical analysis. We have carefully studied the two papers you kindly shared with us and found both to be highly relevant to our work. We have now referenced and discussed both of these works in the revised version of our paper.

评论

Dear Reviewer zTra,

Thank you again for your thoughtful feedback on our paper. We have addressed all the questions and concerns you raised in our responses. We wanted to kindly follow up to see if you have any additional questions or require further clarifications. If our responses address your concerns, we would greatly appreciate it if you could consider updating your score to reflect this.

Thank you for your time and consideration.

评论

Thank you for the effort put in the rebuttal. I highly recommend comparing empirically with Sophia for future revisions. I decided to maintain my score.

评论

Dear Reviewer zTra,

Thank you for your thoughtful feedback and for highlighting the importance of empirical comparisons with Sophia and KATE.

We have included updated experiments with KATE and discussed both works as special cases in our general framework.

Regarding Sophia, we are pleased to inform you that we are already running experiments, and the results will be included in future revisions. We greatly appreciate your recognition of the theoretical analysis and practical relevance of our work, and if our ongoing efforts address your concerns, we would be grateful if you could consider an adjustment to your score.

审稿意见
6

This paper proposes a training algorithm generalizing the "Polyak step-size" to second-order algorithms, such as cubic Newton or quasi-Newton.

Let ff be a function to minimize and w_tw\_t the current estimate of the argmin. For an order-1 method, the Polyak step-size comes down to choosing the step-size such that we would attain the minimum ww^* in one step if ff was affine between w_tw\_t and ww^*. For order-2 methods, the authors distinguish the metric used for the parameter space, denoted by BtB_t, and the metric related to the curvature of the local approximation of ff, represented by a matrix DtD_t.

Then, the authors propose variations of existing algorithms based on their generalization of Polyak step-size.

优点

Originality

As far as I know, the work presented in this paper is original.

Clarity

Overall, the paper is easy to follow.

Quality

I am very grateful to the authors for providing additional experiments (Appendix E), which show not only test accuracy, but also test loss. I would have been even better to provide the training loss, since the proposed algorithms have been designed to improve the optimization process (not the generalization).

Overall, the paper seems to be correct.

Significance

This paper provides practical uses of SP2: A Second Order Stochastic Polyak, Li et al., 2023. So the present paper is relevant for the community.

缺点

Originality

This paper can be seen as a practical application of SP2: A Second Order Stochastic Polyak, Li et al., 2023.

Clarity

At the end of Section 2, the authors use the notations: "SANIA IdI_d", "SANIA (V1)2(V^{-1})^2" and "SANIA diag(H1)\mathrm{diag}(H^{-1})", while writing that there is no preconditioning. Thus, I understand that "SANIA AA" refers to Eqn. (6) with Dt=AD_t = A. But there is no formal definition of "SANIA AA". It should appear somewhere.

Just above Eqn. (9), one can read: mt=mtm_t = m_t, which should be "mt=gtm_t = g_t" (?).

Several mistakes, Figure 1, first plot:

  • the legend is difficult to understand, since several curves have the same label;
  • some curves do not seem to correspond to their label: apparently, light green should be "SANIA (V1)2(V^{-1})^2";
  • the purple curve is dotted, while it is not dotted in the legend.

问题

Could the authors provide a formal link between scale invariance and using SANIA? It seems that there is some overlap between the two, but SANIA does not ensure scale invariance by itself. It would be interesting for potential users of SANIA to provide conditions for their algorithm to be scale-invariant.

Addition to Fig. 5: how does SANIA compare to concurrent methods in terms of training loss?

评论

Finally, we have updated Figure 1 based on your feedback and added training loss to our plots (Figure 5, 6 and 7). We hope you find the revised version to be an improvement

Thank you once again for your insightful comments and suggestions.

评论

Question: Could the authors provide a formal link between scale invariance and using SANIA? It seems that there is some overlap between the two, but SANIA does not ensure scale invariance by itself. It would be interesting for potential users of SANIA to provide conditions for their algorithm to be scale-invariant.

Response: It is accurate that SANIA does not guarantee scale invariance by itself and to design an algorithm that enjoys this property one must ensure the following conditions are satisfied:

  1. Diagonal Transformations: The algorithm should behave identically under transformations of the form wVww \to Vw, where VV is a diagonal, non-degenerate matrix. This ensures it is independent of feature scaling.

  2. Invariant Preconditioning: Any preconditioning applied, such as in adaptive gradient methods, must adapt only to scales (diagonal transformations) and not to rotations. This implies avoiding operations like taking square roots of preconditioners if they break scale invariance.

  3. Scale-Invariant Updates: The step size or learning rate, denoted as γt\gamma_t, should remain unaffected by scaling transformations. This requires careful design of step-size rules.

  4. Consistent Function Values: The function values and gradients evaluated during optimization must remain consistent after scaling, ensuring identical convergence paths.

  5. Bijective Mapping: There must be a one-to-one mapping between the scaled and original parameter spaces, ensuring equivalent updates in both spaces.

These principles are critical for algorithms like SANIA AdaGrad-SQR and SANIA Adam-SQR, which maintain identical performance across scaled datasets. For more proofs and details please refer to Appendix D.1 and D.2.

评论

Reviewer: " At the end of Section 2, the authors use the notations: "SANIA IdI_d", "SANIA (V1)2(V^{-1})^2" and "SANIA diag(H1)\mathrm{diag}(H^{-1})", while writing that there is no preconditioning. Thus, I understand that "SANIA AA" refers to Eqn. (6) with Dt=AD_t = A. But there is no formal definition of "SANIA AA". It should appear somewhere. "

We apologize for any misunderstanding regarding the notation used to describe different preconditioning matrices for SANIA. In Section 2.4 (Affine and Scale Invariance), we discuss several matrices used for preconditioning in SANIA to illustrate scale invariance (note that choice of II as preconditioning will not lead to invariance). Before detailing them, we would like to remind you that the update rule for these methods is described in Eq. (18):

wt+1=wtλtBt1mt.w_{t+1} = w_t - \lambda_t B_t^{-1} m_t.

In this update rule, the selection of BtB_t plays a crucial role in the algorithm, and Section 2.4 proposes choices for BtB_t that ensure SANIA remains scale invariant.

  1. SANIA diag((V1)2)diag((V^{-1})^2): One option is to utilize the squared inverse of the vector VV used to scale the dataset, which simulates badly scaled data.
  2. SANIA diag(H1)diag(H^{-1}): Another approach is to calculate the diagonal of the Hessian matrix.
  3. SANIA Id\mathit{I}_d: As a baseline ablation study, we include SANIA Id\mathit{I}_d, where Bt=IdB_t = \mathit{I}_d, effectively using no preconditioner.

Since Section 2.4 does not mention SANIA AA, could you please clarify what you are referring to as SANIA AA?

评论

Dear Reviewer jmo1,

We sincerely thank you for your thorough review, positive comments on the originality and clarity of our paper, and for recognizing its significance to the optimization community. We have carefully considered your remarks, concerns, and questions, and we will do our utmost to address them effectively.

评论

Dear Reviewer jmo1,

Thank you again for your thoughtful feedback on our paper. We have addressed all the questions and concerns you raised in our responses. We wanted to kindly follow up to see if you have any additional questions or require further clarifications. If our responses address your concerns, we would greatly appreciate it if you could consider updating your score to reflect this.

Thank you for your time and consideration.

评论

Dear Authors,

Thank you for your answer. I have no more questions or remarks, and I have decided to maintain my rating.

评论

Dear Reviewer jmo1,

Thank you once again for your thorough review and constructive feedback, which have significantly contributed to improving our paper. We are glad to hear that all your concerns have been addressed and that you have no further questions or remarks.

Given the originality, relevance, and correctness of our work, as well as the improvements made, such as clarifying SANIA notations, correcting Figure 1, and including training loss plots, we believe the revised submission provides substantial value to the community.

If you feel our revisions fully address your feedback, we kindly ask you to consider updating your score to reflect this. We greatly appreciate your time, effort, and support throughout this process.

评论

We would like to sincerely thank all the reviewers for their careful reading of our submission and their thoughtful and constructive feedback. We greatly appreciate the time and effort spent on reviewing our work and for providing suggestions that will help improve our paper.

We are grateful that all reviewers recognized several key strengths of our paper. Reviewer jmo1 appreciated the originality of our work and acknowledged that it is relevant to the community, particularly highlighting the practical application of the generalized Polyak step-size in second-order methods. Reviewer zTra commended the thorough theoretical analysis and clear writing, noting the robustness and efficiency of our proposed method across diverse non-convex optimization scenarios. Reviewer zkYN emphasized the ease of integrating our framework with existing optimizers and mentioned the excellent presentation of our ideas. Finally, Reviewer iCAR highlighted the novelty of our formulation, especially the investigation into scale invariance, and acknowledged the robustness of SANIA in comparison to fine-tuned algorithms. These strengths collectively underscore the significance and potential impact of our contribution to the field.

In response to the issues raised, we have carefully addressed all the points mentioned in the reviews, including clarifications, additional discussions, and improvements to the experimental section. We have uploaded a revised version of the paper, which includes these updates and directly addresses all the comments and suggestions. We hope that these changes address your concerns and improve the overall quality of our submission.

Thank you again for your valuable feedback and suggestions. We are confident that these improvements have strengthened our paper significantly.

评论

Dear Reviewers,

With the discussion period extended, we would like to take this opportunity to further refine and improve our paper based on your insights. We have thoroughly addressed all your questions and concerns but feel the discussion has not yet reached its full potential.

Could you kindly advise us on any additional steps we could take for you to consider revising your score?

We deeply appreciate the time and effort you have dedicated to reviewing our work and value your feedback greatly.

Best regards, Authors

AC 元评审

This paper studies a general framework for Preconditioned and Second-order Polyak methods. In particular, the paper proposes a first Stochastic Cubic Newton method with polyak step-size and also introduces the new scale invariant versions of AdaGrad and Adam, which make them invariant to some basis transformations. The paper also provides a few experiments to demonstrate the performance of the algorithm. The improvements (if any) are not very significant but the authors highlight that the key benefit of this work is need for limited tuning of hyperparameters.

The reviews for the paper are very much on the borderline. The reviewers while acknowledging that the work is interesting also highlighted the weakness of lack of comparison with schedule free optimization and somewhat limited empirical evidence. After reading through the paper, I was extremely disappointed with the empirical evidence and I believe these weakness are very important to address before publication. The paper's core claim is about the eliminating the need for manual stepsize hyperparameter setting, however the experiments are done on extremely small settings (where hyperparameter tuning is of not much value) and lack proper comparisons. I think this paper would greatly benefit from thorough empirical analysis on larger settings (e.g. ImageNet, ResNet, VIT). I recommend rejection in the current form.

审稿人讨论附加意见

The authors addressed some important concerns of the reviewers (especially related to the presentation). The authors also added additional comparisons. However, I think that the empirical analysis of the paper is extremely limited and does not support the core claims of the paper (as discuss in the metaview).

最终决定

Reject