Newton Losses: Using Curvature Information for Learning with Differentiable Algorithms
摘要
评审与讨论
The paper presents an alternative backpropagation scheme for deep learning with algorithmic losses that combines a preconditioned step on the loss with a gradient step on a least square objective. Two preconditioning methods are investigated: using the Hessian, or the empirical Fisher. Experiments demonstrate that the proposed plugin consistently improve performance of algorithms across architectures and losses on two benchmarks. An ablation study of the potential additional tikhonov regularization is give, as well as a discussion of runtime comparisons.
优点
- The main strength of the paper is its experimental evaluation. Two relevant benchmarks are analyzed. 4 losses are considered in the frist benchmark, 3 in the second benchmark. Ablation studies and runtime comparisons provide a rather full picture on the algorithm.
- Overall the proposed method clearly provides gains across settings. Its theoretical motivation may be unclear but such experimental evidence invites for further research on the subject.
缺点
- The soundness of the approach from a theoretical viewpoint is lacking. However, it is probably better to have clear experimental evaluations than wobbly theoretical explanations. And theoretical explanations can be given later.
问题
- Why do the authors multiply an inverse of the Hessian with the gradients? Such operation amounts to solve a linear system and is always best handled with matrix-free linear algebra solvers that can make use of Hessian vector products.
- Can the approach be cast as appropriate block-coordinate minimization of the Lagrangian associated to the composition of the loss and the network? As the authors acknowledge if both steps in 2.a, 2.b are gradient steps, we retrieve the classical gradient back-propagation as demonstrated in [57]. Such a setting may help uncover the principles underlying the improved performance.
- I don't see how the learning rates were tuned in the experiments. Since it is generally a crucial hyperparameter for optimization methods, can the authors comment on this?
- The authors keep on mentioning vanishing/exploding gradient phenomena. Could the authors actually demonstrate this issue numerically? I suppose the vanishing/exploding gradient phenomena depends on the number of steps taken by the algorithm defining the loss. Maybe one could correlate the number of steps taken by the underlying algorithm and the improved performance of the proposed approach.
局限性
Limitations (or rather scope) of the approach are discussed in Appendix A.
We sincerely appreciate the reviewer for dedicating their time to evaluate our work and helping us improve it further. Below, we have addressed your insightful comments.
Weaknesses
The soundness of the approach from a theoretical viewpoint is lacking. However, it is probably better to have clear experimental evaluations than wobbly theoretical explanations. And theoretical explanations can be given later.
Indeed the focus of our paper is rather on the experimental part rather on a detailed theoretical analysis (4 pages of experiments compared to 2 pages of the derivation of the method).
We consider Section 3.2 to be rather a derivation of the method than a theory section with theoretical guarantees.
Questions
Why do the authors multiply an inverse of the Hessian with the gradients? Such operation amounts to solve a linear system and is always best handled with matrix-free linear algebra solvers that can make use of Hessian vector products.
In all our experiments, the cost of inverting the Hessian and the multiplication is negligable due to the sizes of the involved spaces. We will add a remark mentioning the solver-based alternative option.
Can the approach be cast as appropriate block-coordinate minimization of the Lagrangian associated to the composition of the loss and the network? As the authors acknowledge if both steps in 2.a, 2.b are gradient steps, we retrieve the classical gradient back-propagation as demonstrated in [57]. Such a setting may help uncover the principles underlying the improved performance.
This is an interesting question. As the reviewer correctly points out, if both steps (2a) and (2b) are gradient steps, this is equivalent to using a gradient descent step for (1), as shown in Lemma 2.
Beyond that special case, we are not aware if the combination of steps (2a) and (2b) can recover any other classical approaches, such as block-coordinate minimization. Our notation with the -function may suggest a similarity to the (block)-coordinate descent method. However, unlike coordinate descent methods, we do not split our variables or into sub-coordinates.
I don't see how the learning rates were tuned in the experiments. Since it is generally a crucial hyperparameter for optimization methods, can the authors comment on this?
For the sorting and ranking experiments, we use a learning rate of 0.001, which is an established value for 100 000 step long training on the benchmark. We want to point out that, as we use Adam, the optimizer is (apart from numerical precision) agnostic to factors of the gradient magnitude.
For the shortest path experiments, we determine the optimal initial learning rate for the baseline and then apply the same learning rate for Newton Losses; the values for SS / PO methods and the grid from AlgoVision is provided on page 15. The learning rate decay schedule is predefined by the benchmark.
The authors keep on mentioning vanishing/exploding gradient phenomena. Could the authors actually demonstrate this issue numerically? I suppose the vanishing/exploding gradient phenomena depends on the number of steps taken by the algorithm defining the loss.
To illustrate the characteristics of the losses and their gradients, we provide 6 figures in the supplemental author response PDF page. We illustrate the gradient of the NeuralSort and logistic DSN losses in dependence of one of the five input dimensions for the case. In the illustrated example one can observe that both algorithms experience exploding gradients when the inputs are too far away from each other (which is also controllable via steepness / tau), see Figures (c) / (d). Further, we can observe that the gradients themselves can be quite chaotic, making the optimization of the loss rather challenging. In Figures (e) / (f), we illustrate how using the Fisher Newton Loss recovers from the exploding gradients experienced in (c) / (d).
Maybe one could correlate the number of steps taken by the underlying algorithm and the improved performance of the proposed approach.
We agree; from Table 1, we can derive such a comparison: using a DSN with 10 inputs has twice the number of steps of a DSN with 5 inputs. Here, we can see that the performance improvements with 10 inputs are greater than the performance improvements in the 5 input case.
Please let us know whether you have any other questions or concerns regarding our paper and response, or whether we have successfully answered and resolved all of your questions and concerns.
I thank the authors for their answers to my questions and comments.
- Surely, the paper would benefit from a better theoretical understanding.
- That said, the paper presents an extensive set of experiments and clear improvements with the proposed method. These experimental results may be a base for a better understanding of the approach from a mathematical viewpoint. Numerous experiments diagnose the approach to build upon it. Though I would not argue for a strong accept, I believe the value of this experimental work should not be completely dismissed and I maintain my score.
- Regarding the regularization comments of reviewer tok2, some recent works have analyzed the use of quadratic regularizations of Newton method with proven convergence guarantees (see [1] and further references of [1]). Most importantly, cubic regularizations can be expensive to compute and so not practical. Given that this paper is interested in practical improvements, analyzing the performance with either empirical Fisher* or Newton's method with quadratic regularization is valuable empirically.
[1] Super-Universal Regularized Newton Method, https://arxiv.org/pdf/2208.05888v1
- To avoid further confusion about the Fisher matrix and its "empirical part", I would suggest the authors to rather speak about the second (uncentered) moments of the gradients.
Dear Reviewer nJzH,
thank you for acknowledging our rebuttal and your positive remarks.
Thanks for the reference to [1], we will add a discussion to the camera-ready. Also, thanks for the "second (uncentered) moments" suggestion; we will adjust the paper accordingly.
(Meta: Regarding your "Though I would not argue for a strong accept", we just wanted to make sure that you are aware that between the currently selected "weak accept" and a "strong accept", there is also the score "accept" this year.)
The paper proposes second-order optimization with splitting for hard objectives that arise as smoothing of such hard problems as sorting and ranking to address the problem of vanishing/exploding gradients.
优点
It is a well-written and very complete description of algorithms for reproducibility, which is a very good thing in itself.
缺点
- Insufficient experiments. I'd appreciate adding a comparison here with the SFA technique from there, as it will rely only on first-order information: https://arxiv.org/pdf/2003.02122
问题
- Considering Newton's method in the case of non-convex objectives is a mistake. No matter how much it is regularized, as long as regularization is L2, had authors considered cubic regularization? E.g., https://link.springer.com/article/10.1007/s10107-006-0706-8
- Adding to the first question, for ranking objectives, Hessians are expected to converge to zero. Have you considered an increasing learning rate schedule? I am somewhat sure that this hessian/fisher type of method, due to vanishing gradients, also vanishes, resulting in effectively increasing the learning rate up to . I will appreciate experiments against the first-order method with the learning rate scheduler growing up to that value.
局限性
Yes
We sincerely appreciate the reviewer for dedicating their time to evaluate our work and helping us improve it further. Below, we have addressed your insightful comments.
Weaknesses
- Insufficient experiments. I'd appreciate adding a comparison here with the SFA technique from there, as it will rely only on first-order information: https://arxiv.org/pdf/2003.02122
Thank you for this suggestion. We will include this paper in the related work. However, we were not able to find any source codes for this work, and reimplementing it from scratch within the author response period is not feasible. We will contact the authors regarding the source code, and if possible include these results in the camera-ready. Nevertheless, we would like to point out the extensiveness of our experimental results, as we apply Newton losses to 11 algorithms, which comprises direct comparisons to the methods of 6 published articles (and on the benchmarks that these works utilize).
Questions
Considering Newton's method in the case of non-convex objectives is a mistake. No matter how much it is regularized, as long as regularization is L2, had authors considered cubic regularization? E.g., https://link.springer.com/article/10.1007/s10107-006-0706-8
We thank the reviewer for highlighting Nesterov's work on cubic regularization. As noted in the mentioned paper, various regularization techniques can be applied when using the Newton method for non-convex or ill-conditioned problems. The most classical technique is Tikhonov regularization (or Levenberg-Marquardt regularization). Given that our Tikhonov-regularized version already demonstrates good empirical performance, we did not explore more sophisticated regularization techniques such as cubic regularization. However, this is indeed an interesting direction for future research. We also want to point out that we always include results with the empirical Fisher, which, by definition, is always positive semi-definite.
Adding to the first question, for ranking objectives, Hessians are expected to converge to zero. Have you considered an increasing learning rate schedule? I am somewhat sure that this hessian/fisher type of method, due to vanishing gradients, also vanishes, resulting in effectively increasing the learning rate up to . I will appreciate experiments against the first-order method with the learning rate scheduler growing up to that value.
As the optimizer for all experiments in this paper is the Adam optimizer, such effects can be rather excluded, as multiplying the loss by a factor does not affect updates (apart from numerical precision effects). Using the Adam optimizer is the established experimental setting on each of these benchmarks. In Adam, the sign and relative magnitude of the gradient to the previous step are the important factors. Thus, if the gradient magnitudes increase or decease over time by a certain factor, it should not significantly affect training, as Adam normalized for gradient magnitudes over time (via the square-root of the uncentered variance component). As an ablation study prior to writing the paper, and not included in the paper, we did actually multiply the loss value (and thus also gradient) by factors, and observed that this had no effect on optimization.
Please let us know whether you have any other questions or concerns regarding our paper and response, or whether we have successfully answered and resolved all of your questions and concerns.
We thank the reviewer for highlighting Nesterov's work on cubic regularization. As noted in the mentioned paper, various regularization techniques can be applied when using the Newton method for non-convex or ill-conditioned problems. The most classical technique is Tikhonov regularization (or Levenberg-Marquardt regularization). Given that our Tikhonov-regularized version already demonstrates good empirical performance, we did not explore more sophisticated regularization techniques such as cubic regularization. However, this is indeed an interesting direction for future research.
That is the point that you'll need to overregularize offsetting entirely your negative spectrum, which happens to be as large as positive one, as smoothed objectives are necessarily hyperbolic in nature. So while "you can" it's absolutely meaningless type of regularisation in this case.
As the optimizer for all experiments in this paper is the Adam optimizer, such effects can be rather excluded, as multiplying the loss by a factor does not affect updates (apart from numerical precision effects).
It's about loss <-> lr scaling, not about the fact that it actually will have an effect if you make learning rate scheduled
I'll keep my score as is.
Dear Reviewer tok2,
thank you very much for responding to our rebuttal, and for your clarification.
That is the point that you'll need to overregularize offsetting entirely your negative spectrum, which happens to be as large as positive one, as smoothed objectives are necessarily hyperbolic in nature. So while "you can" it's absolutely meaningless type of regularisation in this case.
We would like to kindly to point at the fact that the empirical Fisher is always PSD, which resolves this concern for the Newton Loss (Fisher). For the Hessian case, you are right that it could require overregularizing, but the empirics show that using the Hessian still works very well, in most cases even better than the Fisher variant.
[...] resulting in effectively increasing the learning rate up to . I will appreciate experiments against the first-order method with the learning rate scheduler growing up to that value.
[...] about the fact that it actually will have an effect if you make learning rate scheduled
We apologize as we might have misunderstood you earlier. To address this concern with concrete experiments, we have now run a set of experiments with learning rate scheduling. In particular, we schedule the learning rate, as suggested to grow the learning rate to initial learning rate . We use both linear as well as eponential growth for the learning rate. Further, we use NeuralSort (n=5), and ran experiments for both and , covering the lambdas of both Hessian Newton Loss and Fisher Newton Loss. We ran each experiment for 5 seeds.
| Method | Full Rankings Acc. | (Individual Ranks Acc.) |
|---|---|---|
| Baseline | 71.33±2.05 | 87.10±0.96 |
| Hessian Newton Loss (ours) | 83.31±1.70 | 92.54±0.73 |
| Fisher Newton Loss (ours) | 83.93±0.62 | 92.80±0.30 |
| GrowLR (, linear) | 69.99±3.17 | 86.39±1.58 |
| GrowLR (, exponential) | 70.66±1.21 | 86.73±0.58 |
| GrowLR (, linear) | 68.44±3.40 | 85.68±1.64 |
| GrowLR (, exponential) | 68.19±1.73 | 85.48±0.86 |
This shows that introducing a learning rate scheduler as suggested does not improve performance, and that the method is not "resulting in effectively increasing the learning rate up to ".
In case it is relevant to you, we also extend the table by loss factors (multiplying the loss by 0.1 or 10, and thus effectively multiplying the gradient by the same factor), which shows that loss factors do not have any significant effect.
| Method | Full Rankings Acc. | (Individual Ranks Acc.) |
|---|---|---|
| Loss factor 0.1 | 72.04±1.24 | 87.43±0.61 |
| Baseline (loss factor 1.0) | 71.33±2.05 | 87.10±0.96 |
| Loss factor 10.0 | 71.43±1.25 | 87.17±0.60 |
The paper proposes a new method to optimize complex possibly non-smooth and algorithmic losses for neural networks. The approach is based on splitting the problem into two-step procedure, where in the first step we construct and optimize the so-called Newton loss and the second step is based on SGD-type procedure for MSE loss with the first step. The authors present a wide experimental comparison of the proposed Fisher and Newton approaches with existing methods.
优点
The paper has a strong literature review and motivation for solving different applications. The experimental section is well described, it contains 4 different non-trivial problems to solve. For the presented cases, the proposed methods outperform the baselines.
缺点
The paper does not contain any proofs or convergence guarantees. The mathematical formulation of the main problem is also quite confusing for me. For example, is vector fixed for all steps or is it a batch of data? Is it a sum-type problem or an expectation problem? What are the properties for ? Is it differentiable, smooth? Because some parts of the text said that the loss is non-smooth and later we calculate the Hessian of such a function. In Formulas 1 and 2, it is not clear what are the fixed parameters or data. Should in 2a be ? Also, I think the mention of Lemma 2 in the main text could be very helpful.
For the experimental section, personally, it feels that the most of space is taken by the description of the problems and the setup and not the actual comparison. As the paper is mostly experimental and empirical, one would expect a better comparison of the proposed methods with the multiple benchmarks. There are no convergence figures with the per-iteration or per-gradient performance. As the authors claim, the main issues in the existing approaches are vanishing and exploding gradients. However, I didn’t find any clipping method for the comparison, which are the possible solutions for exploding gradients.
问题
See Weaknesses
局限性
n/a
We sincerely appreciate the reviewer for dedicating their time to evaluate our work and helping us improve it further. Below, we have addressed your insightful comments.
Weaknesses
The paper does not contain any proofs or convergence guarantees.
It is important to keep in mind that the methods to which we apply Newton Losses do not come with any theoretical convergence guarantees by themselves, even for only SGD optimization, which stems from the non-convexity and complexity of such algorithmic loss functions. Providing convergence guarantees for a simplification that lies outside the space of our considered losses, e.g., a convex original loss function, while trivial, would be rather misleading in our opinion.
For example, is vector fixed for all steps or is it a batch of data?
We apologize for any confusion. The input vector throughout our paper refers to a batch of data. We will clarify this in the revised version.
Is it a sum-type problem or an expectation problem?
We generally consider non-separable loss functions. For example, for the ranking supervision experiments, the relation between individual samples and the sorted population is considered. Thus, these ranking losses can not be expressed as a sum over individual samples. Between independent sets, however, the loss can be expressed as an expectation.
What are the properties for ? Is it differentiable, smooth? Because some parts of the text said that the loss is non-smooth and later we calculate the Hessian of such a function.
We apologize for the confusion. We assume that the loss function is twice differentiable, ensuring the existence of a Hessian, or at least differentiable for ensuring the existence of the gradient and empirical Fisher matrix. For settings with non-smooth objectives (e.g., shortest-path problem), we apply smoothing techniques to approximate the loss with a smooth counterpart, such as stochastic smoothing (see Section 4.2.2), such that the loss function to which we apply Newton Losses is smooth. We will clarify this in the revised version.
In Formulas 1 and 2, it is not clear what are the fixed parameters or data. Should in 2a be $\theta_{t-1}?
Thank you for spotting this typo. Indeed, in Equation (2a), should be . We will fix it.
Also, I think the mention of Lemma 2 in the main text could be very helpful.
Thank you for the suggestion. We will provide a discussion of Lemma 2 in the main text in the revised version.
For the experimental section, personally, it feels that the most of space is taken by the description of the problems and the setup and not the actual comparison. As the paper is mostly experimental and empirical, one would expect a better comparison of the proposed methods with the multiple benchmarks.
Thank you for this suggestion. As we have an additional page for the camera-ready, we will extend the discussion of the results and actual comparisons in the camera-ready; however, we feel that it remains important to maintain the detailed discussions of the benchmarks as well as continuous relaxations that we apply Newton Losses to.
There are no convergence figures with the per-iteration or per-gradient performance.
Could you please clarify what precisely you would like us to plot, considering that we show a per-epoch convergence plot in Figure 3?
As the authors claim, the main issues in the existing approaches are vanishing and exploding gradients. However, I didn’t find any clipping method for the comparison, which are the possible solutions for exploding gradients.
Thank you for this suggestion. On these types of algorithmic loss benchmarks, to our knowledge, gradient clipping has not been previously considered in the literature. If you do have a respective reference, we would greatly appreciate it. Nevertheless, we have implemented gradient clipping for Table 1:
| n | NeuralSort | SoftSort | Logistic DSN | Cauchy DSN |
|---|---|---|---|---|
| n=5 | ||||
| n=10 |
The results show consistent improvements for the first 3 algorithmic losses, but unfortunately it degrades Cauchy DSNs, even compared to the vanilla method. These results are averaged over 5 seeds. In each case, Newton Losses performs substantially better than gradient clipping.
Please let us know whether you have any other questions or concerns regarding our paper and response, or whether we have successfully answered and resolved all of your questions and concerns.
Dear Authors,
thank you for the detailed answers and comments.
Could you please clarify what precisely you would like us to plot, considering that we show a per-epoch convergence plot in Figure 3?
The same as Figure 3, but for all problems. They can be presented in the Appendix for the lack of space. Also, the dependence on gradient/Hessian-vector products or time performance could be interesting to have.
Regarding the rebuttals, I would like to increase my score from 4 to 5.
Dear Reviewer qoSU,
Thank you very much for responding to our rebuttal and for increasing your score.
To address your clarification wrt. the plot request, we offer to include the same as Figure 3, but for all problems in the appendix in the camera-ready. We can also make additional versions with time on the x-axis. Regarding plotting "dependence on gradient/Hessian-vector products", could you clarify what you mean such that we can also include it in the appendix? Was "dependence on gradient/Hessian-vector products" meant as an equivalent to the "time performance" or was it, e.g., number of floating-point operations?
Best regards,
The Authors
Dear Authors,
Thank you! Regarding the “dependence on gradient/Hessian-vector products”, yes, I meant it as an alternative for the time plots, as some authors prefer this type of plots over the time one. I am not sure what is the best choice for your paper, that’s why, I listed both of them. Thank you!
Dear Reviewer qoSU,
Thank you very much for the clarification! Accordingly, we will include both per-epoch and per-time plots in the appendix of the camera-ready version.
Please let us know in case any further questions or concerns come up.
1 page rebuttal attachment: Illustrations of the gradient of the NeuralSort and logistic DSN losses.
Given my own evaluation of this work and taking into account the reviewer's expertise, I am recommending to accept this interesting submission.