PaperHub
7.0
/10
Spotlight4 位审稿人
最低6最高8标准差0.7
6
7
7
8
3.5
置信度
正确性2.8
贡献度3.3
表达3.3
NeurIPS 2024

Neglected Hessian component explains mysteries in sharpness regularization

OpenReviewPDF
提交: 2024-05-16更新: 2024-11-06
TL;DR

Understanding the neglected indefinite part of the Hessian explains important phenomena in sharpness regularization

摘要

关键词
sharpnessflatnessregularization

评审与讨论

审稿意见
6

It is known that SAM can improve generalization, while weight noises and gradient penalties often fail. This work reported that the structure of the Hessian can explain the inconsistency by identifying the key role of NME. This work first studied gradient penalties and show that methods using second order information are sensitive to activation functions due to NME. Moreover, as NME matters to Hessian penalties, weight noises which minimizes NME often behave poorly.

优点

  • This work reports a novel and interesting conclusion that the Hessian structure, particularly the overlooked NME, can explain the poor performance of gradient penalties and Hessian penalties. This is significant for understanding the second order information in optimization and generalization of deep learning.

  • The theoretical analysis seems clear and reasonable.

  • This work empirically verified the theoretical results and identified the key role of NME.

缺点

-The empirical results only include ResNets. May the empirical conclusion also depend on model architectures? How about the results of very simple models (FCN/LR) and very complex models (Transformers)? Note that they have very different loss landscape.

-Training with weight noises only implicitly penalize the Hessian. This work often abuse Hessian penalties and weight noises.

-This work provides insights but failed to show how to improve or design better second-order-based methods. This can further support the conclusion of this work.

问题

Please see the weaknesses.

局限性

This work did not discuss the limitations.

作者回复

We thank the reviewer for their questions, and answer some here.

The empirical results only include ResNets. May the empirical conclusion also depend on model architectures? How about the results of very simple models (FCN/LR) and very complex models (Transformers)? Note that they have very different loss landscape.

We agree that studying more architectures and datasets would be interesting; for this initial work we focused on very detailed experiments in our chosen settings rather than shallow experiments across a broad range of settings.

Training with weight noises only implicitly penalize the Hessian. This work often abuse Hessian penalties and weight noises.

In Sections 5.1 and 5.2, we review the links made in the literature between weight noise and Hessian penalties in order to frame our work. However, in all our experiments we make the procedure clear, and mainly focus on explicit regularization of the Hessian/Gauss Newton trace.

This work provides insights but failed to show how to improve or design better second-order-based methods. This can further support the conclusion of this work.

We agree that additional research is needed to further improve second order methods/regularizers; however our work showed that the Gauss Newton trace is another useful regularizer. In addition, our synthetic second derivative approach in Section 4.5 shows how we can overcome the poor second derivatives of ReLU in a practical, efficient way.

评论

Thanks for the rebuttal. It addressed some of my concerns.

I tend to keep the rating of this work as Weak Accept.

审稿意见
7

The authors examine the performance of training methods for neural networks, which utilize (approximate) second order information. They note that often only the curvature of the loss function is taken into account (rather than that of the loss functions) and demonstrate that this approach leads to drawback in training in particular regrading sharpness regularization such as gradient penalties / weight noise. They examine the effect of including / excluding second-order information in the form of regularizations on generalizability based on numerical experiments.

优点

The authors demonstrate a problem in learning regularizations that has so far no been addressed, namely the varying performance of the regularizations and the mixed success of for example gradient norm penalties. The observations made have the potential to aid in the development of learning algorithms.

缺点

In terms of the use of second derivatives the manuscript sends a bit of a mixed message. The initial experiment in Section 4 seems to show that accurate second-order information of the model is useful when using gradient penalties in learning algorithms, the Hessian penalties in Section 5 seem to pain a different picture, where the entire Hessian trace penalty performs the worst, so neglecting Hessian components may in fact not be such a bad idea in general.

问题

  • 83 - 84: It would be helpful to reformulate the defintion of the trace to make clear that the trace being the sum of eigenvalues is an established result rather than a definition of the trace operator to avoid confusion.
  • 123: The function L changes its meaning halfway through the manuscript. Initially, it depends on z and y, i.e., on model output and labels. Starting with 123 it becomes a function with the sole input being a vector of weights. I think this is supposed to be L(z(theta, x), y) for fixed training data (x, y). Please introduce another function to make this clearer to the reader.
  • Some abbreviations could be given once in full in order to jog the memory of readers who are not very familiar with the precise context (NTK, ReLU / GELU)
  • 143: Should this be SGD (there are not batches to be seen) or a normal gradient descent?
  • 226: "Minimizing the NME is detrimental" Shouldn't this state that "ignoring" the NME is a problem?

局限性

NA

作者回复

We thank the reviewer for their comments, and will fix the errors noted. We address some selected concerns below.

In terms of the use of second derivatives the manuscript sends a bit of a mixed message.

The overall point about second derivatives is a bit subtle; our work suggests that second derivatives in the update rule are helpful (get full NME information in HVP). In contrast, the full Hessian trace penalty has second derivative information in the regularizer. Minimizing this information during optimization reduces the effect of second derivatives in the update rule. It also introduces third derivatives into the update rule; these higher order derivatives in the updates seem to hurt training. It remains an open question whether or not other forms of higher order derivatives are useful.

143: Should this be SGD (there are not batches to be seen) or a normal gradient descent?

We wrote the rule for a single batch; in practice this update rule would be combined with batching/SGD.

226: "Minimizing the NME is detrimental" Shouldn't this state that "ignoring" the NME is a problem?

This should read “Minimizing the NME trace” is detrimental, as the GN trace penalty (no NME in the regularizer) works well.

评论

Thank you for the response. Barring these small concerns, I think the paper provides interesting insights into regularization techniques, and I recommend its acceptance.

审稿意见
7

This paper investigates the importance of considering second order information, specifically the structure of the Hessian of the loss, in deep learning. It decomposes the Hessian into the Gauss-Newton matrix and the Nonlinear Modeling Error (NME) matrix, with focus on the often-overlooked NME. Through empirical and theoretical evidence, the study demonstrates the significance of the NME in the performance of gradient penalties and their sensitivity to activation functions. The difference in regularization performance between gradient penalties and weight noise is also attributed to the NME. The findings underscore the need to consider the NME in experimental design and theoretical analysis for sharpness regularization, potentially leading to new classes of second order algorithms that utilize the loss landscape geometry differently.

优点

Strengths

  1. The paper is clearly written and easy to follow.

  2. The paper focuses on an important topic in the area of sharpness learning, which is worthy of investigation.

  3. The paper provides a novel understanding of the connections between SAM and gradient norm penalty through the perspective of NME, offering interesting insights.

  4. The paper highlights the pitfalls of using different activations with the Hessian, providing valuable guidance for practical training.

缺点

Weakness

  1. How do you solve SAM? Have you also neglected the second-order term in your empirical analysis? What would be the effect if this term were not neglected in your situation?

  2. Could you provide some demonstrations regarding the off-diagonal elements in the NME? It is suggested that these elements may also play an important role.

  3. I recommend that the theoretical analysis be expressed in a more formal and clear style.

  4. There are a couple of typos: Line 133, "p=1p = 1" -> "ρ=1\rho = 1"; Line 208, "if this is link" -> "if this link".

问题

See Weakness.

局限性

I have not found any discussions about the limitations and potential negative societal impact. But in my opinion, this may not be a problem, since the work only focuses on understanding the sharpness learning. Still, it is highly encouraged to add corresponding discussions.

作者回复

We would like to thank the reviewer for their insightful review. We address the reviewer's questions and suggestions below.

How do you solve SAM? Have you also neglected the second-order term in your empirical analysis? What would be the effect if this term were not neglected in your situation?

Our experimental results for SAM are based on the SAM algorithm proposed by Foret et al. Specifically, the SAM update does not utilize any Hessian (but does evaluate an extra gradient). The reason the Hessian is absent from the SAM update is due to the approximation employed by Foret et al (using stop_gradient on the adversarial perturbation). Therefore, our implementation of SAM, which aligns with that of Foret et al, indeed neglects the Hessian. We have not investigated the implications of retaining the Hessian in the SAM formulation because this was already explored by Foret et al. They conducted this experiment and observed that by keeping the Hessian, SAM's performance slightly decreases compared to when the Hessian is neglected (which is consistent with our story). This is discussed in Figure 4 of their paper.

To derive PSAM from SAM Equation 6, we employ a first-order approximation as shown in Equations 9-11. This differs from the approximation used in the original SAM algorithm [Foret et al., 2021]. While exploring a second-order approximation could be a promising avenue, it falls outside the scope of this current work.

Could you provide some demonstrations regarding the off-diagonal elements in the NME? It is suggested that these elements may also play an important role.

The experiments in Section 4.4 show that the off-diagonal elements of the NME by themselves are not sufficient to obtain good results with PSAM. In these experiments we artificially zero out the diagonal elements of the NME of GELU. It would be interesting to see how well PSAM with ONLY the diagonal elements performs, but we did not have time in the rebuttal period to design and run this experiment.

I recommend that the theoretical analysis be expressed in a more formal and clear style.

Thank you for your suggestion. We will improve the style of the theoretical analysis and address the typos you identified.

评论

Thanks for the authors' kind response. Having no further concerns, I believe that this paper serves as an excellent resource for exploring gradient regularization and SAM, and I strongly recommend its acceptance.

审稿意见
8

This paper studies the influence of the second-order component of the Hessian in sharpness-aware minimization and optimization methods that involve gradient penalties. First, they show that the Hessian decomposes into the component of the Hessian that people usually consider [GN] and a term that includes the second derivative with respect to the parameters [NME]. From there, they demonstrate that a gap between penalty sharpness-aware minimization and general sharpness-aware minimization exists when activations are ReLU but not when activations are GeLU. They study how the NME explains this gap by showing that ablating the NME component in GeLU recovers the ReLU generalization. Further, they also show that including a term related to the NME in the ReLU case removes this gap. Finally, they ask the question of whether explicitly accounting for the NME in sharpness-aware minimization / gradient-penalty via weight noise can improve generalization of solutions, finding that it cannot in general. In the end, this work, to my understanding, deepens our understanding of the role of the usually-ignored NME term in various settings.

优点

  • identifies a part of the hessian that is not usually considered in sharpness-aware minimization and isolates its effect in various settings, showing that in some cases it is relevant for understanding the behavior, while in others there are reasons to not include it in algorithms.
  • theoretical analysis as well as experiments on real-scale datasets
  • appropriate ablations to study the effects of terms, mainly on outcome. I would be interested in future work looking at dynamics, as well!
  • I like that the work studies the nuances of the effects of the NME, not just tells a single story and leaves it at that.

缺点

overall, I feel the work studies the questions asked in detail. See some questions / writing suggestions below.

问题

  • In section 2 when deriving the NME, can you include an explicit example where the NME is large? Either analytically or a plot demonstrating the evolution of, e.g., its Frobenius norm over the course of training.
  • small thing: perhaps when you mention the 3 datasets, reverse the order to have them in increasing order of difficulty
  • one thing you do not discuss but I would want to know more about is — PSAM method only should work when rho is small. And indeed, when rho is small, even for ReLU the methods match. Of course, since the GeLU has the match even at the values of rho where there is a gap for ReLU, it seems that the size of rho is not the determining factor. But could you still comment on how to know whether the issue is just the size of rho?
  • any comments on dynamics of training with and without the additional penalty in section 5.2? Seems like that might also be interesting.

局限性

the work analyzes the effect of the NME in various settings and is clear about the particular settings which it studies.

作者回复

We thank the reviewer for their thorough review. We address the reviewer's questions and suggestions below.

In section 2 when deriving the NME, can you include an explicit example where the NME is large? Either analytically or a plot demonstrating the evolution of, e.g., its Frobenius norm over the course of training.

The suggestion to identify specific examples where the NME is large is valuable. We are actively investigating this, but due to time constraints in preparing the rebuttal we don’t have definite results yet.

small thing: perhaps when you mention the 3 datasets, reverse the order to have them in increasing order of difficulty

We will change the order as you suggested.

one thing you do not discuss but I would want to know more about is — PSAM method only should work when rho is small. And indeed, when rho is small, even for ReLU the methods match. Of course, since the GeLU has the match even at the values of rho where there is a gap for ReLU, it seems that the size of rho is not the determining factor. But could you still comment on how to know whether the issue is just the size of rho?

Further evidence that the size of rho isn’t the sole factor can be found in Figure 3, which shows the gap for ReLU can be narrowed even at larger rho values by using a synthetic activation NME.

评论

Thanks for your response! And Figure 3 is noted, thanks!

最终决定

The paper investigates the inconsistencies observed in the performance of sharpness-aware minimization compared to other regularization techniques like weight noise and gradient penalties in deep learning. The authors reveal that these inconsistencies are connected to the structure of the Hessian of the loss function, specifically its decomposition into the Gauss-Newton matrix and an indefinite matrix termed the Nonlinear Modeling Error (NME) matrix.

Following the fact that all referees were positive regarding this work, I recommended acceptance to NeurIPS 2025. Nevertheless, I recommend following the recommendation of the referees for the camera-ready version, especially :

  • identify specific examples where the NME is large is valuable

  • rewrite the parts where reviewers saw inconsistencies

  • investigate the role of off-diagonal elements.