Tilted Sharpness-Aware Minimization
We propose a theoretically principled optimization objective along with its algorithms that generalizes SAM and achieves better performance.
摘要
评审与讨论
This paper introduces Tilted Sharpness-Aware Minimization (TSAM), a novel extension of Sharpness-Aware Minimization (SAM) designed to further enhance generalization in deep learning models. While SAM aims to minimize the worst-case local solutions, it overlooks many neighboring regions that may also contribute to significant losses. TSAM addresses this limitation by exponentially tilting the loss landscape, assigning greater weight to neighbors with higher losses. The authors theoretically demonstrate that as the tilt scalar increases, TSAM favors flatter minima. Additionally, they prove that TSAM achieves a tighter generalization bound than SAM for modest . Extensive experiments validate the effectiveness of the proposed method.
给作者的问题
Please see above.
论据与证据
Yes.
方法与评估标准
- In Alg. 1, at each training step, the method samples random perturbations via HMC algorithm (Alg. 3). This process requires one backpropagation to compute the gradient for each , resulting in a total of backpropagations. Additionally, updating requires another backpropagations to compute for each . In total, a single parameter update requires backpropagations, leading to a high computational cost that may be unbearable.
理论论述
Yes.
实验设计与分析
- Some popular SAM variants, such as ASAM [1], GSAM [2], GAM [3], are not included as baselines in the experiments.
- The experiments are primarily conducted on small-scale datasets. Larger-scale datasets, such as ImageNet, should be used to further validate the effectiveness of the proposed method.
[1] ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks, ICML 2021
[2] Surrogate Gap Minimization Improves Sharpness-Aware Training, ICLR 2022
[3] Gradient Norm Aware Minimization Seeks First-Order Flatness and Improves Generalization, CVPR 2023
补充材料
Yes.
与现有文献的关系
The proposed method contributes to the broader scientific literature by introducing the idea of leveraging exponential tilting to reweight local minima, thereby addressing the drawbacks of SAM.
遗漏的重要参考文献
No.
其他优缺点
Strengths
-
The idea of leveraging exponential tilting to reweight local minima is novel.
-
The paper provides a series of theoretical analyses to further validate the effectiveness of the proposed method.
其他意见或建议
- There are some typos. For example, the title should be corrected from "Reweighting Local Mimina with Tilted SAM" to "Reweighting Local Minima with Tilted SAM". Additionally, in line 26, the phrase "such algorithms (and its variants) that reply on one or few steps" should be "such algorithms (and its variants) that rely on one or few steps".
We thank the reviewer for the time and valuable comments. We hope our response below can address the reviewer's concerns.
[more experiment results] We appreciate the reviewer’s suggestions for adding more baselines. We have cited the papers mentioned by the reviewer in related work. We did not compare them in our original submission mainly because they are proposed to solve a SAM-like minimax style objective, where TSAM proposes a new objective function than SAM along with an algorithm to solve it. As plotted in Figures 3 and 4 in the appendix, even if we can solve the minimax SAM objective perfectly (in that particular toy problem, we did a brute-force search of the optimal parameters in the one-dimensional space), TSAM solutions still demonstrate more desirable properties than SAM solutions (more smooth).
However, we agree more results will strengthen our experiments, and we compare with both GAM [3] and GSAM [2]. We leave out ASAM because GSAM achieves better performance than it (Figure 5 of the GSAM paper). For GSAM [2], we tune all the hyperparameters including the alpha parameter from a grid of {0.01, 0.02, 0.03} as suggested by the paper. The GAM algorithm [3] can be expensive as it requires to compute Hessian vector products. We implemented its non-expensive approximation, as explained in the GAM paper and actually implemented in its open-sourced code.
Results are shown in the table below. We report the best performance of each baseline on each dataset. For instance, in noisy CIFAR100, both GSAM and GAM overfit to noise at the later stage of training and we apply early stopping to obtain the best test performance.
| Method | CIFAR100 w/ WideResNet | CIFAR100 w/ ResNet18 | noisy CIFAR100 w/ ResNet18 | noisy CIFAR100 w/ WideResNet | DTD w/ ViT | DTD w/ WideResNet |
|---|---|---|---|---|---|---|
| SGD | 73.22 | 71.39 | 61.01 | 57.02 | 66.38 | 16.97 |
| TSAM | 80.85 | 77.78 | 69.98 | 70.26 | 68.82 | 18.63 |
| GSAM [2] | 78.21 | 77.95 | 66.84 | 70.35 | 68.67 | 18.13 |
| GAM [3] | 78.01 | 77.55 | 67.26 | 65.32 | 68.28 | 17.97 |
[1] ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks, ICML 2021.
[2] Surrogate Gap Minimization Improves Sharpness-Aware Training, ICLR 2022.
[3] Gradient Norm Aware Minimization Seeks First-Order Flatness and Improves Generalization, CVPR 2023.
[compute cost] We agree that TSAM requires more gradient evaluations per iteration, as discussed and acknowledged in the paper. However, we have evaluated TSAM and all the baselines with the same computational budget (Section 5), e.g., by letting baselines run 3 or 5 longer than TSAM, or optimizing for the inner max of SAM in a more fine-grained way using more gradient computation (ESAM2). We see that TSAM still outperforms those approaches on a variety of tasks and models. The trend still holds even if we further increase the runtime of ERM (SGD) or SAM. In addition, TSAM is inherently easy to parallelize since all the perturbations of model parameters are independent of each other. If we optimize our implementation by sampling perturbations in parallel, TSAM would incur the same per-iteration runtime as the vanilla SAM algorithm.
[typos] Thanks for pointing out the typos in the submission. We will fix them in the next version.
Authors propose Tilted SAM (TSAM) that builds upon SAM in order to smooth out the optimization process using exponential tilting. Unlike SAM, which focuses on the worst-case loss within a neighborhood, TSAM reweights local solutions based on their loss values, favoring flatter minima. Authors claim that this makes TSAM easier to optimize and improves generalization performance across image and text tasks. They develop an algorithm using Hamiltonian Monte Carlo to efficiently estimate gradients for TSAM. Empirical results show that TSAM achieves better test performance and flatter minima compared to SAM and ERM.
给作者的问题
Please refer to my previous comments.
论据与证据
There are several claims made in the paper which are well-supported, including that TSAM leads to flatter solutions by averaging over multiple perturbations. While TSAM is backed by strong theoretical analysis, the claim about TSAM achieving better generalization performance is still quite strong whereas the gap in performance seems marginal in most experiments. Authors also show that the smoothness introduced by TSAM reduces optimization difficulties, however, it still takes more time (at least 3x) to compute as compared to ERM and SAM.
方法与评估标准
- Overall, the methodology is described clearly and backed by detailed analysis on several toy examples.
- The authors propose sampling multiple ’s from a distribution to obtain an empirical gradient estimation, with the idea that the full gradient is a tilted average of the original gradient. While the idea is interesting, it introduces significant computational challenges, as efficiently sampling and computing gradients from this distribution can be resource-intensive and slow. Are there other techniques to avoid this issue?
理论论述
The proposed method is well-grounded in theory (with proofs provided for major claims).
实验设计与分析
- The experiments involve different types of datasets and models and TSAM shows marginal improvements in all the cases.
- The additional analysis on hyper-parameters and ablations are quite insightful.
补充材料
- Overall, the provided implementation details are comprehensive and sufficient for ensuring reproducibility.
- After looking at the runtime comparison, it is quite clear that the overhead involved in TSAM is quite significant.
- Why would scheduling won’t have any effect on the performance? Can the authors elaborate on this?
与现有文献的关系
The idea behind TSAM is to improve SAM by smoothing the optimization process using exponential tilting to prioritize higher-loss perturbations. This is related to several earlier works on flat minima for improved generalization. It also leverages Hamiltonian Monte Carlo (HMC) for efficient gradient estimation. TSAM extends ideas from average-perturbed sharpness and noise-perturbed loss techniques to improve generalization and optimization stability.
遗漏的重要参考文献
I think the literature review is sufficient, and the authors have effectively supported their ideas with relevant references and prior research.
其他优缺点
Please refer to my previous comments.
其他意见或建议
Please refer to my previous comments.
We thank the reviewer for the time and valuable comments.
[expensive computation] We would like to clarify that the runtime numbers in Table 5 are reported for the same number of iterations of all methods (to illustrate the per-iteration cost), while the final test performance in the experiment section (Section 5) is obtained by letting all methods run the same amount of wall-clock time. Across all datasets, we consistently observe that under the same runtime, TSAM outperforms other algorithms.
In addition, TSAM is inherently easy to parallelize since all the perturbations of model parameters are independent of each other. If we optimize our practical implementation in this way, TSAM would incur the same per-iteration runtime as the vanilla SAM algorithm (consisting of one-step of gradient ascent and one step of gradient descent each iteration).
[scheduling ] We note that scheduling with a fixed constraint set (defining the neighborhood of model parameters) does not change final performance significantly as long as we start from or end with the same values of . When is zero, TSAM reduces to an average-perturbation-based sharpness-aware objective, which still accounts for some notions of sharpness of the loss surface. In the future, we plan to further investigate scheduling together with the radius parameter of the constraint set where it is possible to cover vanilla ERM with a mini-batch SGD optimizer.
Thank you for clarifying my concerns. Based on the responses from the authors, I'll maintain my original score.
The authors propose TSAM, which is a version of SAM where, instead of taking a max over the loss around a point in parameter space during training, the authors propose to take a weighted average. Since the method is not tractable for weight functions, the authors develop an approximation by sampling. It is then shown that this leads indeed to (modest) gains in generalization performance. The authors also put some effort in theoretical analysis.
给作者的问题
The central definition for TSAM (eq. 3) seems to be mathemetically equivalent to the local entropy in [1] (see first equation there and replace w' by w + epsilon), up to the choice of measure/distance function. The paper [1] has and the preceding work has some extensive analysis of this kind of setup and I think it would be good to relate these streams (just to be clear, the work in [1] is quite different after that basic definition, so there is no issue in being too similar, but at the same time it's clearly related and it is not cited/discussed).
论据与证据
The claim that TSAM leads to flatter minima and better generalization peformance is well-supported by experiments and theoretical analysis. However, the gains in performance are very small and and not entirely consistent (see Table 1).
方法与评估标准
The authors compare to several other methods and over a wide range of datasets. This part of the paper is sufficient.
理论论述
While I did not follow all mathematical analysis in detail, the general methods and results seems to be in line with the existing literature.
实验设计与分析
No.
补充材料
No.
与现有文献的关系
The work presents an incremental improvement over SAM. Together with the theoretical foundations I think this is not a problem, however.
遗漏的重要参考文献
Some important papers in the field are not cited, especially Pittorino et. al: "Entropic gradient descent algorithms and wide flat minima ", which seems to have the same setup as the authors (see below).
其他优缺点
- I think the basic idea is novel and the execution in the paper is good.
- The experiments are well done and wide-ranging, even though the results are not entirely convincing (see above)
- The paper is well written
- The theoretical aspect is well developed
其他意见或建议
I would suggest exploring different measures of flatness to make the work stronger.
We thank the reviewer for the time and positive assessment of our work! We would like to address the remaining questions/concerns as follows.
[experimental improvements] We observe statistically significant improvements of TSAM compared with the baselines. In terms of the test loss, TSAM outperforms baselines by a large margin (Table 3 in Appendix C.3). Furthermore, as TSAM is a new objective function minimizing a weighted combination of bad losses, we can further improve the current algorithm by combining it with other optimization techniques such as applying the idea of variance reduction to obtain better estimated gradients, or incorporate adaptivity to precondition the estimated gradients. We leave more fine-grained exploration of solvers as future work.
[other measures of sharpness and related work] Thanks for the suggestions and for pointing out the related work. In our work, we have explored two metrics of sharpnesses, as discussed in Section 5.2. The first sharpness measure (results visualized in Table 2) is closely related to the local entropy notion mentioned in the related work (Pittorino et. al.) and we agree the formulations are conceptually related as well. We will add discussions around the connections with Pittorino et. al. Our formulation is inspired by exponential tilting, which inherently has rich connections with different areas such as information theory, applied probability, and optimization [1]. We are certainly interested in developing other sharpness measures based on the properties of our objectives leveraging connections with prior works, and we will add discussions on this in the next version.
[1] On Tilted Losses in Machine Learning: Theory and Applications, JMLR.
This paper proposes Tilted SAM (TSAM) as a smoothed version of SAM using exponential tilting. Its smoothness enbles an easier optimization and a better generalization.
给作者的问题
Do you have insights on why the algorithm still works even with N=1 and "accept the generated ϵwith probability 1"?
论据与证据
All the claims are clear to me except one: "Empirically, TSAM arrives at flatter local minima and results in superior test performance than the baselines of SAM and ERM across a range of image and text tasks."
It's not clear to me the connection between a flatter local minima and a superior test performance. It would be nice to refer to some papers or add explanation when it's first claimed.
方法与评估标准
Yes. They use test accuracy for classification tasks and sharpness when comparing to ERM and SAM.
理论论述
Yes, I checked theorem 3.6 in the main text.
实验设计与分析
Yes. I checked all of their experiments. I have a question on one experiment in appendix:
In figure 7, the CIFAR100 test accuracy of t=20 seems still increasing . What happens after 200 epochs?
补充材料
Yes, Appendix A B.1, and C.
与现有文献的关系
This paper propose applies exponential tilting to smooth SAM, which is a novel combination.
遗漏的重要参考文献
See "Claims And Evidence"
其他优缺点
Strengths: The theoretical part is solid.
Weakness: [limited impact] According to Table 5, the running time of TSAM of cifar100 is 10 times larger than running ERM and 5 times larger than SAM. Meanwhile, the improvement by TSAM is not significant (~1-2%) as shown in table1 for various tasks.
其他意见或建议
NA
We thank the reviewer for the time and valuable feedback.
[connections between flat local minima and improved generalization] Note that we do not intend to claim ‘flat local minima leads to a better generalization in all cases’ by claiming “TSAM arrives at flatter local minima and results in superior test performance…” The exact relations between the two for deep learning is still an open question, as pointed out in our submission citing several related works (Section 2). Our study is partly motivated by the empirical success of a series of SAM works under which sharpness of local minima is significantly reduced (under various notions of sharpness), and we aim to develop a better formulation along this line by considering and reweighting multiple local minima in a principled framework. We will further clarify this in the next version.
[runtime of TSAM] We agree that TSAM requires more gradient evaluations per iteration, as discussed and acknowledged in the paper. However, we have evaluated TSAM and all the baselines with the same computational budget (Section 5), e.g., by letting baselines run 3 or 5 longer than TSAM, or optimizing for the inner max of SAM in a more complicated way with more gradient computation (ESAM2). We see that TSAM still outperforms those approaches on a variety of tasks and models. The trend still holds even if we further increase the runtime of ERM (SGD) or SAM. In addition, TSAM is inherently easy to parallelize since all the perturbations of model parameters are independent of each other. If we optimize our implementation in this way, TSAM would incur the same per-iteration runtime as the vanilla SAM algorithm.
[other questions] (a) The performance of both TSAM and SAM would improve very slightly after 200 epochs, but TSAM still outperforms SAM. (b) When , we are running gradient ascent once to locate the regions with (relatively) large losses, where the first-order gradient information serves as a strong signal to guide the areas to sample from. Additionally, the gap between losses can be magnified by using a large value of during reweighting. We will explain the intuition in more detail in the next version.
Thank you for your replies.
However, we have evaluated TSAM and all the baselines with the same computational budget (Section 5), e.g., by letting baselines run 3 or 5 longer than TSAM
Could you point out in which table/figure you run ERM or other baselines longer than TSAM?
Also I notice in section 5, you mentioned that Despite the existence of adaptive methods for SGD and SAM (Kingma & Ba, 2014; Kwon et al., 2021), we do not use adaptivity for any algorithm for a fair comparison. Does it mean that in all the ERM results, the SGD was used instead of Adam/AdamW?
We thank the reviewer for the time and additional questions.
In almost all numbers presented in our main result, we use the same computation budget for the baselines and TSAM. For instance, in Table 1 on both image and text data, except for SAM that runs the same number of iterations as TSAM, all other baselines (ESAM1, ESAM2, PGN, RSAM) use the same number of gradient evaluations as TSAM. ESAM1 refers to SAM running longer. We explained the protocol in Section 5.1, and will further clarify this in the next version.
Yes, we use SGD instead of adaptive optimizers for ERM for a fair comparison, which is also a standard baseline in prior SAM related works [e.g., 1,2,3]. If we use adaptive optimizers to solve ERM, we would need to incorporate adaptivity into all methods for solving SAM and TSAM as well, which the current baselines (e.g., PGN, RSAM) and the TSAM algorithm do not account for.
[1] Kwon et al., ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks.
[2] Liu et al., Random Sharpness-Aware Minimization.
[3] Foret et al., Sharpness-Aware Minimization for Efficiently Improving Generalization.
This paper proposes a generalized and smoothed variant of SAM called Tilted SAM (TSAM). While SAM optimizes the worst-case loss within a parameter neighborhood, it struggles to find the true maximum loss point and ignores other potentially sharp regions. Instead of focusing solely on the worst perturbation, TSAM reweights all neighborhood points based on their exponentiated loss, effectively considering multiple high-loss regions. A tilt scalar t parameterizes this process, yielding a smooth objective that interpolates between minimizing the average neighborhood loss and recovering SAM's min-max objective. The authors theoretically show and empirically demonstrate that TSAM leads to flatter solutions and better generalization bounds than SAM for modest t. To overcome the challenge of optimizing TSAM, which requires sampling perturbations from a complex distribution, they adapt an efficient Hamiltonian Monte Carlo algorithm for gradient estimation.
Reviewers generally acknowledged the novelty of applying exponential tilting to smooth the SAM objective and appreciated the solid theoretical analysis provided, which demonstrated properties like favoring flatter solutions and achieving potentially tighter generalization bounds than standard SAM. Despite that, reviewers raised concerns on the practical trade-offs. A major recurring point across multiple reviews (Sf29, WTKN, YVio) was the substantially higher computational cost per iteration compared to ERM and SAM, as evidenced by runtime tables. While acknowledging the method finds flatter minima and shows performance improvements, reviewers questioned whether the observed gains (often described as marginal or modest, ~1-2%) justified the increase in training time or gradient evaluations per step. Less critical concerns included comparisons against more recent SAM variants like ASAM, GSAM, and GAM (YVio), the need for validation on larger datasets like ImageNet (YVio), and missing discussion of closely related work (F33F).
In their rebuttal, authors clarified that while the per-iteration cost is higher, their main experimental results were obtained using an equal total computational budget for all methods (meaning baselines like SAM/ERM were run for proportionally longer). They also argued that TSAM is inherently parallelizable, potentially matching SAM's per-iteration time with optimized implementations. Furthermore, they provided new experimental results showing TSAM outperformed requested baselines GSAM and GAM under equal compute budgets. They also provided clarifications on the flatness-generalization connection, cited the related work pointed out, and addressed other minor technical questions.
The final sentiment remained somewhat mixed; while reviewers acknowledged the clarifications, some concerns about the cost-benefit ratio persisted, leading to borderline recommendations ranging from accept to weak reject (averaging to weak accept). Considering the authors' responses, particularly their clarification on the algorithm's parallelizability, which addresses a major concern and enhances practical utility, and their inclusion of new comparisons against baselines like GSAM and GAM, I believe this paper offers significant enough contributions to be useful for the community. Therefore, I recommend acceptance.