PaperHub
6.0
/10
Poster4 位审稿人
最低6最高6标准差0.0
6
6
6
6
4.0
置信度
ICLR 2024

Efficient Sharpness-Aware Minimization for Molecular Graph Transformer Models

OpenReviewPDF
提交: 2023-09-16更新: 2024-03-06

摘要

关键词
Sharpness-Aware MinimizationMolecular Graph

评审与讨论

审稿意见
6

This paper designs a computation efficient variant of sharpness aware minimization (SAM) for molecular graph transformers.

优点

S1. This paper observes that flat minima generalize better for molecular graph transformers. This extends the boundary of sharpness awareness to a new domain.

S2. To overcome the computational burden, this paper leverages domain specific knowledge to design GraphSAM. Supporting theories are derived to explain the reasons behind design.

S3. GraphSAM improves throughput with comparable performance with SAM. Compared to other variants of computation efficient SAM, GraphSAM performs better in the considered setting.

缺点

W1. Can the authors elaborate more on equation (3)? In particular, are there any specific reasons for choosing ϵt\epsilon_t as the moving average of wt/wtw_t/\|w_t \|? What will happen if ϵt\epsilon_t becomes moving average of wtw_t?

W2. What are the specific reasons behind the choice of StepLR for the ρ\rho-schedulers? Does other approaches, such as cosine schedulers or inverse square root schedulers help?

问题

Q1. Table 1 does not fit into the paper properly.

Q2. Some of the expressions are unnecessarily complicated. For example, in equation (2) and (3), sign(ϵt)ϵtsign(\epsilon_t)|\epsilon_t| can be replaced by ϵt\epsilon_t.

Q3. Missing references on SAM variants.

[1] Du, Jiawei, Daquan Zhou, Jiashi Feng, Vincent YF Tan, and Joey Tianyi Zhou. "Sharpness-Aware Training for Free." arXiv preprint arXiv:2205.14083 (2022).

[2] Li, Bingcong, and Georgios B. Giannakis. "Enhancing Sharpness-Aware Optimization Through Variance Suppression." arXiv preprint arXiv:2309.15639 (2023).

评论

We appreciate a lot for your careful review, and would like to provide responses to your mentioned questions and weaknesses one by one.

###[W1 - What will happen if εt+1ε_{t+1} becomes moving average of wtw_{t}? — It can cause gradient explosion.]

Thank you for your insightful observation. Utilizing wt|w_{t}| without normalization could make εt+1ε_{t+1} grow excessively large, which might trigger a gradient explosion. Such an occurrence would critically disrupt the training process, rendering it not only ineffective but also at risk of divergence. In Eq. (3), the choice to use wt/wtw_{t}/|w_{t}| for updating εt+1ε_{t+1} is deliberate and crucial for the stability and effectiveness of the update process. This formulation ensures that the update direction of εt+1ε_{t+1} is determined by wtw_{t} while its magnitude is controlled by the factor (1-β). By normalizing wtw_{t} with its magnitude wt|w_{t}| , we prevent the scale of wtw_{t} from disproportionately influencing the update of εt+1ε_{t+1}. If we were to use wtw_{t} directly for updating εt+1ε_{t+1}, the absence of normalization would lead to significant issues. Our experiments have demonstrated that wt|w_{t}| >> εt|ε_{t}|, further underscoring the necessity of this normalization step.

###[W2 - Does Cosine schedulers or inverse square root schedulers help? — They have some effect, but StepLR is the best.]

Thank you for your question regarding the choice of the StepLR scheduler for the ρ-schedulers in our model. The decision to use StepLR was primarily guided by its simplicity and effectiveness in controlling the learning rate in discrete steps. This allows for a straightforward reduction of the ρ size at predetermined epochs, which we found to be beneficial for the steady convergence to a flat region of our model's parameters. In response to your suggestion, we have added the following two sets of experiments:

CosineLR: ρnewρ_{new}=ρinitial2(1+cos(epochπEPOCH))\frac{ρ_{initial}}{2}*(1+cos(\frac{epoch·π}{EPOCH}))

InverseSquareLR: ρnewρ_{new}=ρinitialmax(epoch,k)\frac{ρ_{initial}}{\sqrt{max(epoch,k)}}

where epoch is the current training epoch, EPOCH is total training epoch, and k=1, ρinitialρ_{initial} = 0.05.

GROVER+GraphSAMBBBP(ROC-AUC↑)Tox21 (ROC-AUC↑)Sider (ROC-AUC↑)Clintox(ROC-AUC↑)ESOL(RMSE↓)Lip(RMSE↓)
Original model0.9170.8220.6490.8530.6390.671
+StepLR0.9280.8460.6650.8660.6250.654
+CosineLR0.9210.8280.6510.8560.6330.661
+InverseSquareLR0.9240.8360.6540.8580.6280.665

We acknowledge that other approaches, such as cosine schedulers or inverse square root schedulers, could also be effective. While these alternative schedulers present intriguing possibilities, our initial experiments with StepLR showed the most promising results, which led to its selection for the current study. However, we recognize the value in exploring these alternative scheduling methods in future work, as they may offer different advantages in specific training contexts or with certain types of datasets.

###[Q1 - Table 1 does not fit into the paper properly. — Fixed, and thank you for the close read!]

###[Q2 - sign(εt)εtsign(ε_{t})|ε_{t}| can be replaced by εtε_{t}! — Yes, you are right and we fixed it.]

###[Q3 - Missing references on SAM variants. — We have added them.]

Thank you for highlighting the omission of references related to SAM variants in our manuscript. In our revised manuscript, we will include and discuss the two SAM variants [1, 2] in the related work.

References:

[1] Du, Jiawei, Daquan Zhou, Jiashi Feng, Vincent YF Tan, and Joey Tianyi Zhou. "Sharpness-Aware Training for Free." arXiv preprint arXiv:2205.14083 (2022).

[2] Li, Bingcong, and Georgios B. Giannakis. "Enhancing Sharpness-Aware Optimization Through Variance Suppression." arXiv preprint arXiv:2309.15639 (2023).

审稿意见
6

The paper proposes GraphSAM to reduce the training cost of sharpness-aware minimization (SAM) and improve the generalization performance of graph transformer models. GraphSAM uses the updating gradient of the previous step to approximate the perturbation gradient at the intermediate steps smoothly and theoretically proves that the loss landscape of GraphSAM is limited to a small range centered on the expected loss of SAM. Extensive experiments on six datasets with different tasks demonstrate the effectiveness and efficiency of GraphSAM.

优点

  • Motivation is clear: SAM is a powerful optimizer but doubles the computational cost compared with the base optimizer. Thus, improving its efficiency (the focus of this paper) is an important problem
  • GraphSAM improves the efficiency of SAM by approximating the perturbation gradient ϵt+1\epsilon_{t+1} as the moving average of ϵt\epsilon_t and ωt\omega_{t} based on two observations: (i) ϵt\epsilon_t is close to ϵt+1\epsilon_{t+1} (Figure 3(a)) and (ii) ϵt+1\epsilon_{t+1} is close to ωt\omega_t (Figure 3(c))
  • By approximating the perturbation gradient, GraphSAM almost does not need to compute the perturbation gradient (only once per epoch). Thus, GraphSAM is efficient
  • Experimental results on the graph dataset show that GraphSAM is comparable with SAM (Tables 1 and 2) and is more efficient (Table 2)

缺点

  • writing:
    • The colors in Figure 1 are inconsistent with the observations: "compared to the base optimizer of Adam (the blue bar)", but the blue is LookSAM? also, SAM (cyan)?
    • Figure 1, it is better to include the proposed GraphSAM as well
    • Related works: "most of the work still ignores the fact of SAM’s double overhead (Damian et al., 2021; Kwon et al., 2021;Wang et al., 2022)." I think some methods have attempted to mitigate this issue, e.g., AE-SAM, SS-SAM, better to discuss them here
    • Section 3, "SAM consumes double overhead due to the extra computation of perturbation gradient compared with the base optimizer." existing methods (AE-SAM, SS-SAM) have tried to mitigate this issue, better to discuss them here, and why they cannot be used for graph datasets, this is also the motivation for the proposed GraphSAM
    • Figure 3, sub-caption for each subfigure
    • Table 1: also compare with GROVER+AE-SAM/SS-SAM
    • "AE-SAM and RST periodically compute the update gradient by different strategies": AE-SAM is an adaptive strategy, not a periodic strategy
  • it seems that the proposed GraphSAM is general, not limited to graph transformers, and can also be used in computer vision. Thus, it is better to conduct some experiments on computer vision.
  • ablation study for hyperparameters γ,λ,β\gamma, \lambda, \beta
  • Theorem 2, based on the proof, the proportion ratio depends on the θ\theta (Eq(7)), it is trivial. if we assume the loss is Lipschitz, then the constant can be independent of θ\theta.

问题

see the questions in weakness.

======

I have read the rebuttal and other reviews. Most of my concerns are resolved, so I maintain my score of 6.

评论

Thank you for the many valuable comments on writing, we have also been corrected! Notably, the SS-SAM you mention is the same paper as the RST compared in our paper.

###[W1.1 - The colors in Fig.1 are inconsistent — Fixed, and thank you for the close read! ]

###[W1.2 - Add GraphSAM in Fig.1 — We have added GraphSAM in Fig.1.]

###[W1.3 and W1.4 - Discuss existing efficient methods, e.g., AE-SAM and SS-SAM. — We have discussted and compared them in Appendix.]

Indeed, we have provided a detailed comparison of these efficient SAM versions, including AE-SAM and SS-SAM(RST), in Appendix A.4 and A.5. We specifically discuss why these methods are less suited for molecular graph datasets and how GraphSAM is uniquely designed to address the challenges inherent to such data.

For AE-SAM, the trend of its squared stochastic gradient norms in CV domain is not consistent with the molecule data. This leads to a decrease in its sensitivity, which is not as good as its performance in CV, but works better than other methods.

For SS-SAM, the advantage is efficiency; but the disadvantage is obvious: random decisions about whether or not to update the gradient can lead to performance reduction. Its performance in CV literature is also poor.

###[W1.5 - sub-caption for each subfigure. — We put the sub-caption for each subfigure uniformly into the description.]

###[W1.6 - Table 1: also compare with GROVER+AE-SAM/SS-SAM. — We have compared with CoMPT and Grover (+ every efficient SAM) in detail in Table 2 and Table 7.]

###[W1.7 - Description of AE-SAM is incorrect. — Fixed, and thank you for the close read!]

###[W2 - GraphSAM can also be used in CV? — Yes, but it does not perform ideally.]

Sorry, GraphSAM doesn't perform well in the CV domain. It is experimentally found that GraphSAM is a good approximation for graph transformers rather than in CV domain. Although our proposed method is designed based on the empirical observations, actually, Observations 1 and 2 are presented only in the graph transformers, while they do not appear in those of CV domain [1]. In CV, the model's perturbation gradient variations are often consistent with updating gradients, which is not in line with the phenomenons on the molecular graphs.

To further address the concern and provide holistic discussion, we add the following experiments to compare the performance of lookSAM [1], AE-SAM and GraphSAM in the CV and molecule datasets, respectively. LookSAM and GraphSAM are efficient SAMs designed based on the gradient varification patterns of models in their respective domains, and AE-SAM is an adaptively-updating efficient method accoding to the comparison evaluation between gradient norm and a pre-defined threshold.

ResNet-18 (CV)CIFAR-10CIFAR-100
+SGD95.4178.21
+SAM96.5380.18
+LookSAM96.2879.91
+AE-SAM96.4980.31
+GraphSAM95.8678.69
CoMPT (Graph)Tox21SiderClinTox
+Adam82.8162.1791.41
+SAM83.9664.3392.73
+LookSAM82.5562.5491.64
+AE-SAM83.3363.1691.92
+GraphSAM84.1164.5893.78

It is observed LookSAM has better results on the image datasets instead on the molecular graphs investigated in this work. On the contrary, our GraphSAM works on the molecules but leads to the worst performances on the image benchmarks. That is because model gradient variations are diverse accross the different domains. It is challenging to transfer the efficient SAMs designed based on the specific gradient patterns. AE-SAM achieves an acceptable performance on the molecular graphs since the perturbation gradient norms of graph transformers are monitored to inform the necessarity of gradient re-computation. But it is not as good as GraphSAM where we accurately fit the perturbation gradients at each step. In summary, GraphSAM is optimized particularly for the graph transformers based on the empirical observations of gradient variations.

References:

[1] Yong Liu, Siqi Mai, Xiangning Chen, Cho-Jui Hsieh, Yang You. Towards efficient and scalable sharpness-aware minimization. CVPR 2022.

评论

###[W3 - Ablation study for hyperparameters β, γ,λ ? — We have included experiments for hyperparameter analysis in the Appendix.]

Thank you for your inquiry regarding the ablation study for hyperparameters. Recognizing the length constraints of the main paper, we have included detailed experiments on hyperparameter ablation in Appendix with Table 3 for β, Table 9 for γ and Fig.7 for λ. These sections in the appendix offer a comprehensive view of how various hyperparameters influence the performance and behavior of our proposed method.

###[W4 - If we assume the loss is Lipschitz, then the constant can be independent of θθ? — Your assumption is correct, but does not solve the problem in SAM.]

The essential nature of using a first-order Taylor expansion involves calculating the gradient of the current parameter θ, which aims at more than just θ. Lipschitz continuity of a function provides a bound on how much the function can change between two points. While it's a valuable property, especially in ensuring stability and robustness of optimization algorithms, it doesn't directly provide a method to find flat minima. In the context of SAM and GraphSAM, the Lipschitz constant would tell us the maximum rate at which the loss can change, but it doesn't directly facilitate the optimization towards flat regions of the loss landscape.

Both Sharpness-Aware Minimization (SAM) and its variants use a first-order Taylor expansion in their formulations, primarily because it offers a practical and computationally efficient way to approximate the worst-case loss in a neighborhood around the current parameter values.

Consequently, we still choose the first-order Taylor expansion as a derivation condition in this paper, instead of assuming the loss is Lipschitz.

审稿意见
6

The study introduces Sharpness-aware minimization (SAM) in computer vision, a technique effective in eliminating sharp local minima in training trajectories and countering generalization degradation. However, its dual gradient computations during optimization increases time costs. To address this, a novel algorithm named GraphSAM is proposed. It lowers the training expenses of SAM while enhancing the generalization performance of graph transformer models. The approach is underpinned by two main strategies: gradient approximation and loss landscape approximation. Empirical tests across six datasets validate GraphSAM's superiority, particularly in refining the model update process. Anonymized code is also provided for further reference.

优点

(1) The paper is exceptionally coherent in its writing. The content is presented seamlessly, and the expression is clear.

(2) This paper is both conceptually and technically innovative. It uniquely reutilizes the updating gradient from the previous step, approximating the perturbation gradient in an efficient manner.

(3) The experimental design of this study is well-conceived. It systematically validates the representational capabilities of GraphSAM. Additionally, the appendix offers an extensive set of supplementary experiments.

缺点

(1) In Observation 1, the author notes the differing changes of the perturbation gradient and the updating gradient throughout the training process. I wonder, is this a general phenomenon? Could more case studies be provided to illustrate this further?

(2) The results presented in Table 1 seem to show limited improvement on certain datasets. I suggest the author consider incorporating an efficiency aspect into the table. Efficiency is a crucial contribution, yet it hasn't been emphasized adequately in the experimental section, which seems inconsistent.

问题

see weakness

评论

We appreciate a lot for your careful review, and would like to provide responses to your mentioned questions and weaknesses one by one.

###[W1 - Is the differing changes of gradient a general phenomenon? — Yes, it is.]

The observation of differing changes in the perturbation gradient and the updating gradient throughout the training process is indeed a general phenomenon. To further substantiate this, we have enriched the Figure 8 in Appendix 8 with additional case studies illustrating these variations. They are consistent with the conclusions in Observation 1.

###[W2 - Efficiency is a crucial contribution, add experiments to illustrate — Agreed, we have compared the efficiency in Table 2 and Table 7.]

We have included comprehensive efficiency comparisons in Table 2 of the main paper and Table 7 of Appendix A.5. These tables specifically highlight the efficiency aspects of GraphSAM, offering a balanced view of both its performance and computational efficiency.

审稿意见
6

The authors study Sharpness Aware Minimization (SAM), an effective training objective in the image domain to avoid "sharp minima", in the context of graph transformers for molecular data. The proposed efficient SAM-alternative, termed GraphSAM, empirically performs on par with SAM while being more efficient.

优点

  1. More efficient approach for SAM (roughly 30%-50% faster training)
  2. GraphSAM consistently improves the performance of the base model
  3. Approach is ablated and rich explanations are provided
  4. The paper is well organized and it is easy to follow.

缺点

  1. The relation to graph machine learning is a bit obscure: is GraphSAM only a good approximation for graph transformers? After all, GraphSAM does not rely on anything graph-related besides the observations made with SAN + graph transformers. Although it is fine to focus on molecular data, the paper would benefit significantly from a holistic discussion (and some rudimentary experiments).
  2. The theoretical statements could be better motivated and more decisively embedded into the story. It remains a bit vague what the actual implications are. For example, the conclusion of Theorem 1 is speculative "by replacing the inner maximum of LG(θ + εˆS) with the maximum of LG(θ + εˆG), we tend to smooth the worse neighborhood loss in the loss landscape. In other words, if one could minimize the upper bound given by LG(θ + εˆG), the final loss landscape is as smooth as that obtained from SAM." Alternatively, the wording may be improved. It is not clear why optimizing the upper bound necessarily results in a "loss landscape is as smooth as that obtained from SAM".
  3. Proof of Theorem 1 relies on empirical observations and (from my perspective) strong assumptions. However, the authors do sufficiently not discuss this in the main part and, thus, the current presentation is misleading. In other words, these assumptions, etc. should be made explicit in a rather prominent way in the main part (e.g. "and after ||ω0 ||2 experimental analysis, the updating gradient ω ≫ ε " is not stated clearly in main part). I think it would be better to drop the term "Theorem" and rather give some intuitive, mathematically motivated explanation.

Minor:

  1. Missing space in the first line of page 2
  2. The bold highlighting in Table 1 is counter-intuitive. Perhaps add an additional marker to highlight the best model per task.

问题

  1. Is GraphSAM only a good approximation for graph transformers? How is GraphSAM working, e.g., in the image domain?
  2. Are there any observations of how the behavior of the graph transformer changes if trained with GraphSAM? For example, do the attention scores then align better with the graph connectivity, or do they become smoother (higher entropy)?
  3. Is it possible to compare GraphSAM with other efficient SAM derivates besides SAM-k (e.g., see related work section)?
  4. Is The proportionality in Theorem 2 not merely caused by the first-order Taylor approximation (Eq 6 in A1.1)? How do the authors know that the linear term is a sufficient approximation?
评论

###[M1 - Missing space — Fixed, and thank you for the close read! ]

###[M2 - Counter-intuitive bold highlighting — Sure, we have changed it]

In the revised manuscript, we maintain the bold highlighting for the best optimizer of graph transformer per task, but also add 'underline' as an additional marker to further emphasize the top-performing model in each category.

###[Q2 - How the behavior of the graph transformer changes if trained with GraphSAM? — When the model trained with GraphSAM, nodes' attention score matrix average entropy of same molecular graph is higher.]

This is a wonderful suggestion. Regarding the attention matrix over BBBP dataset, we observe that models optimized with Adam, GraphSAM, and SAM obtain average entropies of 2.4130, 2.7239, and 2.8382, respectively. That means the application of SAM approaches can smooth the attention scores to avoid the overfitting. We have added the attention heat map analysis in appendix. In addition, Figure 5 at main paper shows that incorporating GraphSAM leads to smoother training and testing loss curves and reduces the gap between them. This smoother convergence suggests that GraphSAM helps learning a more robust and stable representation, which indicates a more evenly distributed attention across the graph's features.

###[Q3 - Compare GraphSAM with other efficient SAM derivates? — Yes, we have compared with them, and GraphSAM is both efficient and high performing.]

We have compared with other efficient SAMs (e.g., LookSAM, AE-SAM, and RST) in Table 2 of the main paper and Table 7 of Appendix A.5. We observe GraphSAM is both efficient and high-performaning on molecular graphs, while the other efficient SAMs deliver poor generalization results. The detials can be checked in paper.

###[Q4 - Is The proportionality in Theorem 2 not merely caused by the first-order Taylor approximation? How do the authors know that the linear term is a sufficient approximation? — Yes, we leverage the first-order Taylor approximation, but it is accurate enouhgh.]

The proportionality in Theorem 2 is indeed analyzed mainly according to the first-order Taylor approximation. We are aware that although the higher-order terms could offer a more nuanced view, the first-order term often dominates in practical scenarios, making it a sufficient approximation for the purpose of our analysis. Take the second-order Taylor expansion as an example:

LGL_{G}(θ+ϵ̂) ≈LG(θ)L_G(θ) + ϵ̂θ\nabla_\theta LG(θ)L_G(θ) +12ϵ^2H(θ) \frac{1}{2} ϵ̂^{2}H(θ),

where H(θ)\mathbf{H}(\theta) is a Hessian matrix (i.e., a matrix of second-order derivatives) of the function with respect to θ\theta. ϵ^\hat{\epsilon} is a mapped perturbation gradient obtained by Eq. (2) and ϵ^2||\hat{\epsilon}||_{2} is pretty small, which is often at the scale 10310^{-3}. Therefore the second-order expansion term as well as the laters can be omitted. Taylor's first-order expansion can be sufficient approximation.

评论

I thank the authors for the careful and elaborate response.

My points have been addressed. The authors, though, could elaborate more in the main part why GraphSAM is only a good approximation for molecular tasks.

评论

We appreciate a lot for your careful review, and would like to provide responses to your mentioned questions and weaknesses one by one.

###[W1 & Q1 - Is GraphSAM only a good approximation for graph transformers? How is GraphSAM working, e.g., in the image domain? — Yes, GraphSAM only works good at molecular graph domain.]

It is experimentally found that GraphSAM is a good approximation for graph transformers rather than in CV domain. Although our proposed method is designed based on the empirical observations, actually, Observations 1 and 2 are presented only in the graph transformers, while they do not appear in those of CV domain [1]. In CV, the model's perturbation gradient variations are often consistent with updating gradients, which is not in line with the phenomenons on the molecular graphs.

To further address the concern and provide holistic discussion, we add the following experiments to compare the performance of lookSAM [1], AE-SAM and GraphSAM in the CV and molecule datasets, respectively. LookSAM and GraphSAM are efficient SAMs designed based on the gradient varification patterns of models in their respective domains, and AE-SAM is an adaptively-updating efficient method accoding to the comparison evaluation between gradient norm and a pre-defined threshold.

ResNet-18 (CV)CIFAR-10CIFAR-100
+SGD95.4178.21
+SAM96.5380.18
+LookSAM96.2879.91
+AE-SAM96.4980.31
+GraphSAM95.8678.69
CoMPT (Graph)Tox21SiderClinTox
+Adam82.8162.1791.41
+SAM83.9664.3392.73
+LookSAM82.5562.5491.64
+AE-SAM83.3363.1691.92
+GraphSAM84.1164.5893.78

It is observed LookSAM has better results on the image datasets instead on the molecular graphs investigated in this work. On the contrary, our GraphSAM works on the molecules but leads to the worst performances on the image benchmarks. That is because model gradient variations are diverse accross the different domains. It is challenging to transfer the efficient SAMs designed based on the specific gradient patterns. AE-SAM achieves an acceptable performance on the molecular graphs since the perturbation gradient norms of graph transformers are monitored to inform the necessarity of gradient re-computation. But it is not as good as GraphSAM where we accurately fit the perturbation gradients at each step. In summary, GraphSAM is optimized particularly for the graph transformers based on the empirical observations of gradient variations.

###[W2 - The conclustion of Theorem 1 is not clear. — Will add to the following new explanations.]

Thank you for your constructive feedback regarding the theoretical aspects of our paper. We hope the following revisement can mitigate your concerns. The revised sentence is: "Recalling the min-max optimization problem of SAM in Eq. (1), if we replace the inner maximum objective from LG(θ+εS)L_{G}(θ + ε^{S}) to LG(θ+εG)L_{G}(θ + ε^{G}), the graph transformer is motivated to smooth a worse neighborhood loss in the loss landscape. In other words, the proposed GraphSAM aims to minimize a rougher neighborhood loss, whose value is theoretically larger than that of SAM, and obtain a smoother landscape associated with the desired generalization."

###[W3 - Proof of Theorem 1 relies on empirical observations and strong assumptions. — Agreed, we have now changed it.]

Thanks for your suggestions, and we have uniformly changed 'Theorem' to 'Conjecture'. Conjecture is a statement or conclusion based on insufficient evidence, rather than through rigorous proof. We rewrite Theorem 1 to include the experimental assumption that supports our conclusion.

The new Conjecture 1 is: Let ϵ^Sϵ̂_{S} and ϵ^Gϵ̂_{G} denote the perturbation weights of SAM and GraphSAM, respectively, where we ignore the subscript of t for the simple representation. Suppose that ω/|ω| >> ϵ ,as empirically discussed in Observation 1, and ϵ^S|ϵ̂_{S}| < ϵ^G|ϵ̂_{G}| for ρ> 0, designating ϵ^Sϵ̂_{S} as the ground-truth. We have:

L_{G}(θ+ϵ^Sϵ̂_{S}) ≤ L_{G}(θ+ϵ^Gϵ̂_{G})

References:

[1] Yong Liu, Siqi Mai, Xiangning Chen, Cho-Jui Hsieh, Yang You. Towards efficient and scalable sharpness-aware minimization. CVPR 2022.

评论

Dear Reviewer QvPb,

We sincerely appreciate your thoughtful feedback. In response to your suggestion, we elaborate in Observation 1 of the main text on why GraphSAM is only a good approximation specifically for molecular tasks. Meanwhile, we compare GraphSAM with other efficient SAMs designed for the CV domain and analyze their limitations in the molecular graph domain in detail as shown in Appendix A.4.

Best regards,

The Authors of ’Efficient Sharpness-Aware Minimization for Molecular Graph Transformer Models‘

评论

We thank all the reviewers for their time and effort in helping us improve the quality of the paper. We are glad that all reviewers agreed our method GraphSAM on GraphTransformer is efficient and effective. All reviewers also recognized the contribution of GraphSAM in the domain of molecular graphs.

We have updated the paper to incorporate constructive suggestions. We summarize the major changes, and make the changed text blue in the main manuscript:

  1. [QvPb] Updated the description and explanation of Theorem 1. Also changed the word 'Theorem' to 'Conjecture' in the main manuscript.
  2. [QvPb, 1va4, sT1K] Additional experimental observations of perturbation gradient ϵ\epsilon and updating gradient ω\omega in Appendix A8.
  3. [QvPb] Additional attention heat map experiments with Adam, SAM, and GraphSAM in Appendix A8.
  4. [QvPb, mSQh] Additional experiments of GraphSAM in CV domain in Appendix A8.
  5. [QvPb, 1va4, mSQh] Modifications to Figure 1 and Table 1 and some writing issues of the main manuscript.
AC 元评审

By incorporating lessons from an empirical analysis of optimization dynamics in Graph Transformers and SAM in molecular structure prediction tasks, this paper develops a new variant of SAM suitable for this domain. The new variant (GraphSAM) utilizes the observed gradient dynamics to improve the computational efficiency of SAM. The proposed algorithms is supported by some theoretical arguments (although under strong assumptions) but the empirical verification of the method is reasonably covered (multiple datasets+ablation study). The authors have also shared the code of their algorithm. The paper is an interesting example of leveraging domain specific knowledge to design a new variant of SAM.

The reviewers find the approach interesting and the empirical performance impressive. There were some clarification questions during rebuttal, for which authors replied with additional experiments (including results of GraphSAM and other methods on vision tasks) and revised parts of the paper to address these concerns. All reviewers are leaning toward accept. Reviewer 1va4 responded that "The concerns are discussed in detail, and addressed". However, I did not see a change in their score, and they were non-responsive despite reminders, thus unclear if they were willing to increase their initial score.

The paper is borderline accept, but given that reviewers's recommendation are all on the accept side, and they found most of their concerns addressed, and also because the paper presents strong empirical performance I recommend accept. Please incorporate the feedback from reviewers and clarifications you provided in the rebuttal into the final version of the paper.

为何不给更高分

The theoretical arguments could have been more rigorous.

为何不给更低分

Reviewer's recommendation is consistently on the accept side, and the empirical results of the paper are impressive.

最终决定

Accept (poster)