PaperHub
4.8
/10
Rejected4 位审稿人
最低3最高6标准差1.1
3
6
5
5
3.3
置信度
ICLR 2024

Promoting Exploration in Memory-Augmented Adam using Critical Momenta

OpenReviewPDF
提交: 2023-09-21更新: 2024-02-11
TL;DR

We propose a memory-augmented adaptive optimizer that promotes exploration of the loss landscape and finds flatter solutions.

摘要

关键词
Adaptive optimizationdeep learningmemory-augmented optimizersmomentum

评审与讨论

审稿意见
3

The paper addresses the challenge of poor generalization performance in adaptive gradient methods like Adam, hypothesizing that adding an exploration strategy with additional memory of gradient information could improve performance. Building on previous work, which used a memory buffer for storing "critical gradients," the authors propose a new memory-augmented Adam optimizer that stores "critical momenta" (CM). This new approach aims to prevent gradient cancellation and allows the optimizer to escape sharp minima and converge to flat regions which are more likely to generalize. The authors validate their approach through convergence analysis and empirical benchmarks, showing improvements in model performance in both supervised and online learning settings.

优点

  • The paper introduces a new concept of storing "critical momenta" in a memory buffer, aiming to overcome the limitations of previous memory-augmented optimizers.
  • Intuitions for why this approach is helpful are built by analyzing toy functions.
  • The method has potential applications in both supervised and online learning showing empirical improvements over vanilla Adam with combinations of CG/SAM.
  • The paper is written with a clear structure and logical flow.

缺点

My main concerns are regarding the memory cost of the approach and whether these costs are justified given the plethora of other alternatives to Adam. The paper does not sufficiently discuss the computational overhead introduced by the memory component, which is critical for practical deployments, especially in resource-constrained environments.

The experimental results indicate some improvements, but these could potentially be attributed to the additional use of memory and hyperparameters. It would be beneficial for the paper to discuss whether Adam+CM offers any regret guarantees or optimality conditions given the added computational complexity i.e. could there be another algorithm that does better?

One potential alternative baseline that comes to mind is to consider averaging the weights via approaches such as Polyak averaging or Stochastic Weight Averaging. These methods have been shown to improve generalization performance and only require one additional copy of the parameters if one employs the exponential moving average trick. The paper could be strengthened by comparing Adam+CM to these alternatives--perhaps the two approaches are orthogonal.

问题

  • I would consider additional measures of sharpness as well. For instance, it seems common to consider λmax/λ5|\lambda_{max}/\lambda_5| (Jastrzebski et al. (2020)) or measures of how quickly loss changes in a local neighborhood. The interaction between sharpness, distance travelled, and generalization performance seems quite complicated generally as it is also a function of the learning rate and other hyperparameters. This should be made clear in Figure 7 and Figure 8.
  • In general, as mentioned in weaknesses, I would be interested in a more thorough study on how additional memory can best be used. It seems like C is generally set to 5 in the grid search -- can we get away with smaller CC? What if we used a different value for CG and CM?
评论

We’d like to thank the reviewer for the evaluation of our paper about overcoming previous limitations, an intuitive approach, and possible applications. We also appreciate the comments and provide our responses below:

Reviewer’s Comment: “The paper does not sufficiently discuss the computational overhead introduced by the memory component … whether Adam+CM offers any regret guarantees or optimality conditions given the added computational complexity”

Response: We agree that our method introduces memory cost. However, it's important to recognize that there are (still) many deep learning applications that involve fewer parameters. In these cases, our method can deliver exploration advantages and result in better generalization. We also emphasize that our findings hold broader implications beyond buffer utilization. Our integration of memory-augmented optimizer concepts demonstrates how Adam can effectively navigate the loss surface, leading to the discovery of flatter regions. In essence, our work introduces a novel perspective on optimization, focusing on efficient exploration of the loss surface. Therefore tackling the memory issue can be a good potential future direction to explore. We would also like to highlight that our method does not introduce more hyper-parameters than the existing memory-augmented method CG. Furthermore, our grid search experiments, as detailed in Appendix A.2, indicate that, unlike CG, our method consistently performs optimally with C=5 and decay=0.7 across all cases. This observation suggests that there may be no necessity for conducting a grid search over these hyperparameters in real-world scenarios.

Reviewer’s Comment: “I would consider additional measures of sharpness as well. … The interaction between sharpness, distance travelled, and generalization performance seems quite complicated generally as it is also a function of the learning rate and other hyperparameters. This should be made clear in Figure 7 and Figure 8”

Response: As suggested by the reviewer, we ran our CIFAR10/100 experiments with another definition of sharpness defined in [1] that determines how quickly loss changes in a local neighborhood. We track m-sharpness for training ResNet34 with different optimizers on the same hyper-parameter setup. We draw similar conclusions as presented in Figure 8 (top-left) i.e., Adam+CM has lower m-sharpness than other baselines. We have added these results in Appendix A.4.3. We agree that the interaction between sharpness, distance travelled, and generalization performance is a function of learning rate and other hyperparameters. However since we show the plots in Figure 8 belonging to the best-performing setup for a given optimizer, these metrics are the indicator of why exactly Adam+CM achieves better overall performance. We wanted to highlight that it’s the property of Adam+CM that helps efficient exploration and finding flatter surfaces to obtain the best performance rather than a particular hyperparameter setup. We have made it clearer in Fig 7 and Fig 8 in the revision.

[1] Foret, P., Kleiner, A., Mobahi, H., & Neyshabur, B. (2020). Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412.

Reviewer’s Comment: “consider averaging the weights via approaches such as Polyak averaging or Stochastic Weight Averaging … The paper could be strengthened by comparing Adam+CM to these alternatives--perhaps the two approaches are orthogonal.”

Response: We appreciate the suggestions made by the reviewer about comparing Adam+CM with alternatives like Polyak averaging or Stochastic Weight Averaging. However, as also pointed out by the reviewer, these two approaches are orthogonal and could seamlessly be integrated to improve performance further.

Reviewer’s Comment: “Can we get away with smaller C? What if we used a different value for CG and CM?”

Response: As mentioned earlier, CM consistently performs optimally with C=5 and decay=0.7 across all cases, unlike CG which may require a larger C. However, upon the suggestion made by the reviewer, we reduced C further in CM, and we performed the same grid search. We observed that with a smaller C, the validation performance in CIFAR10/100 experiments slightly reduced. We have included the results in the Appendix where we show a sensitivity analysis using buffer size on CIFAR10/100 experiments.

In conclusion, we truly value your reviews. We hope that the latest revisions made in our submission meet the high standards expected by the conference. If we have adequately responded to your concerns and questions, we kindly ask for your consideration in increasing the score for our submission. However, if you still have any additional queries or concerns, please let us know.

审稿意见
6

This paper introduces a memory-augmented version of Adam named Adam+CM, which maintains a buffer of critical momenta to enhance exploration and avoid sharp minima during training. The authors have conducted theoretical convergence analysis for the proposed method. Furthermore, they provided empirical results to substantiate the claims regarding the properties of the method. By presenting results in online learning and supervised learning scenarios (NLP + CV), the authors demonstrate that the proposed method achieves superior generalization performance compared to baseline methods.

优点

  • The proposed method is well-motivated, and to the best of my knowledge, it presents a novel approach to optimization.
  • The presentation is lucid and easy to follow.
  • The paper offers comprehensive evidence to back the primary claims, including theoretical analysis, numerical simulation results with toy examples, and real-world practical experiments.

缺点

  • (minor) The paper would be more convincing if larger-scale empirical experiments on NLP were conducted.

问题

  • As a proposed deep learning optimization method, it would be more convincing if comparisons were made on larger-scale language model optimization tasks as well.
评论

We would like to express our gratitude to the reviewer for the positive evaluation of our paper regarding its novelty, clarity, and the supporting evidence for the primary claims in our paper. We also appreciate the comment and provide our response below:

Reviewer’s Comment: “ The paper would be more convincing if larger-scale empirical experiments on NLP were conducted.”

Response: We would like to thank the reviewer for this suggestion. Experiments involving larger-scale NLP often come with the caveat of a sophisticated LR scheduler that is designed particularly for AdamW. This LR scheduler has a huge impact on the optimization trajectory and any modification in the optimization process may result in sub-optimal performance. Experiments with Adam+CM on transformers may, therefore, require a careful exploration of LR scheduling, a consideration that falls beyond the scope of our paper. In contrast, our primary objective in this work is to present a new perspective on optimization. We focus on integrating exploration capabilities into adaptive optimizers, that could result in finding flatter solutions and improved generalization.

If your concern has been addressed, we would appreciate it if you could support our work by increasing your score. If there are more questions/concerns, please let us know.

评论

I am grateful for the author's detailed response. As my expertise does not primarily lie in this specific field of research, I will rely on the insights and evaluations of the other reviewers, who possess a more profound understanding and confidence in this domain.

审稿意见
5

Building upon the framework of critical gradients, which maintains a buffer containing a limited history of gradients from previous iterations, this paper takes a similar approach by maintaining a buffer of previous momenta, which they call critical momentum (CM). The proposed method is expected to effectively escape sharp basins and converge to flat loss regions. They also investigate the effectiveness of combining CM with sharpness aware minimization (SAM).

优点

The preliminary experiments using artificial loss functions such as the Goldstein-Price loss function and Ackley loss function show a clear advantage of CM variants to explore and find the lower loss surface near the global solution.

缺点

Although the experiments using artificial loss functions are promising, the results on actual deep neural networks is not as good. The fact that both CM methods overfit after convergence for LSTM on PTB in Figure 6 raises a major concern, and contradicts the main claim of the paper that CM methods can find a flatter minima.

The scores that are reported in this paper seem to be quite lower than what is reported in "papers with code". For example, EfficientNet-B0 on ImageNet should achieve a top-1 validation accuracy of 76.3% [https://paperswithcode.com/paper/efficientnet-rethinking-model-scaling-for], but the results reported in the paper range between 65.4-71.7%. Also, Adam+SAM is performing very poorly for this specific case, and is significantly worse than the original Adam, but no explanation is given for this poor performance.

It would seem like the current experiments would be strongly affected by the batch size, but the batch size is not included in their hyperparameter search.

问题

The escape ratio in Figure 5 is monotonically increasing with the sharpness coefficient. What happens if it is increased further?

In Figure 6, the validation perplexity of Adam+CM shows a peculiar behavior where it initially drops very fast but gradually increases and eventually crosses over with Adam+SAM. Is there any explanation as to why Adam+CM behaves this way when it is supposed to explore flatter regions compared to other optimizers?

Why is Adam+SAM performing so poorly for EfficientNet-B0 on ImageNet in Table 2?

Comments The correspondence between the two plots in Figure 4 would be more obvious if the color of the lines were matched.

评论

We’d like to thank the reviewer for the evaluation of our paper particularly for the results presented for toy examples. We also appreciate the comments and provide our responses below:

Reviewer’s Comment: “CM methods overfit after convergence for LSTM”

Response: Thank you for your attentive review of our submission. Yes, for LSTM, the optimizers seem to suffer from over-fitting. A similar phenomenon was previously observed in the critical gradients paper [1]. Therefore, as one popular way to control aggressive exploration, we reduce the learning rate in this experiment. We apply a similar technique on other optimizers too which results in further improvement in their performances. However as observed in Figure 6, Adam+CM still outperforms by converging faster to a lower validation perplexity.

[1] McRae, P. A., Parthasarathi, P., Assran, M., & Chandar, S. (2022). Memory Augmented Optimizers for Deep Learning. Proceedings of International Conference on Learning Representations.

Reviewer’s Comment: “EfficientNet-B0 on ImageNet should achieve a top-1 validation accuracy of 76.3%”

Response: We appreciate your attention to the results presented in our submission. Please note that EfficientNet-B0 on ImageNet achieves a top-1 validation accuracy of 76.3% using SGD with momentum. While we use the same setup that reproduces these results, we replace the SGD with Adam and other baselines for comparison with Adam+CM. This is because our objective is to specifically improve Adam by integrating exploration capabilities into it. This integration aims to result in the discovery of flatter solutions and improved generalization.

Reviewer’s Comment: “batch size is not included in their hyperparameter search”

Response: We appreciate your recommendation for including batch size in the hyper-parameter search. We have added a section in Appendix (A.4.1) focused on comparing different batch sizes, particularly on CIFAR10/100 experiments. In summary, the results indicate that the reported batch size (=64) used in our experiments indeed results in the best performance when compared with other batch sizes.

Reviewer’s Comment: “Fig 5: what happens if sharpness coefficient is increased further”

Response: As we increase the sharpness coefficient further, all the optimizers tend to escape that sharp minima more often and hence they all eventually achieve the escape ratio of 1. But through Fig 5, we wanted to highlight that the sharpness of that minima doesn’t need to be high for Adam+CM since it will be able to explore the loss landscape better by escaping such minima.

In conclusion, we truly value your reviews. We hope that the revisions and clarifications will influence and improve your overall opinion. If we've managed to resolve your principal concerns and questions, we would appreciate it if you could further support our work by increasing your score. On the other hand, if there are remaining issues or questions on your mind, we're more than willing to address them.

审稿意见
5

This paper presents a memory-augmented variant of the Adam optimizer that maintains a memory of critical momenta. The proposed method aims to address the gradient cancellation problem of Adam with critical gradients, encouraging exploration toward flatter minima. It is shown to improve performance across various tasks, including language modeling and image classification, by allowing the optimizer to escape sharp basins and converge in flatter regions.

优点

The paper is well-written and clear, and uses both toy examples and real-world problems to demonstrate the effectiveness of the proposed method.

缺点

  1. Prioritizing momenta with large gradient norms is not well motivated. It is unclear if and how this choice affects the performance of the method.
  2. The paper seems to associate flat minima with global minima or lower loss values in some of the toy examples, which is not always the case in practice. Indeed, the training loss of a flat minimum is usually higher than that of a sharp minimum on real-world datasets.
  3. The proposed method is partly motivated by the performance gap between SGD and Adam, but is not compared to SGD in the experiments. Moreover, it would be interesting to see if critical momenta improve the performance of SGD as well.
  4. The baselines are not particularly strong, and it is a bit strange that Adam+SAM performs almost the same or even worse than vanilla Adam on image classification tasks.
  5. The results shown in Fig. 5 could be sensitive to hyperparameters such as learning rate, which is not discussed in the paper.

问题

See above.

评论

We’d like to thank the reviewer for the evaluation of our paper for the clarity and effectiveness of the proposed method. We also appreciate the comments and provide our responses below:

Reviewer’s Comment: “Prioritizing momenta with large gradient norms is not well motivated.”

Response: Thank you for pointing out the need to explicitly provide a motivation for using gradient norms as priorities. The intuitive motivation using large gradient norms comes from the data-pruning approach, which is also discussed in recent works. For example, [1] mentions that by using a large gradient norm, their approach finds a smaller set of gradients to store that will aid in the optimization process. Furthermore, [2] identified that samples with larger gradient norms can be used to identify a smaller set of training data that is important for generalization, and hence prune the data as a pre-processing step.

[1] McRae, P. A., Parthasarathi, P., Assran, M., & Chandar, S. (2022). Memory Augmented Optimizers for Deep Learning. Proceedings of International Conference on Learning Representations.

[2] Paul, M., Ganguli, S., & Dziugaite, G. K. (2021). Deep learning on a data diet: Finding important examples early in training. Advances in Neural Information Processing Systems, 34, 20596-20607

Reviewer’s Comment: “The paper seems to associate flat minima with global minima or lower loss values in some of the toy examples, which is not always the case in practice. Indeed, the training loss of a flat minimum is usually higher than that of a sharp minimum on real-world datasets.”

Response: While we agree that a flat minimum doesn't always result in low loss values, several studies show that flat loss surfaces generalize better [3], [4]. This phenomenon is also seen in our experiments where we show a decrease in sharpness as well as improved validation performance on CIFAR10/100 datasets simultaneously. Moreover, we also show how changing hyper-parameter C affects the sharpness of the solution.

[3] Foret, P., Kleiner, A., Mobahi, H., & Neyshabur, B. (2020). Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412.

[4] Andriushchenko, M., & Flammarion, N. (2022). Towards understanding sharpness-aware minimization. International Conference on Machine Learning (pp. 639-668). PMLR.

Reviewer’s Comment: “It would be interesting to see if critical momenta improve the performance of SGD as well.”

Response: The focus of our paper is on improving adaptive optimizers (particularly Adam). In line with previous work on the topic, our analysis of existing optimizers has revealed that the poor generalization of adaptive optimizers in settings like image classification stems from a limited exploration of the loss surface. Our proposed method Adam+CM is a way to promote exploration while preserving the fast adaptive nature of such optimizers. Thus, we evaluate Adam+CM by comparing it with existing variants of Adam.

Although we think this comparison extends beyond the paper’s intended scope, we ran an experiment with a non-adaptive optimizer in response to the reviewer’s suggestion. We perform a hyperparameter grid search on CIFAR10/100 experiments and report the results in Appendix A.5. We observed that SGD, SGD+CG, and SGD+CM perform similarly but there is a speed up by both memory-augmented optimizers.

Reviewer’s Comment: “Adam+SAM performs almost the same or even worse than vanilla Adam on image classification tasks.”

Response: This is indeed what we consistently observed in our experiments. Note that SAM was initially introduced to enhance the flatness of solutions in SGD. To the best of our knowledge, the combination of Adam and SAM, i.e. Adam+SAM, has not been widely adopted in practice by the community for image classification in ResNet-based models. Given that our approach aims to promote exploration to identify flatter solutions, it was nevertheless a logical choice to use the addition of SAM to Adam as a baseline for comparison.

Reviewer’s Comment: “The results shown in Fig. 5 could be sensitive to hyperparameters such as learning rate.”

Response: The experiments associated with the toy examples in Fig 5 correspond to the same learning rate (=0.1) for comparison of our proposed optimizer with the baselines. This learning rate was obtained by performing a grid search with set [0.1, 0.01, 0.001, 0.0001]. We have included these hyper-parameters details in the appendix of the revision.

In conclusion, your review has been valuable in improving the quality and clarity of our research. If your major questions and concerns have been addressed, we would appreciate it if you could support our work by increasing your score. If there are more questions/concerns, please let us know.

评论

We highly appreciate the valuable comments and feedback from all the reviewers. We are grateful for the acknowledgment from Reviewers Jwku, GnW1, and qx25 regarding the clear and logical flow of content, the novelty of the proposed approach by Reviewers GnW1 and qx25, the insightful analysis using toy examples by Reviewers V8UZ, qx25, GnW1, improved results as noted by Reviewer Jwku and GnW1, and the comprehensive analysis on complex datasets highlighted by Reviewer GnW1. Special thanks to Reviewer GnW1 for recognizing the theoretical analysis and potential applications of our proposed optimizer in both supervised and online learning scenarios, as also noted by Reviewer qx25.

Based on general feedback, we would like to highlight that the primary contribution of our paper is to approach the optimization problem from the perspective of the loss surface exploration issue. Our proposed method, Adam+CM, draws inspiration from memory-augmented optimizers. Not only does it promote exploration, but it also tackles the gradient cancellation problem observed in the previous baseline (CG) and identifies a flatter solution compared to the well-known SAM. Ours is the first work that proposes a controlled exploration method for the loss landscape designed especially for adaptive optimizers.

We address specific concerns and present a revised manuscript, with the following changes highlighted in blue:

  • As suggested by Reviewer qx25, we added a line at the end of Section 5.2, clarifying that the results in Fig. 7 and Fig. 8 are associated with the best-performing setup for the given optimizers, indicating how exploration of loss landscape by Adam+CM helps.
  • We also added Table 12 in the Appendix to report the best hyperparameter setup obtained by grid search on toy examples (Reviewer Jwku).
  • We added a section in Appendix (A.4) that includes:
    • Sensitivity analysis (Fig 16) with different batch sizes and smaller buffer sizes as suggested by Reviewer V8UZ and Reviewer qx25 respectively.
    • We showed the improvement in the flatness of the solution using another definition of sharpness in Fig 17 as suggested by Reviewer qx25.
  • We added another section in Appendix (A.5) to present experiments with non-adaptive optimizer SGD as suggested by Reviewer Jwku.
AC 元评审
  • The study tackles the issue of subpar generalization in adaptive gradient techniques like Adam, suggesting that incorporating a strategy retaining historical gradients might enhance results. Implementing a memory buffer for preserving "critical gradients," the submission introduces a memory-augmented Adam optimizer that archives "critical momenta" (CM). This method enables the optimizer to evade sharp minima and gravitate towards flatter areas, which tend to generalize better.
  • However, it contains several flaws:
    • The experimental results are not quite convincing. For example, there exist over-fitting problems for the LSTM experiments. The reported baseline results are much lower than the original ones. Although the authors explain the reasons in the response, the baseline set by oneself (such as replacing SGD with Adam on EfficientNet) is likely to have suboptimal hyperparameters.
    • The motivation may be improper in some cases. The submission always claims that non-adaptive methods always generalize better which is not well supported. For example, the current SoTA models on ImageNet-1K are always trained by adaptive methods, even for ResNets-based modes. As shown in the results (Adam+SAM) presented in the paper and pointed out by the reviewer, finding flat minima based on Adam does not seem to bring significant gains.
    • The memory cost is hard to accept, especially for the large-scale models, which limits the application scenarios of this method.

为何不给更高分

The current version contains too many disadvantages and is hard to fix. Especially for the weak experimental results and high memory cost.

为何不给更低分

N/A

最终决定

Reject