Do Mice Grok? Glimpses of Hidden Progress in Sensory Cortex
We find evidence for rich feature learning in mouse piriform cortex during overtraining and propose it is driven by approximate margin-maximization, a known cause of grokking in deep learning.
摘要
评审与讨论
The paper demonstrates that animals can learn tasks and improve generalizability through over-training. The study begins by reproducing and reanalyzing experiments conducted on mice (Berners-Lee et al., 2023) and then proceeds to run toy experiments as proof-of-concept illustrations.
优点
The paper addresses interesting topics at the intersection of foundation models and cognitive science, particularly within the realm of behavioral psychology.
缺点
1. Writing Quality:
-
Citation: The poor writing significantly hinders the evaluation of this work. For instance, the authors misuse \citet and \citep throughout the paper. The frequency of these errors affects the paper’s professionalism; please revise these accordingly.
-
Confusing Notation: Although Berners-Lee et al. (2023) represent each odor as a vector in their figures, the description "n-hot vector of possible odorants" may confuse readers. Given that ICLR’s primary audience is in machine learning, they may interpret this as a traditional vector representation, though in the authors' experiment it represents a combination of odorants rather than a vector.
-
Undefined Notations: Several notations lack definitions (e.g., [N, T] in line 755) or exact values (e.g., in -hot vector).
2. Validity of Experimental Setting:
The results could be heavily dependent on the experimental setting. In the default setup, each odor is represented as a combination of odorants, defined by an -hot vector, with limited overlap between odorants. Given the network’s simple architecture (a single fully connected layer), each odorant is almost linearly separable, meaning there is minimal interference among components. This could explain why the model resists overfitting and maintains increasing test accuracy even during overtraining. Conversely, as seen in Figure 6, when odor overlap is greater, neither the Fisher Linear Discriminant nor the test accuracy increases significantly beyond 100 epochs, at which point the model reaches perfect training accuracy.
3. Loss Interpretation:
The authors claim in line 370 that after achieving perfect training accuracy, the loss "plateaus." However, it appears to continue decreasing (though this could be subjective). Additionally, perfect training accuracy does not imply the model has fully learned from the training data, especially since cross-entropy loss is used, and accuracy is based on the argmax classification.
4. Miscellaneous
-
Lines 457–484: This section needs refinement. It is unclear how the original features differ from pre-trained features (L461–463), the relationship between and (L465), and the contrast between the original model’s readout and the randomly initialized network (L482).
-
Figure 1: The caption refers to (a) and (b), but these labels are absent in the figure.
-
Line 355: “are are” should be corrected to “are.”
问题
I am unclear about the contribution of the section "A mathematical model based on rich learning" on page 9. If the authors use a kernel method, isn’t the result somewhat trivial? Furthermore, the logical connection between the content of this section and the conclusion — “An animal overtrained on a task will have better task-model alignment and a richer feature set than an animal whose learning was stopped after behavioral plateau” — is unclear. Could the authors elaborate on this section?
Thank you for the review, and we're sorry to hear you disliked our work. We've made substantial changes in response to your feedback. To clarify: this is not a work on "the intersection of foundation models and cognitive science" -- the paper is neither centrally about foundation models nor cognitive science. Instead, it argues that dynamics in existing neural data from genuine biological networks exhibit close similarities with a phenomenon hitherto seen and studied only in deep learning, presents new evidence for this case, and constructs and interprets simple machine learning models to flesh out a theoretical basis for the argument.
Writing Quality
We've made sure to make correct use of citations throughout, and have clarified vector notation as well as made sure to define notation and exact values where relevant, including in all the places you identified.
Validity of the experimental setting
As we understand it, you're saying here that the learning during overtraining may not take place for odors with larger overlap with the target, for instance in our Figure 6. This is consistent with our framing of the phenomenon as margin maximization between target and nontarget, which have zero overlap and are thus far apart. If the probes are closer in representational space to the target (very high overlap) than the nontarget, they will never be separated by a hyperplane equidistant between the target and nontarget cluster centers. This is a feature, not a bug, of our framework. Our goal here to model the neural data, where the mice are shown overlap-1 probe odors as the held-out test examples, so the increase in accuracy on these overlap-1 probe trials with overtraining is the object we seek to explain, and that our margin maximization model (as Figure 6 shows) indeed does do this.
Loss interpretation
We agree with your statement that accuracy and loss are distinct measures of performance and the former is a discontinuous metric. We also agree the training loss must be decreasing (even if very slowly or imperceptibly) even after accuracy saturates if the weights continue to move/test loss continues to fall, this is tautologically true by the definition of gradient descent. Your point here is exactly what we're trying to communicate: much work in the grokking literature in deep learning outlines how small changes in train loss at late-time (after training accuracy saturates) can lead to large changes in test accuracy (see [Kumar et al, 2023], [Lyu et al, 2023] for examples). In the neural data reanalysis, we show how despite no outward changes in mouse behavior, there is continued decrease on training loss during overtraining. This is one sense in which "mice grok" -- because we see it leads to stronger downstream accuracy.
Miscellaneous and Questions
We've addressed all of these minor aesthetic issues. Further, we've reworked the overtraining reversal section entirely to make it clearer and more detailed. We've focused on making the connection between overtraining, feature learning, and adaptation to reversal clear, both theoretically in a linear network whose dynamics we analyze but also empirically in terms of interpretable feature learning in the model.
We've also made further additions to the manuscript that we believe strengthen it. In addition to the reworked overtraining reversal section, the new manuscript we've uploaded (changes in blue):
- Demonstrates the nature of the objective is the key to this late-time learning, both through a loss ablation that shows a causal link between margin-maximization and grokking (Appendix A.6) but also by constructing a new, more biologically realistic model of piriform cortex (Appendix A.7), and showing the phenomenology persists there if the same loss function is used.
- Contains new statistical comparisons and ablations between the piriform separation during overtraining and random firing rate baselines in Appendix A.5.
Overall, we thank the reviewer for their feedback, and would appreciate if they would reconsider their assessment in light of our response and new additions to the manuscript. It seems the fundamental message of the paper was not appreciated, and much of the criticism was about minor aesthetic and prose details rather than about the core premise: proposing and modeling a new neural phenomenon using the rich language of machine learning. Thank you.
--
References.
Kumar, Tanishq, et al. "Grokking as the transition from lazy to rich training dynamics." arXiv preprint arXiv:2310.06110 (2023).
Lyu, Kaifeng, et al. "Dichotomy of early and late phase implicit biases can provably induce grokking." arXiv preprint arXiv:2311.18817 (2023).
I appreciate the authors for dealing with my concerns. Please inform me if I misunderstood any parts, especially the experimental set-up which is my biggest concern.
Validity of the experimental setting
I would like to clarify my concern. According to the paper, the authors use a few layered MLPs. If there is no overlap between odors modeled as an n-hot vector, the first layer can be decomposed by a weight matrix for each odor (linearly separable). In other words, if a new odor has no overlap with training odors, the corresponding weight vectors are never updated. Moreover, I am also concerned whether the authors’ claim via synthetic data can be directly applicable to realistic computational neuroscience.
To relax my concern, I recommend modeling odor as a joint Gaussian distribution (e.g., simply applying Gaussian smoothing on odors to make input non-zero continuous variables), not an n-hot vector.
Loss Interpretation
In the neural data reanalysis, we show how despite no outward changes in mouse behavior, there is a continued decrease in training loss during overtraining. This is one sense in which "mice grok" -- because we see it leads to stronger downstream accuracy.
Could you please elaborate on how to calculate training loss in the neural data reanalysis? If I understand correctly, the reanalysis uses the actual neural recording and other data from Berners-Lee et al. (2023). I wonder how to calculate training loss in a real behavioral experiment.
Miscellaneous and Questions
Thank you for providing information. I will read revisions and other reviews to adjust my rating.
Validity of the Experimental Setting
We think this issue might be a miscommunication/misconception. If the concerns stem from the impression that we are passing in raw n-hot vectors into the MLP, we should clarify this is not what we are doing. In fact, we already are explicitly doing something similar to what you suggest: passing in embeddings that are non-zero and continuous, into the network. The n-hot vectors are passed through a random projection before being fed into the network. We state this explicitly in the paragraph:
We take our earlier definition of odors as n-hot vectors of odorants (chemicals) that we pass through a random projection, inspired by the functional role of the glomeruli (Blazing & Franks, 2020), with n = 10, k = 100. We train a one-hidden layer multilayer perceptron (MLP) on these embeddings...
Loss Interpretation
You are right that there is obviously no notion of literal "training loss" for learning taking place in mouse piriform cortex, since the explicit form of the objective optimized by cortex is not known. Rather, what we are referring to in that quote is a) the continued separation of class representations of training odors that we present in Figure 2 and Figure 3 as well b) the continued increase in margin of the associated representations, that we present in Figure 4. As we can see in the PCA diagrams in Figure 5, the decreased training loss is a direct consequence of the separation of class representations in the synthetic setting. It is this separation (and associated increase in margin) that we are referring to as close proxies for "training loss" during mouse overtraining.
Thank you for engaging with our response, and please let us know what we can do to make it easier for you to consider reassessing your score.
I appreciate the author's response.
Regarding Validity of the Experimental Setting, Could you please elaborate on how random projection is implemented? Both original manuscript and the revision are difficult to know how it works.
[Original Version]
We define odors as n-hot vectors of odorants (chemicals) in {0, 1}k that we pass through a random projection, inspired by the functional role of the glomeruli.
[Revision]
We take our earlier definition of odors as n-hot vectors of odorants (chemicals) that we pass through a random projection, inspired by the functional role of the glomeruli (Blazing & Franks, 2020), with n = 10, k = 100. We train a one-hidden layer multilayer perceptron (MLP) on these embeddings...
Regarding Loss Interpretation, I would recommend changing the term, training loss to something else to reduce the reader's confusion as the authors acknowledged.
FYI, in L538, model the data of (Berners-Lee et al., 2023), should be citet
The random projection is a multiplication of the sparse input vectors representing odors by a matrix whose entries are sampled from a unit Gaussian, then kept fixed throughout. So the input vectors to the MLP are of the form where is an -hot vector representing a fixed odor and a fixed random matrix. The key point is these vectors will have nonzero overlap almost surely, even if the expected overlap is zero. Importantly, this is the point of the model -- in the experimental setting of [Berners-Lee et al, 2023], zero overlap odors are passed into the mouse olfactory cortex, whose early layers involve random projection.
We also feel the reviewer is focusing on the wrong details. The synthetic model, as reviewer CDUr notes, captures key computational principles at an abstract level of the main findings of the paper which are Figures 2, 3, 4. In response to reviewers, during the rebuttal period we added a new model that explicitly studies "realistic computational neuroscience," taking olfactory cortex biology much more seriously. This model, in Appendix A.7, is a literal interpretation of piriform anatomy, including modeling differing cell types, recurrent feedback, sparsity constraints, and more. We find given the same objective, the late-time learning in test loss is preserved.
Thank you for the suggestion to reword "training loss" and address the citet note. We will make both changes for the camera-ready version, as we cannot currently edit the manuscript. We hope during discussion the reviewer engages with other reviewers to discuss and appreciate the central finding of the paper: Section 3 demonstrates mice are continuing to learn even as their behavior remains unchanged, leading to abrupt generalization downstream, and models of grokking in deep learning make extremely non-obvious predictions (margin-maxization) about piriform data that turn out to be true! We think this is exceptionally interesting, and do appreciate the reviewer's willingness to engage in discussion even if we feel they have not appreciated the central contents of the paper.
I appreciate the authors for their last-minute response. The reason why I focused on those details is if the random projection is linear it still makes the problem trivial and that severely degrades the author's claim. I checked Appendix A.7 and the authors used the non-linear activation function and hold the result as in the main paper. Consequently, I raised my rating.
This paper examines an intriguing phenomenon where neural representations in mouse sensory cortex continue to evolve and improve even after behavioral performance has reached ceiling levels (Berners-Lee et al. 2022). The authors reanalyze neural recording data from mice learning an odor discrimination task and find evidence for "hidden progress" in how odor representations separate in piriform cortex during overtraining, similar to the "grokking" phenomenon observed in deep neural networks. They propose that this continued refinement of representations reflects implicit margin maximization and demonstrate this using a simplified neural network model. The paper makes novel connections between neuroscience and machine learning while offering new insights into both domains.
优点
- Makes an important connection between seemingly disparate phenomena in biological and artificial neural networks (hidden representational changes during overtraining, grokking, reversal learning)
- Provides a compelling mathematical framework (margin maximization) that helps explain the neurobiological data
- Offers a novel explanation for the classic "overtraining reversal effect" from psychology in terms of rich feature learning
- Good empirical validation through neural data analysis and synthetic modeling
- Mostly clear writing and effective visualization
缺点
- The characterization of their model as "biologically faithful" feels overstated, given the significant abstractions from real neural circuits
- The paper should cite relevant literature on deep learning approaches to perceptual learning, particularly work by Wenliang and Seitz (2019) that examines how perceptual learning can be explained as changing readouts from complex features
- The explanation of margin maximization was a bit inscrutable, particularly the part about distinguishing between different notions when a decoder is being retrained on evolving features. It could use a rewrite and tightening up, perhaps a subfigure.
问题
Suggestions for Improvement:
- Tone down claims about biological fidelity and instead emphasize that the model captures key computational principles at an abstract level
- Add discussion of connections to deep learning literature on perceptual learning, which provides complementary perspectives on how representations evolve during training
- Clarify the precise definition of margin maximization being used, particularly in the context of evolving representations.
Thank you for the detailed and positive review. We're rephrased the description of the synthetic model in the main text, adopting your excellent description, added the relevant citations to the literature you brought up, and reworked the explanation of margin maximization, in direct response to your suggestions.
We've also made a number of substantive additions to the manuscript we believe make it stronger and clearer. In particular, the updated manuscript we have uploaded contains changes in blue, and:
- Demonstrates the nature of the objective is the key to this late-time learning, both through our Hinge loss ablation but also by constructing a new, more biologically realistic model of piriform cortex (Appendix A.7), and showing the phenomenology persists there if the same loss function is used.
- Includes discussion for what biological analogs of such an objective might be in cortex, relating to existing work on how uncertainty is stored and used to guide behavior in rodent cortex.
- Contains new statistical comparisons and ablations between the piriform separation during overtraining and random firing rate baselines in Appendix A.5.
- Contains an entirely reworked overtraining reversal section to include a worked out mathematical example of how features can be re-used under label inversion, making it both clearer and more detailed.
We'd love to hear anything we can do to further improve your assessment of our work, and thank you for the detailed and thoughtful review!
This study examines grokking, an interesting phenomenon observed in both deep learning models and in animal learning. The authors re-examine a published dataset that tracks neural activity in the posterior piriform cortex (PPC) of mice trained to perform a multi-odor discrimination task. The authors found that the class margin measured in the neural activity space of PPC continued to increase after performance accuracy has reached a ceiling. In a simple MLP with one hidden layer trained to perform an in silico “odor” discrimination task, the authors found continued improvement in class margin measured in latent space, and “grokking” phenomenon in probe trial accuracy. Using this simple model, the authors also provide theoretical explanation and empirical demonstration of overtrained reversal, a phenomenon long reported in experimental psychology.
优点
The study draws an interesting connection between the grokking phenomenon in deep learning and to the gradual evolution of neural representation in mice overtrained on classification tasks. By using a simple MLP model to demonstrate the link between increase in class margin and the emergence of “grokking”, the authors highlight a promising mechanism by which grokking may emerge in biological and artificial learning systems.
缺点
- The architecture of the MLP model is likely far removed from the actual structure and connectivity of the PPC, hence it appears more like a toy model to demonstrate the cooccurrence of margin maximization and grokking. Whether similar phenomenon would emerge in more sophisticated models of olfactory cortex (or olfactory circuits in lower animals) remain to be tested;
- While the emergence of grokking in the MLP model is interesting, the authors have yet to test whether the observed increase in class margin is causally related to grokking. The authors cite related theory, e.g. setting margin maximization as an optimization objective could drive the emergence of grokking. This proposed causal link could be conveniently tested in the MLP model.
- Without further analysis, it is unclear how the link between class margin maximization and grokking in more complex DL models, e.g. models trained to classify images. While the authors provide a theoretical basis for how overtrained reversal occurs, a theoretical advance on why grokking occurs remains lacking.
问题
Could the authors comment specifically on:
- How results from the MLP model would generalize to more biologically constrained models of the PPC (or other olfactory processing circuits)?
- How results from MLP could generalize to more complex DL models for classification?
- Demonstrate a causal link between margin maximization and the emergence of grokking in the MLP model?
Thanks for the review, and close read! As we understand it, there are three explicit points to be addressed. It's most natural to address your questions in reverse order. Our updated manuscript contains changes in blue throughout.
- Demonstrate a causal link between margin maximization and the emergence of grokking in the MLP model?
Certainly. A natural way to ablate margin is to change the objective from one known to drive asymptotic margin-maximization to another that drives margin-maximization up to a certain point, after which the loss vanishes. This is a "hard margin" objective, and Hinge loss is an example. In the new Appendix A.6, we repeat our experiment with Hinge loss and margin and find when we ablate late-time margin maximization, test loss plateaus when train loss does, and stops improving at late time so that the grokking effect vanishes. We can also see in the accompanying PCA plots that the separation between target and probes is worse in that setting as a result. This strongly suggests margin maximization is driving late-time learning in our setting.
- How results from MLP could generalize to more complex DL models for classification?
We want to emphasize our claim is not a universality claim about margin maximization driving grokking in all possible deep learning settings. Our focus is exclusively on a) piriform cortex, and b) our synthetic model thereof. We aim to focus on answering the question of how much such a phenomenology describes the neural data we revisit. While it's certainly plausible one might find a setting where, for instance, a vision model groks on CIFAR in such a manner, we think in this case looking for/including such a setting would actively distract from our topic of study, which is a deep-learning motivated analysis of overtraining dynamics in piriform cortex.
- How results from the MLP model would generalize to more biologically constrained models of the PPC (or other olfactory processing circuits)?
We can check this directly. This offers us another chance to try falsify our claim that the objective, not architecture, is key here when it comes to margin maximization. In the new Appendix A.7, we consider an MLP architecture that takes posterior piriform biology much more seriously, modeling the different cell types as different layers with different initializations, their sparse vs dense/excitatory vs inhibitory effects with dropout and differing signs on the forward pass, as well as recurrent feedback effects with a leaky ReLU. We find this model -- when trained with cross-entropy -- exhibits the same phenomenology in the same way, supporting the idea that the task and objective, not the details of the architecture, are what are key here. We elaborate on what the corresponding objective might be in cortex by discussing related work from rodent neuroscience in the Discussion section.
We emphasize, both here and in our updated manuscript, that truly the only way to check whether the margin-maximization hypothesis is an accurate description of late-time learning in cortex is to run new and targeted experiments in sensory cortex itself, rather than construct even more machine-learning-esque models. To this end, we lay out -- in detail -- an experimental proposal and our falsifiable hypotheses, and a plan for how this can be done in the future, in Appendix Section A.2, to make it easier for experimentalists to test our hypotheses.
Other substantive additions to the manuscript:
- A reworked overtraining reversal section to include a fleshed out mathematical example of how features can be re-used under label inversion, making it both clearer and more detailed, with both theory and experiments.
- Statistical comparison of representational separation that takes place in cortex during overtraining to random/control baselines in the new Appendix A.5.
We think overall both the additions specifically in response to your questions, and otherwise, significantly strengthen the paper. We'd appreciate it if you'd consider raising your score, or otherwise suggest additional things we can do for you to do so.
Thank you for your thoughtful review; we'd love feedback on our substantial additions and many new experiments done directly in response to your comments, and would appreciate if you consider raising your score as the rebuttal period comes to a close. Thank you!
We would appreciate any engagement with our many new and updated experiments which we've added expressly to address the excellent questions posed by the reviewer. We would appreciate it if the reviewer considers raising their score if they find the new experiments helpful. Thank you.
The authors examine the phenomenon of overtraining in posterior piriform cortex of mice. They show that class separation continues after performance saturates. They further show that the margin of the maximum margin classifier increases in the overtraining period. They construct a simple model, which is a one-hidden layer MLP, that shows this overtraining behavior. They propose that this model provides an explanation for over-training reversal in animal learning.
优点
The paper is very clear and well presented. The basic result is simple and backed by both empirical data and a simple model. Most of the figures are very clear and concise. That the observation can be explained by so simple a model is nice. The explanation of reversal training is nice.
缺点
I think there are a few improvements that could be made to the figures:
Figure 1 could use a quantitative indication of performance, so that it is clear when overtraining starts (as opposed to this being buried in the caption).
Figure 2 is slightly confusing in terms of which column in each row represents what. Do I understand correctly that the first column is the average and the others are the first and last 3 days, respectively? If so, this could be labeled better. If not, this could be labeled much better.
In Figure 3, maybe add an indication of the cluster mean with a line separating the clusters to make the separation more clear? It’s not bad now, but some of these are rather close.
In Figure 5, panel labeled “Epoch 1000”, the target dot is obscured.
问题
This seems like it would be a very general phenomenon, in particular because natural gradient descent dynamics should keep lowering the loss, which by construction has to maximize some discriminant in the data being classified. At the same time, the training samples are discrete so you are guaranteed to saturate the training error (i.e. you can’t get better than perfect on a finite set). Your kernel argument at the end seems to back up this intuition. What properties of networks would you envision would prevent this behavior? Or what kinds of tasks?
Thanks for the review! Your aesthetic notes have all been addressed in the updated manuscript we've uploaded -- changes in blue text throughout. Note the target dot being obscured was not an accident, part of the point being made is that target-probes are not easily separable at the end of training, but become separable during overtraining. Overtraining starts at the beginning of Fig 1, and we've put that in the plot title so it's front and center, as well as included column titles (mice names) + cluster centers in Figs 2 and 3, respectively.
Your question about a non-example is a good one. First, to directly address the question we've added a specific non-example: ablating cross-entropy to hinge loss in the MLP experiment. Hinge loss penalizes hard margin, so that after all points are classified correctly with some margin , loss vanishes, in contrast to loss-entropy, where it doesn't vanish. We see the phenomenon vanishes concomitantly, strongly suggesting margin maximization plays a causal role in driving grokking in our model. These new experiments and ablations can be found in Appendix A.6.
In addition to addressing your aesthetic comments and demonstrating a causal link to address your question, our updated manuscript includes much more content. It now
- Demonstrates the nature of the objective is the key to this late-time learning, both through our Hinge loss ablation but also by constructing a new, more biologically realistic model of piriform cortex (Appendix A.7), and showing the phenomenology persists there if the same loss function is used.
- Includes discussion for what biological analogs of such an objective might be in cortex, relating to existing work on how uncertainty is stored and used to guide behavior in rodent cortex.
- Contains new statistical comparisons and ablations between the piriform separation during overtraining and random firing rate baselines in Appendix A.5.
- Contains a reworked overtraining reversal section to include a worked out mathematical example of how features can be re-used under label inversion, making it both clearer and more detailed.
We think including including substantially more content while simultaneouly polishing the figures and prose strengthens the submission significantly. We'd appreciate it if you considered raising your score as a result, or pointed us to further desiderata for the manuscript. Thanks!
We thank the reviewer for their review, and would invite the reviewer to give feedback on our substantial additions and many new experiments done directly in response to their valuable feedback, and would appreciate if they consider raising their score as the review period approaches a close. Thank you!
We are writing to gently request the reviewer engage with our comments if possible as the now extended rebuttal period is approaching its end, with still no response. We have added several experiments directly in response to reviewer concerns that we think make the manuscript much more convincing, and would appreciate if the reviewer points us to further requests or considers raising their score. Thank you.
This paper uses neural recordings from mice trained to perform odor discrimination tasks to argue that task-specific representations continue to evolve after accuracy on the task saturates. The paper connects this finding to margin of a classifier and suggests that learning in the animal could be maximizing the margin. Using a two layer linear network as the mathematical model, the paper also describes how “overtraining” results in features that are suitable for reversal tasks, something that has also been observed in animals.
This is a simple result that is presented in a sound, well-written paper. I recommend that it be accepted. The authors are encouraged to incorporate the narrative and reviewer comments from the rebuttal while preparing the camera-ready version of this manuscript.
审稿人讨论附加意见
Reviewer KXYs had some stylistic comments and some clarification questions. The authors have addressed them in the rebuttal, in addition to some new experiments on hinge loss vs. cross-entropy loss.
Reviewer aDZL had concerns about whether using an single-layer network to distinguish between neural firings can shed light upon the hypothesized margin maximization going on in the posterior piriform cortex. The second comment, namely why is the observed “grokking” connected to margin maximization. These comments have been addressed satisfactorily in the rebuttal.
Reviewer CDUr had a largely positive review and they wanted the authors to tone down the narrative in certain places.
Reviewer CcRH had a very negative review initially which focused on stylistic comments and validity of the experimental approach. The score was 1/10. This was also pointed out by another reviewer. Some of the harsh criticism was perhaps due to ambiguities in the grokking literature itself, e.g., it is true that small changes in the training loss near the global minimum can lead to improvements in the test error...long after the training error has gone to zero. After a little bit of back and forth, the authors managed to convince the reviewer of the validity of this analysis.
Accept (Poster)