Emergence in non-neural models: grokking modular arithmetic via average gradient outer product
We show that "emergence" in the task of grokking modular arithmetic occurs in feature learning kernels using the Average Gradient Outer Product (AGOP) and that the features take the form of block-circulant features.
摘要
评审与讨论
This paper investigates grokking in the common scenario of modular arithmetic. In contrast to previous approaches, the authors show that grokking occurs when using the RFM learning algorithm. This enables them to isolate the feature learning as the source of grokking in these setups. By defining two progress measures, they demonstrate that the feature learning dynamics do not exhibit two distinct phases. Finally, they compare their results to the traditional NN approach and show that their findings are consistent with previous attempts.
给作者的问题
N/A
论据与证据
Their claims are sufficiently supported by comprehensive numerical experiments.
方法与评估标准
The methods and evaluations are fine.
理论论述
The paper mainly contains observations that are supported by strong numerical evidence.
实验设计与分析
All of the experiments look valid.
补充材料
I also briefly looked into the appendix, which contains additional numerical experiments and derivations.
与现有文献的关系
This paper can help to figure out the mechanism behind the well-known phenomena of grokking in modular addition, which could advance our understanding of the interplay between training and testing dynamics.
遗漏的重要参考文献
N/A
其他优缺点
Strengths:
- This paper shows that grokking in modular arithmetic can also occur using RFM, which is a novel contribution.
- This approach allows the authors to isolate feature learning as the central mechanism of grokking in this setup.
- They demonstrate that generalization requires a circulant structure in the feature matrix, which appears important for understanding modular arithmetic tasks.
- The authors define two progress measures that, unlike test accuracy, exhibit a linear change even in the early stages of training. This is significant as it suggests that there is no real phase transition during feature learning. I also found the discussion of the required a priori knowledge interesting.
- They compare their findings to the modular arithmetic NN approach (Gromov, etc.), showing that it relies on the same principles and that the new progress measures could also be applied there.
- The paper is very well-written.
Weaknesses:
- I felt that some details are still unclear and may need further investigation:
- What is the fundamental mechanism behind the delayed generalization observed here? In other words, why does generalization accuracy behave so differently from the continuous progress measure? It appears that the jump in accuracy occurs when the continuous progress measure saturates. Is this consistent? Have you identified a specific threshold in the progress measures?
- Generalization does not occur below a certain threshold of the training data fraction. What is the interplay between this threshold and your results?
- Finally, since the paper focuses on the specific task of modular arithmetic, the scope of the results is naturally limited.
Overall, the paper is well-written and contains important insights, so I believe it should be accepted.
其他意见或建议
I believe that further investigation of the behaviour of the progress measures for different ratios could be beneficial. For example, I could be interesting to see the AGOP alignment (of the current iteration versus the final iteration) and the circulant deviation at several different ratios (below and above the critical threshold). Fig. 10 is also interesting in this context: the AGOP still changes quite a lot from fraction 40% to 45%, while the test accuracy is already 1 in both of them. It could be interesting to see this graph for more fractions (larger than 45) and also for the circulant deviation measure.
Typos:
- Line 70: "and and test accuracy remain at the constant"
- Around Eq. (6), probably should be (by the way, a brief justification for the choice would be nice).
We thank the reviewers for their detailed feedback. We address their questions and comments below.
I felt that some details are still unclear and may need further investigation: What is the fundamental mechanism behind the delayed generalization observed here? In other words, why does generalization accuracy behave so differently from the continuous progress measure?
There is a sharp transition in feature alignment. Once the features are aligned beyond a certain threshold, the accuracy (and loss) both improve sharply. Thus the distinction is not between continuous or discrete measures but between measures of loss and measures of feature alignment. Why there is a discrepancy between our progress measures and accuracy / loss and formalizing the mechanism behind the delayed generalization is an important question. A simplified direction to understand this process theoretically would be to analyze how random circulant features enable generalization on modular arithmetic.
It appears that the jump in accuracy occurs when the continuous progress measure saturates. Is this consistent? Have you identified a specific threshold in the progress measures?
We have not observed a direct relationship between the jump in accuracy and saturation of progress measures. While it might look like it happens in Figure 2, it is hard to visually evaluate feature improvements as the alignment is close to 1 and the curve looks flat even as alignment continues to improve.
Generalization does not occur below a certain threshold of the training data fraction. What is the interplay between this threshold and your results?
Both training iterations and number of samples will improve feature quality. Thus, as generalization error is a sharp function of this quality, we see sharp transitions with respect to both.
Finally, since the paper focuses on the specific task of modular arithmetic, the scope of the results is naturally limited.
Like prior works (e.g. [1,2]), we focus on the setting of modular arithmetic because (1) these tasks clearly exhibit the sharp transition from trivial-to-perfect generalization, and (2) there exist analytic solutions to the task that we can compare the learned algorithms to. In particular, there is substantial evidence that the algorithm implemented by neural networks is the Fourier Multiplication Algorithm[2,3]. Moreover, it appears that non-parametric methods without the ability to learn features are unable to learn modular arithmetic, hence these tasks are a strong test-bed for the predictor power of feature learning through AGOP.
[1] Power et al., Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets, Mathematical Reasoning in General Artificial Intelligence Workshop, ICLR 2021.
[2] Nanda et al., Progress measures for grokking via mechanistic interpretability, ICLR, 2023.
[3] Zhong, et al. "The clock and the pizza: Two stories in mechanistic explanation of neural networks." NeurIPS 2023.
I believe that further investigation of the behaviour of the progress measures for different ratios could be beneficial. For example, I could be interesting to see the AGOP alignment (of the current iteration versus the final iteration) and the circulant deviation at several different ratios (below and above the critical threshold). Fig. 10 is also interesting in this context: the AGOP still changes quite a lot from fraction 40% to 45%, while the test accuracy is already 1 in both of them. It could be interesting to see this graph for more fractions (larger than 45) and also for the circulant deviation measure.
Please find a new figure here showing circulant deviation and AGOP alignment evolution over time for modular addition at training fractions 5%, 15% 25%, 35%, 45%, 55%, …, 95%: https://ibb.co/ytKK0p5. As the reviewer points out, we observe that the AGOP alignment can continue to increase after test accuracy is at 100%.
We provide another new plot showing that the circulant deviation at the final iteration rapidly decreases with training size, and is very close to 0 for training fractions larger than 55%: https://ibb.co/9Hp1F3yr.
Typos: Line 70: "and and test accuracy remain at the constant" Around Eq. (6), probably should be (by the way, a brief justification for the choice would be nice).
Thank you for noticing this, we will fix this typo in our revision. The choice of s=½ is motivated by the case of two-layer linear networks where the NFA is exact with this exponent [1].
[1] Radhakrishnan, Belkin, Drusvyatskiy. “Linear Recursive Feature Machines provably recover low-rank matrices”, PNAS 2025.
I thank the authors for their detailed response. I would like to maintain my positive score.
This paper studies the phenomenology of learning with recursive feature machines for modular addition, subtraction, multiplication, and division. Modular addition has become a standard setting for studying grokking as an example algorithmic task and has been studied for learning with neural networks, typically with quadratic activation functions. This paper learns the task with recursive feature machines that exhibit similarities with feature learning in neural networks. The paper also gives explicit constructions by transforming the unit vectors by circulant matrices which is then solved with a generalizing solution by quadratic kernel regression.
给作者的问题
Why is AGOP measured by vectorizing the matrices? Why not
I believe this is the standard way to measure the alignment of matrices AND it would capture similarities of eigenvectors, unlike the vectorized alignment.
论据与证据
Claims are supported by clear and convincing numerical simulations and intuitive explanations.
方法与评估标准
Yes.
理论论述
Yes ( but not in detail). There is one Theorem in the main paper that shows that the quadratic kernel learns features after transforming the data by discrete Fourier transform. This Theorem is informally stated in the main paper and a more detailed version is given in the Appendix.
实验设计与分析
Yes. All the Figures in the main are good.
补充材料
I just skimmed Appendix G.
与现有文献的关系
To what extent does learning with recursive feature machines correlate with feature learning of neural networks is an active research area. Also, it is an open question whether this can be shown mathematically in simple settings. Our fundamental understanding of feature learning in neural networks is so far limited to learning in infinite-width in a mean-field regime or learning multi-index models. This paper contributes to understanding feature learning for modular addition (and so on) task.
遗漏的重要参考文献
references are good
其他优缺点
Strengths: Experiments are thorough and explanations are insightful. It was already known that neural networks learn Fourier Multiplication Alg. This recursive kernel learning gives a more transparent framework for understanding grokking. Kernel Machine without feature learning also achieves zero training error but does not generalize but the learning features with RFM (=data transformation) gives a generalizing solution. That's neat ! Applying the (A) correct transformation to data and kernel machine is better than (B) learning with kernel machines which is quite interesting (also, whether (A) beats neural networks).
Weaknesses:
- line 359--361 (right): already said before
- Theorem 5.1 could come earlier in the paper and also be written formally as done in Appendix
- "well-known results for multiplicative group" may not be well known to ML audience. It would be better to state these group theory references formally in math symbols instead of words (and add refs).
其他意见或建议
Suggestions
- Please use the standard definition of Jacobian without transpose in Def 2.1 and write G = Jf ^T Jf to avoid confusion.
We thank the reviewer for their in-depth feedback on our submission. We will address their comments and questions here.
Theorem 5.1 could come earlier in the paper and also be written formally as done in Appendix "well-known results for multiplicative group" may not be well known to ML audience. It would be better to state these group theory references formally in math symbols instead of words (and add refs).
Thank you for this feedback. We will include the Theorem formally in the main text in the updated manuscript, and more clearly reference Appendix E where we discuss the re-ordering.
Please use the standard definition of Jacobian without transpose in Def 2.1 and write G = Jf ^T Jf to avoid confusion.
Thank you for this suggestion, we will use the standard definition to avoid confusion.
Why is AGOP measured by vectorizing the matrices? Why not . I believe this is the standard way to measure the alignment of matrices AND it would capture similarities of eigenvectors, unlike the vectorized alignment.
In fact equals the inner product of the vectorized A and B, and are equivalent to the norms of their vectorizations, thus our formulation is indeed a standard measure of matrix similarity.
This paper studies the grokking phenomenon in modular arithmetic. The main idea is that grokking happens because models slowly learn the right features, not because of any specific neural network architecture or gradient-based optimization. The authors use Recursive Feature Machines (RFMs) to show that even when training loss is zero from the start, the features evolve gradually until they reach a structure that allows generalization. In their case, the key structure is a block-circulant feature matrix that connects to the Fourier Multiplication Algorithm, which exactly solves the modular arithmetic task.
给作者的问题
Have you tested your approach on any tasks beyond modular arithmetic to see if the delayed feature evolution and grokking phenomenon persist? How sensitive are your progress measures to different kernel choices or initializations in RFMs? Can you provide more insights into how the reordering using the discrete logarithm affects the observed feature structure?
论据与证据
The authors claim that grokking is not specific to neural networks and is not tied to gradient-based methods. They show this by demonstrating grokking in RFMs, where the training loss is zero but the model only generalizes after many iterations when the features are more fleshed out. They support this with evidence such as gradual improvements in Circulant Deviation and AGOP Alignment, measures that capture how close the features are to the ideal block-circulant structure.
方法与评估标准
The paper uses a mix of theory and experiments on modular arithmetic tasks like addition, subtraction, multiplication, and division. The models are evaluated not only on standard metrics like test accuracy and test loss but also on new progress measures that track feature quality over time. The experimental design compares RFMs with neural networks and examines how features evolve during training despite the immediate reduction in training loss.
理论论述
On the theory side, the paper argues that the key to grokking is the emergence of a block-circulant feature matrix. They show that when the learned feature matrix attains this specific structure, the model effectively implements the Fourier Multiplication Algorithm. A central result (Theorem 5.1) proves that a kernel machine with a quadratic kernel and a block-circulant Mahalanobis matrix will compute modular arithmetic operations exactly as the Fourier Multiplication Algorithm does. The math leverages the fact that circulant matrices are diagonalizable using the Discrete Fourier Transform, linking the learned feature structure directly to the generalization ability.
实验设计与分析
The experimental design centers on training RFMs on modular arithmetic tasks. Despite perfect training loss from the outset, the test accuracy remains near chance level for many iterations. The authors analyze how the feature matrix, particularly the AGOP, evolves over time. They introduce and track two progress measures (Circulant Deviation and AGOP Alignment) to show that the feature structure gradually becomes more organized. They also include experiments with neural networks and tests using random circulant transformations to further support the claim that the emergence of the correct feature structure is what drives grokking.
补充材料
The supplementary material includes detailed proofs of the main theoretical results, such as the diagonalization of circulant matrices and the formal proof of Theorem 5.1. Additional appendices explain experimental details like the reordering of feature matrices using the discrete logarithm, extra experiments on multi-task grokking, and studies on enforcing circulant structure during training. This extra material helps to reinforce and elaborate on the main claims of the paper.
与现有文献的关系
This work builds on the growing literature on grokking, which has mainly focused on neural networks and delayed generalization. It relates to prior studies that view grokking as a transition from a lazy memorization regime to one where rich, useful features are learned. The paper connects these ideas with the concept of feature learning in kernel methods and neural networks and also ties into Fourier-based methods for solving modular arithmetic. In doing so, it challenges the idea that grokking is solely a function of network architecture or gradient-based optimization. It's an interesting addition to papers like the one from 2024 "Grokking as the transition from lazy to rich training dynamics" which highlight the important of optimization dynamics and architecture vs this one which highlights that feature learning is more of the key driver.
遗漏的重要参考文献
While the paper is well-grounded in the literature on grokking and feature learning, it could benefit from more discussion of works on implicit bias in optimization and how these biases shape feature learning.
其他优缺点
A clear strength of the paper is its novel approach to isolating feature learning from the usual gradient-based optimization by using RFMs. The theoretical connection between block-circulant features and the Fourier Multiplication Algorithm is super cool and interesting. On the downside, the experiments are limited to modular arithmetic tasks, so it is uncertain how well these insights will carry over to more complex, real world dynamics. Also, the paper focuses heavily on RFMs, which may limit how easily it can be generalized to other architectures.
其他意见或建议
No comments
We thank the reviewer for their positive comments and thorough feedback. We will answer their questions below.
While the paper is well-grounded in the literature on grokking and feature learning, it could benefit from more discussion of works on implicit bias in optimization and how these biases shape feature learning.
We would appreciate specific references that the reviewer feels are relevant. We would be happy to extend the discussion to include them.
On the downside, the experiments are limited to modular arithmetic tasks, so it is uncertain how well these insights will carry over to more complex, real world dynamics. Also, the paper focuses heavily on RFMs, which may limit how easily it can be generalized to other architectures.
Applying RFM is crucial for establishing that (1) grokking is not tied to neural networks, as it is a non-neural model, (2) it is not tied to gradient based optimization or training loss, and (3) can be induced solely through feature learning. Further, RFM allows us to isolate the ability of feature learning through AGOP to induce generalization in surprising settings. As mentioned in another response, there is also substantial evidence that the algorithm implemented by neural networks is the Fourier Multiplication Algorithm. RFM being able to recover the FMA (in our theoretical setting) and the features learned by neural networks empirically suggest that feature learning through AGOP is a useful lens to study neural networks. This is especially true as non-parametric methods without the AGOP appear unable to learn modular arithmetic tasks.
Have you tested your approach on any tasks beyond modular arithmetic to see if the delayed feature evolution and grokking phenomenon persist?
We have ongoing results where grokking occurs with RFM for other algorithmic tasks including sparse parities, the Chinese remainder theorem number representations, and certain group structures.
How sensitive are your progress measures to different kernel choices or initializations in RFMs?
The choice of kernel can affect the rate at which grokking occurs in RFM or whether it happens at all. Quadratic and Gaussian kernels are suited for modular arithmetic due to the target function being quadratic in Fourier space, while RFM with the Laplace kernel does not appear to generalize. In all of our experiments, we use the default initialization for RFM with the identity matrix.
Can you provide more insights into how the reordering using the discrete logarithm affects the observed feature structure?
Re-ordering by the discrete logarithm only affects the visualization of the features, as RFM is invariant to re-ordering of the input coordinates. We re-order coordinates for multiplication and division tasks in order to visualize the circulant structure, which is otherwise obscured. The features are circulant after re-ordering because modular multiplication/division (excluding zero element) are equivalent to addition/subtraction after transformation with the discrete logarithm. Note that it is necessary to know the structure of that re-ordering in order to invert it and recover the “underlying algorithm”.
The paper focuses on "grokking" -- a phenomenon that has attracted a lot of interest recently mostly because it is not what the ML community was used to observe regarding training dynamics and generalization.
The paper's main point seems to be that grokking is not specific to neural networks and to SGD training. Instead, the authors show that grokking is also observed with Recursive Feature Matrices (RFMs) -- when trained on modular arithmetic tasks. Another main point is what drives generalization is the learning of structured feature representations -- and the paper proposes a couple of metrics (such as "circulant deviation") to show this gradual progress of the model towards generalization.
给作者的问题
- is there an info theoretic way to show why circulant matrices emerge?
论据与证据
I find the paper a bit confusing on its "positioning" with respect to prior work in this research area.
First, the paper focuses exclusively on modular arithmetic -- but grokking is a more general phenomenon and it has been observed also in language models among others. This has to be clarified somehow: even though the paper helps us better understand the process of learning structured features in a very specific task that relates to modulo arithmetic, its claims and evidence cannot be generalized beyond such tasks.
方法与评估标准
I like how the paper uses both mathematical insights and experimental results -- and overall I found it to be rigorous, with clear definitions, methods, and metrics.
理论论述
The most important mathematical result of the paper is (in my own words): a good solution for the modulo arithmetic problem is the Fourier multiplication algo (FMA). The paper shows that circulant matrices allow kernel machines to solve modulo arithmetic using FMA.
实验设计与分析
The paper is quite strong in terms of experimental design and analysis -- showing several results. Some of the empirical claims that I found more interesting are: a) that even RFMs can show grokking behavior, b) that a metric such as AGOP shows gradual improvement during learning -- as opposed to accuracy, c) that circular features solve the modular arithmetic problem using FMA.
补充材料
I admit that I went through the supp material superficially.
与现有文献的关系
The paper has connections with broader questions related to learning theory, emergent phenomena in learning, and foundations of AI.
遗漏的重要参考文献
None that I know of.
其他优缺点
- The paper certainly contributes in the fundamental understanding of grokking -- but it does not explain some key characteristics of grokking (such as, what determines how many long it will take, say in epochs, until the model learns those structured features?)
- And as previously mentioned, the paper is quite narrow -- focusing only on modular arithmetic.
其他意见或建议
I think that the paper can be improved if the authors find a way to describe more clearly, and from the first section of the paper, what is the key "take home message" of this study.
We thank the reviewer for their detailed feedback. We address their questions and comments below.
First, the paper focuses exclusively on modular arithmetic -- but grokking is a more general phenomenon and it has been observed also in language models among others. This has to be clarified somehow: even though the paper helps us better understand the process of learning structured features in a very specific task that relates to modulo arithmetic, its claims and evidence cannot be generalized beyond such tasks.
Like prior works (e.g. [1,2]), we focus on the setting of modular arithmetic because (1) these tasks clearly exhibit the sharp transition from trivial-to-perfect generalization, and (2) there exist analytic solutions to the task that we can compare the learned algorithms to. In particular, there is substantial evidence that the algorithm implemented by neural networks is the Fourier Multiplication Algorithm [2,3]. Moreover, it appears that non-parametric methods without the ability to learn features are unable to learn modular arithmetic, hence these tasks are a strong test-bed for the predictor power of feature learning through AGOP.
[1] Power et al., Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets, Mathematical Reasoning in General Artificial Intelligence Workshop, ICLR 2021.
[2] Nanda et al., Progress measures for grokking via mechanistic interpretability, ICLR, 2023.
[3] Zhong, et al. "The clock and the pizza: Two stories in mechanistic explanation of neural networks." NeurIPS 2023.
The paper certainly contributes in the fundamental understanding of grokking -- but it does not explain some key characteristics of grokking (such as, what determines how many long it will take, say in epochs, until the model learns those structured features?)
While this question is beyond the scope of this paper, we do observe the rate at which features evolve through the AGOP alignment and circulant deviation progress measures. We find that while the features make progress linearly through iteration, generalization exhibits a sudden phase transition, suggesting error is a sharp function of feature quality specifically. We are exploring theoretical analyses to potentially identify the threshold when better features lead to improved generalization.
Is there an info theoretic way to show why circulant matrices emerge?
One possibility to understand why circulant matrices emerge on these tasks is through their effect on the Mahalanobis kernel. As circulant matrices are diagonalized by the DFT matrix, we can view inner products through the block-circulant matrix as applying the DFT to each of the one-hot encoded integer arguments, then re-weighting the frequency components by the (complex) eigenvalues of the circulant sub-blocks. We suspect the frequency reweighting induces linear separability of the DFT vectors, enabling generalization with kernel regression. We are currently exploring this direction.
This paper receives positive reviews from all reviewers. The reviewers find the paper is well-written, the theoretical connection between block-circulant features and the Fourier Multiplication Algorithm is interesting, and the experimental insights contribute in the fundamental understanding of grokking in non-neural methods. The authors' rebuttal helps address some issues with the scope of the paper. Based on the consistent positive reviews and the authors rebuttal, I recommend acceptance.