Large Language Models as Model Organisms for Human Associative Learning
LLMs show human-like representational change during associative learning, driven by pair similarity and modulated by prior knowledge.
摘要
评审与讨论
The main objective of this paper is to assert wether the computational model of LLMs are useful for testing hypotheses about human brain memory dynamics, featuring scale and control in experimental design that is rarely achievable in real-life biological studies. More precisely, the question is whether LLMs learn token representations that evolve during associative learning, mimicking those observed in humans, with emphasis on the Non-Monotonic Plasticity Hypothesis (NMPH).
To do so, the authors establish a simple and clear experimental protocol, that does not involve fine-tuning LLMs, as it operates “in context”. Associative learning is simulated by presenting LLMs with token pairs, repeated a certain number of times within the context (thus, in context learning). Token pair representation is measured in the last layers of the LLM architecture, before the output logits are produced. Representation dynamics is measured by the difference between the cosine similarity of the initial pair, and the pair after a number of in context repetitions.
Results indicate: 1) representation dynamics exhibit a non-monotonic pattern of representational change, consistent with the NMPH, 2) that such dynamics are influenced by vocabulary interference, reflecting a joint influence of pairwise similarity and global contextual competition within the model’s prior knowledge.
优缺点分析
Strengths:
-
In my personal opinion, I believe this kind of measurement studies are very important and very interesting. Explaining LLMs behavior through the lenses of neuroscience is informative, but it is also very promising to understand whether LLMs could be thought of as “surrogate human brains”, to test hypotheses and replicate experiments that would otherwise be cumbersome if applied to real biological systems.
-
The article is very clear, the editorial quality meets high standards, and the experimental protocol is sound from the statistical point of view.
-
Results corroborate many findings and discussions that have been presented in the workshop on Representational Alignment (https://representational-alignment.github.io/2025/)
-
The idea of studying vocabulary interference as a mechanism modulating representational change is a nice contribution of this work, which cannot be achieved on a real biological system, yet it provides some explanations about such dynamics.
-
The modified greedy coordinate gradient technique to find candidate token pairs is smart!
Weaknesses:
-
The main figure of the paper, Fig. 2(b) indicates a very weak U-shape, in my opinion. Based on reference [28], it is difficult to quantitatively compare results of the “strength” of misalignment between representations. Despite extremely low standard errors, the NMPH hypothesis appears only during the so called “consolidation” phase, but it is kind of hard to understand at which number of repetitions this really happens. Indeed, Fig. 2(a) has an x-axis that does not report the exact numerical value of .
-
Fig. 2(b) presents (if I understood correctly) an aggregate rendering across all tested models. This hinders an appropriate understanding of the phenomenon and its relation to a given LLM behavior.
-
The interpretation of Fig. 3(b) is somehow speculative. Also, the values in the interval (x-axis) [0.9-1] are in contrast with the explanation given.
-
Finally, as noted by the authors, one big limitation of the current work is that token pairs do not represent “real-world” natural semantics. Given the main paper and the appendix, it was hard for me to really understand exactly what were the selected token pairs.
问题
- In Fig. 2(a), can you clarify the x-axis? Given the comments in the forgetting phase paragraph (lines 208-210), I think there has been some normalization, for otherwise the ticks are not consistent with the explanations.
- In Fig. 2(b), at which repeated co-occurrence is computed? If I understand lines 220-222 correctly, this is the average value aggregated across models, and similarity group, but nothing is said about . Are you aggregating also along the number of repeated co-occurrences?
- In Fig. 2(b) you claim that with moderate pre-learning similarity (0.6, 0.75) there is a substantial decrease in pairwise similarity, during consolidation. I see a delta of approximately 0.15 and I cannot wrap my head around the idea that this is substantial. I also read the reference [28] and there is no easy quantitative estimation of what is a substantial deviation from a “pattern separation” hypothesis. This observation, combined to the doubts I have from the previous question, makes me wonder if the U-shape we observe is significant or not.
- Despite you acknowledge it as a limitation of your work, I wonder. What exactly are the token pairs you selected in your study? How would your result change if you considered token pairs that would be more natural, real-world examples?
- As a follow up of the previous question, how would the conclusion of this work change if we considered more realistic patterns with more than token pairs, for example considering sentences? In this case, we would need a method to “mangle” sentences in a controlled manner to gauge dissimilarity.
- A final technical question: in equation (2) and in line 129 you specify how to compute , and how is obtained. However, I am not sure to get a grip on how is obtained. If the model task was to predict , then you would not have access to its latent representation right? So are you actually feeding the model , such that you can measure the contextualized latent representation of ? Any clarification on this point would be very useful to understand the nitty gritty details of your method.
局限性
Yes.
最终评判理由
I was convinced by the authors' rebuttal, both to my review, and to others reviews (irrespective of their rating). I think this is a very interesting article, and I hope the authors will prepare a new version of the paper including all the necessary "disclaimers" they discussed in answering my questions.
格式问题
No issues remarked.
We thank the reviewer for the insightful feedback.
Summary of main points:
- Q1: we clarify Fig. 2a and 2b and show that mid-similarity pair differentiation remains statistically significant after correction for multiple comparisons. While [28] does not provide quantitative predictions, our results are consistent with its core theoretical claims, even though we agree that integration effects for high-similarity pairs are weaker.
- Q3: we clarify Fig 3 and our proposed vocabulary interference measure as a plausible factor influencing representational dynamics. This interpretation remains exploratory, as global stimulus interference has not, to our knowledge, been systematically studied in neuroscience studies. Estimating such effects in humans would require modeling an individual’s lifetime exposure to concepts, a challenging task. Nonetheless, by leveraging the fully observable space of LLMs, we are able to investigate this underexplored factor, grounding our interpretation in neuroscience principles.
- Q4: we tested WordNet-based token pairs and found that high vocabulary interference led to a shift from U-shaped to monotonic representational change. This supports our claim in Sec 3 that NMPH predictions depend on global similarity structure.
Q1 [Fig 2b & significance of NMPH]: In Fig 2b we aggregate the representational change across models, similarity groups, and repetition counts that correspond to the consolidation phase (as defined per model in Tab 3 Appendix C). We will further include per-model and per-repetition plots in the appendix so readers can assess the consistency and robustness of the U-shape across settings across conditions.
The work that proposed the NMPH [28] provides a theoretical account of how representational change should vary with memory coactivation, but it does not specify quantitative predictions, making it difficult to define how pronounced a U-shape must be to support the hypothesis. For empirical context, [39] report a differentiation effect of approximately -0.1 (Pearson correlation) for mid-similarity pairs in human fMRI data. While this offers a useful benchmark, our analysis uses cosine similarity and is conducted on LLM representations–direct comparisons of effect size are not possible. Nevertheless, the ~0.15 change we observe is robust and statistically significant across all 6 models (after multiple comparisons correction; see reply to reviewer ctgT). While we do not observe clear integration for high-similarity pairs, a factor that contributes to the U-shape being less pronounced, the differentiation in the mid-similarity range remains consistent with core predictions of the NMPH, and we believe this provides meaningful support for the hypothesis as illustrated in Fig. 2b.
Q2 [Fig 2a]: We normalize the x-axis to allow alignment of different models based on their phase transitions (from encoding to consolidation to forgetting). Each model enters and exits these phases at different repetition counts due to a slight variability in their learning dynamics (as can be observed in Tab. 3 of Appendix C). To enable visual comparison across models, we map the repetitions for each phase into fixed intervals on the x-axis: (0,1) for encoding, (1,2) for consolidation, and (2,3) for forgetting. Within each model, the actual repetition counts in a phase are linearly scaled to fit within these ranges. This approach ensures that phase-aligned trends are visible despite underlying differences in repetition scale. We will revise the figure caption and text to more clearly state this normalization and add the original accuracy per model in the Appendix. Please let us know if further clarification is necessary.
Q3 [Fig 3]: We agree that the interpretation of Fig 3b involves a degree of speculation. This is, in part, due to the fact that, global stimulus interference has not been systematically studied in memory neuroscience, to the best of our knowledge. Estimating such global effects in humans would require modeling an individual’s prior exposure to all words and concepts, which is extremely challenging and remains an open problem in cognitive neuroscience. For example, [1] explicitly discusses the difficulty of estimating and accounting for the impact of priors in human memory studies. By leveraging the fully observable representational space of LLMs, we are able to explore this under-investigated factor and propose vocabulary interference as a potential mechanism influencing the different and seemingly contradictory findings in humans. While our interpretation is necessarily exploratory, we believe it offers a plausible explanation grounded in principles from neuroscience – specifically, that memory systems must balance the competing demands of integration and interference resolution. We will revise the main text to (1) make the speculative nature of the explanation more explicit, and (2) more clearly connect it to the underlying theory and the observed behavior in Fig. 3(b).
Q4 [Semantically meaningful tokens]: Our study intentionally used token pairs selected for their pre-learning similarity, regardless of semantic meaning. This approach is similar to the use of synthetic stimuli in [39], which highlights the need to sample across the full similarity spectrum (especially the mid-similarity range) to test the NMPH effectively. Real-word tokens are unevenly distributed across this space, making precise control difficult. Our main goal in this paper is not to study meaning, but the structural dynamics of representational change in response to learning. However, we do agree that using meaningful tokens is an important analysis to understand how the NMPH applies in real-world learning. We sampled new token pairs, restricting the search space to single-token words from WordNet. This filtering step reduced the usable vocabulary to approximately 20% of the original token set in the two models tested by the time of this reply.
We first examined how pairwise similarity and vocabulary interference were distributed within this space, anticipating that the limited vocabulary might bias the pairs toward a narrower range of interference values. We found that real-word token similarity more closely reproduces the vocabulary interference distribution. According to our framework in Sec. 3, this implies that most of these pairs would experience high vocabulary competition during prediction. Based on this observation, we hypothesized that the representational change curves would no longer exhibit a U-shaped trend but instead resemble a linearly decreasing pattern, as seen in the orange and blue curves of Fig. 3b. This prediction was confirmed in our results: post-learning similarity decreased monotonically with pre-learning similarity, supporting the idea that highly similar token pairs undergo representational changes influenced by the degree of vocabulary interference.
We will include this new analysis in the revised paper and revise our claims accordingly. Specifically, we will clarify that while the NMPH provides a useful theoretical framework, its behavioral predictions may depend on the distributional properties of real-world stimuli. Our findings suggest that to fully explain representational change in naturalistic learning contexts, NMPH-like theories may need to account for global similarity structure, such as vocabulary interference, as a modulating factor.
Q5 [h_y]: In our setup, the model is prompted to predict a two-token sequence, where the first predicted token should be for the prediction to be considered correct. We obtain as the final hidden state of the input tokens, and as the hidden state of the first predicted token (that is, the first token generated by the model, which is not part of the input). Although is not explicitly fed into the model, we can still access its contextualized representation through the generation process. To validate the robustness of our method, we also conducted a control experiment in a non-predictive setting, where both and were given to the model as input, and was extracted directly from the input sequence. The results, including the U-shaped pattern, were qualitatively similar in both predictive and non-predictive settings. We will revise the main text to clarify this distinction between and .
Q6 [Extension to sentences]: We appreciate the reviewer’s question. Extending this work to sentence-level inputs is an interesting direction, but it introduces several challenges. First, it is nontrivial to construct sentence pairs with controlled and quantifiable pre-learning similarity, which is essential for testing nonmonotonic representational change as predicted by the NMPH framework. Second, it is unclear what specific patterns one should expect from sentence-level associations, given the added complexity of syntax and semantics. Our study intentionally focused on a simplified associative learning task using token pairs, in line with the experimental design used in neuroscience (e.g., [Reb1]), to evaluate whether theoretical predictions about hippocampal learning dynamics extend to LLMs. This controlled setup allowed us to isolate and investigate representational changes in a targeted and interpretable way.
[Reb1] Stark, Shauna M., et al. "What’s in a context? Cautions, limitations, and potential paths forward." Neuroscience letters680 (2018): 77-87.
We believe we have addressed the main points and welcome any follow-up questions.
Dear authors, thank you for your rebuttal. I have read also other reviews and respective rebuttal.
The clarifications, answers to my questions, and additional discussions are convincing to me. I strongly suggest to include the discussion about the answer to Q1 in your revised paper, to make sure readers understand the statistical significance of your results.
I will raise my score.
Thanks
Dear reviewer,
Thank you again for your feedback and for taking the time to read our rebuttal.
We’re pleased that you found our clarifications helpful and the discussion convincing. Regarding Q1, thank you for your suggestion. We had initially considered this analysis but opted for a simpler approach, thinking it might be more accessible for the target audience. That said, we agree your suggestion is more rigorous and appropriate, and we will incorporate it in the revised version to better clarify the statistical significance of our results.
Thank you once more for your careful review and for raising your score.
The authors in this paper are asking if LLMs also similarly associate words and concepts to the Non-Monotonic Plasticity Hypothesis (NMPH), as observed in humans. The objective of this study is to use in-context learning (no finetuning, to avoid any changes to learned weights) and examine if the model starts predicting y when it sees x, just from seeing them co-occur after r repetitions. This study offers a unique way of not only understanding the internal dynamics and the representational space of pre-trained LLMs but also offers a novel way of using LLMs as a computational tool to understand associative learning in humans. After performing experiments across a wide range of LLMs, the authors report that LLMs do show the pattern of NMPH, akin to humans.
Specifically, the authors first compute the cosine similarity between tokens (x,y) before and after r repetitions to observe how closely these tokens are integrated or differentiated over time. Finally, the authors perform one more experiment to check for vocabulary inference by sampling random tokens from the model's vocabulary. This is done to find the similarities between lookalikes to token y in the (x,y) token pair, where x is fixed.
It is important to note that the authors conduct all these experiments by using the representations from the last layer of the models.
This is a novel, clean, and direct way of testing cognitive theories of the human brain, such as NMPH.
优缺点分析
Strengths:
-
The experiments are conducted across six LLMs of varying architecture sizes, and the results are consistent across the models reported in the main text. I don't think the authors need to include more LLMs to check for generalization; however, I do think that some additional evaluations in the form of visualizations will increase the quality of this paper. I will add those additional evaluations in the weakness section. Please look at those points.
-
The results are very clearly plotted, and a consistent pattern is observed in all the graphs.
-
The authors build this research by shifting their focus from behavioral outcomes to the internal representational dynamics, an attempt to understand associative memory in LLMs. I think this is an important line of research and has a lot of potential and research gaps to address in the future.
-
The authors use In-Context Learning (ICL), which I found to be an optimal choice for probing the LLMs, as it allows the authors to be consistent across different models of varying architecture sizes. This also prevents the need for fine-tuning models on task-specific datasets, which could introduce biases.
-
A follow-up experiment is also done to check for the similarities between the y token from the (x,y) pair with likewise tokens to y from the model vocabulary.
Weakness:
-
The foundation of this research is built upon the fact that associative memory in the human brain utilizes NMPH to associate words and concepts. However, I find no concrete evidence in the paper suggesting or explaining why this is tricky/difficult to perform with human subjects. What are the core problems of conducting such research with human subjects? Why is this study not possible to do with biological systems? What behavioral evidence is already present, and why is it not sufficient to explain this hypothesis? Why is the use of LLMs the only alternative option (if that's the case)? To strengthen the claims made in this study, the authors should address these points early in the paper. If the authors think they have already addressed all these questions, then please highlight the line numbers, and I will recheck if they answer these points
-
I agree with the statement in lines 37-41 where the authors highlight the unique property of ICL in LLMs; however, I would like to add that this is highly dependent on how the LLMs were pre-trained. Biological systems, especially human infants, don't have access to billions or trillions of word tokens like LLMs. The authors should consider adding that they are mapping the results of the LLMs to those of mature human brains. This requires a small tweak in the language used in the paper, but it is an important one as the study compares learning mechanisms of human brains and LLMs.
-
The authors don't show whether the claims made in this study are a result of the training data, model architecture, or a combination of both. This is important to understand why LLMs show this human-like behavior and how this behavior even emerges in the first place. If the authors can evaluate an untrained (or a random) LLM architecture and re-run the same experiments on it, it will increase the quality of this paper. I will increase a point in the quality assessment if the authors can report the performance of an untrained network.
-
I did not find any example of the (x,y) token pair in the manuscript or the appendix. I request that an example be given in the main manuscript to better understand the input type. The input pair (x,y) can be as simple as ("UK", "London") or much more complex. Therefore, an example prompt should be given in the paper. On a similar note, did the authors try different types of prompts when running the experiments? If yes, did they observe any difference in results?
-
The study has shown that the internal dynamics of LLMs change over time with repetitive cues. This is shown using cosine similarity (manuscript) and heatmaps (appendix). To further increase the quality of this study, I suggest including a visualization using either tSNE or PCA, showing how the internal representations of the tokens shift before and after training.
问题
-
In Figure 1B, the token pairs (x,y) that are passed as input to the model are shown in different colors. What's the significance of using different colors for the same input? Is it done to show the time scale? If yes, then I would recommend using a single color and then showing the time scale with an arrow.
-
I am curious if the authors found any LLMs that did not align with either of the three conditions (encoding, consolidation, and forgetting). If so, the authors can consider adding that to the appendix. Reporting such models won't necessarily weaken the claim in the main text, but rather provide insights about models that don't have human-like learning.
-
What would happen if the models are given some noise in between the repetitions? For instance, how would an input like xy,xy,xy,yx,xy,xy,av,xy,xy,... affect the results shown in the graphs in Fig. 2?
-
Did the authors observe any pattern between successful and failed LLMs, such as architecture size, vocabulary size, or anything else?
Suggestions:
-
Please focus on the points that I have given under the 'weakness' title above to improve the quality of the paper. Those are my main concerns and suggestions to improve the paper.
-
I would be very interested to see how the results turn out in case of representations from the intermediate layers. I highly encourage authors to probe the intermediate layers for future work.
-
While I have already noted this under the "Weaknesses" section above, I would like to emphasize it again. The authors should consider including results from a randomly initialized LLM (i.e., without loading pretrained weights) to test whether the architecture alone is sufficient to produce some of the patterns observed in the trained models. Additionally, a promising direction for future work would be to experiment with training data that follows a developmental curriculum, mimicking how human infants learn progressively over different age stages.
局限性
The authors have addressed some limitations of the paper. However, since the study draws parallels between the learning mechanisms of the human brain and LLMs, the following limitation should also be acknowledged:
LLMs are pre-trained on billions to trillions of tokens, whereas humans, especially infants, are exposed to a significantly smaller amount of data during early development. This stark difference in training exposure raises questions about the direct comparability of learning dynamics between the two systems. A solution to this limitation is to train a model from scratch on a curriculum (developmental) data of different age groups and evaluate the model after each stage.
最终评判理由
The authors have addressed all my concerns in the rebuttal, and I am convinced by their response. Therefore, I am increasing my score.
格式问题
There are no paper formatting concerns that I have noticed.
We thank the reviewer for the insightful feedback.
Summary of main points:
- Q1: studying NMPH in humans is challenging due to limited control over pre-learning similarity, task-dependent effects, and constraints like noise and cost. Human studies also struggle to account for global knowledge structure. We argue that LLMs, though biologically distinct, offer a valuable model system for testing representational-level hypotheses under controlled and fully observable conditions.
- Q4: we considered this, but untrained LLMs are unsuitable for our paired-associate task due to the lack of ICL abilities.
- Q5: we used synthetic token pairs selected by pre-learning similarity, independent of semantics, to ensure full coverage of the similarity spectrum, especially the mid-similarity range critical for testing NMPH [39]. Token pair examples per group are now included. We also explored real-word tokens and found their similarity aligns with vocabulary interference, shifting the representational change curve from U-shaped to monotonic. This supports our claim that global interference is a key modulating factor.
Q1 [Foundation of the research]: Studying the NMPH in humans presents several challenges. First, it is extremely difficult to precisely control the pre-learning similarity between associated items – a core requirement for detecting nonmonotonic representational change, as emphasized in [39]. Defining what counts as “0.1” vs. “0.5” similarity in humans is nontrivial, costly, and time-consuming. For this reason, [39] relied on representational inversions of DNNs to generate visual stimuli with controlled perceptual similarity. Second, the similarity level at which differentiation emerges varies by task and stimulus set, making it unclear in advance which mid-similarity range will reveal the effect. This requires dense sampling across the similarity continuum, adding further complexity. Finally, as with any human behavioral experiment, results are constrained by cost, limited trials, participant fatigue, and noise. Beyond pairwise similarity, additional factors (such as global (vocabulary-level) interference) may also influence learning dynamics. Measuring this in humans would require estimating an individual’s lifetime exposure to words and concepts, which is inherently difficult and practically infeasible. As a result, human studies typically do not account for how prior knowledge structure might modulate representational change. There is empirical support for all three hypotheses discussed in our paper [6, 9, 10, 28, 30, 32]. However, these findings are often empirically inconsistent, and it remains unclear when each dynamic arises, under what conditions, and which factors drive the observed differences. We propose that global interference may be one such factor influencing these outcomes. For instance, in memory systems such as the hippocampus, similar representations are thought to compete during retrieval, and the system works to disambiguate overlapping information. [9] note, “The hippocampus is believed to reduce memory interference by disambiguating neural representations of similar events.”. This suggests that similarity beyond just the target pair (i.e., relative to the broader knowledge space) may play an important role in shaping representational dynamics. LLMs offer a tractable tool for testing representational-level hypotheses. Unlike biological systems, they provide full access to internal representations, explicit control over exposure, similarity, and repetition, and complete observability of prior knowledge.This enables precise computation of local and global similarity, systematic testing of learning dynamics, and exploration of how broader structure shapes outcomes. While we do not claim that LLMs are mechanistically equivalent to human brains, we view them as complementary model systems – ideal for generating and testing hypotheses that are difficult or impossible to probe directly in human subjects.
Q2 [Comparison with mature human brains]: We agree that LLM pretraining differs from human developmental learning. Our comparisons focus on mature human brains, aligning with the existing neuroscience literature on representational change. We will revise the paper to make our focus more clear, and note this in the limitations section.
Q3 [Factors influencing results]: We tested 6 LLMs with diverse architectures and pretraining data and observed consistent trends across them, suggesting that the representational changes we study reflect a broader pattern not tied to any one model or dataset.
Q4 [Evaluating untrained LLM]: While testing untrained models is a useful control in some contexts, our task requires models to be able to effectively learn paired associates. This is not possible with untrained LLMs. Still, we agree it would be valuable future work to study how these representational dynamics emerge during training, especially under a curriculum that mimics human developmental learning.
Q5 [Examples of token pairs]: Below we provide randomly selected examples per similarity group, based on optimization for llama2-7b. Similar examples for other models will also be included in the appendix.
Our study intentionally used token pairs selected for their pre-learning similarity, regardless of semantic meaning. This approach is similar to the use of synthetic stimuli in [39], which highlights the need to sample across the full similarity spectrum (especially the mid-similarity range) to test the NMPH effectively. Real-word tokens are unevenly distributed across this space, making precise control difficult. Our main goal in this paper is not to study meaning, but the structural dynamics of representational change in response to learning. We agree that using meaningful tokens is important for assessing how NMPH applies in real-world learning. To explore this, we sampled token pairs restricted to single-token WordNet words, reducing the usable vocabulary to ~20% in the two models tested. We found that real-word token similarity closely aligns with vocabulary interference, implying that most pairs face high competition during prediction. Based on this, we hypothesized – and confirmed – that the representational change curve would shift from a U-shape to a linearly decreasing pattern (similar to orange and blue curves in Fig 3b). We will include this analysis in the revised paper and clarify that while NMPH remains a useful framework, its predictions may depend on the similarity structure of real-world stimuli. Our findings suggest that global interference should be considered as a modulating factor in naturalistic learning settings. See reply to reviewer ctgT for more details and semantically-meaningful token pair examples.
Synthetic token pairs:
| Similarity Range | Pair 1 | Pair 2 | Pair 3 |
|---|---|---|---|
| 0.1–0.15 | (Liter, CLARE) | (artifactId, gew) | (emat, SOUR) |
| 0.15–0.2 | (Ste, UITableView) | (Pers, pmatrix) | (ries, pragma) |
| 0.2–0.25 | (it, Autres) | (bt, Autres) | (Bad, tf) |
| 0.25–0.3 | (VD, Autres) | (Vertical, ierte) | (coordinate, gesch) |
| 0.3–0.35 | (elf, ScrollView) | (Tr, named) | (Else, newcommand) |
| 0.35–0.4 | (DER, stackexchange) | (ific, ently) | (von, trightarrow) |
| 0.4–0.45 | (uk, ThreadPool) | (vez, ISBN) | (under, rov) |
| 0.45–0.5 | (illet, cially) | (icio, atr) | (ptop, Wikimedia) |
| 0.5–0.55 | (bootstrap, rach) | (utes, Vorlage) | (iveau, tersuch) |
| 0.55–0.6 | (mittel, umbn) | (Series, notify) | (Problem, emptyset) |
| 0.6–0.65 | (fte, zott) | (Length, TRUE) | (elve, PDF) |
| 0.65–0.7 | (nings, setAttribute) | (isen, issenschaft) | (ouv, schluss) |
| 0.7–0.75 | (ru, occup) | (result, utzt) | (aka, rola) |
| 0.75–0.8 | (cock, eland) | (hib, heast) | (prepare, Once) |
| 0.8–0.85 | (relation, emptyset) | (reen, bmatrix) | (uliar, ienn) |
| 0.85–0.9 | (Italie, urre) | (cement, cement) | (onna, onna) |
| 0.9–0.95 | (aped, aped) | (loster, loster) | (lict, lict) |
We did not use alternative prompt templates, opting for a minimal prompting setup to isolate learning effects and reduce confounds from prompt structure.
Q6 (intermediate layers): We agree this is an interesting point. We examined intermediate layers in an appendix figure, which we later found to have an error. Since NeurIPS does not support figures or PDFs during the rebuttal, we summarize the results below for two different experimental settings:
- Using token pairs optimized for similarity in earlier layers, (similar to Fig 7a, but fixing a coding error). Our corrected results show:
- Intermediate and late layers show a monotonic pattern (e.g. similar to Fig 2b, green line), with significant differentiation for high similarity pairs (>0.7 pre-learning similarity.
- Intermediate layers show stronger differentiation (greater decreases in similarity) than late layers
- Early layers (1 & 2) behave more erratically, lacking a clear trend
- Using token pairs optimized for last layer similarity (same tokens as Fig 2b). This analysis is still running, but we will update the rebuttal when our results come in. Other comments:
[Learning phases across LLMs]: All models exhibited encoding and consolidation phases; only two showed a forgetting phase. No models failed to show encoding or consolidation.
[Noise between tokens]: We intentionally avoided adding noise (e.g., distractors or varying context) to isolate learning-related representational changes between specific token pairs. This controlled setup minimizes confounds and aligns with our goal of probing structural learning dynamics rather than robustness to noise.
[Successful and failed LLMS]: If the reviewer refers to whether models exhibit forgetting, we address this in Appendix C.1, which summarizes model-specific behaviors.
We believe we have addressed the main points and welcome any follow-up questions.
I thank the authors for addressing all my queries in a detailed manner.
I am satisfied with the response that I have received from the authors, and I will increase my score based on that.
Finally, I encourage authors to show examples of token pairs as shown here and add a paragraph on their research foundation to better introduce readers to their work.
Dear reviewer,
Thank you for your feedback and for taking the time to engage with our work. We’re glad to hear that our responses addressed your questions satisfactorily.
We also thank you for the suggestion to include examples of token pairs and to expand on the research foundation. We agree that these additions will help better contextualize our contributions for readers, and we will incorporate them into the revised version of the paper.
Thank you once more for your careful review and for raising your score.
This paper uses a transformer language model as a stand-in for a human subject to discover how the perceived similarity between two items changes as they are seen together. In particular, it attempts to figure out which hypothesis of “Hebbian Learning,” “Pattern Separation,” and “Non-Monotonic Plasticity” is true. The paper measures interference between two tokens in terms of the median pairwise similarity to other tokens, and change in similarity using the difference in cosine similarity before and after in-context learning. The results show that items that are perceived to be different become more similar after association, the items that are somewhat similar grow more apart, and items already thought to be similar remain at the same level of similarity. The paper identifies this result with the prediction of the NMP hypothesis.
优缺点分析
Strengths
I like the idea of using LMs to test hypotheses on human learning. In effect, this assumes that various artifacts of human learning are attributes more of the “learning” aspect, not the “human” aspect. Being more controllable and interpretable, LMs are good candidates. The method is adequately described. I was mostly able to follow the methodology of the paper without much difficulty.
Weaknesses
- I am confused by the given definition of vocabulary interference. I was unable to understand where the three hypotheses say anything about interference (I thought they only talk about similarity).
- The given definition of interference is confusing for several reasons.
- It is not symmetric between x and y, both in its formulation and the fact that the hidden representation of y is conditioned on x.
- I am not convinced that the cosine similarity between two embeddings should be interpreted as the interference between the corresponding tokens
- Even if we ignore the above point, the average similarity between y and other tokens t in context would tell us how much y interferes with these other tokens on average, not how much it interferes with x
- I don’t think that the results necessarily support the NMP hypothesis. My reading of Figure 2 is that low-similarity pairs become more similar, whereas mid-high similarity pairs remain largely the same; the decrease for all except the 0.65-0.7 bracket seem rather small to me. This would then contrast with the NMPH which instead would suggest (1) that mid-similarity pairs differentiate, and (2) high similarity pairs become even more similar. Basically, the curve does not look U-shaped as it should.
问题
- Can you explain why (ref. Weakness 2) you take your definition of vocabulary interference to be what you have in your paper?
- Do you think that the “forgetting” phase is just an artifact of older/weaker models, and is not canonically a part of the learning process? Do you have any hypotheses as to why other models don’t show this behavior?
- Can you clarify what role interference plays in the original NMP hypothesis?
局限性
I think it bears mention in the paper that Language Models may not necessarily be faithful stand-ins for humans in cognitive neuroscience. Although I still would find work with LMs useful in formulating and refining hypotheses, one eventually needs to actually study humans to make conclusions about humans. This point should be mentioned in the paper.
最终评判理由
The authors have sufficiently clarified the role of interference in the NMPH, and that the change they observe is consistent with the changes that other works exploring the NMPH have observed. Therefore, I will raise my score to "3: Borderline reject." The reason for still erring on the rejection side is that I still feel that the results are not entirely convincing. Granted that other works have also not found high correlations for the NMPH predictions, but I believe that also speaks to the NMPH not being a highly predictive theory to begin with. On the other hand, I'd like to stress that I am not very knowledgeable in the area of cognitive neuroscience and that my expertise lies squarely with LLMs. Therefore, if any conflict of opinions were to arise, other reviewers' points should supersede mine.
格式问题
None
We thank the reviewer for the valuable feedback.
Summary of main points:
-
Q1: we clarify that our vocabulary interference metric quantifies how densely a token is represented in embedding space. This measure is grounded in cognitive theories of memory interference. We find that higher vocabulary interference is associated with reduced pairwise similarity after learning, across all pre-learning similarity levels. We also find that NMPH-like effects fade under higher vocabulary interference, which may help account for some of the divergent patterns observed in neuroscience studies (differentiation, integration, non-monotonicity).
-
Q2: while the NMPH provides a theoretical account of how representational change should vary with memory coactivation, it lacks quantitative predictions—making it difficult to define how pronounced a U-shape must be to support the hypothesis. Our findings nonetheless align with its core idea: we observe statistically significant differentiation for mid-similarity pairs across six models, particularly in the low vocabulary interference range. Although integration is weaker for high-similarity pairs, the overall pattern supports NMPH’s central claims and is consistent with prior empirical work [39].
Q1 [Clarifying vocabulary interference]: We appreciate the reviewer’s comments and the opportunity to clarify. In our paper, we considered the possibility that some of the divergent or controversial findings in the neuroscience literature could be due to a variable difficult to measure in human studies: the global associative structure between stimuli, which in our work we call vocabulary interference. This is intended to measure representational competition from the entire stimulus space – that is, how many other tokens in the vocabulary are highly similar to a given token x. This link between similarity and long-term memory interference is supported by decades of psychology and neuroscience literature [Reb1-Reb5, 9, 23], leading to an influential computational model [23] showing how the hippocampus can resolve interference by disambiguating similar representations. This perspective suggests that representational similarity beyond the immediate pair (i.e., within the model’s broader representational space) could play a key role in modulating learning dynamics, and may help explain variability in empirical outcomes. We operationalize this idea by computing the average cosine similarity between x and all other vocabulary tokens prior to learning. Regarding the conditioning: in our experimental design, we analyze the final pair in a stimulus stream such as . The representations we use and are both contextualized within the same prompt. We compute similarity between these two vectors to capture how much the model binds to after learning. The vocabulary interference measure is computed before learning, and is not used to define the similarity, but to assess whether regions with higher representational density tend to show weaker learning (i.e., smaller representational change). Importantly, we find that when we control for vocabulary interference, the post-learning similarity curves change systematically: higher vocabulary interference is associated with reduced pairwise similarity after learning, across all pre-learning similarity levels. We will revise the text to clarify the motivation, definition, and limitations of this interference measure. If further concerns remain, we welcome continued discussion as we aim to make the paper as clear and rigorous as possible.
Q2 [Validity of NMPH results]: The work that proposed the NMPH [28] provides a theoretical account of how representational change should vary with memory coactivation, but it does not specify quantitative predictions, making it difficult to define how pronounced a U-shape must be to support the hypothesis. For empirical context, [39] report a differentiation effect of approximately -0.1 (Pearson correlation) for mid-similarity pairs in human fMRI data. While this offers a useful benchmark, our analysis uses cosine similarity and is conducted on LLM representations -- thus, direct comparisons of effect size are not possible. Nevertheless, the ~0.15 change we observe is robust and statistically significant across all 6 models (after multiple comparisons correction; see reply to reviewer ctgT). While we do not observe clear integration for high-similarity pairs, a factor that contributes to the U-shape being less pronounced, the differentiation in the mid-similarity range remains consistent with core predictions of the NMPH, and we believe this provides meaningful support for the hypothesis as illustrated in Fig. 2b.
Q3 [Forgetting phase]: While a detailed analysis of the forgetting phase is beyond the scope of this paper, we provide initial observations and hypotheses in lines (208-215) and Appendix C.1. Specifically, we note that only two models exhibit this behavior. In one case, we suspect it may be related to the model’s use of a sliding window attention (SWA) mechanism, which could limit its ability to integrate information over longer repetition streams. In the other, we hypothesize that high vocabulary interference may play a role in suppressing learned associations. While we do not claim that forgetting is a canonical part of the learning process across all LLMs, we believe these preliminary findings highlight interesting model-specific dynamics that could lead to further investigation in future work.
Limitations: We will add a statement in the limitations section acknowledging that LLMs are not direct stand-ins for humans, and that empirical validation in human studies remains essential.
References:
[Reb1] A. D. Baddeley, “The Influence of Acoustic and Semantic Similarity on Long-term Memory for Word Sequences,” Quarterly Journal of Experimental Psychology, vol. 18, no. 4, pp. 302–309, Nov. 1966.
[Reb2] M. C. Anderson and B. A. Spellman, “On the status of inhibitory mechanisms in cognition: Memory retrieval as a model case,” Psychological Review, vol. 102, no. 1, pp. 68–100, 1995.
[Reb3] M. C. Anderson, C. Green, and K. C. McCulloch, “Similarity and inhibition in long-term memory: Evidence for a two-factor theory,” Journal of Experimental Psychology: Learning, Memory, and Cognition, vol. 26, no. 5, pp. 1141–1159, 2000.
[Reb4] J. B. Caplan, M. Rehani, and J. C. Andrews, “Associations Compete Directly in Memory,” Quarterly Journal of Experimental Psychology, vol. 67, no. 5, pp. 955–978, May 2014.
[Reb5] Y. Zhao, A. J. H. Chanales, and B. A. Kuhl, “Adaptive Memory Distortions Are Predicted by Feature Representations in Parietal Cortex,” J. Neurosci., vol. 41, no. 13, pp. 3014–3024, Mar. 2021.
We hope we have addressed the reviewers' main concerns and we would happily go into more detail if there are any remaining questions.
Thank you to the authors for their rebuttal. The clarification of the role of interference and the description of prior work helped to give me a clearer picture of the scenario. In the light of these clarifications, I agree that the given results are interesting. I still have some reservations about the fact that the correlations observed are weak (it is true that prior work also observes weak correlations---but also that this makes the takeaways of both the prior work and the current one a bit blurrier). Nevertheless, I find the rebuttal largely informative and convincing.
Therefore, I will raise my score.
Dear reviewer,
Thank you again for your engagement with our work and for taking the time to review our rebuttal.
We’re glad to hear that the clarifications helped provide a clearer picture and that you found the results interesting. We also appreciate your point about the correlations, and we will be careful to reflect this in the final version to ensure the takeaways are appropriately framed.
Thank you again for your constructive feedback and for raising your score.
The paper studies how Large Language Models (LLMs) learn associations between word pairs through repeated exposure in their input context. The authors test if LLMs show the same learning patterns as human brains, specifically the Non-Monotonic Plasticity Hypothesis (NMPH). They find that when token pairs are shown many times, moderately similar pairs become less similar while very different or very similar pairs stay the same. They test this on 6 different LLMs and introduce a new factor called "vocabulary interference" that affects learning.
优缺点分析
Strengths: The authors present a creative and novel approach by using large language models as experimental subjects to examine theories about human memory. This is an innovative bridge between cognitive neuroscience and AI research that opens new ways to study learning at scale. The authors demonstrate good scientific rigor by testing their hypothesis across six different models, including various versions of Llama, Mistral, and Gemma, which strengthens the generalizability of their findings. One particularly valuable contribution is the introduction of "vocabulary interference" - a measure of how many similar tokens compete with the target word. This concept is difficult to measure in human studies but easy to calculate in LLMs, providing new insights into how global similarity affects learning. The technical execution is also solid, with the authors developing a gradient-based search method to find token pairs with precisely controlled similarity levels, ensuring systematic coverage across the similarity range rather than relying on random sampling. Weaknesses: The paper has a critical concern in Appendix B that undermines its entire premise. When the authors test their hypothesis using representations from earlier layers of the models, the U-shaped learning pattern completely disappears. This means their main finding only exists in the final layer of the models, not throughout the learning process, yet they never properly explain why this happens or what it means for their theory. The statistical analysis also has serious issues - they perform different tests across similarity groups and learning phases but never correct for multiple comparisons, which means some of their "significant" results could just be random chance. Moreover, the authors only use 12 token pairs in each similarity group, which is a very small sample size for making broad claims about learning patterns. Another major concern is that they select token pairs based purely on mathematical similarity in the model's representation space, not on meaningful semantic relationships. While the paper doesn't give specific examples of the tokens used, they admit to using an artificial selection process that may not reflect how humans actually learn associations between words. Perhaps most fundamentally, the paper tries to draw parallels between how LLMs process information through attention mechanisms and how the human brain forms memories through synaptic changes. These are completely different processes : one is a mathematical operation that happens instantly without changing the model, while the other involves physical changes in brain connections over time. The authors don't adequately address why this comparison is valid, which calls into question the entire premise of using LLMs as "model organisms" for human memory.
问题
Why does the U-shaped pattern only appear in the final layer when you test representations from earlier layers? This seems to contradict your claim that LLMs demonstrate general associative learning principles similar to the human brain. If this is truly a fundamental learning mechanism, please provide a mechanistic explanation for this layer-specific effect and discuss whether this limits your biological parallels. Can you validate your findings using semantically meaningful word pairs from established human association databases? Your current approach uses tokens selected purely for their geometric properties in embedding space, which may not reflect natural language associations. What do the attention patterns look like during the learning process? Since LLMs implement associative learning through attention mechanisms rather than weight changes, analyzing how attention weights shift across repetitions could show the actual mechanism behind your observed representational changes. Do high-interference tokens show more distributed attention patterns? Could you please standardize the y-axis scales across all panels in Figure 7? The current presentation with different scales makes it very difficult to assess whether the effects in earlier layers are genuinely absent or just smaller in magnitude. How do you distinguish your "forgetting phase" from catastrophic interference effects in neural networks? The pattern you observe, where only two models show forgetting at very different repetition counts, seems more consistent with capacity limitations or positional encoding artifacts than a genuine memory consolidation process. Additionally, could you provide the actual token examples used in your experiments? The paper mentions using gradient-optimized tokens but never shows what these look like, making it impossible to assess their linguistic validity.
局限性
The authors acknowledge some limitations like focusing only on the final layer and using artificial token pairs, but they fail to address the most critical limitation revealed by their own data. The fact that the non-monotonic plasticity pattern completely disappears when examining earlier layers fundamentally undermines their comparison to biological memory systems. The authors also fail to discuss whether their experimental setup reflects real learning situations. In their experiments, they repeat the same two tokens together hundreds or thousands of times in a row, which is nothing like how people actually learn word associations in everyday life. When humans learn that two concepts are related, it happens through varied contexts, conversations, and experiences over time, not through endless mechanical repetition. This extremely artificial learning environment might create patterns that tell us nothing about how human memory actually works in natural settings.
最终评判理由
I increased the quality score and the rating score after reviewing the author's responses.
格式问题
None observed.
We thank the reviewer for the valuable feedback.
Summary of main points:
- Q1: motivated the parallels between LLM and human associative learning by drawing on a line of prior works showing a functional equivalence between ICL and SGD.
- Q2: applied Benjamini-Yekutieli corrections across similarity groups and learning phases. Differentiation in mid-similarity range remains significant.
- Q4: showed that WordNet-restricted tokens exhibit monotonic representational change, consistent with high global interference. Strengths the point that NMPH depends on global similarity structure.
- Q6: fixed Fig 7a analysis; now shows monotonic differentiation across intermediate and late layers. Added new analysis using last-layer-optimized pairs (in progress).
Q1 [Parallels between human and LLM associative learning]: Thank you for raising this fundamental point. While it may appear unintuitive on the surface, we were motivated to investigate the possible parallels due to some recent successes in interpretability work on ICL. Several papers demonstrated that ICL performs updates similar to gradient descent [38, Reb2, Reb3, Reb4], suggesting that it may be closer to ‘synaptic weight updates’ than previously expected. Another paper drew a deep parallel between the mechanisms of induction heads in ICL to a well-validated model of human episodic memory (i.e. fast learning over a stream of episodes) [19]. We also do not think the field widely agrees that attention mechanisms are “completely different processes” than human learning for long-term information storage. Some work has drawn an analogy between attention mechanisms and the older concept of ‘fast weights’ in machine learning [Reb1][Reb2]. That is, one can view ICL as using attention layers to establish temporary weights that allow the LLM to flexibly learn patterns on a fast, short-term basis. While ICL does not directly modify the synaptic weights of the LLM, we do not think this invalidates the comparison with human learning. Rather, we believe the conceptual analogies of ‘synapse’ and ‘neuron’ break down when comparing very short-term learning between humans and LLMs.
Q2 [Statistical tests]: Thank you for raising this point. In our revised analysis, we apply Benjamini-Yekutieli correction to account for multiple comparisons across the 17 similarity groups and 3 learning phases. The findings remain significant, showing differentiation in the mid-similarity range (0.55-0.75). This is consistent with the original nonmonotonic pattern.
Q3 [Small sample size]: We extended the number of pairmates for 2 models (due to time constraints), sampling 100 pairs per similarity group, and found similar non-monotonic trends. We will add this to the revised paper.
Q4 [Semantically meaningful tokens]: Our study intentionally used token pairs selected for their pre-learning similarity, regardless of semantic meaning. This mirrors the use of synthetic stimuli in [39], which highlights the need to sample across the full similarity spectrum to test the NMPH effectively. Real-word tokens are unevenly distributed across this space, making precise control difficult. Our focus is on the structural dynamics of representational change during learning, not semantic meaning. That said, we agree that using meaningful tokens is important to assess whether NMPH applies in real-world contexts. To this end, we sampled new token pairs restricted to single-token words from WordNet, reducing the vocabulary to ~20% of the original token set in the two models tested so far. We first examined how pairwise similarity and vocabulary interference were distributed in this constrained space, anticipating that the limited vocabulary might bias the pairs toward narrower interference ranges. Indeed, real-word token similarity more closely aligned with the vocabulary interference distribution. As outlined in Sec. 3, this suggests that most of these pairs face high competition during prediction. Based on this, we hypothesized that representational change would shift from a U-shaped to a linearly decreasing pattern, as in the orange and blue curves of Fig 3b. Our results confirmed this: post-learning similarity decreased monotonically with pre-learning similarity, supporting the idea that highly similar pairs are influenced by vocabulary interference. We’ll include this new analysis in the revised paper and update our claims. Specifically, we’ll emphasize that NMPH is observed only under specific global interference conditions. Our findings suggest that to fully distinguish between competing hypotheses, global similarity structures, like vocabulary interference, must be considered as modulating factors.
Q5 [Examples of token pairs]: Here are randomly selected examples optimized for llama-7b. Similar examples for other models will be included in the appendix. Note: WordNet includes abbreviations and acronyms.
Synthetic token pairs:
| Similarity Range | Pair 1 | Pair 2 | Pair 3 |
|---|---|---|---|
| 0.1–0.15 | (Liter, CLARE) | (artifactId, gew) | (emat, SOUR) |
| 0.15–0.2 | (Ste, UITableView) | (Pers, pmatrix) | (ries, pragma) |
| 0.2–0.25 | (it, Autres) | (bt, Autres) | (Bad, tf) |
| 0.25–0.3 | (VD, Autres) | (Vertical, ierte) | (coordinate, gesch) |
| 0.3–0.35 | (elf, ScrollView) | (Tr, named) | (Else, newcommand) |
| 0.35–0.4 | (DER, stackexchange) | (ific, ently) | (von, trightarrow) |
| 0.4–0.45 | (uk, ThreadPool) | (vez, ISBN) | (under, rov) |
| 0.45–0.5 | (illet, cially) | (icio, atr) | (ptop, Wikimedia) |
| 0.5–0.55 | (bootstrap, rach) | (utes, Vorlage) | (iveau, tersuch) |
| 0.55–0.6 | (mittel, umbn) | (Series, notify) | (Problem, emptyset) |
| 0.6–0.65 | (fte, zott) | (Length, TRUE) | (elve, PDF) |
| 0.65–0.7 | (nings, setAttribute) | (isen, issenschaft) | (ouv, schluss) |
| 0.7–0.75 | (ru, occup) | (result, utzt) | (aka, rola) |
| 0.75–0.8 | (cock, eland) | (hib, heast) | (prepare, Once) |
| 0.8–0.85 | (relation, emptyset) | (reen, bmatrix) | (uliar, ienn) |
| 0.85–0.9 | (Italie, urre) | (cement, cement) | (onna, onna) |
| 0.9–0.95 | (aped, aped) | (loster, loster) | (lict, lict) |
Semantic token pairs:
| Similarity Range | Pair 1 | Pair 2 | Pair 3 |
|---|---|---|---|
| 0.1–0.15 | (mix, loaded) | (defined, standard) | (pub, any) |
| 0.15–0.2 | (identifier, astern) | (layout, eclipse) | (defined, slash) |
| 0.2–0.25 | (online, pop) | (suite, abb) | (pub, format) |
| 0.25–0.3 | (absolute, pus) | (round, pa) | (annotation, hum) |
| 0.3–0.35 | (series, math) | (black, roc) | (gas, bat) |
| 0.35–0.4 | (spec, cock) | (information, leg) | (argument, lear) |
| 0.4–0.45 | (contra, architecture) | (dictionary, ike) | (rooms, ho) |
| 0.45–0.5 | (gen, nil) | (factory, acre) | (shadow, nih) |
| 0.5–0.55 | (gen, raise) | (time, eb) | (zero, iga) |
| 0.55–0.6 | (dale, person) | (dawn, esp) | (irs, ante) |
| 0.6–0.65 | (any, essen) | (final, mission) | (gi, dim) |
| 0.65–0.7 | (cap, bind) | (mus, skim) | (dd, safe) |
| 0.7–0.75 | (bye, anas) | (izar, through) | (lined, click) |
| 0.75–0.8 | (replace, stock) | (unction, week) | (execution, frame) |
| 0.8–0.85 | (geometry, list) | (locale, embed) | (partition, brand) |
| 0.85–0.9 | (opacity, fragment) | (render, inflate) | (analysis, section) |
| 0.9–0.95 | (gable, board) | (volution, ship) | (slider, simple) |
Q6 [Intermediate layer analysis]: We appreciate you raising this concern. We will revise Fig. 7 to fix an analysis error and to add a new analysis. Since NeurIPS does not support figures or PDFs during the rebuttal, we summarize changes below:
- Using token pairs optimized for similarity in earlier layers, similar to Fig 7a. Here we discovered an error in the code that indexed the wrong layers for averaging. When fixed, we saw:
- Intermediate and late layers show a monotonic pattern (e.g. similar to Fig 2b, green line), with significant differentiation for high similarity pairs (>0.7 pre-learning similarity.
- Intermediate layers show stronger differentiation (greater decreases in similarity) than late layers.
- Early layers (1 & 2) behave more erratically, lacking a clear trend.
- New analysis using the same token pairs as Fig 2b (optimized for last layer similarity), replacing Fig 7b. This is still running, but we will update the rebuttal when our results come in. We note that in our setup, the learning process happens across repetitions, not across layers, so representational dynamics may change along the layer depth.
Q7 [Forgetting phase vs catastrophic interference]: While the "forgetting phase" in ICL might look like catastrophic interference, it's not as there are no weight updates. Instead, we attribute the accuracy drop to model-specific and reversible effects such as sliding window attention or vocabulary interference.
Q8 [Attention pattern]: While this is an important direction, we excluded it due to time constraints and our focus on representational dynamics. We see it as a valuable direction for future work.
[Reb1] J. Ba, G. E. Hinton, V. Mnih, J. Z. Leibo, and C. Ionescu, “Using Fast Weights to Attend to the Recent Past,” in Advances in Neural Information Processing Systems 29, 2016, pp. 4331–4339.
[Reb2] I. Schlag, K. Irie, and J. Schmidhuber, “Linear Transformers Are Secretly Fast Weight Programmers,” in Proceedings of the 38th International Conference on Machine Learning, PMLR, Jul. 2021, pp. 9355–9366.
[Reb3] Dherin, Benoit, et al. "Learning without training: The implicit dynamics of in-context learning." arXiv preprint arXiv:2507.16003 (2025).
[Reb4] Dai, Damai, et al. "Why can gpt learn in-context? language models implicitly perform gradient descent as meta-optimizers." arXiv preprint arXiv:2212.10559 (2022).
We hope we have addressed the reviewers' main concerns and we would happily go into more detail if there are any remaining questions.
Dear Reviewer,
We would like to follow up regarding Q6, the analysis of intermediate layers.
As described in our main paper, we optimized token pairs to achieve specific pre-learning similarities at particular layers. In the main analyses, inspired by the methodology in [39], we focused on the last layer to directly control representations influencing model behavior on the ICL associative learning task.
In the paper and during rebuttal (see Q6), we described analyses where pair similarity was optimized and assessed at each respective layer (e.g., token pairs were found for a specific layer and analyzed post-learning in that same layer) to determine intermediate-layer representation-change trends. For those intermediate layers, we found monotonic trends, with pronounced differentiation for pairs with high pre-learning similarity (>0.7) in mid and late layers.
In a follow up experiment, we analyzed representation changes for pairs that were optimized for the last layer (same token pairs as in Fig. 2), but assessed across different earlier layers. The goal here was to track how the non-monotonic similarity pattern emerges throughout the model as these pairs are processed in earlier layers.
During the consolidation phase:
- Early to mid layers (up to half the model depth) showed relatively flat or mildly monotonic decreasing trends. Only increases in pair similarity were observed, i.e. all y-values were >0, indicating representational integration.
- Mid-late layers (the second half of the model) began exhibiting a monotonic decrease in similarity.
- Late layers (layers just before the analyzed layer in Fig 2b) revealed the emergence of a non-monotonic, U-shaped trend. Although the U-shape is visually apparent, the minimum inflection point of the pattern did not show significant differentiation.
In summary, representational changes first integrate across all pre-learning pair similarity. As they are progressively processed by the different layers, the monotonic decreasing trend becomes the U-shape observed in Fig 2b.
To further explore this observation, we investigated how vocabulary interference (global interference) affects the U-shaped pattern across these different layers. We observe a general trend suggesting that deeper layers operate under a higher vocabulary interference regime. As a consequence, the last layer experiences greater global competition among possible token predictions, which contributes to its greater tendency to differentiate pairs with mid-pre-learning similarity. This result aligns with observations in Fig. 3 and supports our claim that global interference modulates this non-monotonic phenomenon.
We hope these clarifications address your concerns. Please let us know if additional details or explanations are needed.
Thank you for your thorough rebuttal and the additional experiments you conducted. I appreciate the statistical corrections and the new WordNet analysis, which clearly strengthens the technical aspects of your work. However, I raise my score to 4 but I think that the core issue remains: the U-shaped pattern only appears in the final layer, not throughout the model, which makes the comparison to human brain memory processes questionable. While your paper makes interesting technical contributions about how LLMs process token associations, it doesn't convincingly show that LLMs actually work like human memory.
Dear reviewer,
Thank you for your feedback and for taking the time to review our rebuttal. We appreciate your acknowledgment of the statistical corrections and the additional WordNet analysis, and we’re glad to know that you found these contributions strengthened the technical aspects of the work.
We understand your concern that the U-shaped pattern appears most significantly in the final layer. However, we note that this aligns with a sensory-information processing hierarchy, in which the hippocampus--implicated in memory-related representational change and the region where a U-shaped pattern has been observed [39]--sits at the top of the processing stream for memory [Reb5, Reb6].
Importantly, our goal is not to claim that LLMs replicate human memory mechanisms. Rather, as outlined in Fig. 1, our focus is on testing specific hypotheses drawn from cognitive neuroscience within the LLM framework. LLMs offer scalability and experimental control that are often difficult to achieve in biological systems, making them valuable tools for investigating representational dynamics in a controlled setting. The emergence of the U-shaped pattern, particularly in later layers, suggests that LLMs can exhibit memory-related behaviors that are functionally similar to those observed in the brain.
This does not imply a mechanistic equivalence, but it highlights the potential of using LLMs as model organisms for exploring memory-related hypotheses. As noted in Q1, our work is motivated by growing evidence of functional parallels between LLM mechanisms and aspects of human memory. Just as mice and humans differ in many respects yet share core principles that enable generalization, LLMs can serve as tractable testbeds for studying representational change in complex systems. While these analogies are not one-to-one, they provide a compelling foundation for studying memory-like processes in LLMs through the lens of cognitive neuroscience.
We will clarify this framing more explicitly in the revised manuscript to make the intended analogy more precise and grounded in prior work.
Thank you again for your constructive comments and for raising your score.
[Reb5] Randall C. O'Reilly, Yuko Munakata, Michael J. Frank, Thomas E. Hazy (2020), Computational Cognitive Neuroscience, Fifth Edition https://compcogneuro.org/book
[Reb6] P. Lavenex and D. G. Amaral, “Hippocampal-neocortical interaction: A hierarchy of associativity,” Hippocampus, vol. 10, no. 4, pp. 420–430, 2000.
Dear Reviewer ctgT,
Thank you for providing a thoughtful review. This is a reminder that the author-reviewer discussion period is about to close.
Did the authors reply address any of your initial concerns and are there any additional points the authors could clarify?
Dear reviewers & AC,
thank you for taking the time to review our work and provide valuable feedback!
As we approach the end of the discussion period, we’d like to summarize the main points raised and the improvements that will be incorporated to our revised manuscript:
-
Statistical analyses & significance of NMPH (ctgT, 5X6h, nm9c): The reviewers commented on the absence of quantitative predictions in the original NMPH paper [28] and requested clarification regarding statistical significance. Specifically, [28] provides a theoretical account of nonmonotonic representational change but lacks quantitative predictions. For context, [39] report a ~-0.1 differentiation effect, using pearson correlation, in human fMRI data. While we use cosine similarity and LLM representations (which makes it not directly comparable), we observe a robust ~-0.15 effect across six models. To strengthen our claim that results are robust and significant, we thus applied false discovery rate correction across 17 similarity groups and 3 learning phases. After correction, mid-similarity token pairs in the 0.55-0.75 range continued to show significant differentiation, consistent with theoretical predictions of NMPH and human empirical findings. We will update the main paper to report the corrected statistics.
-
Motivation and interpretation of our work (ctgT, 5X6h, Sfgb): We strengthened our research motivation by clearly articulating methodological challenges in human studies of NMPH, such as difficulty in controlling pre-learning similarity, practical constraints (cost, limited trials, participant fatigue), and accurately estimating global interference (e.g., lifetime word exposure). Although there is empirical support for all three hypotheses we discuss [6, 9, 10, 28, 30, 32], findings in humans are inconsistent, and it remains unclear what drives this variability. We provide new evidence that global interference is a modulating factor, consistent with hippocampal theories of disambiguation under competition [9]. Inspired by recent work drawing parallels between LLM mechanisms (e.g., induction heads, fast weights) and synaptic learning or episodic memory [19, 38, Reb2-4], we emphasize how LLMs offer precise control and observability, providing a powerful and systematic framework for testing representational change hypotheses when human experimentation is limited. These changes and clarifications will be integrated into the revised manuscript.
-
Examples of token pairs (ctgT, Sfgb, nm9c): As requested, we provided randomly selected token pair examples from each similarity group. We will also include analogous examples from additional models in the appendix for comprehensive reference.
-
Semantically meaningful tokens (ctgT, Sfgb, nm9c): Originally, our token pairs were selected solely based on pre-learning similarity, regardless of semantic meaning, similar to synthetic stimuli in [39], to allow full-spectrum sampling. To address questions about real-world relevance, we conducted a new analysis using semantically meaningful real-word tokens from WordNet (which allowed us to have only roughly ~20% vocabulary coverage). This revealed significant vocabulary-level interference, shifting the pattern of representational change from a U-shape to a linearly decreasing trend. We will include this analysis and emphasize global interference as a critical factor modulating representational dynamics during this naturalistic setting.
-
Intermediate layers analysis (ctgT, Sfgb): We corrected an indexing error in Appendix Fig. 7 and added new analyses explaining how the non-monotonic pattern emerges across layers. Early layers show integration, mid-to-late layers show monotonic differentiation, and only the second half of the layers exhibit a U-shaped trend. We also found that deeper layers face higher global interference, supporting our claim that interference drives the emergence of non-monotonicity (Fig. 3). These findings will be included in the revision.
We again thank the reviewers for their feedback and hope these clarifications strengthen and improve the quality of our paper.
This paper presents a novel and creative approach for using LLMs as model organisms to investigate the Non-Monotonic Plasticity Hypothesis (NMPH) from cognitive neuroscience. The work is interdisciplinary and all four reviewers consider the approach novel and original. The paper introduces a few new analysis approaches (e.g. gradient based) and the reviewers generally highlight the technically solid execution of the described experiments. A few concerns about the statistical analysis were raised during the review and discussion period, and additional analysis were discussed. The authors promised that the camera ready revision would include some improvements (listed below) including an updated statistical analysis. Overall, the the reviewers agree that the paper presents a valuable contribution, that the manuscript is well written and that the study is well executed with no major concerns. I think we can thus recommend this for to be accepted.