A Model of Place Field Reorganization During Reward Maximization
Optimizing basis function parameters using the Temporal Difference error recapitulates several neural phenomena, and improves the speed and flexibility of policy learning
摘要
评审与讨论
This paper develops a reinforcement learning (RL) model to explain how hippocampal place fields reorganize during reward-based navigation. In the model, Gaussian radial basis functions (place fields) receive continuous spatial inputs, and feed into an actor-critic framework that learns to navigate in 1D and 2D environments. At each time step, the temporal difference (TD) error modulates both the actor-critic synapses and the place field parameters (amplitude, center, and width). Through online updates, the model captures three key experimental observations about place fields: (1) increased density at the reward location, (2) backward elongation against the movement direction, and (3) continuous drift of individual place fields even when behavior stabilizes. The authors show that place field reorganization under TD error significantly improves policy convergence by providing more discriminative spatial representations. Perturbative analysis clarifies why fields near high-value locations experience stronger shifts. Additionally, introducing noise to place field parameter updates leads to representational drift without disrupting navigational performance, and in fact aids adaptation to new reward targets.
给作者的问题
- Does your model support any type of remapping when reward has been changed?
- Could you elaborate more on the comparison between your RM model and other RL models? Especially the reason why this specific algoritm will be able to explain all three phinomina while the others can not?
论据与证据
The authors' major claim is their model can replicate three different phinomina observed in Place cells, which has been demonstrated by their simulation.
方法与评估标准
yes
理论论述
I specifically looked at the perturbative expansions in the paper’s Appendix and found no apparent incorrectness.
实验设计与分析
yes.
补充材料
Yes. I checked the perturbation part in the supplementary information.
与现有文献的关系
The authors situate their model in the intersection of hippocampal research and RL, demonstrating a reward-based alternative to SR or purely mechanistic place field accounts, while also addressing representational drift seen in modern neural data.
遗漏的重要参考文献
no.
其他优缺点
- Strength - Unified three phenomina: clustering, elongating and drifting in one model. - Provided a computational goal (gaining reward) for the hippocampus to form space representation.
- Weakness - The learning of the proposed model will highly rely on reward, which is not true for humans and animals. - Tasks can be too simple. - Directly applying RL algorithm on place cell via backpropogation can be biologically unplausible.
其他意见或建议
no
We thank the reviewer for their comments. We hope our response clarifies most of the concerns.
proposed model will highly rely on reward.
While the current model proposes a reward dependent objective, we have also proposed a non-reward-dependent objective (Metric Representation) which recapitulates place field elongation against the trajectory, but not a high density at the target (Fig. 2C). Optimizing place fields either using the MR objective as a standalone or as an auxiliary improves the policy convergence rate (Fig. S15, red and brown) compared to fixed representations (blue), consistent with Fang & Stachenfeld 2024.
Tasks can be too simple.
We kindly refer the reviewer to our response to Reviewer 1 (Qrmv) that “the environments are also very simple”.
biologically implausible.
The question we were pursuing was whether there is a single, simple normative model that can recapitulate several learning-induced changes in place field representations. To do this, we felt some level of biological unrealism was permissible. Nevertheless, the reviewer is correct. We raised this biological implausibility as a limitation and have discussed avenues to make it plausible (e.g. random feedback with local learning rules). We are currently working on a biologically plausible representation learning model and preliminary results are promising.
1.remapping when reward has been changed?
Yes, when the target changes, some place cells that were coding for the initial reward location shift to the new reward location (Fig. S3), replicating the remapping phenomenon (Gautheir et al. 2018). Additionally, place fields that were not initially coding for reward but in the vicinity of the new reward location were recruited to code for the new target. Hence, the model does support partial remapping when the target changes. However, we do not see (1) the same proportion of reward place cells coding for the new target, and (2) reward coding place fields shift gradually to the new target location, whereas we could expect place fields to rapidly shift or jump to the new target location (Krishnan et al. 2022). The timescales for reward based remapping needs additional experimental data for verification.
2.comparison between your RM model and other RL models?
We will add comparison to other RL models to the discussion namely in 3 aspects:
Reward maximization (RM) algorithms (TD vs Q): Our RM model maximizes rewards by optimizing the place field representations for policy and value estimation. Other RL (e.g. Q-learning or SARSA) algorithms seek to maximize cumulative discounted rewards. Due to the reward dependency in these RL objectives, we expect other reward-maximizing RL algorithms to learn similar representations to our actor-critic based model.
Architecture (GBF vs MLP): Deep RL models that use MLPs for representation learning have also shown high density at reward location and backward shift of features (Fang & Stachenfeld 2024 ICLR). Adding Gaussian noise to deep network parameters also elicit a form of representational drift (Aitken et al. 2022). Although we constrained place field tuning curves to Gaussian distributions to analytically study representation learning, we believe insights from these analyses can be translatable to deep RL models.
Learning objectives (TD vs MR vs SR): The SR model learns transition probabilities based on the agent’s policy. Meaning, when the policy changes, the transition probabilities change, and the SR fields will subsequently change. Hence, SR fields do not influence policy learning, making this model inadequate to study representation learning for policy learning. If the agent has a reward maximizing policy, then the SR model recapitulates high field density at the reward location and field elongation against the trajectory (Fig. 2). Hence, the SR model seems to require 2 disparate components, making the SR model less parsimonious. Furthermore, the SR model is not a good candidate to study representational drift as SR fields have to be anchored to fixed representations (Eq. 72). Meaning, we should not expect SR field centroids to drift to a new location as observed in neural data (Fig. 3D Ziv et al. 2013; Fig. 5H Qin et al. 2023). Conversely, the MR agent is a reward independent objective (Foster et al. 2000) and learns to self-localize by path integration. Given its non-dependency on rewards, place fields do not reorganize to show a high field density at targets (Fig. 2C), though they cause fields to increase in size and shift backwards against the trajectory (Fig. 2A,B). We did not study representational drift using the MR objective in this paper, though this could be a future direction.
We hope these discussions would suffice on why the proposed model (Noisy GBF optimized by TD error) is a suitable candidate while being parsimonious and anatomically grounded to the biological neural circuits. We kindly request the reviewer to reconsider the score.
I appreciate the authors for their genuine and honest reply.
In this work, the Authors develop a reinforcement-learning model of the place field organization. They consider three effects that have been observed regarding the place fields in biology (a higher density at reward locations, an elongation backwards along the trajectory, and a drift observed at the times of stable behavior). The authors show that their model reproduces these three effects.
给作者的问题
N/A
Post-rebuttal update: while I don't believe that my concerns have been addressed, they are mostly related to the significance of the results, which is subjective. Upon that consideration, I raise my score by one step.
论据与证据
While the model reproduces the three effects observed in the place cells in biology (and that is the claim put forward in the paper), there is no reason to believe that the opposite is true, i.e., that the place fields in biology are organized similarly to the proposed model. While the overall direction of the paper is interesting and, with the proposed set of tools, it could be possible to test the argument of biological relevance, this has not been done in this paper, sadly limiting its impact. I elaborate in the sections below.
方法与评估标准
While it is generally a good idea to consider biological observations and see whether a model reproduces them, several considerations should be typically put in place including the generality of the model (i.e. the same model should reproduce all the observations) and the consideration of alternative plausible models (i.e. the models that look equally plausible based on prior research but do not reproduce the observed phenomena). What I found problematic with the methods in this work is that the model is modified ad hoc: for the three phenomena observed in the place fields, there are three corresponding modifications of the model: the first model is formulated to reproduce the higher density of the place fields at the reward locations; then successor representation agents and metric representation agents are introduced to reproduce the elongation of the place fields backwards along the trajectory, and, finally, noise is introduces to reproduce the representation drift. This strategy raises the question of the proposed model's generality. Then, while ablation experiments are provided (which is a good thing), they only consider the properties of the place fields (i.e. the centers, widths, and amplitudes) while the aforementioned design choices (successor representations, noise) are not evaluated. Thus, while the model indeed reproduces the biological observations, there’s no evidence that biological place fields form in accordance with this model.
理论论述
While the theoretical claims seem correct (as in: the model reproduces the said effects) and are confirmed by the simulations, this does not address the model’s applicability issue as raised above.
实验设计与分析
See Methods and Evaluation Criteria section above.
补充材料
I have looked through the Supplementary Material mainly focusing on the derivations.
与现有文献的关系
While a lot of relevant literature is cited, the model design choices here, as well as the comparison with baseline models, could be better informed by the literature. Specifically, there’s substantial literature on the neuroanatomy of the reward circuit including the mapping of the actor-critic model that could be used / discussed. Additionally, there’s substantial literature on the place fields, including in works considering the hippocampus and the entorhinal cortex. These works could be discussed here and compared as to whether they reproduce the same three effects. Finally, there’s vast literature on the representation drift; whole conferences are held on that topic. This literature could also be considered and discussed here. Overall, considering and discussing this literature in the follow-up work on this project may make it a much stronger and more well-founded contribution than the current version.
遗漏的重要参考文献
See Relation to Broader Scientific Literature section above.
其他优缺点
The text is written clearly; the ideas and the scope of work are well-articulated, making it easy to read and review.
其他意见或建议
N/A
We thank the reviewer for their comments. We hope our response clarifies most of the concerns.
place fields in biology are organized similarly
We wanted to ask if a single, simple normative model can recapitulate several learning-induced changes in place field representations. We agree that there could be other mechanisms that elicit the same phenomena, which we briefly explored. Still, we think the proposed model is the most parsimonious in explaining the learning dynamics observed in place fields when animals are learning to navigate.
We do explore how other objectives also elicit similar representations in the subsequent section: i.e. path integration TD error (MR - metric representation) and state prediction (SR - successor representation) objectives. While the representations learned by these 3 different objectives could be similar (Fig. 2A,B), the dynamics of how place field representations change is different (Fig 2C,D,E), making this a prediction of our proposed model.
generality of the model
We would like to clarify that our single proposed model (noisy Gaussian basis function parameters modified by the TD error - Noise+TD) replicates all 3 phenomena without ad hoc modifications. Perhaps this was unclear, which we clarify below.
first model is formulated
While our partial model (TD) was explained first (Fig. 1), our full proposed model (Noise+TD) recapitulates high field density at targets (Fig. 3B, Fig. S3A,B).
successor representation agents and metric representation agents are introduced
We clarify that the successor representation (SR) and metric representation (MR) agents were introduced as alternative comparisons to our proposed agent (Noise+TD) where field parameters were optimized only by the TD error, not by the SR and MR objectives. Both partial (Fig. 2) and full Noise+TD (Fig. 3B) agents show field elongation against the trajectory when all the place field parameters are optimized.
noise is introduced
The stochasticity in the partial TD model was insufficient for representational drift (Fig. S7). Hence, our full model Noise+TD includes noisy parameter updates. To reiterate, adding noise to place field parameters did not detract the model from demonstrating (1) high field density at the reward location (Fig, 3B and Fig. S3), and (2) field elongation (Fig. 3B).
Hence, our full proposed model (Noise+TD) replicates all 3 neural phenomena, without needing ad-hoc modifications. We hope this clarification addresses the reviewer’s main concern about the model’s generality, and we will clarify this in the manuscript.
consideration of alternative plausible models
The SR and MR objectives are alternative plausible models to show that different objectives can recapitulate the field elongation while the representation dynamics are different (Fig. 2E), making this a prediction of our model.
design choices (successor representations, noise) are not evaluated
We would like to clarify that we did evaluate various design choices. SR does not influence policy learning, as the policy influences SR fields instead. Hence, the influence of SR fields on policy learning was not evaluated. Instead, we explored the influence of MR objective in policy learning. Fig. S15 shows that optimizing place field parameters using MR improved policy convergence (red) compared to using fixed place fields (blue). However, the rate of improvement was not as significant compared to optimizing place fields using the TD objective (purple). Using MR as an auxiliary objective (brown) to reward maximization (TD error) showed a slightly faster policy convergence, consistent with Fang & Stachenfeld (2024). We described this result in lines 372-375 right column.
We refer the reviewer to Fig. 4C, Fig. S11 and Fig. S12 which show the functional role of noise in policy learning when the target shifts or obstacle location changes. Specifically, when the target consistently shifts to a new location, partial TD agents without noise (blue) fail to learn the new target locations, suggesting they are trapped in a local minima. Instead, Noise+TD agents can continually learn the newly shifted targets. Hence, noisy place field representations increase the agent’s flexibility to learn new targets.
To conclude, our full proposed Noise+TD model replicates all 3 neural phenomena and demonstrates faster policy convergence when the task structure changes.
better informed by the literature
We refer the reviewer to the specific parts of the manuscript (Sec. 2 & 5) where we believe a discussion of the relevant literature has been included. It would be helpful if the reviewer could share other literature we missed to add as further discussion.
mapping of the actor-critic
Lines 68 to 84 left column
hippocampus and entorhinal cortex
Hippocampus: Lines 80 to 83 left column Entorhinal cortex: lines 84-87 right column, 393-394 right column
representation drift
Lines 66-82 right column, 434-403 left column
Thank you for your response.
-
Re: re: ad-hoc: while the final model, as you clarify, accounts for all the three phenomena, the noise wasn't needed for the first two of them. Regarding SR / MR, thanks for the correction; you are right in that it doesn't apply to my ad hoc point here, though it then transfers to mu next point below.
-
Re: re: alternative models: as the SR / MR models reproduce the observed effects, this doesn't allow for the distinction between these models here. I do acknowledge the difference in the predictions of these models regarding the representation dynamics, however, that has yet to be confirmed with neural data.
-
Re: re: literature: just to clarify (my initial statement may have been obscure here), I didn't mean that you didn't cite the literature at all --- you clearly did cite a lot of it --- I've meant that the specifics outlined in this literature may have been considered in what should go into the final model, further constraining the design choices.
Overall, for now there is no change in my points as I see them (that is, that the model should be further constrained to biology to offer instrumental predictions for neuroscience and meet what's typically required for ICML). I look forward to a thorough discussion with other Reviewers to see what they think about it and happy to revisit my score upon this discussion. Also please feel free to follow up on my response if there are things to be added or clarified.
-
Re: ad-hoc: Yes, noise was not needed for the first two, but it is needed for the 3rd phenomena. To reiterate, our single model with noise recapitulates all 3 phenomena, which addresses your major concern in the initial review about model generality.
-
Re: alternative models: It may be true that we can't tell apart RM/MR/SR from existing data, but we do make predictions about how to distinguish them. This is a pretty important contribution. One important goal of mathematical modeling is making predictions that can be tested experimentally, and this difference in dynamics is one. We would like to assert that this is a strength, not a weakness. Additionally, we would like to reiterate that we had performed additional evaluation using alternative models in the initial manuscript, which also addresses your 2nd concern in the initial review.
-
Re: literature: Our goal here is not to replicate every known mechanistic detail about place cells (as discussed in related works), but come up with a minimal model that captures as many phenomena as possible. This requires deliberate decisions to omit certain granularities. Hence, it is true that this necessitates some disconnect with mechanisms, but allows parsimony and interpretability. This kind of mathematical modeling is very common, appreciated and resulted in new insights into neural systems (e.g. mean-field firing rate model simplifies neural diversity to subpopulations, continuous attractor network to model head-direction and grid cells, hopfield network for one-shot associative memory, etc.). Hence, the simplification of our model is a strength, not a weakness.
Since the manuscript has results for points #1 (ad-hoc) and #2 (alternative model), addressing your major concern in the original review, we feel it is fair for the score to be increased.
This paper proposes a model that is inspired by place fields in the hippocampus and how it could be used to develop representations that can be used for reinforcement learning. The authors argued that their model aligns with phenomena observed in neuroscience experiments, specifically high density of activity around reward locations, elongation of representations from paths taken by the agents as well as stable policy learning despite representational drift. The authors evaluated their hypothesis on simple 1D linear track and 2D environments, which are commonly used in computational neuroscience studies. Experiments on different targets as different tasks were also used to study the effects of place field updates due to these changes. Analyses were done to determine where learning the parameters of the place field were better or worse than keeping them fixed. Overall, the paper contributed a model that shows potential on how place field like representations can be using for reinforcement learning.
给作者的问题
I would be happy to hear from the authors regarding the points I made in the Weakness and the comments sections. Depending on how the discussions and rebuttal proceeds with other reviewers and myself, I would be open to increasing my score. Here are some further questions:
-
Figure 2A. Why does the SR field size decreases at the later stage?
-
Figure 3B, second column where T = 75000. Why is the representation similarity matrix visualisation vastly different from the others in B?
-
Line 318 in the left column. There is a statement on elongation of fields by SR being subtle. Why is this the case? Is this due to the discount factor? Would a different or higher discount factor in Eq. 71 leads to stronger elongation?
论据与证据
Yes, there were supported by the experiments and analyses described in the paper.
方法与评估标准
The proposed methods and evaluation makes sense to evaluate the hypotheses proposed by the authors.
理论论述
While there aren't any mathematical proofs included in the main paper, I have looked through the math equations in the main paper. To the best of my knowledge, they seemed correct.
实验设计与分析
Yes, I checked the experimental designs with are focused on simple 2D mazes. The inputs to the agents are place fields which are pre-determined using Gaussian distributions. Despite having multiple targets as different tasks, the structure of the environment remains fixed.
补充材料
Yes, I reviewed the supplementary materials briefly, with most of my time focusing of the details of how the place fields were defined and the mathematical derivations of the learning algorithm in Appendix A as well as to understand how the Metric Representation agent is learned in Appendix D. Since I am familiar with Successor Representations, I didn't spend too much time in Appendix C. There are many analysis done by the authors which are included as Supplementary Figures, 15 of them in total. Unfortunately due to time constraint, it is hard to delve into the details of these figures.
与现有文献的关系
There is a huge body of work look at place cells-like representations for reinforcement learning, particularly in the computational neuroscience field. This paper aligned well with many of such studies where the basis features are deemed to be gaussian-like, hence are pre-defined to be Gaussian distributions. Many of the related studies as the authors have included as their references also mainly study the efficiency and efficacy using simple navigation tasks.
遗漏的重要参考文献
I believe that the authors have cited relevant references to the best of my knowledge.
其他优缺点
Strength
- The literature review is thorough and extensive. This helps the reader to understand well about the work done by the authors relate to broader field relating to representations inspired of place cells and reinforcement learning.
- Lots of analysis and ablation studies performed to convince the readers that the model proposed fulfilled the three criteria, which I will quote myself earlier: A) high density of activity around reward locations, B) elongation of representations from paths taken by the agents as well as C) stable policy learning despite representational drift.
- The math equations were presently clearly in both the main paper and the supplementary section. Details about the baseline models such as Successor Representation and Metric Representation were also provided.
- The captions of the figures provide clear descriptions of what the plots are showing.
Weakness
- The writing of the paper is very dense. There is a lot of information packed in the main paper, with constant and important references made to the supplementary section. It is clear that the authors have done a lot of work and are trying to pack the main paper as much as they can but this makes it harder for the reader to follow along.
- Many of the figures are pixelated if you zoom in or use a monitor to read the paper. I highly recommend using vector based graphics for your plots and figures for better visualisations and readability.
- Only simple environment with pre-defined features were considered. There is no evidence that this model can handle complex tasks and environments, making the proposed model somewhat toy-ish.
其他意见或建议
I feel that this paper is more suitable to be deemed as computation neuroscience contribution than wha the authors claim to be in their impact statement: "Advance the field of Machine Learning." The reasons being that inputs are simple, pre-determined Gaussian features. Therefore, is it surprising that the resulting representations are transformed Gaussians as seen in Figure 2F?
Secondly, It is unclear if the phenomena observed could also be realised using pixel or other high-dimensional observations.
Thirdly, the environments are also very simple, ranging for 1D linear track to 2D maze. Would the results also hold or perhaps a different set of conclusion can potentially occur when the environment changes as the target changes?
We thank the reviewer for their comments. We hope our response clarifies most of the concerns.
The writing of the paper is very dense
We will reduce the density to keep the description (e.g. parameter choices in methods, description of path integration based TD error in section 4.2, and figure captions) to a minimum and shift supplementary figures to the main paper. Additionally, we will get another page for the final draft.
figures are pixelated
We will use vector-based graphics e.g. .eps instead of .png for the revision!
is it surprising that the resulting representations are transformed Gaussians
Yes, we agree that the paper is more to advance the field of computational neuroscience and have made this edit “This paper presents work whose goal is to advance the field of Computational Neuroscience.” Transformed Gaussians are expected, but it is surprising that the TD objective is sufficient to capture the 3 different phenomena, that have been described by 3 disparate mechanisms i.e. (1) high field density is needed at salient stimulus (e.g. reward location) to maximize fisher information (Ganguli & Simoncelli 2014), (2) predictive coding of future location shows fields elongating and shifting backwards while enveloping obstacles (Stachenfeld et al. 2017) and (3) optimizing noisy parameters shows stable population code while individual place fields drift (Qin et al. 2023). Hence, it is not at all expected that the Gaussian basis function transformations seen in our model recapitulates all 3 phenomena including that shown by the SR algorithm.
phenomena…using pixel or other high-dimensional observations
Yes, this is a good question. Fang & Stachenfeld 2024 showed that optimizing deep representations using the metric representation-like objective recapitulates the backward shift of features. Nevertheless, whether representations in deep networks follow the dynamics observed in the current model is part of our future work. With respect to the current paper, we believe that this is a different question from the original goal, which was to analytically study how representation learning evolves in predefined place cell descriptions.
environments are also very simple
We used simple tasks to match the environmental setup used in experiments so as to study and predict place field representation dynamics. We explored an elevated level of task complexity by changing the target or obstacle location to serve as predictions of place field representations to test in experiments. Nevertheless, it could be possible to have different place field behavior in different or more complex environments. As a follow up, we are working on: multiple rewards (Lee et al. 2020 Cell), uncertainty in reward distribution (Tessarau et al. 2024 bioRxiv) and, sequential two alternative forced choice (Yaghoubi et al. 2024 bioRxiv). We appreciate other complexities to explore based on the reviewer’s recommendation.
1.Why SR fields decrease at the later stage?
Fig. 2C shows how the individual place fields change when optimizing the parameters using the SR algorithm. In the early phases of learning, place fields at the start location (purple) increase in size since the agent spends a higher amount of time at the start doing random walks (black). As the agent learns to spend a higher amount of time at the reward location, the fields at the start location decreases in size while those at the reward location increases. However, the rate of decrease in SR field sizes at the start location (purple) is more significant than the rate of increase in field sizes at the reward (green), resulting in a slight mean decrease in field size in Fig. 2A.
2.Why is the representation similarity matrix visualization different from the others?
This was an outlier example when the similarity matrix at T=75000 was slightly different from the other time points. But the population code becomes similar again as optimization continues. Fig. 3D shows the similarity matrix autocorrelation for the plots in Fig. 3B with remaining largely stable (orange) with small deviations. Subtle differences in representational similarity matrices have also been observed in experimental data (Fig. 5F, Qin et al. 2023).
3.Elongation of fields by SR being subtle
Increasing the SR discount factor (Eq. 71) to 0.95, 0.99, 0.999, 0.9999 led to a faster increase in successor field magnitudes rather than the field width. Yet, the increase in SR field width was still significantly smaller than the field elongation observed when optimizing using the TD error. The SR fields () in our models are anchored to fixed place fields ( Eq. 72) and the SR fields learn to represent the transition probability between each of these fixed place fields. Increasing the distance between the fixed place fields () or reducing their width could show a faster increase in the SR width compared to the increase in SR magnitudes.
The paper proposes a reward maximization framework using temporal difference error to drive place field reorganization to improve policy learning. The model could reproduce three phenomena observed in place cells, which is a strength of the paper from a computational neuroscience perspective. Most reviewers think the paper has sufficient literature review and has done sufficient analysis and ablation studies. Although one reviewer was concerned about the experimental verification of the proposed framework, he eventually recognized the merit of the proposed testable predictions after rebuttal (I highly recommend reading the author-reviewers’ conversations). Combined, I recommend accepting this paper.