/10

Poster4 位审稿人

最低2最高4标准差0.7

ICML 2025

Can Biologically Plausible Temporal Credit Assignment Rules Match BPTT for Neural Similarity? E-prop as an Example

Yuhan Helena Liu,Guangyu Robert Yang,Christopher J Cueva

提交: 2025-01-23更新: 2025-07-24

TL;DR

We investigate whether the pursuit of biologically plausible learning rules prioritizes plausibility at the level of synaptic implementation (e.g., locality) at the expense of reproducing brain-like neural activity.

摘要

关键词

Computational neurosciencebiologically-plausible learning rulesrecurrent neural networksneural data similarityneural representations

评审与讨论

审稿意见

评分: 42025-02-21

The authors use a the procustes method to compare how well do recurrent neural networks trained with eligibility propagation compare with networks trained with backpropagation through time on matching data from neural recordings on monkeys. As secondary results, the authors add a theoretical point suggesting that eligibility propagation can find the same solutions as backpropagation through time, and they show that other features such as initialization can be more relevant than learning rules to macht neural dynamcis.

给作者的问题

N/A

论据与证据

There are two main problems with the claims: Focus on one specific case while the question is general I think that the question they formulate is too broad for a short article. The paper simply shows that e-prop converges to a similar solution to BPTT, and even that is only for certain parameters. This is not the claim of the title, and only narrowly answers their main motivating question (lines 70 to 73). Two issues:

The single theoretical argument is about e-prop resembling BPTT, not about any other rule. I would not claim to be answering the more general question from this.
It examines only e-prop, but the claim is much more general. If the authors pipeline is very flexible, they should test some other rules to validate their method (even if they perform worse). I don't fully see why did they not include ModProp in the main text.

Learning rules and architectures The authors argue that they evaluate bio-plausible learning rules. However, e-prop is a combination of architecture and learning rules, and so are other methods that they cite (equilibrium propagation, for example, requires a specific architecture). The learning rule has a role to play, but only in combination with a specific architecture.

Suggestion The results seem correct and interesting, but the title and some of the key words are misleading. I find it hard to conclude anything about bioplausible learning rules, other than there is one model that includes one specific rule that seems to converge to a similar solution to BPTT. I would change the title to e-prop converges to BPTT or something along those lines. In the discussion the authors can make a point that this is an example that shows that there are bioplausible learning rules that can do as well as BPTT.

方法与评估标准

It's not clear, given the question, why they focus only on task with dynamics, rather than the very common set-up including vision, where most recent bioplausible models have been formulated. I mention this because, the original source for using BPTT (Mante and Sussillo, 2014) has limitations that are noted by Valerio Mante himself (Pagan, M., Tang, V.D., Aoi, M.C. et al. Individual variability of neural computations underlying flexible decisions. Nature (2024)) Also, nothing wrong with procustes analysis, but it seems to be a choice among many, and it is unclear if the results would be different with other methods. None of those points mean that the methods are wrong, just that the claims need to be clarified.

理论论述

Theorem 1 is correct. However, the way I understand the result seems trivial, and I would like to see what am I missing. If the point is to argue that e-prop converges to BPTT, it should suffice to state that both have the same fixed points (without error there won't be gradients). Also, it is clear that in a problem with two different solutions, it is perfectly possible that BPTT/e-prop converges to one or the other, so it is clear that they could converge to different ones.

实验设计与分析

Seems correct. More examples of learning rules would be necessary.

补充材料

Theorem and modProp. both seem correct.

与现有文献的关系

While the authors argue that learning rules should match recordings, so I would point out that there is a very wide literature on neural recording investigating learning rules. The whole idea of testing whether a learning rule makes sense should at the very least mention that such rules should be compatible with the experimental evidence aimed at investigating precisely which are the learning rules found in biological neurons (not only theoretical properties such as locality). Just a few examples: STDP (Bi and Poo, 1998) or differential Hebbian learning can be used in supervised learning (Xie & Seung, Neurips 1999), and to approximate backprop in some settings (Aceituno et al. Front. Comp. Neur. 2023) Burst-induced plasticity (Remy & Spruston, PNAS 2007) is definitely observed in the brain and can also approximate BP (Payeur et al Nat. Neuro. 2021).

Also, the authors should at least mention that there is a debate about whether the brain uses backpropagation at all, which seems to be the main benchmark used. For example, (Song et al., Nat. Neuro. 2024).

遗漏的重要参考文献

Pagan, M., Tang, V.D., Aoi, M.C. et al. Individual variability of neural computations underlying flexible decisions. Nature (2024)

There are architectures similar to the equilibrium-propagation or deep feedback control that do not make the equilibrium assumption (assuming by that they mean static inputs and responses) Aditya Gilra Wulfram Gerstner (2017) Predicting non-linear dynamics by stable local learning in a recurrent spiking neural network eLife Kaleb, Klara, et al. "Feedback control guides credit assignment in recurrent neural networks." The Thirty-eighth Annual Conference on Neural Information Processing Systems.

其他优缺点

I think it's a good point to show that the hyperparameters matter more than the learning rules. However, it seems to indicate that the question is ill-posed, as it seems to be the learning capacity that matters, rather than whether the rule is really biological or not. The authors could comment on that point

其他意见或建议

N/A. It is well written and good graphics. Also, no need to be overtly polite in the responses, I would simply like clarifications or corrections (also if I am wrong)

作者回复

2025-04-01

A more focused title/claim and why mainly one bio-plausible rule. We thank the reviewer for this thoughtful and detailed feedback. We agree that our study does not aim to characterize all biologically plausible learning rules; rather, our goal is to show that at least one such rule—e-prop—can match BPTT in neural similarity when trained to similar performance, thereby demonstrating existence rather than universality; as the reviewer noted, universality would be beyond the scope of a short paper. To clarify our focus, we will:

Update the title to: “Can biologically plausible temporal credit assignment rules match BPTT for neural similarity? E-prop as an example”
Modify lines 70–73 to end with: “— or can some such rules, under the right conditions, yield representations as brain-like as those learned by BPTT?”
Revise the first contribution point to state explicitly that this is an existence result
Emphasize in the Discussion that e-prop serves as a concrete example, not a general case

With this reframing, we believe that providing a single positive example is sufficient for the contribution we aim to make, though we do explore additional rules in appendices for curiosity.

Focusing on tasks with dynamics. While vision tasks are important, our study focuses on temporal credit assignment—a critical challenge in biological learning. This focus, aligned with prior work on dynamic tasks (e.g., Bellec et al., 2020), allows us to evaluate how well models capture evolving neural activity over time rather than static averages, making it ideal for representational similarity analyses. Our updated title now reflects this focus, and we will note that extension to vision tasks is an interesting direction for future work.

Coupling of learning rule and architecture. We agree that learning rules are often developed with specific architectures in mind, and we acknowledge this broader issue. In our study, we intentionally fix the task, architecture, and initialization to isolate the effect of the learning rule on neural similarity, following prior work (e.g., Liu et al., 2022). The architecture used is a standard RNN widely adopted in computational neuroscience (e.g., Yang & Wang, 2020). We will add a discussion point noting the importance of studying rule–architecture interactions more systematically in future work.

Additional similarity measures. Please see our response under “Alternative Metrics” to Reviewer AQPQ, including new results using additional metrics Table 3.

Theoretical claim. We agree the result may appear intuitive, but BPTT and e-prop do not generally share the same fixed points (e.g., $\hat{y}'(W) = 0$ yields a fixed point for BPTT but not generally for e-prop); even when fixed points overlap, the dynamics of convergence may differ due to distinct basins of attraction. Also, what's a bit more precise than saying they sometimes converge to different ones is the sign condition on initialization for e-prop, which determines whether it converges to W*. That said, we recognize the theoretical result is limited in scope and may distract from our main contributions. As noted in our response to Reviewer xVGX, we will revise the manuscript to de-emphasize the theoretical component and clarify that the theorem serves only as a simple illustrative example, not a general convergence result. Our aim is to highlight, in a concrete toy setting, e-prop’s sensitivity to initialization as a motivation for future theoretical work.

BPTT limitations. We agree. While BPTT-trained RNNs are widely used in brain modeling, recent studies (e.g., Pagan et al., 2024) show that BPTT tends to favor a limited subset of solutions. Zahorodnii et al. (bioRxiv, 2025) further demonstrate that handcrafted solutions can yield more brain-like dynamics. We will also cite Song et al. (2024) and others who question the usage of BPTT by the brain (see also significance of the e-prop–BPTT comparison in response to Reviewer 23UQ). Together, these points highlight the limitations of BPTT as a benchmark and motivate future work on alternative rules and rule–architecture co-design; we will update the manuscript accordingly.

Additional references. In the updated manuscript, we will discuss and cite all references the reviewer mentioned.

Is the question ill-posed? This is an excellent point. In under-constrained systems with unknown architectural variables, multiple learning rules may converge to similar solutions. Constraining the architecture using experimental data—or incorporating data during learning—could help narrow the search space, making this a promising direction for future work. We will update the manuscript with this point. That said, our goal is not to pinpoint the brain’s actual learning rule, but to test whether any existing biologically plausible rule (e.g., e-prop) can match BPTT in neural similarity, the de facto benchmark for brain-like models.

审稿人评论

2025-04-03

The rebuttal addresses my points and I will update my score.

作者评论

2025-04-03

We thank this reviewer for their insightful and actionable feedback.

审稿意见

评分: 32025-03-14

This paper examines the properties of recurrent neural networks trained on experimentally motivated tasks using both Backpropagation Through Time (BPTT) and biologically plausible learning rules, in particular e-prop, a truncated approxiamtion of BPTT which uses only local information for weight updates. They demonstrate that e-prop (as well as mod-prop), can achieve competitive similalrity with neural recordings when matched for task performance with BPTT. They also inversigate the effects of learning rate and weight initialization scale on performance-matched similarity scores, finding a profound impact of initialization, and a small effect of learning rate. Comparing neural distances to a baseline constricted by splitting neural recordings into multiple populations, they show that RNNs can reach this baseline level of similarity for a simpler, but not a more complex task. To explain the similarity between E-Prop and BPTT, they show that E-Prop will converge to the same solution as BPTT in a linear RNN with a single recurrent unit solving an effective root-finding task.

给作者的问题

What parameter is varied to change task performance along the curves in figures 1B, 2AB. and 3A? Is sample size or training time changing? How does this compare for the different training algorithms?

论据与证据

Empirical claims are well supported.

方法与评估标准

The evaluation that is missing is the amount of data / training time required to achieve equal performance across learning rules.

理论论述

I found the main text to be very misleading about the extent of the theoretical claims proven in the SI. You can't say in the abstract that you "provide theoretical results demonstrating that e-prop can converge to the same solution as BPTT, depend- ing on the initialization," give only an informal statement of the theorem in the main text which glosses over the fact that the dimensionality of the RNN considered in the theorem is only one-dimensional, and finally get to that limitation in the S.I. There's no argument given for higher-dimensional systems, and this should be made clear in the main text.

Furthermore, the assumptions of theorem 4.1 are so restrictive that they actually bypass any meaningful consideration of the BPTT algorithm. The same exact theorem could be applied by replacing BPTT with "algorithm Z" and assuming that "algorithm Z" converges to some fixed point $w^*$ which solves the task, then so will E-Prop if converged within a small neighborhood of $w^*$ . This is more a statement about the task considered than the e-prop algorithm, which only converges in this case when $\hat{y}'(W(0)) and$ x_{T-1}$ have the same sign, which is essentially putting in as a hypothesis that the e-prop updates will move in the right direction.

I think these limitations in the theoretical claims need to be mentioned more transparently in the text, and I would urge the authors to extend their theorem to the multivariate linear case, where similar tasks have been explored theoretically (see e.g. https://arxiv.org/abs/2006.11036)

实验设计与分析

I didn't check beyond reading what's in the paper.

补充材料

Reading the proof of theorem 4.1 don't see any mistakes in the math.

与现有文献的关系

This paper relates to a large body of literature comparing the representations learned by neural networks to neural activity measured in behaving animals. Also to a large body of work seeking and analyzing biologically plausible learning rules that can achieve competitive performance compared to non-plausible algorithms, namely Backpropagation.

遗漏的重要参考文献

none that are obvious to me

其他优缺点

Strengths:

The introduction is well-written with a good set of references to related work.

The main point demonstrated here -- the bio-plausible learning rules can achieve competitive neural data similarity to BPTT, is a useful fact that will interest many people in the field.

Weaknesses: The captions of the figures are a bit short, and including more detail could enhance readablity.

Theoretical contribution is not consistent with the message presented in the abstract or main text

Some details are missing or not elaborated clearly in the main text (see questions).

其他意见或建议

I think there's an indexing error in equation 6 in the SI. As it currently reads, you get $\frac{\partial h_{l, t}}{\partial W_{h, ij}} = \frac{\partial h_{l, t}}{\partial W_{h, ij}} + \cdots$ implying the rest of the expression is zero.

作者回复

2025-03-31

Theoretical claims. We appreciate the reviewer’s constructive feedback and acknowledge that the main text did not clearly convey the limited scope of our theoretical result. We agree that the theorem is restrictive in two key ways: (1) it applies only to a 1D setting, and (2) W* can be the solution of any algorithm Z, not specifically BPTT.

We recognize that our presentation may have unintentionally suggested a broader theoretical contribution than intended, potentially distracting from our main message. In response, we will make the following updates to our manuscript to (A) clarify the theorem’s assumptions and scope in the main text, and (B) re-center the reader’s attention on our core empirical focus and existence-based framing:

Remove all phrases referring to the theoretical results in abstract and Contributions, including “to support this further, we provide theoretical results …”, to avoid implying that we seek to establish a general theoretical framework for e-prop convergence, which we do not aim to do.
Clarify the restrictive assumptions in the main text’s informal theorem statement by replacing “linear RNN” with “1D linear RNN”, and “found via BPTT” with “found via an arbitrary algorithm (e.g., BPTT)”, to acknowledge that our result does not specifically hinge on BPTT alone.
Update Appendix B heading to “Convergence and divergence of e-prop in a toy setting”
Explain the theorem’s purpose in the surrounding text, clarifying that it serves as an existence demonstration in a toy setting to motivate future theoretical work: “We provide a 1D linear RNN example showing that e-prop can match BPTT under certain initializations, but diverge under others. While this illustrates key ideas—such as e-prop’s sensitivity to initialization—extending the analysis to higher-dimensional RNNs is a substantial effort (e.g., Schuessler et al., NeurIPS’20) and beyond the scope of this paper.”

To offer more information about the last point, our rationale for focusing on a toy (1D) setup is twofold. First, to our knowledge, there is no existing theoretical framework for e-prop convergence that we could readily build upon, and developing such a framework (e.g., in the spirit of Schuessler et al.) would be a major effort in its own right and beyond the scope of this work. Second, the toy setting lets us illustrate the key existence story we wish to highlight: there are indeed cases where e-prop converges to the same solution as BPTT, and there are cases where it fails. This demonstration is not meant to be comprehensive, but rather to motivate future investigations (e.g., broader architectures, higher-dimensional RNNs, etc.).

By making these changes, we hope our updated manuscript can better convey that our theorem is meant primarily to provide an example—“e-prop can match BPTT under certain initial conditions, but not always”—rather than to establish a general theoretical foundation for e-prop, as well as motivating future work on theoretical work to examine insights for the simple toy model across more realistic settings. We hope these revisions clarify that the theoretical result plays a limited, illustrative role and is not essential to our main aims.

Amount of data/training required to achieve equal performance. We agree this is an important point. While we did not include these comparisons originally, they have been addressed in prior work evaluating the “(1) good neuroscience task performance” criterion (see Fig. 3, Bellec et al., 2020), which shows that e-prop typically requires more training iterations than BPTT to reach the same accuracy. Our findings are consistent with this, and we will include this information and citation in the revised manuscript.

Additional questions. Indeed, task performance is varied by changing the number of training iterations. At each iteration, we compute both task accuracy and neural similarity, and plot one against the other. As training progresses, models move from low accuracy/high distance (upper left) to high accuracy/low distance (bottom right) in Figures 1-3. The same procedure is applied to all learning rules. We will add this explanation to the caption for clarity.

Longer captions. In addition to the above explanation, we will ensure all figure captions include complete axis label explanations and refer to the relevant appendix sections. We acknowledge that short captions are more common in the appendix, and will revise those to match the clarity of the main text, including color legend descriptions (e.g., Fig. 8B).

Other comments. We thank the reviewer for catching the indexing error and will correct it in the updated version.

审稿人评论

2025-04-05

Thank you for addressing my concerns. I think these changes will improve the paper substantially. I am raising my score to a 3.

作者评论

2025-04-05

Thank you so much again for your valuable and constructive feedback.

审稿意见

评分: 22025-03-14

This paper investigates whether biologically plausible learning rules (e-prop) in RNNs can achieve neural activity similarity to biological data comparable to models trained with BPTT. Using primate datasets and Procrustes distance as a similarity metric, the authors demonstrate that e-prop-trained models match BPTT in neural data similarity at equivalent task accuracies. Additionally, they show that architecture and initialization significantly influence similarity, often surpassing differences caused by learning rules.

给作者的问题

Does the dominance of initialization over learning rules hold across tasks with varying temporal dependencies (e.g., memory-intensive tasks)?
How would results change if combining Procrustes with dynamical metrics (e.g., DSA) into a composite score?
Could the conclusions extend to SNN or non-primate datasets?

论据与证据

The evidence is robust for the claims overall, except reliance on Procrustes distance alone may miss nuances captured by other metrics. The study lacks experimental comparisons to demonstrate the superiority of Procrustes distance over alternative similarity metrics.

方法与评估标准

Yes.

理论论述

Theorem 4.1 is proven in Appendix B,

实验设计与分析

Baseline comparisons (Fig. 5) and ablation studies (Fig. 7) strengthen validity. But tasks are limited to two datasets; generalization to other species or tasks is unexplored.

补充材料

I check theorem 4.1 in Appendix B.

与现有文献的关系

The work extends prior studies on bio-plausible learning (e.g., Lillicrap et al., 2020; Richards et al., 2019) by systematically evaluating neural similarity—a less-explored aspect. It aligns with Yamins et al., 2014)on performance-similarity correlations but adds insights into initialization’s role.

遗漏的重要参考文献

The article omits some of the latest biologically plausible algorithms such as BrainScale: Enabling scalable online learning in spiking neural networks.

其他优缺点

Strengths:

The article is fluent and easy to understand overall and provide insights into initialization.

Weaknesses:

Using existing similarity evaluation metrics to assess the algorithm's performance on neural datasets, the article lacks significant contributions and novelty.
Limited discussion of alternative similarity metrics (e.g., CKA, RSA).
The analysis is restricted to comparing only one biologically plausible algorithm (e-prop) with BPTT, leading to limited validity and generalizability of the evaluation.

其他意见或建议

'We used retanh to mimic type-1 neuronal firing ‘ should it be 'Tanh'?

伦理审查问题

none

作者回复

2025-04-01

Alternative metrics. We thank this reviewer for their constructive comments and now add additional metrics to the manuscript:

Table 3: Additional measures.

Rule	CCA	CKA	Duong et al., 2023
BPTT	0.283 ± 0.011	0.160 ± 0.018	0.349 ± 0.038
e-prop	0.281 ± 0.012	0.155 ± 0.009	0.394 ± 0.072

These additional results again support the insignificant differences across the rules (p=0.834, 0.683, 0.373, for the respective measures).

While we provided initial justification for using Procrustes in the Discussion, we will expand on it in the revised manuscript. In addition to its interpretability and geometric grounding, a recent study (Cloos et al., ICLR 2025) found that optimizing Procrustes may better preserve task-relevant features than other metrics. This suggests it can capture more meaningful neural structure. They also observed that Procrustes is stricter than CKA: high Procrustes implies high CKA, but not vice versa (see also Harvey et al., UniReps 2024). We will add this discussion.

Novelty & contribution. We thank the reviewer and respectfully clarify that while we use existing tools, our contribution lies in addressing a fundamental, underexplored question: How do biologically plausible gradient approximations affect neural activity similarity? Using e-prop as an example, we show that despite gradient truncation, it is possible to match the similarity of BPTT (before the truncation). This is not obvious a priori—e-prop < BPTT or > BPTT were both plausible. The finding of matched similarity is thus nontrivial, and we further analyze why this occurs (Figs. 3 & Appx Fig. 8), as well as how confounds like architecture/initialization dominate (Fig. 2). These insights also help refine the precise future questions that can be systematically explored across learning rules, architectures, datasets, and metrics — all of which our flexible pipeline is designed to support. We will revise the Discussion to reflect these points and believe our insights meaningfully advance this emerging area, warranting consideration for ICML.

Why focus mainly on one bio-plausible rule. We thank the reviewer for raising this point. Our goal is not to claim universality — that all biologically plausible rules match BPTT in neural similarity — but to demonstrate existence, by showing that e-prop can do so under certain conditions. To support this existence claim, showing a single such example should suffice, which we believe is both non-trivial and novel, especially given how underexplored representational similarity is in this context.

That said, we sympathize with the reviewer and acknowledge that our current title may overstate the scope. To avoid this confusion, we plan to revise the title to better reflect our focus on a single illustrative case — e.g., “Can biologically plausible temporal credit assignment rules match BPTT for neural similarity? E-prop as an example”. Additional planned updates are outlined in our response to Reviewer buDs.

Questions. We thank the reviewer for those meaningful questions and will add them to our manuscript:

This is a nuanced question that touches on how task timescale and complexity modulate the roles of both initialization and learning rules. Due to rebuttal word limits, we focus here on timescale. Prior work shows that performance gaps between BPTT and e-prop widen with longer task timescales (Liu et al., 2021). If neural similarity correlates with task performance—as supported by Yamins et al. (2014) and our results—we would expect a growing similarity gap as well (although we'd be violating the assumption of matched accuracy for comparison). Testing this directly would require datasets with varying temporal dependencies, which we view as a promising direction for future work.
Intuitively, if both metrics show similar results for e-prop and BPTT, a composite score (based on weighted or normalized average) would reflect the same conclusion. However, we prefer to report each metric separately to avoid arbitrary weighting and allow for clearer interpretation. We’ll add composite scores to the manuscript for completeness.
This is a profound question that touches on translational relevance: to what extent do findings from one species (e.g., primates) generalize to others (e.g., rodents or humans)? The widespread use of model organisms in neuroscience is based on shared circuit motifs and learning mechanisms. While we expect our conclusions may extend to other species or even SNNs with similar dynamics, systematic cross-species and model-type studies will be essential. We view this as an exciting direction for future work.

Other comments. We will cite and discuss BrainScale and others in Related Works. We didn’t use Tanh for type-1 firing for nonnegative firing rate; we will clarify this point in the manuscript.

审稿意见

评分: 32025-03-14

This paper compares the representations learned by BackProp Through Time (BPTT), truncated BPTT (tBPTT), and e-prop–a “biologically plausible” learning rule designed as a model of neural plasticity–to those learned in the brains of monkeys performing several tasks. Specifically, the distance between neural representations and learned model representations are calculated via Procrustes analysis. The authors find that the representational distance between e-prop and neural data is not so much larger than that of BPTT, or tBPTT, and neural data. To further bolster this favourable comparison of e-prop and BPTT, a theorem is proven showing that, in linear models, the solution found by e-prop can match that of BPTT if certain initial conditions are met. Lastly, the authors demonstrate that initial conditions can actually have a larger effect on representational similarity than learning rule itself.

Update after rebuttal

Thank you to the authors for the many clarifications, the additional experiments and the statistical significance analysis! I still feel that the theoretical results are exceedingly limited but I believe that, with the rebuttal updates, the experiments alone make the paper worthy of consideration. For this reason I have updated by score to a 3.

给作者的问题

The reviewer wonders if the authors could motivate a little more the significance of seeing similar representations between e-prop and BPTT. Given that e-prop tries to approximate BPTT this might not seem that surprising. Moreover, from a neuroscience perspective–given the biological implausibility of BPTT, is it such a good benchmark to compare with?
Why use the Hatsopolous dataset in place of Mante in Fig.5?
The reviewer noted that the objective in the theorem’s proof is to optimize the 1D linear RNN so that its hidden state is zero after $T$ time steps. Of course, given any parameterization with $W < 1$ the hidden state will converge to zero for large $T$ . Do the authors not think that, for this reason, the studied case might be too simple?
Would it be possible to plot appendix Fig.7 and Fig.1 on the same footing (i.e. noise or no noise in both cases)?

论据与证据

The claims seem reasonably well-supported, but it is at times difficult to parse noise from signal. For this reason the reviewer suggests statistical tests and tabular display of data (see Weaknesses).

方法与评估标准

The datasets used seem reasonable.

理论论述

The proof was skimmed but not checked in detail.

实验设计与分析

Experimental design and analysis seems sound.

补充材料

Portions of Appendix B and C.

与现有文献的关系

As discussed in the paper, past work in computational neuroscience has focused on proposing “biologically plausible” alternatives to BPTT, as candidate models for learning in the brain–many of which were discussed in the paper (e.g. e-prop, RFLO). Most of this literature has evaluated these models based on (1) whether they are local in space and time (not requiring information about spatially distant synapses, or temporally distant timesteps, to update the given synapse at the given time), and (2) whether they can learn reasonably difficult tasks. As of yet, there has not been a significant amount of work comparing the representations learned by these bio-plausible algorithms. This is the gap that the paper intends to address.

遗漏的重要参考文献

Recent work has similarly begun comparing representations learned by different types of algorithms–for example Brain-like neural dynamics for behavioral control develop through reinforcement learning by Codol et al. (2024 – biorXiv)–which could be a useful reference. Also, the term “biologically plausible” is somewhat underdetermined, which could be worth discussing. One potential reference for this is Formalizing locality for normative synaptic plasticity models by Bredenberg et al. (2023 – NeurIPs).

A potential issue related to this paper is the recent finding in certain deep learning architectures that, under sufficient conditions, many models exhibit similar internal representations. An example reference here would be The Platonic Representation Hypothesis, Huh et al., 2024.

其他优缺点

Strengths

The paper seems to be addressing a meaningful problem that has been understudied in the computational neuroscience community, and has selected datasets that are well-known and well-studied with which to address the problem. The paper seems to be well written, and the flow of ideas is logical and well-presented. By and large, the authors have done a solid job situating their work, with a respectable background section and citations.

Weaknesses

The primary weaknesses that the reviewer sees with the paper are issues of (1) clarity when it comes to displaying and analyzing the data, (2) significance of differences in data, (3) and depth of content provided in the paper. To elaborate:

The data that is reported in the main body of the paper is relatively noisy and the effects are not huge. It is at times also difficult to compare across plots–for example, if one wishes to compare the effect of initialization using BPTT to e-prop (Fig.2A to Fig2.B). The reviewer would thus suggest (i) statistical testing, and (ii) reports of numerical results, e.g. in a table, to better support the claims.
Critically, it is also difficult to get an idea of what kinds of differences between representational similarity are large and which are small (in part due to the issues mentioned in the above bullet point). Figure 5 takes a good first step towards this by comparing to a noise floor and a kind of '(relevant) noise ceiling'. However, it exhibits the issues of displaying data and statistical testing mentioned above, and doesn’t investigate the 2nd main dataset used in the study (Mante). Including a kind of ‘noise ceiling’ used in Figure 5–a comparison with a matched randomly initialized network (ideally with the best performing initialization scaling) in the main plots (and proposed table) would be helpful. One might also be able to normalize somehow by the difference between this ‘ceiling’ and the ‘noise floor’ provided by a neural data-to-neural data comparison or the best performing model for the given plot.
This point is somewhat more subjective, but the reviewer wonders whether the paper provides enough content to warrant publication at a venue like ICML. The reviewer thinks this perception could be a function of the fact that only the Procrustes results are included in the main body of the paper. If the figures were made more concise so that some alternative tests (like the DSA results and UMAP examples) could be moved into the main text this should help. Also, comparing with another metric (e.g. CCA, or trying to come up with a paradigm where representation similarity analysis could be used), could also be interesting.

If the authors satisfactorily address these weaknesses and the below points the reviewer would likely raise their score.

其他意见或建议

Line 13 LHS: first sentence seems a little unnecessarily wordy
Line 17 LHS: “approximate” => “approximating”
Line 23 RHS: suggestion: “approach” used twice very close together; perhaps use a different word
Fig 1B: do the authors know why there is such higher noise on the Mante dataset?
Fig 1: could be useful to state truncation length here.
Fig 2: it would be useful to have some numerical quantification of how the variance due to initialization exceeds that of the learning rule
Fig 3: could be useful to include color in this figure; the greyscale makes it very difficult to tell differences between lines and points
Fig 4: could be good to explain what the “condition component 1” represents in this context
Fig 5: why use Hatsopoulos dataset and not Mante?
Fig 5: if the reviewer understands correctly, the noise floor was calculated by comparing the similarity of different sets of recorded neurons. Is this the case? If so, why choose to compute similarity between different sets of neurons rather than between different sample points (matched for subject and test condition)?
Line 420 RHS: final sentence seems rather verbose and doesn’t contribute much insight.
The reviewer noted that the theorem is for 1D linear RNNs, and it’s not mentioned that only the 1D case is studied in the main body of the paper. Perhaps this should be mentioned?

作者回复

2025-04-01

Tabular results and statistical test for Fig 2. We appreciate this reviewer's concrete feedback and tabulate the results:

Table 1: Results from Fig2. *Noise ceiling corresponds to untrained models.

rule	gain=0.0	gain=0.5	gain=1.0	gain=1.5
BPTT	0.461 ± 0.006	0.423 ± 0.005	0.398 ± 0.007	0.428 ± 0.010
e-prop	0.461 ± 0.007	0.437 ± 0.009	0.407 ± 0.008	0.432 ± 0.008
Noise ceil.*	0.565 ± 0.002	0.529 ± 0.005	0.467 ± 0.005	0.508 ± 0.017

Non-overlapping error bars across rows suggest differences across gains (fixed rule), while overlapping bars within columns suggest similarity across rules (fixed gain). Values are mean ± standard deviation across seeds. Noise ceiling refers to untrained models. Gain=2.0 is excluded due to poor e-prop performance. Distances were measured at normalized accuracy ≈ 0.8 to ensure matched performance. To further support these trends, we find BPTT at gain=1.0 vs 0.0 differs significantly (p=2.07e-5), as does e-prop (p=1.3e-4). In contrast, BPTT vs e-prop yields no significant difference at fixed gain=1.0 (p=0.173) or gain=0.0 (p=0.985). We report these two gains for brevity (due to rebuttal char limits), but similar trends hold at other values and are available upon request.

Fig 5 updates. As suggested, we add tabulated bar values and statistical tests below; we apologize for initially omitting Mante13 (due to an earlier code we wrote for Hats07), but now include it and observe similar trends as Suss15 — trained BPTT/e-prop outperform the noise ceiling but still show a gap from the noise floor:

Table 2: Neural similarity scores (Fig5).

Dataset	Noise floor	BPTT (trained)	e-prop (trained)	BPTT (init)	e-prop (init)
Hats07	0.401 ± 0.024	0.395 ± 0.020	0.442 ± 0.030	0.538 ± 0.041	0.540 ± 0.040
Suss15	0.373 ± 0.009	0.406 ± 0.012	0.411 ± 0.012	0.498 ± 0.015	0.494 ± 0.014
Mante13	0.704 ± 0.004	0.721 ± 0.008	0.727 ± 0.010	0.800 ± 0.008	0.801 ± 0.008

As noted, distances here use fewer neurons (to compute the noise floor), so values may differ slightly from earlier plots. Consistent with the bar plots, the gap from the noise floor is insignificant for Hats07 (p=0.311, BPTT-trained), but significant for Mante13 (p=1.29e-5) and Suss15 (p=5.02e-6). We will append these results to the updated manuscript. We will also add a noise ceiling to the main plots and visualization using normalized values to the manuscript.

Additional content in the main text. As suggested, we will add DSA and UMAP results to the main text. We’ve clarified the choice of Procrustes and added new results using CCA and CKA (see Table 3 for Reviewer AQPQ).

Other suggestions and questions. We will incorporate all these helpful suggestions to improve clarity. Fig2&5 comments are addressed above. For Fig5, the reviewer is correct: the noise floor was computed by comparing different neuron subsets matched across time and condition. Our baseline compares time-varying firing rates across conditions (not trial-to-trial variability), aligning with common practice in systems neuroscience. The variance in Fig. 1B (Mante) is largely due to the absence of learning rate decay; adding decay stabilizes training without altering the main trend — we will include this plot. We will also revise the main text to clarify the toy theorem’s assumptions, limited scope, and illustrative purpose, motivating comprehensive future work (see also Reviewer xVGX). We will cite and discuss all references the reviewer mentioned. Finally, we will update Fig1&7 to be on the same footing.

The significance of e-prop BPTT comparison. This is an important question. Beyond the motivation in lines 70–73, BPTT remains a widely used benchmark in brain modeling, especially in seminal works (Yamins, DiCarlo, Mante, Sussillo, …). Comparing it to a biologically plausible alternative like e-prop offers broadly relevant insight into neural similarity. While BPTT has known limitations (see response to Reviewer buDs), our pipeline aims to evaluate — and eventually surpass — it as a benchmark in future studies.

More broadly, our study addresses a fundamental and underexplored question: how do bio-plausible gradient approximations affect neural similarity? Using e-prop as a case study (demonstrating existence, not universality), we show that despite gradient truncation, it can match BPTT (before the truncation). This was not obvious a priori; both “e-prop < BPTT” and “e-prop > BPTT” were plausible outcomes (see further discussions on novelty&contribution with AQPQ). Our framework enables detecting future models that outperform BPTT. We will add these discussion points to the manuscript.

审稿人评论

2025-04-05

Thank you to the authors for the detailed response! I'm excited to see the addition of tables and statistical tests, and to hear about the adjustments that will be made to the text. I have a couple extra questions:

would it be possible to add p-values for each relevant comparison in the table, and mention the method used to correct for multiple comparisons?
I'm still a little confused as to why the noise floor was computed by comparing different subsets of neurons. If the dataset contains multiple samples for each task condition, would it be possible to show an extra control comparing across samples instead?

Thanks again!

作者评论

2025-04-08

We thank you for your meticulous review and insightful comments. Apologies for the delay, as we took a few days to make sure we addressed your points carefully.

Question 1. As suggested, we will add p-values to the table to make them more visible.

Modified Table 1 with p-values in the table (similar trend as before):

rule	gain=0.0	gain=0.5	gain=1.0	gain=1.5	p-val (gain=1.0 vs 0.0)	p-val (gain=1.0 vs 0.5)	p-val (gain=1.0 vs 1.5)
BPTT	0.461 ± 0.006	0.423 ± 0.005	0.398 ± 0.007	0.428 ± 0.010	2.07e-5	1.94e-3	5.61e-3
e-prop	0.461 ± 0.007	0.437 ± 0.009	0.407 ± 0.008	0.432 ± 0.008	1.30e-4	5.71e-3	7.61e-3
p-val (e-prop vs BPTT)	0.985	0.078	0.173	0.587	-	-	-
Noise ceil.	0.565 ± 0.002	0.529 ± 0.005	0.467 ± 0.005	0.508 ± 0.017	-	-	-

We report p-values for all gain comparisons (fixing the learning rule) relative to gain=1.0, as it showed optimal performance and neural similarity in Fig2. Initially, we used uncorrected p-values from independent two-sample t-tests; since only a small number of planned comparisons were performed and the p-values were well below standard thresholds, we did not apply correction at first. For completeness, we now include Bonferroni-corrected p-vals: for e-prop, gain=1.0 vs {0.0, 0.5, 1.5} yields p = {7.80e-4, 3.43e-2, 4.57e-2}; for BPTT, the values are {1.24e-4, 1.16e-2, 3.37e-2} — all remaining significant post-correction, reinforcing the trend that gain influences neural similarity. In contrast, rule comparisons (fixing gain) yielded uncorrected p > 0.05, so no correction was needed. We will clarify this in the revised manuscript.

We will also add the p-values, given in our initial rebuttal, directly into Table 2 (a new column with p-vals for trained BPTT vs. noise floor) and Table 3 (a new row with p-vals for BPTT vs. e-prop across the additional metrics). We apologize that we cannot copy Tables 2&3 here again due to the character limit.

Question 2. This is an insightful suggestion, and we apologize for not elaborating more initially due to the 5000-character limit. The main reason we did not compute the noise floor via trial-based subsampling is due to the substantial trial-to-trial variability in single-trial neuronal firing estimates (Cunningham et al., 2009; Kay et al., 2024), which can dominate the distance and obscure meaningful similarities, even between similar neural responses. As shown in Table 4 (Hats07 as an example), data-data distance is much larger when computed from single trials and decreases with trial averaging. This reflects how averaging helps recover the underlying condition-specific firing rates, while single-trial estimates remain noisy. Although the dataset includes multiple trials per condition, reliable single-trial estimation—e.g., via LFADS (Pandarinath et al., 2018)—remains an open challenge. Due to this, and following common practice in systems neuroscience, we focused on trial-averaged neural activity to assess model- and data-data similarity; we will clarify this in the revised manuscript.

Table 4: illustrating greater data-data distance when computed from single trials; note that these numbers are not comparable to those in Fig 5 due to different ways of subsampling.

Number of trials averaged	1	5	10	15	20
Distance	0.513 ± 0.008	0.358 ± 0.005	0.280 ± 0.004	0.238 ± 0.003	0.212 ± 0.003

Thanks again for your thoughtful questions. We’d also like to clarify why the data-data baseline shown in Figure 5 is informative in this context and will update the appendix to include this explanation. As the reviewer noted, it is computed by comparing similarity between different subsets of recorded neurons, matched across timepoints and conditions. While one could compute alternative baselines (e.g., comparing trial-to-trial fluctuations across the population using simultaneously recorded data), this would require reliable single-trial firing rate estimates, which remain challenging (as explained above). Since our study, like most RNN modeling work in systems neuroscience, focuses on trial-averaged neural activity, we opted for a baseline that asks: if more neurons had been recorded from the same region, would we be able to distinguish them from our model units? That is the question our baseline is designed to assess.

Overall, we agree this is an important question and will add a discussion in the updated manuscript reflecting all points above. We thank the reviewer again for raising this point, which touches on a deeper methodological question for future research.

最终决定Accept (poster)

2025-05-01

The authors address an interesting problem, namely the biological plausibility of representations built by various learning algorithms.

Most reviewers suggested acceptance. While one reviewer leaned towards rejection, this decision seemed to stem largely from a different appreciation of / interest in the submission's general problem.

Indeed, one possible drawback is that this submission is perhaps less suited to ICML than to, say, ICLR or NeurIPS, due to being oriented towards biological plausibility rather than actual "machine learning" per se. While I believe the overlap in interest is sufficient to recommend acceptance, I would understand a different decision.