Flow-field inference from neural data using deep recurrent networks
摘要
评审与讨论
This work sets out to infer the latent variables and their time dynamics from observed neural recordings. To achieve this, the authors developed FINDR, essentially a recurrent neural network with multi-layer perceptrons defining the flow maps of the latent variables, whereas neural activities are defined as linear projections plus a soft plus (an analytically nicer form of relu) from the latent variables. The key contribution to the literature seems to be the use of a specific deep architecture, though similar approaches exist in the published literature.
After the rebuttal
I believe that, following the promised revisions, this work now meets the criteria for acceptance at ICML. That said, I still feel that the clarity and overall impact of the paper could have been significantly improved had the authors chosen to present the full scope of the work in a single manuscript. The mention of a companion bioRxiv paper on scientific applications of FINDR (introduced only in the author response, see: "we discuss scientific findings using FINDR in a separate bioRxiv paper") clarified some of the unease I initially experienced while reviewing this submission. While this fact explains certain omissions that I felt would have benefited this work significantly (which is one big confound we can never account for in our reviews), it also highlights how a more integrated presentation could have elevated the contribution well beyond the acceptance threshold.
给作者的问题
I have the following questions for the authors:
-
Could you clarify how the model parameters for the benchmarks were chosen? Did you perform cross-validation to optimize them? How were the parameters for FINDR chosen?
-
Could you address my comments in Claims And Evidence above: Specifically how held-out neurons were trained to connect to latent variables and what is meant here in terns of latent identifiability?
-
How does your method compare to CEBRA ([1], published a while back) and MARBLE (https://www.nature.com/articles/s41592-024-02582-2) which just came out but has been on BiorXiv for a while? To clarify, for MARBLE, I am not asking for comparisons since it can be considered concurrent work, though should be cited as such, I am just asking for clarification.
I believe this work can be a good fit for ICML, but as it stands, it requires substantial revisions. I am not sure if the limited interactions of the conference format allow such nuanced discussions. Hence, I recommend the authors address above concerns and resubmit to the next conference cycle, though if they choose to do a rebuttal, I remain optimistic and ready to change my evaluation if substantial evidence is presented.
论据与证据
I believe that some of the claims made in this work are not clearly supported by empirical or theoretical evidence. Specifically:
-
Identifiability: It is not clear what latent identifiability is, or how it is achieved. To my understanding, CEBRA ([1]) has solved a significant problem in this literature by proving linear identifiability in latent variables. This claim should be made more clear, as it is central to the latent variable identification literature.
-
Fig 3 experiments: The benchmarked models seem to be not optimized, rather some random parameters taken from demos are used? If true, model parameters should be optimized using cross-validation. The authors should also provide clear information on how projections from held-out neurons were trained for all methods. For instance, LFADS use nonlinear decoders. Did the authors train linear readouts? If so, this is also suboptimal. Overall, there is little to no information in the paper about the methodology behind Fig. 3.
-
Fig. 4: This figure does not support the claim that other methods do not find consistent representations across folds. For this claim, I believe there should be some form of statistical tests. Moreover, since latent variables are often only linearly identifiable, it is likely that the autoLFAD results might be quite consistent after linear transformations.
方法与评估标准
To my understanding, the field has moved away from LFADS. On the other hand, rSLDS are mainly used due to their interpretability, not computational power. Also, please see the latest implementation for rSLDS that seem to provide significant improvements compared to the traditional use (https://nips.cc/virtual/2024/poster/95587).
In general, I believe benchmarking should include CEBRA [1], or models derived from CEBRA-like architectures that also incorporate dynamical modeling of latent states (as is done in this work). See for instance, [2-3]. Additionally, I think at least one component of this paper should include a (low-rank) RNN benchmark. See for instance [4-5]. Before moving towards the deep recurrent network, one might expect whether shallow low-rank RNNs could have similar explanation power. For instance, [5] has shown a similar result as in Fig. 2 in this work.
理论论述
NA.
实验设计与分析
For Fig. 2, it may be more interesting to add other examples than the flip flop task. To solve flip flops, network simply generate bistable dynamics. Tasks such as sin-generation (limit cycle) and delayed addition/multiplication (line attractor) could bring additional breadth to this work and increase the appeal to the broader NeuroAI community.
补充材料
I did not review the full SM. I looked at the identifiability part and searched for how held-out neurons were designed.
与现有文献的关系
As noted earlier, the key contribution to the literature was somewhat unclear to me. Authors state that "The goal of FINDR is to 1) compress the activity of a large population of neurons at time t to an abstract low-dimensional representation, and 2) learn the “rules” of how this representation evolves over time. "
As stated, these are very broad statements that apply to many works in the field [1-5]. I was not able to identify these works as cited in the manuscript, and I think there are several other relevant ones that are cited within these works.
遗漏的重要参考文献
Please see the end of this report. The current manuscript is missing some key references. This list is not exhaustive, but may be helpful.
其他优缺点
As a big strength, I want to note that goal 2 stated by authors is very interesting! A fully static model like CEBRA cannot address this, though later variations did try to advance CEBRA in this regard. That being said, I always wondered the following question when reading papers with such claims: How we can trust the flow maps outside the regions that data is observed? In relevant literature, [6-7] have proven this for rSLDS by performing optogenetics manipulations. I am not sure if such drastic experiments are needed to support this claim, but as noted above, I do not believe Fig. 4 is sufficient either.
其他意见或建议
References:
[1] Schneider, S., Lee, J. H., & Mathis, M. W. (2023). Learnable latent embeddings for joint behavioural and neural analysis. Nature, 617(7960), 360-368.
[2] Abbaspourazad, H., Erturk, E., Pesaran, B., & Shanechi, M. M. (2024). Dynamical flexible inference of nonlinear latent factors and structures in neural population activity. Nature Biomedical Engineering, 8(1), 85-108.
[3] Chen, C., Yang, Z., & Wang, X. (2025). Neural Embeddings Rank: Aligning 3D latent dynamics with movements. Advances in Neural Information Processing Systems, 37, 141461-141489.
[4] Pals, M., Sağtekin, A. E., Pei, F. C., Gloeckler, M., & Macke, J. H. (2024, June). Inferring stochastic low-rank recurrent neural networks from neural data. In The Thirty-eighth Annual Conference on Neural Information Processing Systems.
[5] Valente, A., Pillow, J. W., & Ostojic, S. (2022). Extracting computational mechanisms from neural data using low-rank RNNs. Advances in Neural Information Processing Systems, 35, 24072-24086.
[6] Vinograd, A., Nair, A., Kim, J. H., Linderman, S. W., & Anderson, D. J. (2024). Causal evidence of a line attractor encoding an affective state. Nature, 634(8035), 910-918.
[7] Liu, M., Nair, A., Coria, N., Linderman, S. W., & Anderson, D. J. (2024). Encoding of female mating dynamics by a hypothalamic line attractor. Nature, 634(8035), 901-909.
More relevant work: https://www.nature.com/articles/s41593-020-00733-0, https://www.nature.com/articles/s41586-023-06714-0, https://www.nature.com/articles/s41467-018-06560-z, also see works from Durstewitz's and Ostojic's groups.
伦理审查问题
NA
We thank the reviewer for their valuable feedback. While we can’t update our submission now, we will revise it accordingly.
Identifiability: Our definition (Appendix A.2) differs from [1], which builds on Roeder et al., 2021 ([A]), in two ways:
- Following Wang et al., 2021, we define latent z as identifiable if z_1 != z_2 implies that p(y|z = z_1, \theta) != p(y|z = z_2, \theta), where y is the observed neural activity (this is Eq. (1) and L636). Since we use a linear projection C for p(y|z, \theta), this is satisfied if C in Eq. (1) is injective. This is a weaker condition than [A], which requires both z and \theta to be identifiable. For us, this is difficult as there is always some invertible A such that y = Cz = CAA^{-1}z. This was our motivation for performing SVD on C = USVt, and defining z_tilde = SVt z so that the distance in the space z_tilde is preserved in the space y (A.3). All FINDR analyses use z_tilde. FINDR’s latents (z_tilde) are thus identifiable up to an orthogonal transformation.
- FINDR is a generative model, while CEBRA is a discriminative model. [1] and [A] focus on conditions for linear identifiability in discriminative models.
We will move A.2-A.3 to the main text, and clarify differences from [1].
Fig. 3: We regret if there has been a misunderstanding. We did perform 5-fold CV to optimize hyperparameters of all models (paragraph starting in L332, 1st column). We used a hyperparameter-optimized LFADS (autoLFADS), using config as in L796. For SLDS and rSLDS, we optimized over discrete latents (L308).
Following Pei et al., 2022, we held out 20% of neurons by supplying the encoder with 80% held-in neurons. Then, the decoder in Eq. (1) reconstructed all neurons. We trained FINDR on 3/5 of the trials on all neurons, validated it on 1/5 of the trials on all neurons via grid search (A.1.6), and tested performance on the remaining 1/5 on held-out neurons. We performed the same procedure for autoLFADS. In Pandarinath et al., 2018, they used a linear decoder with exponential nonlinearity, and not a nonlinear decoder.
We will clarify details on evidence-conditioned PSTH R^2, normalized log-likelihood, and held-out neuron training.
Fig. 4: Here, axes represent PC 1 and PC 2 of latents. For autoLFADS, the latent trajectories projected onto these axes were not consistent, whereas for FINDR, they were. To further evaluate consistency across folds, we sorted single-trial trajectories by evidence sign and computed the trial average of each group. Then, we calculated Pearson’s |r| of these trajectories between fold 1 and 2 for each latent axis and took the average of |r| across the axes. With this, FINDR folds were consistent by 0.99. In contrast, for autoLFADS (after doing A.3 transformation, just like FINDR), this was 0.53.
Unlike the transformation in A.3 which preserves distances in the latent space, linearly transforming autoLFADS fold 1 to match fold 2 increased |r| to 0.99, but, by doing linear transform, we stretch the latent space, so the distance in latent space is not preserved in the neural space (ignoring softplus). Without A.3, we wouldn’t be able to say e.g., the first latent dimension explains most of the variance of the task-relevant component of neural data.
Benchmarking & Expressivity: As the reviewer points out, [1] is static and is not a dynamical model in the sense that it does not learn representations like our Eq. (2), and we can’t perform fixed point analysis. This is similar for [3]. The model in [2] is a finite-dimensional LDS with a nonlinear decoder, meaning it can’t learn nonlinear dynamics like bistable attractors.
Regarding explanation power (expressivity) of low-rank RNNs and benchmarking RNNs, Kim et al., 2023 defines a measure of practical expressivity and performs extensive analyses comparing RNNs of different architectures, including the one we use. Low-rank RNNs are a special case of our single-hidden-layer MLP without gating (Mastrogiuseppe & Ostojic, 2018).
Fig. 2: Fig. 2 doesn’t show bistable dynamics, but a disk attractor (L203, 1st column). A 1-D variant similar to this task would generate a line attractor. We find that FINDR can also recover limit cycles and bistable attractors in synthetic data.
FINDR’s goals: We will clarify that the two goals are field-wide, not just FINDR’s. We will make our contributions, including task-relevant and irrelevant dynamics, clearer with bullet points.
Flow map confidence: Please see Reviewer PTUj’s Q1.
Relevant work: We will cite Hu et al., Pals et al., and MARBLE as important concurrent work, and cite references mentioned by the reviewer. In particular, MARBLE estimates flow fields in neural space before embedding them in latent space, whereas FINDR estimates flows in latent space. Estimating flows in high-D space could be more sensitive to noise, and integrating state-space modeling (like the one here) with MARBLE could be an interesting future direction.
I would like to thank the authors for their well-structured rebuttal that clearly engages with the critiques raised in my original review. I found the following responses particularly convincing:
- The clarification on the methodology was much appreciated. The held-out neuron evaluation follows the correct procedure (kudos for fairness here), and the parameter optimization strategy is indeed reasonable (apologies for overlooking this). I recommend making both aspects clearer in the main methods section. I also appreciated the discussion on latent interpretability/identifiability - please ensure the manuscript is self-contained so that readers are not required to consult other works to follow these key ideas.
- I have a better appreciation for Fig. 2 now. I had missed that input strength was varied, making the emergence of a planar attractor quite reasonable. That said, it is still somewhat difficult to disambiguate this from bistable dynamics visually since the latter would also have very slow dynamics around the origin compared to the far edges. You might consider showing a potential function (e.g., in 3D) or a 1D projection to make the planar/disk attractor structure more visually explicit.
- The discussion on Kim et al. (2023) was helpful and I commend the authors’ willingness to balance their claims throughout the rebuttal
However, I remain unconvinced on the following points:
- Trust in flow fields: The response to Reviewer PTUj does not fully address my concern. I still do not see an example where learned flow fields meaningfully generalize beyond steady-state. For instance, how well does FINDR capture neural activity later in the trial versus trial-averaged activity? What about novel trials far from the training set? Such analyses would be necessary but not sufficient for a final publication.
- Benchmarking and expressivity: While CEBRA is indeed static, one can still extract latent variables and then fit dynamical models post hoc. From an experimental viewpoint, this approach can be just as informative, especially given the difficulty of validating the inferred flow fields (see my point above). In my humble opinion, this critique could have been better addressed via experiments rather than theoretical distinctions.
- Fig. 4: I’m still not fully convinced by the response. FINDR defines latent variables up to rotation, while linearly identifiable models allow both rotation and stretching - so it's unsurprising that FINDR captures geometry more precisely, while models like autoLFADS may preserve topology. This is an interesting observation, but perhaps not as impactful as initially framed. If the authors wish to emphasize it, I suggest expanding the theoretical motivation and clarifying the preprocessing (e.g., whether latent variables were standardized before PCA).
Overall, I believe the methodology is sound, but the paper would benefit from substantial revisions in presentation, benchmarking, and a clear focus on real-world (biology) applications. In particular, applying FINDR to a neural recording dataset to uncover a compelling insight would strengthen its contributions, possibly even making it a strong spotlight candidate at a future venue. That being said, at this time, I do not believe the manuscript meets the bar for ICML, but I am increasing my score to reflect my confidence in the methodology and the thoughtful rebuttal. For a final publication, the three remaining points I raised above should be sufficiently addressed. I am happy to reevaluate if authors provide evidence addressing these concerns.
I also want to emphasize that this line of work is highly promising for neuroscience, and with one round of major revisions, it is likely to become suitable for a top ML venue. In the event of a rejection by the AC in line with my recommendation, I would strongly encourage the authors to address this feedback and resubmit to NeurIPS.
Edit: I had some more time to think about it. I think the added benefit of having this work published overwhelms the weaknesses stemming from the benchmarking concerns. As long as the following are satisfied by April 8th and answered by authors with clear evidence, I will support an acceptance:
-
Please show us whether the model extrapolates beyond the training regions in one way or another. It doesn't have to be perfect, but this has to be present in the manuscript and has to be quantified.
-
Please confirm and commit to revising Fig. 4 to tone down the claims made about consistency across cross-validation folds. You are welcome to use the geometry vs topology distinction, but the quantitative results after the linear transformation (r=.99) should be there, i.e., make it clear that topology is preserved in both methods and the added benefit of the SVD is to preserve geometry in this case. I am asking for the specific wording of this to be present in the rebuttal response.
-
Please add CEBRA to Fig. 4 as a third method.
We thank the reviewer for the opportunity to revise and improve our manuscript. We have tried our best to incorporate the suggested changes:
(1) Please see our response to Reviewer PTUj.
(2) Thank you for the suggestion---we will revise Fig. 4 and the main text to clarify our definition of consistency and ensure that quantitative results after linear transformation (|r|=.99) are included for autoLFADS. We will also clarify that while we find empirical evidence that FINDR discovers consistent representations for this dataset, consistency is not theoretically guaranteed and should be verified empirically on new datasets.
While we found that the autoLFADS factor trajectories are topologically consistent across folds, we also wanted to check whether the dynamics found by the LFADS generator for each fold are consistent with each other by identifying approximate attractors for each fold’s generator and seeing whether they match. Ext. Data Fig. 4 shows that FINDR consistently reveals two approximate attractor points corresponding to left and right choices, and that the trajectories, by 3s onwards, reach one of the two points.
Do we find similar bistable attractor-like structures across folds for autoLFADS? To see this, for each fold’s autoLFADS, we ran the trained generator forward in time for 5s, starting from the initial conditions inferred from the encoder. (448 initial conditions, because there were a total of 448 trials in this dataset.) We found that while autoLFADS states reached steady-state by 5s (states moved minimally during the 4-5s period), they did not form two clusters as would be expected from bistable attractors, both for folds 1 and 2. (We will include this figure in the revision.) Importantly, to see if the distribution of the fold-1 states in approximate steady-state match the distribution of the fold-2 states, we affine transformed the autoLFADS latent trajectories from 4s to 5s in fold 1 to match those in fold 2, and applied the Pearson’s |r| metric above. We found that they were not consistent (|r|=0.22). This suggests that even when the autoLFADS factor trajectories across folds are topologically consistent (|r|=0.99), this does not guarantee that the underlying dynamics that generated the trajectories by autoLFADS are consistent. Using a similar procedure for FINDR, we found |r|=0.94, consistent with the visualization in Ext. Data Fig. 4. For this |r|, we didn’t have to do an affine transformation and could simply use z_tilde’s from both folds.
(3) Yes, we will include CEBRA results for all 5 folds in Fig. 4b. In the past few days, we trained CEBRA-Time on folds 1 and 2 using hyperparameters from: https://cebra.ai/docs/demo_notebooks/CEBRA_best_practices.html#Items-to-consider. For both folds, when we color-coded the latents by evidence strength (like in Fig. 4), we saw a gradient in the latent space respecting the evidence strength.
We also trained a Euclidean-distance model ("offset10-model-mse", output dim=2). Although no theoretical guarantees on linear identifiability are provided in [1] or [A] for this model (in contrast to the cosine-distance model), we did see empirically that for this dataset, two folds were consistent by |r|=0.99. We also saw that the parts of the state space traversed by the trajectories depend on evidence strength.
However, as the reviewer pointed out, it is difficult to perform fixed point analysis on the latents of CEBRA without additionally fitting a dynamical model, e.g., FINDR or rSLDS. Whether combining CEBRA with dynamical models improves interpretability is an interesting future direction, but beyond the scope of this work.
A key distinction between FINDR and CEBRA, for this dataset, is that we find that sensory inputs perturb dynamics roughly along PC 1 in the latent space of FINDR, but this would be difficult to know using CEBRA.
Summary: Our new analyses, together with results in our original submission, support that FINDR representations are consistent across folds---not only topologically but also geometrically---and reveal dynamical consistency, specifically two approximate slow points associated with left/right choices. Among all methods tested, only FINDR achieves both:
-
Strong performance on neural data
-
Discovery of consistent low-D dynamical representation, with interpretable slow points
Clarification:
In particular, applying FINDR to a neural recording dataset to uncover a compelling insight would strengthen its contributions,
We would like to clarify that we did apply FINDR to real neural recordings---Fig. 3 and 4 are from neuropixels data. While our focus here is on the methods, we discuss scientific findings using FINDR in a separate bioRxiv paper.
We also mention in the Discussion what the representations found by FINDR mean in decision-making (L377, 2nd column), and more broadly, how FINDR could have a potential impact in neuroscience (L382, 2nd column).
Authors introduce a new method for latent variable inference of neural data. The essence is a sequential variational autoencoder. The main innovation is a “prior” which encourages the latent variables to satisfy an ODE. Using this method, they show that low-D latents are recovered in synthetic examples. When compared with other methods that are also limited to low-D latents, the performance on real data is better for very low-D, and comparable for slightly higher dimensions.
After rebuttal
After reading all the rebuttals and discussion with all reviewers, I am keeping my score. The topic is important, and many methods have been introduced in recent years. While the proposed method here is novel and seems promising, the limited comparison to other methods and benchmarks weakens the contribution.
给作者的问题
In the identifiability section (A.2), you write that rank(C)=L. How does this fit with the results of figure 2D for L>2 ? Equation 3 – should this be sqrt(dt)?
论据与证据
The main claims are partially supported. The method is able to recover low-D synthetic latents. But, the abstract claims that “FINDR outperforms existing methods in capturing the heterogeneous responses of individual neurons”. Existing methods (e.g. LFADS) were not tested without the limitation of low-D latents.
方法与评估标准
There are existing benchmarks (neural latents benchmark, also cited in this paper) that offer quantitative measures of prediction (without limiting latent dimensionality). These were not used in the present work.
理论论述
Irrelevant
实验设计与分析
The synthetic example is indeed a case where we expect a 2D continuous attractor, and the method recovers it successfully. The irrelevant dynamics (constant bias) are perhaps not challenging enough. As mentioned above, it would be good to test the method on other benchmarks, such as neural latents.
补充材料
Yes. All
与现有文献的关系
This is part of the latent inference line of work. The authors mention these other works. The emphasis here is on a low-D latent space, which can improve interpretability.
遗漏的重要参考文献
Two works that also emphasize the low-D latents. Valente, Adrian, Jonathan W. Pillow, and Srdjan Ostojic. “Extracting Computational Mechanisms from Neural Data Using Low-Rank RNNs.” Advances in Neural Information Processing Systems 35 (December 6, 2022): 24072–86. Pals, Matthijs, A. Erdem Sağtekin, Felix Pei, Manuel Gloeckler, and Jakob H. Macke. “Inferring Stochastic Low-Rank Recurrent Neural Networks from Neural Data.” arXiv, February 26, 2025. https://doi.org/10.48550/arXiv.2406.16749.
其他优缺点
Main strength is a new approach to encourage interpretable latents, which is an important goal. The comparison to other techniques and other benchmarks is somewhat lacking, which weakens the paper.
其他意见或建议
none
We thank the reviewer for the constructive feedback. We appreciate their suggestion to test FINDR on the Neural Latents Benchmark (NLB) to further support our claims. We agree that evaluating FINDR on multiple real datasets, including the public training and validation datasets available from NLB, could strengthen our findings. However, as the reviewer pointed out, we think that one of the main goals of our paper is to be able to discover interpretable dynamics (i.e., performing analyses similar to Fig. 4), rather than solely outperforming other models on neural activity prediction (though we should make sure that our model performs reasonably well on this). Since analyses similar to Fig. 4 for the models currently fit to datasets in NLB (for models where this is possible, like rSLDS) are not readily available, we think the dataset we use is as good a choice as datasets in NLB.
The winning model of the NLB challenge in 2021 was based on transformers, which does not give a dynamical interpretation. Similarly, autoLFADS (with or without the rank bottleneck) doesn’t enable plotting low-dimensional vector fields (even when L<=3). We suspect that transformers and autoLFADS with L>20 will outperform FINDR given that autoLFADS with L=20 performed similarly to FINDR with L=2 in terms of R^2 (Fig. 4a). However, given that FINDR performs reasonably well in predicting responses from held-out neurons (Fig. 3c), in addition to being able to provide low-dimensional flow fields, we think this is where the strength of FINDR lies. We will clarify in the main text that FINDR only outperforms existing methods in low dimensions, and revise the abstract so that it reads “we demonstrate that FINDR performs competitively against existing methods…”.
Regarding A.2, thank you for pointing this out. We should have been more clear. The rank of C is computed with numpy.linalg.matrix_rank, which computes the SVD and counts the number of singular values that are greater than some value tol to compute the rank. By default, numpy.linalg.matrix_rank sets tol = S.max(axis=-1, keepdims=True) * max(C.shape[-2:]) * numpy.finfo(S.dtype).eps, where S is the vector of singular values. While, for example, the column for L=6 in Fig. 2d shows that variance explained is close to 1 after L=2, the singular values of C here were 96.303, 86.11196, 9.6449, 8.212957, 7.338667, 1.3648185, which were all greater than tol which, for this C, was 0.0057.
Regarding Equation 3---thank you for catching this! This is a typo, and it should have been \sqrt(dt).
Thanks for all the clarifications to my comments and those of the other reviewers.
Perhaps a more relevant benchmark is the computation through dynamics benchmarks. I'm aware that it's quite new, but it could still strengthen the paper to compare existing methods in such a manner.
Another relevant recent work (but on arxiv since 2022) is Langdon and Engel (Nat Neur 2025).
And another preprint is Versteeg, Sedler, McCart, Pandarinath (arxiv 2023), where in Figure 6 performance of ODIN (and other models) is compared to the peak performance of autoLFADS.
Thank you for suggesting more references/benchmark relevant to our work! We will make sure to cite them in our revision.
FINDR (Flow-field Inference from Neural Data using deep Recurrent networks) is an unsupervised method for discovering low-dimensional neural population dynamics. It uses a gated neural drift function, decomposing spiking activity into task-relevant latents and a time-varying bias capturing non-task effects. A sequential variational autoencoder ensures accurate data reconstruction and interpretable dynamics. FINDR offers an interpretable approach for modeling complex neural computations through low-dimensional flow-field visualizations and analysis.
给作者的问题
1.The flow field is learned from encoder-inferred trajectories reflecting real data; how accurate might the learned dynamics be in latent regions not sampled by the dataset?
2.Why is it necessary to optimize d first, then the other parameters? How does the approach ensure that task-relevant and irrelevant features are distinctly captured by d vs. the latent variables?
3.To what extent does FINDR preserve geometric relationships in the latent space? For instance, if an animal navigates a T-maze, would the model maintain the T-shaped structure in the low-dimensional representation?
4.How large must the MLP be (e.g., number of layers and hidden units) for stable training and robust performance?
论据与证据
The authors claimed that FINDR is able to compress the neural data into a abstract low-d latent space and learn the dynamics in the latent space, which have been supported by their experiments.
方法与评估标准
Yes, the cross-validated log-likelihood and PSTH-based R² are standard metrics for spike train models, helping verify predictive accuracy and interpretability. The validity of their main method: gated drift dunction was verified in previous litratures, not included in the present work.
理论论述
The main theoretical aspect is that the gating structure of the drift function improves the expressivity, stability, and trainability of the neural SDE, referencing prior work in gated neural ODEs, while there are no formal proofs in the text. Other claims seemed correct.
实验设计与分析
The synthetic data experiments were designed to confirm whether FINDR could recover known continuous attractors, a standard check for dimensionality and topology in latent-variable models. The real-data analysis employed a five-fold cross-validation scheme, ensuring robust validation of held-out neurons. Both approaches appear well-executed and adhere to common practices (e.g., log-likelihood and PSTH R²). A minor limitation is that the real-data setup features only a two-choice decision-making task, which might not reflect more complex or multidimensional behaviors.
补充材料
I reviewed A.1.1, which provides details on task-irrelevant feature learning.
与现有文献的关系
FINDR extends classical state-space models (LDS, GPFA) and deep generative approaches (LFADS) by using a gated nonlinear architecture that separates task-irrelevant components. This design builds on continuous-attractor concepts and addresses mixed selectivity, yielding stable, interpretable latent spaces and contributing new insights into neural dynamics and computational neuroscience.
遗漏的重要参考文献
not to my knowledge
其他优缺点
- Strengths
1.The paper is well organized, with a clear exposition of the model through equations and figures.
2.It is novel in how it explicitly separates task-irrelevant features, an approach that helps maintain a stable latent space across trials.
3.Both synthetic and real datasets are used to verify the model, enhancing confidence in its validity.
- Weakness
The real-data experiments focus on a two-choice decision-making task, which may be relatively simple.
其他意见或建议
To my understanding, the decoder should take d as input to reconstruct the firing rates, but d is not included in the figure.
We thank the reviewer for the positive feedback. We will change Fig. 1 to include d in the revision.
Q1: We should have clarified that the colored trajectories in Fig. 4b represent trial-averages sorted by evidence strength, and inside the dotted line represents part of the state space visited by single-trial trajectories. We did not show part of the state space not visited by single-trial trajectories.
In the revision, we could show a “confidence heatmap” showing the probability of the single-trial trajectories visiting a particular region of the state space.
Q2: By design, task-irrelevant components do not use the task-relevant inputs u to capture within- and across-trial baseline firing activity. We reasoned that by optimizing d first, we let the model predict neural activity as much as it can without task-relevant information. If this part is successful, then the dynamics capturing the residual of the task-irrelevant component from inputs u would be task-relevant.
Q3: The Euclidean distance in the neural state space (ignoring softplus) is preserved in the latent space. For more details, please see Appendix A.3 and response to Reviewer UnSA on identifiability.
Q4: We perform hyperparameter grid search to ensure that we analyze latent representations from a model with good training and performance (Appendix A.1.6). We found that the optimal size of the MLP, and other important hyperparameters depend on the dataset.
Thanks for the reply! The answers properly addressed my concerns.
Following Q1, to my understanding, your model is able to be extrapolated to the regions outside the dotted enclosure. To what extent does it still preserve meaningful dynamics in these regions?
Thank you for this important question. Reviewer UnSA raised a similar point, and we address both here. If we understand correctly, the question is whether the FINDR-inferred flow field can be trusted outside the dotted enclosure (i.e., outside the region of state space visited by the single-trial trajectories).
We would like to clarify that we do not make any claims about the accuracy of vector arrows outside the enclosure in this manuscript or elsewhere, and the method is not built or intended to work outside the enclosure. It is intended to succinctly summarize and characterize the dynamics underlying the training dataset. Extrapolating beyond it would involve further assumptions beyond the scope of FINDR.
That said, we recognize that in some cases, knowing the dynamics only within the dotted region is not sufficient to draw conclusions about the dynamics of the system. This is especially the case if the model has to infer the underlying flow field from only a few dynamical trajectories (e.g., Hu et al., 2024, Nair et al., 2023), as there can be multiple possible flow fields that could have generated those few trajectories. This was less the case for our real-world application in Fig. 3 and 4, where the dataset consisted of the rat performing a single decision per trial, with a total 448 of them. Because we infer the flow field that needs to satisfy 4/5 of the 448 dynamical trajectories (because we do 5-fold CV), this helps us be more confident in our estimate of the flow field.
If one had to work with a limited number of trajectories, and to truly test whether the model can extrapolate outside the enclosure, as Reviewer UnSA pointed out, one could experimentally perturb the neural state out of the enclosure, and evaluate whether the relaxation dynamics of the neural trajectory follow the flow field learned by the model. We think perturbations are beyond the scope of our work.
The problem of whether we can identify the correct flow field when there is a limited number of trajectories is a general problem that applies not just to FINDR but to all dynamical systems identification models. In other fields where we have side information about the system, e.g., that the system needs to conserve energy, having such information built into the model has been shown to be helpful for accurately identifying the system (e.g., [a], [b]). In neuroscience, this is a much harder challenge, but we think it is an important future direction to develop flexible models that can integrate multiple types of data/information to help constrain the space of possible flow fields, especially in scenarios where there are a limited number of dynamical trajectories (e.g., development, learning). Using simpler models to fit a limited number of trajectories, without side information, will not necessarily generalize well to the region of the unobserved state space.
The Discussion in L417, 1st column, is also relevant to some aspects discussed here.
New results on extrapolation performance: To directly assess how well FINDR extrapolates from training data, we trained FINDR to the same dataset used in Fig. 4, but held out the last 0.1s of each trial (=10 time steps; a single trial was typically <1s long in this task). We then computed 5-fold CV R^2 for each time step between z_tilde from the full vs. the held-out models. We found that R^2 starts around 0.70 for one step forward and monotonically decreases to 0.63 by the end of the trial. We also confirmed that the 5-fold CV log-likelihood score for neural activity (one used in Fig. 3b) is similar for both models for the held-out 10 time steps (0.028 for full vs 0.024 for held-out). We will include a relevant figure for this in the Appendix.
Summary:
-
Extrapolating beyond the dotted enclosure requires either experimental perturbation data or additional assumptions on the dynamics that are beyond the scope of this work.
-
Nonetheless, FINDR can extrapolate across trial epochs. The model trained on the initial periods of the trial can well predict single-trial trajectories in the later period.
[a] Ahmadi, A. A. & El Khadir, B. (2023). Learning Dynamical Systems with Side Information. SIAM Review.
[b] Greydanus, S. et al. (2019). Hamiltonian Neural Networks. NeurIPS.
The paper introduces FINDR, a method for inference of low-dimensional stochastic dynamics from neural recordings. The proposed approach estimates first the bias of the observation function (called task irrelevant component of the spiking activity) by solving a regression problem for the average firing rates of each neuron. This aims to capture fluctuations in the spiking activity not attributed to the latent dynamics. The authors first estimate an across-trial bias, by fitting a linear model with raised cosine basis functions to the trial-averaged firing rates, capturing slow fluctuations in each neuron’s baseline activity. They then model fast within-trial fluctuations using another linear model with raised cosine basis functions and sum both components to obtain the overall time-varying bias for each neuron. The main inference method uses a sequential variational autoencoder with a semi-orthogonal loading matrix to infer a low-dimensional latent representation of neural population activity, and to learn underlying stochastic dynamics (flow field inference). The authors model the drift function with gated multilayer perceptrons The optimization uses backpropagation through time to compute the gradients.
The authors first demonstrate the method on synthetic data, showing that it estimates correctly the underlying dynamics, performing better than other models, when the latent dimension is correct. Then, on selected recorded data from rat PFC performing a decision-making task, the authors show that FINDR outperforms SLDS, rSLDS, autoLFADS, and GPFA in predicting held-out neuron activity. The authors also show that the inferred flow fields are consistent across cross-validation folds and identify attractors related to the decision-making task.
Overall, the method demonstrates good performance. However, its technical implementation is somewhat involved. Nevertheless, the complexity of the approach is well suited for an application-focused paper.
给作者的问题
-
In FINDR, the SDE’s time constant (as used in Eqs. 2–3) sets the timescale of the intrinsic latent dynamics. Do you estimate directly from the data, or is it fixed a priori? Have you observed any relationship between this timescale and factors such as the required number of recorded neurons/heir average firing rates for accurate inference of the latent dynamics?
-
Do you assume that the external inputs are known? If not, how do you estimate them? If you do assume that they are known, how do you set them up in the examples in Figure 3 and 4?
-
I have a few questions regarding the estimation of the bias term of the observation model, that the authors name "task irrelevant activity". The authors infer first this bias, before learning the dynamics and the other parameters.
- 1] How uniquely can you identify the bias term of each neuron, as opposed to attributing a shift of the dynamics part to this bias?
- 2] Does the proposed approach estimate the d values that were used when generating the spike trains in the simulated experiments, or are they just identified good enough to allow estimation of the latent dynamics? Can the authors provide a scatter plot with the d_estimated vs d_true values to demonstrate how well this part of the method works?
- 3] Moreover I think I would be useful to make an ablation study for this part of the estimation to test how well the inference method would work without estimating d, or without the individual across trial and within trial d components.
-
In the paper of Genkin, & Engel mentioned above, the authors make a discussion regarding overfitting of latent stochastic dynamical models with increasing number of spiking data. Do you observe something similar in your framework requiring to adjust the parameter , or does the selection of suffice to overcome this issue?
-
In the model parameters to be inferred is the noise covariance matrix of the latent process. How do you estimate and how accurate is this estimation in the synthetic dataset?
-
How necessary is the gating function of the gated neural ODE for accurate inference of the latent dynamics? Would a nODE without fail capture the latent flow field?
-
What are the computational demands of each part of the proposed inference method, and how does is compare to the methods used as baselines?
论据与证据
-
The main claims are that FINDR infers accurate dynamics, outperforms existing methods, and provides interpretable visualizations. The synthetic data experiments support the first claim by showing correct latent dimension recovery and attractor structure.
-
The gated parametrisation of the drift is more expressive and trainable compared to non-gated models. I am not sure that there is demonstrated evidence for this in the paper.
-
Consistency across folds and attractor visualization in flow fields support the third claim. However, such interpretable visualisation of flow-filed can be made for all other baseline methods used for comparison. The interpretability of flow fields is demonstrated but might benefit from more quantitative measures. The separation of task-relevant/irrelevant components is shown to improve consistency, but the paper doesn't explore what exactly the task-irrelevant components capture.
-
The paper asserts that it can effectively disentangle task-relevant from task-irrelevant neural activity by estimating two components of the bias d, designed to capture within-trial and across-trial fluctuations that are not explained by the latent dynamics. However, the current version of the paper lacks clear evidence or numerical validation demonstrating the accuracy of this estimation method and the parameters of this part of the method are not rigorously analysed. In the Ext figure 5 the authors claim that by setting a single constant bias across trials the estimated flow fields are less consistent across splits, but to my eyes the projected trajectories in the five splits of that figure seem rather consistent, and the inconsistency seems to occur in the parts of the (projected) state space not visited by the system (see also my questions to the authors below for this part.)
方法与评估标准
The proposed methods and evaluation criteria are well-motivated and suited for inferring neural population dynamics. The authors first apply the method on synthetic data with known underlying dynamics, and demonstrate the performance in terms of normalised log-likelihood on test data. The authors could probably use a metric that quantifies differences between ground truth and estimated flow field for this setting with known dynamics to directly show the accuracy of the flow field inference part.
The use of synthetic datasets, with known underlying attractor structures, provides a controlled environment to verify that the method can accurately recover latent dynamics. The application to real neural recordings using cross-validated metrics (normalized log-likelihood, R²) offers a framework for assessing performance on recorded data with unknown latent dynamics.
The authors could further report forward prediction metrics.
理论论述
There are no theoretical claims in the paper.
实验设计与分析
No
补充材料
Yes A.1. and A.2
Extended Data Figure 1,4,5,6, 2,3
与现有文献的关系
The paper integrates and extends several strands of research in computational neuroscience and machine learning. First, it builds on previous approaches for modeling neural population activity via low-dimensional latent variables, such as GPFA, LFADS, and variants of switching linear dynamical systems. Most of available frameworks consider either latent deterministic or linear dynamics or autonomous dynamics. The work further builds on the literature of neural ODEs (and related differential approaches) by adapting these techniques specifically to spiking neural data.
遗漏的重要参考文献
- Genkin, M., Hughes, O., & Engel, T. A. (2021). Learning non-stationary Langevin dynamics from stochastic observations of latent trajectories. Nature communications.
- Genkin, M., & Engel, T. A. (2020). Moving beyond generalization to accurate interpretation of flexible models. Nature machine intelligence.
- Schimel, M., Kao, T. C., Jensen, K. T., & Hennequin, G. (2022). iLQR-VAE: control-based learning of input-driven dynamics with applications to neural data. ICLR.
- Zhao, Y., & Park, I. M. (2020). Variational online learning of neural dynamics. Frontiers in computational neuroscience.
其他优缺点
Strengths:
- Well-organized and clearly written paper.
- Good performance on demonstrated datasets.
- Consistency of estimated dynamics across splits.
- Needs more evidence, but the bias estimation part is interesting, and to my knowledge the approach is novel.
Weaknesses:
- The method is rather technically involved, with multiple steps and hyperparameter dependencies, potentially limiting accessibility.
- Although a hyperparameter grid-search is described, additional explicit analyses demonstrating robustness or sensitivity across hyperparameter choices and initialisations is missing.
- Also see my questions below.
其他意见或建议
The authors comment on the identifiability of the observation model in the appendix, but they don't discuss the identifiability of the latent dynamics (drift and diffusion of the SDE) when observed through spikes.
I wonder how would the method compare in terms of accuracy and required amount of data when compared with the method off Dunker et al 2019, and Genkin and Engel 2020, when applied on a synthetic dataset with autonomous latent dynamics.
We appreciate the reviewer’s thoughtful comments.
Necessity of gating: Kim et al., ICML, 2023 ([B]) shows that gating increases expressivity and trainability of the dynamics function. Their Fig. 1 suggests gating is necessary for correctly inferring dynamics in our synthetic dataset.
Flow fields and interoperability: We respectfully disagree with the reviewer that flow fields can be visualized for all other baselines. While rSLDS allows flow field visualization, SLDS, autoLFADS, and GPFA do not. rSLDS failed to capture neural data (Fig. 3d), likely due to learning a constant d, and not a time-varying d, so we did not plot flow fields estimated using rSLDS.
Previous work has proposed that the model is interpretable if it discovers representations of computation that we can identify in the low-D dynamics ([B]; Duncker et al., ICML, 2019). Accordingly, one way to quantify interpretability in the flow field is in terms of the log-likelihood and R^2 metrics of how well neural data is captured in low-D (L<=3). Using these metrics, FINDR was the only model that captured neural data well in low-D.
Task-irrelevant components and consistency in flow field: By design, task-irrelevant components do not use the task-relevant inputs u to capture within- and across-trial baseline firing activity. In Ext. Data Fig. 5, colored trajectories represent trial-averages sorted by evidence strength (as in Fig. 4b), and inside the dotted line represents part of the state space visited by single-trial trajectories.
Learning time-varying d improves performance (Ext. Data Fig. 5, last panel), and is necessary---when d is constant, the median evidence-sign conditioned PSTH R^2 is -0.009. We will include this result as the first panel, move the last panel next to this panel, and clarify the definition of the dotted line.
Limited accessibility: We will release our code with documentation and a tutorial notebook.
Model robustness: We find that the second best hyperparameter choice gives a representation consistent with the best choice. We also find a consistent representation when we train the model with a different initialization.
Identifiability: Please see response to Reviewer UnSA.
Additional comparisons with autonomous dynamics: While we did not directly compare to Duncker et al. and Genkin & Engel, we will add analysis showing that FINDR correctly captures autonomous limit cycle dynamics in a synthetic dataset.
Questions
(1) \tau=0.1s was a fixed hyperparameter (A.1.7), but the timescale of the dynamics is adaptive. That is, it depends on z and u through the gating function G. [B] analyzed expressivity as a function of \tau (their Fig. 3), and showed that as \tau increases, the drift function \mu needs more parameters to fit data well. Given the number of our model parameters and trial durations (<1s), \tau=0.1s was a reasonable choice.
(2) Yes, external inputs u are given to the model. We will clarify in our revision that in Fig. 3 and 4, u_t was 2-D: [0;0] (no click), [0;1] (right click), [1;0] (left click).
(3-1) If there is a constant shift in dynamics that is independent of u, this will be included as bias d and not as part of z. For the task-irrelevant component, because we use a linear basis function model, the solution we obtain is analytical and unique, given the regularization coefficients, which we optimize using the validation dataset (A.1.1).
(3-2) We validated d_estimated against d_true using real datasets. Across-trial d_estimated closely matches true across-trial firing rate fluctuations, and within-trial d_estimated matches the observed PSTH (R^2=0.82). We will include relevant figures in the Appendix.
(3-3) Our ablation study (Ext. Data Fig. 5, last panel) shows that without estimating time-varying d, the model fails to capture data (median R^2: -0.009).
(4) Similar in spirit to the approach in Genkin & Engel, we split the data into five different folds to find features that are consistent across folds. We consistently find two approximate point attractors with the selection of \beta=2 (Ext. Data Fig. 4). We will mention this, and cite references suggested by the reviewer.
(5) We learned the parameters of the diagonal elements of \Sigma (L99, 2nd column, Eq. (25)) directly via SGD. For generating the ground truth latents, we first generated the latents and added noise to each timestep with N(0, \sigma^2=0.01). The inferred \Sigma, after affine transformation to match ground truth latents, was found to be [[0.026, 0.001]; [0.001, 0.018]].
(6) For Fig. 3, the inference of the task-irrelevant component doesn’t require GPUs, and takes <30 minutes. Task-relevant component (jax/flax-based) takes <1.5 hours per hyperparameter configuration on a single A100 GPU. It typically took total ~5 hours for FINDR to do grid search and complete training on our cluster. AutoLFADS (PyTorch) took ~6 hours. For SLDS, rSLDS and GPFA, no GPU was used, and typical runtime was <1 hour.
I thank the authors for the response! I appreciate the detailed explanations.
I agree that from the selected baselines only rLDS allows for flow field visualization, but most state space models like the ones referenced in my Essential References Not Discussed also do.
A limitation of the approach is the requirement of knowing the external inputs but I find it ok given the other contributions, especially the part of fitting the task irrelevant component that is often used as a constant value in similar models.
I consider the comparison with one of the mentioned papers that consider autonomous latent stochastic dynamics would considerably improve the paper.
I have a small question in Eq. 25: Why do you pass the noise variables through the sigmoid function? Doesn't this limit the value of the noise that could be affecting the latent dynamics, or am I missing something?
Thank you for your response to our rebuttal! We will make sure to cite the references in the Essential References Not Discussed.
While external inputs u were given to FINDR in our application, it is relatively straightforward to build a controller to infer inputs, similar to how it is done in LFADS, without altering the training pipeline. Thus, we don’t necessarily see this as a limitation of the method itself, though inferring inputs has not been done here.
We also appreciate your question regarding the sigmoid. In Eq. (1), Cz=C*(1/a)*a*z, and we can define C’=C*(1/a) and z’=a*z for any scalar a (for more details, please see Identifiability in the response to Reviewer UnSA). Thus, the sigmoid, in principle, shouldn’t affect model expressivity. Empirically, we found that using the sigmoid instead of the softplus function was helpful for training stability.
This study develops a new method for latent variable inference from neural data. The essence is a sequential variational autoencoder. The main innovation is a type of prior which encourages the latent variables to satisfy an ODE. Using this method, the study shows that low-dimensional latent variables are recovered in synthetic examples. When compared with other methods that are also limited to low-dimensional latent variables, the performance on real data is better for very low-D, and comparable for slightly higher dimensions. The reviewers appreciated that the paper makes a case for being accepted in ICML but also raised suggestions both of citation of literature and of additional analyses and results to be added, which are expected to be implement in a revision.