PaperHub
8.2
/10
Oral4 位审稿人
最低5最高5标准差0.0
5
5
5
5
3.8
置信度
创新性3.3
质量3.0
清晰度3.0
重要性3.0
NeurIPS 2025

High-dimensional neuronal activity from low-dimensional latent dynamics: a solvable model

OpenReviewPDF
提交: 2025-05-09更新: 2025-10-29
TL;DR

We show that high-dimensional neural activity can arise from low-dimensional latent dynamics, both in RNNs and in the brain.

摘要

关键词
recurrent neural networksneuronal recordingsvisual cortexlatent variable modelsPCAeigenvalue decaymean-field limit

评审与讨论

审稿意见
5

This paper introduces theoretical models and empirical results supporting the hypothesis that high-dimensional neural activity can arise from a low-dimensional latent space combined with nonlinear activations. The paper begins by first presenting a RNN model whose dynamics can be explicitly solved. This allows for direct analysis of the covariance eigenspectrum of the model’s post-activation activity, which the paper shows to be high-dimensional. The paper then relaxes assumptions on this solvable model and proposes a conjecture about the decay exponent of the covariance eigenspectrum in this relaxed model. Finally, the paper develops the Neural Cross-Encoder (NCE), a nonlinear autoencoder-inspired architecture, which they apply to neural recordings on mice during a visual task. They find that the NCE distinguishes between response types with low and high-dimensional pre-activations.

优缺点分析

Strengths: This paper tackles an interesting and fundamental question about how we should interpret and test for latent dimensionality in neural population activity. Overall, the paper is easy to read and the technical steps are clearly explained. The solvable RNN model presented in Section 2 is relatively simple yet illustrative of their hypothesis that post-activation activity may still be generated from a low-d pre-activation latent space. Their overarching question and empirical results in Section 4 are well-motivated by real findings from visual cortical population activity.

Weaknesses: My main concern with this paper is about the cohesiveness between the theoretical results presented in Section 2-3 and the empirical results presented in Section 4. Right now, the paper reads as three separate contributions across these sections, and it is not clear to me how the theoretical models allow us to reinterpret the notion of dimensionality in neural recordings. I would have liked to see some experiments about how well Conjecture 1 holds for fitted NCE models of different dimensionalities dd in Section 4. Otherwise, I am not sure about the role of Section 3 in this paper: the conclusion that high-d activity might arise from a nonlinear transformation of low-d states is already apparent from the solvable RNN model.

I also have some concerns about the Neural Cross-Encoder methodology. In 4.2, the authors state that the NCE has a 3-layer encoder to allow for flexible estimation of latent variables; however, the choice of 3 layers is a bit arbitrary and might also lead to overfitting in some cases. I think the authors should address how one would decide the number of encoder layers / flexibility of the encoder. It is also not clear to me why the NCE is better at discarding variability that is not shared across neurons compared to a standard autoencoder fit on the entire set of neurons, since that should also discard such variability. I would have liked to see a more detailed explanation to motivate NCE as well as empirical comparisons with the standard autoencoder.

Finally, some suggestions on clarity:

  • On line 135 the authors introduce the notation z\mathbf{z}, and it would help to draw the connection to κ\mathbf{\kappa} from the previous section.
  • Line 200 typo in “variability”

问题

  • In Fig 3I, it seems that the linear model is overfitting for large dd while the NCE does not. Why do you think this is the case, especially since the NCE is more flexible than the linear model?
  • How does the NCE deal with noise in the observed neural recordings and still extract seemingly meaningful latent variables (at least for the drifting gratings and spontaneous response types)?

局限性

The authors address limitations about potential difficulties when fitting NCE, and in particular they point out that even if NCE is unable to find a well-fitting model, this does not mean such a model doesn't exist. I agree with this limitation and I think it could be better incorporated into the interpretation of results (section 4.3) as well. For example, on line 217 the authors write that the results on natural images suggest that natural images produce high-d representations in the space of pre-activations. However, according to the limitation, this is not necessarily the case -- it could also be due to getting stuck in local optima during training.

最终评判理由

The authors have sufficiently addressed my concerns and questions during the rebuttal period. I appreciate the additional text in the introduction which clarifies the link between the theoretical and empirical contributions of the paper. I also am pleased to see the additional experimental results showing that the NCE is quite robust to varying the layer size, number of layers, and numbers of cells. The authors also provided some helpful insight into the differences between the NCE and standard autoencoder, as well as the NCE's ability to extract latents given noisy observations. Thus, I chose to raise my score to a 5.

格式问题

N/A

作者回复

The reviewer gives an accurate and comprehensive summary of the paper, stating that we tackle “an interesting and fundamental question”.

Answers to the weaknesses

1. Lack of cohesiveness between the theoretical results presented in Section 2-3 and the empirical results presented in Section 4.

We use the experimental data (section 4) to provide examples confirming the theoretical result (sections 2,3) that high post-activation dimension can arise from either low or high pre-activation dimension: drifting gratings and spontaneous activity (high post-activation dimension, low pre-activation dimension) vs. natural image responses (high post- and pre-activation dimension).

To clarify the link between the three result sections, we have added the following text in the introduction

Here, we first construct a solvable RNN model that reconciles the low- and high-dimensional perspectives on population activity by carefully disambiguating the linear dimension of the system before and after the neurons' nonlinearity, which we refer to as the pre- and post-activation dimension, respectively. This dichotomy refines the usual distinction between linear and “intrinsic” dimension (Stringer et al. Nature 2019; Jazayeri and Ostojic CONB 2021; Humphries NBDT 2021), since the intrinsic dimension of a system is the same before and after any continuous, injective nonlinearity. Using the notions of pre- and post-activation linear dimensions, we show that our RNN can be exactly reduced to a low-dimensional dynamical system in the space of pre-activations, making the pre-activations low-dimensional. Then, we show that these latent dynamics generate high-dimensional post-activation activity that has a power-law covariance eigenspectrum. (In this work, dimension will always refer to linear dimension, unless stated otherwise.)

Before analyzing experimental recordings, we revisit the spectral theory of infinite-width neural networks (random feature kernels) [Refs.] to quantitatively relate the pre-activation dimension, the neuronal activation function, and the post-activation covariance eigenspectrum. This three-way relationship tells us that high-dimensional activity is consistent with both low- and high-dimensional pre-activations. [...]

1b. I am not sure about the role of Section 3 in this paper

Section 2 provided one example of a phenomenon described more generally in section 3: determining the pre-activation dimensionality from post-activation eigenspectrum is an ill-posed problem. We add to this a specific conjecture relating three concepts: the post-activation and pre-activation dimensionalities, and neuronal nonlinearities. While this conjecture is not required for Section 4, we believe it is of sufficient general interest to warrant inclusion.

2. How well does Conjecture 1 hold for fitted NCE models of different dimensionalities in Section 4?

This is a natural question to ask. However, answering this question would require a reliable and accurate method for estimating the eigenspectrum decay rate of noisy (and finite) data and we have not found a satisfying method for that yet. This prevents us from answering this question.

3. 3 layers is a bit arbitrary and might also lead to overfitting in some cases. I think the authors should address how one would decide the number of encoder layers / flexibility of the encoder.

We agree that the current justification included in the manuscript for the selection of hyperparameters, i.e. the number/width of encoder layers and the number of source/target neurons is not sufficient. We present additional data to justify these decisions.

The results are only mildly sensitive to the depth and width of the encoder, which were selected via grid search on an example dataset. To demonstrate the minor effect of encoder architecture, we present the fraction of explainable variance explained for two example datasets with as we vary the number of layers, and the size of the first layer. For multi-layer encoders, each successive layer is smaller by a factor of two.

Drifting gratings (d = 4):

Layer size \ # Layers321
10000.970.970.96
5000.970.970.94
2500.950.960.94

Natural images (d = 32):

Layer size \ # Layers321
10000.910.890.89
5000.910.920.89
2500.920.910.91

Varying the number of source and target neurons up to 2000 does not substantially affect the results shown in Figure 3. We selected 500 and 1000 as intermediate values for computational reasons, since we fit 13 models with 5 shuffles each in 10 sessions for a total of 650 models (we have added 3 more experimental sessions since the original submission, confirming and strengthening results). To illustrate this stability, in the table below, we show the fraction of explainable variance explained for a model with d=4d=4 on drifting grating responses, for NCE and RRR models, across combinations of source and target cells.

N_target \ N_source10050010002000
100NCE: 0.98 / RRR: 0.920.96 / 0.880.96 / 0.860.97 / 0.86
5000.98 / 0.910.97 / 0.850.94 / 0.840.93 / 0.84
10000.98 / 0.890.97 / 0.840.94 / 0.830.91 / 0.83
20000.99 / 0.900.97 / 0.850.97 / 0.870.92 / 0.84

Increasing the number of cells beyond 2000 somewhat reduces performance and we suggest two potential explanations. First, due to ethical constraints our experiments are of limited duration, hence a typical session produces ~1000 training examples. This may not be enough to train a relatively large encoder network with >1000 input neurons. Second, only a fraction of neurons recorded have reliable stimulus responses, and including more cells introduces a large amount of non-stimulus-related variance.

Of the remaining hyperparameters, we have found that the performance is not sensitive to variations in learning rate, batch size, and momentum by exploring the hyperparameter space with the Optuna optimizer (Akiba et al. KDD 2019).

We will repeat the measurements shown in the tables above across multiple sessions and different values of dd, and include them as supplementary figures in the appendix. We will also include in the appendix the details of hyperparameter selection, described above.

4. Why is NCE better at discarding variability that is not shared across neurons compared to a standard autoencoder fit on the entire set of neurons, since that should also discard such variability?

A standard autoencoder would discard non-shared variability only when the bottleneck dd is small. When dd is large (without suitable regularization), nothing prevents an autoencoder from accounting for non-shared variability. For example, if d=Nd=N, an autoencoder could learn a function that is close to the identity function (see Steck NeurIPS 2020 and references therein). Because we want to look at the effect of varying dd, we need to exclude the tendency of a standard autoencoder to predict independent variability better with larger dd, which precludes estimating the pre-activation dimensionality.

By design, NCE discards variability that is not shared across neurons, since it predicts the activity of target neurons from a disjoint set of source neurons, which prevents learning the identity function.

Suggestions on clarity

on l. 135, we will replace “denoted z\mathbf{z}” by “henceforth denoted by z\mathbf{z} (instead of κ)\boldsymbol{\kappa})” and we will correct the typo on l. 200.

Answers to questions

  1. The iterative optimization procedure for the NCE described in the Appendix, where the model with the best validation loss is selected at each phase, has a similar effect to the early stopping method in preventing overfitting. We believe this prevents the NCE from overfitting to the noise in the data, whereas the L2 regularization is not sufficient in preventing overfitting in high-d linear models. Note that when training the linear model, at each d we train with 6 different values for the regularization parameter, and we have selected the range such that the selected models are not at the extrema, so increasing the L2 regularization does not prevent the overfitting. Note that all reported scores are on a held-out test set.
  2. This is a good question because the encoder in NCE is taking noisy inputs (the activity of the source neurons). Our best guess for how NCE can extract seemingly noiseless latents is the following: The use of cross-prediction during training strongly encourages NCE to learn encoder parameters leading to the inference of latents representing shared and distributed axes of variability across the population (in the space of pre-activations). To infer these latents, the weights in the encoder have to be distributed. In turn, such distributed weights help the encoder to produce outputs where input noise is averaged out.

Limitations

Regarding the reviewer’s comment under Limitations, we will add, on l. 218, “(but see Limitations below)”.

Final remarks

In conclusion, we hope that the text we added to the introduction, which should improve the cohesiveness of the paper, the additional analyses and the explanation we provide on the choice of NCE hyperparameters, and the clarification of the difference between NCE and a standard autoencoder will prompt the reviewer to reconsider their rating.

评论

Thank you for the thoughtful and detailed response to my concerns. I appreciate the additional text in the introduction which clarifies the link between the theoretical and empirical contributions of the paper. I also am pleased to see the additional experimental results showing that the NCE is quite robust to varying the layer size, number of layers, and numbers of cells. The authors also provided some helpful insight into the differences between the NCE and standard autoencoder, as well as the NCE's ability to extract latents given noisy observations. Again, thanks to the authors for their hard work and clear responses during this rebuttal period! I will raise my score to a 5.

评论

Thank you for your thorough and thoughtful reviews, your suggestions have improved the overall quality paper. We appreciate your time and effort, and thank you for reconsidering your score.

审稿意见
5

The authors provide theoretical proof using an analytically solvable recurrent neural network model that high-dimensional observations can be generated from low-dimensional latent variables. Further, they demonstrate that the generated high-dimensional observations can have a long-tailed covariance eigenspectrum, as is commonly seen in real neural data. Finally, the authors present a novel nonlinear latent variable model for calcium recordings, called Neural Cross-Encoder, and use it to demonstrate that high-dimensional neural activity in some regimes (i.e., spontaneous activity; response to natural images) can be described using low-dimensional latents while in other regimes they cannot (i.e., response to natural images).

优缺点分析

Strengths

  • The theoretical/analytical work presented in sections 2 and 3 are the strongest part of the manuscript. Proofs and derivations (not closely checked) seem complete and well-supported by the cited prior work.

  • Proofs provided in the appendix were helpful for understanding the main results / claims in the manuscript.

Weaknesses

  • The relevance of the experimental analyses to the theoretical work in sections 2 and 3 is not particularly clear. Moreover, the results/claims in section 4 (i) are not particularly novel / surprising, (ii) can be improved with additional experimental details and a few more/modified analyses, and (iii) do not necessarily help prove / support the theoretical claims made previously – as currently presented. Please see questions below for specific recommendations and points of confusion where additional clarification is needed.

问题

  1. It would be helpful if the authors could more clearly communicate that the analyses provided in section 4 are not related to the theoretical work in section 2 (i.e., solvable RNN model), but rather only for supporting / validating the claims made in section 3 (i.e., the relationship between pre-activation dimensionality, activation function, and post-activation eigenspectrum). Also, it would be informative/interesting if the authors visualized the covariance eigenspectrum of the true vs. model-predicted neural activity. As the authors pointed out, the goal here isn’t to estimate (or predict/recreate) the tail of the eigenspectrum, but it would be interesting to see how the decay changes as a function of dimensionality + activation function in real data.
  2. How are the pre-activation dimensionality results in section 4 novel / surprising / noteworthy? First, natural images are statistically complex stimuli and so it is not entirely surprising that their latent representations are of higher dimensionality than the drifting gratings. Second, how do we know that the results are not a byproduct of the model class used in NCE? It might be that for more complex stimuli, which require more complex neural encoding, the current encoder architecture choice is not sufficiently descriptive to learn the complex neural encoding [1]. Third, comparing against only a linear activation function seems insufficient because calcium recordings are known to be nonlinear. It would be interesting / more compelling if the authors explored using different activation functions and looked at the impact on explained variance and the covariance eigenspectrum. The analyses provided in section 3 only considered general rectified power activation functions so it would be more compelling / demonstrative if the authors compared the performance of NCE within this family of functions (not just against linear, i.e., reduced rank regression).
  3. It would be helpful if the authors could provide additional information in the Appendix about which specific regions were recorded and roughly how many neurons in each of the regions were determined “responsive”. Were neuron splits between source and target region agnostic (the authors stated that they collected data from primary and higher visual cortices)? It would be helpful if authors could also provide information on how many neurons and from which regions were assigned to source and target. This information may be important towards interpreting the results for each of the three task conditions, especially gratings vs natural images.
  4. What was the encoder architecture used in NCE? Please add this to the Appendix.
  5. Is the analysis presented in section 2 only limited to neurons on the unit circle/sphere manifold? Would the authors expect that this analysis could generalize to other manifolds?

Minor

  • Could the authors provide an explanation as to why specifically for the spontaneous condition multiple timepoints were used as input for estimating the latent variable as a single timepoint? Was something similar attempted for the stimulus conditions as well? Why/why not? If yes, what were the results?
  • Was there a particular reason why the authors first pretrained the NCE model using knowledge distillation rather than directly training on the target neurons?
  • Equation (2), how did the authors choose the values for JJ and Δ\Delta?
  • Could the authors clarify the following: “...per experiment, across 7 experiments (3 spontaneous activity, 3 natural images, 1 drifting gratings, across three mice)...” Does this mean 7 experiments were conducted for each of the 3 mice or across the 3 mice aggregated (i.e., only 1 mouse had the drifting gratings stimulus or all 3).
  • Fig. 3F would benefit from a few more y-axis labels for the points not close to the extrema.
  • In line 101, minor typo that it should be a pair of neurons ii and jj. Also minor typo line 905 in Appendix: “standard deviation [of]”. Also, the equations following lines 980 and 981 in Appendix E have a reversed bracket in the Appendix, e.g., θU([0,2π])\theta \sim \mathcal{U}(\left[0, 2\pi\right])
  • In Appendix line 1044 (last line of page 27), it would be useful to remind readers of what F\mathcal{F}, aa, and bb denote, especially as F\mathcal{F} and aa were not defined.
  • Clarification for the equations after line 897 of Appendix A: is that an inner product or elementwise product in the integral: (cos(θ)sin(θ))(cos(v)sin(v))\begin{pmatrix}\cos(\theta) \sin(\theta) \end{pmatrix} \cdot \begin{pmatrix}\cos(\mathcal{v}) \sin(\mathcal{v}) \end{pmatrix}?

References

[1] D.L.K. Yamins, H. Hong, C.F. Cadieu, E.A. Solomon, D. Seibert, & J.J. DiCarlo. Performance-optimized hierarchical models predict neural responses in higher visual cortex, Proc. Natl. Acad. Sci. U.S.A. 111 (23) 8619-8624, https://doi.org/10.1073/pnas.1403112111 (2014).

局限性

Limitations with respect to the experimental results in section 4 and the NCE model architecture are mostly sufficient. I recommend the following revision: “The fact that we were not able to accurately predict neuronal responses to natural images with a low-dimensional NCE model does not necessarily exclude the possibility for such a model to exist, and one could possibly find it with a larger training set or a different model class/hyperparameter configuration.” This is because the proposed NCE model may not have sufficient descriptive power/modeling capacity to robustly identify the shared variance in populations of neurons, especially in the context of more semantically complex stimuli wherein the encoding is less clear / more distributed [1,2].

Also, depending on the answer to my 5th question above, it may also be worth explicitly mentioning that the theoretical analysis provided in section 2 is limited to only specific manifold structures (e.g., unit ring/sphere).

References:

[1] Bolaños, F., Orlandi, J.G., Aoki, R. et al. Efficient coding of natural images in the mouse visual cortex. Nat Commun 15, 2466 (2024). https://doi.org/10.1038/s41467-024-45919-3

[2] Michael Weliky, József Fiser, Ruskin H Hunt, David N Wagner. Coding of Natural Scenes in Primary Visual Cortex. Neuron 37(4), 703-718, 2003.

最终评判理由

The authors present novel theoretical work that aims to address a very important question in neuroscience: the relationship between high-dimensional observable neural activity and the low-dimensional latent dynamics that drive it. In addition to complete derivations and proofs, the authors also provide relatively complete empirical demonstrations supporting their theoretical work. Moreover, the authors provided ample ablations, clarifications, and explanations during the rebuttal period that better demonstrated the robustness of their proposed model and its ability to interrogate the true pre-activation dimensionality in the measured system. Although I believe certain open questions remain (i.e., extending the theoretical work towards more general manifold definitions, demonstrating how well their model can reconstruct the covariance eigenspectrum of the observations), the work presented here will be an important contribution to the field. I recommend the paper for acceptance.

格式问题

N/A

作者回复

We thank the reviewer for the high quality of their review and for highlighting the strengths and clarity of our theoretical results. Below are our answers to their questions.

Question 1

We understand what the reviewer means when they challenge the link between Sections 2 and 4 but we would like to argue that there is a clear line of reasoning connecting them.

Specifically, we use the experimental data to provide examples confirming the theoretical result that high post-activation dimension can arise from either low or high pre-activation dimension: drifting gratings and spontaneous activity (high post-activation dimension, low pre-activation dimension) vs. natural image responses (high post- and pre-activation dimension).

To clarify this line of reasoning, we have added the following piece of text in the introduction:

Here, we first construct a solvable RNN model that reconciles the low- and high-dimensional perspectives on population activity by carefully disambiguating the linear dimension of the system before and after the neurons' nonlinearity, which we refer to as the pre- and post-activation dimension, respectively. This dichotomy refines the usual distinction between linear and “intrinsic” dimension (Stringer et al. Nature 2019; Jazayeri and Ostojic CONB 2021; Humphries NBDT 2021), since the intrinsic dimension of a system is the same before and after any continuous, injective nonlinearity. Using the notions of pre- and post-activation linear dimensions, we show that our RNN can be exactly reduced to a low-dimensional dynamical system in the space of pre-activations, making the pre-activations low-dimensional. Then, we show that these latent dynamics generate high-dimensional post-activation activity that has a power-law covariance eigenspectrum. (In this work, dimension will always refer to linear dimension, unless stated otherwise.)

Before analyzing experimental recordings, we revisit the spectral theory of infinite-width neural networks (random feature kernels) [Refs.] to quantitatively relate the pre-activation dimension, the neuronal activation function, and the post-activation covariance eigenspectrum. This three-way relationship tells us that high-dimensional activity is consistent with both low- and high-dimensional pre-activations. [...]

Hence, we need latent variable models to infer the pre-activation dimensinality (Section 4).

Regarding the second part of Question 1, we agree with the reviewer that it would be interesting to see how the eigenspectrum decay rate changes with the dimensionality of the latents and neuronal nonlinearities on NCE fitted to neural data. Unfortunately, we are not able to do this analysis at this stage because we do not have a method for estimating the decay rate of the eigenspectrum that is reliable and accurate enough, when the data is noisy and the number of neurons is relatively small (the number of target neurons is 1,000).

Question 2

(First)

The idea that neural activity is low-dimensional is dominant in the neuroscience literature (e.g. Gao & Ganguli CONB 2015, Gallego et al Neuron 2017 and refs. within). Recent observations of high post-activation dimension have challenged this view, but leave open the possibility that pre-activation dimension could be high or low.

For example, whether spontaneous activity in visual cortex has high or low pre-activation dimensionality is not known: the study of (Stringer et al. Science 2019) and the recent work of (Manley et al. Neuron 2024) stress the high-dimensional aspect of spontaneous activity and do not suggest the pre-activations could be low-dimensional. Hence, the low-dimensional pre-activations we find during spontaneous activity are novel. We consider low pre-activation spontaneous dimensionality our main empirical finding.

(Second)

An important difference between this work and previous studies (eg Yamins et al) is that we are predicting neurons from neurons, rather than from images. Yamins and others use deep networks to process images, but a single layer to predict neurons from latents.

We will add to the Limitations the suggested text. Also, to demonstrate the robustness of results to architecture changes, we present the fraction of explainable variance explained for two example datasets as we vary the number of layers, and the size of the first layer. For multi-layer encoders, each successive layer is smaller by a factor of two.

Drifting gratings (d = 4):

Layer size \ # Layers321
10000.970.970.96
5000.970.970.94
2500.950.960.94

Natural images (d = 32):

Layer size \ # Layers321
10000.910.890.89
5000.910.920.89
2500.920.910.91

Furthermore, because we use the same model architecture across all experimental conditions (drifting gratings, natural images, and spontaneous activity), the comparison between these three conditions says something about the data that is not simply a byproduct of the model architecture. In addition, the fact that NCE always performs above the linear baseline excludes trivial cases of severe under- or over-fitting, which could have led to misleading results.

(Third)

We followed the review’s suggestion and performed an additional analysis comparing NCE with varying fixed activation parameters pp. We focused on the case of drifting gratings with d=4d=4 and fitted NCE with pp ranging from 0.50.5 to 44. The results are summarized in the table below.

pp
0.50.155
0.750.161
1.00.165
1.250.167
1.50.167
1.750.169
2.00.169
3.00.164
4.00.110

These results indicate that within the standard range p[1,2]p\in[1,2] (the fitted pp is 1.89 in the original analysis), the R2 is relatively stable. This numerical experiment suggests that the estimation of the pre-activation dimensionality should not be too sensitive to pp, as long as pp is within the standard range p[1,2]p\in[1,2].

Question 3

We will add to the appendix:

The recordings covered part of primary visual cortex, higher visual areas AM, PM, LM, and RL, with occasional inclusion of some somatosensory areas. We confirmed the recorded area via retinotopic mapping using sparse noise stimuli. To determine stimulus responsiveness, we used the signal-related variance metric to determine what fraction of a neuron’s variance could be explained by stimuli (described in Stringer et al. Nature 2019). In a typical recording with 22,000 neurons, the top 1000 cells had 49% stimulus-related variance, the next 1000 had 22%, 14%, 10% and so on, where the 10,000th cell had ~3%. The responsive neurons were distributed across primary and higher visual cortices, and the selection of source and target neurons was agnostic to location.

We agree that comparing the latent dimensionality of responses across different visual areas is an interesting question, though it is outside of the scope of this work.

Question 4

We will add to the appendix:

The encoder is a 3-layer feedforward network with ReLU activation functions. The hidden layers contain (500, 250, 100) units.

Question 5

We believe that the general idea of having a recurrent network with low-dimensional dynamics but high-dimensional activity is a general phenomenon that is not specific to the circle manifold we consider. However, finding a model whose latent dynamics and covariance eigenspectrum can be analytically solved, that is not a ring model, remains an open theoretical problem.

Minor comments

  1. The stimulus responses are the mean activity over a response window of 2.0s for gratings and 0.8s for natural images, and this averaging changes the statistics of the activity and helps reduce the shot/independent noise. To have a similar effect in spontaneous activity, we chose to use multiple timepoints in the source neurons to better estimate the latents.
  2. We found that the knowledge distillation method allowed the NCE to learn in fewer steps.
  3. We chose JJ and Δ\Delta such that, in Eq. (5), the expression for the latent dynamics is as simple as possible.
  4. The experiment numbers reported are aggregated across all mice. Since the submission, we have expanded the analysis to additional sessions of drifting gratings, such that we now have three mice with at least one session of spontaneous activity, images, and gratings, which has confirmed the results presented.
  5. We will add intermediate ticks to 3F.
  6. We will correct these typos.
  7. We will update the notation to include b~\tilde{b} to denote the recorded neural activity of the target neurons, and we will clarify the definitions of aa as the recorded activity of source neurons and bb as the predicted activity of source neurons. Hence the expression for loss in 1044 will be updated to bb~2||b - \tilde{b}||^2 where b=ϕ(uE(a)+c)+rb = \phi(u \mathcal{E} (a) + c) + r and we will remove F\mathcal{F}.
  8. It is an inner product. We will clarify this in the Appendix.

Final remarks

In summary, the experimental results demonstrate the theoretically-predicted consistency of either low- or high- preactivation dimension with high post-activation dimension. The low spontaneous pre-activation dimensionality our method finds is a novel finding. We have also provided additional evidence showing that in the neuron-to-neuron prediction setting, the complexity of the encoder may not be a limiting factor since our results are robust to changes of encoder architectures. In light of the clarification we have provided and the additional analyses we have performed, we hope that the reviewer will reconsider their rating.

评论

Thank you for the very thorough and complete response to my questions and comments. Your explanations, additional analyses, and planned revisions to the manuscript have resolved any questions/doubts that I had. I will raise my score to a 5.

Question 1

Thank you for the explanation. The connection between sections 2 and 4 are clearer now. Further, the rationale given with respect to estimating the eigenspectrum’s decay rate is fair.

Question 2

Thank you for clarifying the distinction between pre- and post-activation dimensionality. This makes the contribution of your work clear and distinct. The ablations also help to better demonstrate the robustness of your model/approach.

Once again, thank you for your efforts and the thoughtful rebuttal.

评论

Thank you for your thoughtful review, we believe the suggested revisions have improved the paper. We appreciate your time and effort in closely engaging with our work, and thank you for reconsidering your rating.

审稿意见
5

The contributions of the paper are twofold: First, it presents an analytically solvable recurrent neural network model that combines an efficiently low-dimensional dynamics of the neuron activations in the pre-activation space (activations inside the output function) and a very high-dimensional dynamics with a power-law eigenspectrum for the post activations (post the output threshold). The eigenspectrum of the post activations is derived for different types of low-dimensional solutions for the pre-activations, and RKHS theory a relationship beween the form of the activation function (rectified power function) and the decay exponent of the spectrum is derived and numerically verified.

The second contribution is an analysis the the estimated pre-activation dimensionality of mouse visual cortex responses using a neural cross-encoder architecture. This analysis shows low dimensionality for grating and high dimensionality for natural image responses.

优缺点分析

The mathematical analysis part of the paper is very strong and combines methods from probabilistics, neural field theory, and functional analysis / RKHS theory to derive and analyze the solutions of recurrent networks with very specific assumptions for the lateral coupling structure.

It is not entirely clear in how far the results break down for more general forms of lateral connectivity (different width of the interaction kernel, deviation form cos shape). Generality in presence of such changes would be required for a validity of the model for larger classes of real neurons.

问题

What exactly of the model in (13) is fitted to the neural activities?

Apparently sloppy notation in L. 117 ff: For even m the eigenvalues should be always 0. Should not for the uneven ones sin (m \theta) be the eigenfunction?

L. 957 (appendix). should second z_s be z_t ?

Is the factor (2/ \pi) in the equation below L. 921 correct?

(17) => C_{ij} = 4 * [term in (18)] - 4 * (1/4)

局限性

Limitations are thoroughly discussed,

最终评判理由

Author's feedback is plausible and addresses as far as possible my concerns. I thus support the publication of this paper.

格式问题

none

作者回复

The reviewer highlights the strengths of the theory we present in the paper and they make detailed and accurate comments on the mathematical parts of the paper, which are very much appreciated.

Answers to Strengths and Weaknesses

The reviewer asks whether the solvable model can be generalized to lateral connectivity that is not of cosine form, Eq. (2). This is an excellent question. In fact, we can get a very similar solvable model with a more localized connectivity profile by replacing the cosine connectivity, Eq. (2), by a connectivity of the form: Wij=Jl=14βlcos(l(θiθj)Δ).W_{ij}=J\sum_{l=1}^4 \beta_l \cos(l(\theta_i - \theta_j) - \Delta). This extended model is also solvable and can produce a limit cycle on the circle in the space of pre-activations while exhibiting high-dimensional post-activations, as the original model. The analysis of this extended model will be presented in a new section in the Supplementary Material and briefly mentioned in the main text.

Answers to Questions

Q: What exactly of the model in (13) is fitted to the neural activities?

A: The parameters that are learned are the activation parameter pp (that is shared across target neurons), the vector of biases c\mathbf{c}, the matrix of readout weights u\mathbf{u}, and a vector of baseline firing rates r\mathbf{r}. This is stated in the sentence following Eq. (13). The latent variables z\mathbf{z} are the output of the encoder and the input of the decoder; hence, they are inferred but not learned. To clarify the notation, we will update the text so that b~\tilde{b} denotes the recorded activity of target neurons, and bb denotes the NCE reconstruction of these neurons.

Q: Apparently sloppy notation in L. 117 ff: For even m the eigenvalues should be always 0. Should not for the uneven ones sin (m \theta) be the eigenfunction?

A: There was indeed a minor mistake in Eq. (8). The expression in Eq. (8) will be replaced by λn=4π2(2n12+1)2,nN.\lambda_n = \frac{4}{\pi^2}\left(2 \left\lfloor \frac{n-1}{2} \right\rfloor + 1\right)^{-2}, \forall n \in \mathbb{N}_*.

Q: L. 957 (appendix). should the second z_s be z_t ?

A: The reviewer is right, we will correct this typo.

Q: Is the factor (2/ \pi) in the equation below L. 921 correct?

A: The factor is correct but there were indeed some factors 1/(2π)1/(2\pi) missing in ll. 915-920. We will correct these typos.

Q: (17) => C_{ij} = 4 * [term in (18)] - 4 * (1/4)

A: We will do the recommended modification, which helps readability.

Final remarks

We hope that the strength of our theoretical results and their relevance to several disciplines related to the theory of neural systems, the fact that our solvable model can be extended to lateral connectivity that goes beyond the cosine form, and the mathematical clarification/corrections we have made (thanks to the reviewer’s detailed and accurate comments) will prompt them to reconsider their rating.

评论

Thanks a lot for clarifying my points and your willingness to correct the small mathematical errors. In terms of the extension of your analysis for the more general connectivity profile which is described by a Fourier series up to order 4: Can it be said briefly what the critical properties of the lateral interaction has to be to make your theoretical approach applicable? Or is there only a very small number of special functions where your analysis is suitable?

评论

We thank the reviewer for their excellent question. In the revised version of this work, we will include only the additional connectivity example with four Fourier components, as it is the only case for which we have verified—through another collaboration—that the model produces a limit cycle when (β1,β2,β3,β4)=(13,1,1,13)(\beta_1,\beta_2,\beta_3,\beta_4)=(\frac{1}{3},1,1,\frac{1}{3}), while remaining solvable. It is possible that our analysis could be generalized to other connectivity kernels that are Hilbert–Schmidt integral operators on the ring; however, we are not able to make such a statement at this stage.

评论

Many thanks for your response that is fully sufficient for me.

审稿意见
5

The authors present a theoretical framework to resolve the apparent contradiction between experimentally observed high-dimensional neuronal activity and commonly hypothesized low-dimensional neuronal latents. Following their theoretical model of neuronal computations as a linear-nonlinear system, the authors propose a latent variable identification method, namely Neural Cross-Encoder, to identify the low-dimensional latents underlying high-dimensional neuronal activity. They demonstrate that this method reveals low-dimensional latent variable space underlying spontaneous neuronal responses from the visual cortex and responses to drifting gratings but not responses to naturalistic stimuli. Taken together, this work advances our understanding of neuronal computations in response to stimuli as well as build on methods to identify latent variables underlying these computations.

优缺点分析

Strengths:

  1. The paper is well-written, reasonably easy to follow and deals with a very pertinent question in the field. Resolving this apparent contradiction in theoretical/modelling approaches and experimental observations will have a significant impact on unifying and advancing the field further.
  2. The theoretical connections between the covariance matrix eigenspectrum and the eigenspectrum of the integral operator is very interesting and well-described. Section 3, overall, is a nice read and sets up the reader for the NCE method.
  3. The NCE method description is very well written and shows promising results, especially on the neuronal recordings. The semantic relevance of some of the identified latent variables is also a very interesting result, and potentially highlights the utility of the method.

Weaknesses:

  1. I think a core contribution of the paper is the NCE method, and it falls short on highlighting the correctness and utility of the method. Firstly, the method has certain hyperparameters (e.g. size of the source and target neuronal populations) that seemed to have been chosen without proper justification. Moreover, the authors do not comment (in the main text) on the robustness of the method to the choice of these hyperparameter values. Finally, the linear model baseline seems like a weak comparison and underscores the utility of the method. Ideally, the authors would also show the method's abilities to uncover latent variables in some simulation setting before demonstrating their findings in recorded neuronal data.
  2. While the theoretical contributions are very interesting, they rely on the assumption of infinite neurons. Real systems are finite dimensional, and it would be interesting to see how these correspondences hold in (high-dimensional) finite dimensional systems. At the least, I think the authors should add a note or comment on these assumptions.
  3. Conjecture 1 is interesting (and a very cool result), but it warrants further justification. In the current version, the reader lacks context as to how the authors came up with this conjecture. The authors could probably add a few statements in this section to improve the readability of this section.

问题

  1. While I enjoyed reading Section 2 (and the appendix), I am not quite sure what it adds to the paper. Section 3 and further sections do not require an RNN model assumption, and can be understood even without the results in Section 2. Could the authors clarify why they feel Section 2 is important in the context of the paper?
  2. The choice of the size of source and target neuronal populations is unclear. Do the results change if these choices are changed? Also, can you add an experiment to demonstrate the utility of NCE in simulated settings?
  3. Can you compare NCE to CEBRA or Rastermap or other latent variable methods? Given this is one of the core contributions of this work, it is imperative to show that NCE is indeed a performant latent variable method, at least comparable to existing methods.
  4. In the theory section, the latent variables were assumed to be drawn i.i.d. from a uniform distribution. Can you comment on this assumption, how reasonable is it and whether this can be relaxed? Does this affect NCE performance?

局限性

Yes.

最终评判理由

The authors' responses clarified my concerns around the method details and hyperparameter chocies. While I believe more work is required to concretely establish the relevance (and novel contribution) of Section 2 as well as the utility of NCE compared to other methods in literature, I believe this is a technically solid submission and makes a notable contribution towards resolving the apparent contradiction between empirically observed high-dimensional neural representations and the commonly hypothesized low-dimensional latent factors. Moreover, I think NCE is a neat contribution and will have a wide impact on the field of neural data analysis. Therefore, I recommend accepting this submission.

格式问题

N/A

作者回复

The reviewer appreciate the relevance of our results for systems neuroscience and highlight the quality of our text.

Weakness 1

(i) Choice of NCE hyperparameters

The reviewer is correct, the current manuscript does not provide sufficient justification of the selection of 500 source and 1000 target neurons and other parameters. Varying the number of source and target neurons up to 2000 does not substantially affect the results shown in Figure 3. We selected 500 and 1000 as intermediate values for computational reasons, since we fit 13 models with 5 shuffles each in 10 sessions for a total of 650 models (we have added 3 more experimental sessions since the original submission, confirming and strengthening results). To illustrate this stability, in the table below, we show the fraction of explainable variance explained for a model with d=4d=4 on drifting grating responses, for NCE and RRR models, across combinations of source and target cells.

N_target \ N_source10050010002000
100NCE: 0.98 / RRR: 0.920.96 / 0.880.96 / 0.860.97 / 0.86
5000.98 / 0.910.97 / 0.850.94 / 0.840.93 / 0.84
10000.98 / 0.890.97 / 0.840.94 / 0.830.91 / 0.83
20000.99 / 0.900.97 / 0.850.97 / 0.870.92 / 0.84

We note that increasing the number of source cells to 2000 and beyond somewhat reduces performance and we suggest two potential explanations. First, due to experimental constaints, a typical session produces ~1000 training examples. This may not be enough to train a large encoder network with >1000 input neurons. Second, only a fraction of neurons recorded have reliable stimulus responses, and including more cells introduces a large amount of non-stimulus-related variance.

Of the remaining hyperparameters, we have found that the performance is not sensitive to variations in learning rate, batch size, and momentum by exploring the hyperparameter space with the Optuna optimizer (Akiba et al. KDD 2019). The results are mildly sensitive to the depth and width of the encoder network and weight decay; these hyperparameters were seleced via grid search on an example dataset.

To demonstrate the extent of the dependence on encoder architecture, we present the fraction of explainable variance explained for two example datasets as we vary the number of layers, and the size of the first layer. For multi-layer encoders, each successive layer is smaller by a factor of two.

Drifting gratings (d = 4):

Layer size \ # Layers321
10000.970.970.96
5000.970.970.94
2500.950.960.94

Natural images (d = 32):

Layer size \ # Layers321
10000.910.890.89
5000.910.920.89
2500.920.910.91

We will repeat the measurements shown in the tables above across multiple sessions and different values of dd, and include them as supplementary figures in the appendix. We will also include in the appendix the details of hyperparameter selection, described above.

(ii) Linear model baseline

The linear peer-prediction method we use as a baseline is closely related to linear models used in several influential works, e.g., for the estimation of dimensionality of communication subspaces (Semedo et al. Cell 2019) or spontaneous activity (Stringer et al. Science 2019). For a discussion of alternative latent variable models as a baseline, see our response to Question 3 below.

(iii) Testing NCE on simulated data

We performed a test of the NCE method on simulated data generated by the solvable model presented in Section 2. This new analysis shows that the NCE can perfectly recover the two latent variables and it gives an accurate estimate of the pre-activation dimensionality (which is 2). In comparison, the linear model overestimates the dimensionality, estimating it at 4. These new results will be presented in an additional figure placed at the beginning of Section 4, making a natural link between the theoretical results (Sections 2 and 3) and the data analysis part of our work (Section 4).

Weakness 2

Our theory, which assumes an infinite population, gives a good prediction of the dynamics and covariance eigenspectrum of finite-size RNNs. To verify this claim, we simulated a network with 1000 neurons (N=1000N=1000) and showed that it produces a limit cycle similar to that shown in Fig. 1D and the 400 largest eigenvalues of its covariance eigenspectrum closely match our theoretical prediction. This new result will be mentioned in the main text.

Weakness 3

We agree with the reviewer. The sentence preceding the statement of the conjecture, “Known results and simulations suggest the following conjecture”, was too concise. We will replace this sentence by

Drawing intuition from Fourier analysis, the smoothness of a function (here the kernel) should be related to the decay rate of its Fourier transform (here the eigenspectrum)—the smoother the function, the faster the decay rate of its Fourier transform. Known results on the eigenvalues of random feature kernels for the cases p=0p=0 and p=1p=1, with c=0c=0, confirm this intuition and show how it extends to general integers dd (Bach JMLR 2017). Extrapolating those results to any nonnegative pp and any real cc, we get the following conjecture:

Question 1

The paper starts with the solvable RNN model for two reasons.

  • First, the RNN model highlights the nontriviality of the notion of “pre-activation dimensionality” which this paper aims to introduce. The model shows that the autonomous dynamics (i.e. dynamics that are not input-driven) of an RNN can be both low-dimensional in the space of pre-activations and high-dimensional in the space of post-activations, which is a somewhat counterintuitive fact. This fact provides a mechanistic motivation for the notion of “pre-activation dimensionality”, as opposed to the more phenomenological notion of intrinsic dimensionality often considered in neuroscience. To make this point clearer, we will add the following paragraph in the introduction:

Here, we first construct a solvable RNN model that reconciles the low- and high-dimensional perspectives on population activity by carefully disambiguating the linear dimension of the system before and after the neurons' nonlinearity, which we refer to as the pre- and post-activation dimension, respectively. This dichotomy refines the usual distinction between linear and “intrinsic” dimension (Stringer et al. Nature 2019; Jazayeri and Ostojic CONB 2021; Humphries NBDT 2021), since the intrinsic dimension of a system is the same before and after any continuous, injective nonlinearity. Using the notions of pre- and post-activation linear dimensions, we show that our RNN can be exactly reduced to a low-dimensional dynamical system in the space of pre-activations, making the pre-activations low-dimensional. Then, we show that these latent dynamics generate high-dimensional post-activation activity that has a power-law covariance eigenspectrum. (In this work, dimension will always refer to linear dimension, unless stated otherwise.)

  • Second, the RNN model we present in Section 2 constitutes an original contribution to the theory of large neural networks, in and of itself. While, in this paper, we use the model to reconcile different views on dimensionality in systems neuroscience, this model should also be of interest to researchers in computational/theoretical neuroscience, machine learning, and the physics of complex systems, who are interested in the dynamics of RNNs.

Question 2

See answer to Weakness 1(i) above.

Question 3

NCE, CEBRA, and Rastermap are three different dimensionality reduction methods––the latter two being primarily for exploratory analyses––and are therefore hard to compare (e.g., it is difficult to compare CEBRA and Rastermap, hence (Stringer et al. Nat. Neuro. 2025) does not contain such comparisons). The purpose of NCE is to estimate the pre-activation dimensionality of neuronal activity and it tries to do so with as little inductive bias as possible. To our knowledge, there is no other latent variable model that is specifically designed to estimate pre-activation dimensionality. Methods such as LFADS and its derivatives could be used to estimate the pre-activation dimensionality but they come with complex inductive biases that are not necessary to the task of estimating pre-activation dimensionality. How NCE compares to methods such as LFADS (or RADICaL), which have more inductive bias, when the number of recorded neurons is large is an interesting question that lies beyond the scope of this work.

Question 4

In Section 3, we assume that the inputs are uniformly distributed on the sphere. This is an assumption used for mathematical convenience only and it is probably not biologically realistic. Interestingly, a recent paper by (Li et al. JMLR 2024) (cited in the text) shows that results on the decay rate of the eigenspectrum of random feature kernels with uniformly distributed inputs on the sphere can be generalized to a broad class of distributions on more general domains. We will briefly mention this in the main text.

In G.3 of the Supplementary Material, we show that NCE can infer latents that are correlated with running. Since these latents (and running) are not uniformly distributed, this suggests that NCE performs well even when the latents are not uniformly distributed.

评论

I would like to thank the authors for responding to me comments and questions. I believe all the proposed changes will improve the overall quality of the paper and clarify the contributions of this work to the reader.

Some final thoughts:

  1. I would like to thank the authors for adding the results on hyperparameter choices and clarifying the assumptions around infinite neurons and uniformly distributed latent variables.

  2. The relevance of Section 2 is also clearer to me now, although I still believe it might be shifting the focus from the core contributions of the paper. I agree with the authors that using the RNN model as a motivating example to demonstrate NCE is a good idea and should help bridge that gap. However, I am not totally convinced that Section 2 is a novel contribution in itself towards understanding the dynamics of RNNs. This is primarily because the model is specifically designed with a particular weight matrix. While I agree with its value in acting as a setup for validating and demonstrating NCE, its utility in understanding RNNs is limited in its current presentation.

  3. I understand the point that the authors make about the difference in the goal of NCE vs that of CEBRA or Rastermap from a scientific point-of-view. Moreover, I agree with the authors that comparing the LFADS (which has inductive biases incorporated) is a separate question and lies outside the scope of the current work. However, from an application point-of-view, I am not toally convinced whether NCE should be used over these other methods to infer latent factors in neural population data. To that end, I had suggested the authors to compare NCE with other methods that might be used by practitioners.

I thank the authors once again for their hard work and would like to congratulate them on this amazing work. With the changes proposed by the authors, I believe this work will be of great value and interest to the community and therefore recommend accepting this work to NeurIPS.

评论

We thank the reviewer for your thoughtful comments, we appreciate your interest in our work. The suggestions have improved the overall quality and impact of the paper. We will take your final thoughts into consideration for any future follow-up work.

最终决定

Since Stringer et al's 2019 paper, there wasn't a satisfactory explanation of how the relationship between high-dimensional observable neural activity and the low-dimensional latent dynamics that drive it in early visual system. This paper presents a very strong mathematical analysis that combines methods from various subdisciplines to derive and analyze the solutions of recurrent networks with very specific assumptions for the lateral coupling structure. The low-dimensional pre-activation and the power-law post-activation as a possible resolution represents a significant progress and opens future research directions. Authors also present a novel methodological contribution called the NCE to analyze the high-dimensional deconvolved neural responses.