PaperHub
7.3
/10
Spotlight3 位审稿人
最低7最高8标准差0.5
7
8
7
3.7
置信度
正确性3.3
贡献度3.0
表达3.0
NeurIPS 2024

Nonlinear dynamics of localization in neural receptive fields

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06
TL;DR

We analyze the dynamics of emergent localization in receptive fields of a nonlinear neural network, which enables us to bind emergence to specific higher-order statistical properties of the input data.

摘要

关键词
localizationreceptive fieldslearning dynamicsemergence

评审与讨论

审稿意见
7

This paper investigates when localized receptive fields arise in supervised neural networks. Extending a recent work of Ingrosso and Goldt, the authors propose that simple single-neuron models learn localized receptive fields when trained on data with sufficiently negative excess kurtosis, while if the excess kurtosis is sufficiently positive they learn delocalized receptive fields.

优点

The topic of this paper is of broad interest in both machine learning and neuroscience, and on the whole I think this manuscript makes a worthy contribution on top of the work of Ingrosso and Goldt. There are some weaknesses which dampen my enthusiasm (see below), but on the whole I favor acceptance.

缺点

I have two primary concerns about the results presented:

  • First, Lemma 3.1's treatment of time is not sufficiently precise. Can you provide a more precise answer than simply "early in training" or "before A.3 is violated"? Indeed, the logic in the paragraph beginning on Line 174 is not clear. What you show is that, in some cases (see concern below), eq. (5) generates localized RFs in a similar location to those observed in actual training. This does necessarily mean that "the gradient flow in Eq. (5) holds sufficiently long to detect the emergence of localization in the weights," as you write in Lines 177-178. Moreover, you have not in fact defined what you mean by "detect the emergence of localization;" this must be reified. Can you see a change in participation ratio, even if only numerically, before the approximation breaks down?

  • Second, the experiments are rather limited, and rely largely on exemplars rather than systematic statistical investigation. This is important given the gap in Lemma 3.1: the authors rely on experiments to justify their claim that this approximation provides meaningful information about when localization will occur, but all they actually show is that there is resemblance in a few cases. The paper would be much stronger if the authors could also show that their claims hold statistically over many realizations of the data generation process and training procedure.

问题

  • A neuroscientific quibble: in the first sentence of the Introduction (Line 19), one need not restrict attention to the "mammalian nervous system." There are numerous examples of localized receptive fields in non-mammalian species; see for instance the beautiful works of Knudsen & Konishi on audition in owls.

  • The authors might consider citing Sengupta et al., "Manifold-tiling Localized Receptive Fields are Optimal in Similarity-preserving Neural Networks" (NeurIPS 2018) in their discussion of unsupervised learning algorithms that give localized receptive fields.

  • Another potentially relevant reference is Shinn, "Phantom oscillations in principal component analysis" (PNAS 2023).

  • The reference in Footnote 3 is wrong; it should be to Appendix C.2 not C.3.

  • In Line 187, there is a small typo: "termdepends" -> "term depends"

  • Using a perceptually uniform non-grayscale colormap to represent time might improve the legibility of the plots relative to the grayscale used in the submitted manuscript.

  • The final paragraph of the conclusion is not tied to anything that came before; if you care about orientation selectivity you should mention & measure it in earleir portions of the paper. Otherwise, this should be omitted.

局限性

The authors do a largely adequate job of discussing the limitations of their work, up to the technical weaknesses noted above.

作者回复

We thank the reviewer for their extremely helpful feedback, in particular for making us aware of relevant work and clarifying minor misconceptions. We also appreciate your insistence to more comprehensively quantify and replicate our analyses.

Lemma 3.1's treatment of time is not sufficiently precise.

We thank the reviewer for this criticism. Indeed, lemma 3.1 does not provide a precise criterion for when the approximation breaks down. This is because the only assumption that is violated as a result of localization, A3, does not break down at any predictable tt. To make this claim clearer, we have included Fig. 2 & 3 in the Global Response. We see across examples that the analytical model closely tracks the empirical one until a strongly localized bump emerges in the weights; when no localization emerges, the analytical model maintains its precision through training. Please see the figure and its caption for more detailed information. We would happily incorporate suggestions from the reviewer on how best to represent or quantify this notion of time.

the experiments are rather limited, and rely largely on exemplars rather than systematic statistical investigation

We thank the reviewer especially for stating this limitation. To address this, as well as other reviewers' concerns about the precision of our model and analysis, we have included Fig. 1 in the Global Response. This figure shows the IPR against the excess kurtosis for various values of gg and kk, respectively, for NLGP`NLGP` and Kur`Kur`.

one need not restrict attention to the "mammalian nervous system"

Thank you for the reference; Knudsen & Konishi (1978) is indeed relevant. We will broaden our terminology from "the mammalian nervous system" to "animal nervous systems".

Knudsen & Konishi (1978). Center-surround organization of auditory receptive fields in the owl. DOI:10.1126/science.

consider citing Sengupta et al.

Thank you for the reference; we will add Sengupta et al. (2018) to the list of unsupervised learning algorithms producing localized receptive fields. As far as we understand, this algorithm does not optimize an explicit or implicit sparsity criterion as in sparse coding, so it is also an example of an alternative model for the emergence of localization. Har+20Har+20 cited in the submission is another such example; we will incorporate both of these references in the submission via discussion of their modelling assumptions wrt those we employ.

The final paragraph of the conclusion onorientationselectivityon orientation selectivity is not tied to anything that came before ...

We are asked by NeurIPS to address the limitations of our work, and since we display oriented receptive fields in Figure 1 (left, center) we would like to call attention to the fact that we cannot yet capture this feature in our analytical model, and neither do the simulated receptive fields appear oriented. To improve on this point, we will introduce the terminology of oriented receptive fields when referring to Figure 1 (left, center) in the introduction so that this terminology does not spring out of nowhere in the conclusion, as the reviewer has pointed out. Our analytical and simulated receptive fields are not oriented to any degree, so we do not think it is necessary to explicitly measure this property.

typo, corrections

us ee a perceptually uniform non-grayscale colormap

We particularly appreciate the reviewer's thorough examination of our work and their identification of typographical errors. A compiled list of the typographical corrections we plan to implement is provided in the global response. We also intend to reformat our plots to use a non-grayscale colormap to improve legibility, which another reviewer encouraged as well. Please see Fig. 4 in the attached PDF for an example.

评论

Thank you for your thorough reply to my comment. I think you've adequately addressed my comments and those of the other referees, so I will raise my score as I think this paper should be accepted.

审稿意见
8

SETTING: Olshausen & Field famously showed that requiring natural image patches to be constructed from a small number of independent elements from an overcomplete dictionary populates that dictionary with spatially localized feature detectors. But localization also appears in DNNs trained on discriminative-learning tasks. It is also ubiquitous in the cortex. This raises the question of what fundamentally drives the formation of localized receptive fields.

APPROACH: The MS attempts to explain the emergence of localized RFs by reference to the statistics of the stimuli (inputs). In particular, the authors examine the time evolution of network weights trained with gradient descent to discriminate "natural-image-like" stimuli--analytically for a single (ReLU) unit of neural network, and in simulation for a weighted sum of multiple such units (a two-layer NN with one output unit).

More precisely, the authors consider stimuli/inputs with circulant covariance matrices, a 1D idealization of natural images (the covariance matrices of natural images are block-circulant with circulant blocks). Under some additional simplifying assumptions (see below), they derive an ODE for the time evolution of the weights when the network is trained to discriminate stimuli according to whether their spatial correlations are "long" or "short." They show that this ODE depends only on the marginal distribution of a single "pixel" rather than the joint (and is valid until localization begins). Investigation of this ODE (Appendix C.1) reveals that localization will occur (w large for some entries, small for the rest) when the inputs have (sufficiently) negative excess kurtosis. The authors also demonstrate this in simulation with data distributions with varying levels of excess kurtosis, both by examining the learned receptive fields and by numerically integrating the ODE. Finally, they extend their simulations to two-layer NNs.

优点

The MS persuasively presents a new route to localized receptive fields. The proofs look correct (but I did not verify every line) and the simulations confirm that they are not undermined by the various approximations employer along the way. The problem is an important one and of long-standing interest to the field (computational neuroscience).

缺点

This reviewer struggles to see how these results can be generalized beyond this very simplified setting of, essentially, a "network" Y = ReLU(<w, X>) (or its cousin, the SCM). Introducing just one more learnable layer breaks (or at least vitiates) the connection between stimulus kurtosis and localized receptive fields (as the authors show in Fig. 6).

This makes it hard to find this explanation of RF localization more compelling than the classical one in terms of efficient codes.

问题

--One of the chief motivations for the study is that even DNNs trained on supervised-learning tasks acquire localized RFs. In support of this claim they cite the AlexNet paper and some papers on visualizations of convnets. Obviously the convolutions in these networks enforce spatial localization. Do the authors have in mind here localized (bandpass) frequency RFs?

--Is negative excess kurtosis consistent with the statistics of natural scenes?

--Can the authors provide more intuition about assumptions A1 and A2, and how restrictive they are?

--In the proof, l. 634, shouldn't it be P(<w,X> > 0) = 1/2? (There is no > 0 in the expression in the MS. Am I misunderstanding this?)

局限性

N/A

作者回复

We thank the reviewer for their excellent feedback. We especially appreciated your precise and astute questions with regard to the validity and generalizability of our model, and also for reading closely enough to identify a typo in our proof.

how these results can be generalized beyond this very simplified setting ... one more learnable layer breaks (or at least vitiates) the connection between stimulus kurtosis and localized receptive fields

The reviewer correctly points out a key limitation of this work: extensions to more complicated models quickly lead to difficulties in extending our analytical approach and its interpretation. Indeed, we observe that experiments with two-layer networks are noisier (as the reviewer pointed out in referencing Figure 6), as did [IG22].

However, we consider it worth mentioning that some aspects of our analytical approach seem to extend to the two-learnable-layer regime, though we reserve this as an avenue for future work. As an example, using the approach we have developed, consider the steady-state equation for a single-neuron model with two learnable layers, w1\mathbf{w}_1 and w2w_2, after plugging in for w2w_2: φ(Σ1w1Σ1w1,w1)=12π(Σ0+Σ1)w1(Σ0+Σ1)w1,w1\varphi\left( \frac{\Sigma_1 \mathbf{w}_1}{\sqrt{\langle \Sigma_1 \mathbf{w}_1, \mathbf{w}_1 \rangle}} \right) = \frac{1}{\sqrt{2 \pi}} \frac{(\Sigma_0 + \Sigma_1) \mathbf{w}_1}{\sqrt{\langle (\Sigma_0 + \Sigma_1) \mathbf{w}_1, \mathbf{w}_1 \rangle}}

This steady state is very similar to the one implied by Eq. 5 in Lemma 3.1, the only difference being an additional scaling term in the denominator of the right side. We have found that manipulating this term does not substantially change whether a given φ\varphi yields localized receptive fields, but this observation is still preliminary. However, dynamics play a very important role in the precise structure obtained at convergence, something we are exploring for potential future work. As the reviewer points out, adding another learnable layer obfuscates the relation between stimulus kurtosis and localization because the second-layer weight can be pulled into ReLU term and viewed as a rescaling of the stimulus. A thorough analysis will likely not be as simple as just studying φ\varphi and will likely require a careful analysis of initializations, but we expect that general strategies and ideas from this work will carry over into future ones.

hard to find this explanation of RF localization more compelling than the classical one in terms of efficient codes

We would like to emphasize that we don't wish to claim that our approach is much more compelling than the sparse coding work of [OF96]. Instead, we hope to present it as an alternative bottom-up, learning model that should be investigated further because of its ability to reproduce key qualitative phenomena of visual receptive fields with an alternative set of assumptions that some may view as more minimal (due to lack of sparsity regularization).

the convolutions in these networks suchasAlexNetsuch as AlexNet enforce spatial localization

The kernel size of the first-layer convolutional kernels is indeed usually taken to be much smaller than than the size of the input image, thus enforcing a degree of spatial localization. For AlexNet, for example, receives input image of size 224×224224 \times 224 with an 11×1111 \times 11 kernel in the first layer. Here, we mean to refer to the further localization that emerges within these convolutional kernels within that kernel bandwidth; i.e., the oriented receptive fields visible in Figure 1 have width in the edge direction much smaller than 11.

Is negative excess kurtosis consistent with the statistics of natural scenes?

While we focus on analyzing the model of emergent localized receptive fields of [IG22], this is a great question about broader relevance. We can provide some context in the case of simple cells in primary visual cortex (V1), of which sparse coding is classically taken as a model [OF97] Retinal ganglion cells prior to V1 are understood to perform edge detection. Edge detection naturally induces concentrated marginals, corresponding to positive and negative edges, with a large amount mass at zero, corresponding to no edge. Such distributions typically have negative excess kurtosis (unless a very large amount of mass is at zero, corresponding to a nearly uniform input).

Can the authors provide more intuition about assumptions A1 and A2, and how restrictive they are?

We thank the reviewer for this question. We attempted to address in lines 160-3 in the submission, which we expand on here.

A1 and A2 are motivated by the limiting cases of the NLGP(g)`NLGP`(g) data model from [IG22]: g0g \to 0 (no localization) and gg \to \infty (localization). A1 is implied by the weaker assumption E[XjXi=xi,Y=y]xi\mathbb{E}[X_j \mid X_i = x_i, Y = y] \propto x_i after applying S3. A2 captures that when conditioning on XiX_i, nearby entries in X\mathbf{X} are almost deterministic, while distant ones are unaffected. The ii-th row and column of Cov[XjXi=xi,Y=y]\operatorname{Cov}[X_j \mid X_i = x_i, Y = y] are zero. Local dependence implies entries (k,j)(k,j) of the conditional covariance for k,jk,j near ii should be small, while distant ones remain unchanged. Weak dependence (S1) implies σiy\sigma_i^y is largest near ii and zero elsewhere. Subtracting σiyσiy\sigma_i^y {\sigma_i^y}^\top from Σy\Sigma_y expresses these intuitions, supported by NLGP`NLGP` limiting cases. As such we consider these assumptions to be relatively loose. Fig. 1 of the Global Response, which shows a sharp uptick in IPR as excess kurtosis becomes negative, supports this.

... l. 634 shouldn't it be P(<w,X> > 0) = 1/2?

Yes; thank you for the correction.

评论

Thanks for answering my questions. I stand by my high rating.

审稿意见
7

The authors analytically derive the learning dynamics of (extremely) simple neural network models and characterize conditions under which units learn localized receptive fields. This builds on celebrated work in computational neuroscience on sparse coding and independent components analysis, offering a new perspective for biological findings.

优点

The paper is beautifully written (but see comments about figures below). It is well-motivated and accessible to a broad audience. The target audience for this kind of work may be somewhat niche, but it fits within the scope of the NeurIPS conference. The authors summarize prior work concisely and clearly. The assumptions of the analysis are stated precisely. The proofs in the appendix are well written, and the main claims of the theory are tested experimentally (though see below for some areas I am a bit unclear on).

缺点

To enable analytic tractability, the paper relies on very strong and simplifying assumptions. Figures 4, 5, 6, and the right hand side of Figure 3 are hard to see. It could be helpful to add color and show fewer lines. Also, the description of these figures is the one part of the text I had trouble following. Quantifying the outcomes somehow would be helpful for me to understand exactly what I'm supposed to be looking at in these examples. For instance, in Figure 4A, the claim is that the model learns an oscillatory filter that resembles a sinusoid. But the left panel in Fig 4A, though oscillatory, doesn't look sinusoidal?

Ultimately I believe these weaknesses can be addressed during the rebuttal phase and the strengths outweigh the weaknesses.

问题

  • In addition to showing examples in Figures 4, 5, and 6, can you somehow quantify localization?

局限性

Limiting assumptions are clearly stated and discussed.

作者回复

We thank the reviewer for their well-formulated concerns and suggestions, especially with regard to improving the interpretability and readability of our work.

Figures 4, 5, 6, and the right hand side of Figure 3 are hard to see. It could be helpful to add color and show fewer lines.

We thank the reviewer for the suggestion to improve legibility of the receptive field evolution. In the Global Response, we have plotted receptive fields with a non-grayscale (blue-red) colormap to improve legibility, which another reviewer suggested as well.

the description of these figures is the one part of the text I had trouble following.

Quantifying the outcomes somehow would be helpful for me to understand exactly what I'm supposed to be looking at in these examples. in addition to ... Figures 4, 5, and 6, can you somehow quantify localization?

We thank the reviewer for the suggestion to quantify the localization phenomenon. In Fig. 1--3 of the Global Response, we quantify localization with the inverse participation ratio (IPR), defined as IPR(w)=(i=1Dwi4)/(i=1Dwi2)2\operatorname{IPR}(\mathbf{w})=\left(\sum_{i=1}^D w_i^4\right)/\left(\sum_{i=1}^D w_i^2\right)^2, where wiw_i is the magnitude of dimension ii of weight w\mathbf{w}. This measure, also used by [IG22], is large when proportionally few weight dimensions "participate" (have large magnitude), and small when weight dimension magnitudes are more uniform.

For instance, in Figure 4A, the claim is that the model learns an oscillatory filter that resembles a sinusoid. But the left panel in Fig 4A, though oscillatory, doesn't look sinusoidal?

In Fig. 4 of the Global Response, we quantify this claim by fitting a sinusoid to the resulting receptive field, and find that we can do so very well. We also rescale the receptive field to make the sinusoidal structure of the steady state more apparent. Please see the figure caption for more details.

评论

Thanks for the clarifications. The sinusoids are still a bit hard to see, but the new firgure is an improvement. Perhaps an "approximate" sinusoid. Overall I think this is minor; I retain my positive score.

作者回复

We thank the reviewers for their excellent feedback. In particular, we thank each of the reviewers for providing specific, actionable questions and concerns. We have done our best to address each of these in our responses.

Minor corrections to errata

We also thank the reviewers for reading our manuscript very closely and identifying typos. Those corrections, along with additional we identified upon our own re-reading, are listed below:

  1. Line 163: "does and not does" \to "does and does not"

  2. Line 171: "o(1)o(1)" \to "oN(1)o_N(1)", to clarify dependence w.r.t. dimension NN and not time tt

  3. Line 187: "termdepends" \to "term depends"

  4. Line 238: "steady" \to "steady state"

  5. Line 634: P(w,X)=12\mathbb{P}(\langle \mathbf{w}, \mathbf{X} \rangle) = \frac{1}{2} \to P(w,X>0)=12\mathbb{P}(\langle \mathbf{w}, \mathbf{X} \rangle > 0) = \frac{1}{2}.

  6. Line 649: "blow up" \to "dominate"

On extension to 2D

We would also to clarify a potential misconception about our work. Our analysis is presented in terms of NN-dimensional inputs (i.e., XRN\mathbf{X} \in \mathbb{R}^N), which are most naturally interpreted as 1D images. However, our analysis extends naturally to 2D images as well through (un)vectorization, just as is done experimentally in [IG22]. To do this, we construct N2N^2-dimensional signals with a special covariance given by Σy=Σ~yaΣ~ybRN2×N2\Sigma_y = \tilde{\Sigma}^{a}_y \otimes \tilde{\Sigma}^{b}_y \in \mathbb{R}^{N^2 \times N^2}, where Σ~ya,Σ~ybRN×N\tilde{\Sigma}^{a}_y, \tilde{\Sigma}^{b}_y \in \mathbb{R}^{N \times N} are covariances along the two axes of the 2D image. N2N^2-dimensional data sampled from such a distribution can be turned into a 2D image by inverting the vectorization operation: X~vec1(X)RN×N\tilde{\mathbf{X}} \equiv \operatorname{vec}^{-1}(\mathbf{X}) \in \mathbb{R}^{N \times N}, which puts the first NN entries of X\mathbf{X} in the first column of X~\tilde{\mathbf{X}}, the next NN entries of X\mathbf{X} in the second column of X~\tilde{\mathbf{X}}, and so on. This is the procedure we used to generate 2D receptive fields in Figure 1 (Right) in the manuscript for the SCM.

We focused on the 1D case because it is, following the argument above, the most general case, and also the most natural to consider with a feedforward network that does not make any assumptions on intra-layer connectivity.

最终决定

This paper describes novel analysis of the emergence of localized receptive fields using top-down efficiency constraints. All three reviewers felt that it was above the threshold for acceptance, and I'm pleased to report that it has been accepted to NeurIPS. Congratulations! Please revise the manuscript according to the reviewer comments and discussion points.