PaperHub
7.3
/10
Poster4 位审稿人
最低3最高5标准差0.9
5
5
5
3
4.8
置信度
创新性2.3
质量2.8
清晰度3.3
重要性2.5
NeurIPS 2025

Explicitly Modeling Subcortical Vision with a Neuro-Inspired Front-End Improves CNN Robustness

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

We introduce Early Vision Networks (EVNets), hybrid CNNs combining biologically inspired subcortical and cortical front-end blocks, which improve V1 alignment and significantly boost robustness to different image perturbations.

摘要

关键词
Computational NeuroscienceComputer VisionObject RecognitionConvolutional Neural Networks

评审与讨论

审稿意见
5

The authors investigate a new front-end for deep convolutional neural networks (CNNs) that implements a fixed-weight model of the retina and LGN. This builds on prior work showing that a V1 model as a front-end can improve the robustness of object recognition. The authors implement their subcortical model as a front end to ResNet-50 and compare the performance of three different models: the base ResNet-50; the VOneResNet-50 with the V1 front end from Dapello et al; and the EVResNet-50 with their subcortical (retina + LGN) block followed by the V1 block as a front end. They assess these models' performance in terms of their similarity to neural recordings from the primate visual system, and in terms of their object recognition performance (clean test accuracy, robustness to domain shifts, and robustness to corruptions).

The authors observe that, by most metrics EVResNet > VOneResNet > ResNet in terms of similarity to neural recordings from the primate visual system and that ResNet> VOneResNet > EVResNet in terms of clean test accuracy on ImageNet. EVResNet yielded better accuracy on most domain-shifted and corrupted ImagNet tasks than did VOneResNet or ResNet.

优缺点分析

Strengths:

  • The authors have correctly identified a key weakness of the Dapello et al. work, in that it ignores the subcortical contributions to vision. They have done a good job of paralleling that line of work in building their subcortical block.

  • The improvements in robustness to corruption and domain shifts are notable, although they come at the expense of a reduction in robustness to adversarial attack (vs VOneNet).

  • The ablation study (removing M cell pathway) is nice to see.

Weaknesses:

  • The work did not consider how their approach might compare to other avenues for improving computer vision models. E.g., how does their addition of fixed-weight neural models compare to self-supervised pretraining, attention layers, and other components of SOTA vision models? Performing this comparison would help to understand where the field should invest its effort in advancing computer vision.

问题

  • Rich Sutton's "Bitter Lesson" teaches us that approaches that scale well to more data and larger compute resources tend to outcompete finely-tuned and less-scalable architectures. I am of course open to being shown otherwise (and I would welcome counterexamples to the bitter lesson!!). With that in mind, I think it would be helpful to compare the "add neural model front-ends" approach with the "scale up simple stuff" approach that is exemplified by SSL pretraining, vision transformers, etc. That comparison would help us to understand where to invest our energies if we want to push forward the SOTA.

  • Given the random noise in the neural activities, would it make sense to shut off that source of randomness during inference (as in dropout training)? Alternately, perhaps the authors would benefit from repeating the noisy inference multiple times for each input image (with different random noise draws) and averaging over the resultant category labels. This could leverage any stochastic resonance effects and possibly improve categorization.

  • In Fig. 2, it would help to show real brain data alongside the two model curves so readers can tell at a glance which model class better matches the brain data.

  • In terms of light adaptation, the authors use a global luminance adaptation (Eq. 2) whereas the biological visual system has both global and local light adaptation, and the local adaptation is substantial. E.g., each individual photoreceptor adapts to the local light intensity. This has the benefit of minimizing the effects of spatial lighting variation on perception, and the same property could help a lot in robustness to weather shifts (for example).

  • I wonder if it would help to give skip connections around these early blocks. (I.e., to have skip connections from pixels to VOne block and to the downstream ResNet, and to have skips that send the subcortical block's output directly to the downstream ResNet). Then, the ResNet could use these pre-computed features from the neural model when they are useful, and could also find other features in the pixel inputs in cases where those are not provided by the neural model blocks. In the case of skip connections from the subcortical block to the ResNet (bypassing the VOne block), this would mimic the known projections from LGN to V2.

  • I also wonder if it would help to let the parameters of the early neural blocks be trainable. They could still be initialized to the values used by the authors, but this would allow the models to find better parameters than those hard-coded ones.

局限性

Yes

最终评判理由

The authors presented excellent justifications for their modeling choices and added some thoughtful experimental details (e.g., using ensemble approaches by averaging over noisy inference steps). This improved my opinion of the work so I increased my score from 3->5.

格式问题

No

作者回复

We thank the reviewer for the thoughtful and constructive feedback. We address the reviewer’s main concerns and questions below.

Comparison to broader computer vision methods

We agree that comparing against non-biological robustness methods is important. For this reason, we include both adversarial training and PRIME augmentation as strong baselines (Table 5). EVNet+PRIME outperforms both, indicating that the biological priors introduced by EVNets offer complementary and additive gains to established training-based defenses. We focus on these particular methods, because these remain the most effective and widely adopted methods for improving adversarial robustness and robustness to image corruptions. Adversarial training (e.g., PGD) is a long-standing standard for defending against white-box attacks, while PRIME outperforms other augmentation techniques on ImageNet-C [1]. In addition, earlier forms of data augmentations as used in PRIME have been shown to outperform 1000-fold increase on the training dataset for ImageNet-C, and large pretraining and even self-attention mechanisms for out-of-domain datasets such as ImageNet-R [2]. Furthermore, the objective of our work is not to present a new SOTA defense but to evaluate whether modeling the early visual system in greater detail enhances both robustness and brain alignment. Thus, comparing primarily against biologically-inspired and standard CNNs is most appropriate for isolating the effects of architectural priors. Moreover, many attention mechanisms are hypothesized to arise at higher cortical stages, beyond the scope of our subcortical and V1 front-ends [3]. Nonetheless, we acknowledge the importance of this comparison and will expand the discussion in the final version to better contextualize our findings within broader computer vision research.

Scaling in opposition to neuro-inspiration

We thank the reviewer for invoking Rich Sutton's "Bitter Lesson." While the lesson cautions against manual design in favor of scalable approaches in task-optimization regimes, we believe the story in neurally-aligned vision is more nuanced. In fact, prior work [4] suggests that bigger models are not necessarily more brain-like. Additionally, recent evidence from large-scale benchmarking [5] shows that neural alignment with the primate visual cortex saturates with scale, even as task performance and behavioral alignment continue to improve. Crucially, models with strong architectural priors such as CNNs achieve higher neural alignment, than scale-heavy models like ViTs, especially in low-data regimes. Furthermore, human vision itself is highly efficient, achieving strong generalization and robustness under tight biological constraints. Our work aims to mimic this efficiency by introducing compact, fixed-weight modules that encode inductive biases aligned with the structure of early primate vision. In resource-constrained applications, such efficient and interpretable models may offer practical advantages over large-scale transformers. We will explicitly discuss these tradeoffs and include a broader reflection on the tension between scaling and structure in the final manuscript.

Stochasticity during inference

We thank the reviewer for the insightful suggestion regarding removing stochasticity during inference. The Poisson-like nature of stochasticity shifts the activation distribution non-symmetrically. This way, removing stochasticity after training the model with it, impairs its performance (clean accuracy declined from 71.7% to 64.5%). This phenomenon is also present in VOneNets. On the other hand, the alternative strategy of averaging the logits of multiple stochastic forward passes per image, mirroring an ensemble strategy, is a valid one. When averaging over 3 forward passes increased ImageNet accuracy from 71.7% to 72.2%, corruption robustness from 41.5% to 42.4%, and OOD generalization from 38.7% to 39.2%. We will perform an extensive analysis to explore how performance is affected by the number of forward passes and include it in the final manuscript.

Empirical tuning curves

We agree with the reviewer that including real neural tuning curves would enhance Figure 2. Our model tuning curves were carefully fitted and aligned to mean LGN response properties from multiple studies, as described in Section 2.2 and Supplementary Material E.2. However, due to the heterogeneity of biological neurons, there is no single “canonical” response to compare against. Our goal was to model the average tuning behavior of foveal P and M cells. That said, we appreciate the need for transparency and will include representative empirical curves alongside our model fits in the Supplementary Material, along with citations and commentary.

Light adaptation pooling size

Indeed, the biological visual system includes both global and local light adaptation. In early experiments, we implemented a local light adaptation layer using a Gaussian filter. However, during Bayesian optimization, the filter radius consistently grew toward the size of the image, effectively becoming global. To avoid unnecessary compute, we opted for global luminance normalization. That said, we emphasize that our model retains multiple local adaptation mechanisms such as DoG filters and a contrast normalization layer, which provide spatially localized gain control and surround modulation. We will clarify this design choice and include ablation results in the Supplementary Material.

Skip connections

We appreciate the creative architectural suggestion and agree that skip connections from the SubcorticalBlock to the downstream CNN could reflect known projections from LGN to V2. While direct skips from the pixels to V1 are less biologically plausible, incorporating LGN-V2 bypasses is indeed worth exploring. We plan to investigate this for the final version. Such modifications could offer greater flexibility for the network to use or ignore early features depending on the context.

Trainable front-ends

We fully acknowledge the potential benefits of allowing early blocks to be fine-tuned. However, our focus was on designing a fixed-weight front-end that is biologically grounded and not driven by task-based optimization. This approach improves interpretability and alignment with neuroscientific models, and is motivated by prior work showing that fixed-weight architectures like the VOneBlock can achieve competitive robustness. Nonetheless, we agree that enabling training of these early modules is a promising direction. Prior work [6] has shown the utility of this strategy by introducing trainable divisive normalization in a VOneNet-style architecture leads to both enhanced robustness and improved alignment with V1 recordings, as measured by BrainScore. More broadly, letting the front-end adapt to the task might allow neural-like operations to emerge rather than be hand-specified. For example, [7] showed that DoG-like filters can emerge when a convolutional network is constrained by architectural bottlenecks similar to the primate optic tract. This suggests a compelling hybrid route of initialization with biological priors followed by task-dependent fine-tuning thereby balancing biological plausibility and representational flexibility. We will elaborate on this in the revised Discussion and highlight the possibility of initializing from neuro-inspired weights but allowing task-driven fine-tuning.

References

[1] A. Modas, R. Rade, G. Ortiz-Jiménez, S.-M. Moosavi-Dezfooli, and P. Frossard, ‘PRIME: A Few Primitives Can Boost Robustness to Common Corruptions’, in Computer Vision -- ECCV 2022, 2022, pp. 623–640.

[2] D. Hendrycks et al., ‘The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization’, in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 8320–8329.

[3] S. J. Luck, L. Chelazzi, S. A. Hillyard, and R. Desimone, ‘Neural Mechanisms of Spatial Selective Attention in Areas V1, V2, and V4 of Macaque Visual Cortex’, Journal of Neurophysiology, vol. 77, no. 1, pp. 24–42, 1997.

[4] J. Kubilius et al., ‘Brain-Like Object Recognition with High-Performing Shallow Recurrent ANNs’, in Advances in Neural Information Processing Systems, 2019, vol. 32.

[5] A. Gokce and M. Schrimpf, ‘Scaling Laws for Task-Optimized Models of the Primate Visual Ventral Stream’. 2025.

[6] A. Cirincione et al., ‘Implementing Divisive Normalization in CNNs Improves Robustness to Common Image Corruptions’, in SVRHM 2022 Workshop @ NeurIPS, 2022.

[7] J. Lindsey, S. A. Ocko, S. Ganguli, and S. Deny, ‘The effects of neural resource constraints on early visual representations’, in International conference on learning representations, 2019, vol. 77

评论

Thanks for the thorough responses to my critiques. My opinion of the work has strengthened as a result, and I have increased my score from 3-->5.

评论

Thank you for your reconsideration. We're glad our responses helped clarify the contributions of our work and will make sure to include them in the final manuscript.

审稿意见
5

This paper introduces SubCorticalBlock, a new processing block to be used in conjunction with VOneNet convolutional neural networks that simulates processing strategies known to exist in subcortical processing regions of the primate visual system. These SubCorticalBlocks are composed of four primary processing mechanisms: Difference of Gaussian (DoG) convolutions, Light Adaptation, Contrast Normalization, and subcortical noise injection, which are intended to mimic neural processing strategies found in the retina and LGN. The resulting networks, EVNets are shown to achieve greater alignment with primate V1 response (as evaluated though BrainScore’s neural predictivity and tuning alignment evaluations) and improvements on corrupt image robustness and out of distribution robustness as compared to baselined ResNet50 and VOneResNet50 models.

优缺点分析

Strengths:

  • SubCortical block is (in theory) agnostic to the downstream CNN processing. The parameters of this processing block are fixed which poses this as a processing block that could be easily added onto the front of other CNNs.
  • Fixed parameters of the SubCortical block are learned directly from alignment with biological data, providing additional hints to the utility of biological priors in deep neural network processing.
  • Improvement in V1 response property scores are large compared to ResNet50 and VOneResNet50 baselines. A V1 response property score of .808 puts this model at rank 6 for this score among public BrainScore models.
  • In appendix E2, authors show that the SubCorticalBlock generalizes beyond the ResNet50 architecture by adding it to the front of an EfficientNet model, demonstrating generality of the proposed processing block.
  • Paper is clearly written and motivated
  • Ablation studies clearly show the utility of each implemented mechanism of the SubCortical on most evaluated benchmarks.

Weaknesses:

  • A full breakdown of model performance across each different corruption type would be appreciated. This may help reveal what robustness benefits are actually seen as compared to expectation. For instance, do the contrast normalization and light adaptation mechanisms of the SubCortical block improve robustness to the contrast and brightness corruptions of ImageNet-C?
  • No explanation is provided for why a different receptive field is used (7 degree visual angle) as compared to VOneNets and many existing brain score models. Receptive field size decisions can have a reasonably high impact on brain score results, so should be considered for an additional ablation study.
  • Little insights are provided into why specific tuning properties improve (or worsen) with the introduction of these subcortical processing mechanisms.
  • Common with related literature, little insight (beyond numerical results from ablation studies) is provided into WHY we observe improvements in robustness and the tradeoffs of using these new processing strategies.

问题

  • I was left wondering if artificial neural activity measured from the SubCorticalBlock would actually predict biological neural activity measured from subcortical regions (i.e., not just V1) better than models without this processing block. Is there any available dataset that could support this evaluation? If so, a quick experiment may be to evaluate RSA or linear predictivity of SubCorticalBlock representations with this neural data as compared to baseline ResNet50.
  • Is the biological parameterization necessary? For instance, how would different parameterizations impact V1 alignment and robustness scores.
  • Is the VOneBlock necessary? How does EVResNet50 model performance change without it?
  • Table 5: Is an evaluation available for VOneResNet50+PRIME?
  • Tables D1 and D2 provide ablation study results for EVResNet50 on the corrupt image and out of distribution robustness benchmarks. Providing adversarial robustness results for each ablation model could additionally be insightful.

局限性

yes

最终评判理由

I maintain my original score (5) for this paper based on the strengths of the paper that I originally listed. In follow-up discussion with the authors, they have additionally helped resolved my remaining concerns regarding receptive field and parameterization choices and have helped detail a subset of their results more clearly.

格式问题

N/A

作者回复

We thank the reviewer for the thoughtful and constructive feedback. We address the reviewer’s main concerns and questions below.

Model performance across individual corruptions

We thank the reviewer for pointing out this omission. We will include figures in the Supplementary Material that display model accuracy across different corruption severities and types. Notably, our functional analysis of the SubcorticalBlock suggests that its integration offers greater robustness to contrast and brightness corruptions compared to others. However, this is not observed in practice, EVNets achieve an accuracy of 32.1% (31.6% in VOneNets and 39.4% in ResNet50) for contrast and 62.0% for brightness (64.3% in VOneNets and 66.4% in ResNet50). Although these corruptions are primarily related to changes in luminance and contrast, it is important to acknowledge the additional loss of information that occurs due to encoding quantization in 8-bit RGB with gamma correction. Reducing contrast before saving an image increases quantization error, resulting in more pronounced discretization artifacts. Similarly, increasing brightness leads to overexposed images, where pixel values are clipped to the maximum allowable value, further contributing to information degradation. Ideally, if brightness were adjusted by scaling an image by a constant, without any value clipping nor gamma correction, the light-adaptation layer of the SubcorticalBlock would generate the same activation map as for the original image. Considering that this mapping does not reflect true brightness variations, it is not surprising that the gains in brightness robustness are similar to those observed with other types of corruption.

Field of View Changes

In our models, the field of view (FoV) defines the portion of the visual field occupied by the object being classified. The original 8deg FoV in VOneNet was chosen as a rough dataset-wide average, but ImageNet exhibits substantial variability in object size and framing. Moreover, standard ImageNet training involves random cropping, sometimes reducing the input to as little as 8% of the original image, further increasing the variation in apparent FoV during training. In earlier experiments on Tiny ImageNet, we used a 2deg FoV for 64×64 images, corresponding to 32 pixels per degree (ppd), which closely matches the 28 ppd used in the original VOneNet and follows precedents in adapting VOneNets to this dataset [1–3]. To maintain consistent resolution across input sizes, we selected a 7deg FoV for ImageNet-scale images, which keeps the spatial resolution and allows for fair comparison across different input scales while keeping receptive field sizes fixed. Although we did not include a direct baseline with an 8deg FoV for comparison, our VOneNet implementation with a 7deg FoV actually improves over the original VOneNet in clean accuracy (72.9% vs. 71.7%), mean corruption accuracy (40.4% vs. 40.0%), and adversarial robustness (51.5% vs. 51.1%, based on comparable attack conditions) [1]. These results suggest that the change in FoV did not negatively impact performance and may have modestly improved it.

Clarification of tuning property gains

We agree that the manuscript would benefit from a deeper discussion of how subcortical computations influence V1 tuning properties, and we will expand on this point in the revised Discussion section. The observed improvements in V1 alignment, particularly for surround modulation and RF size, stem from the normalization stages implemented in the SubcorticalBlock. These mechanisms enforce local competition and approximate extraclassical phenomena which the VOneBlock alone cannot capture due to its purely classical RF structure. Our results suggest that upstream normalization acts as a surrogate for some of the missing recurrent (and possibly feedback) processes in V1, enhancing the biological plausibility and functional fidelity of the V1 responses observed downstream. This supports the idea that some extraclassical properties can emerge from subcortical preprocessing alone.

Reason behind robustness improvements and need for biological parameterization

We agree that further insights into the source of robustness gains are valuable, and we will expand on this point in the revised Discussion section. As noted in prior work using the VOneNet architecture and changing the V1 properties [3], corruption robustness is not dependent on exactly matching biological parameter distributions but rather on the broader structure and covariance between some of these properties. In the case of subcortical computations, our hypothesis is that normalization layers promote local competition and dynamic range compression, which are known to improve robustness by reducing sensitivity to input perturbations. Meanwhile, the DoG filtering improves features selectivity, while performing low-pass filtering, mitigating high-frequency noise.

Evaluating neural predictivity of subcortical neurons

We acknowledge the reviewer’s suggestion to assess alignment with subcortical recordings. However, our SubcorticalBlock is designed to match the average response properties of primate LGN neurons, and does not attempt to capture their full variability or even single-unit responses, in contrast with the VOneBlock. Because of this design choice, applying RSA or linear predictivity on existing single-unit LGN datasets would not be meaningful as our model lacks the heterogeneity required for such comparisons. We believe that future work should focus on expanding the dimensionality and diversity of the SubcorticalBlock before using it in predictive frameworks.

Need for the VOneBlock

In early experiments, we evaluated a version of our model that only includes the SubcorticalBlock. While it retained some robustness benefits, it consistently underperformed the full EVNet architecture in corruption robustness. This suggests that the VOneBlock and SubcorticalBlock confer complementary inductive biases, with one targeting contrast and luminance invariance (Subcortical) and the other contributing spatial and orientation-specific processing. We will include the corresponding ablation results in the final Supplementary Material.

VOneNet + PRIME

Our intent was not to benchmark EVNet against VOneNet under all conditions, but to examine how architectural and training-based robustness strategies interact. Since the impact of combining biologically inspired front-ends with PRIME had not been studied before, we focused on EVNet+PRIME to highlight its cumulative gains. That said, as including VOneResNet50+PRIME in our study would be interesting, we will include these results in our final manuscript.

Adversarial Robustness on Ablations

Due to computational constraints and the high cost of performing Monte Carlo PGD attacks with 32 samples per iteration under stochastic noise, we did not include adversarial robustness results for all ablations. However, we agree this would offer valuable insight, particularly in understanding the role of specific mechanisms (e.g., contrast normalization, light adaptation) under gradient-based attacks. We will include these evaluations in the final manuscript and update the Supplementary Material accordingly.

References

[1] A. Baidya, J. Dapello, J. J. DiCarlo, and T. Marques, ‘Combining Different V1 Brain Model Variants to Improve Robustness to Image Corruptions in CNNs’, in SVRHM 2021 Workshop @ NeurIPS, 2021.

[2] A. Cirincione et al., ‘Implementing Divisive Normalization in CNNs Improves Robustness to Common Image Corruptions’, in SVRHM 2022 Workshop @ NeurIPS, 2022.

[3] R. Barbulescu, T. Marques, and A. L. Oliveira, ‘Contribution of V1 Receptive Field Properties to Corruption Robustness in CNNs’, in ECAI 2024, IOS Press, 2024, pp. 658–665.

评论

Thank you, authors, for the time and effort that you have put into answering my questions. I appreciate the detailed responses and encourage you to include these additional analyses and details as discussion points or supplemental material in your revised manuscript.

评论

Thank you for your feedback. We are glad the responses were helpful and will make sure to incorporate them in the final manuscript.

审稿意见
5

The manuscript proposes a biologically plausible front-end layer for convolutional neural networks, aiming to better align representations between artificial and biological neural systems. To this end, the authors introduce a SubCorticalBlock that incorporates several stages inspired by the early visual system, including light adaptation, centre-surround modulation, contrast normalisation, and neural noise. The proposed framework is extensively evaluated across various datasets, including V1 response properties, corrupted versions of ImageNet, and adversarial attack scenarios. The main finding is that incorporating the SubCorticalBlock generally improves network performance on degraded or "unclean" images and enhances robustness to adversarial attacks—suggesting a closer alignment with how humans process visual information.

优缺点分析

  1. The manuscript is well written and easy to follow. The source code is publicly available, which enhances transparency and reproducibility. The idea of the SubCorticalBlock is scientifically sound, well executed, and represents a valuable multidisciplinary contribution.

  2. A significant limitation of the current manuscript is that the proposed biologically inspired SubCorticalBlock is only tested as the front-end of two architectures: ResNet-50 and EfficientNet-B0. It is not evident that the same front-end would provide similar benefits for other convolutional architectures. This assumption should be validated through additional experiments. In particular, I recommend evaluating simpler architectures such as AlexNet and VGG, which are in some ways more biologically plausible and could offer additional insight.

  3. While the availability of source code supports reproducibility, the manuscript lacks important implementation details necessary for fully understanding the method and interpreting the results. For example, what are the sizes of the designed DoG kernels in the P and M pathways? What values are used for surround inhibition? A tabular summary of these key parameters—presumably optimised through hyperparameter tuning—would be highly informative.

  4. There is a substantial body of classical computer vision literature that incorporates similar biological mechanisms to those proposed in this manuscript. For example, Difference-of-Gaussians (DoG) models have been extensively used to describe surround modulation in boundary detection, or to model beyond classical receptive fields in colour constancy algorithms. This literature is currently absent from the related work section. Including it would help contextualise the benefits of biologically inspired algorithms beyond the deep learning era and demonstrate continuity with earlier work in computational neuroscience and vision science.

问题

  1. The rationale behind certain modifications to the original VOneNet architecture is not clearly explained. For example, why was a 7-degree field of view (FoV) used instead of the original 8 degrees? Is this change supported by neuroscientific evidence, or was it an engineering decision? If it is the latter, it is important to clarify how sensitive the reported results are to this parameter. A similar clarification is needed for the changes made to the Gabor Filter Bank (GFB).

  2. It is unclear why P-cells were excluded from the ablation experiments. Given their relevance to visual processing, an investigation of their contribution would provide a more complete understanding of the subcomponents' functional roles.

  3. It would strengthen the evaluation to include ablation results for V1 response properties, similar to those reported in Table 1 of the main manuscript. This would offer a more comprehensive view of how each component influences biological plausibility. Similar ablation results for (EV)EfficientNet-B0 would also be valuable.

  4. The ablation results suggest that modelling M-cells may not significantly benefit performance. This raises the broader question of whether the performance differences observed across the components are statistically significant. In some cases, improvements appear to be marginal (i.e. in the decimal range), which might not be meaningful without appropriate statistical testing.

  5. What is the total number of parameters introduced by the proposed subcortical layer? It would be useful to understand how this additional complexity contributes to the observed performance gains.

局限性

Yes.

最终评判理由

I believe the manuscript builds nicely on an important previous work (VOneNets), and the findings are of multidisciplinary interest. Therefore, I increase my rating to 5.

格式问题

  1. It would be beneficial for readers to have the names of all authors in the Reference List instead of abbreviating it with 'et al.'
作者回复

We thank the reviewer for their constructive feedback and for highlighting the strengths of our work, including the scientific soundness of the SubcorticalBlock and the manuscript’s clarity. We address the reviewer’s main concerns and questions below.

Generalization to Other Architectures

Our initial focus on ResNet-50 and EfficientNet-B0 was shaped both by computational constraints and by the precedent established in prior work on VOneNets [1], which demonstrated that biologically inspired front-ends can generalize well across diverse CNN backbones, including AlexNet and CORNet. Nonetheless, to further assess the generality of EVNets, we are conducting new experiments applying the EVNet front-end to other simpler back-end architectures, such as AlexNet, and will include these results in the final version.

Implementation Details and Parameter Summary

We agree that greater transparency in parameterization would be helpful. In the final version, we will include a table summarizing all SubcorticalBlock parameters, including DoG kernel sizes, surround inhibition values, and parameters derived from our neurophysiologically-constrained hyperparameter tuning. While many values were derived through empirical alignment with subcortical data (see Table E5 in Supp. E.2), we acknowledge the need for easier reference.

Connection to Classical Computer Vision Literature

We appreciate the reviewer’s suggestion to better contextualize our approach within the classical computer vision literature. In the final version, we will add references and discussion on early computational models of center-surround mechanisms and DoG-based edge and contrast processing (e.g., Marr & Hildreth [2]), as well as their influence on tasks such as boundary detection and color constancy. We agree that this will provide valuable historical perspective and highlight the continuity between classical vision and modern biologically-inspired deep learning

Field of View Changes

In our models, the field of view (FoV) defines the portion of the visual field occupied by the object being classified. The original 8deg FoV in VOneNet was chosen as a rough dataset-wide average but ImageNet exhibits substantial variability in object size and framing. Moreover, standard ImageNet training involves random cropping, sometimes reducing the input to as little as 8% of the original image, further increasing the variation in apparent FoV during training. In earlier experiments on Tiny ImageNet, we used a 2deg FoV for 64×64 images, corresponding to 32 pixels per degree (ppd), which closely matches the 28 ppd used in the original VOneNet and follows precedents in adapting VOneNets to this dataset [3–5]. To maintain consistent resolution across input sizes, we selected a 7deg FoV for ImageNet-scale images, which allows for fair comparison across different input scales while keeping receptive field sizes fixed. Although we did not include a direct baseline with an 8deg FoV for comparison, our VOneNet implementation with a 7deg FoV actually improves over the original VOneNet in clean accuracy (72.9% vs. 71.7%), mean corruption accuracy (40.4% vs. 40.0%), and adversarial robustness (51.5% vs. 51.1%, based on comparable attack conditions) [1]. These results suggest that the change in FoV did not negatively impact performance and may have modestly improved it.

Gabor Filter Bank Changes

Regarding the Gabor Filter Bank (GFB), we extended its spatial frequency (SF) range to better match empirical V1 distributions. In the original VOneNet implementation, the range of SFs processed by the GFB was sampled from a binned empirical distribution of V1 receptive field preferences [5], but was constrained by the Nyquist frequency of the model. For 224×224 images, the Nyquist frequency, fNf_N, along the diagonal is given by 12(2)px1\frac{1}{2\sqrt(2)} px^{-1}. With the original 8deg FoV, this yields fNf_N = 9.9cpd, and maximum SF used in the GFB was 5.6 cpd, as the next bin (8 cpd) was considered too close to the Nyquist limit to be reliably represented. By decreasing the FoV to 7deg, we increase the model’s resolution, and the Nyquist frequency rises to 11.3cpd. This allows us to safely include higher-frequency components in the GFB, expanding the range of SFs modeled to 8cpd, and thus achieving better coverage of the empirical V1 SF distribution, while maintaining a similar safety margin with respect to the Nyquist limit. This modification enhances the biological plausibility of the GFB’s tuning properties without altering the overall architecture of the VOneBlock

Ablations

Our initial ablation strategy focused on understanding how individual components modulate the behavior of a linear P-cell pathway, which we used as a baseline due to its known dominance in ventral stream projections [7]. Omitting the P pathway would remove both chromatic information and high-acuity processing, making the ablated model qualitatively unlike the full system. On the other hand, omitting the M pathway did not significantly affect performance, which is aligned with the hypothesis that it does not contribute significantly to ventral stream processing. However, we agree that including a P-cell ablation improves completeness and interpretability, and we have, in the meanwhile, trained one seed of a EVResNet50 with the P-cell pathway omitted. This model achieves an accuracy of 62.6% (vs. 71.7% on the full EVResNet50) on the ImageNet validation set, a mean corruption accuracy of 35.6% (vs. 41.5% on EVResNet50) and a mean out-of-domain accuracy of 33.9 (vs. 38.7% on EVResNet50). We appreciate the suggestion to evaluate the impact of each component on V1 response properties. To support a deeper understanding of their individual contributions to V1 alignment, we will add a table with ablation results to the Supplementary Material. We will also extend the ablation results to other backend architectures used.

Statistical Significance

While some improvements may appear modest (e.g., sub-percentage gains), the consistency of results across seeds and benchmarks (e.g., shape bias, corruption robustness) suggest genuine benefits. Evaluating statistical significance between different models would require training a much larger number of seeds.

Parameters introduced by the Subcortical layer

Regarding model complexity, we emphasize that the SubcorticalBlock introduces no trainable parameters. All components, DoG filters, normalization stages, and noise generators, are parameterized based on biological priors and remain fixed during training. The complete SubcorticalBlock configuration relies on six hyperparameters (see Supplementary Table E6), which are optimized to align model responses with empirical subcortical properties.

Reference list formatting

We thank the reviewer for the suggestion. In the final camera-ready version, we will ensure that all author names are listed in full in the References section.

References

[1] J. Dapello, T. Marques, M. Schrimpf, F. Geiger, D. Cox, and J. J. DiCarlo, ‘Simulating a Primary Visual Cortex at the Front of CNNs Improves Robustness to Image Perturbations’, in Advances in Neural Information Processing Systems, 2020, vol. 33, pp. 13073–13087.

[2] D. Marr, E. Hildreth, and S. Brenner, ‘Theory of edge detection’, Proceedings of the Royal Society of London. Series B. Biological Sciences, vol. 207, no. 1167, pp. 187–217, 1980.

[3] A. Baidya, J. Dapello, J. J. DiCarlo, and T. Marques, ‘Combining Different V1 Brain Model Variants to Improve Robustness to Image Corruptions in CNNs’, in SVRHM 2021 Workshop @ NeurIPS, 2021.

[4] A. Cirincione et al., ‘Implementing Divisive Normalization in CNNs Improves Robustness to Common Image Corruptions’, in SVRHM 2022 Workshop @ NeurIPS, 2022.

[5] R. Barbulescu, T. Marques, and A. L. Oliveira, ‘Contribution of V1 Receptive Field Properties to Corruption Robustness in CNNs’, in ECAI 2024, IOS Press, 2024, pp. 658–665.

[6] Russell L. De Valois, Duane G. Albrecht, and Lisa G. Thorell. “Spatial Frequency Selectivity of Cells in Macaque Visual Cortex”. In: Vision Research 22 (1982), pp. 545–559.

[7] Nassi, J., Callaway, E. Parallel processing strategies of the primate visual system. Nat Rev Neurosci 10, 360–372 (2009).

评论

Thank you for your response. I believe the manuscript builds nicely on an important previous work (VOneNets), and the findings are of multidisciplinary interest. Therefore, I increase my rating to 5.

评论

Thank you for your reconsideration. We're glad that you find the work a valuable contribution, and we appreciate your support for its multidisciplinary relevance.

审稿意见
3

The authors introduce Early Vision Networks (EVNets) as an extension of standard convolutional deep networks with bio-inspired early layer computations. The authors process images first with two parallel pathways inspired by the parvo- and magnocellular pathways in biological vision. Responses in both pathways are subject to a battery of contour-strengthening (DoG filters) and normalization operations. The authors perform experiments with the proposed EVNet (that combines prepends a ResNet-50 with a subcortical block and VOneBlock) and show that their robustness to a variety of image distortions is significantly better than that of a baseline ResNet-50, although they observe a decline in clean image accuracy.

优缺点分析

Strengths:

  • Theoretical soundness: I believe the paper proposes a well motivated extension of VOneBlock with magno- and parvo-cellular pathways which seem to have slightly different response properties as shown in Fig.1.C. This could potentially be a beneficial inductive prior to add to DNNs for enhancing their robustness.
  • Reproducing receptive field properties of biological neurons: Figure 2 is interesting, it is nice that the authors are able to replicate the receptive field tuning observed in biological cells using EVNets.

Weaknesses:

  • Novelty: First, it has been shown in several prior works that adding the pieces of EVNet mentioned here (using opponent-channel DoG filters, low and high SF pathways, contrast normalization, stochastic responses) can enhance the robustness of DNNs, some of which has in fact been shown in the VOneNet paper that the authors are extending. I don't find a significantly novel contribution in this work due to this reason.
  • Weak baselines: The proposed EVNets are only evaluated against VOneNet and an ImageNet-trained ResNet-50 block. If the authors want to convincingly present bio-inspired operations as a valid defense against image deformations, they must compare to other non-bio inspired approaches to enhancing perceptual robustness in DNNs. Also, given there are several advancements in image-processing deep neural networks, I'm skeptical the advantages of adding the subcortical module and VOneBlock claimed using comparisons with ResNet-50 will transfer to recent DNNs that have relatively scaled significantly in number of parameters / training dataset size / training objectives.
  • Clarity on how the paper influences neuroscience / machine learning: There are several bio-inspired architectures like CORNets [1], hGRU [2], ConvRNNs [3] etc. and this paper seems to be a new addition to the line of such architectures. Given the low novelty, I am unsure what we learn about visual neuroscience that we haven't already known from the previous architectures. Since the clean image accuracy of EVNets is worse when compared to ResNet-50, there isn't a strong architectural contribution from a machine learning standpoint as well.

References:

  1. Kubilius, J., Schrimpf, M., Kar, K., Rajalingham, R., Hong, H., Majaj, N., ... & DiCarlo, J. J. (2019). Brain-like object recognition with high-performing shallow recurrent ANNs. Advances in neural information processing systems, 32.
  2. Linsley, D., Kim, J., Veerabadran, V., Windolf, C., & Serre, T. (2018). Learning long-range spatial dependencies with horizontal gated recurrent units. Advances in neural information processing systems, 31.
  3. Nayebi, A., Bear, D., Kubilius, J., Kar, K., Ganguli, S., Sussillo, D., ... & Yamins, D. L. (2018). Task-driven convolutional recurrent models of the visual system. Advances in neural information processing systems, 31.

问题

Please refer to my review above. My main questions are about the novelty of the proposed EVNet architecture, how it advances either our understanding of biological/machine visual perception or how it improves the state of the art in deep-learning based computer vision.

局限性

yes.

最终评判理由

I have read the author rebuttal and other reviews. I am raising my score since some of my concerns are addressed by the rebuttal. I am not recommending acceptance since my concerns about novelty and performance gains are not adequately addressed.

格式问题

NA.

作者回复

We thank the reviewer for their assessment and appreciate the recognition of our work’s theoretical grounding and our efforts to replicate biological receptive field properties. We address the main points of concern below.

Novelty

While it is true that individual components of our model (e.g., DoG filters and normalization mechanisms) have been previously explored, our contribution lies in the systematic integration of these components into a unified module that simulates responses of subcortical neurons of primate retina and LGN. Crucially, this is not a heuristic assembly: we introduce a novel, neurophysiologically-constrained hyperparameter tuning procedure that calibrates the SubcorticalBlock according to average empirical response property values. This represents a methodological advancement over prior work, which often relied on loosely inspired architectural motifs without systematic alignment to subcortical physiology. Additionally, EVNets establishes a modular architecture, in the form of a CNN, that explicitly models the early visual cascade from the retina up to V1, enabling a layer-wise commitment to biological function. This fills a notable gap in the literature and in platforms like BrainScore, which, despite its +100 benchmarks for measuring alignment across the whole ventral stream, still lack tools for evaluating subcortical alignment partially because no models have made this level of biological commitment until now. Therefore, while the building blocks may not be individually novel, their integrated and parameter-constrained formulation, as well as the broader architectural strategy of EVNets, represent a meaningful step forward.

Baselines and Comparisons

We agree that comparing against non-biological robustness methods is important. For this reason, we include both adversarial training and PRIME augmentation as strong baselines (Table 5). EVNet+PRIME outperforms both, indicating that the biological priors introduced by EVNets offer complementary and additive gains to established training-based defenses. We focus on these particular methods, because these remain the most effective and widely adopted methods for improving adversarial robustness and robustness to image corruptions. Adversarial training (e.g., PGD) is a long-standing standard for defending against white-box attacks, while PRIME outperforms other augmentation techniques on ImageNet-C [1]. In addition, earlier forms of data augmentations as used in PRIME have been shown to outperform 1000-fold increase on the training dataset for ImageNet-C, and large pretraining and even self-attention mechanisms for out-of-domain datasets such as ImageNet-R [2]. Furthermore, the objective of our work is not to present a new SOTA defense but to evaluate whether modeling the early visual system in greater detail enhances both robustness and brain alignment. Thus, comparing primarily against biologically-inspired and standard CNNs is most appropriate for isolating the effects of architectural priors. As for the backend architectures used, in Supplementary Material D.2, we show results for EVNets using an EfficientNet-B0 backend, in which the robustness benefits persist, and we will be including an additional backend architecture in the final manuscript.

Impact on neuroscience and machine learning

While EVNets belong to the broader category of bio-inspired models, they target a distinct and underexplored segment of the visual hierarchy. Unlike prior work focusing on high-level cortical motifs like recurrence (e.g., CORnet, hGRU, ConvRNNs), our architecture explicitly models early vision through a modular assembly of empirically constrained subcortical and V1-like components. This focus on simulating retinal and LGN computations fills a gap left by most prior biologically-inspired CNNs, which typically begin at or beyond V1. Moreover, as shown in our work, such early vision modules are not mutually exclusive with cortical models, they are instead complementary. EVNets provide a foundation that can be integrated with cortical motifs, potentially yielding even greater brain alignment and robustness gains. From a neuroscience perspective, our findings demonstrate that (1) enhanced V1 tuning and extra-classical properties can emerge solely from upstream (subcortical) computations; and that (2) compositionality of biologically grounded modules can improve overall alignment without additional supervised fitting. On the other hand, from a ML perspective, we show that (1) explicitly modeling subcortical mechanisms can serve as a robustness-enhancing architectural prior, and that (2) these neuro-inspired priors are orthogonal to training-based methods, as shown by additive improvements when EVNets are combined with PRIME. Thus, the work contributes insights into how biological fidelity at early stages can inform both our understanding of visual representations and robust ML design. In sum, we believe our paper contributes a meaningful advance by introducing a novel, empirically aligned model of subcortical processing, demonstrating emergent V1 phenomena from upstream modules and by showing robustness gains that are complementary to SOTA training methods. We hope these clarifications will help the reviewer reconsider the novelty and impact of our work.

References

[1] A. Modas, R. Rade, G. Ortiz-Jiménez, S.-M. Moosavi-Dezfooli, and P. Frossard, ‘PRIME: A Few Primitives Can Boost Robustness to Common Corruptions’, in Computer Vision -- ECCV 2022, 2022, pp. 623–640.

[2] D. Hendrycks et al., ‘The Many Faces of Robustness: A Critical Analysis of Out-of-Distribution Generalization’, in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 8320–8329.

评论

We hope you had a chance to read our rebuttal. Please let us know if there are any remaining questions or concerns about our work that we could help clarify. We would be happy to provide further details if needed.

评论

@Reviewer g9Dr -- could you please let us know whether the rebuttal addresses your concerns? I am especially eager to hear your thoughts regarding baseline comparisons. Thanks

评论

Dear Authors,

Thank you for your detailed response to my review. I appreciate your elaboration of the model's novelty, comparison to stronger baselines and clarifying the contribution to machine learning / neuroscience. Here are my thoughts after reading your rebuttal and other reviews.

  1. Model's novelty: I agree with the authors that the proposed combination of M and P cells, DoG filter convolution, contrast normalization and neural stochasticity is not arbitrary and appreciate the resemblance to biological anatomical constraints. I believe the introduction of M and P cell pathways along with DoG filter-based convolution is the main innovation, since divisive contrast normalization and neuronal stochasticity have been widely studied in prior work. It would be great if the authors could please emphasize this in their submission. I'm still unconvinced that these interventions meet the originality standards of the main conference track at NeurIPS.

  2. Robustness improvements produced by EVNet: I thank the authors for highlighting Table 5 with robustness results compared against PRIME and Adversarial Training. I see comparison to VOneNet in the top part of the table but the authors don't compare EVNet + PRIME / AT with VOneNet + PRIME / AT. I think this is an important comparison to make especially since Table 4 shows that VOneResNet50 is largely outperforming EVResNet-50 on adversarial robustness. Overall, I don't see clear trends that EVNet is better than ResNet-50 and VOneResNet-50.

  3. Contributions to ML / Neuroscience: Thanks to the authors for highlighting their efforts to bridge the gap in simulating the retinal and thalamocortical pathways in CNNs. This is a significant contribution, I agree with the authors that prior work mostly focuses on just retinal / V1 modeling.

In summary, the author rebuttal certainly improved my understanding of their contributions and addressed some of my critiques. However, I maintain my stance of low novelty and unclear gains wrt VOneNet that shares a bunch of components with EVNet. I am raising my score since some of my concerns were addressed, but still feel that the paper has significant room for improvement before acceptance at NeurIPS owing to the mixed improvement signals over VOneNet and potentially low novelty.

评论

Thank you for your reconsideration and constructive feedback.

We agree that a key source of novelty in our work lies in the biologically inspired integration of P and M-cell pathways. We will make this distinction more explicit in the revised manuscript, setting it apart from previous approaches such as DoG filtering, divisive normalization, and noise injection.

Regarding our robustness improvements, we agree that including EVNet and VOneNet results under both PRIME and adversarial training would strengthen our claims. Due to computational constraints, we can't train both EVNet and VOneNet on ImageNet with adversarial training, but we'll include the VOneNet+PRIME results in the final manuscript.

最终决定

The paper introduces Early Vision Networks (EVNets), which combine biologically-inspired subcortical processing (retina and LGN) with existing V1-inspired modules to improve CNN robustness. The key contribution is a fixed-weight SubcorticalBlock that systematically integrates Difference-of-Gaussians filtering, contrast normalization, and neural noise, calibrated to neurophysiological data, with the goal of enhancing both brain alignment and robustness to corruptions and domain shifts.

The reviewers generally found the idea scientifically sound and well-motivated, with clear writing, code availability, and some notable robustness gains. Strengths include stronger alignment with V1 benchmarks and improved robustness to image perturbations. Weaknesses centered on limited novelty (building heavily on VOneNet with only incremental extensions), relatively weak or incomplete baseline comparisons (e.g., not fully evaluated against other SOTA robustness methods or newer architectures), and limited insight into why the observed improvements arise. Some reviewers also flagged missing implementation details and limited ablation coverage.

The authors’ rebuttal and subsequent exchanges clarified their novelty claims (emphasizing systematic integration of subcortical pathways and empirically constrained hyperparameters), provided additional baseline results (including PRIME and adversarial training), and committed to expanding analyses (e.g., ablations, parameter tables, classical vision literature, and more architectures). These responses addressed several clarity concerns and convinced most reviewers that the work represents a meaningful, if incremental, advance. Still, skepticism remains regarding originality relative to prior VOneNet work and the magnitude of robustness improvements.

After discussion, two reviewers maintained strong support based on the quality of execution and multidisciplinary relevance, while one reviewer remained cautious about novelty but acknowledged that the rebuttal improved their assessment. Overall, the consensus is that the contribution, though incremental, is solid and of interest to the NeurIPS community, with sufficient empirical support and reproducibility.

Given all this, the AC recommends the paper be accepted. The work is not a major breakthrough but provides a useful addition to the growing line of bio-inspired models, and the community can benefit from its careful integration of subcortical mechanisms. The authors are encouraged to strengthen their comparisons and better contextualize the novelty in the camera-ready version.

公开评论

We thank all reviewers for their constructive and insightful feedback. We carefully revised the manuscript. Below we summarize the main changes introduced in the revised version.

EVNet Variants

We expanded the analysis of architectural contributions by training 7 EVResNet50 variants. These include 6 targeted ablations and 1 architectural addition. In addition to the ablations previously reported, we separately removed the P pathway and the VOneBlock, and introduced a new LGN–V2 skip connection, implemented by concatenating SubcorticalBlock activations with the VOneBlock bottleneck output. We tested for adversarial performance under a restricted attack set on all EVNet variants. The results are summarized in Table 6. We further evaluated all variants on V1 predictivity, V1 response property alignment, and shape bias (Table D1).

Discussion

We clearly emphasized that the explicit modeling of the P and M pathways represents the core novelty of our work, distinguishing it from previous approaches. We expanded the discussion on the subcortical influence on cortical processing, and provide a refined interpretation of robustness gains. Additionally, we broadened the connection to the computer vision literature and included a reflection on the “bitter lesson” trade-off, and highlighted the potential of neuro-inspired initialization for future work.

Baselines

We added VOneNet + PRIME results to Table 5. EVNet + PRIME continues to outperform this configuration in terms of the aggregate Robustness Score. To validate generalization, we extended our back-end study by incorporating CORnet-Z as a third architecture alongside ResNet50 and EfficientNet-B0. This addition demonstrates the complementarity of EVNet with other neuro-inspired backbones while preserving architectural simplicity. EVCORnet-Z achieved a higher Robustness Score than its unmodified counterpart, reinforcing the generality of the approach.

Supplementary Material

We included all SubcorticalBlock parameter values and justified the design choices regarding FoV and GFB modifications. We clarified the motivation for adopting local light adaptation over global normalization and added a new figure (Fig. C4) showing the top-1 accuracy across corruption severities. A new ensembling experiment was added to evaluate the effect of multiple forward passes on clean, corrupted, and OOD data. Finally, we included representative empirical tuning curves that illustrate the surround suppression and contrast saturation effects in V1 neurons.

Light Adaptation

During adversarial testing, we identified that the light adaptation module used in the SubcorticalBlock could unintentionally induce gradient masking [1] because of the instability introduced by the γ\gamma exponent, attenuating PGD attack efficacy for stronger attacks under norms L1 and L2. To prevent this effect, we revised the normalization equation to follow a smoother and strictly monotonic form, consistent with previous formulations [2, 3]:

xLA=xxˉxˉ\mathbf{x}_\text{LA} = \frac{\mathbf{x} - \bar{\mathbf{x}}}{\bar{\mathbf{x}}}

Empirically, the new formulation produced minor changes in aggregate results. Although V1 predictivity decreased slightly (0.015 drop w.r.t. the EVNet with previous version), we measured an increase in alignment for V1 response properties (+0.018) and shape bias (+2.6%). Clean image accuracy decreased marginally (-1.4%), while the overall Robustness Score improved slightly (+0.7%) mainly because of an enhanced adversarial robustness (+2.4%).

With this correction, we decreased the ensemble size used for adversarial attacks, employing the same attack as in the original VOneNet study [4], ensuring comparability. We also reproduced all attack controls used in the original paper and confirmed that EVNet exhibits consistent robustness without signs of gradient masking. Finally, all EVNet-dependent models were retrained with the new light adaptation module.

Related Work and Formatting

We expanded the Related Work to include a new paragraph where we review early computational models and revised reference list to include the complete author names.


References

[1] Anish Athalye, et al. “Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples”. Proceedings of the 35th International Conference on Machine Learning. ICML, 2018.

[2] Matteo Carandini and David J. Heeger. “Normalization as a canonical neural computation”. Nature Reviews Neuroscience, 2012.

[3] Alexander Berardino, et al. “Eigen-Distortions of Hierarchical Representations”. Advances in Neural Information Processing Systems. NIPS, 2017.

[4] Joel Dapello, et al. “Simulating a Primary Visual Cortex at the Front of CNNs Improves Robustness to Image Perturbations”. Advances in Neural Information Processing Systems. NeurIPS, 2020.