Brain-inspired $L_p$-Convolution benefits large kernels and aligns better with visual cortex
摘要
评审与讨论
This paper primarily investigates variations in local connectivity patterns within CNNs, examining whether incorporating biologically inspired connectivity structures can improve model performance and increase alignment with brain representations. Specifically, the authors introduce Lp-convolution, which utilizes the multivariate p-generalized normal distribution (MPND). The proposed adaptable Lp-masks aim to bridge the gap between artificial and biological connectivity patterns, finding optimal configurations through task-based adaptation to enable strong performance in tasks requiring flexible receptive field shapes.
优点
-
Experimental results indicate that CNN neural representations exhibit a stronger alignment with the visual cortex when the Lp-mask shape approximates a Gaussian distribution.
-
Testing the conformational adaptability of Lp-masks in the Sudoku challenge yielded interesting results, highlighting the flexibility of this approach.
缺点
-
Consistency in terminology would improve clarity; alternating between “Lp-convolution” and “Lp-mask” can be confusing. Using a single term throughout would make the concepts easier to follow.
-
The mention of Vision Transformers (ViTs) in the introduction feels tenuous, as they are not included in subsequent experiments, nor are they closely related to the main theme of the paper.
-
In lines 108-110, where it is stated that “CNNs have rectangular, dense, and uniformly distributed connections, as opposed to the circular, sparse, and normally distributed connections in biological neurons,” this description would benefit from supporting references regarding the shapes of receptive fields in biological neurons. It’s also worth questioning whether this statement accurately characterizes CNN weights, as CNNs trained to model retinal ganglion cells, for instance, have demonstrated sparse weight patterns ([1]-[5]).
-
Lines 137-138 mention that “we optimized parameters of p and σ in MPND (Fig.1e, Eq.1)...” However, Eq.1 and the text do not define σ. It’s also recommended that the authors confirm Eq.1’s form by referencing the standard expression of a multivariate Gaussian function.
-
Integrating Lp-masks in CNNs does not appear to significantly improve recognition accuracy across datasets. Comparing this approach to ViTs, it’s unclear if it achieves current state-of-the-art performance.
-
The justification for using large, sparse kernels feels somewhat weak. Aside from achieving marginal improvements in RSM alignment with the visual cortex, it’s unclear how this approach benefits contemporary computer vision tasks.
References:
[1] Maheswaranathan, Niru, et al. "Deep learning models reveal internal structure and diverse computations in the retina under natural scenes." BioRxiv (2018): 340943.
[2] Tanaka, Hidenori, et al. "From deep learning to mechanistic understanding in neuroscience: the structure of retinal prediction." Advances in neural information processing systems 32 (2019).
[3] Lindsey, Jack, et al. "A unified theory of early visual representations from retina to cortex through anatomically constrained deep CNNs." arXiv preprint arXiv:1901.00945 (2019).
[4] Yan, Qi, et al. "Revealing fine structures of the retinal receptive field by deep-learning networks." IEEE transactions on cybernetics 52.1 (2020): 39-50.
[5] Zheng, Yajing, et al. "Unraveling neural coding of dynamic natural visual scenes via convolutional recurrent neural networks." Patterns 2.10 (2021).
问题
-
In the authors' claim regarding "Lp-convolution with biological constraint," specifically the "Gaussian structured sparsity," what theoretical and empirical evidence supports this biological constraint?
-
Across various experiments, since ppp is a learnable parameter, what typical values does it converge to, and are there any observable trends or variations across different datasets? Could the authors interpret these findings in relation to biological insights?
We sincerely thank the reviewer for recognizing our contributions, particularly acknowledging our effort to demonstrate the effectiveness of Lp-Convolution through the Sudoku Challenge. Additionally, we deeply value the constructive feedback provided, which offers significant opportunities to enhance the quality of our paper. Below, we address the reviewer's comments and provide detailed clarifications.
Weakness 1) Inconsistent and Confusing Terminology for Key Components
First, we would like to clarify the terminology:
- Lp-Mask: trainable mask , applied to convolutional weights (Eqn. 2).
- Lp-convolution: the overall convolution process incorporating the Lp-mask (Eqn. 3).
To address this, we will revise the manuscript as follows:
- Clearly highlight the distinction between the two terms in Section 3.
- Use “Lp-convolution” consistently throughout the text, reserving “Lp-mask” for contexts where it is explicitly relevant.
We believe these changes will significantly enhance the clarity of the manuscript and minimize any potential confusion. Thank you for highlighting this important point.
Weakness 2) Irrelevant Mention of Vision Transformers (ViTs)
While ViTs are not the primary focus of our study, we included them in the introduction to provide broader insight—that CNNs can be advantageous over ViTs in data-hungry regimes due to their inductive bias. However, as the reviewer pointed out, without subsequent experimental results to support this point, we agree that our intention may not be effectively conveyed.
To resolve this we compared ViTs with Lp-Models as following table:
| TinyImageNet | Top-1(%) | FLOPs(G) | Params(M) |
|---|---|---|---|
| ViT-32x32 | 49.88 | 4.37 | 87.6 |
| ViT-16x16 | 54.20 | 16.87 | 86.0 |
| AlexNet | 52.25 | 0.71 | 57.82 |
| Lp2-AlexNet | 54.13 | 3.41 | 68.6 |
| Lp2-VGG-16 | 69.96 | 83.74 | 200.5 |
| Lp2-ResNet-18 | 68.45 | 9.86 | 61.5 |
| Lp2-ResNet-34 | 70.43 | 19.93 | 116.6 |
| Lp2-ConvNeXt-T | 70.72 | 5.42 | 33.8 |
As shown in Table, Lp2-AlexNet achieves comparable performance to ViT-16x16 on TinyImageNet with significantly lower parameter counts and computational cost, demonstrating its efficiency. We will include this result to clearly relate the ViTs and CNNs. Thank you for raising this important point.
Weakness 3) Lack of Supporting References for Biological Comparisons
We agree that lines 108–110 should be carefully addressed with supporting references, as they represent a key premise of our work. Additionally, the observation that CNNs can exhibit sparse weight patterns during biological modeling is an important point that warrants further discussion.
To address these points, we will revise the manuscript lines 108-110 as follows:
- "Standard CNN architectures are typically designed with rectangular, dense, and uniformly distributed connections [6-9], in contrast to the circular, sparse, and normally distributed connections commonly observed in biological neuron [10-12]. Early studies in biological modeling using CNNs have shown that task-specific adaptations can lead to sparse weight patterns [1-5]. These insights demonstrate the adaptability of CNNs and highlight the potential for bridging artificial and biological connectivity patterns."
These revisions will provide stronger support for our claims and improve the manuscript’s precision.
References
[1-5] See reviewer's references.
[6] LeCun et al. "Gradient-based learning applied to document recognition." IEEE (1998)
[7] Krizhevsky et al. "ImageNet classification with deep convolutional neural networks." NeurIPS (2012)
[8] Simonyan and Andrew. "Very deep convolutional networks for large-scale image recognition." ICLR (2015)
[9] He et al. "Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification." ICCV (2015)
[10] Lerma-Usabiaga et al. "Population receptive field shapes in early visual cortex are nearly circular." J Neurosci (2021)
[11] Seeman et al. "Sparse recurrent excitatory connectivity in the microcircuit of the adult mouse and human cortex." eLife (2018)
[12] Hage et al. "Synaptic connectivity to L2/3 of primary visual cortex measured by two-photon optogenetic stimulation." eLife (2022)
Weakness 4) Undefined Variables and Ambiguity in Key Equations
We are grateful for the reviewer bringing this to our attention. We will ensure that the variable (defined in the legend of Fig. 1e) is clearly stated in the main text to enhance clarity and precision in our presentation. Moreover, we agree that it needs to be explicitly defined in the main text as well. We will also properly cite the following reference to Eqn 1.
- Goodman, Irwin R., and Samuel Kotz. "Multivariate θ-generalized normal distributions." Journal of Multivariate Analysis (1973)
We appreciate the reviewer’s sharp observation regarding the undefined variables and ambiguity in key equations.
Question 1) Evidence for Gaussian Sparsity as Biological Constraints
Thank you for raising this important question. Since Gaussian sparsity is a core assumption in our model as a biological constraint, it is essential to address this thoroughly. Below, we provide both theoretical and empirical evidence supporting Gaussian structured sparsity.
1.Theoretical Evidence
-
Sparse Coding Theory: Sparse Coding Theory posits that neural systems optimize sensory representations by minimizing redundancy. Learning a sparse code for natural images leads to the emergence of simple-cell receptive field properties [1]. This process can be linked with Gaussian priors, where synaptic weights follow a Gaussian distribution with most connections being weak and a few strong, promoting efficient information encoding [2].
-
Effective Receptive Field (ERF) Theory: In convolutional neural networks, the actual influence of input pixels on an output neuron decreases in a Gaussian manner from the center of the theoretical receptive field [3]. This means that while the theoretical receptive field defines the maximum possible area of influence, the ERF is effectively smaller and Gaussian-shaped, with central pixels contributing most significantly to the neuron's output.
2. Empirical Evidence
- Supporting References: These two references demonstrate both anatomical and functional distribution of synapses predominantly following a Gaussian-like distribution in the visual cortex [4-5].
- Analysis Result: We have demonstrated Gaussian distribution with the in vivo functional synapse data [5] in alive mouse V1 in Appendix A.3.
These foundations affirm Gaussian sparsity as a biologically plausible constraint within our Lp-convolution framework. We will include this additional section in the manuscript to reflect these points comprehensively. We thank the reviewer for their valuable feedback, which has enhanced the clarity and robustness of our manuscript.
Refrences
[1] Olshausen andField. "Emergence of simple-cell receptive field properties by learning a sparse code for natural images." Nature (1996)
[2] Olshausen and Millman. "Learning sparse codes with a mixture-of-Gaussians prior." NeurIPS (1999)
[3] Luo et al. "Understanding the effective receptive field in deep convolutional neural networks." NeurIPS (2016)
[4] Hellwig et al. "A quantitative analysis of the local connectivity between pyramidal neurons in layers 2/3 of the rat visual cortex." Biological cybernetics (2000)
[5] Rossi et al. "Spatial connectivity matches direction selectivity in visual cortex." Nature (2020)
Question 2) What are the trends in p and their biological meaning?
We refer the reviewer to Appendix A.16 for detailed analyses and provide the following summary.
1. Distribution and Convergence of p
The learned p values, as shown in Appendix A.16.1 and A.16.2, are dispersed around distinct values such as 2, 4, 8, and 16 without overlapping. While these distributions vary slightly across datasets and model architectures, a general trend is observed: p consistently converges towards smaller values compared to its initialization, suggesting a decreasing tendency throughout training.
2. Layer-wise Trend
Appendix A.16.3 and A.16.4 further reveal layer-specific trends. In early layers, p predominantly decreases, likely reflecting the refinement of basic feature representations. Conversely, in late layers, p tends to increase, indicating enhanced integration of higher-level features and abstraction. These contrasting trends align with the hierarchical processing patterns observed in neural networks and biological systems.
3. Biological Meaning
The convergence of p towards smaller values supports our hypothesis that p learns to reinforce biological constraints. This is consistent with effective receptive field theory and our findings in Figure 1, where sensory input receptive fields in trained models resemble Gaussian distributions, a natural phenomenon in biological sensory systems. However, as p remains sensitive to its initial values, ensuring convergence to globally optimal values remains a challenge, which we aim to address in future research.
Weakness 5) Limited Improvements and Not SoTA
We agree with the reviewer’s observation that the performance gains seem modest and far from SoTA. However, we kindly request the reviewer to consider the distinct values of our work beyond achieving SoTA. Specifically, our contributions lie in:
- exploring the potential of novel biologically inspired inductive biases, and
- developing a new, easily pluggable module for CNNs.
1. Novel inductive bias
We acknowledge that the prior works that explore biological ideas focus on providing novel insights and, hence often fail to surpass original ML methods in raw performance [1-3]. Nonetheless, we have put ourselves into situations to provide not only novel insights but also practical improvement to the ML community. We have demonstrated consistent improvements in various architectures and tasks, underscoring the robustness and versatility of our approach. While our results are not SoTA, we believe our work has still meaningful contributions to the field by introducing novel inductive bias to the community.
2. Easily pluggable module
Historically, CNNs have evolved through the introduction of innovative modules. For example, a depthwise convolution module was originally proposed to improve computational efficiency not SoTA in MobileNet [5] and now plays a pivotal role in SoTA architectures like ConvNeXt and RepLKNet [6, 7]. Similarly, CoAtNet, a precursor to Astroformer, leveraged a hybrid module combining depthwise convolution and self-attention, enabling Astroformer to achieve SoTA solely through architectural refinements [8]. In this context, our proposed Lp-convolution module offers practical value as a pluggable component that is easily integrated into existing architectures, facilitating flexible and efficient deployment (See Appendix A.21).
We believe that exploring novel bio-inspired algorithms and providing new convolutional modules enrich the ML community by offering both theoretical insights and practical tools. By contributing an additional design choice that enhances robustness and versatility across architectures, our work supports innovation in ML model design.
References
[1] Pogodin, Roman, et al. "Towards biologically plausible convolutional networks." NeurIPS (2021)
[2] Liu, Yuhan Helena, et al. "Biologically-plausible backpropagation through arbitrary timespans via local neuromodulators." NeurIPS (2022)
[3] Kao, Chia Hsiang, and Bharath Hariharan. "Counter-Current Learning: A Biologically Plausible Dual Network Approach for Deep Learning." NeurIPS (2024)
Weakness 6) Weak Justification for Large, Sparse Kernels and Unclear How RSM Benefits Contemporary Vision Tasks
1. The justification for using large, sparse kernels
The use of large kernels enables the model to cover the input space more effectively with fewer layers compared to smaller kernels [1, 2]. However, simply increasing the kernel size does not guarantee performance improvements, as shown in Table 1 (Base vs. Large). This is presumably due to larger kernels inadvertently incorporating irrelevant global information, which can hinder performance compared to smaller kernels that rely on locality inductive biases to extract local features hierarchically. This is where sparsity plays a key role. We introduce sparsity constraints to optimize the usage of large kernels, ensuring they focus on only relevant global information while mitigating the disadvantages of naïvely expanding kernel sizes, supported by Sudoku experiments.
We will include this discussion in the revised manuscript to better justify our approach and clarify its effectiveness.
2. RSM for Contemporary Vision Task
While RSM analysis itself is not directly tied to contemporary vision tasks, it serves as a critical tool to explore biologically inspired design principles that can inform about AI models. In this study, we used RSM analysis to measure the alignment between AI models and the brain with our novel inductive bias idea, rather than relating with contemporary vision tasks such as classification, detection, or segmentation. The essential question driving this work is whether CNNs behave like the brain and, more importantly, what insights can be gained by applying brain-derived inductive biases to AI models.
References
[1] Ding, Xiaohan, et al. "Scaling up your kernels to 31x31: Revisiting large kernel design in cnns." CVPR (2022)
[2] Luo et al. "Understanding the effective receptive field in deep convolutional neural networks." NeurIPS (2016)
Thank you for your valuable feedback—it’s been incredibly helpful in improving our work. As today is the final day for reviews, please don’t hesitate to ask if you have any remaining questions. We’d be happy to clarify anything to ensure the work is as strong as possible.
This paper introduces Lp-convolution, a novel approach to convolutional neural networks (CNNs) inspired by biological visual processing. The work addresses fundamental differences between artificial and biological visual systems: while traditional CNNs employ rectangular, dense, and uniform connectivity patterns, biological visual systems feature circular, sparse, and normally distributed connections. Additionally, the paper tackles the longstanding challenge that large kernel sizes in CNNs typically don't improve performance despite increased parameters.
The key innovation is the introduction of Lp-convolution, which uses multivariate p-generalized normal distribution (MPND) to bridge these biological-artificial differences. The method implements trainable "Lp-masks" that can adapt their shape through parameters, enabling flexible receptive field shapes that better match biological patterns. Technically, this is achieved by applying channel-wise Lp-masks that overlay onto convolutional kernels, with shape parameters that can be trained for task-dependent adaptation.
The authors demonstrate several significant findings: Lp-convolution improves the performance of CNNs with large kernels, with optimal results achieved when the initial p parameter approaches 2 (matching biological Gaussian distribution). Moreover, neural representations show better alignment with the visual cortex when connectivity patterns are more biologically plausible.
The practical impact of this work is threefold: it enables effective utilization of larger kernels in CNNs, achieves more biologically plausible artificial neural networks, and maintains compatibility with existing CNN architectures.
优点
The paper is overall well-written, with sections that tell a clear, sequential story. The usage of bold characters to highlight important parts is particularly appreciated. This is a very strong contribution across multiple aspects:
-
Connectivity patterns as inductive biases are largely unexplored within this community. The biological inspiration effectively guides the search for plausible connectivity patterns, and the approach proposed in this submission is particularly apt. The work presents a complete narrative, from biological mechanism inspiration to the implementation of Lp-convolution for neural activity prediction in V1 and representational similarity analysis.
-
The paper's approach to addressing large kernel network training challenges could potentially bridge the performance gap between transformers and CNNs in image classification tasks.
-
The mathematical formulation is sound and accessible, with figures (e.g., Figure 1 and 2) that effectively illustrate concepts and build intuition about parameter effects.
-
The choice of the Sudoku challenge adds significant value, serving as an excellent demonstration of the model's capabilities in an easily understandable context (especially for mask shapes)
-
The Appendix comprehensively addresses potential questions, demonstrating thorough consideration of the work's implications and limitations.
缺点
I don't think significant weaknesses are present in this work. The paper should be accepted as it is.
问题
(More of a curiosity) For future developments of this work, it would be interesting to explore connections with anisotropic diffusion (Perona & Malik, Scale-space and edge detection using anisotropic diffusion, 1990). In standard convolution, there exists a well-established mapping between convolution operators and isotropic diffusion processes (as explored in Scale-Space theory, particularly in Koenderink, The structure of images, 1987; and Lindeberg, Scale Space theory in computer vision, 1994). How might Lp-convolution relate to or extend these theoretical frameworks?
We sincerely thank the reviewers for their overwhelmingly positive and encouraging feedback. We deeply appreciate the recognition of our efforts to craft a comprehensive narrative bridging neuroscience and AI. The absence of identified weaknesses and the reviewers' enthusiasm for our approach reinforce our confidence in the significance and robustness of this work. This feedback inspires us to further explore and advance the potential of biologically inspired neural networks.
Question 1) Connections with Anisotropic Diffusion
Thank you for your insightful and constructive comment. We greatly appreciate the suggestion to explore connections between Lp-convolution and anisotropic diffusion (Perona & Malik, 1990), as well as the broader scale-space theory outlined by Koenderink (1987) and Lindeberg (1994). While our current work emphasizes practical and empirical aspects of Lp-convolution, we agree that its theoretical mapping to diffusion processes could be a fascinating direction for future research.
In particular, the parameterized adaptability of p in Lp-convolution provides a natural mechanism to bridge isotropic and anisotropic processes. For example, varying p could function similarly to the diffusion coefficient in anisotropic diffusion, dynamically controlling feature emphasis and smoothing based on local structures. This adaptability could also extend scale-space representations by offering a flexible, multi-scale feature extraction framework. Thank you again for this stimulating idea, which offers valuable guidance for future exploration. We will carefully consider this perspective in our ongoing and future work. Again, we are truly delighted to receive such positive evaluations of our research.
Thank you for your valuable feedback—it’s been incredibly helpful in improving our work. As today is the final day for reviews, please don’t hesitate to ask if you have any remaining questions. We’d be happy to clarify anything to ensure the work is as strong as possible.
The paper proposes a brain-inspired approach of constraining the weights of convnets with a p-generalized Gaussian envelope. The authors demonstrate some minor improvements in performance on relatively small image datasets such as CIFAR-100 and TinyImageNet. They further claim that the learned representations of more "brain-like" convnets have higher representational similarity to the mouse visual system than their more classical counterparts.
优点
- Novel inductive bias for convnets motivated by biology
- Overall fairly well-written paper
- Well-motivated and well-executed experiments
缺点
- The effect sizes are really small, calling into question the practical impact
- Only "toy" datasets are explored
- Experiment on representational similarity not convincing
Detailed explanation
While I find the paper well motivated and the idea original, I see the paper mostly as a negative result given the small effect sizes observed across most of the tables and the "toy" nature of datasets such as CIFAR-100 and TinyImageNet. The paper now has undergone several revisions that only reinforce this conclusion.
There is a statistically significant improvement due to Lp-Conv for some classical architectures, but not all of them (e.g. ResNet). Generally, the improvements are small (a few percent). Given that architectural modifications alone can now push accuracy on CIFAR-100 >90% (https://arxiv.org/abs/2304.05350v2), the 1–2% improvements in the 60–70% range feel insignificant. Experiments on more modern architectures such as RepLKNet (Table 3) show the same pattern, if anything with decreasing effect size (<1% improvement). Similarly, the transfer learning experiment using ConvNeXt-V2 (Table 4) shows close to no effect. There are no experiments on closer-to-real-world datasets like ImageNet (although that's by now a fairly standard problem that can be done on a consumer GPU), although I should say that I do not expect major effects in that experiment, either. The data simply show that the inductive bias doesn't do much.
The experiment on representational similarity yields equally small effect sizes, again insignificant on many architectures. In addition, the comparison is done to several mouse visual areas, some of which aren't even part of the "ventral" stream for which a convnet trained on image classification would be a reasonable model.
问题
None
Weakness 3) Experiment on Representational Similarity Not Convincing
We appreciate the reviewer’s feedback and understand the concerns regarding the representational similarity (RSM) results, specifically:
- small effect sizes and modest significance
- inclusion of non-ventral stream regions in the analysis.
However, we respectfully argue that our RSM results are compelling and well-supported in addressing the primary objective of our study: to explore whether introducing biologically inspired constraints into CNNs enhances their alignment with the brain. Below, we provide detailed responses to each concern.
1. Small Effect Size and Modest Significance
While the observed effect sizes in RSM analysis may appear modest, we believe our results are robust, meaningful, and well-aligned with prior research. Key points supporting this include:
-
Robustness Across Architectures:
The observed trends consistently demonstrate that CNNs incorporating Gaussian sparsity (a biological constraint) achieve better alignment with brain representations compared to others. These trends are consistent across various CNN architectures, despite the inherent variability and complexity of biological data. -
Meaningful Effect Sizes Relative to Accuracy Gains:
The SSM improvements for Lp models (e.g., AlexNet) reach approximately 3%, significantly larger than the corresponding Top-1 accuracy gains of around 1%. This disparity underscores that the observed RSM differences reflect meaningful biological alignment rather than statistical noise. -
Reproducibility with Prior Work:
While the absolute SSM values may seem modest, their range (0.2–0.4) aligns closely with prior findings, such as those reported by Shi et al. (2019) [1]. Moreover, we executed the same codebase as previous work [2], ensuring methodological consistency and the validity of our comparisons.
These points collectively demonstrate that our RSM results, while modest in absolute terms, are biologically relevant and sufficiently robust to support one of our main objectives.
2. Inclusion of Non-Ventral Stream Regions In response to the reviewer’s concern about including non-ventral stream regions, we note that our methodology follows established practices from prior studies [1–3], which analyze a broad range of visual areas in the mouse cortex. Our rationale for including both ventral and dorsal regions is as follows:
-
Holistic Evaluation of Representational Capacity:
While ventral stream regions like VISp and VISl are directly relevant to "what" tasks such as image classification, dorsal regions (e.g., VISam, VISpm, VISal) offer insights into the broader representational capacity of CNNs. By analyzing both streams, we aimed to provide a more comprehensive evaluation of CNNs in relation to biological systems. -
Focused Analysis on Ventral Stream Regions:
To ensure clarity, region-specific SSM analyses are provided in Appendix A.12. These results clearly show that the ventral stream region VISl consistently achieves the highest SSM values across all CNN architectures. Consequently, the maximum SSM values reported in Figure 6 are derived from VISl, confirming that our primary analysis is appropriately focused on ventral stream processing and directly relevant to tasks like image classification.
By incorporating both ventral and dorsal regions, we situate our findings within a comprehensive framework for evaluating CNNs in the context of biological visual systems. The dominance of VISl in our results further substantiates the alignment of CNNs with ventral stream processing, reinforcing the validity and significance of our conclusions.
References
[1] Shi et al. "Comparison against task driven artificial neural networks reveals functional properties in mouse visual cortex." NeurIPS (2019)
[2] Bakhtiari et al. "The functional specialization of visual cortex emerges from training parallel pathways with self-supervised predictive learning." NeurIPS (2021)
[3] Shi, Jianghong et al. "MouseNet: A biologically constrained convolutional neural network model for the mouse visual cortex." PLOS Computational Biology (2022)
Weakness 2) Evaluation Limited to "Toy" Datasets
Given the availability of well-established large-scale datasets like ImageNet, smaller datasets such as CIFAR-100 and TinyImageNet are often perceived as 'toy' datasets. However, these smaller datasets are pivotal for examining novel inductive biases, as highlighted in numerous prior studies [1–5]. This is because, as data size increases, the impact of inductive biases tends to diminish due to scaling effects [6, 7]. For example, CNNs outperform ViTs in data-scarce regimes due to their strong locality inductive bias, while enriched data can obscure the benefits of this bias [1, 7]. Thus, the choice of these smaller datasets was essential for effectively investigating the potential of our proposed inductive biases.
At the same time, we fully recognize the importance of evaluating our method on closer-to-real-world datasets like ImageNet-1k to better understand its broader applicability and potential impact. As this study represents an initial exploration of a novel inductive bias, our primary goal was not to compete for SoTA performance but to demonstrate the conceptual strength and practical utility of our approach. Rather than training models from scratch on ImageNet-1k, we deliberately adopted a transfer learning strategy to integrate our method with ImageNet-1k pretrained SoTA models (Figure 3).
The transfer learning results highlight both the conceptual and practical advantages of our method (Table 4). Although the observed effect sizes were modest, our method seamlessly integrated with pretrained models, enhancing their adaptability without diminishing their performance. This outcome underscores the utility and flexibility of our approach, demonstrating its potential to advance transfer learning applications and contribute to the study and application of novel inductive biases.
References
[1] Lee et al., "Vision Transformer for Small-Size Datasets." arXiv preprint arXiv:2112.13492 (2021)
[2] Verma et al., "Manifold Mixup: Learning Better Representations by Interpolating Hidden States." ICML (2019)
[3] Zagoruyko et al., "Wide Residual Networks." BMVC (2016)
[4] Feng et al., "Conv2NeXt: Reconsidering ConvNeXt Network Design for Image Recognition." CAIT (2022)
[5] Hu et al., "Unlocking Deterministic Robustness Certification on ImageNet." NeurIPS (2024)
[6] Bachmann et al., "Scaling MLPs: A Tale of Inductive Bias." NeurIPS (2024)
[7] Zhang et al., "ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond." IJCV (2023)
We appreciate the constructive critique from the reviewer expecting the high standards of ICLR. We acknowledge the reviewer’s concerns regarding the lack of the ImageNet-1k benchmark and the absence of SoTA results, especially in light of multiple revision opportunities. While we share the ambition of presenting such results, at this early exploratory stage of our novel idea, we regret that we were not yet able to achieve those outcomes. That said, we firmly believe that our research still offers meaningful contributions to the ML community, and we hope the reviewer will appreciate the broader value of our work in advancing novel directions within the field. Below, we address the concerns raised and provide clarifications where needed.
Weakness 1) Small Effect Sizes and Limited Practical Impact
We agree with the reviewer’s observation that the effect sizes might appear small compared to the state-of-the-art (SoTA) models. Indeed, it is impressive that Astroformer achieves 93.4%, outperforming the second-best PyramidNet (89.9%) by more than 3% solely through architectural modifications [1]. However, we kindly request the reviewer to consider the distinct values of our work beyond achieving SoTA performance. Specifically, our contributions lie in:
- exploring the potential of novel biologically inspired inductive biases, and
- developing a new, easily pluggable module for CNNs.
1. Novel inductive bias
We acknowledge that the prior works that explore biological ideas focus on providing novel insights and, hence often fail to surpass original ML methods in raw performance [2-4]. Nonetheless, we have put ourselves into situations to provide not only novel insights but also practical improvement to the ML community. We have demonstrated consistent improvements in various architectures and tasks, underscoring the robustness and versatility of our approach. While our results are not SoTA, we believe our work has still meaningful contributions to the field by introducing novel inductive bias to the community.
2. Easily pluggable module
Historically, CNNs have evolved through the introduction of innovative modules. For example, a depthwise convolution module was originally proposed to improve computational efficiency not SoTA in MobileNet [5] and now plays a pivotal role in SoTA architectures like ConvNeXt and RepLKNet [6, 7]. Similarly, CoAtNet, a precursor to Astroformer, leveraged a hybrid module combining depthwise convolution and self-attention, enabling Astroformer to achieve SoTA solely through architectural refinements [8]. In this context, our proposed Lp-convolution module offers practical value as a pluggable component that is easily integrated into existing architectures, facilitating flexible and efficient deployment (See Appendix A.21).
We believe that exploring novel bio-inspired algorithms and providing new convolutional modules enrich the machine learning community by offering both theoretical insights and practical tools. By contributing an additional design choice that enhances robustness and versatility across architectures, our work supports innovation in machine learning model design.
References
[1] Dagli, Rishit. "Astroformer: More data might not be all you need for classification." arXiv preprint arXiv:2304.05350 (2023)
[2] Pogodin, Roman, et al. "Towards biologically plausible convolutional networks." NeurIPS (2021)
[3] Liu, Yuhan Helena, et al. "Biologically-plausible backpropagation through arbitrary timespans via local neuromodulators." NeurIPS (2022)
[4] Kao, Chia Hsiang, and Bharath Hariharan. "Counter-Current Learning: A Biologically Plausible Dual Network Approach for Deep Learning." NeurIPS (2024)
[5] Howard, Andrew G. "Mobilenets: Efficient convolutional neural networks for mobile vision applications." arXiv preprint arXiv:1704.04861 (2017)
[6] Liu, Zhuang, et al. "A convnet for the 2020s." CVPR (2022)
[7] Ding, Xiaohan, et al. "Scaling up your kernels to 31x31: Revisiting large kernel design in cnns." CVPR (2022)
[8] Dai, Zihang, et al. "Coatnet: Marrying convolution and attention for all data sizes." NeurIPS (2021)
Thank you for the detailed response. I remain unconvinced and will maintain my rating.
-
I do see that it's a novel inductive bias and easily pluggable – but to what end? What problem does it solve? Being biologically inspired is not an end in itself if it does not address an open problem.
-
Again, what problem does this inductive bias solve? We have strong backbones that can be used for transfer learning and we have better methods for these toy tasks. If your inductive bias solves some data-scarce problem (better than previous methods, including transfer learning from backbones that exist and can be downloaded), I suggest you present the evidence. Table 4 is one example for super tiny effects, so I don't consider it support for your claim that your method is useful.
-
I do not agree that they are robust across architectures. The differences are not statistically significant in 3 of 5 architectures. Percent improvements between RSA and accuracy are not comparable, as they are not on the same scale. I did not criticize the absolute magnitude of your RSA values; I am aware of them often being small, especially in the mouse when compared to task-trained CNNs.
Re 2) Again, what problem does this inductive bias solve? [...] I suggest you present the evidence. [...]
In addition to the Large Kernel Problem we’ve discussed in Re 1), we also tried to showcase the strengths of our method through its application to transfer learning. However, the effect size from the transfer learning results might feel underwhelming to you. Reviewer thinks if we are to claim our method is effective, we need to back it up with stronger evidence.
So, how about we take a look at our robustness experiments (Table 2)? The effect size there is much more significant. For example, in the case of the ConvNeXt network, the raw values show that our method almost doubles the robustness performance compared to the baseline across various types of corruption.
| Corruption | Base | Large | Lp2 | Lp4 | Lp8 | Lp16 |
|---|---|---|---|---|---|---|
| brightness | 23.28±7.14 | 31.23±1.61 | 39.26±4.76 | 31.68±9.26 | 33.93±6.33 | 34.27±3.94 |
| contrast | 17.08±6.12 | 26.26±2.06 | 34.91±5.46 | 27.25±9.38 | 29.29±6.62 | 29.17±4.08 |
| defocus_blur | 24.26±7.11 | 31.35±1.42 | 39.35±4.50 | 32.45±9.11 | 34.46±6.20 | 34.88±3.45 |
| elastic_transform | 20.40±5.67 | 27.67±1.11 | 33.43±4.08 | 27.24±7.44 | 29.46±4.88 | 29.83±3.26 |
| fog | 20.08±6.60 | 28.48±1.95 | 36.37±4.97 | 29.17±9.13 | 31.15±6.45 | 31.41±4.28 |
| frost | 17.36±5.79 | 27.44±1.84 | 33.68±5.21 | 26.46±8.32 | 28.28±5.77 | 28.29±4.19 |
| gaussian_blur | 24.34±7.07 | 31.53±1.50 | 39.37±4.46 | 32.51±9.21 | 34.65±6.16 | 35.11±3.28 |
| gaussian_noise | 20.64±5.80 | 30.74±1.30 | 34.57±4.24 | 27.69±7.07 | 29.75±4.12 | 29.86±3.59 |
| glass_blur | 13.96±2.99 | 27.66±0.96 | 24.52±4.78 | 19.43±3.19 | 21.19±1.98 | 19.99±4.36 |
| impulse_noise | 20.81±6.02 | 29.87±1.49 | 34.47±4.12 | 27.84±7.46 | 29.74±4.56 | 29.84±3.46 |
| jpeg_compression | 21.98±6.34 | 31.09±1.42 | 35.81±3.94 | 29.37±7.65 | 31.58±4.99 | 31.79±3.46 |
| motion_blur | 22.40±5.93 | 29.13±1.16 | 35.79±4.24 | 29.44±7.69 | 31.49±5.44 | 31.93±2.75 |
| pixelate | 24.24±6.78 | 31.32±1.69 | 38.56±4.42 | 31.62±8.63 | 33.87±5.64 | 33.93±3.64 |
| saturate | 11.64±4.90 | 19.34±2.39 | 26.10±4.28 | 20.33±7.28 | 22.25±5.10 | 22.14±3.41 |
| shot_noise | 22.07±6.28 | 31.24±1.49 | 36.61±4.64 | 29.46±7.78 | 31.56±4.84 | 31.80±3.53 |
| snow | 18.86±5.75 | 28.61±1.55 | 33.54±4.17 | 27.04±7.64 | 29.07±4.92 | 28.58±3.76 |
| spatter | 21.98±6.62 | 30.63±1.62 | 36.47±4.63 | 29.54±8.20 | 31.66±5.31 | 31.82±3.57 |
| speckle_noise | 21.60±6.00 | 31.15±1.44 | 36.48±4.50 | 29.47±7.79 | 31.56±4.92 | 31.99±3.56 |
| zoom_blur | 22.07±6.07 | 28.46±0.98 | 35.63±4.55 | 29.33±7.30 | 30.97±5.18 | 31.73±2.70 |
We haven’t fully investigated why Gaussian sparsity has such a dramatic impact on robustness yet. However, we believe this result lets us claim that our method is beneficial for model robustness along with the Large Kernel Problem.
Re 3) I do not agree that they are robust across architectures. The differences are not statistically significant in 3 of 5 architectures. Percent improvements between RSA and accuracy are not comparable, as they are not on the same scale. I did not criticize the absolute magnitude of your RSA values; I am aware of them often being small, especially in the mouse when compared to task-trained CNNs.
First off, thank you for clarifying that your critique isn’t about whether our experiments were conducted properly, but rather about whether the statistical significance level of our results is strong enough to support our claims. That’s a really helpful distinction, and we appreciate you pointing it out.
We observed a mostly decreasing trend as p_init increased, and when we said “robust,” we meant that the trends were generally consistent across different conditions. However, as you rightly pointed out, the significance levels stand out primarily in AlexNet and ConvNeXt.
Throughout previous revisions, we have also included the neural activity prediction of V1 (Table 5) using Lp-CNNs to support our claim. When we replaced the CNN in a CNN-GRU network (same model from previous studies) with our Lp-CNN, we found that Gaussian sparsity achieved the best performance. We think this result, combined with our RSA results, could support the idea that Gaussian sparsity allows CNNs to better align with the brain. We’d love to hear your thoughts on this—do you think it strengthens our argument, or is there something else we should consider?
Thank you again for taking the time to provide feedback. It means a lot to us. Again, it’s not just about getting higher scores—your constructive critiques are what truly help us improve. Please don’t hesitate to share more of your thoughts anytime. We deeply value them!
Thank you for taking the time to respond to our writing. It genuinely means a lot to us that you’ve come back to offer feedback. We believe helping us doesn’t just mean increasing our scores. We sincerely welcome your constructive criticism at any time—it has been an invaluable foundation for shaping our research direction. Following, we would like to clarify regarding the reviewer's additional feedback.
Re 1) I do see that it's a novel inductive bias and easily pluggable – but to what end? What problem does it solve? Being biologically inspired is not an end in itself if it does not address an open problem.
Yes, we completely agree with your point: “Being biologically inspired is not an end.” You pointed out that it’s unclear what open problem our method is addressing, and we think this might be because we didn’t effectively communicate our problem statement. In fact, we’ve proposed an open problem here: the “Large Kernel Problem.”
Let us explain. We revisit the large kernel problem as introduced in line 58. Large kernels in CNNs often fail to show consistent performance improvements as kernel sizes increase, even when additional parameters are allocated [1]. This raises an important question in machine learning: Can we expect performance gains by expanding kernel sizes horizontally, beyond the traditional vertical stacking of layers?
To clarify this further, please take another look at Table 1. When comparing the (Base) condition with the (Large) condition—where kernel sizes are increased—we see that performance generally declines across models, except for ResNet. This underscores our key point: if we’ve successfully communicated what is the large kernel problem here, the comparison that truly matters should be Large vs. Lp-Conv, not Base vs. Lp-Conv.
Here’s why our method stands out: it is explicitly designed to ensure that, regardless of the initial p-value, the Base model’s performance serves as a lower bound for the expected performance of large kernel CNN with Lp-Masks. This makes our approach far more stable than simply training large kernels arbitrarily, while still achieving significant performance gains.
How is this possible?
This stability arises because the enlarged kernels overlaid with Lp-Masks make the central parameters prioritized for learning as if initialized like Base model (see Figures 2c, d, e or Figure 3). Over time, the Lp-Mask dynamically changes its conformation in a task-dependent manner by expanding, contracting, narrowing, elongating, or rotating as needed. To validate this mechanism, we performed the Sudoku task (please see trained individual Lp-Masks in Appendix A.7—it’s worth checking out how they adapted their shapes during training!). For example, in Sudoku—a combination of vertical, horizontal, or square constraints but not diagonal goals—we found that no Lp-Mask formed a diagonal shape, underscoring its adaptability. This dynamic behavior is key to the performance gains we observed.
Did our method solve the Large Kernel Problem?
In a way, we believe it depends on how you interpret the results. Our approach is not just about achieving dramatic performance gains immediately but rather about providing a guaranteed lower bound on performance when enlarging kernels in CNNs.
To put this in perspective, think about how vertically extending a model—for instance, increasing ResNet-18 to ResNet-34 by doubling the number of layers—yields approximately a 1.25% performance gain. Similarly, our method achieves a stable 2.64% performance improvement on ResNet-18 by horizontally extending the model size through kernel enlargement. From our perspective, this represents a meaningful step toward addressing the Large Kernel Problem.
By stabilizing performance when increasing kernel sizes, our method offers a solid foundation for further exploration and refinement. We’re happy to discuss and answer more questions regarding this perspective—thank you for your thoughtful engagement!
References
[1] Peng, Chao, et al. "Large kernel matters--improve semantic segmentation by global convolutional network." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
Thank you for your valuable feedback—it’s been incredibly helpful in improving our work. As today is the final day for reviews, please don’t hesitate to ask if you have any remaining questions. We’d be happy to clarify anything to ensure the work is as strong as possible.
This paper introduces -Convolution by integrating the multivariate p-generalized normal distribution into a masks for convolution filters. It allows the network to adapt to different receptive field shape and train efficiently with large kernel. The paper show -Convolution has a advantage in tasks such as sudoku challenge.
优点
The idea of a convolution is original and could have wide application in visual tasks that require flexible receptive field size or in general tasks that require both local and global information from visual input. Showing its ability to transfer to any pretrained network greatly lowers the threshold to apply this for a wide range of tasks. The choices of Sudoku task and the follow up ablation analysis is solid and demonstrates well the strength of this method. Out of all papers that take inspirations from neuroscience and try to utilize it to improve neural nets, this paper stands out in actually providing a fundamentally different implementation of CNNs.
Other that than the points mentioned above, the benchmark testing was thorough and presented clearly. The visualization of both the mask and convolution is very helpful for understanding the concepts. The writing is very clear.
缺点
The paper does show through table 1-4 that -CNNs can train with large kernels and have some advantage in robustness as well as accuracy in benchmark test. However the improvements over baseline models are small. I don't think these numbers convinces me how useful the -CNNs could be. Aside from the sudoku task, the paper didn't really show the advantage of efficiently trained large kernal -CNNs through a task that actually could really benefit from large kernels. I would suggest including some more tasks that requires processing of context, or even tasks ViT excels at for comparison.
For the robustness benchmark as well as the Sudoku task, it could be informative to include performance of ViTs as well.
Lastly, it is pretty well established throughout the paper that is the most useful for most task, and resembles the most to the biological system. I am not sure it is worth having another section (sec. 6) dedicating to comparing similarity of RSM across different s. If the author were to demonstrate it is also a better model for brain representational alignment that I would recommend doing a more thorough study including more datasets, brain regions and variation of CNNs.
问题
See weakness for suggestions.
Potential typo: line 159: a solution to the of large kernel problem in CNN -> a solution of the large ...
Weakness 2) Missing ViT Baselines in Robustness and Sudoku Benchmarks
Thank you for pointing out the lack of ViT baselines for the Sudoku task. While we relied on a previously reported CNN model [1], we could not find ViT implementations specifically designed for this problem. Although Recurrent Transformer models, such as Yang et al. [2], achieve over 95% accuracy on textual Sudoku, no comparable ViT implementation exists for direct benchmarking.
To address this, we designed a custom ViT architecture with a final linear projection reshaped into[batch_size, 9, height, width], SudokuViT, tailored for Sudoku puzzles. Two variations were explored:
- SudokuViT3x3: Processes 3 by 3 patches as sequences.
- SudokuViT1x1: Treats each pixel as a token, effectively functioning as a standard Transformer.
Unfortunately, SudokuViT3x3 struggled to train effectively, achieving only 27% number accuracy. SudokuViT1x1 showed slight improvement, with 49% number accuracy, but still performed poorly on Sudoku-specific metrics such as row-column accuracy and box accuracy.
| Model | Loss | RowColAcc | BoxAcc | SudokuAcc | NumberAcc |
|---|---|---|---|---|---|
| SudokuViT3x3 | 1.753 | 0.08% | 0.03% | 0.00% | 27% |
| SudokuViT1x1 | 1.282 | 0.20% | 0.20% | 0.00% | 49% |
These results suggest that ViTs face significant challenges in capturing the grid-like structure and spatial relationships inherent to Sudoku puzzles, which CNNs handle effectively. However, we acknowledge that this performance could also stem from our inability to identify an effective training strategy or optimized model design for SudokuViT. Thus, it may not be appropriate to treat these results as definitive baselines for ViT performance on Sudoku.
For future work, we recommend referring to Yang et al. [2], where a Recurrent Transformer approach demonstrated strong performance on textual Sudoku, as a more suitable baseline for Transformer-based methods.
We appreciate the reviewer's suggestion, as it allowed us to explore the limitations and potential of ViT architectures for structured reasoning tasks like Sudoku. This exploration provides valuable insights into the architectural trade-offs between CNNs and Transformers in such domains.
References
[1] Oinar. “How to solve sudoku with convolutional neural networks (cnn).” GitHub link (2021)
[2] Yang et al. "Learning to solve constraint satisfaction problems with recurrent transformer."
ICLR (2023)
Weakness 3) Redundant Section on RSM Similarity Comparison
Thank you for your valuable feedback on the RSM experiments. We recognize that the necessity of this section may not be immediately clear. However, one of the key goals of our study is to investigate whether integrating biologically observed inductive biases into CNNs not only enhances engineering performance but also improves alignment with brain representations.
The section 6 plays a critical role in demonstrating this connection. For example, biases inspired by the V1 (e.g., Gaussian sparsity, Appendix A.3) are shown to improve representational alignment with neural activity while simultaneously offering engineering benefits. This dual advantage underscores the value of biologically motivated constraints in informing model design.
We agree that further studies involving more datasets, brain regions, and CNN variations would strengthen our findings. As part of these efforts, we demonstrated neural activity prediction experiments (Table 5), which provide additional evidence of the alignment between CNN representations and biological data.
We hope this clarification justifies the inclusion of Section 6, emphasizing its role in illustrating the complementary relationship between neuroscience and AI.
Thanks for the detailed response and additional experiments. I have no further questions.
We sincerely thank the reviewer for their thoughtful and encouraging feedback, which deeply motivates us. Below, we address the concerns raised and provide further clarifications.
Weakness 1) Limited Utility Demonstrated for Large Kernel CNNs
Thank you for your valuable feedback. We would like to address your concerns by highlighting the strengths of our research and providing our perspective on the issues raised.
1. Lp-Convolution for Large Kernel Problem
Firstly, we reiterate our discussion of the large kernel problem mentioned on line 58. While recent large kernel CNNs do not guarantee performance improvements when increasing the kernel size, even though additional parameters are secured (see Tables 1 and 4). This raises a critical machine learning question: Can we expect additional performance gains by moving beyond traditional vertical layer stacking and expanding the kernel size horizontally?
In Table 1 and 4, the performance gains when comparing our method with legacy models (Base vs. Lp-Conv) may appear modest. However, when applied to large kernel models (Large vs. Lp-Conv), we observe a significant increase regardless of the initial value of p. Specifically, in Table 3, applying Lp-Convolution to RepLKNet—a model that already utilizes large kernel sizes—resulted in a 1% performance improvement with almost no additional computational cost. We believe this constitutes a meaningful enhancement, demonstrating the practical benefits and reinforcing the potential of large kernel CNNs to improve both efficiency and performance.
2. Strong Robustness with Lp-Convolution
In Table 4, we demonstrate the, the Lp-Convolution model with p=2 outperforms the Base model in 57 corruption scenarios, whereas the Base model does not achieve a single win. This consistent superiority is not only reflected in the win counts but also in the absolute performance values, which we include in the raw performance table of ConvNeXt as follows.
| Corruption | Base | Large | Lp2 | Lp4 | Lp8 | Lp16 |
|---|---|---|---|---|---|---|
| brightness | 23.28±7.14 | 31.23±1.61 | 39.26±4.76 | 31.68±9.26 | 33.93±6.33 | 34.27±3.94 |
| contrast | 17.08±6.12 | 26.26±2.06 | 34.91±5.46 | 27.25±9.38 | 29.29±6.62 | 29.17±4.08 |
| defocus_blur | 24.26±7.11 | 31.35±1.42 | 39.35±4.50 | 32.45±9.11 | 34.46±6.20 | 34.88±3.45 |
| elastic_transform | 20.40±5.67 | 27.67±1.11 | 33.43±4.08 | 27.24±7.44 | 29.46±4.88 | 29.83±3.26 |
| fog | 20.08±6.60 | 28.48±1.95 | 36.37±4.97 | 29.17±9.13 | 31.15±6.45 | 31.41±4.28 |
| frost | 17.36±5.79 | 27.44±1.84 | 33.68±5.21 | 26.46±8.32 | 28.28±5.77 | 28.29±4.19 |
| gaussian_blur | 24.34±7.07 | 31.53±1.50 | 39.37±4.46 | 32.51±9.21 | 34.65±6.16 | 35.11±3.28 |
| gaussian_noise | 20.64±5.80 | 30.74±1.30 | 34.57±4.24 | 27.69±7.07 | 29.75±4.12 | 29.86±3.59 |
| glass_blur | 13.96±2.99 | 27.66±0.96 | 24.52±4.78 | 19.43±3.19 | 21.19±1.98 | 19.99±4.36 |
| impulse_noise | 20.81±6.02 | 29.87±1.49 | 34.47±4.12 | 27.84±7.46 | 29.74±4.56 | 29.84±3.46 |
| jpeg_compression | 21.98±6.34 | 31.09±1.42 | 35.81±3.94 | 29.37±7.65 | 31.58±4.99 | 31.79±3.46 |
| motion_blur | 22.40±5.93 | 29.13±1.16 | 35.79±4.24 | 29.44±7.69 | 31.49±5.44 | 31.93±2.75 |
| pixelate | 24.24±6.78 | 31.32±1.69 | 38.56±4.42 | 31.62±8.63 | 33.87±5.64 | 33.93±3.64 |
| saturate | 11.64±4.90 | 19.34±2.39 | 26.10±4.28 | 20.33±7.28 | 22.25±5.10 | 22.14±3.41 |
| shot_noise | 22.07±6.28 | 31.24±1.49 | 36.61±4.64 | 29.46±7.78 | 31.56±4.84 | 31.80±3.53 |
| snow | 18.86±5.75 | 28.61±1.55 | 33.54±4.17 | 27.04±7.64 | 29.07±4.92 | 28.58±3.76 |
| spatter | 21.98±6.62 | 30.63±1.62 | 36.47±4.63 | 29.54±8.20 | 31.66±5.31 | 31.82±3.57 |
| speckle_noise | 21.60±6.00 | 31.15±1.44 | 36.48±4.50 | 29.47±7.79 | 31.56±4.92 | 31.99±3.56 |
| zoom_blur | 22.07±6.07 | 28.46±0.98 | 35.63±4.55 | 29.33±7.30 | 30.97±5.18 | 31.73±2.70 |
When comparing these values, it becomes evident that our method exhibit exceptional robustness across various types of corruptions.
3. Comparison with ViT
Thank you for your valuable feedback. While our primary focus is on improving CNN architectures, we agree with the reviewer that including a comparison to ViTs can better highlight the utility and relevance of our proposed method. To address this, we conducted experiments with ViT base models on TinyImageNet, and the results are summarized below:
| TinyImageNet | Top-1(%) | FLOPs(G) | Params(M) |
|---|---|---|---|
| ViT-32x32 | 49.88 | 4.37 | 87.6 |
| ViT-16x16 | 54.20 | 16.87 | 86.0 |
| AlexNet | 52.25 | 0.71 | 57.82 |
| Lp2-AlexNet | 54.13 | 3.41 | 68.6 |
| Lp2-VGG-16 | 69.96 | 83.74 | 200.5 |
| Lp2-ResNet-18 | 68.45 | 9.86 | 61.5 |
| Lp2-ResNet-34 | 70.43 | 19.93 | 116.6 |
| Lp2-ConvNeXt-T | 70.72 | 5.42 | 33.8 |
As shown in Table, Lp2-AlexNet achieves comparable performance to ViT-16x16 on TinyImageNet with significantly lower parameter counts and computational cost, demonstrating its efficiency. Thank you for raising this important point. We will include this comparison into our revised manuscript to provide a clearer perspective on the relationship between ViTs and CNNs.
Thank you for your valuable feedback—it’s been incredibly helpful in improving our work. As today is the final day for reviews, please don’t hesitate to ask if you have any remaining questions. We’d be happy to clarify anything to ensure the work is as strong as possible.
Review Summary
We sincerely appreciate the reviewers' time and effort in carefully evaluating our paper. Some comments were incredibly encouraging, acknowledging the value of our efforts so far, while others provided constructive critiques that highlighted expectations aligned with the high standards of ICLR. Addressing these comments has been invaluable in refining our understanding of the impact and utility of our work, as well as its position within the broader field. We briefly summarize the strengths and weaknesses pointed out by reviewers.
- Strengths
- Novel Idea on Biologically Inspired Inductive Bias (Reviewer 7Z3f, nN4T, ZVMo)
- Broad Applicability of Our Method (Reviewer nN4T, ZVMo, ozim)
- Comprehensive Experimental Validation and Robustness (Reviewer 7Z3f, nN4T, ZVMo, ozim)
- Clear Writing and Effective Visualizations (Reviewer 7Z3f, ZVMo, nN4T)
- Demonstration of Model Flexibility with Sudoku Challenge (Reviewer nN4T, ZVMo, ozim)
- Weakness
- Not SoTA or Modest Improvements in Performance (Reviewer 7Z3f, nN4T, ozim)
- Limited Datasets and Tasks (Reviewer 7Z3f, nN4T)
- Weak Representational Similarity Analysis (Reviewer 7Z3f, nN4T, ozim)
- Unclear Relation to ViT Integration (Reviewer nN4T, ozim)
General Response
Our research represents an ambitious attempt to bridge neuroscience and AI, exploring whether biologically-inspired inductive biases can not only enhance CNNs but also make them more aligned with the brain's mechanisms. This paper is not the conclusion but the beginning—a foundational step toward conceptualizing and demonstrating this idea's potential.
We recognize the reviewers’ concerns about modest performance improvements, limited engagement with SoTA benchmarks, and the scope of tasks explored. These are valid limitations of our current study, and we appreciate the reviewers’ insights. However, the true value of our work lies in its vision and foundation: we have created an easily pluggable convolutional module that opens the door for future researchers to build on our findings and explore these possibilities further.
Moreover, this research is more than performance numbers—it provides a unique contribution by showing that biologically-observed mechanisms, when tested in AI models, reveal meaningful inductive biases critical to visual processing. This aspect of our work contributes to a deeper understanding of how neuroscience can inform AI, offering a perspective that extends beyond conventional performance-focused computer vision research.
We sincerely hope that reviewers can see this work not as a final statement but as a first, necessary step—a catalyst for deeper exploration at the intersection of neuroscience and AI. With this in mind, we have carefully structured our rebuttal to address the reviewers' valuable feedback while staying true to the motivation and purpose behind this study.
The paper introduces Lp-convolution, a novel approach inspired by the connectivity patterns observed in the brain’s visual cortex. By utilising the multivariate p-generalised normal distribution (MPND), the authors create Lp-masks that allow receptive fields (RFs) in CNNs to adapt in shape, scale, and orientation. This flexibility addresses a key challenge in CNN design known as the large kernel problem, where increasing kernel size often leads to diminished performance. The proposed Lp-convolution overcomes this by enabling large-kernel CNNs to achieve better performance, particularly when the Lp-mask configuration aligns with biologically inspired patterns (e.g. when p = 2, which reflects the Gaussian-like sparsity observed in biological connectivity). Furthermore, the study demonstrates that models employing Lp-convolution achieve stronger alignment with neural representations in the visual cortex, as evidenced by representational similarity analysis (RSA) with mouse visual cortex data.
A key strength and contribution of Lp-convolution lies in its ability to adapt receptive fields in a task-specific manner, enabling CNNs to handle diverse input features more effectively. This adaptability contributes to significant improvements in model robustness, as shown in experiments with CIFAR-100-C, where models with Lp-convolution outperform traditional CNNs. The approach also allows for more efficient transfer learning, allowing existing pre-trained models to incorporate Lp-masks with minimal computational cost and performance drop. Importantly, the method is generalizable, being compatible with a wide range of architectures, from traditional models like AlexNet and ResNet to modern large-kernel architectures like RepLKNet and ConvNeXt.
Nevertheless Lp-convolution introduces certain complexities. The inclusion of trainable parameters (C and p) increases model complexity and requires careful hyperparameter tuning. Model performance is sensitive to the choice of the initial p-value, and selecting the right value for different architectures and tasks can be non-trivial. Moreover, while Lp-masks provide flexibility and task-specific adaptability, the underlying decision-making process within the model is less interpretable than simpler, more transparent CNN designs. The paper also lacks a comprehensive discussion of the computational overhead introduced by Lp-convolution, particularly in large-scale training scenarios.
In summary, this paper provides a neat bridging between biological and artificial intelligence by introducing Lp-convolution, a method that enhances CNN adaptability, robustness, and performance in large-kernel models. While it introduces new complexities, its biologically inspired design offers a fresh perspective on how insights from neuroscience can drive the development of more effective machine learning models, something that was recognised by the majority of the reviewers.
审稿人讨论附加意见
The reviewers had mixed opinions. Reviewer ZVMo and Reviewer nN4T were strongly in favour of accepting the paper, praising the originality and practical impact of Lp-convolution. They highlighted its potential for bridging biological and artificial visual processing, its compatibility with existing CNN architectures, and its utility in large-kernel visual processing tasks. They appreciated the authors’ clear explanations, visualisations, and comprehensive experimental validation, noting the paper’s strong presentation. Reviewer nN4T also noted the method’s potential to inspire future “brain-inspired” works due to its unique integration with existing models.
On the other hand, Reviewer 7Z3f raised some key concerns. While they acknowledged the novelty of the inductive bias and its biological inspiration, they questioned its practical utility, arguing that being “biologically inspired” was not a sufficient justification on its own. The reviewer criticised the small effect sizes and the use of “toy” datasets like CIFAR-100 and TinyImageNet, suggesting that the results failed to provide compelling evidence of significant performance improvements. They also raised concerns about the lack of experiments on larger, real-world datasets like ImageNet and claimed that the representational similarity analysis (RSA) was unconvincing, as the observed differences were modest and not statistically significant across several architectures.
The authors’ responses aimed to address these points. They defended their choice of datasets, arguing that smaller datasets were essential for studying novel inductive biases and that larger datasets might obscure these effects. To address the “toy dataset” critique, they pointed to the transfer learning results and highlighted improvements in robustness, especially on CIFAR-100-C, where their method consistently outperformed the baseline across various types of corruptions. Regarding representational similarity, they argued that while effect sizes were small, they were still biologically meaningful and aligned with previous studies on neural alignment. They further clarified that their primary goal was not to achieve state-of-the-art performance but to introduce a biologically inspired, adaptable convolutional module that could be “easily pluggable” into existing architectures like AlexNet, ResNet, and ConvNeXt.
The final reviewer scores were 8, 8, 6 (bumped up from 5) and 3, which in balance provide good evidence that the paper has merits and has passed the threshold for acceptance, while recognising that it does not solve everything.
Accept (Poster)