Task-Optimized Convolutional Recurrent Networks Align with Tactile Processing in the Rodent Brain
Task-optimized convolutional recurrent neural networks trained on realistic tactile inputs align with rodent somatosensory data, suggesting brain tactile processing uses temporally-precise representations shaped by categorization-driven optimization.
摘要
评审与讨论
The paper tackles rodent whisker-based touch from both an ML and a neuroscience angle. The authors (i) port the WHISKiT Physics simulator to a 30-whisker mouse array and generate two large ShapeNet-derived datasets of realistic force-and-torque sequences; (ii) introduce a PyTorch "Encoder-Attender-Decoder" search space that spans feed-forward, state-space, and convolutional recurrent (ConvRNN) encoders plus optional attention modules; (iii) train 62 model variants under supervised categorization and contrastive self-supervised objectives that use force/torque-specific augmentations; and (iv) compare each layer's representations to barrel-cortex population recordings via noise-corrected RSA.
They find that (i) ConvRNN encoders (i.e. IntersectionRNN) outperform ResNet and SSM baselines on tactile categorization, with (ii) the best ConvRNN-based models “saturating” explainable neural variance, matching or exceeding inter-animal consistency without any fitted read-outs. Task accuracy and neural alignment are linearly related, and contrastive SimCLR models (with tactile augmentations) equal supervised models in neural predictivity, offering a label-free proxy. GPT-style attenders give modest but consistent gains over no-attention or Mamba variants.
优缺点分析
This paper delivers an end-to-end demonstration from realistic whisker physics through large-scale training to barrel-cortex validation. It is allegedly the first work to reach the noise ceiling in tactile cortex and thereby may illuminate key somatosensory inductive biases. The open release of code, simulator, and data further boosts value for both neuroscience and robotics. However, the current draft may over-claim: training on passive ShapeNet sweeps but testing on a six stimulus active-whisking set risks spurious ceiling-level fits, and architecture rankings lack formal statistics.
QUALITY
Strengths:
- From a biomechanically faithful simulator to large-scale model training and direct neural comparison, the experimental flow is coherent and correctly executed.
- 60+ Encoder-Attender-Decoder variants cover feed-forward, state-space, recurrent and attention architectures under both supervised and self-supervised regimes.
- Split-half reliability filtering, noise-ceiling correction and layer-wise RSA are the accepted best practices for small-stimulus neural datasets.
Weaknesses:
- Models learn on passive sweeps of ShapeNet objects but are judged on active whisking to six simple shapes. This mismatch may inflate RSA by chance; an ablation including these test shapes in the training mix (held out for validation) is missing.
- Bar plots lack formal significance tests; we do not know if IntersectionRNN > ResNet is reliable across mice
- Six objects restrict the dimensionality of representational geometry and make it easier to “hit the ceiling.” Confidence intervals on the noise ceiling itself are not shown.
- Search budget (GPU-days, hyper-parameter tuning strategy) is only sketched, making it hard to gauge efficiency or carbon cost.
CLARITY
Strengths:
- Figures are information-dense yet readable; architecture, dataset and result diagrams are clear.
- The summary of tactile augmentations vs. image-style augmentations is easy to follow and motivates the SSL results
Weaknesses:
- Introduction and related-work sections are bloated with robotics-hardware history that obscures the main contributions.
- Key implementation details (augmentation hyper-parameters, exact RSA equation, IntersectionRNN definition) are relegated to the appendix. Some should appear in the main text.
SIGNIFICANCE
Strengths:
- First convincing “goal-driven” account of somatosensory cortex
- demonstrates that recurrent encoders plus force/torque-aware contrastive learning outperform common feed-forward baselines
Weaknesses:
- Impact on mainstream ML may be limited. The architectural lesson (“use ConvRNNs for smooth temporal signals”) is incremental relative to earlier vision work.
- The neural dataset’s narrow scope means conclusions about “saturating variance” could be overturned once richer recordings emerge.
ORIGINALITY
Strengths:
- Combines three elements that have not previously appeared together: (i) a 3-D physically correct whisker array, (ii) a systematic temporal-network search that spans ConvRNN, SSM and attention, and (iii) noise-corrected RSA on barrel-cortex data.
- Shows that SimCLR with bespoke tactile augmentations rivals supervised learning. This is a novel result in somatosensation and a direct answer to an open question from Zhuang et al. 2017
Weaknesses:
- The Encoder-Attender-Decoder framing is more a tidy taxonomy than a fundamentally new modelling concept; each module comes from prior work.
- Attention module analysis is modest. Gains are small and mechanistic interpretation remains speculative
问题
Following up on my comments above, the paper would benefit from a targeted robustness ablation, stronger significance tests, and a leaner, more transparent methods presentation. Specifically:
- Models are trained on passive sweeps of 9k ShapeNet meshes but evaluated neurally on active whisking to six cup-like shapes. Ceiling-level RSA might therefore hinge on an accidental projection onto a tiny stimulus sub-space. This needs at least one ablation to show robustness. For instance: retrain (or at least fine-tune) the best ConvRNN-EAD on a dataset that includes the six concave/convex objects (held out only for validation) and report RSA, to verify it remains near ceiling after ablation.
- The lack of statistical testing undermines confidence in architecture ranking. We cannot tell if IntersectionRNN > ResNet is reliable across mice. To address this, the authors should provide hierarchical statistics (e.g. linear mixed-effects with random intercept per mouse) or paired permutation tests on RSA scores.
- Six stimuli yield a noisy ceiling estimate. Please report split-half reliability distribution; give bootstrapped 95 % CIs for the inter-animal RSA. Add SEM bars or violin plots for the model-to-mouse RSA distribution (n = 11). Clearer ceilings would raise confidence. Revealing wide uncertainty without acknowledging it would lower significance.
- Several critical details are buried or missing. Please move to main text or clearly state in appendix: full definition of IntersectionRNN cell and unroll scheme, exact tactile augmentation parameters (rotation range, flip prob., temporal reversal prob.), and optimizer/learning-rate schedule/batch size/etc.
局限性
Mostly yes. Related to my comments above, the training-test domain gap should be better acknowledged/addressed.
最终评判理由
The authors have addressed my main concerns (passive sweep vs active whisking, noise ceiling, stats) and pointed me to other details that I either overlooked or that were clarified. I have raised my score accordingly
格式问题
None noted.
Thank you for taking the time to review and provide your valuable feedback! We address the weaknesses and questions as follows.
Passive sweeps vs. active whisking and including neural fit shapes during training
This is a great question. We are already generalizing because the shapes that are used in the neural evaluation are not present, even in distribution, in the ShapeNet training data. The reason why we chose to do passive sweeps for the large sweeping dataset on ShapeNet is that, especially with the temporal flipping used in the tactile augmentations, active whisking is isomorphic to doing multiple passive sweeps. Therefore, passive whisking is a more scalable implementation of active whisking, which is important for generating pretraining dataset diversity, which is in turn is important for matching brain data by the “Contravariance Principle” [1, 2, 3].
[1] R. Cao and D. Yamins. “Explanatory Models in Neuroscience, Part 1: Taking Mechanistic Abstraction Seriously.” Cognitive Systems Research, vol. 87, Sep. 2024, p. 101244. ScienceDirect, https://doi.org/10.1016/j.cogsys.2024.101244.
[2] D. Yamins and J. DiCarlo. “Using Goal-Driven Deep Learning Models to Understand Sensory Cortex.” Nature Neuroscience, vol. 19, no. 3, Mar. 2016, pp. 356–65. PubMed, https://doi.org/10.1038/nn.4244.
[3] A. Nayebi*, N. C. Kong*, C. Zhuang, J. L. Gardner, A. M. Norcia, and D. L. Yamins. “Mouse Visual Cortex as a Limited Resource System That Self-Learns an Ecologically-General Representation.” PLoS Computational Biology, vol. 19, no. 10, Oct. 2023, p. e1011506. PubMed, https://doi.org/10.1371/journal.pcbi.1011506.
Low number of stimuli and confidence intervals for the noise ceiling
The mouse neural dataset we are evaluating against only has 2 different shapes and 3 distances which we can use as unique stimuli. This was the most recent viable candidate for somatosensory cortex data that was available. We agree this is an unfortunate limitation that reduces the generalizability of our findings; though we note that low numbers of stimuli are a common occurrence in neuroscience and our task-optimized modeling approach, along with our inter-animal consistency analysis, highlights the need for future neural datasets to move forward in this concrete direction of collecting more stimuli. This was noted in the “Limitations and Future Directions” paragraph in the Discussion section: “Most importantly, current tactile neural datasets remain limited in stimulus diversity and the number of object conditions tested, restricting the captured neural variability and could be the reason why inter-animal consistency values are lower for the statistical average between animals than the current pointwise empirical maximum of 1.3.” (Note: Statistically, Pearson's correlation value (r) is within the range [-1, 1] but the split-half noise correction can result in values >1 when self-consistency is low.) Finally, since we train the models on the large-scale tactile whisking task rather than fit to the neural data directly, we note that our models can be evaluated without change on future neural datasets that collect more stimuli. Therefore, our models and results here are not the final say, but a platform for future exploration and evaluation.
Statistical significance of model comparisons across mice
The error bars in Figure 4 show the SEM of the median neural fit score across animals. Additionally, we find that the Wilcoxon significance on the neural score across 11 mice for Inter+SimCLR (best neural score model) & Resnet+Mamba+SimCLR (best task score out of self-supervised models) gives statistic=2.0, p=0.0425 which is significant (p<0.05). We will include this additional result in the paper!
Search budget and computational efficiency
We already report the GPU hours in line 191 in the main text and the training details (e.g., hyper-parameter tuning strategy) in appendix A2 Model Training. We will provide a more detailed description on the search and training cost in our revised paper.
Missing details in main text & relevance of robotics in introduction
With the additional page provided for the camera-ready version, we are happy to include the augmentation hyper-parameters, exact RSA equation, IntersectionRNN definition in the main text. We will also revise the robotics-hardware history in the introduction section to clarify the primary message and result of our work.
Impact on mainstream ML and contribution to architectural insights for temporal tactile modeling
One of our main contributions is the Encoder-Attender-Decoder (EAD) framework, which enables comprehensive and systematic model search with diverse neural network architectures. Despite the similar architectural conclusion, our tactile processing is different from prior vision works as the input of the model is forces/torques other than pixels that are easier to understand. We also found that we had to create our own data augmentations for self-supervised tactile learning, as typical image augmentations did not scale for this dataset [Figure 5c]. Furthermore, to the best of our knowledge, we are the first to compare tactile models to brain data for neural alignment test; hence, our findings are well-supported by both the EAD framework and the neural alignment test for temporal tactile processing.
The EAD architecture as a unified search framework
As mentioned in abstract line 3-4 and line 166-168, we agree that the EAD framework offers a comprehensive solution that allows searching through different model architectures other than a new modeling concept for tactile processing.
Attention module analysis and performance gains
Compared to the choice of encoders and augmentations, the performance gain of different attention modules are indeed relatively small, as discussed in paragraph “GPT-based Attenders Provide Modest Improvements in Task Performance and Neural Alignment” in line 274-285, hence, more analyses are focused on the encoders and augmentations.
Thank you for these clarifications. I would encourage the authors to explain the "passive sweep vs active whisking" early on in their camera-ready manuscript. I have no further questions
Thank you very much for your comments and suggestions, we will make sure to include this clarification early on in the manuscript.
This paper systematically explores neural network architectures for tactile perception inspired by the rodent whisker system. The authors propose an "Encoder-Attender-Decoder" (EAD) framework and train various models using realistic simulated tactile data from virtual whisker interactions with 3D objects. They demonstrate that convolutional recurrent neural networks (ConvRNNs), outperform other architectures on tactile object categorization tasks. These ConvRNN models exhibit remarkable alignment with neural recordings from rodent somatosensory cortex, surpassing average inter-animal consistency. This research quantitatively identifies nonlinear recurrence as a key inductive bias in rodent tactile processing, guiding future development in embodied tactile AI systems.
优缺点分析
Strengths:
[S1] The proposed Encoder-Attender-Decoder framework enables systematic exploration of neural architecture space, moving beyond ad-hoc architecture selection. It facilitates comprehensive comparisons of multiple models against biological data, highlighting essential inductive biases relevant to rodent tactile processing.
[S2] The paper provides valuable insights and findings without explicitly fitting models to neural data. The ConvRNN encoders effectively saturate explainable neural variance in rodent somatosensory cortex, and demonstrate a clear correlation between task performance and neural fit.
Weaknesses:
[W1] In Figure 4, nearly all models—including randomly initialized versions—outperform the a2a baseline regarding neural fit. Additionally, certain untrained models outperform their trained counterparts. The method used to evaluate neural fit and the impact of model training need further investigation.
[W2] The neural data task involves only six distinct stimuli (concave/convex at three distances), potentially simplifying variance saturation compared to scenarios with more diverse stimulus sets.
问题
See Weaknesses.
Additional question: Are there any models that perform lower than the a2a baseline in neural fit? For example, untrained different architectures
局限性
yes
格式问题
N/A
Thank you for taking the time to review and provide your valuable feedback! We address the weaknesses and questions as follows.
Low number of stimuli
The mouse neural dataset we are evaluating against only has 2 different shapes and 3 distances which we can use as unique stimuli. This was the most recent viable candidate for somatosensory cortex data that we could find.
Model neural fit scores compared to a2a baseline
All models are indeed above the mean animal-to-animal (a2a) fit, but below the maximum a2a baseline. This is noted in the caption but we will update the figure to make this clearer. (Note: Statistically, Pearson's correlation value (r) is within the range [-1, 1] but the split-half noise correction can result in values >1 when self-consistency is low.)
We noted both of these limitations in the “Limitations and Future Directions” paragraph in the Discussion section (lines 351-354): “Most importantly, current tactile neural datasets remain limited in stimulus diversity and the number of object conditions tested, restricting the captured neural variability and could be the reason why inter-animal consistency values are lower for the statistical average between animals than the current pointwise empirical maximum of 1.3.”
Follow-up question: Why does randomly initialized (untrained) models perform so well, even better than many trained models? Do you have any explanation for this phenomenon?
Thank you for your question! Randomly initialized networks are highly non-random functions as they are architectures selected to perform the task well when trained. Therefore, architecture matters a lot, and this is a well-noted phenomenon in NeuroAI across brain regions [1, 2, 3]. By doing this comparison in our manuscript, it allows us to isolate the contributions of the architecture vs. loss function for every pair of (architecture, loss function) tuples. The models that saturate our noise ceiling (bars on far right in Fig. 4b) are noticeably improved when trained. Thank you for asking this, we will include this explanation in our final revision.
[1] Yamins, D. L., Hong, H., Cadieu, C. F., Solomon, E. A., Seibert, D., & DiCarlo, J. J. (2014). Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings of the national academy of sciences, 111(23), 8619-8624.
[2] A. Nayebi*, N. C. Kong*, C. Zhuang, J. L. Gardner, A. M. Norcia, and D. L. Yamins. “Mouse Visual Cortex as a Limited Resource System That Self-Learns an Ecologically-General Representation.” PLoS Computational Biology, vol. 19, no. 10, Oct. 2023, p. e1011506. PubMed, https://doi.org/10.1371/journal.pcbi.1011506.
[3] Schrimpf, M., Kubilius, J., Lee, M. J., Murty, N. A. R., Ajemian, R., & DiCarlo, J. J. (2020). Integrative benchmarking to advance neurally mechanistic models of human intelligence. Neuron, 108(3), 413-423.
This work focuses on rodent tactile processing, and evaluates neural networks trained on biomechanically-realistic force / torque tactile sequences. The results show that recurrent ConvRNNs (IntersectionRNNs) are better at tactile categorization and neural alignment than feedforward ResNet and attention-based models. The models show impressive neural alignment, saturating the explainable neural variability and surpassing inter-animal consistency benchmarks with both supervised and contrastive self-supervised objectives.
优缺点分析
Strengths
- First application of model-brain-alignment methods to tactile processing across a broad set of model architectures and both supervised and self-supervised training objectives
- Novel Encoder-Attender-Decoder parameterization
- Direct neural comparison between a large set of models and rodent somatosensory cortical responses
- Training models on a synthetic dataset (simulated objects interacting with biomechanically-realistic rodent whisker model) is creative and provides a strong validation of the bio-mechanical model (given the success of the resulting convRNN models in predicting real neural data)
Weaknesses
- The alignment with real rodent neural data appears to be based on RDMs computed from just 6 stimuli. This might be the best dataset available, but this dataset would seem to provide a rather coarse measure and should temper claims about the strength of the alignment between models and real neural data. Reading the abstract I was excited by the finding that the neural alignment scores saturated the noise ceiling, but less surprised that this was the case after reading that the analysis focuses on 6 stimuli. The results are still promising, but I would not necessarily expect “saturating the noise ceiling” to hold up under a larger, more varied dataset.
问题
Results Figure 4. How can the animal-to-animal scores be so low (<.20) and the model-to-animal scores so high (approaching 1.0, noise adjusted)? How could one model RDM be almost perfectly correlated (noise adjusted) with each individual animal RDM (without any tuning to individuals), but those animal rdms be so poorly correlated with each other?
Results Figure 4. Can a model score well just by distinguishing convex from concave? Is there reliable variation between near-medium-far (within convex only; or within concave only), and if so, can the models capture this variation as well? This would at least suggest that the models are capable of capturing “fine-grained differences in real neural data” (within the constraints of this particular real-neural dataset). Relatedly, since the RDMs are so small, it would be useful to depict them in a Figure, to clarify what the “neural target” is for these analyses
For task-performance, is it fair to use “top-5 classification” as your measure of task-performance for the self-supervised models, since this is not in fact the task that the model was trained on? I suppose the more neutral measure of task-performance would be some version of the task loss (maybe somehow normalized to allow comparison across tasks), or even a complementary “instance-recognition” measure (how well individual samples are discriminated from the rest, which is more in-line with the objective of the self-supervised models.
Regarding the strong neural alignment of the self-supervised models and its parity with the supervised models, I would find these results more compelling if it could be clearly demonstrated that there’s some fine-grained signal here (not just getting the convex vs. concave distinction, assuming that’s PC1 for these neural data, but also getting the near-medium-far variation within convex, or separately within concave).
Were the rodent neural data combined across brain regions? Why not do separate analyses for S1, S2? Or are the RDMS too similar between these regions to bother with the breakdown?
Minor
- How was the explainable variance ceiling estimated?
- Did you identify the best-fitting layer in separate split of data than what you used to provide the final score for that layer? If so, this could be clarified in the final paragraph of the methods (pg. 6, just before Results).
- Can the authors include an explanation for what the IntersectionRNN is?
- How ethologically valid do you think the simclr variant is? Could you have had the simulated rodent actually “sample the object twice” and still get strong learning?
- Are there any other “precision measurements” or “fine-grained” analyses afforded by the rodent neural dataset, e.g., tuning of individual neurons?
- Isn’t the importance of an inductive bias of “non-linear recurrent hierarchical network” expected from prior work on rodent tactile processing?
- I didn’t really understand how this work might impact robotics exactly. How might these insights be integrated into robotic systems? Are tactile robotics not already using recurrent CNNs? Or is the idea that recurrent EAD systems are new and should be used?
局限性
yes
最终评判理由
I'm convinced that the main limitations of this work are inherited from limitations in the behavioral dataset, but I would encourage the authors to raise these limitations more clearly to make sure they are more clear to readers. In my view the methods and approach are "out in front" of the dataset, and so therefore the claims are a bit "out in front" of the data as well. But I think the strengths of the methods introduced, and advances made here in neural modeling in this domain (which basically expose the need for better neural data) are important and likely will have a broad impact.
格式问题
none
Thank you for taking the time to review and provide your valuable feedback! We address the weaknesses and questions as follows.
Low number of stimuli
The mouse neural dataset we are evaluating against only has 2 different shapes and 3 distances which we can use as unique stimuli. This was the most recent viable candidate for somatosensory cortex data that was available. We agree this is an unfortunate limitation that reduces the generalizability of our findings; though we note that low numbers of stimuli are a common occurrence in neuroscience. Our task-optimized modeling approach, along with our inter-animal consistency analysis, highlights the need for future neural datasets to move forward in this concrete direction of collecting more stimuli. This was noted in the “Limitations and Future Directions” paragraph in the Discussion section (lines 351-354). (Note: Statistically, Pearson's correlation value (r) is within the range [-1, 1] but the split-half noise correction can result in values >1 when self-consistency is low.) Finally, since we train the models on the large-scale tactile whisking task rather than fit to the neural data directly, we note that our models can be evaluated without change on future neural datasets that collect more stimuli.
Discrepancy between model-to-animal and animal-to-animal alignment
The model-to-animal neural score is obtained by using the maximum alignment score across the model to each animal. As a result, the models indeed achieve a score above the mean animal-to-animal (a2a) fit, but all are also below the maximum a2a baseline. (This is noted in the caption but we will update the figure to make this clearer.) In other words, the variability between animals is very high but there is always an animal pair that do very closely match. The inter-animal consistency score calculation is included in the Appendix (A3, Eq. 1). It is the RSA score between the response patterns for each pair of animals.
Distinguishing different types of stimulus and variations
No, we find that matching the neural data is not simply explained by distinguishing convex from concave; but rather, the models more reliably decode near-medium-far, suggesting that there is more variability along this axis for neural predictivity, than merely convex vs. concave. Below, we show that the decoding performance of the best models is above chance for near-medium-far. For each of the 6 stimuli, we train a logistic regression model on the 5 other stimuli and test on the 1 held out stimuli. While the number of stimuli is too low to make a definitive claim, interestingly our most task performant self-supervised model had the highest stimuli decoding score.
| Model | Note | Shape (chance=0.5) | Distance (chance=0.33) |
|---|---|---|---|
| Inter+SimCLR | Best neural score | 0.0 | 0.0 |
| Zhuang+GPT+Supervised | Best task score | 0.33 | 0.5 |
| Inter+GPT+Supervised | Best neural score out of supervised models | 0.0 | 0.5 |
| Resnet+Mamba+SimCLR | Best task score out of self-supervised models | 0.0 | 0.67 |
| Mouse Neural Data | Score of stimuli decoded per mouse, then averaged across mice | 0.44 | 0.0 |
Finally, a figure of the RDMs for the best neural fitting supervised and self-supervised model is already included in the appendix (Figure A3.)
Task performance measurement other than “top-5 accuracy”
For self-supervised learning, all the models are first pre-trained with self-supervised losses. After we have the best pre-trained checkpoint, we freeze the model weights and add a trainable classification head, which is a common approach for evaluating downstream task performance for self-supervised models [1]. This assembled model is then finetuned with labels in a supervised fashion and we save the best model based on its accuracy on the validation set and report the test accuracy (see line 187-189); hence, reporting the task performance under “top-5 accuracy” is consistent for both supervised and self-supervised learning. In our preliminary experiments, we tried self-supervised learning with the instance recognition loss [1], however, the training is highly unstable and the results are not satisfactory, which suggests that in instance recognition as an objective/measurement is less promising for tactile processing. Thank you for bringing this up, we will mention this in our revised paper.
[1] Wu Z, Xiong Y, Yu SX, Lin D. Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018. p. 3733–3742.
Brain regions for rodent neural data & potential more fine-grained analyses
It is all from S1 (barrel cortex specifically). Our current neural dataset does not parametrize convexity and concavity for us to run a tuning curve analysis. However, our focus is on building the models which will be useful for such future datasets and subsequent downstream analyses to further separate models, since our model parameters are optimized on a task (rather than our specific neural dataset) and can be readily evaluated in new settings.
Identifying the best-fitting layer
The best neural fitting layer is the layer with the highest neural score, so a separate split of data was not used.
Explanation of the IntersectionRNN
The following equations define the recurrent computations at layer and time step for IntersectionRNN, more details about the computation of different RNNs used in our paper can be found in [2], and we plan to include these details in our revised paper.
Notation
- is the input at time and layer
- is the state at time and layer
- and represent learnable weight matrices
- are bias terms
- denotes element-wise multiplication
- is the sigmoid function
- is the hyperbolic tangent function
- is the rectified linear unit function
- denotes a linear transformation or convolution depending on context
[2] Aran Nayebi, Javier Sagastuy-Brena, Daniel M. Bear, Kohitij Kar, Jonas Kubilius, Surya Ganguli, David Sussillo, James J. DiCarlo, Daniel L. K. Yamins bioRxiv 2021.02.17.431717; doi: https://doi.org/10.1101/2021.02.17.431717
Ethological validity of the SimCLR loss
The augmentation for SimCLR is an approximation for head motion of the rodents (see in Fig. 2 b, middle, for example, the rotation augmentation mimics head rotation of the rodents seeing the same object). From a training perspective, it enforces invariance of the same object for the model; hence, the “sample-twice” mechanism of SimCLR together with the proposed tactile augmentation ethologically approximating how rodents rotate their heads and provide different views of the same object to improve the learning results.
Inductive biases and prior work in rodent tactile processing
As mentioned in line 98-101, prior work in rodent tactile processing is only limited to supervised learning, a small number of model architectures, and crucially does not compare the models to brain data. With our Encoder-Attender-Decoder framework, we can systematically explore and search for the inductive biases necessary to rodent tactile processing, which is further strengthened by our results in neural alignment, as discussed in line 320-324. To sum up, our work is fundamentally different from previous works and the findings are well-supported by both task performance and neural alignment of the models under comprehensive search with a broad range of different neural network architectures (e.g., SSMs, RNNs, transformers).
Integration of insights into robotic tactile systems
There is currently no established or widely-used method to process tactile data in robotics, due to so much variety in hardware implementation. Yes, our intention is to convey that recurrent EADs would be a good candidate for such tactile hardware as they become more capable. At the moment, animals and simulation environments are more capable than current robotic hardware; therefore, our motivation for this work is to better understand a biological tactile sensing system so that, down the line, we can try to replicate biological tactile capabilities in robots using the inductive biases discovered in this paper. As such, the primary result of this work is in a neuroscientific scope and currently not deployed on robots. Still, our work (1) provides new tactile augmentations which may also be used to train self-supervised policies on robot tactile data, and (2) serves as a validation of low-dimensional force sensors as sufficient for object discrimination, as opposed to requiring high resolution images as tactile data (visuotactile sensors like GelSight). Both provide strong evidence for developing future robotic tactile systems based on force/torque sensors, other than existing sensors purely rely on vision, and can be potentially adapted for building such models.
I'm convinced that the main limitations of this work are inherited from limitations in the behavioral dataset, but I would encourage the authors to raise these limitations more clearly to make they are more clear to readers. For example, I thought I was confused/mistaken about the number of conditions when I saw that RSA was being used (my sense is that RDM over 6 conditions is less common / discouraged in comp neuro, especially with a clear 2-factor experimental design where other methods like linear-mixed effects regression might be sensible).
Thank you for the helpful suggestion. We agree that the limited number of conditions is a core constraint of the neural dataset. While this point is already noted in our Discussion (Limitations), we will elevate it to its own subsection to make it more prominent for readers.
We chose RSA over linear regression in part because the small number of conditions makes linear models prone to overfitting—particularly with any hyperparameter tuning. RSA, being parameter-free and widely used in NeuroAI, provides a more robust and interpretable comparison under these constraints.
More broadly, we see our study as a first step, and hope it motivates the adoption of richer stimulus designs to complement increasingly powerful neural recordings as an actionable and concrete next step for the community. This dimension—stimulus diversity is often underemphasized in experimental neuroscience, where the focus tends to be on recording more neurons rather than more conditions. Because our models are task-optimized rather than response-optimized, they can be directly tested on such future datasets as they become available.
This paper introduces an Encoder-Attender-Decoder (EAD) framework for modeling tactile perception in rodents by training temporal neural networks on realistic tactile sequences simulated from a rodent whisker array. The authors build the ShapeNet Whisking Dataset which is a synthetic tactile dataset generated by simulating a rodent’s whisker array interacting with 9,981 3D objects from ShapeNet across 117 categories. Using the WHISKiT Physics simulator, the authors modeled forces and torques at the bases of 30 whiskers during object sweeps with varied speeds, heights, rotations, and distances, creating rich temporal tactile signals reflective of real whisker sensing. The authors systematically compare convolutional recurrent neural networks (ConvRNNs), feedforward ResNets, and attention/state-space models. The results emphasize the importance of recurrent processing and self-supervision for capturing tactile representations, advancing both scientific understanding and robust tactile perception in embodied AI.
优缺点分析
Strengths:
- The paper is easy to follow, with major contributions mentioned clearly in the Introduction.
- The motivations behind this paper are of significant importance, in that they build a systematic study of encoder, attention, and decoder architectures for inferring sensory semantics from simulated rodent whiskers, which can pave the way for architecture design in robotic tactile sensing.
- The authors present the ShapeNet Whisking Dataset which is a synthetic tactile dataset generated by simulating a rodent’s whisker array interacting with 9,981 3D objects from ShapeNet across 117 categories.
- The author’s insights around contrasting usefulness of recurrent and attention-based architectures, in that the temporally smooth signals from tactile inputs benefit from convolutional and recurrent mechanisms that integrate information locally over time, hold in the presented results.
Weaknesses:
- While this study itself is meaningful, the connection to a real-life application, say, in robotics, is far-fetched. Is there any evidence that the takeaways may hold for any available sensing modalities such as force/torque or vision-based touch sensing from GelSight?
- Although the simulations are biologically realistic, models are trained on synthetic data rather than real tactile sensor readings, leaving a gap between simulation and potential real-world heterogeneity in tactile data.
问题
- Did you explore other recurrent variants (e.g. vanilla RNNs, bi-directional RNNs), and how did they perform?
- ShapeNet is a human-centric dataset. How do you think the neural alignment results would change if the dataset contained ecologically relevant rodent objects (e.g. burrow structures, food items)?
- How transferable are your ConvRNN-based models to real robotic tactile sensors such as GelSight, given mechanical differences?
局限性
Yes.
最终评判理由
The authors addressed all the questions I raise in the rebuttal, specifically around exploring alternate RNN architectures. I am happy to recommend this paper for acceptance and will keep my score of 5 (Accept).
格式问题
No concerns.
Thank you for taking the time to review and provide your valuable feedback! We address the weaknesses and questions as follows.
Transferability to robotic tactile sensors
Our motivation for this work is to better understand a biological tactile sensing system so that, down the line, we can try to replicate biological tactile capabilities in robots using the inductive biases discovered in this paper. As such, the primary result of this work is in a neuroscientific scope and not deployed on robots. Still, our work (1) provides new tactile augmentations which may also be used to train self-supervised policies on robot tactile data, and (2) serves as a validation of low-dimensional force sensors as sufficient for object discrimination, as opposed to requiring high resolution images as tactile data (visuotactile sensors like GelSight).
Simulation-to-reality gap in tactile modeling
We used the most biologically accurate simulation model of whiskers available based on real mouse morphology in order to reduce the gap, but without real whisker force sensors, we are unable to evaluate the extent of the gap ourselves. Especially with tactile sensors, the wide variety in sensor implementation (optical, capacitance, mechanical, magnetic, pneumatic, etc.) and high amount of noise and variability even within sensor types means that collecting quality real data is very difficult. A discussion of the limitations of current hardware is included in the Introduction section– current bio-inspired robotic whisker sensors are limited by hardware complexity, poor multi-stimuli discrimination, and mechanical shortcomings that hinder accurate tactile sensing compared to biological whiskers. While the creation and validation of a real whisker array force sensor is outside the scope of our current neuroscientific evaluation, we definitely agree that this would be beneficial to evaluate our models on real whisker-like hardware once available.
Exploration of alternative recurrent architectures
Thank you for your suggestion! Currently, the RNN architectures explored in our paper include: time-decay (the original Zhuang’s model), GRU, LSTM, UGRNN, IntersectionRNN. We didn’t consider any bi-directional architectures for supervised learning as it breaks the biologically plausible temporal unrolling mechanism, where each layer is sequentially activated as time increases. For self-supervised learning, we incorporate the idea of time reversing in our proposed tactile augmentation (see Fig. 2b, middle, temporal), which flips the input temporally and is similar to the high-level idea of bi-diretional RNNs.
We further replace the recurrent cell with vanilla RNN, and conduct experiments on the following architectures:
- Zhuang+GPT (best task-performant and neural-fit architecture for supervised learning)
- Zhuang+IntersectionRNN+SimCLR (best neural-fit architecture for self-supervised learning)
The best task-performant architecture for self-supervised learning is ResNet+Mamba+SimCLR, which doesn’t include any RNNs; hence, it’s not included in the additional experiments. We present the task performance (top-5 classification accuracy) and neural fits of the vanilla RNN variants below.
| model | Zhuang-vanilla-RNN+GPT | Zhuang-vanilla-RNN+SimCLR | supervised best | self-supervised best |
|---|---|---|---|---|
| top-5 acc. | 64.11 | 14.48 | 69.53 | 22.68 |
| neural fit | 0.90 | 0.69 | 0.95 | 0.96 |
We can observe from above that vanilla RNN doesn’t provide further improvement in either task performance or neural fit compared to existing RNN architectures.
Ecological relevance of ShapeNet for neural alignment
Although ShapeNet is not ecologically relevant for mice, we note that the untrained models are worse than the trained models at neural predictivity, suggesting our choice is beneficial. There are no suitable rodent-relevant datasets, so we would have to hand-craft one by selecting a few objects. Arguably, the dataset diversity in ShapeNet should prepare the model for more downstream tasks, while a smaller but rodent-specific dataset may not be large or diverse enough to do well. Finally, in NeuroAI more broadly, it has been shown that having dataset diversity is critical for model-brain alignment, not just task performance– this is known as the “Contravariance Principle” [1, 2, 3].
[1] R. Cao and D. Yamins. “Explanatory Models in Neuroscience, Part 1: Taking Mechanistic Abstraction Seriously.” Cognitive Systems Research, vol. 87, Sep. 2024, p. 101244. ScienceDirect, https://doi.org/10.1016/j.cogsys.2024.101244.
[2] D. Yamins and J. DiCarlo. “Using Goal-Driven Deep Learning Models to Understand Sensory Cortex.” Nature Neuroscience, vol. 19, no. 3, Mar. 2016, pp. 356–65. PubMed, https://doi.org/10.1038/nn.4244.
[3] A. Nayebi*, N. C. Kong*, C. Zhuang, J. L. Gardner, A. M. Norcia, and D. L. Yamins. “Mouse Visual Cortex as a Limited Resource System That Self-Learns an Ecologically-General Representation.” PLoS Computational Biology, vol. 19, no. 10, Oct. 2023, p. e1011506. PubMed, https://doi.org/10.1371/journal.pcbi.1011506.
I thank the authors for the rebuttal. They addressed all the concerns that I raised. I am happy to recommend this paper for acceptance and will keep my score.
This paper systematically explores neural network architectures for tactile perception inspired by the rodent whisker system. The authors propose an Encoder-Attender-Decoder framework and train various models using realistic simulated tactile data from virtual whisker interactions with 3D objects. They demonstrate that convolutional recurrent neural networks outperform other architectures on tactile object categorization tasks. and exhibit remarkable alignment with neural recordings from rodent somatosensory cortex. It reports the first application of model-brain-alignment methods to tactile processing across a broad set of model architectures and both supervised and self-supervised training objectives. It provides a powerful and direct comparison between a large set of models and rodent somatosensory cortical responses, surpassing average inter-animal consistency. It quantitatively identifies nonlinear recurrence as a key inductive bias in rodent tactile processing, guiding future development in embodied tactile AI systems. Because of the many strengths, the paper is deemed of great interest for the NeurIPS community.