PaperHub
7.2
/10
Spotlight4 位审稿人
最低2最高5标准差1.1
4
5
4
2
ICML 2025

Scaling Laws for Task-Optimized Models of the Primate Visual Ventral Stream

OpenReviewPDF
提交: 2025-01-24更新: 2025-08-07
TL;DR

We systematically explored scaling laws for primate vision models and discovered that neural alignment stops improving beyond a certain scale, even though behavior keeps aligning better.

摘要

关键词
scaling lawsneural alignmentbehavioral alignmentcomputer visionprimate visual ventral stream

评审与讨论

审稿意见
4

This paper explores how scaling model size, dataset size affects the alignment of artificial neural networks with primate visual ventral stream behaviors and neural responses. The scaling law is investigated over diverse models on benchmarks including v1, v2, v4, IT and behavior data. The authors offer interesting findings w.r.t. neural alignment and behavioral alignment.

给作者的问题

See Other Strengths And Weaknesses.

论据与证据

Yes

方法与评估标准

Yes

理论论述

Yes

实验设计与分析

Yes

补充材料

Yes

与现有文献的关系

The scaling laws of large models have been extensively studied and applied. This paper explores another scaling law of models from the perspective of brain alignment. The authors offer an interesting perspective on how mainstream models align with human core object recognition behavior, and the impact of this biological alignment.

遗漏的重要参考文献

No

其他优缺点

Strengths

  • The paper is well-written, well-organized and easy to follow.
  • The paper provides extensive experiments across various model architectures and well-illustrated visualizations.
  • The authors provide several interesting findings, including (1) scaling is particularly beneficial for higher-level visual areas, (2) neural alignment saturates in most conditions whereas behavioral alignment continuously improves when scaling.

Weaknesses

  • The practical implications of the authors' conclusions remain unclear. For instance, while the authors highlight that models with strong architectural inductive biases, such as convolution-based ResNets, show better neural alignment trends than transformer-based ViTs, ViTs are still more widely adopted in the community due to their superior overall performance compared to convolution nets.

  • The authors reveal the scaling laws between neural alignment, behavioral alignment, and data size, model size through a series of impressive experiments. However, what is the relationship between improving neural alignment and enhancing downstream performance? What insights can these scaling laws provide for improving model design?

其他意见或建议

None

作者回复

Thank you for your thoughtful review — we’re glad you found the paper well-written and organized, appreciated the breadth of our experiments and visualizations, and found the findings insightful. We respond to each of your comments below point-by-point.

The practical implications of the authors' conclusions remain unclear. For instance, while the authors highlight that models with strong architectural inductive biases, such as convolution-based ResNets, show better neural alignment trends than transformer-based ViTs, ViTs are still more widely adopted in the community due to their superior overall performance compared to convolution nets.

Since our primary goal is modeling brain information processing (i.e., neuroscience), we consider task performance mainly as a proxy rather than the ultimate target. Indeed, our experiments show ViTs excel at behavioral alignment (Figs. 5a, 6b) but underperform relative to convolutional models in neural alignment benchmarks (Figs. 5a, 6a). This also holds for very large models trained on massive datasets (Fig S2, ConvNeXt vs ViT). Practically, this suggests a promising path forward could involve hybrid architectures that combine biological inductive biases (e.g., convolutional or hierarchical connectivity) with transformer-based scalability and flexibility such as a VOneBlock (Dapello et al. 2020) combined with ViTs (Dosovitskiy et al. 2020), thus balancing task performance with improved neural alignment. Specifically, one viable strategy could be distilling representations from high-performing transformer models into architectures with hierarchical inductive biases resembling biological circuitry.

The authors reveal the scaling laws between neural alignment, behavioral alignment, and data size, model size through a series of impressive experiments. However, what is the relationship between improving neural alignment and enhancing downstream performance? What insights can these scaling laws provide for improving model design?

In this study, we primarily focus on building better models of the human brain, rather than developing better ML models for various tasks. In this context, our findings show a clear correlation between behavioral alignment and task performance (Figs. 1, 6b), consistent with the widely accepted notion in the ML community that increased compute leads to better-performing models. In contrast, neural alignment shows positive but diminishing returns with improved object recognition performance (Fig. 6a), indicating that improving neural alignment may not always directly translate into better downstream task performance.

Practically, our scaling laws offer concrete guidance for future model design targeting neural alignment. First, our results strongly suggest prioritizing larger, richer, and more diverse datasets, as they consistently yield greater neural alignment improvements compared to merely scaling model complexity (Fig. 3a). Second, our findings emphasize the crucial role of biologically inspired inductive biases, particularly in scenarios constrained by limited compute or data, as these priors substantially enhance alignment efficiency (Figs. 2, 3b). Lastly, the graded scaling effect across the cortical hierarchy (Fig. 5) indicates that early visual regions (V1/V2) are the most challenging to align through scale alone, suggesting that architectures incorporating stronger biologically-informed priors (e.g., VOneNet-style architectures) may be necessary to achieve further improvements.

Thanks again for your review. We ask the reviewer to consider raising their score in light of this paper’s placement in computational neuroscience and a focus on modeling the brain.

审稿人评论

The authors' response has addressed all my concerns. Thus I decide to raise my rating.

审稿意见
5

This paper seeks to measure scaling laws for task-optimized models of the primate visual ventral stream. Several models from multiple families were trained with different amounts of compute and training data. They were then compared on their alignment to different areas of the visual cortex (V1, V2, V4 and IT), as well as on behavior data from non-human primates. Scaling laws were then fit to relate how neural and behavioral alignment scale with flops and data. The top result is that neural alignment asymptotes with scaling. The paper then breaks down these results in terms of areas, architectures, dataset dependence, etc.

给作者的问题

N/A

论据与证据

This is a very thorough investigation of scaling laws in visual cortex, fitting hundreds of models on a broad range of downstream tasks. The authors' top-line result, that alignment saturates, is surprising (though in line with prior literature); it's very well-supported; and it is a significant conceptual advance for the field.

方法与评估标准

The authors fit a broad range of models: ResNet, AlexNet, EfficientNet, ViTs, ConvNext, Cornet. They selected two different core datasets: imagenet and ecoset; and alternatives including iNaturalist, infiniMNIST, Places365, etc. They also try out alternative training scenarios beyond supervised learning, including SimCLR, DINO and adversarial fine-tuning. This is a very broad range of models that supports their main conclusions.

理论论述

N/A

实验设计与分析

The methods are all standard and well-accepted; where the paper innovates is in the thoroughness of its evaluation. I have no qualms.

补充材料

I briefly perused the supplementary. There's interesting information here. including the correlation of the private and public benchmarks, their cross-checking against existing pre-trained models, and the evolution of alignment over training. There's a sufficient amount of material here that this short paper could be a journal submission.

与现有文献的关系

Much has been written about the fact that Brain-Scores and similar measures are saturated, and that multiple models converge on the same (wrong) representations; examples include Conwell et al. (2024) and Linsley et al. (2023). However, this submission stands out for its completeness, the depth of its analysis, and its focus on primate vision.

遗漏的重要参考文献

N/A

其他优缺点

N/A

其他意见或建议

N/A

作者回复

We thank the reviewer for their positive feedback and thoughtful comments. We are particularly encouraged that the reviewer highlighted the key strengths that we were indeed most excited about in our paper:

  • Our systematic, extensive evaluation of scaling laws across hundreds of models and diverse benchmarks, providing a new insight into the role of architectural bias and choice of training, and a recipe for optimal compute allocation.
  • Thorough comparisons across various architectures (ResNet, EfficientNet, ViT, ConvNeXt, CORnet), multiple datasets (ImageNet, EcoSet, iNaturalist, infiMNIST, Places365), and alternative training methods (SimCLR, DINO, adversarial training).
  • Clear evidence showing neural alignment saturates with increased model and dataset scaling, providing a significant conceptual step forward to encourage the exploration of new modeling ideas.
  • The depth and rigor of our analyses, including validation with held-out benchmarks and alignment dynamics during training, showing for instance that under controlled model training, task performance and brain alignment remain weakly correlated.
  • Comprehensive results specifically targeted at the primate visual cortex, offering deeper insights beyond prior work, such as the ordered effect of scale on alignment and the role of inductive biases.

We greatly appreciate the reviewer's supportive assessment and enthusiasm for our study.

审稿意见
4

In the paper "Scaling Laws for Task-Optimized Models of the Primate Visual Ventral Stream", the authors investigate scaling laws for alignment of machine learning models to the primate visual processing. They assess neural alignment as well as behavioral alignment, and find that there is a saturation point in neural alignment, but not in behavioral alignment. They further evaluate the properties of models with strong inductive bias versus weak inductive bias, and find that models with stronger inductive bias are more sample- and compute-efficient. Further analyses include the individual assessment of areas along the visual hierarchy and the comparison of different training strategies.

给作者的问题

In line 313, right column, you speculate that factors other than task performance influence neural alignment. Do you have any ideas about what these factors might be?

论据与证据

The authors claim that there is a saturation point in neural alignment, but not in behavioral alignment. They also claim that increasing both parameter count and training dataset size improves alignment, with data providing more gains over model scaling. Architectures with stronger inductive bias and datasets with higher-quality images are more sample- and compute-efficient. Finally, they claim that model alignment with higher-level brain regions and especially behavior benefits the most from scaling. They support these claims with a wide range of experiments and fits of scaling laws to the data. Given the range of models and analyses that they carried out, the evidence for their claims looks solid to me.

方法与评估标准

The authors use benchmarks from Brain-Score, which includes comparisons of model outputs to recordings from the primate visual ventral system, as well as a comparison of model predictions with behavioral data. They use a range of model architectures, including AlexNet, ResNets, CORnet-S, EfficientNets, Vision Transformers, and ConvNeXt, and fit various functional forms of the scaling laws to account for inductive biases in the model architectures and their data requirements. Models were assessed on a range of image datasets, including ImageNet and EcoSet, as well as a range of other image datasets. They also assessed alternative training strategies, including SimCLR, adversarial training, supervised learning, and self-supervised learning. Bootstrapping was used to quantify uncertainty. The selection of models includes state-of-the-art models and a wide range of architectures, which is a strength of the paper. The authors also provide a detailed analysis of the training strategies, which is important for understanding the results. The evaluation criteria are well-defined and the methods are appropriate for the research questions.

理论论述

The paper does not make theoretical claims.

实验设计与分析

There are experiments assessing model size and dataset training size, as well as the effect of inductive bias on alignment. The authors also assess the alignment of models with different brain areas and behavior. Then there are experiments using alternative training strategies. The breadth of the experiments is a strength of the paper, and the authors provide a detailed description of the methods used.

补充材料

The supplementary material includes brief implementation details, descriptions of the image datasets, and additional results. There is a validation on private data, a pretraining analysis, an assessment of training evolution, and additional results on the effects of the training strategy. These are all relevant to the main text and provide additional insights.

与现有文献的关系

The functional form of the misalignment follows Hoffmann et al., 2022, a study for the scaling of large language models. There are changes in the function for optimal allocate of compute to account for empirical observations. The benchmarking itself follows Schrimpf et al., 2018. Linsley et al., 2023 showed that large DNNs rely on different visual features than than those encoded by area IT. In light of these previous works, it might be expected that scaling up models and data would be insufficient to achieve better alignment for neural data. Therefore, the main findings of the paper are not all that surprising and I find the novelty of the paper to be limited. The findings are empirical in nature and do not provide a deeper understanding for the reasons behind the observed scaling laws, other than the inductive biases of the models.

遗漏的重要参考文献

To the best of my knowledge, the paper discusses all essential references.

其他优缺点

The paper is well-written with good structure, sufficient details for making the work reproducible and a clear presentation of the results.

其他意见或建议

Figure panel 3b is not described in the text. It would be helpful to include a brief description of the figure in the main text.

作者回复

We appreciate your positive review and are glad you found the paper well-written, the results clear, and the evidence supporting our claims solid. Below, we address your questions point by point.

In light of these previous works, it might be expected that scaling up models and data would be insufficient to achieve better alignment for neural data. Therefore, the main findings of the paper are not all that surprising and I find the novelty of the paper to be limited. The findings are empirical in nature and do not provide a deeper understanding for the reasons behind the observed scaling laws, other than the inductive biases of the models.

Our work substantially advances previous literature by providing a more systematic, controlled, and comprehensive analysis that isolates how individual factors—model parameters, dataset size, and compute resources—each influence alignment with multiple brain regions and behavior. Earlier studies on the other hand used heterogeneously pre-trained models and compared to limited brain datasets.

  • Generalization of brain regions: Prior works focused narrowly (e.g., only IT in Linsley et al., 2023) or used noisy fMRI signals (Conwell et al., 2022). We evaluate alignment across the full primate ventral stream (V1–IT) using high-resolution intracortical recordings, along with object recognition behavior—both novel in this context.

  • Model comparability: Previous studies used pretrained models with inconsistent training recipes, making comparisons hard. Our models are all trained from scratch under uniform conditions, enabling a fair, apples-to-apples comparison across architectures and datasets.

  • Our comprehensive study led to updated findings: We show that task performance does correlate with neural alignment (Fig. 6a), but with diminishing returns—contrary to prior conclusions. Moreover, our work quantifies the individual effects of data, parameters, and compute (Figs. 2–4), and provides a parametric framework for understanding scaling (Figs. 4–7), going well beyond simple correlations in earlier work.

  • Finally, we contribute new insights such as a graded scaling effect across the visual hierarchy and a clear behavioral/neural alignment dissociation (Fig. 5), offering deeper understanding of how brain-like representations emerge. We’ll emphasize these distinctions more clearly in the revised manuscript and welcome suggestions for where to further highlight them.

Figure panel 3b is not described in the text.

We thank the reviewer for pointing out this oversight. Figure 3b illustrates that model families with weaker inductive biases (e.g., ViT and ConvNeXt) begin training with lower neural alignment scores and consequently require larger training datasets to achieve alignment comparable to architectures with stronger inductive biases (e.g., ResNet, EfficientNet). Additionally, we find that recurrence, as implemented in CORnet architectures, provides a significant initial advantage in alignment relative to purely convolutional models, particularly in low-data regimes. However, this advantage diminishes with increased training data. Overall, these findings emphasize that strong inductive biases—such as convolution and recurrence—facilitate better alignment when data is limited, whereas extensive task-driven optimization eventually mitigates differences across architectures. We will integrate this explanation into the main text of the revised manuscript.

In line 313, right column, you speculate that factors other than task performance influence neural alignment. Do you have any ideas about what these factors might be?

Our findings suggest that architectural inductive biases significantly influence neural alignment independently from task performance. Additionally, as illustrated in Figure 5b, higher cortical regions and behavioral alignment benefit more from task optimization compared to early visual areas. Moreover, supplementary analysis (Figure S7) shows that correlations between alignment and task performance also follow the cortical hierarchy. These observations imply that incorporating stronger biologically-inspired priors (e.g., convolutional constraints, local receptive fields, recurrence) specifically for modeling early visual regions, combined with more flexible, data-driven layers for higher cortical areas—akin to the design philosophy of architectures like VOneNet—may yield improved neural alignment. Furthermore, explicitly incorporating neural data into training procedures (via co-training or fine-tuning, see our response to the reviewer weze) represents a promising additional strategy to surpass the current limitations of purely task-optimized models.

Thanks again for your review. In light of the systematic advances over prior work—such as our controlled experimental setup, broader neural benchmarks, and novel findings on graded scaling and behavioral dissociation—we kindly ask you to consider raising your score.

审稿人评论

I thank the authors for their detailed and structured rebuttal. The authors put forward good points regarding the novelty of their work and I encourage them to highlight these better in the final version of the manuscript if it gets accepted. I will raise my score to 4.

审稿意见
2

The paper introduces scaling laws for task-optimized models in fitting neural recordings. These laws stem from the observation that neural networks trained on classification tasks have emerged as the most effective models for decoding neural activity in the brain. The study then evaluates a measure of alignment across neural and behavioral benchmarks using publicly available datasets from BrainScore. Finally, the authors draw conclusions about how dataset size and architecture type influence behavioral responses, shedding light on their roles in neural decoding performance.

给作者的问题

  • How can the scaling laws inform future selection of architectures data to improve our understanding of the visual cortex?
  • How biological inductive bias can be quantitatively described so it can be easy to understand the claim "Architectural Bias influences alignment in behavior?

论据与证据

While the claim that 'architectures with stronger inductive biases are more sample- and compute-efficient' is compelling, the evidence provided is limited to convolutional neural networks (CNNs). Other architectures with well-established inductive biases—such as CorNet or recurrent neural networks (RNNs)—are notably absent from the analysis. Including these architectures would strengthen the validity of the claim, as their exclusion leaves critical gaps in testing the generalizability of the hypothesis.

方法与评估标准

yes.

理论论述

I checked the derivation for scaling laws, it seems correct.

实验设计与分析

I checked the experimental design for:

  • Total flops vs Alignment score.
  • Parameter size Alignment score.
  • Number of samples vs Alignment score.
  • Learning dynamics and architecture nature.
  • Adversarial finetunning and benefits to neural fit.

补充材料

I reviewed all the Suplemental material. I checked the details on datasets , validation of private data, models that have been pretrained on public available repos, such as torchvision. Training evolution. The performance on models that have been not trained. Effect of training objective. Simclr training impact in regions.

与现有文献的关系

The paper offers a wide range of experiments that validate the scaling laws, however the results described seem to resemble facts that were reported before. For instance, the effects of model size on neural alignment and has been previously reported (Linsley et al, 2023, Conwell et al, 2022 ).

遗漏的重要参考文献

The cites cover most of the literature on the field.

其他优缺点

Strenghts:

The paper offers a good view of the progress in the field in terms of neural and behavioral benchmarks. It offers a good overview, covering a very long list of models.

Weakness:

The main limitation I find of the paper is that it is little indication on how to move forward and how the scaling laws can impact and inform architecture or data diet development?

其他意见或建议

No.

作者回复

Thank you for your review - we’re glad you found the paper informative and comprehensive in its overview of models and progress on neural and behavioral benchmarks. We respond to each of your comments below.

Claims And Evidence - limited model set:

The reviewer might have missed this in the paper, but a core focus of our work is to evaluate a variety of architectures (Figs 2, 3, S2) from standard CNNs (ResNet, AlexNet, EfficientNet) to recent ones (ConvNeXt), Vision Transformers (ViT, DaVit, FastViT, LeViT, MaxxViT, MobileViT) and CORnet. The recurrent architecture CORnet is notably present in all our analyses (lines 133-145 left column; Figs 3, 4, 6) but we do not observe a difference in its alignment at larger scales. Interestingly in Fig 3b, CORnet initially exhibits improved alignment in low-data conditions, indicating that recurrence might provide a meaningful inductive bias for efficiency under limited training samples. However, as the amount of training data increases, the advantage offered by recurrence diminishes, suggesting that deeper feedforward models can approximate recurrent processing by effectively "unrolling" recurrent computations through additional layers. Thus, recurrence remains beneficial primarily in data-constrained scenarios, providing sample efficiency advantages, but larger datasets reduce its relative advantage. Inspired by the reviewer’s question, we will highlight this point more prominently in the updated paper.

The paper offers a wide range of experiments that validate the scaling laws, however the results described seem to resemble facts that were reported before.

Please see our first response to the reviewer 6Eek.

Weaknesses

Please see the answer below for your first question.

How can the scaling laws inform future selection of architectures data to improve our understanding of the visual cortex?

Our scaling laws explicitly guide future architecture and dataset development across varying computational budgets. Specifically, our results suggest that architectures with biologically-inspired inductive biases—like convolution or recurrence—achieve superior alignment more efficiently, especially under limited compute (Figs 2, 3b, 4b). Furthermore, our findings strongly advocate for developing richer and more ecologically valid datasets, which consistently yield greater alignment improvements compared to increasing model complexity alone (Fig 3a, Sec 3.3). Additionally, the graded effect of scaling observed across the cortical hierarchy (Fig. 5) suggests that early visual areas (V1/V2) benefit relatively little from scaling alone, highlighting the need for stronger biologically-informed inductive biases early in the processing pathway, e.g. VOneNet. Finally, we hypothesize that to push neural encoding models beyond the current alignment plateau, explicitly incorporating neural recordings into model training may be necessary. Our preliminary results on fine-tuning models on IT neural data (Papale et al., 2024) are promising: https://imgur.com/a/okfmRu8 (anonymous link for figures). Fine-tuning on a single region from another dataset improves neural alignment, with a stronger effect for a weakly-biased model (ViT-Small) than for a CNN (ResNet50). We will clearly discuss these directions and practical implications in our revised manuscript.

How biological inductive bias can be quantitatively described so it can be easy to understand the claim "Architectural Bias influences alignment in behavior?

We note that the exact subtitle ("Architectural Inductive Bias Influences Alignment and Scaling Behavior") was not specifically about behavioral alignment but rather about scaling properties in general, i.e. the “behavior of scaling”. We see how the phrasing can be misleading and will revise it for clarity.

Our results show that biological inductive biases mainly affect the scaling properties of neural alignment, especially in low-data regimes. Specifically, Figure 5b illustrates clearly that behavioral alignment benefits significantly more from task optimization compared to neural alignment. Further, Figure 5a demonstrates that for behavioral alignment, models with strong versus weak inductive biases closely follow the same scaling trajectory, while significant differences remain for neural benchmarks. Indeed, after extensive training, models with weaker inductive biases (e.g., ViTs) even achieve slightly higher absolute behavioral alignment scores than models with stronger inductive biases.

Quantitatively, these biases shift neural alignment scaling more than behavior. Exploring other priors—like local connectivity or predictive coding—is an exciting direction.

Thank you again for your review. We kindly ask the reviewer to consider raising their score and take into account the results that might have been missed in the first pass, such as the diversity of architectures, the novelty of our findings, and the impact on future model building.

最终决定

The paper systematically investigates how behavioral alignment (=task performance) and neural alignment (=representational similarity with brain) scale with model size across a set of visual recognition models. The authors conclude that while behavioral alignment keeps improving, neural alignment has saturated. While this observation is not entirely new, the current paper presents a thorough, in-depth analysis of the problem that goes substantially beyond prior work in its breadth and level of systematic investigation. As a result, I believe this is a significant and important contribution that identifies pure scaling as a dead end and implies that other approaches (e.g. direct fit to neural data, more diverse behavioral task) are necessary. I strongly recommend acceptance.