PaperHub
7.3
/10
Rejected4 位审稿人
最低3最高5标准差0.9
5
3
5
5
3.8
置信度
创新性2.8
质量3.0
清晰度2.8
重要性3.0
NeurIPS 2025

Instruction-Tuned Video-Audio Models Elucidate Functional Specialization in Brain

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29

摘要

关键词
brain encodingfMRImultimodal instruction-tuned modelsVideo LLMsAudio LLMsmulti-modal stimuliTransformersinterpretability

评审与讨论

审稿意见
5

The paper investigates the representational similarity of instruction-tuned multimodal language models with the human brain, and show that instruction-tuned models are more aligned than non-instruction-tuned or unimodal models. Also, they conduct analyses to show that instructions for different tasks elicit representations that best align to different brain regions.

优缺点分析

Strengths:

  • Nice motivation for studying instruction-tuned models, i.e., they can be applied to a wide variety of tasks and even generalize to new tasks, much like humans, when given a new instruction prompt, as compared to models fine-tuned for a single task.
  • The paper conducted many detailed analyses to study why these models have similar representations to human brains. In particular, it was very interesting to see that task-specific instructions generated task-specific representations that preferentially align with different brain regions, showing that instruction-tuned models modulate their responses based on the cognitive task demands.

Weaknesses:

  • For one of the results, that instruction-tuning improves neural predictivity, it would be a fairer control if they selected instruction-tuned models that could be compared against their pretrained, non-instruction-tuned counterparts. This would make a stronger claim about the role of instruction-tuning on brain alignment. Also, a further control would be to disentangle the roles of learning to follow instructions from the data used for instruction tuning.

问题

None

局限性

Yes

最终评判理由

Weakness 1 -- partially addressed, I acknowledge the comparison of pretrained vs instruction-tuned counterpart, though I think it would be more rigorous to have more than 1 comparison.

格式问题

None

作者回复

We thank the reviewer for their positive, insightful and valuable comments and suggestions which are crucial for further strengthening our manuscript.

Q1. selected instruction-tuned models that could be compared against their pretrained, non-instruction-tuned counterparts

Thank you for this insightful suggestion.

  • Based on the reviewer's suggestion, we conducted two additional experiments: (i) VILA-7B: a pretrained vision-language model with few-shot prompting capabilities, (ii) Qwen-2.5-Omni-7B, a pretrained-only counterpart to the instruction-tuned Qwen-2.5-VL-7B-Instruct model.
    • For both VILA and Qwen-2.5-Omni, we report alignment scores across whole brain, language, visual, and auditory regions, alongside all previously evaluated models.
ModelWhole BrainLanguageVisualAuditory
InstructBLIP0.669 ± 0.0280.693 ± 0.0090.713 ± 0.0150.664 ± 0.027
VideoLLaVA0.650 ± 0.0220.688 ± 0.0250.696 ± 0.0140.649 ± 0.027
LLaVA-Next0.678 ± 0.0180.730 ± 0.0200.751 ± 0.0070.751 ± 0.044
Qwen-2.5-VL0.746 ± 0.0240.759 ± 0.0120.744 ± 0.0260.791 ± 0.021
LLaVA-OneVision0.666 ± 0.0170.694 ± 0.0030.709 ± 0.0170.653 ± 0.048
VideoChat0.679 ± 0.0130.676 ± 0.0080.688 ± 0.0200.672 ± 0.056
QwenAudio0.473 ± 0.0120.523 ± 0.0120.489 ± 0.0160.526 ± 0.046
Kimi0.523 ± 0.0120.563 ± 0.0120.529 ± 0.0160.566 ± 0.046
TVLT0.584 ± 0.0060.571 ± 0.0060.584 ± 0.0050.547 ± 0.006
VideoMAE0.541 ± 0.0230.533 ± 0.0060.571 ± 0.0070.530 ± 0.006
AST0.477 ± 0.0180.529 ± 0.0060.529 ± 0.0060.472 ± 0.004
Qwen-2.5-Omni0.601 ± 0.0110.615 ± 0.0270.629 ± 0.0270.629 ± 0.027
VILA0.628 ± 0.0230.647 ± 0.0120.610 ± 0.0550.610 ± 0.055

We make the following observations based on the updated results in the Table:

  • Both VILA and Qwen-2.5-Omni, which are pretrained multimodal models without instruction tuning, perform significantly better than TVLT and other unimodal baselines across all brain regions, including whole brain, language, visual, and auditory.
  • However, instruction-tuned video MLLMs continue to outperform these pretrained models, demonstrating higher alignment across all regions.
  • These findings support the conclusion that instruction tuning provides additional gains beyond few-shot prompting in pretrained multimodal models.

In response to the reviewer’s suggestion, we will include the updated results from pretrained models with in-context learning capabilities, along with their corresponding brain activation maps, in the final version.

评论

Thanks for the new experimental results. I acknowledge the comparison of pretrained vs instruction-tuned counterpart, though I think it would be more rigorous to have more than 1 comparison. Nevertheless, this result would be helpful to include to improve the paper.

评论

We appreciate the reviewer's positive feedback and are confident that it has enhanced the paper's quality.

审稿意见
3

This paper investigates the alignment between brain activity and instruction-tuned multimodal large language models (MLLMs) during naturalistic movie viewing (video and audio). While previous studies have shown that MLLMs outperform unimodal models in brain alignment, most have focused on unimodal settings or non-instruction-tuned models. To address this gap, the paper evaluates brain predictivity using instruction-specific embeddings from six video and two audio instruction-tuned MLLMs across 13 task-specific video instructions. Experimental results show that instruction-tuned video MLLMs significantly outperform non-instruction-tuned multimodal (by ~15%) and unimodal models (by ~20%) in predicting neural activity. Moreover, the study demonstrates that task-specific instructions lead to disentangled, modality-specific brain representations and that MLLM layers exhibit hierarchical alignment with the brain—from early sensory regions to higher-order cognitive areas. These findings highlight the importance of instruction tuning for enhancing brain-model alignment and offer new insights into joint multimodal information processing in the brain and AI systems.

优缺点分析

Strengths:

  • The paper investigates how task-specific instructions influence the degree of brain alignment in instruction-tuned video and audio MLLMs. The analysis includes a range of popular instruction-tuned video/audio MLLMs, multiple task settings evaluated using Pearson Correlation (PC), and region-specific analyses covering language, visual, and auditory areas.
  • The paper presents several interesting observations regarding the alignment between MLLM representations and brain activity, both at the whole-brain level and within specific functional regions (language, visual, and auditory), offering valuable insights into the hierarchical and modality-specific nature of this alignment.

Weaknesses:

  • The current evaluation relies solely on Pearson Correlation (PC). For a more rigorous assessment, it would be helpful to include additional metrics—such as rank-based correlation scores (e.g., Spearman’s or Kendall’s tau) and value-based metrics like Mean Squared Error (MSE)—to better capture the fidelity and robustness of brain-model alignment.
  • The paper would benefit from a more detailed comparison and discussion with existing MLLM-based brain encoding and decoding methods. This would help contextualize the proposed approach and clarify its novelty and contributions relative to prior work.
  • Figure 3 presents results where each voxel is color-coded based on the instruction that yields the highest normalized brain alignment. However, the visualization is difficult to interpret, and the resulting task-specific brain maps do not appear to reveal clear or meaningful spatial patterns.
  • The MLLM/LLMs are the core component in the method design, yet it is claimed to be used only for grammar checking in the checklist.

问题

Please refer to items in Weaknesses.

局限性

yes.

最终评判理由

I have read the reviews and the author’s response. After discussion with the authors during the rebuttal, I remain unconvinced by certain specific points in the clarification and will therefore keep my score as a borderline reject. I would like to clarify that my expertise is not specifically in brain encoding.

格式问题

no.

作者回复

We thank the reviewer for their positive, insightful and valuable comments and suggestions which are crucial for further strengthening our manuscript.

Q1. The evaluation relies solely on Pearson Correlation (PC). Could you include additional metrics (e.g., Spearman’s, MSE) to better assess brain-model alignment?

Thank you for this question.

  • We would like to clarify that Pearson Correlation (PC) is the standard evaluation metric widely used in brain encoding studies (e.g., Schrimpf et al., 2021; Wehbe et al., 2014; Huth et al., 2016; Toneva et al., 2019; Antonello et al., 2021; Tuckute et al., 2024; Deniz et al., 2019; Oota et al., 2022, Caucheteux et al. 2022, Goldstein et al. 2022).

  • We further quantify these model predictions by measuring the normalized brain alignment where the resulting model prediction correlations are divided by the estimated cross-subject prediction accuracy, as established in prior work [Schrimpf et al., 2021; Oota et al., 2024; Alkhamissi et al., 2024; Oota et al., 2025].

  • In response to the reviewer’s suggestion, we additionally computed Spearman correlation for one representative model and subject.

    • Across the whole brain, we observed a mean correlation of 0.144 using Pearson and 0.146 using Spearman, indicating that both metrics yield similar voxelwise alignment performance.
    • These results suggest that model-brain alignment is robust across both linear (Pearson) and rank-based (Spearman) correlation measures, supporting the use of Pearson as a reliable evaluation standard in this context.

[Schrimpf et al. 2021] The neural architecture of language: Integrative modeling converges on predictive processing. PNAS, 2021

[Alkhamissi et al., 2024] Brain-Like Language Processing via a Shallow Untrained Multihead Attention Network, NAACL-2025

[Oota et al. 2024] Speech language models lack important brain relevant semantics, ACL-2024

[Oota et al. 2025] Multi-modal brain encoding for multi-modal stimuli, ICLR-2025

[Wehbe et al 2014]. Simultaneously uncovering the patterns of brain regions involved in different story reading subprocesses. PLoS One 2014

[Huth et al. 2016] Natural speech reveals the semantic maps that tile human cerebral cortex, Nature 2016

[Toneva & Wehbe 2019] Interpreting and improving natural-language processing (in machines) with natural language-processing (in the brain), In NeurIPS-2019.

[Goldstein et al. 2022] Shared computational principles for language processing in humans and deep language models. Nature neuroscience, 2022

[Antonello et al. 2021] Low-Dimensional Structure in the Space of Language Representations is Reflected in Brain Responses, NeurIPS 2021

[Caucheteux et al. 2022] Brains and algorithms partially converge in natural language processing, Communication Biology

[Tuckute et al. 2024] Driving and suppressing the human language network using large language models, Nature Human Behaviour

[Deniz et al. 2019] The representation of semantic information across human cerebral cortex during listening versus reading is invariant to stimulus modality

Q2. The paper would benefit from a more detailed comparison with existing MLLM-based brain encoding and decoding methods to better clarify its novelty.

Thank you for this question.

  • As indicated in Table 4 (Appendix), researchers have recently investigated brain alignment using multimodal models (Dong & Toneva et al. 2023, Oota et al. 2025, Subramanian et al. 2024, Nakagi et al. 2024) and MLLMs with unimodal stimuli (Oota et al. 2025). We provide a comprehensive overview of these works in multimodal evaluation settings in Table 4.

  • Building on this literature, our study introduces and systematically evaluates instruction-tuned video and audio MLLMs, which had not been previously examined in brain alignment tasks.

  • To our knowledge, this is the first study to:

    • Apply instruction-tuned video and audio MLLMs to model multimodal naturalistic stimuli,
    • Perform task-specific decomposition across brain regions to understand representational specialization,
    • Compare instruction-tuned, unimodal, and non-instruction-tuned multimodal baselines within a unified encoding framework.
  • While we acknowledge the growing interest in brain decoding using MLLMs (e.g., reconstructing images or captions from brain data), our focus is strictly on brain encoding-predicting brain activity from external stimuli representations. As such, comparisons with decoding studies fall outside the scope of our contributions.

Dong et al. (2023) Vision-Language Integration in Multimodal Video Transformers (Partially) Aligns with the Brain, ICLRW-2023

Q3. Figure 3 visualization is difficult to interpret, and the task-specific brain maps do not appear to reveal meaningful spatial patterns.

Thank you for this question.

  • We would like to clarify that the goal of Figure 3 is to support our central claim that no single task representation best predicts the entire brain. Instead, we observe spatial dissociation across tasks, which suggests that instruction-tuned video and audio MLLMs learn task-specific representations that align with functionally distinct brain regions.

For example, we find that:

  • Sound event detection representations strongly align with the auditory cortex and temporal lobe, and
  • Audio captioning representations are more predictive of language-associated regions and frontal cortex.

This task-level separation indicates that instruction-tuned MLLMs disentangle modality- and function-specific signals, producing semantically distinct embeddings aligned with neural specialization. In addition, a detailed explanation of Figure 3 is provided in Section 4.2, and we will revise the figure to improve clarity.

Q4. The checklist says LLMs were only used for grammar checking, but the method relies heavily on MLLMs/LLMs. Why?

Thank you for this question. We thank the reviewer for pointing this out and agree that our initial checklist response was unclear.

  • In this study we use instruction-tuned MLLMs, multimodal video-audio models, unimodal video and audio models as frozen feature extractors to obtain stimulus representations for brain encoding experiments.
  • These models were not developed or fine-tuned as part of our contribution but were used off-the-shelf. We only used LLMs for grammar checking during manuscript preparation in addition to this methodological usage.
  • In the final version, we will revise the checklist to accurately state that LLMs were used for feature extraction in the core method and for checking the grammar assistance.
评论

Thanks for the responses. The authors stated that "Pearson Correlation (PC) is the standard evaluation metric widely used in brain encoding studies." While this is common practice, it does not necessarily mean it is the correct or only metric. The extensive experiments do present meaningful observations, but they clearly require further explanation. For example, the color-coded voxel visualization does not effectively convey readable information. Additionally, previous efforts on connecting and aligning brain data with MLLMs should be discussed in the related work section on brain encoding and decoding. I am not fully convinced, so I am keeping my score unchanged. However, I want to clarify that I do not work specifically in brain encoding.

评论

We thank the reviewer for their feedback.

  • We acknowledge that in brain encoding studies, commonly used evaluation metrics include Pearson correlation, 2-versus-2 (2V2) accuracy, and R² score. For brain decoding models, typical metrics include pairwise accuracy, rank accuracy, R² score, and mean squared error.

  • Unlike brain encoding or decoding models that rely on linear regression, CKA and RSA are nonparametric similarity metrics that do not require additional training, making them complementary tools for model-brain alignment evaluation.

  • In this study, we primarily use Pearson correlation and also quantify these model predictions by measuring the normalized brain alignment where the resulting model prediction correlations are divided by the estimated cross-subject prediction accuracy, as established in prior work.

  • Regarding related work, we have already provided a detailed discussion in Appendix Sections A and B.

  • We sincerely appreciate the reviewer’s time and effort in evaluating our work, especially given their noted background in brain encoding studies. We are confident that the feedback provided will help further improve the clarity and impact of the paper.

审稿意见
5

This paper performs comparisons of many different multimodal and unimodal neural network models with human neural data, using an encoding model framework. They use data from a movie-watching task (Neuromod) and fit single voxel encoding models. They show instruction-tuned models outperform several baselines at predicting voxel activation. They also compare models that are probed using different task instructions, and find that different task instructions predict representations lead to alignment with different brain areas.

优缺点分析

  • Strength: This approach of using instruction tuned models to predict human multimodal movie watching data seems to be novel. The results demonstrate the instruction-tuned models are outperforming many baselines.

  • Strength: The comparisons by different tasks-specific instructions are very interesting, and is a novel usage of instruction-tuned models. This has the potential to yield new neuroscientific insights.

  • Weakness: The baseline choices need some more justification. From table 1, it looks like the instruction-tuned video models are mostly deeper (more layers) than the other models. Can their better performance be attributed to depth/architecture differences? Also, what datasets are the models trained on? Are the datasets similar in their overall size and diversity? One additional baseline I didn’t see is a model trained on video data with a supervised classification objective (something like TimeSFormer for example). There is videoMAE which is unsupervised, and TVLT, which has video and audio. Would a supervised video model be closer to the instruction-tuned video models? If these baselines can be better justified and controlled, it will improve the impact of these results.

  • Weakness: for the task specific analyses (Figure 3, Figure 5), the results are mostly shown on flatmap surfaces, and the patterns can be a bit hard to see, in terms of which areas are best predicted by which tasks. The results don’t always match up visually with what is in the text. For example, dorsal parietal cortex is supposed to have more variance explained by the spatial understanding task (line 301), but in Figure 5 it doesn’t look like those areas are actually included (IPS regions are not colored in Figure 5). These descriptions should be carefully checked to make sure they capture what the data show. It would help to see these results quantified for certain ROIs (language areas, etc), so that we can more easily see the patterns. Or if it’s possible to label the ROIs on the flatmaps, this would help too.

问题

  • In the section on banded ridge regression (line 171), it’s a bit confusingly worded. Banded ridge regression is usually just used to set the regularization strength for different feature groups, how did you use this to decompose the variance? Did you also use variance partitioning?

  • Figure 5, middle panel – this color scheme of blue and green is very hard to see, I would suggest a different color scheme.

局限性

Yes.

最终评判理由

The authors have addressed my concerns in their rebuttal - in particular, I am more convinced now that the advantage of the instruction-tuned models is actually due to instruction tuning, not just an artifact of model depth or dataset properties. I am raising my recommendation to "Accept".

格式问题

No concerns.

作者回复

We thank the reviewer for their positive, insightful and valuable comments and suggestions which are crucial for further strengthening our manuscript.

Q1.1 Can the better performance of instruction-tuned video models be explained by differences in depth or architecture?

Thank you for this insightful question.

  • We agree with the reviewer that instruction-tuned video and audio MLLMs generally have more layers compared to the non-instruction-tuned multimodal (TVLT) and unimodal baselines. However, our results suggest that depth alone does not account for the observed performance differences. Notably, only the instruction-tuned video MLLMs show significantly higher brain alignment across all regions, while the audio MLLMs—despite having similar depth—only show improvements in the auditory cortex. This disparity indicates that instruction-tuning, rather than depth alone, plays a key role in enhancing brain alignment.
  • Furthermore, our analysis (see Fig. 4) shows that different layers of the MLLMs align with distinct brain regions in a hierarchical manner, consistent with prior findings in neuroscience. If model depth were the primary driver of performance, we would expect a more uniform improvement across brain regions and models, which is not supported by our results.

Q1.2 What datasets are the models trained on? Are the datasets similar in their overall size and diversity?

Thank you for this important question.

  • We now provide additional discussion of model training data to clarify both the dataset composition and diversity for each model.
  • In the below Table, we provide a detailed summary of the datasets they were trained on, along with information about dataset size and content diversity.
ModelTraining DataSize
InstructBLIP-Video26 public datasets: MSCOCO, TextCaps, NoCaps, VQA v2, iVQA, MSRVTT-QA, MSVD-QANot specified
Video-LLaVALAION-CC-SBU (558K), Valley (702K), LLaVA-Instruct (665K image-text + 100K video-text)~1.3M total
LLaVA-NeXT-VideoLLaVA-Video-178K synthetic + real video QA/caption data~1.4M total
Qwen2.5-VLWeb-scale 4T tokens (text + multimodal), post-trained on 2M mixed samples4T tokens + 2M samples
VideoChat-R1Qwen2.5-VL backbone, 18K video instruction samples (temporal grounding, tracking, QA)18K samples
LLaVA-One-VisionOneVision dataset (3.2M single-image + 1.6M multi-image/video), Evo-Instruct (143K)~5M total
TVLTHowTo100M, YTTemporal180M (video-audio pairs)Large-scale (not stated)
VideoMAEKinetics-400, Something-Something v2, Epic Kitchens 100~550K videos
ASTAudioSet, ESC-50, Speech Commands~2M+ audio clips

Summary:

The instruction-tuned video MLLMs differ substantially in training sources:

  • Some (like LLaVA-NeXT-Video and Video-LLaVA) include explicit video instruction datasets, while others (like InstructBLIPVideo and LLaVA-OneVision) primarily adapt image-based instructions to video via frame sampling.
  • Model training sizes close to several million across all instruction-tuned video MLLMs, with varying modality mixes and task diversity.
  • The unimodal and non-instruction-tuned baselines are trained on narrower or more specialized domains, lacking the instruction-driven, multimodal task generalization seen in instruction-tuned MLLMs.

Overall, we hypothesize that the use of task-specific instructions during training is a key factor driving the similarly high levels of brain alignment observed across instruction-tuned video MLLMs, regardless of differences in architecture or dataset scale.

Q1.3 One additional baseline: Would a supervised video model (TimeSFormer) be closer to the instruction-tuned video models?

Thank you for this insightful suggestion.

  • Based on the reviewer's suggestion, we now perform one additional baseline experiment: (i) TimeSformer model. For the TimeSformer model, we use the “facebook/timesformer-base-finetuned-k400” model for extracting feature representations.
  • We report the normalized brain alignment across whole brain, language, visual, and auditory regions for TimeSformer alongside all previously evaluated models in the table below. This replicates the results shown in Fig. 2, now including TimeSformer for direct comparison.
ModelWhole BrainLanguageVisualAuditory
InstructBLIP0.669±0.0280.693±0.0090.713±0.0150.664±0.027
VideoLLaVA0.650±0.0220.688±0.0250.696±0.0140.649±0.027
LLaVA-Next0.678±0.0180.730±0.0200.751±0.0070.751±0.044
Qwen-2.5-VL0.746±0.0240.759±0.0120.744±0.0260.791±0.021
LLaVA-OneVision0.666±0.0170.694±0.0030.709±0.0170.653±0.048
VideoChat0.679±0.0130.676±0.0080.688±0.0200.672±0.056
QwenAudio0.473±0.0120.523±0.0120.489±0.0160.526±0.046
Kimi0.523±0.0120.563±0.0120.529±0.0160.566±0.046
TVLT0.584±0.0060.571±0.0060.584±0.0050.547±0.006
VideoMAE0.541±0.0230.533±0.0060.571±0.0070.530±0.006
TimeFormer0.574±0.0030.598±0.0160.596±0.0180.496±0.037
AST0.477±0.0180.529±0.0060.529±0.0060.472±0.004
  • While TimeSformer demonstrates stronger performance than unimodal baselines (e.g., VideoMAE, AST) and is comparable to the non-instruction-tuned multimodal model TVLT, it still significantly underperforms relative to instruction-tuned video MLLMs across the whole brain, language, visual, and auditory regions.
  • This further reinforces our conclusion that instruction tuning-rather than supervision alone-is critical for learning representations that align with fMRI brain activity.

Q2.1 There seems to be a discrepancy between the text (spatial understanding) and Figure 5b flatmap (IPS regions not highlighted). Can you clarify?

Thank you for pointing this out and providing a valuable suggestion.

  • We acknowledge that the intraparietal sulcus (IPS) was not included in the current ROI set because our analyses were constrained to Fedorenko’s language network, which excludes dorsal parietal regions. This omission likely caused the discrepancy noted in the flatmaps shown in Figure 5.
  • Following your suggestion, we will include key IPS subregions-specifically IP1, IP2, and LIPd from the Glasser atlas-in the final version of the figure to more accurately reflect spatial task-related variance in the dorsal parietal cortex, especially for the spatial understanding task (line 301), and ensuring that descriptions in the text accurately match the data.

Q2.2 It would help to see these results quantified for certain ROIs or label the ROIs on the flatmaps to make the patterns clearer.

Thank you for this question.

  • Based on the reviewer’s suggestion, we now provide normalized brain alignment values for both spatial and temporal understanding tasks in two key ROIs: the intraparietal sulcus (IPS) and the posterior cingulate cortex (PCC) which is part of Fedorenko’s language network.
  • Consistent with prior work, IPS shows stronger involvement in spatial processing (Sack, 2009; Papadopoulos et al., 2018), while PCC is active in both the domains (Coull & Nobre, 2008):

IPS:

  • Spatial: 0.723±0.020
  • Temporal: 0.672±0.013

PCC:

  • Spatial: 0.675±0.020
  • Temporal: 0.651±0.012

These findings support the functional specialization of IPS for spatial cognition, and a more domain-general role of PCC across both spatial and temporal understanding.

Sack, 2009. Parietal cortex and spatial cognition. Behavioural brain research

Coull & Nobre, 2008. Dissociating explicit timing from temporal expectation with fMRI. Current opinion in neurobiology

Papadopoulos et al. 2018. Functional subdivisions within the human intraparietal sulcus are involved in visuospatial transformation in a non‐context‐dependent manner. HBM

  • We have labeled several language and visual ROIs on the flatmap in Appendix Figure 6. In line with the reviewer’s suggestion, we will provide quantitative results for multiple language ROIs in the final version of the manuscript.

Q3. In the section on banded ridge regression (line 171), it’s a bit confusingly worded. How did you use this to decompose the variance? Did you also use variance partitioning?

Thank you for this question.

  • We would like to clarify that different task-specific representations require different levels of regularization, which is why we use banded ridge regression.
  • This approach enables separate regularization strengths (as pointed out by the reviewer) for each feature group (e.g., different tasks) and incorporates an implicit feature-space selection mechanism that ignores the contribution of non-predictive or redundant feature spaces.
  • For single-modal baselines (i.e., those with only one feature space), we use banded ridge regression with a single group, but still allows us to maintain consistency across evaluations.
  • To measure per-task predictive performance in the joint model, we use the split-correlation measure, which computes the correlation between the predicted and actual brain responses from each task-specific feature space independently.
  • This allows us to interpret how much each task-specific representation contributes within the joint predictive framework. Implementation ref: predict_and_score_weighted_kernel_ridge(Ks_test, dual_weights, deltas, Y_test, split=split, n_targets_batch=n_targets_batch, score_func=score_func).
  • In addition, we also perform individual task modeling and conduct variance partitioning between task pairs to evaluate unique and shared variance explained.
  • Appendix Table 13 reports the results of variance partitioning across all 13 video tasks, averaged across subjects, for the whole brain, visual, and language regions using the Qwen-2.5-VL model.

We will clarify these distinctions more explicitly in the final revised version of the manuscript.

Q4. I would suggest a different color for Figure 5b.

Thank you for this valuable suggestion. For the middle panel in Figure 5, we will update the visualization using a more distinguishable and accessible color palette in the final revised version.

评论

I appreciate the authors’ detailed response and their additional experiments. With these additional details, I am now more convinced about the claims of the paper, especially the advantage for the instruction-tuned models over baselines. I recommend acceptance of the paper.

评论

We appreciate the reviewer’s feedback and recommendation for acceptance. We are confident that the additional baseline experiments have strengthened the paper and improved its quality. We kindly request that you consider updating your evaluation score to reflect your revised assessment.

审稿意见
5

This work evaluates how instructed video MLLM features align with brain activity through a serie of experiments. Authors rely on movies combining audio and video and the corresponding brain responses from 4 human subjects. The hypotheses evaluated are (a) instructed MLLM features provide better alignmement than non-instructed models and unimodal models (both audio and video respectively alone), (b) that the instructions, grouped by type of requests (e.g asking about visual clues, or about the narrative aspects of the movie clips) align with specific brain regions (e.g. narrative understanding connects further with brain language speciic regions), and finally (c) that an alignment can be find as well between the actual layers of such architectures and the brain regions. For instance, a parallel is drawn between early layers, commonly used for the models to pay attention to the input features, and visual understanding brain regions. Likewise mid-/late-layers, commonly dedicated for transfer information flow between the input tokens, seem to show stronger alignment with brain regions dedicated to more abstract reasoning/tasks. The experiments are supported by statistical testing, and smaller features analysis experiments (Appendix D).

优缺点分析

This work is interesting for a few aspects:

  • It connects neurology considerations with off-the-shelf trained video MLLM, and authors are able to find interesting signal from their experiments. The clear use of statistical testing is a plus for this work.
  • However, while statistically significative, some experiments and their conclusions seem a stretch. The largest contentious point is the authors opposing instructed vs. non-instructed video MLLM, and concluding that non-instructed models are showing weaker alignment. I found this part misleading. Not only the authors only use a single non-instructed video MLLM from 2022 and 60% less layers than all the other models, but in additiong compare it to state-of-the-art models trained at very large scale such Qwen and LLaVA-series. It is likely that (a) the strong alignment might be explained by just more compute training, but as well (b) just the size difference in terms of parameters between TVLT and the rest of those models.
  • There are pre-trained models (e.g. just in image-text like OpenFlamingo or VILA) that I'd expect to show quite strong results (in few-shots settings). The 'instructed' (explicit prompting) part as advanced by the model is likely just an artifact, and few-shots (implicit prompting) on strong pre-trained models would very likely show the same impact. As a matter of fact, some pre-trained models in the text-image fields are sometimes stronger than their counter-part instructed, especially on unseen tasks (e.g. see MM1 30B pretrained performing addition with images of brithday candles).
  • Besides those possible flaws in the reasoning, I think this work brings an interesting discussion and could be published to bring the discussion to a larger audience, especially communities outside of the ML community.

问题

  • While the claim multimodal vs. unimodal makes sense, the claim instructed vs. non-instructed is clearly a stretch. Adding it into the titl eof this work is slighly risky given that this is not clearly established.
    • (A) Have you considered that your experiments are just showing that some models are larger and more trained than other? TVLT is a close to be a toy model compared to Video-LLaVA and InstructBLIP, just in terms of size and sheer compute thrown into their respective training recipes.
    • (B) If you really want to prove that non-instructed is weaker, I urge you to demonstrate it with pre-trained models of simliar size and spent compute. I know from first hand that pre-trained models are quite often very strong, and that as a matter of fact SFT them make them sometimes weaker in certain aspects (e.g. unseen tasks like candles addition in MM1).

局限性

  • Explained above.

最终评判理由

EDIT: Authors added new results regarding the distinction instructed vs. non-instructed. Given those results and the nuance they bring, I update my scores to reflect that addition.

格式问题

  • No concerns. The figures are well done and vectorized. Matching fonts as well.
作者回复

We thank the reviewer for their positive, insightful and valuable comments and suggestions, which are crucial for further strengthening our manuscript.

Q1. Strong pre-trained models (e.g., OpenFlamingo, VILA) can perform well in few-shot settings, sometimes surpassing instruction-tuned counterparts on unseen tasks (e.g., MM130B). Could the reported benefits stem from few-shot prompting capabilities of pre-trained models rather than instruction tuning?

  • We would like to clarify that our definition of a non-instruction-tuned multimodal model refers specifically to multimodal models (like TVLT), which are jointly pretrained on video and audio without any instruction tuning or in-context learning capabilities. We do not consider in-context learning models in this category, as those still rely on prompting capabilities, albeit implicitly. We would clarify this in the revised draft.

  • However, we agree with the reviewer that a few-shot prompting comparison will also add good value to the current paper.

  • Based on the reviewer's suggestion, we conducted two additional experiments: (i) VILA-7B: a pretrained vision-language model with few-shot prompting capabilities, (ii) Qwen-2.5-Omni-7B, a pretrained-only counterpart to the instruction-tuned Qwen-2.5-VL-7B-Instruct model.

    • Since the OpenFlamingo model has version compatibility issues with PyTorch and the MM1 model is not publicly available, we chose the VILA model and extracted video features from image frames.
    • For both VILA and Qwen-2.5-Omni, we report alignment scores across whole brain, language, visual, and auditory regions, alongside all previously evaluated models.
ModelWhole BrainLanguageVisualAuditory
InstructBLIP0.669 ± 0.0280.693 ± 0.0090.713 ± 0.0150.664 ± 0.027
VideoLLaVA0.650 ± 0.0220.688 ± 0.0250.696 ± 0.0140.649 ± 0.027
LLaVA-Next0.678 ± 0.0180.730 ± 0.0200.751 ± 0.0070.751 ± 0.044
Qwen-2.5-VL0.746 ± 0.0240.759 ± 0.0120.744 ± 0.0260.791 ± 0.021
LLaVA-OneVision0.666 ± 0.0170.694 ± 0.0030.709 ± 0.0170.653 ± 0.048
VideoChat0.679 ± 0.0130.676 ± 0.0080.688 ± 0.0200.672 ± 0.056
QwenAudio0.473 ± 0.0120.523 ± 0.0120.489 ± 0.0160.526 ± 0.046
Kimi0.523 ± 0.0120.563 ± 0.0120.529 ± 0.0160.566 ± 0.046
TVLT0.584 ± 0.0060.571 ± 0.0060.584 ± 0.0050.547 ± 0.006
VideoMAE0.541 ± 0.0230.533 ± 0.0060.571 ± 0.0070.530 ± 0.006
AST0.477 ± 0.0180.529 ± 0.0060.529 ± 0.0060.472 ± 0.004
Qwen-2.5-Omni0.601 ± 0.0110.615 ± 0.0270.629 ± 0.0270.629 ± 0.027
VILA0.628 ± 0.0230.647 ± 0.0120.610 ± 0.0550.610 ± 0.055

We make the following observations based on the updated results in the Table:

  • Both VILA and Qwen-2.5-Omni, which are pretrained multimodal models without instruction tuning, perform significantly better than TVLT and other unimodal baselines across all brain regions, including whole brain, language, visual, and auditory.
  • However, instruction-tuned video MLLMs continue to outperform these pretrained models, demonstrating higher alignment across all regions.
  • These findings support the conclusion that instruction tuning provides additional gains beyond few-shot prompting in pretrained multimodal models.

In response to the reviewer’s suggestion, we will include the updated results from pretrained models with in-context learning capabilities, along with their corresponding brain activation maps, in the final version.

评论

Dear Reviewer 4P2s,

We appreciate your strong positive feedback and effort you have invested in evaluating our work.

Following your suggestion, we have conducted additional experiments using the VILA model and included the updated results in our response. We kindly request you to verify our response and consider updating your evaluation based on the revisions made.

评论

Dear Authors,

Thank you for the extra experiments conducted on VILA and other pre-trained models. This is interesting outcome, and at least nuance the point for instructed vs. non-instructed. I think the paper would importantly benefit to integrate those results and the nuance in that distinction instructed vs. non-instructed.

Given the authors will integrate those results and the nuance, I stand on my original review and recommend this paper to be accepted, so the wider community could discuss those results.

评论

We appreciate the reviewer's positive feedback and are confident that it has enhanced the paper's quality.

最终决定

This paper explores how instruction-tuned multimodal large language models (MLLMs) align with human brain activity during naturalistic movie viewing, using fMRI data and an encoding framework. The authors claim that instruction-tuned models outperform unimodal and non-instruction-tuned multimodal baselines and that task-specific instructions map to distinct brain regions.

The strengths of the work include its novelty in connecting instruction-tuned models to brain alignment (PiJw, Mu2g), the careful statistical testing, and the breadth of analyses across multiple instructions and regions (4P2s, 6gym). The framing is timely and could be of interest to both the ML and neuroscience communities.

However, there are significant weaknesses. The central claim—that instruction tuning itself drives better brain alignment—remains insufficiently supported. Reviewers noted that the baselines are poorly controlled: the non-instructed TVLT model is much smaller and older than the instruction-tuned models, confounding claims about tuning vs. scale (4P2s, PiJw). The evaluation also relies almost exclusively on Pearson correlation, without complementary robustness metrics (6gym). Moreover, the neuroscientific contribution is limited; inter-brain mapping results are described as speculative, and interpretations sometimes overreach relative to the evidence.

During rebuttal, the authors added results with VILA and Qwen-Omni, partially addressing baseline concerns, but these additions remain insufficient to firmly isolate the effect of instruction tuning. Given these limitations, I recommend rejection.