Brain-Informed Fine-Tuning for Improved Multilingual Understanding in Language Models
摘要
评审与讨论
This paper presents a method for fine-tuning language models using bilingual brain data to enhance their semantic understanding in multilingual contexts. The core idea leverages neural signals to adapt language models, aligning their representations more closely with human brain activity during language processing. This approach demonstrates potential for improving cross-lingual generalization by grounding model training in biologically inspired signals.
优缺点分析
Strengths:
-
The paper is clearly written, with straightforward experimental setups, and Table 1 presents intuitive results.
-
The release of a benchmark dataset constitutes a tangible contribution to the community.
Weaknesses:
-
The proposed method—naïve full-model fine-tuning of a pretrained language model—lacks technical novelty, significantly limiting its conceptual contribution.
-
As shown in Table 1, fMRI-guided fine-tuning yields marginal and inconsistent performance gains across tasks (particularly in subtables (a) and (c)). These results undermine the framework’s robustness and call into question its practical utility.
问题
-
Is it because the performance is not satisfactory that the results of fine-tuning other models are placed in the appendix? I suggest placing the results of multiple models in the main text, which could reveal whether the approach generalizes across architectures or is specific to large-scale pretrained models.
-
The flat map visualizations in Figure 2 appear cluttered. What specific insights do they intend to convey? Clarifying the key takeaways (e.g., spatial patterns of cross-lingual alignment) would strengthen interpretability.
-
Could the authors explicitly articulate the practical contributions of this work? Given the modest performance gains (Table 1), what justifies its potential for acceptance beyond the dataset release?
局限性
See Weaknesses.
最终评判理由
Although I still don't fully agree with what the author said about the points of novelty, I have decided to raise my score to 3.
格式问题
No.
Clarification of fine-tuning methodology
We thank the reviewer for their valuable comment. We would like to highlight that the novelty of our work lies in the design of the brain-informed fine-tuning pipeline. Our pipeline is a general, end-to-end framework that addresses key limitations of prior brain-to-model alignment methods and enables direct integration of brain data into language model fine-tuning.
Most previous approaches aligned stimulus features to brain responses via preprocessing steps based on rigid assumptions; for instance, using fixed word presentation rates or downsampling (e.g., Schwartz et al., 2019). These designs impose strong constraints. In contrast, our pipeline allows for:
- variable-length input (since word timing per repetition time (TR) can vary),
- more flexible and generalizable alignment applicable across modalities and tasks as we pass raw text as input
- dynamic stimulus-repetition time (TR) alignment as part of the learning process, not a static preprocessing step
To illustrate this advantage, we implemented the preprocessing-based alignment (as in prior work) and fine-tuned a new model on bilingual data from one participant. The mean encoding performance is 0.13 for the preprocessing-based method, compared to 0.16 for our brain-informed fine-tuned model. We hope this highlights the practical and conceptual contributions of our approach. We will add this reasoning in the camera-ready to ensure clarity.
Clarification on gains in Table 1
Thank you for raising this point. The initial Table 1 reported downstream task performance for a single participant, which may have appeared inconsistent. We have revised Table 1 to show average performance across all six bilingual participants, with standard deviation. The overall trends remain consistent with those initially reported, helping strengthen the robustness of the findings. Due to space constraints, please see the updated table 1 included in our response to reviewer hVbZ. This aggregated analysis strengthens the robustness and practical utility of our approach, showing consistent benefits across participants and models. Please note that models were fine-tuned separately for each participant and language.
Q1
Thank you for the question. We placed the results in the appendix due to space constraints, not due to lack of performance. In fact, all three models (mBERT, XLM-R, and XGLM) show improvements (see below) following brain-informed fine-tuning. This table shows that fine-tuning improves performance on most tasks: mBERT (7/9 tasks), XLM-R (9/9), and XGLM (6/9), with especially large gains for XLM-R (e.g., RTE: 54.15 → 62.71, MRPC: 69.61 → 77.94).
Updated Supplementary Table 2. a) Fine-tuning and Evaluation in the Same Language
| GLUE Task | mBERT | mBERT‑ft‑en | XLM‑R‑en | XLM‑R‑ft‑en | XGLM‑en | XGLM‑ft‑en |
|---|---|---|---|---|---|---|
| CoLA | 42.68 | 40.05 | 46.30 | 50.20 | 40.11 | 41.22 |
| SST‑2 | 89.68 | 90.14 | 92.16 | 92.55 | 90.94 | 90.48 |
| MRPC (Acc.) | 84.80 | 84.80 | 69.61 | 77.94 | 71.32 | 71.81 |
| MRPC (F1) | 88.56 | 89.01 | 81.55 | 84.10 | 81.52 | 82.39 |
| STS‑B (Pearson) | 88.06 | 88.42 | 84.96 | 85.18 | 81.43 | 82.36 |
| STS‑B (Spearman) | 87.76 | 88.22 | 84.84 | 85.10 | 81.47 | 82.41 |
| QQP (Acc.) | 90.22 | 90.47 | 90.28 | 90.61 | 88.93 | 90.11 |
| QQP (F1) | 86.70 | 87.07 | 87.05 | 87.45 | 83.11 | 85.33 |
| MNLI‑m | 82.09 | 82.66 | 84.16 | 84.46 | 76.62 | 78.85 |
| MNLI‑mm | 82.38 | 82.97 | 84.22 | 84.40 | 78.80 | 79.01 |
| QNLI | 91.14 | 91.18 | 90.17 | 90.24 | 88.94 | 88.44 |
| RTE | 67.15 | 65.70 | 54.15 | 62.71 | 53.79 | 53.43 |
| WNLI | 53.52 | 56.34 | 43.66 | 56.34 | 50.34 | 52.41 |
Q2
Thank you for the helpful feedback. We acknowledge that Figure 2 has a lot of information. The main takeaway from Figure 2 is that across both languages and model types (monolingual and multilingual), bilingual brain-informed fine-tuning consistently yields better encoding performance than the vanilla models in several brain regions.
While spatial patterns vary due to inter-individual differences (see Appendix Figure 5), we observe improvements in encoding performance in several high-level language areas.. Given this variability, the improvements in downstream task performance from fine-tuning are still robust. We will provide separate flatmaps for each fine-tuning condition in the supplementary material to improve clarity.
Q3
Thank you for this important question. Our work demonstrates that brain-informed fine-tuning can reliably enhance multilingual language model performance, even with limited data. Please note that each model was fine-tuned using 2.5 hours of naturalistic brain data per participant. We adopt a small-N design (Smith & Little, 2018), conducting full replication across participants by fine-tuning and evaluating models individually. While the performance gains may appear modest, they are robust across participants. Our results contribute to work on aligning brain and language model representations. To our knowledge, this is the first study to leverage bilingual brain data for fine-tuning, showing gains in multilingual understanding. This work offers a step toward bridging neuroscience and NLP to interpret and model multilingual semantic representations.
Dear Reviewer ZxVC,
We appreciate your feedback and effort you have invested in evaluating our work.
We have carefully addressed the points you raised regarding clarification on fine-tuning methodology, performance gains in Table 1. Based on our responses, all four other reviewers have updated their evaluations.
We kindly request you to verify our response and consider updating your evaluation score accordingly.
Thanks for the clarification. Although I still don't fully agree with what the author said about the points of novelty, I have decided to raise my score to 3.
We appreciate the reviewer’s thoughtful feedback, and willingness to raise the score.
We respectfully acknowledge the reviewer's concerns regarding the novelty of our approach. We would be grateful if you could share more specific feedback on which aspects you feel are lacking. This would help us better address any remaining misunderstandings and improve the clarity of our contributions.
To summarize, we believe the key contributions of our work are:
- An end-to-end brain-informed fine-tuning pipeline, in contrast to prior work that depends on fixed, preprocessed alignment steps.
- To our knowledge, the first use of bilingual brain data to fine-tune both monolingual and multilingual LLMs.
- A comprehensive evaluation across multiple models (both monolingual and multilingual) and training objectives; demonstrating consistent improvements in both brain alignment and downstream NLP tasks across participants.
These contributions were positively recognized by four reviewers (jJEB, hVbZ, 4NET, and xVwE).
Should you have any further questions or suggestions, we are ready to provide additional information or clarification as needed.
The paper introduces a novel approach for improving multilingual understanding in language models by finetuning them using brain activity recorded from bilingual individuals reading the same stories in English and Chinese. The proposed method aligns model representations with bilingual brain data to induce shared semantic representations. The authors finetune both monolingual and multilingual models using fMRI data and evaluate their performance on brain encoding and standard NLP benchmarks across multiple languages. Results show consistent improvements: finetuned models better predict brain responses and outperform their vanilla counterparts in downstream tasks, not only in the finetuned language but also in the participants’ second language and even in unseen languages. Importantly, these gains are specific to bilingual brain data and not replicable with monolingual data, suggesting that shared bilingual neural representations contribute to broader language generalization.
优缺点分析
Strengths
- This paper presents a timely and compelling contribution to the NeuroAI community. It is the first to explore brain-informed fine-tuning using bilingual brain data, demonstrating its benefits not only for brain predictivity but also for performance on downstream NLP tasks such as GLUE.
- The proposed fine-tuning pipeline is technically novel, integrating both temporal delay modeling and downsampling directly into the model architecture, which sets it apart from prior approaches that treat these as preprocessing steps.
- The authors also provide a thorough and well-documented explanation of their voxelwise encoding methodology, employing 3-lobe Lanczos interpolation, finite impulse response (FIR) filtering, validation-based layer selection, and evaluating on held-out stories, all of which enhance reproducibility and exemplify best practices for measuring brain-model alignment.
- The paper includes several insightful analyses, such as comparisons of training objectives, evaluations across same-language, cross-language, and zero-shot generalization settings, and examinations of finetuning effects using language-selective, semantically-selective, and whole-brain regions which I personally found quite interesting.
Weaknesses
- The main limitation of the work lies in the use of relatively outdated model architectures for fine-tuning, encoder-only models. The authors mention that results for LLaMA-3.2-1B (a more modern decoder-based model) are included in the supplementary materials (line 129), no such results are actually provided. Including such results would be informative to know whether the benefits of brain-informed finetuning scales or not with more recent larger models.
- While the proposed finetuning pipeline is novel, the paper does not include ablations to assess the individual contribution of components like temporal delay modeling or the specific interpolation method. This makes it hard to isolate what aspects of the pipeline are most critical to the observed improvements.
- The presentation of results could be significantly improved. The tables are difficult to parse, with multiple subtables embedded within a single table definition, which hampers readability.
- The paper does not include any quantitative tables comparing the brain encoding performance across different models, relying instead on qualitative inspection of brain surface plots. To strengthen the evaluation, I recommend reporting average Pearson correlation scores (either across all voxels or within specific ROIs) to allow for more rigorous and interpretable model comparisons.
- Finally, many critical results that substantiate the paper's core claims are relegated to the appendix. It would strengthen the paper to move such results into the main text to better support the narrative and improve clarity for the reader.
Typo
Line 341: captures → captured
问题
- Why did you focus mainly on encoder-based models (e.g., BERT, mBERT) for brain-informed finetuning? Did you observe any challenges or limitations when attempting to finetune decoder-only models like LLaMA?
- You integrated downsampling and temporal-delay modeling directly into the model architecture. Did you test the impact of this integration via an ablation (e.g., using fixed preprocessing instead)?
- Given the relatively small number of stories used per language, how sensitive is the finetuning process to the quantity of brain data? Would be interesting to try some sort of scaling laws for the proposed finetuning method.
- Do you have any insight into what aspects of language (e.g., semantics, syntax) are being shaped or improved by the brain-informed finetuning?
局限性
Yes
最终评判理由
The authors have conducted additional experiments and provided reasonable responses to some of my remaining concerns. However, the paper still lacks more experiments for a stronger accept (e.g., running on larger scale models) and improving the clarity of the paper.
格式问题
The tables seem not to follow NeurIPS paper formatting guidelines.
Q1.1
Thank you for this question. We focused primarily on encoder-based models (BERT, mBERT) for two reasons:
- Consistency with prior work: Previous brain-informed tuning approaches with monolingual english brain data have predominantly used encoders (e.g., BERT, Wav2Vec, Whisper, WavLM) due to their strong alignment with text and speech features.
- Bilingual dataset requirements: Our dataset contains bilingual brain recordings. To ensure architectural uniformity across English and Chinese experiments, we used BERT-base, BERT-Chinese, and mBERT, which share a comparable encoder backbone.
However, our experiments are not limited to encoders. We also fine-tuned XLM-R (encoder) and XGLM (decoder), showing that our brain-informed fine-tuning approach generalizes across architectures (see Supplementary Table 2). This table shows that fine-tuning improves performance on most tasks: mBERT (7/9 tasks), XLM-R (9/9), and XGLM (6/9), with especially large gains for XLM-R (e.g., RTE: 54.15 → 62.71, MRPC: 69.61 → 77.94).
Updated Supplementary Table 2. a) Fine-tuning and Evaluation in the Same Language
| GLUE Task | mBERT | mBERT‑ft‑en | XLM‑R‑en | XLM‑R‑ft‑en | XGLM‑en | XGLM‑ft‑en |
|---|---|---|---|---|---|---|
| CoLA | 42.68 | 40.05 | 46.30 | 50.20 | 40.11 | 41.22 |
| SST‑2 | 89.68 | 90.14 | 92.16 | 92.55 | 90.94 | 90.48 |
| MRPC (Acc.) | 84.80 | 84.80 | 69.61 | 77.94 | 71.32 | 71.81 |
| MRPC (F1) | 88.56 | 89.01 | 81.55 | 84.10 | 81.52 | 82.39 |
| STS‑B (Pearson) | 88.06 | 88.42 | 84.96 | 85.18 | 81.43 | 82.36 |
| STS‑B (Spearman) | 87.76 | 88.22 | 84.84 | 85.10 | 81.47 | 82.41 |
| QQP (Acc.) | 90.22 | 90.47 | 90.28 | 90.61 | 88.93 | 90.11 |
| QQP (F1) | 86.70 | 87.07 | 87.05 | 87.45 | 83.11 | 85.33 |
| MNLI‑m | 82.09 | 82.66 | 84.16 | 84.46 | 76.62 | 78.85 |
| MNLI‑mm | 82.38 | 82.97 | 84.22 | 84.40 | 78.80 | 79.01 |
| QNLI | 91.14 | 91.18 | 90.17 | 90.24 | 88.94 | 88.44 |
| RTE | 67.15 | 65.70 | 54.15 | 62.71 | 53.79 | 53.43 |
| WNLI | 53.52 | 56.34 | 43.66 | 56.34 | 50.34 | 52.41 |
Q1.2
When applying brain-tuning to decoder-only models such as XGLM and LLaMA, we observed two challenges:
- data efficiency; these larger generative models require more neural data and are prone to overfitting
- task mismatch; our downstream evaluations focus on understanding rather than generation.
Despite this, we observed small gains, suggesting that the method is architecture-agnostic but currently more effective on encoder models given data and task constraints.
Please see below for results with LLaMA‑3.2‑1B fine-tuned using our brain-informed approach with LoRA adaptation applied to layer 9 (similar to Antonello et al., 2024). Models were evaluated on three downstream datasets (Wang et al. 2019, Sakaguchi et al., 2019). Brain-informed fine-tuning yields improvements over the base model across two datasets, further supporting the generalizability of our approach to different architectures and low-parameter adaptation methods (LoRA). We note that improvements are small, likely due to the parameter-efficient adaptation (LoRA on a single layer). Unfortunately, we couldn’t perform full fine-tuning on this model due to computational constraints.
| Dataset | LLaMA‑3.2‑1B | LLaMA‑3.2‑1B‑ft-en | LLaMA‑3.2‑1B‑ft-zh |
|---|---|---|---|
| BoolQ | 0.5790 | 0.5848 | 0.5830 |
| COPA | 0.5534 | 0.5534 | 0.5534 |
| Winogrande | 0.4945 | 0.4961 | 0.4961 |
| Average | 0.5423 | 0.5447 | 0.5441 |
Q2
In response to your suggestion, we implemented an alternative version of our pipeline that mirrors the preprocessing-based approach used in prior work; stimulus-TR alignment is performed prior to fine-tuning. Specifically, we constructed stimulus-TR pairs similar to Schwartz et al. (2019). We then fine-tuned the language model using this pre-aligned input and bilingual brain data from 1 participant. The mean encoding performance is 0.13 for the preprocessing-based method, compared to 0.15 for the vanilla model and 0.16 for our fine-tuned model. The table below presents downstream task performance (on the GLUE benchmark) for this preprocessing-based method (denoted as BERT-ft-en (Ablation)). This allows direct comparison with our proposed end-to-end brain-informed fine-tuning pipeline. We observe a substantial drop in performance for the preprocessing-based method (BERT-ft-en (Ablation)). This suggests that stimulus-TR alignment performed prior to fine-tuning procedure likely discards valuable temporal and contextual information. In contrast, our approach integrates brain alignment during fine-tuning, enabling robust downstream performance.
| GLUE Task | BERT-en | BERT-ft-en (Mean±Std) | BERT-ft-en (Ablation) |
|---|---|---|---|
| CoLA | 53.38 | 55.11 ± 0.75 | 0.00 |
| SST-2 | 92.08 | 92.58 ± 0.39 | 50.92 |
| MRPC (Acc.) | 79.41 | 80.55 ± 1.01 | 68.38 |
| MRPC (F1) | 86.27 | 86.91 ± 0.75 | 81.22 |
| STS-B (Pears.) | 88.06 | 88.10 ± 0.07 | 0.008 |
| STS-B (Spear.) | 87.65 | 87.55 ± 0.18 | 0.028 |
| QQP (Acc.) | 90.84 | 90.79 ± 0.03 | 75.28 |
| QQP (F1) | 87.70 | 87.71 ± 0.19 | 62.81 |
| MNLI-m | 84.38 | 84.40 ± 0.09 | 41.74 |
| MNLI-mm | 84.64 | 84.55 ± 0.20 | 41.48 |
| QNLI | 91.45 | 91.49 ± 0.09 | 58.32 |
| RTE | 67.15 | 67.32 ± 0.17 | 52.71 |
| WNLI | 49.30 | 55.59 ± 1.31 | 43.66 |
Q3
Thank you for this thoughtful question. We analyzed scaling behavior on high-performing voxels using up to 6 stories (~300 TRs each). While there is a mild upward trend in encoding performance (mean r: 0.151 → 0.162), the effect is small relative to the standard errors, likely due to the limited dataset size. We expect clearer scaling trends to emerge with larger brain datasets (Antonello et al., 2023).
| No. of Stories | 1 | 2 | 3 | 4 | 5 | 6 |
|---|---|---|---|---|---|---|
| Mean (r) | 0.1515 | 0.1508 | 0.1500 | 0.1571 | 0.1624 | 0.1625 |
| SE | 0.0052 | 0.0054 | 0.0053 | 0.0102 | 0.0149 | 0.0150 |
[Antonello et al., 2023] Scaling laws for language encoding models in fMRI
Q4
Thank you for this interesting question. We observe that language-selective and semantic-selective voxels provide stronger gains in downstream task performance compared to using whole-brain voxels, suggesting that brain-informed fine-tuning primarily enhances semantic representations. Please see Appendix Table 5 for details. Our current work focused on improving encoding and task performance of bilingual brain-informed fine-tuning. However, we agree that interpreting which specific aspects of language (semantics, syntax, etc.) are altered is an important next step. We plan to explore attribution-based analyses and causal graph methods (Lindsey et al. 2025) in future work to better understand changes introduced with fine-tuning.
[Lindsey et al. 2025] On the Biology of a Large Language Model
With the extra page allowance, we will move key results from the appendix into the main text to improve clarity as suggested.
Thank you for your response and for conducting the additional experiments. I will keep my positive score as is.
We appreciate the reviewer's positive feedback and are confident that it has enhanced the paper's quality.
This article investigates the consequences of fine-tuning a monolingual or multilingual language model with fMRI brain data using either monolingual or bilingual participants. The models are evaluated on their ability to predict brain activity on a held-out dataset and on natural language processing benchmarks. According to the authors, the results show that fine-tuning on brain data from bilingual subjects improves the performance of the models on all these tasks, implying that the models benefited from the neural representations of the bilingual participants.
优缺点分析
The paper is rather well structured, well referenced and well written, and tackles an interesting question using a large variety of tasks and control conditions. It makes use of an fMRI dataset that seems to be novel and could be of interest in itself.
The key question, "whether fine-tuning language models with bilingual brain data can elicit multilingual capabilities in language models", is only supported by quite fragile results that are in my view over-interpreted. Fine-tuning using bilingual brain data is said to be the cause of the improved performance of the multilingual capabilities of language models, and this point is key to differentiate this paper from existing literature. In spite of the many experiments, the results are not so convincing and are reported in a way that favors the hypotheses of the authors, but after looking at the results with more scrutiny, the results do not strongly support the claim (see below for more details).
Despite clear qualities in the document as listed above, this work is not quite polished, and there is some missing content, although it is announced in the paper that it should be there. In particular, I could not find the results with the Llama-3.2 model, which would have been interesting as it is the only decoder based model, and also the only recent model that the authors investigate (this model is announced on p.4, but is not in appendix nor in supplementary information). Moreover, the results with the alternative loss function coined "hybrid" in the paper are not reported in Supplementary as announced (see section 3.3 p.4).
Although the paper produces a lot of results, these are actually not much discussed. It might have been interesting to discuss whether the type of task in which the models improve or not makes sense. More importantly, what about the other models that were tested? As said before, the Llama-3.2 results are not presented, but the Supplementary present results for XLM-R and XGLM. The results for these models are reported in a way that makes it not possible to check whether it actually works: the bold face seems to indicate which fine-tuned model wins, but the point is about base model vs fine-tuned model, so these results should be reported, and they are missing. Only then can we appreciate whether fine-tuning these models on brain data improves their NLP abilities.
Given the main point of the paper, it would be nice to include a control task where we would expect to see no difference after fine-tuning, such as a non-language task like arithmetic.
The submission would strongly benefit from sharing the fMRI data that was collected, as the scientific community working on multilingualism in the brain and LLMs might benefit from it.
Here are some more detailed comments and suggestions.
p.4 Why use the last hidden layer for the representation? It is well know that peak predictivity between hidden states of language models and brain activity comes from middle layers (see e.g. Jain & Huth, 2018; Toneva & Wehbe, 2019; Caucheteux & King 2022, among others). In fact this is what is done in section 3.4, for the voxel-wise encoding (by the way, the precise layer is given for BERT, layer 7, but not for all the other models that were studied). It seems that the goal is the same: predict brain activity using the representations learned by the LLM. This is also important given the paper's focus on multilingualism and existing works that show the interplay between layer depth and multilingual computations (e.g. Wendler et al, 2024; Zhao et al, 2024).
p.4 In order to facilitate the comparison between the different losses that were studied, it would be helpful to put all of them in the same table (see table in Supplementary; by the way, the so-called "hybrid" loss is missing here, contrary to what is stated on p.4). Moreover, the results for mBERT-ft-en with Ridge seem to be clearly better, it does not seem obvious that NT-Xent is better. Having all of them side-by-side might help the comparison.
p.6 Fig. 2 There are many improvements located in the visual areas (same for other models, see Fig. 1 in supplementary info). It looks like fine-tuning finds its way, but in a way that is not related to language, which might go counter the main argument. This deserves more discussion.
p.6 The control baselines are appreciated but hard to read. Some comparative statistics might help, as would a visualization on a brain map of the difference between brain-informed over control. Same thing for the cross-participants results, the maps are not easy to assess objectively (see e.g. participant 5), and descriptive statistics might help (e.g. mean r per subject).
p.7 There seems to be an issue with the results relative to the WNLI task. All the results on fine-tuned models (but not the base vanilla model) have a performance of 56.34, which improves over the original model, but is the exact same value whatever the conditions, for BERT-ft-en, mBERT-ft-en, BERT-ft-zh, mBERT-ft-zh (p. 8), mBERT-ft-en, bilingual or monolingual (p.9) and even the models with alternative losses (Supplementary p.1) and the other models provided in Supplementary p.3.
p.7 Overall, the results are not so clear and the effects quite small. These results would be more meaningful if they could be replicated with other participants and models. The part with the models is supposed to be done, but the results are not really reported as far as I understand.
p.7 Concerning the results on fine-tuning on bilingual brain data vs monolingual ones. This is an important section given the point of the paper. The reporting of the results seems misleading. Ignoring the WNLI results (see above), if one looks at Table 2, the bilingual vs monolingual battle is 4/8 vs 4/8 for English (rather than 5/7 as reported) and 4/7 for Chinese (rather than 5/7; one is actually a tie). So the results are fragile, which is concerning given the importance of this in the general point the authors are making (see Discussion notably). It seems important to study these effects on other participants and other models.
More minor points.
p. 4 What is the dropout value that you use? How was it chosen, and how does it affect the results?
Fig. 2 p.6 There is an error when calling appendix D: it is actually E.2
Table 5 p.26 Ordering the tasks as in the table of the main paper would ease the comparisons.
问题
See Strengths And Weaknesses section above.
局限性
yes
最终评判理由
I believe that the proposed changes, with the new results and discussion, will improve the manuscript. The effects that are observed, although interesting, are quite subtle, which limits the overall impact of the work, but there is still value in this original work. As a consequence, I raise my score to 4.
格式问题
Nothing to report.
Inclusion of control task
Thank you for the helpful suggestion. We ran a control experiment on the GSM8K dataset (arithmetic reasoning task), and as expected found no performance gains; both vanilla and fine-tuned models achieved 0 accuracy and F1.
p.4 Clarification on usage of last hidden layer for the representation
Thank you for this important question. We think the confusion arises from the assumption that both analyses, brain-informed fine-tuning and voxel-wise encoding have the same goal, i.e., predicting brain activity from pretrained LLM representations.
This is not the case. We have three scenarious in our analysis. First, in brain-informed fine-tuning (Section 3.3) our aim is to induce a brain-informed bias into the language model’s representations by performing full end-to-end fine-tuning of all layers using brain data, without any layer selection. Second, we use the representations from the last hidden layer,following standard practice for downstream NLP tasks. Third, Once the model is fine-tuned, our goal is to evaluate how these brain-tuned language model representations perform, particularly in how well they predict brain activity. In this third case, we follow established provedures from the literature and focus on intermediate layers (e.g., layer 7 for BERT). We will revise the section to make this explicit and ensure clarity.
p.4 Comparison of different losses
Thank you for this important point, and apologies for the oversight regarding the missing hybrid loss. Please see the updated Supplementary Table 1 below for a full side-by-side comparison of all five loss functions, including the newly added hybrid loss, for mBERT-ft-en across GLUE tasks.
Updated Supplementary Table 1 a) Fine-tuning and Evaluation in the Same Language
| GLUE Task | mBERT | Contrastive | MSE | Ridge | Spatial | Hybrid |
|---|---|---|---|---|---|---|
| CoLA | 42.68 | 40.05 | 43.28 | 44.40 | 41.14 | 40.36 |
| SST-2 | 89.68 | 90.14 | 88.53 | 91.17 | 90.94 | 90.37 |
| MRPC (Acc.) | 84.80 | 84.80 | 84.10 | 86.27 | 84.56 | 84.07 |
| MRPC (F1) | 88.56 | 89.01 | 88.59 | 89.96 | 88.69 | 88.33 |
| STS-B (Pearson) | 88.06 | 88.42 | 86.14 | 88.36 | 85.84 | 87.88 |
| STS-B (Spearman) | 87.76 | 88.22 | 86.10 | 88.25 | 85.96 | 87.82 |
| QQP (Acc.) | 90.22 | 90.47 | 90.36 | 90.57 | 90.40 | 90.55 |
| QQP (F1) | 86.70 | 87.07 | 87.01 | 87.24 | 86.97 | 87.23 |
| MNLI-m | 82.09 | 82.66 | 81.27 | 82.33 | 82.24 | 82.31 |
| MNLI-mm | 82.38 | 82.97 | 82.23 | 82.98 | 82.69 | 82.83 |
| QNLI | 91.14 | 91.18 | 91.38 | 91.34 | 91.38 | 91.41 |
| RTE | 67.15 | 65.70 | 67.51 | 66.06 | 68.59 | 65.62 |
| WNLI | 53.52 | 56.34 | 56.34 | 56.34 | 56.34 | 57.75 |
We agree that Ridge loss achieves the strongest downstream performance on several tasks. We would like to note that contrastive loss yielded the better brain encoding performance (mean r = 0.163, SE = 0.0097), slightly outperforming Ridge (mean r = 0.1622) and MSE (mean r = 0.1612). Given that contrastive objectives are commonly used in prior brain decoding studies and offer strong alignment with brain representations (Défossez et al., 2023; Levy et al., 2025), we retained it as our main objective to ensure comparability with existing literature.
We will revise the manuscript to clarify this rationale and include both within-language and cross-language results for all five loss functions (including the hybrid loss) in the Appendix to provide a comprehensive side-by-side comparison.
[Défossez et al., 2023] Decoding speech perception from non-invasive brain recordings
[Levy et al., 2025] Brain-to-Text Decoding: A Non-invasive Approach via Typing
p.6 Clarification on improvements in visual areas
Thank you for this thoughtful observation. We agree that improvements in visual areas warrant further discussion. Please note this is a reading experiment, where variation in visual brain areas is correlated with stimulus presentation, especially in naturalistic settings. In addition, semantic features from language models are known to spuriously predict brain activity in visual areas even after regressing out low-level visual information (Oota et al., 2024; Deniz et al., 2019). Importantly, our analyses using semantically selective voxels (which explicitly exclude visual regions) show consistent performance gains from brain-informed fine-tuning. This indicates that the improvements are not driven by visual areas but reflect genuine alignment with higher-level semantic processing. We will add a discussion of this point in the revised manuscript.
[Oota et al., 2024] Speech language models lack important brain relevant semantics
[Deniz et al., 2019] The Representation of Semantic Information Across Human Cerebral Cortex During Listening Versus Reading Is Invariant to Stimulus Modality
p.6 Comparative statistics for control baselines
Thank you for the helpful suggestion. We will include visualizations of differences in encoding performance in the brain to better highlight improvements from brain-informed fine-tuning over controls. Additionally, we’ll provide descriptive statistics, including mean Δr per subject, across both baselines, models, and cross-participant analyses in the camera-ready version. For now, here is a summary statistic for BERT-en, across participants (for well-predicted voxels; r>0.1) to aid interpretation:
- Δr̄ (vanilla – fine-tuned): -0.037 ± 0.00037
- Δr̄(vanilla – participant transfer): -0.00055 ± 0.00029
- Δr̄(vanilla – control (TR shuffle)): 0.133
- Δr̄(vanilla – control (mBERT)): 0.136
As can be seen from the summary statistics reported above, our results remain consistent taking Δr in account.
p.7 Clarification on WNLI task
Thank you for pointing this out. The identical score (56.34) across some fine-tuned models on WNLI is due to the small dataset size and integer-valued confusion matrices (please refer to GLUEBenchmark FAQ). Note that our fine-tuned models do yield slightly different predictions, but the resulting accuracy rounds to the same value. For example, here are the confusion matrices for three different fine-tuned models (for 1 participant):
- BERT-ft-en: [[33, 18], [13, 7]]
- mBERT-ft-en: [[32, 17], [14, 8]]
- mBERT-ft-zh: [[34, 19], [12, 6]] We've updated Table 1 to report mean ± standard deviation across participants to better highlight subtle differences and clarify variability.
p.7 Comparison across participants and models
Thank you for the helpful feedback. We have now revised Table 1 to report the average performance across all six bilingual participants, along with standard deviations. All models were fine-tuned separately for each participant and language. The overall trends remain consistent with those initially reported, helping strengthen the robustness of the findings. Due to space constraints, please see the updated supplementary table 2 included in our response to reviewer xVwE. Regarding model comparisons, the current supplementary material includes fine-tuned results for mBERT, XLM-R, and XGLM. We apologize for the oversight in not including the performance of their corresponding vanilla models and LLaMA-3.2-1B. Please see the updated table below. This table shows that fine-tuning improves performance on most tasks: mBERT (7/9 tasks), XLM-R (9/9), and XGLM (6/9), with especially large gains for XLM-R (e.g., RTE: 54.15 → 62.71, MRPC: 69.61 → 77.94).
Please see below for results with LLaMA‑3.2‑1B fine-tuned using our brain-informed approach with LoRA adaptation applied to layer 9 (similar to Antonello et al., 2024). Models were evaluated on three downstream datasets (Wang et al. 2019, Sakaguchi et al., 2019). Brain-informed fine-tuning yields improvements over the base model across two datasets, further supporting the generalizability of our approach to different architectures and low-parameter adaptation methods (LoRA). We note that improvements are small, likely due to the parameter-efficient adaptation (LoRA on a single layer). Unfortunately, we couldn’t perform full fine-tuning on this model due to computational constraints
| Dataset | LLaMA‑3.2‑1B | LLaMA‑3.2‑1B‑ft-en | LLaMA‑3.2‑1B‑ft-zh |
|---|---|---|---|
| BoolQ | 0.5790 | 0.5848 | 0.5830 |
| COPA | 0.5534 | 0.5534 | 0.5534 |
| Winogrande | 0.4945 | 0.4961 | 0.4961 |
| Average | 0.5423 | 0.5447 | 0.5441 |
We will update the supplementary section to include these results for completeness.
[Wang et al. 2019] SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems, NeurIPS-2019
[Sakaguchi et al. 2019] WinoGrande: An Adversarial Winograd Schema Challenge at Scale, 2019
p.7 Clarification and extension of bilingual vs monolingual brain data
Thank you for pointing this out. To address the reviewer’s concern about robustness, we increased the number of monolingual participants by adding two additional participants from another dataset (LeBel et al., 2023). Due to space constraints, please see this extended comparison with 3 monolingual participants (updated table 2) included in our response to reviewer hVbZ. The results show that bilingual brain fine-tuning consistently outperforms the mBERT baseline across tasks and offers slightly stronger generalization than fine-tuning on individual monolingual participants. Bilingual brain fine-tuning outperforms monolingual fine-tuning on 7 out of 8 tasks on the CLUE benchmark, with equal performance on the remaining task (ChID). We will correct the misleading wording and be more explicit in the final version.
[LeBel et al., 2023] A natural language fMRI dataset for voxelwise encoding models
Minor point; p.4: We use a dropout of 0.2 for brain-informed fine-tuning to mitigate overfitting on limited brain data, and 0.1 for downstream tasks, following standard BERT fine-tuning defaults. We will include this information in the camera ready version.
Minor point; Fig. 2 p.6 and Table 5 p.26: Thanks for catching this. We’ll correct the appendix reference and reorder tasks accordingly.
Thank you to the authors for the rebuttal, clarifications, and proposed changes, notably regarding the missing content.
The limitations should be clearly stated, without overemphasizing the results, as the observed effects are quite small.
Although the results are not very strong, the newly computed reproductions on other monolingual participants make them more convincing and robust.
It remains unclear whether the full data will be made available.
Overall, I believe that the proposed changes, with new results and discussion, will improve the manuscript. As a consequence, I will raise my score.
on the GSM8K dataset (arithmetic reasoning task), and as expected found no performance gains; both vanilla and fine-tuned models achieved 0 accuracy and F1.
I appreciate the effort, but the fact that the vanilla model achieves 0% accuracy makes this benchmark pretty ill-suited (or the model; maybe a better model would be preferable), as it does not provide the sensitivity necessary to investigate the change (or lack thereof) in performance due to fine-tuning (especially given the quite subtle effects that are observed).
We appreciate the reviewer’s thoughtful feedback, positive reassessment, and willingness to raise the score. We are confident that the proposed changes have strengthened the paper and improved its quality.
- We agree with the reviewer that GSM8K is a CoT-style reasoning dataset and is not well-suited for BERT-like models without step-by-step generation. Following the reviewer’s suggestion, we evaluated our models on the Math_QA dataset, which is more appropriate for classification-based models like BERT. On a 1000-example subset, both the vanilla and fine-tuned models achieved the same F1 score of 0.33, indicating no significant gain post fine-tuning on this task. We appreciate the reviewer’s input in guiding this additional analysis.
- We will also clearly state the limitations in the final version to ensure the results are appropriately framed.
We kindly request that you consider updating your evaluation score to reflect your revised assessment.
Thank you to the authors for their response and the new analysis. I will update my final rating to a 4.
This work is part of a broader research effort exploring the potential benefits of incorporating brain data into the training of large language models (LLMs). Specifically, the authors empirically examine the effects of fine-tuning both monolingual and multilingual LLMs (e.g., BERT and mBERT) using bilingual (English and Chinese) brain data acquired through functional Magnetic Resonance Imaging (fMRI). The evaluation focuses on two aspects: voxel-wise brain activity reconstruction and performance on downstream natural language processing (NLP) tasks.
Contributions
- Introduction of a novel data modality: This study is the first to leverage bilingual brain data for fine-tuning LLMs.
- Development of an end-to-end brain-informed fine-tuning pipeline: The authors propose a novel pipeline that improves upon previous, less integrated approaches.
- Comprehensive evaluation: The study presents an extensive assessment of both monolingual and multilingual models fine-tuned with bilingual brain data, evaluating their performance on voxel-wise brain activity reconstruction and downstream NLP tasks in both English and Chinese.
Technical summary
This is primarily an empirical study, and its methodology involves the following components:
- Data collection: The authors collected a new bilingual fMRI dataset recorded while participants read naturalistic narratives in both English and Mandarin Chinese.
- Model fine-tuning pipeline: The brain recordings were preprocessed and used to fine-tune several LLMs, including monolingual BERT and multilingual models such as mBERT, XLM-R, XGLM, and LLaMA-3.2. The fine-tuning procedure involves supervised learning, using token sequences from the narratives as inputs and the corresponding BOLD (Blood Oxygen Level Dependent) fMRI responses as the target. A projection layer maps the model representations to brain activity.
Experimental design/evaluation
The fine-tuned models are compared to their non-fine-tuned counterparts across two “dimensions”:
- Voxelwise encoding modeling: This analysis evaluates whether brain-informed fine-tuning enhances alignment between model representations and brain activity. The primary question is whether the fine-tuned model can better predict brain responses compared to the original (vanilla) model.
- Downstream NLP task performance: Models are evaluated on standard NLP benchmarks, GLUE (for English) and CLUE (for Chinese), under several setups:
a. Fine-tuning and evaluation in the same language;
b. Cross-lingual transfer between English and Chinese (both languages are known to participant); c. Zero-shot transfer to unseen languages, including French, German, Japanese, Spanish, and Korean (XGLUE/XTREME benchmarks).
Main findings
According to the authors’ interpretation, the main findings are as follows:
- Fine-tuning improves voxelwise encoding performance.
- Fine-tuning also improves performance on downstream NLP tasks in most cases, including within-language, cross-lingual, and zero-shot transfer scenarios.
- Fine-tuning on bilingual brain data (English and Chinese) results in better performance in both known and unseen languages compared to fine-tuning on monolingual brain data (English only).
优缺点分析
Strengths
I found this work to have the following strengths:
Clarity: The manuscript is well-written and well-structured. The narrative is logical and easy to follow, with a clear separation between primary and secondary information, the latter being appropriately placed in the appendix. The experimental design is extensive and broad, covering various setups.
Originality: To the best of my knowledge, the idea of using brain data from bilingual individuals, while simple, is quite novel conceptually. Since bilingual brains may encode conceptual information in a less language-specific way, this approach could plausibly enhance skill transfer and generalization when used to fine-tune large language models (LLMs).
Significance: This work is significant in that it contributes to a better understanding of the parallels/interconnection between language regression models (such as LLMs) and neuroscientific findings from studies of bilingual individuals. It helps bridge the gap between artificial and biological representations of language. Furthermore, I believe this work opens promising directions for future research, such as exploring other brain modalities or studying individuals with additional cognitive abilities (e.g., multilingualism), that may further enhance conceptual understanding in LLMs.
Weaknesses
From my perspective, the primary weaknesses of this study arise from the interpretation and presentation of results:
-
Small sample size: The most significant limitation is the small number of participants, which the authors acknowledge in the limitations section. While this is understandable, given the complexity of the experimental setup, it limits the generalizability and statistical power of the findings.
-
No statistical significance reports: The results are reported using mean values only, without accompanying measures of variability such as standard deviation or standard error. These metrics are critical for evaluating the robustness of the findings. Although this omission is transparently noted in the NeurIPS checklist (Section 7), it nonetheless weakens the statistical interpretation of the results.
-
Ambiguity in reported metrics: Tables 1 and 2 lack clarity regarding the exact metrics reported and their corresponding ranges. For example, it is unclear in some cases which metric is being reported (unless explicitly specified), and the absence of reference ranges makes interpretation more difficult.
For a complete and detailed account of both major and minor issues, please refer to the “Questions” section.
问题
I would like to thank the authors for the interesting paper and the considerable effort invested in this work. However, there are several points that I believe require further attention/work. I have divided these into major issues, which should be prioritized, and minor ones, which should be addressed if time permits.
Major Comments/Questions
I will consider increasing the quality score and overall score (rating) if the following points are addressed:
-
Sample size: While I fully understand the complexity associated with empirical research involving human participants, the sample size in this study appears to be very limited. I strongly encourage the authors to consider increasing the sample size by recruiting additional participants. If expanding the entire dataset is only partially feasible, I suggest targeting a subset of the experiments for additional data collection and explaining that explicitly. In the case that further recruitment is not possible at all, please provide a clear justification for the current sample size and its sufficiency for the conclusions drawn.
-
Reporting variability: Please report standard deviations (or standard errors) alongside mean values to provide a clearer picture of variability. Even with a small sample size (e.g., 6–7 participants), such measures are crucial for assessing the robustness and reliability of the results. Although the standard deviation may be less stable with a small sample size, its inclusion contributes to transparency.
-
Clarity in Table 1: Table 1 presents some ambiguities regarding how performance metrics are reported. For instance, in the STS-B task from the GLUE benchmark, results are typically evaluated using Pearson or Spearman correlation coefficients, which range from [-1, 1]. However, the values reported fall in the 85–90 range, which is unclear.
In addition to incorporating standard deviations (see point 2), I recommend the following:
- Clearly specify the evaluation metric used for each task (currently done only for tasks with more than one metric).
- Include reference ranges for chosen metrics. These changes will improve clarity and facilitate a more accurate interpretation of both absolute and relative performance.
-
Monolingual participant experiments: The experiment involving a monolingual (English-speaking) participant currently includes only one individual. I strongly recommend increasing the number of monolingual participants to match the bilingual sample size for comparability. If additional recruitment is not feasible, I suggest re-positioning these results as preliminary, moving them to the appendix, and referencing them in the main manuscript. Additionally, revise the main text to ensure that claims are appropriately moderated. Should you decide to include more monolingual participants, please also apply recommendations from points (2) and (3) to this subset of the data.
-
Clarification on brain region analysis (Section 4.2): In Section 4.2, the paragraph titled “Performance on downstream tasks with language- or semantically-selective voxels” mentions the use of selective voxels. Does this imply that the results reported in Tables 1 and 2 are based on whole-brain data? If so, and unless I have missed something, this distinction should be made explicit in the manuscript. Clarifying whether the reported results are based on whole-brain activity or specific voxel subsets is important for interpreting the findings.
Minor Comments/Questions
While addressing the following points may not be critical to the paper’s core contributions, doing so would enhance the overall quality. I consider increasing the quality score and overall score (rating), but strongly encourage the authors to prioritize the major comments above in case of time limitations:
-
Line 137: I would appreciate the inclusion of additional visual elements to clarify the model architecture, specifically the roles of the 3-lobe Lanczos, FIR filtering, and BOLD processes. Although these components are not introduced in this work, they are integral to the proposed method. Visual illustrations would be especially helpful for readers unfamiliar with these elements. Ideally, such visuals should be included in the main manuscript; at the very least, they should appear in the appendix.
-
Line 148: Consider relocating the entire Training Protocol section to Appendix G to conserve space. This section does not provide method-specific insights central to the main narrative and could be referenced as needed.
-
Line 153: First, please include an explicit reference to the appendix section where the loss functions are compared. Second, a brief explanation as to why NT-Xent appears to perform better would be helpful. If the reason is speculative or not yet understood, stating so would improve transparency and avoid ambiguity.
-
Figure 2: The names of the brain regions are not clearly legible. Please improve the visibility of these labels. Additionally, consider referencing Figures 3 and 4 earlier in the text to clarify the distinction between language-selective and semantically-selective brain regions. Without a clear visual cue, it is difficult to interpret statements from Figure 2 captions such as: “Voxel colors reflect the best-performing model: black for vanilla, blue for fine-tuned with whole-brain, magenta for fine-tuned with language, and yellow for semantically-selective brain regions.”
General Advice
The manuscript presents a range of experimental design choices that are intellectually compelling and, in my view, constitute one of the strengths of the work (as I indicated above). However, these choices also lead to a combinatorial explosion of experimental configurations, which can be time-consuming to conduct properly. I recommend that the authors prioritize the most principled setups and place greater emphasis on improving statistical rigor and clarity in the reporting of results, in case of time limitations.
局限性
Yes
最终评判理由
The authors have addressed my concerns point by point, and I am satisfied with their responses. I have accordingly raised the quality score and overall ranking.
格式问题
I did not identify any issues with formatting.
Major comments
Q1 sample size
Thank you for raising this important concern. Our study shows that brain-informed fine-tuning can reliably enhance multilingual understanding. We employ a small-N design, where a language model is fine-tuned and evaluated independently for each participant, enabling robust within-subject inference (Smith & Little, 2018). Importantly, each participant in our dataset contributes two hours of naturalistic brain recordings, which are critical for capturing the complexity of language representations (Hamilton & Huth, 2018) in the brain and for enabling meaningful learning through fine-tuning. While we recognize that larger sample sizes would strengthen the generalizability of our findings, data collection is not feasible at this stage due to current resource constraints (four 2.5 hours sessions for data collection, data analysis, preprocessing and subject recruitment time). That said, we fully agree with the reviewer’s suggestion and currently plan to expand the bilingual dataset and extend our main findings to new participants in future work.
We have increased the number of monolingual participants, please see response to Q4 below.
[Smith & Little, 2018] Small is beautiful: In defense of the small-N design
[Hamilton & Huth, 2018] The revolution will not be controlled: natural stimuli in speech neuroscience
Q2 Reporting variability
Models were fine-tuned separately for each participant and language (note Methods 3.3). We have now revised Table 1 to report the average performance across all six bilingual participants, along with standard deviations. As shown in the updated results (see below), the results remain consistent with those originally reported in the paper. We thank the reviewer for this helpful feedback as this change should help better demonstrate the robustness of our findings.
Updated Table 1, a) Fine-tuning and Evaluation in the Same Language
| GLUE Task | BERT-en | BERT-ft-en (Mean ± Std) | mBERT | mBERT-ft-en (Mean ± Std) |
|---|---|---|---|---|
| CoLA (MCC) | 53.38 | 55.11 ± 0.75 | 42.68 | 39.58 ± 1.82 |
| SST-2 (Acc.) | 92.08 | 92.58 ± 0.39 | 89.68 | 90.64 ± 0.54 |
| MRPC (Acc.) | 79.41 | 80.55 ± 1.01 | 84.80 | 85.39 ± 0.51 |
| MRPC (F1) | 86.27 | 86.91 ± 0.75 | 88.56 | 89.20 ± 0.27 |
| STS-B (Pears.) | 88.06 | 88.10 ± 0.07 | 88.06 | 88.47 ± 0.09 |
| STS-B (Spear.) | 87.65 | 87.55 ± 0.18 | 87.76 | 88.28 ± 0.12 |
| QQP (Acc.) | 90.84 | 90.79 ± 0.03 | 90.22 | 90.43 ± 0.04 |
| QQP (F1) | 87.70 | 87.71 ± 0.19 | 86.70 | 87.08 ± 0.02 |
| MNLI-m (Acc.) | 84.38 | 84.40 ± 0.09 | 82.09 | 82.26 ± 0.27 |
| MNLI-mm (Acc.) | 84.64 | 84.55 ± 0.20 | 82.38 | 82.73 ± 0.14 |
| QNLI (Acc.) | 91.45 | 91.49 ± 0.09 | 91.14 | 91.24 ± 0.06 |
| RTE (Acc.) | 67.15 | 67.32 ± 0.17 | 67.15 | 65.63 ± 0.51 |
| WNLI (Acc.) | 49.30 | 55.59 ± 1.31 | 53.52 | 56.34 ± 0.00 |
| CLUE Task | BERT-zh | BERT-ft-zh (Mean ± Std) | mBERT | mBERT-ft-zh (Mean ± Std) |
|---|---|---|---|---|
| AFQMC (Acc.) | 75.25 | 74.50 ± 0.48 | 69.74 | 70.70 ± 0.99 |
| CMNLI (Acc.) | 80.50 | 80.87 ± 0.04 | 78.66 | 78.99 ± 0.14 |
| CSL (F1.) | 80.18 | 80.82 ± 0.42 | 81.10 | 81.60 ± 0.56 |
| IFLYTEK (Acc.) | 60.25 | 60.48 ± 0.25 | 56.52 | 56.92 ± 0.17 |
| TNEWS (Acc.) | 56.24 | 56.48 ± 0.08 | 54.77 | 55.08 ± 0.16 |
| TNEWS (F1.) | 55.17 | 55.39 ± 0.08 | 53.69 | 53.81 ± 0.20 |
| ChID (Acc.) | 10.66 | 10.66 ± 0.00 | 10.66 | 10.66 ± 0.00 |
| C³ (F1.) | 49.74 | 49.83 ± 0.39 | 49.42 | 50.25 ± 0.27 |
Updated Table 1, b) Cross-Language Transfer Between Known Languages
| GLUE Task | BERT-zh | BERT-ft-zh (Mean ± Std) | mBERT | mBERT-ft-zh(Mean ± Std) |
|---|---|---|---|---|
| CoLA (MCC) | 51.13 | 53.82 ± 1.18 | 40.96 | 40.99 ± 1.08 |
| SST-2 (Acc.) | 92.26 | 92.77 ± 0.15 | 88.27 | 90.30 ± 0.66 |
| MRPC (Acc.) | 76.96 | 77.53 ± 0.28 | 82.84 | 85.20 ± 0.93 |
| MRPC (F1) | 83.38 | 84.39 ± 0.17 | 87.15 | 89.16 ± 0.78 |
| STS-B (Pears.) | 87.21 | 87.96 ± 0.12 | 87.09 | 88.30 ± 0.17 |
| STS-B (Spear.) | 86.73 | 87.67 ± 0.18 | 86.97 | 88.16 ± 0.23 |
| QQP (Acc.) | 90.65 | 90.86 ± 0.12 | 89.91 | 90.39 ± 0.13 |
| QQP (F1) | 87.53 | 87.76 ± 0.04 | 86.27 | 87.05 ± 0.09 |
| MNLI-m (Acc.) | 84.00 | 84.16 ± 0.05 | 81.52 | 81.97 ± 0.14 |
| MNLI-mm (Acc.) | 83.91 | 84.16 ± 0.10 | 81.94 | 82.78 ± 0.28 |
| QNLI (Acc.) | 91.27 | 91.42 ± 0.05 | 90.55 | 91.23 ± 0.26 |
| RTE (Acc.) | 67.15 | 67.91 ± 0.08 | 66.06 | 67.21 ± 0.67 |
| WNLI (Acc.) | 53.52 | 56.34 ± 0.00 | 54.93 | 55.25 ± 0.84 |
| CLUE Task | BERT-en | BERT-ft-en (Mean ± SD) | mBERT | mBERT-ft-en (Mean ± SD) |
|---|---|---|---|---|
| AFQMC (Acc.) | 69.00 | 69.02 ± 0.03 | 69.74 | 71.25 ± 0.97 |
| CMNLI (Acc.) | 68.34 | 68.78 ± 0.13 | 78.66 | 78.87 ± 0.07 |
| CSL (F1) | 71.20 | 72.33 ± 0.66 | 81.10 | 81.51 ± 0.20 |
| IFLYTEK (Acc.) | 47.86 | 47.27 ± 0.56 | 56.52 | 56.92 ± 0.31 |
| TNEWS (Acc.) | 50.92 | 51.42 ± 0.16 | 54.77 | 54.99 ± 0.20 |
| TNEWS (F1) | 50.15 | 50.68 ± 0.07 | 53.69 | 53.76 ± 0.04 |
| ChID (Acc.) | 10.66 | 10.66 ± 0.00 | 10.66 | 10.66 ± 0.00 |
| C³ (F1) | 42.64 | 42.99 ± 1.49 | 49.42 | 49.72 ± 0.35 |
Updated Table 1, c) Zero-Shot Transfer to Unseen Languages
| XGLUE Task | German (de) | French (fr) | Spanish (es) | Japanese (ja) | Korean (ko) |
|---|---|---|---|---|---|
| mBERT / mBERT-ft-en | mBERT / mBERT-ft-en | mBERT / mBERT-ft-en | mBERT / mBERT-ft-en | mBERT / mBERT-ft-en | |
| PAWS-X (Acc.) | 84.00 / 83.62 ± 1.10 | 88.10 / 87.17 ± 0.81 | 87.05 / 86.50 ± 0.42 | 78.70 / 76.55 ± 2.12 | 78.75 / 76.90 ± 0.35 |
| XNLI (Acc.) | 70.60 / 70.94 ± 0.14 | 72.69 / 73.30 ± 0.80 | 74.10 / 74.62 ± 0.05 | 86.00 / 87.97 ± 0.05 | 86.00 / 87.82 ± 0.25 |
| NER (F1) | 10.67 / 12.85 ± 2.02 | 06.67 / 11.09 ± 1.09 | 07.33 / 12.11 ± 0.67 | 13.00 / 11.34 ± 2.11 | 12.00 / 12.25 ± 1.71 |
Q3 Clarity in Table 1
We understand how this may have caused confusion. To maintain consistency across tasks, we followed the common practice used in the GLUE leaderboard and reported all metrics on a 0-100 scale. For example, a Pearson correlation of 0.88 is reported as 88.0. To improve clarity, we will add metrics adjacent to the task (like in the table above) and a footnote to all relevant tables explicitly indicating the evaluation metric range and the scaling.
For reference, please see the table below:
| GLUE Task | Metric | Reference Range | Scaled |
|---|---|---|---|
| CoLA | Matthews Correlation | [-1, 1] | ×100 |
| SST-2 | Accuracy | [0, 1] | ×100 |
| MRPC | F1 Score / Accuracy | [0, 1] | ×100 |
| STS-B | Pearson / Spearman Correlation | [-1, 1] | ×100 |
| QQP | F1 Score / Accuracy | [0, 1] | ×100 |
| MNLI | Accuracy | [0, 1] | ×100 |
| QNLI | Accuracy | [0, 1] | ×100 |
| RTE | Accuracy | [0, 1] | ×100 |
| WNLI | Accuracy | [0, 1] | ×100 |
| CLUE Task | Metric | Reference Range | Scaled |
|---|---|---|---|
| AFQMC | Accuracy | [0, 1] | ×100 |
| CMNLI | Accuracy | [0, 1] | ×100 |
| CSL | F1 Score | [0, 1] | ×100 |
| IFLYTEK | Accuracy | [0, 1] | ×100 |
| TNEWS | Accuracy / F1 Score | [0, 1] | ×100 |
| ChID | Accuracy | [0, 1] | ×100 |
| C³ | Exact Match / F1 Score | [0, 1] | ×100 |
| XGLUE Task | Metric | Reference Range | Scaled |
|---|---|---|---|
| PAWS-X | Accuracy | [0, 1] | ×100 |
| XNLI | Accuracy | [0, 1] | ×100 |
| NER | F1 Score | [0, 1] | ×100 |
Q4 Monolingual participant experiments
In response to your feedback, we identified two additional monolingual participants from another dataset (LeBel et. al, 2023) and have now included them in the analysis. With this expanded comparison group (3 monolingual participants), the results show that bilingual brain fine-tuning consistently outperforms the mBERT baseline across tasks and offers slightly stronger generalization than fine-tuning on individual monolingual participants. Bilingual brain fine-tuning outperforms monolingual fine-tuning on 7 out of 8 tasks on the CLUE benchmark, with equal performance on the remaining task (ChID). We thank the reviewer for this suggestion as we think this extended analysis better supports the contribution of our paper and makes our results more convincing.
Updated Table 2. b) Cross-Language Transfer Between Known Languages
| CLUE Task | mBERT | Bilingual mBERT-ft-en (Mean ± SD) | Monolingual mBERT-ft-en (Mean ± SD) |
|---|---|---|---|
| AFQMC (Acc.) | 69.74 | 71.25 ± 0.97 | 70.33 ± 0.48 |
| CMNLI (Acc.) | 78.66 | 78.87 ± 0.07 | 78.60 ± 0.12 |
| CSL (F1) | 81.10 | 81.51 ± 0.20 | 81.09 ± 0.11 |
| IFLYTEK (Acc.) | 56.52 | 56.92 ± 0.31 | 56.87 ± 0.37 |
| TNEWS (Acc.) | 54.77 | 54.99 ± 0.20 | 54.61 ± 0.30 |
| TNEWS (F1) | 53.69 | 53.76 ± 0.04 | 53.42 ± 0.09 |
| ChID (Acc.) | 10.66 | 10.66 ± 0.00 | 10.66 ± 0.00 |
| C³ (Acc.) | 49.42 | 49.72 ± 0.35 | 49.28 ± 0.22 |
[LeBel et al., 2023] A natural language fMRI dataset for voxelwise encoding models
Q5 Clarification on brain region analysis
The reviewer is correct that Tables 1 and 2 are based on fine-tuning with whole-brain data. This is noted in the captions but we will revise this and make this distinction more explicit both in the main text and the caption.
Minor comments
Q1: We agree this would improve clarity and will include graphical illustrations of the requested components in the appendix for the camera-ready version.
Q2: We agree with the reviewer and will move the Training Protocol section to Appendix G in the camera-ready version.
Q3: Thank you for this suggestion. We would like to note that contrastive loss yielded better brain encoding performance (mean r = 0.163, SE = 0.0097), slightly outperforming Ridge (mean r = 0.1622) and MSE (mean r = 0.1612). Given that contrastive objectives are commonly used in prior brain decoding studies and offer strong alignment with brain representations (Défossez et al., 2023; Levy et al., 2025), we retained it as our main objective to ensure comparability with existing literature. We will revise the reference to Supplementary Table 1 and include this reasoning in the revised version for clarity.
Q4: Thanks for highlighting this; we indeed overlooked this. We’ll improve label visibility, enlarge ROI names, and reference Figures 3 and 4 earlier to ensure clarity.
Dear Authors,
Thank you for addressing the points I raised in my initial review, including recruiting additional participants for the study. Regarding Q3, you may specify these ranges in the table captions or elsewhere in the text to conserve space. I am raising my quality score from 2 to 3 and my overall ranking from 4 to 5.
We appreciate the reviewer's positive feedback and are confident that it has enhanced the paper's quality.
The paper uses fMRI data obtained during story reading to fine-tune an LLM. The paper suggests that this type of fine-tuning improves voxelwise encoding modeling compared to the baselines. Additionally, the paper claims that fine-tuning the LLM on bilingual brain data improves downstream task performance on standard NLP benchmarks.
优缺点分析
Clarity: The paper is very difficult to understand — after 2 reads of the paper and going through all of the appendices, a lot of aspects of the data, task and other design decisions are still unclear to me. There are a lot of appendices, including one on extended related work, which are all necessarily to be able to follow the paper. This makes me wonder whether a journal submission where there’s more space for discussion is a better format for the current work.
Quality: the main claims of the paper are only weakly supported. The baselines for the voxelwise modeling are fairly weak and the differences between bilingual and monolingual models are very small and hard to interpret without understanding the variability between models (see Questions for more details).
Significance and originality are hard for me to assess since I’m not an expert in the field. The paper tackles a very interesting question — whether fine-tuning LLMs with fMRI data obtained on language tasks improves model’s performance on standard NLP benchmarks, asking specifically whether models benefit from bilingual data over monolingual data. Based on the Related Work section, we already know that brain encoding and NLP performance improves after fine-tuning on monolingual fMRI data. Thus, the main contribution of the current work is evidence of the difference in performance between monolingual and bilingual brain data, which is hard to assess given the provided evidence.
问题
The paper uses some terminology that may not be known to the broad audience of NeurIPs without defining it: What is brain encoding?What is a TR?
A few design decisions that are unclear to me:
- What is the motivation for including language and semantically-selective regions into the analysis?
- One of the contributions of the paper is providing a novel brain-informed fine-tuning pipeline, while prior work has performed the same steps as part of pre-processing. Why is this important? Is the pipeline more efficient, performant, etc? Some comparison to prior work or motivation would be nice to have if the pipeline itself is listed as a novel contribution.
Voxelwise encoding modeling: The baselines (temporally mis-aligned data and mBERT embeddings) seem a bit too weak of a comparison for fine-tuning with the actual fMRI data. I’d encourage the authors to consider a stronger baseline.
Performance on downstream tasks:
- The fMRI dataset contains 6 bilingual and 1 monolingual participants. How was bilingual fine-tuning performed — was one model fine-tuned on 1 participant data? That is, are the post-fine-tuning numbers in Table 1 averages over 6 fine-tuned models? Or was it a randomly chosen participant? Or something else entirely?
- I find it difficult to interpret the findings in Table 1 and Table 2 without knowing how much variability there is across models trained on different participants since the differences are very small.
- Similarly, the comparison between the bilingual and monolingual performance (Table 2) shows very small differences. Since we’re essentially looking at 2 participants in this table, my summary of the findings is that bilingual and monolingual brain fine-tuning performs broadly similarly. Given the current evidence, I'm not convinced that bilingual brain fine-tuning outperforms monolingual one.
Model choices: the paper states that it fine-tunes 4 multilingual models: mBERT, XLM-R, XGLM, LLaMA-3.2-1b. However, I can only find the results from the BERT models either in main text or appendices. Is the paper missing the Supplementary (the authors refer to the Supplementary in multiple spots but it doesn’t exist. I assumed it was another name for the appendix but I’m no longer sure). Either way, I’d like to see how stable the results are across different architectures.
局限性
yes
最终评判理由
The authors have conducted additional analyses to address my concerns (including adding monolingual participants in their study) and clarified the points that were unclear in the original submission. My only remaining concern is how many of the clarifications/elaborations provided in the response to the reviewers would make it into an already tight paper to improve the clarity given the space constraints.
格式问题
n/a
Clarification on the Main Objective of the Paper
Thank you for your comment. Prior studies have fine-tuned speech-based language models with brain data and shown improvements in encoding performance (Moussa et al., 2025; Vattikonda et al., 2025). In contrast, our study focuses on text-based language models, which already exhibit strong semantic capabilities and have been shown to align well with brain activity. The key question we address is whether bilingual brains can serve as a useful supervisory signal to enhance the multilingual capabilities of language models. Specifically, we ask whether fine-tuning with a bilingual brain, which is known to encode a shared semantic space across languages (Chen et al., 2024), can benefit both monolingual and multilingual language models.
The comparison between fine-tuning with monolingual and bilingual brain data serves primarily as a control to assess whether the observed improvements in multilingual performance are attributable to brain data in general, or specifically to the bilingual brain.
Clarification on Methods
Thank you for the helpful feedback. We will include additional citations and elaborations in the revised version to explain the foundational concepts more clearly for a broader audience. To clarify a few terms: “brain encoding” refers to the prediction accuracy of voxel-wise timecourses, where each voxel in our data corresponds to a 2.24 × 2.24 × 4.1 mm³ volume of the brain. “TR” stands for repetition time, which is the acquisition time for each fMRI volume; in our case, the TR was 2.0045 seconds.
Clarification on design decisions
Response for Q1
For speech-based models, Vattikonda et al. (2025) showed that fine-tuning with different brain regions can yield differences, as regions may compete during fine-tuning and dominate the loss (see also Antonello et al., 2023). To avoid noise or voxels from non-language/semantic areas dominating the loss, we fine-tuned two complementary models for each participant: one using only brain responses from language regions and another using semantically-selective regions. The semantically-selective regions were chosen in a data-driven manner for each participant using encoding performance, while the language regions were defined using a group-level template (Fedorenko et al., 2010). These variants showed slightly better performance than the whole-brain model on some tasks, as reported in Appendix Section F.
Response for Q2
Thank you for the insightful question. Typically in fMRI studies, text stimuli are presented either in a controlled manner for example as a fixed number of words per repetition time (TR) (e.g., four words per TR, each lasting 0.5 seconds used in (Schwartz et al. 2019)) or as continuous input later downsampled during preprocessing. This limits flexibility, especially in settings where word timing is uneven. Performing stimulus-TR alignment as a preprocessing step (prior to fine-tuning) likely discards valuable temporal and contextual information.
In contrast, our pipeline introduces stimulus-TR alignment during fine-tuning, which has three key advantages:
- variable-length input (since word timing per repetition time (TR) can vary)
- more flexible and generalizable alignment applicable across modalities and tasks as we pass raw text as input
- dynamic stimulus-repetition time (TR) alignment as part of the learning process, not a static preprocessing step
In response to your suggestion, we implemented an alternative version of our pipeline that mirrors the preprocessing-based approach used in prior work; stimulus-TR alignment is performed prior to fine-tuning. Specifically, we constructed stimulus-TR pairs similar to Schwartz et al. (2019). We then fine-tuned the language model using this pre-aligned input and bilingual brain data from 1 participant. The mean encoding performance is 0.13 for the preprocessing-based method, compared to 0.15 for the vanilla model and 0.16 for our fine-tuned model. The table below presents downstream task performance (on the GLUE benchmark) for this preprocessing-based method (denoted as BERT-ft-en (Ablation)). This allows direct comparison with our proposed end-to-end brain-informed fine-tuning pipeline. We observe a substantial drop in performance for the preprocessing-based method (BERT-ft-en (Ablation)). This suggests that stimulus-TR alignment performed prior to fine-tuning procedure likely discards valuable temporal and contextual information. In contrast, our approach integrates brain alignment during fine-tuning, enabling robust downstream performance.
| GLUE Task | BERT-en | BERT-ft-en (Mean±Std) | BERT-ft-en (Ablation) |
|---|---|---|---|
| CoLA | 53.38 | 55.11 ± 0.75 | 0.00 |
| SST-2 | 92.08 | 92.58 ± 0.39 | 50.92 |
| MRPC (Acc.) | 79.41 | 80.55 ± 1.01 | 68.38 |
| MRPC (F1) | 86.27 | 86.91 ± 0.75 | 81.22 |
| STS-B (Pears.) | 88.06 | 88.10 ± 0.07 | 0.008 |
| STS-B (Spear.) | 87.65 | 87.55 ± 0.18 | 0.028 |
| QQP (Acc.) | 90.84 | 90.79 ± 0.03 | 75.28 |
| QQP (F1) | 87.70 | 87.71 ± 0.19 | 62.81 |
| MNLI-m | 84.38 | 84.40 ± 0.09 | 41.74 |
| MNLI-mm | 84.64 | 84.55 ± 0.20 | 41.48 |
| QNLI | 91.45 | 91.49 ± 0.09 | 58.32 |
| RTE | 67.15 | 67.32 ± 0.17 | 52.71 |
| WNLI | 49.30 | 55.59 ± 1.31 | 43.66 |
Clarification on baseline for Voxelwise encoding modeling
We agree that strong baselines are important for comparing the value of fine-tuning with bilingual brain data. Our current baselines are designed to test two specific hypotheses. First, to assess whether the fMRI signal alone, without meaningful temporal alignment to language, is sufficient. For this, we fine-tuned with temporally misaligned fMRI data. Second, to test if encoding improvements could stem from a different multilingual representation without brain data, we used mBERT embeddings as a baseline.
Clarification on Downstream tasks
Response for Q1
Thank you for this helpful question. Models were fine-tuned separately for each participant and language (note Methods 3.3). We will explicitly highlight this in the paper to ensure clarity. The numbers reported in Table 1 correspond to Participant 1. We have revised Table 1 to report the average performance across all six bilingual participants, along with standard deviations. Due to space constraints, please see the updated table 1 included in our response to reviewer hVbZ. As shown in the updated results, the overall trends remain consistent with those originally reported in the paper. We appreciate the suggestion since this change will help make the findings more interpretable and robust.
Response for Q2 and Q3
We agree with the concern regarding the small differences and sample size reported in Tables 1 and 2.
Regarding Table 1, we have revised Table 1 to report the average performance across all six bilingual participants, along with standard deviations (see comment above).
Regarding Table 2, we agree that the small sample size limited the strength of the conclusions. In response to your feedback, we identified two additional monolingual participants from another dataset (LeBel et. al, 2023) and have now included them in the analysis. Due to space constraints, please see the updated table 2 included in our response to reviewer hVbZ. With this expanded comparison group (3 monolingual participants), the results are consistent across participants: bilingual brain fine-tuning consistently outperforms the mBERT baseline across tasks and offers slightly stronger generalization than fine-tuning on individual monolingual participants. Bilingual brain fine-tuning outperforms monolingual fine-tuning on 7 out of 8 tasks on the CLUE benchmark, with equal performance on the remaining task (ChID). We thank the reviewer for this suggestion as we think this extended analysis better supports the contribution of our paper and makes our results more convincing.
[LeBel et al., 2023] A natural language fMRI dataset for voxelwise encoding models
Clarification on model choice
Please note that the supplementary material is indeed different from the appendix; it is separately downloadable. The supplementary material (referred to in the main text) includes comparison results for various model architectures: across mBERT, XLM-R, and XGLM. We apologize for the oversight and have now also included results for LLaMA-3.2-1B.
Due to space constraints, please see the results and comparison for mBERT, XLM-R, and XGLM (updated Supplementary Table 2) in our response to reviewer xVwE. This table shows that fine-tuning improves performance on most tasks: mBERT (7/9 tasks), XLM-R (9/9), and XGLM (6/9), with especially large gains for XLM-R (e.g., RTE: 54.15 → 62.71, MRPC: 69.61 → 77.94).
Due to space constraints, please see the results for LLaMA-3.2-1B in our response to reviewer 4NET. Bilingual brain fine-tuning outperforms the vanilla model on 2 out of 3 tasks. Please note that, due to computational constraints, LLaMA was fine-tuned using LoRA only at layer 9. As a result, the observed performance gains on downstream tasks are low compared to the other models.
Thank you for highlighting the points we had overlooked. Your feedback has helped us improve the clarity of the paper. We will incorporate all the tables and explanations provided above into the camera-ready version.
I appreciate the comprehensive answers of the authors to the questions and suggestions of the reviewers. If the authors manage to make all the modifications mentioned in their answers and include the clarifications and additional analyses into the updated manuscript, then this submission becomes stronger. Clarifying the contributions of the novel pipeline along with the ablation and adding 2 monolingual participants to the analysis significantly improves the contribution of the paper. While the post-fine tuning improvements are small (and the authors should be clear about it), there is some signal there, which I think would be of interest to the computational neuroscience community. I will be increasing the quality and clarity scores and the overall assessment.
My apologies for missing the supplemental! As for the LLaMA-3.2-1B results, I'm not sure how much they add to the overall story. While I'd like to see the performance of the SOTA architecture on this task, the model wasn't fine-tuned following the same procedure as the other models, which makes it hard to interpret the results.
We appreciate the reviewer for their thoughtful feedback, positive reassessment and willingness to raise the clarity, quality and overall assessment scores. We are confident that the proposed changes have strengthened the paper and improved its quality.
We would also like to clarify that GLUE tasks require architecture-specific fine-tuning, whereas LLaMA-3.2-1B demonstrates strong performance without full task-specific fine-tuning. For this reason, we chose to evaluate it on other popular benchmarks more suitable for generative models. We will clarify this text in the final version.
Overview:
This paper studies the impact of fMRI data during story reading for finetuning LLMs, showing that finetuning on bilingual brain data finetuning on this data improves voxelweise encoding modeling. The primary experiments are conducted with English adn Chinese. Results show consistent improvements on standard NLP tasks including NLI, CLUE, and MCQ tasks.
Strengths:
-
The paper is well written taking time to illustrate concepts from cognitive science that are helpful for the reader. The overall idea is presented clearly, and the experiments are all thorough.
-
The paper shows clear improvements on downstream tasks and also does studies with human participants. In addition, while the authors have not said they will release the dataset, the paper makes contributions with a new dataset that can be used to study LLM performance with fMRI data.
Weakness:
-
This paper is primarily a study on feasibility and potential improvement of finetuning LLMs on fMRI dataset. Many reviewers emphasized showing improvements with fMRI data as the main contribution of the work as well as the creation of the dataset. However, the paper has several limitations in its study including limited numbers of participants and experiments only on small models. It is difficult to know how these results would extend to larger models and for other downstream tasks in addition to the standard NLP benchmarks considered.
-
The paper is missing some comparisons with prior work for which it claims an improved fine-tuning pipeline with less preprocessing, some additional analysis, and evaluations with more participants. These were added in the rebuttal and should appear in the final version of the paper.
-
The paper could benefit from editing the final set of Tables in pages 8, 9 as they are all aggregated and it is difficult to read through these last pages.
Main Reasons to Accept or Reject:
-
Accept: An interesting setup that shows improvements for bilingual scenarios and potential release of a dataset to aid in studying this.
-
Reject: Lack of comparison on modern models. It is hard to say that this will be useful for larger decoder models that are more conventionally studied.
Rebuttal Period: All reviewers engaged in the rebuttal and many of the reviewers increased scores. There is still some outstanding concern about the novelty from one reviewer. Overall, the main concern is about whether the approach will work on more current architectures, and whether the data will be released (unclear from the checklist). Without the dataset, I don't think this work will be very reproducible.
Final Recommendations:
The paper conducts an interesting study into the finetuning of LLMs on fMRI data. The paper is overall strong, but can still benefit from incorporating reviewer feedback around clarity of contributions, and discussing the limitations.