Topoformer: brain-like topographic organization in Transformer language models through spatial querying and reweighting
We spatialize self-attention to organize transformer representations as brain-like topographic maps
摘要
评审与讨论
In this paper, the authors propose to include topographic constraints to the Transformer model by using spatial querying and spatial reweighting. This resulted in “Topoformers” The authors demonstrate their methods both in simple examples and in BERT architecture on large language corpora. They found that the Topoformers performed slightly worse than the standard Transformers, but they allowed better interpretability. Furthermore, the authors compared the emerging topographic organization of the language areas obtained from the fMRI experiments with that from the model, and they reported similarity between the two. Overall, I find the ideas to be interesting and I enjoy reading the paper. However, the explanation and quantification of the fMRI results are not clear. It was difficult for me to judge the strength of the results. I will explain in more details below.
优点
Most parts of the paper were very well written. The motivation and the computational approaches were nicely explained.
The idea of incorporating “connectivity constraints” to transformer language models seems to be novel, despite recent work on including such constraints to deep network models in vision.
The ideas behind the modeling approach are well explained and the results are promising.
The paper combines modeling and empirical studies based on fMRI experiments.
缺点
The main weakness I see is that the approach to link the topological structure of the fMRI data to that of the Topoformer model was not well explained and is questionable. These include Section 3.3, Fig 1B, and Fig 6. Also see my questions below.
问题
— What does Fig 1b plot exactly? What does the different colors represent?The description in the caption was insufficient for understanding what’s really going on. Same questions apply for Fig 6.
— How to interpret the statistics in Fig 4? The calculation of the Typograph statistic is based on Equation 4, but there is a lack of explanation regarding how to interpret it and why this is a good statistics to use.
— In section 3.3., the authors stated “we then visualized the weights in the cortical MNI surface space (Gao et al., 2015a)”. What doe the “weights” mean here?
— when connecting the data with the model, the authors say “Because we have already shown the spatial smoothness of these components, if we see smoothness in the correlation of these components with voxels/units in the target space, we can infer that there is a correspondence between the two topographies.” This is not obvious to me. I would appreciate if the authors could unpack the arguments. Furthermore, how to quantify the strength of the correspondence?
— The authors performed a control analysis in Appendix A 6.1, i.e. “We sanity-checked that a control model did not show any topographic correspondence with the brain organization (see Appendix A.6.1). ” But I don’t understand why this is the appropriate control because Fig 9 is not about the mapping between the PCs and single units for the control model & brain response. I’d think that one would need a figure analogous to Fig 5, i.e., to show the Brain-PC correlation map under the control model.
We appreciate the reviewer's generally positive assessment of our work, and hope to clarify many points in our response.
In response to your one listed weakness: we have substantially revised the brain modeling component of the work, to make alignment between brain and model topographies clearer. We now use a technique called partial least squares singular value decomposition (PLSSVD) that projects brain and model embeddings into a joint space, by performing SVD on the cross-covariance matrix. This allows us to recover shared dimensions, and evaluate the correspondence of brain and model projections along these dimensions using held-out sentences. This allowed us to discover alignment between the representations, and reveal the topographic nature of the shared dimensions. We have also added multiple control analyses, discussed in our comments to the other reviewers, that we believe help clarify the contribution of our work.
Answers to questions:
- Figure 1b has been removed, but generally, we believe we have vastly improved the visual presentation of our results, which we hope can help clarify the reviewer's understanding of our work.
- We apologize for the missing information. We have added this information to the text. The generic topographic statistic is a measure of the degree to which response correlations decay with distance, over some maximum distance. In the model, we compute this for a range of distances and take the mean, as well as plotting the statistic over all distances for layer 15 (Figure 3).
- By weights, we meant the weights of a given principal component. We have now substantially revised our presentation of the results to focus more on the PLS-SVD results, but the same idea holds, only now the weights are supervised for alignment across brain and model representations.
- We thank you for raising this point, which was also raised by another reviewer. Now, rather than relying on heuristics such as this one, we have taken substantial efforts to demonstrate the alignment of brain and model topographies, with the PLS-SVD approach, as well as targeted control analyses using a different brain network and an untrained Topoformer-BERT variant. We find that the alignment between trained Topoformer and language networks is substantially stronger than that including either the control network or untrained variant; however, the untrained variant also shows a degree of low-dimensional alignment.
- A related issue regarding the comparison of topographies between two different systems is that, without an explicit theory of the global organization, and a careful consideration of the shape of the cortical sheet, one cannot simply compare topographies one-to-one. Thus, our PLS-SVD approach is a useful alternative approach to demonstrate alignment and visualize the individual topographies of each representation in their native spatial layout. Tackling the global layout is a long-term problem, but is of great interest for future work.
- We apologize for the confusion. That figure (now Figure 9, A.7) merely demonstrates the lack of topographic organization in a control model without the topographic constraint. Moreover, we hope that the various controls already discussed take care of any remaining concerns raised in this comment.
Thank you again for your insightful comments and questions. We hope our responses clarify the work, and persuade you of its importance and rigor.
This paper introduces the Topoformer, a transformer model, here applied to language, that imposes spatial structure on the latent dimensions involved in the attention process. It uses a binary matrix encoding spatial proximity that is placed between the K and Q matrices that create the attention matrix. Further, local connectivity is used after the application of the V matrix.
This transformer is trained first using a 1-layer version on Imdb, later a large BERT version.
Transformer activations are inspected and found to have spatial structure.
A relationship is drawn between this spatial structure and that found in and fMRI language experiment.
优点
This paper introduces the idea of spatial structure in the attention mechanism of a transformer, devises a method to achieve it, and shows that the spatial structure indeed appears.
缺点
This paper is unfortunately quite badly written, but thankfully many aspects can be improved straightforwardly.
- The abstract and intro contain falsehoods and non sequiturs and would largely benefit from being made more concise. Example of falsehood: Abstract, sentence 2. Convnets exhibit and exploit spatial structure Example of non sequitur: Intro, paragraph 3. "Despite the success of these LMs, the fact that their architecture is not compatible with spatial constraints of the biological cortex is a fundamental limitation for the growing research enterprise that uses LMs as models of human intelligence" the one simply does not follow from the other, on any level, especially without concrete evidence.
The paper would largely benefit from having such problematic statements removed, since they contribute to the impression that the authors might want to pull the wool over the reader's eyes.
-
Variables used in formulas are not introduced and formulas are hard to find and not explained. See, e.g. the first math items in 2.1. Further, see reference to a non-existent "Equation 4" (which may correspond to Eq4 found in the appendix, but the reader can't know this, because it is possible that enumeration restarts. On top of this, the equation from the appendix is stated without any context or description of what is what).
-
The link to brain organization is highly suggested or strongly stated along the paper but is unfortunately very tenuous. Examples are Fig 1, where learned spatial attention structure is juxtaposed with some brain flatmaps and the term "brain-like" is used. There are so many possible reasons for smooth variation of brain data - for starters there is the smoothing employed in the pre-processing, the mapping to MNI space and the subsequent mapping to the cortical surface. Functional brain regions do exist, but absent individual localizers it is even unclear whether the cropped regions are correct. It would be good to give a map showing any form of predictive power of any language model on the selected voxels. As an aside, it would have been great to have an idea on a global surface or a 3D brain of where this region actually lies. Also the figure caption says "Brain responses are visualized", while the diagram mentions PCs, which are not brain responses, but summaries of them.
Related to both 1 and 3, the statement "Because we have already shown the spatial smoothness of these components, if we see smoothness in the correlation of these components with voxels/units in the target space, we can infer that there is a correspondence between the two topographies." is incorrect. One would see patterns, and smoothness in the correlation of the topoformer pixels with many things, possibly even certain noise vectors (though with lower correlation values), but very likely with brain data from other regions. The latter would be a good test. It would be very useful to see the correlation drop when using e.g. data from a visual area or a motor area. The currently important thing to read from the diagram is the relatively high correlation value which shows that there is a correspondence between the transformer activations in brain activity. However, this is also true for unstructured transformers.
All in all, it would be very helpful if the authors backed their statements up better with actual analyses and put sharp and concise statements along with proof if not an established fact.
It is worth considering removing the brain analysis part and focusing on more detailed interpretability of the transformer model. E.g. why is the focus on the final layer of the topoformer model? What do the other ones look like? Even if they exhibit the spatial property less, this would be interesting. A comprehensive study would have been more helpful instead of a link to brain activity that does not strongly corroborate similar topological information, but that only shows what several papers have already shown - a correspondence of these models to brain activity.
问题
Why is locality enforced using a matrix M that forces averaging of the adjacent keys and queries? This a priori does not actually have to enforce similarity between the points that are connected, but it does seem that it is enough pressure to make the locations become similar to their average. A different approach would be to simply constrain each location to only be able to use information from a neighborhood. Was this approach considered at all?
While we are disappointed by the reviewer's negative overall view of our paper, we greatly appreciate the many important critiques, and believe that our answer's should help the reviewer see our paper more favorably. Let us address the concerns in turn:
-
We appreciate the reviewer’s careful assessment of our writing. Regarding the first “falsehood”, the reviewer misunderstands our point: indeed, convnets exhibit and exploit spatial structure, but it is of a different form (explicitly spatial information in the stimulus) than the spatial structure explored in this work (spatial organization of learned features). A more similar form of “spatial structure” to that exploited by convnets would be the sequence position, but this is not the target of our work. Rather, we are interested in exploring the spatial organization of featural information, not immediately available in the input. Beyond this, we have given the paper a comprehensive edit and hope that the reviewer will find the text less problematic and more factual and descriptive, backed up by several further analyses.
-
Equation 4 indeed was in the appendix, and we apologize for not explaining it more clearly. This has been fixed. Additionally, we have added an intuitive explanation about this statistic: “A high value of the statistic indicates that nearby units tend to be more correlated in their response pattern across sequences than distant units.”.
-
We appreciate the insightful comments. We have re-analyzed the data and now use a version without smoothing, and have moved the functionally localized language network results (previously in the appendix) to the main paper, to quell any concerns about individual localization or false smoothness. The smooth organization of the brain data is still very clear and statistically significant, as now confirmed by a permutation testing analysis. To be clear, none of the statistics that are computed are affected by the surface projection, which is merely a visualization tool. Lastly, we have moved towards lateral surface views, as we agree that they are substantially easier for the average reader to interpret, and more clearly demonstrate the topographic organization.
Para below 3. We thank the reviewer for correctly pointing out this poorly framed claim. We have now added additional control analyses to make the point of greater topographic alignment more clear.
- First, we applied a new analysis approach leveraging partial least squares singular value decomposition (PLS-SVD), where we can project the brain and model representations into a shared space (Figure 5). Using held-out data, we demonstrated that the low dimensional variability in these models could be substantially aligned. Visualizing the weights of these aligned components, we found that each was highly topographic. Regardless of any arguments about control regions or control models, this demonstrates significant generalizable alignment of low-dimensional topographic variability in the model and functional language network.
- Second, we added a control analysis of a separate brain region composed of motor and somatosensory strips, demonstrating weaker alignment of these voxels with TopoBERT, and worse encoding model prediction.
- Third, we added a control analysis of an untrained TopoBERT model, demonstrating weaker alignment of untrained vs. trained TopoBERT representations and brain representations beyond the first dimension, and poorer encoding model prediction (albeit above chance). Other studies have demonstrated significant brain predictivity of untrained transformer models, and our results corroborate this while demonstrating greater brain alignment with the trained model.
- Fourth, as mentioned, we computed the generic topography statistic for each language region and the control region, and demonstrated that while the values were somewhat low, they were significantly greater than the chance values estimated through a permutation of voxel responses with respect to positions. In summary, while it is true that a topographically organized system will always show topographic correlation structure with whatever it is being compared to, in many cases these correlations would be weak and insignificant. The stronger correlations between trained TopoBERT and the language network – both of which we demonstrate to exhibit topographic variation across natural language sentences – demonstrates an intriguing alignment of low dimensions. We do not want to make strong claims about the particular similarity of the spatial frequency, etc., of the emergent topographies at this stage, as optimizing for such properties is beyond the scope of this initial paper.
We will address the final paragraph and question in a follow up comment.
Regarding the paragraph beginning "It is worth considering":
- No paper has quantified the fine-grained spatial topography of regions of the language system, so we believe that establishing the topographic organization of the language network is a critical motivation for the work presented here, and thus an important aspect of the work and the audience we hope to receive it. That being said, we agree that further comparisons are an interesting path for future work beyond the scope of a single conference paper. We hope that the reviewer agrees that our new analyses of the brain data are a worthy inclusion to the paper.
- We looked at the final layer because we expect it to be the most context-aware layer, and thus limited our visualization to a single layer. The other layers look similar as suggested by the statistic. A more comprehensive analysis of all layers, particularly looking at the nature of their representations rather than just the presence of topography -- leveraging similar approaches used here, such as the new selectivity test suite -- is unfortunately not possible within the scope of the paper, but would be an excellent direction for future work.
Answer to question:
- The use of a mask in spatial querying is a mathematically simple way of introducing locality over feature dimensions. In a standard self-attention layer, each key is associated with a single query, so there is no limiting beyond this that could be done. Rather, spatial querying expands the amount of information that a given key has access to, in terms of its local pool of queries. That being said, we acknowledge there are other possible ways of introducing locality, such as a locally-biased weight matrix that allows for learned spatial combinations of queries – rather than an average – to be associated with a given key. Lastly, and perhaps this is what you meant by “constraining each location to only be able to use information from a neighborhood"), locality could be introduced by making the query and key matrices locally connected. One advantage of our formulation of spatial querying compared to the latter approach is that it does not require spatial organization of its inputs, and allows the network to combine information from the full input embedding before organizing it again. However, we agree that these other options would be interesting to explore, albeit beyond the scope of this first paper on topographic transformers. Lastly, we agree that spatial querying encourages, rather than forces, the development of local similarity in queries and keys, since this allows for reasonable local averages of queries and keys.
Thank you very much for all the insightful comments. We hope we have addressed most of your concerns, and that you will considering bumping your score to reflect this.
The authors introduce a method for integrating topographic organization into the keys, queries, values, and outputs of the self attention mechanism of a transformer. They validate that this indeed yields topographic organization on the simple IMDB task. When their method is applied to language modeling (BERT) the authors identify topographic organization with respect to semantic clusters of topics (music / sport). They then measure the topographic organization of human subjects (through FMRI) with respect to language, and compare these results to the topographic organization of their Topoformer-BERT model through correlation analysis. They show that there is indeed strong spatial correlation between their model and the human FMRI data, indicating some similarity of topographic maps. Finally, they show that this topographic organization does not significantly hurt the performance of their Topoformers compared to non-topographic baselines.
Post Rebuttal
We thank the authors for their extensive response to our review. We have looked at the updated PDF and find it to be a significant improvement over the original version and therefore maintain our recommendation that this is a good paper that should be accepted.
优点
- Novelty — this is the first model which integrate topographic organization into a transformer (to the best of my knowledge) and furthermore one of the first to do so for language.
- The comparison with human FMRI data adds a important degree of grounding to the paper which only adds to it’s impact.
- The quality of the writing in the paper is very high in general, and specifically the introduction is very succinctly written and elegantly describes topographic models.
- The use of a topographic metric to quantify topographic organization across multiple layers in an easily digestible way is a welcome finding in the field.
- Topographic organization in artificial neural networks is an important but understudied topic, and I believe this paper will draw important attention to in the future. Furthermore, I believe this paper has the potential to become a foundational paper in the field given it’s application to transformers and language modeling at this crucial point in time.
缺点
- The appendix is poorly formatted and appears almost incomplete. There are equations (such as (4)) which are important to the text but are left with undefined symbols. Furthermore, some figures (such as the Brain-PC correlation with a non-topographic control) appear to be missing (or is this Figure 9?).
- No baseline for IMDB accuracy performance without topographic organization.
- Only a single attention head is used on all experiments, limited the comparison with most state of the art transformer architectures. (To be clear, I think the Topoformer-BERT also only uses a single head but could not find this in the text).
- Too many important aspects are relegated to the Appendix making the text challenging to read. The authors should improve the formatting of the appendix (ensure headings for sub-sections are in the correct location) and potentially move some figures (such as Figure 9, and equation 4) to the main text.
- Given the authors mention that they compared many spatial receptive field sizes, it would be nice (and make sense) if they included these results in the appendix. Furthermore, it would be helpful to understand how the authors ultimately settled on their final choices of RF sizes as this is not described in the text.
- The topographic organization for the values and fc_out appears quite weak (despite the topographic metric). This makes one either question the topographic metric, or question the usefulness of this model for organizing the actual output of self-attention. It would be helpful if the authors could comment on this. (To be clear, despite this weak organization of values, I still think there is significant value in the strong organization of keys and queries and therefore do not think this should be held strongly against the quality of this paper).
- It would be helpful if the authors included additional controls comparing non-topographic models with brain PCs for the correlation plots of Section 3.3. Without these, the strengths of the conclusions drawn from this correlation analysis are significantly lowered.
问题
- Is there any intuition why in Figure 1, you see that the Keys and Queries show strong selectivity for negative sentiment but not positive? While the values appear to show only very strong selectivity (very little intermediary), but for both classes? Is this somehow a result of the different mechanisms used to induce topographic organization? (Bilinear vs. Matrix-vector product?). Similarly, is there a reason why these components appear to have patchier organization in the BERT setting as welll?
- In the appendix you say SQR was less stable and required larger batch size, do you have intuition for why this might be?
- Would it be possible to again compute the topography metric (Equation 4) on the correlation maps between the model and Brain PCs? Would this make sense as another quantative metric of the alignment of the two representations?
- Do the PCs found for the human participants correspond to any semantic categories of the text?
We thank the reviewer for an extremely positive review, noting the novelty and potential for impact of our work, while also providing several important criticisms that were easily addressed. Let us address the weaknesses in turn. While they were not numbered, we will number our responses (in order) for the sake of organization.
-
We apologize for the earlier poor formatting of our appendix. We have edited this to ensure greater readability, and sufficient information to understand each equation.
-
We apologize for the lack of control comparison for IMDB accuracy. We have hereby trained a non-topographic counterpart (identical to the topographical model, just without Spatial Querying or Spatial Reweighting), and this model achieved a performance of 0.83 on the IMDB test set. We have added this in the appropriate section: “In comparison, an identical 1-layer Transformer model without spatial querying achieved an accuracy of 0.83.”
-
We trained both a single head and multihead non-topographic model, and reported GLUE results in Table 1. While the multihead model performs the best, a single head does surprisingly well, and our topographic variant performs nearly as well as the non-topographic control model.
-
As stated before, we have improved the formatting of the appendix substantially. Unfortunately, we are heavily limited by space and have had to keep only the most important results in the main text, while trying to provide as much relevant information as we can to the reader. We hope that any limitations on readability are overweighed by the benefits of greater information and robustness of our results.
-
This is a great point. We have included these analyses in Appendix A.9.
-
The topographic organization of values and fc_out representations is a bit weaker, as you point out. We were also interested in this finding. The statistic has been updated slightly to look at the decay of response correlation in terms of the correlation of pairwise correlations and distances, as should be clearer now from Appendix 3.2. It still shows the same result. It appears to us that the fc_out layer in particular is a bit sparser in its activity --- more specifically, having few units with very high activations --- which makes the visualizations look a bit weaker. In any case, for space considerations, we have opted to focus primarily on the keys, while quantifying all of the layers for breadth of consideration. Many other formulations of spatial constraints are possible (Gaussian probability masks, wiring penalities rather than fixed local connectivities, etc.), and it is possible that other formulations would work better in the scaled up model. However, we note that the small-scale model did not exhibit this same issue, showing substantial organization in the values and fc_out representations, suggesting it is not a fundamental limitation of spatial reweighting.
-
This is an excellent point, and we have now added multiple controls. Moreover, we have now used a cross-validated approach to assess the alignment of brain and model topographies, which reduces the need for controls in the first place. Nevertheless, we added a control analysis of a separate brain region composed of motor and somatosensory strips, demonstrating weaker alignment of these voxels with TopoBERT, and worse encoding model prediction. We also added a control analysis of an untrained TopoBERT model, demonstrating weaker alignment of untrained vs. trained TopoBERT representations and brain representations beyond the first dimension, and poorer encoding model prediction (albeit above chance).
Answers to questions:
- Interesting question! This is not found for all models, so we believe this was a random occurence for this one model. We believe the patchier representation in BERT may owe to the greater effective dimensionality required when representing the features of a much larger corpora of linguistic information.
- Unfortunately we do not yet have a strong intuition for this, however it may be due to the positivity constraint on the feedforward connections, which is not present in the SQ model.
- The topography metric requires multiple trials, so it cannot be computed on the model correlation maps. However, we have added a quantification of brain topography using this statistic, finding significant smoothness (albeit reduced now that we have eliminated the smoothing from our preprocessing pipeline). While the statistic is not perfect, it is one useful way of summarizing organization quantitatively across a number of spatial representations.
We will address the final question in a follow-up comment.
- The PCs are quite difficult to interpret in both the brain and the model, however their alignment is intriguing. We find this interpretability problem fascinating and challenging. We made some headway on this in the model, with our new suite of selectivity tests. We find substantial selectivity for several semantic distinctions, including animacy and concreteness. This is more difficult in the brain, given we have less abiility to run controlled experiments. However, we are excited by the possibility of further work using computational models to drive insights that predict particular selectivities in the brain, and believe the Topoformer will be helpful in this vein.
Thank you again, so very much, for your insights, critiques, and questions!
Paper 2 proposes a novel approach to training Transformer language models with topographic organization, called Topoformer. The key idea of Topoformer is to arrange the keys and queries of the self-attention mechanism on a 2D grid, and to associate local pools of queries with a given key. This allows Topoformer to learn topographic representations of language, which are more interpretable and efficient than the unstructured representations learned by traditional Transformer models.
优点
The proposed method, Topoformer, is a novel approach to training Transformer language models with topographic organization. Topoformer has been shown to be feasible on a 1-layer sentiment classification task and to perform on par with a non-topographic control architecture on downstream NLP benchmarks. Topoformer has also been shown to yield similar forms of topographic organization for linguistic information as that present in the language network of individual subjects.
缺点
The paper does not provide any concrete examples of how Topoformers can be used to improve the interpretability of NLP models. The paper does not evaluate Topoformer on a variety of different NLP tasks.
问题
How can Topoformers be used to improve the interpretability of NLP models? Will this Topoformer increase the performance of the donw-steam task?
We appreciate the reviewers generally positive remarks on our paper, and are glad to address their concerns.
Regarding the weakness of interpretability. We have added a new suite of 8 tasks designed to probe interpretability. We find selectivity for several semantic attributes such as animacy and concreteness. This selectivity exists within generally highly similar mean activations (true of our model, and standard language models), but is simply visualized in our 2D topographic layout. While much future work remains to be done to interpret language models (with superposition being a major obstacle), we believe our work provides some important first steps and a useful architecture for future work that can benefit from its simple topographic priors.
We believe this also addresses the previous concern regarding evaluation on multiple tasks. Additionally, we already analyzed the Topoformer-BERT model on the GLUE benchmark suite, which contains several tasks (Table 1). We believe our evaluation of Topoformer-BERT is extensive and demonstrates similar performance to a non-topographic model. Beyond this, state-of-the-art performance is not our goal, as we do not have the resources to train massive language models. We believe that our simple architectural motifs should be scalable to much larger models that can be trained by groups with more resources.
Biological brains feature spatial organization in neuron arrangement. However, the representations of current machine learning models lack such organization, pose challenges in interpretation. This study introduces a novel approach called Spatial Querying, and Spatial Reweighting, which results in interpretable topographic organization. The primary contributions can be summarized as follows: Introduced a Topoformer with topographic organization and demonstrated this organization through a 1-layer Topoformer in sentiment analysis task. Subsequently, scaling this proposed method to BERT, achieving competitive results in NLP benchmarks. Further experimental studies reveal that Topoformers exhibit similar linguistic organization while analyzing human brain activity.
优点
- Addressed the architectural disparity between current Transformer models and the biological brain by inducing a topographic organization of features within the Transformer.
- The topographic organization of the Topoformer yields competitive performance compared to the Vanilla Transformer model in small-scale sentiment analysis and benchmark GLUE tasks.
- The proposed method is scalable for both small and larger-scale datasets.
- The alignment between the way information is organized in the Topoformer and the human language network is clearly shown using brain dataset.
缺点
- Although the novelty of the paper is interesting, but it lacks specific experimental details. o How to choose the optimal number of tokens in local spatial querying? o Additionally, with the introduction of local pooling of spatial queries, the parameter differences between Topoformer and Vanilla BERT is not provided. o How did the authors generate Fig 4? What does "Stat Value" refer to? How do we determine the Stat Value across layers for queries, keys, values, and fc_out? o What does fc_out refer to? It is not defined or referenced anywhere in the entire paper except in Fig 4. o Typically, a BERT-base model comprises 12 encoder layers. However, in Fig 4, there are 15 layers depicted. Could the authors provide an explanation as to why there are 15 layers in this context?
- The validation of the proposed Topographic Transformer model appears insufficient. Similar to BERTology studies, has the proposed Topoformer been assessed for its ability to capture the hierarchy of linguistic structure (such as early layers capturing surface features, intermediate layers capturing syntax, and later layers representing semantics)?
- Why does the Topoformer with a single head result in better accuracy scores on the GLUE benchmark compared to the multihead attention?
- What is the complexity of self-attention in Topoformer after introducing the local spatial querying mechanism? Did the authors maintain the same local spatial querying across layers, or did they increase the local pool size with the depth of the layer?
- the clarity can be improved:
- several typos: In Fig4: fc_out, rest of the paper: fc-out, FC-Out
问题
- What is the representational similarity between Topoformer and Vanilla BERT across layers? During fine-tuning of Topoformer, similar to fine-tuned BERT, are the last layers significantly affected?
- Please check weaknesses for the remaining questions.
伦理问题详情
N/A
We thank the reviewer for noting the strengths of our paper and providing some relevant critiques. Let us address the weaknesses in turn:
a. There is a misunderstanding in the understanding of spatial querying. Our spatial operations do not operate over tokens, but features for a given token. Regardless, we have added an additional analysis of the role of the RF sizes for spatial querying and reweighting, using the small-scale model. b. Topoformer-BERT and the non-topographic control model are identical except for the presence of spatial querying and reweighting. We now make this clearer in the text. c. Regarding the generic topography statistic, we have modified the statistic for greater robustness to scale, and discussed it more thoroughly both in the appendix and what is now Figure 3. d. fc_out is the output of the self-attention layer. We hope that our updated Figure 1 helps to clarify this. e. We selected 16 layers because we followed the protocol of the CRAMMING paper (Geiping et all., 2022). Throughout our paper, we depict 16 layers, however, they are zero-indexed. We have clarified this in the manuscript, thank you for catching this. 2. We have added substantial new assessments of topographic selectivity in Topoformer-BERT, please see Figure 4. Beyond this, we found that Topoformer-BERT performed generally on par with a non-topoformer control – as we have shown on a variety of benchmarks present within the GLUE composite benchmark – and thus we have no reason to believe that these factors you mention would be different. In general, we believe these aspects of interpretability are beyond the scope of the present paper, which is designed to introduce the algorithm, provide some intuitions about its performance, and make comparisons with the human brain’s language network. 3. This is only the case for a single metric (COLA) - overall, the multihead model does perform better. However, the Topoformer performs similarly, albeit somewhat worse, than the matched single-head non-topographic model. 4. The complexity is unchanged and is the same as the original attention mechanism: O(n2). We use a single RF value for spatial querying and reweighting across the network, but further exploration of structured variation as you suggest would be interesting for future work. 5. We apologize for any lack of clarity and have significantly revised the manuscript to address this. 6. We thank you for catching this and have fixed these.
Answers to questions:
- We expect that the representational similarity between Topoformer-BERT and non-topographic variants is generally on par with that between randomly initialized variants of non-topographic BERT models. We make no claims about the topographic bias fundamentally altering the representations, only that they allow the representations to be visualized in 2D.
Thank you again for your insightful comments. Addressing them has improved our paper substantially. We hope to have convinced you of the rigor, quality, and contribution of our work.
We thank all the reviewers for their incredibly helpful and thoughtful comments. According to the reviewers, the main strengths of the paper are 1) Novelty and strong motivation (LDRr, DNiv, S9Nw, 73Ff), 2) Competitive performance of the Topoformer compared to the Vanilla Transformer (LDRr, DNiv), 3) Interesting comparison to the human language network (LDRr, DNiv, S9Nw, 73Ff).
Additionally, it was mentioned that “this paper has the potential to become a foundational paper in the field given its application to transformers and language modeling at this crucial point in time.” (S9Nw).
We agree with all of these points, and were encouraged to see the reviewers note them.
Several weaknesses and questions were raised. We greatly value this feedback and believe the paper has been substantially improved in response. We have done our absolute best to address each of them in direct response to each reviewer.
Some main highlights are as follows:
- A new PLS-SVD approach to characterize the alignment between brain and model topographic representations, and several control analyses to determine its robustness, and the robustness of topographic organization in the model and brain.
- A new suite of 8 selectivity tests to provide greater interpretability into the topographic selectivity of the Topoformer-BERT model
- Analysis of the role of the RF parameter in emergent topography and performance in the small-scale model
- Improved clarity of writing
We hope that our revision and our responses can significantly persuade each reviewer that their concerns have been addressed, and that this work is sufficiently valuable to warrant publication in ICLR.
Thank you again for your hard work in reviewing our paper.
The work explores the issue of adding a notion of locality in the attention mechanism of transformer via some spatial querying and reweighting operations. The proposed network architecture is evaluated on NLP tasks and fMRI decoding.
The notion of locality is clear from a neuroscience perspective and also commonly considered in NLP literature (yet mostly for computation reasons). It is true that we have seen in the past literature that ML and neuroscience interactions can offer inspiring contributions, yet the current contribution remains moderately convincing on its ability to impact the either the field of NLP or neuroscience.
为何不给更高分
The paper develops interesting perspective on bringing locality in transformers. Yet the claims of potential impact for either NLP or neuroscience community fail to convince.
为何不给更低分
The study has no clear experimental flaw and the question remains interesting, yet the results are not there yet.
Reject