6.0

/10

Rejected4 位审稿人

最低3最高10标准差2.5

3.8

置信度

正确性2.8

贡献度2.8

表达3.3

ICLR 2025

Disentangling and Integrating Relational and Sensory Information in Transformer Architectures

Awni Altabaa,John Lafferty

OpenReview PDF

提交: 2024-09-28更新: 2025-02-05

TL;DR

We introduce an extension of the Transformer architecture with explicit relational computational mechanisms, integrating sensory and relational processing.

摘要

关键词

relational learningtransformersinductive biasessensoryrelationalarchitectureattention

评审与讨论

审稿意见

评分: 5置信度: 42024-10-26

The paper proposes to add a relational attention mechanism to the transformer architecture in order to improve relational reasoning performance.

优点

The proposed method is shown to out-perform vanilla transformers on a small set of synthetic or simple real-world tasks.
The paper is written clearly and easy to follow.

缺点

There is limited novelty. The relational attention sounds very similar to the graph attention network to me. Also the idea of having a perception module followed by relational reasoning networks have been explored for many years in visual reasoning domain (e.g. [1,2]). The Symbolic attention also sounds like Link mechanism in the Retriever[3].
The experiments are only performed on a small set of simpler tasks. I wonder how the proposed method will perform for more complex tasks.

[1] Amizadeh, Saeed, et al. "Neuro-symbolic visual reasoning: Disentangling." International Conference on Machine Learning. Pmlr, 2020. [2] Wang, Duo, Mateja Jamnik, and Pietro Lio. "Abstract diagrammatic reasoning with multiplex graph networks." arXiv preprint arXiv:2006.11197 (2020). [3] Yin, Dacheng, et al. "Retriever: Learning content-style representation as a token-level bipartite graph." arXiv preprint arXiv:2202.12307 (2022).

问题

None

2024-11-19

Relation to Link Attention in the Retriever

The Symbolic attention also sounds like Link mechanism in the Retriever[3].

Symbolic attention and the Link mechanism of the Retriever are distinct mechanisms. While the Retriever is an interesting work, it has different motivations and implementations. The key idea behind the Retriever is to separate permutation-invariant information from the rest of the information. This is achieved through an autoencoder architecture, with permutation-invariant encoder. We don't see a direct connection to symbolic attention, aside from the involvement of an attention operation in the Retriever.

On the experiments

The experiments are only performed on a small set of simpler tasks. I wonder how the proposed method will perform for more complex tasks.

We respectfully disagree with this characterization. Our suite of experiments covers a range of tasks, data modalities, and architectural variants, which include both controlled synthetic tasks and large-scale complex real-world tasks. This was a recognized point of strength in all other reviews (mxrQ, YUpf, qVFZ).

Below, we aim to summarize the experimental component of the paper.

Sec 4.1: We begin with a synthetic benchmark of relational tasks, called "relational games". This benchmark was studied in a series of prior works on relational architectures, and gives us a way to evaluate our proposed model in a controlled environment. The benchmark contains a suite of 5 different tasks, where we evaluate learning curves (i.e., data-efficiency) and compare to standard Transformers. We show that our model is significantly more data-efficient.
Sec 4.2: We evaluate symbolic reasoning via a set of mathematical problem-solving tasks. These tasks are modeled as sequence-to-sequence tasks, using an encoder-decoder architecture. We demonstrate improved performance compared to a standard Transformer, across different model sizes and parameter scales.
Sec 4.3: We evaluate our model on visual processing via image recognition tasks. We use a ViT-style architecture on these tasks, processing the input image as a sequence of patches. We demonstrate improved performance, showing that relational processing can be useful for visual processing tasks such as image recognition.
Sec 4.4: We evaluate our model on autoregressive language-modeling using a causal decoder-only architecture. We evaluate scaling laws with respect to both data size and model size, and show improvements in both data efficiency and parameter efficiency compared to standard Transformers. Our models go up to 1.3 Billon parameters, roughly matching the scale of GPT2.

These experiments show that the DAT architecture yields improved performance across a wide range of tasks (symbolic reasoning, image recognition, language modeling), data modalities (e.g., text, vision), and architectural variants (e.g., encoder-only, decoder-only, encoder-decoder, and ViT-style).

Moreover, we note that we build on a line of work on relational architectures and inductive biases [Ref 15-22]. The empirical evaluation of this prior work was mainly limited to synthetic tasks, like the relational games benchmark of Sec 4.1. Thus, one of the key contributions of our work is to integrate relational neural mechanisms into a general architectural framework (namely, the Transformer) and demonstrate that relational neural mechanisms and inductive biases confer performance benefits on complex real-world tasks, like language modeling and image recognition. We believe this is an important contribution to this line of work.

Thank you for your review. We hope we were able to clarify the novelty of our architectural proposals. Please let us know if we have addressed your concerns or if you have any remaining concerns. We look forward to your response.

评论- Follow Up: Invitation to Review Our Responses

2024-11-29

Dear reviewer,

Thank you again for your review. As the discussion period is coming to an end (Dec 2nd), we kindly invite you to review our responses above. We hope we were able to address your primary concerns and clarify the novelty and contributions of our work. In particular, we hope we were able to clarify the distinction between the GAT architecture and the architecture proposed in this work. We'd be happy to address any further questions.

Sincerely,

The Authors

2024-11-19

Thank you for your review. We aim to address each point raised in detail.

Below, we provide a summary of our responses:

Concern: relational attention is "very similar" to graph attention networks (GAT).
- Response: This characterization is inaccurate, relational attention is a distinct mechanism from GAT. We provide a detailed explanation of the differences below.
Concern: experiments are performed on a small set of simpler tasks.
- Response: We respectfully disagree. While our experiments include synthetic benchmarks to enable controlled evaluations with respect to previously-studied relational tasks, they also include complex real-world tasks such as image recognition and language modeling. Our experiments span a diverse range of task paradigms (sequence classification, sequence-to-sequence, autoregressive next-token prediction), data modalities (text and vision), and architectural variants (encoder-only, decoder-only, encoder-decoder, ViT-style). Our models go up to 1.3B parameters in size, and we include an analysis of scaling laws compared to standard Transformers.

Difference between relational attention and GAT

The relational attention sounds very similar to the graph attention network to me.

This characterization is inaccurate. The Graph attention network (GAT) layer is essentially self-attention with a mask corresponding to graph neighborhoods. Thus, it is no more similar to relational attention than standard self-attention is. The only common feature between GAT and our proposed relational attention mechanism is that it involves computing attention scores (which it shares with standard attention). We'd like to explain in more detail below.

The standard attention mechanism of Transformers (Vaswani et al. 2017) takes the form: $h_i' = \sum_{j} \alpha_{ij} W_v h_j,$ where $\alpha_{ij}$ are attention scores, and $h_i$ are the hidden embeddings.

A GAT layer (Velickovic et al. 2018; Eq 4) updates node embeddings at each layer via a similar operation: $h_i' = \sigma(\sum_{j \in \mathcal{N}_i} \alpha_{ij} W h_j),$ where $\alpha_{ij}$ are attention scores computed similarly to the dot-product attention mechanism used in Transformers. $\sigma$ is an optional non-linearity. The main difference is the attentional mask representing the graph neighborhoods $\mathcal{N}_i$ .

The relational attention mechanism proposed in our work is very different to both GAT and standard attention: $h_i' = \sum_{j} \alpha_{ij} (W_r r(h_i, h_j) + W_s s_j),$ where $r(\cdot, \cdot) \in \mathbb{R}^{d_r}$ is a learned relation function, $s_j \in \mathbb{R}^{d}$ is a "symbol vector" which "points to" object $j$ , and $W_r, W_s$ are learned linear maps. Instead of attending to the embeddings $h_j$ of the objects in the context, relational attention attends to and retrieves learned relations $r(h_i, h_j)$ between the query object and the context objects. Here, $r(\cdot, \cdot)$ is modeled as a series of inner product comparisons under different feature projections.

(Note that we presented the single-head version for each of the three mechanisms for clarity and simplicity, but all have multi-head variants.)

As you can see, while graph attention networks have a similar form to standard Transformer attention, our proposal of relational attention bears little resemblance to graph attention networks. In particular, standard self-attention and GAT both only model a selection criterion that determines how to aggregate the neighbors' embeddings. In attention terminology, the values in standard self-attention and GAT are the feature embeddings of the neighbors. By contrast, in relational attention, the values are representations of the relations between the receiver (query object) and sender (context object). This is a fundamental difference, and is the key to our proposed architecture.

审稿意见

评分: 10置信度: 32024-11-02

The paper describes a new architecture designed to make it easier for transformers to work with relational information. The authors provide several experiments that indicate the architecture outperforms standard transformers.

优点

This is a strong paper, and I recommend accepting it. The authors take a well-motivated challenge (helping transformers work with relational reasoning) and define a natural extension to the transformer architecture that attack that challenge.

The basic idea of exchanging "relational" information seems solid, and the given architecture is seems like a relatively simple way to achieve that, essentially adding one more bilinear function to the mix. Crucially, the new architecture decouples strength of attention from strength of relation (this distinguishes it from an earlier proposal known as relational cross attention). One might worry about the added parameters, but of course the authors are careful to control for that variable.

The experimental data is impressive, especially because the architecture seems to work on a broad set of tasks. It's interesting that even image recognition improves. (This might be a hint that the new architecture, although meant to represent relations, might have other benefits as well.) I appreciated the careful comparisons with baselines.

Overall, this seems like an important contribution to the literature.

缺点

I believe there are some opportunities for improving the exposition of this paper. To begin with, "sensory" doesn't seem like the right metaphor. I realize the cognitive science origin, but I also think it's worth being careful with brain metaphors. I wonder if it would be better to talk in terms of "unary" vs. "binary" attention heads, or "first-order" vs. "relational," perhaps.

The first paragraph (and maybe much of the second) of the introduction seem unnecessary, and it might be possible to cut them entirely.

I found the first few explanations of the architecture confusing, and didn't really understand the "type" of r or symbols until I got to the explicit formulas. I wonder if it's worth making this a little more precise earlier.

The theorem in 2.4 gets very little play, and I'm not sure how important it is. I'd recommend either relegating this entirely to the appendix, or spending a bit more time explaining why it matters here. (One issue is that plain-vanilla transformers are computationally very powerful already, so it's not clear what this theorem adds.)

The second paragraph of section 3.1 seemed redundant, and might be able to be cut.

Figure 5 is potentially interesting, but I wonder if the story could be illustrated better by picking one layer, and showing attention for all the "normal" heads vs. the "relation" heads. I also wonder if there is any "low-hanging fruit" for other visualizations. For example, for training when the relation matrices are not constrained to be symmetric, do they ever end up learning to be near-symmetric? That said, this is a long paper already, and the authors explicitly mention interpretability as future work, so this is certainly an optional change!

As just mentioned—and I can't fault the authors for this—this paper has a huge amount of material, mostly in the appendices. All of that is good and necessary, but it's easy to miss details when reviewing, which is why I've put a relatively low confidence score in my review.

问题

If you cut some of the text as suggested above, you might have more room for future work. Do you have thoughts about how this might apply to other architectures, such as graph neural nets?

2024-11-19

Figure 5 is potentially interesting, but I wonder if the story could be illustrated better by picking one layer, and showing attention for all the "normal" heads vs. the "relation" heads. I also wonder if there is any "low-hanging fruit" for other visualizations. For example, for training when the relation matrices are not constrained to be symmetric, do they ever end up learning to be near-symmetric? That said, this is a long paper already, and the authors explicitly mention interpretability as future work, so this is certainly an optional change!

We'd certainly love to explore interpretability further, and your suggestions make sense! What we'd ultimately like to do is to compare the structure between three things: the attention scores in standard attention, the attention scores in relational attention, and the relations in relational attention. There are many interesting questions here, and we'd like to take the time to address them rigorously and quantitatively. For the current paper, however, we will also think about other possibilities to improve Figure 5 and provide a visualization of the learned relations.

As just mentioned—and I can't fault the authors for this—this paper has a huge amount of material, mostly in the appendices. All of that is good and necessary, but it's easy to miss details when reviewing, which is why I've put a relatively low confidence score in my review.

Totally understandable! Our goal was to make the main body of the paper as self-contained as possible, while including a more thorough presentation of the relevant details in the appendix. We think we will be able to further improve the presentation with your helpful feedback.

Please let us know if there are any details that we can help to clarify for you.

If you cut some of the text as suggested above, you might have more room for future work. Do you have thoughts about how this might apply to other architectures, such as graph neural nets?

The case of graph neural networks is interesting, and deserves an in-depth discussion. A key aspect in the case of graph neural networks is the distinction between edges on the graph (and possible features of these edges), and the relations between nodes (the terminology can be overloaded here sometimes). This can enable some interesting interaction between these two aspects, since in the standard message-passing paradigm for GNNs, the role of the edges is to control the flow of node information. Whereas the DAT considers fixed "graphs" (i.e., either fully-connected or causal), a GNN-variant of our proposal could enable some interesting interaction between the direction of information propagation (i.e., edges) and the relational content of the information being propagated.

Another interesting question is whether it is possible to integrate an analogous notion of relational processing in recurrent sequence models such as the recent SSM class of models, or if such relational processing is unique to attentional models like Transformers that have direct access to the entire context.

We'd like to thank you again for your thorough and thoughtful review and your many helpful suggestions! We really appreciate and value your thoughts and feedback, and we think it has helped us improve the exposition of the paper.

2024-11-19

We'd like to sincerely thank you for your deep engagement with our work, and your thorough evaluation and valueable feedback! We are encouraged that you found the problem to be "well-motivated", the architectural proposal to be a "natural extension to the transformer architecture", and the experimental evaluation to be "impressive", covering a "broad set of tasks" with "careful comparisons with baselines".

We appreciate your many useful and thoughtful suggestions around presentation. We will carefully consider all of them.

Below, we respond to some of your comments.

The experimental data is impressive, especially because the architecture seems to work on a broad set of tasks. It's interesting that even image recognition improves.

This was interesting to us as well! As discussed in the paper, in the work on relational inductive biases that most influenced ours, empirical evaluation was mostly limited to synthetic benchmarks, similar to the relational games benchmark in Section 4.1 [see e.g., references 15-22 in the paper]. So it was an open question as to whether these ideas can yield improvements in complex (and messy) real-world tasks, like image recognition and language modeling. This was a big part of the motivation for our work, and the decision to build on the powerful Transformer framework. We were certainly encouraged to see that these relational mechanisms yield meaningful performance improvements across a range of complex tasks, while maintaining the generality of the Transformer architecture!

Crucially, the new architecture decouples strength of attention from strength of relation (this distinguishes it from an earlier proposal known as relational cross attention).

Yes! As you may have seen, we have a section in the appendix where we discuss the relationship between relational attention and RCA, and present some exploratory experiments considering the performance a DAT variant with RCA. Interestingly, we find that although RCA performs comparably on the synthetic relational games experiments, it is significantly worse at language modeling: an RCA-based DAT performs seemingly identically to standard Transformers, and loses the improvement due to relational attention.

I believe there are some opportunities for improving the exposition of this paper. To begin with, "sensory" doesn't seem like the right metaphor. I realize the cognitive science origin, but I also think it's worth being careful with brain metaphors. I wonder if it would be better to talk in terms of "unary" vs. "binary" attention heads, or "first-order" vs. "relational," perhaps.

These are interesting suggestions. We take your point about the accuracy of the term "sensory". The term sensory refers to the fact that the values contain features of the objects. Although we like the cognitive science reference, we agree that such brain metaphors can sometimes be misleading. We will carefully consider your suggestions.

The first paragraph (and maybe much of the second) of the introduction seem unnecessary, and it might be possible to cut them entirely.

It's always helpful to get feedback on exposition and presentation. We will try to make the presentation more succinct.

I found the first few explanations of the architecture confusing, and didn't really understand the "type" of r or symbols until I got to the explicit formulas. I wonder if it's worth making this a little more precise earlier.

This is useful feedback! This was a concern for us while writing as well, and its useful to have this confirmed. We will aim to revise accordingly.

The theorem in 2.4 gets very little play, and I'm not sure how important it is. I'd recommend either relegating this entirely to the appendix, or spending a bit more time explaining why it matters here. (One issue is that plain-vanilla transformers are computationally very powerful already, so it's not clear what this theorem adds.)

Thank you for this feedback on the exposition. Our intention for placing the theorem in the main text of the paper is to try to give some intuition about the class of functions that relational attention computes, which we thought might be helpful for readers who like more formal statements. In particular, the theorem aims to make clear how the attention criterion is decoupled from the relation being modeled.

But given your feedback, we will carefully think about the presentation of this Theorem, and perhaps either expand on its significance or move it to the appendix.

审稿意见

评分: 6置信度: 42024-11-04

This paper introduces a novel attention mechanism by which transformers are encouraged to explicitly represent information about relations between elements of a sequence or set of tokens. This relational information is encoded separately from sensory information, or information about the specific entities that enter into a relation. This new dual attention transformer architecture is evaluated on a diverse series of modalities and tasks, demonstrating enhanced performance in language modeling, a simple visual classification task, and a toy relational reasoning dataset.

优点

The proposed mechanism is novel, yet is a natural extension of the attention mechanisms within a standard transformer. The idea of explicitly representing relations between entities is an important one that merits further investigation. Empirically, the DAT appears to be more data efficient than the standard transformer, which is especially important in the case of language modeling.

缺点

While the work is interesting and the idea of building in inductive biases toward representing relational information is potentially useful, there are several framing and experimental issues that should be addressed. First, the claim that standard attention mechanisms only represent sensory information is empirically false. The authors themselves cite several works describing how attention in language models often captures syntactic information, which is inherently relational. Furthermore, recent work has explicitly found that finetuned ViTs represent sensory information in early layers, but represent abstract relational information in their later layers [1]. Finally, “register tokens” – tokens that encode little information about their local image patch, but instead represent global information about an image – have been discovered in large pretrained ViTs [2]. Their very existence challenges the notion that attention mechanisms are only transmitting “sensory information". At very least, the sensory information is not necessarily tied to the pixels and token embeddings that the activation vector appears to correspond to.

The proposed method has at least two separate important components: the representation of relational information, and the tying of key and query matrices. It appears very important to test a variant of the standard transformer subject to tied key and query matrices. For example, Figure 8 shows that removing this symmetry condition deteriorates the benefits of relational attention quite a bit, especially in low-data regimes. This should also be done for the experiments presented in section 4.3. Similarly, when using position-relative symbol assignment, a control that modifies the standard transformer with relative positional embeddings should also be included. This is especially important because prior work has demonstrated that these positional embeddings can aid transformers on synthetic generalization tasks [3].

In section 4.4, the authors suggest that relational attention might be especially useful in that it captures semantic, rather than syntactic, relations between words. However, this same phenomenon also happens in standard language models! A quick check with GPT2-small using the same stimulus reveals similar attention patterns for the token “model” (Layer 0, Head 0, verifiable using this colab: https://colab.research.google.com/github/neelnanda-io/TransformerLens/blob/main/demos/Main_Demo.ipynb) . This section should thus be revised to properly contextualize the utility of relational attention in contrast to existing models.

[1] Lepori, Michael A., et al. "Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects." arXiv preprint arXiv:2406.15955 (2024). [2] Darcet, Timothée, et al. "Vision transformers need registers." arXiv preprint arXiv:2309.16588 (2023). [3] Csordás, Róbert, Kazuki Irie, and Jürgen Schmidhuber. "The devil is in the detail: Simple tricks improve systematic generalization of transformers." arXiv preprint arXiv:2108.12284 (2021).

问题

How are you generating attention scores for tokens after “model” and “state” in Figure 5? Is this not a causal language model?

2024-11-19

Weight-tying and symmetry of relations

The proposed method has at least two separate important components: the representation of relational information, and the tying of key and query matrices.

We appreciate your thoughtfulness and attention to detail here.

First, we'd like to provide a couple points of clarification on weight-tying of key and query matrices:

Attention scores ( $\alpha_{ij}$ ) are computed identically in both standard attention and relational attention, without weight-tying (i.e., $W_q^{attn} \neq W_k^{attn}$ ). The relations $r_{ij}$ are only present in relational attention, and this is where we sometimes experiment with weight-tying (i.e., $W_q^{rel} = W_k^{rel}$ ).
Symmetry of $r_{ij}$ plays a minor role overall in the experiments, yielding performance improvements only in the relational games experiments (section 4.1), where symmetry aligns with task requirements (e.g., same/different relations). In the image recognition experiments (section 4.3), symmetry of $r_{ij}$ has no significant effect. Sections 4.2 and 4.3 do not use symmetric relations.
The importance of symmetry as an inductive bias in relational learning was discussed in prior work that considered the relational games benchmark as well, e.g., [Ref 20].

It appears very important to test a variant of the standard transformer subject to tied key and query matrices.

As explained above, the attention scores in both models are computed without weight-tying. Nonetheless, inspired by your question, we were curious to see what effect weight-tying the attention scores would have.

We carried out an additional set of experiments that evaluates models with symmetric attention scores, via weight-tying $W_q^{attn} = W_k^{attn}$ . Note that it is possible to do this in either standard attention or relational attention, and this is distinct from weight-tying $W_q^{rel} = W_k^{rel}$ . We found the effects of this to be mixed, but relatively small. In the standard Transformer, this resulted in a decrease in performance on the 'same', 'between', and 'match pattern' tasks, and an increase in performance in the 'occurs' and 'xoccurs' tasks. In the DAT model, this resulted in a decrease in performance across all tasks.

These results fit with our intuition that it is important to decouple the attention criterion from the relations.

This should also be done for the experiments presented in section 4.3.

Thank you for the suggestion. The paper includes an ablation over symmetry for the experiments in Section 4.3, described in Appendix C.3. We find that symmetry of relations in relational attention does not have a significant effect, and the performance difference is within the margin of error.

Positional encoding

Similarly, when using position-relative symbol assignment, a control that modifies the standard transformer with relative positional embeddings should also be included.

We'd like to clarify that we use the same positional encoding method in both the Transformer baselines and the DAT model in all experiments. Different positional encoding methods are used in different tasks (e.g., RoPE for language modeling, learned positional embeddings for ViT), but they are the same across different baselines within each experiment.

The symbols are separate from the positional encoding, serving a different purpose. Symbols are used only inside relational attention, whereas positional encoding is used in both standard attention and relational attention. Please see lines 268-272 for a description of how positional encoding is applied. For example, in positional encoding methods that are applied to the attention scores (e.g., RoPE), these are applied by modifying $W_q^{attn}, W_k^{attn}$ , and are applied identically in standard attention and relational attention.

Recall that position-relative symbols, although related to position-relative encoding in that they encode position-relative information, are a distinct concept. For example, the position-relative bias of models like T5 modify the attention scores by adding a bias. Position-relative symbols do not touch the attention scores, and are instead part of the "values", serving as an annotation for the retrieved relations that refers or points to the source object in the relation.

2024-11-19

Thank you for your review and your helpful comments. We appreciate your positive feedback regarding the novelty of our proposed architecture, how it's a natural extension of the Transformer framework, and the strength of the empirical results.

We aim to address each of your concerns in turn, and look forward to further discussion with you!

Below, we summarize our responses to the concerns you raised, with a more detailed response in the sections that follow:

Concern: terminology regarding propagation of sensory and relational information in standard attention vs relational attention.
- Response: We clarify that the distinction lies in the type of information being propagated (the values), rather than the attention selection criterion (attention scores).
Concern: effect of weight-tying of query/key maps in relational attention in experiments of section 4.1.
- Response: We clarify that attention scores are computed identically in both relational and standard attention, with weight-tying applied only to the relations ( $W_q^{rel} = W_k^{rel}$ ). To ensure thoroughness, we conducted additional experiments with weight-tying in the attention scores ( $W_q^{attn} = W_k^{attn}$ ), discussed in detail below.
Concern: The consistency of the use of positional encoding across baselines, and the relationship to symbol assignment mechanisms.
- Response: We clarify that the same positional encoding method is used across different models in a given experiment. The symbol assignment mechanism is separate, and pertains to the values rather than attention scores.
Concern: linguistic interpretation of the relational representations learned in the DAT language models (Figure 5).
- Response: We agree that the picture is more complicated than the brief discussion in the paper may suggest. We provide a more detailed discussion, and will make appropriate revisions to this section to reflect the underlying complexity.

Clarification about terminology

First, the claim that standard attention mechanisms only represent sensory information is empirically false. The authors themselves cite several works describing how attention in language models often captures syntactic information, which is inherently relational.

There is a subtle but crucial distinction here. Our claim here refers specifically to the information being propagated (i.e., the values), not the attention scores (see e.g., L118-120, L136-137, L147-148).

In standard attention, the attention scores can be interpreted as relations that define a selection mechanism for information retrieval. In language models, these scores often correlate with syntactic relations, as noted in our paper and as you note in the review. However, the crucial point is that in standard attention the values retrieved represent sensory information (object embeddings), not relational information. The attention score relations are computed as an intermediate step in an information retrieval operation, but the relations themselves are not explicitly represented in the updated embeddings. This observation has also been made in prior work, including [Ref 20,21].

By contrast, in relational attention, the values retrieved represent relations between the receiver (query object) and sender (context object). A key aspect of our proposal is that it decouples the relations used to model the attention scores from relations in the value embeddings. While standard attention and relational attention model the selection mechanism (i.e., attention scores) in the same way, the values retrieved are sensory (object embeddings) in the former but relational (a separate set of learned relations) in the latter.

Since this distinction is subtle, we will revise the paper to clarify this point early on, explicitly highlighting the distinction between attention scores and values in representing relational information.

"Abstract Relational Information" in deeper layers of ViTs

recent work has explicitly found that finetuned ViTs represent sensory information in early layers, but represent abstract relational information in their later layers ...

We thank you for pointing us to these interesting references.

We wish to clarify that our claim that standard attention is sensory while relational attention is relational refers only to a single layer of attention. We do not claim that it is impossible for a sufficiently-large Transformer model to learn relational representations, given enough data. Relational representations can emerge in deep Transformers, for example by composing multiple layers with MLPs that learn to disentangle objects and compute relations between them.

Rather, our claim is that incorporating explicit relational computational mechanisms (i.e., relational attention) enhances the efficiency and effectiveness of Transformers. This is supported by our experimental results, which demonstrate improved parameter efficiency and data efficiency.

2024-11-19

Semantic vs Syntactic Relations in Attention Scores of Standard Transformers

We appreciate your engagement and attention to detail here! Thank you also for the specific reference.

We agree that much more exploration is needed to understand what types of relations the relational attention mechanism captures. The brief discussion on this in the paper reflects our initial qualitative observations, but a more through quantitative investigation is needed. We agree that the distinction is not as clear and simple as "purely syntactic" vs "purely semantic", and will revise the text in that section of the paper to emphasize the underlying complexity.

We make a few comments to clarify some key ideas and share our conceptual model for understanding the different types of circuits captured by relational attention and standard attention.

In standard attention, attention scores ( $\alpha_{ij}$ ) model selection criteria (i.e., which token to attend to), but do not directly update the embeddings. In contrast, in relational attention, relations ( $r(x_i, x_j)$ ) are used to update the embeddings directly and are distributed vector representations (not normalized like $\alpha_{ij}$ ).
While both $\alpha_{ij}$ and $r(x_i, x_j)$ can be viewed as "relations," their roles are fundamentally different: $\alpha_{ij}$ is a selection criterion, whereas $r(x_i, x_j)$ updates the receiver’s embedding, forming more complex computational circuits.
The attention scores $\alpha_{ij}$ and relations $r_{ij}$ ought to be understand through their functional roles. For example, the presence of syntactic correlates in attention scores reflects the usefulness of syntax information (e.g., subject-predicate) as a selection criterion. Similarly, the relations in relational attention ought to be understood through the usefulness of retrieving a particular relation function (whether it is semantic or syntactic).
As you point out, attention scores in standard Transformer models (e.g., GPT2) can also reflect semantic relations, similar to the relations $r_{ij}$ in DAT depicted in Figure 5. will revise Section 4.4 to contextualize this observation and more clearly highlight the distinct advantages and utility of relational attention.
We have created an interactive webapp for exploring the for exploring the activations of trained DAT language models on different inputs. We hope this will allow people to develop intuitions about this new architecture, and facilitate follow-up work. A link will be included in the deanonymized version.

Question on Figure 5

How are you generating attention scores for tokens after “model” and “state” in Figure 5? Is this not a causal language model?

Yes, this is a causal language model. In figure 5, we are plotting the relations $\mathbf{r}\_{ij} = r(x\_i, x\_j)$ , not the attention scores $\alpha_{ij}$ (which would be zero for $j > i$ ). Recall that the same relation function $r(\cdot, \cdot)$ is applied across all pairs of objects. While the relation to future objects will be masked out by the attention scores, we can still inspect $\mathbf{r}_{ij}$ for the purposes of interpretability.

Although this is explained in the caption (L507), we will make sure to emphasize this and clarify that these are not attention scores to avoid the confusion. This point of confusion may be part of some of your other concerns (e.g., on weight-tying or interpretation of attention scores).

Thank you again for your thoughtful review and your helpful comments. We believe your feedback has helped us improve the paper significantly. We hope to have addressed your concerns and answered your questions. Please let us know if there is anything else we can clarify or address. We look forward to your response and continued discussion.

2024-11-23

I appreciate the detailed reply to my comments and concerns. I believe many of my main worries regarding experimental design and methodology were substantively addressed, though I am not very convinced by the arguments re: positional encodings. Just to clarify, I was looking for a means of injecting position-relative information into both standard transformers and DAT, such that one can dissociate the impact of injecting this information (at all) from the specific impact of injecting this information by using the proposed relational attention mechanism.

Regarding the sensory/relational distinction, it is crucial to revise the main paper to make explicit that standard attention is sensory only within a single layer, and that it is plausible that many of these relations may be captured by a sufficiently well trained transformer. In other words, the standard attention operation is only sensory insofar as the hidden state entering into the operation does not encode relational information.

I am raising my score to reflect the revisions and responses described above.

2024-11-26

I believe many of my main worries regarding experimental design and methodology were substantively addressed

Thank you for your response and for engaging with us in this important discussion. We are glad that we were able to substantively address many of your main concerns.

though I am not very convinced by the arguments re: positional encodings. Just to clarify, I was looking for a means of injecting position-relative information into both standard transformers and DAT, such that one can dissociate the impact of injecting this information (at all) from the specific impact of injecting this information by using the proposed relational attention mechanism.

We appreciate your clarification and agree that this is a meaningful and valid concern. The main point we were trying to make in our earlier response is that positional information plays a different functional role in the symbol assignment mechanism compared to traditional positional encoding methods (more on this later). However, we agree with you about the importance of dissociating the impact of positional information in the symbols from the primary relational mechanisms.

We conducted additional ablative experiments to address this concern.

Additional Ablative Experiments

As you noted, the current version of the mathematics experiments (Sec 4.2) use relative-positional symbols as the symbol assignment mechanism in our DAT model. We conducted additional experiments where we replace this with symbolic attention [L134-143] (i.e., same symbol assignment mechanism used in the language modeling experiments). We present some preliminary results in the table below.

Task	Model	Acc
`polynomials__expand`	Transformer	89.2 ± 0.5%
	DAT	93.3% (91.4 ± 0.9%)
`polynomials_add`	Transformer	87.6 ± 0.2%
	DAT	89.1% (88.7 ± 0.0%)
`algebra__sequence_next_term`	Transformer	93.4 ± 2.0%
	DAT	98.8% (98.7 ± 0.3%)

For DAT models, the number outside the parenthesis is the accuracy obtained for the newly-trained DAT model with symbolic attention, and the number in the parenthesis is the performance obtained by the model with relative-positional symbols (as reported in the original version of the paper).

We observe that DAT with symbolic attention performs no worse than the version with relative-positional symbols, suggesting that the performance improvement is primarily a result of the relational computational mechanisms rather than any positional information that may be injected by the symbols.

These results are from some initial experimental runs with 4-layer models. Running the remaining experiments (including multiple trials for each configuration to compute confidence intervals as with the current results) will take a few days. These ablative experiments will be added to the final version of the paper. We thank you for the suggestion, and believe that this helps improve the paper by further supporting the main claims and dissociating the impact of positional information in the symbols from the primary relational mechanisms.

2024-11-26

Further discussion on functional role of symbol assignment mechanism

The main point we were trying to make in our earlier response is that positional information plays a different functional role in the symbol assignment mechanism compared to traditional positional encoding methods.

In particular, relational attention has the (simplified) form " $\sum\_j \alpha\_{ij}(\mathbf{r}\_{ij} + s\_j)$ ". This updates an object's hidden state with information that says "I have the relation $\mathbf{r}\_{ij}$ with the object referred to by the symbol $s\_j$ ". Without the symbols, the receiver does not know the identity of the object that the relation $r\_{ij}$ involves. When the symbol assignment mechanism is relative-positional, $s\_{j}$ encodes the relative position $j-i$ such that the hidden state is updated with the information "I have the relation $\mathbf{r}\_{ij}$ with the object $j-i$ positions away from me". We think of the symbol assignment mechanism (whether positional, relative-positional, or symbolic attention) as playing a supporting role in relational attention to identify or "point to" the object involved in the relation---the primary computation lies in the relations $\mathbf{r}\_{ij} \in \mathbb{R}^{d\_r}$ .

By contrast, relative positional encoding methods typically inject positional information into the attention scores by adding a bias based on relative-position. For example T5-style relative positional encoding adds a learned bias $b_{j-i}$ , Alibi adds a fixed bias $m_h \cdot |j - i|$ , and RoPE rotates the query and key vectors proportional to relative position $q\_i^\top R(j-i) k\_j$ . Note that relative-positional symbols do not modify the attention scores and a separate positional encoding method would be necessary in order to attend based on position.

Although we think this distinction is important, we agree with you that relative-positional symbols do inject positional information into the model (in some form), and it would be useful to know whether this alone accounts for the performance improvements, or if the relational attention mechanism is useful more generally. This is precisely what we aimed to understand in the ablative experiments above.

Presentation/Exposition

Regarding the sensory/relational distinction, it is crucial to revise the main paper to make explicit that standard attention is sensory only within a single layer, and that it is plausible that many of these relations may be captured by a sufficiently well trained transformer.

We agree about the importance of this distinction, and will make sure to emphasize it in the revised version of the paper.

Thank you again for your engagement throughout the discussion period and for your constructive feedback. We hope we were able to address your final concern regarding ablating relative-positional information in the symbol assignment mechanism.

审稿意见

评分: 3置信度: 42024-11-04

This paper proposes a modification of multi-head attention to better capture the relations information between objects. The proposed alternative is named dual attention, which concatenates the results of self-attention and the results of a new module named relational attention. Here, relational attention is similar to self-attention, except that the value to aggregate is replaced by a weighted sum between an relation vector computed from each pair of objects and a symbol vector. Experiments on synthetic and real data demonstrate data efficiency and parameter efficiency.

优点

The paper is well written and easy to follow
experiments are diverse in domains and supporting main claims

缺点

The idea does not seem very novel or original. There are many attempts to integrate relational information into the attention mechanism in the Graph Neural Network community, with the closest one I can find being "Learning Graph Representations Through Learning and Propagating Edge Features" (https://ieeexplore.ieee.org/document/10004977). Specifically, Eq. 2 directly gives the general form of the proposed relational attention. This previous work goes on with slightly different parametrization of f and g, i.e. concatenation instead of dot product etc, but has an overall very similar central idea. This paper should at least cites this line of work and compare against them as baselines.

问题

Are there experiments with the learned Symbolic Attention?

2024-11-19

Contributions of this work

Finally, we would like to remind the reviewer of our main contributions in this work:

The proposal of a neural mechanism for routing and processing relational information within the Transformer framework. The proposed relational attention mechanism is based on an attentional operation that selectively retrieves relational information from the context, tagging it with symbolic identifiers.
The proposal of the dual attention mechanism, a variant of multi-head attention with two distinct types of attention heads, enabling routing and processing of both sensory information and relational information.
The proposal of a corresponding extension to the Transformer framework called the Dual Attention Transformer (DAT), which integrates sensory and relational information in a unified architecture. The strength of this framework is that it is as general as the Transformer framework, while yielding benefits in flexibility, data efficiency, and parameter efficiency. In particular, it supports all architecture variants of the standard Transformer (e.g., encoder-only, decoder-only, encoder-decoder, ViT, etc.), and can be applied across a diverse range of tasks and data modalities.
We evaluate the proposed DAT architecture on a diverse set of tasks ranging from synthetic relational benchmarks to complex real-world tasks such as language modeling and visual processing, demonstrating notable improvements across all tasks. This in particular includes an analysis of scaling laws, which demonstrate greater data efficiency and parameter efficiency at large scales.

评论- Follow-up: Invitation to Review Our Responses

2024-11-29

Dear reviewer,

Thank you again for your review. As the discussion period is coming to an end (Dec 2nd), we kindly invite you to review our responses above. We hope we were able to address your primary concerns and clarify the novelty and contributions of our work. In particular, we hope the discussion above clarifies the differences in scope, approach, and application between our work and the paper you mentioned. We'd be happy to address any further questions.

Sincerely,

The Authors

2024-11-19

Zhang et al.'s architectural proposal is different

Specifically, Eq. 2 directly gives the general form of the proposed relational attention.

It is important to note that Eq. 2 is not a concrete architectural proposal, but is rather a generic formulation of the problem propagating edge features or relational information. The problem of processing relational information is a fundamental one, and arises in many settings, including in graph representation learning and sequence modeling, despite being distinct.

This previous work goes on with slightly different parametrization of f and g, i.e. concatenation instead of dot product etc, but has an overall very similar central idea.

As the review recognizes, Zhang et al's architectural proposal is distinct from ours. Though, we'd like to emphasize that the difference is much more fundamental than the review suggests.

The proposal of Zhang et al. is an operation which processes a graph consisting of a collection of nodes $\{x\_i\}\_{i \in \mathcal{N}}$ , edges $\mathcal{E} \subset \mathcal{N} \times \mathcal{N}$ , and edge features $\{e\_{uv} : u,v \in \mathcal{E} \}$ . They propose updating the edge features by applying a linear map to the concatenation of the initial edge features and the pair of node features: $e\_{uv}' = V \cdot \mathrm{concat}(x\_u, e\_{uv}, x\_v)$ This is then aggregated by an attention mechanism (though not the standard dot-product attention typically used in Transformers).

In contrast, our relational attention mechanism is defined as:

\mathrm{RelAttn}(x, (y\_1, ..., y\_n)) &= \sum\_{i} \alpha\_{i}(x, \boldsymbol{y}) (W\_r r(x, y\_i) + W\_s s\_i), \\\\ r(x, y\_i) &= (\langle W\_{q,\ell}^{rel} x, W\_{k, \ell}^{rel} y\_i)\_{\ell \in [d\_r]}\\\\ (s\_1, ..., s\_n) &= \mathrm{SymbolRetriever}(\boldsymbol{y}; S\_{lib})

where $\alpha_i(x, \boldsymbol{y})$ are dot-product attention scores, $r(x, y_i) \in \mathbb{R}^{d_r}$ is a relation function parameterized as a series of inner product comparisons under different learned feature projections, and $s_i$ is a vector which acts as a pointer or reference to the object $y_i$ the relation is with.

To summarize, we'd like to highlight some key differences between the two architectural proposals and unique aspects of our proposal:

Setting: Zhang et al.'s proposal is a GNN that operates over graph inputs with edge features $e_{uv}$ as part of the input. We tackle sequence modeling within the Transformer framework, with no graph or edge features as input.
Relation modeling: The way that the relations are modeled differs fundamentally. Modeling relations as inner products of learned feature maps is a key aspect of our architectural design. It enables computing explicit comparisons between objects. In contrast, Zhang et al. models updated edge features as a linear map applied to the node pair of node features and the initial edge features (which is an input to the model).
Symbol assignment mechanisms: The use of symbol assignment mechanisms, serving as pointers to objects in relational processing, is unique to our model and unique to the sequence modeling setting (as opposed to graph processing)
Dual attention mechanisms: Our proposal includes dual attention, a variant of multi-head attention with both sensory and relational processing mechanisms. This is a novel contribution of our work.
Dual Attention Transformer: The proposal of a corresponding extension to the transformer framework, the Dual Attention Transformer (DAT), is a novel contribution of our work.

This paper should at least cites this line of work and compare against them as baselines.

Thank you for pointing out this related line of work. We agree that citing and discussing these papers will strengthen our discussion of related models, and we will add an expanded related work section to do so.

Question on Symbolic Attention

Questions: Are there experiments with the learned Symbolic Attention?

Yes, the language modeling experiments of section 4.4 use Symbolic Attention. By interpreting symbolic attention as a learned differentiable equivalence class over embeddings, we posit that symbolic attention learns to represent semantic structures, perhaps analogous to synsets. We are excited to explore this further in future work as part of a broader mechanistic interpretability investigation.

2024-11-19

Thank you for your review, and for your positive comments regarding the presentation of the paper and the strength and diversity of our experiments.

We appreciate the reviewer's reference to the Zhang et al. paper. Our work shares certain high-level similarities with their proposal in the sense that both approaches seek to integrate relational representations into neural models. However, we'd like to highlight that our approach is distinct in both scope and application: While Zhang et al. address the propagation of edge features in GNNs for graph representation learning, our focus is on integrating relational representation learning specifically within the Transformers framework, targeting sequence modeling tasks (e.g., language modeling).

Below, we will provide clarification on the goal of our work, and a detailed discussion on the distinction between our work and Zhang et al. We will also expand the related work section and incorporate a more detailed discussion of relevant work from the GNN community, including the work of Zhang et al.

Clarification of goals and setting

We'd like to clarify the goals of our paper and the setting we are targeting.

What this paper is about: Introducing explicit relational computational mechanisms into the Transformer framework, to form an architecture that integrates sensory and relational processing. Our focus is specifically on the Transformer architecture. The goal is to enhance data efficiency and enable greater flexibility through new types of computational circuits that compose sensory and relational computation.

What this paper is not about: Graph neural networks, or neural models operating on graph-structured data. We are not tackling the problem of propagating edge features along graphs in GNNs. We'd like to emphasize that the term "relation" in our work does not refer to edges on the graph, but rather refers to internal feature representations that capture comparisons between objects in the input.

Discussion of Zhang et al. (2024)

We would like to provide clarification on:

The differences in the overall goal and setting between Zhang et al and our work
The differences in proposed architectures

Zhang et al. has different goals and tackles a different setting: Graph Representation Learning vs Sequence Modeling

Our work studies integrating relational computational mechanisms in Transformers, while Zhang et al. studies integrating edge features into the message-passing operation of GNNs specifically. Although Transformers and GNNs can be linked (by viewing Transformers as GNN-variants operating on a fully connected graph), they are distinct architectural paradigms that tackle a different class of tasks and have various differing considerations.

The work by Zhang et al. specifically focuses on propagating edge features within a GNN through a message-passing paradigm. Here, edge features are a core part of the input (along with graph edges and node features) and are propagated along graph edges during the message-passing operations. For example, Zhang et al. applies this to molecular graphs, where the edge features are bond types. Their approach is structurally and conceptually distinct from the mechanisms we develop within the Transformer framework.

One easy way to see this is by noting the difference in the experimental settings tackled by each paper.

Zhang et al.:

molecular graphs (ZINC): graph regression
macromolecular graphs (PROTEINS, ENZYMES): graph classification
synthetic benchmarks generated by stochastic block models (PATTERN, CLUSTER): node classification

Our work:

visual relational reasoning: classification (encoder-only)
mathematical problem-solving: seq2seq (encoder-decoder)
image recognition: classification (ViT-style)
language modeling: autoregressive (decoder-only)

These are distinct settings: Zhang et al tackles graph representation learning, focusing on graph-structured data such as molecular graphs, whereas we tackle sequence modeling in the Transformer framework.

评论- Summary of Reviews & Rebuttal Discussion

2024-12-04

Dear all,

We would like to thank the reviewers for their reviews and thoughtful feedback, which has helped us to further improve the paper. This message summarizes the strengths and concerns raised in the reviews, our responses during the rebuttal, and the revisions made to improve the paper.

Summary of Strengths

Novel and well-motivated architectural proposal
- "The authors take a well-motivated challenge (helping transformers work with relational reasoning) and define a natural extension to the transformer architecture that attack that challenge." (mxrQ)
- "The proposed mechanism is novel, yet is a natural extension of the attention mechanisms within a standard transformer." (YUpf)
Strong and diverse experimental evaluation
- "The experimental data is impressive, especially because the architecture seems to work on a broad set of tasks. [...] I appreciated the careful comparisons with baselines." (mxrQ)
- "Empirically, the DAT appears to be more data efficient than the standard transformer, which is especially important in the case of language modeling." (YUpf)
- "experiments are diverse in domains and supporting main claims" (qVFZ)
Clear and effective exposition
- "The paper is written clearly and easy to follow." (r4cd)
- "The paper is well written and easy to follow" (qVFZ)

Summary of Concerns and Responses

Below, we will summarize each review separately, highlighting the concerns raised and summarizing our responses and revisions. We refer to our individual responses to each reviewer for further details.

Reviewer mxrQ

We deeply appreciate reviewer mxrQ's enthusiasm for our work and their thoughtful, detailed feedback.

The primary concerns raised relate to exposition and presentation. We are especially grateful for the reviewer’s specific and constructive suggestions, which will significantly enhance the clarity and quality of the final version of the paper.

Reviewer YUpf

We'd like to thank reviewer YUpf for their detailed review and constructive feedback. We are grateful for the reviewer's response to our rebuttal, in which they stated that many of their main concerns were "substantively addressed".

Concern: Impact of weight-tying of query/key maps in experiments of section 4.1
- Response: We provided additional ablative experiments to address this specific question.
Concern: Role of positional information in symbol assignment
- Response: Additional experiments using a symbol assignment mechanism without position-relative encoding (symbolic attention) confirmed our conclusions.
Concern: Clarification on terminology
- Response: We point to our detailed response, where we clarify terminology and discuss the references mentioned by the reviewer.

The two remaining reviews were short, and we unfortunately did not receive a response to our rebuttal from the reviewers. We summarize our responses below, which we hope clarify and address their main concerns.

Reviewer r4Cd

Concern: "There is limited novelty. The relational attention sounds very similar to the graph attention network to me."
- Response: This characterization is inaccurate. The only common feature between GAT and our proposed relational attention mechanism is that it involves computing attention scores, which is a feature shared with standard Transformer attention as well.
Concern: "The experiments are only performed on a small set of simpler tasks."
- Response: Our empirical evaluation includes complex real-world tasks, such as image recognition and language modeling. Models with up to 1.3 billion parameters were evaluated to analyze scaling laws. These aspects were recognized as strengths by all other reviewers.

Reviewer qVFZ

Concern: The review questions the novelty of our work, citing the following paper from the graph neural network community "Learning Graph Representations Through Learning and Propagating Edge Features" by Zhang et al (2023) that studies edge feature propagation in GNNs.
- Response: We point to our response to the review for a detailed discussion on the differences in scope, setting, and methodology between our work and Zhang et al. Here, we summarize the main points:
  1. The two works study different problems: relational reasoning in Transformers vs. edge feature propagation in GNNs.
  2. The architectural proposals are distinct (see rebuttal for detailed discussion).
  3. The setting and application scope are distinct: sequence modeling within the Transformer framework (e.g., language modeling) vs graph processing within the GNN framework (e.g., molecular graph classification).

We hope our responses and revisions effectively address the concerns raised.

Best Regards,

Authors

AC 元评审

2024-12-23

This paper proposes to add a relational attention mechanism to the transformer architecture in order to improve relational reasoning performance. Here, the added relational attention mechanism is similar to self-attention, except that the value to aggregate is replaced by a weighted sum between a relation vector computed from each pair of objects and a symbol vector. Experiments on synthetic and real data demonstrate data efficiency and parameter efficiency.

Strengths: The idea of explicitly representing relations between entities is an important one that merits further investigation. Experimental results are very promising, in particular it appears that the method is more data efficient than the standard transformer, which is especially important in the case of language modeling. Weaknesses: Novelty is overstated, and the paper makes little effort to relate their work to the literature (especially the prior work on graph networks and message passing networks).

This is a difficult paper to decide on. While reviewer mxrQ is clearly excited about the work, their score of 10 seems disproportionately high. Being somewhat familiar with the graph neural networks literature I tend to agree with the criticisms of Reviewers r4Cd and qVFZ (scores 5 and 3) who point to a lack of novelty, but their reviews are very short and do not seem sufficiently thorough. The results are certainly promising and in my opinion worth publishing. At the same time I am unable to ignore the lack of scholarship in the paper and the insistence of the authors that their method has nothing to do with graph networks. This seems especially strange since they themselves introduce their method as a form of message passing.

Overall I recommend rejecting this paper and would encourage the authors to make an effort to situate their work relative to the literature on message passing networks.

审稿人讨论附加意见

Both reviewers r4Cd and qVFZ criticise a lack of novelty and point to papers about graph neural networks as relevant prior work. The authors respond by explaining in detail why their work has nothing in common with these papers except attention scores. They mostly point to very different applications and emphasize that their work is meant to work on language and vision and not on graphs. In my opinion this completely misses the point. The applications are indeed different but the motivation and the methods share a lot of commonalities. Transformers have been applied to graphs and graph neural networks to vision. None of this is acknowledged in the paper.