Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Hierarchy
We mathematically generalize the Softmax attention mechanism to hierarchical and multi-modal settings.
摘要
评审与讨论
This paper derives self-attention as a gradient descent step to minimise the variational upper bound of entropy and formulates a new hierarchal self-attention by changing its energy function. There are experiments comparing this new hierarchal self-attention and the standard self-attention on multi-modal data and zero-shot transfer learning.
优点
-
The notation is clear and problem well motivated.
-
The authors make an effort to provide background information, which could help those unfamiliar with the topic to understand the context.
缺点
-
The derivation of self-attention does not consider the value matrix which is a crucial component. Despite this, another paper [1] has already derived attention from the update rule of hopfield networks which is relatively similar to this work.
-
The formulation of hierarchal self-attention (HSA) is unclear and the key methodological steps are missing, making it difficult to fully understand the approach.
-
The experiments are limited and do not provide strong evidence of the method’s effectiveness. They are performed on small datasets like IMDB and it is not convincing that the datasets chosen have a hierarchal structure that requires a new attention mechanism. The multi-modal experiment only has marginal improvements and there is a significant drop in accuracy on the zero-shot transfer learning.
-
While there might be less flops used for HSA, there is doubt that the actual run time is lower due to the algorithm used in HSA.
[1] Ramsauer, Hubert, Bernhard Schäfl, Johannes Lehner, Philipp Seidl, Michael Widrich, Thomas Adler, Lukas Gruber et al. "Hopfield networks is all you need." arXiv preprint arXiv:2008.02217 (2020).
问题
-
What is the novelty in the derivation of self-attention in this paper in comparison to hopfield networks? Why are the differences between the gradient update in the entropy minimisation and self-attention not important?
-
Please clarify clearly and simply the exact steps required for HSA and provide the run time during training and inference.
-
Can HSA be applied to larger and more complex standard datasets like imagenet or C4/Wikitext-103? And can it be shown that these datasets benefit from using a hierarchal structure?
We appreciate the reviewer's feedback, and hope we can clarify some of points raised here.
Re: the lack of value projection: first of all, please note that in both our proposed Algorithms 1-3 as well as all of our experiments we have indeed incorporated value projections. Furthermore, in order to justify the incorporation of value projections within our theoretical derivation, we have added a new appendix (Appendix C in the new draft) explaining how value projections theoretically emerge in our statistical mechanical interpretation of self-attention. (TLDR: value projections emerge as learnable step-size in the gradient update in Eq. (6) realized as gradient projections.)
Re: "key methodological steps are missing" We would appreciate a more concrete feedback here as to what exact steps are missing so we can further explain them in the next version of the draft. We have done our best to clearly illustrate our methodological contributions, mathematical derivations and the empirical study in the main paper as well as the appendices. But we'd like to know how we can further improve the presentation.
Re: the scale of datasets We have added a new set of experiments in Appendix I in the new draft with larger datasets and models, and the results are still consistent with what was reported in Sec 5.1 and 5.2.
Re: whether these dataset have hierarchical structure or not Please note that written text in general have a semantically meaningful hierarchical structure (words, sentences, paragraphs, sections, etc.) which can be seen as abstraction of semantics at different levels and therefore be utilized using the HSA framework regardless of the dataset it's coming from.
Re: "The multi-modal experiment only has marginal improvements" The improvement in the multi-modal experiment is indeed statistically significant.
Re: "drop in accuracy on the zero-shot transfer learning." First of all, please note that the experiments reported in Sec 5.3 are not transfer learning; we are not changing the task here. Second, these experiments are designed to examine the practical implications of Theorem 1 by measuring the amount of accuracy drop in case the standard attention matrix is replaced by the closest hierarchical attention matrix to it (in terms of the KL-divergence) in a completely zero-shot manner. And while the performance gap is more pronounced for certain tasks/layers, for others it's surprisingly quite narrow which indicates an inherent redundancy in attention calculation and the HSA's ability in utilizing that, even in 100% zero-shot setting.
Re: novelty of deriving self-attention compared to Hopfield networks The emergence of self-attention within Hopfield networks and our entropy minimization framework are indeed quite similar, as both are different forms of energy minimization techniques. However, please note that derivation of the standard Softmax attention is not the main contribution point of our work. We have simply used the energy minimization technique for deriving standard Softmax attention to extend Softmax attention to nested signals, which we have further shown to be the optimal derivation in terms of the KL-divergence. To the best of our knowledge this theoretical approach to deriving hierarchical self-attention is novel.
Re: runtime vs FLOPs The HSA algorithm presented in Algorithms 1-3 is the theoretical depiction of our dynamic programming methodology. In practice, however, we have implemented the parallelized version of the algorithm in our codebase which we are intending to release as well. Nevertheless, there's a constant overhead to HSA which hinders the runtime gains for small and moderate scale problems reported in the paper. However, as we have already observed for auto-regressive generation tasks with much larger models, the runtime difference as well as the memory footprint gap is quite pronounced. We postpone these results to future work where we will comprehensively delve into hierarchical auto-regressive generation.
Re: "Why are the differences between the gradient update in the entropy minimisation and self-attention not important?" They are, but those differences are somewhat orthogonal to the HSA operation itself. As we have shown in Sec 5.1, 5.2 vs Sec 5.3, HSA can be applied in both architectural settings.
Re: "clarify clearly and simply the exact steps required for HSA" The exact steps to perform HSA are detailed in Algorithm 1-3 and Eq. 15 in Appendix D.
Re: "Can HSA be applied to larger and more complex standard datasets like imagenet or C4/Wikitext-103? And can it be shown that these datasets benefit from using a hierarchal structure?" HSA can be applied to a large variety of single-modal or multi-modal datasets, and in most cases due to the introduction of block-constraint, computational complexity of the attention calculation will drop. However, whether the introduction of hierarchy would benefit the model statistically is a different question whose answer heavily depends on the datasets and the end task. For example for textual data, since written language has a semantically meaningful abstraction of meaning through words, sentences, paragraphs, etc, the incorporation of that hierarchy via HSA is likely to result in performance improvement. However, for image data, the abstraction of meaning through hierarchal patching of the pixels is not always semantically meaningful, which means that the success of HSA in improving the statistical complexity in such cases heavily depends on the task.
Dear Reviewer Lkye,
Thank you again for raising important questions and concerns. We are hoping that our new version of the draft with newly added experiments and appendices can address most of your concerns. We do appreciate if you can let us know whether you have any further concern or question as the discussion phase is ending soon.
Beast regards, The authors
Thank you to the authors for their efforts in the rebuttal, I appreciate the additions included in the revised manuscript and new experimental results. From your clarifications, I do believe that there is a novel contribution in extending the energy minimization technique to derive a hierarchical self-attention and the theoretical justifications are good.
However, my concerns remain on the extensiveness of the experiments which I do not feel provide sufficient empirical evidence in support of HSA. Further, the possible increase in run time is another disadvantage of the method.
Due to these reasons, I increase my score to 5 but not any further. Lastly, I sincerely apologise for my late reply.
We appreciate your rigorous feedback and comments which has further improved our work. Also, thank you for recognizing the theoretical contribution of our work and increasing your score as a result. We are hoping to add extra experiments to the final draft in order to further enrich the empirical side of the paper.
In this paper, the authors aim to obtain a hierarchical self-attention mechanism. They aim to show that they have a new concept of a nested signal. Further, they aim to use the notion of entropy minimisation to derive a form of hierarchical attention that they claim is optimal. They provide evaluations on a few datasets and mainly provide comparisons between flat attention and the proposed hierarchical self-attention. Further, the comparisons with DeepSet are provided. The main focus is on theoretical validity
优点
- They aim to consider the idea of hierarchical self-attention using their definition of a nested signal and a hierarchical self-attention mechanism that is derived theoretically using ideas of entropy minimization
- They provide comparisons with the flat attention mechanism and the hierarchical self-attention mechanism is observed to be superior to the flat attention
- Finally, they consider the zero-shot generalisation of a pretrained flat attention transformer to a hierarchical transformer and provide comparisons
缺点
- The hierarchical self-attention transformer is only bench-marked against flat self attention and Deep Set. The authors have observed that various hierarchical self-attention methods have been proposed in the literature. If as they claim, the proposed hierarchical self-attention is indeed better than other proposed hierarchical self-attention, then it is imperative to benchmark their work against such works.
- The notion of self-attention was principally to reduce an inductive bias - for instance the translation invariance inducted by convolutional neural networks. By re-introducing an explicit induction bias, is that not detrimental? Can the analysis be done based on this notion of inductive bias?
- There have been substantial number of works that have considered harmonic analysis based operators such as adaptive Fourier neural operators, scattering vision transformers and many such neural operator based attention mechanisms that have demonstrated superior performance with extensive bench-marking. In contrast, the proposed work does not benchmark or compare it with other such methods for inducing structure in a transformer.
- While the authors have described their approach to hierarchical self-attention, it is not clear how conceptually their approach differs from previous hierarchical self-attention. The formulations have been considered theoretically in this work, but, the resulting loss function is quite close to the loss function used previously. So, an analysis and comparison with previous hierarchical approaches would be expected
问题
The clarifications requested would be any clarifications that the authors can provide with respect to the weaknesses mentioned above.
-- Update post rebuttal - In view of the clarifications provided, I have updated my score from 3 to 5.
We thank the reviewer for their insightful feedback.
Re: comparison to other hierarchical methods We agree with the reviewer that in general it'd be highly desirable to compare the proposed HSA framework to other hierarchical self-attention approaches. The main reason we did not perform such a comparison was that some of these methods have used the notion of "hierarchy" in somewhat different sense than what we have used within HSA, while those with similar interpretations did not provide their codebase. Regardless of that, we are intending to add such comparisons to our draft anyways. Nevertheless, it should also be emphasized that in this work, our main goal is not to propose the empirically best hierarchical self-attention algorithm out there. Instead, we aim at proposing theoretical constructs and rigorous derivations that seamlessly generalize the notion of Softmax self-attention to multi-scale and multi-modal settings.
Re: hierarchical self-attention and inductive bias. We agree with the reviewer that self-attention was in part introduced to overcome the restrictions of other neural architectures dictated by their inherent inductive biases. However, some domains such as language or multi-modal problems semantically have scale separation inductive biases embedded in them which do benefit the statistical complexity of the model (as our experiments show) if they can be somehow injected into it. But the standard self-attention mechanism cannot naturally incorporate this knowledge and that's where HSA comes into the equation. Furthermore, even when such scale separation inductive biases are not semantically grounded in a problem, by imposing hierarchical structure on the problem, HSA can effectively act as a mechanism to arrive at a better trade-off between accuracy and efficiency depending on the use-case.
Re: comparison with other self-attention operators. Please note that in this work our goal is not to develop a universally "best" attention mechanism that beats every other attention operation in the literature on all benchmarks. Instead, we are proposing a rigorous, mathematical construct (i.e. the nested signal) that can represent multi-scale, potentially multi-geometry problems within the transformer framework. Furthermore, we have extended the classical Softmax attention mechanism to this construct using our statistical mechanical derivation of self-attention, which we have further proved to produce the closest attention weight matrix to that of the classical Softmax attention theoretically. In that sense, comparing our methodology against other generic alternative attention mechanisms is not quite relevant to what we'd like to achieve in this paper.
Re: how conceptually HSA differs from previous hierarchical self-attention First, please note that in this paper, we do not propose any contribution regarding the loss function or the backward pass in general; all the contributions of this work are solely for the forward pass architecture of the model. As for the conceptual differences of HSA and other hierarchical transformer frameworks, there are two fundamental notions that differentiate HSA from previous work: (1) HSA is derived for our proposed notion of nested signal which is a very versatile tool to represent multi-scale, multi-modal data; this is in contrast to previous work that often operate under single-modality settings and/or with fixed, rigid hierarchical structures. In that sense, our framework can be seamlessly applied to a vast and diverse set of multi-scale and multi-modal problems, way beyond what we have presented in the paper. (2) The way we calculate the attention weights in HSA is theoretically grounded and as we proved in Theorem 1 can be seen as the direct generalization of the Softmax attention to hierarchical, multi-modal settings. In that sense, our method differs from other previous work that often define hierarchical attention heuristically. Due to its theoretical optimality, as we have shown in Sec. 5.3, HSA can replace classical Softmax attention operator post-training in completely zero-shot manner to reduce FLOPs with minimal accuracy degradation.
Thank you for the clarification. I do agree that providing a theoretical construction based on the notion of nested signal and the arising constrained attention construct is interesting and useful.
What I am less clear about is, empirically, which is the optimal structure that can be learned and have good empirical trade-offs. There have been works that induce hierarchical constraints. Previously, there have been similar attempts at introducing sparsity constraints - For instance, the work, 'Generating Long Sequences with Sparse Transformers', by Rewon Child et al introduced block-diagonal sparsity constraints that they showed scaled better than flat attention networks. Similarly, this was then extended to a more general learned sparsity constraint in 'Adaptively Sparse Transformers' by Correia et al. EMNLP 2019. Given the significant past exploration, a new work that provides a theoretical construct needs to relate the various constraints induced and provide comparisons to the variants in order to validate the contribution. I believe, the present work does not yet achieve that aim. I would be happy to be corrected.
We truly appreciate the reviewer's insightful feedback which has led us to further enrich the present work. Indeed, the previous extensive work on sparse attention is relevant to our current work in the sense that sparsity is another way of enforcing constraints on attention computation and reducing its degrees of freedom as a result. However, we'd like to emphasis that in this work we are generalizing the standard Softmax attention (which is a dense attention) to hierarchical structure of nested signals, that is already given via the scale separation inductive bias inherent in the problem at hand. Our derivation guarantees that resulted HSA mechanism is indeed the generalization of Softmax attention to hierarchy both theoretically and empirically.
Therefore, we believe that sparse attention like any flat attention mechanism is tangential to the general methodology of HSA for deriving hierarchical attention. To further show this point theoretically and also respond to the reviewer's request re: relating our work to the previous work, we have added a new appendix (Appendix J) in the latest draft, where we theoretically show that Sparse attention can also be cast into our energy minimization framework. We have further used this result to generalize sparse attention to the hierarchical structure of nested signals; much similar to what we did for the Softmax (dense) attention in the main paper. In other words, despite their similarity in enforcing constraints, HSA and Sparse attention are somewhat orthogonal to each other, and HSA can be equally employed to generalize sparse attention to hierarchy.
This independence also shows itself empirically as these methodologies captures different aspects of the problem. For instance, in those problems where one is interested in eliciting an overall understanding of the given (nested) signal (e.g. classification) AND has access to a meaningful hierarchical abstraction of semantic within the input signal, incorporating HSA with the known hierarchical structure makes sense. On the other hand, if a token-level, sparse attention pattern is more relevant (e.g. in-context retrieval), then learning sparse attention is the way to go in order to impose structure on the problem. Finally, one can also envision scenarios where a meaningful hierarchical structure is available and at the same time the attention patterns at each level of the hierarchy are believed to be sparse. In this case, one may use the extension of HSA with sparse attention, which we have introduced in Appendix J.
Thanks to the authors for the new appendix J that provides a theoretical framework to analyse sparse attention. This is useful and the extension to hierarchical sparse attention is also interesting.
In view of the revision, I am updating my score to 5 from 3. The main reason for not further increasing my score is because I believe there is a need for empirical validation of the proposed method against some constrained attention methods such as other hierarchical notions or sparsity based constraints. The main question is when does the proposed method prove to be empirically more useful than these other notions. This need not be always the case. For instance, a setting where there is no hierarchical organization, the proposed method need not improve over any other notion of constraints. However, I believe, we do need to see at least see some more empirical validation other than the limited validation provided so far.
I do appreciate the theoretical characterization and value the efforts by the authors for the same.
Thank you again for your helpful comments and feedback which further enriched our work. We also appreciate that you have increased your score. We are hoping to add the empirical study you suggested in the next version of the draft, which hopefully can further address your concern.
The paper proposes to adapt the transformer architecture to multi-modality signals or signals presented at different scales by a hierarchical self-attention (HSA) mechanism for nested signals. The paper first generalizes the classical Softmax attention mechanism (traditional transformer) by representing it as a solution to an entropy minimization problem of a simple signal (without hierarchy). Then, based on this insight, the paper derives a theoretically grounded mechanism for calculating self-attention within nested signals. The authors evaluate the proposed approach in language and multi-model news classification problems and show improved performance. In addition, the authors show that the proposed approach can be utilized for fine-tuning, where a specific form of hierarchical information is revealed at test time - the simple self-attention operation can be exchanged for HSA, leading to efficient calculations at test time without the need for re-training.
优点
-
The paper proposes a novel, theoretically grounded mechanism for hierarchical self-attention that elegantly generalizes the classical Softmax self-attention for flat signals.
-
The paper is well-written and easy to follow. The example of website representation demonstrates the use cases of the proposed algorithm and provides convincing motivation.
-
The paper includes a thorough discussion of the related work.
缺点
Limitations: The authors did not address all the limitations of the work. For example:
-
Adding the inductive bias of the signals’ structure during training can constrain the model to output only signals with this specific structure. In addition, it’s not straightforward to indicate the structure of the signals in every dataset.
-
The authors acknowledge that the algorithm is not readily parallelizable on standard GPUs and requires dedicated hardware for acceleration - strangely, this is omitted from the main paper and only mentioned in the appendix.
Comparisons:
Although the authors mention that previous works attempt to incorporate hierarchy and multimodality within transformers (based on ad hoc heuristics), the proposed method is compared only to flat self-attention and DeepSet (work from 2017).
问题
-
Please address the aforementioned weaknesses.
-
In section 5.3 the authors mention that when a pre-trained flat self-attention is replaced with HSA and fine-tuned, the results match the performance of the original flat self-attention, with improved computational cost. Why doesn’t incorporating the indicative bias improve the performance as well? Given that the model can be fine-tuned until convergence.
-
The authors mention that the leading algorithms usually incorporate tricks such as using multiple loss functions to achieve SOTA performance. Is it possible to incorporate the same tricks with the HSA models?
minor typo: Line 254: Eq. equation 4 → equation 4
We appreciate the reviewer's constructive feedback and insightful comments.
Re: Adding the inductive bias of hierarchy We agree that injecting an inductive bias to a model can be generally restrictive. However, some domains such as language or multi-modal problems semantically have scale separation inductive biases embedded in them which do benefit the statistical (on the top of computational) complexity of the model (as our experiments show) if they are injected into it. In that sense, for such domains, we'd argue it makes sense to do so. Nevertheless, as the reviewer correctly pointed out, arriving at such semantically meaningful hierarchical structure is not straightforward in every problem. In such cases, the application of HSA with an "imposed" hierarchical structure acts as a flexible mechanism to change the trade-off between the accuracy and efficiency (e.g., improving efficiency at the cost of minimal accuracy degradation).
Re: the algorithm not being parallelizable As we have mentioned in Appendix D.4. the proposed dynamic programming presented in Algorithm 1-3 is not parallel as it's basically a tree-traversal algorithm. However, as we explained next, it is indeed parallelizable as we have illustrated how to do so in the same Appendix. Furthermore, our implementation does not require any dedicated hardware for acceleration and uses the same standard operations in Pytorch. We are planning to release our code, so hopefully that'd be helpful to other researchers seeking to implement HSA in parallel manner.
Re: comparisons with other hierarchical methods We agree with the reviewer that in general it'd be highly desirable to compare the proposed HSA framework to other hierarchical self-attention approaches. The main reason we did not perform such a comparison was that some of these methods have used the notion of "hierarchy" in somewhat different sense than what we have used within HSA, while those with similar interpretations did not provide their codebase. Regardless of that, we are intending to add such comparisons to our draft anyways. Nevertheless, it should also be emphasized that in this work, our main goal is not to propose the empirically best hierarchical self-attention algorithm out there. Instead, we aim at proposing theoretical constructs and rigorous derivations that seamlessly generalize the notion of Softmax self-attention to multi-scale and multi-modal settings.
Re: replacing standard self-attention with HSA and fine-tuning Please note that in the experiments in Sec 5.3, we did not fine-tune the models after replacement; the reported results were obtained in a completely zero-shot manner without any re-training or fine-tuning. Furthermore, the hierarchical structures used in Sec 5.3 are not based on semantic structure of the text but are merely fixed, imposed hierarchies. That is, we are not really injecting any semantically meaningful inductive bias into the model using these hierarchies. Given these two experimental settings, accuracy degradations are indeed expected. But the reason we run experiments in Sec 5.3 under these adverse settings is because we wanted to examine the practical implications of Theorem 1. In particular, we'd like to see how much overall accuracy degradation occurs if we simply replace the standard attention weight matrices in a pre-trained transformer with (imposed) hierarchical ones that have the minimal KL-divergence to the original matrices. And interestingly, the results indeed confirmed the practical importance of the theoretical result in Theorem 1.
Re: incorporating other tricks during training Our proposed contribution in this paper are solely concerned with the forward pass architecture of the model, so most of the tricks used during the backward pass in standard transformers can be equally applied to a HSA-based transformer model as well. The main reason we have not incorporated any of such techniques in our empirical study was that we were not mainly after beating the SOTA but purely measuring the impact of introducing HSA while excluding any other potential factor impacting the overall performance of the models.
Dear Reviewer GYaf,
Thank you again for your thoughtful and constructive feedback on our initial draft. We hope we could address most of your questions and concerns via our explanations and additions to the latest draft. If you have any more questions, please let us know before the discussion phase ends.
Best regards, The authors
I appreciate the authors' response to my review. While most of my concerns have been addressed, I remain concerned about the absence of comparisons with hierarchical methods, which I believe is crucial for comprehensively evaluating the advantages of the proposed framework. Therefore, I maintain my original score, though with low confidence.
We'd like to thank you for your insightful feedback that has led us to improve our work further. We are hoping we can add extra experiments for comparing against other hierarchical methods to the final version of the draft.
The authors present a new way to formulate the attention mechanism of Transformer architectures based on a formal mathematical construct that is able to capture/model the hierarchical structure of multi-modal multi-scale data with different underlying signal geometries in a consistent and general manner.
优点
Originality & Significance:
- Motivation of the method based on a clear observation of shortcomings in previous heuristic approaches, leading to the introduction of a well-formalised mathematical foundation to model different signal geometries in one consistent framework
- Proposed method applicable to wide range/combination of modalities due to the very few assumptions that are made (nested signals very flexible in terms of exact composition / instantiation of internal nodes)
- Authors also provide a way how their contributions can be incorporated in already trained ‘conventional’ Transformers, further expanding the applicability of their findings
Quality:
- Contributions are well-placed within the wider area of research, good overview of related methods
- Good range of experiments presented that mostly support the underlying intuition and argument of the paper
Clarity:
- Contributions and underlying motivation clearly stated in intro
- Explanations well-supported through the two main Figures
- The paper is well written and easy to read and follow; longer proofs moved to the appendix, and the paper therefore provides a good level of depth to follow without interrupting the reading flow
缺点
TLDR; I do very much like the underlying idea, but have some questions and concerns that I’d like the authors to clarify & address.
- Potential limitation to the structural assumption made during derivation of ‘conventional’ softmax attention and how this affects the transferability of the presented concept/insights onto current Transformer architectures is not (sufficiently) discussed – see questions
- Comparison of obtained results with a ‘current’ Transformer of the same size (which does not necessarily adhere to the constraints defined in ll. 260ff) – see questions
- Experiments exclusively on minuscule scale for Transformer models
问题
Main concerns, questions & potential improvements:
-
Four main constraints/variations that deviate from current Transformer layouts are set out in lines 260ff to match the attention formulation;
Is this only required to show the mathematical equivalence with the ‘popular’ softmax self-attention, or actually used throughout the experiments? In either case, I’d like to know to which extent this affects the generality of the authors’ proposed approach, specifically:- No separate value projection, but use of keys as values: This might severely limit the expressivity of the architecture; Do the authors see this as a potential drawback, and could this be fixed/accommodated for?
- Post-norm instead of pre-norm for LayerNorm: Pre-norm significantly improved training stability, which is especially important for larger architectures. As the authors’ experiments mainly tackle small scales, is there a possibility this might become restrictive? And could this be accommodated somehow?
-
Results reported for non-hierarchical approach: The authors compare their HAS to FSA, i.e. flat self-attention; Does this still include the constraints mentioned before? If so, how would a ‘current’ Transformer perform on these tasks (w/o these modifications)?
-
L 317 f: 'Approximation of attention-weight between any leaf node in A and any leaf node in B by one value': Is there an underlying intuition why this would make sense? I can see this at higher levels, but when thinking about one modality being an image and another one a sentence describing it – I would expect sub-parts (e.g. tokens) of each modality to definitely have different attention weights, as some words will refer to specific parts of the image? (same applies to audio, etc.)
-
Replacing self-attention in pre-trained transformer-based models by HSA w/o retraining: As the structure of the HSA is essentially different, how is this performed? Assuming a relatively recent architecture with Q,K,V and pre-norm – how would HSA fit in there? Would the attention matrix simply be replaced while still using the same projection matrices and layer structure?
Also: Replacing degrades performance, as discussed towards the end of Sec 5; – would this be ‘solved’ when training an entire (larger) model from scratch with HSA? -> Also see next question: -
Scale of used models: Table 1 shows experiments for model with 1.2M params, which is extremely small (even for small and efficient models). This goes up to 12M later on, which is still extremely small.
-> Do authors have any impression how their approach scales to larger models? I do understand computational constraints, but some insight would be very helpful for the reader! (especially given that even highly-efficient models working on multiple modalities are typically orders of magnitude larger).
Additional comments:
- L 228 ff: Would the dimensionality of the key and query variables really need to be , i.e. the same dimension as the input? I assume key and query space could be of any dimension?
-> Note: This distinction might be important when thinking about the actual dimensionality of key/queries within a Multi-Head Self-Attention, where each head is of lower dimension than the input/embeddings
Update / Adjustment (rebuttal phase):
Given the lack of response by the authors to my and other reviewers' questions, I'm slightly downgrading my current score from 6 to 5. I'd encourage the authors to respond as there is still time left for discussion.
Update post-rebuttal phase:
Given the provided rebuttals, I have raised my score (as discussed below) to 8. While there are some remaining weaknesses around the empirical validity/usefulness compared to other hierarchical approaches, I do think this work passes the bar given the insights it provides.
We appreciate the reviewer's constructive comments and feedback to further improve our work.
Re: the restrictions of the proposed formulation
-
The (lack of) value projection: first of all, please note that in both our proposed Algorithms 1-3 as well as all of our experiments we have indeed incorporated value projections. Furthermore, in order to justify the incorporation of value projections within our theoretical derivation, we have added a new appendix (Appendix C in the new draft) explaining how value projections theoretically emerge in our statistical mechanical interpretation of self-attention. (TLDR: value projections emerge as learnable step-size in the gradient update in Eq. (6) realized as gradient projections.)
-
Pre-norm vs. post-norm: The reviewer is indeed right that in larger-scale problems post-norm may become restrictive. Note that the introduction of post-norm is merely a technical trick to arrive at standard self-attention formulation in Proposition 1. In practice, however, we can simply use pre-norm and since the outputs would be bounded anyways, Proposition 1 would still hold. In fact, in our empirical study in Sec 5.3, we have used pre-norm. We have further added new experiments in the new draft that use the standard pre-norm (see the comment below).
Re: comparison to "current" transformers The FSA results in Sec 5.1 and 5.2 are obtained under the same architectural constraints as those of the HSA. However, we have added a new experimental study in Appendix I where we compare HSA against regular transformers where neither of the two models bears the restrictions mentioned in the main text anymore. The results of these experiments show that HSA still does better than regular transformers while at the same time significantly reduces the number of FLOPs required to perform self-attention.
Re: lacking finer attention weights between two modalities This is indeed an acute observation by the reviewer. Generally speaking, the scale separation inductive bias incorporated within our framework dictates the block constraint between any two non-overlapping sub-trees of a nested signal. And that's where the computational (and often the statistical) gain comes from. However, as the reviewer pointed out, in certain use cases, finer attention weights between sub-parts of different modalities are highly desirable. To accommodate such cases within our proposed framework, one can simply put the sub-parts of the two interacting sub-trees under one parent node, which is equivalent to placing the sub-parts on a common geometrical domain. This would force the algorithm to calculate separate attention weights among the sub-parts. Finally, if such common geometrical domain is not easily attainable for the two modalities (e.g. in the case of text and image), then one can resort to "no geometry" which is equivalent to unordered set geometry in our framework.
Re: replacing regular self-attention with HSA in Sec 5.3 The HSA-RoBERTa model introduced in Sec 5.3 is obtained by replacing the attention calculation operator with HSA while keeping everything else (including pre-norms, architecture of each layer and the learned weights) intact. The results in Sec 5.3 are obtained after replacement without any fine-tuning and hence the degradation of the accuracy. However, as we showed in the newly added experiments in Appendix I, by training from scratch, we can indeed beat the performance of the baseline while reducing the number FLOPs.
Re: insight into scaling up to much larger models As stated in the paper, HSA reduces the number of attention weights from O(M^2. b^2) to O(m.b^2), where M is the number of internal nodes and b is the average branching factor of the hierarchy. Since the overhead of HSA is O(1), this reduction will become more visible in terms of the memory footprint of the model as the model and context/input sizes become larger, which means that we'd be able to fit larger models in the restricted GPU memory for training.
Re: the dimensionality of query and keys We thank the reviewer for noticing the typo. Indeed the attention dimensionality doesn't need to be the same as the input dimensionality as it's the case in all of our experiments. We have fixed the typo in the new draft.
We apologize for the late response. We were preparing the new experimental results to address some of the reviewer's questions while at the same time experiencing loss of power due to a cyclone in our area during the last few days. We hope our answers can clarify the questions raised by the reviewers.
Dear Reviewer 5fDx,
Your initial comments on our original submission led us to further improve the draft by adding new experiments and clarifications; we really appreciate that! We are hoping our latest draft and explanations have addressed most of your initial questions and concerns. If there's anything left unclear or unanswered, we appreciate if you can let us know before the discussion phase ends soon.
Best regards, The authors
I'd like to sincerely thank the authors for the work they have put into the rebuttal, and the additional explanations and insights.
Most of my questions and concerns have been addressed; I have also thoroughly read the other reviewers' opinions and the provided rebuttals -- and have to somewhat agree with reviewer TzdN in their point of empirical 'usefulness' when compared to other hierarchical approaches.
However, I do think that this work offers several additional insights (especially on the theoretical side) and has the ability to connect different aspects within one joint theoretical framework -- which for me outweighs the slightly limited empirical side. (Any additions there would, however, definitely further strengthen the paper.)
Overall, I think this work does provide a 'sufficient' number of new insights to the community, and is in addition very well written (and I'm therefore raising my score to 8).
We really appreciate your detailed comments and constructive feedback which led us to further improve our work. Also, thank you for recognizing the theoretical significance of our work and increasing your score. We agree that adding those additional experiments will enrich our paper further on the empirical side and therefore we are hoping we can add those comparisons to the final version of the draft.
Dear Reviewers,
As we are getting close to the end of the discussion period, we'd like to invite you to consider giving us further feedback regarding our responses to your initial questions and concerns. We would like to be sure we could address your concerns to the large extent.
We greatly appreciate your insightful comments as they have already helped us enriching our paper in its latest draft.
Sincerely, The authors
This paper introduces a hierarchical self-attention (HSA) mechanism for handling nested and multi-modal signals. By deriving self-attention as a gradient descent step for entropy minimization, the authors propose a new formulation optimized for hierarchical data. The method improves performance on language and multi-modal tasks, and offers theoretical insights into adapting transformers for multi-scale data.
Overall, this paper received mixed scores. While it presents strong theoretical analysis for the proposed HSA, the empirical validation remains insufficient. Limited comparisons with other methods and existing constraints on nested signals highlight the need for more comprehensive experiments to provide meaningful insights. The ACs acknowledge the value of the theoretical contributions but emphasize that further empirical work is essential to substantiate the claims and demonstrate the broader applicability of HSA.
审稿人讨论附加意见
All four reviewers actively participated in the discussion and rebuttal stages.
Reviewer TzdN first raise the concerns on the proposed HSA is only bench-marked against flat self attention and Deep Set and s not clear how conceptually their approach differs from previous hierarchical self-attention. The formulations have been considered theoretically in this paper, but, the resulting method is quite close to the sparse attention used previously. The author's response is that the sparsity is another way of enforcing coconstraints on attention computation, but in this work they are generalizing the standard Softmax attention (dense attention) to hierarchical structure of signals. Overall, there is currently no empirical validation of the proposed method against some constrained attention methods such as sparsity based constraints.
Reviewer GYaf gave a high rating of 8 with confidence 2, recognizing that the theoretically grounded HSA generalizes classical Softmax for flat signals. However, they raised concerns about the algorithm's limited parallelizability on standard GPUs, and the insufficient comparison with methods beyond flat self-attention and DeepSet (proposed in 2017). The authors addressed these concerns in Appendix D4, suggesting sparse tensors to mitigate the shortcomings of the dynamic programming in Algorithms 1–3, and clarified that the notion of "hierarchy" in prior work differs from their usage in HSA. The ACs thought these rebuttals somewhat unconvincing.
Reviewer 5fDx also rated the paper an 8 with confidence 3, praising its well-formalized mathematical foundation for modeling diverse signal geometries in a unified framework. While they agreed with Reviewer TzdN's concern about limited comparisons with alternative approaches, 5fDx thought that the theoretical contributions and insights outweigh the limited empirical validation.
After rebuttal, Reviewer Lkye maintained concerns about the limited extensiveness of the experiments, which do not provide sufficient empirical evidence for HSA. They also highlighted potential increases in runtime as another drawback of the method.
Overall, the paper offers strong theoretical analysis for the proposed HSA, but the empirical validation is notably lacking. The limited comparisons with other methods and existing constraints on nested signals suggest the need for more comprehensive experiments to provide meaningful insights. The ACs believe that while the theoretical contributions are valuable, further empirical work is required to substantiate the claims and broader applicability of HSA.
Reject