PaperHub
6.3
/10
Poster6 位审稿人
最低2最高4标准差0.7
4
2
3
3
4
4
ICML 2025

MoH: Multi-Head Attention as Mixture-of-Head Attention

OpenReviewPDF
提交: 2025-01-22更新: 2025-07-24
TL;DR

We propose Mixture-of-Head attention (MoH) that outperforms multi-head attention even by using only 50%~90% of the attention heads.

摘要

In this work, we upgrade the multi-head attention mechanism, the core of the Transformer model, to reduce computational costs while maintaining or surpassing the previous accuracy level. We show that multi-head attention can be expressed in the summation form. Drawing on the insight that not all attention heads hold equal significance, we propose Mixture-of-Head attention (MoH), a new architecture that treats attention heads as experts in the Mixture-of-Experts (MoE) mechanism. MoH has two significant advantages: First, MoH enables each token to select the appropriate attention heads, enhancing inference efficiency without compromising accuracy or increasing the number of parameters. Second, MoH replaces the standard summation in multi-head attention with a weighted summation, introducing flexibility to the attention mechanism and unlocking extra performance potential. Extensive experiments on ViT, DiT, and LLMs demonstrate that MoH outperforms multi-head attention by using only 50%$\sim$90% of the attention heads. Moreover, we demonstrate that pre-trained multi-head attention models, such as LLaMA3-8B, can be further continue-tuned into our MoH models. Notably, MoH-LLaMA3-8B achieves an average accuracy of 64.0% across 14 benchmarks, outperforming LLaMA3-8B by 2.4% by utilizing only 75% of the attention heads. We believe the proposed MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.
关键词
Multi-Head AttentionMixture of ExpertsFoundation Models

评审与讨论

审稿意见
4

This paper proposes Mixture-of-head attention (MoH) to replace standard multi-head attention(MHA) in transformers. The key idea is treat each attention head as an expert in mixture-of-expert framework. The experiments demonstrate that MoH can be applied to vision transformers(ViT) for image classification, DiT for diffusion-based image generation, and LLM. The results shows MoH can achieve competitive perforamnce with only 50%-90% of the heads/experts to be active.

Strengths:

The idea of applying the mixture of experts on attention head is intuitive and effective.

MoH can achieve competitive performance with only 50%-90% heads/experts.

Detailed ablation study and well-organized paper structure.

Weaknesses:

The idea is not entirely new. It shares a similar idea with MoA, plus implementation optimization of shared heads and two-stage routing.

No experiments compared with MoA.

Given these strengths and weaknesses, I am leaning towards weak accept. I am willing to raise my scores accordingly if the authors eventually address these concerns.

update after rebuttal

The authors solved my concern toward MOA comparison, therefore I raised my scores to 4.

给作者的问题

Does the implementation of MOH affect GPU memory usage? if so, it increases of decreases the total memory usage? In that case, can we support larger model or small model by using MOH?

论据与证据

  1. Claim: not all attention heads hold equal significance

The evidence is supported with Voita et al. (2019) and Michel et al. (2019). I suggest providing state-of-the-art mechanism explanation presented on 2024, like [1], [2] or [3].

  1. Claim: MoH outperforms multi-head attention by using only 50%∼90% of the attention heads

Empirical evaluations on ViT, DiT, and LLMs consistently show that MoH performs better.

方法与评估标准

MoH replace multiple head attention by MOH module, adding a router as MOE architecture to sellect top-K heads, and keep some heads always open. It makes sense for transformers.

理论论述

Appendix A provides the theoretical claim that MoH is superior to vanilla multi-head attention. Given the 8-page limit, it is understandable to include it in the appendix.

The claim seems to tell a story that reduced redundancy and greater differentiation are better choices and lead to better model architecture. However, if it is true, complete MoH should be used, and all attention heads should be routed to moe-based design. Instead, this paper deploys a mixed technology in which parts of attention heads are specialized, and parts of attention heads are shared. In this case, specialization is not always good, finding a good balance between specialization and generalization is more important, how to find such a balance is one open question to answer.

实验设计与分析

No experiments compared with MoA. Although section 5 discusses the difference between MoH and MoA, but not experiment results support this claim point.

As MoA is only validated for language tasks, can authors provide a comparison between MoA and MoH for language tasks?

Table 5 presents the Ablation study on the impact of each component of the proposed MoH, is the first row equal to MoA? But it results from image classification, not from language tasks.

补充材料

The paper provides additional code in Supplementary Material with MoH-DiT and MoH-ViT, I review the supplementary materials and run MoH-ViT on my own.

与现有文献的关系

Attention Head Pruning: Prior research of [1,2,3] shows that many heads can be removed without noticeable harm, MoH extends this insight by routing attention experts with performance gain.

遗漏的重要参考文献

[1] Wu, Wenhao, et al. "Retrieval head mechanistically explains long-context factuality." arXiv preprint arXiv:2404.15574 (2024). [2] Fu, Yu, et al. "Not all heads matter: A head-level KV cache compression method with integrated retrieval and reasoning." arXiv preprint arXiv:2410.19258 (2024). [3] Xiao, Guangxuan, et al. "Duoattention: Efficient long-context llm inference with retrieval and streaming heads." arXiv preprint arXiv:2410.10819 (2024). These papers [1,2,3] propose the same insights that not all attention heads hold equal significance.

其他优缺点

Strengths: Extensive Experimentation on vision classification, diffusion-based generation, large language modeling. Strengths: able to do both pretrain and finetune.

其他意见或建议

TYPO: Sec 4.3 "Please refer to the Appendix for detailed hyper-parameter settings (Tab. C)" link to wrong table of Table 3.

作者回复

We sincerely appreciate your thoughtful comments and for recognizing our work as "intuitive and effective," acknowledging that "MoH can achieve competitive performance," and highlighting our "detailed ablation study and well-organized paper structure." Below, we address your questions in detail.

Q1: The idea is not entirely new. It shares a similar idea with MoA.

A1: We explain this problem in three aspects:

  • In terms of motivation, MoH aims to make the attention mechanism more efficient and effective without adding extra parameters. In contrast, MoA, like MoE, focuses on increasing model size while keeping inference costs low. Therefore, the model settings of MoH are more stringent than those of MoA.
  • In terms of methodology, we maintain the original structure of multi-head attention as much as possible, allowing MoH to seamlessly replace standard multi-head attention across different tasks without extra tuning. In contrast, MoA introduces shared keys and values to add heads without increasing the KV cache, which disrupts the original multi-head attention design.
  • In terms of flexibility, we show that pre-trained multi-head attention models can be further continue-tuned into our MoH models, making MoH highly practical. In contrast, MoA integrates multi-head attention with MoE but relies on shared keys and values, requiring training from scratch, which reduces its flexibility.

Q2: No experiments compared with MoA.

A2: Thanks for your insightful advice. As suggested, we compare MoA and MoH on the translation task and additionally provide results for image classification. As shown in the table below, MoH outperforms MoA, primarily for two reasons: (1) MoA's use of shared keys and values across heads limits its expressiveness; (2) MoH's shared heads and two-stage routing improve the model's ability to capture general knowledge (please refer to our response (A1) to Reviewer 6acz).

WMT14EnDe (BLEU)Image Classification (Acc)
MoA28.375.4%
MoH29.078.6%

Q3: How to find a good balance between specialization and generalization.

A3: There are two ways to choose shared heads and routed heads: (1) manual configuration and (2) learning through learnable masks. In the manuscript, we show the results of manual configuration. For the learnable mask approach, we follow a method similar to [1]. Specifically, we introduce a mask module with binary values {0,1} applied to the heads. These masks are dynamically learned rather than statically assigned, allowing the model to determine which heads should be shared. The remaining heads are designated as routed heads, and we set K based on a predefined ratio. For example, if the ratio is 1/4 and there are 8 routed heads, then K is set to 2. As shown in the table below, our latest experimental results show that learning through learnable masks is generally better than manual configuration.

# Activated Heads (%)Image Classification (Acc)
Baseline10084.8%
Manual Configuration7584.9%
Manual Configuration5084.7%
Learning Through Learnable Masks5085.0%

[1] Liu, Peiyu, et al. "MOEfication by Experts as Masks."

Q4: Table 5 presents the Ablation study on the impact of each component of the proposed MoH, is the first row equal to MoA?

A4: The first row in Table 5 follows a structure similar to MoA, but without shared keys and values or the use of additional z-loss.

Q5: Providing state-of-the-art mechanism explanation presented in 2024.

A5: Thanks for your valuable suggestion. We have added your additional references to the Introduction to better explain that not all attention heads hold equal significance.

Q6: TYPO: Sec 4.3 "Please refer to the Appendix for detailed hyper-parameter settings (Tab. C)" link to wrong table of Table 3.

A6: Thank you for your thorough review. This issue may be a bug in LaTeX, and we will work on fixing it.

Q7: Does the implementation of MoH affect GPU memory usage? if so, it increases of decreases the total memory usage? In that case, can we support larger model or small model by using MoH?

A7: As shown in our response (A2) to Reviewer 6acz, MoH slightly reduces GPU memory usage, though the difference is not significant. This is because GPU memory is primarily used to store model parameters, gradients, and the KV cache. Since MoH only optimizes attention computation, it does not substantially reduce the GPU memory of these three components. Although MoH doesn't allow training larger models, it can make training and inference faster.

审稿意见
2

The paper proposes Mixture-of-Head (MoH), a replacement for the standard attention mechanism, in which attention heads can be adaptively switched on and off and reweighted for each token. This proposal is motivated by the already studied redundancy/specialization of attention heads, and the authors show that MoH can maintain or even improve the performance of a variety of transformer networks across different tasks while activating just a fraction of the attention parameters.

给作者的问题

  1. It's unclear to me how the dynamic routing works at the single token level. This is a core contribution to the paper (as stated repeatedly, including in the abstract), so I would like to better understand it and see it more clearly explained in the paper, where I feel most of the focus is at head level. What exactly happens when a token tt in sentence SS is routed/activated in head hh? Does it interact only with the other ones routed to hh?

论据与证据

Claims are generally well supported by evidence. My only concern in this direction is that, in most cases, performance enhancements provided by MoH are marginal or even absent compared with the standard multi-head attention mechanism, which is probably insufficient to sustain the claim that MoH consistently enhances model performance.

方法与评估标准

Yes, methods and evaluations are reasonable. The accessibility of the method could be improved by rephrasing the paragraph on two-step routing, as the roles of the learnable projection matrices are not immediately clear.

理论论述

N/A

实验设计与分析

Experimental design is generally sound, though unclear in some parts. Specifically, the difference between the setting of ViT and DiT w.r.t. Llama could be made clearer. From my understanding, the MoH module is trained on a pre-trained transformer in both cases. What is then the peculiarity of continual tuning in the case of Llama? Moreover, in the cases of ViT and DiT, it is not clear whether shared heads are assigned and what strategy is used to choose them.

补充材料

I tried having a look at the code to clarify a doubt I expressed in the "Questions" part. Other than that, no.

与现有文献的关系

N/A (there is some information regarding this in other sections of the form).

遗漏的重要参考文献

I would point the authors to the following related works that might be worth discussing in the manuscript:

  • Interpreting CLIP's Image Representation via Text-Based Decomposition; Gandelsman et al., ICLR 2024. In this paper, the authors find that the attention heads of CLIP tend to specialize in specific input attributes.

  • Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP; Balasubramanian et al., NeurIPS 2024. This paper generalizes such findings to non-contrastive Vision Transformers;

  • ResiDual Transformer Alignment with Spectral Decomposition; Basile et al., 2024. In this paper, the specialization property of CLIP heads is used to show that few heads can outperform the whole model on zero-shot classification tasks.

其他优缺点

As an additional strength, I think the idea of adding a MoE-like routing in the attention mechanism is fascinating and worth investigating, especially for efficiency purposes (but also interpretability!).

On the weaknesses side, they are scattered around aspect-specific parts of the review (e.g., Experimental Designs).

其他意见或建议

N/A

作者回复

We sincerely appreciate your thoughtful comments and for recognizing that "the idea of adding a MoE-like routing in the attention mechanism is fascinating and worth investigating, especially for efficiency purposes." Below, we address your questions in detail.

Q1: In most cases, performance enhancements provided by MoH are marginal.

A1: We explain this problem in two aspects:

  • The motivation for our work is to upgrade the multi-head attention mechanism to reduce computational costs while maintaining or surpassing the previous accuracy level. Therefore, our method activates fewer parameters than multi-head attention, and our method performs better if it activates the same level of parameters.
  • To demonstrate the robustness of our method, we only replace multi-head attention with MoH in various structures while keeping the original training parameters unchanged. Our latest experimental results indicate that, with tuning, our method can achieve even higher performance.
# Activated Heads (%)Image Classification (Acc)
Multi-Head Attention10084.8%
MoH5085.0%

Q2: The accessibility of the method could be improved by rephrasing the paragraph on two-step routing.

A2: Thanks for your insightful advice. As suggested, we have rewritten the paragraph on two-step routing to make it easier for readers to understand.

Q3: The difference between the setting of ViT and DiT w.r.t. Llama.

A3: We explain our experimental setup in detail:

  • ViT for Image Classification. We trained a MoH model from scratch based on TransNeXt. To ensure a fair comparison, we only replace the standard multi-head attention with the proposed MoH, while keeping all other training parameters identical to TransNeXt.
  • DiT for Class-Conditional Image Generation. We trained a MoH model from scratch based on DiT by replacing the standard multi-head attention with the proposed MoH. We also keep all other training parameters identical to DiT.
  • Training LLMs from Scratch. We trained the LLMs from scratch, maintaining the multi-head attention baseline with exactly the same training parameters as the MoH model.
  • Continue-Tuning LLaMA3-8B. To significantly enhance the applicability of the proposed MoH method, we attempt to further continue-tune pre-trained multi-head attention models, such as LLaMA3-8B, into MoH models.

Q4: What strategy is used to choose shared heads?

A4: All MoH models contain shared heads. There are two ways to choose shared heads: (1) manual configuration and (2) learning through learnable masks. In the manuscript, we show the results of manual configuration. For the learnable mask approach, we follow a method similar to [1]. Specifically, we introduce a mask module with binary values {0,1} applied to the heads. These masks are dynamically learned rather than statically assigned, allowing the model to determine which heads should be shared. The remaining heads are designated as routed heads, and we set K based on a predefined ratio. For example, if the ratio is 1/4 and there are 8 routed heads, then K is set to 2. As shown in the table below, our latest experimental results show that learning through learnable masks is generally better than manual configuration.

# Activated Heads (%)Image Classification (Acc)
Baseline10084.8%
Manual Configuration5084.7%
Learning Through Learnable Masks5085.0%

[1] Liu, Peiyu, et al. "MOEfication by Experts as Masks."

Q5: Essential references not discussed.

A5: Thanks for your advice. We have expanded [1,2,3] and improved the discussion of related works.

Q6: The dynamic routing works at the single token level.

A6: We input the sentence "Give me a short introduction to large language model." into MoH-LLaMA3-8B. We find the tokens forming the phrase share most of the activated heads. For example, all three tokens in "large language model" activate heads {6,7,8,9,11,12}. For other tokens, no clear pattern emerged in the activation of routed heads. Notably, besides routed heads, the shared heads create a stable semantic interaction between all tokens.

TokenIDs of the Activated Heads
Give0,3,7,8,9,12,14,15
me2,7,8,10,11,13,14,15
a0,1,4,5,6,8,9,12
short1,4,5,6,9,11,12,15
introduction0,2,3,6,7,8,9,11
to1,2,7,8,10,12,13,14
large1,6,7,8,9,11,12,15
language6,7,8,9,10,11,12,15
model0,6,7,8,9,10,11,12
.0,2,3,8,9,10,13
审稿人评论

Thank you for your reply and your work. However, I believe my central question is still unanswered by the previous comment. I'm referring to what the authors labelled "Q6".

To rephrase it, I would like to have a better understanding of what changes in the attention mechanism in a specific head when only a few tokens are routed to that head. This is key to understanding the method's validity, and the lack of a direct answer raises concerns about whether this aspect has been sufficiently considered.

作者评论

Thank you for your invaluable feedback. We truly appreciate the time and effort you've dedicated to thoroughly reviewing our paper.

First, we would like to clarify some important details about our approach. In our method, if a token xtx_t does not select head hih_i, it will not compute the query xtWQix_t W_Q^i and attention value for that head. However, the token will still compute the key xtWKix_t W_K^i and value xtWVix_t W_V^i of the head hih_i. This is because other tokens, such as xtx_{t'}, may select head hih_i, and in this case xtx_{t'} will need the key and value of xtx_t to compute the attention value.

We give the pseudo-code below:

  • For each token xtx_t:
    • For each attention head hih_i:
      • If hih_i is selected by token xtx_t:
        • Compute and cache key Kti=xtWKiK^i_t=x_t W_K^i and value Vti=xtWViV^i_t=x_t W_V^i
        • Compute query Qti=xtWQiQ^i_t=x_t W_Q^i
        • Compute attention value using the KV cache of all tokens
      • If hih_i is not selected by token xtx_t:
        • Still compute and cache key Kti=xtWKiK^i_t=x_t W_K^i and value Vti=xtWViV^i_t=x_t W_V^i

It is worth noting that the computational overhead of calculating the KK and VV is relatively small. For example, in self-attention computations with a dimension of 512 and a sequence length of 8192, the calculation of KK and VV accounts for only about 5% of the total computation. The proportion of computational overhead of KK and VV will further decrease with the increase of sequence length.

Your suggestion to conduct a detailed analysis of each head's attention map is very insightful. As per your recommendation, we have visualized the attention maps of MoH in both MoH-ViT-B and MoH-LLaMA3-8B.

For MoH-ViT-B, Figure 1 presents a comprehensive visualization of the 4 shared heads and 28 routed heads for 49 tokens in an image. Our observations show that the shared heads tend to focus on larger areas, while the routed heads focus more on finer details in the image. Figure 2 provides an example where the shared heads focus on a broad area, while the routed heads focus on the image’s finer details. This result further confirms that the shared heads tend to learn general knowledge, while the routed heads focus on learning more specialized knowledge.

For MoH-LLaMA3-8B, we visualize the attention maps for 16 shared heads and 16 routed heads for the sentence "Give me a short introduction to large language model." in Figure 3. We observe that shared heads may tend to learn fixed patterns, such as focusing solely on the query token. In contrast, the attention patterns of the routed heads are more diverse. Figure 4 provides an example.

In summary, since all keys and values are computed in MoH, its attention mechanism has the same range as that of multi-head attention. As a result, in our experiment, we can directly replace multi-head attention with MoH, and the model still performs well without modifying any training parameters. Besides, in MoH, shared heads and routed heads are responsible for learning global knowledge and specialized knowledge, respectively. As a result, the redundancy of attention heads in MoH may be lower than in multi-head attention. Finally, the combination of routed heads in MoH introduces more variability, suggesting that MoH may have a higher performance ceiling than multi-head attention.

We sincerely hope that our responses have addressed your concerns. We will include the important discussions mentioned above in the final manuscript and highlight them for clarity. If anything is still unclear or needs more explanation, we are happy to provide further details.

If our response has resolved your question, we kindly and humbly ask you to consider updating your score, as your affirmation would mean a great deal to us and help us improve our work.

审稿意见
3

The paper proposes leveraging the Mixture-of-Experts (MoE) mechanism to upgrade the standard Multi-Head Attention into a novel Mixture-of-Heads (MoH) Attention. Specifically, MoH replaces the standard summation in multi-head attention with a weighted summation, where the weights are determined by a newly introduced Two-Stage Routing strategy. Experiments on Vision Transformers (ViT), Diffusion Transformers (DiT), and large language models (LLMs) are conducted to demonstrate the effectiveness of the proposed MoH technique.

给作者的问题

Please see the above comments.

论据与证据

Part of.

There are inconsistencies between the theoretical formulation of MoH and its implementation in the LLaMA3-8B experiment. Specifically, the router used in the experiments differs from the one described in Eqs. (5) and (6); moreover, the weighting mechanism also deviates from the theoretical presentation. These discrepancies raise concerns about the alignment between the proposed theory and its practical application, and they should be addressed to ensure the validity of the results.

方法与评估标准

Yes.

理论论述

NA.

实验设计与分析

Yes.

补充材料

No.

与现有文献的关系

The key contributions of the paper are related to building a general transfrormer-based network architecture.

遗漏的重要参考文献

No.

其他优缺点

Strengths:

  • The proposed MoH technique is simple and easy to integrate into existing architectures, making it a practical contribution to the field.

  • The idea of using a weighted summation to enhance multi-head attention is intuitively appealing and expected to improve performance.

  • The experiments span a wide range of tasks, including classification, generation, and LLMs. The results are promising and suggest broad applicability.

  • The paper is well-written and easy to follow.

Weaknesses:

  • There are inconsistencies between the theoretical formulation of MoH and its implementation in the LLaMA3-8B experiment. Specifically, the router used in the experiments differs from the one described in Eqs. (5) and (6); moreover, the weighting mechanism also deviates from the theoretical presentation. These discrepancies raise concerns about the alignment between the proposed theory and its practical application, and they should be addressed to ensure the validity of the results.

  • The paper does not provide a clear rationale for how the value of K in the Top-K selection is determined. This is a critical parameter that directly impacts the behavior of the MoH mechanism. Besides, the relationship between the K value and the ratio of shared heads is not discussed. Understanding this relationship is essential for interpreting the results and optimizing the method for different tasks.

其他意见或建议

No.

作者回复

We sincerely appreciate your thoughtful comments and for recognizing that our method is a "practical contribution," "can be fine-tuned from pre-trained multi-head attention models," and "the results are promising and suggest broad applicability." Below, we address your questions in detail.

Q1: There are inconsistencies between the theoretical formulation of MoH and its implementation in the LLaMA3-8B experiment.

A1: We explain this problem in three aspects:

  • Experiments on ViT, DiT, and LLMs are conducted with models trained from scratch, giving us more flexibility to modify the architecture. However, LLaMA3-8B is a pre-trained model trained on 15T tokens, while we had only 400B tokens (about 3% of the pre-training data) for continue-tuning. This limited our ability to alter the model structure significantly. To address this, we adjusted our original formula to maximize weight reuse from the pre-trained model while preserving its output distribution.

  • We have conducted experiments on LLMs trained from scratch, demonstrating the advantages of our method. Since much of the current research focuses on fine-tuning open-source LLMs, we introduced additional experiments to continue-train LLaMA3-8B. This contribution has been acknowledged by all other reviewers.

  • To the best of our knowledge, our work is the first to attempt to reduce the computational amount of the attention mechanism without degrading the model performance by continue-tuning on a pre-trained model. We consider our proposed training techniques to be a valuable additional contribution.

Q2: The paper does not provide a clear rationale for how the value of K in the Top-K selection is determined.

A2: Thanks for your insightful advice. There are two ways to set this up: (1) manual configuration and (2) learning through learnable masks. In the manuscript, we show the results of manual configuration. For the learnable mask approach, we follow a method similar to [1]. Specifically, we introduce a mask module with binary values {0,1} applied to the heads. These masks are dynamically learned rather than statically assigned, allowing the model to determine which heads should be shared. The remaining heads are designated as routed heads, and we set K based on a predefined ratio. For example, if the ratio is 1/4 and there are 8 routed heads, then K is set to 2. As shown in the table below, our latest experimental results show that learning through learnable masks is generally better than manual configuration.

# Activated Heads (%)Image Classification (Acc)
Baseline10084.8%
Manual Configuration7584.9%
Manual Configuration5084.7%
Learning Through Learnable Masks5085.0%

[1] Liu, Peiyu, et al. "MOEfication by Experts as Masks."

Q3: The relationship between the K value and the ratio of shared heads is not discussed.

A3: Thanks for your valuable suggestion. It is worth noting that when using learnable masks to determine the number of shared heads, we only need to set the ratio of K among the routed heads. As shown in the table below, this ratio is a trade off parameter. If this ratio is too small, the number of activated heads will be insufficient, potentially degrading performance. Conversely, if this ratio is too large, it reduces the model’s sparsity, limiting efficiency improvements.

# The Ratio of KImage Classification (Acc)
Learning Through Learnable Masks1/884.8%
Learning Through Learnable Masks1/485.0%
Learning Through Learnable Masks1/284.9%

We sincerely thank you for your constructive comments. We will add the above important discussions in the final manuscript and highlight them. Thanks again for taking the time and effort on our paper.

审稿意见
3

The paper introduces MoH (Mixture-of-Head Attention), a novel perspective on multi-head attention that formulates each head as an expert in a Mixture of Experts (MoE) framework. It employs a two-stage routing mechanism—comprising shared and non-shared experts—to reduce computational costs while enhancing accuracy. The authors validate their approach through experiments on both training-from-scratch and fine-tuning settings across vision tasks (such as image classification and class-conditional image generation) and language tasks. The results demonstrate that MoH outperforms or matches vanilla multi-head attention while activating fewer parameters.


Update after rebuttal

After reading the rebuttal, I appreciate that the authors thoroughly attempted to address and clarify all of my questions. However, my primary concern remains the incremental nature of the current version of the paper.

First of all, the idea of using shared and non-shared heads originates from DeepSeek-MoE. While the authors claim to extend this idea by analyzing the differences between shared and routed heads through feature rank analysis, I view this more as an experimental insight rather than a fundamentally novel contribution.

Moreover, the core idea of the paper is based on the equivalence between the summation and concatenation forms of multi-head attention. Building on this, the authors propose treating each head as an expert and applying a mixture-of-experts framework. This idea is relatively straightforward, and many other components are drawn from existing literature. In response to my concerns, the authors explained their method for integrating MoE into multi-head attention in the presence of dynamic KV cache. Specifically, they compute the key and value for all heads, even those not selected by a given token. While this is a thoughtful engineering solution, I do not find it substantial enough to constitute the primary contribution of the work. Since the core idea of the paper is simple and the effectiveness is demonstrated through empirical results, I think the paper requires stronger mathematical justification to support the effectiveness of MoE, which could enhance the overall contribution.

Nevertheless, I appreciate the new perspective introduced in the paper, especially the reinterpretation of multi-head attention in summation form, which to my knowledge has not been explored in previous works. Along with the improved empirical results, I believe this work sheds light on potential advancements in multi-head attention using MoE and paves the way for more theoretical works in this field.

Therefore, I have increased my score from 2 to 3 and leaned towards acceptance after considering the rebuttal.


给作者的问题

See the weaknesses mentioned in comments about methodology and experimental results above. Furthermore, I have additional questions for the authors:

  1. What are the motivations to define α1\alpha_1 and α2\alpha_2 as in Eq. 6 rather than considering them as hyperparameters?

  2. Explain the parameter-free router mentioned in Section 4.4 (line 319)?

论据与证据

  1. The motivation for proposing MoH as a dynamic-head routing mechanism is clear. It builds on existing literature showing that redundant heads in traditional multi-head attention can be pruned to reduce computational cost while maintaining accuracy. However, while prior work supports this claim when heads are combined through concatenation, the proposed method instead uses summation. This raises concerns about whether the claim from the literature still holds in this new form.

  2. I cannot find any theoretical guarantee for the effectiveness of the proposed methods, particularly when the heads are combined using summation via MoE gate. Even the appendix does not provide a clear justification or proof.

  3. The methodology is incremental. I will provide detailed comments in the following section.

  4. Existing works that explore the mixture of heads in attention mechanism are not thoroughly discussed in the related works section.

方法与评估标准

While I appreciate the authors' effort in developing a general framework that integrates Mixture of Experts (MoE) into attention layers in Transformers, the paper appears to be an incremental extension of existing work rather than presenting a novel contribution. Additionally, the writing and the formulation of the proposed methods present several issues.

  1. Most of the components of the proposed methods are directly adapted from previous works, making the paper more of a combination of existing approaches rather than a novel contribution.
  • The core idea of two-stage routing involves introducing shared experts to capture global information, as presented in [1]. The original motivation of Dai et al. for this approach was to enhance expert specialization. However, this paper adapts it directly without discussing its suitability or potential additional benefits for the proposed mixture-of-head attention setting.

  • The load-balancing loss is also directly adapted from previous works.

  1. The summation form of combining heads is inconsistent with the authors' claim.
  • The author claims that in the proposed method, each head is divided by rows and then concatenated according to Eq. (3). Nevertheless, only WOW_O is divided into smaller WOiW_O^{i}, and HiH_i remains undiscussed, meaning that HiH_i is obtained as the vanilla multi-head attention, which is not divided by rows. I recommend the authors state clearly here that the expert defined is Hi.WOiH_i.W_O^{i} to avoid confusion that the experts are WOiW_O^{i} only.
  1. The lack of discussion of the new experts defined.
  • While traditional MoE considers experts as a FFN, the new formulation of each expert ii as Hi.WOiH_i.W_O^{i} needs further theoretical discussion regarding the effectiveness, the convergence rate and the optimization scheme.
  1. The abuse of notation WW in computing gating score makes the paper confusing.
  • As far as I understand, the paper implies WrW_r and WsW_s as the expert embeddings. However, the notation WW and its description as a projection matrix may be confusing, as it resembles the projection matrices used in a standard Transformer.

References

[1] Dai, Damai, et al. "Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models." arXiv preprint arXiv:2401.06066 (2024).

理论论述

I find no theoretical guarantees or mathematical proofs in the paper.

实验设计与分析

I have checked all experimental parts and found that the experimental design is appropriate, which includes a wide range of tasks from vision to language. However, I still have concerns about the claim of the author that the new methods can be fine-tuned from pretrained models, which is an advantage compared to [1].

  • As far as I understand, the authors initialize the router weights and copy the pretrained weight for WOW_O, WQiW_Q^{i}, WKiW_{K}^{i} to finetune it. However, with that scheme, the decision of the shared heads and the division of WOW_O into smaller WiW_i will affect the initialization of the models (eg. which pretrained head is better to share).
  • Additionally, if the proposed method can be fine-tuned by copying weights from pretrained models, then other approaches involving a mixture of heads, such as [1], can also leverage this technique by copying pretrained weights and initializing routers (e.g., sparse upcycling [2], [3]). This suggests that the empirical advantages of MoH are not unique since upcycling techniques could be applied to other methods as well, diminishing the claimed advantage of MoH.

References

  1. Zhang, Xiaofeng, et al. "Mixture of attention heads: Selecting attention heads per token." arXiv preprint arXiv:2210.05144 (2022).

  2. Komatsuzaki, Aran, et al. "Sparse upcycling: Training mixture-of-experts from dense checkpoints." arXiv preprint arXiv:2212.05055 (2022).

  3. Zhang, Qizhen Irene, et al. "Bam! just like that: Simple and efficient parameter upcycling for mixture of experts." Advances in Neural Information Processing Systems 37 (2024): 56304-56321.

补充材料

I have reviewed the code provided in the supplementary materials. While the authors include code for reproducing vision tasks, they do not provide code for language tasks. Additionally, I appreciate if the authors can acknowledge the base code they adapted or explicitly state in the README file whether the code was written from scratch.

与现有文献的关系

From my own perspective, this work provides a promising suggestion to advance attention-based models despite the weaknesses mentioned above.

遗漏的重要参考文献

The authors should provide a more in-depth discussion of related works involving the mixture of heads in Section 2, along with a critical analysis of their weaknesses to better highlight the paper’s contribution. While I acknowledge the comparison with Mixture of attention heads (MoA) [1] in Section 5, the broader literature on this topic remains insufficiently discussed ([2], [3]).


References

[1] Zhang, Xiaofeng, et al. "Mixture of attention heads: Selecting attention heads per token." arXiv preprint arXiv:2210.05144 (2022).

[2] Csordás, Róbert, et al. "Switchhead: Accelerating transformers with mixture-of-experts attention." Advances in Neural Information Processing Systems 37 (2024): 74411-74438.

[3] Hao Peng, Roy Schwartz, Dianqi Li, and Noah A. Smith. A mixture of h1h - 1 heads is better than hh heads. In Proc. Association for Computational Linguistics (ACL), pages 6566–6577, Virtual only, July 2020.

其他优缺点

N/A

其他意见或建议

I suggest that the authors move the comparison between MoH and MoA to the related works and consider the limitations of Mixture of attention heads (MoA) as a motivation for proposing MoH.

作者回复

We sincerely appreciate your thoughtful comments and your recognition that our method "provides a promising suggestion to advance attention-based models." Below, we provide detailed responses to your questions.

Q1: While prior work supports this claim when heads are combined through concatenation, the proposed method instead uses summation.

A1: First, in the forward pass, the summation and concatenation forms give the same results, i.e., MultiHead(X,X)=i=1hHiWOi=Concat(H1,H2,...,Hh)WO \textrm{MultiHead}({X},{X}')=\sum_{i=1}^{h} {H}^{i}{W}_O^{i}=\textrm{Concat}({H}^{1}, {H}^{2}, ..., {H}^{h}){W}_O.

Second, the gradients are also identical, i.e., MultiHead(X,X)Hi=WOi \frac{ \partial \textrm{MultiHead}({X},{X}') }{\partial H_{i} }={W}_O^{i}. Therefore, the summation and concatenation forms are mathematically equivalent.

Q2: Theoretical guarantee for the effectiveness of the proposed methods.

A2: Thanks for your valuable suggestion. As mentioned by Reviewer xCL1, we put the theoretical guarantee that MoH is superior to multi-head attention in Appendix A. Specifically, we proved that MoH not only improves efficiency and model performance but also helps different attention heads to specialize better compared to multi-head attention, from both theoretical and experimental perspectives.

Q3: The methodology is incremental.

A3: The motivation for our work is to upgrade the multi-head attention mechanism, the core of the Transformer model, to reduce computational costs while maintaining or surpassing the previous accuracy level, rather than making improvements to the MoE. We propose combining multi-head attention and MoE, so we adopt the MoE technique:

  • For the two-stage routing, as shown in our response (A1) to Reviewer 6acz, the reason for our two-stage routing is to capture general knowledge. Besides, we compare the gradients and training data distribution of shared heads and routed heads in Appendix Table A to further demonstrate that shared heads play a key role in capturing general knowledge.

  • For the load-balancing loss, since we decompose multi-head attention into a summation form, which is similar to MoE structure, we directly adopt the auxiliary loss used in MoE.

Q4: The lack of discussion of the new experts defined.

A4: As suggested, we have defined the expert as HiWOiH^i W^i_O to avoid possible misunderstanding.

Q5: Further theoretical discussion regarding the effectiveness, the convergence rate and the optimization scheme.

A5: Mathematically, we prove that the summation and concatenation forms are equivalent. Besides, we show in Appendix A that the gradient per head in the MoH differs from the gradient of multi-head attention by only a single weight. Finally, we replace multi-head attention with MoH in various structures while keeping the original training parameters unchanged. The experimental results demonstrate that MoH enhances the performance of multi-head attention, providing experimental evidence of its effectiveness.

Q6: The abuse of notation WW.

A6: As suggested, we have replaced the notation WW in the router with EE, referring to them as expert embeddings.

Q7: Concerns about the claim of the author that the new methods can be fine-tuned from pretrained models.

A7: We explain this problem in three aspects:

  • We simply select the first 16 attention heads of each layer as shared heads. Even if the structure is not optimal, the experimental results show that MoH-LLaMA3-8B has a significant advantage over LLaMA3-8B. This result shows the robustness of our method.
  • Unlike the MoE upcycling technique, which copies the FFN to increase the model size, our MoH prunes the original model to reduce the activation parameters, making it more challenging.
  • To the best of our knowledge, our work is the first to attempt to reduce the computational amount of the attention mechanism without degrading the model performance by continue-tuning on a pre-trained model.

Q8: Acknowledge the base code they adapted.

A8: Thanks for your advice. We will acknowledge the contributors in our official code, and release our trained models.

Q9: Existing works are not thoroughly discussed.

A9: As suggested, we have expanded and improved the discussion of related works.

Q10: What are the motivations to define α1\alpha_1 and α2\alpha_2 as in Eq. 6 rather than considering them as hyperparameters?

A10: We choose to predict α1\alpha_1 and α2\alpha_2 based on the input so that different tokens can dynamically combine general knowledge from the shared heads and specialized knowledge from the routed heads.

Q11: Explain the parameter-free router mentioned in Section 4.4 (line 319)?

A11: We use the l2l_2 norm (which measures the magnitude of a vector in Euclidean space) of each head to represent its importance. We then normalize the l2l_2 norm of all heads using SoftMax. This simple router achieves results comparable to learnable routers.

审稿人评论

Dear Authors,

Thank you for spending the time to reply to my review. I understand that your primary goal is to improve multi-head attention in Transformers rather than enhancing MoE, as well as to demonstrate the equivalence between the summation form and the concatenation form in Section 3.1. However, I still find that your response does not fully address my concerns.

  1. Since the main objective is to improve multi-head attention, the only key contribution of the manuscript appears to be the demonstration that the summation form and the concatenation form are equivalent, and from that summation form, the authors then propose to consider each head as an expert and apply MoE to Multi-head Transformer. While this is an interesting finding, I do not find it substantial enough for publication.

Furthermore, the theoretical explanation in Appendix A.1 suggests that in multi-head attention, each expert processes a subset of data, while shared experts enhance specialization among the remaining ones. This concept appears to be an intuitive adaptation from general MoE and especially DeepSeek MoE [1]. However, I am not convinced that Table A offers a proper theoretical proof demonstrating that experts indeed become more specialized or that MoH outperforms standard multi-head attention beyond the intuition drawn from [1].

  1. In A9, the authors stated:

    "As suggested, we have expanded and improved the discussion of related works."

    However, I could not find this expanded discussion in your response to my review. I kindly ask the authors to either provide additional rebuttal comments or direct me to the relevant section where these improvements have been made. Specifically, I would like to see a detailed comparison between MoH and prior works on mixture of heads in Transformers.

PS. Until now, my opinion remains largely unchanged. I appreciate the authors' effort in adapting Mixture of Experts to enhance Multi-head Attention, as well as their response to my review. However, I still consider this a borderline paper, and I am currently leaning towards rejection. Nevertheless, I am open to reconsidering my score if the authors can address all my concerns outlined above.

Thank you.


References.

[1] Dai, Damai, et al. "Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models." arXiv preprint arXiv:2401.06066 (2024).


Update: I thank the authors for their thorough response to my (additional) comments. Although I believe the current version of the paper is borderline and requires additional mathematical justification for the effectiveness of MoH rather than relying solely on experimental results, this paper provides a new perspective with improved performance, which sheds light on potential advancements in Multi-head Attention using MoE and paves the way for more theoretical work in this field. Therefore, I have increased my score from 2 to 3. I hope the authors can include a more detailed discussion (which was briefly touched upon during the rebuttal phase) in the revised version of the manuscript.

作者评论

Thank you so much for your valuable feedback. We truly appreciate the time and effort you've spent carefully reviewing our paper. We're sorry that our previous responses didn't fully meet your expectations, and we will provide more detailed answers to your questions below.

Q1: The contribution of the manuscript is not substantial enough.

A1 We explain this problem in three aspects:

  • First, combining MoE with attention mechanisms is not simple. MoE activates multiple FFNs sparsely, with each token computing independently. In contrast, attention relies on a dynamic KV cache, where different tokens may activate different heads. We use the approach outlined in our "Reply Rebuttal Comment" to Reviewer 6df6 to achieve sparse activation of attention heads. Other methods [1,2,3] avoid this complexity, leading to many drawbacks (please refer to A2 below).
  • Second, as noted by Reviewer xCL1, we provide evidence that MoH outperforms multi-head attention in Appendix A. Specifically, in Table B of Appendix A.1, we calculate the similarity of attention patterns and output features across different attention heads. As shown in Table B below, the similarity in MoH is lower than in standard multi-head attention, indicating reduced redundancy and better differentiation among the attention heads in MoH. (Given a pair of attention score matrices A and A', we calculate the similarity of attention patterns as 112E[AA1]1 - \frac{1}{2} \mathbb{E}[ ||A-A'||_{1} ]. Since attention scores form a probability distribution for each query, the similarity is always between 0 to 1.)
Similarity of Attention PatternsCosine Similarity of Output Features
ViTLLMViTDiT
Multi-Head Attention0.51590.47950.04110.2550
MoH0.39780.43330.01650.2042
  • Third, we take it a step further than DeepMoE. We provide evidence that shared heads and routed heads can capture different types of knowledge. Experimentally, we analyze the feature rank of shared and routed heads. A lower feature rank means higher correlation between features from different samples, indicating that the features capture more general knowledge. As shown in the table below, the feature rank of shared heads is much lower than that of routed heads, suggesting that shared heads capture common information, while routed heads focus on sample-specific details.
Hidden Sizefeature rank of shared headsfeature rank of routed heads
ViT768164270
LLM153611231441

Q2: Detailed comparison between MoH and prior works.

A2: We give a detailed comparative discussion below:

  • MoA [1], like MoE, focuses on increasing model size while keeping inference costs low. Since duplicating attention heads increases the KV cache, MoA uses a fully shared KV for all heads. Thus MoA avoids dynamic KV cache for different heads in this way. As I replied to Reviewer xCL1, this design greatly limits the performance of MoA.
  • SwitchHead [2] applies MoE to Q projection, K projection, V projection, and output projection, instead of sparse activation at the head level. It is worth noting that the expert in the MoE used by SwitchHead is a single linear layer rather than the common MLP: Qi=MoEQi(x,expert=Linear),Ki=MoEKi(x,expert=Linear),Vi=MoEVi(x,expert=Linear)),WOi=MoEOi(x,expert=Linear). Q^i=MoE_Q^i(x, expert=Linear), K^i=MoE_K^i(x, expert=Linear), V^i=MoE_V^i(x, expert=Linear)), W^i_O=MoE_O^i(x, expert=Linear).
  • MAE [3] does not use sparse activation, and its G-step and F-step iterative optimization strategy significantly increase the training cost: MAE(x)=i=1hgi(x)hh1(Hi+j=1hHj).MAE(x)=\sum_{i=1}^{h}g_{i}(x)\frac{h}{h-1}(-H_i+\sum_{j=1}^{h}H_j).

In contrast, MoH aims to make attention more efficient without adding extra parameters. Besides, MoH preserves much of the multi-head attention structure, offering three key advantages:

  • In our experiments, we replaced multi-head attention with MoH without changing any training parameters. In contrast, prior works [1,2,3] modify the structure, requiring them to be reconfigured.
  • We show that a pretrained model can be further continue-tuned into our MoH models by low-cost training, which is very important in the era of large models, because most researchers do not have enough computing power to train a large model from scratch.
  • MoH can be easily adapted across a variety of popular dense and MoE-based model frameworks. Prior works [1,2] have only been compared with MoE-based language methods.

We will include the important discussions in the final manuscript. If our response has resolved your question, we kindly and humbly ask you to consider updating your score.

[1] Zhang, Xiaofeng, et al. "Mixture of attention heads: Selecting attention heads per token."

[2] Csordás, Róbert, et al. "Switchhead: Accelerating transformers with mixture-of-experts attention."

[3] Hao Peng, Roy Schwartz, Dianqi Li, and Noah A. Smith. A mixture of heads is better than heads.

审稿意见
4

The paper introduces Mixture-of-Head Attention (MoH) as an enhancement to the multi-head attention (MHA) mechanism in Transformer models, aiming to reduce computational costs while maintaining or improving model accuracy. The key insight is that not all attention heads contribute equally, and some can be pruned or dynamically selected without significantly affecting performance. Inspired by Mixture-of-Experts (MoE) models, MoH treats attention heads as experts and introduces a router that selects the most relevant heads for each token. This allows MoH to activate only a subset of attention heads dynamically, improving efficiency without increasing the number of model parameters. Additionally, MoH replaces the standard summation of multi-head attention with a weighted summation, enhancing flexibility.

The authors validate MoH across multiple architectures, including Vision Transformers (ViT), Diffusion Transformers (DiT), and Large Language Models (LLMs). Results show that MoH can match or outperform standard MHA while using only 50%–90% of the attention heads. Notably, MoH-LLaMA3-8B achieves a 2.4% accuracy improvement over LLaMA3-8B using only 75% of the attention heads.


Update Post-rebuttal

I am happy with the authors response and would like to see the paper accepted.


给作者的问题

If MoH is inspired by Mixture-of-Experts, then why does it manually designate some heads as "always active" instead of letting the router handle all head selection dynamically?

If certain heads are critical enough to be always active, why doesn’t the routing function naturally prioritize them?

论据与证据

The paper presents MoH (Mixture-of-Head Attention) as an alternative to standard Multi-Head Attention (MHA) and claims that it improves efficiency and accuracy across multiple architectures (ViTs, DiTs, and LLMs). While the empirical results generally support these claims, some claims require further validation. For instance, the authors claim that MoH improves efficiency without increasing parameter count, but didn't provide FLOP counts or memory benchmarks to confirm efficiency gains. Moreover, the authors mentioned that MoH's routing strategy is optimal for balancing shared and routed heads, but didn't compare this strategy against say dynamic attention strategies such as sparse attention.

方法与评估标准

MoH seems to be well-suited for improving Transformer efficiency, and the evaluation covers a diverse range of architectures (ViTs, DiTs, and LLMs) using 14 benchmark datasets (e.g., MMLU, CEVAL, GSM8K, TruthfulQA), making the results broadly applicable. However, the paper lacks efficiency metrics (e.g., FLOPs, memory usage, latency) despite claiming computational improvements. Additionally, MoH’s manual selection of shared vs. routed heads is not compared to fully dynamic routing methods (e.g., MoE-based attention), leaving some doubts about whether the hybrid design is necessary. Addressing these gaps could strengthen the empirical validation of MoH.

理论论述

The core theoretical claims about selective activation and weighted summation are well-supported, but the manual head selection strategy and efficiency gains lack mathematical justification.

实验设计与分析

The experimental design provides a broad evaluation of MoH across ViTs, DiTs, and LLMs, using many benchmark datasets (e.g., MMLU, CEVAL, GSM8K, TruthfulQA), which strengthens the generalization claims. The experimental setup is strong in terms of dataset diversity, but the lack of efficiency benchmarks and alternative routing comparisons in MoH’s empirical claims.

补充材料

Skimmed over it.

与现有文献的关系

The paper builds on prior work in Multi-Head Attention and Mixture-of-Experts models by introducing Mixture-of-Head Attention, which selectively activates attention heads to improve efficiency. The concept of routing-based selection is inspired by MoE-based Transformers, but MoH applies it at the attention head level instead of full feedforward layers, making it a more lightweight alternative. The idea of structured sparsity in Transformers aligns with research on sparse attention mechanisms and adaptive token selection methods, though MoH introduces a unique hybrid model where some heads are manually fixed while others are routed dynamically. However, the paper does not compare MoH against fully dynamic routing mechanisms like MoE-based attention, which raises questions about whether its manual head selection strategy is necessary.

遗漏的重要参考文献

The paper provides theoretical justification for MoH, focusing on the routing mechanism and its impact on efficiency and expressivity. The derivations related to selective head activation and weighted summation appear logically consistent, following standard MoE formulations. However, the manual selection of shared heads is not rigorously justified—there is no proof explaining why certain heads must always be active, rather than letting the routing function learn to retain essential heads automatically. Additionally, the paper does not compare MoH's routing formulation against fully dynamic MoE-based attention models, leaving its mathematical superiority unverified.

If MoH is inspired by Mixture-of-Experts, then why does it manually designate some heads as "always active" instead of letting the router handle all head selection dynamically? I think this partially undermines the motivation for using a routing mechanism in the first place.

其他优缺点

The paper presents a refinement of Multi-Head Attention by introducing Mixture-of-Head Attention, which selectively activates heads to improve efficiency without increasing parameter count. This hybrid approach balances shared and dynamically routed heads, offering a new perspective on structured sparsity in Transformers. The significance is high, as MoH generalizes across ViTs, DiTs, and LLMs, making it applicable to a wide range of architectures. However, clarity is hindered by the lack of justification for manually fixing some heads as always active, which contradicts the motivation for routing-based selection. Additionally, claims about computational efficiency are not supported by FLOP/memory benchmarks, and no comparisons are made against alternative dynamic attention mechanisms (e.g., MoE-based attention, sparse routing). Strengthening these aspects would solidify MoH’s contributions and further validate its impact.

其他意见或建议

None.

作者回复

We sincerely thank the reviewer for the constructive comments, and for noting that our method provides "a new perspective on structured sparsity in Transformers" and that "the significance is high." We address the questions as below.

Q1: If MoH is inspired by Mixture-of-Experts, then why does it manually designate some heads as "always active" instead of letting the router handle all head selection dynamically?

A1: We explain this problem in three aspects:

  • Load balance loss (also called MoE loss) pushes experts to focus on specific areas rather than general knowledge. As shown below, load balance loss ensures all experts are chosen equally and become specialized. However, this goes against the idea that essential experts should be selected more often and should learn broader, general knowledge. Due to the load balance loss, even though some heads are critical enough to remain active at all times, the routing function cannot naturally prioritize them.

Lb=i=hs+1hPifi,L_b = \sum_{i=h_s+1}^{h} P_i f_i,

Pi=1Tt=1TSoftmax(Wrxt)ihs,P_i = \frac{1}{T} \sum_{t=1}^{T} \text{Softmax}(W_{r} x_{t})_{i-h_s},

fi=1Tt=1T1(Token xt selects Head i)f_i = \frac{1}{T} \sum_{t=1}^{T} {1}(\text{Token } {x}_t \text{ selects Head } i).

  • From the perspective of training stability, keeping some heads active at all times helps maintain a stable gradient. If all heads are selected freely, the gradients and loss spikes can increase significantly, reducing training efficiency.

  • In recent MoE work [1], some experts are also selected as shared experts to extract general knowledge. Besides, in attention mechanisms, some heads may capture common knowledge across different contexts, such as grammatical rules in language. Inspired by this idea, we designate a subset of heads as shared heads that remain always activated. We also compare the gradients and training data distribution of shared heads and routed heads in Appendix Table A to further demonstrate that shared heads play a key role in capturing general knowledge.

[1] Dai, Damai, et al. "Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models." arXiv preprint arXiv:2401.06066 (2024).

Q2: Claims about computational efficiency are not supported by FLOPs/memory benchmarks.

A2: Thanks for your valuable suggestion. We have provided the comparison of the efficiency between MoH and multi-head attention mechanisms in Table 7, and we have added FLOPs and memory as suggested. Specifically, MoH surpasses standard multi-head attention mechanisms, with its advantage becoming more pronounced as the input sequence length increases.

We find that MoH slightly reduces GPU memory usage, though the difference is not significant. This is because GPU memory is primarily used to store model parameters, gradients, and the KV cache. Since MoH only optimizes attention computation, it does not substantially reduce the GPU memory of these three components.

# Head Num# Head Dim# Sequence Length#Activated Heads (%)Time (ms)FLOPsMemory
Multi-Head Attention32642561000.36023M998M
MoH3264256900.35222M980M
MoH3264256750.32118M978M
MoH3264256500.22513M978M
Multi-Head Attention32645121001.37683M3356M
MoH3264512901.35177M3354M
MoH3264512751.18065M3328M
MoH3264512500.86345M3302M

Q3: No comparisons are made against alternative dynamic attention mechanisms (e.g., MoE-based attention, sparse routing).

A3: Thanks for your insightful advice. As shown in the table below, MoH outperforms both sparse attention and MoE-based attention. These results also demonstrate the benefits of shared heads from an experimental perspective.

# Activated Heads (%)Image Classification (Acc)
Sparse attention S-78.4%
MoE-based attention S7575.6%
MoH S7578.6%

We sincerely thank you for your constructive comments. We will add the above important discussions in the final manuscript and highlight them. Thanks again for taking the time and effort on our paper.

审稿人评论

I thank the authors for their rebuttal. I am not sure what the second part (5th row to 8th row) of Table 7 refers to, but I am satisfied with their answers. I will maintain my score. Appreciate the time and efforts you put into the rebuttal.

作者评论

Thank you for your valuable feedback. We truly appreciate the time and effort you spent reviewing our paper. We are pleased to know that our response was to your satisfaction.

In the second part (the 5th row to the 8th row) of Table 7, we increase the input sequence length. For rows 1 to 4, the input length is 256. For rows 5 to 8, it is 512.

We test different sequence lengths because sparse activation methods have extra routing cost CroutingC_{routing}:

  • The routing cost CroutingC_{routing} grows linearly with the input length LL, i.e., CroutingLC_{routing}\propto L.
  • However, the computational cost CattnC_{attn} of attention grows quadratically with the input length, i.e., CattnL2C_{attn}\propto L^2.

We want to observe if our method performs better with longer sequences. As shown in Table 7, as the input length LL increases, our proposed MoH shows a greater speed advantage.

We are sorry that the contents of Table 7 are somewhat confusing because we did not have sufficient descriptions of the table. We will change the caption of Table 7 to make it easier to follow.

We sincerely appreciate your support for our work. We kindly ask you to consider raising your score, as your encouragement is very important to us. It also helps more people discover and benefit from our research. Thank you for your understanding and support.

审稿意见
4

This paper aims to enhance the efficiency of multi-head self-attention by integrating mixture of experts into the attention. The authors propose Mixture of Head attention, which selectively activates subsets of attention heads for each token and gets a weighted sum of these selected heads to get the final output. The approach demonstrates improved efficiency and performance across some tasks, including image classification, image generation, and both fine-tuning and training large language models from scratch.

给作者的问题

Please refer to the discussion above.

论据与证据

The paper claims to reduce computational costs while maintaining comparable or favorable performance. The experiments confirm this to a certain extent. However, one weakness of the method is its relatively high activation rate.

方法与评估标准

The proposed method is validated on well-known tasks, including image classification and generation, as well as training and fine-tuning LLMs.

理论论述

The paper does not introduce new theoretical contributions, so there are no proofs to verify.

实验设计与分析

The experiments in the paper are well-conducted, and the efficiency of the proposed method is confirmed across various tasks and datasets. Additionally, the experimental details are clearly reported.

However, one major weakness of the paper is the lack of discussion on the standard MoE applied to feedforward layers:

  • Lack of comparison with mixture of feedforward experts: Traditional MoE is typically applied to feedforward layers and has demonstrated significant improvements in both efficiency with a highly sparse activation rate and performance. Furthermore, training feedforward MoE is generally simpler compared to the proposed MoH.

  • Lack of integration with mixture of feedforward experts: Would incorporating MoE in both feedforward and attention layers result in even better results? It would be great to explore this possibility and see if it provides further efficiency gains and performance improvements.

补充材料

I have reviewed the additional discussions, experiments, and some implementation details.

与现有文献的关系

This introduces a new type of mixture of experts, which I find both important and interesting.

遗漏的重要参考文献

I did not identify any essential or significant related works that are missing.

其他优缺点

Please refer to the discussion above.

其他意见或建议

Please refer to the discussion above.

作者回复

We sincerely appreciate your thoughtful comments and for recognizing that "The experiments in the paper are well-conducted." Below, we address your questions in detail.

Q1: One weakness of the method is its relatively high activation rate.

A1: To demonstrate the robustness of our method, we only replace multi-head attention with MoH in various structures while keeping the original training parameters unchanged. Therefore, the results presented in the manuscript may not be optimal. Our latest experimental results indicate that, with tuning, our method can achieve even higher performance with a lower activation rate. Besides, we are developing deep learning-based methods to automatically determine the optimal activation ratio. We believe that our proposed MoH is promising and can be further optimized for even better performance.

# Activated Heads (%)Image Classification (Acc)
Multi-Head Attention10084.8%
MoH5085.0%

Q2: Lack of comparison with mixture of feedforward experts.

A2: Thanks for your insightful advice. We explain the difference between MoH and MoE from the following three aspects:

  • Attention and FFNs are the core components of Transformers. While MoE applies sparse activation at the FFN level, MoH introduces sparsity at the attention level. MoH not only extends the scope of MoE but also offers a more effective approach to reducing Transformer computation. This is particularly significant because, as the input sequence length increases, FFN computation grows linearly, while attention computation scales quadratically. Consequently, MoH has greater potential to alleviate the computational burden of Transformers.
  • MoH presents greater technical challenges than MoE. Unlike the MoE upcycling technique, which copies the FFN to increase the model size, our MoH prunes the original model to reduce the activation parameters, making it more challenging.
  • MoH naturally leverages the multi-head structure in Transformers, while MoE requires additional FFN replication. From this perspective, MoH matches the original Transformer design better.

Q3: Lack of integration with mixture of feedforward experts.

A3: Thanks for your valuable suggestion. Due to the time constraints of the rebuttal, we designed a 28M-sized small model as a baseline, with a training budget of 100 epochs on ImageNet-1K classification. As shown in the table below, MoH can be combined with MoE, where MoE enhances the model's performance by replicating the FFN, while MoH optimizes computational efficiency by the dynamic activation of attention heads.

# Params# Activated Heads (%)Image Classification (Acc)
Multi-Head Attention28M10077.0%
MoH28M7577.2%
MoE Top-1/2E45M10078.1%
MoH & MoE Top-1/2E45M7578.1%

We sincerely thank you for your constructive comments. We will add the above important discussions in the final manuscript and highlight them. Thanks again for taking the time and effort on our paper.

最终决定

In the paper, the authors propose Mixture-of-Head attention (MoH) that treats attention heads as experts in mixture of experts. The proposed method allows each token to select the appropriate attention heads and yields extra flexibility to the attention mechanism to potentially improve the performance of Transformer.

After the rebuttal, the paper still received mixed reviews from the reviewers. On the positive side, most of the reviewers agree that the proposed method is sound and the experiments are well-conducted (e.g., they consist of a broad evaluation across ViTs, DiTs, and LLMs, using many benchmark datasets, including MMLU, CEVAL, GSM8K, and TruthfulQA). On the negative side, the method is not novel and quite incremental given the literature on the mixture of attention heads and Deepseek-MoE. Furthermore, the effectiveness of the MoH is purely demonstrated from the empirical results. It is important to have mathematical justification for the effectiveness of the MoH.

After considering both the strengths and weaknesses of the paper, I think that the contributions of the proposed framework MoH are still sufficient and of interest to ICML. Therefore, I recommend accepting it in its current form. However, I encourage the authors to address the reviewers’ suggestions and integrate their feedback into the camera-ready version of their paper.