8.2

/10

Spotlight4 位审稿人

最低4最高6标准差0.7

4.0

置信度

创新性3.0

质量3.5

清晰度3.5

重要性3.3

NeurIPS 2025

MoBA: Mixture of Block Attention for Long-Context LLMs

Enzhe Lu,Zhejun Jiang,Jingyuan Liu,Yulun Du,Tao Jiang,Chao Hong,Shaowei Liu,Weiran He,Enming Yuan,Yuzhi Wang,Zhiqi Huang,Huan Yuan,Suting Xu,Xinran Xu,Guokun Lai,Yanru Chen,Huabin Zheng,Junjie Yan,Jianlin Su,Yuxin Wu,Yutao Zhang,Zhilin Yang,Xinyu Zhou,Mingxing Zhang,Jiezhong Qiu

OpenReview PDF

提交: 2025-05-11更新: 2025-10-29

TL;DR

MoBA is a dynamic sparse attention mechanism for long-context LLMs. It uses the Mixture of Experts (MoE) principle to allow models to autonomously decide where to focus attention in training, without predefined biases.

摘要

关键词

LLMSparse AttentionLong Context

评审与讨论

审稿意见

评分: 5置信度: 52025-06-20

This paper proposes MOBA, a method inspired by the Mixture of Experts (MoE) paradigm. The key idea is to divide the context into blocks and select the top-k most relevant blocks for attention computation. The approach effectively accelerates the prefill phase in long-context scenarios, though it does not provide significant speedups during decoding. The authors also released a flash_attn_varlen-based implementation to support reproducibility.

优缺点分析

Pros:

The paper is clearly written with a well-structured and logical flow, making it easy to read.

The core idea is neat and is shown to be empirically effective.

While the method appears simple, it is actually non-trivial. The authors identify the issue of unstable gradients introduced by the router and mitigate it with a simple hybrid strategy, highlighting the thoughtfulness of the design.

Code is released.

Cons:

All attention heads within a group are restricted to attend to the same selected blocks, which may degrade performance. I encourage the authors to add an ablation study to evaluate the impact of this design choice.

The routing mechanism requires compressing all keys within a block into a single vector, which may lead to information loss. For instance, in a synthetic experiment with keys and values as: k₁:v₁, k₂:v₂, ..., kₙ:vₙ — as n increases, can the moba still perform NIAH (for any possible key)? I encourage the authors to include such an experiment and analyze its dependence on block size. If a clear improvement is demonstrated, I would consider increasing my score.

It would be valuable to include results on the GSM-Infinite benchmark.

问题

See cons.

局限性

最终评判理由

Good paper, it can be accepted.

格式问题

作者回复

2025-07-31

Thank you for your positive feedback and insightful questions! Below we clarify the three main concerns.

(W1) Regarding the restriction that all attention heads within a group attend to the same selected blocks:

We would like to clarify that MoBA does not introduce such restriction on block selection. MoBA allows different attention heads to independently select their own top-k KV blocks. Actually, such restriction is introduced by NSA --- quoted from Section 2.3.2 of NSA paper: "This aggregation ensures consistent block selection across heads within the same group."

(W2) Regarding MQAR-style synthetic experiment:

We train a MoBA model from scratch on MQAR (Multiquery Associative Recall)-style synthetic data. The synthetic data construction follows Arora, Simran, et al[1], where input sequences consist of a number of key-value pairs followed by queries (for example: A 4 B 3 C 6 F 1 E 2 -> A ? C ? F ? E ? B ? ).

The model is a 16-layer, 16-head dense Transformer with Multi-Head Attention (MHA). It has 822M non-embedding parameters, a hidden dimension of 2048, and employs the GLU activation in the feed-forward blocks. In our MoBA configuration, we segment the input into chunks of 256 tokens and select the top-4 experts per token; the overall sequence length is 2048.

We observe that the LM loss for unknown ("?") tokens can finally go to zero (less than 1e-7), meaning that MoBA can fit MQAR-style synthetic data. We will add more details in the next version of our manuscript.

[1] Arora, Simran, et al. "Zoology: Measuring and Improving Recall in Efficient Language Models." Proceedings of 12th International Conference on Learning Representations (ICLR). ICLR, 2024.

(W3) Regarding GSM-Infinite:

We evaluated Llama-8B-1M-MoBA and Llama-8B-1M-Full on a subset of the GSM-Infinite task. Because the rebuttal window is short, our protocol deviates slightly from the original paper:

We sample 200 examples from GSM-infinite, whose ops rangs from 2 to 20 (uniformly)
We report exact-match accuracy.

Accuracy (%)	MoBA Prefill Full Decode	MoBA Prefill MoBA Decode	Full
gsm_infinite_sample200_hard_128k	10.5	9.5	12
gsm_infinite_sample200_medium_128k	8.5	9.5	9
Average	9.5	9.5	10.5

Both MoBA and Full perform poorly yet comparably on this subset, in terms of average accuracy for ops <= 20 in 128K context. The above results suggest that the present base model (Llama-8B) and our long-context training recipe are not yet strong enough to reveal meaningful differences between full and dynamic sparse attention on GSM-Infinite. We will present a more rigorous evaluation in the next version of the manuscript.

As a reference, the average accuracy for ops<= 30 of Llama3.1-8B-Instruct reported in GSM-infinite paper[1] is 21.86%.

[1] Zhou, Yang, et al. "GSM-Infinite: How Do Your LLMs Behave over Infinitely Increasing Context Length and Reasoning Complexity?." arXiv preprint arXiv:2502.05252 (2025).

2025-08-05

I have increased my score to 5.

审稿意见

评分: 5置信度: 42025-06-24

This paper introduces MoBA, a sparse attention mechanism for long-context LLMs. It partitions the input into blocks and lets each attention head dynamically select top‑k relevant blocks using mean-pooled key vectors, while ensuring causal masking. The method is fully compatible with standard Transformer architectures and shows strong results up to 1M-token contexts.

优缺点分析

Strengths:

The proposed method is novel and elegant. Its per-head, parameter-free routing is simple yet effective, avoiding complex structures while retaining flexibility.
Experiments are comprehensive and well-executed. MoBA performs comparably to full attention while offering large speedups, and ablations support the design choices.
Strong scaling to long contexts. Even at a million-token length, it maintains stable loss and efficiency, which is rare for sparse attention methods.

Weaknesses:

The use of mean pooling for block representation may dilute important token-level signals and lead to suboptimal routing. This is a minor suggestion, however, and does not affect my overall positive assessment of the paper.

问题

See weaknesses.

局限性

yes

最终评判理由

The response addresses my concern, and I maintain my positive rating.

格式问题

No paper formatting concerns.

作者回复

2025-07-31

Thank you for your positive feedback and insightful questions! Below we clarify the main concerns.

(W1) Regarding the use of mean pooling for block representation

Thank you for your insightful questions! Besides mean pooling, we do ablate another strategy --- in addition to the top block selected by the mean pooling gate, we also select top blocks in terms of max pooling and min pooling respectively. It does lead to slightly better performance, implying that the design of a learnable and more representative block representation method can be a promising future research direction for trainable dynamic sparse attention. We will add more details in the next version of our manuscript.

2025-08-05

Thanks for your reply. I will maintain my positive evaluation for this paper.

审稿意见

评分: 6置信度: 42025-07-03

This work presents Mixture of Block Attention (MoBA), an approach that applies the top-k gating mechanism in Mixture of Experts (MoE) to attention. The targeted scenario is long context Large Language Models (LLMs), where the attention mechanism requires each query token to attend to all key and values tokens, causing a quadratic time complexity. MoBA partitions the context into blocks and lets the model to learn which blocks should a query token attend to via the top-k gating mechanism, along with specific designs to preserve causality in auto-regressive LLMs. Integration with FlashAttention is implemented to achieve high efficiency. The experiment results validate MoBA’s effectiveness on preserving good long context ability while configured with considerable sparsity.

优缺点分析

Strengths:

(S1) Timely and important topic. The proposed technique is also well motivated and well designed.

(S2) The in-depth discussion about additional design choices, including the granularity of block partitioning and the layer-wise integration with full attention, is a joy to read. This shows that MoBA is able to generalized to a broader context.

(S3) The experimental results are promising.

Weaknesses:

(W1) Limited discussion about related works on sparse LLM inference. In particular, QUEST [12] has a quite similar idea as MoBA,. In my humble opinion, MoBA seems to be more generalizable and can be integrated with training. However, such differences should be discussed/compared in details.

(W2) When assessing the efficiency, only the attention mechanism is evaluated. Although the other layers like FFN are unaltered, it would be helpful to see how the proposed technique contributes to the end-to-end efficiency. Additionally, it would be nice to see a figure demonstrating the pareto front of MoBA in terms of efficiency and long-context capability.

(W3) Minor: The models trained from scratch using MoBA are relatively small (from 545M to 2.1B), and those trained with SFT are not so-called “large” (8B).

问题

Please address the weaknesses:

discuss/compare with QUEST;
provide end-to-end efficiency;
provide pareto front in terms of efficiency and long-context capability;

局限性

yes

最终评判理由

My concerns have been addressed. I suggest adding the discussion and results into the paper.

格式问题

Please fix the quotation marks in line 108.
Please use more readable font sizes in figures.

作者回复

2025-07-31

Thank you for your positive feedback and insightful questions! Below we clarify the four main concerns.

(W1 and Q1) Regarding discussion about related works on sparse LLM inference (in particular QUEST [12]):

QUEST pioneered inference-time dynamic sparse attention and also employs top-k block selection. The key difference is that QUEST uses more sophisticated block representation which combining max and min pooling, whereas MoBA currently uses simple mean pooling.

To further answer your question, we ablate adding max- and min-pooled blocks on top of mean pooling and observed a small but consistent LM loss improvement, confirming that richer and learnable block representations are promising. We will add more details in the next version of our manuscript.

(W2 and Q2) Regarding end-to-end efficiency:

The end-to-end efficiency of Llama-8B-1M-MoBA and Llama-8B-1M-Full can be found in the following table:

Context	Full Prefill Time (s)	MoBA Prefill Time (s)	Speed-up
256 K	5.67	3.55	1.60×
512 K	19.59	8.40	2.33×
1 M	73.97	22.15	3.34×

Note that Llama-8B-1M-MoBA is a hybrid model with the last 3 layers to be full attention.

(W3) Regarding scaling law:

We appreciate the valuable suggestion. Limited by computing resources, the largest model we trained for scaling law experiments is 2.1B, and we agree that training larger model can help better understanding towards MoBA.

(Q3) Regarding pareto front in terms of efficiency and long-context capability:

We investigate the pareto frontier by studying the relationship between attention sparsity and LM loss. We pre-train an 822M Non-Embedding parameter model in 32K context length. We divide the 32k context length into different number of blocks (ranging from 4 to 512), and let MoBA select a fixed number of topk=3 blocks. In our experiments, a sparsity around 3/32 turns out to be a good choice to balance efficiency (sparsity) and long-context capability. We will add more details in the next version of our manuscript.

Number of Blocks	4	8	32	128	512
LM loss	2.48	2.48	2.47	2.53	2.60

2025-08-08

The rebuttal has addressed my concerns. I will keep my review positive.

审稿意见

评分: 4置信度: 32025-07-05

This paper introduces MoBA, a novel sparse attention architecture inspired by the MoE mechanism. MoBA dynamically selects the most relevant key blocks for each query token, enabling the model to maintain sparsity without losing critical information. The authors implement MoBA using FlashAttention, which significantly improves its computational efficiency in practice. Experimental results further demonstrate that MoBA achieves performance comparable to full attention on both general language tasks and long-context benchmarks.

优缺点分析

Strengths:

Very solid paper. The paper is clearly written. The method is designed in a highly intuitive and natural way, without relying on overly complex or engineering-heavy mechanisms. The experiments demonstrate the strong applicability of the proposed method.

Weaknesses:

The experiments lack comparison with other dynamic sparse attention baseline methods. From the paper, I cannot tell whether MoBA outperforms the related works(performance, efficiency).

It also lacks ablation studies on the configuration of hyperparameters, e.g., the configuration of hybrid attention. As a result, one cannot tell how to choose the hyperparameters on their own settings, and whether the hyperparameters are sensitive.

问题

I understand that the whole recipe for post-training a long-context LLM is expensive, but I have a question that I cannot infer from existing experiments: Why do you choose hybrid attention with the last 3 layers using full attention? Is this configuration only applicable to the Llama family, or all model families? If my concerns about the experiments are solved, I am willing to raise my score.
Just curious, the primary purpose of this method is to improve the efficiency of attention during training and generation, but it does not reduce the memory footprint of the KV cache, right?

局限性

N/A

最终评判理由

Since the authors did not provide experimental results to address my concerns, but only stated that they would include them in a future manuscript, I will maintain my score. However, as I am not particularly familiar with this field, please primarily refer to the opinions of the other reviewers.

格式问题

N/A

作者回复

2025-07-31

Thank you for your positive feedback and insightful questions! Below we clarify the three main concerns.

(W1) Regarding MoBA vs other dynamic sparse attention baseline methods:

Although we do not include comparison with other dynamic sparse attention (e.g., DeepSeek's NSA) in our manuscript, we would like to refer to an independent benchmark by Tilderesearch ("Sparsity is Cool", Dhruv Pai et al.[1]) which rigorously compares MoBA with DeepSeek-NSA on efficiency, perplexity, and attention-pattern. We will cite and discuss this report in the next version of our manuscript.

[1] Dhruv Pai, Timor Averbuch, Mason Wang, Ben Keigwin. "Sparsity is Cool". Tilderesearch Tech blog.

(W2 and Q1) Regarding hyperparameters, e.g., the configuration of hybrid attention:

We briefly discuss the motivation of switching the last layers to full attention in section 3.2 (Line 222-232). In summary, we hypothesize that replacing the last few MoBA layers with full-attention during SFT alleviates the sparse-gradient problem (especiall for long prefill tasks) caused by loss masking, enabling gradients from the unmasked tokens to propagate along the entire sequence and thus achieving better SFT loss.

To further answer your question, we ablate another layer hybrid strategies during SFT --- switching the first, the middle, and the last layer from MoBA to full attention, and we observe a performance drop in SFT loss (comparing to switching the last 3 layers to full attention). We will add more ablation studies in the next version of our manuscript.

(Q2) Regarding KV-Cache:

MoBA does not reduce KV-cache; it only reduces the compute cost of the attention matrix. The memory footprint remains identical to full attention.

2025-08-06

最终决定Accept (spotlight)

2025-09-17

The paper presents MoBA, a novel approach for scaling large language models (LLMs) to handle long contexts efficiently. MoBA employs a Mixture of Block Attention mechanism, which allows the model to dynamically select the most relevant key blocks for each query token, maintaining sparsity without losing critical information. This approach is designed to adhere to the 'less structure' principle, enabling the model to learn where to focus attention during training. The paper also highlights the method's ability to transition between full and sparse attention, offering a key advantage in efficiency. Experimental results show that MoBA performs comparably to full attention on both general language tasks and long-context benchmarks, and it has already been deployed in production workloads with long-context requirements. The authors address reviewer concerns by providing additional comparisons with other dynamic sparse attention methods, clarifying the configuration of hybrid attention, and discussing the trade-off between efficiency and long-context capability.