Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More
We propose Mask-Enhanced Autoregressive Prediction (MEAP), a simple yet effective training paradigm that seamlessly inte- grates Masked Language Modeling (MLM) into Next-Token Prediction (NTP) using a decoder- only Transformer.
摘要
评审与讨论
The paper introduces a simple modification to the standard causal next token prediction training framework of decoder-only LLMs, by masking a random fraction of the input tokens. After masking, the objective is still next token prediction. Via a set of experiments, the authors show the effectiveness of this paradigm, in particular in long context tasks with scarce relevant information. The authors hypothesize that by masking a fraction of the input tokens, the model is encouraged to sharpen its attention to distinguish between relevant and irrelevant parts of the context. To explore this hypothesis, they analyze the attention distribution and show that models trained with their technique place more attention weight on relevant parts while overall increasing the variance over unmasked tokens. This suggests that indeed the model distinguishes better between relevant and irrelevant parts.
给作者的问题
Nothing to add.
论据与证据
The claims are well supported by experiments.
方法与评估标准
The experimental setup is appropriate for the problem and results are convincing.
理论论述
There are no theoretical claims.
实验设计与分析
The experimental setup is appropriate for the problem and results are convincing.
补充材料
The paper is self-contained. I didn't see a need to review the supplemental material.
与现有文献的关系
The idea is a simple modification of the standard LLMs training framework. There are previous works on combining training techniques, including token masking, but I am not aware of a published work that does what this paper is suggesting.
遗漏的重要参考文献
Not that I am aware of.
其他优缺点
The paper is well written and easy to read. The idea is simple, which is a plus, yet surprisingly effective.
其他意见或建议
Table 2: the drop of NTP perfomance when moving from 40B to 60B is suspicious. Can you please explain?
We sincerely appreciate your valuable review. We're glad you found our paper to be self-contained and our results convincing.
Q1: Table 2: the drop of NTP perfomance when moving from 40B to 60B is suspicious. Can you please explain?
Thank you for your sharp observation. One plausible explanation is that extended training with NTP might cause the model to progressively lose sensitivity to infrequent data points. Rare examples become increasingly diluted, reducing their influence on the model’s parameters. As NTP training progresses, earlier exposure to rare instances may be forgotten or down-weighted due to the overwhelming presence of more common patterns.
However, training on sufficiently more data (e.g., 200B tokens) can help address this issue by ensuring that rare instances are represented with enough frequency and variety to leave a stronger imprint on the model. With more examples—both common and rare—the model has more opportunities to generalize from diverse patterns and avoid overfitting to dominant trends.
In contrast, MEAP that explicitly mask and predict masked tokens tend to emphasize different parts of the input, including rare tokens, making them more naturally sensitive to infrequent phenomena. Still, a sufficiently robust dataset can help NTP models maintain a better balance across the frequency spectrum.
The authors propose Mask-Enhanced Autoregressive Prediction (MEAP), a novel training paradigm integrating Masked Language Modeling (MLM) into the traditional Next-Token Prediction (NTP) objective. The key idea is randomly masking a fraction of input tokens during autoregressive pre-training, improving key information retrieval and long-context reasoning in decoder-only Transformers. The authors claim substantial improvements in benchmarks such as Needle-in-a-Haystack and Multi-Document QA without incurring additional computational overhead. MEAP is argued to improve attention differentiation by reducing unnecessary context attention, achieving significant performance gains over baseline NTP-trained models.
update after rebuttal
I would like to thank the authors and the other reviewers for the discussion. I increased my score to accept, as explained in my comments.
给作者的问题
-
Did you train both the NTP and MEAP models in the reported experiments? Or did you use a pre-trained NTP and trained yourself MEAP? Comparison will be problematic in the latter .
-
Why was Deepseek-V3 selected specifically as the hallucination detector? Did you explore other models or methods for hallucination evaluation? Clarification on this would affect confidence in your hallucination claims.
-
Have you evaluated the robustness of MEAP across drastically different language domains (e.g., highly technical versus colloquial language) to justify its broader applicability?
-
How sensitive is MEAP’s performance to different types of masking strategies (e.g., structured masking, linguistic units versus random masking)?
论据与证据
-
The primary claim—that MEAP significantly improves key information retrieval and long-context reasoning—is strongly supported by experimental results on multiple datasets (e.g., Needle in a Haystack and Multi-Document QA).
-
The claim regarding the reduced hallucination rate in summarization tasks (Table 4) is less convincingly evidenced, since the hallucination measurement is indirectly assessed using another LLM (Deepseek-V3). This indirect evaluation introduces potential biases due to dependency on the correctness and robustness of Deepseek-V3 itself. Introducing a synthetic dataset with verifiable answers will strengthen the claim.
-
The claim that MEAP’s improvements result from enhanced attention distinguishability is theoretically plausible but lacks detailed analysis across diverse experimental conditions, particularly variations in masking strategies.
-
The claim that MEAP achieves "similar performance with only 60B tokens compared to NTP’s 200B" is striking, but the experiment's control conditions are insufficiently described. It remains unclear whether other hyperparameters were optimally adjusted to fully exploit NTP's potential.
方法与评估标准
-
The evaluation methods used (Needle-in-a-Haystack, MDQA, and various reasoning benchmarks) are appropriate and well-selected to demonstrate MEAP's strengths.
-
The usage of Deepseek-V3 as the hallucination detector for summarization datasets introduces potential bias, as this model might have intrinsic limitations and inaccuracies.
理论论述
The paper does not contain formal theoretical proofs.
实验设计与分析
-
Lack of ablations on model components - The experimental setup lacks essential ablation experiments on key architectural decisions, such as masking percentage and the specific choice of masking during pre-training versus fine-tuning. Although Table 7 presents results on masking ratios, it is very brief and lacks deeper insights into why different mask ratios significantly affect performance. Experimenting with various masking strategies will also shed light about the method's effectiveness.
-
Single model size for fine-tuning - The fine-tuning experiments only evaluate the 8B Llama-3 variant, raising concerns about the generalizability across other sizes or architectures.
补充材料
I have reviewed the supplementary material; it is detailed and generally clear.
与现有文献的关系
The authors position their work well against key prior work, clearly highlighting distinctions from pure MLM models (BERT, RoBERTa, XLNet), pure NTP models (GPT, LLaMA), and unified paradigms (UniLM, UL2). The integration of masking into autoregressive prediction without additional computational overhead is a well-contextualized contribution within this literature landscape.
遗漏的重要参考文献
-
"Sparse and Continuous Attention Mechanisms" (Martins et al. 2022) on sparse attention mechanisms which implicitly relate to their core mechanism (reducing attention to tokens).
-
"Needle in the Haystack for Memory Based Large Language Models" (Nelson at al, 2024)
其他优缺点
Strengths:
- Creative integration of MLM into decoder-only Transformers, maintaining computational simplicity.
- Strong empirical demonstration of improved performance across multiple tasks.
- Clearly presented motivation and overall straightforward implementation.
Weaknesses:
- Insufficient theoretical justification or rigorous analysis behind the chosen masking ratio (15% pre-training, 10% fine-tuning), despite its significant impact.
- Attention score variance and decay analysis (Section 5.1) is presented briefly without rigorous statistical backing or detailed statistical analysis of robustness.
其他意见或建议
-
Table formatting is generally clear but lacks sufficient explanation in the captions—especially how percentages are computed.
-
The paper is generally well-written but suffers from redundancy, notably between abstract and introduction. The introduction could be more concise.
We sincerely thank you for your valuable comments!
Q1: Variations in masking strategies.
We added three distinct masking strategies as you suggested: Random Masking, 5-Span Masking (spanning 5 consecutive tokens), and 50-Span Masking, using a 0.3B parameter model pre-trained on 5B tokens. While they achieve comparable performance on commonsense reasoning, random masking achieves the best results on the Multi-Document QA task, possibly because its unpredictable mask encourages the model to develop more robust attention mechanisms across the entire context window.
| ARC-c | ARC-e | BoolQ | PIQA | HellaSwag | WinoGrande | OBQA | Average | MDQA | |||
|---|---|---|---|---|---|---|---|---|---|---|---|
| MEAP Random Mask | 21.84 | 35.44 | 61.25 | 61.04 | 29.50 | 51.46 | 27.40 | 41.13 | 0.218 | ||
| MEAP 5-Span Mask | 21.42 | 35.61 | 60.40 | 62.08 | 29.81 | 51.07 | 27.60 | 41.14 | 0.168 | ||
| MEAP 50-Span Mask | 23.46 | 36.20 | 59.54 | 62.84 | 30.43 | 50.99 | 28.00 | 41.64 | 0.189 | ||
| NTP | 18.00 | 37.75 | 58.44 | 62.62 | 28.56 | 50.67 | 13.60 | 40.09 | 0.187 |
Q2: Did you train both the NTP and MEAP models in the reported experiments?
Yes, we trained both NTP and MEAP models in identical experimental settings. Both models used the same datasets, training steps, and hyperparameters, with the only difference being the masking mechanism in MEAP. This ensures fair comparison and reliability of our results.
Q3: Deepseek-V3 as the hallucination evaluator?
We followed the setting of Diff Transformer using LLMs to make judgments. While the limited context window does allow us to design rigorous synthetic datasets, we added two more LLMs as judge to show the robustness of our evaluation.
| XSum | MultiNews | WikiSum | |
|---|---|---|---|
| NTP (Deepseek-V3) | 0.09 | 0.17 | 0.24 |
| MEAP (Deepseek-V3) | 0.13 | 0.19 | 0.33 |
| NTP (Qwen-Plus) | 0.16 | 0.11 | 0.21 |
| MEAP (Qwen-Plus) | 0.19 | 0.14 | 0.27 |
| NTP (GPT-4o) | 0.14 | 0.10 | 0.19 |
| MEAP (GPT-4o) | 0.16 | 0.13 | 0.24 |
Q4: Different model sizes for fine-tuning.
We added fine-tuning results on various pre-trained LLMs. While MEAP achieves on-par or better results than NTP on commonsense reasoning, it consistently outperforms NTP on multi-document QA tasks, exhibiting consistent improvements across all tested model architectures and sizes.
| ARC-c | ARC-e | BoolQ | PIQA | HellaSwag | WinoGrande | OBQA | Average | ||
|---|---|---|---|---|---|---|---|---|---|
| Llama-3.2-3B | NTP | 47.95 | 69.07 | 75.54 | 76.50 | 72.43 | 64.33 | 44.40 | 64.32 |
| Llama-3.2-3B | MEAP | 49.32 | 73.06 | 71.80 | 77.53 | 74.26 | 68.51 | 44.60 | 65.58 |
| NTP-Qwen2.5-14B | NTP | 53.67 | 74.71 | 86.73 | 77.64 | 78.44 | 68.19 | 48.00 | 69.63 |
| MEAP-Qwen2.5-14B | MEAP | 56.83 | 79.38 | 87.37 | 79.33 | 79.37 | 72.69 | 47.40 | 71.77 |
| NTP-Mistral-7B-0.2 | NTP | 35.67 | 60.10 | 75.81 | 71.22 | 63.03 | 61.40 | 35.40 | 57.52 |
| MEAP-Mistral-7B-0.2 | MEAP | 37.20 | 59.18 | 72.63 | 73.50 | 64.08 | 61.17 | 35.60 | 57.62 |
On multi-document QA task (20-document setting):
| 1 | 5 | 10 | 15 | 20 | AVERAGE | ||
|---|---|---|---|---|---|---|---|
| Llama3.2-3B | NTP | 13.60 | 12.09 | 12.54 | 12.69 | 14.35 | 13.05 |
| MEAP | 23.47 | 20.34 | 20.38 | 21.96 | 23.65 | 21.96 | |
| Mistral-7B-0.2 | NTP | 36.96 | 30.55 | 27.82 | 27.55 | 38.79 | 32.33 |
| MEAP | 37.91 | 32.98 | 31.46 | 32.22 | 43.45 | 35.60 | |
| Qwen2.5-14B | NTP | 60.00 | 51.98 | 56.01 | 56.05 | 63.39 | 57.49 |
| MEAP | 61.69 | 53.71 | 57.21 | 56.65 | 66.29 | 59.11 |
Q5: How robust is MEAP across different language domains?
We verified cross-domain robustness by training on GSM8K and testing on MathQA. MEAP achieved a 9.2% improvement over NTP, demonstrating that MEAP's gain also robustly holds in the math domain.
| Method | mathqa |
|---|---|
| NTP | 28.0 |
| MEAP | 30.6 |
Q6: Attention score variance and decay analysis (Section 5.1) is presented briefly without rigorous statistical backing.
To evaluate the statistical significance of our analysis, we sample 10 different question-answering examples, comparing attention distribution between NTP and MEAP during inference, and calculating the T-Statistic and P-Value. All tests reached statistical significance, confirming that MEAP models produce systematic changes in attention distribution.
| Sequence Length | Metric | Value | T-Statistic | P-Value |
|---|---|---|---|---|
| 1024 | Attention Score Decay | 34.08% | -25.71 | <0.000001 |
| 1024 | Attention Variance Increase | 12.66% | 12.26 | <0.000001 |
| 4096 | Attention Score Decay | 53.34% | -9.97 | <0.000001 |
| 4096 | Attention Variance Increase | 7.80% | 5.22 | <0.000001 |
Q7: Essential references missing and redundancy issues.
We thank you for pointing out Martins(2022) on sparse attention mechanisms and Nelson(2024) on information retrieval. We will definitely add a detailed discussion in our revision. We will also simplify the introduction to highlight key innovations and remove unnecessary repetition.
Thanks for the detailed rebuttal. I increased the score to reflect the impact of the changes on the manuscript.
This paper introduces Mask-Enhanced Autoregressive Prediction (MEAP). In particular, MEAP incorporates the masked language modeling technique in next-token prediction setting by randomly masking out a small portion of input tokens and train the model with standard next-token prediction. This method is applied to both pretraining and finetuning. The author conducts experiments on long-context information retrieval benchmarks (NIAH and MDQA) and long-context reasoning benchmark (M-RS) and observe performance gain over NTP training. The author also shows the effectiveness of MEAP by analyzing the attention scores.
给作者的问题
- What is the intuition to copy the input again for the fine-tuning case. What if you just finetune like in the pre-training case?
- Line 299 left column: "Only the masked tokens are predicted during fine-tuning." What does this mean? It does not seem to match the formula on Line 122 right column?
论据与证据
Claim: MEAP improves performance in key-information retrieval and long-context modeling
The author evaluates MEAP's performance on NIAH and MDQA, and shows improvement over the NTP approach. This shows that the key-information retrieval capability is improved with MEAP. For the long-context reasoning task, the author conducts experiment on Multi-Needle Reasoning Task. I would also recommend the author to evaluate on HELMET [1]. HELMET has a comprehensive evaluation on long-context tasks and I wonder whether MEAP is better than NTP on all types of long-context tasks, or if there is any type of long-context task that MEAP doesn't improve much.
Claim: The effectiveness of MEAP arises from its capability to promote more distinguishable attention scores
The author shows an analysis in Figure 6 that shows MEAP effectively encourages the model to assign higher attention score on the answer part.
[1] Yen, Howard, et al. "Helmet: How to evaluate long-context language models effectively and thoroughly." arXiv preprint arXiv:2410.02694 (2024).
方法与评估标准
The method and evaluation method makes sense. As mentioned in the section above, I also recommend evaluating on the HELMET benchmark.
理论论述
N/A
实验设计与分析
(See Claims And Evidence section)
补充材料
The supplementary material contains the code for this work.
与现有文献的关系
N/A
遗漏的重要参考文献
N/A
其他优缺点
N/A
其他意见或建议
N/A
Q1: Intuition behind copying input for fine-tuning
The duplication approach addresses a critical constraint in supervised fine-tuning: SFT answers are often extremely short (typically 5-15 tokens). With such limited tokens, conventional masking would risk fragmenting the semantic coherence of these responses. Even masking 15% of a 10-token answer could remove key semantic components. By providing a duplicated sequence, we ensure semantic preservation while still enforcing the masking mechanism that drives attention specialization. Our experiments with single-sequence masking during fine-tuning (see "Single Mask" row in the fine-tuning results table) demonstrate substantially degraded performance across all tasks, confirming this design decision's importance.
Performance comparison on various commonsense reasoning tasks. Results are measured by fine-tuning with Llama-3-8B.
| Method | ARC-c | ARC-e | BoolQ | PIQA | HellaSwag | WinoGrande | OBQA | Average |
|---|---|---|---|---|---|---|---|---|
| NTP | 53.58 | 81.10 | 83.98 | 79.27 | 62.74 | 72.06 | 39.40 | 67.30 |
| MEAP | 55.12 | 83.21 | 83.82 | 81.01 | 63.31 | 74.27 | 38.20 | 68.42 |
| Single Mask | 44.03 | 73.95 | 80.89 | 77.86 | 54.55 | 66.06 | 35.00 | 61.76 |
Q2: Clarification on Line 299
Thank you for pointing out this inconsistency. The statement 'Only the masked tokens are predicted during fine-tuning' indeed does not align with the formula on Line 122. We will revise it as follows in the final manuscript.
Where the sequence is a copy of the original sequence (i.e., ). The cross-entropy loss is computed over the subset of answer tokens () and masked tokens ().
Q3: HELMET Benchmark Results
Following your recommendation, we evaluated MEAP on the HELMET benchmark. The results strongly support our claims:
| HotpotQA (RougeL Recall) | Natural Questions (RougeL Recall) | PopQA (RougeL Recall) | TriviaQA (RougeL Recall) | MS MARCO (NDCG@5) | |
|---|---|---|---|---|---|
| MEAP-1.1B | 7.92 | 6.94 | 22.21 | 17.22 | 18.38 |
| NTP-1.1B | 3.43 | 5.75 | 6.31 | 8.00 | 0.0 |
| Improvement | 4.49 | 1.19 | 15.9 | 9.22 | 18.38 |
While HELMET provides comprehensive evaluation, many of its test cases require context lengths up to 130K tokens, which exceeds our model's current context capacity. We evaluated our approach on the subset of HELMET tests compatible with our model's context window. These results demonstrate that MEAP delivers consistent improvements across diverse question-answering tasks, with particularly substantial gains on challenging information retrieval benchmarks like PopQA (+15.9) and MS MARCO (+18.38). We will clarify this limitation in our revised manuscript and properly cite the HELMET benchmark as recommended.
Thanks for clarifying my questions. I have some follow-up questions one the finetuning method for MEAP:
- In the loss you show in your rebuttal, why is loss calculated on the masked token, i.e., ?
- For finetuning. what will happen if you just randomly mask the question part and do not mask the response part?
- For finetuning, if you repeat the input like in the Line 122, wouldn't that make training less efficient? For example, in NTP, a single forward pass calculates loss, but if you repeat the input, you can only calculate loss for a single pass because at the first repeat you don't want the token being predicted to be seen? If so, I think it should be mentioned in the paper. If not, can you better explain or point to me the code that implement this (how you do forward pass and calculate the loss)?
Q1: Why is loss calculated on the masked token?
A1: For fine-tuning, we duplicate the input where the original input is calculated as the standard NTP and the loss of the duplicated input is calculated on the masked tokens. This duplication approach addresses a critical constraint in supervised fine-tuning: SFT answers are often extremely short (typically 5-15 tokens). With such limited tokens, conventional masking would risk fragmenting the semantic coherence of these responses. Even masking 15% of a 10-token answer could remove key semantic components.
By including masked tokens in the loss function, we force the model to develop better attention mechanisms to recover masked information. This promotes more robust context understanding since the model must learn to utilize surrounding context to predict these masked portions.
Q2: For finetuning. what will happen if you just randomly mask the question part and do not mask the response part?
A2: We added the experiment as you requested. Results are measured by fine-tuning with Llama-3-8B on Commonsense tasks. As we can see, only masking the questions part achieves worse performance than NTP and MEAP.
| Method | ARC-c | ARC-e | BoolQ | PIQA | HellaSwag | WinoGrande | OBQA | Average |
|---|---|---|---|---|---|---|---|---|
| NTP | 53.58 | 81.10 | 83.98 | 79.27 | 62.74 | 72.06 | 39.40 | 67.30 |
| MEAP | 55.12 | 83.21 | 83.82 | 81.01 | 63.31 | 74.27 | 38.20 | 68.42 |
| Random Masking Questions | 48.98 | 78.32 | 83.18 | 80.14 | 58.63 | 70.48 | 33.60 | 66.05 |
Q3: Efficiency considerations for dual-sequence approach
A3: During fine-tuning, NTP calculates the loss once with (T−1) tokens. In contrast, MEAP calculates the loss once with (T−1) tokens (from the first repetition) plus tokens (from the second repetition). Here, refers to the number of masked tokens in the second repeated input.
Repeating the input does increase the training overhead compared to standard NTP, as MEAP’s input sequence is longer. However, this overhead is effectively amortized by MEAP’s higher data utilization efficiency. As shown in Figure 5 of our submission, MEAP requires only 50% of the epochs to process a similar number of tokens, while still notably outperforming NTP. This demonstrates that MEAP’s training overhead is well offset by its effectiveness.
Our implementation can be found in MEAP-SFT/MEAP-SFT.py, with lines of 479-568.
This paper integrates masked language modeling into next-token prediction for decoder-only Transformers. Reviewers all give positive scores, on its clear motivation, simplicity, and strong empirical results demonstrating performance gains across multiple tasks. The creative yet computationally simple integration of MLM is also commended. However, several issues exist. The chosen masking ratios lack theoretical justification, and the analysis of attention scores lacks statistical rigor. Claims about reduced summarization hallucinations are weakly supported, and the link between improvements and attention distinguishability needs more exploration. Overall, AC agrees this is an interesting, simple yet effective training paradigm for LLM, and thus recommends acceptance.