PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
4
4
5
3.5
置信度
创新性2.5
质量2.0
清晰度2.5
重要性2.5
NeurIPS 2025

Rope to Nope and Back Again: A New Hybrid Attention Strategy

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

This paper analyzes existing methods for long-context modeling and compares various attention variants. It introduces a hyrbid architecture that offers a more efficient and scalable solution with better performance for extended sequence processing.

摘要

关键词
LLMPretrainingLong ContextHybrid Attention

评审与讨论

审稿意见
4

This paper presents a new hybrid attention strategy combining Rotary Position Embedding (RoPE) and No Positional Embedding (NoPE) for long-context large language models (LLMs). The authors analyze existing attention mechanisms and highlight their limitations, then propose an architecture that alternates between RoPE and NoPE layers. This hybrid model outperforms traditional RoPE-based methods on long-context tasks and is efficient in both training speed and memory usage, providing a more effective solution for handling long input sequences.

优缺点分析

Strengths

  1. The paper introduces a unique hybrid attention mechanism, combining RoPE and NoPE, offering an innovative solution for long-context modeling.
  2. It provides a clear comparison of existing attention mechanisms, making the analysis accessible and well-structured.
  3. The proposed model significantly improves performance and efficiency, making it a valuable contribution to the development of large language models.

Weaknesses

  1. The paper adjusts the model architecture but lacks an analysis of the training overhead introduced by these adjustments.
  2. The findings in this paper are similar to those in [1], but the paper does not explain the differences.
  3. The paper uses only RoPE as the baseline, lacking validation of the method's effectiveness; it could benefit from a comparison with p-RoPE [1].

[1] Round and Round We Go! What makes Rotary Positional Encodings useful?

问题

See Weaknesses.

局限性

yes

最终评判理由

Thank you for the response. I have raised my score accordingly.

格式问题

Table 1, Table 2, Figure 1, Table 3, and Table 4 exceed the page margins.

作者回复

Thank you for the valuable feedback. We address the comments as follows:

  1. Training overhead analysis: We discuss relevant aspects in Section 5.3, Impacts on Training and Inference. Our findings indicate that introducing hybrid attention can significantly accelerate both training and serving. Additional benefits arise when combining it with specialized kernels such as ring attention, which reduces communication overhead.

  2. Comparison with p-RoPE: Thank you for pointing this out. In the second paragraph of Section 3, we briefly mention variants that combine NoPE and RoPE along the head dimensions (e.g., GPT-J/NeoX). The work of p-RoPE further analyzes RoPE, focusing on the impact of different frequency domains and concluded that low/high frequencies focus on semantic and positonal attention respectively, which is similar to our findings that NoPE focuses more on information retrieval and RoPE more on recency bias. The difference is that p-RoPE removes low frequencies in the head dimensions (like partial rope used in prior work) and we combine RoPE/NoPE across layers (which has more engineering benefits as pointed out in section 5.3).

We are probably not able to conduct another pretraining from scratch with this variant as the resource constraint is very demanding for this experiment (similar to why we didn't do NoPE + SWA). However, we will include a reference to the p-RoPE paper and expand our discussion of its relationship to our method in the same paragraph in the final version.

We appreciate your thoughtful feedback and look forward to further discussions to enhance our work.

评论

Thank you for the authors' response. However, as the authors acknowledge in their rebuttal, the novelty of the findings remains limited when compared to [1]. Furthermore, the set of baselines used for the experimental comparison is not sufficiently comprehensive. Strengthening these aspects would significantly improve the paper. Given that the authors have not demonstrated a convincing distinction or provided further specific experimental results, I will maintain my score.

[1] Round and Round We Go! What makes Rotary Positional Encodings useful?

评论

We would like to respectfully clarify that the claim—“the authors acknowledge in their rebuttal, the novelty of the findings remains limited when compared to [1]”—is not supported by anything stated in our previous responses. At no point did we concede that the novelty of our work is limited relative to [1], and such a characterization is inaccurate.

To clarify again the distinction between our work and [1] (as well as other partial RoPE approaches), we emphasize that our method introduces a layer-wise combination of NoPE and RoPE, which leads to two key contributions:

  1. Division of Labour: Our analysis shows that different layers naturally specialize in distinct roles—such as retrieval versus local aggregation—leading to an emergent functional separation (see Section 2) and the subsequent efficiency gains. To our knowledge, this insight has not been demonstrated in prior work. We also hope that our findings can inspire further discussion and exploration of this phenomenon within the research community.

  2. Efficiency Gains: While head-wise or dim-wise mixing of RoPE and NoPE can preserve performance, they do not yield improvements in training or inference efficiency over dense models. In contrast, our proposed RNoPE-SWA design achieves up to 4× speed-up and substantial KV-cache savings (see Section 5.3).

Comparing with [1] or other "partial rope" approaches: On the analysis part, [1] offers a valuable analysis of RoPE and NoPE characteristics and the role of frequency components. Our paper did similarly but with focus on layer-wise composition which resulted in significant efficiency improvements over [1]. Moreover, our work includes full training from scratch of a substantial 8B model, further supporting the robustness and reproducibility of our approach—something not covered in [1].

We believe these points underscore the novelty and practical value of our contributions beyond what is explored in [1].

We agree that the relationship between “partial RoPE” and relevant prior work deserves further discussion. We appreciate this valuable point and will include an expanded discussion in the paper to better contextualize our approach.

[1] Round and Round We Go! What makes Rotary Positional Encodings useful?

评论

Thank you for the detailed response. The observation that “different layers and heads naturally specialize in distinct roles” was also established in [1], which provides a comprehensive analysis of this phenomenon. Although the RNope method itself differs from the approach in [1], the analysis presented is largely analogous. Consequently, the methodological contribution seems limited, offering only incremental novelty.

[1] Round and Round We Go! What makes Rotary Positional Encodings useful?

评论

We would like to clarify that in the statement “different layers and heads naturally specialize in distinct roles,” our emphasis is on the cross-layer combination of RoPE and NoPE, whereas [1] only explores their combination within each layer. This cross-layer integration is important not only from a mechanistic analysis perspective, but also because it enables substantial efficiency gains compared to [1] or any other model using purely dense attention.

For example, with an interleaving ratio of 3:1, our method achieves up to a 4× speedup and KV cache savings relative to a dense-attention counterpart. Moreover, our experiments show that it delivers significantly higher performance than the dense model while using 3–4× fewer FLOPs. This results in a “free lunch” phenomenon, where both efficiency and performance improve simultaneously. Our analysis also offers a plausible explanation for the underlying mechanisms of this phenomenon to a meaningful extent, which to the best of our knowledge, were largely omitted by other works on hybrid attentions. In comparison, the proposed approach in [1] provides no conspicuous efficiency gains on speed or memory usage.

We would also like to highlight that our paper contains additional analyses that we believe are important and valuable to the community. One notable example is our investigation of QK-Norm—its negative impact on long-context performance, and the underlying reason why this occurs. To our knowledge, this is the first time this widely used training technique has been formally examined in the context of long-context scenarios.

The potential significance of this finding can be illustrated with a concrete example:

  • The well-known Gemma 3 model adopts a similar architecture with interleaved global–local attention using solely RoPE, yet it underperforms on long-context evaluations such as RULER. Specifically, the Gemma 3 27B model scores 0.91 at 8k, but drops to 0.66 at 128k—a nearly 30% performance loss—while also lagging behind other established open-source models.

  • In contrast, RNoPE-SWA, despite using significantly less compute and being almost 4× smaller in size, shows only a ~17% drop on RULER QA and ~7% drop on RULER Retrieval. While we acknowledge that this is not an apples-to-apples comparison, the substantial difference offers an intuitive perspective on the impact.

We believe this gap is very likely due to the adoption of QK-Norm, and our paper provides a clear analysis explaining why this degradation occurs. We therefore expect these findings to be of direct benefit to the broader research community.

We would encourage the reviewer to consider these factors when assessing the novelty and potential impact of our work, and we welcome further discussion on these points.

评论

Thank you for the response. I have raised my score accordingly.

评论

Dear Reviewer LEho,

Today is the last day to engage in the discussion with the authors. The authors have replied to your most recent comment. Please engage at the earliest convenience.

Best, AC

审稿意见
4

The paper focuses on position embedding for better long-context modeling. Experiments are conducted to compare three position embedding methods, including RoPE, Query-Key normalization, and NoPE. Based on the empirical results, the paper proposes combining rotary position embedding (RoPE) and no position embedding (NoPE) for better long-context performance.

优缺点分析

Strength:

  • The proposed hybrid position embedding method seems effective in improving model performance.

  • Generally, the paper is clear and easy to follow.

Weakness:

  • It lacks a deep analysis or sufficient supportive empirical evidence for the choice of hybrid position embedding with both RoPE and NoPE. While the paper provides observations in Section 2, there is no analysis on why the RoPE and NoPE are complementary. After all, as shown in Section 2.1 (Figures 1 and 2, Table 2), RoPE and NoPE are similar when the model employs a single position embedding.

  • It requires more comparison to the baseline models to truly verify the effectiveness of the proposed method. While a larger θ\theta enhances the long-context performance for RoPE, it would only be fair to compare the proposed hybrid position embedding to RoPE with varying θ\theta. Meanwhile, the proposed method uses RoPE with sliding window attention; the baseline should also include models with sliding window attention.

问题

  • As mentioned in the Weakness part, more comparisons between the proposed method and various baseline variants (various θ\theta values for RoPE, RoPE with sliding window attention) could further validate the effectiveness of the proposed method. I look forward to further results with more fair comparisons.

  • It would be better if the authors could provide a more comprehensive analysis of why choosing NoPE and RoPE for the hybrid position embedding. A deeper analysis of the motivation and insights could strengthen the paper, as it currently only considers three position embedding variants in Section 2.

I look forward to the authors' reply and will adjust my rating accordingly.

局限性

The authors could address more of the limitations of the proposed method. Currently, there seems to be little discussion of the limitation.

最终评判理由

This paper proposes to mix the RoPE and NoPE for better long-context modeling. Experiments are conducted to compare three position embedding methods, including RoPE, Query-Key normalization, and NoPE.

格式问题

The width of some tables should be adjusted.

作者回复

Thank you for the valuable feedback. We address the comment as follows:

  1. Analysis of RoPE, NoPE, and QK-Norm, and rationale for interleaving RoPE and NoPE for improved performance:

    In Figure 1, we analyzed the attention mass on the needle tokens across sequence lengths and model variants. From these observations, we infer that NoPE exhibits an advantage in its ability to focus on relevant information, based on the following points:

  • Attention mass differences at fixed sequence length: For example, in Figure 1 (sequence length 32k), the attention mass of RoPE, NoPE, and QK-Norm is 0.02, 0.03, and 0.01, respectively. While the absolute values are small, the relative differences are notable: RoPE ≈ 2× QK-Norm, and NoPE ≈ 3× QK-Norm.

  • Attention mass differences across sequence lengths: When comparing the attention mass over needle tokens across sequence lengths, we observe that QK-Norm drops the most (from 0.017 to 0.005, >3× reduction), RoPE drops from 0.033 to 0.015 (>2× reduction), while NoPE drops from 0.045 to 0.025 (<2× reduction). This suggests that NoPE degrades less when sequence length increases.

  • Needle score and prior findings: From Table 2 (and corroborating prior studies), QK-Norm performs poorly on long-context tasks, likely due to its reduced ability to attend to relevant tokens—a hypothesis we further analyze in Appendix B.

These differences led us to hypothesize that RoPE and NoPE could be complementary where RoPE explicitly models positional information and recency bias and NoPE improves long-context retention and reduces degradation across long sequences.

Motivated by this, we explored interleaving RoPE and NoPE to combine their strengths. This design decision were made as an exploration. We also conducted several other attempts to improve the model's capabilities, which we did not include as they either performed poorly or were less relevant. As we also stated in the paper, other attempts were also made by prior work like GPT-J/Neox of combing NoPE and RoPE (partial RoPE) but not across layers.

Thank you for pointing this out. We will make our reasoning clearer in the final version by including numeric comparisons and percentage differences.


  1. Comparing with baseline with larger RoPE theta values and sliding window.

Thank you for this important suggestion. We plan to include experiments utilizing varying θ values for RoPE in the final version. Due to computational constraints, these may be limited to the fine-tuning (SFT) stage. Our current baseline uses a θ of 8 million, which aligns with industry-standard long-context training regimes. For example:

  • LLaMA 3 adopts θ == 500 k for support of an 8 k context window
  • Qwen‑2.5, supporting up to 128 k tokens, employs θ around 10 million
  • Cohere Command R, also designed for 128 k context length, uses θ on the order of 8 million

While we acknowledge that these models also incorporate techniques like Dual‑Chunk Attention or the Yarn mechanism to enhance long-context performance, the overarching trend of scaling θ to a few million for extended context remains consistent.


  1. On including baseline models with sliding‑window attention

Thank you for this perceptive suggestion. We agree that adding baseline comparisons that also utilize sliding‑window attention (SWA) would strengthen the completeness and fairness of our evaluation. However, we did not include those in the current version for the following reasons:

  • Resource constraints: Pretraining models from scratch or fine-tuning major variants with SWA is computationally expensive and falls beyond our current resource budget.

  • Low practical impact of pure NoPE‑only variants: In our preliminary experiments, models using only NoPE—regardless of whether with SWA—showed significantly worse performance, including large increases in perplexity. This suggests limited practical value for pure NoPE baselines in long‑context regimes.

  • Focus on core contributions: We prioritized experiments directly aligned with our hypothesis: the hybrid combination of RoPE and NoPE. As such, we ran QK‑Norm with sliding‑window attention—but omitted it from the paper because it did not illuminate the central mechanisms we study.

The goal of this paper is that we want to show:

  • Analysis that different attention mechanisms and combinations across layers can result in interesting attention patterns and further understanding this pattern can help us improve the model.
  • Experiments that focus on demonstrating that RNoPE‑SWA can easily outperform prevailing long-context recipes with lower compute and serving overhead without tuning the θ values. This "free-lunch" phenomenon also corresponds to findings of other recent works on hybrid attention.
  • We understand that there are possibly other ways to combine different mechanisms to improve, even just between NoPE and RoPE. We think it will be good for further work to explore and refine on top of this approach.

We appreciate your thoughtful feedback and look forward to further discussions to enhance our work.

评论

I appreciate the authors' response. However, without any further results, I will maintain my rating. Considering the resource contraint, I would encourage the author to provide partial results, given two weeks for the rebuttal and discussion period.

评论

Thank you for your comment. We have explained our rationale for the experiments in responses to other reviewers, but to summarise: our work began with the analysis in Section 2, where we explored different positional embeddings and identified interesting attention patterns. These findings directly informed our modelling choices. We believe that the analysis in Sections 2 and 3 is valuable in its own right, because:

  • Among the many works employing hybrid-attention architectures, ours is one of the very few that investigates the underlying mechanisms of their effectiveness. For example, the findings of "division of labour" across different layers can be interesting and informative for future model design.

  • Our discussion of QK-Norm, to our knowledge the first in the community, provides a plausible explanation for why models like Gemma 3, which uses a similar interleaved global–local attention architecture with solely RoPE, underperform on long-context evaluations such as RULER. For example, Gemma 3 27B scores 0.91 at 8k but drops to 0.66 at 128k—a ~30% loss—while also lagging behind other open-source models. In contrast, RNoPE-SWA, despite requiring significantly less compute and being almost 4× smaller, shows only ~17% and ~7% drops on RULER QA and RULER Retrieval, respectively. While this is not an exact apples-to-apples comparison, the large gap highlights the likely role of QK-Norm, which our paper analyses in detail.

We also want to point out that our proposed method—derived from this attention analysis—offers clear efficiency advantages in speed and memory usage, while achieving better performance than other methods that combine NoPE and RoPE such as [1] or a dense baseline that utilize full attention. This "free lunch" phenomenon also stems from the analysis we performed in previous sections.

Given the scope and length of the analysis in Sections 2 and 3—and the substantial resources required—we found it difficult to allocate additional space or compute for more ablations and baselines in Sections 4 and 5. Instead, we ensured that our baseline is strong, well-trained, and uses hyperparameters (such as RoPE θ) validated by industry practice. While we would like to add further experiments, the review timeline makes it unlikely that this can be completed before the discussion period ends.

Another reason we structured our experiments in this way is that long-context evaluations can be highly sensitive to the amount of training data. With insufficient training tokens, the conclusions drawn can change dramatically. For example, the ALiBi paper [2] reported extraordinary context extrapolation capabilities, but subsequent works [3, 4] found that part of this effect was attributable to insufficient pretraining tokens and a limited receptive field. Given limited compute resources, we wanted to ensure that—even with a limited number of experiments or baselines—each run received sufficient training tokens to produce meaningful and robust results.

We kindly ask the reviewer to take these factors into account when assessing the novelty and potential impact of our work, and looking forward to future discussions.

[1] Round and Round We Go! What makes Rotary Positional Encodings useful?

[2] Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

[3] BTLM-3B-8K: 7B Parameter Performance in a 3B Parameter Model

[4] Dissecting Transformer Length Extrapolation via The Lens of Receptive Field Analysis

评论

Dear Reviewer qPgF,

Note that the authors have replied to your comment on Aug. 4th. Please engage ASAP and clarify as much as possible so that we have a productive discussion.

Thanks!

AC

评论

Dear Reviewer qPgF,

Today is the last day to engage in the discussion. Please do so at the earliest convinience.

Best,

AC

评论

Dear Authors,

I appreciate your further clarification. I will increase my rating. However, I would still suggest you add more experiments in the future.

审稿意见
4

The paper proposed to mix NoPE and RoPE layer by layer interleaved so that it can get the advantages from both methods. NoPE for better retrieval ability, while RoPE for better local ability. They made some observations on the attention distribution of the models, and then use the proposed method to show improvements on the downstream tasks of long context retrieval. Performance gains is more obvious when the context length goes up.

优缺点分析

Weaknesses:

  • Why didn't the experiment show NoPE as a baseline? I guess NoPE will still outperform NRoPE on some retrieval tasks, and it will also be good at short context tasks.
  • Insufficient ablations: a good ablation of the method will include the following ablated variants:
    1. RoPE
    2. RoPE with swa
    3. NoPE
    4. NoPE with swa.
      For both accuracy performance and efficiency performance angles to discuss. The current experiments are only with a single “Baseline” which is not professional and not able to prove the effectiveness/efficiency of RNoPE-swa. If efficiency is important, you should change section 5.3 into a table to compare the efficiency of all different variants.
  • The evenly stacking of RoPE and NoPE one by one is not well supported by ablations either. In my opinion, the first few of transformer early layers should be encoding more local informations, so it’s more useful to put RoPE-swa, while the later layers can encode more of the global information that requires NoPE.
  • Again, the ablation of how to mix RoPE/NoPE layers is only in a sentence of “we perform an ablation study on the interleaving ratio of full attention and sliding window layers, testing the configurations of 1:1, 1:3, and 1:7.” Where are your ablation results? How can I trust this sentence without the numbers/significance differences to back it up? There are no tables/numbers for it. The whole paper so far makes me feel like a very non-professional work from an undergrad course project.

问题

  • What does qc mean in the figure 2? If it’s “query and completion”, please explicitly say it.
  • Why is NoPE not included as a baseline?
  • Can you add NoPE and NoPE+SWA to the ablation?
  • Can you include more ablation on the design choices?

局限性

yes

最终评判理由

The authors addressed most of my concerns, and I raised the score accordingly.

格式问题

N/A

作者回复

Thank you for your thoughtful feedback. Below we clarify several misunderstandings and explain the scope and focus of our work as well as answering questions.

First we want to emphasize that the goals of this paper are:

  • Illustrate with analysis that different position embeddings can demonstrate different attention distributions. We then use this insight to hypothesize that mixing them may combine their strengths.
  • Understanding these patterns and combining some position embeddings in a certain way (RNoPE-SWA) can easily outperform the prevailing SOTA long-context scaling method based on RoPE-NTK-Scaling and its variants. And RNoPE-SWA can achieve this with much less compute during long-context training and serving. Overall, RNoPE-SWA without much hyper-parameter tuning easily outperforms SOTA baseline with much less compute required -- a "free-lunch" phenomenon observed in other hybrid attention works (minimax-text-01, jamba, yoco, etc.) as well.
  • Note that our aim is not to provide exhaustive ablations for RNoPE‑SWA, but to demonstrate the interesting phenomenon that reduced compute can yield better results under some design choices.

To give a brief summary of how we approached the problem:

  • In section 2.1 and 2.2, we analyzed the evaluation score and attention distributions of variants with RoPE, NoPE and QK-Norm. Based on the analysis, we formed some intuitions and made some hypothesis on how we can combine different types of position embeddings to potentially improve the model performance.
  • In section 2.2.3, we showed that a model with interleaved NoPE and RoPE layers demonstrate very interesting attention distribution and distinct attention characristics among layers with different position embedding types. The combination also showed excellent long-context performance which corresponds to our analysis and conjectures that models with this attention distribution might work better on long-context tasks than existing SOTA baselines trained with RoPE-NTK. We also shrinked the attention span of RoPE layers based on the observations we had on the attention layers to signifcantly bring down compute requirement during long context training and serving.
  • Based on the above analysis, we chose to interleave NoPE and RoPE with a certain ratio and sliding window attention as our final model design.
  • We compare our model architecture with the prevailing method (a dense architecture with RoPE-NTK training) where most open-sourced SOTA models adopted more or less (llama3, Qwen-2.5, DeepSeek-v3, Command R, etc.)

Note that our final method and architecture were derived not from exhaustive ablations but from empirical observations and analyses in several small‑scale pre‑experiments. The primary goal of this paper is to demonstrate that this specific approach outperforms the prevailing SOTA long‑context extension training recipe based on RoPE NTK scaling. Determining whether other combinations or hyper-parameter tuning can surpass our proposed method is beyond this work’s scope and left for future research.


We will answer the questions with the above paper goals in mind:

  1. Why didn't the experiment show NoPE as a baseline?

Prior work (e.g., [37], [71]) already shown pure NoPE lags behind RoPE on standard benchmarks, despite strong extrapolation ability. In Sec. 2.1–2.2 we independently confirm that NoPE underperforms RoPE‑NTK on both retrieval accuracy and perplexity, which motivated focusing on mixed architectures rather than repeating a known weak baseline. We eliminated QK-Norm from subsequent experiments due to the same reason.


  1. The paper should include all the below variants in the ablation:
  • RoPE
  • RoPE + SWA
  • NoPE
  • NoPE + SWA

Note that we derived our final architecture RNoPE-SWA based on analysis from section 2 rather than exhaustive architecture searches on architecture combinations. During the analysis, we pointed out a few interesting attention patterns and already eliminated some bad design choices like pure NoPE or QK-Norm. There could potentially exist other combinations of position embeddings that achieve on-par oir better performance and same or different attention patterns compared to RNoPE-SWA. Several potential variants can be tested such as RoPE + SWA, Alibi+SWA, partial RoPE or Yarn + NoPE + SWA and we think they are good to be explored in the future. However, the goal of the paper is to illustrate on the findings from the analysis and the experimental results that even with much less compute, this particular combination -- RNoPE-SWA -- can easily outperform the SOTA methods which uses a dense architecture and RoPE-NTK. Note that his "free-lunch" phenomenon also corresponds to findings of other recent works on hybrid attention and the analysis conducted in the paper can serve as a step towards explaining this behavior.

To sum up the reasons that prevents us from including some experiments:

  • Resource constraints: Pretraining models from scratch or fine-tuning major variants with SWA is computationally expensive and falls beyond our current resource budget.

  • Low practical impact of pure NoPE‑only variants: In our preliminary experiments, models using only NoPE—regardless of whether with SWA—showed significantly worse performance, including large increases in perplexity. This suggests limited practical value for pure NoPE baselines in long‑context regimes.

  • Focus on core contributions: Rather than prove RNoPE‑SWA is optimal among all hybrids, we aim to show it can already surpass SOTA long‑context extension architectures/recipes.


  1. The paper should include why evenly stacking RoPE and NoPE instead of assigning early layers with RoPE and later layers with NoPE

Similar to our reasoning above, this can be one of the explorations for fututre work. It may or may not work but comparing all available variants is not the goal of this paper.


  1. Layer interleaving ratio ablations are not shown explicitly in the paper.

Thanks for pointing this out. We will include more information about this in the Appendix. We originally left them out because we felt they were not central to our core message, for the following reasons:

  • At the end of section 2, we resutled in RNoPE-10k-SWA with 1:1 interleaving. We ablated and found 1:3 to be a good number with a decent level of sparse layers. Going with 1:7 seems to be streching too much given the model size and number of layers. This turns out to be different again when we further scaled the model parameters later on. It seems that the model can allow higher level of sparsity as it grows bigger with more layers, which makes sense intuitively as well. However, this is less relevant to the core concept the paper wants to convey.
  • Any 1:X interleaving yields a theoretical compute cost of 1/(1 + X) compared to a dense model at very long sequence lengths. Thus, the “free‑lunch” trade‑off—reduced compute with better accuracy —holds regardless of the exact ratio.

While exploring how far we can push sparsity is an interesting avenue (and subject to future work), omitting these details does not undermine our main narrative: that a simple sparse‑dense interleaving outperforms dense RoPE‑NTK baselines without needing to fine‑tune hyperparameters for optimal sparsity.


Answer to Questions:

  1. What does qc mean in the figure 2? -- Thanks for spotting this, we will rectify the typo accordingly in the figure.
  2. Can you include more ablation on the design choices? -- With similar answers from above, we refined our design and obtained the final architecture based on analysis in section 2 and part of section 3 instead of massive ablations.

We appreciate your thoughtful feedback and look forward to further discussions to enhance our work.

评论

Thanks for your response. I still feel ablation study is so important in understanding the mechanism between each of the design choices. Like the original the attention is all you need paper also did a solid comparison of the design choices in their table 3. It may be okay to ignore the ablations, but at the same time it will make your paper less convincing.

I totally understand there is a computation budget constraint. However, people usually do ablations in smaller scales. You can run ablation study using smaller models and data (e.g. half the model size and 10% of the data make it only use a 5% budget) which can make the experiments easier. Also, I am not asking you to try all the potential variants, e.g. Alibi+SWA, partial RoPE or Yarn + NoPE + SWA, as you mentioned in your rebuttal, they are out of scope of our discussion here. I am just asking for the ablations only "related" to your final design choice: RNoPE-SWA, and part of the components to justify their usefulness/contributions in the final design.

Rather than prove RNoPE‑SWA is optimal among all hybrids, we aim to show it can already surpass SOTA long‑context extension architectures/recipes.

I think for a scientific paper that accepted to a conference like NeurIPS, the duty is to discover novel scientific findings supported by evidences, not presenting strong empirical results. If someone made a SoTA model without understanding the interplay between its components, it may serve well in product applications, but it falls short of delivering scientific understanding.

In short, I think this paper has strong potential, and a focused ablation study could substantially strengthen the contribution. I hope the authors consider this in a future revision or follow-up submission.

评论

We would like to respectfully clarify that the claim regarding a “lack of ablation” appears to stem from a possible misunderstanding or limited engagement with the paper—particularly Section 2, which is dedicated entirely to ablation studies and related analyses.

Contrary to the implication, resource constraints were not the primary reason certain ablations were omitted. Instead, our focus was on maintaining clarity and coherence around the central message of the paper. We closely examined in detail the attention distribution and gave thorough analysis on its impact on the model performance and underlining mechanisms. At the end of section 2, we removed certain components or variants (e.g., pure NoPE or QK-Norm) based on our analysis or ablations.

We encourage a careful re-reading of both Section 2 and our previous response, where we outlined the rationale behind our experimental design, the logical flow of our conclusions, and the reasoning for excluding specific variants. For example, the discussion on the “division of labour” attention pattern provides a key insight into why our hybrid attention model—combining NoPE with full attention and RoPE with SWA—was chosen over, for example, the RoPE or NoPE only variants.

We believe the paper offers substantial empirical evidence and logical reasoning to support our modeling decisions and scientific claims. Moreover, while many recent works adopt hybrid attention mechanisms, few attempt to investigate the underlying dynamics in such depth. We see our contribution as a step toward filling that gap through thoughtful experimentation and interpretation.

评论

Thanks for the reply. I double check the section 2 and agree that it can be served as some ablation studies, although not as diverse as what we saw in the section 5 main experiments. Most of my concerns was resolved, and I will raise my score accordingly.

审稿意见
5

The paper presents a comprehensive analysis of different attention and positional encoding mechanisms in large language models, specifically for handling long contexts. The authors begin by empirically evaluating the performance and attention patterns of models using Rotary Position Embedding (RoPE), No Positional Embedding (NoPE), and Query-Key Normalization (QK-Norm). They identify that NoPE excels at information retrieval over long distances by relying on semantic similarity, but lacks in general performance. Conversely, standard RoPE models with extensions struggle with very long contexts, and QK-Norm, while stabilizing training, degrades long-context performance by creating noisy, unfocused attention patterns.

优缺点分析

Strength:

  • The paper is exceptionally well-written and structured. The narrative flows logically from a clear problem statement, through a comprehensive analysis of existing methods, to the proposal and validation of a new solution. The use of attention pattern visualizations (Fig. 1, 2) provides strong, intuitive evidence for the paper's claims about the behavior of RoPE, NoPE, and QK-Norm, making the motivation for the hybrid model very convincing.
  • The contribution is highly timely. Improving long-context performance while simultaneously increasing computational efficiency is an important goal in LLM research. The proposed RNope-SWA architecture offers a practical and effective solution that addresses both performance and efficiency, a rare combination. The results, particularly the near-perfect scores on NIAH at 256k and the substantially smaller performance drop on Ruler (Table 6), are impressive and demonstrate a clear state-of-the-art advancement over the baseline.
  • While the components themselves (RoPE, NoPE, SWA) are not new, their combination into a functionally specialized, interleaved architecture is novel and insightful. The core idea that different layers can be specialized for different tasks (long-range retrieval vs. local modeling) is powerful. The paper provides strong empirical backing for this "division of labor" hypothesis, moving beyond simple engineering to offer a new architectural principle.

Weakness

  • The authors state that they performed an ablation study on the interleaving ratio of NoPE to RoPE+SWA layers, testing 1:1, 1:3, and 1:7, and found 1:3 to be optimal (lines 217-219). However, the full results of this important ablation is not presented in the paper or appendix. This is a key design parameter, and its justification currently relies on a statement rather than evidence.
  • The paper makes a key observation that a large RoPE theta value in the hybrid model "introduces noise into the overall architecture, which disrupts the NoPE layers' ability to...perform retrieval effectively" (lines 184-186). While the data supports this observation, the paper could benefit from a deeper explanation or hypothesis as to why this occurs. Is it due to the flatter attention distributions from RoPE interfering with the sharp, similarity-based attention of NoPE? A more detailed discussion would strengthen the theoretical underpinnings.

问题

See weakness

局限性

No limitations

最终评判理由

I think this is a great paper, and we should accept it.

格式问题

N/A

作者回复

Thank you for the valuable feedback.

  1. Interleaving ratio ablation omitted: Thank you for highlighting this. We agree that this is an important point, and we will include the full ablation results in the appendix in the revised version. Briefly, our findings are as follows:

In Section 2, we presented the RNoPE-10k-SWA model using a 1:1 interleaving ratio. We later ablated different ratios and found that 1:3 offered a favorable trade-off between performance and the number of sparse layers. In contrast, a 1:7 ratio introduced excessive sparsity given the model size and number of layers, leading to a noticeable drop in the Needles score. Interestingly, this trend shifted when scaling the model further: larger models with more layers tolerated higher levels of sparsity, which aligns with our intuition that deeper networks can better distribute and specialize functionality.

We initially omitted these details because we felt they were tangential to the core contributions. Regardless of the exact ratio, any 1:X interleaving configuration yields a theoretical compute cost of 1/(1 + X) compared to a dense baseline model at very long sequence lengths. Thus, the “free lunch” phenomenon—achieving reduced compute with improved accuracy—holds across a range of interleaving ratios.


  1. Thanks for bringing this up, we think this is also one of the very interesting findings of the paper. We actually gave this lots of thinking but ended up not including too much details about this in the original version of the paper. We will try to add as much as possible in the revised version. Our ideas are actually similar to yours (flatter attention distributions from RoPE interfering with the sharp, similarity-based attention of NoPE):
  • This paper has shown that the original dense, rope-based transformer architecture has lots of noise in its attention mechanism after training. These noise prevents the model from forming useful attention patterns, especially at longer context. We think transformer models try to "fight" this design flaw during gradient descent but is not able to completely denoise its attention. This could be why when we visualize the attention distributions of the needles, there are still significant mass on the irrelevant contextual tokens (ex. with QK-Norm being the highest).

  • From our understanding, the "division of labor" that happens in Figure 2 indicates that by constructing the model with interleaved NoPE/RoPE layers, gradient descent will "push" the NoPE layers to learn information retrieval more than RoPE layers since NoPE doesn't have recency bias built in at all, whereas the RoPE layers will be "pushed" to aggregate local information. We think how well this system can be trained following this setting depends on how much inductive bias we give to each type of layers. increasing RoPE’s theta value expands its receptive field, increasing its tendency to attend to distant tokens and introducing more noise into the system. Cutting the context off with sliding window is strictly imposing this inductive bias to RoPE layers to make sure they solely focus on recent contexts and introduces less obstacle for the model to follow the "division of labor" phenomenon during training.

We believe this phenomenon warrants further theoretical analysis, including the development of formal bounds or derivations to better explain the observed effects. Some related insights can be found in the minimax-text-01 report (https://arxiv.org/pdf/2501.08313), particularly Section 2.2.4, which we found to be highly thought-provoking.

We sincerely appreciate your thoughtful review and look forward to incorporating this feedback to strengthen our paper.

评论

Thanks the authors for providing the rebuttal. While I acknowledge that other reviewers have expressed concerns about this work, I believe this paper makes an important contribution that will prompt valuable discussion within the community regarding the long-context model architectures. The insights presented in the paper deserve wider consideration, so I will maintain my initial positive scores.

最终决定

This paper presents a thorough analysis of attention mechanisms—RoPE, NoPE, and QK-Norm—highlighting their strengths and limitations in long-context language modeling. It uncovers distinctive attention patterns that influence performance and offers architectural insights for improving model design. Building on these findings, the authors propose a novel hybrid attention mechanism that outperforms RoPE-based transformers on long-context tasks while remaining competitive on shorter ones.

After the discussions between authors and reviewers the major concerns were:

  1. Insufficient experiments and/or ablations (R-V4bg, R-qPgF, R-qtcq); and
  2. Missing distinction between the work in [1] and the submission (R-LEho).

However, the authors replied to these points and most reviewers were satisfied w/ the authors’ replies. In the end, all reviewers support the acceptance of this work. We encourage the authors to include beneficial material discussed w/ the reviewers and clearly state the distinction of this work and [1].