Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
摘要
评审与讨论
This paper introduces Mixture-of-Recursions, a that aims to integrate two mainstream model efficiency techniques—parameter sharing and dynamic computation (i.e., adaptive depth) into a single, unified architecture. The core idea is to reuse a shared block of layers within a recursive Transformer to achieve parameter efficiency, while simultaneously employing a lightweight, token-level router to dynamically assign different recursion depths to each token.
优缺点分析
Strengths
-
The idea of combining parameter sharing and dynamic computation within a single recursive framework is highly novel and forward-looking. Developing more efficient large language models without sacrificing performance is a core challenge in the field, and this paper's research direction is well-aligned with this goal, making it highly significant.
-
The design of the MoR framework is well-considered, encompassing not only dynamic routing but also a corresponding KV cache strategy to form a complete solution. The paper's explanation of the methodology is clear.
-
The authors evaluate their model across multiple scales and computational budgets and conduct several ablation studies (e.g., on parameter sharing and routing strategies), which demonstrates the effort invested in validating their method.
Weaknesses
-
A core claim of the paper is that MoR is more efficient to train, but this is primarily justified using theoretical FLOPs rather than actual wall-clock time. The dynamic routing and conditional computation logic introduced by MoR will incur additional overhead, which may offset or even negate the FLOPs savings on real hardware. The advantage of a truly efficient architecture should ultimately be reflected in end-to-end training time. The lack of a comparison based on actual training time undermines the claim of efficiency.
-
The paper's main results rely on the expert-choice routing strategy, but the ablation study shows that this method requires an auxiliary loss to function correctly, which adds complexity and potential instability to training. Meanwhile, the simpler "Token-choice" strategy performs significantly worse. This raises doubts about whether the success of the MoR framework is overly dependent on a specific, finicky routing mechanism rather than a clean, robust design.
-
The comparison between the M-Cyc and Cyc strategies may fail to properly control for variables. M-Cyc has independent first and last layers, meaning its shared recursive module has fewer parameters than that of Cyc. Therefore, it is unclear whether M-Cyc's superiority stems from its structural advantage or simply from its smaller, easier-to-optimize recursive module. A more rigorous ablation study is needed to disentangle these factors.
问题
Please refer to my weakness part.
局限性
yes
格式问题
N/A.
We thank the reviewer for highlighting MoR's strengths: unified parameter sharing with dynamic computation, dynamic routing paired with a KV-cache, its clear methodology, and comprehensive multi-scale ablations. Below, we address the reviewer’s specific questions and concerns (quoted).
[W1] A core claim of the paper is that MoR is more efficient to train, but this is primarily justified using theoretical FLOPs rather than actual wall-clock time. The dynamic routing and conditional computation logic introduced by MoR will incur additional overhead, which may offset or even negate the FLOPs savings on real hardware. The advantage of a truly efficient architecture should ultimately be reflected in end-to-end training time. The lack of a comparison based on actual training time undermines the claim of efficiency.
[A1] Thank you for raising this interesting point. We want to highlight the theoretical efficiency of MoR directly translates to a reduction in actual wall-clock training time. In MoR, each expert (recursion step) selects only top- tokens among a total of input tokens. As shown in Table 2 (right), MoR significantly reduces the theoretical computational cost compared to the vanilla baseline. For a detailed derivation of these expressions, please refer to our first response [A1] to reviewer cDNG.
As shown in Table 3R, MoR achieves a substantial reduction in wall-clock training time compared to the vanilla Transformer under identical training steps. This confirms that the theoretical efficiency of our method directly translates to practical speedup in real-world training scenarios.
[Table 3R] Comparison of wall-clock training time for MoR and Vanilla 360M Transformers, measured under the same number of training tokens (20B). Note that recursive baselines have almost the same wall-clock time to Vanilla models. We trained models with FlashAttention-2 and HuggingFace Accelerator on 4 H100 GPUs.
| Vanilla | MoR-2 | MoR-3 | |
|---|---|---|---|
| Training Time (hr) | 30.2 | 25.1 (17%↓) | 23.6 (22%↓) |
Although MoR introduces dynamic routing, the associated overhead (routers or indexing) is minimal in practice: the router consists of a single linear layer, and each token passes through it for only recursion times (e.g., 2, 3, or 4). As a result, the efficiency gains predicted by our theoretical analysis are realized in faster wall-clock performance.
We would also like to note that the throughput gain reported in Figure 4a is measured in wall-clock time rather than theoretical calculation.
[W2] The paper's main results rely on the expert-choice routing strategy, but the ablation study shows that this method requires an auxiliary loss to function correctly, which adds complexity and potential instability to training. Meanwhile, the simpler "Token-choice" strategy performs significantly worse. This raises doubts about whether the success of the MoR framework is overly dependent on a specific, finicky routing mechanism rather than a clean, robust design.
[A2] Thank you for raising this point. We'd like to clarify that we didn't do extensive engineering or pick a finicky router for the MoR framework. Instead, we found that a much simpler and more intuitive expert-choice routing consistently demonstrated robust performance, which is why we adopted it.
[Auxiliary Loss] The expert-choice routing strategy does not introduce significant complexity or instability in practice. There are typically two solutions to address information leakage in expert-choice: auxiliary loss and auxiliary router. The auxiliary loss is more straightforward and easy to implement—for example, it can be expressed with only a few lines of pseudocode as below.
router_weights = Router(x)
selected_tokens = TopK(router_weights, k=top_k)
targets = zeros_like(router_weights)
targets[selected_tokens] = 1
sampling_loss = BCE(router_weights, targets) / (batch_size * seq_len)
In contrast, the auxiliary router has the drawback of requiring a different router for training and inference, and both must be used during training. We believe our finding that the auxiliary loss is more robust and stable to use across all experiments is therefore very significant (refer to Table 4 (left) and Table 11).
[Expert-Choie vs. Token-Choice] We would like to emphasize that expert-choice routing is much simpler than token-choice routing and is better suited to recursion steps.
- Expert-choice selects a fixed number of tokens per expert, enabling consistent capacity allocation (see Section 2.2.1 and Table 2 (left)), which leads to more stable training.
- Token-choice for MoR shows unstable and sensitive performance to hyperparameter settings. We tested various configurations—such as different load balancing strategies, router activation functions, and z-loss for stability—but as shown in Table 4 (right) and Table 12, it was very difficult to balance expert allocation. It actually results in high MaxVio values (i.e., the router failed to balance the load) and higher loss across training phases. We believe the heterogeneous experts in token-choice make training more difficult to stabilize.
- Implementation complexity is even higher for token-choice. Because it does not guarantee static capacity, the number of tokens processed by each expert can vary within a batch, requiring more engineering like additional padding for efficient batching.
We conducted thorough ablations of these design choices (Table 4, Table 11 and Table 12) and concluded that the simple expert-choice routing with an auxiliary loss is better suited to MoR’s recursive nature and robustly shows good performance across all settings.
[W3] The comparison between the M-Cyc and Cyc strategies may fail to properly control for variables. M-Cyc has independent first and last layers, meaning its shared recursive module has fewer parameters than that of Cyc. Therefore, it is unclear whether M-Cyc's superiority stems from its structural advantage or simply from its smaller, easier-to-optimize recursive module. A more rigorous ablation study is needed to disentangle these factors.
[A3] Thank you for pointing out this. Achieving a perfectly rigorous ablation between Middle-Cycle and Cycle strategies is inherently difficult. Since the middle-cycle has two unshared layers, it is not trivial to exactly match both the number of unique parameters and effective depth for a given recursion number. For example, in Table 9, M-Cyc often has one or two more layers than Cyc cases. This is the main reason why M-Cyc has much better performance than Cyc (not because it has a smaller recursive module as larger scales generally have better performance in language modeling).
Another reason why we opt for M-Cyc is from its uptraining results. In Table 10, despite a parameter difference of only about 10M (i.e., one or two layers), the M-Cyc strategy demonstrated significantly better performance. When uptrained from a pretrained checkpoint, the non-shared first and last layers likely had a substantial impact. As a result, the robust performance of M-Cyc across diverse settings led us to choose it.
However, since this is not fair comparison, we compared the Cycle and Middle-Cycle strategies while fixing the number of unique layers, in a setting that favors the Cycle strategy. Specifically, with 730M model size, M-Cyc with a recursion number of 3 uses 8 layers for the recursion block and 2 unshared layers (1 + 8 x 3 + 1), resulting in 10 unique layers and 26 effective layers. We compared this to the Cycle strategy with 10 layers per recursion block (10 x 3), which also has 10 unique layers but a total of 30 effective layers.
[Table 4R] Comparison of M-Cyc and Cyc sharing strategies under the same number of unique parameters. All models use 730M scales (non-embedding parameters) as the base model with 3 recursions. The models are trained from scratch on 10B tokens using the Fineweb-Edu dataset.
| Model | Param | Train loss | LD | HS | PQ | WG | ARC | MMLU | Average | |
|---|---|---|---|---|---|---|---|---|---|---|
| Cyc | 3 | 252M | 2.7573 | 28.72% | 37.31% | 65.56% | 52.01% | 39.94% | 27.85% | 41.90% |
| M-Cyc | 3 | 252M | 2.7552 | 29.75% | 37.73% | 66.00% | 52.49% | 40.49% | 27.50% | 42.32% |
The M-Cyc has better performance than the Cycle despite having much fewer effective layers (26 vs. 30). However, we need to validate these across more scales to draw a better, trustable conclusion. We appreciate the valuable comments.
We appreciate your thoughtful feedback and your valuable time. As the discussion period comes to an end, we would like to check if our responses have addressed your three main concerns.
Here is a brief summary of our responses to your concerns:
-
Practical speed improvements: The benefits of our MoR are not limited to a reduction in theoretical FLOPs. We have demonstrated that this translates to performance gains in actual wall-clock time. The observed training speed-up correlates closely with the theoretical FLOPs reduction, and this efficiency extends to inference, where MoR achieves significantly higher, actual throughput.
-
Simplicity and stability of Expert-choice: We opted for Expert-choice routing over Token-choice due to practical advantages. Token-choice methods often suffer from training instability caused by heterogeneity between experts and introduce batching inefficiencies due to load imbalance. In contrast, Expert-choice approach is far simpler, leading to more straightforward implementation (due to perfect load balance) and stable training. The auxiliary loss is also easy to implement and has proven robust across nearly all experimental settings.
-
Superiority of M-Cyc design: A perfectly controlled comparison between Cyc and M-Cyc is challenging due to architectural differences that prevent matching both parameter size and effective depth simultaneously. However, as shown in Table 4R, when we match the parameter count, M-Cyc slightly outperforms Cyc despite being at a disadvantage of 4 fewer effective layers. Furthermore, M-Cyc demonstrated substantially superior performance in the uptraining setting (Table 10), convincing us to use it.
If there are any remaining issues preventing an increase in your score, we would be grateful if you could let us know so we may further clarify or improve the submission.
The paper aims at improving the efficiency of recursive transformers, where the inputs is looped over the model for several recursive steps. The efficiency gains are obtained across three main axis. First, a routing mechanism allocates tokens to different loops. Two variants are proposed, by sequentially dropping tokens at each loop, or by allocating the tokens to different number of loops prior to start each of them. Second, by proposing two alternatives for caring the activations, specifically tailored for the recursive computation performed by the model. Third, different ways of grouping layers. The results show that recursive transformers with these modifications become as compute efficient as vanilla transformers despite using much less parameters. Several ablations compare all the different proposed alternatives in the setup.
优缺点分析
Strengths
- Strong improvements over baselines: The proposed MoR architecture significantly outperforms other recursive baselines across compute budgets, achieving lower validation loss and higher few-shot accuracy (e.g., Table 3) while maintaining efficiency through dynamic recursion. It is impressive that it is as FLOPs efficient as a vanilla transformer despite using much less parameters.
- Efficient use of compute: The model demonstrates compute optimality consistent with scaling laws (e.g., Chinchilla scaling). The models and baselines seem well tuned. I verified with quick calculations that with the FLOPs budget of the vanilla model with the 730M parameter should be the optimal, which indeed appears to be true from Figure 3.
- Advances in recurrent efficiency: The focus on improving efficiency in recurrent architectures, which are typically FLOP-intensive, addresses an important practical bottleneck.
- Pareto improvements in throughput: MoR delivers substantial throughput gains, positioning it as a practical solution for high-speed inference, which is also a remarkable achievement.
Weaknesses
- More recursion is detrimental?: Across the results in the paper, such as in iso-flops comparisons, fixed token setting and throughout plots, it seems that more recursive steps give worse performance. This seems to indicate some issues with the experimental set up. I would be expecting performance to saturate, but not to degrade, with more recursions. In this context, trainability issues do not seem to be addressed. Or could it be that the combination of routing mechanism + high number of recursions causes signal degeneracies? Some of these issues were addressed in related works, see e.g. "Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach".
- Test-time scaling interpretation: The conclusion that “allocating more recursion steps at inference can improve generation quality” is overstated. The results suggest that across models, fewer recursion steps are generally better, and additional recursion helps only within a given model.
- Shallow semantic analysis: The discussion of how the router aligns recursion depth with semantic importance is somewhat superficial. The mechanism by which token context determines exit decisions could be better examined. See also questions on this point.
- Vanilla still outperforms at iso-parameter: In iso-parameter settings, the vanilla Transformer continues to outperform recurrent variants, including MoR, as seen in some configurations in Figure 4. While not a major flaw given the paper’s focus on efficiency, this limitation could be acknowledged more explicitly.
- Minor: Decoding logic clarity: The description of the decoding process—particularly how tokens that exit early are dynamically substituted, and how masking and caching are handled—is insufficiently detailed. This is important for interpreting the throughput results of Figure 4a and understanding practical deployment implications.
- Minor: Caching mechanism explanation: The prose explaining the caching mechanism (e.g., lines 143–149) could be clearer, especially in detailing the logic behind the claimed memory savings and consistency of KV caches. Right now, it feels a bit too compressed.
Overall, I am in favor of acceptance, although I will consider increasing my score based on the author's rebuttal, especially on the issue of why more recursions does not improve performance.
问题
-
Effective FLOPs-parameter trade-off: For fixed FLOPs, MoR with strikes an optimal balance, while deeper recursion degrades performance as parameter count is reduced disproportionately. Can the authors expand on this point? Is the model saturating its capacity with too many recursions until becoming inefficient?
-
Does the model quickly decode because the prediction is easier, i.e. the model is more confident about predicting? What if the a token was forced to be decoded despite the router allocating it to the next step?
局限性
yes
格式问题
no
We thank the reviewer for highlighting MoR’s superior performance and compute efficiency to other baselines.
We’d like to clarify the misunderstanding regarding MoR performance degradation as recursion number increase; put simply, higher recursion number of model means applying more parameter sharing and thus fewer unique trainable parameters. When applying recursion number as 2, 3, or 4 to a 360M base model using the Middle-Cycle strategy, the unique parameters are reduced to 167M, 118M, and 98M, respectively. Please refer to the “Param” column in Table 3 for verification.
This particular setup aims to highlight a different kind of trade-off than what's explored in [1]. Specifically, more recursion with fewer parameters can lead to better throughput through our early-exiting concepts (Figure 4a), while introducing some performance drop. Conversely, [1] aims to show that more recursion with fixed parameter sizes leads to better performance, which will have lower throughput.
Below, we address the reviewer’s specific questions and concerns (quoted).
[W1] More recursion is detrimental?: It seems that more recursive steps give worse performance. I would be expecting performance to saturate, but not to degrade, with more recursions.
[A1] We would like to clarify that the observed performance degradation when increasing the number of recursions is due to the reduction in the number of unique parameters, not the wrong experimental setup or an inherent issue with the recursion and routing mechanism itself. By increasing the recursion number of model configuration, MoR proportionally reduces the number of unique parameterized layers, thereby decreasing the overall model capacity.
At the same time, to understand how the number of recursions affects performance while keeping the number of unique parameters fixed, we conducted an experiment as follows.
[Table 2R] Comparison of MoR with different numbers of recursions under the same number of parameters. Vanilla equals MoR with recursion 1. All models are trained on 10B tokens.
| Model | Param | Train loss | Val loss | LD | HS | PQ | WG | ARC | MMLU | Average | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Vanilla | 1 | 118M | 2.9182 | 2.9678 | 25.81% | 33.08% | 61.81% | 51.46% | 38.10% | 26.64% | 39.48% |
| MoR | 2 | 118M | 2.8767 | 2.9205 | 27.09% | 33.84% | 62.13% | 53.04% | 36.77% | 26.88% | 39.96% |
| MoR | 3 | 118M | 2.8667 | 2.9111 | 27.38% | 34.60% | 63.17% | 51.46% | 37.15% | 26.83% | 40.10% |
This aligns perfectly with the reviewer's insight and [1], demonstrating that the MoR model can scale performance by increasing recursion steps with fixed model sizes.
[W2] Test-time scaling interpretation: The results suggest that across models, fewer recursion steps are generally better, and additional recursion helps only within a given model.
[A2] Similar to the previous point, as we increase the recursion step in MoR, the number of unique parameters per layer decreases proportionally. This reduction in model capacity is the main reason for the gradual decline in overall performance as increases—not an inherent limitation of MoR itself.
We also want to emphasize that in Figure 5c, we show how the MoR model scales up performance by allowing more recursion steps within a given model like in [1]. This demonstrates that, within a single MoR model, performance increases as the model forwards more recursion blocks, and saturates as it reaches the maximum recursion step.
[W3, Q2] Shallow semantic analysis: The discussion of how the router aligns recursion depth with semantic importance is somewhat superficial.
[A3] Thank you for raising this point. The allocation of recursion depth reflects contextual predictability of the subsequent token. If the predicting next token is relatively easy, the model could skip the last recursion steps.
For example, in Figure 1 (right), the latter subword parts (e.g., “-ensively” in “defensively”) within a word are usually easier to predict and often require only one recursion (i.e., “def” goes one recursion). Also, opening tokens of function words (e.g., ( { . , typically require fewer steps, while their corresponding closing parts or immediately following tokens may need deeper recursion. Please refer to detailed response in [A3] to reviewer cDNG regarding intuitions behind recursion.
In Figure 5c, we limited the maximum recursion steps within a given MoR model. For example, "Loop 1 in MoR-3 performance" denotes that we forced tokens to be decoded at the first loop, even though the router had allocated them to the second loop. While their performance was quite robust, it was lower than when allowing more recursions based on the router's decision.
[W4] Vanilla still outperforms at iso-parameter: In iso-parameter settings, the vanilla Transformer continues to outperform recurrent variants, including MoR, as seen in Figure 4.
[A4] We’d like to clarify again that it's not iso-parameter settings. All of our experiments comparing vanilla, recursive and MoR are not iso-parameter settings. In our experiments, recursive and MoR models actually have “fewer unique parameters than the vanilla Transformer” due to parameter sharing through recursion.
The main point of our work is to demonstrate that MoR can closely match the performance of a larger vanilla model despite using "fewer parameters”, while achieving faster speed and greater memory efficiency. We will state this distinction more clearly in the revised version.
[W5] Minor: Decoding logic clarity: The description of the decoding process—particularly how tokens that exit early are dynamically substituted, and how masking and caching are handled—is insufficiently detailed.
[A5] We appreciate the reviewer’s feedback and will provide a more thorough explanation in the updated version of our paper.
MoR models can leverage a continuous depth-wise batching inference system [2], thanks to their recurrent architecture and early-exiting mechanisms. This enables batch inference even when tokens are at different depths (recursion steps), because they go through the same set of parameters.
Once queries are enqueued, the router first decides whether to apply more recursion or exit to proceed the next token. Queries that are directed to the next token then pass through the output layer and token embedding. Subsequently, they are queried back to the batch, and efficiently computed together with samples that move to the next recursion step. After this computation of a recursion block, each token obtains a new decision from routers, and this process is repeated iteratively.
At this point, since the context length will be variable depending on the recursion depth, we need to carefully handle padding and masking for KV caches. In our experiments for Figure 4a, we utilized Static Cache, FlashAttention-2 variable-length decoding, and torch.compile. However, there's significant room for further engineering optimization here. Ultimately, early-exiting reduces redundant recursion depths and the number of activated tokens in deeper recursion steps, leading to improved speed and throughput.
[W6] Minor: Caching mechanism explanation: The prose explaining the caching mechanism could be clearer, especially in detailing the logic behind the claimed memory savings and consistency of KV caches.
[A6] We acknowledge that the explanation of the caching mechanism was too compressed. Please refer to our response to reviewer cDNG [A1] for a more detailed explanation.
To summarize, recursion-wise KV caching selectively cache KV pairs: only tokens routed to a specific recursion step will store their key–value entries at that level. This design means that a linearly decreasing fraction of tokens are active at each recursion step (specifically, , where denotes the recursion step starting from 1). As a result, this method achieves a reduction in total memory and I/O costs by a factor of compared to a vanilla approach.
In recursive sharing, we cache KV pairs exclusively at the initial recursion step and reuse them across all subsequent recursions. This can further lower memory usage, though it still incurs the same I/O to vanilla at each recursion.
Both strategies also reduce Attention FLOPs relative to vanilla (), either by operating attention on fewer active tokens (recursion-wise; per layer) or by shortening the query length while keeping the full key length (recursive sharing; per layer).
[Q1] Effective FLOPs-parameter trade-off: For fixed FLOPs, MoR with strikes an optimal balance, while deeper recursion degrades performance as parameter count is reduced disproportionately.
[A7] We believe this point has already been addressed in our previous responses. To directly answer the additional aspect of your question: for a fixed FLOPs budget, as recursion increases, the effective capacity per step is reduced due to parameter sharing (as indicated by “Param” column in Table 3). Thus, the observed degradation can be fair as the performance change is proportional to this parameter allocation. At the same time, we want to emphasize that MoR with more recursions offers significant efficiency benefits: smaller parameter sizes, lower KV memory footprint, and higher throughput.
References
[1] Geiping et al. “Scaling up test-time compute with latent reasoning: A recurrent depth approach”. arXiv preprint 2025.
[2] Bae et al. “Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA”. ICLR 2025.
I thank the authors for their response. Indeed it was not entirely clear to me that the experiments in Figure 4 were not iso-parameters. I also appreciate the extra scaling experiment that verifies that increasing the number of recursions improves performances. I keep my positive score.
Thank you for your feedback. We would like to clarify that Figure 4 does not represent an "iso-parameter" setting. This figure illustrates the throughput of the specific models detailed in our main experiment, Table 3.
To be precise:
-
The "Vanilla" point corresponds to the first row of Table 3, which has 315M non-embedding parameters and an NLL (negative log likelihood) of 2.7824.
-
The "MoR-2, MoR-3, and MoR-4" points correspond to the first three models in the gray-highlighted MoR section. As listed, their non-embedding parameter counts are 167M, 118M, and 98M, with corresponding NLLs of 2.7511, 2.7925, and 2.8204.
We believe your confusion may stem from two different experimental designs for Recursive Transformers:
-
Constant unique parameters [1]: Like in Table 2R, applying more recursions increases the effective depth of models that share the same number of unique parameters.
-
Constant effective depth [2]: Like in Table 3 and Figure 4, the total effective depth is held constant across all models. Consequently, applying more recursions decreases the number of unique parameters.
We follow prior work ("Relaxed Recursive Transformers" [2]), designed to show how increasing recursions can improve parameter efficiency and their throughput. Let's assume a 12-layer base model. We made MoR models to have 12 number of effective depths. An MoR-2 model built from this base would therefore have 6 unique layers, an MoR-3 model would have 4 unique layers, and an MoR-4 model would have 3. This is exactly why models with more recursions end up having fewer parameters, and thereby have lower performance, but higher throughput.
We hope this explanation clarifies our setup and resolves the misunderstanding regarding the non-iso-parameter setting of Figure 4 (which just shows throughput of models in Table 3).
[1] Geiping et al. “Scaling up test-time compute with latent reasoning: A recurrent depth approach”. arXiv preprint 2025.
[2] Bae et al. “Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA”. ICLR 2025.
In this work, the authors present Mixture-of-Recursions (MoR), a unified framework for improving both the training and inference efficiency of Transformer models. MoR combines two key ideas: parameter sharing through recursive reuse of a shared stack of layers, and adaptive computation over the tokens (i.e, not all tokens proceed to full depth/receive full compute). This allows the model to allocate computation proportionally to importance, reducing redundant computation and significantly lowering overall cost compared to conventional baselines. MoR is shown to dominate the Pareto frontier in iso-FLOP settings, consistently achieving lower validation perplexity with better efficiency. Additionally, this conditional token computation delivers strong inference speedup on several model sizes. These results highlight MoR as a compelling direction for scaling model quality without incurring full-scale computational costs.
优缺点分析
Strengths:
-
Clarity and Presentation: This paper is well written, with diagrams and figures that clearly convey the core ideas. In particular, Figure 1 effectively illustrates how computation is allocated per output token during the decoding stage, providing intuitive insight into the mechanism of Mixture-of-Recursions (MoR).
-
KV Caching Design: The approach to key-value caching is both novel and well-motivated. The authors show thoughtful design by selectively storing KV entries only for tokens that pass through routed blocks, minimizing unnecessary memory use. Meanwhile, the layer-sharing component ensures full-context awareness during generation, demonstrating a strong understanding of the importance of maintaining contextual integrity in autoregressive decoding.
-
Strong Empirical Results: The experimental results are compelling. MoR consistently dominates the compute-performance Pareto frontier across model scales and benchmarks. The ablations on cache strategy, throughput, and router configuration provide valuable insights into the contribution of each component and the rationale behind the final design choices. The detailed router analysis further supports the effectiveness of the method.
Weaknesses: While the paper is overall well-executed and well-motivated, I believe the methodology could benefit from deeper discussion in relation to prior work on conditional depth computation. In particular, "Mixture-of-Depths: Dynamically Allocating Compute in Transformer-Based Language Models" (arXiv 2024) is notably relevant. That paper also explores token-level conditional routing using depth-based computation and introduces a similar routing heatmap to Figure 1 in this work.
Given these similarities, it would be valuable for the authors to clarify how MoR goes beyond a naive combination of Mixture-of-Experts (MoE) and Mixture-of-Depths (MoD). Without this, one might interpret MoR as simply integrating known ideas (layer skipping + routing) under a recursive framing. An in-depth comparison or positioning in the related work would help distinguish the unique contributions of this paper.
问题
The main question I have is whether the authors could explicitly contrast MoR with a conceptual combination of MoE and MoD. Since the layer router could be viewed as a generalization of expert selection, and the token-router can be viewed similarly as depth allocation, it would be useful to understand what architectural or algorithmic elements of MoR are fundamentally new, beyond being a combination of existing approaches (refer to weaknesses).
局限性
yes
最终评判理由
I have reviewed the rebuttal and maintain my score. The paper is technically solid, and the authors' agreement to revise the text with the distinctions satisfies my previous concerns.
格式问题
No formatting concerns.
We appreciate the reviewer's thorough review and positive feedback. We're glad that the reviewer found our core idea clear and well-motivated, and our experimental results strong. Below, we address the reviewer’s specific questions and concerns (quoted).
[W1] While the paper is overall well-executed and well-motivated, I believe the methodology could benefit from deeper discussion in relation to prior work on conditional depth computation, such as MoD and MoE. An in-depth comparison or positioning in the related work would help distinguish the unique contributions of this paper.
[A1] Thank you for suggesting us to clarify the distinctive contribution of MoR compared to MoR and relevant works, MoD and MoE. We will add this discussion to the revised script.
[Comparison with MoD]
While MoR and MoD explore token-level routing over network depth, MoR extends to be a unified framework by incorporating parameter sharing, early-exiting, and various KV caching mechanisms:
-
Parameter sharing: MoR uses recursive blocks as the routing primitive, enabling substantial parameter efficiency through weight sharing. This contrasts with MoD, which does not leverage parameter sharing and thus cannot achieve the same reductions in model size without sacrificing performance.
-
Early-exiting and throughput gains: In MoR, our hierarchical filtering scheme does not allow for “deactivated” tokens to be reactivated in subsequent blocks. Once a token is "early-exited" by the router at a certain recursion depth, its computation is considered complete, and it does not participate in further recursive blocks. Thus, early-exiting and parameter-sharing structure enables continuous depth-wise batching during inference, which significantly boosts throughput by reducing idle time and maximizing resource utilization. In contrast, MoD's throughput gains are potentially limited. While MoD also reduces per-block computation based on the number of activated tokens, its mechanism does allow for deactivated tokens to be reactivated later. This reactivation mechanism prevents the same level of idle time reduction achievable with MoR's early-exiting mechanism and depth-wise batching.
-
KV caching mechanism: Both MoR and MoD improve memory and throughput efficiency by using attention mechanisms only on activated tokens. We further propose a KV sharing method to optimize memory even more by leveraging the recursive structure of our MoR blocks and reusing the KV cache from the first recursion step.
[Comparison with MoE]
MoE and MoR share a high-level similarity in their pursuit of token-level conditional computation via a routing mechanism. This enables us to explore various techniques developed for MoE, such as expert-choice and token-choice strategies and efforts to mitigate their inherent problems.
However, a fundamental difference lies in their routing objectives: MoE's router directs tokens to different FFN-based experts, allowing each token to find its optimal parameter path, whereas MoR's router learns to decide whether to apply further recursion to a token.
Importantly, MoR and MoE are orthogonal, and they can be integrated easily. It's possible to replace the FFN within a MoR's recursion block with a conventional MoE layer. This combined approach would offer two axes of adaptive computation for activated tokens: not only an adaptive path based on recursion depth (MoR) but also selection among FFN experts (MoE), potentially leading to even greater model capacity and specialization.
This paper proposes a new method named mixture-of-recursive, which considers both parameter and computation efficiency. It combines the (expert-level or token-level) router with recursive layers, and also KV cache sharing strategies to improve model performance, can achieve similar or even better performance than vanilla transformer with 1/3 parameters when training from scratch.
优缺点分析
Strengths:
- The ideas are clear and effective, and the writing is excellent and very easy to follow
- The experimental results are comprehensive and impressive. Utilizing recursive layers could save the number of parameters but will always cause a little drop of performance. While here utilizing token router can let model learn better on important tokens and improve performance. This paper considers different scales of models and computations and different combinations of strategies, and show consistent advantage compared with vanilla recursive layers.
- Doing many ablations studies to show how to combine different routers and archs, etc.
Weakness:
- maybe giving the calculation process of Tab2 (right) in appendix and refer it in Tab 2 would be helpful for readers, because they are dependent on some specific capability factor sequence (N_r/N_r, ..., 1/N_r)
- It seems that as increasing compute budget, the performance gain from MoR is decreasing. And on small models (like 135M), it perform worse than vanilla models.
问题
could you give some intuitions about the example in Fig.1 Right bottom, why words like "---" or "and" "those" will be pass twise while words like "comfortable" "views" only once?
局限性
authors mention their limitations in appendix A
最终评判理由
experiment results are very solid and the architecture is interesting, I think definitely can be accepted
格式问题
NA
We thank the reviewer for their careful review and thoughtful comments. We appreciate that the reviewer found our idea to be effective and clear, and our experimental results to be comprehensive. We provide responses to the reviewer’s specific questions and concerns (quoted) below.
[W1] Maybe giving the calculation process of Tab2 (right) in appendix and refer it in Tab 2 would be helpful for readers, because they are dependent on some specific capability factor sequence (N_r/N_r, ..., 1/N_r)
[A1] Thank you for your valuable suggestion. We will include a detailed explanation of the cost efficiency numbers in the appendix of the revised version and refer to it from the table caption. Below is a summary of the calculation process for Table 2 (right). Please note, these values are calculated based on the Cycle sharing strategy. The Middle-Cycle strategy differs slightly, as it includes single layers at both the beginning and the end, which may lead to some variations.
[KV Cache Memory and IO Comparison]
We consider the cost of processing a full sequence at each recursion step as 1 unit. Table 2 (right) presents the calculated values across the entire model for each strategy, normalized against the Vanilla strategy.
-
Vanilla: Every token is activated over steps (i.e., all layers). This means it requires times the KV memory and corresponding IO (loading time) to fetch the data.
-
Recursion-wise KV Caching: The number of tokens selected at each recursion step () decreases linearly. Specifically, a fraction of of tokens are activated (e.g., for the first step, then , and so on, down to ). Summing these fractions across all steps gives us: This calculation shows that both the KV memory and IO requirements for this strategy are in total.
-
Recursive Sharing: In this approach, the KV cache generated for the full sequence during the very first recursion step is continuously reused. Consequently, the memory cost is just 1 unit. However, fetching operations are still required at all steps, making the IO cost .
[Attention FLOPs Comparison]
For the Attention FLOPs per layer, we report values normalized by the Vanilla case.
-
Vanilla: The attention mechanism exhibits quadratic complexity with respect to the sequence length ().
-
Recursion-wise KV Caching: In this strategy, we activate only the top- tokens at each recursion step (), where . Consequently, the attention mechanism has a complexity of .
-
Recursive Sharing: When using sharing, the query length is reduced to top-, but the key states still utilize the full sequence length of . This results in a complexity of .
[W2] It seems that as increasing compute budget, the performance gain from MoR is decreasing. And on small models (like 135M), it performs worse than vanilla models.
[A2] In Figure 3, Recursive Transformer and MoR models are constructed by applying recursion to a base Vanilla model. Therefore, the trainable parameters for Recursive Transformer and MoR (with a recursion step of 3) are approximately three times fewer than those of Vanilla models. Consequently, a direct comparison based solely on FLOPs, as shown in Figure 3, can be misleading, as it doesn't account for the differing parameter counts that impact total training steps and convergence speed.
Therefore, to accurately assess the models' performance in relation to their actual parameter size and computational budget, we refer to Figure 5a, which provides a more straightforward comparison by also considering the number of trainable parameters. Figure 5a visualizes the scaling law results from Figure 3 by fitting them to axes of model size and validation loss. Recursive (blue) and MoR (red) are positioned to the left of the Vanilla curve (green) because they use a recursion step of 3. Figure 5a shows several key insights:
-
As the compute budget increases, the MoR model demonstrates strong scalability. The gap between the optimal points of MoR (indicated by stars) does not shrink with increasing FLOPs, confirming that MoR is a highly scalable architecture.
-
MoR consistently outperforms the Recursive baseline. While recursive baselines with fixed recursion depth might appear to converge better at larger compute budgets, the similar differences between optimal points within each MoR and Recursive at higher budgets suggest that MoR's performance advantage will likely persist. It's crucial to note that vanilla recursive models require an additional training phase for early-exiting [1], which can be sensitive to hyperparameters and introduce performance drops.
-
MoR demonstrates comparable performance to Vanilla Transformers given the same training budget. Across three different budgets, the scaling curve just shifts to the left due to the reduction in model size with recursion. We further validated the performance for the 1.7B model with a 6.85e19 FLOPs budget in Table 1R, which also showed comparable performance.
[Table 1R] Comparison of MoR, Recursive, and Vanilla 1.7B models under 6.85e19 FLOPs budget. We used expert-choice router and recursion-wise KV caching for MoR, and for MoR and Recursive models.
| Model | Share | Param | FLOPs | LD | HS | PQ | WG | ARC | MMLU | Average | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Vanilla | - | - | 1.61B | 6.85e19 | 20B | 40.7% | 49.2% | 71.3% | 53.2% | 30.8% | 47.3% | 48.8% |
| Recursive | M-Cyc | 2 | 0.87B | 6.85e19 | 20B | 37.3% | 46.5% | 68.9% | 52.6% | 29.6% | 44.2% | 46.5% |
| MoR | M-Cyc | 2 | 0.87B | 6.85e19 | 26B | 40.2% | 48.0% | 69.6% | 52.8% | 30.6% | 47.0% | 48.0% |
Looking at the optimal scaling policy in Figure 5a, MoR performance can be lower when the shared block size is very small (using an adequately larger model is better at the same FLOPs). Notably, a 135M-based MoR model has roughly 45M unique parameters—nearly three times fewer than vanilla models. However, we would like to emphasize this is simply a performance-efficiency trade-off; MoR achieves much faster training speeds (wall-clock time) and nearly double the throughput gain. To ensure a fairer and more accurate comparison, we should consider “speed” in addition to performance.
[Q1] could you give some intuitions about the example in Fig.1 Right bottom, why words like "---" or "and" "those" will be pass twice while words like "comfortable" "views" only once?
[A3] Thanks for highlighting these valuable points. In Figure 1 (right) and Figure 10 (in the Appendix), we show the recursion depth performed for each subword token. To clarify, this depth indicates how easily the next token can be predicted within its given context. Essentially, the allocation of recursion depth reflects the contextual predictability of the subsequent token.
Through several examples, we made the following observations:
-
The second (or subsequent) subword part of a word is often straightforward to predict, thus requiring fewer steps. For instance, in a word like "defensively," only a single recursion is performed to predict "-ensively," (i.e., “def” is highlighted in purple) which aligns with our intuition.
-
In many cases, the initial generation (opening) of function words like
--- ( { . ,appears easy for models to predict. However, predicting their closing parts or the first token immediately following their opening might be more challenging, often requiring 2 (blue) or 3 (red) recursions. -
Generally, more content-rich or less common words are typically assigned deeper recursion, as expected.
We will make sure to clarify these points in the revised version of the paper. For full transparency, we avoided cherry-picking samples; the examples discussed were simply the first 10 samples from our validation set.
References
[1] Bae et al. “Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA”. ICLR 2025.
Thanks for the detailed response! They address most of my concerns. I would like to keep my positive score since I think the experiment results are solid. BTW, I'm still little confused about the observation in A3. Do not "({.," words appear more common and maybe require more recursions due to multiple prediction probability? It seems that little conflicted with point 3 that less common words instead need shallower recursion?
Thank you for your thoughtful feedback and positive score. Regarding Observation 3, our core idea is that the number of recursions that a model needs to predict a token also depends on how common that token is. We believe that high-frequency tokens, like the function words and punctuation you mentioned ("({.,") may be easy for the model to predict and thus require fewer recursions. In contrast, less common or content-rich tokens are harder to predict (since the model did not observe too much), and we expect them to require more recursions.
Dear Reviewers,
As we are approaching the final day of the review process, we kindly ask if you had a chance to review our responses. We have done our best to address your concerns, and would appreciate it if you could reconsider your score if our contributions merit it.
Please let us know if you have any remaining concerns.
Thank you.
The manuscript combines effective transformer models in a clever way that improves the computational efficiency of the model. This "Mixture of Recursions (MoR)" model is a transformer based architecture that shares parameter weights across layers and uses a heuristic to dynamically determine the depth for different aspects of the network. Both early exit and parameter sharing a proven successful methods, and it isn't a surprise that combining them in an intelligent way is also effective.
The reviewers are unanimous in their support of the manuscript with numerous strengths listed. The summary from the author's final rebuttal lists the strengths as: Novel and significant MoR framework combining parameter sharing with dynamic computation (@cDNG, @MjJH, @kjG6). Strong empirical results outperforming baselines and dominating the compute-performance frontier (@cDNG, @MjJH, @eKta, @kjG6). Remarkable compute and throughput efficiency for practical high-speed inference (@MjJH, @eKta). Extensive ablation studies providing valuable insights into design choices (@cDNG, @MjJH, @kjG6).
These are accurately listed and reflect well the reviewers remarks.
The main weakness of the manuscript is that there isn't much surprising here, nor is it clear that other researchers developing ML models would gain significant insight from the manuscript. That said, it can be expected that practitioners would appreciate and benefit from the results.