PaperHub
7.0
/10
Poster3 位审稿人
最低3最高4标准差0.5
4
4
3
ICML 2025

xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference

OpenReviewPDF
提交: 2025-01-22更新: 2025-07-24
TL;DR

We introduce xLSTM 7B a Large Language Model based on the xLSTM architecture with targeted optimizations for fast and efficient inference.

摘要

关键词
xLSTMLLMinferenceinference timeinference speedTransformer

评审与讨论

审稿意见
4

This work builds on the work by Beck et al. on language modeling with xLSTMs. It introduces several architecture modifications and tricks that support efficiency and training stability, and it introduces an (open sourced) pre-trained 7B parameter language model based on mLSTM. The paper contains a detailed description and evaluation of these modifications with ablations.

给作者的问题

See previous section.

论据与证据

Yes, claims are supported with detailed empirical evidence (mostly in the form of ablation studies).

方法与评估标准

Yes, models are evaluated on standard language modeling and long context benchmark tasks.

理论论述

N/A

实验设计与分析

Experimental results are thorough and detailed.

补充材料

Yes

与现有文献的关系

This is an empirical study, and prior work is discussed (as it should be) primarily through quantitative comparisons to existing models.

遗漏的重要参考文献

N/A

其他优缺点

The paper introduces a strong new open source pre-trained model, which is as valuable contribution.

As stated in the paper, the performance of the model ranks somewhere in the mid-range among other, similar size models. The paper also states that with a better training set mix (more data, better curation, more emphasis on math and code in early training phase) performance could match stronger models. This is a bit of a strange claim to make, given that the main aim of this paper is to provide an empirical study and improvement of the existing architecture (from Beck et al) Why not attempt and report on those suggested improvements?

其他意见或建议

There should be citations for RMSNorm and Layernorm the first time they occur (line 187 or before).

Figure 5: to clarify: these results all include the time it takes to consume the prefill tokens?

How sensitive is model performance with respect to choice of soft-capping parameters? Also, why was it set to 15 and 30, for gates and logits, respectively? How sensitive is that choice?

It is interesting that learnable input gates do not matter much, except for long-context evaluations. Any insights into why that is?

Pre- and post-up projection should be described in the paper to make it more self-contained. Currently, the paper relies on definitions in Beck et al. and Gu & Dao.

What about non-linear models? RNNs hold the promise to beat transformers in tasks that rely heavily on state-tracking (as opposed to natural language modeling, where this does not seem to be the case as much), but they require non-linearity (or at least special considerations regarding the eigenvalues of the transition matrix). Have the authors considered exploring sLSTM blocks over just mLSTM in this study? Related to this point, as well as the authors' point on the dependence of performance on training set amount and mix: I am wondering how much the introduction of non-linearities at the cost of parallelism during training would really reduce efficiency, and how much this would really matter with respect to downstream metrics - in particular, in tasks that are heavy on state tracking requirements (such as code and math)?

作者回复

We thank the reviewer for their helpful feedback. We appreciate that the reviewer finds that our claims are supported by evidence, that our experimental results are thorough and detailed and that our strong model is a valuable contribution to the open-source community.

Attempt on Suggested Improvements

Starting from the models in the xLSTM paper, this study already improves the pre-training dataset by switching from SlimPajama in the xLSTM paper to the DCLM dataset. At the time of training the xLSTM 7B, this was the probably best open source pre-training dataset, to the best of our knowledge.

Other competitor models used custom internal pre-training datasets (Codestral-Mamba, Falcon-Mamba), which make a re-training unfeasible.

While the main focus of this study are architecture optimizations, improving the pre-training data is an exciting future opportunity to train even better xLSTM models and we will continue to work towards this goal.

Figure 5: Time to first token includes the prefill time Yes, for the time to first token metric in Figure 5 the prefill time is included. Figure 5 (left) measures the time to the first token (or latency), i.e. the time it takes until the model answers with the first token. Figure 5 (right) measures the time until the model has answered with 100 tokens. This corresponds to a mix between prefill and generate time.

Soft-capping parameters

This is an interesting question that we did not investigate in our work yet. The logit soft capping values were taken from the Gemma 2 technical report. For the gate soft-capping the range is intended to be well outside the interesting initialization range (see attached additional paper).

Input gate & long context evaluations

Many of the long context tasks in RULER contain only small parts of highly relevant text among a lot of unnecessary "filling" text. The exponential input gate can increase the magnitude of the important parts within the linear matrix memory, and this seems to help improve the task performance. Still, it might be interesting to look at some examples qualitatively to support this mechanistically.

Non-linear RNNs

While non-linear models have the theoretical advantage of state tracking capabilities, in our preliminary experiments, we did not see any benefit of including those for human language modeling tasks. Our xLSTM-7B architecture tries to maximize speed in both training and inference at high performance in language modeling, which is why sLSTM was not included here. In recent work, researchers have shown how to maximize speed for sLSTM and other non-linear RNN variants, but the training speeds reported there are still far behind what can be achieved with mLSTM (as a linear RNN) and Transformers. [1]

We agree with the reviewer, it is an interesting question whether state-tracking capable architectures such as non-linear RNNs show benefits on math and code downstream tasks, which should be investigated further in future work.

We thank the reviewer again for their valuable comments that help to improve our paper.

[1] Pöppel, Korbinian, Maximilian Beck, and Sepp Hochreiter. "FlashRNN: I/O-Aware Optimization of Traditional RNNs on modern hardware." The Thirteenth International Conference on Learning Representations. 2025.

审稿意见
4

In this paper, the authors introduce a new 7B LLM xLSTM 7B. The model is built upon optimized xLSTM architecture to achieve better training stability and efficiency. Extensive experiments show that xLSTM 7B is memory and computation efficient compared to attention-based models and Mamba-based models, and achieves comparable performance to recurrent models of a similar size.

update after rebuttal

I thank the effort the authors made in the rebuttal about comparison with optimized transformer inference. I keep my positive score.

给作者的问题

In the paper, you do experiments for speed metrics with Huggingface transformers, so I wonder what would the speed of xLSTM inference be like when compared to attention-based LLMs with vLLM or other inference engines, that provide optimized inference speed.

论据与证据

  • The authors claim the xLSTM is efficient in inference GPU memory consumption and computation.
    • This claim is supported by experiments in Figure 4, 5, 6.
  • The authors claim XLSTM 7B is comparable to existing LLMs.
    • This is evidenced by results in Table 1 and Table 6, showing close or superior performance to SOTA recurrent LLMs while surpassing some attention-based LLMs.
  • The optimized xLSTM architecture is claimed to improve training stability.
    • This is evidenced in Appendix C.2. For example, with the softcap, the training gradient norm is smaller.

方法与评估标准

The methods and evaluation criteria make sense. For both the performance and the efficiency, the adopted metrics are widely used. And the compared baseline modes are also common ones.

理论论述

The paper does not provide theoretical claims.

实验设计与分析

I checked the validity of all experimental designs, and they make sense to me.

  • The benchmarks in Table 1 and the models chosen are widely used ones, which leads to a valid comparison.
  • The speed benchmarks consider long context token generation speed, memory consumption, and time to first token. All these metrics are reasonable for evaluation.

补充材料

I checked all supplementary material.

与现有文献的关系

The work is related to all LLM architectures, especially recurrent LLMs.

遗漏的重要参考文献

I am familiar with general LLM architectures. For example,

[1] Grattafiori, Aaron, et al. "The llama 3 herd of models." arXiv preprint arXiv:2407.21783 (2024).
[2] Gu, Albert, and Tri Dao. "Mamba: Linear-time sequence modeling with selective state spaces." arXiv preprint arXiv:2312.00752 (2023).

其他优缺点

Strength:

  • The paper is well-written and easy to follow.

Weakness:

  • The Figure 1 and Figure 7 are confusing. While I can guess the meaning of most blocks without their names, it is better to add names.

其他意见或建议

None.

作者回复

We thank the reviewer for their helpful feedback. We appreciate that the reviewer finds our paper well-written, easy to follow and that your claims are supported by evidence as well as that our experimental design is valid and reasonable.

Architecture Illustrations

We agree with the reviewer, our illustrations of the architecture could be improved by adding names to the respective blocks. This is a good suggestion and we will update our figures in the final version of the paper.

Metrics with Optimized Inference Engines (like vLLM)

We expect that the speed of xLSTM will only get faster than our numbers currently reported in the paper with the current optimizations of the Huggingface models with torch.compile and CUDA graphs.

We acknowledge that our current benchmark setup does not use any production grade inference serving frameworks (like for e.g. vLLM) and we are open about this in the paper. We did our best to optimize the pure PyTorch Huggingface implementations as much as we can.

For a fair comparison, we compare baselines from Huggingface with the same optimizations as for xLSTM (torch.compile and CUDA graphs).

However, we agree with the reviewer it would be interesting to measure the speed also in these optimized inference environments. Currently, xLSTM is not yet integrated in these inference frameworks, but we intend to do so in the future.

Since some of our baselines are already integrated in vLLM, we added additional benchmark results for FalconMamba, CodestralMamba, Llama2 and Llama3 in vLLM.

We observe a speed-up of Llama 3 for longer context lengths, also small speed-ups of Mamba-based models compared to our optimized HuggingFace versions. However, xLSTM 7B is still the fastest model (even when compared to baselines from optimized frameworks like vLLM), both in generation and prefill, with larger margin towards longer contexts, due to the quadratic scaling of Transformers. See the results at: https://i.postimg.cc/1XcMCyQV/Rebuttal-Plots.jpg

We thank the reviewer again for their valuable comments that help to improve our paper.

审稿意见
3

This paper mainly scales XLSTM to 7B using some optimization technique

给作者的问题

  1. lack discuss about the core design mLSTMCell , any deep insight or theoretical analysis.

  2. it is better to open source the project

论据与证据

yes

方法与评估标准

yes

理论论述

not many theoretical claims.

实验设计与分析

yes

补充材料

yes, experimental parts

与现有文献的关系

another LLM models

遗漏的重要参考文献

yes

其他优缺点

Strengths:

  1. The contriution of this paper is clear, scaling xLSTM model to 7B model using optimization techinqies.

  2. The experimental results show the trade-off, it seems that its performance is not as good as mamba, but its lentency is good.

  3. This framework may be interesting for LLM community.

Weakness:

  1. lack discuss about the core design mLSTMCell , any deep insight or theoretical analysis.

  2. it is better to open source the project

其他意见或建议

no

作者回复

We thank the reviewer for highlighting our optimizations as a clear contribution and seeing strengths in our results in the trade offs we make, our good latency, and the interesting xLSTM framework.

Discussion on the mLSTM cell

We agree with the reviewer that there is only a brief discussion on the mLSTM cell in Section 2 of our paper. The reason is that our goal in Section 2 is to only review the fundamentals of the mLSTM. For a more in depth discussion, we would like to point the reviewer to the original xLSTM paper (https://arxiv.org/abs/2405.04517), where the mLSTM has been introduced.

The main motivation for using the mLSTM in the xLSTM 7B is its high efficiency due to its full parallelizability. Moreover, in the xLSTM 7B we still rely on two main features of the mLSTM: The improved gating with exponential gating and the enhanced memory capacity with its matrix memory, which we found to be beneficial in language modeling.

Open-Sourcing and Code Release

Unfortunately, we did not upload the code as part of the submission, but we can assure you that it will be open sourced.

We thank the reviewer again for the helpful feedback and hope they find their main concerns addressed.

最终决定

This paper improves the xLSTM architecture and scales it to the 7B parameter model size. It proposes several ways to increase the efficiency and stability of xLSTM, especially at scale. The resulting model performs well compared to related recurrent models as well as transformers with strong efficiency at inference time. All reviewers are positive about this work.