PaperHub
8.0
/10
Spotlight3 位审稿人
最低7最高10标准差1.4
10
7
7
4.3
置信度
正确性3.3
贡献度3.7
表达3.3
TL;DR

We extend the LSTM architecture with exponential gating and new memory structures and show that this new xLSTM performs favorably on large-scale language modeling tasks.

摘要

关键词
LSTMLLMLanguage modelingNLPMemory

评审与讨论

审稿意见
10

This paper introduces the Extended Long Short-Term Memory (xLSTM), which enhances traditional LSTMs with exponential gating and new matrix memory. These improvements address LSTM limitations and enhance their memory storage capability, as well as the ability to revise storage decisions and be more parallelizable. The authors test the efficacy of their method with a set of very exhaustive experiments, which confirm that xLSTMs are competitive with both SSMs and Transformer-based architectures.

优点

I believe this paper is exceptionally strong and represents a highly innovative contribution to the field of sequence modeling. It is compelling to see that traditional RNN-based architectures, with some modifications, can outperform Transformer-based architectures, which aligns with some of the recent trends in this field [1,2]. I hope this direction gains traction within the community. This paper also reinforces the notion that old ideas, with slight adjustments, can effectively compete with the most popular models today. Time will tell if these architectures will become a viable alternative to their attention-based counterparts.

Here are some of the key strengths of the paper:

  1. An innovative approach to replacing input gates in LSTMs to improve storage decisions.
  2. An intuitive methodology for storing and retrieving information in memory in matrix form.
  3. Linear computation and constant memory complexity with respect to sequence length.
  4. Extremely exhaustive evaluation in both language-based and synthetic tasks.

[1] Gu, Albert, Karan Goel, and Christopher Ré. "Efficiently modeling long sequences with structured state spaces." arXiv preprint arXiv:2111.00396 (2021).

[2] Orvieto, Antonio, et al. "Resurrecting recurrent neural networks for long sequences." International Conference on Machine Learning. PMLR, 2023.

缺点

I think the paper has been executed almost flawlessly, so I will forgo this section and will simply add some questions in the section below.

问题

For my interest, I would like the authors to clarify a few things:

  1. Structure of Memory Mixing Matrices: As far as I can tell, the idea of matrix-based updates has been used for a while now. I do appreciate the introduction of different “heads” through a block-diagonal structure. In some sense, this approach could be viewed as restrictive, as it imposes a specific form on the update matrix. I wonder if there might be any benefit to considering a more general structure (which could be viewed as a sort of graph) that could be learned from the downstream task loss?

  2. State Normalization: The idea of normalizing the state to avoid overflow using the additional state mtm_t is clever. However, at some point I pressume iti_t has to be calculated and stored as well. Could this still lead to some overflow issues? What did you observe in practice?

  3. Matrix Memory: I find this idea very intuitive. While the authors have focused on using this for the cell state of the LSTM in this context, I wondered if they had tested the same idea for other (even vanilla) RNNs. Should similar improvements in memory performance be expected intuitively?

  4. Overall Architectural Decisions: What was the thought process behind designing the whole block structure (e.g., 1D convolutional layers, etc.) shown in the appendices? Was this inspired by previous high-performing architectures in sequence modeling (i.e. previous high-performing sequence modeling architectures)? Is this something that the authors tuned significantly, or is there potential for even better performance?

  5. Extrapolation: Do the results in Figure 3 suggest that xLSTMs could train on shorter context lengths while retaining the ability to extrapolate to longer sequences? Note that this is just a question to get the authors' opinion; I am not suggesting that they should run experiments along these lines, as the cost would be too high.

  6. Other Applications: This paper has focused on text data as well as other synthetic tasks, but I was wondering if the authors would expect it to work well for other data types (e.g., temporal or image data)?

局限性

Yes, the limitations of the proposed method are mentioned in the paper in a dedicated section.

作者回复

We would like to thank the reviewer for the excellent score and the follow up questions:

  1. It is possible to use a different structure to make the recurrent matrix more parameter efficient. Because of the similarity to Transformer heads and hardware-optimal training on GPUs the block-diagonal form was taken.

  2. We did not see any overflow issues for mLSTM or sLSTM. However as the normalizer is aggregating it might be limited by numerical ranges at some point if forget gates stay too close to one. The normalizer state is bounded from below by one for both sLSTM and mLSTM, so the division is numerically stable.

  3. In principle a matrix memory should enhance other (vanilla) RNNs as well. A purely recurrent version (as for vanilla tanh RNNs) will be slow as it cannot be computed efficiently on GPUs. This is why it was not tested in our work here.

  4. Architecture decisions: This was inspired by previous work on State Space models (H3, Mamba). The convolution turns out to be important and the convolution window size can be tuned further. There is still a lot of architectural search space to explore.

  5. Extrapolation: The results in Figure 3 imply this capability, such that shorter context lengths can be used for training, while having larger context during inference. However there are fundamental limitations for retrieval-focused tasks like the phone book task or the needle in the haystack tasks due to the fixed size hidden state.

  6. As the model performs well on the tested sequence modeling tasks, we expect it to perform well on time series too. For vision, we expect good results as well, although the sequential inductive bias in our model seems unintuitive.

评论

Thank you for your clarifications, and congratulations on your work!

审稿意见
7

Proposes some extension to LSTM. Specifically:

  • exponential gating with appropriate normalization and stabilization techniques

  • sLSTM with a scalar memory, a scalar update, and new memory mixing. This still needs to be calculated sequentially.

  • mLSTM that is fully parallelizable with a matrix memory (instead of just a vector) and a covariance update rule

  • putting sLSTM & mLSTM together into a residual block: the xLSTM block

  • Extended LSTM (xLSTM) blocks are residually stacked into xLSTM architecture

Tested for language modeling.

Compared to Transformers and State Space Models.

优点

Scales up LSTM models.

Proposes extensions to the standard LSTM model which perform really well.

Lots of experiments. (Although most is only in the appendix...)

Open code.

缺点

Some of the notation is confusing. Specifically, I'm never really sure whether it's about scalars or vectors. I also think there are some errors regarding this. I don't understand why to write it at all in scalar notation.

There is never really an exact definition of the whole model. We can only infer it from the figures. (Or the code.)

The scaling laws comparison is a bit flawed: I don't think you can just compare the number of parameters across different architectures. More reasonable would be to compare inference time.

问题

In all the formulas, it's not immediately clear whether e.g. h_t, c_t etc are scalar values or vectors. I assume those are vectors. It would be good to write that explicitly (sth like c_t, h_t, ... \in \R^D) or so.

However, then it says that w_z, w_i, w_f, w_o are also vectors (eq 4-6). This is weird. That should be matrices, or not? Also the r_z, r_i, r_f, r_o, they are just called "weights" (p.3 line 94), but that should be matrices as well, specifically all in \R^{D \times D}. It would help to state that explicitly. The notation is also a bit uncommon. Most common is to use capital letters for matrices.

Or is h_t here really a scalar, not a vector, e.g. in eq 2? But then, in eq 4-7, the multiplication with r is wrong? Or are those r also scalars? But this is not a normal LSTM then, because it would mix also the other dimensions.

I see some different definition in B.1. Is this just now a different notation, specifically in vector notation? Or is this really an alternative different model? As this says "vector notation", that really means that the initial equations (eq 1-7 and more) are all on scalars? Specifically, for example, eq 27 does not fit together with eq 4. If r is a vector in eq 4, then it just multiplies with a scalar h_{t-1} in eq 4, which is wrong.

Or later in Sec 2.1, it then says "In later formulations, multiple memory cells were combined in a vector, which allows the usage of recurrent weight matrices". So actually only this aspect really leads to the vanilla LSTM. It means the presented formula in Sec 2.1 are really not the same as B.1. And also the Sec 2.2 formula are also not what you actually use, because then later you also have matrices R.

I think this is all confusing. Why present the scalar variants at all, when they are never used like this?

Now, for sLSTM, it was said that sLSTM has scalar memory and a scalar update. What does this mean? What is scalar about it? In what way is it different from the normal LSTM?

I see in appendix B.2 that "The matrices Rz , Ri , Rf , Ro are block-diagonal". Is this specific only for the sLSTM? For the vanilla LSTM, you have fully dense matrices, right? So is this actually the main difference to the vanilla LSTM? Block-diagonal refers to "scalar"? So it's not really scalar?

What is the motivation of exponential gates? To be able to open the gate and even amplify the input (gate > 1)? But why exp, why not relu or some other function which does not grow so fast?

sLSTM Stabilized Version only in appendix. Why? Is this not so important?

sLSTM normalization (n_t), how important?

Sec 2.2 "we introduce a normalizer state that sums up the product of input gate times all future forget gates." - I think some word is missing.

Sec 2.3 "we increase the LSTM memory cell from a scalar c ∈ R to a matrix C ∈ Rd×d" - this is misleading, or I don't understand it. Actually you increase the LSTM memory cell from a vector to a matrix, not from a scalar to a matrix, or not? Or do you really have a different matrix C for every dimension, i.e. you actually have a 3D tensor as memory when there are multiple cells? Specifically, in eq 16-24, the i_t, f_t, h_t, q_t, etc, are those scalars or vectors now? In eq 16, when i_t/f_t are scalars, that means when you now have multiple cells, that the C_t becomes a 3D tensor?

Sec 2.4: xLSTM blocks and xLSTM architecture: I think it's bad to refer to the appendix for figures. I think those are crucial aspects about the whole model, which should be in the main text. Also, there should be some exact formulas to define the model, not just figures.

Sec 2.4 xLSTM architecture: "LayerNorm ... See last column in Figure 5." - when I look at the last column in Fig 5, I only see some gray blocks. I don't see LayerNorm there.

Sec 2.4 xLSTM architecture: The only real "definition" of the model is Figure 5? In Figure 5, when it says xLSTM blocks, that means either a sLSTM block (like Figure 6) or a mLSTM block (like Figure 7)? So the light gray blocks are mLSTM, and the dark gray blocks are sLSTM in the last column in Fig 5?

The sLSTM and mLSTM blocks (Fig 6, Fig 7) have many further aspects which are not really discussed, like the group norm, etc. How did you end up with this specific architecture of the blocks? Have you tested any variations on that?

Sec 2.4: From Fig 5, it seems that the amount of sLSTM and mLSTM blocks is not the same? This is not really discussed or even mentioned at all? It seems to be also a crucial aspect of the whole xLSTM model. Only later in Sec 4, I read "For all experiments, we use the notation xLSTM[a:b] for the ratio a/b of mLSTM-based versus sLSTM-based xLSTM blocks.".

Figure 4, scaling laws: I don't think it is reasonable to have the number of parameters on the x axis. More reasonable is to put e.g. the inference time at the x axis.

Relation to multiplicative LSTM (https://arxiv.org/abs/1609.07959)?

Relation to associative LSTM (https://arxiv.org/abs/1602.03032)?

Study on activation functions? Or only tanh?

局限性

作者回复

Thank you very much for your review, which helps to improve our paper!

We are sorry that the scalar notation for LSTM and sLSTM (eqs. 2-14) caused confusion. This notation was chosen to reflect the original LSTM idea of a single cell and to make the distinction to matrix memory cells more pronounced. The scalar notation assumes that an LSTM cell is defined by a scalar input and forget gate. The vectorized version (where we stack multiple cells in a vector) with cell interactions (i.e. memory mixing, eqs. 25-37) is probably more familiar from recent literature. Some of our contributions, namely the block-diagonal recurrent matrix R for sLSTM also relies on the vectorized part essentially. We checked all notations again and they all should be correct and consistent. For all our formulas we use the convention:

  • non-bold, lower case letters represent scalars
  • bold, lower case letters represent vectors
  • bold, capitalized letters represent matrices

According to this convention the difference between LSTM and sLSTM formulation in equations (2-14) and equations (25-37) is that (25-37) essentially stack multiple memory cells / hidden states into a vector.

In short, the main difference between the original LSTM and the sLSTM are: sLSTM has Exponential Gating vs. LSTM has sigmoid gating. sLSTM has block diagonal recurrent weights R vs. LSTM has dense weights. (However one could also apply block diagonal recurrent weights R to the original LSTM).

The motivation for the exponential gate is the similarity to a running softmax and the fact that no matter how large an input is amplified, a later input can still surpass it in its weight by an additive increment in the input gate preactivation. This way it is always possible to revise a previous storage decision. The stabilization is essential for both variants but was moved to the appendix for space reasons. Regarding the cell expansion from scalar to matrix, we assume one cell to be defined by one scalar input and forget gate. In this sense it is an expansion from scalar to matrix. There are different heads for mLSTM in parallel, so effectively this is a third dimension orthogonal to the matrix state.

Regarding the complete model figures, we want to emphasize their size for readability and the limited space in the main paper. The specific positions of sLSTM vs mLSTM blocks influence the training performance, and the models shown performed well. The shown ratios performed well on the specific tasks, especially for language modeling there seems to be an emphasis on memory rather than the state tracking capabilities of sLSTM. The tanh as activation function for sLSTM enables long context training stability, but others were not tested.

Relation to multiplicative LSTM: Multiplicative LSTMs modify the memory-mixing part of LSTMs to an input dependent recurrent matrix (see Eqn. 8). Their gating is equivalent to vanilla LSTMs and the architecture is not sequence-parallelizable. Relation to associative LSTM: Associative LSTM act on complex numbers and also use some kind of state expansion by using multiple "copies" with different (non-learnable) random permutations, whereas mLSTM uses learnable projections for key and value vectors that form the cell state update. They do not have a parallelizable version and use vanilla LSTM gating.

Our scaling laws (with the number of parameters on the x-axis) show how effective the model parameters are used. We agree, the inference time is very important. Therefore, in additional experiments, we have measured the inference time of the models, which we present in the attached figures. Please find these figures in the supplementary one-page PDF attached to our general response. The inference time behavior for other model sizes is similar.

Also they do not include optimized kernels for mLSTM.

We address the following of your points directly below:

“Sec 2.4 xLSTM architecture: "LayerNorm ... See last column in Figure 5." - when I look at the last column in Fig 5, I only see some gray blocks. I don't see LayerNorm there.”

Indeed, this is a typo. We mean the last two columns. In the blocks on the left you see the pre-layernorm architecture of the transformer and skip connections. The last column shows the final stacking. Thank you for reading the paper so carefully.

“Sec 2.4 xLSTM architecture: The only real "definition" of the model is Figure 5? In Figure 5, when it says xLSTM blocks, that means either a sLSTM block (like Figure 6) or a mLSTM block (like Figure 7)? So the light gray blocks are mLSTM, and the dark gray blocks are sLSTM in the last column in Fig 5?”

Correct. Light gray = mLSTM, dark gray = sLSTM.

“Sec 2.4: From Fig 5, it seems that the amount of sLSTM and mLSTM blocks is not the same? This is not really discussed or even mentioned at all? It seems to be also a crucial aspect of the whole xLSTM model. Only later in Sec 4, I read "For all experiments, we use the notation xLSTM[a:b] for the ratio a/b of mLSTM-based versus sLSTM-based xLSTM blocks.".”

We view the block ratio as a hyperparameter that needs to be tuned for the respective task. In the experiment section we detail which ratio we used for each experiment.

“The sLSTM and mLSTM blocks (Fig 6, Fig 7) have many further aspects which are not really discussed, like the group norm, etc. How did you end up with this specific architecture of the blocks? Have you tested any variations on that?”

Our Pre-Up Projection Block (Fig. 7) was inspired by previous work like Mamba, H3 or Retention. The Post-Up Projection Block was inspired by Transformer architectures like Llama or GPT. We refer to these publications for details.

We hope that we could answer your questions and clarify your concerns. If you find them addressed properly we kindly ask you to raise your score.

评论

Thank you for the rebuttal.

I think it would still be helpful to use the common vector/matrix notation to introduce and define the models, and to explicitly define what variables are (e.g. xRD,MRDin×Doutx \in \mathbb{R}^D, M \in \mathbb{R}^{D_{\textrm{in}} \times D_{\textrm{out}}} or so), and if you think the scalar notation is useful/interesting from a historical perspective, move that to the appendix.

Regarding scaling laws (Fig 4): What I mean is, instead of having the number of parameters on the x-axis, put the decoding runtime on the x-axis. Or maybe the FLOPs needed to compute the model, to keep it hardware/implementation independent. And leave the validation perplexity on the y-axis. I think the decoding runtime is the more relevant than the number of parameters when comparing different model types.

评论

Thank you for your response.

You are right, we should update the notation here and define variables explicitly as we did for the mLSTM.

Also regarding the scaling laws, we agree that inference time is a very important measure. We tried to deliver this in our rebuttal document. Decoding FLOPs are a hardware independent measure for compute needed for a decoding step. Still, modern hardware with accelerated matrix multiplication can compute an order of magnitude more FLOPs in matrix multiplications (as for mLSTM, sLSTM, Transformer decoding - NVIDIA A100: 312 TFLOP/s) compared to FLOPs in scalar operations (as dominant in RWKV4, Mamba - NVIDIA A100: 39 TFLOP/s).

The decoding time actually depends heavily on the context length for the Transformer (Llama) architecture, whereas it is constant for recurrent models. So the scaling laws would look different for different context sizes. Similar to the decoding time, the inference FLOPs for the Transformer are not constant over different context lengths as for the recurrent models. Hence, again the scaling laws would heavily depend on the context length.

This is why we chose actual inference time of our models, as well as equally sized competitors and compared them for different context lengths. Note that we expect further speedup factors by using custom decoding kernels for xLSTM variants.

The reason, why we initially chose model parameters as a scaling law criterion, is that for scaling recurrent models to larger sizes the GPU memory (HBM) is the limiting factor at some point - the majority of the memory will be used by the model itself. Therefore we see the number of model parameters as an important measure for which we provided scaling laws.

评论

I agree that it's definitely not straightforward to compare the capacity of the model. Using FLOPs was a suggestion, and surely, as you point out, there are many details you need to further define. E.g. pick some reasonable context length. But I think this would still be more reasonable than just counting the number of parameters. The number of parameters is really the most misleading measure you can have here.

Using the actual inference time is also fine. But my intention is to have this for the scaling law figure (Fig 4 in the paper), on the x-axis, for different model sizes, comparing the different models, and then the perplexity on the y-axis.

评论

Thank you for your updated answer. We agree, there should be more scaling law plots that also consider inference time or compute in addition to the number of parameters. Additionally to the context length also the batch size is a somewhat free parameter (if not limited by GPU memory) for measuring inference time. We will perform additional experiments on this and add more scaling law figures with other quantities on the x-axis (similar to Fig. 4).

审稿意见
7

The paper introduces xLSTM, an advanced variant of traditional LSTMs, incorporating two extentions aimed at boosting its memory capacity and performance. The first enhancement, termed sLSTM, modifies the standard LSTM by integrating exponential input and forget gates alongside a stabilizing normalizer term, designed to refine gate operations and ensure network stability.

The second mLSTM involves the incorporation of matrix-form hidden states into sLSTM, reminiscent of architectural advancements observed in models like the GLA, RetNet, and RWKV5/6. This approach significantly augments the memory capacity of the LSTM units, aligning with recent trends in RNN design that prioritize enhanced memory retention.

The xLSTM architecture is constructed through a residual stacking of xLSTM token mixing layers coupled with Linear layers, mirroring the design principles employed in architectures such as Llama. This configuration facilitates efficient information propagation and processing within the network.

The authors successfully scaled xLSTMs to 1B trained on hundreds of billions of tokens, from which they found xLSTMs perform better than exising Transformer, SSMs and other linear attention archs. As for associative recall abilities, xLSTMs exibits better performance evaluated on MQAR tasks. The paper also includes an exploration of scaling laws, revealing that xLSTMs maintain favorable performance characteristics when scaled to larger model sizes.

优点

The paper proposes a new RNN variant, denoted as xLSTM, making two extensions, i.e., sLSTM and mLSTM, to the classical LSTM, which has potentials to rejuvenate the classical RNNs in the LLM era. Experiements on 340M / 1B sized models show that xLSTMs perform very well on language modeling tasks. Analysis indicates that the xLSTM variant excels in retrieval-intensive scenarios, a critical advancement given the acknowledged limitations in memory capacity that have historically constrained RNNs. I look forward to seeing the performence of xLSTMs scaled to larger sizes.

缺点

The overall designs of mLSTM are not that novel. Notably, architectures such as the RetNet and GLA has successfully applied 2-d matrix-formed hidden states into RNN architectures that greatly enlarge the memory capabilities.

xLSTMs additionally propose a exponentials gating, which, as claimed by the authors, can enable the abilities to solve state tracking problems. The authors verify the effectiveness of some synthetic tasks in Appendix D.1. However, I believe this design can greatly hurt the parallelizability of xLSTMs. To enable stable training, the authors necessitate maintaining ii and ff in log-space first, at the scarifice of hardware optimizations, making it hard to benefit from tensor-core accelerations, a feature that is pivotal for achieving high computational efficiency. This partially explains why their kernel impls are much slower than flash-attention. Moreover, the input and forget has to be kept in scalars, which I believe is inferior to the more fine-grained gating of GLA.

问题

If possible, I would like to known the comparisons between xLSTMs and GLA

局限性

The proposed exponential gating makes xLSTM hard to be parallelized, hindering its abilities to scale to larger sizes in compared to other exisiting architectures.

作者回复

Thank you very much for your review which helps to improve our work!

Indeed the reviewer is right, GLA and Retention are similar to mLSTM in that sense as they both have a matrix cell state, too. However, we highlight that neither Retention nor GLA have an (input-dependent) input gate and Retention has no input-dependent gates at all. mLSTM has an input-dependent exponential input gate and an input-dependent sigmoid forget gate. We would like to point the reviewer to Table 6 in the Appendix, where we carefully ablate the design decisions of the different gating mechanisms and also relate to the other models like Retention. We find that our Exponential Gating with all gates trainable performs the best and introduce Exponential Gating as a gating mechanism that is applicable to matrix memory cells (mLSTM) as well as scalar memory cells (sLSTM).

For a direct comparison to GLA in terms of performance in Language Modeling measured in Perplexity (PPL) on the validation set, we refer to Table 1 in the main paper. There, we show that our xLSTM variants xLSTM[1:0] (i.e. mLSTM only) and xLSTM[7:1] (i.e. ratio mLSTM:sLSTM blocks = 7:1) achieve a PPL of 13.43 and 13.48 and significantly outperform RetNet (16.23 PPL) and GLA (16.15 PPL). As noted by the reviewer we keep the per-head input and forget gates for the mLSTM as scalars and then broadcast these scalars along the head dimensions. We do not think that this decreases performance compared to the dimension-wise gating of GLA (see Table 1 and Table 6 in our paper). This design decision has also been made by the authors of Mamba 2, when moving from Mamba 1 to Mamba 2 (see[1] and Table 1 in [2]). Additionally, having the gates as scalars simplifies the parallel form even though we maintain the forget gates in logspace, since we can first accumulate across the head dimension and then multiply by the gates (compare e.g. to equation (4) in [2]). Apart from that, the chunkwise formulation of Linear Attention of GLA [2], is applicable to mLSTM, too.

We believe that in our paper we did not sufficiently separate exponential gating and memory mixing, which are independent architecture characteristics. Exponential gating alone does not hinder parallelization. When we use memory mixing as it is done in the sLSTM, we cannot parallelize the computation. In this case the reason for the non parallelizability is memory mixing, i.e. the dependence of the gates on the previous hidden states via recurrent weights (see r_{i,f,z,o} and R_{i,f,z,o} in equations (11-14) and (34-37), instead of exponential gating. Our formal language experiments indicate that exponential gating in combination with memory memory mixing can solve the Parity task, which is one simple instance of a problem that requires state tracking (see Table 8, rows: LSTM, xLSTM[0:1] (i.e. sLSTM only), xLSTM[1:1]). Since the mLSTM alone (xLSTM[1:0]), which has exponential gating but no memory mixing, cannot solve the task, and the original LSTM, which has memory mixing but no exponential gating) can solve it, we think that memory mixing is crucial for the state tracking capability. We are sorry that we did not state this clearly in the main paper and will do so in a potential camera ready version.

When we apply Exponential Gating to the original LSTM, we obtain the sLSTM (i.e. xLSTM[0:1]), which improves performance on language modeling by almost 9 PPL points (difference between line 3 and 4 in upper part of Table 6). Note that both original LSTM and sLSTM have memory mixing and are not parallelizable.

In contrast to the sLSTM, the mLSTM is fully parallelizable analogous to Transformers with Self-Attention. We outline the parallel formulation (forward and backward pass) of the mLSTM with Exponential Gating in Section B.3 in the Appendix.

Despite the non-parallelizability of the sLSTM, we propose a more efficient variant of memory mixing compared to the original LSTM and use optimized CUDA kernels for the sLSTM, which is less than two times slower than the parallel mLSTM implementation.

We hope that we could clarify your concerns and kindly ask you to raise your score if you find them addressed properly.

[1] Dao, Tri, and Albert Gu. "Transformers are SSMs: Generalized models and efficient algorithms through structured state space duality." arXiv preprint arXiv:2405.21060 (2024).

[2] Yang, Songlin, et al. "Gated linear attention transformers with hardware-efficient training." arXiv preprint arXiv:2312.06635 (2023).

评论

Thank you for your thoughful replies.

Overall, I believe this is an excellent paper that significantly exceeds the acceptance standards of NeurIPS. The paper presents a range of meaningful analyses and comprehensive comparisons, from which I have learned a great deal. However, I intend to maintain my score, in part because other reviewers have given very high evaluations, and also to reflect my concerns regarding efficiency, which is crucial when scaling to larger models.

Few suggestions: can the authors include some discussions on ABC [1], which I believe employ similar ideas of exponential decay, even though the approaches are not entirely the same.

[1] https://arxiv.org/abs/2110.02488

评论

Thank you for your response. We are happy you like our work. For large scale pre-training, there might be efficiency concerns for the sLSTM (xLSTM[:1]) part, which trades the sequence-parallelization for the ability to do state tracking with memory mixing (in our efficiency tuned variant). Still the mLSTM (xLSTM[1:0]) alone is fully sequence parallelizable (as for example GLA [1]), outperforms all other models in Language Modeling and in our rebuttal plots we show that it scales very well in inference - even without custom kernels.

Thank you for the interesting find, the mentioned ABC architecture multiplies the keys and values with learnable “control vectors” parameterized by the exponential function, which is similar to our exponential input gates. It uses a different state and reduced regular (i.e. softmax) attention as in Linformer [2]. We will include this in the related work section.

[1] https://arxiv.org/abs/2312.06635

[2] https://arxiv.org/abs/2006.04768

评论

Thank you for your hard work. I raised the score to 7.

作者回复

We thank all reviewers for their comments and constructive feedback. In a potential camera ready version of our paper we addressed all your comments and feedback, which considerably improved our paper.

We thank the reviewers WcbD & ZvGL for appreciating our extensive experiments that we conducted over the last year.

We answer all questions for each reviewer directly below their reviews.

Reviewer WcbD proposed to measure the inference (wall-clock) time and compare xLSTM to the other baselines. We agree that this is a very interesting experiment.

Therefore, we have measured the time for text generation for our Transformer (Llama architecture), Mamba and RWKV4 baselines and our two xLSTM variants xLSTM[1:0] and xLSTM[7:1]. We compare the generation time for the 1.3B sized models with a context prefill of 16 tokens, and generation speeds for 64 tokens at varying context pre-fill. The results are shown in the attached PDF.

Due to the recurrent nature and the fixed state size of Mamba, RWKV4 and xLSTM, these models have a constant generation time per token, independent of long pre-fill contexts. Remarkably, xLSTM[1:0] with torch.compile and no custom GPU kernels is on par with a Huggingface Llama implementation in generation speed for small contexts and greatly surpasses it for long contexts.

评论

Dear reviewers,

The author-reviewer discussion period will end soon. Please make sure you read the authors' rebuttal and respond to it. If you have additional questions after reading the rebuttal please discuss with the authors.

AC

最终决定

This paper proposes the so-called extended long short-term memory (xLSTM) model to include numerous enhanced features such as exponential gating and modified memory structures in the traditional LSTM framework. These enhanced features can effectively mitigate some of the well-known limitations of LSTM (e.g. poor at revising storage decision, limited storage capacity and lack of parallelizability). The authors conducted extensive experiments to compare the proposed xLSTM with state-of-the-art Transformers and state space models in both performance and scaling and demonstrated its effectiveness. Overall, this is an interesting paper and may have good impact to the machine learning community. Concerns raised by reviewers are cleared in the rebuttal and following discussions. The authors may consider restructuring the paper a bit as some of the important contents are put into the appendix.