Mamba: Linear-Time Sequence Modeling with Selective State Spaces
We introduce a selection mechanism into state space models, leading to state-of-the-art results on general sequence modeling including language.
摘要
评审与讨论
The authors study the recent state-space models (SSM) family of efficient sequence architectures and address some of their challenges, related to the inability to perform content-based reasoning. The core contribution of the work is the addition of a selection method to the SSM architecture, which results in simple and scalable architecture, Mamba. Then they demonstrate the superiority of Mamba on standard language benchmarks, as well as DNA and audio modeling. The authors also contribute efficient implementation and benchmarking of Mamba on modern hardware.
优点
S1: The paper addresses very efficiently and effectively pressing problems in sequential modeling.
S2: The authors have identified simple toy tasks, such as selective copying and associative recall, that enable them to make design choices which state-of-the-art impact on real-world data.
S3: The connection to the role of gating mechanisms in RNNs is well-appreciated.
S4: The empirical part of the paper is very thorough, and the results are strong.
缺点
I do not identify any major weaknesses of the paper.
问题
I am curious if we could build a better understanding of the selection mechanism that you propose. In Theorem 1 you link that mechanism to gating in RNNs as a special case. Is it possible to understand better the generalization through some discussion / qualitative examples?
We are glad that the reviewer appreciates Mamba’s relationship to gating mechanisms and finds the empirical results strong.
Regarding the connection to gating: Theorem 1 establishes the relationship for a special case of and . More broadly, the parameter of SSMs plays the role of generalized gating, and we view it as an elegant way to automatically induce gating (instead of the conventional heuristic explanation of gates).
In general, controls the balance between how much to focus or ignore the current input . It is analogous to the role of the gate in Theorem 1, mechanically, a large resets the state and focuses on the current input , while a small persists the state and ignores the current input. SSMs (equation (1)-(2)) can be interpreted as a continuous system discretized by a timestep , and in this context the intuition is that large represents the system focusing on the current input for longer (thus ``selecting'' it and forgetting its current state) while a small represents a transient input that is ignored.
We plan to include this discussion, as well as an extended discussion on the intuition and role of the other parameters of selective SSMs (, , and ), in an extended preprint that will be released together with the open source release of the model.
This paper upgrades S4 by making the token mixing matrix data-dependent and introduces the Mamba structure. On the other hand, although the use of FFT is not possible, the authors provide a linear algorithm for computation, resulting in linear computational complexity. The effectiveness of the proposed method is validated on multiple datasets.
优点
The paper is written in a clear and understandable manner, with a well-defined approach and simple yet effective improvement strategies.
缺点
The paper lacks references to some relevant works, such as [1], [2], [3], [4] which discusses some Linear Attention methods, and [5], which is also a LongConv method. However, these references are completely absent in the paper. I suggest that the authors consider adding these citations to provide a more comprehensive review of related work.
[1] Zhen Qin, Weixuan Sun, Hui Deng, Dongxu Li, Yunshen Wei, Baohong Lv, Junjie Yan, Lingpeng Kong, and Yiran Zhong. cosformer: Rethinking softmax in attention. In ICLR, 2022.
[2] Efficient Attention via Control Variates, Lin Zheng, Jianbo Yuan, Chong Wang, and Lingpeng Kong In International Conference on Learning Representations (ICLR), 2023
[3] Linear Complexity Randomized Self-attention Mechanism, Lin Zheng, Chong Wang, and Lingpeng Kong In International Conference on Machine Learning (ICML), 2022
[4] Zhen Qin, Xiaodong Han, Weixuan Sun, Dongxu Li, Lingpeng Kong, Nick Barnes, and Yiran Zhong. The devil in linear transformer. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7025–7041, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics.
[5] Zhen Qin, Xiaodong Han, Weixuan Sun, Bowen He, Dong Li, Dongxu Li, Yuchao Dai, Lingpeng Kong, and Yiran Zhong. Toeplitz neural network for sequence modeling. In The Eleventh International Conference on Learning Representations (ICLR), 2023.
问题
- Adding extrapolation experiments to the language model would be interesting.
- The ablation analysis in Table 6 should be more comprehensive, with a total of possible combinations. I suggest that the authors include the remaining two combinations.
- What's your setting of Scaling Law? Why is your ratio of token number and model size is the same as Chicilla's paper? I suppose the FLOPs of Transformers and SSMs would differ. Suppose the FLOPs of Transformers and SSMs would differ given the same amounts of total parameters, is this important to the final performance(accuracy)?
- How did you parameterize the first convolutional layer in the Mamba-Block.
- Providing more detailed implementation, such as offering core code, is very helpful.
We are glad the reviewer found the ideas in the paper simple, compelling, and clear.
Citations and related work
We appreciate the reviewer’s suggestions for related work, which are all very related and appropriate. We have included all of them (and more linear attention variants) such as RFA, cosFormer, Performer, TransNormer, RAFA, and TNN in the extended related work.
Adding extrapolation experiments to the language model would be interesting.
Our original submission includes length extrapolation results for the Induction Heads task, which is hypothesized to be deeply related to generalization abilities of language models [Anth]. In Figure 3 we show that Mamba can extrapolate from sequences of length 256 to length 1000000, while no other model (including Attention with variants of relative positional embeddings) extrapolate beyond length 512.
In the shared response, we include new results on measuring the length extrapolation abilities of Mamba compared to a standard Transformer.
The ablation analysis in Table 6 should be more comprehensive, with a total of possible combinations. I suggest that the authors include the remaining two combinations.
We chose to exclude some partial combinations for the sake of clarity. This table includes the full ablations.
| Selective | Selective | Selective | Perplexity |
|---|---|---|---|
| 10.93 | |||
| + | 10.15 | ||
| + | 9.98 | ||
| + | 9.81 | ||
| + | + | 9.49 | |
| + | + | 9.21 | |
| + | + | 9.07 | |
| + | + | + | 8.71 |
What's your setting of Scaling Law? Why is your ratio of token number and model size is the same as Chicilla's paper? I suppose the FLOPs of Transformers and SSMs would differ. Suppose the FLOPs of Transformers and SSMs would differ given the same amounts of total parameters, is this important to the final performance(accuracy)?
The FLOPs of Transformers and SSMs do not differ significantly at shorter sequence lengths (e.g. length 2048), and we believe that the standard Chinchilla scaling laws should apply. As shown in the difference between Figure 4 (Left) and 4 (Right), the FLOPs of Transformers vs. SSMs do differ at longer sequence lengths. It would be very interesting for future work to explore whether SSMs have different scaling laws especially with differing context lengths.
How did you parameterize the first convolutional layer in the Mamba-Block.
We use a standard local (causal) convolution of width 4. It is a depthwise convolution, i.e. operates independently per channel. This follows similar architecture design in H3, Hyena, and RWKV.
Providing more detailed implementation, such as offering core code, is very helpful.
As mentioned in the shared response, we are planning to release all code and pretrained models for the community.
The paper proposes a new class of selective state space models (SSMs) for sequence modeling that achieves Transformer-quality performance while scaling linearly in sequence length. The paper addresses the key problem in SSMs for selecting data by selecting particular inputs. The paper presents a hardware-aware algorithm that computes the model recurrently with a scan instead of convolution, avoiding materializing the expanded state to reduce memory usage. This results in faster computation than previous methods.
The paper simplifies prior deep sequence model architectures into a homogeneous architecture which is called as Mamba, incorporating the selective SSMs. Mamba enjoys fast inference, linear scaling, and improved performance on long sequences.
In the results the authors show that Mamba achieves state of the art on synthetic tasks, audio/genomics modeling, and language modeling and outperforms Transformers of the same size on language modeling in both pretraining and downstream tasks.
The results suggest selective SSMs and the Mamba architecture could be a strong candidate for a general sequence model backbone for foundation models across modalities. The paper demonstrates the potential for linear-time models to match or exceed the performance of quadratic Transformers.
优点
- A key limitation of prior SSMs is the inability to efficiently select data in an input-dependent manner. The paper introduces a key mechanism by parameterizing the SSM parameters based on the input, allowing the model to filter out irrelevant information and remember relevant information indefinitely.
- The results as compared to Pythia, and Transforms on many benchmarks are impressive.
缺点
- The model still has a quadratic memory requirement during training like Transformers.
问题
-
Have you evaluated scaling behavior beyond 1.4B parameters? How does it compare to Transformers at 10B scales?
-
The input selection mechanism introduces additional hyper parameters. How sensitive are the results to hyperparameters like the projection rank?
We appreciate the reviewer’s close reading of the paper and that the reviewer found our results compelling. We respond to the reviewer’s questions below.
The model still has a quadratic memory requirement during training like Transformers.
We would like to clarify that Mamba’s memory requirement scales linearly in sequence length; the memory bottleneck is simply the size of the activation tensors. This is true of most deep sequence models, including Transformers (when using an optimized implementation such as FlashAttention). Mamba is in fact very memory efficient; we have included additional comparisons in Appendix F.5 (see response 4/4 to Reviewer du8a for numbers).
Have you evaluated scaling behavior beyond 1.4B parameters? How does it compare to Transformers at 10B scales?
We have managed to scale Mamba to the next standard model size of 2.7-2.8B parameters. Results are shown in the general response; it continues to strongly outperform open-source Transformers of the same size, and match Transformers of the next larger size (7B parameters).
The input selection mechanism introduces additional hyper parameters. How sensitive are the results to hyperparameters like the projection rank?
This is an excellent question; many new models come with new hyperparameters that must be tuned. Mamba has a few main axes of variation:
- The projection rank , as the reviewer noted
- The state size ; larger states can perform better but slow down the model
- Prior works on SSMs have had many variants of the initialization.
Ablations for all of these have been included in the original submission, Tables 7, 8, and 9. We see that increasing and both offer small improvements for a modest increase in parameters; however, they do not offer a significant performance difference. In fact, all our experiments did no tuning on any of these hyperparameters (fixing to be a constant fraction of the model dimension; fixing ; and fixing the initialization). We strongly believe that Mamba can be a promising drop-in replacement for attention in general foundation models.
This paper proposes Mamba, which is a linear-time sequence model with selective state spaces. The authors propose to modify conventional state space models (SSMs) such that the modified models are input-dependent. The authors further propose engineering techniques for performance optimization. Experiments are conducted to demonstrate the effectiveness of the proposed method. In particular, several flavors of pre-trained models are provided.
优点
-
The proposed Mamba method includes a simple modification to the conventional SSM model: add additional models to make SSM models dependent on the inputs. SSMs are known for their computational difficulties, and the authors address this issue by several performance optimization techniques.
-
The authors pre-train several variants of Mamba, ranging from 130M parameters to 1.4B parameters. These pre-trained models show performance improvements compared with the baselines in the paper.
缺点
Concerns about model design:
-
The motivation of Mamba is to address the drawbacks of recurrent models while improving the efficiency of attention-based models. There are many works following the same direction: S4-diagonal [1], SGConv [2], MEGA [3], SPADE [4], and many efficient Transformer models (e.g., [5]). All of these models achieve near linear complexity, and the authors need to compare Mamba with these works in terms of both model performance and efficiency. For model performance, some simple experiments such as language modeling on Wikitext-103 should suffice.
-
Many attention-based Transformer models show length generalization ability, i.e., models can be trained on a shorter sequence length and tested on a longer sequence length. Some examples include relative positional encoding (T5) and Alibi [6]. Because SSMs are in general sequential, does Mamba have this length generalization ability?
Concerns about experiments:
-
The authors need to compare with stronger baselines. The authors acknowledge that H3 was used as a motivation for the model architecture. However, they did not compare with H3 in the experiments. From Table 4 in [7], ppl of H3 is 8.8 (125M), 7.1 (355M), and 6.0 (1.3B) on the Pile dataset, which are considerably better than Mamba. The authors need to show comparisons with H3.
-
For the pre-trained models, the authors only show results on zero-shot inference. This setting is quite limited and the results cannot support the effectiveness of Mamba well. I suggest the authors run more long-sequence experiments such as document summarization, where the input sequence is naturally long (e.g., the average sequence length of the arXiv dataset is greater than 8k).
-
One of the main contributions that the authors claim is long sequence modeling. The authors should compare with more baselines on LRA (Long Range Arena), which is essentially the standard benchmark for long sequence understanding.
-
Memory benchmarking is missing. Even though Section 4.5 is titled “speed and memory benchmark”, only speed comparisons are presented. Also, the authors should provide more detailed setups of Figure 8 left, e.g., model layers, model sizes, details of the convolution, etc. Could the authors provide some intuitions why FlashAttention is the slowest when the sequence length is very large (Figure 8 left)?
[1] https://arxiv.org/pdf/2203.14343.pdf
[2] https://arxiv.org/pdf/2210.09298.pdf
[3] https://arxiv.org/pdf/2209.10655.pdf
[4] https://arxiv.org/pdf/2212.08136.pdf
[5] https://arxiv.org/pdf/2202.10447.pdf
[6] https://arxiv.org/pdf/2108.12409.pdf
[7] https://arxiv.org/pdf/2212.14052.pdf
问题
See above
One of the main contributions that the authors claim is long sequence modeling. The authors should compare with more baselines on LRA (Long Range Arena), which is essentially the standard benchmark for long sequence understanding.
As elaborated in the first part of our response, in this paper we have deliberately focused on more impactful settings than small benchmarks such as LRA. We offer several additional points of justification:
- Our emphasis on the importance of pretraining is underscored by recent works such as [Amos], which indicate that Transformers are competitive with state-of-the-art SSM models on LRA when allowed pretraining. This supports our claim that small-scale train-from-scratch benchmarks such as LRA may not be the most appropriate for measuring the quality of new foundation model architectures at scale, where thus far only Transformers have been effective.
- LRA is not actually long-range compared to our selection of tasks. Our shortest-range task is standard Language Modeling (Figure 4), for which we have sequences of length 8192, which is already longer than (or on the order of magnitude of all LRA tasks). On the other hand, we have multiple settings (induction heads, genomics, and audio) involving sequences of length >1000000.
[Amos] “Never Train from Scratch: Fair Comparison of Long-Sequence Models Requires Data-Driven Priors”. https://arxiv.org/abs/2310.02980
Memory benchmarking is missing. Even though Section 4.5 is titled “speed and memory benchmark”, only speed comparisons are presented.
The memory usage is simply the size of the activation tensors, as with most deep sequence models (including FlashAttention). Mamba is in fact very memory efficient; we have included additional measurements of the training memory requirements of 125M models on 1 A100 80GB GPU. Each batch consists of sequences of length 2048. We compare to the most memory-efficient Transformer implementation we are aware of (with kernel fusion from torch.compile and with FlashAttention-2).
| Batch size | Transformer (w/ FlashAttention-2) | Mamba |
|---|---|---|
| 1 | 4.6GB | 4.8GB |
| 2 | 5.2GB | 5.8GB |
| 4 | 6.9GB | 7.3GB |
| 8 | 11.5GB | 12.3GB |
| 16 | 20.7GB | 23.1GB |
| 32 | 34.5GB | 38.2GB |
Mamba's memory requirement is comparable to a similar-sized Transformer with an extremely optimized implementation, and we expect further improvement in Mamba's memory footprint in the future. We have added this Table to Appendix F.5.
Also, the authors should provide more detailed setups of Figure 8 left, e.g., model layers, model sizes, details of the convolution, etc.
These details have been reported in the original submission. The main body directs to Appendix F.5, which includes these details:
- These measure only the core operation (scan, convolution, attention), not the full block (e.g. do not include the QKV projections of attention). There is no concept of layers
- The model dimension is , with sequence lengths from to .
- The convolution is an FFT-based convolution using PyTorch primitives.
Could the authors provide some intuitions why FlashAttention is the slowest when the sequence length is very large (Figure 8 left)?
Attention (including FlashAttention) is a quadratic-time operator, and therefore is significantly slower than subquadratic operators (e.g. convolutions and linear recurrences) such as Mamba when the sequence length is long.
Many attention-based Transformer models show length generalization ability, i.e., models can be trained on a shorter sequence length and tested on a longer sequence length. Some examples include relative positional encoding (T5) and Alibi [6]. Because SSMs are in general sequential, does Mamba have this length generalization ability?
Our original submission includes length extrapolation results for the Induction Heads task, which is hypothesized to be deeply related to generalization abilities of language models [Anth]. In Figure 3 we show that Mamba can extrapolate from sequences of length 256 to length 1000000, while no other model (including Attention with variants of relative positional embeddings) extrapolate beyond length 512.
In the shared response, we include new results on measuring the length extrapolation abilities of Mamba compared to a standard Transformer.
[Anth] “In-context Learning and Induction Heads.” https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html
The authors need to compare with stronger baselines. The authors acknowledge that H3 was used as a motivation for the model architecture. However, they did not compare with H3 in the experiments. From Table 4 in [7], ppl of H3 is 8.8 (125M), 7.1 (355M), and 6.0 (1.3B) on the Pile dataset, which are considerably better than Mamba. The authors need to show comparisons with H3.
H3 is one of our main baselines, and is featured prominently in almost all results in the paper:
- The synthetic tasks, Selective Copying (Table 1) and Induction Heads (Figure 3).
- For Chinchilla scaling laws (Figure 4), we used a much stronger version of H3 that we call H3++ (details in Appendix F.2.1).
- For genomics, we compare against Hyena (Figure 5), which is the successor of H3.
- We have since downloaded the pretrained H3 models from [7] and ran the suite of evaluations (see shared response).
The reviewer mentions that the reported perplexity numbers from [7] are better than ours. However, as mentioned in the caption of our Table 2, perplexity scores are only comparable when trained with the same tokenizer. The H3 paper used the GPT2 tokenizer and is not comparable to ours which uses the now-standard NeoX tokenizer. Instead, the downstream evaluations are directly comparable, where Mamba performs significantly better than H3.
For the pre-trained models, the authors only show results on zero-shot inference. This setting is quite limited and the results cannot support the effectiveness of Mamba well. I suggest the authors run more long-sequence experiments such as document summarization, where the input sequence is naturally long (e.g., the average sequence length of the arXiv dataset is greater than 8k).
We agree that the reviewer’s suggestion is a great idea, and in fact are planning to pursue such long-context applications in follow-up work. Unfortunately, these evaluations require an extensive fine-tuning pipeline in a non-autoregressive setting which fall beyond the scope of this paper; however, we strongly believe that Mamba will be effective in such settings.
The motivation of Mamba is to address the drawbacks of recurrent models while improving the efficiency of attention-based models. There are many works following the same direction: S4-diagonal [1], SGConv [2], MEGA [3], SPADE [4], and many efficient Transformer models (e.g., [5]). All of these models achieve near linear complexity, and the authors need to compare Mamba with these works in terms of both model performance and efficiency. For model performance, some simple experiments such as language modeling on Wikitext-103 should suffice.
As stated in the shared response, we deliberately focused on the perplexity of larger-scale pretraining rather than small-scale benchmarks. Nevertheless, Mamba strongly outperforms all suggested models and many more on WikiText-103, as one might expect from our general results on language.
First, we compare Mamba in the exact same setting as the Hyena paper [Poli, Table 4.3]. Beyond their reported numbers, we tune our own strong Transformer baseline. We then swap out the model for Mamba, which improves over our Transformer by 1.7 ppl and the original baseline Transformer by 2.3 ppl.
| Model (125M) | WikiText-103 ppl | to Transf. |
|---|---|---|
| Transformer | 18.6 | 0.0 |
| Hybrid H3 | 18.5 | -0.1 |
| Performer | 26.8 | 8.2 |
| Reformer | 25.6 | 7.0 |
| AFT-conv | 28.2 | 9.6 |
| Linear Attn. | 25.6 | 7.0 |
| Hyena | 18.5 | -0.1 |
| Transf. (ours) | 18.0 | -0.6 |
| Mamba (ours) | 16.3 | -2.3 |
Therefore all of these subquadratic architectures do not improve over a comparable Transformer baseline. Moreover, our own Transformer baseline (18.0 ppl) above is stronger than that of prior work, and our Mamba model (without tuning) is even significantly better (16.3 ppl), all in the same setting.
Next, we show results from several of the reviewer’s suggested baselines [2,3,4] as reported by their original papers. Because these results were obtained in potentially different codebases and training setups, we also report each paper’s own Transformer baseline as well as the improvement.
| Model | WikiText-103 ppl | Transformer ppl | |
|---|---|---|---|
| SGConv + Attn [2] | 18.70 | 18.3 | 0.40 |
| MEGA [3] | 18.07 | 18.3 | -0.23 |
| SPADE [4] | 18.5 | 18.8 | -0.3 |
The improvements of these methods are all small (up to -0.3 ppl) compared to our results above (-2.3 ppl). Additionally, all three of these methods are not linear-time: they are hybrid models explicitly using quadratic attention. Without attention, these methods are S4-based LTI models that do not perform well on language, which is the main thesis of our paper.
Finally, we analyze the reviewer’s last suggested baseline FLASH [5]. While [5] does not report WikiText-103, we found that the recent preprint [Qin] performs controlled comparisons of a wide range of models including FLASH. Their results [Qin, Table 1] are reproduced here.
| Model (44M) | WikiText-103 ppl | to Transf. |
|---|---|---|
| Transformer | 24.78 | 0.0 |
| FLASH | 26.70 | 1.92 |
| Linear Attn. | 28.05 | 3.27 |
| Performer | 63.16 | 38.38 |
| cosFormer | 27.06 | 2.28 |
| gMLP | 29.13 | 4.35 |
| Syn(D) | 32.43 | 7.65 |
| S4 | 39.66 | 14.88 |
| DSS | 41.07 | 16.29 |
| GSS | 30.74 | 5.96 |
| RWKV | 25.07 | 0.29 |
| LRU | 31.12 | 6.34 |
| TNN | 24.67 | -0.11 |
| HGRN | 24.82 | 0.04 |
Consistent with the rest of these results and the main thesis of our paper, every other existing subquadratic architecture does not improve over the basic Transformer.
[Poli] “Hyena Hierarchy: Towards Larger Convolutional Language Models.” https://arxiv.org/abs/2302.10866
[Qin] “Hierarchically Gated Recurrent Neural Network for Sequence Modeling.” https://arxiv.org/abs/2311.04823
We greatly appreciate the reviewer’s feedback and suggestions. We would like to first address the reviewer’s main high-level concerns, and then respond to each detailed point individually.
The reviewer’s main concerns are about our baselines and benchmarks, with suggestions to compare against a broader range of attention-free or hybrid models on small-scale benchmarks such as WikiText-103 and Long Range Arena. We would like to reiterate the main point of our new model as stated in the Introduction of the original submission:
We propose a new class of selective state space models, that improves on prior work on several axes to achieve the modeling power of Transformers while scaling linearly in sequence length.
We would like to note that almost all of the reviewer’s suggested baselines are either:
- Already included in our baselines: For example, S4D, DSS [1], and SGConv [2] are special cases of linear time-invariant (LTI) models that are nearly identical and perform very similarly to S4 and Hyena, two of our main baselines. Similarly, H3 [7] as well as Hyena are some of our main comparisons for language modeling (Figure 4).
- Not Transformer quality, or not linear-time: A central thesis of our paper is that prior linear-time models (which are almost all LTI) do not match the performance of Transformers on information-dense data such as language. Indeed, models such as SGConv [2], MEGA [3] , and SPADE [4] explicitly use hybrid models incorporating quadratic attention to achieve their results on WikiText-103 (which are still worse than ours; see new results below). If the attention is removed, all of these models become LTI state-space models like the above and would perform similarly to S4 (significantly worse than Mamba/S6).
In the detailed response below, we expand on comparisons on WikiText-103 for all the above suggested models and many more. Nevertheless, we would like to emphasize that our original submission has taken care to compare against the most modern and well-known subquadratic models that have been evaluated on large-scale pretraining such as Hyena, RWKV, and RetNet.
Additionally, the Introduction states:
We empirically validate Mamba’s potential as a general sequence foundation model backbone, in both pretraining quality and domain-specific task performance, on several types of modalities and settings.
Foundation modeling – e.g. demonstrating scaling abilities on large-scale pretraining data – has become a fundamental paradigm in modern machine learning that is arguably more important than individual benchmarks. While small-scale benchmarks such as WikiText-103 (around 100M tokens) and Long Range Arena (LRA) have been popular for demonstrating the potential of new models such as SSMs, our work deliberately moves beyond these toward showing that SSMs – with our new contributions – can actually be effective at scale (e.g. on the Pile dataset - 300B tokens - 3000x the size of WikiText-103).
We thank each reviewer for their careful reading of our work and their thoughtful feedback. All reviewers agreed that our paper addresses an important problem in sequence modeling, offers a simple and motivated improvement to prior work, and has compelling empirical results. Reviewers asked perceptive questions and comments, which are answered in detail in individual responses and have improved our submission. We have additionally run many new empirical results and analysis, including:
- We have added evaluation results for the H3 model, one of our main prior works.
- We have scaled the fully-trained models to the next larger model size (2.8B parameters, 300B tokens on the Pile), where it continues to outperform all comparable open-source models of the same size (e.g. Pythia 2.8B) while matching the performance of models twice its size (e.g. Pythia 7B, RWKV 7B).
- We show here results for length extrapolation beyond the training length, which Reviewers du8a and 5ZBk asked about.
- We have new results for WikiText-103. In our detailed response to Reviewer du8a, we analyze results from several papers to show that Mamba significantly outperforms over 20 other recent subquadratic sequence models.
In this shared response, we elaborate on some of the new results.
Evaluation of pretrained H3 models
We have downloaded the pretrained H3 models of sizes 125M-2.7B parameters from [Dao] and ran our suite of evaluations. As H3 is one of our primary motivations and our goal was to address its main weakness (linear time invariance), Mamba is significantly better on all language evaluations.
| Model | Lambada | HellaSwag | PIQA | Arc-E | Arc-C | WinoGrande | Avg |
|---|---|---|---|---|---|---|---|
| Hybrid H3-130M | 25.8 | 31.7 | 64.2 | 44.4 | 24.2 | 50.6 | 40.2 |
| Mamba-130M | 44.2 | 35.2 | 64.5 | 48.0 | 24.2 | 52.3 | 44.7 |
| Hybrid H3-360M | 48.0 | 41.5 | 68.1 | 51.4 | 24.7 | 54.1 | 48.0 |
| Mamba-370M | 55.6 | 46.5 | 69.5 | 55.1 | 28.0 | 55.3 | 51.7 |
| Hybrid H3-1.3B | 49.6 | 52.6 | 71.3 | 59.2 | 28.1 | 56.9 | 53.0 |
| Mamba-1.4B | 65.0 | 59.1 | 74.2 | 65.5 | 32.8 | 61.5 | 59.7 |
| Hybrid H3-2.7B | 55.7 | 59.7 | 73.3 | 65.6 | 32.3 | 61.4 | 58.0 |
| Mamba-2.8B | 69.2 | 66.1 | 75.2 | 69.7 | 36.3 | 63.5 | 63.3 |
Note also that these H3 models are even hybrid models using quadratic attention, while our pure model with only linear-time Mamba layers significantly outperforms it on every metric.
[Dao] “Hungry Hungry Hippos: Towards Language Modeling with State State Models.” https://arxiv.org/abs/2212.14052
3B parameter Mamba model
Compared to the open source models at the 3B scale trained to the same token count (300B), Mamba is better on every evaluation result. It is even comparable to models at the 7B scale: when comparing Mamba (2.8B) to OPT, Pythia, and RWKV (7B), Mamba has the best average score and best or second-best score on every benchmark.
| Model | Lambada | HellaSwag | PIQA | Arc-E | Arc-C | WinoGrande | Avg |
|---|---|---|---|---|---|---|---|
| GPT-Neo 2.7B | 62.2 | 55.8 | 72.1 | 61.1 | 30.2 | 57.6 | 56.5 |
| Hybrid H3-2.7B | 55.7 | 59.7 | 73.3 | 65.6 | 32.3 | 61.4 | 58.0 |
| OPT-2.7B | 63.6 | 60.6 | 74.8 | 60.8 | 31.3 | 61.0 | 58.7 |
| Pythia-2.8B | 64.7 | 59.3 | 74.0 | 64.1 | 32.9 | 59.7 | 59.1 |
| RWKV-3B | 63.9 | 59.6 | 73.7 | 67.8 | 33.1 | 59.6 | 59.6 |
| Mamba-2.8B | 69.2 | 66.1 | 75.2 | 69.7 | 36.3 | 63.5 | 63.3 |
| OPT-6.7B | 67.7 | 67.2 | 76.3 | 65.6 | 34.9 | 65.5 | 62.9 |
| Pythia-6.9B | 67.1 | 64.0 | 75.2 | 67.3 | 35.5 | 61.3 | 61.7 |
| RWKV-7.4B | 67.2 | 65.5 | 76.1 | 67.8 | 37.5 | 61.0 | 62.5 |
Length extrapolation
We have included an attached figure that evaluates length extrapolation of pretrained 3B parameter language models: https://ibb.co/XVT0hGJ
This figure was generated by taking the validation set of the Pile dataset, feeding in each example with no padding or concatenation, and measuring the loss per token. The mean loss (log perplexity) for each position is plotted. Note for example that the perplexity for the first token is high because it has no context, and improves up to the training context length (2048) for both Mamba and the baseline Transformer (Pythia). Interestingly, Mamba’s perplexity improves significantly past its training context up to around length 3000.
We would like to emphasize that length extrapolation was not a direct motivation of our model and we view these capabilities as a bonus feature. We note that
- The baseline model (Pythia) here was not trained with length extrapolation in mind, and there are perhaps other Transformer variants that are more generalizable (e.g. T5 or Alibi relative positional encodings).
- We are not aware of any open-source 3B models trained on the Pile using relative positional encodings, and thus could not create this comparison.
- Mamba, like Pythia, was not trained with length extrapolation in mind, and are thus comparable. Just as Transformers have many techniques (e.g. different positional embeddings) to improve their capabilities on axes such as length generalization, in future work it may be interesting to derive SSM-specific techniques for similar capabilities.
We thank reviewers du8a and 5ZBk for the interesting suggestion.
Code and model release
We are planning to open-source all model code as well as the weights of our pretrained models of all sizes (125M to 2.8B parameters). We hope that Mamba presents a valuable contribution to the community.
This paper introduces a novel variant of state space models designed for long-range language modeling. The conducted experiments reveal notable advancements in comparison to existing models under the perplexity metric for language modeling tasks. Notably, two reviewers provided highly positive assessments, (despite one of which has limited prior experience with language models). However, a third reviewer, an more experienced expert in language models, raised two significant concerns pertaining to the benchmark and evaluation metric:
-
Absence of Results on LRA (Long Range Arena): The reviewer underscored the omission of results on LRA, a widely acknowledged benchmark for long sequence modeling. LRA's inclusion has been customary in prior research on state space models, making it imperative for a comprehensive evaluation.
-
Evaluation using perplexity: The reviewer questioned the reliance on perplexity as the major metric for evaluation. References were made to Sun et al. (2021), suggesting lower perplexities may not necessarily imply improved modeling abilities for end NLP applications. Their claim has been further strengthened by Zhang et al. (2023), which highlighted the limitations of some transformer models that achieve lower perplexity but struggle in generation tasks such as summarization and question-answering.
Additionally, a minor concern was raised regarding the potential performance gap of long-range language models in short text sequences. I recommended the inclusion of supplementary experimental results to address this aspect.
To reconcile these differing perspectives, discussions were initiated with the reviewer du8a and subsequently with the senior area chair. After a meticulous examination of the paper and considering the valid concerns raised, the final decision was to recommend rejection. The concerns, particularly those related to experimental methodology and the chosen evaluation metric, were deemed substantial and not adequately addressed in the provided rebuttal. We believe that the paper could substantially benefit from addressing these concerns through adding additional experiments.
[1] Sun et al. Do Long-Range Language Models Actually Use Long-Range Context? EMNLP 2021 [2] Zhang et al. Efficient Long-Range Transformers: You Need to Attend More, but Not Necessarily at Every Layer. EMNLP 2023.
为何不给更高分
See comments
为何不给更低分
NA
Reject
What a shame ...
Such a good paper, I don't understand why it was rejected. The reviewers overlooked a lot of things...
Simply focusing on experiment results to reject a good paper is not a good practice.
So, the big question is: why the rejection?
The reviewers' main concerns, namely the lack of LRA benchmarking and the sole focus on language modeling perplexity, are, to put it bluntly, off-target. Demanding LRA results for a language model is like asking a star basketball player to prove their skills on a football field. The tasks are just too different. While both LRA and language modeling deal with sequences, LRA focuses on long-range dependencies in algorithmic tasks, whereas language modeling deals with understanding and generating human language. Success in one doesn’t guarantee success in the other. Mamba already demonstrated state-of-the-art results in language modeling, a significant achievement on its own.
Similarly, criticizing the paper for solely focusing on perplexity while completely overlooking the included downstream task evaluations – which show Mamba outperforming other language models – is simply baffling. The authors clearly went beyond just perplexity to show Mamba's real-world potential.
And then there's the comment about Mamba’s memory complexity being quadratic "just like transformers." This shows a fundamental misunderstanding. Transformers have quadratic compute costs, not memory (they have linear). This error makes me question how thoroughly the reviewers actually delved into Mamba's architecture.