Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models
A Robust and Scalable Post-training Quantization Framework for Selective State Space Models
摘要
评审与讨论
The paper introduces Quamba2, a PTQ framework for State Space Models, which aims to reduce model size and improve computational efficiency while maintaining performance. Quamba2 supports multiple bit-width configurations (W8A8, W4A8, and W4A16) for different deployment scenarios, such as cloud services and edge devices. The framework leverages an offline sort-and-cluster approach for quantizing inputs and per-state-group quantization for input-dependent parameters, resulting in significant speed-ups and memory reduction with minimal accuracy loss. The paper demonstrates Quamba2's effectiveness through extensive experiments and evaluations on various tasks and datasets, including MMLU.
给作者的问题
Please see the above comments.
论据与证据
Yes
方法与评估标准
Yes
理论论述
There are no theoretical claims.
实验设计与分析
Yes
补充材料
Yes, all parts.
与现有文献的关系
This paper propose a new method to solve the quantization problem in SSMs.
遗漏的重要参考文献
"Post-training quantization for vision transformer. Z Liu, Y Wang, K Han, W Zhang, S Ma, W Gao."
其他优缺点
Strengths:
- Quamba2 supports multiple bit-width configurations (W8A8, W4A8, W4A16), addressing diverse deployment needs (e.g., cloud throughput vs. edge-device efficiency).
- The offline cluster-aware weight reordering and per-state-group quantization effectively mitigate SSMs’ sensitivity to quantization errors, improving precision without fine-tuning.
Weaknesses:
- While emphasizing speed-ups, the paper does not quantify the computational cost of offline weight reordering and clustering, which could affect deployment practicality.
- The evaluation focuses only on Mamba1/Mamba2. It remains unclear whether Quamba2 generalizes to other SSM architectures (e.g., S4, DSS, DenseMamba).
- Offline weight rearrangement may complicate integration into existing inference pipelines, especially for dynamic or frequently updated models.
其他意见或建议
Please see the above comments.
We thank the reviewer for their time and thoughtful questions. We address the reviewer’s concerns and responses below.
While emphasizing speed-ups, the paper does not quantify the computational cost of offline weight reordering and clustering, which could affect deployment practicality.
We provide the detailed breakdown of the GPU hours on A5000 for our offline clustering, scale calibration, quantization, and weight packing in Table R5.
| Table R5 | Group & Reorder | Scale Calib. | GPTQ | Weight Packing | Total (hr) |
|---|---|---|---|---|---|
| Quamba2-2.7B | 0.04 | 0.05 | 0.07 | 0.01 | 0.17 |
| Quamba2-8B | 0.05 | 0.17 | 0.10 | 0.03 | 0.35 |
We note that this is a one-time cost. The quantized model can be deployed to various devices and applications.
The evaluation focuses only on Mamba1/Mamba2. It remains unclear whether Quamba2 generalizes to other SSM architectures (e.g., S4, DSS, DenseMamba).
We apply our method to the latest Jamba-1.6-Mini, a large SSM-based language model with 52 billion parameters (12B active / 52B total), and show the result in Table R12. Following the official guidelines [8], we quantize the Transformer and MoE blocks with bitsandbytes [9, 10], while keeping the Mamba blocks in half-precision. Next, we apply different precision levels to the Mamba blocks using our framework. We observe that the accuracy degradation after quantizing the Mamba blocks to low bit-width is approximately 1%. This demonstrates that our quantization framework enables effective low-bit quantization for large-scale Mamba-Transformer hybrid models, addressing a key limitation of existing LLM quantization techniques.
| Table R12 | Transformer | MoE | Mamba | Avg Acc |
|---|---|---|---|---|
| Jamba-1.6-Mini | W8A8 | W8A8 | FP16 | 78.3% |
| Quamba2-Jamba-1.6-Mini | W8A8 | W8A8 | W8A8 | 77.2% |
| W8A8 | W8A8 | W4A8 | 77.3% | |
| W8A8 | W8A8 | W4A16 | 78.0% |
Offline weight rearrangement may complicate integration into existing inference pipelines, especially for dynamic or frequently updated models.
Our framework only increases the common GPTQ quantization pipeline by 0.05 GPU hours (~3 minutes) on A5000, while largely boosting the accuracy by 13.7% for Quamba2-8B-W4A8, as shown in Table R13.
| Table R13 | Group & Reorder | Scale Calib. | GPTQ | Weight Packing | Total (hr) | Acc. |
|---|---|---|---|---|---|---|
| GPTQ | - | 0.17 | 0.10 | 0.03 | 0.30 | 55.1% |
| Ours | 0.05 | 0.17 | 0.10 | 0.03 | 0.35 | 68.8% |
Notably, our framework also supports LoRA modules for various downstream applications. Once the trained LoRA weights are fused into the main model weights, the same weight reordering and quantization pipeline is reused and applied.
[8] https://huggingface.co/ai21labs/AI21-Jamba-Mini-1.6
[9] Dettmers, et al. "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale"ArXiv (2022)
[10] Dettmers, et al. "8-bit Optimizers via Block-wise Quantization" ICLR (2022)
The paper introduces Quamba2, a robust and scalable post-training quantization framework tailored for Selective State Space Models (SSMs), specifically Mamba1 and Mamba2. Quamba2 leverages structural properties unique to SSMs, such as channel order preservation and activation persistence, through novel techniques like sort-and-cluster and per-state-group quantization, significantly enhancing quantization accuracy. The framework supports various bit-width configurations, achieving notable speedups and a 4× reduction in memory usage, with minimal accuracy loss. Extensive experiments demonstrate Quamba2's effectiveness, outperforming state-of-the-art approaches in both speed and accuracy, while maintaining generalizability across diverse tasks, including the challenging MMLU dataset.
给作者的问题
See above.
论据与证据
Yes, the claims made in the submission are generally well-supported by clear and convincing evidence. The paper provides comprehensive experimental results demonstrating significant speedups and memory reductions across various bit-width configurations (W8A8, W4A8, W4A16), comparing directly to relevant baselines such as MambaQuant and Quamba, thus adequately supporting claims of superiority. Key conceptual contributions, including "sort-and-cluster" and "per-state-group quantization," are clearly illustrated through ablation studies, justifying their effectiveness in addressing quantization-induced errors in Selective SSMs.
方法与评估标准
Yes, the proposed claims, including improved accuracy, speedup, and memory reduction via techniques like sort-and-cluster, per-state-group quantization, and cluster-aware weight reordering, are clearly supported by detailed experiments on multiple benchmarks (LAMBADA, HellaSwag, PIQA, ARC, and WinoGrande), latency profiling, and memory comparisons.
理论论述
The paper does not present explicit theoretical proofs requiring validation.
实验设计与分析
Yes, the experiments are methodologically sound and comprehensive: the authors clearly justified the benchmark choices, carefully detailed their quantization methods, and presented extensive latency and accuracy comparisons. The ablation studies effectively validate individual contributions of their techniques, such as sort-and-cluster and per-state-group quantization, which clearly isolate and demonstrate their empirical benefits.
However, there exists a limited diversity in datasets when evaluating mixed-precision robustness (primarily relying on MMLU), suggesting a broader evaluation could strengthen the robustness claims.
补充材料
Yes, including results for six zero-shot downstream tasks, and the implementation and evaluation details of the Quamba2 framework.
与现有文献的关系
The paper’s key contributions connect closely to prior literature on efficient neural network quantization and compression, especially works focusing on selective SSM. Quamba2 extends previous findings by introducing novel techniques, e.g., sort-and-cluster and per-state-group quantization informed by SSM-specific propertie: channel order preservation and activation persistence, to significantly reduce quantization errors. Also, the paper leverages established methods such as Hadamard transformations and weight reordering to further enhance quantization precision.
遗漏的重要参考文献
No.
其他优缺点
Strengths:
- Creative combination of existing ideas, e.g., quantization, weight reordering with novel SSM-specific insights.
- Strong practical significance for efficient deployment of large SSMs.
- Rigorous and clear experimental design and ablation studies.
Weaknesses:
- Limited evaluation of generalizability, primarily focused on MMLU.
其他意见或建议
See above.
We appreciate the reviewer’s positive feedback and respond to the concerns as follows.
Limited evaluation of generalizability, primarily focused on MMLU.
We evaluate our framework on more tasks to show that the W4AX improves generalizability in Table R11. We include BoolQ (accuracy), and the generation-based tasks Natural Questions (NQ) (exact match) and SquadV2 (F1).
| Table R11 | Precision | BoolQ | NQ | Squadv2 |
|---|---|---|---|---|
| Mamba2 | FP16 | 76.5 | 17.2 | 51.9 |
| Quamba2 | W8A8 | 68 | 15.0 | 43.6 |
| W4A8 | 64.6 | 14.2 | 45.9 | |
| W4A16 | 71.2 | 16.6 | 50.7 | |
| W4AX | 73.2 | 14.9 | 47.4 |
We sincerely appreciate the reviewer’spositive feedback and will include the additional evaluations in the final version.
This paper proposed a quantization scheme for SSMs such as mamba2. By leveraging the channel order preserving and activation persistence characteristics of SSMs, author mainly utilized two existing techniques, reordering and Hadamard rotation, to alleviate the quantization difficulty for SSMs and improved the resulting performance. Because not all the additional quantization-related components are done by offline processing, author also profiled the model inference and verified real end-to-end speed-up. Furthermore, author discussed the speed/accuracy trade-off between different quantization configurations, e.g. W8A8, W4A8, W4A16, and mixed precisions. Lastly, author demonstrated that for model size greater than 2.7B, quantizing embeddings and lm_head to 4bit has almost no accuracy penalty, so it can be used to further reduce model size for deployment.
给作者的问题
please see "Weakness" above
论据与证据
Yes
方法与评估标准
Yes
理论论述
NA, no theoretical derivations in this work.
实验设计与分析
Yes, looks reasonable, no issues.
补充材料
Yes, Appendix A to C.
与现有文献的关系
improve upon previous SSM quantization works. doesn't seem to improve W8A8 accuracy compared to MambaQuant (Table 6) but this work enables 4bit options so that it can improve the generation speed of memory-bound cases (Table 5).
遗漏的重要参考文献
No
其他优缺点
Strength
- well written and easy to follow.
- covered most of the important quantization considerations and demonstrated real speed-up using custom kernels.
- promised to open source kernels.
Weakness
-
Even though most of the additional processing are offline, some of the steps sound a bit time-consuming than others, e.g. clustering. It would be beneficial to give reader a rough idea and a few examples of the GPU hours of the key steps, maybe a break down table similar to Table 8 or 9.
-
The value of quantizing SSD scan and Conv layer is a bit unclear. Author mentioned the purpose to quantize SSD to 8bit is to reduce memory pressure. But general understanding is that SSD and Conv are very light in computation and constant memory with respect to sequence length, unlike KV cache. In fact, Table 3 shows that with input seq_len=1024, bs=8, INT8 SSD only saves ~0.5 msec out of ~3msec. Compared to Table 5 input seq_len=1024 bs=1, time-to-first-token for INT8 case is ~122 msec. The saving seems negligible. Maybe the author could elaborate more on SSD quantization and provides some justifications for doing so. And it would be helpful to add this option to one of the ablation tables (8 to 10) and show the impact of SSD quantization on accuracy.
-
Experiments 5.1 Line 346, "...percentile clipping on the input SSM, ..." may need to be a little more specific. For example, "input SSM" refers to one, some, or all in (A, B, C, x)? which percentile were used? and maybe a simple sentence or two to explain the motivation or impact would be nice.
其他意见或建议
NA
We thank the reviewer for their detailed review and positive feedback. We address the reviewer’s questions below.
a breakdown table for GPU hours.
We provide the detailed breakdown of GPU hours for Quamba2. We report the GPU hours on A5000 for offline clustering, scale calibration, quantization, and weight packing in Table R5.
| Table R5 | Group & Reorder | Scale Calib. | GPTQ | Weight Packing | Total (hr) |
|---|---|---|---|---|---|
| Quamba2-2.7B | 0.04 | 0.05 | 0.07 | 0.01 | 0.17 |
| Quamba2-8B | 0.05 | 0.17 | 0.10 | 0.03 | 0.35 |
show the impact of SSD quantization on accuracy.
We show the impact of causal convolution and SSD scan on Mamba2-2.7B in Table R6. We quantize the weights to 4-bit with different activation bit-width settings, and report the accuracy on the Lambada dataset. We apply our quantization techniques to SSD inputs (B, C, x).
| Table R6 | Inproj output | Conv output | SSD output + Had | Acc. |
|---|---|---|---|---|
| (Conv input) | (SSD input) | (Outproj input) | ||
| FP16 | FP16 | FP16 | 68.8% | |
| Int8 | FP16 | FP16 | 68.0% | |
| Int8 | Int8 | FP16 | 67.2% | |
| Int8 | Int8 | Int8 | 65.6% |
The value of quantizing SSD scan and Conv layer is a bit unclear.
The SSD update is a latency and memory bottleneck during generation, especially as batch size increases, although the SSD scan remains minor during prefilling. We profile latency (µs) of the linear layer, causal convolution, and SSD update, along with memory usage (MB) during generation for Quamba2-8B W4A8 and W4A16. As shown in Table R7, cached SSM and convolutional states grow linearly with batch size, eventually exceeding model size and dominating both latency and memory in large-batch scenarios.
| Table R7 | Model size (MB) | bsize | Linear (µs) | Conv (µs) | SSD (µs) | conv size (MB) | state size (MB) |
|---|---|---|---|---|---|---|---|
| Quamba2-8B-W4A8 | 4049 | 1 | 95.08 | 2.2 | 5.825 | 2.2 | 56 |
| 64 | 155.7 | 22.2 | 219.99 | 140 | 3584 | ||
| 128 | 254.37 | 53.2 | 466.65 | 280 | 7168 | ||
| Quamba2-8B-W4A16 | 4056 | 1 | 96.9 | 1.79 | 6.22 | 4.4 | 112 |
| 64 | 170.59 | 25.2 | 397.84 | 280 | 7168 | ||
| 128 | 364.22 | 47.6 | 785.46 | 560 | 14336 |
Our W4A8 models with 8-bit SSD inputs halve the cached state size and improve time-per-output-token (TPOT, i.e., generation) latency, as shown in Table R8. End-to-end TPOT latency is reported in milliseconds (ms), and OOM indicates an out-of-memory error.
| Table R8 | Bit-width | b=1 | b=32 | b=64 | b=128 | b=256 |
|---|---|---|---|---|---|---|
| Mamba2-8B | FP16 | 22.73 | 35.74 | 49.63 | OOM | OOM |
| Quamba2-8B | W8A8 | 12.61 | 23.83 | 30.82 | 44.85 | 79.65 |
| W4A8 | 7.43 | 15.05 | 24.65 | 44.54 | 85.26 | |
| W4A16 | 7.58 | 20.58 | 38.48 | 74.25 | OOM |
These justify our motivation of quantizing 8-bit SSD and the cached SSM states.
Experiments 5.1 "input SSM" refers to one, some, or all in (A, B, C, x)? Which percentile were used? explaining the motivation or impact would be nice.
For the Mamba2 baseline, we follow the setting in Quamba [5] and use their official implementation. Specifically, we apply the percentile clipping to the x input activation and per-tensor quantization for B and C for Mamba2 models. We list the clipping percentiles in Table R9.
| Table R9 | x perc. |
|---|---|
| 2.7B | 0.9995 |
| 8B | 0.999 |
In Table R10, we apply clipping to (B, C, x) for Quamba2-8B-W4A8, respectively, and compare to our proposed weight-reordering and clustering. We apply per-group GPTQ quantization to the weights and Hadamard transforms to the SSD output. We report the accuracy on the Lambada dataset.
| Table R10 | Bit-width | B/C | x | Acc. |
|---|---|---|---|---|
| Mamba2-8B | FP16 | - | - | 70.9% |
| Quamba2-8B | W4A8 | Per-tensor | Clipping | 55.1% |
| W4A8 | Clipping | Clipping | 58.7% | |
| W4A8 | PerSG | Clipping | 60.7% | |
| W4A8 | PerSG | SnC | 68.8% |
We sincerely thank you for your valuable input which prompted us to clarify the details of our work.
[5] Chiang, et al. "Quamba: A Post-Training Quantization Recipe for Selective State Space Models." ICLR (2025).
Thanks for the clarifications. Please do include the justifications in the revised version. I'd like to keep my assessment unchanged.
Thank you again for your positive feedback and thorough review. We will incorporate the results into the revised version.
This work introduces Quamba2, a novel post-training quantization (PTQ) framework designed for State Space Models (SSMs), particularly the Mamba1 and Mamba2 architectures. The work addresses the challenge of efficiently scaling SSMs for deployment in cloud and edge computing environments by optimizing low-bit-width quantization techniques (W8A8, W4A8, and W4A16). Unlike previous methods that struggle with quantization-induced errors in SSMs, Quamba2 leverages channel order preservation and activation persistence to improve quantization accuracy. The framework employs a sort-and-cluster technique for input processing and per-state-group quantization for model parameters, ensuring computational consistency while minimizing accuracy loss. Experimental results demonstrate that Quamba2-8B achieves up to 3× generation speed-up, 4× memory reduction, and only a 1.6% average accuracy drop compared to full-precision models.
给作者的问题
Please clarify the novelty and the experimental design choices as explained above.
论据与证据
While the claims regarding the technicality and results are well-supported, some claims remain problematic in my opinion.
-
The paper claims Quamba2 outperforms "several state-of-the-art SSM quantization methods," but there are comparisons to only two previous methods (MambaQuant and Quamba).
-
While the paper demonstrates deployment on Nvidia Nano 8G, it is not clear whether Quamba2 can enable real-world performance for edge applications better than non-SSM quantized models.
-
The paper mentions evolutionary search for mixed precision configurations but doesn't thoroughly explain the search space and constraints
方法与评估标准
Mostly yes.
理论论述
Yes.
实验设计与分析
While the experimental design is generally sound and appropriate for the claims made, it would be better if the authors can add more statistical rigor, and clearly report the experimental parameters.
-
The paper doesn't clearly specify the batch sizes for all accuracy evaluations, which could affect performance metrics. Statistical significance testing is absent when comparing different methods, making it difficult to determine if observed differences are meaningful or just noise.
-
The ablation studies focus primarily on the 8B model; additional ablations on smaller models would strengthen generalizability claims.
-
The 3:1 ratio of W4A8:W4A16 seems predetermined rather than being a result of optimization
补充材料
Yes, all of it.
与现有文献的关系
The paper's novelty is somewhat incremental. While Quamba2 introduces techniques like sort-and-cluster and per-state-group quantization specifically for SSMs, these methods largely adapt existing quantization approaches to the unique properties of State Space Models. The core insights about channel persistence and state persistence in SSMs are intriguing, but the paper primarily builds upon and extends two recent works (MambaQuant and Quamba) rather than presenting fundamentally new quantization paradigms. Core approaches like weight grouping, per-channel/group quantization scaling, Hadamard transforms to smooth activation distributions, and mixed-precision configurations are all established techniques in the Transformer quantization literature. The mixed-precision configuration approach, while practical, employs standard evolutionary search methods seen in other domains.
遗漏的重要参考文献
-
This paper does not cite some of the relevant literature in quantization of Mamba models (when there are not a lot of works yet on this topic). For example, [1] explores binary (1-bit) quantization for state space models and addresses some of the same challenges with linear recurrence sensitivity. While more extreme than Quamba2's focus, this work provides important context on the quantization limits of SSMs, and it would be important to compare the performance-efficiency trade-off with this work.
-
The paper uses evolutionary search for mixed precision but does not cite relevant prior work on hardware-aware neural architecture search that specifically targets bit-width selection, such as HAQ [2] or HAWQ [3], which pioneered similar approaches for Transformer models.
[1] Tang, S., Ma, L., Li, H., Sun, M., & Shen, Z. (2024). Bi-mamba: Towards accurate 1-bit state space models. arXiv preprint arXiv:2411.11843.
[2] Wang, K., Liu, Z., Lin, Y., Lin, J., & Han, S. (2019). HAQ: Hardware-aware automated quantization with mixed precision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[3] Dong, Z., Yao, Z., Gholami, A., Mahoney, M. W., & Keutzer, K. (2019). HAWQ: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
其他优缺点
Please see the points above.
其他意见或建议
N/A
We thank the reviewer for their insightful comments. We address the reviewer’ concerns and responses below. References [1-3] are from the review.
The paper's novelty is somewhat incremental.
Our SSM quantization framework introduces novel observations, open-sources low bit-width kernels, and explores mixed-precision strategies to improve the generalization, which are distinct from prior SSM quantization work [4, 5]. We summarize the key novelty and contributions of our work:
- We identify three key properties in SSMs: channel order, channel persistence, and state persistence.
- Based on these, we propose sort-and-cluster and per-state-group quantization, validated on large-scale datasets.
- We explore W4AX mixed-precision quantization to boost robustness and generalization of low-bit SSMs.
- Our framework supports W4A8, W4A16, W4AX, and W8A8, enabling flexible deployment and speedups across platforms.
Outperform only two previous methods.
To the best of our knowledge, the most up-to-date, peer-reviewed, and strongest SSM baselines were included by the Jan 30 deadline. We compare against and outperform two and only two latest state-of-the-art SSM quantization methods, MambaQuant and Quamba, which were accepted to ICLR 2025 on Jan 22. We will change “several state-of-the-art” to “two latest state-of-the-art” in our manuscript.
Compared with 1-bit Bi-mamba
We compare our framework against 1-bit Bi-mamba on Mamba2 2.7B in terms of storage, performance, GPU hours, and cost versus the average accuracy of PIQA HS WG ARC-e ARC-c in Table R1. We estimate the A100 GPU hours for a fair comparison.
| Table R1 | Storage | Tokens | GPU Hrs | Cost | Avg Acc |
|---|---|---|---|---|---|
| Bi-mamba-2.7B-W1A16 | 0.55 GB | 105B | 7822 | $32070 | 52.68 |
| Quamba2-2.7B-W4A8 | 1.4 GB | 0.18M | 0.05 | $0.205 | 61.36 |
Although QAT-based Bi-Mamba offers better storage efficiency, it requires several orders of magnitude more tokens, GPU hours, and cost (more than 100,000) compared to our PTQ framework.
The paper doesn't clearly specify the batch sizes for all accuracy evaluations. Statistical significance testing is absent.
We use a batch size of 16, a fixed random seed, and report the average accuracy over five runs (Section 5.1) for all experiments. We include standard deviation in Table R2 for Quamba2-8B-W4A8. Notably, our method outperforms all baselines on Mamba2 beyond the reported deviations.
| Table R2 | LA | HS | PIQA | Arc-E | Arc-C | WG | Avg |
|---|---|---|---|---|---|---|---|
| Quamba2-8B-W4A8 | 68.6±0.10 | 77.1±0.23 | 79.1±0.21 | 75.0±0.24 | 46.0±0.39 | 68.7±0.28 | 69.1±0.17 |
The ablation studies focus primarily on the 8B model.
We conduct the same ablation study for Quamba2-2.7B-W4A8, and show the results in Table R3. We note that the group sizes of B/C for 2.7B are one, making it equivalent to per-tensor quantization. Our framework in the last row outperforms all other settings.
| Table R3 | Weights | Had. | B/C | x | Acc. | |
|---|---|---|---|---|---|---|
| PerG | GPTQ | PerSG | SnC | |||
| FP16 | - | - | - | - | - | 69.5% |
| W4A8 | ✓ | - | fail | |||
| ✓ | ✓ | - | 39.8% | |||
| ✓ | ✓ | ✓ | - | 51.2% | ||
| W4A8 (Ours) | ✓ | ✓ | ✓ | - | ✓ | 65.6% |
Quamba2 better than non-SSM quantized models on edge?
We include the 4-bit Llama-3-8B [7] in Table R4 for reference. Due to time constraints, we report only the Time-To-Last-Token (TTLT) in seconds on an A5000 GPU, using 2K input tokens and 2K generated tokens, along with the average accuracy across six zero-shot tasks. We will include latency results on the Nano 8G in the next version of our manuscript.
| Table R4 | Bit-width | Avg Acc | TTLT (2k+2k) | Memory |
|---|---|---|---|---|
| Llama-3-8B | FP16 | 70.4% | 48.52 | 15.4G |
| Llama-3-8B-QServe | W4A8KV4 | 69.0% | 24.63 | 5.7G |
| Mamba2-8B | FP16 | 70.7% | 47.3 | 15.7G |
| Quamba2-8B (ours) | W4A8 | 69.1% | 15.4 | 4.0G |
Explain the search space and constraints. The 3:1 ratio of W4A8/A16 seems predetermined.
Our search space consists of N^2 configurations, where N denotes the number of layers. We fix the W4A8/A16 ratio and search for the precision for each layer to achieve the best accuracy. We will include more configurations in our manuscript.
Evolutionary search for mixed-precision seen in other domains. Cite relevant on hardware-aware NAS targeting bit-width selection.
Our work to address the performance gap for low bit-width SSMs on large datasets by mixed-precision with generic speedups, which are different from CNNs [2, 3], Transformers [6], and prior SSM quantization work [4, 5]. We will cite and discuss the relevant work in our final version.
[4] Xu, et al. "MambaQuant: Quantizing the Mamba Family with Variance Aligned Rotation Methods." ICLR (2025).
[5] Chiang, et al. "Quamba: A Post-Training Quantization Recipe for Selective State Space Models." ICLR (2025).
[6] Zhao, et al. "Automatic mixed-precision quantization search of bert." IJCAI (2021).
[7] Lin, et al. "QServe: W4A8KV4 Quantization and SystemCo-design for Efficient LLM Serving." MLSYS (2025).
The authors debut Quamba2, a new post-training quantization (PTQ) method for Mamba state-space models (SSMs). Unlike recent Mamba PTQ methods which may struggle at W4A8 precision--MambaQuant (which applies applies the Karhunen-Loeve Transformation and equalizes SSM-channel variance) and Quamba (which performs symmetric uniform quantization)--Quamba2 introduces sort-and-cluster technique for input processing and per-state-group quantization to preserve accuracy at lower precisions. The paper identifies and leverages the channel order preserving property and activation persistence of SSMs. Across both Mamba-1 (1.4B, 2.8B) and Mamba-2 (2.7B, 8B) models, zero-shot (averaged across the standard datasets used in the Mamba-1 paper), 5-shot (MMLU), and several precisions (W8A8, W4A8, W4A16), the authors show the effectiveness of their approach; they remain competitive with MambaQuant at 8-bit precisions, while outperforming both MambaQuant at 4-bit precisions. Additionally, they perform extensive ablations across Mamba-2 model sizes, showing language modeling degradation decreases as model sizes grow. This an overall interesting study on PTQ for Mamba SSMs, I look forward to the ensuing code release.
Reviewer hMgy raised concerns with the compute overhead of the clustering step, but the authors provided additional experiments during the rebuttal demonstrating the modest runtime (in GPU hours) of each Quamba2 PTQ step for Mamba-2-2.7B and Mamba-2-8B models. The authors successfully addressed the other reviewer concerns, pointing to the novelty of their approach compared to very recently published methods MamabaQuant and Quamba.