PaperHub
7.0
/10
Poster3 位审稿人
最低3最高4标准差0.5
4
3
4
ICML 2025

Understanding Input Selectivity in Mamba: Impact on Approximation Power, Memorization, and Associative Recall Capacity

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24
TL;DR

In this work, we demystify the role of input selectivity in Mamba, investigating how it impacts its function approximation power, long-term memory, and associative recall capabilities.

摘要

关键词
MambaState Space ModelInput selectivityFunction ApproximationMemorizationMQAR

评审与讨论

审稿意见
4

This paper analyzes the flexibility of MAMBA showing that the S6 layer can i) it can project the input into Haar wavelet basis, ii) counteract memory decay, iii) solve multi-query associative recall (MQAR) problem proposed by Arora et al. tasks. While this is mostly a theoretical paper, the authors demonstrate applicability of their theory on approximation of discontinuous functions and the counteraction of memory decay via the KEEP n-TH task, requiring memorization the n-th token in a sequence. For the full Mamba model, they confirm empirically that the model sizes prescribed theoretically by their analytical solutions to the MQAR and INDUCTION HEADS are tight in practice.

Update after rebuttal

I did not see anything to alter my score. I maintain a positive impression of the work.

给作者的问题

None!

论据与证据

These results are proven rigorously.

方法与评估标准

The empirical studies add a little bit as they show that the theoretical bounds are not too loose.

理论论述

I have read the proof sketches in the main text but not the detailed proofs in the supplementary material.

实验设计与分析

I have looked at the experiments but not delved to deeply into it, this being mostly a theoretical paper.

补充材料

Mostly section D, with additional experimental detail.

与现有文献的关系

For several years now, there has been theoretical work probing of which tasks are possible to solve by transformers. The work contrasting transformers and state space models mostly focused on what transformers can do and state space models cannot (see for example Jelassi etal, 2024). However, as new state space models start to competitive performance, the theoretical interest in state space models capacity increases. This work is a useful addition to such literature.

遗漏的重要参考文献

I am not in a position to comment on it.

其他优缺点

The paper is quite clearly written.

其他意见或建议

None!

作者回复

Common Response

We refer the reviewer to the Common Response in the Rebuttal to WpKK.

Individual Response

We thank the reviewer once again for their positive feedback, and remain open to include any suggestion for improvement, or answer any question they might have.

审稿人评论

I would like to maintain my score. Best of luck!

审稿意见
3

This paper provide theoretical justifications for selective SSM layer (S6) in Mamba architecture. They show

  1. S6 has better expressiveness than S4D layer
  2. S6 suffers from exponential memory decay
  3. 1-layer Mamba (with S6) solves MQAR tasks with SSM mixer. MQAR is an information retrieval tasks that normally requires >2 transformer layers to solve.

The authors use sufficient numerical results to backup their theoretical findings.

update after rebuttal

给作者的问题

.

论据与证据

yes

方法与评估标准

yes

理论论述

yes. I check the lemmas and theorems and their proofs.

实验设计与分析

yes. I check all verification experiments and their figure and tables.

补充材料

yes, i skimmed through both proofs and code. I do not run the code. I do not check proofs line-by-line.

与现有文献的关系

this work connects to both transformer and mamba. the literature are covered in sec 2.

遗漏的重要参考文献

not that I can think of

其他优缺点

Strengths

  • Clarity: the language and formating of this paper is of very high quality.

  • Originality: the results and proposed method are original. There are lots of mamba papers, but this one is very refreshing yet solid.

  • Significance: I believe this work is significant. The theory echos closely with practice and provides practically useful insights for practitioners.

Weaknesses

  • (minor): SD4 is mentioned without definition in abstract (and the 1st page).

Overall, good theory paper well-executed numerical support. I lean toward acceptance.

其他意见或建议

.

作者回复

Common Response

We refer the reviewer to the Common Response in the Rebuttal to WpKK.

Individual Response

SD4 is mentioned without definition in abstract (and the 1st page).
We thank the reviewer for pointing out the mistake: this has been corrected in the text. We further thank the reviewer for their favorable review, and remain ready to address any question or suggestion for improvement they may have.

审稿意见
4

This paper aims to understand the effect of gating in Mamba models in terms of function approximation power, long-term memory, and associative recall capabilities. Both theoretical derivations and empirical results are provided.

给作者的问题

See the question before.

论据与证据

Three major claims as outlined in the paper:

  • S6 layer is more expressive than S4D because S6 can represent projections onto Haar wavelets, which presumably reflects the capability of approximating discontinuous functions.
  • S6 (akin to RNN models) suffers from memory decay.
  • There exists an analytical solution to MQAR using Mamba architecture, which reveals the advantage of input selectivity.

All claims are supported by theoretical derivations, along with some constructed tasks to show the tightness.

方法与评估标准

The empirical evaluation relies on benchmarks of MQAR which is widely used to assess associative recall capability.

理论论述

I have checked the function approximation analysis and sensitivity analysis in sec 4.1 and 4.2. Below are questions on whether the analysis can be generalized:

  • Does Mamba2 fit into the function approximation analysis?
  • Does the analysis shed light on the expressivity differences between Mamba1 and Mamba2?

实验设计与分析

The MQAR and Induction Head experiments are well-suited for differentiating the expressive power of S6 and S4D.

补充材料

Yes, I skimmed the proof on approximation power.

与现有文献的关系

The submission connects to the broad research effort on sub-quadratic/efficient architectures of LLMs. While most existing work focuses on empirical evaluations of pretraining results, this work provides a valuable perspective in understanding a core design element (i.e., gating) of Mamba models, and might inspire follow-up work on better parameterization of gating.

遗漏的重要参考文献

NO.

其他优缺点

NO.

其他意见或建议

No.

作者回复

Common Response

We are grateful for the positive comments from reviewers on our paper, particularly regarding its clarity (kLHf: “this paper is of very high quality”; R4Me: “quite clearly written”), soundness (kLHf: “very refreshing yet solid”; R4Me: “These results are proven rigorously”), and impact (kLHf: “I believe this work is significant. The theory echos closely with practice and provides practically useful insights for practitioners”; R4Me: “a useful addition to such literature”; WpKK: “this work provides a valuable perspective in understanding a core design element...and might inspire follow-up work”).

We answer the remaining questions from the reviewers in the Individual Response section. Additionally, if the reviewers have any other remarks or recommendations that can help further improve the quality of the paper, we remain at their disposal to address them.

Individual Response

Does Mamba2 fit into the function approximation analysis?
Indeed it does, and the reviewer is right in that it is worth highlighting more clearly. We do briefly mention (Sec3, after Eq(7)) how the Mamba2 layer is a simplification of Mamba that prescribes a state matrix parameterized by a single scalar Λ=λI\boldsymbol{\Lambda}=\lambda \boldsymbol{I}. Substituting this into (8) would remove the dependency of λ\lambda from the hidden-state component nn, that is we would have λnλ\lambda_n \equiv \lambda for all n=1,,Nn=1, \ldots, N in the integral. Nonetheless, for the proof in Sec4.1, it suffices to set λn=1\lambda_n = -1 for all nn (see Line 605), so the proof still holds even in the Mamba2 framework. A similar reasoning also applies to Sec4.2, where again for Mamba2 λnλ\lambda_n \equiv \lambda would be constant over n=1,,Nn=1, \ldots, N, but it still does not affect the overall derivation. Following the reviewer's remark, in the main text we have added a note on the validity of both Thm1 and Lem1,2, for Mamba2.

Does the analysis shed light on the expressivity differences between Mamba1 and Mamba2?
The analysis in Sec4 (see also the response above) highlights that, from the point of view of function expressivity alone, if we consider only the SSM-layer, then there is no expressivity difference between Mamba and Mamba2 on approximating Haar Wavelets: Mamba2's simplification of setting λnλ\lambda_n \equiv \lambda across state dimension does not hinder this ability.
Nonetheless, the performance difference between Mamba and Mamba2 has been clearly shown empirically. Motivated by this, we extended our analysis to the whole architecture in Sec5. Our results in Thm2,3 highlight that, thanks to its per-SSM-parameter short-convolution layers, Mamba2 can recover more parameter-efficient solutions than Mamba for the MQAR synthetic task, hinting at a possible explanation for its superior performance. We believe analyzing other, more complex tasks to be an interesting next step in shedding more light on the expressivity differences between Mamba and Mamba2.
We thank the reviewer for raising this point. We took this chance to update the manuscript to highlight more clearly the implications that Thm2 and Thm3 have on the relative parameter-efficiency of Mamba and Mamba2 in solving MQAR.

最终决定

The paper received strong reviews. Reviewers emphasize clarity, originality, and significance. Several reviewers checked the proofs and attested to their rigor. The paper is well-written and tackles important open problems in state space modeling.