6.4

/10

Poster4 位审稿人

最低3最高5标准差0.7

3.8

置信度

创新性2.3

质量3.0

清晰度2.8

重要性2.8

NeurIPS 2025

Generalization Error Analysis for Selective State-Space Models Through the Lens of Attention

Arya Honarpisheh,Mustafa Bozdag,Octavia Camps,Mario Sznaier

OpenReview PDF

提交: 2025-05-10更新: 2025-10-29

TL;DR

The paper derives generalization bounds for selective SSMs using connections to self-attention, showing that spectral properties of the state matrix influence generalization.

摘要

关键词

Generalization BoundsState-Space ModelsSelective ScanStabilityTransformersRademacher ComplexityCovering Numbers

评审与讨论

审稿意见

评分: 4置信度: 32025-07-01

This paper investigates how well selective SSMs generalize as the sequence length grows.

The authors reuse covering-number tools developed for Transformer theory. They give generalization upper (Theorem 3.3) and lower (Theorem 4.1) bounds for selective SSMs, showing that when the transition matrix has non-negative spectral abscissa, an exponential dependence on sequence length is tight and unavoidable.

The authors ground their theory with two experiments, each initialized with a different spectral abscissa s_A:

Experiment 1 (unstable start, s_A>0) – The model stays in the unstable regime on long sequences, so training diverges.
Experiment 2 (stable start, s_A\le 0) – The model remains stable, and its train-test gap stays flat as sequence length grows, demonstrating length-independent generalization.

优缺点分析

Strengths.

Novel theory: Theorems 3.3 and 4.1 give the first length-independent generalization bound for selective SSMs, plugging a gap left by prior LTI-SSM work.
Insight: The bound scales as e^{s_A T} for s_A>0 but is flat in T when s_A<0, pinpointing stability as the decisive factor in SSM initialization.
Transparency: Assumptions 3.1–3.6 state all Lipschitz and norm constraints, and every proof step in App. A cites them, making the analysis easy to audit.

Weaknesses.

Narrow empirics: Only two tasks and a single-block model; no Transformer or S4 baseline, so real-world relevance is unclear.
No statistical rigor: All plots come from one seed, with no error bars or significance tests.
Assumptions unverified: The paper never logs input-token norms or Δ parameters, so it’s unknown whether training stays within the bound’s regime.
Related-work coverage is thin, and the limitations section is vague.

问题

Suggestions:

Broaden the empirical study:

Train deeper selective-SSM stacks (6–8 blocks) to test whether instability compounds with depth.
Add matched-parameter baselines—Transformer, Performer, and S4D—on standard long-range tasks such as ListOps and PG-19.
Sweep the initial spectral abscissa s_A while holding all other hyperparameters fixed to isolate the stability factor.

Clarify theory: Either tighten the lower bound to close the residual O(T) gap or argue why that gap is information-theoretic and cannot be removed.
Verify assumptions: Log input-token norms and key parameter magnitudes during training; confirm they stay within the theoretical bounds or explain any deviations.

局限性

The authors do not adequately address the paper’s limitations. While they point to Section 4, it only compares their bounds to prior work and omits key issues like limited experiments, unverified assumptions, and evaluation scope. There is no dedicated limitations section, despite the checklist guidance to include one.

最终评判理由

I thank the authors for their detailed rebuttal. They addressed several concerns, including statistical rigor (by reporting results over multiple seeds), empirical verification of theoretical assumptions (via parameter norm logs), and additional results on ListOps. While these clarifications improve the presentation and strengthen the empirical support, the scope of evaluation remains narrow and the relevance to full-model SSM architectures is still limited.

Given the paper’s theoretical novelty and the transparent analysis of selective SSM generalization, I am maintaining my borderline accept rating.

格式问题

I did not notice any major formatting issues in this paper.

作者回复

2025-07-31

We thank the reviewer for recognizing the novelty and clarity of our theoretical contributions. Due to character limits, we kindly refer the reviewer to our responses to other reviewers for some overlapping points, and we apologize for any inconvenience this may cause. Below is our response to the weaknesses and questions.

Weaknesses:

Our work focuses on the theoretical analysis of a single selective SSM block rather than extensive empirical benchmarking. However, we acknowledge the limited number of baselines in this paper and address this concern by providing additional empirical results on the ListOps benchmark, as detailed in our response to the third weakness raised by Reviewer WxU3.
We agree with the reviewer that statistical rigor is important. To address this, we repeated the length-independence experiment for the majority task using five different random seeds and ensured that training was successful in each run. The table below reports the mean and standard deviation across seeds:

Seq Len	50	100	150	200	250	300	350	400
Train Mean	0.95	0.96	0.96	0.97	0.98	0.98	0.97	0.98
Train Std	0.013	0.008	0.008	0.009	0.015	0.008	0.006	0.007
Test Mean	0.88	0.87	0.89	0.89	0.87	0.87	0.87	0.88
Test Std	0.028	0.013	0.018	0.019	0.027	0.018	0.014	0.015
Gen Gap	0.0758	0.0926	0.0732	0.0790	0.1024	0.1068	0.0996	0.1044

As shown in the table, the generalization gap shows no clear trend, and the small standard deviations indicate statistical significance. Similar results for ListOps are reported in the response to Reviewer WxU3. IMDb is omitted due to space constraints.

We thank the reviewer for this valuable comment and fully agree that logging parameter norms to verify theoretical assumptions is important. Our main assumptions (3.1-3.4) require certain SSM parameter norms to remain bounded by constants. However, these constants are neither universal nor predetermined, which makes direct verification against a single fixed numerical threshold less intuitive.

Still, we were on the same page with the reviewer and had already checked these norms during experiments. We monitored $s_A$ , $\|p\|$ , $\|q\|_2$ , $\|W_B\|_2$ , $\|W_B\|_1$ , $\|W_C\|_2$ , $\|W_C\|_1$ , and $\|A_c\|_2$ throughout training. Our observations confirm that these norms remain stable and do not exhibit unbounded growth. Regarding the input tokens, they are generated via a learnable linear projection layer. As such, their norm is controlled by design, and our boundedness assumption is satisfied by construction. For completeness, we include the input norm $\max{\|u[t]\|_2}$ here.

The logs are available in the GitHub repository linked in Footnote 2 (page 8). As a minor note, the repository records $-s_A$ (instead of $s_A$ ) due to an earlier version of our code that computed stability margins directly via negative spectral abscissa values. These results were omitted from the main paper due to space constraints, but we will include a section in the appendix during revision. Additionally, we provide an example log table in our response to Question 3 below.

We thank the reviewer for this feedback. For Related Work, please refer to our response to the second weakness raised by Reviewer h1Pz. Regarding Limitations, we opted to discuss them within their respective sections, but we agree that a dedicated Limitations section would improve clarity and will include one in the appendix of the revised manuscript.

Questions:

Broadening the empirical study:

We appreciate this suggestion and agree that exploring the effect of depth is an interesting direction. However, selective SSM blocks by themselves are not typically stacked to form complete models. Architectures such as Mamba use more complex blocks that include additional components such as convolution layers, projection layers, and residual connections. Our work focuses on the role of the selective SSM block within such models, so the current analysis does not directly extend to multi-block architectures. Investigating these interactions at the full-model level is an important avenue for future work.
We thank the reviewer for the suggestion. However, we would like to clarify that our work focuses on theoretical analysis rather than empirical benchmarking. The goal is to understand the generalization behavior of selective SSMs, not to propose a new algorithm to improve performance metrics and compare it empirically against other architectures. In this setting, models are evaluated based on the structure and dependencies of their generalization bounds, not performance metrics. Therefore, matching parameter counts is not particularly informative, as there is no performance index to optimize. Instead, we compare theoretical bounds directly, as shown in Table 1. The softmax attention bound included there corresponds to Transformers, while the LTI-SSM bound captures models like S4D. To our knowledge, there are currently no generalization error bounds for Performer that would allow a fair comparison.

We agree with the reviewer that verifying our analysis on a task like ListOps would be valuable. For an additional experiment on ListOps, please refer to our response to the third weakness raised by Reviewer WxU3. Regarding PG-19, we note that the generalization assumptions and theorems used in our analysis, which are based on Rademacher complexity, are developed for supervised learning settings and rely on conditions like Assumption 3.3 regarding the loss function. Since PG-19 is a generative language modeling task, it does not fit within this theoretical framework.

We thank the reviewer for the insightful suggestion. The following table presents the results of sweeping the initial $s_A$ on the IMDb dataset with the sequence length $T=500$ , while holding all other hyperparameters fixed.

epoch	1	3	5	7	9
0.1	0.078	0.119	0.125	0.126	0.126
0.08	0.078	0.099	0.102	0.103	0.013
0.06	0.086	0.103	0.105	0.106	0.106
0.04	0.082	0.097	0.100	0.100	0.100
0.02	-0.012	-0.031	-0.063	-0.068	-0.057
0	-0.027	-0.066	-0.073	-0.072	-0.068
-0.02	-0.024	-0.053	-0.046	-0.061	-0.057
-0.04	-0.029	-0.070	-0.067	-0.065	-0.048
-0.06	-0.056	-0.078	-0.026	-0.068	-0.082
-0.08	-0.068	-0.037	-0.043	-0.046	-0.033
-0.1	-0.065	-0.060	-0.113	-0.119	-0.086

These results correspond to the left plots in Figure 1 of the paper and show consistent trends. Unstable initializations ( $s_A > 0$ ) generally lead to training failure, except for $s_A = 0.08$ and $s_A = 0.02$ , where the spectral abscissa is effectively driven toward zero from above, stabilizing the system. In contrast, stable initializations ( $s_A \leq 0$ ) remain stable throughout training. The corresponding loss curves (shown in the right plots of Figure 1) decrease drastically for $s_A \leq 0$ and $s_A = 0.02$ from the first epoch, and for $s_A = 0.08$ after epoch 7, while remaining high for the rest. Due to space constraints, these loss curves are omitted here. We have also performed the same analysis for the Majority task and will include those results in the appendix of the revised manuscript.

The lower bound we provide is on the Rademacher complexity, not directly on the generalization error. This approach is inspired by [Bartlett et al., 2017, Spectrally-normalized margin bounds for neural networks], where a similar lower bound is derived to assess the tightness of the upper bound. In their work, as in ours, the lower bound is not tight but still provides valuable theoretical insights. Our result shows that, under the current analysis based on Rademacher complexity, the upper bound is tight up to a factor of $\mathcal{O}(T)$ . We have explicitly noted this as a limitation in the paper (line 277). Nonetheless, the lower bound highlights a key contribution of the paper: for the case $s_A > 0$ , the exponential dependence in the upper bound cannot be removed.
The following table provides an example log from Experiment 1 (IMDb, $T = 350$ ). The spectral abscissa $s_A$ is already plotted in Figure 1 (bottom left, orange line). At the beginning of a stable training cycle, in which $s_A$ is successfully pushed toward $0$ from above and the loss drops significantly within 10 epochs, the parameter norms do not show strictly increasing trends. This supports the claim that these norms remain bounded. Additional logs are available in the anonymous GitHub repository linked on page 8, footnote 2.

Epoch	$s_A$	$\\|p\\|$	$\\|q\\|_2$	$\\|W_B\\|_2$	$\\|W_B\\|_{1,1}$	$\\|W_C\\|_2$	$\\|W_C\\|_{1,1}$	$\\|A_c\\|_2$	$\max{\\|u[t]\\|_2}$	Loss
0	0.100	1.43	4.12	7.70	51.00	7.40	48.50	9.51	7.39	–
1	0.069	1.46	3.89	7.12	45.90	6.85	43.40	9.41	7.07	$6.4 \times 10^{20}$
3	0.005	1.62	3.56	6.53	41.40	6.33	39.10	9.21	6.49	347.71
5	-0.006	1.58	3.44	6.21	39.53	6.15	39.03	9.01	6.19	2.04
7	-0.009	1.58	3.38	6.05	38.56	6.13	39.01	8.84	6.00	1.00
9	-0.012	1.61	3.35	5.97	38.03	6.12	40.00	8.67	5.82	0.67

2025-08-03

I thank the authors for the detailed and thoughtful rebuttal. The theoretical contribution is clear, and I appreciate the effort to follow up with additional experiments, clarification on the stability sweep, and confirmation that key assumptions hold during training.

The multi-seed results on the Majority task are helpful and show that the generalization gap remains stable. The s_A sweep nicely demonstrates how stability affects learning, and it’s great to see the connection to training dynamics made explicit. I also appreciate the discussion on the lower bound and the role of Rademacher complexity.

That said, I still encourage the authors to include the ListOps results and seed-wise curves in the appendix, clarify the units and metrics in the s_A sweep table, and consider adding a brief limitations section for completeness. Making the logged norms and assumptions easier to inspect (e.g., a table or plot) would also improve transparency.

Overall, this is a solid contribution to the theory of SSMs, and I’m happy to raise my score to a Borderline accept, assuming these clarifications are incorporated.

2025-08-04

Thank you for your positive comments. We’re glad to hear that our clarifications and additional experiments were helpful. You mentioned that you plan to raise your score. Just to clarify, your score was already a 4 (borderline accept) before the discussion. Did you mean you’re considering increasing it to 5, or keeping it at 4? Either way, we appreciate your thoughtful feedback and will incorporate your suggestions to improve the paper.

2025-08-05

I'm sorry. I’m happy to keep my score to a Borderline accept. Thank you for answering all my questions!

审稿意见

评分: 5置信度: 42025-07-02

This work derives a probabilistic generalization error guarantee for the class of discretized continuous-time selective SSMs, precisely the Mamba architecture. The generalization gap is upper bounded by the function classes Rademacher complexity, which is in turn upper bounded by the use of a covering argument in conjunction with Dudley’s integral formula. Parts of the analysis are adapted from a framework previously developed for a similar analysis of linear Transformers. The authors run two sets of experiments aimed at understanding the relationship between system “stabilization” and training sequence length, as well as the effect of the sequence length on the generalization error when the SSM’s continuous time transition matrix has a negative spectral abscissa, in which case the error is independent of the sequence length.

优缺点分析

Strengths

The paper employs relatively demanding technical concepts in its derivations, yet the main text is still written clearly and accessibly.
The results in Figure 1 are very interesting. Not only does training with overly long sequences fail to stabilise an unstably initialized system, but the spectral abscissa monotonously increases with the sequence length for lengths in which a stable point is reached through training. As the authors mention, the training “stabilizes the system just enough to preserve rich temporal information”, and more information tends to be preserved with longer training sequences.
The paper seems to propose the first generalization bound on selective SSMs derived through a covering-based upper bound on the function class’ Rademacher complexity. By doing this, it fills an important knowledge gap in the statistical learning theory of foundation models.
Figure 2 experimentally confirms the authors’ main finding, that SSMs with a negative spectral abscissa will exhibit a length-independent generalization error. It is a novel and interesting result.

Weaknesses

The analysis only considers SSMs obtained through the discretisation of a continuous-time SSMs, in particular Mamba. While the authors are correct in stating that this assumption is more aligned with how such models are used in practice, a more general result should perhaps directly consider discrete-time time-variant SSMs in full generality. Similarly to the evolution from S4 and related models to the LRU (Orvieto et al. 2023), we might expect to see an evolution from Mamba to models that discard the assumption of a discretised continuous-time system.
While Figure 2 does demonstrate that, with a stable system, the generalization error tends to be constant with respect to the sequence length, the exponential dependence of the generalization error on the sequence length with unstable systems was not experimentally shown. The results shown in Figure 1 seem to point to a phase transition, rather than smooth, albeit very rapid, exponential growth.
Ideally the length-independence of the generalization gap such as shown in Figure 2 would have been analysed on a more diverse set of tasks.

问题

I cannot entirely clearly see why the generalization bound is length-independent when the spectral abscissa of A is negative. In the $S_2$ factor in the generalization bound, assuming that $s_A<0$ in which case $\rho_A<1$ , we obtain $\lim_{T \rightarrow \infty}\frac{ \rho_A(1- \rho_A^T)}{(1- \rho_A)^2}- \frac{T \rho_A}{ \rho_A-1}=\frac{ \rho_A}{(1- \rho_A)^2}-\lim_{T \rightarrow \infty}\frac{T \rho_A}{ \rho_A -1}$ . Asymptotically, the behaviour seems to be linear in $T$ . Could the authors please comment on this aspect, seeing as it’s possibly the core message of the paper.
How do the authors explain the model’s length-independent high performance on majority? In my understanding, predicting majority seems to require an overview of the entire sequence, yet the state of an SSM with a negative spectral abscissa should exhibit an exponentially decreasing dependence on distant past states, at least in the case of an LTI system. Is the situation drastically different with selective SSMs, and do you think that the SSM behaviour will stay consistent on the majority task even with significantly longer inputs?

局限性

Yes

最终评判理由

The paper appears to present the first generalization bound for selective SSMs, derived via a covering-based upper bound on the function class’s Rademacher complexity. This contribution addresses a notable gap in the statistical learning theory of foundation models and represents a meaningful theoretical advancement. On the downside, the experimental evaluation is relatively limited, with modest results likely due to the use of small-scale models.

格式问题

作者回复

2025-07-31

We thank the reviewer for their positive feedback and insightful comments.

Weaknesses:

Thanks for raising the very interesting idea of directly considering discrete time systems. In this case, the discrete-time counterpart of the discretized continuous-time state-space model (SSM) would be a parameter-dependent system, where the system matrix $A(u)$ depends on the input $u$ at each time step. In principle, the stability of such a model can be assessed by replacing the continuous-time spectral abscissa $s_A$ with the spectral radius $\rho(A)$ and requiring that $\rho(A(.)) <1$ for all possible values of $u$ . However, this condition could be potentially very conservative, depending on the nature of the dependence of $A(.)$ on $u$ . We believe that a less conservative approach could be based on imposing that the joint spectral radius of the product $\rho(\mathcal{A}) = \lim_{T \\rightarrow \infty} \rho(A(u_1)A(u_2)....A(u_T))< 1$ . Computing this spectral radius is, in general, intractable. However, in the case where the input $u$ can be associated with the nodes of a graph, then one can use the approach introduced by A. A. Ahmadi et. al. in ``Joint Spectral Radius and Path-Complete Graph Lyapunov Functions," SIAM J. Control and Optimization, 52,1, pp. 687-717, 2014. Nevertheless, pursuing this approach would require defining a new class of Mamba models with a specific dependence of $u$ that allows for exploiting these results. In turn, this would require first considering different models $A(u)$ , including the simpler case where $A$ is constant, and benchmarking their performance and the tradeoff complexity of the model versus generalization properties. While we agree that this is a research question worth pursuing, it is beyond the scope of the present paper.
We agree with the reviewer that Experiment 1 does not directly measure the generalization error in the unstable regime. Its purpose is to demonstrate that when $s_A > 0$ , training fails for long sequences and for short sequences, $s_A$ quickly moves toward negative values during early training. This means that in practice, the unstable regime is not sustained, especially for longer sequences. Importantly, our theoretical bounds are asymptotic and become meaningful only at long sequences, precisely where training fails in the unstable regime. This makes it infeasible to conduct experiments to validate the generalization bounds in that setting. Since test error evaluation is only meaningful if training is successful, generalization behavior cannot be studied when training fails. For further clarification, please refer to our response to the third weakness raised by Reviewer h1Pz.
Our work is primarily theoretical, following a research tradition where results are supported through rigorous analysis or minimal experimental validation, as in [Bartlett 2017], [Edelman 2022], and [Trauger 2024]. That said, we fully acknowledge the importance of empirical validation, particularly for the length-independence of generalization. Such validation is challenging because it requires evaluating the same task across varying sequence lengths, a difficult goal for real-world datasets due to truncation, padding, or information loss. Synthetic datasets offer a controlled setting where sequence length can be varied without altering the task content. Accordingly, we include experiments on ListOps, as suggested by Reviewer vv3d. This task, part of the Long Range Arena benchmark, requires hierarchical reasoning over long sequences. Unlike simpler synthetic tasks, ListOps evaluates a model's ability to maintain long-range dependencies over nested structures, making it a meaningful test for generalization across sequence lengths.

Below, we present our results on ListOps for sequence lengths ranging from 100 to 1,000 tokens, with significance analysis performed by averaging over five different seeds.

Seq Len	100	200	300	400	500	600	700	800	900	1000
Train Mean	0.2154	0.2106	0.1980	0.2040	0.2154	0.2012	0.2156	0.1974	0.2150	0.2218
Train Std	0.0150	0.0068	0.0134	0.0070	0.0050	0.0128	0.0069	0.0154	0.0096	0.0112
Test Mean	0.1578	0.1638	0.1488	0.1662	0.1528	0.1638	0.1468	0.1666	0.1694	0.1498
Test Std	0.0236	0.0124	0.0119	0.0132	0.0116	0.0103	0.0119	0.0137	0.0061	0.0091
Gen Gap	0.0576	0.0468	0.0492	0.0378	0.0626	0.0374	0.0688	0.0308	0.0456	0.0720

As the results show, the generalization gap does not exhibit a clear trend with respect to sequence length. While the accuracy values may seem low, it is important to note that even full-scale multi-layer models like Linear Transformer and Performer perform similarly or worse on this task, with 0.16 and 0.18, respectively, than our single-layer selective SSM block with N=4 states per channel and d=16 channels [Tay 2020]. This highlights the inherent difficulty of the task.

Questions:

We genuinely appreciate the reviewer for pointing out this issue. The term $S_2$ appears in the proof as a result of evaluating the sum $\sum_{t=0}^T t \rho_A^t$ at the end of Lemma D.6 (line 611). From that derivation, it follows that $S_2$ is given by:

S_2 = \frac{\rho_A\left(1 - \rho_A^T\right)}{(1 - \rho_A)^2} - \frac{T \rho_A^T}{1 - \rho_A}.

There was a typo in the main theorem where the exponent $T$ in the numerator of the second term is missing. When this is corrected, it becomes clear that the bound asymptotically approaches $\frac{\rho_A}{(1 - \rho_A)^2}$ as $T \rightarrow \infty$ , confirming that the generalization bound becomes independent of sequence length in the stable regime.

The majority task is essentially counting the number of ones in a sequence, which corresponds to the state-space model $x(k+1) = x(k) + u(k)$ . This is the discretized form of an integrator, which has spectral abscissa $s_A = 0$ . In theory, given sufficient training data and successful optimization, the learned model should converge to this form, yielding $s_A = 0$ . In practice, however, due to numerical errors and finite data, we rarely observe $s_A = 0$ exactly; instead, $s_A$ tends to be a very small value close to zero. For extremely long sequences, where even not just exponential but linear accumulation of error is not tolerable, we agree with the reviewer that training may fail. Additionally, compared to LTI SSMs, the selective ones have more freedom to adjust the determining quantity $\rho_A$ as defined in Theorem 3.3 to stabilize training. They may perform better by adjusting $u$ , $q$ , $p$ , and $A_c$ , as opposed to relying solely on $A$ in LTI SSMs.

2025-08-04

Thank you very much for your response, which adequately addresses all of my questions and concerns with the paper. I will keep my score to recommend acceptance.

审稿意见

评分: 4置信度: 42025-07-03

This paper studies the generalization ability of selective state space models (selective SSMs), such as Mamba. Specifically, the authors derive an upper bound on the covering number of selective SSMs and use it to evaluate their generalization error. The key finding is that the generalization error is governed by the spectral abscissa $s_A$ , i.e., the largest real part of the eigenvalues of the parameter matrix $A$ that appears in selective SSMs. The authors show that when $s_A > 0$ , the generalization error grows exponentially with the sequence length, while when $s_A < 0$ , the error becomes independent of sequence length. The proof leverages the structural similarity between selective SSMs and Transformers, enabling to use existing theoretical results for Transformers. The empirical section evaluates how $s_A$ affects the generalization ability of selective SSMs using both synthetic and real-world datasets.

优缺点分析

Strengths:

(S1) The main theoretical result is novel in showing that SSM-based models can achieve length-independent generalization error for $s_A < 0$ .

(S2) The experiments provide practical insights. Specifically, when initialized with $s_A > 0$ , the model drives $s_A$ below zero during training if the sequence length is short enough for sufficient stabilization, while for longer sequences, $s_A$ remains positive. Additionally, it is shown that initializing $s_A$ near zero leads to generalization even for long sequences.

Weaknesses:

(W1) The abstract claims “we analyze how the spectral abscissa of the continuous-time state matrix governs the model’s training dynamics and its ability to generalize across sequence lengths.” However, the paper does not actually analyze the training dynamics.

(W2) The fact that the properties of the matrix $A$ are important for the generalization of RNNs is already known in prior work. The authors likely consider it important that they extend this discussion to selective SSMs, and I agree. However, the paper lacks sufficient explanation of how the proof differs from those for RNNs. If similar techniques are used, this should be made explicit.

(W3) As noted in Remark 3.2, the phrase “through the lens of attention” appears to refer to the use of generalization error analysis techniques for Transformers. However, it is not clear from the main text exactly which results from Transformer theory are used. Although the proof sketch of Theorem 3.3 contains the sentence “Recognizing the attention mechanism in (8), Lemma C.3 covers the parameters {WB, WC, q, w} by treating them as linear function classes,” this explanation is too abstract and difficult to understand.

(W4) Minor: The $A$ in $s_A$ in eq. (15) should be bold.

问题

(Q1) When $s_A$ is small, it is expected that the model’s expressivity for long sequences is limited. Can the authors provide any insights into the trade-off between expressivity and generalization with respect to $s_A$ ?

(Q2) In Experiment 1, the model failed to generalize for long sequences, but in Experiment 2, it did generalize. Is it correct to understand this as showing that initializing at $s_A = 0$ makes the model more likely to generalize even for long sequences?

(Q3) Why did you plot the loss in Experiment 1 and the accuracy in Experiment 2?

局限性

yes

最终评判理由

This paper theoretically investigates the generalization ability of selective SSMs, exemplified by Mamba. The claims are mathematically rigorous; in particular, the paper provides a novel result: for $s_A < 0$ , the upper bound on the generalization error is independent of sequence length. It is also commendable that the authors empirically show that the value of $s_A$ at initialization influences subsequent training.

On the other hand, I have the following concerns about the submission:

The technical differences from existing work on generalization error in RNNs are unclear.
The paper does not sufficiently explain in what sense the analysis is “through the lens of attention.”
There is no discussion of how the conditions imposed for the error analysis affect expressivity.

The authors provided adequate explanations for these points in the rebuttal, and improvements are expected in the final version. While the current lack of clarity keeps my overall evaluation borderline, considering the paper’s strengths, I will assign an accept-leaning rating.

格式问题

Nothing to comment here.

作者回复

2025-07-31

We appreciate the reviewer for their thoughtful comments, recognizing the novelty of our work and the insights we provide. Below, we address the weaknesses and questions they raised.

(W1) We appreciate the reviewer’s observation. We agree that the paper does not provide a formal analysis of training dynamics. Our focus is on generalization. Nonetheless, Experiment 1 was included to show that the spectral abscissa $s_A$ influences whether training succeeds, particularly as sequence length increases. In this sense, $s_A$ indirectly governs whether the model enters the regime where generalization bounds apply. Following this, Experiment 2 focuses on the stable regime, where training converges and generalization can be meaningfully evaluated (More elaborate discussion on this is included in our response to Reviewer h1Pz). To avoid misinterpretation of this connection, we will revise the abstract as follows: ``Using this result, we analyze how the spectral abscissa of the continuous-time state matrix influences the model’s stability during training and its ability to generalize across sequence lengths.''

(W2) Thank you for recognizing the value of extending generalization analysis from RNNs to selective SSMs. We attempted to outline the distinct challenges posed by selective SSMs in the proof sketch, but we appreciate the opportunity to clarify them more explicitly here:

(i) In RNNs, the state matrix $A$ is a learnable parameter and can typically be treated as fixed when analyzing generalization. Its effect is often controlled by bounding its operator norm $\|A\|_2$ directly.

(ii) In contrast, selective SSMs involve an input-dependent state matrix $A$ , which varies with the input sequence through $A[k] = \exp(\Delta[k] A_c)$ . Because no fixed bound on $\|A\|_2$ applies across all inputs, we cannot treat $A$ as a constant in the analysis, which poses a new challenge. To address this, we treat $A$ (as well as $B$ and $C$ ) as linear function classes, similar to attention mechanisms. This part of the proof is therefore fully based on results from generalization for attention. This is a technical departure from standard RNN analysis toward the methods developed for Transformers.

(iii) We then observe that the spectral abscissa of the continuous-time state matrix $A_c$ provides an input-independent upper bound on the spectral radius of the discrete-time matrix $A$ : $\max_i(|\lambda_i(A)|)$ . This contrasts with the use of $l_2$ norm $\|A\|_2$ in RNNs. We believe there is no straightforward way to relate properties of $A_c$ directly to $\|A\|_2$ .

(iv) After the above steps, we use a telescoping summation. As the reviewer noted, this step is indeed similar to the treatment in RNN generalization bounds. There is a novel technical subtlety here: we use the Gelfand formula to relate the spectral radius of $A^t$ to its norm $\|A^t\|_2$ for a sufficiently large $t$ , which allows this step. Remember in the previous step, from the continuous-time matrix $A_c$ we could bound the spectral radius, not $\|A\|_2$ .

(v) Finally, we combine the covering numbers, bound the Rademacher complexity, and derive the generalization bound. This part follows standard techniques established in the study of neural networks.

We will revise the main text to explicitly mention the connection to RNN generalization bounds, especially in the telescoping step, and clarify where our techniques diverge from and extend prior approaches.

(W3) We thank the reviewer for raising this concern and understand that this connection deserves a clearer explanation. Overall, selective SSMs differ from RNNs in the broader sense that $A$ , $B$ , and $C$ are input-dependent rather than fixed. To handle this, we treat these components as linear function classes, an approach directly inspired by the treatment of $Q$ , $K$ , and $V$ in generalization analysis for Transformers. This is most clearly reflected in Equation (16), where the structure mirrors linear attention: $B$ can be viewed as analogous to keys, $C$ as queries, and the inner product $C^T A^t B$ resembles the attention with a causal mask. This analogy allows us to apply existing bounds on the covering numbers of attention mechanisms. Specifically, Lemma C.3 draws from such results to bound the complexity of the input-dependent parameterizations. Without adopting this perspective, one would need to develop entirely new theorems and lemmas to handle such input-dependent operators. This connection is further validated in Proposition 3.4, where under simplifying Assumptions 3.5 and 3.6, our bound recovers the generalization result for linear attention. To clarify this connection, we will revise the proof sketch and Remark 3.2 to state explicitly that step (i) of our proof (as outlined in W2) is where the attention-based generalization framework is applied.

(W4) Thank you for pointing out this mistake, we will fix it in the revised manuscript.

(Q1) We appreciate the reviewer’s insightful question. When $s_A$ is highly negative or positive, the induced critical quantity $\rho_A$ (as defined in Theorem 3.3) becomes $\ll 1$ or $\gg 1$ , respectively. This causes the influence of past inputs on the output at time $T$ to decay or grow exponentially, limiting the model’s ability to balance information across time. Specifically, when $\rho_A \ll 1$ , the model primarily focuses on recent inputs, whereas when $\rho_A \gg 1$ , early inputs dominate. In both extremes, expressive power suffers. For a formal result on this phenomenon, we refer the reviewer to Theorem 2 in Block-Biased Mamba for Long-Range Sequence Processing by Annan Yu (2025), which analyzes expressivity in the setting of diagonal $A_c$ . While their setting differs from ours, the core intuition carries over to general selective SSMs. The combination of this result with the contributions of our paper suggests a natural trade-off: to achieve good generalization, $s_A$ is driven toward negative values (to ensure $\rho_A \leq 1$ ), but it operates near zero to preserve expressivity by keeping $\rho_A$ close to 1. This conclusion is highlighted in bold in the final paragraph of Section 5.

(Q2) Yes, the results support the conclusion that initializing $s_A$ near zero makes the model more likely to generalize for long sequences. In Experiment 1, $s_A$ starts positive and is pushed toward zero within the first few epochs for shorter sequences, allowing training to succeed. However, for longer sequences, this stabilization fails, leading to divergence. The key takeaway is that generalization can only be meaningfully evaluated when training succeeds, which happens when $s_A$ is small enough by the end of training. In Experiment 2, initialization at $s_A = 0$ makes training more likely to complete successfully across all sequence lengths.

(Q3) The loss is inversely related to accuracy. In Experiment 1, we plotted loss versus epochs because it is standard practice to track loss during training. In contrast, Experiment 2 evaluates already trained models, where reporting train/test accuracy is more common, especially in classification tasks.

2025-08-06

Thank you to the authors for their detailed response.

Our focus is on generalization. Nonetheless, Experiment 1 was included to show that the spectral abscissa $s_A$ influences whether training succeeds, particularly as sequence length increases. … To avoid misinterpretation of this connection,

Thank you for the clarification. I agree that the experiments demonstrate that $s_A$ is related to whether training succeeds. However, I think “training dynamics” overstates the claim, so I look forward to seeing the phrasing revised.

we attempted to outline the distinct challenges posed by selective SSMs in the proof sketch, but we appreciate the opportunity to clarify them more explicitly here:

Overall, selective SSMs differ from RNNs in the broader sense that $A$ , $B$ , and $C$ are input-dependent rather than fixed. To handle this,

Thank you for the further explanation. With these clarifications, the key parts of the proof—as well as the similarities and differences from prior work—became much clearer to me. I believe adding these explanations to the revised version will make the paper more accessible to readers.

The combination of this result with the contributions of our paper suggests a natural trade-off: to achieve good generalization, $s_A$ is driven toward negative values (to ensure $\rho_A \leq 1$ ), but it operates near zero to preserve expressivity by keeping $\rho_A$ close to 1.

Let me restate my question: if $\rho_A$ is made closer to 1, expressivity increases, but wouldn’t generalization degrade, based on Theorem 3.3? (Please correct me if I’m mistaken.) Can you provide any quantitative results on this trade-off?

Yes, the results support the conclusion that initializing $s_A$ near zero makes the model more likely

Thank you for the answer. I value your inclusion of the dynamics of $s_A$ in the paper, which helps connect the theory with the experimental observations.

I would like to once again thank the authors for their detailed rebuttal. Some of my concerns have been resolved. Based on your answers to the additional question above, I will reassess the contribution of the paper and determine my final rating.

2025-08-07

We thank the reviewer for their positive feedback and we are glad that we could clarify their concerns. For the question about $\rho_A$ :

Note that the term $S_2$ in Theorem 3.3 (corrected as indicated by Reviewer WxU3 derived in Lemma D.6.) is

S_2 = \frac{\rho_A (1 - \rho_A^T)}{(1 - \rho_A)^2} - \frac{T \rho_A^T}{1 - \rho_A}.

This expression does not diverge as $\rho_A \to 1$ even though there is $1-\rho_A$ term in the denominator; applying L'Hôpital's rule twice, we find $S_2 = \frac{T^2 - T}{2}$ , which is finite. Please see the derivation at the end of this comment.

Therefore, although increasing $\rho_A$ degrades generalization (as you note), the generalization bound does not blow up at $\rho_A = 1$ ; rather, it remains bounded and grows quadratically with $T$ . The generalization becomes severely damaged only when $\rho_A > 1$ , where the bound becomes exponentially increasing in $\mathcal{O}(T\rho_A^T)$ .

Regarding quantitative results, the experiments are finite-horizon whereas the bound we have is asymptotic. This allows $\rho_A$ to be slightly above 1 for short sequences in practice (e.g., length 100 or 200), such as $\rho_A = 1.01$ , since $1.01^{200} \approx 7$ , which is acceptable. For longer sequences, where our theoretical results become most relevant, we consistently observe that $s_A$ becomes negative, resulting in $\rho_A < 1$ . In Experiment 1, we observe that $\rho_A$ tends to be between 0.95 to 0.99 when the training succeeds to balance the trade-off between generalization and expressivity.

Proof that the bound remains finite as $\rho \to 1$ :

We compute:

\lim_{\rho \to 1} \left( \frac{\rho(1 - \rho^T)}{(1 - \rho)^2} - \frac{T \rho^T}{1 - \rho} \right) = \lim_{\rho \to 1} \frac{\rho - T\rho^T + (T - 1)\rho^{T+1}}{(1 - \rho)^2}

First derivative:

\frac{d}{d\rho} \text{(numerator)} = 1 - T^2 \rho^{T-1} + (T^2 - 1)\rho^T, \quad \frac{d}{d\rho} \text{(denominator)} = -2(1 - \rho)

Second derivative:

\frac{d^2}{d\rho^2} \text{(numerator)} = -T^2(T-1)\rho^{T-2} + T(T^2 - 1)\rho^{T-1}, \quad \frac{d^2}{d\rho^2} \text{(denominator)} = 2

Evaluate at $\rho = 1$ :

\lim_{\rho \to 1} = \frac{-T^3 + T^2 + T^3 - T}{2} = \frac{T^2 - T}{2}.

2025-08-08

I appreciate the authors’ clear response. Since the trade-off between the expressivity of SSMs and their learnability (including generalization) is often discussed, I think that addressing this point somewhere would increase the impact of the paper.

I have reconsidered my rating and will increase it by one point.

审稿意见

评分: 3置信度: 42025-07-03

The paper derives covering number-based generalization bounds for selective SSMs by connecting them to attention mechanisms. The main result shows that generalization depends on the spectral abscissa $s_A$ of the continuous-time state matrix: when $s_A < 0$ , the bound is length-independent, but when $s_A > 0$ , it grows exponentially with sequence length $T$ .

优缺点分析

Strengths:

The paper is well written.
The topic of generalization bounds and guarantees that can be obtained for selective SSMs such as Mamba is important as our theoretical understanding of these models is still lagging behind their immense popularity.

Weaknesses:

The paper is overly complicated. While the reviewer can appreciate the technical difficulties in the analysis of such models and incorporating the selectivity mechanism. The authors could convey the main ideas behind the theoretical claims without the technical overhead introduced in Section 2 (specifically, many of the notations presented are not used in the main paper).
The paper lacks an in-depth discussion of related works. While many relevant works are discussed for the sake of the presentation and comparison in Table 1 - a dedicated section that discusses different theoretical results on the theory of generalization and optimization of sequential models is missing.
Experimental section is very weak - the first experiment does not prove anything about generalization. When using unstable initialization for $s_A$ the authors are simply showing that optimization fails, this has nothing to do with generalization.

问题

In the IMDB experiment- why truncate at 300? why not use the maximal sequence length?
I'm not sure what is the purpose of Figure 1 - if you use unstable initialization for $s_A$ , what is the experiment supposed to convey? The conclusion I can draw from this experiment is that optimization fails when the sequence length is too large.
The experiments that demonstrate $s_A>0$ leads to the conclusion that "generalization is bad" cannot be decoupled from the question of optimization. Do the authors have other means to show that generalization is bad in the unstable case?

局限性

This work is theoretical and assumptions are stated clearly.

格式问题

N/A

作者回复

2025-07-31

We thank the reviewer for recognizing the clarity and significance of our work and for their constructive feedback. Below, we address the weaknesses and questions raised:

Weaknesses:

We thank the reviewer for their time and for raising this point, which allows us to clarify the motivation behind the structure of our paper. The main finding of this paper is a generalization error bound where all of our analysis and experiments revolve around. While the high-level idea is simple, the derivation requires a deep understanding of specific topics in both learning theory and dynamical systems. We agree with the reviewer that these topics might be complicated for the readers, and to avoid this problem we took the following measures:

Section 2 provides a mathematically rigorous background on selective SSMs, ensuring all components are well-defined. This was intended to remove ambiguity present in prior works, including foundational SSM papers, where key concepts are often described visually or heuristically.
Section 3 conveys the intuition behind Rademacher complexity, covering numbers, and their connection to generalization in a verbal manner before presenting our main result. The detailed technical proof is omitted from the main body for accessibility and presented in the appendix instead.

Regarding notation, all introduced symbols are used in the paper except for $\Im$ , which was included for completeness alongside $\Re$ . We believe introducing notation upfront promotes clarity, even if some terms are referenced only once. Further simplification would risk omitting critical definitions and make the main contribution, our generalization analysis, difficult to follow. Our intent was to provide a clear and self-contained exposition without overwhelming the reader with technical details in the main text.

We thank the reviewer for this valuable feedback. The paper already contains a significant amount of material, so a dedicated Related Work section with detailed discussion was omitted to keep the main text concise. Instead, we aimed to include recent and significant works in the Introduction and provide direct comparisons with SOTA results on other models in Table 1 and the analysis section to show how our results extend them. Nonetheless, we agree with the reviewer that a more in-depth discussion of related work would be beneficial. In the revision, we will include a dedicated Related Work section, placed in the appendix to maintain readability. This section will survey generalization error and sample complexity bounds across several fields, including machine learning, deep learning, system identification, and reinforcement learning. It will review generalization bounds based on model architecture, covering neural networks, RNNs, LTI-SSMs, and Transformers. In addition, it will discuss key analytical tools such as VC dimension, covering numbers, Rademacher complexity, and information-theoretic approaches.
We thank the reviewer for this important observation and the opportunity to clarify the purpose of Experiment 1. As the reviewer noted, this experiment illustrates stability during training and does not directly address generalization. At first glance, a comprehensive generalization experiment might seem to require evaluating the generalization gap for both stable ( $s_A \leq 0$ , or more precisely, $\rho_A \leq 1$ where $\rho_A$ is defined in Theorem~3.3) and unstable ( $s_A > 0$ ) regimes across varying sequence lengths. However, it is important to note that the generalization gap is meaningful only after training has converged. Experiment 1 is included to demonstrate that, in practice, the unstable case does not arise: regardless of the initial spectral abscissa, training pushes $s_A$ toward negative values to adjust $\rho_A$ within the first few epochs. Its purpose is to justify why we do not report generalization gaps for the unstable region. In that regime, training does not succeed, so the model does not perform well on the training data, and thus it is meaningless to talk about how well it performs on the test data. With this clarification, we proceed to Experiment 2, which focuses on the stable regime ( $s_A \leq 0$ or $\rho_A \leq 1$ ) where the generalization bounds are meaningful and empirically supported. We will revise the text to make this reasoning explicit so that readers clearly understand the role of Experiment 1 and its connection to Experiment 2.

Questions:

For the IMDb dataset, in the first experiment, input sequences are truncated to lengths $300, 350, \ldots, 550$ . For longer sequences, training fails at the initial stage, with the loss diverging to infinity. The second experiment truncates sequences to lengths ranging from $300$ to $2000$ to demonstrate the length-independence of generalization, which is the main objective of that experiment. Therefore, we do not limit truncation to length 300; rather, we go beyond it based on the experimental goal.
We tried to address this point in our response to the third weakness, as it is closely related. We agree with the reviewer’s conclusion that Figure 1 demonstrates training failure rather than generalization performance. The purpose of Figure 1 is to show that it is not possible to obtain a well-trained unstable model. When the model does not perform well on the training data, it is not informative to evaluate its performance on the test data, rendering the generalization gap meaningless. This supports our focus on the stable regime in Experiment~2, where training succeeds and generalization can be meaningfully evaluated.
We appreciate the reviewer’s insightful question. To support the claim that generalization degrades in the unstable regime, we provide a theoretical justification in Theorem 4.1. This result establishes a lower bound on the Rademacher complexity of the selective SSM class when $s_A > 0$ , showing that it grows exponentially with the sequence length $T$ . Since the generalization error is upper-bounded by the Rademacher complexity, this implies that no sub-exponential upper bound is possible in this regime. As for experimental validation, we agree with the reviewer that optimization and generalization are inherently coupled in this setting: meaningful generalization analysis requires successful training. Moreover, the bounds derived in our theorems are asymptotic and become valid only for large values of $T$ , precisely where training fails when $s_A > 0$ . For these reasons, it was not feasible for us to conduct empirical generalization experiments in the unstable regime.

评论- Remaining Concerns

2025-08-04

Thank you for the detailed rebuttal and responses to the points raised. I have also reviewed the other reviews and the authors’ rebuttals.

My main concern remains that the empirical methodology does not adequately distinguish between optimization failure and generalization failure, which limits the strength of the empirical evidence. Additionally, the manuscript still prioritizes technical derivations at the expense of a fuller discussion of relevant prior work.

Taken together, these issues remain significant enough that my original borderline rejection recommendation stands.

2025-08-05

We thank the reviewer for carefully reading our rebuttal and for their thoughtful comments. We would like to clarify that the limitation raised is not due to our experimental design, but rather reflects a fundamental property of the problem setting: the generalization gap is not well-defined when training fails, as neither the training nor test errors are computable in a meaningful way. In such cases, their difference—the generalization gap—does not yield informative value. Both optimization and generalization success or failure stem from the same underlying factor: whether the spectral abscissa of $A$ is non-positive. The optimization problem solved during SSM training is highly non-convex, so there are no guarantees that a gradient-based search will converge. Our results, including the $s_A$ sweep experiment, demonstrate that optimization success is directly tied to the non-positivity of $s_A$ . Moreover, as both our empirical and theoretical analyses show, this same condition guarantees successful generalization. Finally, our lower bound on the Rademacher complexity (Section 4.1) shows that if $s_A > 0$ , then the generalization gap grows exponentially with sequence length.

We agree that the discussion of related work in the main text is not exhaustive. However, we made a deliberate effort to include key and recent results in the Introduction, particularly those most relevant to our setting. These include classical covering-based bounds for neural networks [Bartlett et al., NeurIPS 2017; Zhang, JMLR 2002] and recent generalization analyses for Transformers [Edelman et al., ICML 2022; Trauger and Tewari, AISTATS 2024], which directly inform our treatment of selective SSMs.

More importantly, Section 4 of the manuscript is dedicated to a detailed analysis, comparing our generalization bounds with prior work on RNNs, Transformers, and LTI-SSMs—arguably the most relevant sequence processing models for comparison. This discussion is central and was intentionally integrated with the main results to highlight the technical contributions in context with prior work. While a longer survey-style treatment was not possible due to space and scope constraints and would shift the focus from our contributions, we believe the current discussion sufficiently supports our claims. Nonetheless, we acknowledge the value of a broader overview and will include a dedicated Related Work section in the appendix of the revised version.

最终决定Accept (poster)

2025-09-17

This paper provides a theoretical generalization analysis of selective state-space models (SSMs), which form the core of the Mamba architecture. The authors derive a novel covering number–based bound for selective SSMs, extending recent theoretical results on Transformers. A central insight is that the spectral abscissa of the continuous-time state matrix governs both training dynamics and the model’s ability to generalize across sequence lengths. Overall, the work offers a clear and timely theoretical contribution to understanding SSMs and their generalization properties.

While there are some limitations–particularly that the experimental results are not very strong–the reviewers agree that the paper tackles an important problem (generalization in SSMs) with a novel and well-founded methodology (covering numbers applied to selective SSMs). Therefore, in line with reviewers WxU3, Bnv5, and vv3d, I recommend acceptance of this paper.