PaperHub
7.3
/10
Spotlight4 位审稿人
最低4最高5标准差0.5
4
5
5
4
4.3
置信度
创新性3.3
质量2.8
清晰度3.0
重要性3.3
NeurIPS 2025

Structured Sparse Transition Matrices to Enable State Tracking in State-Space Models

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

We propose a parametrisation of SSM transition matrices that enables SSMs to track states of arbitrary finite-state automata while keeping the cost of the parallel scan comparable to that of diagonal SSMs.

摘要

Modern state-space models (SSMs) often utilize structured transition matrices which enable efficient computation but pose restrictions on the model’s expressivity, as measured in terms of the ability to emulate finite-state automata (FSA). While unstructured transition matrices are optimal in terms of expressivity, they come at a prohibitively high compute and memory cost, even for moderate state sizes. We propose a structured sparse parametrization of transition matrices in SSMs that enables FSA state tracking with provably optimal state size and depth, while keeping the computational cost of the recurrence comparable to that of diagonal SSMs. Our method, PD-SSM, parametrizes the transition matrix as the product of a column one-hot matrix ($P$) and a complex-valued diagonal matrix ($D$). As a result, the computational cost of parallel scans scales linearly with the state size. Theoretically, the model is BIBO-stable and can emulate any $N$-state FSA with one layer of dimension $N$ and a linear readout of size $N ×N$, significantly improving on all current structured SSM guarantees. Experimentally, the model significantly outperforms a wide collection of modern SSM variants on various FSA state tracking tasks. On multivariate time-series classification, it outperforms neural controlled differential equations, a paradigm explicitly built for time-series analysis. Finally, we integrate PD-SSM into a hybrid Transformer-SSM architecture and demonstrate that the model can effectively track the states of a complex FSA in which transitions are encoded into sets of variable-length English sentences. The code is available at https://github.com/IBM/expressive-sparse-state-space-model.
关键词
State-Space ModelsExpressivenessEfficiencyMatrix ParametrisationState-TrackingFinite-State Automata

评审与讨论

审稿意见
4

This paper proposes PD-SSM, a structured state-space model with a novel input-dependent structured state-transition matrix, A(ut)=P(ut)D(ut)A(u_t)=P(u_t)D(u_t), where P(ut)P(u_t) is a binary matrix with one non-zero element per column and D(ut)D(u_t) is a complex diagonal matrix. This structure fits within a large family of existing structures, including the diagonal structure of Mamba, the dense structure of IDS4, Linear NCDEs, and SD-SSM, the block-diagonal structure of block-diagonal input dependent linear RNNs, and the diagonal-plus-low-rank structure of DeltaNet.

The paper shows that PD-SSM has the same asymptotic computational complexity as using diagonal transition matrices and can solve any finite state automata with a hidden dimension that scales linearly with the number of states. Additionally, PD-SSM empirically solves a number of state-tracking benchmarks and outperforms Mamba and S7 on a subset of the long range arena benchmark.

优缺点分析

Strengths

  • The paper is well written.

  • The PD structure is well motivated, both from a theoretical cost stand point and the analysis showing that any FSA can be represented in an efficient manner.

  • The benchmark datasets are suitable, and the PD structure clearly demonstrates an increase in expressivity over diagonal matrices.

  • The natural language state-tracking benchmark provides a good indication that the results on the regular language benchmarks translate into real-world language modelling performance.

Weaknesses

  • The paper limits the theoretical and empirical results to only diagonal, dense, and PD structures. However, as noted by the authors on lines 43 and 44, both block-diagonal (BD) and diagonal-plus-low-rank (DPLR) structures exist in the literature. Table 1 and the empirical results neglect these additional structures, and no arguments are made as to why the PD structure should be favoured over BD and DPLR.

  • Both Table 2 and Table 3 are currently incomplete. Table 2 is missing cycle navigation and even pairs, both of which SD-SSM solves perfectly, and Table 3 is missing the pathfinder datasets (which the authors note is due to an issue with their compute cluste). The paper would be significantly improved by completing the missing datasets and expanding the empirical results to include BD and DPLR structured matrices.

  • Given that computational cost is a main motivating factor for the PD structure, the empirical runtime of the model is not sufficiently discussed in the main body of the text. As shown in the Appendix, despite having the same asymptotic computational cost, going from diagonal to a PD structure can lead to an order of magnitude longer training time per step.

  • The main body claims C2×C30C_2\times C_{30} and D30D_{30} are trained on sequences up to length 4040, but the Appendix claims they were trained on sequences up to length 9090.

问题

Questions

  • How were the hyperparameters chosen for each dataset?

Suggestions

  • As mentioned, a theoretical and empirical comparison to the already existing structures in the literature is critical for a paper proposing a new structure. This addition would significantly improve the paper. Given an appropriate demonstration of the benefits of the PD structure over BD and DPLR, I would raise my score.

  • A log scale on the y-axis of figure 4 would make the comparison with the diagonal structure easier.

  • After the submission deadline, this paper introduced two more possible structures for the transition matrix, and introduced a unifying framework for input-dependent structured state transition matrices. I mention it here as I believe including this paper in the discussion of alternative structures would be benficial. However, there is naturally no requirement to engage with it given that it appeared after submission.

局限性

Yes.

最终评判理由

The authors have significantly improved the empirical evaluation of their proposed structure, expanding both the datasets and baselines. This, coupled with their existing theoretical results, represents a substantial contribution to the literature on state transition matrices. I am happy to recommend this paper for acceptance.

格式问题

None.

作者回复

We thank the reviewer for their careful reading of our paper. We are glad that the reviewer finds the paper well written and that they appreciate the theoretical analysis and motivation. We are pleased that they find our benchmarks suitable and effective, even as indicators for real-world language modeling benefits.


Weaknesses 1 and 2: We agree that the paper would benefit from a more thorough comparison with the variety of modern SSMs, in particular those adopting BD and DPLR transition matrices. To address this, we provide significantly extended results to strengthen our contribution.

Firstly, the table below extends the experimental evaluation from Walker et al. (2025). We now evaluate +10 recurrent and parallelizable sequence models on four FSA emulation tasks. We fully conform to their experimental setup which adopts a two-layer architecture and uses a single, fixed set of hyperparameters on all tasks. While the authors used the best results obtained with state size 128 or 512, we fixed the state size to 128. Notably, our model sets the state-of-the-art performance with no hyperparameter tuning.

ModelCycle Nav.Even PairsMod Arith.ParityAverage
xLSTM[1:1]56.6 ±3.6100.0 ±0.034.4 ±1.4100.0 ±0.072.8
DeltaNet49.0 ±5.4100.0 ±0.044.2 ±8.058.4 ±1.162.9
DeltaNet[−1, 1]47.3 ±6.0100.0 ±0.063.9 ±7.297.2 ±1.777.1
Gated DeltaNet55.9 ±7.8100.0 ±0.049.9 ±7.556.6 ±0.765.6
Gated DeltaProduct[-1,1]61.4 ±12.099.6 ±0.658.0 ±20.098.8 ±1.579.5
RWKV-737.8 ±5.088.1 ±14.241.6 ±13.851.0 ±0.354.6
Transformer24.4 ±0.590.4 ±10.423.6 ±0.752.2 ±0.447.7
Mamba50.7 ±7.8100.0 ±0.038.1 ±6.054.5 ±1.560.8
BD4–LNCDE100.0 ±0.091.3 ±12.245.6 ±7.888.4 ±13.981.3
D–DE16–LNCDE77.3 ±21.381.4 ±9.994.5 ±5.281.0 ±18.183.6
DPLR4–LNCDE83.9 ±20.197.7 ±5.226.0 ±0.779.6 ±20.371.8
PD-SSM97.7 ±0.0296.2 ±0.0796.4 ±0.0499.7 ±0.0397.5

On the Pathfinder dataset, our model achieves 62.61% average accuracy. The average LRA accuracy of our method is now 74.43%, compared to the 66.44% achieved by Mamba and 73.89% achieved by S7.

To extend the evaluation of our method on more real-world datasets, we further add results on time-series classification with long-range dependencies, following Walker et al. (2025). To obtain the results, we ran a grid search over learning rates in {1e-5, 1e-4, 1e-3} and number of layers in {1, 2}. The state size is 16, and the embedding size is 64. Notably, this hyperparameter grid is significantly less extensive than the one used to evaluate the baseline methods. We happily report that PD-SSMs natively also achieve a new state-of-the art for SSMs on this dataset, notably without adopting a CDE structure.

DatasetS6Log-NCDED-LNCDEBD-LNCDEDE-LNCDEPD-SSM
EigenWorms85.0 ±16.185.6 ±5.180.0 ±5.486.7 ±4.187.8 ±5.787.2 ±5.1
EthanolConcentration26.4 ±6.434.4 ±6.425.8 ±3.928.6 ±6.425.6 ±3.738.7 ±6.6
Heartbeat76.5 ±8.375.2 ±4.672.9 ±5.075.2 ±6.076.1 ±3.176.8 ±3.0
MotorImagery51.3 ±4.753.7 ±5.354.0 ±7.358.2 ±4.154.0 ±5.355.4 ±5.0
SelfRegulationSCP182.8 ±2.783.1 ±2.883.5 ±2.184.9 ±1.983.5 ±6.682.8 ±3.2
SelfRegulationSCP249.9 ±9.553.7 ±4.153.0 ±5.853.3 ±7.546.3 ±3.654.0 ±2.8
Average62.064.361.564.562.265.8

In addition to the experimental results, we want to highlight that we provided theoretical arguments in the paper on why PD is favourable over the BD and DPLR structures, although we agree with the reviewer that we did not explicitly spell them out:

  • We proved the existence of FSAs that cannot be emulated by an SSM with state size less than N-1.
  • We proved the fact that a single-layer PD-SSM can emulate any N-state FSA with state size N.

Taken together, these results clearly speak against adopting block-diagonal (BD) transition matrices, since they are theoretically less expressive than PD-SSMs (they cannot mix between blocks), but more expensive (O(b^2 d L) vs our O(d L)). In addition, theoretical constructions for DPLR models to emulate FSAs rely either on an exponential number of layers and DPLR factors (Grazzi et al., 2024), or rely on MLPs with exponentially large hidden layers (Peng, Zhang, Goldstein et al, 2025), in stark contrast with the guarantees provided by our parametrization. We will expand on this analysis in a revision of our text.


Weakness 3: Due to the restricted number of pages in the initial submission, we were indeed unable to further open up this point in the main body. Still, as the reviewer points out, we provide a detailed break-down on the empirically observed computational cost in the Appendix, and further mention this in the limitations section. While our implementation is not optimized on the kernel level, Table S8 suggests that the generation of P matrices is responsible for most of the compute. Indeed, the table demonstrates that a parallel scan with PD transition matrices only has a 2x overhead compared to diagonal transition matrices. To address the reviewer's concern, we will switch to a log-log plot for Figure 4 in the main body of the text, move Table S8 from the Appendix to the main sections, and more thoroughly discuss the P matrix generation as a bottleneck. We suggest the following additional text:

As shown in Figure 4, PD-SSM scales significantly better than SD-SSM. PD-SSM’s higher runtime compared to diagonal SSM stems primarily from the additional operations in the generation of P: the simultaneous soft selection of transition matrices MiM_i across the sequence during parallel execution incurs additional compute and memory. Table S8 confirms these findings and suggests potential improvements in future work.


Weakness 4: We appreciate the reviewer catching this inconsistency. The two tasks, C2xC30 and D30 were indeed trained on sequences of length up to 90.


Question: Hyperparameters

  • Long-Range Arena: We performed a grid search defined by all combinations of the following values: Learning rate in {1e-5, 5e-5, 1e-4, 5e-4, 1e-3}, weight decay in {0.001, 0.01, 0.1, 0.05}, and dropout in {0.0, 0.1, 0.3}. On each task, we report the average accuracy over three random seeds. The embedding dimension and state size were both fixed to 128, with the exception of the Retrieval dataset in which we reduced it to 64. We used Adam with the default parameters (0.9, 0.999). The hyperparameters of the models we compare with were obtained via an extensive Bayesian hyperparameter search, as outlined in (Soydan, Zubic et al., 2025).

  • FSA Results (Section 4.2): We ran a grid search over learning rates in {5e-5, 1e-4, 5e-4, 1e-3}, state sizes in {64, 128}, and the number of dictionary matrices (K) in {6, 8}. The reported baseline results are taken from Terzic et al. (2025), who ran similar hyperparameter grid searches for all baseline methods.

  • New FSA Results: We re-used the hyperparameters from the referenced paper (Walker et al., 2025) with no tuning and with the state size fixed to 128.

  • Time Series: We ran a grid search over learning rates in {1e-5, 1e-4, 1e-3} and number of layers in {1, 2}. The state dimensionality was fixed to 16, and the embedding dimensionality was fixed to 64. All other hyperparameters were re-used from the referenced paper (Walker et al. 2025). Notably, this hyperparameter grid is significantly less extensive than the one used to evaluate the baseline methods, see Walker et al. (2024).


Suggestions:

  1. Comparison with BD and DPLR: Addressed in the two tables above, as well as in our theoretical arguments.

  2. Log-scale y axis for runtime: Agreed, a logarithmic plot will provide more information than the current format.

  3. Discussion of the Structured-LNCDE Paper: We became aware of the paper shortly after submission and thank the reviewer for bringing it up. We agree that it is very relevant related work and we now thoroughly engage with it through both experimental and theoretical claims, and accordingly made a benchmarking comparison with it (see above tables). Interestingly, the referenced work theoretically analyses sparse SSMs. The authors note that unstructured sparse SSMs do not provide any efficiency benefits over unstructured dense SSMs, and note that they anticipate that ongoing work on sparse matrices will enable future gains in efficiency. Our work exactly presents a structured sparse SSM with significant efficiency gains compared to unstructured variants.


Once again, we thank the reviewer for their thoughtful feedback, which enabled us to substantially improve the paper, and hope that by addressing all their concerns, we can convince them to improve their score.

评论

Thank you for your comprehensive and detailed response.

The additional experimental results are greatly appreciated and strengthen the paper.

However, while investigating reasons for the low standard deviation on the formal language tasks, I noticed that Lines 79-81 of range_evaluation.py are:

  random.seed(1)
  np.random.seed(1)
  t.manual_seed(1)

This resets the global RNG every time you call range_evaluation, leading to a periodic training loop, and more importantly meaning that the training runs for different random seeds are not independent. I confirmed this behaviour by adding

print(t.rand(1,))

after each range_evaluation.range_evaluation call on Line 215. This prints the same number each time. I apologise for not addressing the rest of your response in this reply, but I wanted to give you the maximum time to address this bug, if it has affected the results in the paper.

评论

Thank you again for your thorough response. I intend to raise my score.

Having said this, I still have some concerns about the presented empirical results that I give below.

Regular Language Results

As noted in my previous reply, PD-SSM’s standard deviations are two orders of magnitude smaller than those of the other models. Looking at the code from Walker et al. (2025), each seed uses a different randomly generated train and validation set. Therefore, identical performance across seeds is surprising, especially on length-generalisation tasks where the OOD set changes each time. Have you investigated the cause of this? Is the model converging to the same weights each time and if so, is there a pattern to the types of sequences the model struggles to solve?

UEA Results

SLiCEs did not undergo a hyper-parameter optimisation on this dataset, instead using the hyper-parameters that were optimal for the Log-NCDE, to isolate the impact of varying the vector field structure. The results remain impressive when comparing to those models that did undergo hyper-parameter optimisations, but it would be good to include a comparison to oscillatory state space models, presented at ICLR this year, which are the current state-of-the-art on this benchmark.

Theoretical Expressiveness

I am not convinced that BD transition matrices are less expressive than PD transition matrices. Although BD does not allow cross-block interaction, it allows richer intra-block dynamics that PD may not capture. I would encourage providing a formal statement of this result, if you intend to include it. That said, the optimality with respect to state size and better scaling in terms of number of parameters are strong aspects of the PD state transition matrix.

评论

Thank you very much for being considerate in this matter. The intended effect of these lines of code is ensuring that all models are trained and evaluated on the same data. The global setting of seeds to 1 happens only after model initialisation, meaning that it affects the randomly sampled training and evaluation data; it does, however, not affect the random model initialisation. To ensure that all models are trained on exactly the same data, the seed should have been set to 1 before the first training batch is sampled. We instead set it the manual seed at model validation, meaning that most but not all of the training data is the same across models. The evaluation data is shared across the models. Compared with standard practices which use the exact same data to train different models, our code can only increase the variance of the results. The referenced lines of code induce the effect of training models on a fixed dataset for a certain number of epochs.

We should also clarify that the code which manually sets the seed to 1 in range_evaluation.py was used to train models on FSA tasks in our section 4.2, in which we only reported the best-performing model out of three random seeds. To obtain the results we used in the rebuttal, which indeed exhibit a very low standard deviation, we imported our model into the training pipeline provided by the authors of the paper you referenced, "Structured Linear CDEs: Maximally Expressive and Parallel-in-Time Sequence Models" by Walker et al. 2025. We only made a single modification to their training pipeline, that of removing the padding of sampled sequences by a fixed length. Concretely, the authors initially padded all of the sampled sequences with 256 zeros, which very significantly and unnecessarily increased the training time (the maximum training sequence length is 40, to which we would then always append 256 inconsequential zeros). We instead pad each sampled batch such that all of the sequences are as along as the longest sampled sequence. We have made no other modification to their code, and the omission of the padding should have no consequence on the final results, given that the gradients are masked for the padded parts of the sequence, and the validation accuracy computation also naturally takes padding into account.

To be fully specific, in the PyTorch GitHub repo provided by the above referenced paper, in the train.py function, lines 364 and 367 have None as the padding length in our implementation, rather than 40 or 256 the authors used. This influences the collate_fn function in file data_dir/dataloaders.py, so that the three conditional lines of code 240-242 are no longer executed.

To be fully certain that this has no effect on the results, we have re-trained our model with padding length set to 40, the same as what was used in the baseline LNCDE methods, and we observed no effect on the results. All of the runs of PD-SSM converge to almost perfect accuracy with very low standard deviation.

评论

Regular Language Results

You are correct to question the surprisingly low standard deviation. The models are not learning the same weights; our reported deviation numbers are in the wrong units and should be multiplied by a factor of 100 (only for the regular language table). We apologize for the confusion.

UEA Results

Thank you for pointing this out. The correct statement should have read "Notably, this hyperparameter grid is significantly less extensive than the one used to evaluate the baseline SSM methods, see Walker et al. (2024)". Indeed, SLiCEs did not undergo a hyperparameter optimization; the baseline methods that did undergo such a search include all of the SSMs and LNCDE in Walker et al. 2024, as well as the work on oscillatory SSMs by Rusch and Rus at ICLR 2025.

In an effort to equalize the scale of the hyperparameter search among the SSM methods, we now extended the hyperparameters search of our PD-SSM using the following grid: Learning rate in {1e-5, 1e-4, 1e-3}, number of layers in {2, 4, 6}, state dimension in {16, 64, 128} and embedding dimension in {16, 64, 128}. We report the results below:

ModelEWECHBMISPC1SPC2Avg.
S685.0 ±16.126.4 ±6.476.5 ±8.351.3 ±4.782.8 ±2.749.9 ±9.562.0
LinOSS-IMEX80.0 ±2.729.9 ±1.0275.5 ±4.357.9 ±5.387.5 ±4.0358.9 ±8.1165.0
LinOSS-IM95.0 ±4.4129.9 ±0.6275.8 ±3.760.0 ±7.587.8 ±2.6258.2 ±6.967.8
PD-SSM88.3 ±6.433.1 ±2.280.0 ±2.659.3 ±5.581.4 ±2.956.8 ±8.266.5

Note that certain results we had previously reported are missing; these results were obtained using 1 layer, which are excluded as the baselines were not evaluated using 1 layer. The baselines were also not evaluated using state size 128, with the grid rather being {16, 64, 256}. We used {16, 64, 128} to more quickly obtain results relevant for this discussion.

Theoretical Expressiveness

Let us expand a bit on our statements. Our formal analysis is entirely based on Proposition 3, which states the following: For any N>1N>1 there exists an FSA with NN states that cannot be emulated by any single-layer SSM with state size less than N1N-1 under unique state encodings. The proof is in the appendix, section C. The assumption of unique state encodings means that each state of the FSA gets assigned to a unique vector in Rd\mathbb{R}^d.

The proposition makes no assumption on the structure of the transition matrix. The richer intra-block dynamics of dense (or block-diagonal) SSMs do not influence the conclusion.

This implies that a block-diagonal SSM would need blocks of size at least N1N-1 in order to emulate the NN-state finite-state automaton. The cost of matrix-matrix multiplication would then be O(N3N^3), as compared to O(NN) in our sparse parametrisation.

We know that our sparse parametrization can emulate any NN-state FSA with state size NN. To see this, notice that we can identify each FSA state with its one-hot representation. The mapping that we describe in section 2.2 then results in a matrix with a column one-hot structure.

You are correct to point out that this does not prove that the PD method is in general more expressive than a block-diagonal system. When we made that statement, we implicitly assumed that both SSMs operate using the same state size and that the block-diagonal system has more than one block. In this case, there exists the difference in expressivity described above. Concretely, under the assumption of unique state encodings, SSM hidden dimension N, there exists an N-state FSA that can be emulated by PD-SSM but cannot be emulated by a block-diagonal SSM with maximum block size less than N-1.

One can question the assumption of unique state encodings. However, if we do not assume unique state encodings and instead allow arbitrary state representations, a single-unit SSM can represent a very large number of states, bounded only by the numerical representational capacity, as discussed in footnote 2 on page 6. We thus believe that our assumption is realistic, but we are open to alternative analyses. One might, for example, assume that a cluster of vectors at Euclidean distance ϵ\epsilon from a centroid encodes a state.

Of course, our analysis is relevant in the context of FSAs. One can easily imagine that, under a different setting, block-diagonal SSMs might have advantages exactly due to the intra-block dynamics which you mention. There might be patterns that the sparse transitions of PD-SSM cannot capture. Some insight may be gained from the Walker et al. (2025) analysis of sparse transition matrices; the analysed matrices are however not column one-hot.

评论

Thank you for your detailed response. I address each point individually below.

Regular Language Results

This was my initial suspicion, although I disregarded the idea, as it would imply that PD-SSM achieves 99.7±399.7\pm3\\% on parity, which I don't believe is possible. I would appreciate if the authors could provide the corrected row for PD-SSM with the true standard deviations, although I do not consider this a pre-requisite to me raising my score. The important aspect is including the comparison with alternative structures in the final version of the paper, which I believe it is clear the authors intend to do.

UEA Results

Thank you for clarifying the comparison and including the additional baselines. I feel these results are a strong addition to the paper.

Theoretical Expressiveness

I apologise; I interpreted your statement about theoretical expressivity as being a general statement, as opposed to one in the context of FSA. I agree with your expanded analysis and think it makes another strong addition to the paper.

评论

Thank you for the very constructive review and discussion. We're glad to hear you appreciate our additions to the paper.


Regular Language Results

The attention to detail is very much appreciated; we clearly misreported this standard deviation too. The correct results are the following:

ModelCycle Nav.Even PairsMod Arith.ParityAverage
PD-SSM99.5 ±0.6899.7 ±0.2896.2 ± 3.499.9 ±0.0798.8

Most of the newly reported accuracies are higher because the training has now fully completed; the previously reported accuracies were obtained during model training, before the 100'000 training steps had finished.

The result on modular arithmetic is now lower by 0.2%. This happened after introducing the padding by 40 zeros in the code, mentioned in our first response after the rebuttal. We are not sure how to explain the discrepancy, but it is one of only 0.2%.

We certainly intend to report the full comparison with the alternative methods included in the original table.


Please let us know in case you have any further questions.

评论

Thank you for your continued responses.

I have no further questions, and am now happy to recommend the paper for acceptance.

审稿意见
5

This paper addresses a fundamental limitation in State Space Models (SSMs): the trade-off between expressiveness and computational efficiency. While diagonal SSM blocks are computationally tractable, they suffer from limited expressive power. Conversely, unstructured transition matrices offer full expressiveness but at prohibitive computational costs. The authors propose a novel middle-ground approach using structured matrices with a PD parameterization, where P is a one-hot matrix computed via a complex procedure and D is diagonal. This parameterization maintains efficiency while forming a monoid under multiplication, enabling parallel scan algorithms.

The paper establishes theoretical foundations by measuring expressive power through the class of Finite State Automata that SSMs can emulate (without post-SSM nonlinearity). To address the non-trivial computations required for P, which present differentiation challenges, the authors provide implementation strategies and circumvention methods. The theoretical claims are empirically validated through testing on state tracking problems.

优缺点分析

Strengths

  1. Clear Problem Formulation: The paper tackles one of the most important open problems in the SSM literature with a well-motivated theoretical framework.
  2. Theoretical Rigor: The proposed architecture is theoretically grounded and addresses fundamental expressiveness limitations of diagonal SSMs.
  3. Practical Considerations: The authors provide detailed analysis of training difficulties and propose concrete solutions for implementation challenges.
  4. Novel Contribution: The PD parameterization represents a principled approach to balancing expressiveness and computational efficiency.

Weaknesses

  1. Insufficient Experimental Validation: The experimental section has significant gaps that undermine the paper's contributions:
    • Missing experimental setup details: Critical information about hyperparameter selection, training procedures, and experimental configurations is absent. Specifically: How were hyperparameters chosen? Were hyperparameter sweeps conducted? How was fair comparison between different models ensured? The learning rate appears constant across S2 and S3, and hyperparameter choices for Mamba are not visible.
    • Lack of comparison with recent models: No benchmarking against important recent SSM variants including Mamba2, DeltaNet, GatedDnet, and RWKV-7, which are essential baselines for evaluating the proposed method
    • Limited scope of experiments: While the focus on toy models and limited language modeling experiments may be acceptable for an initial architecture introduction (following similar literature precedents), the current experimental validation is insufficient to demonstrate the method's practical advantages
  2. Presentation Issues: Appendix A (Algebra and FSA) contains poorly organized algebraic content with numerous inconsistencies and unclear definitions that require substantial polishing for clarity. For example, Appendix A2 inconsistently switches between Z/nZ, Zn, and Cn to denote the same group. The mathematical exposition suffers from such notation inconsistencies, missing quantifications, and terminological confusion that hinder comprehension.

问题

Minor Issues and Suggestions

  • Lines 34-36: The universal approximation claim is correct and actually supports your arguments even more strongly. Consider clarifying that in the cited result, the approximation is actually achieved by the neural network after the SSM, while the SSM itself primarily reconstructs the input. Since your approach focuses on the SSM without post-processing nonlinearity, making this distinction explicit would better highlight the significance of your theoretical contribution.
  • Lines 79-82: The invertibility of A, while implied by perturbation arguments, should be stated more explicitly.
  • Lines 86-87: (Pedantic comment) in addition to the stated condition the matrices need to be all diagonalizable by themselves as well.
  • Lines 144-148: The choice of P through post-processed mixture seems convoluted. Please elaborate on:
    • Whether this was the original design choice
    • Alternative approaches considered (e.g., input-dependent low-rank followed by hardmax)
    • Comparative analysis of different parameterization strategies
  • Line 165: (Pedantic comment) Property 2 alone is insufficient to justify associative scan usage. A remark after line 157 demonstrating that matrices in H can be written as PD would improve rigor.
  • Section 4.2: Consider expanding the analysis with additional comparisons that would strengthen the paper's impact. I assume only SSM blocks are compared here - it would be valuable to explore whether including MLPs afterwards changes the relative performance. Additionally, reporting how many layers Mamba requires to reach 100% accuracy would provide an extremely compelling addition to demonstrate the efficiency gains of your approach.
  • Lines 227-231: The architecture description needs expansion, potentially in an appendix.

Appendix Issues

  • Notation Consistency: Inconsistent use of Z/nZ, Zn, and C_n for the same group throughout
  • Lines 40-41: T(A) notation unclear; specify the second element of delta in Q and define σ₁,...,T
  • Lines 53-57: Quantify all mathematical objects
  • Lines 66-67: Inconsistent switching between "group" and "automaton" terminology
  • Line 141: Verify if C₄/C₂ × C₄ should be (C₂ × C₄)/C₄; inconsistent Z₄ and C₄ usage
  • Lines 144-146: Notation {e}/C₄, {e}/Z₄, {e}/G should be reversed
  • Line 173: Reference Definition 12 when first using "Uniform" in Definition 10
  • Line 181: There appears to be an inconsistency regarding fan-in restrictions. The text states that "NC1 allows for logarithmic depth, but restricts the fan-in of the gates to 2," while the definition only mentions that fan-in is bounded. Please clarify this discrepancy.
  • Line 231: Use standard notation instead of ∂⁺ for total derivative
  • Line 267: In the equation "xₖ = Aₖ xₖ₋₁ + bₖ", the subscript of A and b should be "k-1" for consistency with both the proof and the results presented below.
  • Line 273: Quantify P appropriately
  • Equation below 269: (Pedantic comment) Specify that α and β are not all zero in summation equalities

Recommendation

This paper addresses an important problem in the SSM literature with a theoretically sound and well-motivated approach. The core contribution—finding a middle ground between diagonal SSMs and unstructured matrices through the PD parameterization—is valuable and tackles a fundamental limitation in the field. However, the current submission has notable gaps that prevent clear acceptance at this stage.

I would genuinely like to accept this paper as it shows strong theoretical promise and addresses a central problem in the SSM literature. However, it requires multiple small but targeted improvements: the appendix and proofs need polishing, the experiments need improvement, and experimental details must be included to ensure fair comparisons with competing models. With proper revision addressing these specific issues, this work would clearly merit acceptance.

局限性

yes

最终评判理由

I recommend acceptance. The authors have addressed all raised concerns, expanded the experimental evaluation, and clarified the theoretical framework. The work tackles a fundamental limitation of modern SSMs and represents a meaningful contribution that merits publication.

格式问题

No concerns noticed.

作者回复

We thank the reviewer for their exceptionally careful reading of our paper. We are glad that they recognize the importance of the studied problem and the theoretical and methodological contributions of our work. We are pleased that the reviewer recognizes the empirical analysis and practical considerations addressed. It is very encouraging to hear that our work would merit acceptance given an appropriate revision, and we are especially thankful for the constructive and comprehensive list of suggestions for improvements.


Insufficient Experimental Validation

Experimental Setup Details: The text we submitted indeed omits important information on the experimental setup, including the optimizer, loss functions, hyperparameter search procedures. The accidental omission of these details is unfortunate, and have made sure to include them in full detail in the final text. For example, the experimental details on LRA only listed the hyperparameters for each task. In contrast, the improved text now reads as follows:

The optimal hyperparameters for our model on the long-range arena dataset were obtained va grid search defined by all combinations of the following values: Learning rate in {1e-5, 5e-5, 1e-4, 5e-4, 1e-3}, weight decay in {0.001, 0.01, 0.1, 0.05}, and dropout in {0.0, 0.1, 0.3}. On each task, we report the average accuracy over three random seeds. The embedding dimension and state size were both fixed to 128, with the exception of the Retrieval dataset in which we reduced it to 64. We used Adam with the default parameters (0.9, 0.999). The hyperparameters of the models we compare with were obtained via an extensive Bayesian hyperparameter search, as outlined in (Soydan, Zubic et al., 2025).

We agree that clearly stating such information is central to gauge the fairness of our comparison, and have added analogous descriptions to all of the tasks considered.

Comparison with Recent Models and Limited Experimental Scope: We agree that a thorough comparison with the variety of modern SSMs is of central importance. While we did address the two extreme ends of the spectrum, diagonal and dense transition matrices, a more thorough comparison with a range of established models was missing. To address this, we now provide a significantly extended comparison with related methods.

Firstly, we extended the Table from Walker et al. (2025), by evaluating our proposed model on the full set of state tracking tasks from Deletang et al., (2023). We fully conform to their experimental setup which adopts a two-layer architecture and uses a single, fixed set of hyperparameters on all tasks. While the authors used the best results obtained with state size 128 or 512, we fixed the state size to 128. A summary of central results is shown below. Notably, our model sets the state-of-the-art performance with no hyperparameter tuning.

ModelCycle Nav.Even PairsMod Arith.ParityAverage
DeltaNet49.0 ±5.4100.0 ±0.044.2 ±8.058.4 ±1.162.9
DeltaNet[−1, 1]47.3 ±6.0100.0 ±0.063.9 ±7.297.2 ±1.777.1
Gated DeltaNet55.9 ±7.8100.0 ±0.049.9 ±7.556.6 ±0.765.6
Gated DeltaProduct[-1,1]61.4 ±12.099.6 ±0.658.0 ±20.098.8 ±1.579.5
RWKV-737.8 ±5.088.1 ±14.241.6 ±13.851.0 ±0.354.6
Transformer24.4 ±0.590.4 ±10.423.6 ±0.752.2 ±0.447.7
Mamba50.7 ±7.8100.0 ±0.038.1 ±6.054.5 ±1.560.8
BD4–LNCDE100.0 ±0.091.3 ±12.245.6 ±7.888.4 ±13.981.3
D–DE16–LNCDE77.3 ±21.381.4 ±9.994.5 ±5.281.0 ±18.183.6
PD-SSM97.7 ±0.0296.2 ±0.0796.4 ±0.0499.7 ±0.0397.5

While the table above significantly extends the set of experiments and comparisons we already report on, it is still limited to FSA emulation. We extend the evaluation of our method to more real-world tasks by including additional experiments on time-series classification with long-range dependencies.

DatasetS6Log-NCDED-LNCDEBD-LNCDEDE-LNCDEPD-SSM
EigenWorms85.0 ±16.185.6 ±5.180.0 ±5.486.7 ±4.187.8 ±5.787.2 ±5.1
EthanolConcentration26.4 ±6.434.4 ±6.425.8 ±3.928.6 ±6.425.6 ±3.738.7 ±6.6
Heartbeat76.5 ±8.375.2 ±4.672.9 ±5.075.2 ±6.076.1 ±3.176.8 ±3.0
MotorImagery51.3 ±4.753.7 ±5.354.0 ±7.358.2 ±4.154.0 ±5.355.4 ±5.0
SelfRegulationSCP182.8 ±2.783.1 ±2.883.5 ±2.184.9 ±1.983.5 ±6.682.8 ±3.2
SelfRegulationSCP249.9 ±9.553.7 ±4.153.0 ±5.853.3 ±7.546.3 ±3.654.0 ±2.8
Average62.064.361.564.562.265.8

PD-SSMs achieve a new state-of-the-art among the evaluated state space models on this dataset (taken from Walker et al. 2025), notably even without building on controlled differential equation (CDE) architectures. S6 is the previous best-performing evaluated non-CDE SSM, with a more extensive overview of SSM results provided in Walker et al. (2024).


Presentation Issues

We thank the reviewer for pointing out the notational and conceptual inconsistencies. We aimed to provide a self-contained tutorial on the relevant concepts and frameworks. We are thus particularly grateful for the detailed proofread provided by the reviewer. In addition to their suggestions, we have found several unfortunate inconsistencies that we have since corrected. For example, the initial submission's Theorem 3 states the following:

Under the assumption that TC0NC1{TC}^0 \neq {NC}^1, constant depth and width diagonal log-precision SSMs can be emulated by a logspace-uniform TC0TC^0 circuit.

The assumption is, however, not required for the stated conclusion, and the implications of the conclusion are not clearly spelled out. The improved version of Theorem 3 is now summarized as:

Under the assumption that TC0NC1{TC}^0 \neq {NC}^1, log-precision diagonal SSMs cannot emulate automata with non-solvable transformation groups.


Addressing the Suggestions:

Lines 79-82: The text now explicitly states: ε>0,ARN×N,AεRN×N\forall \varepsilon > 0, \forall A \in \mathbb{R}^{N\times N}, \exists A_\varepsilon \in \mathbb{R}^{N \times N}, invertible WCN×NW \in \mathbb{C}^{N \times N} and diagonal ΛCN×N\Lambda \in \mathbb{C}^{N \times N} s.t. Aε=WΛW1A_\varepsilon=W\Lambda W^{-1} and AAεF<ε.\||A - A_\varepsilon\||_F < \varepsilon.

Lines 79-82: Assuming A(ut)A(u_t) are diagonalizable under the same change of basis already implies individual diagonalizability.

Lines 144-148: The specific choice of the P parametrization does in fact provide certain practical benefits compared to alternatives such as a linear generator by reducing the number of parameters and operations. We now include a paragraph elaborating on those.

Section 4.2 We agree that the proposed experiments are interesting ones, and we did in fact report results with these two ablations. Due to the page constraint, the ablations are part of the appendix (E.2 and E.3). We deeply appreciate the suggestion for improvement; a more extensive version of the ablations can be of benefit in the main text too.

Lines 227-231 The new version of the paper provides a full sketch of the architecture, highlighting architectural details such as positioning of the norms and the design of the residual connections.


Addressing the Appendix Issues:

  1. We now refer to the cyclic group using Zn.
  2. We now provide a definition of δ(,σ1,,T):QQ\delta(\cdot, \sigma_{1,\dots,T}):Q \rightarrow Q as the composition of the single-step QQQ \rightarrow Q functions δ(,σ1),...,δ(,σT)\delta(\cdot, \sigma_1), ..., \delta(\cdot, \sigma_T), with σiΣ\sigma_i \in \Sigma.
  3. We now denote the group set as G and denote the binary operation by \cdot. We have appropriately quantified all objects in the sense that it is always clear that the elements listed in the different properties belong to the group set G.
  4. We now make no references to automata in this section.
  5. The factor group should indeed be written as (Z₂ × Z₄)/Z₄. We have corrected this error and ensured consistent notation throughout.
  6. See above.
  7. We have removed the mention of uniformity in this definition.
  8. The definition of NC1NC^1 has been revised to correctly state that the gate fan-in is restricted to 2.
  9. We have revised the text to better conform to standard notation, writing total derivatives using dd and partial derivatives using \partial.
  10. We have corrected the typos in the proof and subsequent results to ensure consistency with the presented indexing.
  11. The text now makes it clear that P is an element of the monoid H.
  12. We have incorporated the helpful suggestion.

Due to space constraints, we could not address each individual suggestion in the rebuttal. Once again, we thank the reviewer for their thoughtful feedback, including their extensive list of comments and suggestions. Addressing their concerns and following their suggestions allowed us to substantially improve the paper, and by doing so, we hope to convince them to recommend acceptance.

评论

Dear Reviewer XSo7,

Thank you very much for acknowledging our rebuttal. Once again, we are grateful for your thorough and constructive review, which has greatly contributed to improving both the quality and scope of our paper.

In particular, we have taken care to address your concerns by:

  • Extending our experimental validation to include comparisons with more recent SSMs,
  • Evaluating our method on natural data, specifically time-series classification tasks with long-range dependencies,
  • And refining the theoretical formulation and presentation throughout the paper.

We would be happy to engage in further discussion should you have any additional questions or comments during the discussion phase.

评论

Thank you for the detailed response, I have raised my score to a 5.

审稿意见
5

Building on prior work, this paper proposes a new parameterization of transition matrices for state-space models that achieve greater expressivity for state tracking/automata simulation with lower memory. Concretely:

  1. The paper proposes parameterizing transition matrices as column one-hot times diagonal (PD) form
  2. Theoretically, the paper shows this parameterization unlocks expressivity for state tracking with linear memory in the convolutional form, which is reduced compared to a dense parameterization
  3. Empirically, the authors show this parameterization attains better runtime, better performance on fully synthetic and natural-language state tracking, and strong Long Range Arena performance relative to other time-variant SSMs.

优缺点分析

Strengths

  1. The paper is well motivated, clear, and rigorous
  2. The proposed PD parameterization is a compelling method to achieve greater expressivity while minimizing memory overhead.
  3. The empirical evaluation is fairly extensive, spanning synthetic state tracking evaluations and more natural tasks

Weaknesses

In addition to the existing empirical evaluations, the paper could benefit from reporting language modeling performance (e.g., cross entropy or bits/byte) on a standard dataset.

The following claim is made regarding other approaches for other methods that balance efficiency and expressivity with SSM transition matrix parameterizations:

However, these approaches either scale unfavorably and thus prohibit large-scale training, or are not expressive enough to encode arbitrary N-state FSA with a reasonably-sized model such as a single layer with state dimensionality N.

This claim is made without rigorous support: you have not shown in what sense other methods like RWKV-7 scale poorly. E.g., RWKV-7 also uses a constrained parameterization where each matrix has rank 1. The products of those matrices may have higher rank, which suggests that SCAN might take more memory, but you have not shown this rigorously, so this sentence comes off as too strong. You should either elaborate a bit about the other methods you are comparing against or hedge this sentence appropriately.

Regarding scalability, some of these other methods (e.g., RWKV-7) have scaled up beyond the experiments in this paper. So while they might "scale unfavorably" compared to your approach in principle, they have indeed been scaled up.

Theoretically, time-invariant (fixed) diagonal transition matrices suffice to guarantee universal approximation to any regular causal sequence-to sequence map (Orvieto et al. 2024).

I don't understand this sentence in the intro. Clarify this and explain how it is consistent with the reduced expressivity of diagonal transition matrices for state tracking.

问题

“a finite-state automaton with N states that cannot be emulated by any single-layer SSM” What does emulation precisely mean? An SSM can classify the output state of the SSM using a linear layer?

What about the other direction: Given a single layer of your PD SSMs, if we want to capture its output with a WFA, how many states would we need?

Regarding your natural language state tracking task: what does a state correspond to? a physical location? Show a full example input.

In 4.4, why are time-invariant models better than time-invariant models? Is this expected/found elsewhere in the literature?

局限性

The major limitation I see the lack of larger scale language modeling experiments: if these were present and validated that the architecture could be scaled up (e.g., with an appropriate hardware kernel) and achieve good performance, I would consider raising my score. I would like to see the cross entropy attained by an SSM with this parameterization or even scaling laws or measures of compute efficiency relative to other architectures. However, I understand these experiments could require significant resources, and could reasonably be relegated to future work motivated by this paper.

最终评判理由

I appreciate the new empirical results and maintain my high score.

格式问题

N/A

作者回复

We are glad that the reviewer has a positive impression of our paper, highlighting its clear motivation, rigor, and strong empirical evaluation, as well as the premise of our PD parameterization in enhancing expressivity with minimal memory overhead.


Extended evaluation, including standard datasets:

We share the reviewer's interest in how the increased expressiveness translates to other impactful tasks, such as e.g. language modeling, code generation, or mathematical problem solving, tasks typically executed by LLMs. Such an evaluation was unfortunately out of the scope of our paper, but it is a future work we are deeply interested in. Based on recent literature, increased SSM expressivity can translate to practical benefits in other real-world tasks. Grazzi et al. (ICLR 2025) demonstrate that increasing the expressivity of DeltaNet, which can be interpreted as a matrix-valued SSM, translates to significantly reduced perplexity on code generation and mathematical problem solving. Hu et al. (ACL 2025) consistently observe that models pre-trained on formal languages, specifically context-free and context-sensitive languages, exhibit increased performance on standard NLP tasks. A large-scale evaluation of the benefits of our concrete implementation is certainly something we plan to address in future work.

While we do not provide language modeling performance, we have extended the evaluation of our model to an extensive collection of baselines (+10 new models) on: 1) four tasks in FSA emulation (two new tasks compared to our submission); 2) a real-world dataset in long-range multi-class time-series classification.

The table below extends the experimental evaluation from Walker et al. (2025). We now evaluate +10 sequence models on four FSA emulation tasks. Notably, our model sets the state-of-the-art performance in this setup with no hyperparameter tuning.

ModelCycle Nav.Even PairsMod Arith.ParityAverage
sLSTM39.2 ±4.3100.0 ±0.034.8 ±1.5100.0 ±0.068.5
xLSTM[1:1]56.6 ±3.6100.0 ±0.034.4 ±1.4100.0 ±0.072.8
DeltaNet49.0 ±5.4100.0 ±0.044.2 ±8.058.4 ±1.162.9
DeltaNet[−1, 1]47.3 ±6.0100.0 ±0.063.9 ±7.297.2 ±1.777.1
Gated DeltaNet55.9 ±7.8100.0 ±0.049.9 ±7.556.6 ±0.765.6
Gated DeltaProduct[-1,1]61.4 ±12.099.6 ±0.658.0 ±20.098.8 ±1.579.5
RWKV-737.8 ±5.088.1 ±14.241.6 ±13.851.0 ±0.354.6
Transformer24.4 ±0.590.4 ±10.423.6 ±0.752.2 ±0.447.7
Mamba50.7 ±7.8100.0 ±0.038.1 ±6.054.5 ±1.560.8
BD4–LNCDE100.0 ±0.091.3 ±12.245.6 ±7.888.4 ±13.981.3
D–DE16–LNCDE77.3 ±21.381.4 ±9.994.5 ±5.281.0 ±18.183.6
PD-SSM97.7 ±0.0296.2 ±0.0796.4 ±0.0499.7 ±0.0397.5

The table below reports the performance of a variety of parallelizable sequence models on time-series classification with sequences of lengths up to over 17k. S6 is the previous best-performing evaluated SSM, with a more extensive overview of SSM results provided in Walker et al. (2024).

DatasetS6Log-NCDED-LNCDEBD-LNCDEDE-LNCDEPD-SSM
EigenWorms85.0 ±16.185.6 ±5.180.0 ±5.486.7 ±4.187.8 ±5.787.2 ±5.1
EthanolConcentration26.4 ±6.434.4 ±6.425.8 ±3.928.6 ±6.425.6 ±3.738.7 ±6.6
Heartbeat76.5 ±8.375.2 ±4.672.9 ±5.075.2 ±6.076.1 ±3.176.8 ±3.0
MotorImagery51.3 ±4.753.7 ±5.354.0 ±7.358.2 ±4.154.0 ±5.355.4 ±5.0
SelfRegulationSCP182.8 ±2.783.1 ±2.883.5 ±2.184.9 ±1.983.5 ±6.682.8 ±3.2
SelfRegulationSCP249.9 ±9.553.7 ±4.153.0 ±5.853.3 ±7.546.3 ±3.654.0 ±2.8
Average62.064.361.564.562.265.8

As we can see, our method also achieves a new state-of-the-art among the evaluated SSMs on this dataset (taken from Walker et al. 2025), outperforming approaches that rely on controlled differential equation (CDE) architectures.


Clarifying our statements on the scaling of related methods:

The reviewer is correct to point out that RWKV-7 and related methods have indeed been scaled to larger experimental setups. Our statement on unfavorable scaling preventing large-scale training does not refer to the family of matrix-valued SSMs that adopt a diagonal plus low-rank (DPLR) transition matrix structure, such as RWKV-7. The comment rather refers to unstructured transition matrices from Terzic et al. (2025), which do enable the emulation of any N-state FSA using a single layer with state size N, yet are prohibitively expensive in a larger setting. We have made sure to explicitly state these facts in a revised version of the paper.

We additionally acknowledge that our submission did not provide sufficient theoretical detail on RWKV-7 and related DPLR methods. We have referred to previous work which provides bounds on the model size necessary to emulate FSAs using such approaches, yet we have not clearly stated what those bounds are and how they compare to ours.

To clarify what we mean by the statement that DPLR-based methods are not expressive enough to encode any N-state FSA with a reasonably-sized model, we turn to Theorem 4 from Grazzi, Siems et al. (2024), which exactly considers SSMs with diagonal plus low-rank transition matrices. We paraphrase the theorem to focus on the relevant aspects of it as follows:

SSMs with state transition matrices that are repeated products of Generalized Householder matrices, each with eigenvalues in the range [−1,1], can recognize any regular language. In particular, every FSA A=(Σ,Q,q0,δ) can be implemented by a finite precision SSM with s≤2^|Q|layers, where the state is of size n≤|Q| and the number of Generalized Householder factors in the transition matrix is p≤s.

In this construction, the number of layers is exponential in the number of states, as is the number of factors which build up the transition matrix. In contrast, our model allows the emulation of any N-state FSA using one layer and a state size of N, where the size of the readout is linear in the number of automaton states.

In a separate result, Peng, Zhang, Goldstein et al. (2025) show that the RWKV-7 model is able to emulate an arbitrary FSA using four layers. However, the provided construction relies on MLP layers which implement lookup tables whose sizes are exponential in the size of the state set, which becomes infeasible for even moderately large FSAs.

We have now added a paragraph to the paper that contextualizes our theoretical results through comparison with these previous theoretical results on model expressivity, and thank the reviewer for bringing this up.


Clarifying the universal approximation property of diagonal SSMs

The referenced result on universal function approximation assumes a mapping from a sufficiently regular fixed-length sequence to another fixed-length sequence. The setting of tokens living in a finite set, such as what we have, is drastically different and not discussed in their work. In addition, the cited work makes no assumption on the numerical precision used by the SSM, and the latent dimension of the SSM and the subsequent MLP scale with the length of the sequence. As such, the framework and the assumptions used by the authors are vastly different from the empirically validated circuit-complexity bounds presented in Merill et al. (2024), which form the background of our work.


Questions:

  1. We indeed refer to accurate linear classification of the SSM hidden states into the classes defined by the automaton states. In our tasks, the input to the SSM is a sequence of one-hot encoded entries from the FSA vocabulary, and the model, with a linear layer at the output, is trained to predict the automaton state given such a sequence of inputs.

  2. The number of possible values that the entries of the hidden state can assume is very large, essentially limited only by numerical representation capacity. An interesting approach towards answering this question may make use of methods for automatic extraction of FSAs from the model, such as e.g. the one proposed in Weiss et al. (2019).

  3. In this task, the states are fully defined by the underlying automaton. The difference between this task and the one we presented in Section 4.2 lies in the encoding of the inputs. Whereas the inputs in the "synthetic state tracking" task were one-hot encodings of a discrete set of inputs, the inputs we consider in "natural-language state tracking" are LLM-generated embeddings of meaningful English sentences.

  4. This result is indeed expected and is found elsewhere in literature, e.g., Soydan, Zubic et al. (2025).


We thank the reviewer for their thoughtful review and hope that we were able to address all their questions and concerns.

评论

We again appreciate the positive score.

With regards to the exponentially wide MLP hidden state, we are concretely referring to the final sentence in appendix D.2 of the arXiv version of the RWKV-7 paper, which states the following: "Our construction uses MLPs to implement lookup tables with sizes on the order of Σ2n|Σ|^{2n}, which may require MLP layers that are exponentially wide in the number of states of the original DFA."

评论

Thanks for your response! I appreciate the new empirical results and maintain my high score.

One clarification regarding the discussion of RWKV-7 in the rebuttal:

However, the provided construction relies on MLP layers which implement lookup tables whose sizes are exponential in the size of the state set

I don't think this is correct - A lookup table should be polynomial (not exponential) size in the number of states, since it maps each (state, token) pair to a state, and thus takes size roughly Q2Σ\lvert Q \rvert^2 \cdot \lvert \Sigma \rvert. However, it is possible for RNNs to do better than this, which is perhaps the intuition you're alluding to. See: https://arxiv.org/html/2402.15814v2

审稿意见
4

This paper proposes a novel parameterization of the transition matrices in selective state-space models (SSMs). Specifically, the authors replace diagonal transition matrices with a product of a column one-hot matrix (P) and a complex-valued diagonal matrix (D). Through both theoretical and empirical analysis, they argue that this new parameterization effectively models arbitrary finite-state automata (FSA). They also demonstrate improved efficiency by comparing the empirical runtime of their method against SD-SSM and Diagonal SSM baselines.

优缺点分析

Strengths

  1. The paper is clearly written and easy to follow.
  2. It presents a strong theoretical foundation, supported by empirical improvements in both performance and efficiency.
  3. The work addresses an important problem in the expressivity and computational efficiency of SSMs.
  4. The proposed changes are simple to implement yet yield meaningful benefits.

Weaknesses

The analysis is centered entirely around finite-state automata. While FSAs are a useful proxy, state-space models are increasingly applied in real-world tasks such as natural language processing and computer vision. It remains unclear whether the proposed parameterization would translate into practical benefits in these broader domains.

问题

How does the FSA-based analysis inform real-world applications? Do FSAs share structural or behavioral similarities with typical use cases of SSMs in NLP or vision tasks?

局限性

yes

最终评判理由

Overall, I recommend the acceptance of the paper because:

  • The paper presents a strong theoretical foundation, supported by empirical improvements in both performance and efficiency.
  • The work addresses an important problem in the expressivity and computational efficiency of SSMs.
  • The proposed changes are simple to implement yet yield meaningful benefits.
  • Effectiveness in realistic scenarios were addressed in the rebuttals with time-series classification.

格式问题

No concerns.

作者回复

We thank the reviewer for their thoughtful feedback and are pleased they found the paper clear and appreciated our focus on the expressivity-efficiency trade-off in state-space models (SSMs), as well as the simplicity and theoretical depth of our proposed parameterization.


On the FSA focus and practical relevance

We agree that real-world applicability is critical and appreciate the concern about whether FSA-based analysis generalizes beyond synthetic settings. Our core contribution is a state-space model that can exactly emulate arbitrary finite-state automata (FSA) while maintaining the parallel scan efficiency of diagonal SSMs. This property significantly enhances the expressive power of parallel sequence models—an ability particularly relevant in domains requiring structured reasoning, such as code generation or formal language understanding.

Although large-scale NLP and vision tasks were beyond the scope of this submission, we note growing evidence that architectural expressivity grounded in formal structures like FSAs does translate into practical benefits. For example, Grazzi et al. (ICLR 2025) demonstrate that making DeltaNet—a matrix-valued SSM—more expressive in terms of its ability to emulate finite-state automata directly translates to improved performance on tasks like code generation and mathematical problem solving. Hu et al. (ACL 2025) similarly show that pretraining on formal languages enhances downstream NLP task performance. These works reinforce our belief that FSA emulation capabilities can be a building block for success in broader domains.


New real-world benchmarks

In our work, we introduce the first state-space model capable of perfectly tracking the states of finite-state automata (FSA) while maintaining the parallel scan cost at the same asymptotic level as diagonal state-space models. Through experiments on the Long Range Arena (LRA), we already observed practical benefits on text and image classification.

To address the reviewer's helpful suggestion for more practical evaluations, we added experiments on six multiclass time-series classification datasets (table below), covering long sequences (up to 17,000 time steps). This benchmark is widely used in real-world settings like healthcare, and it can be evaluated within the rebuttal window. To obtain the results, we ran a grid search over learning rates in {1e-5, 1e-4, 1e-3} and number of layers in {1, 2}. The state dimensionality was fixed to 16, and the embedding dimensionality was fixed to 64. All other hyperparameters were re-used from the referenced paper (Walker et al. 2025). Notably, this hyperparameter grid is significantly less extensive than the one used to evaluate the baseline methods, see Walker et al. (2024), yet we still achieve state-of-the-art results among the evaluated methods. S6 is the previous best-performing evaluated non-CDE SSM, with a more extensive overview of SSM results including LRU, S5, and Mamba provided in Walker et al. (2024).

DatasetS6Log-NCDED-LNCDEBD-LNCDEDE-LNCDEPD-SSM
EigenWorms85.0 ±16.185.6 ±5.180.0 ±5.486.7 ±4.187.8 ±5.787.2 ±5.1
EthanolConcentration26.4 ±6.434.4 ±6.425.8 ±3.928.6 ±6.425.6 ±3.738.7 ±6.6
Heartbeat76.5 ±8.375.2 ±4.672.9 ±5.075.2 ±6.076.1 ±3.176.8 ±3.0
MotorImagery51.3 ±4.753.7 ±5.354.0 ±7.358.2 ±4.154.0 ±5.355.4 ±5.0
SelfRegulationSCP182.8 ±2.783.1 ±2.883.5 ±2.184.9 ±1.983.5 ±6.682.8 ±3.2
SelfRegulationSCP249.9 ±9.553.7 ±4.153.0 ±5.853.3 ±7.546.3 ±3.654.0 ±2.8
Average62.064.361.564.562.265.8

PD-SSMs achieve a new state-of-the-art among the evaluated state space models on this dataset (taken from Walker et al. 2025), notably even without building on controlled differential equation (CDE) architectures.

These findings highlight a growing consensus around the practical benefits of expressive architectures, and they motivate our ongoing work. While the LRA benchmark offers valuable insight, we acknowledge its somewhat synthetic nature. Our newly added time-series experiments help address this by demonstrating real-world relevance. A comprehensive evaluation of our model on large-scale NLP tasks remains an exciting direction for future work.


Extended automaton emulation results

We also significantly expanded our comparison on FSA emulation tasks. We now include over 10 recent baselines across four automaton challenges—two more than in the original submission. Concretely, we extended the Table from Walker et al. (2025), by evaluating our proposed model on the full set of state tracking tasks from Deletang et al. (2023). We fully conform to their experimental setup which adopts a two-layer architecture and uses a single, fixed set of hyperparameters on all tasks. While the authors used the best results obtained with state size 128 or 512, we fixed the state size to 128. A summary of central results is reported in the table below. Notably, our model sets the state-of-the-art performance with no hyperparameter tuning.

ModelCycle Nav.Even PairsMod Arith..ParityAverage
sLSTM39.2 ±4.3100.0 ±0.034.8 ±1.5100.0 ±0.068.5
xLSTM[1:1]56.6 ±3.6100.0 ±0.034.4 ±1.4100.0 ±0.072.8
DeltaNet49.0 ±5.4100.0 ±0.044.2 ±8.058.4 ±1.162.9
DeltaNet[−1, 1]47.3 ±6.0100.0 ±0.063.9 ±7.297.2 ±1.777.1
Gated DeltaNet55.9 ±7.8100.0 ±0.049.9 ±7.556.6 ±0.765.6
Gated DeltaProduct[-1,1]61.4 ±12.099.6 ±0.658.0 ±20.098.8 ±1.579.5
RWKV-737.8 ±5.088.1 ±14.241.6 ±13.851.0 ±0.354.6
mLSTM87.4 ±4.5100.0 ±0.029.0 ±5.150.9 ±0.366.8
Transformer24.4 ±0.590.4 ±10.423.6 ±0.752.2 ±0.447.7
Mamba50.7 ±7.8100.0 ±0.038.1 ±6.054.5 ±1.560.8
D–LNCDE73.5 ±4.7100.0 ±0.020.8 ±0.391.9 ±14.571.6
WH–LNCDE66.7 ±7.987.5 ±17.223.9 ±1.362.1 ±2.260.1
BD4–LNCDE100.0 ±0.091.3 ±12.245.6 ±7.888.4 ±13.981.3
D–DE16–LNCDE77.3 ±21.381.4 ±9.994.5 ±5.281.0 ±18.183.6
DPLR4–LNCDE83.9 ±20.197.7 ±5.226.0 ±0.779.6 ±20.371.8
PD-SSM97.7 ±0.0296.2 ±0.0796.4 ±0.0499.7 ±0.0397.5

Conclusion

While our focus is on improving the theoretical expressiveness of SSMs via FSA emulation, we have taken concrete steps to show this expressiveness translates into real-world benefits on long-range, structured sequence tasks. A comprehensive evaluation on large-scale NLP or vision datasets remains an exciting direction for future work, and we appreciate the reviewer’s encouragement in that direction.

评论

Thank you very much for your response and for acknowledging our rebuttal. We are glad that the additional results successfully addressed your sole concern regarding the practical value of our work.

You mentioned maintaining your score in support of recommending acceptance. For clarity, did you mean that you are keeping your score at 4, or are you considering an increase to 5 (accept)? In either case, we sincerely appreciate your support and remain available for any further questions or clarifications you may have.

评论

Thank you for the rebuttal! I believe the additional discussions clarified the practical value of this work. I will keep my score to recommend acceptance.

最终决定

In this paper, the authors propose PD-SSM, a new parametrization of transition matrices for selective state-space models. Instead of diagonal matrices (efficient but limited) or unstructured matrices (expressive but costly), PD-SSM represents transitions as the product of a column one-hot matrix (P) and a complex diagonal matrix (D). This keeps efficiency comparable to diagonal SSMs while allowing exact emulation of finite-state automata (FSA). The authors perform theoretical analysis to show expressivity guarantees and performance improvements, and experiments covering synthetic FSA tracking, natural-language encoded FSAs, and Long Range Arena benchmarks.

Initial reviews were broadly positive. Reviewers agreed that the parametrization is simple and theoretically grounded, and that the experiments support improvements in efficiency and expressivity (R-pBTL, R-DCmS). While some reviewers appreciated that the evaluation was already extensive, it was generally pointed out that the scope was still limited: R-pBTL questioned whether the FSA focus translates to broader domains such as NLP or vision, while R-DCmS noted the absence of language modeling results and said that claims about other approaches (e.g., RWKV) scaling poorly may have been overstated.

In the rebuttal, the authors added experiments on six multiclass time-series classification datasets with sequences up to 17k steps, showing PD-SSM performing better than prior SSMs. They also cited related work suggesting that improved expressivity grounded in FSAs can transfer to NLP and code generation tasks. For the point raised regarding the RWKV comparison, they offered clarification on scaling claims by distinguishing unstructured SSMs from Diagonal plus Low rank methods such as RWKV, and added more context on how their theoretical results compare to prior work. Reviewers were satisfied with these clarifications and maintained positive recommendations.

Overall, the paper presents a parametrization that expands the expressivity of SSMs while keeping efficiency, with extensive experiments and theoretical analysis backing up the claims. I recommend the paper be accepted, with the expectation that the final version will include the extended baselines, time-series results, and revised comparisons offered in the rebuttal.