PaperHub
4.9
/10
Poster4 位审稿人
最低2最高3标准差0.4
3
3
3
2
ICML 2025

UnHiPPO: Uncertainty-aware Initialization for State Space Models

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24
TL;DR

HiPPO extension based on linear stochastic control theory and the Kalman filter making SSMs more robust against noise

摘要

关键词
state spaceuncertaintyhippomambakalmannoisefilter

评审与讨论

审稿意见
3

This paper studies the HiPPO framework with noisy data. While the original HiPPO(-LegS) framework is based on projecting a function onto Legendre polynomials, it assumes that the input function is noise-free. The paper proposes an alternative way of formulating the HiPPO framework, called UnHiPPO, which is based on modeling the posterior of the states given the noisy measurements. The UnHiPPO framework introduces a new hyperparameter σ\sigma that is related to the noise level and ablation studies have been carried out to show its necessity.

update after rebuttal

I thank the author(s) for their response and would maintain my score.

给作者的问题

  1. My main question is about the application of the HiPPO in LSSL. As mentioned earlier, I think there is a gap between theory and practice. The following subquestions may be useful to consider:
    1. In Figure 7, you showed an ablation study that changing σ2\sigma^2 can vary the performance of the model. If you look at the model more carefully, then only a small proportion of σ2\sigma^2 leads to a better performance than the model initialized by HiPPO. This made me wonder: is it the noise-aware mechanism or something else that determines the performance of the model as the hyperparameter varies. By "something else", I mean things like the magnitudes of the eigenvalues of the discrete-time matrix A\mathbf{A} that controls the system's memory (e.g., [1] and [2]), the imaginary parts of the eigenvalues of the continuous-time matrix A\mathbf{A} that controls the frequency bias and the approximation-estimation (e.g., [3] and [4]), and the stability of the parameterization of the system (e.g., [5]). Can the authors show more empirical studies of whether the noise-awareness is really the thing? For example, it is useful to show Figure 7 multiple times given different ρ2\rho^2 and see how the optimal σ2\sigma^2 changes.
    2. As mentioned in the manuscript, what is used in theory (a time-varying system with a 1/t1/t scaling) is different from what is used in practice (a time-invariant system that drops 1/t1/t). How does the theory account for this change and what is the noise-awareness of the time-invariant system?
    3. HiPPO and UnHiPPO are only initializations. You have to train the systems eventually. How stable is UnHiPPO when trained?
  2. On page 6, you mentioned that "An unfortunate side effect is that σ2\sigma^2 cannot be interpreted as the noise variance of the data directly." While I understand that one does not have σ2=ρ2\sigma^2 = \rho^2, I wonder if there is a connection between them. That is, if I know the noise level in the input, is there a principled way for me to select or scale the hyperparameter σ2\sigma^2?
  3. Can UnHiPPO be thought of as a regularization scheme, where σ2\sigma^2 controls the magnitude of regularization? This seems pretty clear in Figure 3-4. I wonder if some theory can be derived in this direction.
  4. The UnHiPPO discussed in this paper is only for HiPPO-LegS, while there are also many variants of HiPPO (e.g. HiPPO-LegT). Can the work in this paper be extended to those? To be clear, I do not look for detailed derivations of each of them. I only wonder if the analysis in this paper is generic or ad hoc to HiPPO-LegS.

[1] Antonio Orvieto et al., Resurrecting recurrent neural networks for long sequences, International Conference on Machine Learning, 2023.

[2] Naman Agarwal et al., Spectral state space models, arXiv preprint arXiv:2312.06837, 2023.

[3] Annan Yu et al., Tuning frequency bias of state space models, International Conference on Learning Representations, 2025.

[4] Fusheng Liu and Qianxiao Li, Autocorrelation Matters: Understanding the Role of Initialization Schemes for State Space Models, International Conference on Learning Representations, 2025.

[5] Shida Wang and Qiaoxiao Li, Stablessm: Alleviating the curse of memory in state-space models through stable reparameterization, International Conference on Machine Learning, 2024.

论据与证据

I find the claims made in the submission clear and convincing, modulo the first point I raised in the weaknesses section.

方法与评估标准

The proposed method makes sense for the problem of interest.

理论论述

I checked the derivations of the formulas in the main text. I did not check the mathematics in the supplementary material.

实验设计与分析

I agree with the soundness of the experimental designs. I do have some questions about the analyses, which I raise in the questions section.

补充材料

I scanned through the entire supplementary material without carefully checking its correctness.

与现有文献的关系

This paper studies the noise handling of a specific initialization scheme of LSSL/SSM. Both the objective (making the model noise-aware) and the class of models are of interest to many sequential and time series problems.

遗漏的重要参考文献

NA

其他优缺点

I am ambivalent about my recommendation of this manuscript. On the one hand, the derivation of the UnHiPPO framework in this paper is beautiful and principled; on the other hand, I am not totally persuaded by the significance of the contribution to the application of LSSLs/SSMs, as there is a gap between the theory of the (Un)HiPPO framework and its practical success in such sequence models (as outlined in the questions section). In the end, I vote for weak acceptance, keeping in mind that such a gap is not an intrinsic flaw of the manuscript itself and that the UnHiPPO framework is potentially useful in fields other than LSSLs/SSMs or even deep learning.

Strengths:

  • The noisy-aware modeling is an important topic and has not been considered in the HiPPO literature.
  • The derivation of the UnHiPPO framework in this paper is beautiful and principled.

Weaknesses:

  • While the UnHiPPO framework is derived from mathematical principles, there lacks a theoretical comparison of how UnHiPPO and HiPPO handle noises. That is, I would like to see a clear and convincing theorem that shows UnHiPPO is more robust to noises than HiPPO.
  • The UnHiPPO state transitioning matrix A\mathbf{A} is void of a simplified representation (e.g., diagonal-plus-low-rank or diagonal). This makes it harder to apply this initialization in the more efficient S4/S4D/S5 models.
  • The presentation of the UnHiPPO framework can potentially be improved. See the comments section.

其他意见或建议

  1. I suggest clearly stating the issue with a noisy function ff and the objective of designing a noise-aware system before starting to derive it.
  2. The manuscript is very well-written and easy to follow up to the end of section 3. In section 4, the discussion is a bit confusing and required several reads for me to figure out the core idea. That is, (7)-(8) requires more explanation of how it connects to (4). The UnHiPPO systems are not derived in a way that an ML researcher standardly imagine, where one seeks parameters of an autoregressive system with given inputs. The idea is more of "input matching given the guessed states." This can be made more straightforward at the beginning of section 4 to avoid confusion.
  3. I cannot understand the following sentence in section 4: "In contrast to what one might expect, the observations y_k_ty\_{k\_t} do not exist in HiPPO and, instead, the signal f(t)f(t) corresponds to the control signal u_tu\_t." Please consider rephrase or further explain it.
  4. At the end of section 2, I suggest expanding "best possible compression of ff" to "best possible compression of ff in the L2L^2 space" to make the statement more precise.
作者回复

Thank you for your review and careful reading.

Phrasing

We will clarify Section 4 based on your feedback. The phrase "observations do not exist in HiPPO" refers to the fact that the data does not take the role of an observation in HiPPO, but rather of a control signal. We will update the section to make this clear. We have also adopted your clarification "best possible compression of ff in the L2L^2 space".

Effect of noise level

https://figshare.com/s/132436b00e91612513fd

We ran two experiments similar to Figure 7, but this time the noise level ρ2\rho^2 is set to a lower and a higher value. The best results were obtained for σ2=1010\sigma^2 = 10^{10}. However, as we increase the noise level ρ2\rho^2, we notice that higher σ2\sigma^2 values perform comparatively well, as opposed to what happens in Figure 7, when σ2=1014\sigma^2 = 10^{14}.

What is the effect of ignoring the time-variance of the true dynamics?

We examined the effect of this in Figure 5 and the surrounding text in the paragraph "Time-invariant Dynamics" empirically, but we did not analyze it theoretically.

How can we select σ2\sigma^2?

At the moment, our best strategy for choosing σ2\sigma^2 is empirical. The reason that σ2\sigma^2 cannot be chosen as the noise in the data comes down to Σ=I\Sigma = I in the Kalman filter. Because this adds uncertainty to all degrees of the polynomial representation at each step, it also increases the uncertainty in the high-degree components. Even thought the dynamics are regularized, the highest degrees still grow quickly, so σ2\sigma^2 needs to be of a similar order of magnitude as BHTPkBHB_H^T P_k^{-}B_H in sks_k to have an effect. Choosing to the diagonal of Σ\Sigma to fall quickly should be able to counteract this effect, though it is unclear how quickly it would need to fall exactly.

Can UnHiPPO be thought of as a regularization scheme, where σ2\sigma^2 controls the magnitude of regularization?

Yes, as we have shown in Figure 4, σ2\sigma^2 controls directly how sensitive the system is to high-frequency components / noise in the data.

Can UnHiPPO be extended to HiPPO variants other than LegS?

We decided to focus on the LegS variant of HiPPO, because it is the most relevant one in the literature. Extending the approach to other polynomial bases should be possible, because we only use properties of Legendre polynomials explicitly in Section 4.1 to derive the regularized HiPPO matrix. The same derivation should be theoretically possible in other bases, thought it might turn out to be either unnecessary if the basis behaves well under extrapolation or not produce a closed-form of QiQ_i.

审稿意见
3

The paper extends HiPPO by incorporating uncertainty-awareness, enhancing the robustness of state space models (SSMs) to noise.

  • The study is limited to SC10; testing on additional datasets and tasks would improve generalizability, a limitation acknowledged in the paper.
  • The choice of regularization method is not thoroughly analyzed, though the proposed approach is discussed in detail. Exploring alternative strategies would strengthen the contribution.
  • The σ2\sigma^2 hyperparameter is crucial but lacks a systematic selection method across different datasets, despite some analysis of its impact.
  • Claims of negligible runtime cost are supported by experiments, but further evaluation on larger datasets would provide a clearer comparison with standard HiPPO.
  • No structured formulation for the UnHiPPO matrix is provided, which may impact computational efficiency—an issue noted by the authors.
  • The closed-form solution is preferred over the trapezoidal rule, and while its instability in UnHiPPO is noted, further explanation could clarify this choice.
  • More analysis on the impact of different Σ\Sigma structures on performance would be beneficial.
  • Empirical comparisons with existing methods would help contextualize UnHiPPO’s advantages and limitations beyond theoretical discussion.

给作者的问题

See summary

论据与证据

See summary

方法与评估标准

See summary

理论论述

See summary

实验设计与分析

See summary

补充材料

See summary

与现有文献的关系

See summary

遗漏的重要参考文献

See summary

其他优缺点

See summary

其他意见或建议

See summary

作者回复

Thank you for your review. Please see Figure 9 in Appendix B for a visualization of the effect of different discretizations. It demonstrates the instability of some methods in both HiPPO and UnHiPPO and in particular the remarkable stability of the closed-form solution.

审稿意见
3

The paper proposes to extend high-order polynomial projection operators (HiPPO) that are used to initialise the dynamics of recent state space models. HiPPO theory is agnostic to measurement noise. The paper extends HiPPO operators to capture uncertainty arising from measurement noise. Specifically, the paper proposes to infer the posterior pdf of the HiPPO coefficients conditional on the noisy observations. The parameters of the pdf are estimated using a Kalman filter. The updated Kalman state (i.e., the mean of the posterior) itself is a first-order difference equation, where the transition of states from time step t-1 to t is captured by a transition matrix, and where the excitation is projected into state-space by an input vector. The transition matrix and input vector are therefore synonymous to the HiPPO matrices, but – owing to their Bayesian formulation – capture uncertainty in the measurement noise. The paper embeds the proposed matrices within LSSL (Gu & Dao 2021). The results of the resulting uncertainty-aware LSSL are compared against LSSL for a 10-class subset of the Speech Commands dataset. Results indicate improvements of up to approximately 1.5 percentage points of the proposed model over the baseline.

Update after rebuttal:

I thank the authors for their rebuttal. I particularly appreciate their effort in providing new results for different datasets to address my concerns. In my opinion, the idea of capturing uncertainty to improve robustness to noise is interesting and very promising. Unfortunately, similar to the results in the paper, the new results provided during the rebuttal also demonstrate only small improvements (< ~3%) compared to the LSSL baseline. In my opinion, more convincing results are required to evidence the paper's claim, i.e., that the proposed initialisation improves robustness against noise. I therefore maintain my score. Again, I think this is a promising approach and I would encourage the authors to undertake a more thorough experimental investigation with the aim to identify scenarios in which the proposed approach clearly outperforms LSSL.

给作者的问题

  1. P. 6, paragraph following (29): “In contrast to LSSL where we can get the discretized dynamics at any t directly, for UnLSSL we compute them for all integer steps t ∈ [tmax] and then select a subset.” – I assume that this is necessary since the uncertainty-aware transition matrix and input vector are obtained from the discrete-time update equation in (24). Please can you clarify this point?
  2. Same paragraph: “Instead, we also compute all intermediate steps, which mirrors the more realistic setting where we also observe data at 1, 2, . . . , t − 1.” – Is the initialization of UnLSSL therefore data dependent?
  3. How were the noise experiments conducted? What type of noise was added? At what SNR?

论据与证据

The main claim of the paper is that the performance and noise robustness of models initialised with HiPPO can be improved by capturing explicitly uncertainty in the measurement noise. The mathematical formulation is presented in a clear and concise manner. Convincing examples are provided in Section 5 to illustrate the benefits of the proposed method.

方法与评估标准

In my opinion, the proposed method provides an elegant and simple approach to embed measurement uncertainty in HiPPO. The dataset that was selected is appropriate to validate the proposed methodology, given the inherent challenges arising from speech in noise. While the results in the presented in the paper are sufficient for validation, the results showcase only small improvements in accuracy. To help identify the benefits (and potential limitations) of the proposed model, I would have expected a more thorough study involving different tasks and datasets for evaluation. For example, most SSMs since LSSL are evaluated on the LRA benchmark.

理论论述

See “Claims and Evidence”

实验设计与分析

In general, the experimental design provides the necessary information about the models to reproduce the studies. However, there is some information missing about the distortion of the speech signals:

  1. Based on the information provided in Section 6, I gather that the authors added noise to Speech Commands to investigate performance under varying conditions. Is this correct? If so, my main concern is that Speech Commands was recorded in varying acoustic conditions, i.e., the clips already include background noise at varying noise levels. The results in Figure 6 seem to indicate LSSL actually performs equally well as UnLSSL at low noise levels (i.e., when no additional noise was added to the noisy Speech Commands utterances). This seems contradictory to the main claim of the paper.
  2. It is unclear how noise was added. Were the signals normalised to a particular level prior to adding noise? What is the target SNR? How is this handled considering that clips often contain long periods of silence / background noise with very short speech utterances?
  3. It is unclear what type of noise was added. The type of noise will have a significant impact on the results. For example, white noise is relatively easy to remove. However, realistic noise sources are rarely white. Have the signals been distorted with realistic noise sources, such as speech-like noise or music? Which noise dataset was used?

Provided that my understanding of the noise experiments is correct, I recommend to repeat the experiments with a dataset that is recorded in anechoic or studio conditions (e.g., TIMIT, VCTK). It would also be possible to estimate the noise levels in Speech Commands from periods of speech inactivity.

补充材料

Supplementary Material contains code

与现有文献的关系

A timely paper that fits will within recent work on the initialisation of SSMs, including, e.g., [1] below. In my opinion, the novel contribution of the paper is the extension of HiPPO to a framework that embeds uncertainty.

[1] Liu & Li, “Autocorrelation Matters: Understanding the Role of Initialization Schemes for State Space Models”, ICLR 2025

遗漏的重要参考文献

N/A

其他优缺点

Strengths: This is a very well written paper, that provides clear motivation and justification for assumptions, coherent and linear explanations, and helpful illustrative examples. I particularly appreciated the concise summary of HiPPO in Section 3.

Weaknesses: The Figures in Section 6 provide nice visualisations of the general trend of accuracy and error curves. However, considering the small differences in performance between UnLSSL and LSSL, Tables are required to provide a precise comparison in performance. For example, in Figure 7 at \sigma^2 = 10^{10}, I struggle to determine if the difference in accuracy is 1 percentage point, more or less.

其他意见或建议

N/A

作者回复

Thank you for your review.

How was the noise added? How were the signals normalized?

We follow LSSL implementation and the code from repository of the authors. After the audio files are loaded, signals are divided by 32k to be normalized (see here). After the noise is added, we apply z-score normalization.

What type of noise was added?

The two noise sources in equations (7) and (8), namely dβ\mathrm{d}\mathbf{\beta} and εtk\varepsilon_{t_k}, are Gaussian. In line with the theory, we sampled Gaussian noise with 0 mean and fixed variance. We have not experimented with other noise sources.

Computation of parameters:

P. 6, paragraph following (29): “In contrast to LSSL where we can get the discretized dynamics at any t directly, for UnLSSL we compute them for all integer steps t[tmax]t ∈ [t_{max}] and then select a subset.” – I assume that this is necessary since the uncertainty-aware transition matrix and input vector are obtained from the discrete-time update equation in (24). Please can you clarify this point?

That is correct. In the end, the parameters are tied to Kalman filtering in Equation (23), which is a linear-dynamical system (LDS) evolves with the transition matrix computed for each time step. We briefly mentioned this in the next sentence: "In theory, we could jump to any directly in the Kalman update in Equation (23), but that would increase the uncertainty as if there was no data before, changing the dynamics."

Same paragraph: “Instead, we also compute all intermediate steps, which mirrors the more realistic setting where we also observe data at 1, 2, . . . , t − 1.” – Is the initialization of UnLSSL therefore data dependent?

No, it just means that we compute the parameters for the case where we observe data at each integer time step. Otherwise, the computed parameters would be computed as if we had not observed any data for longer stretches of time and therefore increase the uncertainty estimate and smoothing.

How were the noise experiments conducted? What type of noise was added? At what SNR?

For each audio signal, we compute the standard deviation of the signal and add a random noise with times the computed standard deviation. We used 10 different values: 0.0, 1e-7, 1e-6, 3.16e-6, 1e-5, 1.77e-5, 3.16e-5, 5.62e-5, 1e-4. We observed values larger than 1e-4 repress most of the audio signals in SC10 and leaves nothing but noise for this dataset.

Since we do not add the noise based on amplitudes, and instead based on the standard deviation, we do not have a constant SNR value. For example, an audio with constant signal would never be effected by noise even when is non-zero as the standard deviation of the signal itself is 0.

We run the LSSL model with HiPPO and UnHiPPO initialization on the same data and report averages of three different seeds. The complete configuration for the noise experiments is available in the supplementary material in config/experiment/sc-raw-noise.yaml.

审稿人评论

Thank you for the response. I understand that the evaluation is based on an existing paper. However, this does not resolve my concerns regarding the experimental setup. I am concerned that the results are inconclusive considering that white Gaussian noise is added to signals that are already distorted by varying levels of background noise.

作者评论

Thank you for your response.

Based on your initial review, we evaluated our model on RWCP-SSD, a dataset of non-speech, dry sounds recorded in a professional anechoic studio with a reported signal-to-noise ratio of 50dB.

We used two subsets of the data for two separate classification tasks. On the first subset of about 3500 recordings, we try to detect the material of colliding objects (wood, metal, plastic, or ceramic). The second subset of about 4200 recordings contains characteristic sounds of various objects, e.g. metal articles (coin, bell), paper (tearing, dropping book), musical instruments (drum, bugle), electronic sound (phone, toy), mechanical sound (spring, stapler), which we try to distinguish.

https://figshare.com/s/132436b00e91612513fd

As before, we normalize the data, add white noise w.r.t. the standard deviation of the audio signal, and apply z-score normalization. We ran the same noise experiment that we showed in Figure 6 and 8 in the paper. The results show that the UnHiPPO initialization also improves the robustness to noise on these two new datasets.

审稿意见
2

This paper investigates state space models (SSMs) through the lens of linear stochastic control theory and proposes a novel initialization method to enhance robustness against input noise. The authors first reformulate the linear recurrence in SSMs as a homogeneous linear dynamical system with noise, replacing the input signals with their closed-form reconstructions. Next, they derive a regularized HiPPO formulation by enforcing the online approximator to extrapolate linearly and maintain a consistent time derivative with the closed-form approximation at the boundary. Building on this new dynamic system, the authors compute the posterior mean estimate under Gaussian noise, which ultimately yields an improved initialization for the state transition matrix in SSMs.

给作者的问题

A question is whether the proposed initialization remains compatible with computationally efficient parameterizations, which serve as a core advantage of SSMs. The original HiPPO matrix has been shown to admit a normal-plus-low-rank decomposition, enabling fast computation. Subsequent work further simplifies this structure by approximating HiPPO with diagonal matrices. It remains unclear whether the UnHiPPO formulation retains or admits similar structural properties.

论据与证据

Most claims in the paper are clear and well-supported. However, I have some reservations regarding certain steps in the derivation of the UnHiPPO initialization:

To obtain data-free dynamics, the authors substitute f(t)f(t) with f^t(t)\hat{f}_{\le t}(t). While this choice eases the derivation, it introduces a discrepancy between the dynamics used for initialization and those used during actual signal processing. Further justification would strengthen this step

Additionally, as far as I understand, models such as LSSL and S4 do not learn time-dependent matrices AA and BB directly. Instead, they learn static matrices AA and BB which are then converted into time-varying forms during training or inference. In contrast, the proposed initialization appears to generate a separate AkA_k and BkB_k for each timestep. It remains unclear whether the proposed method requires time-wise parameterization in practice, and if not, how the initialization is reconciled with the time-independent parameterization commonly used in these models.

方法与评估标准

The primary goal of this paper is to enhance the robustness of state space models (SSMs) against input noise. The proposed method is an initialization scheme that provides a posterior estimate of the signal under Gaussian noise at the initialization stage. The derivation appears rigorous and well-grounded. However, when integrated into an SSM, the scheme seems to instantiate time-dependent parameters AkA_k and BkB_k, which may diverge from the common practice of using time-invariant parameters AA and BB that are converted to AkA_k and BkB_k on the fly during training or inference. The experiments focus primarily on evaluating the robustness of the proposed approach under noise perturbations, which is well-aligned with the stated objective.

理论论述

All derivations look correct to me.

实验设计与分析

The experimental setup feels somewhat limited. Unlike many recent works on SSMs that evaluate on the Long Range Arena benchmark [1], this paper validates the proposed approach on a relatively small-scale dataset. Additionally, the comparisons are restricted to LSSL with HiPPO initialization, without including other competitive baselines. It would strengthen the empirical evaluation to include comparisons with more recent and widely adopted models such as S4 [2], DSS [3], or other state-of-the-art SSM variants.

[1] Tay et al., Long Range Arena: A Benchmark for Efficient Transformers

[2] Gu et al., Efficiently Modeling Long Sequences with Structured State Spaces

[3] Gupta et al., Diagonal State Spaces are as Effective as Structured State Spaces

补充材料

The supplementary materials were not reviewed in detail. However, based on the structure of the main paper, it appears that no critical components of the proposed method are deferred to the appendix.

与现有文献的关系

The proposed method is highly relevant to recent advancements in sub-quadratic sequence modeling, which aim to improve the efficiency and effectiveness of long-sequence processing. The underlying dynamical system studied in this work forms the core foundation for this class of models.

遗漏的重要参考文献

Two core works on the initialization and parameterization of SSMs appear to be missing:

[1] Gupta et al., Diagonal State Spaces are as Effective as Structured State Spaces

[2] Gu et al., How to Train Your HiPPO: State Space Models with Generalized Orthogonal Basis Projections

其他优缺点

Strengths:

  • Analyzing SSMs through the lens of stochastic linear control theory provides a novel perspective and a powerful theoretical tool. I found Section 4.1 particularly insightful, in particular, the idea of regularizing the extrapolation behavior of SSMs using boundary conditions. The authors may also want to emphasize that this regularized formulation contributes directly to improving robustness against noise.

Weaknesses:

  • The writing could benefit from greater clarity. For example, it took me some time to fully understand how f^t\hat{f}_{\le t} depends on τ\tau. If I understand correctly, f^\hat{f} should be expressed as two functions of tt and τ\tau separately rather than ambiguously as a function of one variable.

  • While the derived initialization is theoretically interesting, the empirical analysis could be strengthened. I recommend evaluating the method on more diverse tasks, such as language modeling, to demonstrate broader applicability. Additionally, it would be valuable to discuss how this initialization connects with recent advances in SSMs, such as Mamba [1], and whether it can be integrated into these newer architectures.

[1] Gu et. al. Mamba: Linear-Time Sequence Modeling with Selective State Spaces

其他意见或建议

I caught one typo:

Ln 279-280, a GELU nonlinearity followed a linear layer -> a GELU nonlinearity following a linear layer

作者回复

Thank you for your detailed review.

Time-dependence of matrices

It is correct that SSMs learn AA and BB and discretize them on the fly. However, at least for LSSL, the time step at which the learned matrices are discretized is fixed per feature to make the model adapt to multiple timescales and does not actually vary with time. Therefore, it is factually equivalent that we learn AkA_k and BkB_k directly.

Experiments

We did not evaluate on LRA, because its synthetic tasks have discrete input data, which does not fit the assumptions of our derivation. In there, we assume that the noise is Gaussian and therefore continuous. An adaptation for discrete data and discrete noise would be a nontrivial adaptation.

Typo and related work

Thank you for your careful reading. We have added the missing "by" to clarify that the GELU nonlinearity comes first and included the two works you mentioned in our related work section.

Compatibility with efficient parametrizations

As we described under Limitations (Section 8), we were unfortunately not able to derive a similarly structured representation of our initialization, because the pseudo-inverse and the Kalman filter equations pose a significant challenge, which we elaborate in the following.

mathbfAmathrmR\\mathbf{A}_{\\mathrm{R}} is not a normal matrix, therefore, there is no trivial diagonalization of it. Previously, Gu et. al. (S4) used correction matrices to obtain skew-symmetric matrices which are in turn a special case of normal matrices. Here, though, finding a correction matrix is not trivial due to pseudo-inverse. Considering the definition of pseudo-inverse for matrices constituting of linearly independent columns,


\\mathbf{M}^{\\dagger} = (\\mathbf{M}^{\\ast}\\mathbf{M})^{-1}\\mathbf{M}^{\\ast}\

where (cdot)ast(\\cdot)^{\\ast} denotes the conjugate transpose that is equivalent to transpose for a real matrix. Since mathbfAmathrmR\\mathbf{A}_{\\mathrm{R}} is real, we can safely use,


\\mathbf{M}^{\\dagger} = (\\mathbf{M}^{\\mathsf{T}} \\mathbf{M})^{\-1}\\mathbf{M}^{\\mathsf{T}}\

Then, pseudo-inverse of (19) becomes


$

\begin{pmatrix}\mathbf{I}\\ \mathbf{B}^{\mathsf{T}}_{\text{H}}\\ \mathbf{Q}^{\mathsf{T}}\end{pmatrix}^{\dagger} = \begin{pmatrix}\begin{pmatrix}\mathbf{I}\\ \mathbf{B}^{\mathsf{T}}_{\text{H}}\\ \mathbf{Q}^{\mathsf{T}}\end{pmatrix} \begin{pmatrix}\mathbf{I} \\ \mathbf{B}^{\mathsf{T}}_{\text{H}}\\ \mathbf{Q}^{\mathsf{T}}\end{pmatrix}\end{pmatrix}^{-1} \begin{pmatrix}\mathbf{I}\\ \mathbf{B}^{\mathsf{T}}_{\text{H}}\\ \mathbf{Q}^{\mathsf{T}}\end{pmatrix}^{\mathsf{T}}\nonumber
$

and mathbfAmathrmR\\mathbf{A}_{\\mathrm{R}} can be explicitly written as


$

\mathbf{A}_{\mathrm{R}} = \begin{pmatrix}\underbrace{{\begin{pmatrix}\mathbf{I} \\ \mathbf{B}_{\text{H}}^{\mathsf{T}} \\ \mathbf{Q}^{\mathsf{T}}\end{pmatrix}}^{\mathsf{T}}{\begin{pmatrix}\mathbf{I} \\ \mathbf{B}_{\text{H}}^{\mathsf{T}} \\ \mathbf{Q}^{\mathsf{T}}\end{pmatrix}}}_{\text{First part}} \end{pmatrix}^{-1} \underbrace{{\begin{pmatrix}\mathbf{I} \\ \mathbf{B}_{\text{H}}^{\mathsf{T}} \\ \mathbf{Q}^{\mathsf{T}}\end{pmatrix}}^{\mathsf{T}} \begin{pmatrix}\mathbf{A}_{\text{H}}^{\mathsf{T}} - \mathbf{I} \\ 2\mathbf{Q}^{\mathsf{T}} \\ \mathbf{0}\end{pmatrix}}_{\text{Second part}}
$

First part gives us a symmetric and square matrix. Therefore, we know that inverse of this will be also symmetric, and this part would not be a problem for diagonalization if it were to stand by itself, or alongside another symmetric matrix.

Second part does not result in a symmetric, skew-symmetric or in general a normal matrix. Therefore, the overall result needs a correction matrix to become diagonalizable matrix.

However, hand calculation requires taking the inverse of the first part. Even after that, multiplication with a non-normal matrix needs to be computed, together which makes it too complicated to reveal the structure of the matrix, so that it can be manipulated to have a normal form.

最终决定

The paper has been well received by all the reviewers, who also point out that it is theoretically sound and easy to follow.

The main point of criticism has to do with the moderate improvement over the HiPPO variant, LSSLs/SSMs. I agree with that concern, but at the same time I also see why the reviewers are focusing on the elegance of the proposed approach and the "principled formulation".

I would suggest the authors to take the time to incorporate the reviewers feedback, together with the additional experiment in the final version of the manuscript.

Congratulation on getting your paper accepted.